时 间:2023年4月19日(周三)15:30-17:30
地 点: 理科大楼A1716室
报告人:任好洁 上海交通大学副教授
主持人:王光辉 副教授
摘 要:
In the big data era, subsampling or sub-data selection techniques are often adopted to extract a fraction of informative individuals from the massive data. Existing subsampling algorithms focus mainly on obtaining a representative subset to achieve the best estimation accuracy under a given class of models. In this work, we consider a semi-supervised setting wherein a small or moderate sized “labeled” data is available in addition to a much larger sized “unlabeled” data. The goal is to sample from the unlabeled data with a given budget to obtain informative individuals that are characterized by their unobserved responses. We propose an optimal subsampling procedure that is able to maximize the diversity of the selected subsample and control the false selection rate (FSR) simultaneously, allowing us to explore reliable information as much as possible. The key ingredients of our method are the use of predictive inference for quantifying the uncertainty of response predictions and a reformulation of the objective into a constrained optimization problem. We show that the proposed method is asymptotically optimal in the sense that the diversity of the subsample converges to its oracle counterpart with FSR control. Numerical simulations and a real-data example validate the superior performance of the proposed strategy.
报告人简介:
任好洁,上海交通大学数学科学学院长聘教轨副教授,18年博士毕业于南开大学,随后在宾州州立大学从事博士后研究。她的研究方向包括统计异常探查、在线学习与监控、高维数据推断等。在JASA,Biometrika等杂志上发表学术论文10余篇。