报告题目:Model-free Feature Screening Based on Hellinger Distance for Ultrahigh Dimensional Data
报告时间:2024年8月12日上午9:00
报告地点:南湖校区老图书馆四楼会议室
主办单位:数学与统计学院/科研处
报告人:崔恒建
报告人简介:崔恒建,首都师范大学教授,博士生导师,中国科协第十届全委会委员,曾任国务院学位委员会学科评议组专家。中国科学院系统科学研究所博士毕业。在大数据统计建模、高维统计及其稳健统计理论和方法、统计机器学习、金融统计、以及质量管理等领域取得过许多重要的研究成果,发表论文180余篇,其中包括发表在国际顶级的统计和计量经济学杂志JASA、 AoS、JRSS(B)、Biometrika和JoE上。主持国家自然科学基金重点项目项目多项。现担任《数学学报》和《应用数学学报》中、英文版以及《Statistical Theory and Related Fields》编委,中国现场统计研究会副理事长,全国工业统计教育研究会副理事长,北京应用统计学会会长,国际数理统计学会(中国分会)常务理事。曾获得教育部高等学校科学技术奖-自然科学奖二等奖;全国统计科学研究优秀成果奖一等奖等。
摘要:With the explosive development of data acquisition and processing technology, the dimension of features increases exponentially with the sample size, which poses great challenges for data analysis. It is vital to accurately identify useful features from thousands of them. In this paper, we develop an omnibus model-free feature screening procedure based on the Hellinger distance with some appealing merits. First, we define the Hellinger distance index for discrete response variables in discriminant analysis. Second, this procedure works consistently for continuous response variables, in which the continuous responses are discretized by slice-and-fused technique. Third, it is robust to the potential outliers and model misspecification. Theoretically, the procedure for discrete and continuous response variables possess sure screening properties and ranking consistency properties under mild conditions. Numerical studies demonstrate that this procedure exhibits strong competitiveness in heavy-tailed and skewed data, while remaining comparable to existing approaches for light-tailed data, indicating its robustness performance across a range of data. Real data contains two examples, discrete and continuous response variables, to illustrate the effectiveness of the proposed method.