文章摘要
李业棉,赵芃,杨嵛惠,王静娴,颜虹,陈方尧.队列研究中纵向缺失数据填补方法的模拟研究[J].中华流行病学杂志,2021,42(10):1889-1894
队列研究中纵向缺失数据填补方法的模拟研究
Simulation study on missing data imputation methods for longitudinal data in cohort studies
收稿日期:2020-11-30  出版日期:2021-10-23
DOI:10.3760/cma.j.cn112338-20201130-01363
中文关键词: 纵向数据;缺失数据;填补
英文关键词: Longitudinal data;Missing data;Imputation
基金项目:国家自然科学基金(81703325);国家重点研发计划(2017YFC0907200,2017YFC0907201)
作者单位E-mail
李业棉 西安交通大学医学部公共卫生学院流行病与卫生统计学系 710061  
赵芃 西安交通大学医学部公共卫生学院流行病与卫生统计学系 710061  
杨嵛惠 西安交通大学医学部公共卫生学院流行病与卫生统计学系 710061  
王静娴 西安交通大学医学部公共卫生学院流行病与卫生统计学系 710061  
颜虹 西安交通大学医学部公共卫生学院流行病与卫生统计学系 710061  
陈方尧 西安交通大学医学部公共卫生学院流行病与卫生统计学系 710061 chenfy@xjtu.edu.cn 
摘要点击次数: 1548
全文下载次数: 619
中文摘要:
      目的 数据缺失是队列研究中几乎无法避免的问题。本文旨在通过模拟研究,比较当前常见的8种缺失数据处理方法在纵向缺失数据中的填补效果,为纵向缺失数据的处理提供有价值的参考。方法 模拟研究基于R语言编程实现,通过Monte Carlo方法产生纵向缺失数据,通过比较不同填补方法的平均绝对偏差、平均相对偏差和回归分析的Ⅰ类错误,评价不同填补方法对于纵向缺失数据的填补效果及对后续多因素分析的影响。结果 均值填补、k近邻填补(KNN)、回归填补和随机森林的填补效果接近,且表现稳定;多重插补和热卡填充次于以上填补方法;K均值聚类和EM算法填补效果最差,表现也最不稳定。均值填补、EM算法、随机森林、KNN和回归填补可较好地控制Ⅰ类错误,多重插补、热卡填充和K均值聚类不能有效控制Ⅰ类错误。结论 对于纵向缺失数据,在随机缺失机制下,均值填补、KNN、回归填补和随机森林均可作为较好的填补方法,当缺失比例不太大时,多重插补和热卡填充也表现较好,不推荐K均值聚类和EM算法。
英文摘要:
      Objective Data being missed is an unavoidable problem in cohort studies. This paper compares the imputation effect of eight common missing data imputation methods involved in cutting longitudinal data through simulation study to provide a valuable reference for the treatment of missing data in longitudinal studies. Methods The simulation study is based on R language software and generates missing longitudinal data by the Monte Carlo method. By comparing the average absolute deviation, average relative deviation, and TypeⅠerror from the regression analysis of different imputation methods, the imputation effect of varying imputation methods on missing longitudinal data and the influence on subsequent multivariate analysis are evaluated. Results The mean imputation, k nearest neighbor (KNN), regression imputation, and random forest all have a similar imputation effect, which is also steady. However, the hot deck is inferior to the above imputation methods. K-means clustering and expectation maximization (EM) algorithm are among the worst and unstable. Mean imputation, EM algorithm, random forest, KNN, and regression imputation can control TypeⅠerror. Still, multiple imputations, hot deck, and K-means clustering cannot effectively manage the TypeⅠerror. Conclusions For missing data in longitudinal studies, mean imputation, KNN, regression imputation, and random forest can be used as better imputation methods under the mechanism of missing at random. When the missing ratio is not too large, multiple imputations and hot deck can also perform well, but K-means clustering and EM algorithm are not recommended.
查看全文   Html全文     查看/发表评论  下载PDF阅读器
关闭