李业棉,赵芃,杨嵛惠,王静娴,颜虹,陈方尧.队列研究中纵向缺失数据填补方法的模拟研究[J].Chinese journal of Epidemiology,2021,42(10):1889-1894 |
队列研究中纵向缺失数据填补方法的模拟研究 |
Simulation study on missing data imputation methods for longitudinal data in cohort studies |
Received:November 30, 2020 |
DOI:10.3760/cma.j.cn112338-20201130-01363 |
KeyWord: 纵向数据 缺失数据 填补 |
English Key Word: Longitudinal data Missing data Imputation |
FundProject:国家自然科学基金(81703325);国家重点研发计划(2017YFC0907200,2017YFC0907201) |
Author Name | Affiliation | E-mail | Li Yemian | Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China | | Zhao Peng | Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China | | Yang Yuhui | Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China | | Wang Jingxian | Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China | | Yan Hong | Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China | | Chen Fangyao | Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China | chenfy@xjtu.edu.cn |
|
Hits: 4333 |
Download times: 2261 |
Abstract: |
目的 数据缺失是队列研究中几乎无法避免的问题。本文旨在通过模拟研究,比较当前常见的8种缺失数据处理方法在纵向缺失数据中的填补效果,为纵向缺失数据的处理提供有价值的参考。方法 模拟研究基于R语言编程实现,通过Monte Carlo方法产生纵向缺失数据,通过比较不同填补方法的平均绝对偏差、平均相对偏差和回归分析的Ⅰ类错误,评价不同填补方法对于纵向缺失数据的填补效果及对后续多因素分析的影响。结果 均值填补、k近邻填补(KNN)、回归填补和随机森林的填补效果接近,且表现稳定;多重插补和热卡填充次于以上填补方法;K均值聚类和EM算法填补效果最差,表现也最不稳定。均值填补、EM算法、随机森林、KNN和回归填补可较好地控制Ⅰ类错误,多重插补、热卡填充和K均值聚类不能有效控制Ⅰ类错误。结论 对于纵向缺失数据,在随机缺失机制下,均值填补、KNN、回归填补和随机森林均可作为较好的填补方法,当缺失比例不太大时,多重插补和热卡填充也表现较好,不推荐K均值聚类和EM算法。 |
English Abstract: |
Objective Data being missed is an unavoidable problem in cohort studies. This paper compares the imputation effect of eight common missing data imputation methods involved in cutting longitudinal data through simulation study to provide a valuable reference for the treatment of missing data in longitudinal studies. Methods The simulation study is based on R language software and generates missing longitudinal data by the Monte Carlo method. By comparing the average absolute deviation, average relative deviation, and TypeⅠerror from the regression analysis of different imputation methods, the imputation effect of varying imputation methods on missing longitudinal data and the influence on subsequent multivariate analysis are evaluated. Results The mean imputation, k nearest neighbor (KNN), regression imputation, and random forest all have a similar imputation effect, which is also steady. However, the hot deck is inferior to the above imputation methods. K-means clustering and expectation maximization (EM) algorithm are among the worst and unstable. Mean imputation, EM algorithm, random forest, KNN, and regression imputation can control TypeⅠerror. Still, multiple imputations, hot deck, and K-means clustering cannot effectively manage the TypeⅠerror. Conclusions For missing data in longitudinal studies, mean imputation, KNN, regression imputation, and random forest can be used as better imputation methods under the mechanism of missing at random. When the missing ratio is not too large, multiple imputations and hot deck can also perform well, but K-means clustering and EM algorithm are not recommended. |
View Fulltext
Html FullText
View/Add Comment Download reader |
Close |
|
|
|