队列研究中纵向缺失数据填补方法的模拟研究

李业棉; 赵芃; 杨嵛惠; 王静娴; 颜虹; 陈方尧

Abstract

李业棉,赵芃,杨嵛惠,王静娴,颜虹,陈方尧.队列研究中纵向缺失数据填补方法的模拟研究[J].Chinese journal of Epidemiology,2021,42(10):1889-1894

队列研究中纵向缺失数据填补方法的模拟研究

Simulation study on missing data imputation methods for longitudinal data in cohort studies

Received:November 30, 2020

DOI：10.3760/cma.j.cn112338-20201130-01363

KeyWord: 纵向数据缺失数据填补

English Key Word: Longitudinal data Missing data Imputation

FundProject:国家自然科学基金（81703325）；国家重点研发计划（2017YFC0907200，2017YFC0907201）

Author Name	Affiliation	E-mail
Li Yemian	Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China
Zhao Peng	Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China
Yang Yuhui	Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China
Wang Jingxian	Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China
Yan Hong	Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China
Chen Fangyao	Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China	chenfy@xjtu.edu.cn

Hits: 4333

Download times: 2261

Abstract:

目的数据缺失是队列研究中几乎无法避免的问题。本文旨在通过模拟研究，比较当前常见的8种缺失数据处理方法在纵向缺失数据中的填补效果，为纵向缺失数据的处理提供有价值的参考。方法模拟研究基于R语言编程实现，通过Monte Carlo方法产生纵向缺失数据，通过比较不同填补方法的平均绝对偏差、平均相对偏差和回归分析的Ⅰ类错误，评价不同填补方法对于纵向缺失数据的填补效果及对后续多因素分析的影响。结果均值填补、k近邻填补（KNN）、回归填补和随机森林的填补效果接近，且表现稳定；多重插补和热卡填充次于以上填补方法；K均值聚类和EM算法填补效果最差，表现也最不稳定。均值填补、EM算法、随机森林、KNN和回归填补可较好地控制Ⅰ类错误，多重插补、热卡填充和K均值聚类不能有效控制Ⅰ类错误。结论对于纵向缺失数据，在随机缺失机制下，均值填补、KNN、回归填补和随机森林均可作为较好的填补方法，当缺失比例不太大时，多重插补和热卡填充也表现较好，不推荐K均值聚类和EM算法。

English Abstract:

Objective Data being missed is an unavoidable problem in cohort studies. This paper compares the imputation effect of eight common missing data imputation methods involved in cutting longitudinal data through simulation study to provide a valuable reference for the treatment of missing data in longitudinal studies. Methods The simulation study is based on R language software and generates missing longitudinal data by the Monte Carlo method. By comparing the average absolute deviation, average relative deviation, and TypeⅠerror from the regression analysis of different imputation methods, the imputation effect of varying imputation methods on missing longitudinal data and the influence on subsequent multivariate analysis are evaluated. Results The mean imputation, k nearest neighbor (KNN), regression imputation, and random forest all have a similar imputation effect, which is also steady. However, the hot deck is inferior to the above imputation methods. K-means clustering and expectation maximization (EM) algorithm are among the worst and unstable. Mean imputation, EM algorithm, random forest, KNN, and regression imputation can control TypeⅠerror. Still, multiple imputations, hot deck, and K-means clustering cannot effectively manage the TypeⅠerror. Conclusions For missing data in longitudinal studies, mean imputation, KNN, regression imputation, and random forest can be used as better imputation methods under the mechanism of missing at random. When the missing ratio is not too large, multiple imputations and hot deck can also perform well, but K-means clustering and EM algorithm are not recommended.

View Fulltext Html FullText View/Add Comment Download reader