基于机器学习算法的大于胎龄儿风险预测模型

白皙; 罗云云; 周智博; 苏明亮; 杨柳青; 陈适; 阳洪波; 朱惠娟; 潘慧

文章摘要

白皙,罗云云,周智博,苏明亮,杨柳青,陈适,阳洪波,朱惠娟,潘慧.基于机器学习算法的大于胎龄儿风险预测模型[J].中华流行病学杂志,2021,42(12):2143-2148

基于机器学习算法的大于胎龄儿风险预测模型

Development and evaluation of a machine learning prediction model for large for gestational age

收稿日期:2021-08-24 出版日期:2021-12-16

DOI：10.3760/cma.j.cn112338-20210824-00677

中文关键词: 机器学习大于胎龄儿风险预测模型

英文关键词: Machine learning Large for gestational age Risk prediction model

基金项目:

作者	单位	E-mail
白皙	中国医学科学院/北京协和医学院/北京协和医院内分泌科/国家卫生健康委员会内分泌重点实验室/疑难重症及罕见病国家重点实验室, 北京 100730
罗云云	中国医学科学院/北京协和医学院/北京协和医院内分泌科/国家卫生健康委员会内分泌重点实验室/疑难重症及罕见病国家重点实验室, 北京 100730
周智博	中国医学科学院/北京协和医学院/北京协和医院内分泌科/国家卫生健康委员会内分泌重点实验室/疑难重症及罕见病国家重点实验室, 北京 100730
苏明亮	东华医为科技有限公司, 北京 100190
杨柳青	东华医为科技有限公司, 北京 100190
陈适	中国医学科学院/北京协和医学院/北京协和医院内分泌科/国家卫生健康委员会内分泌重点实验室/疑难重症及罕见病国家重点实验室, 北京 100730
阳洪波	中国医学科学院/北京协和医学院/北京协和医院内分泌科/国家卫生健康委员会内分泌重点实验室/疑难重症及罕见病国家重点实验室, 北京 100730
朱惠娟	中国医学科学院/北京协和医学院/北京协和医院内分泌科/国家卫生健康委员会内分泌重点实验室/疑难重症及罕见病国家重点实验室, 北京 100730
潘慧	中国医学科学院/北京协和医学院/北京协和医院内分泌科/国家卫生健康委员会内分泌重点实验室/疑难重症及罕见病国家重点实验室, 北京 100730	panhui20111111@163.com

摘要点击次数: 3577

全文下载次数: 1222

中文摘要:

目的开发和验证基于机器学习算法的孕期大于胎龄儿（LGA）风险预测模型，并比较其与传统逻辑回归方法建模的性能差异。方法研究对象来自"中国免费孕前优生健康检查项目"，于2010-2012年在全国31个省市的220个县开展，覆盖全部农村计划妊娠夫妇，本研究选取分娩新生儿胎龄在24~42周内，单胎活产的所有育龄期夫妇及其新生儿为研究对象。应用10种机器学习算法分别建立LGA预测模型，评估模型对LGA的预测性能。结果最终纳入104 936名新生儿，男婴54 856例（52.3%），女婴50 080例（47.7%），LGA的发生率为11.7%（12 279例）。经过下采样数据平衡处理后，机器学习方法建立模型的整体效能出现明显提高，其中以CatBoost模型在预测LGA风险方面表现最佳，模型的受试者工作特征曲线的曲线下面积（AUC）为0.932；逻辑回归模型表现最差，AUC仅为0.555。结论与传统的逻辑回归方法相比，通过机器学习算法可建立更有效的孕期LGA风险预测模型，具有潜在的应用价值。

英文摘要:

Objective To develop and validate a useful predictive model for large gestational age (LGA) in pregnancy using a machine learning (ML) algorithm and compare its performance with the traditional logistic regression model. Methods Data were obtained from the National Free Preconception Health Examination Project in China, carried out in 220 counties of 31 provinces from 2010 to 2012, covering all rural couples with a planned pregnancy. This study included all teams of childbearing age who delivered newborns within 24-42 weeks of gestational age and their newborns. Ten different ML algorithms were used to establish LGA prediction models, and the prediction performance of these models was evaluated. Results A total of 104 936 newborns were included, including 54 856 boys (52.3%) and 50 080 girls (47.7%). The incidence of LGA was 11.7% (12 279). The imbalance between the two groups was addressed by the under- sampling technique, after which the overall performance of the ML models was significantly improved. The CatBoost model achieved the highest area under the receiver-operating-characteristic curve (AUC) value of 0.932. The logistic regression model had the worst performance, with an AUC of 0.555. Conclusions In predicting the risk for LGA in pregnancy, the ML algorithms outperform the traditional logistic regression method. Compared to other ML algorithms, CatBoost could improve the performance, and it deserves further investigation.

查看全文 Html全文查看/发表评论下载PDF阅读器

关闭