全基因组关联研究中极端不平衡数据的统计分析方法（一）

谢宁; 毕文健; 张中文; 邵方; 魏永越; 赵杨; 张汝阳; 陈峰

文章摘要

谢宁,毕文健,张中文,邵方,魏永越,赵杨,张汝阳,陈峰.全基因组关联研究中极端不平衡数据的统计分析方法（一）[J].中华流行病学杂志,2024,45(11):1582-1589

全基因组关联研究中极端不平衡数据的统计分析方法（一）

Statistical methods for extremely unbalanced data in genome-wide association study (1)

收稿日期:2024-05-06 出版日期:2024-11-15

DOI：10.3760/cma.j.cn112338-20240506-00235

中文关键词: 极端不平衡数据全基因组关联研究渐近分布罕见变异

英文关键词: Extremely unbalanced data Genome-wide association study Asymptotic distribution Rare variants

基金项目:国家自然科学基金（82220108002，82273737）

作者	单位	E-mail
谢宁	南京医科大学公共卫生学院生物统计学系, 南京 211166
毕文健	北京大学基础医学院医学遗传学系, 北京 100191
张中文	南京医科大学公共卫生学院生物统计学系, 南京 211166
邵方	南京医科大学公共卫生学院生物统计学系, 南京 211166
魏永越	北京大学公众健康与重大疫情防控战略研究中心, 北京 100191
赵杨	南京医科大学公共卫生学院生物统计学系, 南京 211166 南京医科大学环境与人类健康国际联合研究中心, 南京 211166
张汝阳	南京医科大学附属常州第二人民医院信息科, 常州 213164	zhangruyang@njmu.edu.cn
陈峰	南京医科大学公共卫生学院生物统计学系, 南京 211166 南京医科大学环境与人类健康国际联合研究中心, 南京 211166	fengchen@njmu.edu.cn

摘要点击次数: 1922

全文下载次数: 282

中文摘要:

极端不平衡数据定义为自变量或因变量指标的取值呈现严重比例失衡的数据，例如病例-对照极度不平衡、疾病发病率极低、生存数据大量删失以及遗传位点为低频或罕见变异等。在此情境下，logistic回归模型、Cox比例风险回归模型等参数假设检验的经典统计量偏离理论渐近分布，难以控制第一类错误。近年来，随着超大型人群队列全基因组关联研究（GWAS）资源的日益共享与深度挖掘，高效准确处理独立或非独立样本极端不平衡数据的统计需求日益突出。本文介绍遗传统计中的经典统计分析方法，通过模拟试验展示经典统计方法在极端不平衡数据情境下的失效，旨在引起研究者对GWAS中极端不平衡数据的重视。

英文摘要:

Extremely unbalanced data here refers to datasets where the values of independent or dependent variables exhibit severe unbalance in proportions, such as extremely unbalanced case-control ratio, very low incidence rate of disease, heavily censored time-to-event data, and low-frequency or rare variants. In such scenarios, the statistic derived from hypothesis test using the classical statistical method, e.g., logistic regression model and Cox proportional hazard regression model, might deviate from theoretical asymptotic distribution, resulting in inflation or deflation of type I error. With the increased availability and exploration of resources from large-scale population cohorts in genome-wide association study (GWAS), there is a growing demand for effective and accurate statistical approaches to handle extremely unbalanced data in independent and non-independent samples. Our study introduces classical statistical methods in genetic statistics firstly, then, summarizes the failure of classical statistical methods in dealing with extremely unbalanced data through simulation experiments to draw researchers' attention to the extremely unbalanced data in GWAS.

查看全文 Html全文查看/发表评论下载PDF阅读器

关闭