应用分类树模型筛选恶性肿瘤危险因素的研究

张勇晶; 陈坤; 金明娟; 范春红

Abstract

张勇晶,陈坤,金明娟,范春红.应用分类树模型筛选恶性肿瘤危险因素的研究[J].Chinese journal of Epidemiology,2006,27(6):540-543

应用分类树模型筛选恶性肿瘤危险因素的研究

Study on the application of classification tree model in screening the risk factors of malignant tumor

Received:November 03, 2005

DOI：

KeyWord: 分类树模型乳腺肿瘤危险因素卡方自动交互检测法

English Key Word: Classification tree model Breast neoplasm Risk factor Exhaustive chi-square automatic interaction detection method

FundProject:国家自然科学基金资助项目(30471492)

Author Name	Affiliation	E-mail
ZHANG Yong-jing	Department of Epidemiology and Health Statistics, School of Public Health, Zhejiang University, Hangzhou 310031, China
CHEN Kun	Department of Epidemiology and Health Statistics, School of Public Health, Zhejiang University, Hangzhou 310031, China	ck@zju.edu.cn
JIN Ming-juan	Department of Epidemiology and Health Statistics, School of Public Health, Zhejiang University, Hangzhou 310031, China
FAN Chun-hong	Department of Epidemiology and Health Statistics, School of Public Health, Zhejiang University, Hangzhou 310031, China

Hits: 3283

Download times: 1393

Abstract:

目的介绍分类树模型筛选恶性肿瘤危险因素基本原理、运算法则和应用价值。方法以浙江省嘉善县乳腺癌现场调查数据为例,采用Exhaustive CHAID法建立分类树模型对调查结果进行危险因素筛选,使用错分概率Risk值和ROC曲线下面积对模型进行评价。结果分类树模型从全部105个候选变量中筛选出9个危险因素,其中职业是最重要的影响因素,工人、教师及退休人员的乳腺癌发生概率显著高于其他人员。另外,模型显示经常参加体育锻炼在不同人群中对乳腺癌的影响效果有所不同。模型错分概率Risk值为0．174,利用预测概率绘制的ROC曲线下面积为0．872,与0．5比较具有显著的统计学意义,模型拟合效果很好。结论分类树模型不仅可以有效挖掘筛选出主要的影响因素,还可以对研究变量科学定义分界点,展示变量间复杂的相互作用,在流行病学研究中具有较高的应用价值。

English Abstract:

Objective To introduce the partitioning algorithm of classification tree model, and to explore the value of this data mining technique applied in data analysis of multifactorial diseases as malignant tumors. Methods Data was analyzed from a survey that conducted on 84 breast cancer patients and 273 cancer-free controls selected randomly in Jiashan county. The classification tree model was constructed using Exhaustive CHAID method and evaluated by the Risk statistics and the area under the ROC curve. Results 9 out of 105 effect risks factors were selected, in which career was the most important factor indicating that workers, teachers and retirees suffered much more risks than others. Nevertheless, the number of pregnancies, breast examination, reasons for menopause, age at menarche, intake of shrimp, crab, kipper, kelp and laver etc were also risk factors on breast cancer. However, physical exercise played different roles on different people. The Risk statistics of model was 0. 174, and the area under the ROC curve was 0.872 which was significantly different from 0.5, suggesting that the classification tree model fit the actuality very well. Conclusion The classification tree model could screen out the major affecting factors quickly and effectively and could also identify the cutting-points for continuous and ordinal variables,as well as revealing the complex interaction among the factors at many levels. This model might become a powerful tool to explore the complexities of the risks on diseases.

View Fulltext Html FullText View/Add Comment Download reader