|Table of Contents|

Risk prediction study of prostate tumors based on Borderline-SMOTE algorithm and Stacking ensemble learning

Journal Of Modern Oncology[ISSN:1672-4992/CN:61-1415/R]

Issue:
2023 16
Page:
3075-3081
Research Field:
Publishing date:

Info

Title:
Risk prediction study of prostate tumors based on Borderline-SMOTE algorithm and Stacking ensemble learning
Author(s):
XIONG Siwei1LIU Yulin12
1.School of Microelectronics and Data Science,Anhui University of Technology,Anhui Ma'anshan 243032,China;2.Anhui Provincial Joint Key Laboratory of Disciplines for Industrial Big Data Analysis and Intelligent Decision,Anhui Ma'anshan 243032,China.
Keywords:
prostate tumorsmutual informationBorderline-SMOTEStacking ensemble learning
PACS:
R737.25
DOI:
10.3969/j.issn.1672-4992.2023.16.025
Abstract:
Objective:A combination model with high accuracy has been established by applying data mining method to predict the risk of patients with prostate cancer,which provides reference for the prevention and diagnosis of prostate cancer (PCa).Methods:A total of 682 patients who underwent prostate biopsy in the Clinical Medical Science Data Center (301 Hospital) were selected.Mutual information was used as evaluation criteria to screen out the characteristic attributes related to PCa.A single model was constructed based on XgBoost,Logistic regression,Adaboost,K-nearest neighbor and Random Forests algorithm of machine learning,and three models with better predictive ability were selected by using the five-fold cross-validation algorithm.The study used oversampling to construct the single model based on Borderline-SMOTE and the Stacking combination model based on Borderline-SMOTE,then explored the influence of different combination methods.Finally,37 clinical cases from 301 Hospital and Wuhu Yijishan Hospital were selected as external validation set to test the model.Results:19 key feature attributes were screened by mutual information.It was found that random forest model,XgBoost model and AdaBoost model performed better in the study of a single model.And the single models based on Borderline-SMOTE made the label attributes balance and gave a great increase of AUC.In the three constructed combination models by Borderline-SMOTE and Stacking,the one with XgBoost,Random Forests as the primary classifier and AdaBoost as the secondary classifier had the best prediction ability.Its accuracy was 0.945 4.Recall was 0.937 5,precision was 0.957 3,F1 score was 0.947 0,the AUC was as high as 0.982 3,and the combined model also had a good prediction effect in the clinical validation set.Conclusion:Borderline-SMOTE oversampling treatment of the imbalance data set is very effective.Compared with the prediction of a single model,the PCa risk prediction method based on Stacking ensemble learning method of the multi-model fusion has higher prediction accuracy and good generalization performance,which is more helpful for clinical diagnosis of PCa.

References:

[1]BRAY F,FERLAY J,SOERJOMATARAM I,et al.Global cancer statistics 2018:GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries[J].CA:A Cancer Journal for Clinicians,2018,68(6):394-424.
[2]LIU X,YU C,BI Y,et al.Trends and age-period-cohort effect on incidence and mortality of prostate cancer from 1990 to 2017 in China[J].Public Health,2019,172:70-80.
[3]LOUIE KS,SEIGNEURIN A,CATHCART P,et al.Do prostate cancer risk models improve the predictive accuracy of PSA screening?A meta-analysis[J].Ann Oncol,2015,26(5):1031-1032.
[4]曹文哲,应俊,张亚慧,等.基于机器学习算法的前列腺癌诊断模型研究[J].中国医疗设备,2016,31(4):30-35. CAO WZ,YING J,ZHANG YH,et al.Research on prostate cancer diagnosis model based on machine learning algorithm [J].China Medical Equipment,2016,31(4):30-35.
[5]FAN X,XIE N,CHEN J,et al.Multiparametric MRI and machine learning based radiomic models for preoperative prediction of multiple biological characteristics in prostate cancer[J].Frontiers in Oncology,2022,2:12-24.
[6]柏冬,李璐,王宏林.机器学习方法对前列腺癌的诊断价值[J].分子影像学杂志,2020,43(2):5-11. BAI D,LI L,WANG HL.Diagnostic value of machine learning methods for prostate cancer[J].Chinese Journal of Molecular Imaging,2020,43(2):5-11.
[7]MALIK SS,BATOOL R,MASOOD N,et al.Risk factors for prostate cancer:A multifactorial case-control study[J].Curr Probl Cancer,2018,32(5):10-13.
[8]葛平.基于数据挖掘的前列腺癌相关数据的研究[D].北京:北京理工大学,2016. GE P.Research on prostate cancer related data based on data mining[D].Beijing:Beijing Institute of Technology,2016.
[9]王道虎,郭悦江,陈炜,等.前列腺体积和前列腺特异性抗原密度与前列腺癌检出率的关系[J].中山大学学报(医学科学版),2013,34(5):768-771. WANG DH,GUO YJ,CHEN W,et al.Relationship between prostate volume and prostate-specific antigen density and prostate cancer detection rate[J].Journal of Sun Yat-sen University (Medical Science Edition),2013,34(5):768-771.
[10]孙林,张起峰,徐久成.基于互信息的Fisher Score多标记特征选择[J].南京大学学报(自然科学),2023,59(01):55-66. SUN L,ZHANG QF,XU JC.Multi-marker feature selection by Fisher Score based on mutual information[J].Journal of Nanjing University (Natural Sciences),2023,59(01):55-66.
[11]SONG LL,XU YK,WANG MH,et al.PreCar_Deep:A deep learning framework for prediction of protein carbonylation sites based on Borderline-SMOTE strategy[J].Chemometrics and Intelligent Laboratory Systems,2021,218:28-36.
[12]徐慧丽.Stacking算法的研究及改进[D].广州:华南理工大学,2018. XU HL.Research and improvement of stacking algorithm[D].Guangzhou:South China University of Technology,2018.
[13]车宏鑫,王桐,王伟.前列腺癌预测模型对比研究[J].数据分析与知识发现,2021,5(9):107-114. CHE HX,WANG T,WANG W.Comparative study of prostate cancer prediction models[J].Data Analysis and Knowledge Discovery,2021,5(9):107-114.
[14]路帅,李文杰,徐紫薇,等.前列腺癌风险预测模型的构建与验证[J].重庆医科大学学报,2023,48(3):328-334. LU S,LI WJ,XU ZW,et al.Construction and verification of prostate cancer risk prediction model[J].Journal of Chongqing Medical University,2023,48(3):328-334.
[15]KSIEK W,GANDOR M,PAWIAK P.Comparison of various approaches to combine logistic regression with genetic algorithms in survival prediction of hepatocellular carcinoma[J].Computers in Biology and Medicine,2021,134:31-43.
[16]LIAO XY,HAMEED N,CLOS J.An investigation of XGBoost-based algorithm for breast cancer classification[J].Machine Learning with Applications,2021,6:12-26.
[17]DOU L,LI X,ZHANG L,et al.iGlu_AdaBoost:Identification of lysine glutarylation using the AdaBoost classifier[J].Journal of Proteome Research,2020,20(1):6-9.
[18]YANG L,WU H,JIN X,et al.Study of cardiovascular disease prediction model based on random forest in eastern China[J].Scientific Reports,2020,10(1):7-11.
[19]EHSANI R,DRABLS F.Robust distance measures for KNN classification of cancer data[J].Cancer Informatics,2020,19:28-32.
[20]MOMENZADEH N,HAFEZALSEHEH H,NAYEBPOUR MR,et al.A hybrid machine learning approach for predicting survival of patients with prostate cancer:A SEER-based population study[J].Indormatics in Medicine Unlocked,2021,27:23-34.
[21]BOLUWAJI A,AKINNUWESI KA,OLAYANJU BS,et al.Application of support vector machine algorithm for early differential diagnosis of prostate cancer[J].Data Science and Management,2023,6:1-12.

Memo

Memo:
安徽省教学研究项目(编号:2020jyxm0212);安徽省质量工程项目(编号:2021xxkc017);大学生创新创业训练计划项目(编号:202110360319)
Last Update: 1900-01-01