Feature Selection and Ensemble Methods for Predicting Parkinson’s Disease Outcomes: A Data-Driven Approach


Creative Commons License

Buyrukoglu S., Buyrukoğlu G.

All Sciences Academy, Konya, 2025

  • Publication Type: Book / Research Book
  • Publication Date: 2025
  • Publisher: All Sciences Academy
  • City: Konya
  • Abdullah Gül University Affiliated: Yes

Abstract

The main problem is that early prediction and stratification of the progression of the Parkinson disease (PD) process are important to individual treatment planning and the development of clinical trials. We designed and critically assessed in this work a full machine learning pipeline on the Parkinson progression Markers Initiative (PPMI) clinical data to predict a three-class event outcome of the various disease progression patterns of a cohort of 1,008 subjects who have full follow-up data. The analytical procedure involved systematic preprocessing (removal of missing data >50%, labelling features, median measure imputation, z-score normalization), univariate feature choice by ANOVA F-statistics (SelectKBest, k=20), and stringent comparison with several classification frameworks such as tree- based ensemble models (Random Forest, Gradient Boosting, Extra Trees, AdaBoost), support machine, logistic regression, stacking ensemble with logistic regression meta-learner, and a super learner based on meta-learning using out-of The models were evaluated in terms of accuracy, macro-averaged precision, recall, F1-score, ROC-AUC and confusion matrices taking a 5 fold stratified cross-validation. The super learner meta-learning model performed the best with regard to predictive performance (accuracy: 94.94% ± 2.58%, F1-score: 93.06%) and exceeded the single Gradient Boosting baseline (93.35% ± 1.73%) by a significant margin, showing that meta-learner selection is a critical factor in the effectiveness of ensembles in medium-sized clinical data. The use of feature selection enhanced the performance of gradient-boosting algorithms but showed inconsistent performance across other family of models. The proposed workflow shows that properly configured meta-learning architectures can get state-of-the-art predictive performance on structured clinical tabular data and give interpretative results by feature importance and SHAP analysis, which can serve as a methodological template of future studies on PD progression modelling.