Evaluating machine learning and statistical learning techniques for cancer classification and diagnosis

Alwazy, Asmaa; Buyrukoğlu, GONCA; Buyrukoğlu, Selim; Baker, Mohammed

doi:10.1007/s42044-025-00233-z

Evaluating machine learning and statistical learning techniques for cancer classification and diagnosis

Alwazy A. S. H., Buyrukoğlu G., Buyrukoğlu S., Baker M. R.

Iran Journal of Computer Science, cilt.8, sa.2, ss.471-490, 2025 (Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 8 Sayı: 2
Basım Tarihi: 2025
Doi Numarası: 10.1007/s42044-025-00233-z
Dergi Adı: Iran Journal of Computer Science
Derginin Tarandığı İndeksler: Scopus
Sayfa Sayıları: ss.471-490
Anahtar Kelimeler: Cancer classification, Data mining, Diagnosis, Machine learning algorithms, Statistical learning algorithms
Abdullah Gül Üniversitesi Adresli: Hayır

Özet

Accurate cancer diagnosis is critical for effective treatment and positive patient outcomes. This study investigates the robustness of machine learning (ML) and statistical learning (SL) algorithms in classifying and diagnosing lung, prostate, breast, and heart disease. We evaluated and compared the performance of several algorithms, including support vector classifier (SVC), random forest (RF), XGBoost, decision tree (DT), elastic net, Lasso, and ridge, using four medical data sets. Multiple metrics, such as accuracy, precision, recall, F1-score, and area under the curve (AUC), are used to assess the performance. Hyperparameter tuning is conducted using GridSearchCV. The results demonstrated that RF and SVC achieved near-perfect accuracy (up to 98%) and AUC scores (1.00) in distinguishing between benign and malignant lung and breast cancer samples. SL algorithms exhibited robust and consistent performance for prostate cancer, achieving accuracy, precision, recall, and F1-scores of approximately 90%, making them suitable for data sets with smaller sample sizes. However, SVC-based ML algorithms outperformed SL methods in overall accuracy for prostate cancer, although SL methods provided a stable baseline. For heart disease, SVC achieved the highest accuracy (82%), while regularized SL models such as Elastic Net and Lasso performed similarly, with an accuracy of 80%. This study highlights the importance of selecting algorithms tailored to the data set’s characteristics and the specific diagnostic requirements. It also underscores the complementary strengths of ML and SL approaches, suggesting that an integrated strategy may offer the most effective solution. The findings contribute to the growing evidence supporting integrating computational methods into clinical practice to enhance diagnostic accuracy and improve patient care.