Addressing Class Imbalance in Healthcare Data: Machine Learning Solutions for Age-Related Macular Degeneration and Preeclampsia



Healthcare Domain, Class Imbalance, Ensemble Classifiers, Diagnostic Decision-Making, Personalized Medicine, Machine Learning Techniques


The use of machine learning in healthcare has
transformed the way diseases are diagnosed and treatments
are optimized. However, medical databases often lack balanced
data due to challenges in data collection caused by privacy
regulations. Certain health conditions are underrepresented,
which hampers machine learning performance. To address this
problem, a hybrid approach has been proposed that combines
the Synthetic Minority Oversampling Technique (SMOTE) with
undersampling and uses two specific techniques tailored for
imbalanced datasets. Comparative evaluations were conducted
using various thresholds to reduce one class and employing
Balanced Accuracy to mitigate bias toward the majority class,
with popular machine learning methods. The results showed
that Balanced Bagging and Balanced Random Forest consistently
outperformed other methods, performing the best with
an average ranking of 1.42 and 3.58 out of 32 configurations
in the two datasets, respectively. Tree-based approaches such
as Random Forest and Gradient Boosting demonstrated similar
effectiveness, emphasizing the power of aggregating predictions
from multiple trees to reduce bias. Notably, undersampling and
SMOTE proved advantageous for non-tree-based models like
KNN, SVM, and Logistic Regression showcasing their usefulness
across different algorithms. This study provides a robust solution
for handling imbalanced datasets in healthcare, which could
potentially optimize healthcare interventions and improve patient
outcomes and care.


Author Biographies

Antonieta Martinez-Velasco, Universidad Panamericana

Antonieta Martínez-Velasco is a professor and researcher at the Engineering School at the Universidad Panamericana. She received her Ph.D. in engineering from Universidad Panamericana. Her main research areas are data analytics, Artificial Intelligence, and Machine learning techniques applied to social and health sciences. She is part of the Mexican National Researchers System.

Lourdes Martínez -Villaseñor, Universidad Panamericana

Lourdes Martínez-Villaseñor is a Full-time Professor
in the School of Engineering at the Universidad
Panamericana, Mexico, and head of the postgraduate
academic area. She is a Computer Systems
Engineer and a Doctor in Computational Sciences
from Tecnológico de Monterrey, Mexico. She has
the distinction of level 1 of the National System of
Researchers of CONACYT. Her main research interests
are artificial intelligence applied to healthcare
systems and ethics for artificial intelligence.

Luis Miralles-Pechuán, University College Dublin

Luis Miralles-Pechuán is a Lecturer at Technological University Dublin. He obtained his PhD and Bachelor in Computer Science at the University of Murcia (Spain). He worked as a full-time researcher/lecturer at the University Panamericana in Mexico for three years. He started a PhD in 2012 on creating new approaches within the Online Advertising world. During his PhD, he got familiar with ML and many papers on how to apply ML to online advertising. After finishing his PhD, he worked in postdoc levels I and II in CeADAR, University College Dublin, and there, he won the prize for supervising the best student
paper at the Digital Forensic conference. His topic is applying Reinforcement Learning to fight the COVID-19 pandemic and plan the containing levels, considering public health and the economy. Lastly, he has expertise in human activity recognition and generalized zero-shot learning (GZSL) and applying machine learning to improve the accessibility of websites.


