A Machine Learning Approach to Predicting Type 2 Diabetes in Young Adults
Khang Le,(1) Damla Duendar,(1) Amelia Blanton,(2) Temitope Ogundare, (2) Anne Thompson,(2,3) Brittany Gouse,(2,3) Hannah E. Brown, (2,3) Archana Venkataraman (1)
1. Department of Electrical and Computer Engineering, Boston University, USA
2. Wellness & Recovery After Psychosis Research Program, Boston Medical Center, Boston, MA
3. Boston University Avedisian and Chobanian School of Medicine, Boston, MA
Background: Type 2 diabetes mellitus (T2DM) associated with second-generation antipsychotics is a leading driver of premature mortality in schizophrenia. Cardiometabolic risk across the first-episode psychosis (FEP) population is heterogeneous. We developed novel machine learning methods to predict incident T2DM as a first step towards incorporating this information into subsequent studies to reduce mortality in FEP.
Methods: Using data from the 2019–2020 Healthy Minds Study, a traditional machine learning approach was employed to predict T2DM. Samples with missing values were excluded. Feature exclusion for model optimization was performed by calculating the variance of each feature and removing those below the threshold. The dataset was then split into training, validation, and test sets with a 60-20-20 split. To focus on the minority positive class, the training data was under-sampled to achieve a 1:3 class ratio between negative and positive classes. Hyperparameters for various models (XGBoost, Ordinal Logistic Regression, SVM, and Random Forest) were tuned using Latin Hypercube Sampling. Models were trained on the optimized training set and evaluated on the test data to assess predictive performance.
Results: The XGBoost model demonstrated superior performance, achieving a ROC-AUC of 0.7082 (±0.0414), accuracy of 0.6835 (±0.0334), precision of 0.7553 (±0.0584), recall of 0.6425 (±0.0507), specificity of 0.8524 (±0.0479), and an AUC-PR of 0.7432 (±0.0344). XGBoost was further evaluated on a schizophrenia dataset, achieving an ROC-AUC of 0.8729, accuracy of 0.8973, precision of 0.6101, recall of 0.6250, specificity of 0.9544, and an AUC-PR of 0.6183.
Conclusions: Predicting T2DM using machine learning is feasible in a community sample of young adults and could be generalized to a schizophrenia-only sample. External validation in a larger clinical sample of FEP patients is needed to assess clinical utility of our model.