Introduction: The objective of this study is to assess the efficacy of machine learning models in gestational diabetes mellitus (GDM) prediction using electronic health record (EHR) data collected in the first trimester.

Methods: Data were extracted from The Coombe Hospital EHRs, Dublin, spanning from 2018-2022. The year 2020 was excluded due to COVID-19-related deviations from usual screening practices. We employed four machine learning models—Random Forest, XGBoost, Logistic Regression and Explainable Boosting Machine—evaluating them using receiver operating characteristic (ROC) curve and average precision (AP). Models were trained on data within the EHRs collected during the first prenatal visit (8-14 weeks).

Results: 27,500 pregnancies and 3,100 GDM cases were analyzed post-processing. Logistic Regression consistently showed high performance (full feature set ROC AUC = 0.821, AP = 0.39; top 13 features ROC AUC = 0.818, AP = 0.39; first pregnancy only ROC AUC = 0.826, AP = 0.38). Other models demonstrated varying degrees of performance, some slightly decreasing when reducing features or considering only the first pregnancy.

Conclusion: Logistic regression appears to match or better the performance of more sophisticated machine learning models when predicting GDM using data from EHRs. This has the added benefit of easily explainable models for healthcare practitioners and stakeholders.

Disclosure

M.A. Germaine: None. A.C. O'Higgins: None. G. Healy: None. B. Egan: None.

Funding

This work has emanated from research supported in part by a grant from Science Foundation Ireland (18/CRT/6183).

Readers may use this article as long as the work is properly cited, the use is educational and not for profit, and the work is not altered. More information is available at http://www.diabetesjournals.org/content/license.