Introduction: Many biobanks use machine learning algorithms to predict patient phenotypes such as type 1 diabetes (T1D) . It is not known how these algorithms perform in patients with different ancestral backgrounds or whether incorporating genetic data improves phenotypic classification.

Methods: We examined 40,843 patients in the Mass General Brigham (MGB) Biobank, including 34,870 white patients and 4,872 non-white patients. The MGB machine learning algorithm predicted 242 patients to have T1D. After applying a gold standard diagnosis based on chart review, we assessed accuracy of the MGB algorithm in different self-reported races and genetic ancestries. We divided the dataset into separate training and validation cohorts, and we constructed a logistic regression model using the machine learning algorithm and previously published T1D polygenic scores to classify patients as T1D cases or controls.

Results: The MGB machine learning algorithm for T1D had a positive predictive value of 77.8% for white patients, but only 62.2% for non-white patients (p = 0.04) . Among patients with predominantly European genetic ancestry, adding a polygenic score derived from European populations to the MGB classification algorithm improved the area under the receiver operating curve (AUC) from 0.928 to 0.974 (p < 0.05) . Among patients whose predominant genetic ancestry was not European, adding a polygenic score derived from African populations improved the AUC from 0.889 to 0.920 (p > 0.05) . In both cases, the improvement in AUC was less substantial when using non-ancestry-matched polygenic scores.

Conclusions: Automated prediction tools for T1D are imperfect, and performance may differ by patient race. Incorporating ancestry-specific polygenic scores into phenotyping algorithms can improve diagnostic accuracy. This may help reduce healthcare and research disparities by refining the automated classification of T1D in all racial and ancestral groups.


A.J.Deutsch: None. T.Majarian: Employee; Vertex Pharmaceuticals Incorporated. J.M.Mercader: None. J.C.Florez: Consultant; AstraZeneca, Goldfinch Bio, Inc., Other Relationship; AstraZeneca, Merck & Co., Inc., Novo Nordisk. M.Udler: None.


National Institutes of Health (K23DK114551, T32DK007028)

Readers may use this article as long as the work is properly cited, the use is educational and not for profit, and the work is not altered. More information is available at