Prediabetes in young people is an emerging epidemic that disproportionately impacts Hispanic populations. We aimed to develop a metabolite-based prediction model for prediabetes in young people with overweight/obesity at risk for type 2 diabetes.
In independent, prospective cohorts of Hispanic youth (discovery; n = 143 without baseline prediabetes) and predominately Hispanic young adults (validation; n = 56 without baseline prediabetes), we assessed prediabetes via 2-h oral glucose tolerance tests. Baseline metabolite levels were measured in plasma from a 2-h postglucose challenge. In the discovery cohort, least absolute shrinkage and selection operator regression with a stability selection procedure was used to identify robust predictive metabolites for prediabetes. Predictive performance was evaluated in the discovery and validation cohorts using logistic regression.
Two metabolites (allylphenol sulfate and caprylic acid) were found to predict prediabetes beyond known risk factors, including sex, BMI, age, ethnicity, fasting/2-h glucose, total cholesterol, and triglycerides. In the discovery cohort, the area under the receiver operator characteristic curve (AUC) of the model with metabolites and known risk factors was 0.80 (95% CI 0.72–0.87), which was higher than the risk factor-only model (AUC 0.63 [0.53–0.73]; P = 0.001). When the predictive models developed in the discovery cohort were applied to the replication cohort, the model with metabolites and risk factors predicted prediabetes more accurately (AUC 0.70 [95% CI 40.55–0.86]) than the same model without metabolites (AUC 0.62 [0.46–0.79]).
Metabolite profiles may help improve prediabetes prediction compared with traditional risk factors. Findings suggest that medium-chain fatty acids and phytochemicals are early indicators of prediabetes in high-risk youth.
Introduction
Nearly one in five adolescents and one in four young adults in the U.S. have prediabetes (1). Young people with prediabetes have a high type 2 diabetes risk, and 8% develop young-onset type 2 diabetes within 3 years of diagnosis (2). Youth with type 2 diabetes have reduced quality of life and an ∼15-year shorter life span than metabolically healthy youth (3). There are significant racial and ethnic disparities in prediabetes, and Hispanic youth and young adults are at particular risk: 22.5% of Hispanic youth and 28.7% of Hispanic young adults have prediabetes compared with 15.8% and 21% of non-Hispanic White youth and young adults, respectively (1). Lifestyle interventions for individuals with prediabetes can decrease type 2 diabetes risk by 40–70% (4), so early identification of youth at risk for prediabetes may allow for targeted intervention to prevent the development of type 2 diabetes. Thus, there is an urgent need to identify biomarkers for early detection of prediabetes in young people at risk for type 2 diabetes.
Plasma metabolites can be informative of metabolic dysregulation and may be early indicators of glucose dysregulation (5). In adults, fasting branched-chain amino acids (6–9), aromatic amino acids (6–9), and bile acids (10) have been found to predict type 2 diabetes development beyond traditional clinical risk factors, including age, sex, BMI, fasting and postprandial glucose, and fasting lipids. Despite the potential for metabolites to predict prediabetes or type 2 diabetes in young people, prospective studies examining associations of metabolite profiles and prediabetes risk in this population are lacking. In two small-scale studies, plasma levels of α-hydroxybutyrate and branched-chain amino acids were associated with insulin resistance (11,12), although these studies both included <20 participants. Almost all previous studies have examined the predictive ability of fasting metabolite profiles. Prediabetes is characterized by decreased metabolic flexibility, which impairs the ability to respond to blood glucose changes (13). Since decreased metabolic flexibility and altered postprandial metabolite profiles may be present years before prediabetes develops (14), metabolite levels after a glucose challenge may be more informative for prediabetes risk in young people than fasting metabolites.
This study aimed to use machine learning techniques to develop a metabolite-based prediction model for risk of prediabetes in youth with overweight/obesity at risk for type 2 diabetes and to validate findings in an independent cohort of young adults with a history of overweight/obesity in adolescence. We screened for impaired fasting glucose and impaired glucose tolerance using 2-h oral glucose tolerance tests (OGTT) and measured untargeted metabolomics at baseline from blood samples collected 2 h after the glucose challenge. We hypothesized that differences in postprandial aromatic and branched-chain amino acids, fatty acids, and bile acids would predict prediabetes development in adolescents and young adults.
Research Design and Methods
Study Populations
Discovery Cohort: Study of Latino Adolescents at Risk
For Study of Latino Adolescents at Risk (SOLAR), 143 Hispanic adolescents with overweight/obesity and without baseline prediabetes were included in analysis. Details of SOLAR have been described previously (15,16). SOLAR recruited 328 children in two waves between 2001 and 2012. Participants underwent annual clinical visits at the University of Southern California (USC) General Clinical Research Center or the Clinical Trials Unit. Inclusion criteria for recruitment included age 8–13 years old, sex, age-specific BMI >85th percentile, direct familial history of type 2 diabetes, and Hispanic/Latino based on both parents and all grandparents self-reported as Hispanic/Latino. Participants were excluded if they had type 1 or 2 diabetes or were on medications impacting glucose or insulin metabolism. For this analysis, participants were included if they completed a 2-h OGTT at their first and second visit and metabolomics were measured at baseline. Participants were excluded if they had baseline prediabetes, resulting in 143 participants for analysis (Supplementary Fig. 1A). The USC Institutional Review Board provided ethics approval. Participants/guardians provided written informed assent/consent before participation.
Validation Cohort: Southern California Children’s Health Study
To examine replication and generalizability of the metabolites identified in SOLAR, we analyzed 56 young adults without baseline prediabetes from the MetaAir cohort (17,18), a subset of the Southern California Children’s Health Study (CHS) (19). MetaAir included 172 young adults (ages 17–23 years) of mixed ethnicity recruited between 2014 and 2018 (17); 86 participants completed a follow-up between 2020 and 2022 (18). Visits occurred at the USC Diabetes and Obesity Research Institute or the Clinical Trials Unit. During visits, demographics were collected via questionnaires, and a 2-h OGTT was performed. Inclusion criteria at baseline included age 17–23 years old and a history of overweight/obesity (BMI >85th percentile) in 9th–10th grade (age 14–15 years). Participants were excluded if they had type 1 or 2 diabetes or were on medications impacting glucose or insulin metabolism. For this study, we further excluded participants with baseline prediabetes, leaving 56 participants (Supplementary Fig 1B). The USC Institutional Review Board provided ethics approval. Before participation, participants/guardians (if applicable) provided written informed assent/consent.
Clinical Outcomes
At each visit, fasting glucose, 2-h glucose, and HbA1c were measured in plasma samples from the OGTT, performed following an overnight fast. Prediabetes/type 2 diabetes were defined using the American Diabetes Association criteria for fasting glucose, 2-h glucose, and HbA1c (20). Type 2 diabetes was defined as fasting glucose ≥126 mg/dL, 2-h glucose ≥200 mg/dL, or HbA1c ≥6.5%. Prediabetes was defined as impaired fasting glucose (fasting glucose between 100 and 125 mg/dL), impaired glucose tolerance (2-h glucose between 140 and 199 mg/dL), or elevated HbA1c (HbA1c between 5.7 and 6.4%) (20). In a subset of SOLAR participants with a third visit (n = 106), stability of continuous outcomes was assessed using intraclass correlation coefficients.
Covariates
In both cohorts, sociodemographic information, including age, sex, and ethnicity, was collected with questionnaires, as described previously (15–18). At each visit, height (m) and weight (kg) were measured to calculate BMI (kg/m2). Fasting levels of triglycerides and total, HDL, LDL, and VLDL cholesterol were measured at baseline. To determine pubertal stages in SOLAR, Tanner staging was performed by a physician at each visit using pubic hair and testicular size in boys and pubic hair and breast development in girls (21,22).
Covariates were grouped into general and specific risk factors for type 2 diabetes and were selected based on their established predictive ability of type 2 diabetes in Hispanic and non-Hispanic populations in the U.S. (23). General risk factors were chosen as noninvasive characteristics easily measured in children in a clinical setting (23) and included baseline age, BMI, and sex, and in CHS, ethnicity. Specific risk factors included fasting and 2-h glucose, total cholesterol, and triglycerides. To allow for greater generalizability of prediction models between adolescents and young adults, Tanner stage was not included as a risk factor in the primary models.
Untargeted Plasma Metabolomics
Liquid chromatography with high-resolution mass spectrometry was used to measure untargeted metabolomics in plasma samples from the 2-h OGTT time point by expanding on established methods (24), as described previously (25). Details are provided in the Supplementary Methods. SOLAR and CHS analyses were performed consecutively within 7 days to reduce measurement variability between cohorts. Metabolite features were removed from analysis if they had a coefficient of variability in the quality control samples >30% or if they were detected in <25% of study samples. Nondetected values were imputed with 1 divided by square root of lowest non-0 feature specific value. The number of metabolite features that passed quality control was 23,166.
Statistical Analysis
Feature selection, model building, and validation were performed in four steps (Fig. 1). First, metabolite annotation and dimensionality reduction was performed. Second, feature selection was conducted in the discovery cohort. Third, the three selected metabolites were refined to those with consistent effects across cohorts. Finally, the predictive performance of the selected metabolites was assessed in the discovery and validation cohorts.
Before analysis, metabolites were log2 transformed and scaled to a standard normal distribution. Extreme values (>7 SDs from the mean) were replaced by the mean feature intensity to reduce the impact of outliers in any single metabolite. Continuous covariates were scaled to obtain comparable effect estimates across variables. Statistical significance was set to P < 0.05. Analyses were performed in R version 4.2 software.
Metabolite Annotation and Dimensionality Reduction: Discovery (SOLAR) Cohort
To tentatively annotate the 23,166 metabolite features, we used version 2 of the MetaboAnalyst version 5.0 MS peaks-to-paths module (26), using a mass tolerance of 5.0 ppm and the Human MetaFishNet (MFN) library (27). All features that were tentatively annotated were included in the feature selection (described below). Following the feature selection, the identity of the selected metabolites was further confirmed by comparison with a database of authentic standards analyzed on the same instrument using identical instrumental methods (described in Supplementary Methods).
Feature Selection: Discovery (SOLAR) Cohort
Feature selection was performed in the discovery cohort using least absolute shrinkage and selection operator (LASSO) regression with a stratified subsampling-based stability selection procedure (Fig. 1, step 2). Before feature selection, we used 10-fold cross-validation to determine the LASSO tuning parameter λ that minimized mean error. General and specific risk factors were included in the LASSO regression model by setting λ = 0. The LASSO regression was implemented using glmnet (28). Feature selection was implemented in a stability selection procedure with 5,000 LASSO regression iterations, each using a stratified random subsample of 70% of participants. Features selected in >85% of the iterations were included in the models. The threshold of 85% was chosen to ensure a parsimonious model that included the minimal number of metabolites needed to predict prediabetes. The top three metabolites were selected in 87–91% of the iterations, whereas the next feature was selected in 81%. The selected metabolites were refined by selecting those with consistent effects across cohorts (Fig. 1, step 3).
Prediction Models: Discovery (SOLAR) and Validation (CHS) Cohorts
The performance of the LASSO-selected metabolites for predicting prediabetes, impaired fasting glucose, impaired glucose tolerance, and elevated HbA1c (SOLAR only) was evaluated in both cohorts using logistic regression (Fig. 1, step 4). To determine predictive performance in CHS, we used two approaches. First, to examine validation of the metabolite in CHS, model coefficients were estimated in CHS independent of the SOLAR models. Separate models in each cohort allowed for validating metabolite-prediabetes associations despite differences in risk factors between cohorts. Second, to examine the transportability and calibration of the prediction model, we trained a model with data from SOLAR and used the estimates to predict outcomes in CHS. This analysis did not include age and ethnicity as general risk factors due to differences in the study populations.
For each outcome, we compared the area under the receiver operator characteristics curve (AUC) and used the Delong method to test statistical significance (29) between the following five models:
General risk factors (sex, baseline age and BMI, and in CHS, ethnicity)
General risk factors plus specific risk factors (general risk factors plus fasting glucose, 2-h glucose, fasting total cholesterol, and fasting triglycerides)
General risk factors plus metabolites
General risk factors, specific risk factors, and metabolites
Metabolites only
Sensitivity Analysis
To examine whether using pubertal status instead of age improved the predictive ability of the statistical models, we refit the five prediction models in SOLAR with Tanner stage instead of age. To examine whether the predictive performance of the models was similar when including all three selected metabolites in step 2 in Fig. 1, we reran the same five models described above with all three metabolites. To assess the potential impact of including cholesterol and triglycerides in the LASSO selection models, we reran the feature selection process omitting these risk factors from the model.
To evaluate the robustness of the feature selection approach in the validation cohort, we compared the AUC from the logistic regression with all risk factors and LASSO-selected metabolites to an empirical distribution of the AUC under the null hypothesis that no metabolites were predictive of prediabetes. First, we randomly selected 2 of the 595 annotated metabolites (the same number in our final prediction models). We then fit a predictive model in the discovery cohort with the randomly selected metabolites and overlapping general and specific risk factors. Using the coefficients from the discovery cohort, we examined the predictive ability and calculated the AUC in the validation cohort. This was repeated 10,000 times to obtain the empirical AUC distribution under the null. To statistically test the estimated AUC from the final prediction model developed in the validation cohort and tested in the replication cohort, we calculated the probability of obtaining an AUC equal to or greater than that of the final prediction model under the null hypothesis. We a priori set the probability of a type I error to 0.05 to assess whether the observed AUC from the full model fell within the rejection region.
Data and Resource Availability
The data sets for the current study are not publicly available because they contain information that could compromise participant privacy but are available from the corresponding author upon reasonable request. The analytic code is available on GitHub (Goodrich-Lab/Plasma_Metabolites_and_risk_of_prediabetes).
Results
Participant Characteristics
Table 1 shows participant characteristics. In SOLAR, 37 participants (26.6%) developed prediabetes at follow-up, and 1 developed type 2 diabetes. In CHS, 15 participants (26.8%) developed prediabetes at follow-up. Owing to the limited sample size, individuals who developed prediabetes or type 2 diabetes were pooled for analysis. In a subset of SOLAR participants with a third visit, the intraclass correlation coefficients for fasting glucose, 2-h glucose, and HbA1c were 0.56, 0.44, and 0.71, respectively, suggesting these outcomes were relatively stable.
. | SOLAR (n = 143) . | CHS (n = 56) . |
---|---|---|
Characteristic . | Mean ± SD or n (%) . | Mean ± SD or n (%) . |
Age (years) | 11.1 ± 1.8 | 19.9 ± 1.2 |
Follow-up duration (years) | 1.2 ± 0.4 | 4.1 ± 1.0 |
Sex | ||
Male | 83 (58) | 27 (48) |
Female | 60 (42) | 29 (52) |
Ethnicity | ||
Hispanic/Latino | 143 (100) | 30 (54) |
Non-Hispanic | — | 26 (46) |
Puberty status | ||
Prepuberty (Tanner stage 1) | 57 (40) | — |
Puberty (Tanner stages 2–4) | 76 (53) | — |
Postpuberty (Tanner stage 5) | 10 (7) | — |
BMI (kg/m2) | 27.3 ± 5.50 | 29.2 ± 4.2 |
BMI percentile | 96.1 ± 4.9 | — |
Fasting glucose (mg/dL) | 88.3 ± 5.0 | 88.6 ± 6.0 |
2-h glucose (mg/dL) | 118.3 ± 12.4 | 109.6 ± 17.6 |
Total cholesterol (mg/dL) | 153.3 ± 24.6 | 156.1 ± 37.9 |
HDL cholesterol (mg/dL) | 38.9 ± 9.6 | 40.5 ± 9.9 |
LDL cholesterol (mg/dL) | 93.7 ± 21.1 | 99.8 ± 33.0 |
VLDL cholesterol (mg/dL) | 20.9 ± 9.6 | 15.8 ± 8.6 |
Triglycerides (mg/dL) | 103.9 ± 48.1 | 79.0 ± 43.0 |
At follow-up | ||
Prediabetes/type 2 diabetes | 38 (27) | 15 (27) |
Fasting glucose >99 mg/dL | 8 (6) | 8 (14) |
2-h glucose >139 mg/dL | 18 (13) | 10 (18) |
HbA1c >5.7% (39 mmol/mol) | 15 (10) | 1 (2) |
. | SOLAR (n = 143) . | CHS (n = 56) . |
---|---|---|
Characteristic . | Mean ± SD or n (%) . | Mean ± SD or n (%) . |
Age (years) | 11.1 ± 1.8 | 19.9 ± 1.2 |
Follow-up duration (years) | 1.2 ± 0.4 | 4.1 ± 1.0 |
Sex | ||
Male | 83 (58) | 27 (48) |
Female | 60 (42) | 29 (52) |
Ethnicity | ||
Hispanic/Latino | 143 (100) | 30 (54) |
Non-Hispanic | — | 26 (46) |
Puberty status | ||
Prepuberty (Tanner stage 1) | 57 (40) | — |
Puberty (Tanner stages 2–4) | 76 (53) | — |
Postpuberty (Tanner stage 5) | 10 (7) | — |
BMI (kg/m2) | 27.3 ± 5.50 | 29.2 ± 4.2 |
BMI percentile | 96.1 ± 4.9 | — |
Fasting glucose (mg/dL) | 88.3 ± 5.0 | 88.6 ± 6.0 |
2-h glucose (mg/dL) | 118.3 ± 12.4 | 109.6 ± 17.6 |
Total cholesterol (mg/dL) | 153.3 ± 24.6 | 156.1 ± 37.9 |
HDL cholesterol (mg/dL) | 38.9 ± 9.6 | 40.5 ± 9.9 |
LDL cholesterol (mg/dL) | 93.7 ± 21.1 | 99.8 ± 33.0 |
VLDL cholesterol (mg/dL) | 20.9 ± 9.6 | 15.8 ± 8.6 |
Triglycerides (mg/dL) | 103.9 ± 48.1 | 79.0 ± 43.0 |
At follow-up | ||
Prediabetes/type 2 diabetes | 38 (27) | 15 (27) |
Fasting glucose >99 mg/dL | 8 (6) | 8 (14) |
2-h glucose >139 mg/dL | 18 (13) | 10 (18) |
HbA1c >5.7% (39 mmol/mol) | 15 (10) | 1 (2) |
Metabolites Associated With Prediabetes Risk in Youth
In SOLAR, the metabolome-wide association study identified 1,105 untargeted features (4.2%) associated with risk of prediabetes/type 2 diabetes. Pathway analysis of the metabolome-wide association study identified 83 unique metabolic pathways that were linked to 595 metabolite features and 186 unique metabolites (Supplementary Table 1). The stability selection procedure on the 595 metabolite features identified 3 metabolites that were selected in >85% of the iterations (Supplementary Table 2). These metabolites were caprylic acid, allylphenol sulfate, and taurocholic acid. Taurocholic acid exhibited a different direction of effect between cohorts and was not included in the primary prediction models (Supplementary Table 3).
Metabolites Improve Prediction Beyond Clinical Variables
In models fit in both cohorts independently, the full prediction model containing the general risk factors, the specific risk factors, and the two selected metabolites resulted in the highest AUC for predicting both prediabetes and impaired glucose tolerance (Fig. 2A and Table 2). In SOLAR, the AUC for predicting prediabetes with the complete model, including all risk factors and the two metabolites (model 4), was 0.80 (95% CI 0.72–0.87). This AUC was higher than those of the risk factor-only models, including model 2 with the general and specific risk factors (AUC 0.63 [95% CI 40.53–0.73]) and model 1 with the general risk factors only (AUC 0.59 [0.49–0.69]). The AUCs for predicting impaired fasting glucose, impaired glucose tolerance, and elevated HbA1c were similar to those for prediabetes and improved when adding the three metabolites (Table 2).
. | . | SOLAR (discovery) . | CHS (validation) prediction models refit . | CHS (transportability) prediction models from SOLAR† . | |||
---|---|---|---|---|---|---|---|
Clinical outcome . | Risk factors . | AUC without metabolites . | AUC with two metabolites . | AUC without metabolites . | AUC with two metabolites . | AUC without metabolites . | AUC with two metabolites . |
Prediabetes/ type 2 diabetes | |||||||
GRF | 0.59 (0.49, 0.69) | 0.76 (0.68, 0.84)* | 0.68 (0.51, 0.85) | 0.74 (0.59, 0.88) | 0.69 (0.53, 0.84) | 0.72 (0.57, 0.86) | |
GRF + SRF | 0.63 (0.53, 0.73) | 0.80 (0.72, 0.87)* | 0.74 (0.61, 0.88) | 0.80 (0.68, 0.92) | 0.62 (0.46, 0.79) | 0.70 (0.55, 0.86) | |
No risk factors | — | 0.75 (0.66, 0.83) | — | 0.70 (0.55, 0.85) | — | 0.70 (0.54, 0.85) | |
Impaired fasting glucose (fasting glucose ≥100 mg/dL) | |||||||
GRF | 0.71 (0.46, 0.96) | 0.73 (0.50, 0.95) | 0.90 (0.80, 1.00) | 0.93 (0.85, 1.00) | 0.69 (0.49, 0.89) | 0.74 (0.57, 0.91) | |
GRF + SRF | 0.88 (0.69, 1.00) | 0.90 (0.73, 1.00) | 0.97 (0.92, 1.00) | 0.97 (0.93, 1.00) | 0.79 (0.60, 0.99) | 0.79 (0.66, 0.93) | |
No risk factors | — | 0.64 (0.45, 0.83) | — | 0.71 (0.55, 0.87) | — | 0.43 (0.19, 0.67) | |
Impaired glucose tolerance (2-h glucose ≥140 mg/dL) | |||||||
GRF | 0.60 (0.45, 0.76) | 0.74 (0.62, 0.87)* | 0.70 (0.52, 0.90) | 0.85 (0.74, 0.96) | 0.62 (0.45, 0.81) | 0.76 (0.61, 0.91) | |
GRF + SRF | 0.63 (0.49, 0.76) | 0.78 (0.67, 0.88)* | 0.80 (0.64, 0.96) | 0.88 (0.79, 0.97) | 0.64 (0.46, 0.82) | 0.75 (0.60, 0.90) | |
No risk factors | — | 0.75 (0.64, 0.85) | — | 0.78 (0.61, 0.96) | 0.78 (0.61, 0.96) | ||
HbA1c ≥5.7% | |||||||
GRF | 0.67 (0.49, 0.84) | 0.78 (0.66, 0.90) | — | — | — | — | |
GRF + SRF | 0.71 (0.58, 0.84) | 0.81 (0.70, 0.92)* | — | — | — | — | |
No risk factors | — | 0.71 (0.55, 0.88) | — | — | — | — |
. | . | SOLAR (discovery) . | CHS (validation) prediction models refit . | CHS (transportability) prediction models from SOLAR† . | |||
---|---|---|---|---|---|---|---|
Clinical outcome . | Risk factors . | AUC without metabolites . | AUC with two metabolites . | AUC without metabolites . | AUC with two metabolites . | AUC without metabolites . | AUC with two metabolites . |
Prediabetes/ type 2 diabetes | |||||||
GRF | 0.59 (0.49, 0.69) | 0.76 (0.68, 0.84)* | 0.68 (0.51, 0.85) | 0.74 (0.59, 0.88) | 0.69 (0.53, 0.84) | 0.72 (0.57, 0.86) | |
GRF + SRF | 0.63 (0.53, 0.73) | 0.80 (0.72, 0.87)* | 0.74 (0.61, 0.88) | 0.80 (0.68, 0.92) | 0.62 (0.46, 0.79) | 0.70 (0.55, 0.86) | |
No risk factors | — | 0.75 (0.66, 0.83) | — | 0.70 (0.55, 0.85) | — | 0.70 (0.54, 0.85) | |
Impaired fasting glucose (fasting glucose ≥100 mg/dL) | |||||||
GRF | 0.71 (0.46, 0.96) | 0.73 (0.50, 0.95) | 0.90 (0.80, 1.00) | 0.93 (0.85, 1.00) | 0.69 (0.49, 0.89) | 0.74 (0.57, 0.91) | |
GRF + SRF | 0.88 (0.69, 1.00) | 0.90 (0.73, 1.00) | 0.97 (0.92, 1.00) | 0.97 (0.93, 1.00) | 0.79 (0.60, 0.99) | 0.79 (0.66, 0.93) | |
No risk factors | — | 0.64 (0.45, 0.83) | — | 0.71 (0.55, 0.87) | — | 0.43 (0.19, 0.67) | |
Impaired glucose tolerance (2-h glucose ≥140 mg/dL) | |||||||
GRF | 0.60 (0.45, 0.76) | 0.74 (0.62, 0.87)* | 0.70 (0.52, 0.90) | 0.85 (0.74, 0.96) | 0.62 (0.45, 0.81) | 0.76 (0.61, 0.91) | |
GRF + SRF | 0.63 (0.49, 0.76) | 0.78 (0.67, 0.88)* | 0.80 (0.64, 0.96) | 0.88 (0.79, 0.97) | 0.64 (0.46, 0.82) | 0.75 (0.60, 0.90) | |
No risk factors | — | 0.75 (0.64, 0.85) | — | 0.78 (0.61, 0.96) | 0.78 (0.61, 0.96) | ||
HbA1c ≥5.7% | |||||||
GRF | 0.67 (0.49, 0.84) | 0.78 (0.66, 0.90) | — | — | — | — | |
GRF + SRF | 0.71 (0.58, 0.84) | 0.81 (0.70, 0.92)* | — | — | — | — | |
No risk factors | — | 0.71 (0.55, 0.88) | — | — | — | — |
Data are presented as odds ratios and 95% CI. The two predictive metabolites (caprylic acid and allylphenol sulfate) were identified using LASSO regression and a stability selection procedure in SOLAR, and prediction models with and without the identified metabolites were developed independently in SOLAR and CHS. In CHS, validation was assessed by refitting the prediction models in CHS and using the cohort specific coefficients to predict the outcomes. Model transportability was assessed in CHS by using the coefficients from the prediction models in SOLAR to predict outcomes in CHS. In CHS, only one participant developed elevated HbA1c, so prediction models were not fit for this outcome. GRF, general risk factors only (sex, age, BMI, and in CHS, and in CHS, Hispanic/Latino Ethnicity); SRF, specific risk factors, including fasting glucose, 2-h glucose, total cholesterol, and triglycerides.
P < 0.05 for the comparison of differences between the AUCs for the model with the metabolites vs. the model without the metabolites.
General risk factors for assessing transportability included only risk factors which were overlapping in both cohorts (sex and BMI).
The predictive ability of the metabolites identified in SOLAR was validated in young adults from CHS by reestimating the prediction coefficients in CHS. The AUC of the complete model with all risk factors and two metabolites (model 4) was 0.80 (95% CI 0.68–0.92) (Fig. 2A and Table 2). This was higher than the AUC from both risk factor-only models, including model 2 with the general and specific risk factors (AUC 0.74 [95% CI 0.61–0.88]) and model 1 with the general risk factors only (AUC 0.68 [0.51–0.85]). The AUCs for predicting impaired glucose tolerance improved when the two metabolites were added and were higher than the AUCs for prediabetes (Table 2). When the transportability and calibration of the prediction models between SOLAR and CHS was assessed, similar results were observed for predicting prediabetes and impaired glucose tolerance (Table 2). For example, for impaired glucose tolerance, the model with the two metabolites and all risk factors outperformed the same model without the two metabolites (AUC 0.75 [95% CI 0.60–0.90] vs. 0.64 [0.46–0.82]) (Table 2).
Sensitivity Analysis
In SOLAR, adding Tanner stage as a covariate did not change the results. When predicting prediabetes, the model with the two metabolites, all risk factors, and Tanner stage significantly outperformed the same model without the two metabolites (AUC 0.81 [95% CI 0.74–0.89] vs. 0.72 [0.63–0.81]). When all three metabolites selected in >85% of the LASSO iterations were included, the prediction results in of the fit independently in SOLAR and CHS were similar to those of the two metabolite models (Supplementary Table 4). When fasting cholesterol and triglycerides were excluded from the selection model, the same three metabolites were selected, and no new features were selected in >85% of iterations.
In CHS, we compared the AUC of 0.70 for the full prediabetes prediction model developed in SOLAR and applied to CHS with the empirical distribution of AUC values under the null hypothesis (Supplementary Fig. 2). The 95th percentile of this distribution was 0.68, indicating that only 5% of AUC values for the full model would be >0.68 if the three metabolites were selected randomly. The AUC of our final full model was 0.70, suggesting that the LASSO-selected metabolites performed significantly better than randomly selected metabolites for predicting prediabetes using a type 1 error rate of 0.05.
Conclusions
To our knowledge, this is the largest study to date to identify predictive metabolites for risk of prediabetes in adolescents and young adults. It is also the first to focus on Hispanic adolescents, a high-risk and understudied population for type 2 diabetes. In independent cohorts, we measured untargeted metabolomics in plasma samples from 2-h post-OGTT. We used robust statistical analysis methods to identify two metabolites associated with risk of prediabetes, impaired glucose tolerance, and elevated HbA1c. These were caprylic acid, a straight-chain fatty acid and potential marker of insulin resistance (30), and allylphenol sulfate, a phenyl sulfate phytochemical found in fruits and vegetables (31). Taurocholic acid, a bile acid linked to gut microbiome dysbiosis (32), was also associated with prediabetes in both cohorts, but exhibited different direction of effects. In both cohorts, predictive models with the two selected metabolites outperformed models that included established type 2 diabetes risk factors. These metabolites may be early surrogate markers of prediabetes in adolescents and young adults at elevated risk of type 2 diabetes.
The metabolites identified in our study likely reflect biological and lifestyle factors that together determine risk of prediabetes. For example, allylphenol sulfate is a polyphenol predominately found in fruits and vegetables that was negatively associated with risk of prediabetes. Since this metabolite is not produced endogenously, it may be a surrogate marker of dietary quality in this population. Conversely, caprylic acid levels postglucose challenge may signify biological risk factors for prediabetes. Although previous studies show that caprylic acid can improve insulin secretion and decrease type 2 diabetes risk (33,34), higher caprylic acid levels postglucose challenge may be an early marker of insulin resistance. Caprylic acid and other medium-chain fatty acids decrease following a glucose challenge, but individuals with insulin resistance exhibit smaller decreases than metabolically healthy individuals (30). Thus, lower caprylic acid levels postglucose challenge likely reflect better insulin sensitivity, consistent with our findings. The third metabolite identified was taurocholic acid, a bile acid associated with higher and lower risk of prediabetes in SOLAR and CHS, respectively. Bile acids play a complex role in glucose and lipid metabolism (35). Some studies suggest that bile acids increase the risk of type 2 diabetes (8,10), whereas others have found the opposite associations (36). Although taurocholic acid was selected in SOLAR, it was not necessary to accurately predict prediabetes because the two-metabolite model improved the prediction of prediabetes beyond previously established risk factors.
In contrast to our hypothesis and contrary to previous studies with fasting metabolites, we found that amino acids, short-chain fatty acids, long-chain fatty acids, or lipids did not notably improve prediabetes prediction. Several studies have found that higher fasting levels of aromatic and branched-chain amino acids increase the risk of type 2 diabetes (6–12). In our study, these metabolites were not consistently identified during feature selection. For example, tyrosine, an aromatic amino acid, was selected in 3% of the subsampling iterations, and all other aromatic/branched-chain amino acids were selected less. Similar results were observed for short- and long-chain fatty acids; 2-hydroxybutyrate, a short-chain fatty acid linked to glucose dysregulation in adolescents and adults (7,12), was selected in 9% of the subsampling iterations. We did not consistently identify any longer-chain fatty acids or lipids such as phosphatidylcholines, diacylglycerols, and lysophospholipids, which have been linked to type 2 diabetes (8,10). In a sensitivity analysis, we found that excluding fasting total cholesterol and triglyceride levels as covariates in the LASSO regression did not change whether these metabolites were identified in feature selection; when these covariates were excluded, the same three metabolites were consistently identified. There are several potential reasons we did not identify these metabolites beyond the fact that we assessed metabolites postglucose challenge. First, the duration of sample storage in SOLAR may have led to the degradation of some fatty acids (37). Second, because we aimed to develop a parsimonious and robust prediction model, some metabolites with weaker predictive power but with biologically meaningful associations may have been excluded from our model. Third, although this is the largest study to date in youth, we may have been underpowered to detect associations with these metabolites. Although we did not replicate metabolites from adult studies, it is biologically plausible that the metabolites we identified explain important biological and environmental factors that combine to impact prediabetes risk in young people.
This study has several strengths. First, the discovery cohort included only Hispanic adolescents, a high-risk and understudied population for young-onset type 2 diabetes (38). We validated results in a mixed-ethnicity cohort, suggesting that these associations may be generalizable to other populations. Second, we used robust statistical analysis methods to validate our findings in an independent cohort of young adults, adding additional evidence that the identified metabolites may predict prediabetes in young people. Third, we used untargeted metabolomics, allowing for greater coverage of metabolite species than some targeted panels.
Despite these strengths, this study has some limitations. First, although this is the largest study in young people, our sample size is smaller than some adult studies, potentially limiting the detection of other significant metabolite associations and leading to relatively large CIs for the AUC in the validation cohort.
Second, we observed a relatively low conversion to type 2 diabetes. In both cohorts, only 1% of participants developed type 2 diabetes, contrasting with data in adults indicating that >5% of individuals with obesity develop type 2 diabetes (39). However, since prediabetes in adolescents increases early-onset type 2 diabetes risk (2), longer follow-up may reveal a substantial progression to type 2 diabetes.
Third, although the OGTT was performed following an overnight fast under the same guidelines as clinical settings, activity and diet were not strictly controlled in the days prior to the OGTT. Although it is possible that these factors differed between cohorts and that this could have introduced noise into the metabolite measurements and prevented detection of other predictive metabolites, using the same protocol as clinical settings makes our findings more generalizable to clinical practice.
Fourth, although caprylic acid and taurocholic acid were confirmed with level 1 annotation, the annotation for allylphenol sulfate was limited to level 2b (40). Level 2b annotations have confirmed molecular formulas and a probable structure. Allylphenol sulfate is the only molecule in the human metabolome database with this specific molecular formula, increasing the likelihood that it is correctly annotated.
Fifth, long-term sample storage can lead to metabolite degradation (37), especially for fatty acids. This potentially explains discrepancies with previous studies regarding long-chain fatty acids.
Finally, metabolite levels from the 2-h post-OGTT may make implementing this prediction model in routine clinical practice challenging. However, the identified metabolites could supplement recommended OGTT screenings in high-risk adolescents (20), which could help improve risk stratification and prevent under- or overtreatment of adolescents at risk for prediabetes.
In conclusion, we identified two metabolites that may improve prediabetes prediction in Hispanic adolescents at risk for type 2 diabetes beyond previously established risk factors. These findings provide a new avenue for identifying adolescents at high risk of prediabetes that could help guide personalized preventive treatment to prevent prediabetes/type 2 diabetes in this high-risk population.
This article contains supplementary material online at https://doi.org/10.2337/figshare.24437953.
J.A.G. and H.W. contributed equally to this work.
Article Information
Funding. Funding for SOLAR came from the National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases grant R01DK59211 and National Institute of Environmental Health Sciences (NIEHS) grant R01ES029944. Funding for the MetaAir/MetaChem study came from NIEHS (R01ES029944), the Southern California Children’s Environmental Health Center grants funded by NIEHS (P01ES022845, 5P30ES007048, P01ES011627), the United States Environmental Protection Agency (RD83544101), and the Hastings Foundation. Additional funding from National Institutes of Health supported J.A.G., L.C., D.V.C., and T.L.A. (National Human Genome Research Institute [NHGRI], U01HG013288), J.A.G., L.C., and D.V.C. (NIEHS P30ES007048), J.A.G. (NIEHS, T32ES013678, National Institute of General Medical Sciences, R25GM143298), L.C. and D.V.C. (NIEHS, R01ES030364, R01ES030691, R21ES028903, and R21ES029681), Z.C. (NIEHS, R00ES027870), D.V. (NIEHS, R01ES033688, R21ES029328, K12ES033594, and P30ES023515), D.I.W. (NIEHS, U2CES030859, R01ES032831), D.V.C. (National Cancer Institute, P01CA196569), T.L.A. and M.I.G. (National Institute on Minority Health and Health Disparities, P50MD017344), T.L.A. (NIEHS, R00ES027853), and D.P.J. (NIEHS, U2CES030163, P30ES019776, R24ES029490, R01ES032189, and R21ES031824).
Duality of Interest. No potential conflicts of interest relevant to this article were reported.
Author Contributions. J.A.G., H.W., D.I.W., X.L., X.H., T.L.A., Z.C., D.V., B.O.B., S.R., K.B., F.D.G, M.I.G., D.P.J., D.V.C., and L.C. acquired, analyzed, or interpreted the data. J.A.G., H.W., D.I.W., X.L., X.H., T.L.A., Z.C., D.V., B.O.B., S.R., K.B., F.D.G., M.I.G., D.P.J., D.V.C., and L.C., critically revised the manuscript for important intellectual content. J.A.G., H.W., D.I.W., T.L.A., Z.C., D.V., K.B., F.D.G., M.I.G., D.P.J., D.V.C., and L.C. contributed to concept and design. J.A.G., H.W., D.V.C., and L.C. drafted the manuscript. J.A.G., H.W., D.V.C., and L.C. performed the statistical analysis. D.I.W., X.L., X.H., S.R., and D.P.J. provided administrative, technical, or material support. J.A.G. and L.C. are the guarantors of this work and, as such, had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.