In the rapidly evolving landscape of health care, commercially available large language models (LLMs) like GPT-4 are being examined for their potential to assist in clinical decision-making, especially in scenarios where no single “right” answer exists (1). This challenge is particularly evident in the management of type 2 diabetes, where providers often face multiple guideline-concordant options for first-line therapy, yet lack clear consensus on the best approach (2–4). With the advent of newer therapeutic options, ramifications related to drug accessibility and cost, and considerations for other metabolic risks, both clinical complexity and uncertainty have rendered management decisions bewildering. Against this backdrop, in this issue of Diabetes Care, Flory et al. (1) examine LLMs’ decision-making in diabetes care, particularly regarding medication choices, such as for metformin and alternatives, that depend on patient characteristics (1,5–11). In comparing LLM-driven decisions with those made by experienced endocrinologists, the article highlights where LLMs can contribute and areas where they may fall short in supporting diabetes care.

Clinical guidelines from the American Diabetes Association and the European Association for the Study of Diabetes (2), as well as the American Association of Clinical Endocrinology (3) and the Endocrine Society (4), have traditionally advocated for first-line metformin use. However, given metformin’s relative and absolute contraindications, treatment plans must include consideration of individual patient factors including patient-centered glycemic targets. These guidelines continue to champion lifestyle modifications, but successfully implementing these modifications requires consideration of several factors related to social determinants and health access (2). Recent updates of the 2024 American Diabetes Association “Standards of Care in Diabetes” now highlight obesity management and cardiorenal benefits of sodium–glucose cotransporter 2 inhibitors and glucagon-like peptide 1 receptor agonists (5,6,12,13). Given this complexity, it is crucial to evaluate how well LLMs like GPT-4 navigate these decisions.

In their study, Flory et al. (1) assess agreement between endocrinologists and GPT-4 in selection of initial antidiabetes medications for patients with type 2 diabetes, particularly focusing on metformin use. The analysis compared the endocrinologists’ (n = 31) and GPT-4’s preferences for metformin versus alternative treatments in response to 40 clinical vignettes, the consistency and robustness of GPT-4’s responses across multiple runs, and the influence of varied prompts that emphasized safety, cost, or adherence to guidelines.

The findings revealed that endocrinologists chose metformin in 31% of cases, while GPT-4 selected metformin 25% of the time after prompt refinement, resulting in a modest 5.5% difference. Both groups were less likely to recommend metformin for patients with reduced kidney function. GPT-4 generally aligned with clinical guidelines but was more cautious, especially regarding gastrointestinal symptoms. The model’s variability across different prompts indicated some inconsistencies, particularly when prompts emphasizing a “metformin nudge” influenced its recommendations.

The study by Flory et al. has several strengths, including real-world relevance in testing GPT-4’s decision-making in scenarios that mirror routine practice. The comparison of GPT-4’s responses with those of endocrinologists provides a valuable benchmark for assessment of alignment with expert opinions. The analysis of personalized responses, based on patient characteristics, enhances applicability to precision medicine. Additionally, exploring the variability in GPT-4’s recommendations, driven by factors like temperature settings and tailored prompts, deepens understanding of how LLMs handle clinical ambiguity. The use of various prompt versions, including nudges such as for metformin prioritization and cost reduction, further clarifies their impact on LLM decision-making.

The study also has notable limitations. The use of hypothetical vignettes limits the complexity and unpredictability of actual clinical practice, which may affect the generalizability of the results (14). Bias introduced by certain prompt structures might also influence GPT-4’s decision-making. Furthermore, while variability in responses could indicate clinical equipoise, it might also reflect gaps in the LLM’s understanding rather than true uncertainty.

One important finding is GPT-4’s tendency to favor newer drug classes, reflecting its reliance on up-to-date data but introducing concerns about patient costs and cost-effectiveness. In a health care system already burdened by rising costs, adopting artificial intelligence (AI)-driven recommendations without clinician oversight could add to this burden (15). The results of the study by Flory et al. emphasize the need to evaluate LLM recommendations in terms of cost and patient access. For evaluation of the cost impacts of GPT-4’s recommendations, a direct comparison between GPT-4’s suggested treatments and clinician choices would clarify potential final impacts on patients and payers. For example, GPT-4’s preference for newer, more expensive drugs like sodium–glucose cotransporter 2 inhibitors or glucagon-like peptide 1 receptor agonists can be contrasted with the choice of lower-cost alternatives like metformin or sulfonylureas. This analysis would reveal the incremental cost difference per patient vignette and whether LLM recommendations lead to significantly higher drug costs. Also crucial is consideration of the potential impact on health equity, as LLM recommendations could exacerbate disparities by favoring higher-cost treatments. Further analyses could illuminate how variations in clinical vignettes and prompts, such as drug pricing, patient adherence, or insurance coverage, affect overall GPT-4 recommendations, which is essential for preventing the propagation of existing health care disparities.

The study results also underscore the need for human oversight in AI-assisted diabetes care, as LLMs like GPT-4 can analyze data but lack the clinical judgment required to address individual patient needs (14). Trust in AI must be built on its ability to support, not replace, clinician decision-making. Recent studies highlight both the potential and limitations of LLM-generated medical responses. Goodman et al. (16) and Johnson et al. (17) found that while AI models often provided accurate answers, they lacked consistency due to incomplete or incorrect information, impacting clinical decision-making. Ye et al. (18) reported that patients rated AI responses similarly to physicians’, but physicians considered them inferior. Another study showed that AI responses matched or exceeded physician responses in empathy and quality (19). These findings suggest that while AI holds promise, thorough validation is crucial for its safe and effective use in clinical decision-making (20).

Human-in-the-loop (HITL) systems offer a promising solution in combining AI with clinician expertise (21). In an HITL system, AI generates recommendations or insights, but the final decision is made by a human who takes the patient’s unique circumstances, preferences, and medical history into account. This approach ensures that LLM suggestions are integrated into the clinician-patient dialogue for diabetes care. HITL systems may improve both the quality and safety of care, as the clinician retains control over decisions, using AI to enhance efficiency and support informed choices. As health care adopts AI, HITL models can balance innovation with the irreplaceable value of human judgment. A feedback loop for reporting issues and safety events could be used to identify hallucinations and inaccuracies, enabling refinement and retraining of these systems to produce outputs that align more closely with clinical expertise for safer, more effective care (22,23) (Fig. 1).

Figure 1

HITL for LLM-assisted diabetes medication management. HITL ensures that clinicians maintain control over AI-assisted decisions, with a feedback loop between clinician and LLM system to detect inaccuracies and refine the system for safer, expert-aligned care.

Figure 1

HITL for LLM-assisted diabetes medication management. HITL ensures that clinicians maintain control over AI-assisted decisions, with a feedback loop between clinician and LLM system to detect inaccuracies and refine the system for safer, expert-aligned care.

Close modal

For successful implementation of LLMs in diabetes care, frameworks like the Consolidated Framework for Implementation Research (CFIR) and Reach, Effectiveness, Adoption, Implementation, and Maintenance (RE-AIM) are crucial in evaluating effectiveness, adoption, and scalability (24,25). Piloting LLMs in diverse clinical settings will provide real-world data on their role in treatment decisions. Engaging stakeholders will help with identification of needs and barriers, to ensure that LLMs are refined to enhance clinician buy-in and patient engagement. Additionally, training clinicians to interpret LLM recommendations will reinforce human oversight and build trust in AI for diabetes management.

The work by Flory et al. is a key first step for understanding how LLMs can aid complex diabetes treatment decisions, especially in uncertain cases. We emphasize that LLMs must work within HITL systems that prioritize clinician oversight, patient-centered care, and evidence-based practice. Future research should focus on demonstrating AI’s impact on diabetes outcomes and developing frameworks for its responsible, equitable use. The findings of Flory et al. set an important benchmark for the utility of current LLMs, which are only going to improve in the years to come.

See accompanying article, p. 185.

Funding. This work was supported by National Institute on Aging grant 2P30-AG028716-16 (to J.M.P.) and a Research Career Scientist award from the Department of Veterans Affairs (RCS 10-391 to M.L.M.).

Duality of Interest. No potential conflicts of interest relevant to this article were reported.

Handling Editors. The journal editors responsible for overseeing the review of the manuscript were Steven E. Kahn and Matthew J. Crowley.

1.
Flory
JH
,
Ancker
JS
,
Kim
SYH
,
Kuperman
G
,
Petrov
A
,
Vickers
A
.
Large language model GPT-4 compared to endocrinologist responses on initial choice of antidiabetic medication under conditions of clinical uncertainty
.
Diabetes Care
2025
;
48
:
185
192
2.
Davies
MJ
,
Aroda
VR
,
Collins
BS
, et al
.
Management of hyperglycemia in type 2 diabetes, 2022. A consensus report by the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD)
.
Diabetes Care
2022
;
45
:
2753
2786
3.
Samson
SL
,
Vellanki
P
,
Blonde
L
, et al
.
American Association of Clinical Endocrinology consensus statement: comprehensive type 2 diabetes management algorithm - 2023 update
.
Endocr Pract
2023
;
29
:
305
340
4.
LeRoith
D
,
Biessels
GJ
,
Braithwaite
SS
, et al
.
Treatment of diabetes in older adults: an Endocrine Society clinical practice guideline
.
J Clin Endocrinol Metab
2019
;
104
:
1520
1574
5.
American Diabetes Association Professional Practice Committee
.
Introduction and methodology: Standards of Care in Diabetes—2024
.
Diabetes Care
2024
;
47
(
Suppl. 1
):
S1
S4
6.
Malik
ME
,
Falkentoft
AC
,
Jensen
J
, et al
.
Discontinuation and reinitiation of SGLT-2 inhibitors and GLP-1R agonists in patients with type 2 diabetes: a nationwide study from 2013 to 2021
.
Lancet Reg Health Eur
2023
;
29
:
100617
7.
Eberly
LA
,
Yang
L
,
Eneanya
ND
, et al
.
Association of race/ethnicity, gender, and socioeconomic status with sodium-glucose cotransporter 2 inhibitor use among patients with diabetes in the US
.
JAMA Netw Open
2021
;
4
:
e216139
8.
Eberly
LA
,
Yang
L
,
Essien
UR
, et al
.
Racial, ethnic, and socioeconomic inequities in glucagon-like peptide-1 receptor agonist use among patients with diabetes in the US
.
JAMA Health Forum
2021
;
2
:
e214182
9.
Tummalapalli
SL
,
Montealegre
JL
,
Warnock
N
,
Green
M
,
Ibrahim
SA
,
Estrella
MM
.
Coverage, formulary restrictions, and affordability of sodium-glucose cotransporter 2 inhibitors by US insurance plan types
.
JAMA Health Forum
2021
;
2
:
e214205
10.
Lamprea-Montealegre
JA
,
Madden
E
,
Tummalapalli
SL
, et al
.
Association of race and ethnicity with prescription of SGLT2 inhibitors and GLP1 receptor agonists among patients with type 2 diabetes in the Veterans Health Administration system
.
JAMA
2022
;
328
:
861
871
11.
Green
JB
,
Lee
RH
.
“The price is right” for diabetes management of older adults - evidence for the closest glycemic target without going over
.
J Am Geriatr Soc
2023
;
71
:
3680
3682
12.
Brown
E
,
Heerspink
HJL
,
Cuthbertson
DJ
,
Wilding
JPH
.
SGLT2 inhibitors and GLP-1 receptor agonists: established and emerging indications
.
Lancet
2021
;
398
:
262
276
13.
Steinberg
J
,
Carlson
L
.
Type 2 diabetes therapies: a STEPS approach
.
Am Fam Physician
2019
;
99
:
237
243
14.
Bedi
S
,
Liu
Y
,
Orr-Ewing
L
, et al
.
Testing and evaluation of health care applications of large language models: a systematic review
.
JAMA
2024
:
e2421700
15.
Park
J
,
Zhang
P
,
Wang
Y
,
Zhou
X
,
Look
KA
,
Bigman
ET
.
High out-of-pocket health care cost burden among Medicare beneficiaries with diabetes, 1999–2017
.
Diabetes Care
2021
;
44
:
1797
1804
16.
Goodman
RS
,
Patrinely
JR
,
Stone
CA
, et al
.
Accuracy and reliability of chatbot responses to physician questions
.
JAMA Netw Open
2023
;
6
:
e2336483
17.
Johnson
D
,
Goodman
R
,
Patrinely
J
, et al
.
Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model
. 28 February 2023 [preprint]. Res Sq: rs.3.rs-2566942
18.
Ye
C
,
Zweck
E
,
Ma
Z
,
Smith
J
,
Katz
S
.
Doctor versus artificial intelligence: patient and physician evaluation of large language model responses to rheumatology patient questions in a cross-sectional study
.
Arthritis Rheumatol
2024
;
76
:
479
484
19.
Ayers
JW
,
Poliak
A
,
Dredze
M
, et al
.
Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum
.
JAMA Intern Med
2023
;
183
:
589
596
20.
Shah
NH
,
Entwistle
D
,
Pfeffer
MA
.
Creation and adoption of large language models in medicine
.
JAMA
2023
;
330
:
866
869
21.
Sezgin
E
.
Artificial intelligence in healthcare: complementing, not replacing, doctors and healthcare providers
.
Digit Health
2023
;
9
:
20552076231186520
22.
Reddy
S
.
Generative AI in healthcare: an implementation science informed translational path on application, integration and governance
.
Implement Sci
2024
;
19
:
27
23.
Yu
P
,
Xu
H
,
Hu
X
,
Deng
C
.
Leveraging generative AI and large language models: a comprehensive roadmap for healthcare integration
.
Healthcare (Basel)
2023
;
11
:
2776
24.
Damschroder
LJ
,
Aron
DC
,
Keith
RE
,
Kirsh
SR
,
Alexander
JA
,
Lowery
JC
.
Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science
.
Implement Sci
2009
;
4
:
50
25.
Glasgow
RE
,
Vogt
TM
,
Boles
SM
.
Evaluating the public health impact of health promotion interventions: the RE-AIM framework
.
Am J Public Health
1999
;
89
:
1322
1327
Readers may use this article as long as the work is properly cited, the use is educational and not for profit, and the work is not altered. More information is available at https://www.diabetesjournals.org/journals/pages/license.