In the rapidly evolving landscape of health care, commercially available large language models (LLMs) like GPT-4 are being examined for their potential to assist in clinical decision-making, especially in scenarios where no single “right” answer exists (1). This challenge is particularly evident in the management of type 2 diabetes, where providers often face multiple guideline-concordant options for first-line therapy, yet lack clear consensus on the best approach (2–4). With the advent of newer therapeutic options, ramifications related to drug accessibility and cost, and considerations for other metabolic risks, both clinical complexity and uncertainty have rendered management decisions bewildering. Against this backdrop, in this issue of Diabetes Care, Flory et al. (1) examine LLMs’ decision-making in diabetes care, particularly regarding medication choices, such as for metformin and alternatives, that depend on patient characteristics (1,5–11). In comparing LLM-driven decisions with those made by experienced endocrinologists, the article highlights where LLMs can contribute and areas where they may fall short in supporting diabetes care.
Clinical guidelines from the American Diabetes Association and the European Association for the Study of Diabetes (2), as well as the American Association of Clinical Endocrinology (3) and the Endocrine Society (4), have traditionally advocated for first-line metformin use. However, given metformin’s relative and absolute contraindications, treatment plans must include consideration of individual patient factors including patient-centered glycemic targets. These guidelines continue to champion lifestyle modifications, but successfully implementing these modifications requires consideration of several factors related to social determinants and health access (2). Recent updates of the 2024 American Diabetes Association “Standards of Care in Diabetes” now highlight obesity management and cardiorenal benefits of sodium–glucose cotransporter 2 inhibitors and glucagon-like peptide 1 receptor agonists (5,6,12,13). Given this complexity, it is crucial to evaluate how well LLMs like GPT-4 navigate these decisions.
In their study, Flory et al. (1) assess agreement between endocrinologists and GPT-4 in selection of initial antidiabetes medications for patients with type 2 diabetes, particularly focusing on metformin use. The analysis compared the endocrinologists’ (n = 31) and GPT-4’s preferences for metformin versus alternative treatments in response to 40 clinical vignettes, the consistency and robustness of GPT-4’s responses across multiple runs, and the influence of varied prompts that emphasized safety, cost, or adherence to guidelines.
The findings revealed that endocrinologists chose metformin in 31% of cases, while GPT-4 selected metformin 25% of the time after prompt refinement, resulting in a modest 5.5% difference. Both groups were less likely to recommend metformin for patients with reduced kidney function. GPT-4 generally aligned with clinical guidelines but was more cautious, especially regarding gastrointestinal symptoms. The model’s variability across different prompts indicated some inconsistencies, particularly when prompts emphasizing a “metformin nudge” influenced its recommendations.
The study by Flory et al. has several strengths, including real-world relevance in testing GPT-4’s decision-making in scenarios that mirror routine practice. The comparison of GPT-4’s responses with those of endocrinologists provides a valuable benchmark for assessment of alignment with expert opinions. The analysis of personalized responses, based on patient characteristics, enhances applicability to precision medicine. Additionally, exploring the variability in GPT-4’s recommendations, driven by factors like temperature settings and tailored prompts, deepens understanding of how LLMs handle clinical ambiguity. The use of various prompt versions, including nudges such as for metformin prioritization and cost reduction, further clarifies their impact on LLM decision-making.
The study also has notable limitations. The use of hypothetical vignettes limits the complexity and unpredictability of actual clinical practice, which may affect the generalizability of the results (14). Bias introduced by certain prompt structures might also influence GPT-4’s decision-making. Furthermore, while variability in responses could indicate clinical equipoise, it might also reflect gaps in the LLM’s understanding rather than true uncertainty.
One important finding is GPT-4’s tendency to favor newer drug classes, reflecting its reliance on up-to-date data but introducing concerns about patient costs and cost-effectiveness. In a health care system already burdened by rising costs, adopting artificial intelligence (AI)-driven recommendations without clinician oversight could add to this burden (15). The results of the study by Flory et al. emphasize the need to evaluate LLM recommendations in terms of cost and patient access. For evaluation of the cost impacts of GPT-4’s recommendations, a direct comparison between GPT-4’s suggested treatments and clinician choices would clarify potential final impacts on patients and payers. For example, GPT-4’s preference for newer, more expensive drugs like sodium–glucose cotransporter 2 inhibitors or glucagon-like peptide 1 receptor agonists can be contrasted with the choice of lower-cost alternatives like metformin or sulfonylureas. This analysis would reveal the incremental cost difference per patient vignette and whether LLM recommendations lead to significantly higher drug costs. Also crucial is consideration of the potential impact on health equity, as LLM recommendations could exacerbate disparities by favoring higher-cost treatments. Further analyses could illuminate how variations in clinical vignettes and prompts, such as drug pricing, patient adherence, or insurance coverage, affect overall GPT-4 recommendations, which is essential for preventing the propagation of existing health care disparities.
The study results also underscore the need for human oversight in AI-assisted diabetes care, as LLMs like GPT-4 can analyze data but lack the clinical judgment required to address individual patient needs (14). Trust in AI must be built on its ability to support, not replace, clinician decision-making. Recent studies highlight both the potential and limitations of LLM-generated medical responses. Goodman et al. (16) and Johnson et al. (17) found that while AI models often provided accurate answers, they lacked consistency due to incomplete or incorrect information, impacting clinical decision-making. Ye et al. (18) reported that patients rated AI responses similarly to physicians’, but physicians considered them inferior. Another study showed that AI responses matched or exceeded physician responses in empathy and quality (19). These findings suggest that while AI holds promise, thorough validation is crucial for its safe and effective use in clinical decision-making (20).
Human-in-the-loop (HITL) systems offer a promising solution in combining AI with clinician expertise (21). In an HITL system, AI generates recommendations or insights, but the final decision is made by a human who takes the patient’s unique circumstances, preferences, and medical history into account. This approach ensures that LLM suggestions are integrated into the clinician-patient dialogue for diabetes care. HITL systems may improve both the quality and safety of care, as the clinician retains control over decisions, using AI to enhance efficiency and support informed choices. As health care adopts AI, HITL models can balance innovation with the irreplaceable value of human judgment. A feedback loop for reporting issues and safety events could be used to identify hallucinations and inaccuracies, enabling refinement and retraining of these systems to produce outputs that align more closely with clinical expertise for safer, more effective care (22,23) (Fig. 1).
HITL for LLM-assisted diabetes medication management. HITL ensures that clinicians maintain control over AI-assisted decisions, with a feedback loop between clinician and LLM system to detect inaccuracies and refine the system for safer, expert-aligned care.
HITL for LLM-assisted diabetes medication management. HITL ensures that clinicians maintain control over AI-assisted decisions, with a feedback loop between clinician and LLM system to detect inaccuracies and refine the system for safer, expert-aligned care.
For successful implementation of LLMs in diabetes care, frameworks like the Consolidated Framework for Implementation Research (CFIR) and Reach, Effectiveness, Adoption, Implementation, and Maintenance (RE-AIM) are crucial in evaluating effectiveness, adoption, and scalability (24,25). Piloting LLMs in diverse clinical settings will provide real-world data on their role in treatment decisions. Engaging stakeholders will help with identification of needs and barriers, to ensure that LLMs are refined to enhance clinician buy-in and patient engagement. Additionally, training clinicians to interpret LLM recommendations will reinforce human oversight and build trust in AI for diabetes management.
The work by Flory et al. is a key first step for understanding how LLMs can aid complex diabetes treatment decisions, especially in uncertain cases. We emphasize that LLMs must work within HITL systems that prioritize clinician oversight, patient-centered care, and evidence-based practice. Future research should focus on demonstrating AI’s impact on diabetes outcomes and developing frameworks for its responsible, equitable use. The findings of Flory et al. set an important benchmark for the utility of current LLMs, which are only going to improve in the years to come.
See accompanying article, p. 185.
Article Information
Funding. This work was supported by National Institute on Aging grant 2P30-AG028716-16 (to J.M.P.) and a Research Career Scientist award from the Department of Veterans Affairs (RCS 10-391 to M.L.M.).
Duality of Interest. No potential conflicts of interest relevant to this article were reported.
Handling Editors. The journal editors responsible for overseeing the review of the manuscript were Steven E. Kahn and Matthew J. Crowley.