Current guidelines recommend that individuals with diabetes receive yearly eye exams for detection of referable diabetic retinopathy (DR), one of the leading causes of new-onset blindness. For addressing the immense screening burden, artificial intelligence (AI) algorithms have been developed to autonomously screen for DR from fundus photography without human input. Over the last 10 years, many AI algorithms have achieved good sensitivity and specificity (>85%) for detection of referable DR compared with human graders; however, many questions still remain. In this narrative review on AI in DR screening, we discuss key concepts in AI algorithm development as a background for understanding the algorithms. We present the AI algorithms that have been prospectively validated against human graders and demonstrate the variability of reference standards and cohort demographics. We review the limited head-to-head validation studies where investigators attempt to directly compare the available algorithms. Next, we discuss the literature regarding cost-effectiveness, equity and bias, and medicolegal considerations, all of which play a role in the implementation of these AI algorithms in clinical practice. Lastly, we highlight ongoing efforts to bridge gaps in AI model data sets to pursue equitable development and delivery.
Introduction
Diabetes presents a considerable global health challenge, with an estimated 463 million people worldwide living with diabetes. This number is expected to climb to 700 million by 2045 (1). A common complication of diabetes, diabetic retinopathy (DR), stands as the leading cause of new-onset blindness among adults aged 20–74 years (2). As the prevalence of diabetes continues to grow, the global DR patient population is expected to expand to 160 million by 2045 (1). Current guidelines by the American Academy of Ophthalmology recommend that patients diagnosed with diabetes undergo yearly screenings for detection of referable cases of DR (3). DR screening includes a vision and retinal examination. Retinal exams can be composed of either direct or indirect ophthalmoscopy, or fundus photography with interpretation by a qualified reader either on-site or via telemedicine (4). During retinal examination, DR severity will be graded according to a scale. There are many scales for DR severity classification, and a commonly used scale is the International Clinical Diabetic Retinopathy Severity Scale (ICDR). This is a 5-class scale: 0, no apparent retinopathy; 1, mild proliferative DR; 2, moderate nonproliferative DR; 3, severe nonproliferative DR; and 4, proliferative DR (5). Additionally, there is a separate classification of present or absent for clinically significant diabetic macular edema (DME) (5). Patients with referable DR require a comprehensive ophthalmic examination and medical/surgical therapy to prevent blindness. To address the extensive DR screening burden for patients with diabetes, developers have been using artificial intelligence (AI) devices for >20 years (6). The earliest AI techniques for DR screening identified pathological features of fundus images such as hemorrhages, neovascularization, exudates, etc., and then used these features to determine if the patient had DR (7–9). With more recent advances in computing power, deep learning is now the primary AI technique used in DR screening, with many deep learning models outperforming strictly feature-based machine learning methods (10). Thus far, three algorithms are cleared by the U.S. Food and Drug Administration (FDA) for clinical use in DR screening: IDx-DR, EyeArt, and AEYE Diagnostic Screening (AEYE-DS). They are all fully autonomous algorithms that operate without human supervision.
Basic Concepts
Before discussing the existing technology, it is pertinent to review how models are developed and trained. The key to machine learning is the ability to “learn,” which is when a model is fed a large amount of data and learns how to use that data to perform a task such as classifying DR severity. For the algorithm to learn, the input data must be labeled. Labeling is the process of matching each input data with an associated label for the outcome task. In DR screening this would be attaching a DR severity grade created by an expert grader to each color fundus photo in the data set. These labels, also known as the ground truth, are what the model uses to learn through iterative feedback (11). Figure 1 shows a schematic of the iterative training cycle for a deep learning algorithm. To ensure the models’ performance, these labels are often created by expert clinicians or teleretinal graders. The models learn to perform a task to get as close to the ground truth labels as possible. If there are any errors with the ground truths, the model will perpetuate those errors in its predictions. Therefore, it is imperative to ensure accurate ground truths in training models, with a reliable labeling protocol, to ensure accurate algorithm performance.
Once an accurately labeled data set is obtained, a model can be trained. The data set needs to be partitioned into three different splits: training, validation, and testing. The training split is used for the model to learn how to perform a task. During training, the model is fed input images and outputs a prediction for the task; then the output prediction is compared with the ground truth label. The model weights, which are variables that determine the model output, are automatically adjusted to make the model’s prediction more similar to the ground truth label. Simultaneously while training, the model’s performance against the validation split is also assessed, and this can be used to make adjustments in model design such as tweaking the hyperparameters and preventing overfitting. This training process is repeated until the model’s performance converges, and then the model is ready for evaluation on the testing split. The model does not “learn” anything from the testing split, also referred to as the test set, and thus does not adjust its weights during this phase. Instead, this phase is used to test the model’s performance on a different data set to demonstrate generalizability. An ideal test set is a data set that is independent of the training and validation split and represents data that the model may see in clinical deployment. Oftentimes this means that the test set should be from an entirely different geographical setting and be composed of a diverse patient population to ensure that the algorithm can perform adequately across many patients from different regions of the world. In most studies reporting AI model performance, the metrics are derived from the model evaluating the test set. It is important to note that when AI models are evaluated for regulatory approval and eventually deployed, developers freeze their weights, which means they can no longer be updated or modified.
Prospectively Studied Algorithms
There is a wide range of variability of regulatory approval studies for the commercially approved DR screening algorithms. Of those with results published in scientific journals, many studies have included examination of the performance of algorithms on open-source data sets, retrospective data sets, and prospectively collected data sets. We chose to report only on algorithms for which there are prospective data sets available as these are more closely aligned to real-world clinical implementation and involve integrating AI algorithms into clinical workflows (Table 1 and Fig. 2). Additionally, there are already many review articles with evaluation of the DR screening AI algorithms tested on retrospective and open-source data sets (12–14).
First author . | Date . | Model . | Data set . | Field of view . | Camera . | No. of photos . | Mydriasis . | No. of patients . | Task . | Reference grading standard . | Sensitivity . | Specificity . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Abramoff | 2018 | IDx-DR | 10 primary care clinics within the same U.S. health system | Macula centered and disc centered 45° | Topcon TRC-NW400 | 2 images per eye | Nonmydriatic but dilated if required | 900 | mtmDR (ETDRS score ≥35 and/or one eye with DME) | FPRC standards (4 stereoscopic images in each eye) with optical coherence tomography for DME assessment if patient had ETDRS level ≥35 | 87.2 (81.8–91.2) | 90.7 (88.3–92.7) |
Bellemo | 2019 | SELENA | Mobile diabetes screening unit in Zambia | Macula centered and retinal centered 45° | Digital Retinopathy System | 2 images per eye | Not mentioned | 1,574 | mtmDR via ICDR | Up to 3 graders in Singapore | 92.3 (90.1–94.1) | 89.0 (87.9–90.3) |
Zhang | 2020 | VoxelCloud Retina | 155 diabetes centers in China | Macular centered 45° | Topcon TRC-NW400, MiiS DSC-200, Canon CR-2 PLUS AF, Canon CR-2 AF, Zeiss VISUCAM200 | 1 image per eye | Nonmydriatic | 47,269 | mtmDR via ICDR | Up to 3 ophthalmologists classifying using ICDR guidelines | 83.3 (81.9–84.6) | 92.5 (92.1–92.9) |
Heydon | 2021 | EyeArt | 3 diabetic eye screening program sites in the U.K. | Macular centered and disc centered 45° | Not mentioned | 2 images per eye | Nonmydriatic | 30,405 | mtmDR (ETDRS score >35) | National Health System grading scale with 3 human graders | 95.7 (94.8–96.5) | 54.0 (53.4–54.5) |
Ipp | 2021 | EyeArt | 6 primary care clinics, 6 general ophthalmology clinics, and 3 retina centers in the U.S. | Macula and disc centered 45° | Canon CR-2 AF or Canon CR-2 PLUS AF | 2 images per eye | Nonmydriatic | 893 | mtmDR (ETDRS score ≥35) | FPRC standards (4 stereoscopic images in each eye) with optical coherence tomography for DME assessment if patient had ETDRS level ≥35 | 95.5 (92.4–98.5) | 85.0 (82.6–87.4) |
Scheetz | 2021 | — | 2 Australian endocrinology clinics and 3 Aboriginal medical services clinics | Macula centered 45° | Digital Retinography System, Canon CR-2 AF, Topcon 3D OCT1 Maestro | 1 image per eye | Nonmydriatic | 236 | Referable DR (preproliferative DR or worse and/or DME via National Health System grading scale) | 2 National Health System–certified retinal graders | 96.9 | 87.7 |
Ruamviboonsuk | 2022 | 9 primary care clinics in Thailand | Macula centered 45° | Topcon 3D OCT1 Maestro, Topcon TRC-NW300, Nidek AFC-230, Nidek AFC-210, Nidek AFC-300 | 1 image per eye | Nonmydriatic but dilated if ungradable | 7,940 | vtrDR via ICDR (NPDR, PDR, and/or DME) | 3 U.S. board-certified retina specialists as reference graders | 94.7 (93.0–96.2) | 91.4 (87.1–95.0) | |
Yang | 2022 | AIDRScreening system | 3 eye hospitals in China | Macula centered and disc centered 45° | Zeiss VISUCAM FF450, Topcon TRC-50DX, Topcon TRC-NW400 | 2 images per eye | Mydriatic | 1,001 | Referable DR as defined by the International Council of Ophthalmology Guidelines for Diabetic Eye Care | Up to 3 human graders using International Council of Ophthalmology Guidelines for Diabetic Eye Care classifications | 86.7 (83.4–90.1) | 96.1 (94.1–97.5) |
First author . | Date . | Model . | Data set . | Field of view . | Camera . | No. of photos . | Mydriasis . | No. of patients . | Task . | Reference grading standard . | Sensitivity . | Specificity . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Abramoff | 2018 | IDx-DR | 10 primary care clinics within the same U.S. health system | Macula centered and disc centered 45° | Topcon TRC-NW400 | 2 images per eye | Nonmydriatic but dilated if required | 900 | mtmDR (ETDRS score ≥35 and/or one eye with DME) | FPRC standards (4 stereoscopic images in each eye) with optical coherence tomography for DME assessment if patient had ETDRS level ≥35 | 87.2 (81.8–91.2) | 90.7 (88.3–92.7) |
Bellemo | 2019 | SELENA | Mobile diabetes screening unit in Zambia | Macula centered and retinal centered 45° | Digital Retinopathy System | 2 images per eye | Not mentioned | 1,574 | mtmDR via ICDR | Up to 3 graders in Singapore | 92.3 (90.1–94.1) | 89.0 (87.9–90.3) |
Zhang | 2020 | VoxelCloud Retina | 155 diabetes centers in China | Macular centered 45° | Topcon TRC-NW400, MiiS DSC-200, Canon CR-2 PLUS AF, Canon CR-2 AF, Zeiss VISUCAM200 | 1 image per eye | Nonmydriatic | 47,269 | mtmDR via ICDR | Up to 3 ophthalmologists classifying using ICDR guidelines | 83.3 (81.9–84.6) | 92.5 (92.1–92.9) |
Heydon | 2021 | EyeArt | 3 diabetic eye screening program sites in the U.K. | Macular centered and disc centered 45° | Not mentioned | 2 images per eye | Nonmydriatic | 30,405 | mtmDR (ETDRS score >35) | National Health System grading scale with 3 human graders | 95.7 (94.8–96.5) | 54.0 (53.4–54.5) |
Ipp | 2021 | EyeArt | 6 primary care clinics, 6 general ophthalmology clinics, and 3 retina centers in the U.S. | Macula and disc centered 45° | Canon CR-2 AF or Canon CR-2 PLUS AF | 2 images per eye | Nonmydriatic | 893 | mtmDR (ETDRS score ≥35) | FPRC standards (4 stereoscopic images in each eye) with optical coherence tomography for DME assessment if patient had ETDRS level ≥35 | 95.5 (92.4–98.5) | 85.0 (82.6–87.4) |
Scheetz | 2021 | — | 2 Australian endocrinology clinics and 3 Aboriginal medical services clinics | Macula centered 45° | Digital Retinography System, Canon CR-2 AF, Topcon 3D OCT1 Maestro | 1 image per eye | Nonmydriatic | 236 | Referable DR (preproliferative DR or worse and/or DME via National Health System grading scale) | 2 National Health System–certified retinal graders | 96.9 | 87.7 |
Ruamviboonsuk | 2022 | 9 primary care clinics in Thailand | Macula centered 45° | Topcon 3D OCT1 Maestro, Topcon TRC-NW300, Nidek AFC-230, Nidek AFC-210, Nidek AFC-300 | 1 image per eye | Nonmydriatic but dilated if ungradable | 7,940 | vtrDR via ICDR (NPDR, PDR, and/or DME) | 3 U.S. board-certified retina specialists as reference graders | 94.7 (93.0–96.2) | 91.4 (87.1–95.0) | |
Yang | 2022 | AIDRScreening system | 3 eye hospitals in China | Macula centered and disc centered 45° | Zeiss VISUCAM FF450, Topcon TRC-50DX, Topcon TRC-NW400 | 2 images per eye | Mydriatic | 1,001 | Referable DR as defined by the International Council of Ophthalmology Guidelines for Diabetic Eye Care | Up to 3 human graders using International Council of Ophthalmology Guidelines for Diabetic Eye Care classifications | 86.7 (83.4–90.1) | 96.1 (94.1–97.5) |
NPDR, nonproliferative DR; PDR, proliferative DR.
There are currently three FDA-cleared DR screening AI devices in the U.S.: IDx-DR, EyeArt, and AEYE-DS. According to FDA guidelines, DR screening algorithms fall under class II devices. Devices in this class are categorized as moderate to high risk to patients and need to undergo premarket approval to be cleared by the FDA. For a device to obtain clearance substantial equivalence to existing technology must be demonstrated (15). The existing technology for comparison is the IDx-DR, version 2.0, algorithm, which was approved in 2018 (16). Both EyeArt and AEYE-DS demonstrated equivalence to IDx-DR to receive FDA clearance as class II devices (17). In the European Union (EU), multiple devices have class IIa approval (CE marking), including EyeArt, IDx-DR, Retmarker, Google, and Singapore Eye Lesion Analyzer (SELENA). Class IIa devices require an external audit and certification process, and the results are voluntarily uploaded into the European Database on Medical Devices (EUDAMED) (18). The European certification system is more difficult to search, and many algorithms claim to be certified, but we were unable to verify this through a database search. Although regulatory approval is an important aspect of clinical deployment of these algorithms, the difficulty in independently assessing their approval status remains a significant barrier to third-party verification.
IDx-DR is notable for being the first fully autonomous AI system across any field of medicine to receive FDA clearance, which occurred in 2018 via a de novo approval process (IDx-DR, version 2.0) (19). The pivotal trial was a multicenter prospective study in which the AI algorithm was compared with a reading center’s standard for Early Treatment Diabetic Retinopathy Study (ETDRS) grading and DME detection, with the task of identifying fundus photos of mild-to-moderate DR (mtmDR), which was defined as an ETDRS ≥35 and/or one eye with DME. The algorithm showed a sensitivity of 87.2% and a specificity of 90.7% compared with a stringent reference human grader standard. Currently, as of 2023, the IDx-DR algorithm has been renamed to LumineticsCore. The most recent FDA-cleared version, IDx-DR, version 2.3, demonstrated an improved ability to read ungradable images and increased processing speed compared with IDx-DR, version 2.0, but without change to the algorithm that classifies DR (20).
EyeArt is a deep learning–based classification tool, developed by Eyenuk, which has been extensively tested in research studies. Two prospective trials have been conducted to evaluate the performance of the tool. In the first, participants were screened for referable DR based on the ETDRS standard (>35) within three centers in the U.K. with >30,000 patients. They found that EyeArt had 95.7% sensitivity and 54.0% specificity compared with the reference standard of three human graders (21). In the second, which was the trial used for FDA clearance, EyeArt was evaluated in a prospective multicenter trial in the U.S. with 893 patients, with the AI algorithm’s detection of both more-than-mild DR (≥35 EDTRS) and vision-threatening DR (vtDR) (≥53 EDTRS or clinically significant macular edema) compared with a reference standard of a reading center with the same protocol used by the authors of IDx-DR. The results from this study showed that EyeArt had a sensitivity of 95.5% and a specificity of 85.0% for detecting mtmDR (22).
AEYE-DS, an algorithm created by AEYE Health, received FDA clearance in November 2022 for screening of DR (23). The company reported the completion of a phase III clinical trial but has not released details in the form of a scientific manuscript, and thus we elected not to report their results.
SELENA was first described by Ting et al. (24), trained on a data set of referable DR from a Singaporean population, and validated retrospectively in an ethnically diverse cohort that included patients from China, Hong Kong, Singapore, Mexico, Australia, and the U.S. The algorithm was prospectively validated in Zambia in 2019 by Bellemo et al. (25), who compared the SELENA algorithm with human graders and showed a sensitivity of 92.25% and a specificity of 89.04% for referable DR, which was defined as more-than-mild DR by the ICDR. This model also had a sensitivity of 99.42% for detection of vtDR via the ICDR and a specificity of 97.19%. SELENA is currently in clinical use in Singapore as part of the national diabetes screening program and is a class IIa CE marked device in Europe (26).
Google has developed its own unnamed DR detection system, which was one of the first deep learning–based screening AI algorithms. It was trained on a mixed data set from multiple sites in the U.S. and India and evaluated on the EyePACS and Messidor-2 data sets and data from a nationwide DR screening program in Thailand (27,28). This algorithm was improved (29) and then prospectively evaluated in multiple primary care sites within Thailand against the performance of regional retinal specialists and compared with the ground truth of a panel of three U.S. retina specialists (30). The outcome was vtDR, which was defined as severe nonproliferative DR, proliferative DR, or referable DME by the ICDR. The algorithm had a sensitivity of 91.4% and a specificity of 95.4% for vtDR, which was superior to the performance of regional retina specialists compared to a consensus panel of U.S. retina specialists. From interviews conducted following the study, the 12 nurses who interfaced between the AI and the patients indicated that the immediate results provided by the AI system helped aid the decision-making process of referrals. However, interviewees acknowledged that working with the algorithm could be challenging due to the additional steps required to upload images. In addition, the algorithm would deem some images ungradable, which required repeated imaging or human interpretation (30).
Li et al. (31) developed an unnamed AI algorithm that was initially trained with a set of 106,244 retinal images of Malays, Caucasian Australians, and Indigenous Australians. This algorithm was later prospectively evaluated in a cohort of 242 Australians in either endocrinology outpatient or Aboriginal medical service primary care clinics and shown to have a sensitivity of 96.9% and specificity of 87.7% for referable DR (preproliferative DR or worse [equivalent to ETDRS ≥43] and/or DME) defined by U.K. National Health Service Diabetic Eye Screening Programme (32). The study reported that 93.7% of patients were satisfied or extremely satisfied with the service and 93.2% would be likely or extremely likely to use the service again (33).
VoxelCloud Retina is an AI algorithm that is part of the VoxelCloud software suite. The study included 155 diabetes centers in China with a total of 15,805 patients who were randomly selected from a larger cohort. The AI output was compared with the reference standard of a panel of three ophthalmologists. The algorithm had a sensitivity of 83.3% and specificity of 92.5% to detect referable DR, defined as more-than-mild DR by the ICDR (34).
The AIDRScreening system is a Chinese AI algorithm that was the first AI-based DR screening system to obtain a certificate from the National Institutes for Food and Drug Control of China. It was prospectively evaluated in a cohort of 1,001 patients from three centers in China. The algorithm was compared against the reference standard of a panel of up to three ophthalmologist graders who labeled images as referable DR according to the International Council of Ophthalmology criteria. The algorithm had a sensitivity of 86.72% and a specificity of 96.09% for referable DR (35). Unlike most of the other algorithms that were prospectively evaluated, the AIDRScreening system cannot detect DME.
In summary, there are many algorithms with prospective evaluation demonstrating >85% sensitivity and specificity compared with a human grader reference standard. The number of photos required varies by algorithm, with both EyeArt and IDx-DR (the two FDA-cleared algorithms with publically available data) requiring two images per eye. All of the algorithms were nonmydriatic except the AIDRScreening system. The reference grading standard used was not consistent across algorithms (ICDR, Fundus Photograph Reading Center [FPRC], etc.), and the populations in which these algorithms were evaluated had vastly different demographic characteristics. Finally, because the algorithms have a lower threshold for deeming an image ungradable, there are often higher rates of ungradable images from AI devices in comparison with human graders. This has the potential to increase clinical workflow time and the number of unnecessary referrals.
Non–Prospectively Studied Algorithms That Are Commercially Available
Although many non–prospectively studied algorithms are commercially available, we elected to review only two of these algorithms because they were included in subsequent head-to-head validation studies that are discussed later in this manuscript.
Retmarker is a feature-based machine learning model developed in 2011 that detects microaneurysms from color fundus photos to detect “disease” or “no disease” (8). Although it does not specify whether a patient’s disease warrants a referral for an in-person ophthalmic examination, it was implemented in a two-step process in a Portuguese reading center for DR screening where all fundus photos were first passed to the algorithm to detect “disease” or “no disease,” and each “disease” photo was then graded by humans for determination of whether in-person ophthalmic examination was warranted. The study findings suggested that this workflow had the potential to reduce clinician workload in reading images (36). Retmarker was included in many head-to-head comparisons and cost evaluation studies (10,37) and was acquired by Meteda in January of 2022 (38).
RetinaLyze is an AI software that was first described in 2003 (39); however, there are no publications within the last 5 years describing the methods or reporting performance on a clinical data set. The algorithm is compatible with multiple different fundus cameras and can identify microaneurysms from fundus photographs but does not classify whether a fundus photo has referable versus non-referable DR. In Europe, it is CE marked class I, which means it must be used with human oversight and has not been independently certified, although it was previously reported that the authors had applied for class IIa approval (40).
Algorithms for Other Fundus Imaging Modalities
Although these algorithms mentioned were all trained and evaluated with traditional, nonportable, fundus cameras, there are additional algorithms under development with use of different fundus photography modalities. One imaging modality is portable fundus camera photographs, with algorithms such as Medios DR, which underwent prospective evaluation (41,42). Ultrawide field imaging is increasingly adopted in clinical settings, which allows for a greater field of view of the retina than traditional fundus photography, and there are some algorithms that have shown promising results for detecting referable DR (43,44). No algorithms for these alternative imaging modalities are FDA cleared or CE marked.
Head-to-head Validation
Although many have studies been conducted with evaluation of the performance of a given AI model compared with a specific reference standard of human graders, the comparison between these algorithms is challenging. This is because the data sets that the models are evaluated on all have different compositions of patients, and model performance can vary dramatically based on the test data set (Fig. 3). Without head-to-head validation on the same test data set, it is difficult to estimate a model’s relative performance.
Tufail et al. (10) published one of the first head-to-head validation studies in 2017, with comparison of EyeArt, Retmarker, and human graders against a third-party reference standard of independent graders. The algorithms in this study used traditional computer vision methods because deep learning models had not been developed yet. The cohort was composed of 20,258 patients who were seen in the U.K. National Health Service Diabetic Eye Screening Programme. The study findings showed that human graders had a greater positive predictive value and negative predictive value than EyeArt and Retmarker. In this cohort, EyeArt had a sensitivity of 93.8% for any retinopathy and Retmarker had a sensitivity of 73.0% for any retinopathy. It is important to understand that these models have been modified since the time the study was conducted and they may perform better now. In this study, the authors also performed cost estimation to determine the minimum cost per patient for the AI device to be less expensive than human grading. The amount was £3.82 per patient for Retmarker and £2.71 per patient for EyeArt. This study set the stage for further head-to-head validation studies and provided important information on how commercially available software compared against each other and human graders.
Another head-to-head study was conducted by Grzybowski and Brona (45). In this study investigators compared commercially available IDx-DR and RetinaLyze algorithms on a retrospective data set of 170 patients in a Polish diabetes clinic that was graded by a single ophthalmologist. Investigators found that the agreement percentage with the reader for DR-positive and DR-negative cases was 93.3% and 95.5% for IDx-DR, and 74.1–89.7% and 71.8–93.6% for different RetinaLyze methods, respectively. Although the authors discussed the limitations regarding small sample size, patient selection, and the reference grading standard, this is one of the few head-to-head validation studies that exist and could serve as the groundwork for more extensive validation studies.
In an attempt to validate the performance of multiple different commercially available algorithms across a diverse data set, Lee et al. (37) performed the largest deep learning head-to-head validation study to date in 2021. The authors contacted 23 companies to evaluate their model performance, of which 5 agreed to submit their algorithms to the study. Each algorithm was anonymized, and the results were blinded for all researchers, although the companies could see their results and make adjustments to their software in the future as necessary. The data set was composed of fundus photographs from two Veterans Affairs (VA) hospitals, one in Seattle and one in Atlanta. The performance of the algorithms was compared with that of the VA graders, and a subset of the images was regraded by a third party of arbitrary graders, which was then compared against the AI models and the VA graders. The authors found a wide variability in model performance across the data set, with sensitivities ranging between 50.98 and 85.90% and specificities between 60.42 and 83.69%. Most of the algorithms were not superior to VA graders in comparisons against the arbitrary grader standard, but two achieved higher sensitivities and one yielded comparable sensitivity and specificity. Interestingly, the performance of the models varied between the two different cohorts, with most algorithms performing worse in terms of sensitivity and specificity in the Seattle cohort. The Atlanta cohort was more racially and ethnically diverse than the Seattle cohort, and the Atlanta cohort all underwent pharmacologic dilation before retinal imaging, which was not routinely performed in the Seattle cohort. The results from this larger study highlight significant variability in AI devices’ performance and the alarming fact that the majority do not outperform human graders. In addition, models can be exquisitely sensitive to minor differences between the training and test data sets, which will degrade performance when the models are deployed. Thus, additional studies comparing AI DR algorithms are critical for clinicians to decide which algorithms may apply to their clinical practice.
These head-to-head studies have limitations. Firstly, while investigators attempted to test performance on diverse data sets, the patient populations of the tests do not represent the full spectrum of all patients these algorithms may be used with, as the study by Tufail et al. (10) was among a British population, that by Grzybowski and Brona (45) was among a very limited Polish population, and that by Lee et al. (37) was with an American population of older males. It is imperative that future head-to-head studies evaluate performance within the population among whom the devices will be deployed. Secondly, the management of DR varies substantially across different health care systems. Some systems refer all DR including mild, while others just refer more-than-mild DR, which makes evaluating the algorithms difficult because of the varied reference standard. Finally, the limited amount of publicly available, large, diverse data sets makes head-to-head comparison difficult, although there are data sets from programs like the NIH Bridge to Artificial Intelligence (Bridge2AI) that are currently being collected to help fill this gap.
Cost-effectiveness
As the use of AI algorithms for the diagnosis and treatment of DR becomes more widespread, questions around the cost-effectiveness of these technologies are increasingly important to consider. While AI has demonstrated promising results in accurately detecting DR, whether the implementation of these algorithms is cost-effective in comparison with human graders remains unclear.
The literature on the cost-effectiveness of AI screening is conflicting, and factors such as geography and deployment strategy appear to play a large role. With cost estimation studies based on population data from the U.S., researchers have predicted that AI screening is more cost-effective than human graders (37,46,47). One of the main arguments is that the high cost of human graders in countries such as the U.S. allows AI devices to be priced lower than human graders, thus providing cost-effective care. Whether this fully explains cost-effectiveness is unclear, as separate studies with data from China and Thailand have similarly found that AI screening is cost-effective despite the lower cost of human grading (48–50). However, investigators of separate studies from China and Brazil using similar methods estimated that AI algorithms are less cost-effective than human graders (51,52). Results of other studies have suggested that the deployment strategy may play a role in cost-effectiveness. A study based on Singaporean clinical data findings indicated that AI was more cost-effective than human grading for DR screening but that a semi-autonomous system with a two-step process of AI evaluation, followed by human grading of all the DR-positive images, was the most cost-effective option (53). From these studies, it is unclear whether AI is universally cost-effective. Grader cost varies by country, with high-income countries having higher human grader costs and low- and middle-income countries having lower grader costs; this makes cost-effectiveness potentially lower in the low- and middle-income countries for AI devices. Some of the discrepancies in the studies may be related to the study period, as they have varying model periods, ranging from 1 year up to a lifetime. Estimating cost-effectiveness remains difficult in trying to align the impact on patient outcomes, health care accessibility, and efficiency of the health care systems, which may partly explain the discrepancy in the study results.
AI algorithms have the potential to reduce costs for health care systems according to some of the aforementioned studies, but incorporating new technology into complex billing, insurance, and payment structures presents a significant obstacle. The U.S., in particular, has a complex reimbursement system with a mix of private and public payors. Before IDx-DR, no autonomous AI devices had been cleared by the FDA; thus, there was no applicable billing framework in the U.S. (54). In the fall of 2021, the Centers for Medicare & Medicaid Services approved the first Current Procedural Terminology (CPT) billing code for autonomous AI devices for detecting disease: 92229, “imaging of retina for detection or monitoring of disease; point-of-care autonomous analysis and report, unilateral or bilateral” (55). As of 2022, the nonfacility (outpatient) reimbursement amount for CPT code 92229 according to the 2023 Medicare Physician Fee Schedule was $45.75. For comparison, CPT code 92228-TC (“imaging of retina for detection or monitoring of disease; with remote physician or other qualified health care professional interpretation and report, unilateral or bilateral”) has a nonfacility price of $12.88 if the actual imaging is performed by a technician or primary care physician (56). Given that there is no existing billing framework into which AI software falls, it is difficult to evaluate the adequate value the AI tool provides and thus determine a fair reimbursement amount (54,57,58). The FDA has attempted to supplement the cost of these novel AI devices with the New Technology Add-on Payment (NTAP) system, which provides additional reimbursement for devices that are not covered by standard reimbursement (54). Additionally, the Medicaid reimbursement amount for AI DR screening varies by state and insurer, and some states may not reimburse telescreening for significant DR if a patient has mild DR, which is less severe than the commonly used referable standard of more-than-mild DR (59). Despite many recent changes, significant obstructions regarding financial incentives still impede the deployment of AI devices in the U.S. Although every country has a unique payment structure for health care, other systems will likely encounter similar obstacles. It is likely they will need to change their reimbursement policies and incorporate new billing frameworks for AI policies to effectively implement these devices into their system.
Equity and Bias
Equity and bias are important factors to consider in the implementation of AI algorithms, and there are concerns regarding AI bias leading to the possibility of inequitable outcomes (60,61). Biases can cause potential harm to specific groups via underdiagnosis or high false-positive rates, for example. Developers and users of AI devices have an ethical responsibility to ensure equitable outcomes among underrepresented or marginalized communities (62). Biases can be introduced in different phases of the model development-to-deployment pipeline. In development, inconsistencies in data labeling or the inclusion/exclusion of specific groups in the data set can instill bias into the model. Not accounting for these biases during development could lead to inequitable performance across different subgroups. Although there are numerous methods to reduce bias during model development such as data augmentation and transfer learning (62), the most effective way to evaluate for bias is evaluating the model on extensive, diverse test cohorts (63). Investigators of some studies have set out to evaluate for bias by using diverse validation cohorts to see whether model performance deteriorates across subgroups. Ting et al. (24), in the initial retrospective study of SELENA, evaluated the model performance over a wide range of geographic locations, ethnicities, and camera types. Although the model’s area under the receiver operating characteristic curve for each respective subgroup was >0.9, there were varying levels of performance between Malaysian, Indian, Chinese, African American, Mexican, and Hong Kongese patients. The reasons for varying performance are likely multifactorial, including the overall number of the respective ethnicities in the training set, the different grading standards for each of the data sets, and different camera types or acquisition methods from each of the ethnic cohorts. In the study by Tufail et al. (10), a head-to-head validation study, the authors found that the performance of EyeArt was not significantly affected by ethnicity, sex, or camera type. As previously discussed for the head-to-head validation study by Lee et al. (37), the performance of the algorithms varied significantly based on the training and test data sets including demographic characteristics of study populations. Thus, potential biases must be accounted for in applying algorithms that may not have been validated in the targeted populations.
To help increase transparency in model development, the Asia-Pacific Academy of Ophthalmology (APAO) recommended that for future AI devices standardized reporting guidelines for discussion of their respective methodology, training protocols, and evaluations with specific subgroups within the model should be followed (64). Other authors have suggested that algorithms can report a population-adjusted sensitivity that incorporates sensitivity and compliance with an intervention across different populations within a cohort (65). It is important to consider that algorithm performance on a test set in a research setting is only one way to assess bias. Therefore, algorithms should be continuously monitored after deployment to assess patient outcomes and ensure equitable access to these novel technologies (63). Currently, there is no standard for continuous model evaluation (66).
Medicolegal Considerations
As with all emerging technologies in the medical field, AI will need to be regulated in a manner that protects patient safety, privacy, and autonomy (65). Because AI is a relatively new technology, many regulatory bodies have had to adapt to incorporate the new technology. In the U.S., the FDA has classified AI algorithms as devices through the Software as a Medical Device (SaMD) framework (67). However, there are still several legal challenges that will need to be addressed to fully integrate AI into daily clinical practice (68). It is currently unclear how responsibility will be shared among clinicians, facilities, developers, and regulating bodies. Historically, clinicians have borne the brunt of responsibility for liability in the U.S. (69). However, currently, there is a lack of consensus regarding who shares responsibility. Some legal experts believe that facilities will likely need to perform their due diligence while vetting algorithms for incorporation into their practice, much like with other devices (70). Others argue that in the U.S. tort law may not apply properly to AI decision-making without assigning the technology “personhood” and being able to understand the reasoning for the algorithms’ decisions (71). The legal minutiae of tort law in the U.S. is often variable by state, and it will be difficult to follow the myriad of regulatory changes to address a rapidly changing technology. The “black box” nature of deep learning decision-making will prove to be a challenge for today’s court system, especially if it is given “personhood.” Currently, the American Medical Association holds the developer of the AI software liable for any issues regarding misdiagnosis and system failure (72). In the current teleretinal grading paradigm, human graders also assess for other ocular pathology during DR screening such as age-related macular degeneration, glaucoma, or choroidal melanomas. If this is detected incidentally, patients will be informed and consulted appropriately. These DR algorithms are currently limited to just detecting DR and do not handle the detection of other diseases, although there are algorithms in development that may detect multiple diseases. There is a lack of clarity around liability for when another serious condition is missed by an AI DR screening device, such as choroidal melanoma, and this omission leads to significant morbidity/mortality. Perhaps future developments in AI can include both the ability for models to explain their decision-making and comprehensive retinal evaluation, but until then medicine must rely on the legal system to adapt to these emerging technologies as they evolve. The EU proposed the AI Act in 2021, which is a set of horizontal rules on AI for ensuring safety and trustworthiness (73). In 2022, the EU proposed two additional directives for updated guidance on the management of liability with AI devices, the Product Liability Directive and AI Liability Directive. These directives, which serve as recommendations for constituents in making local laws, restructure the way that liability is assessed for AI devices (74). The laws and directives are dense but show that governments are changing their legal structure to adopt these new devices. While most of these laws, frameworks, and guidelines have been published in the U.S. or Europe, low- and middle-income countries have a relative dearth of guidelines regarding AI regulation, which may potentially exacerbate existing disparities (67). Ideally, the medical community should be prepared to communicate the risks, benefits, and limitations of AI to establish their role and liability when deploying these algorithms in clinical practice (75,76). Algorithm developers will have an ethical responsibility to adhere to standards and demonstrate equitable performance, while clinicians will be responsible for understanding the respective risks and benefits of AI devices that they use (77). Issues such as patient privacy, informed consent, and the potential for AI to exacerbate health care disparities are all critical considerations that deserve attention in the future.
Future AI Data Sets
As discussed in this review, the training or testing data sets impact the overall performance of AI models significantly. The ability to train, validate, and compare AI algorithms for the screening of DR in an equitable fashion has been limited by a lack of well-designed, high-quality, large, and inclusive multimodal data sets that are freely accessible. A new project funded by the National Institutes of Health titled AI Ready and Equitable Atlas for Diabetes Insights (AI-READI) is poised to help fill this need by generating an open-source equitable and powerful data set for AI training, validation, and comparison, among other highly impactful clinical questions. The aim of the study is to develop a cross-sectional data set of >4,000 people across the U.S. with triple balancing for sex, self-reported race/ethnicity (White, Black, Asian American, and Hispanic), and four stages of diabetes severity (no diabetes, lifestyle controlled, oral medication controlled, and insulin dependent). Building balanced training data sets is critical for the development of unbiased machine learning models. Therefore, this data set will include an equal number of patients of four racial/ethnic groups while balancing the diagnoses of diabetes equally across groups. Additionally, the AI/machine learning–ready data will include social determinants of health surveys; continuous glucose monitoring; serological testing for endocrine, cardiac, and renal biomarkers; retinal imaging (color fundus photos, optical coherence tomography, and optical coherence tomography angiography); electrocardiogram; environmental sensor; and 24-h wearable activity monitoring (Fig. 4). This comprehensive data set will pave the way for numerous AI applications to enhance patient care for those with diabetes while simultaneously contributing to the advancement of AI in medicine.
Conclusions
In conclusion, there are numerous AI devices poised to make an impact in DR screening, with several already showing good performance on prospective data sets. However, to ensure beneficial changes in patient care it is essential to fill a few substantial knowledge gaps. Few studies directly compare available devices; thus, there is an urgent need for additional head-to-head validation studies to inform clinicians about which AI devices to deploy. These algorithms may be cost-effective in estimation studies, but the complex billing requirements and variability across health care systems make the true costs difficult to estimate. Finally, these algorithms have variable performance across different data sets and need to continue to show equitable diagnoses and outcomes with deployment in a clinical setting. Ultimately, AI devices may significantly reduce the screening burden of DR worldwide, but additional knowledge gaps need to be addressed to ensure the effective use of this new technology.
This article is featured in a podcast available at diabetesjournals.org/care/pages/diabetes_care_on_air.
Article Information
Funding. This research has been funded by National Institutes of Health grants K23EY029246, R01AG060942, and OT2OD032644; the Latham Vision Research Innovation Award (Seattle, WA), the C. Dan and Irene Hunter Endowed Professorship, the Klorfine Family Endowed Professorship, and the Roger H. and Angie Karalis Johnson Retina Center and by an unrestricted grant from Research to Prevent Blindness. A.Y.L. reports support from the FDA.
The sponsors or funding organizations had no role in the design or conduct of this research. This article does not reflect the opinions of the FDA.
Duality of Interest. A.Y.L. reports grants from Santen, Carl Zeiss Meditec, and Novartis and personal fees from Genentech, Topcon, and Verana Health, outside of the submitted work. No other potential conflicts of interest relevant to this article were reported.
Author Contributions. A.E.R., O.Q.D., C.S.L., and A.Y.L. all wrote, reviewed, and edited the manuscript. A.Y.L. is the guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.