Diabetic macular edema (DME) is the leading cause of vision loss in people with diabetes. Application of artificial intelligence (AI) in interpreting fundus photography (FP) and optical coherence tomography (OCT) images allows prompt detection and intervention.
To evaluate the performance of AI in detecting DME from FP or OCT images and identify potential factors affecting model performances.
We searched seven electronic libraries up to 12 February 2023.
We included studies using AI to detect DME from FP or OCT images.
We extracted study characteristics and performance parameters.
Fifty-three studies were included in the meta-analysis. FP-based algorithms of 25 studies yielded pooled area under the receiver operating characteristic curve (AUROC), sensitivity, and specificity of 0.964, 92.6%, and 91.1%, respectively. OCT-based algorithms of 28 studies yielded pooled AUROC, sensitivity, and specificity of 0.985, 95.9%, and 97.9%, respectively. Potential factors improving model performance included deep learning techniques, larger size, and more diversity in training data sets. Models demonstrated better performance when validated internally than externally, and those trained with multiple data sets showed better results upon external validation.
Analyses were limited by unstandardized algorithm outcomes and insufficient data in patient demographics, OCT volumetric scans, and external validation.
This meta-analysis demonstrates satisfactory performance of AI in detecting DME from FP or OCT images. External validation is warranted for future studies to evaluate model generalizability. Further investigations may estimate optimal sample size, effect of class balance, patient demographics, and additional benefits of OCT volumetric scans.
Introduction
Diabetic retinopathy (DR) is a neurovascular complication of diabetes (1,2). Diabetic macular edema (DME), which can develop at any stage of DR, is the primary cause of irreversible vision loss in people with diabetes (1,3), overtaking proliferative DR as the most frequent cause of visual impairment among people with diabetes in developed countries (4). As blindness from DME is preventable with timely treatment, it is important to precisely identify DME among people with diabetes. Given that the population with diabetes is projected to approach 600 million worldwide by 2035 (5), DME is likely to be responsible for substantial vision loss unless detected earlier and treated adequately in the future.
DR screening programs are currently implemented at the primary care level in many countries using two-dimensional (2D) nonstereoscopic digital fundus photography (FP) using fundus cameras (6–9). When signs of DME or other sight-threatening DR are identified, patients are referred to ophthalmologists for further clinical examination and management. However, the diagnosis of DME requires identification of retinal thickening, which is a three-dimensional (3D) concept that is difficult to be reliably diagnosed based on 2D FP. Notably, in screening settings, manual interpretation of FP images for DME has been reported with high false-positive rates (e.g., >86% in Hong Kong [10] and >79% in the U.K. [11]), causing unnecessary referral of suspected DME to ophthalmologists and leading to a substantial increase in medical costs and waiting time for patients.
Spectral domain optical coherence tomography (SD-OCT) is a noninvasive imaging modality that provides 3D volumetric scans of the layered retinal structures. It has been widely used as the gold standard for DME diagnosis in clinical settings and clinical trials (12,13), monitoring treatment response, and providing prognostic information. Its role as a screening tool for DME has been investigated in pilot studies (13,14), showing that implementation of OCT into DR screening programs provides a reduction in referrals for diabetic maculopathy by 40% (15). Nevertheless, commercially available OCT devices could be at least three times more expensive than fundus cameras. Therefore, further studies are required to demonstrate the feasibility and cost effectiveness of implementing OCT into DME screening. More importantly, one common and critical issue for using FP or OCT for DR screening is the requirement of professionals to review a tremendous number of images. Taking OCT for example, a volumetric data cube usually comprises >100 images per eye. Manual slice-by-slice assessments are required to ensure that no positive cases are missed. Considering the large population of individuals with diabetes, this is indeed a time- and labor-intensive task.
Artificial intelligence (AI), particularly deep learning, is a major area of research in medical image analysis (16). Recently, deep learning has been making remarkable breakthroughs in the field of DR for revolutionizing DME detection from FP and OCT images in an automated, convenient, and efficient fashion (17,18). Despite the excellent diagnostic performance, several gaps remain. First, whether FP-based AI can obtain satisfactory performance for screening DME has not yet been evaluated comprehensively. Second, although OCT is regarded as the gold standard for DME diagnosis, whether implementing OCT-based AI in DR screening provides significantly better performance requires validation studies. Third, factors that determine the AI’s discriminative performance in detecting DME also require further elucidation.
We conducted a meta-analysis to evaluate the synthesized discriminative performance of AI in detecting DME from FP or OCT images and to identify factors that affect a model’s performance. In addition, on the basis of current literature, we further discuss potential research directions, aiming to facilitate clinical translation of AI models to real-world practice for DR screening.
Research Design and Methods
We conducted a systematic review and meta-analysis to assess the diagnostic performance of AI for detecting DME from FP and OCT images in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis guidelines (19). Institutional review board approval and informed consent were not required, as all data were extracted from past publications. All research adhered to the principles of the Declaration of Helsinki.
Eligibility Criteria
Studies that used AI algorithms to detect DME from FP or OCT macular scans from 1 January 1991 to 12 February 2023 were included in the current study. We excluded records given the following reasons: 1) did not specify patients’ diabetes status; 2) did not provide qualitative outcome of DME (i.e., presence or absence of DME); 3) used traditional computer vision methods instead of AI techniques; 4) did not report sensitivity, specificity, and area under the receiver operating characteristic (AUROC) curve; 5) used ocular images other than FP or OCT macular scan as input; 6) were conference abstracts, ongoing studies, reviews, meta-analyses, comments, editorials, book chapters, theses, non–English-language, and nonhuman studies; and 7) did not have full text available. We also excluded any studies of low quality using the QUADAS-2 tool (20).
Electronic Search Strategy
Two independent reviewers (C.L. and Y.L.W.) conducted the literature search in seven electronic libraries, including PubMed, Embase, Web of Science, Google Scholar, Scopus, the Cochrane Library, and CINAHL, using a hierarchical search strategy with combinations of keywords related to AI (e.g., artificial intelligence, AI, machine learning, deep learning, automated detection), the target condition (e.g., DME), and image modalities (e.g., FP, OCT) and Medical Subject Headings terms, as appropriate. Full details of the search strategy for each database are described in Supplementary Table 1.
Study Selection and Quality Assessment
We ran our search strategy through the electronic libraries and collected all search results using EndNote. Duplicates were removed before selection. The selection process was in three phases. First, two reviewers (C.L. and Y.L.W.) independently screened the titles and abstracts of the records and identified studies related to the topic. Second, the two reviewers (C.L. and Y.L.W.) independently assessed the full texts and excluded articles that met the exclusion criteria. Third, we evaluated the quality of included studies using the QUADAS-2 tool and eliminated studies of low quality. Each study was evaluated for risk of bias and applicability following four key domains: patient selection, index test, reference standard, and flow and timing. Reference lists of the selected studies were manually searched and screened. During the entire process, discrepancies between the two reviewers were resolved by discussions with a senior reviewer (Z.T.).
Data Collection
Relevant data were extracted from the included studies, including 1) study characteristics (author names, year of publication, and country), 2) data set characteristics (source of database, imaging modality and device, imaging protocol, image resolution, number of participants, eyes and images, number of participants and eyes with DME, number and experience of graders, and DME definition [i.e., FP, OCT, ophthalmoscopy]); 3) algorithm characteristics (types of networks; data splitting and data distribution in training, testing, and validation sets; and outcomes), and 4) performance metrics (AUROC; sensitivity, recall, or true-positive rate; specificity or true-negative rate; false-positive rate; accuracy; false-negative rate; precision or positive predictive value; negative predictive value; and other reported performance parameters).
Data Synthesis and Statistical Analysis
We constructed a 2 × 2 contingency table for each study in RevMan (version 5.4.1) with the built-in calculator using the collected study characteristics and performance parameters. We imported the 2 × 2 contingency tables into R (version 4.2.0) and used the mada package (21) and VassarStats (22) to perform statistical analyses. To demonstrate the discriminative performance of FP- and OCT-based AI for DME detection, respectively, we calculated the pooled AUROC using a hierarchical model. Pooled sensitivity and specificity were calculated by bivariate random-effects model to measure the AI’s ability to identify DME-positive or DME-negative eyes. Bivariate analysis produces informative summary measures in diagnostic reviews. As a single indicator, the diagnostic odds ratio (DOR) is used to evaluate how much greater the odds of a positive diagnosis of DME is for patients with test-positive versus test-negative results, providing a comprehensible overview of the models’ overall diagnostic performance (23). We generated heterogeneity parameters, forest plots, and summary receiver operating characteristic (SROC) curves. Subgroup analyses were performed according to the type of AI (machine learning vs. deep learning), data set size (smaller than median vs. larger than median), data set diversity (single vs. multiple), and testing sets (internal vs. external). Small developmental data set size was defined as smaller than the median number of images among studies included in this meta-analysis, and multiple training data sets were defined as data from different institutions or with systematically different population characteristics (18). We also created funnel plots to assess for any publication bias in our meta-analysis.
Results
Study Selection
The selection process for 53 included studies is shown in Fig. 1. Initially, 4,776 search results were identified, and 1,338 duplicates were removed before selection. We screened the titles and abstracts of the 3,438 records, and 2,796 were excluded because of irrelevant topics. Of 642 studies for full-text review, 589 were excluded according to the exclusion criteria. We assessed the quality of 53 studies, and none were excluded for low quality (Supplementary Fig. 1). Finally, 53 studies were included in the meta-analysis (full citations provided in the Supplementary Material).
Study Characteristics
Basic characteristics of the included studies are presented in Table 1. Among the 53 studies, 25 used FP as input images, while 28 used OCT B scans, among which 2 studies also included volumetric scans as input data for algorithm development. For FP studies, 8 developed machine learning algorithms, and 17 developed deep learning algorithms. For OCT studies, 3 studies developed machine learning algorithms, and 25 developed deep learning algorithms. Heidelberg Spectralis SD-OCT was most commonly adopted (21 studies), while Cirrus HD-OCT, Cirrus SD-OCT, Topcon Triton OCT, Topcon 1000 SD-OCT, Topcon 3D OCT-1 Maestro, and Optovue RTVue-XR Avanti were also adopted in other studies. Regarding publicly available data sets for training, testing, and validation, the Methods to Evaluate Segmentation and Indexing Techniques in the Field of Retinal Ophthalmology (MESSIDOR) and Kermany et al. (24) data sets were the most common for FP and OCT studies, respectively.
Study . | Year . | Training database . | Imaging modality (device) . | Imaging protocol . | Image resolution, pixels . | Total images . | DME images . | Outcome . | Type of network (best result) . | AUROC . |
---|---|---|---|---|---|---|---|---|---|---|
FP | ||||||||||
Agurto et al. | 2011 | RIST | Topcon TRC-50EX | 45° mydriatic images centered on the macula, optic disc, and superior temporal region of the retina | 1,888 × 2,224 | 238 | 174 | Yes/no CSME | ML PLS regression classifier | 0.980 |
UTHSCSA | Canon CF-60UV | 45° mydriatic images centered on the macula, optic disc, and superior temporal region of the retina | 2,048 × 2,392 | 323 | 207 | Yes/no CSME | ML PLS regression classifier | 0.970 | ||
Akram et al. | 2014 | HEI-MED | Not specified | Not specified | Not specified | 169 | 54 | Normal/CSME | Hybrid GMM SVM | 0.940 |
MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 229 | Normal/Non-CSME/CSME | Hybrid GMM SVM | 0.970 | ||
Bressler et al. | 2022 | EyePACS | Topcon, Canon, CenterVue, Crystalvue, Zeiss, unassigned | 45° mydriatic and nonmydriatic images centered on the macula and between the macula and the disc | Not specified | 32,049 | 15,595 | Yes/no DME | DL neural network | 0.954 (EyePACS), 0.971 (MESSIDOR-2) |
Chalakkal et al. | 2021 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic images centered between the macula and disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,187 | 150 | Non-CSME/CSME | CNN ResNet-50 | 0.962** |
UoA-DR | Zeiss VISUCAM 500 | 45° centered on macular and optic disc | 2,124 × 2,056 | 200 | 74 | Non-CSME/CSME | CNN ResNet-50 | |||
IDRiD | Kowa VX-10α | 50° nonmydriatic images centered between the macula and the disc | 4,288 × 2,848 | 516 | 243 | Non-CSME/CSME | CNN ResNet-50 | |||
Dai et al. | 2021 | SIM | Canon CR-1 Mark II/CR-2, Topcon TRC-NW200, Zeiss VISUCAM 200 | 45° nonmydriatic images centered on optic disc and macula | Not specified | 666,383 | 3,926 | Yes/no DME | ResNet, Mask R-CNN | 0.946 |
Deepak et al. | 2012 | HEI-MED | Not specified | Not specified | Not specified | 122 | 54 | Normal/Hard Exudate (moderate/ severe) | Gaussian data description, PCA data description | 0.990 (DMED), 0.960 (MESSIDOR) |
Gulshan et al. | 2019 | Aravind Eye Hospital/Sankara Nethralaya | Topcon NM TRC, Forus 3nethra | 40°–45° nonmydriatic images centered on the macula | Not specified | 140,853 | 20,002 | Yes/no RDME* | CNN Inception v4 | 0.984 |
He et al. | 2019 | IDRiD | Topcon TRC-NW6 | 50° nonmydriatic images centered between the macula and the disc | 4,288 × 2,848 | 516 | 284 | 0, 1, 2# | CNN VGG-16, XGBoost classifier | 0.964 |
MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 226 | 0, 1, 2# | CNN VGG-16, XGBoost classifier | 0.982 | ||
Li et al. | 2021 | Shanghai First People’s Hospital, MESSIDOR-2 | Shanghai: not specified MESSIDOR-2: Topcon TRC-NW6 | Shanghai: 45° centered between the macula and the disc MESSIDOR-2: 45° nonmydriatic images centered on the macula | Shanghai: 1,488 × 1,488 MESSIDOR-2: 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 45,806 | 5,158 | NRDME/RDME | CNN Inception v4 | Shanghai: 0.994 MESSIDOR: 0.948 |
Li et al. | 2020 | MESSIDOR and 2018 ISBI IDRiD challenge data set | MESSIDOR: Topcon TRC-NW6 IDRiD: Kowa VX-10α | MESSIDOR: 45° mydriatic and nonmydriatic centered between the macula and the disc IDRiD: 50° nonmydriatic images centered between the macula and the disc | MESSIDOR: 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 IDRiD: 4,288 × 2,848 | 1,716 | 545 | 0, 1, 2# | CNN ResNet-50 | 0.942 |
Li et al. | 2018 | LabelMe | Topcon, Canon, CenterVue, Heidelberg | Not specified | 2,480 × 3,280, 576 × 768, 1,900 × 2,285, 1,958 × 2,588, 1,900 × 2,265, 1,956 × 2,448, and 1,944 × 2,464 | 71,043 | 14,598 | Yes/no DME | CNN Inception v3 | 0.986 |
Liu et al. | 2022 | Lerdsin and Rajavithi Hospital, Thailand; Moorfields Eye Hospital, U.K.; Alameda County Health System, U.S. | Topcon DRI OCT Triton, Kowa VX-10, Canon CR-DGi CFP | 45° centered between the macula and the disc | Not specified | 1,167,791 | 852,437 | Yes/no thickness-based, IRF-based CI-DME, or thickness-based DME | Deep CNN trained on TensorFlow | 0.860–0.960 |
Mo et al. | 2018 | HEI-MED | Not specified | Not specified | 2,196 × 1,958 | 169 | 54 | Yes/no DME | Convolutional residual network | 0.971** |
E-Ophtha EX | Not specified | 45° centered on the macula and between the macula and the disc | 1,440 × 960 to 2,544 × 1,696 | 82 | 47 | Yes/no DME | Convolutional residual network | |||
Mookiah et al. | 2015 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 300 | 185 | No DME, Non-CSME, CSME | NB SVM-linear | 0.969 |
Kasturba Medical College | Topcon TRC-NW200 | 45° FOV | 480 × 382 | 300 | 200 | No DME, Non-CSME, CSME | NB SVM-linear | 0.975 | ||
Rajput et al. | 2020 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 94 | Not specified | Yes/no DME | Color edge detection and mathematical morphology | 0.971 |
Raumviboonsuk et al. | 2019 | Nationwide screening program in Thailand | 3nethra; Canon CR2; Kowa VX-10, VX-20; Nonmyd 7, Nonmyd WD, Nonmyd α-DIII 8300; Nidek AFC-210, AFC-230, AFC-300; Topcon TRC-NW8; Zeiss VISUCAM 200 | 45° centered on the macula | 779 × 779 | 29,985 | 1,868 | Yes/no DME | CNN Inception v4 | 0.993 |
Sahlsten et al. | 2019 | Digifundus Ltd. | Canon CR2 | 45° mydriatic images centered on the macula and optic disc | 3,888 × 2,592 to 5,184 × 3,456 | 35,630 | 5,536 | NRDME/RDME | CNN Inception v3 | 0.992 |
Singh et al. | 2020 | IDRiD | Kowa VX-10α | 50° centered on the posterior pole | 4,288 × 2,848 | 516 | 284 | 0, 1, 2# | HE-CNN | 0.965 |
MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 226 | 0, 1, 2# | HE-CNN | 0.965 | ||
Stevenson et al. | 2019 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 2,283 | 226 | Normal/DME (/AMD, /DR, /RVO, /glaucoma)† | CNN Inception v3 | 0.746 |
Sundaresan et al. | 2015 | Local hospital | Not specified | Not specified | Not specified | 181 | Not specified | M0, M1, M2^ | GMM | 0.950 |
Tariq et al. | 2012 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 226 | Healthy, non-CSME, CSME | SVM | 0.967 |
STARE | Topcon TRV-50 | 35° FOV with varying imaging settings | 700 × 605 | 81 | 50 | Healthy, non-CSME, CSME | SVM | 0.973 | ||
Tariq et al. | 2013 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 226 | Healthy, non-CSME, CSME | Filter bank and GMM | 0.961 |
STARE | Topcon TRV-50 | 35° FOV with varying imaging settings | 700 × 605 | 81 | 50 | Healthy, non-CSME, CSME | Filter bank and GMM | 0.976 | ||
Varadarajan et al. | 2020 | Thailand Rajavithi Hospital | Kowa VX-10 | 50° centered on the macula | 4,288 × 2,848 | 7,072 | 1,990 | Yes/no CI-DME | CNN Inception v3 | 0.890 (Thailand), 0.840 (EyePACS) |
Wang et al. | 2022 | 3 Taiwan medical centers | Zeiss VISUCAM 200; Nidek AFC-330; Canon CF-1, CR-DGI, CR2, CR2-AF | 45° macula-centered on macula and between the macula and optic disc | 724 × 722 to 4,288 × 2,848 | 35,001 | 14,001 | DME/non-DME | EfficientDet-D1, bidirectional feature pyramid network | 0.981 (Taiwan), 0.952 (MESSIDOR-1), 0.958 (MESSIDOR-2) |
Yu et al. | 2022 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, 2,304 × 1,536, 4,288 × 2,848, | 1,716 | 462 | Yes/no DME | CNN + residual attention network | 0.882 |
IDRiD | Kowa VX-10α | 50° centered on posterior pole | and 4,288 × 2,848 | 516 | 284 | Yes/no DME | CNN + residual attention network | 0.772 | ||
OCT | ||||||||||
Ai et al. | 2022 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,488/2,000 (limited) | 11,348/1,000 (limited) | Normal/DME | Inception v3 /Inception-ResNet v2 /Xception + convolutional block attention mechanism | 0.773–1.000 |
Alqudah et al. | 2020 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,489 | 11,599 | Normal/DME | CNN 19 layer | 1.000 |
Altan et al. | 2021 | Kermany data set, SERI | Heidelberg Spectralis, Cirrus SD-OCT | Not specified | 512 × 1,024 | 66,585 | 13,397 | Normal/DME | Lightweight CNN DeepOCT | 0.983 |
Bhatia et al. | 2020 | Noor Eye Hospital | Heidelberg SD-OCT | Not specified | 224 × 224 | 100 volumes | 50 volumes | Normal/DME | CNN VGG16, 21 layers | 0.980** |
Ophthalmology and Microsurgery Institute | Heidelberg SD-OCT | Not specified | 224 × 224 | 50 volumes | 25 volumes | Normal/DME | CNN VGG16, 21 layers | |||
Das et al. | 2019 | Kermany data set | Heidelberg SD-OCT | Not specified | 256 × 256 | 38,163 | 11,598 | Normal/DME | CNN multiscale deep feature fusion-based classifier | 0.990 |
Dash et al. | 2018 | Merry Eye Care, Puducherry | Not specified | Not specified | Not specified | 150 | 90 | Normal/DME | SVM | 0.980 |
Fang et al. | 2019 | Kermany data set | Heidelberg Spectralis | Not specified | 224 × 224 | 38,163 | 11,598 | Normal/DME | LACNN Inception v3 | 0.974 |
Hassan et al. | 2020 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,489 | 11,599 | Healthy/DME | Recurrent Residual Inception Network | 0.986 |
Hecht et al. | 2019 | Munk and Israel data set combined | Heidelberg Spectralis | Centered on macula, including B scan (a minimum of 10 frames | 1,563 × 1,563 | 153 | 96 | DME/PCME | Decision tree | 0.937 |
Hussain et al. | 2018 | Duke, CERA, NYU, Tian | Heidelberg SD-OCT, EDI-OCT | Not specified | 512 × 1,024 | 11,662 | 2,940 | Normal/DME (/AMD)† | Random forest | 0.990 |
Hwang et al. | 2020 | Taipei Veterans General Hospital | Cirrus HD-OCT 4000, Optovue RTVue XR Avanti | Not specified | 3,499 × 2,329, 2,474 × 2,777, or 948 × 879 | 3,495 | Not specified | DME/non-DME | CNN (MobileNet, using Sigmoid Cross-Entropy and RMSprop) | 0.960 |
Joshi et al. | 2020 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 16,440 | 7,118 | Normal/DME | CNN | 0.990 |
Kermany et al. | 2018 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,489 | 11,599 | Healthy/DME | Inception v3 | 0.999 |
Khaothanthong et al. | 2023 | Rajavithi Hospital | Heidelberg Spectralis | Centered on macula, including radial scans from six lines per eye | Not specified | 6,356 | 1,455 | Yes/no DME | CNN ResNet-50, RelayNet/graph cut | 0.980 |
Li et al. | 2019 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,489 | 11,599 | Normal/DME | VGG16 | 0.999 |
Li et al. | 2019 | Shanghai Zhongshan Hospital and the Shanghai First People’s Hospital | Heidelberg Spectralis | Not specified | Not specified | 9,674 | 3,238 | Normal/DME | CNN multi-ResNet-50 ensembling | 0.996 |
Liu et al. | 2022 | Multiple centers | Topcon 3D OCT-1 Maestro | 6 mm × 6 mm | Not specified | >20,000 | Not specified | Yes/no ME | R-CNN | 0.944 |
Perdomo et al. | 2018 | SERI | Cirrus SD-OCT | Not specified | 1,024 × 512 | 4,096 | 2,048 | Normal/DME | VGG16 | 0.927 |
Perdomo et al. | 2019 | SERI + CUHK | Cirrus SD-OCT | Not specified | 1,024 × 512 | 9,600 | 5,352 | Normal/DME | VGG inspired | 0.860 |
Rajagopalan et al. | 2021 | Kermany data set | Heidelberg Spectralis | Not specified | 224 × 224 | 6,000 | 3,000 | Normal/DME | CNN | 0.960 |
Rasti et al. | 2018 | Not specified | Topcon 1000 SD-OCT | Not specified | 650 × 512 | 7,680 | 3,840 | Normal/DME | Wavelet-based CNN random forests | 0.993 |
Duke data set | Heidelberg SD-OCT | Not specified | 512 × 496, 768 × 496 | 30 volumes | 15 volumes | Normal/DME | Wavelet-based CNN random forests | 0.993 | ||
Rastogi et al. | 2019 | Kermany data set | Heidelberg Spectralis | Not specified | 128 × 128 | 62,489 | 11,599 | Normal/DME | DenseNet | 0.992 |
Saraiva et al. | 2020 | Kermany data set | Heidelberg Spectralis | Not specified | 150 × 150 | 62,489 | 11,599 | Normal/DME | CNN | 0.990 |
Tang et al. | 2021 | CUHK-STDR | Cirrus HD-OCT | 6 mm × 6mm | 1,024 × 512 × 128 | 3,788 volumes | 1,208 volumes | No DME/non-CI-DME/CI-DME | CNN ResNet-34 | 0.964 |
CUHK-STDR | Heidelberg Spectralis | 6.3 mm × 6.3 mm and 6.5 mm × 4.9 mm | 1,024 × 25, 1,024 × 19 | 30,515 | 5,542 | No DME/non-CI-DME/CI-DME | CNN ResNet-18 | 0.846 | ||
CUHK-STDR | Topcon Triton OCT | Radial 9 mm × 30° | 1,024 × 12 | 39,443 | 7,804 | No DME/non-CI-DME/CI-DME | CNN ResNet-18 | 0.935 | ||
Togacar et al. | 2022 | Kermany data set, Duke data set, Noor data set | Spectralis SD-OCT | Not specified | Not specified | 91,969 | 13,803 | Normal/DME | 9 CNN models | 1.000 |
Wang et al. | 2020 | Duke data set | Spectralis SD-OCT | Not specified | 224 × 224 | 1,920 | 522 | Normal/DME | CNN VGG16 | 0.980 |
Wang et al. | 2023 | CUHK Eye Center | Triton, Spectralis | Radial 9 mm × 30° | 1,024 × 992 × 12, 1,024 × 496 × 25 | 69,491 B scans, 4,644 volumes | 2,910 volumes | Yes/no DME | Deep semisupervised multiple instance learning | 0.934 and 0.963 for B scan; 0.926 and 0.950 for volume |
Xu et al. | 2021 | Noor Eye Hospital (Tehran), Kermany data set | Spectralis SD-OCT | Not specified | Not specified | 87,738 | 9,720 | Normal/DME | Multibranch hybrid attention network | 0.970 and 1.000, respectively |
Study . | Year . | Training database . | Imaging modality (device) . | Imaging protocol . | Image resolution, pixels . | Total images . | DME images . | Outcome . | Type of network (best result) . | AUROC . |
---|---|---|---|---|---|---|---|---|---|---|
FP | ||||||||||
Agurto et al. | 2011 | RIST | Topcon TRC-50EX | 45° mydriatic images centered on the macula, optic disc, and superior temporal region of the retina | 1,888 × 2,224 | 238 | 174 | Yes/no CSME | ML PLS regression classifier | 0.980 |
UTHSCSA | Canon CF-60UV | 45° mydriatic images centered on the macula, optic disc, and superior temporal region of the retina | 2,048 × 2,392 | 323 | 207 | Yes/no CSME | ML PLS regression classifier | 0.970 | ||
Akram et al. | 2014 | HEI-MED | Not specified | Not specified | Not specified | 169 | 54 | Normal/CSME | Hybrid GMM SVM | 0.940 |
MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 229 | Normal/Non-CSME/CSME | Hybrid GMM SVM | 0.970 | ||
Bressler et al. | 2022 | EyePACS | Topcon, Canon, CenterVue, Crystalvue, Zeiss, unassigned | 45° mydriatic and nonmydriatic images centered on the macula and between the macula and the disc | Not specified | 32,049 | 15,595 | Yes/no DME | DL neural network | 0.954 (EyePACS), 0.971 (MESSIDOR-2) |
Chalakkal et al. | 2021 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic images centered between the macula and disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,187 | 150 | Non-CSME/CSME | CNN ResNet-50 | 0.962** |
UoA-DR | Zeiss VISUCAM 500 | 45° centered on macular and optic disc | 2,124 × 2,056 | 200 | 74 | Non-CSME/CSME | CNN ResNet-50 | |||
IDRiD | Kowa VX-10α | 50° nonmydriatic images centered between the macula and the disc | 4,288 × 2,848 | 516 | 243 | Non-CSME/CSME | CNN ResNet-50 | |||
Dai et al. | 2021 | SIM | Canon CR-1 Mark II/CR-2, Topcon TRC-NW200, Zeiss VISUCAM 200 | 45° nonmydriatic images centered on optic disc and macula | Not specified | 666,383 | 3,926 | Yes/no DME | ResNet, Mask R-CNN | 0.946 |
Deepak et al. | 2012 | HEI-MED | Not specified | Not specified | Not specified | 122 | 54 | Normal/Hard Exudate (moderate/ severe) | Gaussian data description, PCA data description | 0.990 (DMED), 0.960 (MESSIDOR) |
Gulshan et al. | 2019 | Aravind Eye Hospital/Sankara Nethralaya | Topcon NM TRC, Forus 3nethra | 40°–45° nonmydriatic images centered on the macula | Not specified | 140,853 | 20,002 | Yes/no RDME* | CNN Inception v4 | 0.984 |
He et al. | 2019 | IDRiD | Topcon TRC-NW6 | 50° nonmydriatic images centered between the macula and the disc | 4,288 × 2,848 | 516 | 284 | 0, 1, 2# | CNN VGG-16, XGBoost classifier | 0.964 |
MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 226 | 0, 1, 2# | CNN VGG-16, XGBoost classifier | 0.982 | ||
Li et al. | 2021 | Shanghai First People’s Hospital, MESSIDOR-2 | Shanghai: not specified MESSIDOR-2: Topcon TRC-NW6 | Shanghai: 45° centered between the macula and the disc MESSIDOR-2: 45° nonmydriatic images centered on the macula | Shanghai: 1,488 × 1,488 MESSIDOR-2: 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 45,806 | 5,158 | NRDME/RDME | CNN Inception v4 | Shanghai: 0.994 MESSIDOR: 0.948 |
Li et al. | 2020 | MESSIDOR and 2018 ISBI IDRiD challenge data set | MESSIDOR: Topcon TRC-NW6 IDRiD: Kowa VX-10α | MESSIDOR: 45° mydriatic and nonmydriatic centered between the macula and the disc IDRiD: 50° nonmydriatic images centered between the macula and the disc | MESSIDOR: 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 IDRiD: 4,288 × 2,848 | 1,716 | 545 | 0, 1, 2# | CNN ResNet-50 | 0.942 |
Li et al. | 2018 | LabelMe | Topcon, Canon, CenterVue, Heidelberg | Not specified | 2,480 × 3,280, 576 × 768, 1,900 × 2,285, 1,958 × 2,588, 1,900 × 2,265, 1,956 × 2,448, and 1,944 × 2,464 | 71,043 | 14,598 | Yes/no DME | CNN Inception v3 | 0.986 |
Liu et al. | 2022 | Lerdsin and Rajavithi Hospital, Thailand; Moorfields Eye Hospital, U.K.; Alameda County Health System, U.S. | Topcon DRI OCT Triton, Kowa VX-10, Canon CR-DGi CFP | 45° centered between the macula and the disc | Not specified | 1,167,791 | 852,437 | Yes/no thickness-based, IRF-based CI-DME, or thickness-based DME | Deep CNN trained on TensorFlow | 0.860–0.960 |
Mo et al. | 2018 | HEI-MED | Not specified | Not specified | 2,196 × 1,958 | 169 | 54 | Yes/no DME | Convolutional residual network | 0.971** |
E-Ophtha EX | Not specified | 45° centered on the macula and between the macula and the disc | 1,440 × 960 to 2,544 × 1,696 | 82 | 47 | Yes/no DME | Convolutional residual network | |||
Mookiah et al. | 2015 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 300 | 185 | No DME, Non-CSME, CSME | NB SVM-linear | 0.969 |
Kasturba Medical College | Topcon TRC-NW200 | 45° FOV | 480 × 382 | 300 | 200 | No DME, Non-CSME, CSME | NB SVM-linear | 0.975 | ||
Rajput et al. | 2020 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 94 | Not specified | Yes/no DME | Color edge detection and mathematical morphology | 0.971 |
Raumviboonsuk et al. | 2019 | Nationwide screening program in Thailand | 3nethra; Canon CR2; Kowa VX-10, VX-20; Nonmyd 7, Nonmyd WD, Nonmyd α-DIII 8300; Nidek AFC-210, AFC-230, AFC-300; Topcon TRC-NW8; Zeiss VISUCAM 200 | 45° centered on the macula | 779 × 779 | 29,985 | 1,868 | Yes/no DME | CNN Inception v4 | 0.993 |
Sahlsten et al. | 2019 | Digifundus Ltd. | Canon CR2 | 45° mydriatic images centered on the macula and optic disc | 3,888 × 2,592 to 5,184 × 3,456 | 35,630 | 5,536 | NRDME/RDME | CNN Inception v3 | 0.992 |
Singh et al. | 2020 | IDRiD | Kowa VX-10α | 50° centered on the posterior pole | 4,288 × 2,848 | 516 | 284 | 0, 1, 2# | HE-CNN | 0.965 |
MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 226 | 0, 1, 2# | HE-CNN | 0.965 | ||
Stevenson et al. | 2019 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 2,283 | 226 | Normal/DME (/AMD, /DR, /RVO, /glaucoma)† | CNN Inception v3 | 0.746 |
Sundaresan et al. | 2015 | Local hospital | Not specified | Not specified | Not specified | 181 | Not specified | M0, M1, M2^ | GMM | 0.950 |
Tariq et al. | 2012 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 226 | Healthy, non-CSME, CSME | SVM | 0.967 |
STARE | Topcon TRV-50 | 35° FOV with varying imaging settings | 700 × 605 | 81 | 50 | Healthy, non-CSME, CSME | SVM | 0.973 | ||
Tariq et al. | 2013 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, and 2,304 × 1,536 | 1,200 | 226 | Healthy, non-CSME, CSME | Filter bank and GMM | 0.961 |
STARE | Topcon TRV-50 | 35° FOV with varying imaging settings | 700 × 605 | 81 | 50 | Healthy, non-CSME, CSME | Filter bank and GMM | 0.976 | ||
Varadarajan et al. | 2020 | Thailand Rajavithi Hospital | Kowa VX-10 | 50° centered on the macula | 4,288 × 2,848 | 7,072 | 1,990 | Yes/no CI-DME | CNN Inception v3 | 0.890 (Thailand), 0.840 (EyePACS) |
Wang et al. | 2022 | 3 Taiwan medical centers | Zeiss VISUCAM 200; Nidek AFC-330; Canon CF-1, CR-DGI, CR2, CR2-AF | 45° macula-centered on macula and between the macula and optic disc | 724 × 722 to 4,288 × 2,848 | 35,001 | 14,001 | DME/non-DME | EfficientDet-D1, bidirectional feature pyramid network | 0.981 (Taiwan), 0.952 (MESSIDOR-1), 0.958 (MESSIDOR-2) |
Yu et al. | 2022 | MESSIDOR | Topcon TRC-NW6 | 45° mydriatic and nonmydriatic centered between the macula and the disc | 1,440 × 960, 2,240 × 1,488, 2,304 × 1,536, 4,288 × 2,848, | 1,716 | 462 | Yes/no DME | CNN + residual attention network | 0.882 |
IDRiD | Kowa VX-10α | 50° centered on posterior pole | and 4,288 × 2,848 | 516 | 284 | Yes/no DME | CNN + residual attention network | 0.772 | ||
OCT | ||||||||||
Ai et al. | 2022 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,488/2,000 (limited) | 11,348/1,000 (limited) | Normal/DME | Inception v3 /Inception-ResNet v2 /Xception + convolutional block attention mechanism | 0.773–1.000 |
Alqudah et al. | 2020 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,489 | 11,599 | Normal/DME | CNN 19 layer | 1.000 |
Altan et al. | 2021 | Kermany data set, SERI | Heidelberg Spectralis, Cirrus SD-OCT | Not specified | 512 × 1,024 | 66,585 | 13,397 | Normal/DME | Lightweight CNN DeepOCT | 0.983 |
Bhatia et al. | 2020 | Noor Eye Hospital | Heidelberg SD-OCT | Not specified | 224 × 224 | 100 volumes | 50 volumes | Normal/DME | CNN VGG16, 21 layers | 0.980** |
Ophthalmology and Microsurgery Institute | Heidelberg SD-OCT | Not specified | 224 × 224 | 50 volumes | 25 volumes | Normal/DME | CNN VGG16, 21 layers | |||
Das et al. | 2019 | Kermany data set | Heidelberg SD-OCT | Not specified | 256 × 256 | 38,163 | 11,598 | Normal/DME | CNN multiscale deep feature fusion-based classifier | 0.990 |
Dash et al. | 2018 | Merry Eye Care, Puducherry | Not specified | Not specified | Not specified | 150 | 90 | Normal/DME | SVM | 0.980 |
Fang et al. | 2019 | Kermany data set | Heidelberg Spectralis | Not specified | 224 × 224 | 38,163 | 11,598 | Normal/DME | LACNN Inception v3 | 0.974 |
Hassan et al. | 2020 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,489 | 11,599 | Healthy/DME | Recurrent Residual Inception Network | 0.986 |
Hecht et al. | 2019 | Munk and Israel data set combined | Heidelberg Spectralis | Centered on macula, including B scan (a minimum of 10 frames | 1,563 × 1,563 | 153 | 96 | DME/PCME | Decision tree | 0.937 |
Hussain et al. | 2018 | Duke, CERA, NYU, Tian | Heidelberg SD-OCT, EDI-OCT | Not specified | 512 × 1,024 | 11,662 | 2,940 | Normal/DME (/AMD)† | Random forest | 0.990 |
Hwang et al. | 2020 | Taipei Veterans General Hospital | Cirrus HD-OCT 4000, Optovue RTVue XR Avanti | Not specified | 3,499 × 2,329, 2,474 × 2,777, or 948 × 879 | 3,495 | Not specified | DME/non-DME | CNN (MobileNet, using Sigmoid Cross-Entropy and RMSprop) | 0.960 |
Joshi et al. | 2020 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 16,440 | 7,118 | Normal/DME | CNN | 0.990 |
Kermany et al. | 2018 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,489 | 11,599 | Healthy/DME | Inception v3 | 0.999 |
Khaothanthong et al. | 2023 | Rajavithi Hospital | Heidelberg Spectralis | Centered on macula, including radial scans from six lines per eye | Not specified | 6,356 | 1,455 | Yes/no DME | CNN ResNet-50, RelayNet/graph cut | 0.980 |
Li et al. | 2019 | Kermany data set | Heidelberg Spectralis | Not specified | 256 × 256 | 62,489 | 11,599 | Normal/DME | VGG16 | 0.999 |
Li et al. | 2019 | Shanghai Zhongshan Hospital and the Shanghai First People’s Hospital | Heidelberg Spectralis | Not specified | Not specified | 9,674 | 3,238 | Normal/DME | CNN multi-ResNet-50 ensembling | 0.996 |
Liu et al. | 2022 | Multiple centers | Topcon 3D OCT-1 Maestro | 6 mm × 6 mm | Not specified | >20,000 | Not specified | Yes/no ME | R-CNN | 0.944 |
Perdomo et al. | 2018 | SERI | Cirrus SD-OCT | Not specified | 1,024 × 512 | 4,096 | 2,048 | Normal/DME | VGG16 | 0.927 |
Perdomo et al. | 2019 | SERI + CUHK | Cirrus SD-OCT | Not specified | 1,024 × 512 | 9,600 | 5,352 | Normal/DME | VGG inspired | 0.860 |
Rajagopalan et al. | 2021 | Kermany data set | Heidelberg Spectralis | Not specified | 224 × 224 | 6,000 | 3,000 | Normal/DME | CNN | 0.960 |
Rasti et al. | 2018 | Not specified | Topcon 1000 SD-OCT | Not specified | 650 × 512 | 7,680 | 3,840 | Normal/DME | Wavelet-based CNN random forests | 0.993 |
Duke data set | Heidelberg SD-OCT | Not specified | 512 × 496, 768 × 496 | 30 volumes | 15 volumes | Normal/DME | Wavelet-based CNN random forests | 0.993 | ||
Rastogi et al. | 2019 | Kermany data set | Heidelberg Spectralis | Not specified | 128 × 128 | 62,489 | 11,599 | Normal/DME | DenseNet | 0.992 |
Saraiva et al. | 2020 | Kermany data set | Heidelberg Spectralis | Not specified | 150 × 150 | 62,489 | 11,599 | Normal/DME | CNN | 0.990 |
Tang et al. | 2021 | CUHK-STDR | Cirrus HD-OCT | 6 mm × 6mm | 1,024 × 512 × 128 | 3,788 volumes | 1,208 volumes | No DME/non-CI-DME/CI-DME | CNN ResNet-34 | 0.964 |
CUHK-STDR | Heidelberg Spectralis | 6.3 mm × 6.3 mm and 6.5 mm × 4.9 mm | 1,024 × 25, 1,024 × 19 | 30,515 | 5,542 | No DME/non-CI-DME/CI-DME | CNN ResNet-18 | 0.846 | ||
CUHK-STDR | Topcon Triton OCT | Radial 9 mm × 30° | 1,024 × 12 | 39,443 | 7,804 | No DME/non-CI-DME/CI-DME | CNN ResNet-18 | 0.935 | ||
Togacar et al. | 2022 | Kermany data set, Duke data set, Noor data set | Spectralis SD-OCT | Not specified | Not specified | 91,969 | 13,803 | Normal/DME | 9 CNN models | 1.000 |
Wang et al. | 2020 | Duke data set | Spectralis SD-OCT | Not specified | 224 × 224 | 1,920 | 522 | Normal/DME | CNN VGG16 | 0.980 |
Wang et al. | 2023 | CUHK Eye Center | Triton, Spectralis | Radial 9 mm × 30° | 1,024 × 992 × 12, 1,024 × 496 × 25 | 69,491 B scans, 4,644 volumes | 2,910 volumes | Yes/no DME | Deep semisupervised multiple instance learning | 0.934 and 0.963 for B scan; 0.926 and 0.950 for volume |
Xu et al. | 2021 | Noor Eye Hospital (Tehran), Kermany data set | Spectralis SD-OCT | Not specified | Not specified | 87,738 | 9,720 | Normal/DME | Multibranch hybrid attention network | 0.970 and 1.000, respectively |
For the full citation of each study, see the Supplementary Material. AMD, age-related macular degeneration; CERA, Center for Eye Research Australia; CI-DME, center-involved diabetic macular edema; CNN, convolutional neural network; CUHK; Chinese University of Hong Kong; DL, deep learning; EDI-OCT, enhanced depth imaging optical coherence tomography; EyePACS, Picture Archive Communication System for Eye Care; FOV, field of view; GMM, Gaussian mixture model; HE-CNN, hierarchical ensemble of convolutional neural networks; HEI-MED, Hamilton Eye Institute Macular Edema Dataset (formerly DMED); IDRiD, Indian Diabetic Retinopathy Image Dataset; IRF, intraretinal fluid; ISBI, International Symposium on Biomedical Imaging; LACNN, lesion-aware convolutional neural network; ML, machine learning; NA, not available; NB, naive Bayes; NRDME, nonreferable diabetic macular edema; NYU, New York University; PCA, principal component analysis; PCME, pseudophakic cystoid macular edema; PLS, partial least squares; R-CNN, region-based convolutional neural network; RDME, referable diabetic macular edema; RIST, Retina Institute of South Texas; RVO, retinal vein occlusion; SERI, Singapore Eye Research Institute; SIM, Shanghai Integration Model; STARE, Structured Analysis of the Retina; STDR, sight-threatening diabetic retinopathy; SVM, support vector machine; UoA-DR, University of Auckland Diabetic Retinopathy; UTHSCSA, The University of Texas Health Science Center at San Antonio.
RDME is defined as hard exudates within 1 disc diameter of the macula.
Grade 0 is defined as no visible hard exudate, grade 1 as exudate distance >1 papilla diameter, grade 2 as exudate distance ≤1 papilla diameter.
Other pathologies detected by the models are listed in brackets.
M0 (nil DME) is defined as no visible hard exudate, M1 (observable DME) as the presence of hard exudate within a circular zone of 3 optic disc diameter around the macula, and M2 (RDME) as the presence of hard exudate within circular zone of 1 optic disc diameter around the macula.
AUROC based on data sets for all studies combined.
Performance of AI in DME Detection
Detailed statistical results are shown in Table 2. FP-based AI of 25 studies yielded a pooled AUROC of 0.964 (95% CI 0.964–0.964) with a sensitivity of 92.6% (95% CI 90.2–94.4%), specificity of 91.1% (95% CI 88.0–93.4%), and DOR of 147.3 (95% CI 94.0–230.8), while OCT-based AI of 28 studies yielded a pooled AUROC of 0.985 (95% CI 0.985–0.985) with a sensitivity of 95.9% (95% CI 94.1–97.2%), specificity of 97.9% (95% CI 96.6–98.6%), and DOR of 1,154.6 (95% CI 691.0-1,929.1). The overall pooled AUROC, sensitivity, specificity, and DOR of the 53 studies was 0.979 (95% CI 0.979–0.979), 94.6% (95% CI 93.1–95.7%), 95.8% (95% CI 94.3–96.9%), and 456.1 (95% CI 311.9–667.1), respectively. Forest plots and SROC curves of FP and OCT studies are shown in Fig. 2A–D.
. | Studies (validation data sets) . | Pooled AUROC (95% CI) . | P . | Pooled sensitivity, % (95% CI) . | Pooled specificity, % (95% CI) . | DOR (95% CI) . |
---|---|---|---|---|---|---|
Overall | 53 (101) | 0.979 (0.979–0.979) | 94.6 (93.1–95.7) | 95.8 (94.3–96.9) | 456.1 (311.9–667.1) | |
FP | 25 (45) | 0.964 (0.964–0.964) | <0.001 | 92.6 (90.2–94.4) | 91.1 (88.0–93.4) | 147.3 (94.0–230.8) |
OCT | 28 (56) | 0.985 (0.985–0.985) | 95.9 (94.1–97.2) | 97.9 (96.6–98.6) | 1,154.6 (691.0–1,929.1) | |
Type of AI | ||||||
Machine learning | 11 (16) | 0.966 (0.966–0.966) | <0.001 | 96.7 (95.7–97.4) | 94.5 (89.2–97.3) | 723.7 (403.6–1,297.8) |
Deep learning | 42 (85) | 0.979 (0.979–0.979) | 94.2 (92.4–95.5) | 96.0 (94.4–97.2) | 421.1 (279.0–635.7) | |
Developmental data set size# | ||||||
Smaller than median (≤8,655) | 27 (46) | 0.975 (0.975–0.975) | <0.001 | 93.6 (91.3–95.3) | 95.5 (93.1–97.0) | 353.4 (211.9–589.6) |
Larger than median (>8,655) | 28 (55) | 0.981 (0.981–0.981) | 95.3 (93.3–96.7) | 96.1 (94.0–97.5) | 575.0 (323.8–1,021.2) | |
Studies with validation | ||||||
Internal | 10 (18) | 0.985 (0.985–0.985) | <0.001 | 96.9 (93.9–98.5) | 97.5 (93.2–99.1) | 1,089.5 (466.9–2,542.4) |
External | 13 (32) | 0.967 (0.967–0.967) | 91.7 (87.2–94.8) | 93.4 (88.7–96.2) | 162.6 (91.1–290.2) | |
Data set diversity (performance upon external validation) | ||||||
Single data set | 8 (20) | 0.965 (0.965–0.965) | 0.002 | 90.8 (85.0–94.5) | 93.7 (88.2–96.8) | 158.3 (79.3–316.2) |
Multiple data sets^ | 5 (12) | 0.971 (0.971–0.971) | 93.3 (84.2–97.4) | 92.7 (81.4–97.4) | 187.9 (50.0–706.0) |
. | Studies (validation data sets) . | Pooled AUROC (95% CI) . | P . | Pooled sensitivity, % (95% CI) . | Pooled specificity, % (95% CI) . | DOR (95% CI) . |
---|---|---|---|---|---|---|
Overall | 53 (101) | 0.979 (0.979–0.979) | 94.6 (93.1–95.7) | 95.8 (94.3–96.9) | 456.1 (311.9–667.1) | |
FP | 25 (45) | 0.964 (0.964–0.964) | <0.001 | 92.6 (90.2–94.4) | 91.1 (88.0–93.4) | 147.3 (94.0–230.8) |
OCT | 28 (56) | 0.985 (0.985–0.985) | 95.9 (94.1–97.2) | 97.9 (96.6–98.6) | 1,154.6 (691.0–1,929.1) | |
Type of AI | ||||||
Machine learning | 11 (16) | 0.966 (0.966–0.966) | <0.001 | 96.7 (95.7–97.4) | 94.5 (89.2–97.3) | 723.7 (403.6–1,297.8) |
Deep learning | 42 (85) | 0.979 (0.979–0.979) | 94.2 (92.4–95.5) | 96.0 (94.4–97.2) | 421.1 (279.0–635.7) | |
Developmental data set size# | ||||||
Smaller than median (≤8,655) | 27 (46) | 0.975 (0.975–0.975) | <0.001 | 93.6 (91.3–95.3) | 95.5 (93.1–97.0) | 353.4 (211.9–589.6) |
Larger than median (>8,655) | 28 (55) | 0.981 (0.981–0.981) | 95.3 (93.3–96.7) | 96.1 (94.0–97.5) | 575.0 (323.8–1,021.2) | |
Studies with validation | ||||||
Internal | 10 (18) | 0.985 (0.985–0.985) | <0.001 | 96.9 (93.9–98.5) | 97.5 (93.2–99.1) | 1,089.5 (466.9–2,542.4) |
External | 13 (32) | 0.967 (0.967–0.967) | 91.7 (87.2–94.8) | 93.4 (88.7–96.2) | 162.6 (91.1–290.2) | |
Data set diversity (performance upon external validation) | ||||||
Single data set | 8 (20) | 0.965 (0.965–0.965) | 0.002 | 90.8 (85.0–94.5) | 93.7 (88.2–96.8) | 158.3 (79.3–316.2) |
Multiple data sets^ | 5 (12) | 0.971 (0.971–0.971) | 93.3 (84.2–97.4) | 92.7 (81.4–97.4) | 187.9 (50.0–706.0) |
The median number of images among studies included in this meta-analysis was 8,655.
Multiple data sets is defined as data from different institutions or with population characteristics that are systematically different.
Subgroup Analyses
Regarding the type of AI, deep learning algorithms showed higher pooled AUROC (0.979) than machine learning (0.966, P < 0.001). Regarding the developmental data size, the median image number (8,655) was used to stratify algorithms into using smaller and larger data sets. For studies that proposed multiple algorithms or models that used separate developmental data sets, each model was evaluated separately. Results demonstrated that developmental data sets larger than the median (i.e., >8,655) were associated with better validation results with higher pooled AUROC, sensitivity, specificity, and DOR (0.981, 95.3%, 96.1%, 575.0, respectively) than those developed with data sets smaller than the median (0.975, 93.6%, 95.5%, 353.4, respectively).
Regarding testing data sets, 50 studies were internally validated, but only 13 studies performed external validation. Among externally validated studies, we compared their performances upon internal and external validation. Results demonstrated lower pooled AUROC, sensitivity, specificity, and DOR for the algorithms when validated externally (0.967, 91.7%, 93.4%, 1,089.5, respectively) than internally (0.985, 96.9%, 97.5%, 162.6, respectively). Moreover, upon external validation, models trained on multiple data sets showed slightly better performance than those trained on a single data set (pooled AUROC 0.971 and DOR 187.9 for multiple data sets and 0.965 and 158.3, respectively, for a single data set, P = 0.002).
As evidenced by Supplementary Fig. 2A and B, bias at the bottom-right from the asymmetrical funnel plot is shown, which could result from large DOR and SEs among certain studies. Studies with small SEs also distributed among a rather scattered range of DOR, which did not concentrate around the funnel. Therefore, the results should be interpreted with caution and indicate the need for further rigorous research to reduce bias.
Conclusions
The present meta-analysis investigated the performance of AI in detecting DME using FP or OCT images. Overall, our results indicate good discriminative performance for both FP- and OCT-based AI models in terms of pooled AUROC, sensitivity, specificity, and DOR. Potential factors that may increase model performance include the type of AI, sample size, and diversity in the developmental data set.
FP-Based AI
Among 25 FP-based AI studies, the pooled AUROC was 0.964, with sensitivity and specificity all >90%. This indicates the potential application of FP-based AI in primary care settings, considering the appreciable affordability and accessibility of FP. It may serve as an ideal tool to aid health care providers, especially in resource-restrained areas, to avoid intense human and time input. For example, as FP has already been used for DR screening, AI can be used as a first-pass tool and assist human graders’ subsequent grading in existing screening programs (6,7). This benefits the population with diabetes and the public health infrastructure by enhancing clinical workflow and referrals.
However, we noticed incomprehensive reporting in details of ground truth labeling for each study, i.e., whether it was based on FP alone or with additional clinical examinations (Supplementary Table 3). Since FP alone cannot serve as the gold standard for DME diagnosis, AI trained on these data may not overcome the previously reported high false-positive rate in current screening programs (10,11), despite better expertise (retinal specialists and ophthalmologists) for labeling. Thus, standardization of reporting in terms of the labeling gold standard and training materials would be recommended for future AI studies. Developing models with OCT-based diagnosis for the grading of its paired FP (25) may also improve the quality of ground truth labeling.
OCT-Based AI
The 28 OCT-based AI studies that detect retinal thickening or DME features, such as intraretinal or subretinal fluids, yielded a pooled AUROC of 0.985, with sensitivity and specificity >95%. Previous publications attributed the improved performance to OCT providing 3D scans of the whole retinal structure, which are more informative for AI models to detect eyes with DME, especially among cases with central subfield thickness <300 μm (26). Practically, cases alike could be difficult to differentiate from normal ones by using 2D FP alone. The high discriminative performance obtained by OCT-based AI demonstrates its potential to be applied in tertiary settings and eye hospitals, especially in more resourced settings where OCT machines are affordable, for not only detecting the presence of DME but also further classifying center-involved DME (CI-DME) and non-CI-DME. This is of clinical value as eyes with CI-DME require urgent intervention, such as anti–vascular endothelial growth factor injection, while for eyes with non-CI-DME, initial observation is an acceptable option. Moreover, Tang et al. (17) developed deep learning models that used whole 3D volumetric scans to not only detect DME with classification of CI-DME or non-CI-DME but also identify non-DME retinal abnormalities, such as epiretinal membrane and macular hole. This implies another advantage of OCT-based AI models for efficiently differentiating different retinal abnormalities based on comprehensive information provided by the volumetric scans, which may facilitate more personalized treatment plans for patients.
On the other hand, analysis of 3D volumetric OCT images may be useful for future algorithm development. In our meta-analysis, only two studies (17,27) developed OCT-based algorithms that detected DME from 3D volumetric scans directly. As the data remain insufficient for subgroup analysis, we have yet to evaluate and compare the accuracy between using OCT 2D B scans and 3D volumetric scans. Previous studies suggested substantial merits of deep learning training with 3D volumetric scans in terms of labeling effort (28,29). Compared with volume scan–level annotation, ground truth labeling of B scans is more labor intensive and time consuming (17). Model output of one result for one volumetric scan may also facilitate clinical application for graders with less expertise. However, analysis of volumetric scans requires higher computation power. Therefore, future meta-analyses may compare the performance of DME detection by AI using OCT B scan and volumetric scan and evaluate whether using volumetric scans as input provides additional benefits in terms of model development and clinical application.
Subgroup Analyses
We also conducted subgroup analyses to investigate how the type of AI, developmental data set size, validation methods, and number of training data resources affect the performance of AI algorithms.
Type of AI
Regarding AI type, we found that both deep learning and traditional machine learning algorithms obtained satisfactory performance with pooled AUROC, sensitivity, and specificity >90%, while deep learning algorithms demonstrated a slightly higher pooled AUROC. While traditional machine learning requires complex feature engineering (29), most included studies adopted deep learning, which allows automatic pattern recognition with high-volume modeling capability and may offer identification of insignificant markers of early-stage DME. Since deep learning requires higher computation power and larger training data sets (30), some studies further used transfer learning and semisupervised learning techniques. Transfer learning allows pretrained networks to be retrained on specific tasks with fewer training examples and less computational power, while semisupervised learning, such as generative adversarial networks, addresses the scarcity of labeled data. Self-supervised learning, a novel technique combining qualities of high label efficiency and strong generalization capacity, indicates the great potential of deep learning techniques in accelerating medical AI development.
Model Development
Our results show that developing AI with a larger data set provides better performance. This supports the outcomes of previous subsampling experiments, which showed that AUROC of an FP-based deep learning model for DME detection increases with sample size (25), and the performance of another FP-based deep learning model for referable DR increased with training data set size, plateauing at ∼60,000 images (18). However, we also noticed a discrepancy from individual studies. Ai et al. (30) developed two models for DME detection, a complete model trained on a large, but imbalanced data set of >60,000 normal and DME images and a limited model trained on a small, but balanced data set of 1,000 images only from each class. The limited model was shown to outperform the complete model despite a huge deficiency in number of training data. However, in our meta-analysis, we have yet to compare the effect of class balance and sample size on model development and performance.
Model Validation
Apart from data acquisition, we noticed deficiencies in external validation among studies included in our meta-analysis, which was performed in only 13 of 53 studies. External validation is important since overfitting is a common pitfall in AI training wherein models learn specific patterns in the training data and show strong performance in similar internal testing data, resulting in great discrepancies in performance between internal and external validation. This is supported by our subgroup analysis, where algorithms showed a significant decline in all pooled performance parameters when validated with external unseen data. Therefore, guidelines related to AI development and performance reporting emphasize the importance of external validation before clinical translation (31,32). Insufficiency in external validation among the included studies limits our assessment of the generalizability of algorithms in the real world and may lead to an overestimation of their performance in clinical settings. Future research should be directed toward external validation of algorithms.
Regarding training data diversity, we found better performance upon external validation among studies trained on data from multiple institutions or with systematically different population characteristics. Greater data diversity from multiple sources may enhance the resilience of algorithms over different baseline populations in unseen external data. This suggests that increasing developmental data diversity may enhance generalizability of AI upon clinical application to maintain its performance in detecting DME among patients with various baseline characteristics.
We also noticed discrepancies in baseline characteristics between validation data sets and real-world data. Epidemiological studies showed that DME prevalence is 1.4–12.8% among the population with type 2 diabetes worldwide (33,34). However, the proportion of DME images among validation data sets of included studies has a median of 27.5% (Supplementary Table 3), which is much higher than the general prevalence. Such discrepancy may hinder our understanding of the models’ realistic applicability in primary settings. Therefore, validation of AI using data sets with DME prevalence more resemblant of current epidemiology may provide more representative statistics. In view of the unrepresentative DME prevalence in retrospective data sets, it is also important to perform real-world prospective validation studies before these models can be used clinically with confidence.
Strengths and Limitations
Our meta-analysis has several strengths. First, our study demonstrated that machine learning and deep learning could be used to develop AI with high diagnostic performance for detecting and classifying DME from FP or OCT images. From a clinical and public health standpoint, these advancements can enhance clinical workflow and optimize identification of patients requiring specialists or even retina clinic referrals. FP-based AI could be incorporated into existing DR screening programs, while OCT-based AI could be used as a secondary tool to screen individuals with positive results based on FP and determine treatment options. Second, we identified factors that may enhance model performance, such as deep learning technique, a larger developmental data set, and greater training data diversity. Third, regarding gaps in existing research, we recognized insufficiencies in external validation and discrepancies between validation data and the realistic, primary care-level populations. Addressing these gaps may enhance clinical translatability of these AI models and accelerate their implementation in real-life settings.
Our study also has several limitations. First, the outcomes of algorithms were not standardized and were labeled based on different disease definitions. For example, we noticed that there are variable outcomes of algorithms, i.e., DME, clinically significant macular edema (CSME), nonclinically significant CSME (non-CSME), CI-DME, non-CI-DME, and referable DME, of which CSME and non-CSME were graded based on the exudate location on FP and retinal thickness or DME features on OCT, while CI-DME and non-CI-DME are graded based on retinal thickness on OCT. There is also no standard definition of referable DME. Clinically, this may cause difficulty in results interpretation and, thus, referral procedure. Future guidelines should standardize model outcome and disease definition to facilitate clinical translation and allow fair performance evaluation. Second, there are insufficient data for evaluating model performance on comparing OCT 2D B scans and 3D volumetric scans, OCT machines, and generalizability of models on external data sets. Third, the small sample size of some included studies may reduce representativeness of their AI performance. Fourth, few articles reported detailed demographics of their database, such as age, sex, and duration of diabetes, which provided insufficient data for meta-regression to evaluate their database diversity and their effect on AI development. Finally, there is also a vacuum for assessing whether AI performance might be affected by different DME severity. Future research based on DME grade is required to further elucidate the performance discrepancies in AI algorithms.
Future Directions
For future development of AI models, data acquisition remains a major challenge in terms of data diversity and sample size. Collaboration between multiple centers or integration of multiple public data sets with different baseline characteristics would be beneficial in increasing the diversity and size of developmental data. Data efficient techniques, such as generative adversarial networks (35), synthetic minority oversampling technique (36), few-shot transfer learning (37), and self-supervised learning, may aid algorithm development among a limited-source training sample. However, to date, there are no studies that have provided an estimation of optimal sample size and class balance for DME model development. Further studies on sample size determination methodologies are required to estimate the developmental sample size that balances diagnostic performance with resource availability (38). Standardization on reporting of labeling gold standard, patient demographics, such as age, sex, ethnicity, type and duration of diabetes, and settings of image acquisition, would also facilitate the assessment of the value of AI as a potential population screening tool for DME. Clinically, AI-relevant education of health care providers and development of a systematic approach for AI implementation by multiple stakeholders are crucial, including legislation addressing issues of accountability and data sharing (39).
In conclusion, AI algorithms show satisfactory discriminative performance in DME detection from either FP or OCT images. Potential factors that may increase model performance include the type of AI, sample size, and diversity in the developmental data set. There remain significant gaps in external validation among current studies to evaluate models’ generalizability for clinical translation. Further studies can evaluate the potential and added value of OCT volumetric scan analysis by AI models, estimate the optimal sample size and effects of class balance and patient demographics, and compare the relative effectiveness of AI in DME detection with human evaluation.
This article contains supplementary material online at https://doi.org/10.2337/figshare.24518287.
PROSPERO reg. no. CRD4202127609, https://www.crd.york.ac.uk/prospero/
Article Information
Duality of Interest. No potential conflicts of interest relevant to this article were reported.
Author Contributions. C.L. and Y.L.W. researched data; performed the systematic search, study selection, data extraction, and statistical analyses; and wrote the manuscript. Z.T. and A.R.R. researched data, contributed to the discussion, and reviewed and edited the manuscript. X.H., T.X.N., D.Y., and S.Z. conducted the data extraction and reviewed and edited the manuscript. J.D., S.K.H.S., and C.Y.C. contributed to the discussion and reviewed and edited the manuscript. All authors approved the decision to submit for publication. A.R.R. and C.Y.C. are the guarantors of this work and, as such, had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.