To determine the diagnostic accuracy in a real-world primary care setting of a deep learning–enhanced device for automated detection of diabetic retinopathy (DR).
Retinal images of people with type 2 diabetes visiting a primary care screening program were graded by a hybrid deep learning–enhanced device (IDx-DR-EU-2.1; IDx, Amsterdam, the Netherlands), and its classification of retinopathy (vision-threatening [vt]DR, more than mild [mtm]DR, and mild or more [mom]DR) was compared with a reference standard. This reference standard consisted of grading according to the International Clinical Classification of DR by the Rotterdam Study reading center. We determined the diagnostic accuracy of the hybrid deep learning–enhanced device (IDx-DR-EU-2.1) against the reference standard.
A total of 1,616 people with type 2 diabetes were imaged. The hybrid deep learning–enhanced device’s sensitivity/specificity against the reference standard was, respectively, for vtDR 100% (95% CI 77.1–100)/97.8% (95% CI 96.8–98.5) and for mtmDR 79.4% (95% CI 66.5–87.9)/93.8% (95% CI 92.1–94.9).
The hybrid deep learning–enhanced device had high diagnostic accuracy for the detection of both vtDR (although the number of vtDR cases was low) and mtmDR in a primary care setting against an independent reading center. This allows its’ safe use in a primary care setting.
Introduction
With the growing prevalence of diabetes, the prevalence of diabetic retinopathy (DR) is rising as well. Screening for DR has proven to be effective in the prevention of visual loss and blindness from DR (1). National health authorities (2) and most professional organizations (3) recommend regular DR screening programs, which are usually integrated within regular diabetes care (4). Automated medical diagnosis has achieved parity with or even superiority to clinical experts’ diagnosis for an increasing number of clinical tasks, including detection of DR (5–7), and can help to improve health care efficiency, affordability, and accessibility of DR screening. Moreover, automated diagnosis reduced the diagnostic variability that was common in expert review of medical images (8).
Multiple diagnostic algorithms for the detection of DR are now commercially available for which the performance has been independently evaluated (9–12). One of these, the IDx-DR-EU-2.1 device, has been enhanced with deep learning. Deep learning, a machine learning technique that uses multilayer neural networks, has allowed substantial improvements in artificial intelligence (AI)-based diagnostic systems (13–17). Because deep learning is used to build its explicit retinopathy lesion (biomarker) detectors, the IDx-DR-EU-2.1 is a lesion-based AI system, mimicking human visual processing (13,18). While most deep learning applications associate images directly with a diagnostic output, lesion-based AI systems detect lesions and other abnormalities and are thought to be more robust to catastrophic failure from small perturbations in images (18). The lesion-based AI system allowed significantly improved diagnostic accuracy on a laboratory data set (13) and is designed to detect multiple levels of DR and diabetic macular edema (DME) according to the International Clinical Diabetic Retinopathy Severity Scale (ICDR) (13,19,20).
The purpose of this study was to determine the diagnostic accuracy of the hybrid deep learning–enhanced device (IDx-DR-EU-2.1) to detect more than mild DR and/or DME (mtmDR) and vision-threatening DR or DME (vtDR), according to the ICDR grading system compared with the reference standard, in people with type 2 diabetes in a primary care setting.
Research Design and Methods
Study Design, Population, and Setting
This retrospective study studied all people with type 2 diabetes that were screened at a diagnostic center in the Netherlands Star-SHL (Star-SHL, Rotterdam, the Netherlands) in the year 2015. Star-SHL is a so-called “primary center diagnostic center,” a facility that provides medical diagnostics to general practitioners in the Southwest region of the Netherlands. Under the guidance of general practitioners Star-SHL counsels patients with chronic diseases including diabetes. Study inclusion criteria were: existing diagnosis of type 2 diabetes, not previously diagnosed with DR and ability to undergo fundus photography. Patients were not otherwise selected, and reflected the mixed multiethnicity of the general population of Rotterdam, with around 15% non-Caucasian inhabitants.
Imaging
Participants underwent fundus imaging according to a strict standardized protocol (two per eye: one macula centered and one disc centered [45° field of view]) using Topcon TRC-NW200 cameras operated by experienced Star-SHL technicians. The images were made in eight different sites, and settings of the cameras were identical. Pharmacological dilation was applied when the technician decided that the images did not meet the requirements for grading. Image sets for each participant were stored in a proprietary Picture Archival System (PACS). Approval was obtained from the Human Subjects Committee of Star-SHL to conduct the study in accordance with the tenets of the Declaration of Helsinki.
Reference Standard Grading
A reading center determined the exam quality, as well as the presence and severity of DR, according to the ICDR grading system for all exams (20,21). The reading center protocol was as follows: two experienced readers from the Rotterdam Study at Erasmus Medical Center (22–24), independently graded each exam per ICDR grading system. Graders were masked to any algorithm outputs. Disagreements between the two readers were adjudicated by an experienced retinal specialist (F.D.V.) for the final grade. For analysis, the final ICDR grades were combined into no or mild DR (and no DME) and moderate DR (mtmDR and not vtDR) or vtDR (see Supplementary Table 1). The presence of exudates, retinal thickening (if visible on nonstereo photographs), within 1 disc diameter of the fovea, was taken as evidence of DME (19).
Automated Detection of DR
All images for which a reference standard according to the reading center was available were graded by a deep learning–enhanced device (IDx-DR-EU-2.1), referred to here as “the device.” The device’s core is a lesion-based algorithm with explicit lesion detectors, enhanced by deep learning, thought to closely resemble human visual processing (13,25). The underlying algorithms have been described extensively (13,26). Briefly, the lesion-based algorithm consists of multiple mutually dependent detectors, many of them implemented as convolutional neural networks of DR characteristic lesions. The outputs are integrated into an index, a numerical output varying between 0 and 1, indicating the likelihood of the exam having DR. Both images (fovea centered and optic disc centered) are colocalized and integrated using the optic disc and the larger retinal vessels as landmarks. A categorical outcome is provided: no or mild DR, moderate DR, or vtDR, see Supplementary Table 2. In contrast to the reference standard, the device puts both no DR and mild DR into one grade. If the exam has insufficient quality, no outputs for vtDR or moderate DR are provided.
Statistical Analysis of Performance
For assessment of the interobserver agreement of the reference standard, specific agreement between the two graders was calculated for the categories moderate DR and vtDR using a method described recently (27). Specific agreement was expressed as the chance that one of the graders scored the same grade, i.e., moderate DR or vtDR, as the other grader. The 95% CIs for specific agreement were obtained by bootstrap resampling using 1,000 bootstrap replicates.
With use of the ICDR classification, sensitivity, specificity, and positive predictive value (PPV) and negative predictive value (NPV), and their 95% CIs, were calculated for the device outputs no or mild DR, mtmDR, and vtDR, compared with the corresponding ICDR reference standard classifications of no or mild DR, moderate DR, and vtDR (20).
The analysis was based on exact binomial distribution. Exams of insufficient quality per the ICDR reference standard, or the device, were excluded from diagnostic accuracy analysis.
Results
Between 1 January 2015 and 31 December 2015, 1,616 participants were imaged. Mean age was 63 years (SD 11.3), and 53% of the participants were male (see STARD diagrams [Figs. 1 and 2]).
Of these 1,616 participants, the images of 191 (11.7%) were graded as of insufficient quality by the reference standard. Of the 1,425 participants with exams of sufficient quality, 1,187 (83.3%) had no DR, 167 (11.7%) had mild DR, 55 (3.9%) had moderate DR, and 16 (1.1%) had vtDR (15 of these 16 vtDR cases had DME and 1 [0.1%] had vtDR without DME, but with severe nonproliferative DR)—all according to the reference standard per the ICDR grading system. The interobserver agreement of the reference standard, expressed as specific agreement, i.e., the chance that one of the graders scored the same grade as the other, was 53% (95% CI 43–62) in case of moderate DR and 48% (95% CI 26–68) for vtDR.
The device gave an output of insufficient quality for 280 participants (17.3%) per the ICDR grading system. Of the 1,293 participants (90.6%) with exams of sufficient quality for both the reference standard and device, 1,167 (90.3%) had no or mild DR, 82 (6.4%) moderate DR, and 44 (3.4%) vtDR, including 15 (34.1%) with DME (see Table1).
Reading center reference standard (ICDR grading system) . | Device output . | Total . | |||
---|---|---|---|---|---|
No or mild DR . | mtmDR . | vtDR . | Insufficient quality . | ||
No DR | 1,050 | 19 | 6 | 112 | 1,187 |
Mild DR | 104 | 40 | 11 | 12 | 167 |
Moderate DR | 13 | 23 | 11 | 8 | 55 |
vtDR | 0 | 0 | 16 | 0 | 16 |
All | 1,167 | 82 | 44 | 132 | 1,425 |
Reading center reference standard (ICDR grading system) . | Device output . | Total . | |||
---|---|---|---|---|---|
No or mild DR . | mtmDR . | vtDR . | Insufficient quality . | ||
No DR | 1,050 | 19 | 6 | 112 | 1,187 |
Mild DR | 104 | 40 | 11 | 12 | 167 |
Moderate DR | 13 | 23 | 11 | 8 | 55 |
vtDR | 0 | 0 | 16 | 0 | 16 |
All | 1,167 | 82 | 44 | 132 | 1,425 |
Data are n.
The sensitivity/specificity, per the ICDR reference standard, for the device to detect vtDR was 100% (95% CI 77.1–100)/97.8% (95% CI 96.8–98.5) and mtmDR 79.4% (95% CI 66.5–87.9)/93.8% (95% CI 92.1–94.9). The PPV and NPV for vtDR were 36.4% (95% CI 28.4–45.2) and 100%, respectively. For mtmDR, the PPV and NPV were 39.7% (95% CI 33.8–45.8) and 98.9% (95% CI 98.2–99.3), respectively.
There were 13 false negative exams for the enhanced device’s mtmDR output according to the ICDR reference standard, and all images for these participants are shown in Fig. 3. Review of the images of the 13 false negative cases in Fig. 3 indicated that these participants had a single isolated hemorrhage or cotton wool spot and had no microaneurysms.
Conclusions
The results show that a hybrid lesion-based device, with deep learning enhancements, for the automated detection of DR achieved high diagnostic accuracy in a primary care setting in a study with a predetermined protocol and an independent reference standard. These results confirm corresponding results in an earlier study of essentially the same algorithm in a laboratory setting (13). Specifically, the device achieved high sensitivity (100%) in people with vtDR, as the device did not miss any vtDR, or DME, according to the ICDR grading system. It also achieved high specificity (97.8%). However, the number of vtDR cases, although representative for the studied patient population, was low and prevents definite conclusions. The device also had a high sensitivity to detect mtmDR of 79.4%, at a specificity of 93.8%.
Applying the device into the health care system at primary care sites, where patients with diabetes are regularly seen, could improve the percentage of patients screened when indicated. In addition, such a device would lead to improved accuracy compared with present standard of care and will lead to a higher number of patients with images with sufficient quality owing to the direct feedback of the device regarding the image quality. Nongradable images can either be seen by a human grader or directly referred to an eye care provider, implying that no diagnoses of DR were missed as a result of images of insufficient quality, with a guarantee for good clinical care. Overall, this system has the potential to reduce the sociomedical burden of DR.
Clinicians increasingly deviate from the methods used by reading centers, as defined in the original standards (31). For example, whether a single red lesion is a microaneurysm or a hemorrhage can make the difference between a mild versus moderate level of DR. These levels were used in the primary outcome studies that to a great degree still determine the management of DR, such as the Diabetic Retinopathy Study (DRS) (32), Early Treatment of Diabetic Retinopathy Study (ETDRS) (33), and DCCT/EDIC (Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications) studies (34), and so it is important to use methods that are as close as possible to methods of these original standards to avoid conflicts based solely on differences in definitions. The ICDR classification used in the current study is a simplified classification based on the original ETDRS classification, which was often too complicated to use in clinical studies. It is widely accepted in the ophthalmological community and the preferred classification in leading reading centers around the world.
A relatively low sensitivity to detect DR in a standard of care setting, using single human graders, has been shown in previous studies (11,35,36). This is also clear from the current study, with a modest interobserver agreement of the graders of roughly 50%. One of the advantages of using a device for the automated detection of DR is the consistently high diagnostic accuracy—not accomplished by single human graders.
The results also show that the diagnostic accuracy of a device to detect DR is typically lower in a real-world setting than in a laboratory setting, as we and others have shown previously (9,13,19). Image quality in published data sets is likely higher than is found in a real-world setting. Finally, there are often differences in the prevalence of DR, with laboratory studies thus far typically showing higher prevalence than in real-life studies such as this one (19). The recent studies by Gulshan et al. (16) and by Ting et al. (17) do report overlapping diagnostic accuracy values for automated screening of DR. The random subject sample of real-life images, prospectively collected, which will inherently have a large number of poor-quality photographs, was unique to our study.
The study has limitations. The reference standard was graded from retinal color images, which lack stereo, and no macular optical coherence tomography was available—now a widely used method for determining the presence of DME. Isolated retinal thickening may be underappreciated (37), though human expert detection of DME from exudates only, in nonstereo images, was shown to be almost as sensitive as clinical stereo biomicroscopic analysis of retinal thickening (38,39). DME prevalence and severity may be underestimated in this data set, and a reference standard including optical coherence tomography could lead to differences in a device’s measured algorithmic performance.
The application of mydriatics was unfortunately not reported to the diagnostic center, and the influence of mydriatics on quality of the images could not be analyzed.
The missing of other diagnoses other than DR using a device for automated screening is inherent to most algorithms. False positives for other pathologies, like venous occlusions or exudative (wet) age-related macular degeneration, will be sent to the ophthalmologist, but other, more subtle, diagnoses, like glaucoma or dry exudative (wet) age-related macular degeneration, may be missed. These diagnoses are relatively infrequent (and in many cases probably already known), so the importance of missing other diagnoses is limited and, in our opinion, acceptable.
The retrospective nature of the study can be considered to be a limitation but allowed for the analysis of an unselected, unbiased, real-life data set.
The device used in the current study has recently received U.S. Food and Drug Administration approval for providing a screening decision without the need for clinician to also interpret image or results, making it usable by health care providers who may not normally be involved in eye care (40).
In summary, the device had high diagnostic accuracy for the detection of vtDR and a more modest but still adequate accuracy in detection of mtmDR in a primary care setting using an independent reference standard. The diagnostic accuracy of the device therefore allows safe use in a primary care setting.
Article Information
Funding. M.D.A. is the Robert C. Watzke Professor of Ophthalmology and Visual Sciences, University of Iowa; Research to Prevent Blindness, New York, NY. This material is the result of work supported with resources and the use of facilities at the Iowa City VA Medical Center.
Contents are solely the responsibility of the authors and do not necessarily represent the official views of the Department of Veterans Affairs or the U.S. government.
Duality of Interest. This study was funded by IDx. M.D.A. is listed as inventor on patents and patent applications related to the study subject. M.D.A. is director of and shareholder in IDx. All authors, with the exception of G.N., received financial support from IDx. No other potential conflicts of interest relevant to this article were reported.
Author Contributions. F.D.V. drafted the manuscript and supervised the study. F.D.V., M.D.A., G.C.F.B., C.K., G.N., and A.A.v.d.H. were responsible for study concept and design. F.D.V., G.C.F.B., C.K., and A.A.v.d.H. interpreted data. F.D.V., G.C.F.B., and C.K. acquired data. A.A.v.d.H. analyzed data and performed statistical analysis. All authors critically revised the manuscript for important intellectual content and provided administrative, technical, or material support. F.D.V. is the guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Prior Presentation. Parts of this study were presented in abstract form at the 53rd Annual Meeting of the European Association for the Study of Diabetes, Lisbon, Portugal, 11–15 September 2017.