OBJECTIVE—To evaluate the performance of a system for automated detection of diabetic retinopathy in digital retinal photographs, built from published algorithms, in a large, representative, screening population.
RESEARCH DESIGN AND METHODS—We conducted a retrospective analysis of 10,000 consecutive patient visits, specifically exams (four retinal photographs, two left and two right) from 5,692 unique patients from the EyeCheck diabetic retinopathy screening project imaged with three types of cameras at 10 centers. Inclusion criteria included no previous diagnosis of diabetic retinopathy, no previous visit to ophthalmologist for dilated eye exam, and both eyes photographed. One of three retinal specialists evaluated each exam as unacceptable quality, no referable retinopathy, or referable retinopathy. We then selected exams with sufficient image quality and determined presence or absence of referable retinopathy. Outcome measures included area under the receiver operating characteristic curve (number needed to miss one case [NNM]) and type of false negative.
RESULTS—Total area under the receiver operating characteristic curve was 0.84, and NNM was 80 at a sensitivity of 0.84 and a specificity of 0.64. At this point, 7,689 of 10,000 exams had sufficient image quality, 4,648 of 7,689 (60%) were true negatives, 59 of 7,689 (0.8%) were false negatives, 319 of 7,689 (4%) were true positives, and 2,581 of 7,689 (33%) were false positives. Twenty-seven percent of false negatives contained large hemorrhages and/or neovascularizations.
CONCLUSIONS—Automated detection of diabetic retinopathy using published algorithms cannot yet be recommended for clinical practice. However, performance is such that evaluation on validated, publicly available datasets should be pursued. If algorithms can be improved, such a system may in the future lead to improved prevention of blindness and vision loss in patients with diabetes.
Diabetic retinopathy blinds ∼25,000 patients with diabetes annually in the U.S. alone and is the main cause of blindness in the U.S. and Europe working-age populations (1). Almost 50% of the 18 million patients with diabetes in the U.S do not undergo any form of regular documented dilated eye exam (2). This is in spite of overwhelming scientific evidence that this, if combined with appropriate management, can prevent up to 95% of cases of vision loss and blindness (3–10) and also in spite of guidelines by the American Diabetes Association and the American Academy of Ophthalmology that advise an annual dilated eye exam for most patients with diabetes (11). Digital photography of the retina examined by ophthalmologists or other qualified readers has been shown to have sensitivity and specificity comparable with or better than indirect ophthalmoscopy by an ophthalmologist (12,13) and has been proposed as an approach to make the dilated eye exam available to underserved populations that do not receive regular exams by ophthalmologists. If all of these populations were to be served with digital imaging, the number of retinal images to be evaluated annually is 32 million (∼50% of patients with diabetes and at least two photographs per eye) (13,14).
Over the last few years, we and others have developed and published computer algorithms that can aid the ophthalmologist with the evaluation of digital fundus photographs for early forms of diabetic retinopathy in a community (screening) population (15–27). Our studies on small numbers of patients show that such algorithms have the capacity to perform comparably with retinal specialists and outperform most other published algorithms on limited tasks including:
correct localization of the optic disc in 999 of 1,000 retinal images (28)
segmentation of retinal vessels at an accuracy of 94.2% in 20 retinal images (29)
detection of hemorrhages, microaneurysms, and vascular abnormalities with sensitivity of 100% and specificity of 87% in 100 retinal images (30)
detection and separation of exudates, cotton-wool spots, and drusen with a lesion sensitivity of 95% and specificity of 88% in 300 retinal images (31)
detection of retinal images with insufficient image quality with an accuracy of 97.4% in 1,000 retinal images (32).
Though these algorithms were targeted to these narrowly focused tasks, they can potentially be combined into a complete system for the detection of diabetic retinopathy in a screening setting, meaning in a population in which the incidence of newly diagnosed diabetic retinopathy is <30% (14,33–35). No studies have examined the performance of a combined system, built from published algorithms, for the detection of diabetic retinopathy on a large group of patients from a true screening population, i.e., with a high proportion, of up to 90%, of normal appearing fundi and a low proportion, of up to 30%, of fundoscopically detectable diabetic retinopathy. Typical screening projects consist of multiple sites and often use multiple types of digital fundus cameras, with camera-operators with varying levels of experience, resulting in retinal images that exhibit considerable variation in size, resolution, and image quality.
The gold standard for the evaluation of diabetic retinopathy is seven-field stereo fundus photography read by trained readers in accordance with the Early Treatment Diabetic Retinopathy Study (ETDRS) standard (36), and ideally the performance of a complete system would be compared with the evaluation of ETDRS stereo slides taken from the same patients on the same day. However, the expense of such a study with seven-field stereo photography by trained photographers and evaluation by ETDRS readers on a sufficiently large number of patients can only be warranted if preliminary studies show that such a research effort may be worthwhile.
The EyeCheck project for online diabetic retinopathy detection in the Netherlands uses so-called “nonmydriatic” fundus cameras. Over 20,000 exams have been performed to date, with two photos per eye, on patients not known to have diabetic retinopathy and read by a single ophthalmologist (of three participating ophthalmologists) according to a strict protocol; prevalence of referable diabetic retinopathy ranges between 5–10% in this population (33).
The present study was designed as a preliminary study to determine how well the performance of a combination of published algorithms for automated detection of diabetic retinopathy holds up in comparison with the clinical evaluation of a single retinal specialist on the same 10,000 retinal exams, consisting of 40,000 retinal images obtained with a variety of different retinal cameras from the EyeCheck project.
RESEARCH DESIGN AND METHODS—
In this retrospective study, we selected 10,000 consecutive patient visits from 5,692 unique patients. These visits consisted of four retinal photographs (two left, two right eye), as well as textual information about age, A1C, sex, and presence of risk factors over the period 2003–2005, from the EyeCheck project, which has been described elsewhere (33). Patients are photographed annually in community health centers according to an imaging protocol described below. The study protocol was approved by the institutional review board of the University of Iowa. Because of the retrospective nature of the study and because full anonymity was maintained throughout the study, informed consent was judged not to be necessary by the review board.
Inclusion/exclusion criteria
Patients were included if they had a diagnosis of diabetes according to World Health Organization criteria and were aged 18 years and over. Patients were excluded if they had a diagnosis of diabetic retinopathy, had a previous visit to an ophthalmologist for a dilated retinal eye exam, had only one eye photographed, had less than two photographs per eye, or had no demographic data available. Once a patient is evaluated as having referable retinopathy, he or she cannot be rephotographed. Thus, the exams were always of patients either up until then known not to have referable diabetic retinopathy based on a previous exam or until then not known to have diabetic retinopathy.
Imaging protocol
Patients were photographed with “nonmydriatic” digital retinal cameras by trained technicians, at 10 different sites, using either the Topcon NW 100, the Topcon NW 200 (Topcon, Tokyo, Japan), or the Canon CR5–45NM (Canon, Tokyo, Japan) nonmydriatic cameras. Across sites, four different camera settings were used: 1) 640 × 480 = 0.3 megapixels (Mp) and 45° field of view (fov), 2) 768 × 576 = 0.4 Mp and 35° fov, 3) 1,792 × 1,184 = 2.1 Mp and 35° fov, and 4) 2,048 × 1,536 = 3.1 Mp and 35° fov; all images were JPEG compressed at the minimal compression setting available. For each exam, four images were acquired, two of each eye, one centered on the fovea and one centered on the optic disc. The exam, date of birth, duration of diabetes, and A1C status were transmitted over the internet to the Web site (www.eyecheck.nl) and evaluated by a single out of three ophthalmologists, based on the International Clinical Diabetic Retinopathy Disease Severity Scale, as either “no referable retinopathy” or “referable retinopathy;” sufficient image quality was also evaluated (37). This evaluation was documented online and was immediately available to the primary care physician and patient at the community health center. Patients were recommended for follow-up for imaging for 1 or 2 years, depending on risk factors not available for this study. For this study, the following were available from each exam: the anonymized retinal images, the age of the patient when the photographs were taken, an anonymized sequential unique site identification number, an anonymized unique patient identification number, and an aggregate human expert evaluation of the four images in each exam for: unacceptable quality; no referable retinopathy or no apparent diabetic retinopathy; or referable retinopathy, at least mild nonproliferative retinopathy.
System of computer algorithms for evaluation of retinal images for diabetic retinopathy
After automatic cropping of the black border, all retinal images were automatically resampled to 640 × 640 pixels. Then, the following sequence of image operations was performed on each image:
Automatic determination of the probability that the image has a sufficient quality for evaluation, based on the presence of particular image structures with a particular spatial distribution, as described previously (32). After preliminary testing, the threshold for insufficient image quality was set so that ∼80% of exams judged by the human expert to have unacceptable quality were rejected, while 20% of exams judged by the human expert to have acceptable image quality were rejected by the system as having unacceptable quality.
Automatic segmentation of the vessels using pixel feature classification as we have described previously, resulting in a vessel probability image, which is necessary to exclude false-positive “red lesions” (29,38).
Automatic detection and masking of the optic disc using a method that determines where blood vessels converge, as we have described previously (28).
Automatic detection of red lesions—microaneurysms, hemorrhages, or vascular abnormalities—using feature classification, as we have described previously, resulting in a red lesion probability map (30).
Automatic detection of “bright lesions”—exudates, cotton-wool spots, and drusen—using pixel feature classification, resulting in a combined probability map for exudates, cotton-wools spots, and drusen (the differentiation between drusen and the other two is not used by the combined system we present here), as described previously (31).
We have used these machine learning algorithms as is, in other words, no retraining of the algorithms was performed for this study. The total runtime of the complete system for a single image on a standard Windows PC was 7 min, and one exam thus took 4 × 7 = 28 min. To quickly evaluate the complete set of 10,000 exams, we implemented our system on a cluster of 30 PCs with an automatic job distribution system.
Outcome parameters and data analysis
Several features such as the number and probability of the lesions in all four images of an exam as detected by the red and bright lesion detection algorithms were used to create the per-exam retinopathy probability. Each of the four images, from both eyes, were therefore considered independently. Combined with the output of the image level image quality algorithm, two probabilities were thus created at the exam level, namely the probability that an exam has sufficient quality for evaluation and the probability that that exam shows referable retinopathy. The image quality threshold is a number that determines when an image is ungradable. By varying the thresholds for red and bright lesions, the sensitivity and specificity of the system compared with the clinician's reading can be made to vary. Sensitivity and specificity of the complete system relative to the clinical evaluation are calculated at each threshold setting. These sensitivity/specificity pairs are used to create receiver operator characteristic (ROC) curves, showing the sensitivity and specificity at various thresholds. The area under the ROC curve is regarded as a comprehensive measure of system performance: an area of 1.0 has sensitivity = specificity = 1 and represents perfect detection, while an area of 0.5 is the performance of a system that essentially performs a coin toss on average. From the ROC curve, the “optimal threshold” is the subjective point at which we deemed sensitivity and sensitivity to be optimal for a screening setting.
In addition, we defined the number needed to miss (NNM) [= 1/(false negatives/all negatives)] as the number of normal exams after which, on average, one case of referable retinopathy will be missed for a specific threshold setting. The effect of camera type and administration of pharmacologic dilation, if any, on system performance was evaluated by determining the sensitivity and specificity of the algorithm at the optimal threshold.
RESULTS—
Of the 5,692 patients included, 3,107 had one exam, 1,601 had two exams, and 986 had three or more exams over the time period. All included patients had two photographs of each eye taken (see inclusion criteria). The average age of these 5,692 patients at the time of their first exam was 61.8 years (±1 SD 12.81 years); 3,833 (49.85%) were male, and 3,856 (50.15%) were female. Of 10,000 exams, 9,017 (90%) were evaluated by an ophthalmologist as having no apparent retinopathy, 498 (4.98%) as having at least minimal nonproliferative retinopathy, and 485 (4.85%) as having insufficient image quality on one or more images.
The system detected 2,311 of 10,000 exams as not having sufficient quality, and 7,689 of 10,000 exams were determined by the system to have sufficient image quality for all images in the exam. The 7,689 exams were from 4,739 unique patients, average age 59.9 ± 11.7 years. Of the 7,689 exams with system-determined sufficient image quality on all four images, 7,229 (94%) were evaluated by the human expert as having no referable retinopathy, 378 (4.92%) as having referable retinopathy, and 81 (1.05%) as having insufficient image quality on one or more images. As mentioned in the research design and methods section, the quality threshold for the automatic system was set so that 80% of the 9,515 exams that were thought by humans to have sufficient quality were allowed to pass.
We performed a limited estimate of the reading by single human expert by selecting a random set of 500 from the 7,689 exams. The exams in this set were read independently by three masked retinal specialists. Compared with the original reading, their sensitivity/specificity (referable retinopathy or not) was 0.85/0.89, 0.73/0.89, and 0.62/0.84.
If the human expert was taken as the reference standard or “truth,” the area under the ROC curve was 0.84 for all 7,689 sufficient image quality exams, 0.85 for all 4,739 patients on their first visit, and 0.84 for the 1,701 patients with a second visit (Fig. 1). The optimal threshold had a sensitivity of 0.84 and a specificity of 0.64. At this point, 4,648 of 7,689 exams (60%) were correctly identified by the system as having no referable retinopathy, 59 of 7,689 exams (0.8%) did have referable retinopathy but were missed by the system, 319 of 7,689 exams (4%) were correctly identified as having referable retinopathy (Figs. 2A and B), and 2,581 of 7,689 exams (33%) were false-positives, i.e., the system estimated them as having diabetic retinopathy, while the human expert did not. The number needed to falsely miss one case (NNM) of referable diabetic retinopathy, at this point, was 80, meaning that for every 80 exams, one case of referable retinopathy is missed at this chosen “optimal” threshold. Figure 2C shows an example of such a false negative, missed exam.
If the sensitivity of the system was increased, by adjusting the threshold, the NMM could be increased up to a maximum of 127, at which point the system had a (very low) specificity of 0.22.
We performed a preliminary, qualitative analysis of a sample of 87 false negative exams obtained at a specific point in the ROC curve to understand where the system was failing and the potential for improvement. Of these 87 false negative exams, 24 of 87 (27%) contained large hemorrhages and/or neovascularizations of the disc or elsewhere, 23 contained one or a few small hemorrhages, 18 contained exudates or cottonwool spots only, and 22 exams contained other lesions that the system was not designed to detect, including nevi and retinal scars, that were not deemed to be diabetic retinopathy by a second expert.
The effect of camera type and pharmacologic dilation on system performance at the optimal threshold was as follows: in the subgroup that was pharmacologically dilated, sensitivity/specificity was 0.76/0.39, while in the group that was not pharmacologically dilated, sensitivity/specificity was 0.80/0.43. Across camera types, they were camera 1) 0.86/0.22, 2) 0.88/0.56, 3) 0.79/0.83, 4) 0.78/0.73, respectively.
CONCLUSIONS—
To our knowledge, this is the first time that an automated diabetic retinopathy screening system, based exclusively on previously published algorithms, has been tested on an unselected set of exams of this size obtained from a representative diabetic population. The data used in this study were collected over a 3-year period and represent real, unselected screening data obtained in a diabetic retinopathy screening program. The area under the curve of 0.85 on the first visit and an optimal sensitivity of 0.84 and specificity of 0.64 show the potential of such a system. That the system is relatively stable is shown by the fact that the area under the curve for the second visit is 0.84. The limited validation on a sample of 500 exams shows that the original single human expert reading corresponds quite closely to the independent readings by three masked retinal specialists. However, because independent expert readings on the full dataset are not available, it is impossible to determine how the system compares with human readers on this aspect.
However, the system cannot yet be recommended for clinical practice because the NNM of diabetic retinopathy was 80 at the optimal threshold and was never higher than 127. Another more important reason is that of the 87 false negatives at the optimal threshold, ∼25% were isolated neovascularizations or large hemorrhages not accompanied by any other lesion such as exudates or microaneurysms, and such isolated lesions still require urgent referral to an ophthalmologist. Therefore, there is a compelling need for algorithm improvement, including better detection of neovascularizations, for which no specific algorithm has yet been published. In comparison with these results, Usher et al. (39) previously tested a complete system on a set of 733 patients with a high prevalence of diabetic retinopathy of 38% and found a maximum sensitivity of 95.1% and a specificity of 46.3%. Larsen et al. (40) tested a previously commercially available system on 100 patients, 63% with diabetic retinopathy, and found a sensitivity of 96.7% and a specificity of 71.4%. These smaller test populations, with a higher prevalence of diabetic retinopathy, should not be considered representative of a true, large scale screening population, in which an automated system is expected to be deployed.
The results do show that performance is close enough to that of human experts to warrant additional studies on high-quality, validated datasets such as that of the ETDRS (41). Additionally, the results show that reporting only sensitivity and specificity or area under the ROC curve may not be sufficient to evaluate a system for screening for diabetic retinopathy, as our preliminary analysis of false negatives shows. Finally, the results show that a measured performance comparable with human experts on limited tasks, such as red lesion detections, does not translate directly to comparable results on much larger, unselected datasets. This caveat, to our knowledge, also applies to all other published diabetic retinopathy lesion detection algorithms. There is a large difference between the rate of exams with images of insufficient quality in this study (4.85%) and in most previous studies, including our own study, where we reported a rate of ungradeable photographs of 12% (33). The EyeCheck population is a partially closed population and not a cohort. Patients, once photographed, remain in the population until they either fail to show up for their repeat exam or are referred because of insufficient image quality or referable abnormalities including diabetic retinopathy. Thus, over time, patients with initially insufficient image quality leave the population, while patients in the population accrue insufficient image quality much more slowly, for example, from cataract formation. Patients that had preexisting abnormalities leave the population at their first exam, while the accrual of diabetic retinopathy is much slower. On the other hand, new screening sites are added almost monthly, so there is a continuous influx of new patients. We expect that the observed change in rates of insufficient quality exams reflects the above balance. There seemed to be a slight effect of increasing camera resolution resulting in improved system performance, meriting additional study.
There are several important limitations with this study. Most importantly, the estimate by the system was compared with a single reading by a human expert, and because of the nature of the study could not be compared with the gold standard seven-field stereo photography read by experts (41). It is therefore impossible to state whether the performance of the system would have been different if compared with the gold standard.
The automated system is also slow, taking 28 min per patient (four images). This was a nonoptimized version running in debug mode, and meanwhile we have more than halved this time. For screening projects online, response time can be increased simply by running the system on multiple PCs. However, for real-time diagnosis directly in the camera, an option we currently see as less attractive, this is likely to still be too long.
In summary, this study indicates that automated detection of diabetic retinopathy using a combination of published algorithms cannot yet be recommended for clinical practice, based on this test on a true population of patients screened for diabetic retinopathy. On the other hand, the performance is such that evaluation on validated, publicly available datasets should be pursued. Individual algorithms for lesion detection, as well as the manner in which lesion detection algorithms are combined, merit additional research before they can be considered for automated detection of diabetic retinopathy in patients with diabetes, where they may aid in the prevention of blindness and vision loss in these patients.
Article Information
M.D.A. was supported by National Eye Institute Grant R01-EY017066; Department of Defense Grant A064-032-0056; Research to Prevent Blindness NY, NY; the Wellmark Foundation; the U.S. Department of Agriculture; the University of Iowa; and Netherlands ZonMw. M.N. was supported by the Dutch Ministry of Economic Affairs (IOP IBVA02016).
M.D.A. has patent applications pending with the U.S. Patent and Trademark Office for computer-assisted diagnosis of diabetic retinopathy and glaucoma. M.N. and B.v.G. have patent applications pending with the U.S. Patent and Trademark Office for computer-assisted diagnosis of diabetic retinopathy.
References
Published ahead of print at http://care.diabetesjournals.org on 16 November 2007. DOI: 10.2337/dc07-1312.
M.D.A. is a director and shareholder of iOptics.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C Section 1734 solely to indicate this fact.