Allow me to make these observations on the article by Lee et al. (1) in this issue of Diabetes Care. The topic of artificial intelligence (AI) techniques, deep learning in particular, for interpretation of medical images for use in clinical environments is timely. We congratulate the authors for their contribution to a highly relevant topic.

The strengths of the article are many. For example, the results show that some implementations of AI for diabetic retinopathy (DR) screening are highly effective and can have a significant impact on increasing access to DR screening in a cost-effective manner.

There are five notable observations that I wish to make. First, the referral criteria were somewhat academic and not consistent with the goal of evaluating the algorithms in a realistic environment. The case for setting the referral criterion at level 1 (any DR) or 2 (mild nonproliferative DR [NPDR]) should be justified in light of realistic environments. We can visualize environments where these criteria would have clinical value. However, as given in Table 1, 72% of referrals present with only mild NPDR, which by referring the 3,321 mild NPDR cases of the 3,861 total would overwhelm the ophthalmology clinic. The U.S. Food and Drug Administration has permitted marketing of two AI systems for DR screening based on the detection threshold of more than mild DR, i.e., level 2 (moderate or greater NPDR). The referral would be to an eye care professional. Those patients with less than moderate NPDR would be evaluated in 12 months.

Second, most studies report performance by “case,” on whether a patient is to be referred or not. Reporting performance by the individual image, sometimes as many as 13 for one patient (based on 311,604 images and 23,727 cases), makes comparison with other studies difficult. This manner of reporting results clouds the referral process and the results of the algorithms’ effectiveness. For example, were patients’ referrals counted multiple times if multiple scans for the same person were positive?

Third, access to both mydriatic and nonmydriatic data presented an opportunity to shed light on the effect of dilation on the performance of the algorithms. With the data available from the two Veterans Affairs hospitals, one primarily using nonmydriatic imaging and the other dilating all patients, an opportunity was missed by not analyzing these two categories of mydriasis. The impact of mydriasis would be an important consideration when implementing a DR screening program.

Fourth, the study had a binary grade for image quality, i.e., the image was gradable or was not. Was it possible that one patient had multiple acceptable quality images that would have been sufficient to determine a referral/no referral result, even though the same patient had one or many more ungradable images?

Fifth, no results were given for diabetic macular edema or clinically significant macular edema. Granted, macular edema is best detected with OCT, yet for screening purposes, many direct eye exams and algorithms use surrogate markers, e.g., hard exudates on or near the fovea. This is a critical omission.

Duality of Interest. P.S. is a paid employee of VisionQuest Biomedical and stockholder. No other potential conflicts of interest relevant to this article were reported.

, et al
Multicenter, head-to-head, real-world validation study of seven automated artificial intelligence diabetic retinopathy screening systems
Diabetes Care
Readers may use this article as long as the work is properly cited, the use is educational and not for profit, and the work is not altered. More information is available at