Background: In the presence of large data set of electronic health records (EHRs), predicting the future disease status is of importance for decision making in the medical treatments. Using modern machine learning techniques, it is generally becoming easier to build complex models to predict the future. For those models, a set of past information are used to make explanatory variables, however, we don't have enough knowledge as to how long we should collect data backward. In some cases, very late tendencies are influencing the future status of disease while in the other cases, old events were the importance causes of the change of the disease status. Our interest thus lies in how old data we have to process to make the good prediction models.

Method: In this paper, we discuss a set of machine learning algorithms to predict the diabetic nephropathy stage in the future using sets of input variables which were collected from different time span of past records. To compare the performance of algorithms we used Logistic Regression, AdaBoost, Gradient Boosting, Decision tree, Multi-layer Perceptron, and Random Forest. We then provide different set of variables of EHR that include past 30-, 60-, 90-, 180-, 210-, 240-, 270-, 300-, 330-, and 360-day data sets, from which we extracted several longitudinal statistics for input variables. From about 65 thousand type 2 diabetes patients, the models classify whether the nephropathy stage gets aggravated or stay in 180 days.

Results: For almost all algorithms, AUC is getting improved when using older data, and 360-day data sets gave the best. Among the algorithms, Gradient Boosting gave the best AUC of 0.77 when using 360-day data set. When using 360-day data sets, Decision Tree gave worst AUC of 0.61.

Conclusion: We observed that when using to past data up to 360 days, the oldest data set gave the best prediction performance. Longitudinal statistics in rather long span gives good explanatory information for future nephropathy development.


A. Koseki: Employee; Self; IBM. M. Ono: None. M. Kudo: Employee; Self; IBM. K. Haida: None. M. Makino: None. A. Suzuki: Research Support; Self; Chugai Pharmaceutical Co., Ltd., Dai-ichi Life Insurance Company, IBM, MSD, Ono Pharmaceutical Co., Ltd., Takeda Pharmaceutical Company Limited. Speaker's Bureau; Self; Asahi Kasei Corporation, Daiichi Sankyo Company, Limited, Eli Lilly and Company, Mitsubishi Tanabe Pharma Corporation, Taisho Pharmaceutical Co., Ltd.

Readers may use this article as long as the work is properly cited, the use is educational and not for profit, and the work is not altered. More information is available at