Technological progress in the past half century has greatly increased our ability to collect, store, and transmit vast quantities of information, giving rise to the term “big data.” This term refers to very large data sets that can be analyzed to identify patterns, trends, and associations. In medicine—including diabetes care and research—big data come from three main sources: electronic medical records (EMRs), surveys and registries, and randomized controlled trials (RCTs). These systems have evolved in different ways, each with strengths and limitations. EMRs continuously accumulate information about patients and make it readily accessible but are limited by missing data or data that are not quality assured. Because EMRs vary in structure and management, comparisons of data between health systems may be difficult. Registries and surveys provide data that are consistently collected and representative of broad populations but are limited in scope and may be updated only intermittently. RCT databases excel in the specificity, completeness, and accuracy of their data, but rarely include a fully representative sample of the general population. Also, they are costly to build and seldom maintained after a trial’s end. To consider these issues, and the challenges and opportunities they present, the editors of Diabetes Care convened a group of experts in management of diabetes-related data on 21 June 2018, in conjunction with the American Diabetes Association’s 78th Scientific Sessions in Orlando, FL. This article summarizes the discussion and conclusions of that forum, offering a vision of benefits that might be realized from prospectively designed and unified data-management systems to support the collective needs of clinical, surveillance, and research activities related to diabetes.
INTRODUCTION
Within the span of their professional careers, older physicians and investigators have experienced a revolution in the management of data. In the 1960s, we wrote chart notes and prescriptions by hand, pored over large volumes in libraries, recorded notes from these tomes on 3- by 5-inch notecards, computed means and standard deviations on mechanical calculators, and composed manuscripts for publication on typewriters. Digital technologies ushered in a paradigm change for all of these practices. Large, slow mainframe computers were developed in that decade and, concurrently, defense and academic groups established electronic communication networks. These innovations were followed in the 1970s by smaller, yet more powerful, computers and, in the 1980s, by personal computers and expanded networks. Now we have the Internet, the World Wide Web, and “cloud” storage capabilities, all of which can be accessed anywhere and at any time by individuals with smartphones and other small electronic devices. Our ability to collect, store, analyze, and transmit data has increased remarkably, giving rise to the collective term “big data”: extremely large data sets that can be analyzed to identify patterns, trends, and associations. It is now, at least in principle, possible to manage huge quantities of data over decades of time and among regions globally.
These tools for data management have long been recognized as relevant to our efforts to improve health care (1), and certainly this applies to clinical care and research in the field of diabetes. Some notable examples deserve mention. Henry J. Kaiser, a prominent defense contractor, developed health systems for employees at his shipyards in the 1940s. The Kaiser systems applied business principles to health care, including early adoption of electronic medical records (EMRs) (2). The U.S. Department of Veterans Affairs created an electronic database for its geographically dispersed medical systems in the 1980s (3). Population-based medical registries have been established in the U.K. (4) and other countries. In the U.S., the National Health and Nutrition Examination Survey (NHANES) began in 1971, and data collection continues to the present time (5). Likewise, the first large, randomized trials testing interventions for diabetes were facilitated by digital data management. The UK Prospective Diabetes Study (UKPDS) was launched in 1977 (6), and enrollment in the Diabetes Control and Complications Trial (DCCT) began in 1983 (7).
Despite the strong influence of digital technology on these projects, the systems used for clinical care, epidemiologic surveillance, and interventional trials have grown and evolved in quite different ways. Their purposes and designs differ considerably, and collected data are not easily compared among systems. To consider these issues, and both the challenges and opportunities presented by them, Diabetes Care convened a group of experts in the field of diabetes digital technology on 21 June 2018. Here, we report a summary of the discussion and conclusions of that forum. The discussion was divided into the three categories of data-management systems briefly described in Table 1.
Typical features of current systems for managing medical data
Feature . | EMRs . | Public surveys and registries . | RCTs . |
---|---|---|---|
Financial support | Health system | Government | Government, industry, or voluntary health organization |
Governance | System administrators | Government employees | Academic partnership with sponsor |
Population included | Enrolled in public or commercial system | National or regional | Selected for study, may be international |
Time of data collection | Continuous | Periodic | Specific interval |
Feature . | EMRs . | Public surveys and registries . | RCTs . |
---|---|---|---|
Financial support | Health system | Government | Government, industry, or voluntary health organization |
Governance | System administrators | Government employees | Academic partnership with sponsor |
Population included | Enrolled in public or commercial system | National or regional | Selected for study, may be international |
Time of data collection | Continuous | Periodic | Specific interval |
EMRs, electronic medical records; RCTs, randomized controlled trials.
ELECTRONIC MEDICAL RECORDS: ENORMOUS POTENTIAL BUT LIMITATIONS
Ancient Egyptian physicians recorded their patients’ medical information on papyri (8), and until recently, handwritten records continued to be the norm. However, as digital technology developed, it was quickly applied to medical records. Some health systems introduced electronic data management in the 1960s, with the focus initially on scheduling and billing. Over time, electronic medical record (EMR) systems have expanded to other aspects of patient care and, in the past decade, a growing number of health care organizations have largely abandoned paper-based records.
The potential of EMRs to make patient-related information more accessible is enormous. In 1863, Florence Nightingale complained that, “In attempting to arrive at the truth, I have applied everywhere for information, but in scarcely an instance have I been able to obtain hospital records fit for any purposes of comparison” (9). Recent studies have demonstrated that use of EMRs can improve preventive health services, decrease medication errors, and facilitate population health management (10–13). In the case of diabetes, analyses of data from EMRs have been shown, in appropriate settings, to help improve success in controlling glycemia, lipids, and blood pressure and to reduce the frequency of emergency department visits and nonelective hospitalizations (14–16).
At the organizational level, review of EMR data allows for the assessment of clinical visit scheduling and reimbursement, attendance and wait times, medication prescription and dispensation, and tracking of variously defined measures of quality of care. For providers, EMRs allow immediate access to patients’ clinical histories, physical and laboratory findings, and other care-related information. Providers potentially can access clinical information independent of where they or their patients may be at a given time. The importance of this ability was illustrated by the experience after Hurricane Katrina struck New Orleans and nearby areas in 2005. Clinicians who had EMR access could provide information and advice and fill prescriptions for their patients who were widely dispersed across the country, whereas those without EMRs lost contact with patients and permanently lost their paper records to storm and flood damage. Electronic records allow many different users to access medical information simultaneously and eliminate the costs of creating and delivering hard copies of records to each clinician. Virtually instantaneous remote access by on-call clinicians, including those in emergency departments and distant institutions, can assist in timely provision of care. For the care of those with diabetes, use of EMRs facilitates tracking of relevant clinical data over time, including weight, blood pressure, A1C, lipid measurements, and medications for control of various risk factors. Because a team of providers—including physicians and advanced practice providers, diabetes educators, nutritionists, and others—is typically involved in care for people with diabetes, EMRs assist in coordinating multidisciplinary care.
There are also potential limitations to the use of EMRs. Both isolated and systematic unintended consequences have been reported (17–21). Workloads of clinical providers may be increased and their morale impaired by the need to enter orders for tests, prescriptions, and consultations—tasks previously performed by other health care personnel. Because of the need to review prior encounters and enter current data in the examining room, both clinicians and patients have sometimes complained about EMRs interfering with communication during visits (22,23). Although much energy is devoted to optimizing the use of EMRs in managing the logistical aspects of care (e.g., scheduling, billing, and process-based quality assessment), medical information needed for personalized management of complex conditions such as diabetes may be less easily collected, recorded, and visualized. Whereas the consistency and accuracy of entries concerning financial or operational matters are routinely checked by specialized personnel within health systems, similar quality control is rarely attempted for clinical entries. The result is variability and inconsistency in capturing even the most crucial medical information in many cases. A notable example is the difficulty of tracking insulin doses prescribed, as well as those actually taken—especially when patients are actively self-managing their glycemic control. Another is the lack of consistency in distinguishing between type 1 (autoimmune-mediated) diabetes, type 2 diabetes, and less common forms of diabetes in EMRs.
Electronic record systems come in a bewildering variety of configurations, and they frequently evolve over time. Therefore, careful implementation procedures, including user training, are crucial to their success. Although broad principles of EMR design are well established (24,25), they are not universally followed. As a result, many systems suffer from discrepancies between software design, user needs, and clinical workflow, sometimes leading to negative perceptions of their value and reliability (Fig. 1) (26–29). Alignment of EMRs with the activities and concerns of medical providers can be improved, but in many cases this is not occurring. Business-related aspects of EMR use can also pose barriers. For example, EMR system vendors may have contractual hold-harmless clauses that limit their accountability for harm or inconvenience related to defects and malfunction. It may be unclear who is responsible for maintenance of services, and difficulties may not be reliably reported. Governmental oversight of the quality of EMR products and services is limited (30–33).
A conceptual model of differences between how electronic medical records are designed (Designer model), functionality desired by the users (User model), and how they are actually utilized (Activity model). Reprinted from Zhang J, Walji MF. TURF: toward a unified framework of EHR usability. J Biomed Inform 2011;44:1056–1067, with permission from Elsevier (29).
A conceptual model of differences between how electronic medical records are designed (Designer model), functionality desired by the users (User model), and how they are actually utilized (Activity model). Reprinted from Zhang J, Walji MF. TURF: toward a unified framework of EHR usability. J Biomed Inform 2011;44:1056–1067, with permission from Elsevier (29).
There is considerable potential for the use of data collected routinely in EMRs for epidemiological surveillance or prospectively designed medical research (34,35). Some large health systems with long-standing databases have published useful epidemiologic reports of their experience. Notable examples relevant to diabetes include early reports of clinical inertia in advancing pharmacotherapy of diabetes (36,37) and clinical features associated with hypoglycemia in clinical practice (38). However, there are limitations to such use of data collected in EMRs under current circumstances. These include missing or unreliable data, collected without consistent definitions or quality control, and uncertain generalizability when data originate from a single institution. These problems could be addressed and some attempts have been made, although the success of such efforts depends on allocation of additional resources and support by health system administrators (39–42).
SURVEYS AND REGISTRIES: MONITORING POPULATION-WIDE TRENDS
Public health surveillance for chronic diseases has also been greatly facilitated by electronic data-management systems. Surveillance can be defined as quantitative monitoring of population-level incidence (risk) and prevalence (frequency) of disease and of provision of preventive care, with attention to variations according to personal characteristics, time, and location (43,44). Periodic surveys can identify emerging risk factors, new health problems and comorbid conditions, gaps in care, and adverse events of treatment. Surveillance aims to identify subpopulations that are most at risk for a given disease or most likely to benefit from intervention. Data grouped according to specific characteristics of individuals may be described as a registry, which can be systematically updated to provide targeted surveillance of individuals sharing this characteristic.
Such information provides timely guidance for short-term decisions by policy makers, health plan administrators, clinicians, and the public. It also permits more in-depth etiological analyses, cost-effectiveness determinations, and health impact modeling, all relevant to long-term decisions. When combined with related disciplines (e.g., clinical epidemiology, health services and policy research, health economics, and program management evaluations), population surveillance forms the basis for public health strategies and resource allocation.
Population-level surveillance for diabetes is undergoing a rapid transformation due to new health-related data sources and also computing and analytic approaches to large data sets (45). Diabetes surveillance in countries such as the U.S., Canada, Europe, Australia, Israel, and some Asian countries originated mainly from public survey– and direct registry–based systems. In some settings, it is now extending to include health system–based electronic registries linking EMR data, hospital and ambulatory services, laboratory and pharmacy data, and, most recently, various non-health-related data sources (46,47).
Surveillance Through Public Systems
Nationally representative surveys in the U.S. that include assessment of diabetes prevalence have existed for more than 50 years (Fig. 2), beginning with the National Health Survey in the 1960s. Next came the National Health Interview Survey (NHIS), the first National Health and Nutrition Examination Survey (NHANES I) in the 1970s, NHANES II in the 1970s and 1980s, NHANES III in 1988–1994, and continuous NHANES surveys from 1999 to the present (48–51). These are coordinated by the Centers for Disease Control and Prevention’s National Center for Health Statistics.
Overview of diabetes-related metrics monitored in the U.S. via publicly available survey data throughout the natural history of the disease. Adapted from Desai et al. (43).
Overview of diabetes-related metrics monitored in the U.S. via publicly available survey data throughout the natural history of the disease. Adapted from Desai et al. (43).
A suite of other health care surveys—including the National Ambulatory Medical Care Survey (52), Medical Expenditure Panel Surveys from health care settings (53), and the National Hospital Discharge Survey (later supplanted by the National Inpatient Sample [NIS] [54])—collects data at the level of hospitals rather than individuals. Since 1993 the Behavior Risk Factor Surveillance System (BRFSS) has provided population-based surveys conducted at the state level (46). These surveys are complemented by registries for selected conditions such as the United States Renal Data System for end-stage renal disease (55), or for special problems and populations (e.g., the prevalence of type 1 vs. type 2 diabetes in children in the SEARCH for Diabetes in Youth study) (56). Similar evolution of surveillance has occurred in other countries as well. For example, the National Diabetes Audit in the U.K. is one of the largest annual clinical audits in the world. It integrates data from both primary and secondary care sources, with providers legally required to supply the data from their clinical practices (57).
Most of these surveys are designed to obtain repeated cross-sectional, complex samples with analytic weighting so that the estimates derived are representative of the noninstitutionalized population, including people without health insurance. NHANES is the most comprehensive survey in the U.S., consisting of a questionnaire, physical exam, and laboratory examinations every 2 years. It is the primary source for tracking total prevalence of diabetes, prediabetes, and undiagnosed diabetes, as well as selected risk factors and complications, including peripheral arterial disease, retinopathy, and chronic kidney disease (58–61). NHIS includes the single largest sample of the U.S. population and is the primary source of self-reported incidence of diagnosed diabetes. It serves as the key platform for supplemental surveys of issues ranging from health care access to preventive care (62,63). NIS is the main source of data for hospitalizations and procedures and is used to estimate and track the incidence of cardiovascular disease, stroke, and amputation (64). BRFSS has been crucial in providing state-level and, with assistance of small-area statistical modeling, county-level prevalence and incidence rates of diabetes and prevalence of obesity and physical inactivity (65). Several surveys, including NHIS and NHANES, also have linkage to the National Death Index. This is an important association that allows mortality rates to be estimated for consecutive cohorts (66). Collectively, the publicly available surveys permit researchers and policy makers to monitor a broad range of metrics such as behavioral and biochemical risk factors, preventive behaviors, receipt of preventive care, risk factor management, diabetes-related complications, disability, and mortality (Fig. 2) (43).
However, these public surveys have some fundamental limitations. First, they are largely cross-sectional data sets. Apart from the mortality linkage, the lack of longitudinal data limits assessment of changes in risk and care and the ability to examine the effectiveness of treatments or the etiology of conditions in individuals. Second, the ability to examine geographic variation in risk, care, or outcomes is limited in most surveys. Thus, their utility for directly targeting interventions to areas of greatest need in regions below the national level is impaired. While BRFSS has been useful for estimation of state- and county-level prevalence of diabetes, obesity, and physical inactivity, there are limitations of its design and data collection that allow incidence rates to be tracked reliably only at the national level (46,62,63,65). Third, although they are designed and weighted to be representative of noninstitutionalized populations, steadily declining response rates are a growing threat to validity. Finally, despite improvements in the timeliness of collection and disclosure of data, periodic surveys do not always allow real-time assessment of emerging problems. Also, incorporating new elements into the surveys requires administrative review and approval, which can be a lengthy process.
Surveillance Within Health Systems
As noted earlier, integrated health systems in the U.S. and elsewhere have used EMRs and other systematically collected data for surveillance, development of registries, and evaluation of care within their populations. Direct clinical data in such systems can be linked to pharmacy and laboratory information, allowing broader assessment of processes and outcomes (67,68). This experience has set the stage for linkage of previously existing public surveys and registries to data derived from direct patient contact within private systems. This trend has been paralleled by conceptually similar population-wide registries in countries with single-payer health systems, including Sweden, Finland, Denmark, and the U.K. (47,57,69–71). Combining these registries has the advantage of allowing estimation of levels of care, risk-factor management, and rates of outcomes, taking a broader perspective than is possible within a single database. Such analyses can lead to revaluation of medical practice methods, medication use, and the cost-effectiveness of specific interventions within each system.
Development of EMR-based registries by privately managed health systems has also provided an opportunity for large multi–health system aggregators such as IBM MarketScan Research, DARTNet, Optum, the Centers for Medicare & Medicaid Services, and others. Their databases contain information on billing claims for various services, pharmacy records, and laboratory data on large segments of the population. This information can be linked to other factors at the health plan level or to external information on geographic location and socioeconomic patterns. Thus, they can broaden the population included beyond that of individual health systems. However, these aggregating systems require substantial financial resources and can have other limitations. Although individual-level longitudinal analyses are possible with such systems, they can be complicated by the flow of individuals in and out of health plans, requiring careful distinction between cross-sections and cohorts. Aggregated health-system data also may lack routinely collected information on health behaviors and any information on the historically and geographically variable proportion of the U.S. population that is uninsured. Finally, as with databases within individual health systems, the completeness and reliability of aggregated data sets varies widely and poses significant problems of interpretation.
CLINICAL TRIAL DATABASES: OPTIMIZED COLLECTION AND ANALYSIS FOR SPECIFIC QUESTIONS
Complete and accurate quantitative data are required for success in all disciplines involved in scientific research. Clinical research can generally be classified as either observational or experimental. Observational research relies on data generated by people, clinics, institutions, health systems, or devices that are obtained, often passively, from sources such as an EMR system. Any variety of exposures, differences, or changes can be analyzed to identify relationships between the topic of investigation and various outcomes. Examples of topics for study include the uptake of a new drug, a change in health policy, an increase or decrease in access to health care providers, genetic characteristics, or increasing duration of disease or surveillance. Clinical assessments that can be related to such topics include weight or blood pressure, laboratory tests (e.g., A1C), health system utilization (e.g., emergency room visits), symptomatic events (e.g., hypoglycemia), and medical outcomes (e.g., myocardial infarction). All kinds of information collected during routine medical care might be used for observational research, and data are increasingly stored in easily accessible digital forms to facilitate their analysis.
Although observational research can identify relationships, whether any relationship is caused by the exposure or by something else linked to the exposure (i.e., a confounding variable) is more difficult to discern. Although sophisticated statistical techniques can account for potential confounding variables, they can only account for those that are both known to be possible confounders and available in the database. Because any relationship may reflect the effect of an unknown number of confounding factors, both measured and unmeasured, a causal effect suggested from observational analysis should be viewed as hypothesis-generating rather than definitive evidence. The only exception would be relationships that are extremely strong, such as the effect of smoking on the risk of lung cancer or the ability of insulin to prevent death in patients with type 1 diabetes.
Observational studies are no substitute for randomized controlled trials (RCTs) in establishing efficacy (72). The RCT is the gold standard for detecting modest but clinically important effects of a treatment or intervention. Indeed, a large number of RCTs conducted in the past 25 years (73) have provided crucial insights into the management of diabetes and have identified novel life-saving therapies. In an RCT, the administration of the exposure versus the comparator is randomly determined for two or more groups, and the effect of one versus alternate exposures (comparators) is then measured. The randomization process reduces confounding by constructing treatment groups that are, on average, expected to be similar except for the extent of the exposure being studied. Thus, any difference in outcomes is attributable to the exposure and not something else, with a level of confidence that depends on the rigor with which the study is designed and conducted. Many different exposures can be tested in RCTs, including drugs, devices, monitoring procedures (e.g., continuous glucose monitoring [CGM]), treatment algorithms, and administrative policies. Whereas the unit of randomization is typically an individual, groups or clusters of individuals can also be randomly assigned, with different clusters being randomly allocated to different exposures.
Although the methodological strengths of randomization are profound, randomization alone is not sufficient. Other requirements must be met (Fig. 3). First, a clearly formulated and ethical research question or hypothesis must be articulated as part of a carefully designed protocol. This should be reviewed by impartial experts to ensure that the question is important and that the research plan is ethical and feasible. Second, sufficient numbers of participants must be enrolled within a short enough period of time to ensure that the allocated groups are well matched and that the trial will be finished quickly enough to be relevant. Third, systems should be in place to ensure that people who are allocated to the exposure being tested actually adhere to or receive it. The lower the level of adherence, the smaller the difference between the allocated groups will be, so that a trial with low adherence may fail to detect very important effects of the exposure. Fourth, follow-up of study outcomes must be as close to 100% as possible to avoid the possibility that those who are not followed in each of the treatment groups may differ in important ways from those who are followed. Otherwise, these differences may confound the randomization and limit the confidence of conclusions about the effect of the exposure on the outcome. As illustrated in Table 2, most large-scale, randomized cardiovascular outcomes trials in diabetes have achieved follow-up rates for vital status approaching 100% (74–86). Fifth, systems need to be in place to ensure that outcomes of interest occurring during follow-up are reliably collected and analyzed. This is accomplished for very high percentages of participants in trials such as those shown in Table 2. Finally, the results should be analyzed according to the originally allocated exposure (i.e., through an intention-to-treat approach) regardless of adherence to the exposure.
Six crucial components that ensure the robustness of a randomized controlled trial.
Six crucial components that ensure the robustness of a randomized controlled trial.
Ascertainment of vital status in recent diabetes outcomes trials
Trial acronym (drug studied) . | Patients randomized (n) . | Median follow-up (years) . | Vital status known (%) . |
---|---|---|---|
ORIGIN (glargine & omega-3 fatty acid) (74) | 12,537 | 6.2 | 99.0 |
SAVOR-TIMI 53 (saxagliptin) (75) | 16,492 | 2.1 | 99.1 |
EXAMINE (alogliptin) (76) | 5,308 | 1.5 | 99.5 |
TECOS (sitagliptin) (77) | 14,671 | 3.0 | 97.5 |
EMPA-REG OUTCOME (empagliflozin) (78) | 7,020 | 3.1 | 99.2 |
ELIXA (lixisenatide) (79) | 6,068 | 2.1 | 99.0 |
LEADER (liraglutide) (80) | 9,340 | 3.8 | 96.8 |
SUSTAIN-6 (semaglutide) (81) | 3,297 | 2.1 | 99.6 |
CANVAS Program (canagliflozin) (82) | 10,142 | 2.4 | 99.6 |
EXSCEL (exenatide) (83) | 14,752 | 3.2 | 98.8 |
ACE (acarbose) (84) | 6,522 | 5.0 | 94.4 |
HARMONY Outcomes (albiglutide) (85) | 9,463 | 1.6 | 99.4 |
DECLARE-TIMI 58 (dapagliflozin) (86) | 17,160 | 4.2 | 99.5 |
Trial acronym (drug studied) . | Patients randomized (n) . | Median follow-up (years) . | Vital status known (%) . |
---|---|---|---|
ORIGIN (glargine & omega-3 fatty acid) (74) | 12,537 | 6.2 | 99.0 |
SAVOR-TIMI 53 (saxagliptin) (75) | 16,492 | 2.1 | 99.1 |
EXAMINE (alogliptin) (76) | 5,308 | 1.5 | 99.5 |
TECOS (sitagliptin) (77) | 14,671 | 3.0 | 97.5 |
EMPA-REG OUTCOME (empagliflozin) (78) | 7,020 | 3.1 | 99.2 |
ELIXA (lixisenatide) (79) | 6,068 | 2.1 | 99.0 |
LEADER (liraglutide) (80) | 9,340 | 3.8 | 96.8 |
SUSTAIN-6 (semaglutide) (81) | 3,297 | 2.1 | 99.6 |
CANVAS Program (canagliflozin) (82) | 10,142 | 2.4 | 99.6 |
EXSCEL (exenatide) (83) | 14,752 | 3.2 | 98.8 |
ACE (acarbose) (84) | 6,522 | 5.0 | 94.4 |
HARMONY Outcomes (albiglutide) (85) | 9,463 | 1.6 | 99.4 |
DECLARE-TIMI 58 (dapagliflozin) (86) | 17,160 | 4.2 | 99.5 |
ACE, Acarbose Cardiovascular Evaluation; CANVAS, Canagliflozin Cardiovascular Assessment Study; DECLARE-TIMI 58, Dapagliflozin Effect on Cardiovascular Events; ELIXA, Evaluation of Lixisenatide in Acute Coronary Syndrome; EMPA-REG OUTCOME, BI 10773 (Empagliflozin) Cardiovascular Outcome Event Trial in Type 2 Diabetes Mellitus Patients; EXAMINE, Examination of Cardiovascular Outcomes with Alogliptin versus Standard of Care; EXSCEL, Exenatide Study of Cardiovascular Event Lowering; HARMONY Outcomes, A Long Term, Randomized, Double-blind, Placebo-Controlled Study to Determine the Effect of Albiglutide, When Added to Standard Blood Glucose Lowering Therapies, on Major Cardiovascular Events in Patients With Type 2 Diabetes Mellitus; LEADER, Liraglutide Effect and Action in Diabetes: Evaluation of Cardiovascular Outcome Results; ORIGIN, Outcome Reduction With Initial Glargine Intervention; SAVOR-TIMI 53, Saxagliptin Assessment of Vascular Outcomes Recorded in Patients with Diabetes Mellitus–Thrombolysis in Myocardial Infarction 53; TECOS, Trial Evaluating Cardiovascular Outcomes With Sitagliptin; SUSTAIN-6, Trial to Evaluate Cardiovascular and Other Long-term Outcomes With Semaglutide in Subjects With Type 2 Diabetes.
RCTs have additional strengths. First, if two or more interventions work in very different ways, they can be tested at the same time in a large enough RCT. For example, in a 2-by-2 factorial design all participants in the study population are randomized to one intervention being tested or to its comparator and also to another intervention being tested or to its comparator. Such designs have been extremely successful and have seldom been undermined by unanticipated interactions between the therapies. Second, the database from an RCT can also be used for observational research. Indeed, any analyses done on the data that are not related to the comparison of the randomized treatment groups are essentially observational analyses through which associations can be explored. Third, substudies that rely upon collection of additional data (e.g., blood tests, images, or physiological measures) can also be built into trials to help determine the mechanism of action of the exposure (e.g., the effects of a new drug).
Because accuracy and completeness of data are centrally important to RCTs, procedures have been devised to ensure that the data collected are of the highest quality. Digital technology has been essential to this effort. Most RCTs have been done by specialized groups within specialized infrastructures erected to monitor and support the work of each trial (i.e., a coordinating center, a steering committee, multiple clinical sites, an event adjudication committee, and a data and safety monitoring board). Such infrastructures are complex and expensive, especially when there are many study participants who are geographically dispersed. The personnel and procedures assembled for large RCTs are capable of collecting and rapidly processing massive amounts of meticulously collected data. For example, the ORIGIN (Outcome Reduction With Initial Glargine Intervention) trial gathered data for more than 6 years concerning more than 12,500 participants in 40 countries (74). Each trial’s infrastructure is usually disbanded on completion of the trial.
Databases of well-conducted RCTs contain the highest-quality data, both to answer the primary question that prompted the study and to generate additional hypotheses from observational analyses. There are, however, some limitations to these data sets. They are usually focused on a very specific question that is addressed by measuring a specific primary outcome of interest and may not include observations or measurements that are relevant to some other questions. Furthermore, the population studied—which has been selected for being both able and willing to participate—may not be completely representative of the general population. Efforts are commonly made to analyze the data in such a way as to assess generalizability of the conclusions of the RCT, but questions often remain (87). The rapid growth of both EMRs and large public and private registries has the potential to address these problems. Specifically, such databases may facilitate enrollment of suitable study populations for randomized trials and also assist in tracking various measures of outcome (88).
OPPORTUNITIES FOR IMPROVING MEDICAL DATA MANAGEMENT
Although there are substantial variations between databases within each category, those derived from EMRs, those created from surveys and registries, and those created for RCTs or prospective observational trials generally differ in their strengths and limitations. Some of their characteristics are summarized in Table 3.
Attributes of current and potential future systems for managing medical data
Attributes . | EMRs . | Public surveys and registries . | RCTs . | Retrospectively aggregated data sets . | Prospectively integrated data systems . |
---|---|---|---|---|---|
Representative | ++ | +++ | + | ++ | +++ |
Consistent | + | +++ | +++ | ++ | +++ |
Accurate | + | + | +++ | + | +++ |
Comprehensive | +++ | + | + | + | +++ |
Up-to-date | +++ | + | +++ | + | +++ |
Attributes . | EMRs . | Public surveys and registries . | RCTs . | Retrospectively aggregated data sets . | Prospectively integrated data systems . |
---|---|---|---|---|---|
Representative | ++ | +++ | + | ++ | +++ |
Consistent | + | +++ | +++ | ++ | +++ |
Accurate | + | + | +++ | + | +++ |
Comprehensive | +++ | + | + | + | +++ |
Up-to-date | +++ | + | +++ | + | +++ |
+, ++, or +++ refer to the relative strength of each attribute, with +++ denoting the strongest. EMRs, electronic medical records; RCTs, randomized controlled trials.
Information in EMRs deals with large groups of individual patients, includes a comprehensive range of clinical material, is collected continuously, and is intended to be stored indefinitely. However, except in the case of certain countrywide health systems that provide care for all citizens, the population included in an EMR may not be fully representative of the general population. Interpretation of observations is generally limited by lack of consistency and accuracy in data entries. One important reason for this difficulty is that data are typically entered by many individual providers with little support or monitoring by administrative personnel. Further, the structure of EMRs differs widely among health systems and, thus, pooling or comparison of data from different systems is difficult.
Registries or surveys can include data that are consistently collected and often representative of a whole population. However, surveillance data sets may not be reliable in all cases (e.g., when collected by self-report), usually focus on one or a few categories of data, and may be updated only intermittently. When repeated cross-sectional information is collected, longitudinal observation of individuals may not be possible.
Databases created specifically for large clinical trials excel in the specificity, completeness, and accuracy of the data collected. However, they rarely include a fully representative sample of the patient population and may contain a relatively limited range of observations. They are also costly to build and maintain, and often are not maintained after completion of the trial.
An Evolving Approach: Distributed Data Networks
To some degree, the limitations of these several models for collecting and managing data can be overcome by development of comprehensive, distributed, multicenter data networks. Such “hub-and-spoke” models for linking individual databases differ from traditional multicenter studies or surveillance systems (in which all data are held centrally) in that individual data are maintained at their source, with analyses conducted peripherally using centrally coordinated common data models and analytic routines. Aggregate results standardized by the common data models and pre-established covariate adjustment can then be returned to the coordinating center for final analyses. Typically, such models have been used for comparative effectiveness studies of pharmacological options, bariatric surgery, and adverse events of drugs, but less often for surveillance of variation in risk, care, or outcomes of diabetes or for postmarketing drug surveillance programs (89–91). There are also opportunities for wider integration across care providers, including linkage of clinic EMRs with pharmacies to allow better monitoring of therapy adherence, and for automated acquisition of data from personal devices such as glucose monitors, insulin pumps and pens, exercise trackers, and health and fitness apps. One early example of regional EMR linkage for assessment of diabetes was the DARTS (Diabetes Audit and Research in Tayside, Scotland) study, which linked EMRs within a Scottish community to create a diabetes registry (92). More recently, groups of clinical investigators, such as the Blood Pressure Lowering Treatment Trialists’ Collaboration and the Cholesterol Treatment Trialists’ Collaboration, have formed for the purpose of aggregating individual data from large trials (93,94). The possibility of expanding the range of data collection and analysis through such networks is obvious, but the quality of data can still be limited by inconsistencies and inaccuracies in clinical observations and data entry, and even aggregated data may not fully represent the general population.
A Potential Solution: Unified Data-Management Systems
In an ideal world, prospectively designed and unified data-management systems could support clinical, surveillance, and research activities all together in a way that circumvents many of the limitations of current systems while drawing on the strengths of each. An integrated system would, in theory, allow substantial savings in costs of design, development, operation, and maintenance, and these costs could be shared among multiple stakeholders.
Specifically, EMRs could be improved by incorporating the more stringent monitoring of data integrity, including automated validation of quantitative clinical and laboratory entries, which is typically used in trial-management systems. Population-wide surveillance of drug safety, rates of various adverse outcomes, regional differences in patterns of care, and other public health concerns could be based on improved and structured data collection during routine health care. Additionally, testing of new therapeutic agents, devices, or regimens could be embedded within existing health systems, using prospectively designed protocols for randomized or nonrandomized treatment choices and assessment of outcomes. Such an approach would facilitate enrollment of more representative and larger patient cohorts at lower cost. Patient follow-up and therapeutic adherence likely would be better if research studies were performed in a familiar usual-care setting.
Additional benefits of conducting RCTs or prospective observational studies using an EMR-based system would include the ability to follow patients passively long after the more structured initial part of the study. Long-term—even lifetime—individual follow-up could more fully capture the risks and benefits of the interventions evaluated and identify potential “legacy effects” persisting after completion of an active intervention. Beyond the opportunities related to individual trials and data cohorts, widespread implementation of unified data-management systems would facilitate routine analysis of pooled individual patient data, allowing greater representation of the whole population than is possible with current meta-analytic techniques. Access to such rich and long-term phenotypic information might also facilitate use of biobank and genetic data to identify new biomarkers and their relationships to disease outcomes and to existing and future therapies. Such unified big-data systems could provide a unique platform for testing, validating, and refining new analytic techniques such as artificial intelligence technologies, potentially leading to new diagnostic (95) and interventional tactics.
SUMMARY: MOVING TOWARD INTEGRATED SYSTEMS FOR MANAGING MEDICAL DATA
As noted above, networks that draw on varied sources are already accumulating experience with large, long-term aggregated data sets. For example, data collected over several decades in multiple population-based registries in both Norway and Sweden have been analyzed with the aim of improving regional health practices. These efforts have led to recent reports of marked and apparently continuing reduction of end-stage renal disease in type 1 diabetes during the period of surveillance (96,97). Also, a 5-year population-based intervention program in Hong Kong, based on structured EMR surveillance and decisionmaking by designated personnel trained in diabetes management, prospectively demonstrated large concurrent reductions of deaths, hospitalizations, and costs for patients with type 2 diabetes (98,99). Thus, movement toward integration of various kinds of health-related data is already underway.
As suggested by the examples above, the scale of the systems involved may be relevant to the success of integrated data management. Sweden, Norway, and Hong Kong all have populations in the 5- to 10-million range, and all have comprehensive publicly supervised health systems providing services to nearly all citizens. Fortunately, at present the computational power of electronic systems should not be a limiting factor in pursuing the goal of integrated data management for diabetes. However, the organizational and practical barriers to implementing integrated programs may be daunting over a larger geographic range and larger population than were demonstrated in these examples.
Several specific requirements appear to be necessary for implementing such systems in any setting. One is agreement on the definitions of key terms and goals in the management of diabetes, as for other medical conditions. Some progress has been made on this front for diabetes, as evidenced by consensus statements regarding glycemic measurements, glycemic targets, and hypoglycemia prompted by the recent development of CGM devices (100–102). Similarly, growing agreement on the properties and best uses of various glucose-lowering therapies is apparent in recent consensus statements by professional organizations (103,104). However, much remains to be done. Unified data systems will require national and international standardization of nomenclature and the incorporation of data dictionaries that can be used to encode diagnoses, procedures, drugs, and clinical outcomes. This may require building on established systems such as the Systematized Nomenclature of Medicine (SNOMED), International Classification of Diseases (ICD), and World Health Organization (WHO) classifications. Where required, mapping tools might be developed to allow translation of data sets among disparate coding systems. This is already being done by the National Library of Medicine, which maps the ICD-9 Clinical Modification, ICD-10 Clinical Modification, ICD-10 Procedure Coding System, and other classification systems to SNOMED, with the goal of establishing a universal taxonomy.
Additional difficulties are posed by proprietary concerns of competing health systems, hardware and software manufacturers, and data-management groups. Sharing of data and agreement on standardization of systems among businesses that are competing in the same markets may pose a significant barrier. In addition, security of protected health information must be ensured, and procedures to accomplish it must be agreed upon by various stakeholders.
However, there is precedent for resolving difficulties such as these. Standardization of electronic systems, definitions, procedures, and regulations allowed for the development of international telephone service in the last century, and more recently the mechanics of the Internet and the World Wide Web. There seems no reason to believe that greater integration of data-management systems to facilitate diabetes care, surveillance, and research cannot be attained, given the potential for simultaneously improving medical outcomes and reducing overall costs.
In summary, integrated and improved management of big data has the potential to open a brave new world for diabetes care and research. Already we see successful proof-of-concept efforts, but further progress depends on overcoming logistical, administrative, and ethical obstacles to linking currently separate data-based activities.
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
This article is featured in a podcast available at http://www.diabetesjournals.org/content/diabetes-core-update-podcasts.
Article Information
Acknowledgments. Writing and editing support services for this article were provided by Debbie Kendall of Kendall Editorial in Richmond, VA. The authors thank Christian S. Kohler, the American Diabetes Association’s Associate Publisher for Scholarly Journals, and his staff for their assistance, guidance, and expertise in convening the 2018 Expert Forum.
Duality of Interest. M.C.R. has received research grant support from AstraZeneca and Eli Lilly; honoraria for consulting from Adocia, AstraZeneca, DalCor, Dance, Elcelyx, Eli Lilly, GlaxoSmithKline, Sanofi, and Theracos; and honoraria for speaking at a scientific meeting from Sanofi. L.B. has received research support from Janssen Pharmaceuticals, Lexicon Pharmaceuticals, Merck, Novo Nordisk, and Sanofi; has been a speaker for Janssen Pharmaceuticals, Novo Nordisk, and Sanofi; and has been a consultant for AstraZeneca, Gilead Sciences, Janssen Pharmaceuticals, Merck, Novo Nordisk, and Sanofi. H.C.G. holds the McMaster-Sanofi Population Health Institute Chair in Diabetes Research and Care. He has received research grant support from AstraZeneca, Eli Lilly, Merck, Novo Nordisk, and Sanofi; honoraria for speaking from AstraZeneca, Boehringer Ingelheim, Eli Lilly, Novo Nordisk, and Sanofi; and consulting fees from Abbott, AstraZeneca, Boehringer Ingelheim, Eli Lilly, Merck, Novo Nordisk, Janssen, and Sanofi. R.R.H. has received research grant support from AstraZeneca, Bayer AG, and Merck Sharp & Dohme; honoraria for speaking from Bayer AG; and consulting fees from Boehringer Ingelheim, Merck Sharp & Dohme, Novartis, and Novo Nordisk. G.A.N. has received research support from Boehringer Ingelheim, Merck, and Sanofi. A.T. has received research grant support from Eli Lilly and consulting fees from Monarch Medical Technologies and has equity in Brio Systems. No other potential conflicts of interest relevant to this article were reported.