A systematic review is a rigorous process that involves identifying, selecting, and synthesizing available evidence pertaining to an a priori–defined research question. The resulting evidence base may be summarized qualitatively or through a quantitative analytic approach known as meta-analysis. Systematic review and meta-analysis (SRMAs) have risen in popularity across the scientific realm including diabetes research. Although well-conducted SRMAs are an indispensable tool in informing evidence-based medicine, the proliferation of SRMAs has led to many reviews of questionable quality and misleading conclusions. The objective of this article is to provide up-to-date knowledge and a comprehensive understanding of strengths and limitations of SRMAs. We first provide an overview of the SRMA process and offer ways to identify common pitfalls at key steps. We then describe best practices as well as evolving approaches to mitigate biases, improve transparency, and enhance rigor. We discuss several recent developments in SRMAs including individual-level meta-analyses, network meta-analyses, umbrella reviews, and prospective meta-analyses. Additionally, we outline several strategies that can be used to enhance quality of SRMAs and present key questions that authors, editors, and readers should consider in preparing or critically reviewing SRMAs.
Introduction
Systematic review and meta-analysis (SRMA) research has risen in popularity across the scientific realm, even securing a place atop “evidence pyramids” and study design hierarchies (1,2). A “systematic review” refers to the process of identifying all research meeting the scope and eligibility criteria for a specific scientific question defined a priori. The collation of this evidence is then summarized qualitatively and/or quantitatively, with the quantitative analytic approaches referred to as “meta-analysis.”
Well-conducted SRMAs are an indispensable tool that provide a comprehensive synthesis of available evidence (3). A fundamental strength is the standardized approach that serves to minimize selective reporting and other author biases and has generally been a welcome replacement for the often cherry-picked narrative reviews and expert opinion articles. As such, SRMAs are routinely used to inform clinical care guidelines such as the Standards of Care in Diabetes recommendations put forth by the American Diabetes Association (4). Policymakers also heavily rely on data synthesized through SRMA processes to develop their recommendations and monitor implementation effectiveness (5,6). Another advantage of SRMAs is that combining effect estimates from multiple studies can improve statistical power, and therefore precision, for an exposure-outcome association. An example is the case of dipeptidyl peptidase 4 inhibitors and glucagon-like peptide 1 receptor agonists and pancreatic cancer risk, where in initial retrospective case-control studies, a study design with high potential for selection bias and reverse causation but an advantage of greater statistical power for rare outcomes, investigators observed signals for a positive association (7). Prospective cohort studies are less prone to biases but were largely underpowered individually; however, SRMAs combining the estimates of multiple cohort studies or randomized clinical trials did not substantiate the early concerns observed in retrospective studies (7,8).
There is a misconception, however, that the standardized infrastructure to guide implementation means that SRMAs inherently arrive at unbiased and even definitive conclusions. Within the procedurally systematic framework, there are several subjective decision points and methodological considerations in conducting SRMAs, each with the potential to influence the authors’ findings, interpretations, and conclusions. A recent systematic review showed generally poor methodological and reporting quality of published SRMAs in diabetes research (9). The objective of this article is to provide up-to-date knowledge and a comprehensive understanding of strengths and limitations of SRMAs. We first provide an overview of the SRMA process and then offer ways to identify and overcome common pitfalls at key steps. We also present key questions that authors, editors, and readers should consider in preparing or critically reviewing SRMAs (Table 1).
Key questions to consider in reviewing or developing an SRMA
Does the SRMA have a well-defined research question that has not been addressed by recent SRMAs? |
Is the SRMA protocol registered? |
Does the SRMA article include use of the PRISMA checklist? |
Are there a sufficient number of studies (typically a rule of thumb of five) with adequate samples for drawing reliable conclusions? |
Are the included studies sufficiently similar in terms of study designs, populations, and outcome measures to be combined? |
Are certain studies intentionally included or omitted to veer the conclusion in a certain direction? |
Is there a comparator exposure in examining the effects of an exposure of interests such as a food or nutrient? |
Are the study selection and data extraction done independently by two or more reviewers? |
Does the SRMA include assessment of methodological quality and biases of the included studies? |
Does the forest plot reveal a large between-study heterogeneity or certain outliers (e.g., implausibly large effect sizes or extremely narrow CIs)? |
Are there signs that one or two big studies drive the pooled estimates? |
Are there signs of small study effects (i.e., disproportionate weight assigned to studies with small samples)? |
Are the pooled estimates and their 95% CIs as compared between random-effects and fixed-effect models different substantially? |
Does the SRMA include use of appropriate tools to assess the certainty of the meta-evidence? |
Does the SRMA include predefined subgroup analyses to examine robustness of the findings and potential effect modifications? |
Does the SRMA include meta-regression to examine sources of heterogeneity? |
Does the SRMA include assessment of publication bias using funnel plots and appropriate statistics? |
Does the SRMA include examination of the impact of funding sources of individual studies on the findings and description of the funding source of the SRMA? |
Does the SRMA include description of the limitations of the original studies and the SRMA methodology? |
Does the SRMA include consideration of findings in alignment with other types of study designs on the topic? |
Does the SRMA have a well-defined research question that has not been addressed by recent SRMAs? |
Is the SRMA protocol registered? |
Does the SRMA article include use of the PRISMA checklist? |
Are there a sufficient number of studies (typically a rule of thumb of five) with adequate samples for drawing reliable conclusions? |
Are the included studies sufficiently similar in terms of study designs, populations, and outcome measures to be combined? |
Are certain studies intentionally included or omitted to veer the conclusion in a certain direction? |
Is there a comparator exposure in examining the effects of an exposure of interests such as a food or nutrient? |
Are the study selection and data extraction done independently by two or more reviewers? |
Does the SRMA include assessment of methodological quality and biases of the included studies? |
Does the forest plot reveal a large between-study heterogeneity or certain outliers (e.g., implausibly large effect sizes or extremely narrow CIs)? |
Are there signs that one or two big studies drive the pooled estimates? |
Are there signs of small study effects (i.e., disproportionate weight assigned to studies with small samples)? |
Are the pooled estimates and their 95% CIs as compared between random-effects and fixed-effect models different substantially? |
Does the SRMA include use of appropriate tools to assess the certainty of the meta-evidence? |
Does the SRMA include predefined subgroup analyses to examine robustness of the findings and potential effect modifications? |
Does the SRMA include meta-regression to examine sources of heterogeneity? |
Does the SRMA include assessment of publication bias using funnel plots and appropriate statistics? |
Does the SRMA include examination of the impact of funding sources of individual studies on the findings and description of the funding source of the SRMA? |
Does the SRMA include description of the limitations of the original studies and the SRMA methodology? |
Does the SRMA include consideration of findings in alignment with other types of study designs on the topic? |
When Are SRMAs Useless or Even Counterproductive?
The proliferation of SRMAs has also ushered in redundant reviews and reviews of questionable quality. A common misuse of the technique has led some to question the value of SRMAs altogether (10). As for any scientific endeavor, the investigator should first identify whether their investigation will address a gap in the evidence base. If a substantial amount of original research has accumulated, an SRMA may indeed be warranted. If there are already SRMAs addressing the same hypothesis, conducting an updated review could be justifiable if a critical mass of studies has been published since. However, sparse or heterogeneous evidence often precludes the ability to draw meaningful conclusions, leading the SRMA to be uninformative and add to uncertainty and confusion. There are many useful roadmaps to guide investigators in this decision-making process (11–13), which we have summarized in Fig. 1.
Deciding whether a systematic review of your hypothesis is warranted.
Even when an SRMA is justified and suitable to advance a particular hypothesis, it is critical to recognize that this approach remains fraught with potential pitfalls. For example, in a recent study investigators examined the 20 most cited meta-analyses in the field of strength and conditioning and found that most of them (85%) suffered one or more common statistical errors such as mixing up SEs and SDs of estimates and double counting the same studies (14). These common errors highlight the importance of quality control by authors and readers in preparing or critically reviewing SRMAs.
When Should a Meta-analysis Be Conducted Within a Systematic Review?
A meta-analysis is simply the calculation of a weighted average of the individual studies’ effect estimates to generate a single summary statistic pertaining to an exposure or intervention and outcome relationship. There is no universally accepted decision tree or threshold to indicate the point at which a meta-analysis is warranted. Conducting a meta-analysis to statistically summarize findings in a systematic review is tempting, and technically, only two data points and their measures of variance are needed. One algorithm broadly outlines major decisions to aid quantitative synthesis decision-making, with consideration of compatibility of study hypotheses and designs, structure of the exposure and outcome data, and units of measurement (15). However, the question of “whether data should be statistically combined to begin with” supersedes any downstream considerations, and this relies heavily on having a well-defined hypothesis, a clearly specified SRMA protocol, and subject matter expertise.
If the investigator anticipates the possibility of synthesizing the results of their literature search and data extraction with a meta-analysis, the analytic plan and rationale should be specified in the protocol, including steps for harmonizing across units of exposure and outcome, effect estimate scales, and measures of variance. The protocol should also provide the rationale for the weighing scheme (e.g., fixed effects vs. random effects) and, where appropriate, methods for dose-response analysis and meta-regression. There are existing comprehensive resources for conducting these analyses that we refer the reader to (16). However, it may be the case that a meta-analysis should ultimately not be performed. There are several reasons why a meta-analysis is not appropriate, such studies using different methods for assessment of the exposure (e.g., daily dietary vitamin D intake vs. serum vitamin D concentrations) or having different end points (e.g., change in fasting blood glucose concentration vs. HbA1c levels) or study designs (e.g., cross-sectional studies vs. prospective cohort studies vs. randomized intervention trials). Low-quality protocols with vague definitions of exposures and outcomes can lead to the inclusion of poorly aligned studies that may not be directly relevant or matched to the research question or objective of a systematic review.
Meta-analyzing More Than Randomized Controlled Trials
As meta-analyses were originally developed for placebo-controlled randomized controlled trials (RCTs), conducting a meta-analysis is often more complicated for other types of research, such as in the case of observational data. A meta-analysis of exposure/outcome effect estimates derived from observational data should include consideration of the study populations likely have different exposure distributions, exposure and outcome ascertainment methods (e.g., self-report, registry database, blood biomarker, etc.), and analytic approaches. Further, it is common in observational research for investigators to conduct several multivariable-adjusted models for varying degrees of plausible confounders. The SRMA investigator who is familiar with the literature should anticipate this, and a priori specify in their protocol what constitutes appropriate inclusion and exclusion criteria for the literature search. Inappropriate inclusion and exclusion criteria without consideration of the analytic methods of the original studies can also lead to biased results. For example, some meta-analyses arrived at conclusions that those with a higher BMI have a lower risk of all-cause mortality than those with normal weight (17). However, these meta-analyses implicitly included studies conducted in patients with prevalent chronic conditions where gradual declines in body weight precede death, such as cancer or neurodegenerative diseases, resulting in “reverse causation” bias.
Key Components of an SRMA
We will not provide an exhaustive guide on conducting an SRMA, as there are already numerous resources available to the research community (11–13,16,18). Generally, the process can be distilled to four main components as shown in Fig. 2: 1) developing a clear research hypothesis and protocol, 2) implementing the protocol to identify and characterize the evidence base, 3) evidence synthesis and statistical analysis, and 4) formulating a conclusion informed from both the results and the quality of the evidence.
Begin With a Clear Hypothesis
The SRMA process begins with articulating a clear research hypothesis and developing an appropriate protocol to address the question. There are several frameworks available that guide researchers through the process of formulating a well-defined hypothesis, with PICO (population, intervention, comparator, outcome) and its variants among the most commonly used (19). Briefly, elements of a well-defined research question include specifying the target patient population, intervention (or exposure) and comparator, and outcome definition. Other considerations include defining the relevant time frame or duration that would be reasonable for development of the outcome, optimal study designs and analytic approaches, and to be as detailed as if the investigators were planning to conduct an original study themselves.
Protocol, Protocol, Protocol
A well-developed protocol is critical for conducting a high-quality SRMA as it serves as the investigators’ roadmap for the systematic review process. The 17-item extension of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) for protocols (PRISMA-P) checklist and several extensions for different types of reviews have been used to improve the quality of developing SRMA protocols (20,21). However, most published SRMAs related to diabetes research did not have available protocols and, even among those with available protocols, the adherence to the PRISMA-P checklist was poor (9). The absence or low quality of a prospective protocol may raise concerns about the rigor of the SRMA but does not necessarily mean it should be dismissed entirely.
Composing an SRMA protocol is seemingly straightforward and standardized tools such as PRISMA-P foster “systematic” and replicable results. However, this stage is arguably one of the major determinants of overall quality and bias. An ill-defined research question, such as in the case of failing to specify a comparator exposure, can lead to inappropriate inclusion of studies misaligned with the investigators’ original intent, result in substantial heterogeneity, and undermine the quality and certainty of the evidence base and the ability to draw meaningful conclusions. Similarly, vague protocol criteria introduce unnecessary subjectivity during screening and data extraction. The subject matter expertise required to formulate a research hypothesis and translate it into a protocol is also often underestimated. It is not uncommon for SRMAs to combine results from observational cohorts and randomized intervention trials, which can lead to misleading findings, given differences in study designs, duration of follow-up, exposure and comparator types, participant criteria, and outcome measurements. Alternatively, some SRMAs had available and seemingly eligible studies omitted without explanation.
In requiring authors to prospectively register SRMA protocols one seeks to improve transparency and reproducibility of the literature search process and potentially decrease some biases; at the very least, these records will serve as a resource for understanding discrepancies should similar SRMAs arrive at different conclusions. Scientific journals increasingly require authors of SRMAs to have prospectively filed their protocol (i.e., before initiating the literature search), and registries such as International prospective register of systematic reviews (PROSPERO) facilitate this (22–24). Preregistered SRMAs were found to have higher overall methodological quality compared with nonregistered reviews (25). Readers, journal editors, and reviewers should be encouraged to examine registered protocols while reviewing SRMAs to ensure the SRMA adheres to the predefined methods in the protocol.
Literature Search and Data Extraction
A well-developed protocol should include a comprehensive literature search strategy that is unbiased and reproducible. In general, inclusion and exclusion criteria should be clearly stated based on the research question before the literature search is started. Multiple databases such as PubMed, Embase, Web of Science, APA PsychInfo, and Cochrane Library should be searched, and unpublished studies and non–English language studies included, to ensure that all potential eligible studies are identified. Title and abstract screening, and full-text review of the retrieved studies and data extraction of eligible studies, should be conducted by two authors independently, and a third researcher may be called to resolve any disagreements. Newer systematic review platforms, such as Covidence (https://www.covidence.org/), have made the duplicate screening and consensus processes efficient and easy to track. Often, the authors of original articles need to be contacted for missing or partially reported data. The Peer Review of Electronic Search Strategies (PRESS) checklist is a useful tool to guide and improve the quality of literature search strategies, and a PRISMA flow diagram is used to display the detailed search process and results (26). The PRISMA 2020 statement had a revised flow diagram with inclusion of the number of studies from previous reviews and those identified through other search strategies (27). Figure 3 is an example of a thorough PRISMA flowchart that describes the study selection process from a meta-analysis on the association between weight status and risk of diabetes in adults (28). As shown here, the flowchart should include the results of the initial search and any updated search of databases from inception until the start of data analysis. It should also describe the results from database search and other sources, e.g., hand searching reference lists of the included studies. It is important for the authors to provide as much detailed information as possible about the reasons for excluding studies in every step. The diagram can be modified to accommodate updated or continually updated (“living”) systematic reviews. The final set of manuscripts included in the SRMA are then described in a table, which provides high-level summaries of data extracted on the study design, study population, interventions or exposure, and outcomes for each manuscript.
PRISMA 2020 flow diagram example for systematic reviews. Cochrane, Cochrane Library. Reprinted with permission from Yu et al. (28). OB, obesity; NW, normal weight; OW, overweight.
PRISMA 2020 flow diagram example for systematic reviews. Cochrane, Cochrane Library. Reprinted with permission from Yu et al. (28). OB, obesity; NW, normal weight; OW, overweight.
Quantifying and Interpreting Heterogeneity
A critical step of data synthesis is assessing heterogeneity, which refers to clinically and/or statistically different effect estimates among the eligible studies. Despite a well-defined hypothesis and SRMA protocol, two or more seemingly similar studies may estimate statistically different effects due to random chance or factors the investigator did not control for in the protocol development and screening process. In a qualitative synthesis of evidence, results of the individual studies should be summarized in a table, with enough information for the reader to interpret the effect estimate, such as the units of the outcome and scale of the effect estimate and variance.
In a quantitative synthesis, forest plots are emblematic of meta-analyses as they provide a visual representation of the individual studies’ estimates and contain important information about the evidence base and a qualitative glimpse at consistency in the results, or lack thereof, between studies. The plots are useful in inspecting implausibly large effect sizes, implausibly narrow 95% CIs, outlier studies, and large between-study heterogeneity.
The heterogeneity among studies in a meta-analysis is quantified by the Cochran Q test and I2 statistic. The I2 describes the percentage of total variation in the summary estimates that can be attributed to between-study heterogeneity. The χ2 test is for addressing the null hypothesis that the individual study effect estimates are similar and any differences are due to chance alone. A higher I2 value and a significant χ2P value would indicate the presence of statistical between-study heterogeneity. I2 values of 25%, 50%, and 75% are generally considered indicative of low, moderate, and high heterogeneity, respectively. However, these cut points are somewhat arbitrary. Of note, because I2 is a relative rather than an absolute measure of statistical heterogeneity, it tends to be inflated in meta-analyses of large observational studies where the variability due to sampling error is relatively low (29). Therefore, it is common to observe a high I2 value in a meta-analysis of large cohort studies, even if the estimates across individual studies are relatively consistent. In addition, meta-analyses of continuous outcomes often exhibit substantially higher I2 values compared with meta-analyses of binary outcomes (30).
There are important considerations for interpreting a meta-analysis summary statistic in the presence of between-study heterogeneity. First, it is crucial to assess what features of study design, population characteristics, and intervention/exposure, among others, might have led to different results. The SRMA protocol should include a list of possible factors the investigator anticipates examining, mainly based on prior knowledge or biological plausibility. Sensitivity analyses where justified outliers are excluded can help with assessment of their impact on the meta-analysis. Meta-regression analyses may also be useful in assessing whether a specific study-level factor explains between-study differences, keeping in mind that inferences of study-level differences cannot be attributed to individual-level effects (i.e., ecological fallacy) (16). It is possible that significant effect modification by a study-level factor would warrant presenting a stratified meta-analysis. This decision should be clearly documented and justification should be included if it deviates from the original analysis plan, usually driven by biological plausibility and/or strong prior evidence (31). While significant between-study heterogeneity is often considered as a negative aspect in assessing the degree of certainty of evidence, as it may suggest inconsistency of the literature, it may suggest true etiological differences that should be explored rather than dismissed outright (32). As discussed above, it is important to interpret a high I2 value from meta-analyses of large cohort studies with caution and not automatically assume it to be an indicator of inconsistency or a justification for downgrading the certainty of the meta-evidence (29).
In a series of SRMAs on diet and lifestyle exposures with incident chronic disease risk from observational data, the Global Burden of Disease group used a statistical method to model and account for between-study heterogeneity (33). With this method they calculated the estimated uncertainty intervals (UIs) that were several times wider than the 95% CIs generated by conventional random-effects models. As an example, although there was a highly statistically significant and approximately linear positive association between red meat intake and risk of type 2 diabetes, their lower 95% boundary of the UIs included 1 due to an ∼2.5-fold inflation of the width of the conventional CIs, resulting in a rating of weak evidence (two out of five stars) for the association (33). In their analyses of smoking and health outcomes (34), the evidence on smoking and heart disease was rated “moderate” (three-star) despite overwhelming evidence from multiple sources of data to support a strong causal relationship (35). The UIs were calculated with similar methods to estimate prediction intervals for the random-effects summary estimates (36,37). Because UIs incorporate the uncertainty in both the mean effect from a random-effects model and the heterogeneity parameter, they are much wider than 95% CIs. UIs are intended to account for additional variability arising from the wide range of effects observed in individual studies and therefore are useful in predicting the range of effects that may be observed in a new study. However, it is important to note that UIs should not be used to draw conclusions about the overall impact of an exposure or treatment based on existing evidence (38). It is more appropriate to use the pooled effect estimates and corresponding 95% CIs from SRMAs in estimating and interpreting population average effects, which are crucial for making public health or clinical recommendations.
Fixed-Effects Versus Random-Effects Meta-analysis
There has been a long-standing debate regarding whether fixed-effects or random-effects meta-analysis is the preferred model, especially in the presence of significant heterogeneity and small study effects. While the fixed-effects model assumes a common true treatment effect across studies, the random-effects model assumes a distribution of true treatment effects across studies. Although inverse variance weighting is used in both fixed-effects and random-effects models, the former includes consideration of only within-study variability and the latter incorporates both within-study and between-study variability. The DerSimonian and Laird method is the most commonly used method to combine data from individual studies using random-effects models (39). As shown in Fig. 4, small studies contribute more weight to the overall estimate in applying a random-effects model, while their weights are more uniformly distributed in comparison with the fixed-effects model. In this example, the largest study contributes 17.7% of the estimate in the random-effects model as opposed to 92.8% in the fixed-effects model. In the case of very limited or absent between-study heterogeneity, random-effects and fixed-effect weights are very similar or the same.
Random-effects and fixed-effect meta-analyses comparing the effect of intravenous magnesium with placebo on overall mortality in patients with acute myocardial infarction. A risk ratio (RR) <1 indicates that intravenous magnesium is better than placebo. Reprinted with permission from da Costa and Juni (12).
Random-effects and fixed-effect meta-analyses comparing the effect of intravenous magnesium with placebo on overall mortality in patients with acute myocardial infarction. A risk ratio (RR) <1 indicates that intravenous magnesium is better than placebo. Reprinted with permission from da Costa and Juni (12).
When small study effects (small studies demonstrate extreme effects) exist, the results of fixed and random effects can be substantially different. A classic example is a meta-analysis of the effect of intravenous magnesium on mortality following myocardial infarction, in which beneficial effects of intervention were found in a meta-analysis of small studies, with these findings subsequently challenged when the very large ISIS-4 (Fourth International Study of Infarct Survival) had null results (40). Because there was substantial between-trial heterogeneity, the studies were weighted much more uniformly in the random-effects analysis than in the fixed-effects analysis, with small studies contributing more to the pooled estimate. In the fixed-effects analysis, ISIS-4 contributes >90% of the weight and so the pooled estimate shows no beneficial intervention effect. In the random-effects analysis, the small studies contributed most of the weights and there appeared to be strong evidence of a beneficial effect of intervention (Fig. 4). Of note, the proportion of events contributed by ISIS-4 was 92% (4,319 of 4,696). In interpreting the evidence, it is crucial to make a judgement about the validity of the combined estimates from the smaller studies in comparison with that from ISIS-4.
A common practice is to use random-effects meta-analysis when tests of heterogeneity are statistically significant; otherwise, a fixed-effects meta-analysis is conducted. One caveat with this approach is that the heterogeneity test largely depends on the sample sizes of the included studies. Another consideration is that random-effects models give disproportionate weights to the small studies, thus penalizing the large studies, which results in wider CIs of the pooled estimates and greater uncertainty of the findings (41). For these reasons, some have urged that fixed-effects models are more appropriate than random-effects models regardless of between-study heterogeneity (42). In practice, it might be useful to report both fixed-effects and random-effects summary estimates and their 95% CIs to gauge the robustness of the findings. The determination of which of the two primarily depends on several factors including the possibility of small study effects, and the expected clinical heterogeneity regarding population characteristics, and differences in study designs and intervention doses or types.
There are several strategies to address the presence of a large degree of heterogeneity. One option is not to perform a meta-analysis when heterogeneity is so severe where there is clear inconsistency in the direction and the magnitude of the effects, which may mean that the studies are not comparable enough to be meta-analyzed. If a decision to meta-analyze the data is reasonable, then exploring the potential causes of heterogeneity is mandatory. Implementing a random-effects model takes into account between-study heterogeneity, but it does not substitute a thorough investigation of heterogeneity, which is typically performed through subgroup analysis and meta-regression. Characteristics of the studies that may be associated with heterogeneity should be prespecified in the review protocol. Also, we need to note that lack of power to meaningfully explore heterogeneity is common when a few studies are synthesized.
Grading the Degree of Certainty of the Evidence
A meta-analysis summary statistic does not signify the completion of the SRMA evidence synthesis process and reporting. Once the evidence base has been summarized qualitatively or quantitatively, the next critical step is to assess the overall validity and certainty of the synthesized evidence. This begins with evaluating the individual studies for their quality and transparency in design, conduct, analysis, and reporting of the research, including careful consideration for potential biases undermining the validity of their effect estimates. In general, it is not advisable to draw conclusions about the validity of evidence based solely on the type of study designs, for example, automatically favoring RCTs over observational studies. There is no one-size-fits-all approach to systematically evaluating these, and an appropriate appraisal will require an understanding of the study designs’ strengths and limitations, statistical approaches, and substantive knowledge.
Next, a critical appraisal of the totality of the evidence included in the review is conducted to determine the level of confidence that the investigator puts in the SRMA’s analytic findings. The appraisal involves the assessment of individual study quality and the consistency of existing evidence, biological plausibility, and level of evidence certainty, often from multiple lines of evidence. In Table 2, we recommend several quality, bias, and certainty appraisal tools that can assist authors to enhance the overall quality of SRMA. The appraisal of evidence levels is crucial to formulating the SRMA’s final conclusions. In a recent assessment of the epidemiological characteristics and the overall methodological quality of SRMAs of diabetes treatment, less than half of the SRMAs (45.2%) included assessment and documented scientific quality of included studies, and only 34.5% SRMAs considered it when formulating conclusions (43). Failing to properly appraise and apply validity and certainty in an SRMA’s conclusions may lead a falsely inflated or diminished level of confidence in the underlying evidence base, with the potential to negatively influence clinical and public health recommendations.
Good practice tools for assessment of study-level quality and risk of bias and improvement of overall SRMA reporting
Study design . | Tool . | Description . | Source of guidance . |
---|---|---|---|
Systematic reviews | ROBIS | The Risk of Bias in Systematic Reviews (ROBIS) tool has three parts: 1) assess relevance (optional), 2) identify concerns with the review process, and 3) judge risk of bias | https://www.bristol.ac.uk/population-health-sciences/projects/robis/ |
AMSTAR 2 | A MeaSurement Tool to Assess systematic Reviews (AMSTAR) 2 16-item evaluation tool enables a detailed assessment of systemic reviews that include randomized (RCT) or nonrandomized studies of health care interventions | https://amstar.ca/ | |
Randomized trials | RoB 2 | Version 2 of the Cochrane risk-of-bias tool for randomized trials (RoB 2) is structured into five domains of bias, according to the stages of a trial in which problems may arise: 1) the randomization process, 2) deviations from intended intervention, 3) missing outcome data, 4) measurement of the outcome, and 5) selection of the reported result | https://www.riskofbias.info/ |
Observational studies | ROBINS-I | ROBINS-I is composed of seven domains to assess bias due to confounding, selection of participants, classification of interventions, departures from intended interventions, missing data, measurement of outcomes, and selection of reported result | https://methods.cochrane.org/methods-cochrane/robins-i-tool |
NOS | Newcastle-Ottawa Scale (NOS) is used to assess selection of study groups, comparability of the groups, and ascertainment of exposure/outcome in nonrandomized studies | https://www.ohri.ca/programs/clinical_epidemiology/oxford.asp | |
Diagnostic accuracy | QUADAS-2 | The Quality Assessment of Diagnostic Accuracy Studies (QUADAS)-2 has 4 domains: patient selection, index test, reference standard, and flow and timing | https://jbi.global/sites/default/files/2020-08/Checklist_for_Diagnostic_Test_Accuracy_Studies.pdf |
Prognostic factors | QUIPS | The Quality in Prognosis Studies (QUIPS) tool includes use of six important domains that should be critically appraised in evaluating validity and bias in studies of prognostic factors: 1) study participation, 2) study attrition, 3) prognostic factor measurement, 4) outcome measurement, 5) study confounding, and 6) statistical analysis and reporting | https://methods.cochrane.org/prognosis/our-publications |
Study design . | Tool . | Description . | Source of guidance . |
---|---|---|---|
Systematic reviews | ROBIS | The Risk of Bias in Systematic Reviews (ROBIS) tool has three parts: 1) assess relevance (optional), 2) identify concerns with the review process, and 3) judge risk of bias | https://www.bristol.ac.uk/population-health-sciences/projects/robis/ |
AMSTAR 2 | A MeaSurement Tool to Assess systematic Reviews (AMSTAR) 2 16-item evaluation tool enables a detailed assessment of systemic reviews that include randomized (RCT) or nonrandomized studies of health care interventions | https://amstar.ca/ | |
Randomized trials | RoB 2 | Version 2 of the Cochrane risk-of-bias tool for randomized trials (RoB 2) is structured into five domains of bias, according to the stages of a trial in which problems may arise: 1) the randomization process, 2) deviations from intended intervention, 3) missing outcome data, 4) measurement of the outcome, and 5) selection of the reported result | https://www.riskofbias.info/ |
Observational studies | ROBINS-I | ROBINS-I is composed of seven domains to assess bias due to confounding, selection of participants, classification of interventions, departures from intended interventions, missing data, measurement of outcomes, and selection of reported result | https://methods.cochrane.org/methods-cochrane/robins-i-tool |
NOS | Newcastle-Ottawa Scale (NOS) is used to assess selection of study groups, comparability of the groups, and ascertainment of exposure/outcome in nonrandomized studies | https://www.ohri.ca/programs/clinical_epidemiology/oxford.asp | |
Diagnostic accuracy | QUADAS-2 | The Quality Assessment of Diagnostic Accuracy Studies (QUADAS)-2 has 4 domains: patient selection, index test, reference standard, and flow and timing | https://jbi.global/sites/default/files/2020-08/Checklist_for_Diagnostic_Test_Accuracy_Studies.pdf |
Prognostic factors | QUIPS | The Quality in Prognosis Studies (QUIPS) tool includes use of six important domains that should be critically appraised in evaluating validity and bias in studies of prognostic factors: 1) study participation, 2) study attrition, 3) prognostic factor measurement, 4) outcome measurement, 5) study confounding, and 6) statistical analysis and reporting | https://methods.cochrane.org/prognosis/our-publications |
An example highlighting complexities of evidence appraisal is the series of SRMAs evaluating intake of red and processed meats with risks of major chronic disease incidence and mortality. As would be anticipated, the evidence base consisted almost exclusively of observational cohort studies, given the long-term health outcomes of interest. The authors applied the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach for evaluating the certainty of evidence, which is a well-established and validated tool for evaluating strengths of evidence from RCTs of pharmacological or other medical interventions. Its direct application to evaluating observational evidence, however, can be problematic (32). The authors assigned the evidence to the level of “very low and/or low certainty” owing to lack of randomization and substantial between-study heterogeneity (44) and as such concluded with recommendations for individuals to continue their red and processed meat consumption habits. In contrast, using a certainty rating system tailored specifically for nutritional exposure research, a separate group rated the same body of literature as “moderate quality” and “high quality” on the associations of red and processed meat intakes with mortality and type 2 diabetes, respectively (45,46). A key difference between the GRADE and NutriGRADE approach is that NutriGRADE does not automatically assign a low rating to observational evidence as a default starting point. Instead, it is tailored to specific characteristics, strengths and limitations, and potential biases of nutrition research (32).
The risk of bias in nonrandomized studies of interventions (ROBINS-I) tool has been used to assess the risk of bias across multiple domains in observational studies (47), where an effect size consistent with the plausible biological exposure/outcome relationship informed by other lines of evidence and the presence of a dose-response relationship can serve as justifications for upgrading the certainty of observational evidence. The incorporation of ROBINS-I with GRADE offers a more suitable method for evaluating the certainty of observational evidence related to nutritional, lifestyle, and environmental exposures (48). This integration ensures that GRADE does not assign an automatic low rating to observational evidence as a starting point. It is also important that a high I2 value from meta-analyses of large cohort studies does not lead to an automatic downgrade of the certainty of observational evidence (29).
It is common for an SRMA protocol to incorporate rationale for the inclusion of multiple study designs that address complementary hypotheses concerning the same exposure/outcome association, such as efficacy intervention trials alongside epidemiologic studies assessing long-term habitual exposure and the same outcomes. In such cases, evidence synthesis may be best performed separately for each study design, and quality and certainty assessments should be performed with tools that are optimized to specific types of research being evaluated.
Assessing Publication and Selective Reporting Biases
Understanding of the mechanisms by which publication bias and other forms of selective reporting bias might arise is a prerequisite for minimizing their impact on our interpretation of the literature, and correspondingly these mechanisms need to be thoroughly examined by the diabetes research community. Such mechanisms include confirmation bias (selective preference for new results that agree with prior evidence), improper study parameters (e.g., lack of power, improper specification of the population for the intervention), hypothesis testing practice (discontinuation of the manuscript development due to negative results in the analysis), lack of appropriate avenues for reporting negative studies beyond “grey literature” (ideally, the probability of publication of study findings should be independent of statistical significance), and selective outcomes reporting bias (reporting only outcomes with nonnull findings).
In a recent assessment of the methodological quality of SRMAs on diabetes treatment investigators concluded that <40% included assessment of the potential for publication bias (43). A thorough and comprehensive search for relevant records is extremely important for minimizing publication bias. Searching for so-called “grey literature,” including dissertations, preprints, and reports from the government or industry or conference proceedings, as well as preregistration databases with additional study information and outcome data, such as ClinicalTrials.gov, may be worth the effort (49). Although we can examine the presence of publication bias through statistical tests, we cannot provide proof of publication bias directly with use of any of these methods. Nonetheless, with these methods we can examine certain properties of the data that may be indicative of proof of publication bias.
The most used graphical method to explore publication bias is the funnel plot (Fig. 5). In the absence of publication bias, the scatter will be due to the sampling variation only, with corresponding number of studies in each bottom side of the funnel plot. In the presence of publication bias, studies from either side of the bottom of the funnel plot will be missing. However, funnel plot asymmetry can also be due to true study heterogeneity, other reporting biases, or chance (50). Eyeballing a funnel plot for evidence of publication bias is subjective and therefore can be misleading. It is important to note that funnel plots can be symmetrical even in the presence of publication and other reporting biases (51,52).
Asymmetric funnel plot with evidence of publication bias (left) and symmetric funnel plot with no evidence of publication bias (right). The graphs were created with simulated data. A funnel plot is a scatter plot of effect sizes on the x-axis and a measure of their SEs on the y-axis. The y-axis is inverted in the funnel plot, with studies with a small SE (typically, larger studies have more precision) to occupy the top of the funnel.
Asymmetric funnel plot with evidence of publication bias (left) and symmetric funnel plot with no evidence of publication bias (right). The graphs were created with simulated data. A funnel plot is a scatter plot of effect sizes on the x-axis and a measure of their SEs on the y-axis. The y-axis is inverted in the funnel plot, with studies with a small SE (typically, larger studies have more precision) to occupy the top of the funnel.
The role of chance is of particular importance in the interpretation of a funnel plot because most meta-analyses only have a few studies and therefore may be underpowered. In evaluating the role of chance, statistical tests for funnel plot asymmetry (small study effects) are used to examine whether the association between published effect estimates and measures of study size is greater just by chance alone. These tests such as the Begg rank test (53) and Egger regression test (54), and its extensions (55–57), are often underpowered. Thus, even if their results are null, publication and other reporting biases cannot be excluded.
Adhering to Recognized Standards for Reporting SRMAs
It is essential for authors to follow recognized standards for reporting and publishing SRMAs so that all of the steps are reported in detail and reproducible by an independent researcher. The PRISMA 2020 statement, which was updated from the 2009 statement, provides guidance for reporting and assessment of the quality of SRMAs (27). The 27-item checklist covers the study inclusion and exclusion criteria, databases and search terms to be used, literature screening and extraction procedures, statistical methods for meta-analysis, methodsfor individual study quality and bias assessment, and methods for assessing certainty level of the meta-evidence. The checklist is useful for improving the transparency and methodological standards of systematic reviews.
Do Not Overinterpret the Findings From SRMAs
Because well-conducted SRMAs sit atop the evidence hierarchy, there is a tendency for authors to draw overly confident conclusions beyond the strength of the data. For example, some authors may portray their findings from SRMAs as definitive or causal. It should be noted that an SRMA per se is not a tool for establishing causality, although the findings from well-conducted SRMAs quantify the strength, consistency, and dose-response relationship of evidence to inform causal inference (58,59). Often, authors translate their SRMA findings directly into public health recommendations, without examining other important considerations such as biological plausibility, implementation and scalability, cost-effectiveness, environmental impact, side effects or safety, and more. For example, the previously mentioned SRMAs on red meat and health outcomes were published alongside the authors’ proposed revised dietary guidelines for individuals to continue their current meat consumption habits, which has led to a great deal of public confusion (44). Similarly, in the Burden of Proof studies investigators converted the SRMA findings of epidemiological studies on lifestyle factors and chronic diseases into a simplified one- to five-star rating system for policy recommendations without considering other lines of evidence (60). These examples highlight the importance of exercising caution in drawing conclusions with policy implications based on evidence from SRMAs.
Recent Developments in SRMAs
There are several newer variants in meta-analysis with expanded scope and capabilities in comparison with more conventional methods of research synthesis. Examples include individual-level meta-analyses, network meta-analyses, prospective meta-analyses, and umbrella reviews. Here we briefly describe the strengths and limitations of these approaches.
Individual-Level Data Meta-analysis
A meta-analysis is classically performed through analysis of aggregate data; however, the quality of study reporting, different outcome definitions, and analyses performed may limit the validity of and ability to combine these data (16). The individual participant data meta-analysis addresses many of these concerns and could yield higher-quality meta-analyses in comparison with literature-based SRMAs. This type of SRMA is performed by collecting individual-level data from study investigators. This allows for consistent inclusion/exclusion criteria and outcome definitions, standardized analytic approaches and effect estimates, and analyses including unpublished data. Analyses with individual-level data also avoid ecological fallacy in examining sources of between-study heterogeneity and allow for intervention-interactions by patient-level characteristics and duration of follow-up. An important barrier to using the individual participant data meta-analysis is the higher cost, time, and coordination required, including data-sharing agreements and data transfers. Detailed protocols are needed to harmonize exposures, covariates, and outcomes and to pool and analyze data from diverse data resources. In a recent systematic review of individual participant data meta-analysis published from 1991 to 2019, investigators found that the methodologic quality of these was far from satisfactory (61); as for aggregate-level meta-analyses, the validity of the individual participant–level meta-analysis is contingent on high-quality methodology and reporting.
Network Meta-analysis
Since conventional SRMA only accommodates pairwise exposure comparisons, the network meta-analysis approach was developed to consider more than two exposures of interest. Network meta-analysis statistically contrasts any number of pairwise exposure effect estimates with a common outcome, using both direct and indirect comparisons (62). Thus, it offers a comparison of two interventions even if they have never been directly tested head-to-head. If used appropriately, this is a powerful tool to inform clinical decision-making when resources are limited for conducting multiple comparative effectiveness trials. As an example, in a recent network meta-analysis investigators examined 816 trials with 471,038 patients, together including evaluation of 13 different drug classes, with confirmation of the benefits of sodium–glucose cotransporter 2 inhibitors and glucagon-like peptide 1 receptor agonists in reducing cardiovascular disease and end-stage kidney disease compared with standard care (63). Of note, the estimates of indirect comparisons between two treatments should be interpreted with caution when the treatment groups were not derived from the same study population and estimates were obtained under different study protocols (64). Further, inherent to the design, foundational steps such as risk of bias and assessment of heterogeneity may be more challenging to interpret (65). Since a network meta-analysis yields more than one effect estimate, bias from any single trial or heterogeneity between trials may affect several pooled effect estimates or impact multiple other comparisons.
Prospective Meta-analysis
As for all SRMAs, a protocol specifying the research question in a prospective meta-analysis is critical; however, the key feature that differentiates this type of SRMA from a conventional meta-analysis is that protocol development and the identification of studies for inclusion precede the reporting of individual study results (66). With this type of SRMA investigators seek to minimize the pitfalls of publication bias and selective reporting of outcomes after knowledge of the results. Additionally, with the prospective inclusion of studies, efforts are made to enhance consistency between study designs and analytic plans. Some drawbacks of prospective meta-analyses include the longer duration and higher cost, as well as the high level of coordination, planning, and collaboration required, as is seen with individual-patient data meta-analysis (66). While the publication of prospective meta-analyses has increased over time, they remain relatively rare in the literature. However, with the emergence of the COVID-19 pandemic there was a sudden and urgent need for evidence regarding the prognosis and treatment of COVID-19. The explosion of clinical trials and prospective cohort studies occurring across the globe offered an ideal opportunity for the prospective meta-analysis, and thus several RCTs and prospective cohorts were designed with harmonization of treatment protocols, exposure definitions, and data analysis plans (67,68). More work is required to develop evidence-based reporting tools for these reviews.
Umbrella Reviews
The proliferation of the SRMA has sparked the need for an additional methodology to summarize findings across multiple SRMAs, called an umbrella review. Also called an “overview of reviews,” the umbrella review includes identification and compilation of available systematic reviews in an area of research. This methodology may be particularly helpful to summarize evidence where there are multiple interventions for the same condition, to examine the same intervention across different populations, or to examine adverse events related to a given intervention in different populations (69). It is also useful in assessing risk factors for diseases, with the goal of identifying those with robust evidence for an association (70). While transparent reporting is a cornerstone in SRMAs, studies examining the quality of reporting of umbrella reviews have revealed that insufficient reporting is commonplace (71–73). It was only recently that an evidence-based reporting guideline for umbrella reviews was published (74). The preferred reporting items for overviews of reviews (PRIOR) statement includes a comprehensive 27-item checklist with 19 subitems recommended for the complete reporting of umbrella reviews (74).
Summary and Conclusions
High-quality SRMAs will remain an important and robust methodology to inform clinical practice and research. However, with the sheer number of published SRMAs it is not surprising that SRMAs of poor methodologic quality are all too frequent. A systematic review of diabetes-related SRMAs suggested several critical areas for quality improvement: adherence to guidelines for protocol development, more careful assessment of heterogeneity, and investigating risk of bias in individual studies and meta-analyses (9). For enhancement of the quality and trust of SRMAs, there are a number of key questions that authors, editors, and readers should ask in preparing or critically reviewing SRMAs (Table 1).
There are several commonly used checklists/tools available that should accompany SRMA submissions, and individual choice in those used will depend on the type of review and studies included. These include reporting checklists (e.g., PRISMA and Meta-analysis of Observational Studies in Epidemiology [MOOSE]) and reporting standards/tools (e.g., A MeaSurement Tool to Assess systematic Reviews [AMSTAR]). A summary of some common checklists is included in the sample resources presented in Table 2. Of note, these tools are still subjective with regard to reviewer experience, bias, and subject matter expertise, and many fail to adequately capture key sources of bias and uncertainty across diverse research areas.
Diabetes research and clinical practice will continue to rely on SRMAs to synthesize important and growing bodies of evidence, underscoring the importance of investigators, peer reviewers, and journal editors to upholding high standards of quality. Although guidelines of best practices for protocol development, literature search and screening tools, and tools for analysis and interpretation undoubtedly improve rigor and minimize biases, many aspects of the SRMA process are still susceptible to error, subjectivity, and bias. Therefore, continued scrutiny and vigilance are warranted for authors, editors, and readers in preparing or critically reviewing SRMAs to ensure reliability and integrity of the findings.
Article Information
Funding. This work was supported by National Institutes of Health grants DK127601 and HL60712.
Duality of Interest. No potential conflicts of interest relevant to this article were reported.
Author Contributions. D.K.T., S.P., J.M.Y., and F.B.H. wrote the manuscript, contributed to the discussion, and reviewed and edited the manuscript.