Genome-wide association studies that compare the statistical association between thousands of DNA variations and a human trait have detected 958 loci across 127 different diseases and traits. However, these statistical associations only provide evidence for genomic regions likely to harbor a causal gene(s) and do not directly identify such genes. We combined gene variation and expression data in a human cohort to identify causal genes.
Global gene transcription activity was obtained for each individual in a large human cohort (n = 1,240). These quantitative transcript data were tested for correlation with genotype data generated from the same individuals to identify gene expression patterns influenced by the variants.
Variant rs8050136 lies within intron 1 of the FTO gene on chromosome 16 and marks a locus strongly associated with type 2 diabetes and obesity and widely replicated across many populations. We report that genetic variation at this locus does not influence FTO gene expression levels (P = 0.38), but is strongly correlated with expression of RBL2 (P = 2.7 × 10−5), ∼270,000 base pairs distant to FTO.
These data suggest that variants at FTO influence RBL2 gene expression at large genetic distances. This observation underscores the complexity of human transcriptional regulation and highlights the utility of large human cohorts in which both genetic variation and global gene expression data are available to identify disease genes. Expedient identification of genes mediating the effects of genome-wide association study–identified loci will enable mechanism-of-action studies and accelerate understanding of human disease processes under genetic influence.
The identification of genes contributing to disease development in humans will provide medical benefits in two ways. First, the defining of causative risk predisposing genetic variation will allow improvements to be made to current clinical risk models for disease onset and progression. Second, knowledge of the gene's identity that mediates the biological consequences of this variation will improve understanding of the molecular pathways that participate in disease pathogenic processes. Genome-wide association studies have yielded 958 loci associated with 127 human diseases and traits over the past several years (1). Yet these associations, like many successful genetic linkage analysis studies that predate them (2), only identify a genetic region or loci and do not directly locate the causative variant(s). Without comprehensive sequence information to achieve better resolution, the associated variant may merely be in linkage disequilibrium with another untyped functional variant. Depending upon the population and sample design, the span of potential linkage disequilibrium may easily be as large as 500 kb (3) and potentially much larger (4). Alternatively, the observed associated variant may be directly functional, working to modulate transcriptional elements operating at large genetic distances. Without disease gene identifications, the benefits to medicine cannot be fully realized.
A major limitation to identification of the genes through which these loci cause disease has been the dearth of large-scale cohorts in which both gene expression and genotype data have been measured in the same individuals (5). In this study, we analyze the San Antonio Family Heart Study (SAFHS), where large-scale gene expression and variation measurements have been undertaken in the same individuals (6) and demonstrate its utility in identification of genes mediating the consequences of biological variation. We apply this approach to the FTO locus (Entrez GeneID: 79068) that was the most strongly associated region identified by genome-wide association study that predisposes to type 2 diabetes through an effect on body mass (7,,–10) and has been widely replicated (11). These data indicate that human transcriptional regulation is more complex than has been previously anticipated.
RESEARCH DESIGN AND METHODS
All study participants provided informed consent. The study and all protocols presented here were approved by the Institutional Review Board at the University of Texas Health Science Center at San Antonio.
This study was based on participants in the San Antonio Family Heart Study (12), which focuses on genetic aspects of cardiovascular health and disease in Mexican Americans.
Total RNA was isolated from lymphocyte samples from 1,240 individuals and gene expression measured by hybridization of labeled cRNA to Illumina Sentrix Human Whole Genome (WG-6) Series I BeadChips and scanning on the Illumina BeadArray 500GX Reader as described previously (6) and in the supplementary methods section, available in an online appendix at http://diabetes.diabetesjournals.org/content/full/db09-1277/DC1.
Identification of expressed transcripts.
To identify transcripts that exhibited sufficient quantitative expression in lymphocytes, the distribution of expression values for a given transcript was compared with the distribution of the expression values of the controls that are embedded in each chip. For each transcript, we performed a χ2 “tail” test of whether there was a significant excess of samples with values above the 95th percentile of the control null distribution. This test was used because it allows detection of even those transcripts that are clearly present above baseline levels in only a subset of individuals, but not detectable at these levels in most individuals. Using a false discovery rate of 0.05, we identified 20,413 transcripts that exhibited significant expression by this criterion (6).
Standardization of expression values.
To minimize the influence of overall signal levels, which may reflect RNA quantity and quality rather than a true biological difference among individuals, abundance values of all 20,413 retained transcripts were first standardized by z scoring within individuals (using decile percentage bins of transcripts, grouped by average log-transformed raw signals across individuals), followed by linear regression against individual-specific average log-transformed raw signal and its squared value. Lastly, for each transcript, we directly normalized these residual expression scores using an inverse Gaussian transformation across individuals, to ensure that the assumptions underlying variance component–based analyses were not violated. This conservative procedure results in normalized expression phenotypes that are comparable among individuals and across transcripts (6).
Microsatellite marker genotyping.
The Human MapPairs Genome-Wide Screening Set Versions 6 and 8 from Research Genetics (Huntsville, AL) were used. Genotyping occurred according to the manufacturer's instructions by PCR on lymphocyte-derived DNA samples from study participants. The PCR products were subsequently pooled into multiplex panels for genotype calling on an automated DNA sequencer (model 377 with Genescan 672 and Genotyper software programs; Applied Biosystems, Foster City, CA). Overall, 1,345 individuals were genotyped at 432 highly polymorphic microsatellite markers, distributed with average intermarker spacing of <10 cM across all 22 autosomes. The median distance between the detected autosomal transcripts and the nearest marker is ∼2.7 cM, and 80 and 90% of transcripts are within 4.3 and 5.5 cM from the nearest marker, respectively. These genotypes were subjected to extensive data cleaning that has been previously described (6).
Single nucleotide polymorphism genotyping.
The Illumina HumanHap550 Genotyping BeadChip was used to genotype single nucleotide polymorphisms (SNPs) in 858 individuals in whom gene expression data were available using the Infinium II Assay (Illumina, San Diego, CA). The Infinium II Assay combines a single-tube, whole-genome amplification method with direct, array-based capture and enzymatic scoring of the SNP loci. Locus discrimination is provided by a combination of sequence-specific hybridization capture and array-based, single-base primer extensions. A single primer is used to interrogate a SNP locus. The 3′ end of the primer is juxtaposed to the SNP site, and extension of the primer incorporates a biotin-labeled nucleotide (C or G) or a dinitrophenyl-labeled nucleotide (A or T). Signal amplification of the incorporated label further improves the overall signal-to-noise ratio of the assay. The Infinium II genotyping was performed using Illumina's standard protocols on a Tecan Freedom Evo Robot. After staining, each BeadChip was imaged on the Illumina BeadArray 500GX Reader using Illumina BeadScan image data acquisition software (Version 3.2.6). Genotype data were then assessed using the Genotyping Module of Illumina's BeadStudio software (Version 2.0).
Genotype data cleaning.
Multivariate methods generally require that individuals with missing data be excluded from the analysis, which can result in a significant decrease in sample size. To minimize this issue, we used likelihood-based imputation with the MERLIN computer package as has been previously described (6) to rapidly impute SNP data in pedigrees. For each pedigree, imputation is performed and the posterior genotypic probabilities for each missing genotype are stored. These posterior probabilities are used to construct an appropriately weighted covariate for each SNP that is then used in association analyses.
To eliminate the potential for hidden population stratification that the standard fixed-effect association approach is susceptible to, we also used the quantitative transmission disequilibrium test approach to association testing implemented in SOLAR (13). Although this approach exhibits substantially decreased power than measured genotype analysis (14), it maintains the appropriate significance levels under the null hypothesis, even in the presence of population stratification. However, because of its poor power, we used it only as a check to identify consistence of qualitative inferences drawn from measured genotype analysis.
Statistical genetic analysis: linkage and association.
All statistical analyses on related individuals were performed using variance component–based methodology and software package SOLAR Version 4.0 (15). As preparation for linkage analysis, the probabilities of multipoint identity-by-descent allele sharing among pairs of related individuals were computed by the Monte Carlo Markov chain multipoint approach implemented in software package Loki (16), using the genotypes at all linked markers jointly in the computations. Maximum likelihood marker allele frequencies were used as previously described (6). Covariates were included in the variance components framework as linear predictors of phenotype. Measured genotype analysis (17), embedded in a variance component–based linkage model, was used for association testing, assuming an additive model of allelic effect (i.e., the SNP genotypes AA, AB, and BB were coded as −1, 0, and 1, respectively, and used as a linear predictor of phenotype) (13).
Correction for multiple testing.
To correct for the effect of multiple testing for a given phenotype, we estimated the effective number of SNPs using the method of Li and Ji (18), which uses a modification of an earlier approach by Nyholt (19). After obtaining the effective number of SNPs, we used a modified Bonferroni procedure to identify a target α level that would maintain an overall significance level of ≤0.05.
Heritability analysis was performed under the classical approach decomposing the phenotypic variance into independent genetic and environmental components, assuming an additive model of gene action (“narrow sense” heritability) and expected kinship coefficients based on the observed intrafamilial relationships.
Expression profiles were obtained from the lymphocytes of 1,240 individuals in the SAFHS (6). The potential consequences of variants genotyped in this cohort were assessed by performing correlation analysis of the genetic variation using a measured genotype approach with the high-dimensional quantitative expression data.
The obesity-associated locus within the FTO gene is a 47-kb region of high linkage disequilibrium (LD) that spans part of intron 1, all of exon 2, and part of intron 2 of the FTO gene. Two SNPs in this region, rs8050136 and rs9939609, are in high LD (r2 > 0.95) with each other and have been strongly associated with type 2 diabetes (P = 5.2 × 10−8) (20) and obesity (P = 1.1 × 10−47) (11). We examined the association of rs8050136 with levels of transcripts up to 5 Mb in either direction of the SNP. No association with FTO gene expression was observed (P = 0.54; n = 854); however, a strong association was observed with the retinoblastoma-like 2 (RBL2; Entrez GeneID: 5934) transcript (P = 1.5 × 10−5; n = 854) that is located ∼270 kb proximal to FTO (Fig. 1).
To avoid missing the potential effects of untyped genetic variation and capture of the combined effects of all genetic variation in the region on RBL2 and FTO expression, we performed multipoint genetic linkage analysis on chromosome 16 using RBL2 and FTO transcript levels as expression quantitative traits. Linkage analysis was conducted on highly polymorphic simple tandem repeat markers that have been previously described in this cohort (12). We observed a nominally significant linkage signal for RBL2 expression (logarithm of odds [LOD] score = 0.997, P = 0.016; n = 1,240) but not FTO expression directly at the RBL2-FTO locus (Fig. 2), indicating that only RBL2 expression is likely to be affected by genetic variation in this region.
To identify the genetic variation in this region responsible for the observed linkage signal and to test whether rs8050136 was a contributor, we genotyped 136 SNPs spanning 706 kb in 858 of the same individuals in whom expression levels were measured. We observed strong association of several groups of variants with RBL2 expression using the measured genotype test, one of which included rs8050136 (P = 1.5 × 10−5; Fig. 3 and supplementary Table 1). Two other regions showed strong association, one located directly at the location of the RBL2 structural gene itself (rs8043918; P = 1.8 × 10−17), potentially affecting local promoter elements, and the second region 87.4 kb distal to rs8050136 at 52.5 Mb and within the FTO gene, but spanning introns 4 and 5 and exon 5 of FTO (rs7197983; P = 3.8 × 10−8). The associations remained significant after adjustment for multiple testing after allowing for LD (target α P = 4.9 × 10−4). To determine whether these associations were independent of background linkage, we undertook the quantitative transmission disequilibrium test, which is resistant to inflation of type I error rate in the presence of linkage (14). The following P values were observed for association with RBL2: rs8050136, P = 0.006; rs8043918, P = 6.6 × 10−5; and rs7197983, P = 0.027, suggesting that these associations are independent of background linkage (supplementary Table 1). The variants were also tested for association with FTO gene expression, but none remained significant after adjustment for multiple testing (supplementary Table 1).
The minor allele frequency (MAF) of rs8050136 (A allele) was 0.23 in the SAFHS, consistent with prior data in the Mexican American population (MAF = 0.20) (21) and lower than that observed in the CEPH (Utah residents with ancestry from northern and western Europe) cohort (MAF = 0.46) and Yoruba in Ibadan, Nigeria (MAF = 0.46), but higher than the Han Chinese in Beijing, China (MAF = 0.14), and Japanese in Tokyo cohorts (MAF = 0.18). We observed a similar 40-kb region of high LD (r2 > 0.5) around rs8050136 and rs9939609 that rapidly fell outside this interval as previously reported (10).
We performed conditional linkage analysis to determine whether rs8050136 contributed to the observed linkage signal for RBL2 gene expression. Incorporation of rs8050136 into the linkage model as a covariate reduced the LOD score from 0.997 to 0.201 (Fig. 4,B). Further addition of the two SNPs most strongly associated with RBL2 expression levels (rs8043918 and rs9921587; n = 852) reduced the LOD score to zero (Fig. 4 C), indicating that either these variants or those correlated with them account for the majority of the observed linkage signal, although the contribution of additional variants of small effect size cannot be ruled out.
The initial observation of association of rs8050136 and rs9939609 was with type 2 diabetes; however, after adjustment for BMI, the association was abrogated (10). We tested for association of rs8050136 with type 2 diabetes in the SAFHS and observed a nominal association (P = 0.050). Furthermore, and consistent with its initial discovery, the association was lost after adjustment for BMI (P = 0.20). The strength of this association is lower than that previously reported, likely due to the smaller sample size (n = 854) in the SAFHS cohort; however, the observed trend is consistent with prior published data. Although we observed no association between FTO and RBL2 expression levels with diabetes traits, the SNP most strongly associated with RBL2 expression (rs8043918) exhibited nominal association with diabetes status (P = 0.047) and homeostasis model assessment–insulin resistance (HOMA-IR) (P = 0.028) after adjustment for BMI.
The majority of the replicated genome-wide association studies have identified variants that lie in intronic or intergenic regions; however, follow-up on identification of genes mediating the consequence of biological variation has been problematic because candidate genes may lie hundreds of kilobases from the variant (1). Mapping expression quantitative trait loci have the potential to identify which genes mediate the effects of genetic variation. We used a large family-based cohort where both genotype and gene expression data are available from the same individuals to address the question of which genes juxtaposed to disease-associated loci confer the biological consequences of the variation. We examined the most strongly type 2 diabetes– and obesity-associated loci denoted by rs8050136 located in intron 1 of the FTO gene. We observed no effect of this variant on FTO gene expression levels, but observed a strong association with RBL2 gene expression located ∼270-kb proximal to FTO. Genetic linkage analysis in the same cohort revealed a modest linkage signal for RBL2 expression and none for FTO. Variants typed in this region, including rs8050136, were significantly associated with RBL2 expression and found to account for the majority of the linkage signal, whereas no variants were associated with FTO expression after adjustment for multiple testing. These data implicate RBL2 as the major contributor to the biological consequences of rs8050136 or genetic variation in LD with it.
The SNPs rs8050136 and rs9939609 that are in high LD (r2 > 0.95) have been genotyped in an extensive number of replication studies that predominantly confirm the original findings. Because these replication studies evaluate only disease association with genotype data, our results are not in conflict with this body of literature.
The lack of correlation of SNPs rs8050136 and rs9939609 with FTO mRNA expression has been previously reported in human muscle and adipose tissues in a number of studies. Our data are consistent with these reports. Despite the lack of correlation with these SNPs, some human studies have been able to identify relationships between FTO mRNA and measures of obesity. In a large cohort of female Scandinavians (n = 306), FTO mRNA was slightly higher in subcutaneous adipose tissue of obese (BMI >30 kg/m2) than in nonobese (BMI <25 kg/m2) individuals (22). A second large study of Danish twins (n = 496) similarly observed increased FTO mRNA with increased BMI (23), as did a smaller Spanish cohort (24). In contrast, however, a study reported decreased mRNA with increased BMI in a Scandinavian cohort (25). Mouse models of obesity show decreased FTO mRNA expression with increased obesity (26), but conflict with the study by Church et al. (27) on an N-ethyl-N-nitrosurea (ENU)–induced COOH-terminal mutant of FTO (mFTOI367F) in mice that exhibited reduced levels of FTO protein and decreased body and fat mass. Collectively, it appears that FTO may play a role in obesity, although the mechanism remains controversial. In humans, because the SNPs rs8050136 and rs9939609 consistently fail to correlate with FTO mRNA expression across multiple tissues, it is plausible that other genes such as the neighboring RBL2 may participate in mediating these effects and that downstream processes may have the potential to affect FTO gene expression indirectly.
We observed nominal association between rs8043918 and HOMA-IR (P = 0.028) and diabetes status (P = 0.047) after adjustment for BMI, suggestive of a potential role that RBL2 may play in obesity and diabetes. However, the observed strength of the association prior to adjustment for multiple testing indicates that the relationship should be considered preliminary at this stage. Consistent with this result, we note that in the original genome-wide association study, genetic variants juxtaposed to RBL2 were nominally associated with type 2 diabetes (P ∼ 1 × 10−4; 20). Differences in strength of these associations might be explained by different tiers of transcriptional control that operate independently and in different circumstances according to availability of transcription factors and temporal development. For example, the locus identified within intron 1 of FTO, although 270 kb distant, may play a specific role in control of RBL2 gene expression in a particular tissue type and/or in a specific stage of development that influences obesity. Conversely, genetic variation directly at the RBL2 locus potentially mediates their effects through interaction with an alternate set of transcription factors responding independently to internal programs and external stimuli and affect cellular properties that have little impact on development of obesity.
A limitation of this study is its reliance on expression from a single human tissue and currently a lack of knowledge regarding coordinate gene expression control among multiple tissues. It is unclear whether gene expression patterns observed in lymphocytes can be extrapolated to other tissues such as adipose and muscle; however, in prior work, valid inferences have been made by such extrapolations, particularly in the area of central nervous system disorders (28). Recent studies suggest, however, that variants affecting gene expression in a consistent manner among fibroblasts, B-cells, and T-cells are limited to ∼17% of genes for two or more tissue types (29). Additionally, work by Heintzman et al. (30) suggest that consistency among cell types is dependent upon the type of expression control region, with promoters and insulators showing highest between–cell type consistency and enhancer regions showing greatest divergence. Extrapolating these findings to the FTO-RBL2 locus is difficult because the control elements for the locus have not been fully delineated. Our data and that of others show that the lack of correlation between FTO SNPs and FTO gene expression was consistently observed across human muscle, adipose, and lymphocyte tissue types (22,,,–26,31), suggesting this control region may be among those conserved between cell types. Further investigation into genome-wide transcriptional analysis across tissues is warranted in this area, particularly in metabolic disease given the potential benefits to understanding of disease processes and development of biomarkers. The SNP rs8050136 creates a cut-like 1 (CUX1) transcription factor binding site (26) that may participate in mediating its biological effects; however, we note that CUX1 is broadly expressed across human tissue types including adipose, liver, muscle, and lymphoid (32).
Replication of expression datasets is an important consideration but difficult to achieve. Although large cohorts exist for replication of genetic variation analyses such as genome-wide association studies, there are far fewer large-scale cohorts that contain genome-wide expression data, making replication of expression-related findings problematic. This is further complicated by differences in tissue types that are relevant to a specific disease under investigation. As noted above, some genes share coordinate expression patterns between tissues, whereas others do not. Availability of transcription factors in different cell types as well as methylation patterns contribute to this diversity. To address this, we and colleagues are seeking to establish large-scale gene expression replication cohorts with a range of tissues for analysis, which in the future may allow expression replication studies to be undertaken.
RBL2 is a member of the retinoblastoma (RB) family of tumor suppressor genes. Like RB, RBL2 binds members of the DNA binding E2F transcription factor family that regulate transcription of many genes associated with cell cycle progression and cellular differentiation (33). Notably, RBL2 plays a role in preadipocyte proliferation and differentiation, specifically binding E2F in the preadipocyte quiescent nonproliferating and differentiated states (29).
Adipocyte number is a major determinant of fat mass in adults, and a recent study has shown that the number, which is set in childhood and adolescence, remains relatively constant over adult life (34). The A allele of rs8050136 (which is in high LD with the A allele of rs9939609) confers increased risk of obesity that is present by the early age of 11 years (10). We observed that the A allele of rs8050136 is strongly positively correlated with RBL2 expression levels, raising the possibility that increased levels of RBL2 may serve to limit clonal expansion of the preadipocyte population during development. A finite pool of adipocytes, with the environmental trigger of excess consumption of highly processed energy-dense food, may undergo cellular hypertrophy that is characterized by a hyperlipolytic state, resistance to insulin's antilipolytic effects, and increased nonesterified fatty acid flux (35). The effects of excess nonesterified fatty acids on the liver and periphery promote insulin resistance, loss of glucose homeostasis, and development of type 2 diabetes, providing a possible explanation as to why this locus was originally identified based on association with type 2 diabetes, which was dependent upon the presence of obesity.
In summary, we present data to suggest that the RBL2 gene may participate in mediating the biological consequences of genetic variation in the fat mass and obesity-associated locus within the FTO gene denoted by rs8050136.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
A generous donation from the Azar/Shepperd families paid for the transcriptional profiling. The statistical genetics computer package, SOLAR, was supported by a grant from the National Institute of Mental Health (MH059490). The supercomputing facilities used for this work at the AT&T Genetics Computing Center were supported in part by a gift from the AT&T Foundation. The laboratory work was conducted in facilities constructed with support from the National Center for Research Resources (RR013556). Data collection was supported by a grant from the National Heart, Lung, and Blood Institute (HL045222).
Additional funds for transcriptional profiling, sequencing, genotyping, and statistical analysis were provided by ChemGenex Pharmaceuticals. No other potential conflicts of interest relevant to this article were reported.
We are grateful to the participants in the San Antonio Family Heart Study.