Identifying the tissue-specific molecular signatures of active regulatory elements is critical to understand gene regulatory mechanisms. Here, we identify transcription start sites (TSS) using cap analysis of gene expression (CAGE) across 57 human pancreatic islet samples. We identify 9,954 reproducible CAGE tag clusters (TCs), ∼20% of which are islet specific and occur mostly distal to known gene TSS. We integrated islet CAGE data with histone modification and chromatin accessibility profiles to identify epigenomic signatures of transcription initiation. Using a massively parallel reporter assay, we validated the transcriptional enhancer activity for 2,279 of 3,378 (∼68%) tested islet CAGE elements (5% false discovery rate). TCs within accessible enhancers show higher enrichment to overlap type 2 diabetes genome-wide association study (GWAS) signals than existing islet annotations, which emphasizes the utility of mapping CAGE profiles in disease-relevant tissue. This work provides a high-resolution map of transcriptional initiation in human pancreatic islets with utility for dissecting active enhancers at GWAS loci.
Genome-wide association studies (GWAS) for complex diseases such as type 2 diabetes (T2D) have identified hundreds of signals associated with disease risk; however, most of these lie in non-protein-coding regions and the underlying mechanisms are still unclear (1). T2D GWAS variants are highly enriched to overlap islet-specific enhancer regions, which suggests that these variants affect gene expression (2–4). Many GWAS signals are marked by numerous single nucleotide polymorphisms (SNPs) in high linkage disequilibrium (LD), which makes identifying causal SNPs extremely difficult using genetic information alone.
For delineating regulatory elements, profiling histone modifications such as the enhancer-associated H3 lysine 27 acetylation (H3K27ac) (5,6) and the promoter-associated H3 lysine 4 trimethylation (H3K4me3) (6,7), among others, can be useful. However, the identified regions typically span hundreds of base pairs (bp). Profiling transcription factor (TF)-accessible chromatin regions can identify the functional DNA bases with these broad regulatory elements in pancreatic islets (1,4,8–12). Integrating other epigenomic data such as DNA methylation and chromatin looping has been valuable in identifying biological mechanisms (4,13,14). Transcription is a robust predictor of enhancer activity, and a subset of enhancers are transcribed into enhancer RNA (eRNA) (15,16). eRNAs are nuclear, short, mostly unspliced, 5′ capped, usually nonpolyadenylated, and usually bidirectionally transcribed (15,17,18). Therefore, identifying the location of transcription initiation can pinpoint active enhancer regulatory elements in addition to active promoters.
Genome-wide sequencing of 5′-capped RNAs with cap analysis of gene expression (CAGE) can detect transcription start sites (TSS) (15,17). CAGE can be applied on RNA samples from hard-to-acquire biological tissue such as islets and does not require live cells that are imperative for other TSS profiling techniques such as a variation of global run-on sequencing (GRO-seq) called GRO-cap (19–21). The functional annotation of the mammalian genome (FANTOM) project (22) has generated a CAGE expression atlas across 573 primary cell types and tissues, including the pancreas. However, pancreatic islets that secrete insulin and are relevant for T2D and related traits constitute only ∼1% of the pancreas tissue. Therefore, a pancreas TSS map may not accurately represent the islet TSS landscape. To date, there are no publicly available CAGE data sets for islet tissue. Here, we present a CAGE-based TSS map of pancreatic islets with enhancer validation using a massively parallel reporter assay (MPRA). Finally, we integrate our data with existing epigenomic data sets to reveal molecular signatures of noncoding islet elements and their role in T2D and related traits.
Research Design and Methods
Sample Collection and CAGE Library Preparation
We processed 71 human pancreatic islet samples obtained from unrelated organ donors (Supplementary Table 1) received from the Integrated Islet Distribution Program, the National Disease Research Interchange, and Prodo Laboratories. We prewarmed islets to 37°C in shipping media for 1–2 h before harvest. Total RNA from 2,000–3,000 islet equivalents was extracted and purified with Trizol (Life Technologies). RNA quality was confirmed with Bioanalyzer 2100 (Agilent); samples with RNA integrity number >6.5 were prepared for CAGE sequencing. We sent 1 μg total RNA per sample to DNAFORM (Kanagawa, Japan), where CAGE libraries were generated. The library preparation included polyA-negative selection and size selection (<1,000 bp) in an attempt to enrich for the short and nonpolyadenylated eRNA transcripts. Stranded CAGE libraries were generated for each islet sample with use of the no-amplification nontagging CAGE libraries for Illumina next-generation sequencers (nAnT-iCAGE) protocol (23). Each islet CAGE library was barcoded and was pooled into 24-sample batches and sequenced over multiple lanes of HiSeq 2000. All procedures followed ethics guidelines of the National Institutes of Health (NIH).
CAGE Data Processing
We trimmed adapter sequences and mapped the reads to hg19, performed unique molecular identifier–based deduplication, and identified TSS. We selected 57 islet samples with strandedness measures >0.85 calculated from Quality of RNA-seq Tool-Set (QoRTS) (24) for all downstream analyses. We identified tag clusters (TCs) in each sample in a strand-specific manner using paraclu (25), allowing single bp TCs (“singletons”) if supported by more than two tags. We identified a “consensus” set of reproducible islet TCs by merging TCs on each strand across samples and retaining segments supported by a conservative threshold of 10 samples (Supplementary Fig. 1). We then filtered out regions blacklisted by the Encyclopedia of DNA Elements (ENCODE) consortium. The TC coordinates for the selected threshold and a more lenient threshold of 5 are shared in Supplementary Table 2.
We downloaded the FANTOM CAGE-TSS data for 118 tissues (https://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.tissue.hCAGE/) (22) and called TCs using paraclu (25) with the same parameters as described above.
Overlap Enrichment Between Annotations
Enrichment for overlap between islet TCs and various annotations was calculated with the Genomic Association Tester (GAT) tool (26). GAT randomly samples segments from the genomic workspace and computes the expected overlaps. We used 10,000 GAT samplings for each enrichment run and obtained empirical P values.
Experimental Validation Using MPRA
We generated a barcoded plasmid library of N = 7,188 islet CAGE elements (198 bp flanked by 16 bp anchors) to test in the MPRA. We electroporated 50 μg of library into 25 million 832/13 rat insulinoma cells in three biological replicates, harvested the cells 24 h later, and isolated total RNA. We mapped the bar codes corresponding to each CAGE element in the MPRA plasmid using PCR and sequencing. We sequenced the input DNA bar code library along with three cDNA barcode libraries. We quantified bar code counts while accounting for sequencing errors using the sequence clustering algorithm Starcode (https://github.com/gui11aume/starcode) (27) and removed PCR duplicates using the unique molecular identifier (https://github.com/parkerlab/starr-seq-analysis-pipeline). We selected N = 3,446 quantifiable CAGE elements, which had at least two bar codes each with at least 10 DNA counts. We used MPRAnalyze (version 1.3.1) (https://github.com/yoseflab/mpranalyze) (28) to model DNA and RNA counts in negative binomial generalized linear models to quantify enhancer activities. To estimate the null in our experiment, within MPRAnalyze, we conservatively assume that the mode of the distribution of transcription activity estimates is the center of the null distribution. Therefore, values lower than the mode are used to estimate the null variance.
Least Absolute Shrinkage and Selection Operator Regression
We modeled the CAGE element MPRA z scores as a function of TF motif occurrences within the element using least absolute shrinkage and selection operator (LASSO) regression. We identified 540 nonredundant motifs (Supplementary Information) and scanned for these in the hg19 reference using Find Individual Motif Occurrences (FIMO) (29). For each CAGE element, we considered the inverse-normalized (RNOmni package, version 0.7.1) FIMO −log10 (P value) of each motif occurrence as motif “scores.” We again inverse normalized the scores for each TF motif across the CAGE elements so that the regression coefficients would be comparable across motifs. The LASSO regression was run with use of the glmnet package (v2.0-16) with parameter α = 1.
Functional GWAS Analyses and Fine Mapping
We used fgwas (version 0.3.6) (30) to compute enrichment of GWAS and expression quantitative trait loci (eQTL) data in TC-related and other annotations. We obtained summary data for T2D GWAS (1) and islet eQTL (31) and lymphoblastoid cell line eQTL (32) and organized summary statistics as required by fgwas. For eQTL data, we selected SNP-gene associations for eGenes identified at 1% false discovery rate (FDR) and included a unique “SEGNUMBER” for each eGene. We used fgwas with default parameters for enrichment analyses and included the “-fine” flag for eQTL analyses.
We performed conditional analyses using the “-cond” option where the enrichment parameters for the first annotation were modeled and fixed the maximum likelihood values. An additional parameter for the second annotation was included and estimated.
To reweight GWAS summary data based on functional annotation overlap, we used the -print option while including multiple annotations in the model that were individually enriched or depleted. We included islet active TSS, active enhancer, quiescent and polycomb repressed chromatin states, and Assay for Transposable-Accessible Chromatin with high-throughput sequencing (ATAC-seq) peaks with or without TCs.
Data and Resource Availability
We submitted islet CAGE data to the database of Genotypes and Phenotypes (dbGaP) (phs001188.v2.p1) and MPRA data to Gene Expression Omnibus (GEO) (GSE137693). A UCSC Genome Browser session is available from https://genome.ucsc.edu/s/arushiv/cage_2021. Scripts are shared on GitHub (https://github.com/ParkerLab/islet_cage), and the processed data files are at Zenodo (https://zenodo.org/record/3524578).
The CAGE Landscape in Human Pancreatic Islets
We performed CAGE in 71 human pancreatic islet total RNA samples obtained from unrelated organ donors (Supplementary Table 1). Selecting 57 high-quality sample’s, we identified a consensus set of 9,954 reproducible TCs (median length of 176 bp) (Supplementary Fig. 2 and Supplementary Table 2), spanning a total genomic territory of ∼2.4 Mb. As a resource, Supplementary Table 3 includes the islet TC identified to be the closest to a known gene TSS (GENCODE Human Release 19 [GENCODE V19]) (33). To explore the chromatin landscape underlying islet TCs, we overlaid publicly available chromatin immunoprecipitation sequencing data for five histone modifications (Supplementary Table 4) integrated into 11 distinct chromatin states using ChromHMM (34) (Supplementary Fig. 3 and Supplementary Information), along with bulk and single nucleus ATAC-seq data in islets (10,12). Figure 1A shows an example islet TC in the intronic region of the ST18 gene that overlaps the islet active TSS chromatin state and an ATAC-seq peak. Importantly, this region does not overlap any annotated TSS on the basis of conservative definitions from coding/noncoding/pseudogene genes in both GENCODE V19, the official hg19 release, and GENCODE V33 lifted over to hg19 (V33lift37). The regulatory activity of this element was validated by the VISTA Enhancer Browser in an in vivo reporter assay in mouse embryos (35).
We next compared our islet CAGE data with FANTOM CAGE data available for 118 human tissues (22). Islet TCs showed the highest overlap with pancreas (Supplementary Fig. 4). Approximately 20% of islet TCs were unique to islets (N = 1,974 with no overlap in any FANTOM tissue), whereas ∼60% of islet TCs were shared across ≥60 FANTOM tissues (Fig. 1B). With categorizing of islet TC segments by the number of FANTOM tissues in which they overlap TCs (colored bars in Fig. 1B), islet-specific TCs (0 overlap with FANTOM) occurred farthest from known TSS (Fig. 1C). We highlight an example locus where an islet TC in the AP1G2 gene occurs in active TSS chromatin states across multiple tissues and overlaps shared ATAC-seq peaks in islet and the lymphoblastoid cell line GM12878 (36) (Fig. 1D, blue box). TCs across FANTOM tissues are identified in this region (Fig. 1D, FANTOM TCs track). The islet TC segment (Fig. 1D, blue box) overlaps TCs in 88 FANTOM tissues. Another islet TC ∼34 kb away, however, occurs in a region lacking gene annotations and overlaps a more islet-specific active enhancer chromatin state and ATAC-seq peak (Fig. 1D, orange box). This region was not identified as a TC in any of the 118 analyzed FANTOM tissues. At other islet-relevant loci such as the potassium channel subfamily K gene KCNK16 TSS, we observe TCs in islets but not in any other FANTOM tissues (Supplementary Fig. 5). Collectively, these results highlight that CAGE profiling in islets identifies islet-specific sites of transcription initiation, including at TSS-distal enhancers.
We computed the enrichment of islet TCs in islet annotations such as chromatin states and ATAC-seq peaks (identified in bulk islets and in islet α- and β-cells [10,12]) and other “common” annotations including annotations aggregated across multiple cell types using GAT (26). Islet TCs were highly enriched to overlap islet active TSS chromatin states (fold enrichment = 69.72, P value = 1e−04) (Fig. 1E and Supplementary Table 5), as expected, since the transcription initiation sites would likely resemble the “active TSS” chromatin state. TCs identified in FANTOM tissues that also had publicly available chromatin data (37) were also overwhelmingly enriched to overlap active TSS chromatin states in the corresponding tissue, which demonstrates how our protocol yielded CAGE profiles comparable with existing data (Supplementary Fig. 6). Islet-specific TCs were more enriched for islet active enhancer chromatin states (Supplementary Fig. 7). Islet TCs were enriched in bulk islet and islet α- and β-cell ATAC-seq peaks (for all three annotations, fold enrichment >37.58, P value = 1e−04) (Fig. 1E), signifying that the identified transcription initiation sites constitute TF-accessible chromatin.
Aggregated CAGE signal over ATAC-seq narrow peak summits highlighted a bidirectional pattern of transcription initiation flanking the ATAC-seq peak summit (Fig. 2A). Conversely, anchoring in the islet TC centers showed that the ATAC-seq signal summit lies upstream (relative to CAGE strand) of the TC center (Fig. 2B). Islet TF footprint motifs (binding sites supported by islet ATAC-seq data and TF DNA-binding motifs) (10) were more enriched to overlap the 500 bp TC upstream region compared with the 500 bp downstream region relative to TCs (Fig. 2C and Supplementary Table 6). These observations show that, as expected, the region just upstream of the TC is highly accessible where more TF binding events occur and indicate the high quality of our islet TC map.
We next compared the characteristics of TCs that occur in accessible regions of two main regulatory classes: promoters and enhancers. We considered TCs in ATAC-seq peaks in promoter (active, weak, or flanking TSS) versus enhancer (active, weak, or genic enhancer) chromatin states either 5 kb proximal or distal from the nearest protein-coding genes (GENCODE V19). We then explored the chromatin landscape at these regions across 98 Roadmap Epigenomics cell types using the 18-state “extended model” (37). TSS proximal islet TCs in accessible promoter states (N = 7,064 segments) were nearly ubiquitously identified as promoter states across Roadmap Epigenomics cell types (Fig. 2D, left). A subset of TSS distal islet TCs in accessible islet promoter states (N = 443 segments) were more specific for pancreatic islets (Fig. 2D, right). In contrast, islet TCs in accessible islet enhancer states, both proximal (N = 254 segments) and distal (N = 289 segments) to known gene TSS, more specifically overlapped enhancer states in islets (Fig. 2E). Such specificity was not observed for whole pancreas (Fig. 2D and E), which highlights differences in the chromatin architecture underlying islet TCs in islets versus pancreas.
Footprint motifs for the regulatory factor X (RFX) TF family were enriched to overlap both enhancer and promoter states; however, the fold enrichment in enhancers was considerably higher in comparison with promoters (for five different motifs: enhancer, >4.0-fold; promoter, 1.3- to 1.5-fold; P value = 1e−4) (Fig. 2F and Supplementary Table 7). TCs in accessible promoter regions were highly enriched to overlap footprint motifs of the E26 transformation-specific (ETS) TF family (Fig. 2F). We observed divergent aggregate CAGE profiles over TF footprint motifs enriched in enhancers, e.g., RFX5_known8 footprint motifs in 5 kb TSS distal regions and ELK4_1 motifs (Fig. 2G and H). These results highlight the characteristics of transcription initiation sites based on the underlying chromatin context.
Experimental Validation of Transcribed Regions
We experimentally validated the enhancer activity of islet CAGE-profiled regions. Self-transcribing active regulatory region sequencing (STARR-seq) is an MPRA technique where candidate elements are cloned downstream of the core promoter into a reporter gene’s (e.g., GFP) 3′-untranslated region, and enhancer activity of the elements leads to reporter mRNA transcription harboring the candidates’ sequences (38–40). We generated a library of 7,188 candidate CAGE elements (198 bp each) and used a modified MPRA approach, cloning the elements 3′ to the GFP polyA signal and cloning a random 16-bp bar code into the GFP 3′ region so that each candidate enhancer element is represented by multiple transcribed barcodes. We transfected the MPRA libraries into the rat β-cell insulinoma (INS1 832/13) cell line in triplicate, extracted DNA and RNA, and sequenced the bar codes as the reporter readout. After quality control procedures (Research Design and Methods) we identified 3,378 quantifiable CAGE elements. We observed high correlations between the normalized sum of RNA counts of the CAGE element bar codes across the three biological replicates (Pearson r = 0.97) (Supplementary Fig. 8). We modeled the RNA and DNA bar code counts in generalized linear models (Supplementary Table 8) and observed that ∼68% (N = 2,279) of the quantifiable CAGE elements showed significant enhancer activity (5% FDR) (Fig. 3A, top), a large fraction of which occurred in promoter states (Fig. 3A, bottom). CAGE elements in promoter states showed higher MPRA activity compared with the elements in enhancer states (Wilcoxon rank sum test P = 1.02 × 10−6) (Fig. 3B). CAGE elements overlapping ATAC-seq peaks showed higher enhancer activities than elements not in ATAC-seq peaks (Wilcoxon rank sum test P = 5.50 × 10−16) (Fig. 3C), and elements 5 kb proximal to protein-coding gene TSS showed higher enhancer activities then TSS distal elements (Wilcoxon rank sum test P = 5.38 × 10−9) (Fig. 3D). These results are consistent with results of a recent MPRA study in GM12878 (41).
We next aimed to identify the biological-relevant sequence-based features of active CAGE elements by modeling MPRA enhancer activity as a function of TF motif instances using linear regression. Since many TF motifs are correlated, we used the LASSO procedure, which shrinks some regression coefficients to zero, resulting in a simpler model. We modeled CAGE element MPRA z scores on TF motif scores in the element (Fig. 3E and Supplementary Table 9). TF motifs from the ETS family showed positive LASSO coefficients, indicating that these sequence elements are associated with high enhancer activity. These motifs were also enriched to occur in TCs in accessible promoter regions (Fig. 2F). NRF-1 motif showed a positive coefficient; β-cell–specific Nrf1-knockout mice have shown decreased glucose-stimulated insulin secretion (42). TF motifs with negative LASSO coefficients such as ZBTB16 and GZF1 have been shown to act as repressors (43,44). In Fig. 3F, we highlight an islet TC overlapping an islet ATAC-seq peak, active TSS, and enhancer states for which we tested three tiled elements. All three elements showed significant transcriptional activity in our assay (z score >2.94, P values <0.001). Overall, there was a significant positive correlation (Pearson r = 0.64, P = 1 × 10−9) between TF motif LASSO coefficient and TF footprint motif enrichment in TCs, indicating a strong correspondence between CAGE TC profiling and active enhancer activity measured from the MPRA (Supplementary Fig. 9).
TCs Augment Functional Annotations in GWAS Fine Mapping
We asked whether islet TCs supplement our understanding of T2D GWAS (1) or islet eQTL (31). We classified genomic annotations as 1) chromatin states, 2) accessible regions within the chromatin states, and 3) TCs in accessible regions within the chromatin states. TCs in accessible enhancers were highly enriched for T2D GWAS loci, with use of the Bayesian hierarchical model in fgwas (30) (Fig. 4A, left, and Supplementary Table 10) and logistic regression in GWAS analysis of regulatory or functional information enrichment with LD correction (GARFIELD) (45) (Supplementary Fig. 10). TCs in accessible enhancers were highly enriched to overlap islet eQTL (Fig. 4A, right) but not eQTL in unrelated lymphoblastoid cell lines (32) (Supplementary Fig. 11).
TCs showed higher conditional enrichment over enhancer states to broad and length-matched ATAC-seq peaks (Fig. 4B and Supplementary Fig. 12) for T2D GWAS and higher conditional enrichment over enhancer and promoter states versus broad and length-matched ATAC-seq peaks for islet eQTL (Fig. 4B and Supplementary Fig. 12). Functional reweighting of T2D GWAS (1) with islet chromatin states, ATAC-seq peaks, and TCs in fgwas resulted in higher maximal SNP posterior probability of association (PPA) at many loci compared with maximal SNP PPAs from genetic fine mapping alone (Supplementary Fig. 13), consistent with other studies (1,30). Including TCs along with chromatin states and ATAC-seq achieved higher maximal reweighted SNP PPAs than chromatin state and ATAC-seq data, suggesting that TCs add valuable information in fine mapping (Fig. 4C and Supplementary Table 11). We highlight one such GWAS locus named LCORL (lead SNP rs12640250, P value = 3.7 × 10−8). The 99% genetic credible set at this locus includes 74 variants (1), with lead SNP rs12640250 PPA = 0.15 (Supplementary Fig. 14A). Functional reweighting using islet TCs, chromatin states, and ATAC-seq peaks resulted in 44 SNPs in the 99% credible set, where rs7667864 (genetic PPA = 0.12, LD r2 0.97 with the lead GWAS SNP) obtained the maximum reweighted PPA = 0.62 (Supplementary Fig. 14B). This SNP overlaps an ATAC-seq peak and a TC in islets (Fig. 4D and E). The eQTL lead SNP rs2074974 (genetic PPA = 0.026, LD r2 = 0.96 with lead GWAS SNP) occurs upstream of the TC and overlaps the ATAC-seq peak and obtained a reweighted PPA = 0.096 (Fig. 4E). An element overlapping this TC showed significant activity in our MPRA (z score = 18.48, P value = 1.56 × 10−76), and several TF motifs that showed positive MPRA LASSO regression coefficients also occur in this region (Fig. 4E). These analyses demonstrate that transcription initiation sites demarcate active regulatory elements in islets, and this information can be useful in fine mapping and prioritizing GWAS variants.
Our work shows that islet CAGE TCs mark active, specific, and relevant islet regulatory elements. A large proportion of TCs overlapped the active TSS chromatin state. Using an MPRA, we validated the enhancer activity of 2,279 CAGE elements. Our results show that sequences associated with native promoter chromatin landscapes can show strong enhancer activity when cloned downstream of a reporter gene in an episomal MPRA paradigm.
Several ETS family footprint motifs were highly enriched in transcribed and accessible promoter regions, and these motifs were also strong predictors of the elements’ activity in the MPRA. ETS family TFs are found in all metazoans and contain the conserved ETS DNA-binding domain and can recruit acetyl transferases or deacetylases to modulate transcription (46). The regulatory potential of ETS motifs has been described before in MPRAs (47). A previous islet eQTL study demonstrated that for eQTL SNPs (eSNPs) occurring in ETS footprint motifs, the preferred bases in the motifs were significantly more often associated with increased expression of the target gene (31). RFX footprint motifs were highly enriched to overlap transcribed and accessible enhancer regions. RFX TFs contain the X-box DNA-binding motif and are involved in cellular specialization and terminal differentiation (48). T2D GWAS risk alleles were previously shown to confluently disrupt RFX footprint motifs (10). The concordance of our findings with these orthogonal studies highlights the robustness of our islet TC map.
A small fraction of TCs (0.4%) overlapped with the enhancer chromatin states. Since gene-distal transcripts are more unstable, some enhancers may be actively transcribed but fall below the limits of detection of CAGE. It is plausible that CAGE profiling using total RNA from whole islet preps, as we have performed, would comprise more stable promoter-associated RNA transcripts and have a lesser representation of weaker transcripts originating from enhancer regions. Recent technologies such as native elongating transcript-cap analysis of gene expression (NET-CAGE) show promise in more efficiently identifying more unstable transcripts from fixed tissues (49). We note that CAGE-based enhancer calls represent only the most transcriptionally active subset of enhancers in the genome. The Roadmap Epigenomics Consortium used DNase I hypersensitivity sequencing and histone modification chromatin immunoprecipitation sequencing to identify 2,328,936 enhancers across 127 cell types (37), whereas the FANTOM5 Consortium in their extensive catalog of CAGE enhancers identified 43,011 enhancers across 808 CAGE libraries (432 primary cell, 135 tissue, and 241 cell lines) (15). CAGE profiling therefore has several advantages and limitations when compared with other epigenomic modalities. While CAGE identifies transcription initiation at bp resolution, the technique can be limited to a subset of most active elements. Alternatively, integrating three-dimensional chromatin interaction data with other epigenomic profiles can identify active regulatory elements; however, the resolution is generally limited.
Previously, we showed that genetic variants in more cell type–specific enhancer regions have lower effects on gene expression than the variants occurring in more ubiquitous promoter regions (31,50). This finding is consistent with our observation that enhancer chromatin states comprised a smaller proportion of active transcription initiation sites and lower enhancer activities relative to promoter chromatin state regions. The basal transcription initiation landscape could change under stimulatory conditions where relevant enhancers help orchestrate a response.
Our work demonstrates that islet CAGE elements can help GWAS fine mapping in addition to other relevant epigenomic information such as chromatin states and chromatin accessibility. Identifying target genes remains a challenging task where overlaying dense eQTL maps and correlating transcription initiation in enhancers with gene TSS while also leveraging chromatin conformation data would be useful in future studies.
This article contains supplementary material online at https://doi.org/10.2337/figshare.14394707.
Acknowledgments. The authors thank Sally A. Camper, Mats Ljungman, Cristen J. Willer, and Parker laboratory members (University of Michigan) for their feedback.
Funding. The authors acknowledge support from the University of Michigan Rackham Predoctoral Fellowship (to A.V.); National Human Genome Research Institute, NIH, grant T32 HG00040 (to P.O.); American Diabetes Association (ADA) grant 1-18-ACE-15 (to M.L.S.); NIH intramural support from project ZIA-HG000024 (to F.S.C.); ADA Pathway to Stop Diabetes grant 1-14-INI-07; and National Institute of Diabetes and Digestive and Kidney Diseases, NIH, grants 1UM1DK126185-01 and R01 DK117960 (to S.C.J.P.).
Duality of Interest. No potential conflicts of interest relevant to this article were reported.
Author Contributions. A.V. designed and performed analyses and wrote and edited the manuscript. Y.K. performed the MPRA experiments. C.W. performed the LASSO regression. M.R.E. processed the islet samples. N.N., R.D.A., P.O., and J.O.K. contributed to analyses. Y.K., V.R.E., C.W., R.D.A., and P.O. wrote sections of the manuscript. M.L.S., F.S.C., J.O.K., and S.C.J.P. contributed to designing the study. R.D.A., P.O., F.S.C., and S.C.J.P. edited the manuscript. S.C.J.P. supervised all aspects of the study. S.C.J.P. is the guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.