DNA sequences that regulate expression of the insulin gene are located within a region spanning ∼400 bp that flank the transcription start site. This region, the insulin promoter, contains a number of cis-acting elements that bind transcription factors, some of which are expressed only in the β-cell and a few other endocrine or neural cell types, while others have a widespread tissue distribution. The sequencing of the genome of a number of species has allowed us to examine the manner in which the insulin promoter has evolved over a 450 million–year period. The major findings are that the A-box sites that bind PDX-1 are among the most highly conserved regulatory sequences, and that the conservation of the C1, E1, and CRE sequences emphasize the importance of MafA, E47/β2, and cAMP-associated regulation. The review also reveals that of all the insulin gene promoters studied, the rodent insulin promoters are considerably dissimilar to the human, leading to the conclusion that extreme care should be taken when extrapolating rodent-based data on the insulin gene to humans.
The cloning and sequencing of the insulin gene in 1980 (1) was a landmark breakthrough that opened up a new field of research on the mechanisms controlling expression of the gene. This in turn led to the discovery of transcription factors that, in addition to regulating the insulin gene in a tissue-specific and temporal manner, participated in the development of the endocrine pancreas and in the maintenance of islet cell function (2). Some of these transcription factors have been identified as maturity-onset diabetes of the young (MODY) genes (3), and at least one has been associated with type 2 diabetes. Their use in the development of novel therapies for diabetes based on the differentiation of embryonic or adult stem cells toward a β-cell–like phenotype (4) and the forced expression of endogenous insulin genes in nonislet cells (5–7) has also been exploited.
The early work on characterizing the DNA sequences involved in regulating insulin gene expression focused on the rat insulin 1 gene (8). The reason for this was that at the time there were no available human β-cell lines and it was felt important to correlate data from transfected promoter constructs with effects on the endogenous insulin gene. As it turned out, most of the studies involved transfecting the rat insulin gene constructs in the Syrian hamster HITm2.2.5 cell line, which transfected much more efficiently using available techniques than the rat RINm5F cell line. There was also a perception that human insulin promoter constructs would not function in transfected rodent cells. However, these worries proved to be unfounded after it was later shown that there is a very high degree of sequence and functional conservation within the transcription factors that regulate the gene (e.g., 89% identity between rat and human PDX-1) and the human insulin promoter exhibited the expected pattern of activity in transgenic mice (9,10). As a result of the decision to concentrate on a detailed analysis of the rat insulin promoter, most of the literature on the insulin promoter pertains to this promoter.
The structure and evolution of the insulin gene has been previously reviewed (11). In this article, we focus on the sequences that lie upstream of, or flank, the transcription start site and are known to affect transcription of the gene. One major conclusion is that the rodent promoters are markedly different from the human promoter, and we urge caution in extrapolating data from rodent promoter studies to the etiology and therapy of diabetes.
INSULIN GENE EXPRESSION
Humans, in keeping with the overwhelming majority of species, have a single copy of the insulin gene, which is located on chromosome 11 (p15.5) (12). Of the small number of species with two nonallelic insulin genes, the best known are Xenopus laevis (13) and the popular laboratory research rodents of rat (14) and mouse (15), with insulin two corresponding to the single copy in most animals.
In the adult insulin is expressed almost exclusively in the β-cells of the pancreatic islets of Langerhans (16), hence its name from Latin insula or “island.” Low levels of extrapancreatic insulin have been detected in a number of other tissues (17,18) including brain (19), thymus (20–22), lachrymal glands (23), and salivary glands (24). The role of insulin expression in non-β-cells is unclear. In some tissues it may play a role in the complex hormonal communication required for the maintenance of overall energy balance (25,26) or in the establishment of immune tolerance (27). Very little is known about the regulatory sequences that control insulin gene expression in nonpancreatic tissue, although the sequence containing variable numbers of tandem repeats (see later) has been implicated in thymus expression of insulin.
In the β-cell, sophisticated mechanisms have evolved to control insulin expression at the correct time and place during embryonic development. In the adult related mechanisms and a variety of signaling pathways are involved in restricting insulin expression to β-cells (notwithstanding the low level extrapancreatic expression about which little is known) and in coordinating insulin expression in response to diverse afferent signals (16). Positive and negative crosstalk between the various signaling pathways, formation of homo- or heterodimers permitting individual transcription factors to act as activators, nonactivators or repressors, reversible phosphorylation of transcription factors, multiple isoforms of several transcription factors, and synergistic interactions between certain combinations of transcription factors extend the gamut of signals influencing the regulation of insulin gene expression.
Insulin transcriptional control is conferred by cis-acting regulatory sequences believed to be located within 300–400 bp from the transcription start site (28), which bind β-cell restricted and ubiquitous transcription factors (16). The principal regulatory elements within the human insulin promoter are outlined in Fig. 1. The compact nature of the insulin promoter results in the close proximity of regulatory elements that can bind an extensive range of factors thereby permitting a multiplicity of outcomes through additive and synergistic interactions between the bound proteins (29–31). In addition, regulatory elements can overlap in certain species e.g., the A3 and a cAMP response element (CRE) site in humans, introducing another layer of complexity through binding competition between alternative transcription factors.
MULTIPLE-SPECIES COMPARISON OF INSULIN PROMOTERS
There is no general approach to interpreting and predicting transcriptional evolution; however, the insulin promoter is one of the most extensively studied, and knowledge of the signals that bear upon insulin transcriptional regulation facilitates our understanding of possible functional consequences of insulin promoter evolutionary differences. By classic convention, the sequences that regulate basal promoters were divided into two classes. These are upstream regulatory elements (UREs) that are often located within 100–200 bp upstream of the site of initiation and display directional qualities, and enhancers that can function over distances of many kilobase pairs, regardless of orientation or whether they lie upstream or downstream of the start site. However, as more promoters and enhancers have been identified and studied, it has become apparent that there is a continuum between these two classes of regulatory elements with promoter and enhancer motifs sharing many physical and functional traits. Therefore, in keeping with current opinion, we have reviewed the cis-regulatory elements within the compact insulin promoter without further categorization.
This review has drawn upon publicly available DNA sequences to compare the human insulin promoter sequence (−1,500 to +100) to the insulin promoters in an evolutionary and taxonomically divergent range of species. Definitive identification of insulin genes and their promoters lags well behind the isolation of the corresponding cDNA sequences; hence, care has been taken to ensure that only unambiguous insulin promoters have been included. These belong to human (Homo sapiens), great apes (chimpanzee [Pan troglodytes], orangutan [Pongo pygmaeus], and gorilla [Gorilla gorilla]), Old World monkeys (African green monkey [Cercopithecus aethiops] and rhesus macaque [Macaca mulatta]), New World monkey (owl monkey [Aotus trivirgatus]), rodents (rat [Rattus norvegicus] and mouse [Mus musculus]), mammals with diverse diets (carnivorous dog [Canis familiaris], herbivorous cow [Bos taurus], and omnivorous pig [Sus scrofa]), bird (chicken [Gallus gallus]), and fish (zebrafish [Danio rerio]). The promoter sequences of gorilla, orangutan, African green monkey, and owl monkey are currently incomplete extending upstream to positions −295, −290, −426, and −510, respectively. The phylogenic relationships based on molecular analyses (32,33) between these species are outlined in Fig. 2.
A preliminary evaluation of the relatedness of homologues can be generated from the number and relative position of introns, and these are shown in Fig. 3 (34). There are minor variations in the sizes of the introns among mammals while large dissimilarities are witnessed in the introns of chicken and zebrafish. The insulin 1 genes of rat and mouse have lost the second intron and also contain the remnant of a polydeoxyadenylate acid tract preceding the downstream direct repeat. Together, these structural features have led to the suggestion that the insulin 1 gene is a functional transposon (14) that was generated by an RNA-mediated duplication-transposition event involving a transcript of insulin 2 gene that was initiated upstream from the normal capping site. This duplication-transposition event clearly preceded separation of rat and mouse 15 million years ago. Along this evolutionary road, additional divergence has taken place resulting in rat having the two insulin genes residing about 55 Mbp apart on chromosome 1, whereas in the mouse they lie on different chromosomes, namely 6 and 7.
Synteny (i.e., the preserved order of genes between organisms) provides an expedient higher-level assessment of the association between homologues. The identification and annotation of genes in most genomes remains fragmentary; however, it is clear from currently available data that all of the studied insulin genes display remarkable synteny extending all the way back to zebrafish, which diverged from humans 450 million years ago. Not only are the immediate upstream and downstream flanking genes of tyrosine hydroxylase (TH) and insulin-like growth factor 2 (IGF-2) retained, but inspection of ∼500 Mbp confirms extensive maintenance of synteny of many important genes including syt8, lsp1, tnnt3, mrpL23, cd81, and tssc4. While gene order and direction of transcription are preserved, the spacing between specific genes can vary. This is most dramatically illustrated with the insulin and TH genes, which are separated by 2–22 kbp in all species except mouse and rat, where the insulin 2 gene lies ∼210 and 230 kbp distant from the TH gene, respectively. Despite evidence of different rates of insertion and deletion mutation within the insulin gene region, maintenance of synteny across vast evolutionary timescales points to a common and vital function for the insulin gene product, which is wholly consistent with the high degree of insulin protein conservation.
HOMOLOGY BETWEEN INSULIN PROMOTERS
It has been estimated from large-scale studies that the number of conserved intergenic sequences is similar to that of coding sequences (35–37), and evolutionary changes in promoters together with their attendant alteration in transcriptional response to physiological and environmental demands have been documented (38,39). This is facilitated by the fact that promoters of protein-encoding genes are laid out into functional modules (40), allowing independent evolutionary selection of distinct characteristics of the overall transcription profile. Promoters are also considered to be more prone to genetic change than coding sequences (41,42) as the constraints typical of coding sequences are absent. In light of different regions of vertebrate genomes diverging at dissimilar rates (43) and this heterotachy being witnessed across different classes of mutation and lineages (44), this study utilized a variety of comparative alignment and transcription factor binding site search techniques with parameters that were appropriate for the evolutionary distances between species in order to detect meaningful evolutionary conserved regions (ECRs). The computational tools included CLUSTAL W (45), T-Coffee (46), GraphAlign (47), ECR Browser (48), Mulan (49), zPicture (50), TRES (51), and TRANSFAC (52).
Calculations of homology between the different insulin promoters and the human version were carried out across the region spanning −600 to +1. The downstream 100 bp, which contains two cyclic AMP response element (CRE) sites in human (see the section on CREs below), is comprised mostly of the extremely poorly conserved first intron that unduly influences the overall results. Percentage identity plots (PIPs) comparing the human insulin promoter to those of the other species reveal that, not surprisingly, the most closely related chimpanzee and other great apes share the greatest homology to human, making discernment of conserved regions impossible. Mammals that are more distantly diverged from human display several regions of conservation within the first 350 bp upstream, which correspond to the major regulatory elements. There is a clear fall off in homology beyond −350 or −400 bp upstream from the start of transcription, which is especially apparent in rhesus macaque. While PIPs are useful for identifying ECRs, a detailed breakdown of identity values for specific regions can expose the overall relatedness of different insulin promoters (Table 1). Interestingly, the degree of homology does not follow a simple direct correlation with time from divergence. For example, African green monkey and owl monkey diverged from humans 25 and 35 million years ago, but the main regulatory region of their promoters (−300 to +1) display 90 and 98% identity, respectively. Similarly, most nonprimate mammals have 65–69% identity in this region and 49–55% in the adjacent upstream 50 bp. Dog stands out in having much higher homology with 69 and 75% identity for these two regions, respectively.
Together, these results are in agreement with the opinion that vertebrate genes and immediate upstream flanks are highly constrained and, more important, confirm the accepted demarcation of the insulin promoter. There is no discernable significant homology between human and either chicken or zebrafish insulin promoters, which is in keeping with the view that most human DNA is not alignable to species separated by more than 200 million years. Likewise, there is no homology between chicken and zebrafish insulin promoters.
Computational analysis of the insulin promoters for novel evolutionary conserved sequences uncovered a single short region immediately upstream of the A3 box (see a boxes); however, this region does not appear to contain any currently known transcription factor consensus sequences.
THE PROXIMAL PROMOTER REGION
Within a promoter, the fundamental component is the ∼100-bp basal promoter that provides an assembly platform for the RNA polymerase II initiation complex. These modules vary among genes and can contain a TATA box 25–30 bp upstream of the transcription start site, an initiator element lacking the TATA sequence or a null basal promoter containing neither. All of the studied insulin promoters contain a TATA box. However, the chicken promoter is distinct from the others in that at least two isoforms can be transcribed from alternative initiation sites (53). In E1.5 chicken embryo pancreas, the single insulin gene is also transcribed from an upstream secondary promoter to yield an mRNA with an additional 32-bp leader sequence. Inspection of available chicken genome sequence reveals that this alternative start site must be the product of a secondary basal transcription complex, as the transcript includes the genomic sequence from immediately upstream of the TATA box (25 bp upstream from the start of transcription) to the beginning of exon 1. The lack of another TATA box within the promoter and the presence of a C at −1 and an A at +1 of the longer transcript suggest that transcription is most likely established by an initiator element.
REGULATORY ELEMENTS WITHIN INSULIN PROMOTERS
Regulatory elements within promoters can originate at different times, and species comparisons indicate that promoters evolve through transcription factor binding site turnover and accretion (54,55). The relative numbers of the principal insulin promoter regulatory elements in the surveyed species are listed in Table 2.
A-box sequences containing the TAAT motif bind homeodomain proteins (56), the most important of which is pancreatic duodenum homeobox-1 (PDX-1) (57–61), which has been shown to be a potent stimulator of transcription of rat, mouse, and human insulin genes (62). There are three principal A boxes in the human promoter: A1 (−82), A3 (−216), and A5 (−319) (Fig. 1). PDX-1 stimulates expression at A3 (58,63–65) and mutation of A3 has the most significant effect on transcription (61,65,66). Contrary to the opinion that A3 is not the most conserved (16), this survey has shown that A3 is the only A box present in all the mammals and, therefore, must be considered to be the most conserved and central to PDX-1 stimulation. PDX-1 bound to A1 has been shown to interact synergistically with E47/β2 in rat insulin 1 (30).
As the 4-bp TAAT motif can occur every 256 bp, the ability of PDX-1 to differentiate between potential regulatory elements must be influenced by adjacent sequences. The 3-bp flanking sequences have been shown to make an important contribution to the binding affinity of PDX-1 to TAAT core elements with a concomitant effect on activation. However, variations in these sequences are insufficient to completely explain differences in PDX-1 binding affinities (67). Therefore, the 8-bp flanking regions of all A boxes were assayed for homology (Table 3). The A3 box and 5′ flanking region lie within a novel ECR, and this is reflected in the high degree of conservation. The lack of any other regulatory elements within this ECR based on computational analysis raises the intriguing possibility that, while the TAAT motif is symmetrical, binding of PDX-1 to the promoter may be directional. Clear, though less well defined, asymmetrical homology of the other A box flanking regions to the human sequences is also apparent. Regulatory elements present in multiple copies often exist in both orientations (42), thereby increasing potential phenoplasty.
The A3 5′ flanking region in rat insulin 1 has two additional TAAT sequences as a consequence of two single base pair changes. This creates the A4 site (29), which is juxtaposed to A3 to generate an additional regulatory element that has been reported to bind other homeodomain transcription factors, some of which have been shown to affect transcription. One of the best studied is hepatocyte nuclear factor (HNF)-1α, which has been reported to activate the rat insulin 1 gene in the HIT cell line (68). Similarly, Isl-1 has been found to bind to this site (69) and to interact with islet cell–specific transcription factor β2 to stimulate rat 1 insulin expression (70). Other transcription factors reported to bind to the A3/A4 box include cdx-3 (29) and HMGI(Y) (71). Inspection of all other insulin promoters shows that this homeodomain-binding sequence is unique to rat insulin 1. It would, therefore, seem logical to conclude that these transcription factors play no role in other species. However, HNF-1α provides an example of how the promiscuity of transcription factors creates obstacles in predicting insulin promoter effecters. Although the consensus binding sequence is not present in the human insulin promoter, the A3 region is sufficiently similar for the protein to bind, at least in vitro, and stimulate reporter assays (72). On the other hand, in vivo chromatin immunoprecipitation (ChIP) assays have shown that HNF-1α is not necessary for either insulin 1 or 2 expression in mice, which lack A4 (73). Surprisingly, both the 5′ and 3′ flanking regions of each of the A4 TAAT sequences have higher homology to the human A3 region than rat insulin 1 A3, differing by only 1 bp. This evokes the interesting likelihood that, although the rat insulin 1 A3 box seems to be the main binding site (67), A4 could also bind and be regulated by PDX-1. Regardless of the regulatory capacity of the alternative A boxes, the binding kinetics of PDX-1 to the primary A3 regulatory element could be appreciably different in rat insulin 1 compared with humans and other mammals.
The greatly diverged chicken and zebrafish insulin promoters lack mammalian A boxes; however, several TAAT motifs are present. The chicken has two at −359 and −386, and zebrafish has three at −142, −347, and −359 plus two more further upstream at −473 and −510. The clustering of TAAT motifs is greater than would be expected from random nucleotide arrangements. While TAAT motifs are targets for a large number of homeodomain transcription factors, it is worthy to note that the 5′ and 3′ flanks of the zebrafish A boxes at −359 and −142 have 3-bp sequences associated with strong PDX-1 binding (67), suggesting a possible role for PDX-1 in regulating these insulin genes. The flanking regions share no homology with human. This is unlikely to reflect divergence of the PDX-1 proteins (rodent, chicken, and zebrafish PDX-1 proteins share 89, 26, and 49% amino acid sequence identity with the human protein, respectively) as the homeodomains are well conserved and there is no evidence of species specificity in DNA binding.
In addition to A boxes, the GGAAAT-containing GG2 motif (−145) is also activated by PDX-1 (74) despite its deviation from the homeodomain consensus. The human insulin promoter contains a second GG motif 5 bp downstream of GG2 and commonly referred to as GG1 (75) or A2 (28). Mutation of these GG regulatory elements either singly or together has been shown to drastically reduce transcription (76), and the transcription factor binding to GG1 interacts with a transcription factor binding to the adjacent C1 site (77). Together, these findings suggest that both of the GG regulatory elements have a function in insulin expression. Of the two, GG2 is by far the more conserved being present in all mammals except the rodent insulin promoters. GG1, on the other hand, is absent from insulin promoters that diverged from human more than 25 million years ago, with the exception of the rat insulin 1 gene and dog. The presence of the highly conserved GG1 and C1 regulatory elements immediately downstream of GG2 and GG1, respectively, precludes useful comparison of flanking regions. The chicken insulin promoter has a GG motif at −130, which is in the same general region as GG1 in human (−133); however, there is no homology with the flanking regions of either human GG1 or GG2. The zebrafish insulin promoter does not contain any GG motifs.
Cyclic AMP response element.
In the context of the insulin promoter, cAMP responsive elements bind the broadest array of transcription factors. These are generally closely related members of the bZIP CREB/ATF family, which can exist as multiple isoforms (78) that can interact with transcription factors activated by cAMP and diacylglycerol signaling pathways (79,80) to create activators, nonactivators, or repressors. The human insulin gene has four CRE sites: CRE1 at −210; CRE2 at −183; CRE3 at +18; and CRE4 at +61 (81). Although none of the CRE sites contains the consensus CRE sequence of TGACGTCA, mutagenesis experiments have shown that all are transcriptionally active (82).
Comparison of CRE sites between species (Table 4) reveals that only primates have multiple copies of CREs with other mammals containing a single CRE corresponding to CRE2. Of these, only the dog CRE is identical to the conserved human CRE2 site. The multiple CRE sites in primates could be due to several factors; the most likely being dietary. It should be noted that while gorillas are often considered to be predominantly folivorous, it has become apparent that they also consume a significant amount of fruit (83). This is even truer of the Western gorilla (Gorilla gorilla), whose genome is being sequenced for assembly, than of Eastern gorillas (Gorilla beringei). Also, all the primates, especially the great apes, are partly omnivorous since they supplement their diets with birds, eggs, small reptiles, and insects. In comparison to the other mammals studied, only primates consume large quantities of fruit in their diet. However, the number of CRE sites is not in a simple direct correlation with the amount of fruit consumed, as all the studied primates eat large quantities. Another possible reason is that while primates are omnivorous to varying degrees, they often gorge themselves on a single food (e.g., ripe fruit when a tree is in season or meat when a whole carcass is consumed quickly), which would give rise to major alterations in metabolic demands. This would be particularly pertinent to early humans and necessitate an insulin promoter that could respond accordingly. The phenomenon of increased numbers of CREs in primates may be expedited by the fact that that primate promoters have an increased rate of evolution (44).
As with other regulatory elements, the chicken and zebrafish insulin promoters do not contain obvious CRE sites. The chicken insulin promoter contains four possible (three overlapping) nonconsensus sequences in the vicinity of the conserved mammalian CRE site, while the zebrafish has two potential nonconsensus octamers at −46 and −226.
It is impossible to draw conclusions on the effects of the numerous minor nucleotide changes on CRE site activity, as most regulatory elements can tolerate one or more substitutions without total loss of function (84,85). Therefore, it may be very significant that, even with the variability of the octamer in the conserved CRE site, sequences that include the CRE core along with at least 8 bp of both 5′ and 3′ flanking regions represent one of the most prominent ECRs in all mammalian insulin promoters. This strongly points to the importance of CRE sites in insulin gene regulation.
Initial expression studies on the C1 element at −128 (5′TGCAGCCTCAGCC) were carried out on the rat insulin 2 gene, showing that it binds the transcription factor RIPE3b1, which was subsequently identified as the basic leucine zipper (bZIP) protein MafA (86–90). Mutagenesis of the human C1 MafA binding site reduces promoter activity by 74% in INS-1 β-cells (91) and blocks activation by glucose in MIN6 β-cells (92). MafA can also interact with β2 and PDX-1 (93). All the mammalian insulin genes show extremely high conservation of the C1 site, and all are identical to human, except dog and pig, which have 1- and 2-bp substitutions at the 3′ and 5′ regions of the consensus sequence, respectively. As the recognition site is 13 bp long, it is possible that mutations at its extremities would not necessarily eliminate MafA stimulation. Despite the clear conservation pressures on the C1 site, no comparable sequence was detected in the chicken and zebrafish insulin promoters.
The human insulin promoter has a bipartite C2 element (5′CAGGGACAGG) at −252 (94), and rat insulin 1 promoter has been reported to contain a dissimilar, though active, sequence between −329 and −307. The C2 site can bind PAX4 and PAX6, which repress (95) and stimulate (96), respectively. A search of insulin promoters showed that the human C2 site is present in all primates, although African green monkey has a single base pair substitution between the two CAGG motifs. Among nonprimates, dog has two substitutions between the direct repeat and cow has three repeats with the intervening regions containing 1- and 2-bp deletions. It is not immediately apparent from DNA sequence alone whether these latter sites are functional.
E boxes (5′CANNTG) bind proteins of the basic helix loop helix (bHLH) class of transcription factor with ubiquitous E47 forming a heterodimer with neuroendocrine cell specific NeuroD/β2 (97). Two important E boxes were initially identified in the rat insulin 1 promoter between −104 and −112 (E1 or IEB1) and between −233 and −241 (E2 or IEB2) (98). The E1 box is the more conserved of the two and analysis showed that it is present in all mammal insulin promoters. Mutagenesis of this site in the rat insulin 1 and 2 promoters results in reduced transcription (98,99), and in the human insulin promoter drastically reduces basal transcription (91) and responsiveness to glucose (92). The E2 motif is less well conserved and the homologous sequence in the human insulin promoter at −239 (5′GCCACCGG) (75) contains a nonconsensus recognition site. The human E2 sequence can bind the ubiquitous transcription factor USF (100) but it does not appear to have a measurable effect on the overall activity of the promoter. In addition to the E1 and E2 boxes, a search of the insulin promoters revealed the presence of many other “CANNTG” consensus sequences (Table 2), including two in the negative regulatory element that lie just 23 and 33 bp upstream of the human E2 site. The presence of numerous potential E boxes suggests that regulation of the insulin promoter by bHLH transcription factors remains to be fully elucidated. Chicken and zebrafish insulin promoters contain neither E1 nor E2; however, they possess several consensus E box sites.
An unnamed sequence at −232 (5′GGGCCC), which we have tentatively termed G2 in Fig. 1, overlaps the 5′ end of the E2 box and binds a factor with limited tissue distribution (101). This sequence, which is known to induce DNA curvature, may serve to bring together proteins that bind at sites flanking this motif. Examination of the other insulin promoters reveals that within the primates, chimpanzee and gorilla contain the G2 sequence at the same location while orangutan, rhesus macaque, and African green monkey share a transition at the first nucleotide. The G2 site is absent from owl monkey; however, this primate has an alternative G2 motif at −453. Among the other mammalian insulin promoters, mouse insulin 2 and cow have a G2 site in the same region while dog, mouse insulin 1, and pig have alternative G2 sequences at −329, −400, and −16, respectively. Since a 6-bp motif would be expected to occur only once every 4,096 bp by random, the existence of alternative G2 motifs may indicate that G2-facilitated DNA bending abets interactions between proteins binding to the promoter. The G2 motif is absent from the rat insulin paralogues, chicken and zebrafish.
Negative regulatory element.
The human insulin promoter contains an inhibitory sequence (−279 to −258) referred to as the negative regulatory element (NRE) (5′GAGACATTTGCCCCCAGCTGT) (75,102) that lies within the glucose sensing Z element (−243 to −292) (103,104). It displays contrary properties acting as both a potent glucose-responsive transcriptional enhancer in primary cultured islet cells and as a transcriptional repressor in immortalized β- and non-β-cells and in primary fibroblasts (103). Searches of the insulin promoters detected the NRE sequence in all primates; however, it is absent from all other species, which is in agreement with reports that there is no evidence for a β-cell–specific NRE in rat insulin 1 (98,105).
Insulin-linked polymorphic region.
A hypervariable region containing variable numbers of tandem 14-bp repeats (5′TCTGGGGAGAGGGG) (insulin-linked polymorphic region [ILRP] or variable number of tandem repeats) is located at approximately −360 in the human insulin promoter. The ILPR adopts an altered structure, which has been characterized as a quadriplex involving interactions between the G residues on the top strand (106). This sequence, which binds the transcription factor Pur-1/Maz (107) has a powerful effect on promoter activity in β-cells. Three classes of VNTR alleles have been identified based on the number of repeats of the 14-bp sequence: class I (20–63 repeats), class II (64–139 repeats), and class III (140–210 repeats). There is a correlation between the number of repeats in this region (IDDM2 locus) and susceptibility to type 1 diabetes with the highest risk conferred by class I (108), while class III has been linked to type 2 diabetes (109). On the other hand, studies involving large cohorts have shown that this region has no impact on early growth (110), insulin release, or diabetes (111). The class I allele is associated with higher levels of insulin mRNA in the pancreas, whereas class III alleles are associated with higher levels of insulin gene transcription in the thymus (20). The increased levels of insulin in the thymus may promote efficient deletion of autoreactive T-cells for proinsulin and immune tolerance to a key antigen implicated in type 1 diabetes. The ILPR sequence was found in only the chimpanzee promoter.
The G1 box (5′GTAGGGGA) at −52 contains a sequence similar to that in the ILPR repeat sequence. The human insulin promoter G1 box binds the transcription factor Pur-1/MAZ (107,112). Although rat insulin 1 and 2 promoters lack the 5′GTAGGGGA motif, Pur-1/MAZ can bind to the adjacent guanine-rich region that often contains a GAGA box to stimulate transcription (113). A search of insulin promoters shows that chimpanzee, orangutan, and owl monkey have a G1 sequence identical to human. Gorilla and rhesus macaque share a single nucleotide change; however, like African green monkey, rat insulin 1, and both mouse paralogues, they retain the GAGA box. Therefore, it is likely that Pur-1/MAZ is active in regulation of these insulin promoters. Pig, cow, and dog all contain deletions in this region, and chicken and zebrafish lack homologous motifs.
The core element (5′TGTGGAAAG) at −312 has a perfect match to the binding site for the CCCAAT-enhancer binding protein (C/EBP) and probably other factors. There is very little known about this regulatory element, although it may act along with the adjacent A5 to mediate MafA-PDX-1 interactions (104). The enhancer core is present in all the primates for which sequence is available (not gorilla and orangutan). Rat insulin 2, mouse insulins 1 and 2, and dog share a single conservative transition at the most 3′ position. Rat insulin 1 has an additional mutation at the most 5′ position that may significantly reduce stimulatory potential, and the motif is absent from all other species.
The SP1 site (5′CCGCCC) at −345 was originally identified as a sequence that could bind a factor present in HIT T15 β-cells. The SP1 site appeared to exhibit powerful transcriptional effects, but mutations that abolished protein binding had no effect on its transcriptional activity (A.R. Clark and K.D., unpublished findings), suggesting possible interactions with adjacent sites. The SP1 site has also been identified as a potential binding site for the SP1-like factor KLF11, variants of which may contribute to the development of diabetes (114). Examination of primates for which sequence is available (not gorilla and orangutan) shows that all but African green monkey contain the identical SP1 site in the same position. African green monkey has a single nucleotide substitution of a C to T that reduces but does not eliminate KLF11 binding to oligonucleotide in electrophoretic mobility shift assay (EMSA) studies (114). It is absent from all other species.
The Ink (for insulin kilobase upstream) sequence at −1,030 contains a cluster of potential binding sites comprising a palindromic element with zero spacing overlapping a direct-repeat element with 2 bp pairing (5′AG GTCCCCAGGTCATGCCCTC) and is responsive to both retinoic acid and thyroid hormone (115). Searches of available insulin promoters sequences upstream to −1,500 shows that the Ink box is absent from all nonprimates. Of the primates, distant upstream sequence is available for only chimpanzee and rhesus macaque. Both of these monkeys contain the Ink motif at −854 and −947, respectively. Although the positions are quite removed from the human, the immediate 30-bp regions display 95% identity with the human Ink region, suggesting that this regulatory element may be influential in insulin expression, perhaps playing a role in energy homeostasis.
SEQUENCE ELEMENTS THAT ARE ABSENT FROM THE HUMAN INSULIN PROMOTER
Several of the descriptions of the effects of transcription factors on insulin expression are based on results from single species. For example, there is a transcriptionally active CCAAT regulatory element that overlaps the single CRE site in the insulin promoters of both rat and mouse. Expression studies using rat insulin 1 promoter have shown that the combined CRE/CCAAT site shows preferential binding for the nuclear transcription factor-Y (NF-Y), which leads to reduced influence of CRE-associated signaling (116). A search of the other insulin promoters revealed not only that no nonrodent species have a CCAAT site that overlaps with CRE, but that CCAAT sites are totally absent from all of the insulin promoters except zebrafish, which has three at −164, −130, and −85. Therefore, NF-Y signaling, which has an absolute requirement for all five bases in the CCAAT consensus sequence (117), is unique to rodents within mammals and does not typically play a role in insulin regulation.
HNF-4α regulatory element.
Rat and mouse insulin 1 and 2 promoters contain a consensus binding site for HNF-4α (5′ACGGCAAAGTCC) located between −69 and −57. The rat insulin 1 promoter has been shown to be activated directly by HNF-4α, which can interact synergistically with PDX-1 at the adjacent A1 site (118). In contrast, the HNF-4α binding site does not exist in the human insulin promoter, and HNF-4α fails to activate the gene in reporter assays (72). A search of all other insulin promoters found no evidence of any HNF-4α binding sites. Therefore, HNF-4α transactivation is unique to rodents and does not generally have a function in insulin regulation.
STAT regulatory element.
Hormones involved in energy homeostasis and growth (e.g., leptin, prolactin, and growth hormone) have been reported to modify rat insulin 1 expression at −330 to −322 (5′TTCTGGGAA) through the transcription factors STAT3 (119) and STAT5 (120). Examination of all insulin promoters revealed that the STAT regulatory element is present only in the rat insulin 1 promoter, although the other rodent insulin promoters have only a single base pair substitution within the consensus sequence. The differences in the human insulin promoter were much greater, raising uncertainty about the relevance of direct influence by the STAT signaling pathways on insulin expression in humans.
COUP-TFII binding element.
Chicken ovalbumin upstream promoter–transcription factor II (COUP-TFII), which is also known as NR2F and ARP-1, binds a direct repeat in the chicken ovalbumin promoter (121) and has been reported to bind an unrelated imperfect repeat in rat insulin 2 promoter between −55 and −38 (5′GGGTCAGGGGGGGGGTGC) (122) through a different molecular mechanism (123). COUP-TFII has recently been implicated in the control of blood glucose in heterozygous knockout transgenic mice that had increased insulin secretion in low glucose and decreased insulin secretion in high glucose (124). The corresponding regions in human and all primates (5′AGGTAGGGGAGATGGGCT) have several nucleotide differences that result in loss of essential guanine nucleotides. In rat insulin 1 and both mouse insulin promoters, there are transitions of essential guanines to adenines in positions where either purine can serve for recognition, if not intermolecular association. Given the irregular binding affinities of COUP-TFII, it is difficult to make unequivocal statements regarding its possible effects on these insulin promoters. The other mammals (cow, dog, and pig) have deletions in the COUP-TFII binding region, and no known form of consensus sequences are to be found in chicken and zebrafish insulin promoters. Within the context of this survey, the action of this COUP-TFII would seem to be limited to rodents.
REGULATORY ELEMENT SPACING
The spacing between the individual regulatory elements within the particularly well-conserved cassette of C1, E1, and A1 boxes has been shown to alter the relative stimulatory effects of the transcription factors that bind along with their synergistic interactions (91). Comparison of mammalian insulin promoters in this region showed that the relative spacing of the regulatory elements has been maintained for at least 35 million years, as there is no deviation in the primates. On the other hand, all the rodent insulin promoters contained insertions and deletions between all three sites. In mammals lacking A1, the C1-E1 spacing was maintained in both pig and dog while cow had a one base pair insertion between C1 and E1.
EFFECTS OF CHROMATIN STRUCTURE
Efficient transcription is the outcome of coordinated dynamic arrangements upon the promoter. ChIP assays using MIN6 β-cells have shown that PDX-1, MafA, E47, and β2 bind to the mouse insulin 2 promoter in a cyclical manner with a periodicity of ∼10–15 min (125). Insulin gene regulation is also influenced by epigenic factors that include DNA methylation and alterations in histone modifications, which affect the packaging of DNA within chromatin. There are a number of studies on the role of histone acetylation and methylation in the control of insulin gene expression. A key role for histone acetyl transferase (HAT) p300 in insulin promoter regulation has been demonstrated by the observations that PDX-1 and β2 mediate their effects on the rat insulin 2 gene through an interaction with p300 (31,126,127), while activation of a rat insulin 1 promoter construct in HeLa cells by PDX-1 requires interactions with p300 (128). It has also been shown that the effects of glucose on a rat insulin 1 promoter construct in the mouse MIN6 β-cell line involved the recruitment by PDX-1 of HAT and histone deacetylase activities (HDAC) activities. Thus, under low-glucose conditions, PDX-1 associated with HDACs to repress transcription (129), whereas under high glucose conditions PDX-1 recruited the HAT p300 to activate transcription (130). PDX-1 has also been linked to the presence of methylated histone H3, i.e., H3K4me (nomenclature as per (131)), at the proximal promoter and coding regions of the insulin gene in rodent cells (132). More recently, the histone methyl transferase set9 has been localized to β-cells in association with the insulin gene (133).
Investigations into the role of chromatin accessibility in insulin expression have revealed that PDX-1 shows preferential binding to open chromatin (euchromatin) over condensed chromatin (heterochromatin). In particular, PDX-1 occupies the endogenous insulin promoter in mouse βTC3 β-cells but not in mPAC ductal cells, which do not express insulin. Furthermore, the binding affinity of PDX-1 is strongly influenced by the position of nucleosomes relative to its regulatory element (134). Even within euchromatin, the degree of openness varies as the A3/A4 region (−126 to −296) to which PDX-1 can bind contained the most open chromatin structure based on micrococcal nuclease digestion, whereas the adjacent region (−297 to −460), which is not as crucial for β-cell–specific insulin transcription, was more condensed. Although it is likely that the insulin gene is embedded in euchromatin in β-cells and in more condensed heterochromatin in non-β-cells, it may be of relevance that the synteny studies (see insulin genes) show that the human insulin gene lies only 2 kbp from the transcriptionally active TH gene, whereas this distance is >100-fold greater in rodents. Thus, the diverse efforts to induce insulin expression in non-β-cells may be less problematic in humans than in rodents.
The extraordinary synteny of insulin genes from zebrafish to human substantiates the key importance of the insulin hormone product. Comparison of insulin promoters spanning 450 million years of evolution has permitted identification of the central regulatory elements as well as several valuable observations.
The transcription factor PDX-1 emerges as one of the fundamental regulators of insulin expression for several reasons: all promoters have at least one A box with A3 being the most conserved, the weaker PDX-1-binding GG boxes also form ECRs with GG2 being the more conserved, and the transcription factor is known to interact with MafA and E47/β2.
The other strongly conserved regulatory elements of C1, E1, and the conserved CRE site attest to the importance of MafA, E47/β2, and cAMP-associated regulation. Of these regulatory elements, the CRE site is unusual due to the remarkable degree of variability and extensive array of associating transcription factors.
Regulatory element conservation is not limited to the consensus sequences as the flanking regions also contribute to ECRs. This is true of regulatory elements with both short and long recognition sequences indicating that flanking regions are necessary for transcription factor specificity and binding. This may be of particular consequence for the capricious CRE sites while the asymmetrical nature of the conserved A3 box flanking regions may reflect directional binding of PDX-1.
Within mammals, dog stands out due to its much greater homology to humans. The similarities include higher percentage identity, possessing more PDX-1–binding A boxes and GG elements, and having a CRE site that is identical to human. It is interesting to speculate whether these likenesses between a carnivore and omnivore correlate to the increased contribution of meat in the human diet over evolutionary time compared with other primates.
The chicken and zebrafish insulin promoters bear no obvious homology with mammals and exhibit a dearth of readily discernible regulatory elements.
Investigations based on rodents and their insulin genes have provided invaluable insights into diabetes and the workings of insulin promoters. However, the findings reported here illustrate that notable dissimilarities exist between the human and rodent promoters, which may reflect both divergence and the degree to which these promoters have been studied. The atypical characteristics of rodent insulin promoters are exemplified most manifestly with the rat insulin 1 promoter, whose unusual attributes include an active dominant CCAAT site overlapping the single CRE site, HNF-1α, and HNF-4α regulatory elements; a functional Isl-1 binding site at A3/A4; a STAT-3 binding site; a potential COUP-TFII binding site; a consensus-containing E2 site; loss of GG boxes; lower conservation of A3 flanking regions; and changed spacing between regulatory elements in the C1-E1-A1 module leading to alternative synergistic interactions. The most plausible basis for the complexity of rodent insulin promoters is the duplication of their associated genes. Gene duplication can lead to functional divergence of the cis-regulatory elements (135,136) that can be swift even in recently duplicated genes (137). In addition, the signaling pathways regulating an essential gene like insulin will undoubtedly incorporate redundancy to extend responses and to act as a buffer against the consequences of mutation of key components. The fundamental differences in regulatory elements should serve as a salutary warning to be cautious when extrapolating rodent-based data to humans.
A major obstacle in diabetes research has been the lack of a human pancreatic β-cell line that is functionally equivalent to primary β-cells. It is essential that new human β-cell lines be developed and widely distributed in order that physiologically and medically relevant studies on the human insulin promoter can be carried out. This is especially true of in vivo epigenetic and ChIP-based experiments that will accurately map the position and define the role of nucleosomes and undoubtedly help to unravel the precise mechanisms responsible for insulin gene regulation.
These are exciting times as genome sequencing progresses rapidly. The availability of insulin genes from a wider range of species will provide tools that will permit the relatively straightforward answering of points raised in this report and allow us to advance our comprehension and appreciation of the subtle and sophisticated insulin promoter.