`J. Mol. Biol.
`
`Sequence and Comparative Analysis of the Rabbit (X-Like
`
`
`
`
`
`Globin Gene Cluster Reveals a Rapid Mode of Evolution
`
`
`in a G + C-rich Region of Mammalian Genomes
`
`1t, Dan Krane1t, David Vandenbergh
`Ross Hardison
`1§, Jan-Fang Cheng1II
`
`James Mansberger
`1, John Taddie1, Scott Schwartz
`
`2, Xiaoqiu Huang2,r
`and Webb Miller2
`
`1 Department of Molecular and Cell Biolog y
`
`
`
`
`
`2 Department of Computer Science and
`
`
`
`Institute of Molecular Evolutionary Genetics
`
`The Pennsylvania State University
`
`University Park, PA 16802, U.S.A.
`
`
`
`(Received 8 April 1991; accepted 24 July 1991)
`
`
`
`A sequence of 10,621 base-pairs from the a-like globin gene cluster of rabbit has been
`
`
`
`
`
`
`
`
`determined. It includes the sequence of gene O (a pseudogene for the rabbit embryonic
`
`
`
`
`(-globin), the functional rabbit a-globin gene, and the 01 psuedogene, along with the
`
`
`
`
`
`sequences of eight C repeats (short interspersed repeats in rabbit) and a J sequence
`
`
`
`
`implicated in recombination. The region is quite G+C-rich (62%) and contains two CpG
`
`
`
`
`islands. As expected for a very G + C-rich region, it has an abundance of open reading
`
`
`
`
`frames, but few of the long open reading frames are associated with the coding regions of
`
`
`
`genes. Alignments between the sequences of the rabbit and human oc-like globin gene
`
`
`
`
`
`
`
`clusters reveal matches primarily in the immediate vicinity of genes and CpG islands, while
`
`
`
`
`the intergenic regions of these gene clusters have many fewer matches than are seen between
`
`
`
`
`
`the P-like globin gene clusters of these two species. Furthermore, the non-coding sequences
`
`
`
`
`
`in this portion of the rabbit oc-like globin gene cluster are shorter than in human, indicating
`
`
`
`
`
`
`a strong tendency either for sequence contraction in the rabbit gene cluster or for expansion
`
`
`
`
`in the human gene cluster. Thus, the intergenic regions of the a-like globin gene clusters
`
`
`
`
`have evolved in a relatively fast mode since the mammalian radiation, but not exclusively
`
`
`
`
`
`by nucleotide substitution. Despite this rapid mode of evolution, some strong matches are
`
`
`
`
`found 5' to the start sites of the human and rabbit a genes, perhaps indicating conservation
`
`
`
`
`
`
`of a regulatory element. The rabbit J sequence is over 1000 base-pairs long; it contains a C
`
`
`
`repeat at its 5' end and an internal region of homology to the 3' -untranslated region of the
`
`
`
`
`
`a-globin gene. Part of the rabbit J sequence matches with sequences within the X homology
`
`
`
`block in human. Both of these regions have been implicated as hot-spots for recombination,
`
`
`
`
`hence the matching sequences are good candidates for such a function. All the interspersed
`
`
`
`repeats within both gene clusters are retroposon SINEs that appear to have inserted
`
`independently in the rabbit and human lineages.
`
`oc-globin gene cluster; CpG islands; G+C-rich isochores;
`
`
`
`
`Keywords:
`
`
`
`
`
`DNA sequence alignments; evolutionary rates; recombination sequences;
`
`short interspersed repeats
`
`t Author to whom correspondence should be
`
`
`459 II Present address: Human Genome Center,
`
`
`
`addressed.
`
`Donner Laboratory, Lawrence Berkeley Laboratories,
`of Genetics, t Present address: Department
`
`
`
`
`Berkeley, CA 94720, U.S.A.
`
`
`
`Washington University School of Medicine, 4566 Scott
`'I! Present address: Department of Computer Science,
`
`
`
`
`
`
`Avenue, St Louis, MO 63110-1095, U.S.A.
`
`
`
`Michigan Technological University, Houghton,
`
`
`
`
`§Present address: National Institute on Drug Abuse/
`MI 49931, U.S.A.
`
`
`
`Addiction Research Center, P.O. Box 5180, Baltimore,
`MD 21224, U.S.A.
`
`
`
`0022-2836/91/2202:33-l 7 $03.00/0
`
`233
`
`
`
`© 1991 Academic Press Limited
`
`SKI Exhibit 2027 - Page 1 of 17
`
`
`
`234
`
`et al.
`R.Hardison
`
`non-erythroid cells (Calza et al., 1984; Goldman et
`
`1.Introduction
`
`
`al., 1984). These characteristics contrast with the
`
`Much work has been devoted to understanding
`
`A+ T-rich and CpG-deficient P-like gene cluster,
`
`
`the mechanisms involved in the co-ordinate
`
`which has been shown to be methylated in non
`
`
`
`temporal and tissue-specific regulation of ct and
`
`
`erythroid tissues in human (van der Ploeg & Flavell,
`
`
`P-globin genes (for a review, see Collins &
`
`
`
`1980) and rabbit (Shen & Maniatis, 1980). The P-likc
`
`Weissman, 1984; Evans et al., 1990; Orkin, 1990).
`
`globin gene clusters, like the bulk of mammalian
`
`
`Given that equal amounts of ct-like and P-like globin
`
`
`isochores genomic DNA, are present in low-density
`
`must be synthesized to produce the hemoglobin
`
`
`(Bernardi et al., 1985), and like most tissue-specifie
`
`tetramer, ct2/32, one might anticipate
`that this co
`
`genes, they are replicated late in S phase in non
`
`
`
`ordinate expression would be achieved by utilizing
`
`
`erythroid cells but early in erythroid genes (Dhar et
`
`
`
`
`identical regulatory schemes. However, that is
`al., 1988; Epner et al., 1988).
`
`
`certainly not the case. The human ct-globin gene is
`The correlation between striking differences in
`
`
`
`
`
`
`
`expressed permissively in a variety of cell lines after
`
`
`
`genomic context and in types of regulation suggests
`
`
`transfection (Mellon et al., 1981 ), whereas the
`
`
`that a detailed comparison of both these gene
`
`
`
`/J-globin gene requires either a viral enhancer
`in cis
`
`
`
`clusters between mammalian species would be
`
`
`
`or erythroid induction to be expressed after trans
`
`
`
`productive in generating insights into their regula
`
`
`fection (Banerji et al., 1981; Humphries et al., 1982;
`
`
`tion. Indeed, analyses of the extensively sequeneed
`Wright et al., 1984; Charnay et al., 1984). This
`
`
`human and rabbit P-like glohin gene clusters (73·4
`
`
`
`permissive expression of ct-globin genes in non
`
`
`and 44·6 kb, respectively) reveal long stretches of
`
`
`erythroid cells is observed for genes from both
`
`
`
`
`sequence similarity that extend through and even
`
`human and rabbit, but not mouse (Cheng et al.,
`
`
`between each of the P-like glohin genes (Margot
`et
`
`
`1986; Whitelaw et al., 1989). A dominant control
`
`
`al., 1989). The sequence similarity between t}w
`
`
`
`
`element for the P-like globin gene cluster is the locus
`
`
`mouse and human /3-like globin gene clusters is Jes,;
`
`
`control region (LCRt) located 5 to 20 kb 5' to the
`
`
`
`
`extensive (Shehee et al., 1989), as is expected given
`
`
`e-globin gene (Grosveld et al., 1987; Forrester
`et al ..
`
`
`
`the more rapid rate of evolution in rodents (Wu &
`
`
`
`regulatory region has also1987). A strong positive
`Li, 1985). One notable segment of extragenie
`
`be�n found 40 kb 5' to the human (-globin gene
`
`
`sequence, located about 6 kb 5' to the human
`
`
`(Higgs et al., 1990); although the ct-globin LCR
`
`
`e-globin gene, has been highly conserved between
`
`shares some properties with that of the p-globin
`
`human, rabbit and mouse (Hardison, 1991 ), and it
`
`gene cluster, it is not clear that the regulatory
`has been shown to he part of the LCR of the fi-likP
`
`
`features of these two LCRs are identical. Also,
`
`
`glohin gene cluster (Forrester et al .. 1987; Grosveld
`
`
`mammalian adult P-globin genes reach full induc
`
`
`
`is carried out inet al., 1987). A similar comparison
`
`tion at later stages of development than adult
`
`
`this paper for the ct-like globin gene clustPrs of
`
`
`
`ct-globin genes (Rohrbaugh & Hardison, 1983;
`rabbit and human.
`
`
`
`Peschle et al., 1985). Some of the critical sequences
`The rabbit ct-like glohin gene cluster is located in
`
`
`
`
`
`that account for these differences are within or 3' to
`
`
`a Geimsa-light hand at the terminus of the long arm
`
`these genes (Charnay et al., 1984; Wright et al.,
`
`
`of chromosome 6 (Xu & Hardison, 199]). The
`1984).
`
`
`minimal gene cluster includes one adult ct-globin
`
`
`
`These differences in regulation may be related to
`
`
`
`gene, five homologs to the emhryonie (-globin gem·.
`
`
`the substantially different genomic contexts
`
`
`and two 0-glohin pseudogenes, arranged
`in the order
`
`
`observed for the ct and P-like globin gene clusters.
`
`5'-(0-(l-ct-01-(2-(3-02-(4-3' within a 38 kb DNA
`After the ct-like and P-like globin gene clusters
`
`
`segment (Cheng et al., 1986, 1987. 1988). This ge1w
`
`
`moved to different chromosomes in the progenitor
`
`
`
`
`cluster probably evolved by a duplication of a large
`
`to birds and mammals (for a review, see Collins
`&
`
`
`DNA block containing the (-(-ct-0 gene set, followed
`
`
`
`
`Weissman, 1984; Hardison, 1991), they evolved into
`
`
`by deletion of the ct-globin gene in the 3' duplicated
`
`
`very different segments of the genome in some
`
`(2-(3-02 gene set (Cheng et al .. 1987). The rabbit
`
`
`mammalian lineages. For example, the G + C-rich
`
`
`
`ct-like globin gene cluster is highly polymorphic both
`
`
`
`
`ct-like gene cluster in humans contains several CpG
`
`for the number of duplicated gene sets as well as for
`
`
`rich islands (Fischel-Ghodsian et al., 1987) that are
`
`restriction fragment lengths around (0 and ;: I
`
`never methylated (Bird et al., 1987), and this gene
`
`
`
`(Cheng & Hardison, 1988). A similar sequence i,-,
`
`cluster is found in the most dense (most G + C-rich)
`
`
`found at the breakpoints proposed for the recom
`
`isochore in both human and rabbit genomes
`
`
`
`binations involved in duplicatiorn; of(, block dupli
`
`
`(Bernardi et al., 1985). Isochores are verv long
`
`, and deletion of a: this common
`
`cations of (-(-ct-0
`
`
`
`(probably thousands of kb) segments of "homo
`
`
`
`
`junction sequence is called a ,J sequen('e (Cheng Pt
`
`
`
`geneous base composition that may correspond to
`
`
`
`al., 1987). Part of the J sequence is verv similar to
`the Giemsa light and dark bands seen in metaphase
`
`
`
`the 3'-untranslated sequence of the a-glohin gene.
`
`
`chromosomes (Bernardi, Unlike most tissue
`1989).
`
`
`
`and this homology is likely to have been involved in
`
`
`
`specific genes, the ct-like globin gene clusters are
`
`
`
`the recombination that deleted the a-glohin ge1w
`
`
`replicated early in S phase in both erythroid and
`
`from the (2-(3-02 gene set. The deletions that
`
`
`
`occurred frequently during the propagation of ).
`
`
`
`
`this clones carrying rabbit genomic DNA eontaining
`
`gene cluster also mapped close to thP J sequences
`
`region-kb t Abbreviations used: LCR locus control
`
`
`
`10 bases or base-pairs; bp, base-pair(s).
`
`3
`
`'
`
`' '
`
`SKI Exhibit 2027 - Page 2 of 17
`
`
`
`
`
`Rapid Evolution in G + C-rich Regions
`
`
`
`235
`
`(b)Analysis
`of the DNA sequence
`
`(Cheng et al., 1987), arguing that these sequences
`
`sequence was determined from both strands. The
`
`
`
`
`
`
`could constitute a hot-spot for recombination. This
`
`
`
`
`sequence was not determined through 5 restriction sites
`
`gene cluster contains at least one active gene, the
`
`that were the ends of fragments used to construct
`
`
`
`
`subclones. These sites are internal to either C repeats or J
`
`adult a-globin gene (Cheng et al., 1986), and the
`
`
`sequences. In 3 cases, Bglll*, Pstl* and BamHI* in
`
`
`gene (O is the most likely candidate to encode the
`
`
`C4 7 (Fig. 1 ), the sequences around the restriction site
`
`
`
`embryonic (-globin found in rabbit. The remaining
`
`
`
`
`match with similar sequences repeated elsewhere in the
`
`
`(-globin genes appear to be pseudogenes (Cheng et
`
`
`gene cluster or genome, hence it is unlikely that any
`al., 1988).
`
`
`
`sequence is missing. The short segment from 4278 to 4283
`
`
`The a-like globin gene cluster in human is located
`
`was highly compressed on the gels and hence this
`
`verv close to the telomere of the short arm of
`
`
`sequence could not be determined unambiguously. The
`
`chr�mosome 16 in a segment of very G+C-rich
`
`
`
`new sequences were combined with previously determined
`
`DNA that continues as far as 2000 kb (Harris
`
`sequences (Cheng et al., 1986, 1987, 1988; Krane et al.,
`et al.,
`
`
`
`1990). The cluster contains a functional (2 gene
`
`
`
`
`1991) to generate a composite sequence extending from
`the 5' flank of (1 to the 3' flank of 01. The (l-0t-01
`
`
`
`
`encoding an embryonic (-globin polypeptide, a non
`
`
`
`sequence is available as GenBank accession number
`
`
`functional O gene that is only slightly divergent
`
`
`M35026, and the sequence of J01 is available as EMBL
`
`
`from (2, a highly divergent t/Ja2 pseudogene, a
`
`accession number X60985.
`
`
`
`moderately divergent pseudogene I/Jal that has lost
`
`
`A composite file of sequences from the human Ot-like
`
`
`
`its CpG island, duplicated functional adult a-globin
`
`
`globin gene cluster was assembled from data in the
`
`genes a2 and al, and a 0 gene that produces tran
`
`
`following sources: the 5' flank of gene (1 (Willard
`et al.,
`
`
`
`
`scripts, but for which no polypeptide product has
`
`
`1985), gene (1 (Proudfoot et al., 1982), pseudogene i/10t2
`
`
`been identified (for a review, see Higgs et al., 1989).
`
`
`
`(Hardison et al .. 1986), intergenic sequence between (1
`
`The genes are arranged in the order 5' -(2-0-t/Ja2-
`
`
`
`and i/10tl (Sawada et al., 1983), pseudogene i/10tl (Proudfoot
`
`
`
`
`& Maniatis, 1980), homology blocks containing 0t2 and 0tl
`
`
`t/Ja I -a2-a 1-0-3'. The a-glob in genes were duplicated
`
`
`(Liebhaber et al., 1980; Michelson & Orkin, 1980, 1983:
`
`
`
`in the stem simians (Sawada & Schmid, 1986), and
`
`
`Hess et al., 1983. 1984), 3' flank of 0tl (Hardison & Gelinas,
`
`the 5' a gene found in several other mammals (Schon
`
`
`1986). intergenic sequence between 0tl and 01 (Bailey.
`
`et al., 1982) is orthologous to the human I/Jal gene,
`
`1990), and 01 (Hsu et al., 1988).
`
`
`based on sequence similarities in the 5' flank
`
`
`(Hardison & Gelinas, 1986; Sawada & Schmid,
`
`
`1986). The duplication of a genes in higher primates
`
`
`has left a long homology block of about 4 kb that is
`
`
`divided into three regions called X, Y and Z. A
`Direct and inverted repeats, open reading frames and
`
`
`
`
`
`
`sequence that confers an enhanced rate of recom
`
`
`
`nucleotide strings were identified with the computer
`
`bination in COS cells has been mapped to the first
`
`
`program DNA Inspector Ile (Textco) running on a
`
`300 bp of the X region (Hu & Shen, 1987). Almost
`
`
`
`Macintosh computer. Plots of G + C richness and CpG and
`
`
`20 kb of continuous sequence has been determined
`
`GpC dinucleotides were made from the output of the
`
`
`BASIC computer program "Di-nt Frequency" (Krane.
`
`from the human a-like globin gene cluster (see
`
`1990) scanning windows 50 bp in length.
`
`
`
`Materials and Methods), encompassing the region
`
`
`Local alignments of the 2 sequences were generated
`
`
`from O through 0. This sequence, along with that
`using the program SIM (Huang et al., 1990), run on a
`
`reported for rabbit in this paper, allows a compre
`
`
`
`Sun4 workstation. SIM generates alignments between
`
`
`
`hensive comparison of a major portion of the gene
`
`
`very long D�A sequences while using computer space
`
`
`clusters in rabbit and human, including the three
`
`
`
`efficiently. and it produces alignments that are optimized
`
`
`
`major members of the a-like globin gene cluster,(, a
`
`
`
`to parameters set by the user. All alignments discussed in
`
`
`
`and 0. Parallels and differences are discussed for
`
`
`this paper were obtained where matches count I,
`
`
`
`these interspecies comparisons of a-like and P-like
`
`
`
`mismatches count -l, the gap-open penalty is 4·0, and
`globin gene clusters.
`
`
`
`the gap-extension penalty is 0·4 per nucleotide. With the
`
`
`single exception of Fig. 7, the number of local alignments
`
`
`to be used was determined by using theoretical results
`
`
`
`(Karlin & Altschul, 1990) on the expected number of gap
`
`
`free alignments. Specifically, we used only those align
`
`
`
`ments whose score exceeds a threshold r. defined so that
`
`
`the probability is 0·8 that random sequences matching the
`Clones of rabbit DNA containing the Ot-like globin gene
`
`
`
`
`
`
`cluster were isolated from a library of rabbit genomic
`
`
`given sequences in length and in nucleotide composition
`
`
`
`have a gap-free alignment scoring at least r. Informally
`
`DNA (Maniatis et al., 1978; Cheng et al., 1986, 1987, 1988).
`
`
`The recombinant phage ).R0tGl containing the genes (1, Ot
`
`
`speaking, r is a threshold where we expect that 2 random
`
`
`
`and 01 was used to generate restriction fragments that
`
`
`sequences of the given length and composition would
`
`
`were subcloned into plasmids pBR322, pUC or pBlue
`
`
`
`
`exhibit 1 or several gap-free alignments, i.e. a dot-plot of
`
`script, and into the phage Ml3. Most of the DNA
`
`
`
`random sequences at these criteria would contain a few
`
`
`
`sequence was determined by the dideoxynucleotide chain
`
`
`specks. The large number of local alignments generated
`
`termination method (Sanger et al., 1977) and was
`
`
`by SIM were organized and viewed by a graphical user
`
`
`confirmed in some regions with the base-specific chemical
`
`
`
`
`
`interface called LAV (local alignment viewer; Schwartz
`et
`
`
`degradation method (Maxam & Gilbert, 1980). In some
`
`al., 1991). Figs 6, 7 and 8 were drawn directly from the
`
`
`
`cases. directed deletions for rapid sequence determination
`
`
`SIM alignments and from hand-generated files giving
`
`
`were constructed by using exonuclease III and mung bean
`
`
`
`
`positions of sequence features (such as exons, introns and
`
`
`
`
`nuclease (Henikoff, 1984). The strategies employed in
`
`
`
`repeated sequences) using the program LAD (local align
`ment diagramer).
`
`
`
`determining the sequence are shown in Fig. I. Much of the
`
`
`
`2.Materials and Methods
`
`
`
`(a)Determination of DNA sequence
`
`SKI Exhibit 2027 - Page 3 of 17
`
`
`
`236
`
`R.Hardison
`et al.
`
`I
`
`II
`
`200 bp
`
`PstI
`BamHIPvun
`lf1ndm Pvull
`BamHI*
`
`NcoI Psi! 4 Aval
`
`BstXI.
`
`NcoI
`Noe! Sacll Noe! Ava I
`
`-
`
`-
`
`I
`
`I I
`I I
`
`I I I
`
`I
`
`I I JI I I
`a cap
`
`-
`
`-
`
`m
`Socil BomHI*
`
`BamHI*
`
`Ps!I
`
`- J
`
`81
`
`-
`
`Figure 1. Strategies used to determine the DNA sequence of regions I, II and III. Arrows above the line indicate the
`
`
`
`
`
`
`
`
`
`
`
`extent of individual readings of the top strand, and arrows below the line correspond to readings of the lower strand. The
`
`
`dotted lines drawn above parts of regions II and III cover the sequence previously determined by Cheng et al. (1987).
`
`
`
`
`
`
`
`
`of their poly(A) tracts. the gene cluster; they point in the direction ofC repeats within Ope n arrows indicate the positions
`
`
`
`
`
`Asterisks mark internal restriction sites through which the sequence has not been determined.
`
`3.Results
`
`
`
`
`gene cluster
`
`result of additional duplications of the (-(-0 gene set
`
`
`
`To obtain a quantitative estimate of the fraction of the
`
`
`
`1988).
`
`
`
`et-like or P-like globin gene clusters that match between
`
`(Cheng & Hardison, The homology blocks
`
`containing ( and 0 genes (Z blocks and T blocks,
`
`
`rabbit and human (see Results, section (h)), local align
`
`
`
`ments were optimally chained together to make "meta
`
`
`respectively) are bounded by a characteristic junc
`
`
`
`alignments" using an algorithm for computing optimal
`
`
`tion sequence called a J sequence (Fig. 2). In this
`
`
`paths in a directed acyclic graph (Corman et al., 1990). In
`
`paper they will be referred to by the name of the
`
`1 SIM align ment follows another
`the meta-alignments,
`gene that they follow, e.g. JO is 3' to gene (I. As
`
`
`
`
`only if its starting positions in the 2 sequences follow the
`
`
`will be explained below, the J sequences extend
`
`
`
`ending positions of the other alignment. The chaining was
`
`from the C repeat at the 5' end through a sequence
`
`done so as to maximize the number of matches in the
`
`
`
`homologous to the 3' portions of the rx.-globin gene
`
`meta-alignment. For the rabbit and human et-globin
`(Cheng et al., 1987).
`
`
`genes, the divergence was determined from the aligned
`
`
`
`
`sequences and corrected for multiple substitutions at a
`New sequence data (Fig. l) were combined with
`
`
`
`
`single site (Jukes & Cantor, 1969) to obtain the number of
`
`previously published sequences to make a con
`
`
`substitutions per site. The time of divergence between
`
`
`
`tiguous sequence of 10,621 bp, beginning 2273 hp
`rabbit and human was taken as the time of the mam
`
`
`5' to (I, extending through the rx.-globin gene and
`
`
`malian radiation, about 80 million years ago (Romero
`
`
`ending 204 hp 3' to the polyadenylation signal of the
`Herrera et al., 1973).
`
`
`01 pseudogene. It is available from the GenBank
`
`
`database under accession number M35026. This
`
`
`
`three-gene set contains homologs to each of the
`three rx.-like globin genes found in mammalian
`rx.-like globin(a)Nucleotide sequence of the rabbit
`
`
`species, and it contains most of the DNA in the
`
`basic set of genes that has duplicated to evolve this
`A diagram of the portion of the rabbit rx.-like
`
`gene cluster.
`
`
`globin gene cluster isolated in 38 kb of cloned DNA
`(Cheng et al., 1986, 1987, 1988) is shown in Figure 2.
`
`
`
`
`Analysis of a population of laboratory rabbits by
`
`genomic blot-hybridization shows that the gene
`The rabbit rx.-like globin gene cluster contains at
`
`
`
`
`
`
`cluster can extend farther 3' in some haplotypes as a
`
`
`least 15 C repeats, the predominant short inter-
`
`
`
`C repeats(b)Short, interspersed
`
`
`
`SKI Exhibit 2027 - Page 4 of 17
`
`
`
`Rapid Evolution
`in G + C-rich
`Regions
`
`237
`
`�o J�O �I J�1 ex
`■w■rn◄ I
`16 19
`17 20
`18 B B B
`II
`
`21 36 22
`
`81
`
`J81
`
`�2
`
`Hf ... @
`I
`
`46 47 23 25
`24
`
`2 kb
`J82
`
`82
`
`I [fi)
`
`BB B B
`
`B B
`
`BB B
`
`m
`
`ti -
`
`-
`
`-
`
`l.RcxGx1
`
`l.RcxGlt20
`
`l.RtGx1
`
`Figure
`2. Organization
`of the rabbit
`ix-like
`globin
`gene cluster.
`Related
`genes are represented
`by boxes with the same
`shading,
`C repeats
`are shown as filled triangles
`and J sequences
`are shown as open, pointed
`boxes.
`A repetitive sequence
`et al., 1987) is shown as a stippled
`found by hybridization
`experiments
`(Cheng
`polygon
`between
`genes (2 and (3; current
`data do not exclude
`the possibility
`that it is one or several divergent
`C repeats.
`BamHI (B) sites
`within
`the gene cluster
`are indicated
`on the 2nd line, and sequenced
`segments
`are shown as thick regions
`on this line. The new sequences
`reported
`in this paper are shaded
`on the 2nd line. Boxes
`below the gene map show the T homology
`blocks
`containing
`0
`genes and Z homology
`blocks
`containing
`( genes separated
`by junction,
`or J, sequences.
`Horizontal
`lines below the
`of A clones
`genome by Cheng et al. ( 1987,
`homology
`blocks
`indicate
`the positions
`isolated
`from this region
`of the rabbit
`1988).
`
`repeat in the rabbit genome (Cheng et al.,
`G+C-rich
`C repeat,
`C14, from a recently
`trans
`spersed
`posing subfamily (Krane et al., 1991). Seven
`1984; Hardison
`& Printz,
`1985).
`These repeats
`tend
`segments
`of the rabbit a-like
`globin gene cluster
`are
`to insert into or nearby one another
`in groups
`(Krane et al., 1991);
`noticeably
`high in A+ T content
`( > 60 % ) relative
`this is readily
`apparent
`in the 5'
`to the remainder
`of the cluster
`(Fig. 3(a)),
`but in
`flanks of genes O and (2 (Fig. 2), although
`the
`five of these cases the A+ T richness
`is derived
`from
`segment
`from a through
`01 has remained
`free of C
`and (CT)n tracts
`found at the 3' end of C
`the poly(A)
`repeats.
`A total of eight C repeats
`are in the 0-a-01
`(Krane et al., 1991) that have
`repeat sequences
`sequence,
`comprising
`2·7 of the 10·6 kb, or 25% of
`transposed
`into this region
`of the rabbit
`genome.
`As
`the contiguous
`sequence.
`At least seven
`additional
`C
`et al., 1986), the longest
`previously
`noted (Cheng
`repeats
`have also been detected
`by sequence
`analy
`A+ T-rich stretch
`(between
`a and 01) is flanked
`by
`sis and hybridization
`studies
`in the remainder
`of the
`gene cluster
`(Fig. 2), and it is likely that the
`10 hp-long
`inverted
`repeats,
`suggesting
`that it too
`may have entered
`this gene cluster
`by a trans
`unidentified repeats
`between
`(2 and (3 include
`more
`C repeats
`and possibly
`a J sequence.
`In contrast,
`no
`position
`event.
`As expected
`for a sequence
`with a low A+ T
`members of the predominant
`long interspersed
`et al., 1989),
`DNA, LlOc (Demers
`content,
`many open reading
`frames are observed
`on
`family of repeated
`both strands (Fig.
`4). Some of the open reading
`have been found in the a-like
`globin gene cluster
`frames correspond
`to the exons of the a-globin
`gene,
`either
`by sequence
`determination
`or by hybridiza
`(Cheng et al., 1987).
`the only functional
`gene in this region,
`but most do
`tion studies
`not correspond
`to regions
`that encode known poly
`peptides.
`This illustrates
`the difficulty
`in identifying
`potential
`genes by mapping open reading
`frames in
`G + C-rich
`sequences.
`The base composition
`of the 0-a-01 DNA
`sequence
`is 62 % G + C and 38 % A+ T; this is
`essentially
`the reverse
`of the values reported
`for the
`whole rabbit genome, 44% G+C and 56% A+T
`in human (Bird et al.,
`(Sober,
`1968), or for the rabbit P-like
`globin gene
`Similar
`to the situation
`et al.,
`et al., 1987), the rabbit
`cluster,
`39% G+C and 61% A+T (Margot
`1987; Fischel-Ghodsian
`1989). The high G + C content
`of the a-like
`globin
`a-like globin gene cluster
`contains
`CpG islands,
`gene cluster
`is uniform over most of the gene cluster
`whereas
`the P-like
`globin gene cluster
`does not. The
`(Fig. 3(a)),
`and regions
`rabbit a-globin
`of greater
`than 65% G+C
`gene has many CpG dinucleotides
`in
`content
`can be found throughout
`the sequence,
`not
`its 5' flank and internally
`(Fig. 3(a)).
`This abun
`just in close association
`with functional
`genes.
`This
`dance of CpGs is much greater
`than is seen for gene
`is in striking
`contrast
`to the P-like globin gene
`0, which is equally
`rich in G + C content;
`thus, the
`cluster,
`which has a very low G+C content
`through
`cluster
`of CpG dinucleotides
`in the a-globin
`gene
`out, exemplified
`by the /;-{3 region shown in Figure
`does not simply result
`from a high G + C content.
`The 01-globin
`gene also has a strikingly
`high level of
`3(b). This A+T richness
`is interrupted
`mainly by a
`
`( c)Base composition
`
`(d)CpG islands
`
`SKI Exhibit 2027 - Page 5 of 17
`
`
`
`238
`
`R.Hardison
`et al.
`
`�I
`
`16 17 18
`19 20
`..... ti
`•
`100
`
`21 36 22
`
`a
`
`81
`
`-� ►◄ .. IM:�m@;I
`
`80
`
`� !!..- 60
`
`+ 40(9
`
`20
`
`0
`
`15
`
`(9
`
`(.)
`
`10
`
`5
`
`0
`
`15
`
`(9
`
`10
`
`5
`
`0
`
`2000
`
`6000 8000 10,000
`4000
`Nucleotide
`position
`
`� -�
`
`f3
`
`15
`
`,i,a
`I �
`
`13
`◄ I
`
`14
`�► �j
`
`(al
`
`100
`
`80
`
`� !!..- 60
`
`(.)
`+ 40
`(9
`
`20
`
`0
`0
`
`r
`I
`
`
`j ':kJtJ.JW\1mJ1�MAAn&\�
`
`051
`
`M a a M a m&a u a a
`� _ oo o a Ou
`su a Ao mwu a .
`
`22,000 24,000 26,000 28,000
`30,000 32,000
`
`
`
`Nucleotide position
`
`( b)
`Fig. 3.
`
`SKI Exhibit 2027 - Page 6 of 17
`
`
`
`
`
`Rapid Evolution in G+C-rich Regions
`
`
`
`239
`
`0
`
`5
`
`10
`
`21 J 36 22
`◄ �
`
`Cl
`
`DJ
`
`ORFs !
`.... 3 c:J
`
`c:::::J
`
`c:::::J c::::Jc=:c::J
`
`c:Jc::::J
`c::J
`
`c::J
`c:=:=:J c::J c::::J
`
`c:::::J
`
`c:J
`
`c::J
`
`c::::::::Jc::::J
`I j
`
`ORFs ! c::::J c::::::::J
`c:::::J
`..... 3 c::::::J
`
`y
`
`RY
`
`c::::J
`c::::J
`c::::J c::::J
`
`c:J
`
`c:::::J
`
`I:.
`
`c::==::::J
`c::::::J
`
`c::::J
`
`c::::J c::Jc::J
`j I ::::J
`
`6
`
`7
`
`8,9,10 11 13,13,13 12
`
`A
`
`AM A
`
`1
`2
`Dir Apt A A
`
`34 5
`!:,Al).
`
`Inv Apt
`
`13.13
`
`4
`
`1
`!).
`
`2
`
`� •v � t V, ire "ii
`,1
`�
`are shown as in Genes and repeats globin gene cluster. Figure 4. Summary of sequence features in the rabbit ix-like
`
`
`
`
`
`
`
`Fig. 2, except that the introns of the genes are shown as open boxes. Vertical bars below the line are BamHI restriction
`
`
`
`
`
`
`sites. 0 RFs -+, open reading frames longer than 300 nucleotides on the top strand; 0 RFs +-, open reading frames longer
`
`
`
`
`
`
`than 300 nucleotides on the bottom strand; Y, strings of pyrimidine residues longer than 14 nucleotides on the top
`
`
`
`
`
`
`
`strand: RY. alternating purine and pyrimidine residues longer than 14 nucleotides; Dir Rpt, direct repeats longer than
`
`
`
`
`
`
`18 nucleotides with no more than 2 mismatches; Inv Rpt, inverted repeats longer than 18 nucleotides with no more than
`
`
`
`
`
`
`
`
`2 mismatches. The direct and inverted repeats are indicated by arrowheads either pointing upward for the first repeat or
`
`
`
`
`
`downward for the second repeat; a series of 5 26-nucleotide tandem repeats between the IX and 01 genes is shown as 5
`
`
`
`
`
`regions of the coding regions or homologous triangles numbered 13. Repeats involving pairs of C repeats of genes are not
`
`
`
`shown.
`
`( e)Simple repeats
`in the rabbit rx-like globin
`
`
`gene cluster
`
`CpG dinucleotides (Fig. 4(a)), even though it does
`
`
`are relatively evenly dispersed in the P-like globin
`
`
`
`
`not encode a functional globin polypeptide (Cheng et gene cluster (Margot et al., 1989). Two of the five
`
`
`
`al., 1986). Like the bulk of the genome, the rabbit
`
`tracts shown in Figure 4 correspond to tandem
`
`
`repeats of the dinucleotide ApC. These are found
`
`P-like globin gene cluster has very few CpG
`
`dinucleotides (note the change of scale in Fig. 3(b)),
`3' to the rx gene and 570 hp 51 to 01. No
`immediately
`and the small group of CpGs that are seen are in the
`
`
`homologs of either of these sequences have been
`
`recently inserted Cl4 repeat.
`found in the human rx-like globin gene cluster,
`unlike an (AC)13 tract within the rabbit P-like
`
`globin gene cluster that has been shown to have a
`homolog in human (Margot et al., 1989).
`
`Direct repeats that contain at least 18 nucleotides
`
`
`
`with no more than two mismatches are also listed in
`
`
`Only one string of pyrimidine nucleotides longer
`
`
`Figure 4. One especially notable region is a segment
`than 14 is found in the rabbit rx-like globin gene
`
`of 134 nucleotides between the rx and 01 genes
`
`
`
`cluster (Fig. 4), and no strings of polypurines are
`
`(number 13, Fig. 4) that contains five tandem,
`
`found, whereas such sequences are common in the
`1989).
`
`
`P-like globin gene cluster (Margot et al., Most
`
`
`
`imperfect repeats of the 26 nucleotide sequence
`
`
`CACCGCCGTAGCCGGGAATGGTGGGG (Cheng et
`
`
`
`polypyrimidine strings within the rx-like globin gene
`al., 1986). (AC)n tracts
`
`account for two of the direct
`
`cluster occur in the (CT)n sequence
`
`in C repeats, and
`
`
`
`repeats (numbers 6 and 11, Fig. 4). Direct repeats 8,
`
`
`
`any such sequences involving C repeats have been
`
`9, 10 and 12 contain imperfect copies of the
`
`omitted from this analysis. Tracts of tandemly
`
`sequence GCCC; note that these are found in the
`(RY)n,
`
`repeating purine/pyrimidine dinucleotides,
`where n is greater
`
`
`CpG-rich islands of the rx and 01 genes. An inverted
`than six, are found mainly around
`
`
`repeat of 18 nucleotides separated by 37 nucleotides
`
`the 01 gene; these tracts also occur frequently, but
`
`Genes and C gene clusters. frequencies within the rabbit ix and P-like Figure 3. Base composition and dinucleotide
`
`
`
`
`
`
`
`
`
`
`
`repeats are shown as in Fig. 2; Ll repeats within the P-like gene cluster are shown as striped arrowheads. The number of
`
`
`
`
`
`
`
`
`CpG and GpC dinucleotides in 50-nucleotide segments across the regions are plotted in the lower 2 panels of (a) and (b).
`
`
`
`
`
`
`
`The ix-like gene cluster is shown in (a), and a region of the P-like gene cluster of equal size containing the t/Jb and P genes is
`
`shown in (b). C repeats are numbered.
`
`SKI Exhibit 2027 - Page 7 of 17
`
`
`
`240
`
`R.Hardison
`et al.
`
`RabbitJ�O
`
`their 5' end, hence this C repeat can be considered
`
`
`
`such part of the J sequence, but some ,J sequences,
`
`as J01 and ,](3 do not have C repeats at their 3' ends
`Human lf(ll
`
`(Fig. 2). The J sequence, therefore, begins with a C
`
`
`repeat and continues for a total of about 1000
`
`
`nucleotides. Part of the sequence 5' to ,J01 matches
`
`
`
`with the consensus C repeat, indicating an addi
`
`
`tional C repeat, C46, in the gene cluster.
`
`Rabbit J�I
`
`RabbltJ81
`
`31JTa
`
`Rabblt<1
`
`(g)Homolog y with the human ,:x.-like
`ylobin
`gene cluster
`
`(f)Rabbit J sequences
`
`Figure
`among matches 5. Schematic diagram showing
`
`
`
`
`
`
`(i)Overall pattern8 of matches
`
`
`rabbit junction (J) sequences, human 1/let.l and rabbit et..
`
`
`
`The percentage of matching nucleotides, determined from
`
`The program SIM (Huang et al., 1990) was used to
`
`
`
`pairwise alignments generated by SIM, are given in the
`
`align the sequence of the rabbit ex-like globin gene
`
`
`shaded areas between the diagrams of the genes and ,J
`
`
`cluster (10,621 nucleotides) with a composite
`
`
`
`
`sequences. C and Alu repeats are filled arrows, exons are
`
`
`
`sequence of the human gene cluster (19,574 nucleo
`
`
`grey boxes, and the 3' -untranslated region of the et.-globin
`
`
`
`tides), containing (l-l/,ex2-l/,ex l -ex2-cd-01. ThiR
`
`
`gene (3' UT!.X) or its homologs are cross-hatched boxes. A
`
`program generates non-overlapping local align
`
`
`
`slanted line in a gene means non-matching sequences were
`
`
`ments of very long sequences that are optimized to
`
`
`omitted for clarity. nt, nucleotides.
`
`the scoring parameters specified by the user. These
`
`
`
`alignments are readily analyzed using a graphical
`
`
`
`
`user interface called LAV (local alignment viewer)
`is found 5' to the 01 gene (