`
`
`
`GENOMICS 13, 741-760 (1992)
`
`Primate Ga/ago
`
`
`of the Prosimian The /3 Globin Gene Cluster
`
`
`
`
`
`crassicaudatus: Nucleotide Sequence Determination of the
`
`
`41-kb Cluster and Comparative Sequence
`Analyses
`
`2 PHILIP BENSON, t
`DANILO A. TAGLE, *·
`1 MICHAEL J. STANHOPE, t DAVID R. 51EMIENIAK,+·
`MORRIS GOODMAN, t AND JERRY l. SLIGHTOMt·+
`
`Departments of *Molecular Biology and Genetics and tAnatomy and Cell Biology, Wayne State University School of Medicine,
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Detroit, Michigan 48201; and +Molecular Biology Unit 7242, The Upjohn Company, Kalamazoo, Michigan 49007
`
`
`
`
`
`
`
`Received November 1, 1991; revised March 27, 1992
`
`back to a single progenitor gene, which tandemly dupli
`
`
`
`The nucleotide sequence of the fJ globin gene cluster
`
`
`cated some 150 to 200 millon years ago (MYA) in the
`of the prosimian
`has been deter
`Galago crassicaudatus
`
`
`early mammals. By about 110 to 135 MYA, the two gene
`
`
`mined. A total sequence spanning 41,101 hp contains
`
`lines from the tandem duplication had differentiated
`
`
`
`
`and links together previously published sequences of
`
`
`
`into an embryonically expressed locus (proto-E) and an
`
`the five galago /1-like globin genes (5'-f-)'-if;T/-b-fJ-3').
`
`
`
`ontogenetically later expressed locus (proto-/3) (Efstra
`
`
`A computer-aided search for middle interspersed repet
`tiadis
`
`et al., 1980; Czelusniak et al., 1982; Koop and
`
`
`
`itive sequences identified 10 LINE (Ll) elements, in
`
`
`
`
`Goodman, 1988). Further tandem duplications of the em
`
`cluding a 5' truncated repeat that is orthologous to the
`
`
`
`
`bryonic 5' locus and adult 3' locus, in the early eutherian
`
`full-length Ll element found in the human E--y inter
`
`
`mammals (85-100 MYA), led to a genomic domain of
`genie region.
`
`
`SINE elements that were identified in
`
`
`five developmentally regulated loci (5'-ey-r,-o-/3-3').
`
`cluded one Alu type I repeat, four Alu type II repeats,
`
`
`
`In this /3 globin gene cluster, E, 'Y and r, originated from
`
`
`and two methionine tRNA-derived Monomer (type III)
`
`
`the 5' proto-, locus and were embryonically expressed,
`
`
`elements. Alu type II and Monomer sequences are
`
`
`while o and /3 originated from the 3' proto-/3 locus and
`
`
`unique to the galago genome. Structural analyses of the
`
`
`were expressed in the non-embryonic, later ontogenetic
`
`
`
`
`cluster sequence reveals that it is relatively A + T rich
`
`
`
`stages oflife (Hardies et al., 1984; Hardison, 1984; Hill et
`
`
`(about 62%) and regions with high G + C content are
`
`
`
`associated primarily with globin coding regions. Com
`
`al., 1984; Goodman et al., 1984, 1987). This ancestral
`
`
`
`
`parative analyses with the fJ globin cluster sequences of
`
`
`
`
`
`eutherian /3 globin gene cluster underwent varying de
`
`
`human, rabbit, and mouse reveal extensive sequence
`
`
`grees of change during the evolution of different euther
`
`
`homologies in their genie regions, but only human,
`
`
`
`
`ian orders. For example, lagomorph (rabbit) and rodent
`
`
`
`galago, and rabbit sequences share extensive inter
`
`
`(mouse) /3 globin gene clusters lack the r, globin locus
`
`
`
`genie sequence homologies. Divergence analyses of
`
`
`
`
`whereas artiodactyl (goat, sheep and ox) clusters lack
`
`aligned intergenic and flanking sequences from ortho
`
`
`
`
`
`the 'Y globin locus. However, each primate /3 globin gene
`
`
`logous human, galago, and rabbit sequences show a
`
`
`
`
`cluster so far examined contains sequences from all five
`
`
`
`gradation in the rate of nucleotide sequence evolution
`loci (E,'Y, JJ, o, /3) in the ancestral
`5' to 3' arrangement.
`
`
`along the cluster where sequences 5' of the f glob in gene
`
`
`
`The fJ globin gene cluster in mammals ranges in length
`
`region show the least sequence divergence and se
`
`from approximately 20 kb in lemurs to about 90 kb in
`quences just 5' of the fj globin gene region show the
`
`
`goats (Harris et al., 1984, 1986; Townes et al., 1984). The
`
`
`greatest sequence divergence.
`,g 1992 Academic Press, Inc.
`
`
`
`
`complete nucleotide sequence of the /3 globin gene clus
`
`
`ter of human (Collins and Weissman, 1984; Li et al.,
`
`
`
`
`
`and mouse (Shehee 1985), rabbit (Margot et al., 1989), et
`
`al., 1989) have been determined, and these sequences
`The different developmentally expressed /3-type glo
`
`
`
`
`reveal that only about 10% of the DNA in a mammalian
`
`bin genes of primates and other mammals can be traced
`
`
`
`
`/3 globin gene cluster encodes globin mRNAs.
`
`
`Very few orthologous sequence data sets that compare
`
`
`
`Sequence data from this article has been deposited with the Gen
`
`
`Bank Data Library under Accession No. M73981.
`
`
`
`noncoding DNAs from a series of species are presently
`
`
`
`1 Present addresses: Department of Human Genetics, 4562 MSRB
`
`
`
`available. Of those that do exist, the most extensive are
`
`
`
`
`
`II, The University of Michigan Medical Center, Ann Arbor, MI 48109-
`
`
`
`from closely related simian primates; they involve inter
`0650.
`
`
`genie sequences that flank the i/;a globin genes (Sawada
`
`
`2 Howard Hughes Institute 4522 MSRB I, The University of Michi
`
`
`
`gan Medical Center, Ann Arbor, MI 48109-0650.
`
`
`
`
`and flank sequences that include et al., 1985), noncoding
`
`INTRODUCTION
`
`741
`
`0888-7543/92 $5.00
`
`@ 1992 by Academic
`Copyright
`Press, Inc.
`
`
`All rights of reproduction in any form reserved.
`
`SKI Exhibit 2030 - Page 1 of 20
`
`
`
`742
`
`TAGLE ET AL.
`
`the 1"'1 globin gene (Bailey et al., 1991; Fitch et al., 1988;
`
`
`ary structures were resolved by sequencing at a higher temperature
`
`
`
`
`
`(55°C) using Taq DNA polymerase. Selected BamHI-and HindIII
`Koop et al., 1986; Maeda et al., 1988; Miyamoto et al.,
`
`
`
`
`generated fragments (not shown) were also subcloned into plJC-18
`et al.,
`
`
`1987), and o-fJ intergenic sequences (Savatier
`
`
`
`
`
`across the sequence contiguity and sequenced to verify nucleotide
`
`
`
`1985, 1987). Consequently, much still remains to be
`
`
`
`cloning site junctions of certain EcoRI clones.
`
`
`
`
`learned about the sequence features, evolution, and
`Base composi
`
`
`
`Computer-aided analyses of nucleotide sequences.
`
`
`functional constraints that act upon intergenic noncod
`
`
`
`tion and other sequence features [such as open reading frames
`
`
`
`ing sequences. Comparative analyses of the (3 globin
`
`
`
`
`(ORFs), strand asymmetry, subsequence breakdown, and repetitive
`
`
`
`
`
`elements] of the galago /3 globin gene cluster were identified and ana
`
`
`
`
`
`gene cluster sequences from distantly related species can
`
`
`
`
`lyzed using the DNA analysis computer programs available from The
`
`
`
`be used to identify evolutionarily conserved DNA ele
`
`
`
`
`Genetics Computer Group package (GCG; Madison, WI: Devereux
`et
`
`
`
`
`ments or phylogenetic footprints, which can provide in
`
`
`
`6.6 from IBI Technoloal., 1984) and the Mac Vector Package Version
`
`
`
`sights into the evolution of gene clusters at the molecu
`
`
`gies (New Haven, CN). In addition to the ORF search, the galago
`
`
`lar level (Tagle et al., 1988; Gumucio et al., 1991).
`
`
`
`
`cluster sequence was submitted to GRAIL (Gene Recognition and
`
`
`
`
`Analysis Internet Link; Oak Ridge, TN; Uberbacher and Mural, 1991)
`
`
`
`In this context, we decided to determine and analyze
`
`
`
`
`
`for additional searches of potential protein coding regions. The loca
`
`
`
`
`the complete nucleotide sequence of the (3 globin gene
`
`
`
`
`tion of shared nucleotide sequence identities between the galago /3
`
`
`
`
`cluster from a distantly related primate, the prosimian
`
`
`
`globin gene cluster sequences and human (Collins and Weissman,
`
`(Galago crassicaudatus).
`galago
`
`
`Fossil evidence indicates
`
`
`
`
`
`or mouse (Shehee 1984; Li et al., 1985), rabbit (Margot et al., 1989), et
`
`
`
`
`
`by dotidentified sequences were first al., 1989) [:J globin gene cluster
`
`
`
`that the prosimian (galago)/simian (human) divergence
`
`
`
`plot comparisons using the GCG computer program COMPARE. Ho
`
`
`time dates back to about 55 MYA (Fleagle, 1988). Here
`
`
`mology plots of the galago globin gene cluster sequence with itself
`
`
`
`
`we report 41,101 bp of continuous sequence spanning
`
`
`
`
`
`were used to locate repeated regions. In addition, identity searches
`
`
`
`
`the entire (3 globin cluster of the galago. This cluster
`
`
`using SINE (Daniels and Deininger, 1983, 1985, 1991; Daniels
`et al.,
`
`
`
`contains five globin-like genes that are arranged in the
`
`
`se1983) and Ll (Hattori et al., 1986; Scott et al., 1987) consensus
`
`
`
`
`of identify and define the boundaries quences were used to further
`
`
`order of their developmental expression: 5'-ey-(embry
`these repetitive elements.
`
`onic )-1P11-(nonexpressed)-o-{J-(fetal and postnatal)-3'
`Pairwise alignment among human, galago, and rabbit /3 globin gene
`
`
`
`
`
`
`
`(Tagle, 1990; Tagle et al., 1988, 1991). An analysis of the
`
`
`
`
`
`cluster sequences in their matching regions were obtained using the
`
`
`
`compositional and structural features of the complete
`
`
`
`
`alignment programs of Smith and Waterman (1981), Wilbur and Lip
`
`
`
`man (1983), Lipman and Pearson (1985), and Needleman and
`
`
`
`
`cluster sequence of galago is presented. We have identi
`
`
`
`
`Wunsch (1970). These aligned sequences are available in diskettes
`
`
`fied the DNA sequences that are orthologous (derived
`
`
`
`
`from the authors upon request. In all cases, gaps were inserted into the
`
`from the same DNA sequence in the last common ances
`
`
`
`
`
`alignments to increase sequence identities. Due to the conserved na
`
`
`
`tral species) among galago, human, rabbit, and mouse {J
`
`
`
`ture of the orthologous gene loci, alignments in these regions were
`
`
`
`globin gene clusters and the locations of short and long
`
`
`
`used as anchor points to align the more diverged intergenic regions.
`
`
`The Local Alignment Diagrammer program (LAD; Schwartz
`et al.,
`
`
`
`
`interspersed repetitive nuclear elements (SINEs and
`
`
`
`l 991) was used to display the aligned sequences as plots. Each plot
`
`
`
`
`LINEs) within the galago cluster. Differences in the de
`
`
`
`
`depicts pairwise interspecies alignments computed by the SIM pro
`
`
`
`gree of sequence divergence among the orthologous non
`
`gram (Huang et al., 1990) with a score of 1 for matches, -1.5 for mis
`
`
`
`coding regions of galago, human, and rabbit have also
`
`
`matches, -6 for opening a gap, and -0.2 for each symbol in the gap.
`been determined.
`
`
`
`
`An alignment is displayed only if its score exceeds a threshold ( r),
`
`
`chosen so that the probability is 0.05 that random sequences matching
`
`
`
`
`the given sequences in length and nucleotide composition have a gap
`
`
`free alignment scoring of at least r.
`Pairwise divergence values for the aligned sequences were calcu
`
`
`
`
`
`
`
`lated following the method of Nei and Gojobori (1986). Nucleotide
`
`Twelve overlapping recombi
`
`
`Cloning and nucleotide sequencing.
`
`
`
`substitutions (both transitions and transversions) and gaps (regard
`nant Charon 35 phage clones
`
`
`that span the galago /3 globin gene clus
`
`
`
`
`
`
`less of length) were counted as single events. Divergence values were
`
`
`
`
`ter (Fig. 1) were previously isolated and described (Tagle, 1990; Tagle
`
`
`
`corrected for hidden, superimposed substitutions
`by the method of
`et al., 1988, 1991). Phage clones A Ger 18.3, A Ger 11.9, >.. Ger 15.4, >..
`
`Jukes and Cantor (1969).
`
`
`Ger 16.1, >.. Ger 15.2A, and>.. Ger 15.2B were used to generate a com
`
`
`(Rl to R22) that were subplete and ordered set of EcoRI fragments
`
`
`
`subclones cloned into pUC-18 (Yanisch-Perron Plasmid et al., 1985).
`
`
`R5 to R22 were sequenced by the dideoxynucleotide chain termination
`
`
`
`cleavage method method (Sanger et al., 1977) and/or hy the chemical
`
`
`
`
`
`(Maxam and Gilbert, 1980). In the latter method, restriction frag
`
`
`
`ments from R7, R9, R12, and parts of R8 and RlO were end-labeled
`Globin Gene Cluster
`
`
`
`and sequenced as described in Tagle et al. (1988) for the€ and r globin
`
`
`of the genes and in Koop et al. (1989) for the t/111 gene. The sequences
`A schematic diagram showing the organization of the
`
`
`
`
`
`
`
`galago o and /3 globin genes have been presented by Tagle et al. (1991).
`
`
`
`
`
`These previously published galago /3 globin gene sequences represent
`
`
`
`galago {3 globin gene cluster is shown in Fig. lA. The 12
`
`
`
`
`only about 9 kb of the (3 globin gene cluster sequence presented here.
`
`
`overlapping recombinant phage clones used to recon
`
`
`
`
`The remaining intergenic sequences were obtained by the dideoxynu
`
`
`
`struct the structure of the cluster are also shown. Restric
`
`
`
`
`cleotide chain termination method using initially universal forward
`
`
`tion enzyme site maps of these 12 clones revealed the
`
`
`
`
`
`and reverse vector primers and then by using synthetic oligomers that
`
`
`
`
`extent of overlaps for each clone. Southern blot analysis
`
`
`
`
`
`were designed from the previously determined sequence (referred to as
`
`
`
`
`
`
`oligomer walking). Nucleotide sequence readings ranged from 400 to
`
`
`
`of Eco RI restriction digests of the phage clones localized
`
`
`950 bp. Approximately 95% of the intergenic nucleotide sequences
`
`
`
`
`the five linked (3-like globin genes and was confirmed by
`
`
`
`were determined on both DNA strands, and the remaining 5% were
`
`
`
`Southern blot analysis of EcoRI-digested galago geno
`
`determined from the same strand at least twice using independent
`
`
`mic DNA (data not shown). Gene identities were con-
`
`
`
`
`sequence reactions (see Fig. 2). Areas of compression or strong second-
`
`Organization and Nucleotide Sequence of the Galago (3
`
`MATERIALS AND METHODS
`
`RES UL TS AND DISCUSSION
`
`SKI Exhibit 2030 - Page 2 of 20
`
`
`
`THE GALAGO {3 GLOBIN GENE CLUSTER
`
`743
`
`A.
`
`♦♦ i
`0.1 1.4 8.6
`
`Ill Ill. 113
`
`R4
`
`y
`e
`\jfT\
`7 a • a: ♦ +9
`♦
`i u '
`me::
`OA I.I 1.HJ 1.H.$ 4.3 :u; 1.1 1J. ...
`7J
`♦+
`• •
`3.9 u "
`l.l 0.2 J.)
`lt17 RII lt19 Rll!IUI !Ill
`IU 116
`RS a, lilt IUl 1112 Rll lU41US Ill'
`.,
`
`6
`
`�
`
`� ·S 11.J
`
`l.OCRIU
`
`� OCR 16.J
`
`a SIi l�.16
`
`•
`l. S!!;!. 11..4
`lOCRlU
`
`� OCR 13.9
`
`l.OCll 153
`
`� OCR 10.7
`l. OCR 12..l
`
`l.OCll.12.!I
`
`). OCR IS.la
`
`B.
`
`EcolU SITES
`
`£
`
`'Y
`'I'll
`6
`�•l.lXB OLOBIN OENES
`==
`7 .r
`. n ♦ .:rx:::i
`7 ..
`...
`l.fi l.l U 0.9 1.1
`3.9 3.&
`••
`u O.l
`3.S 0.4 LI 1.50.3 Ufl.5 (.3
`..
`..
`"
`Rl6 R17 RIB Rl'I klBRll
`RS lt9
`RIO RU JUl RU JU4RJS
`PLASMID CLONl!S R5 Rfi
`R7
`
`♦
`
`RU
`
`
`
`SrNF'.s: Alu TYPE l
`
`4
`◄
`lllu TYPE Il
`I l l
`� ◄
`MONOMER
`
`LJNF.s: Ll
`
`s ►
`
`I l
`
`'
`◄
`
`'
`◄
`
`1 • •
`•◄► ◄
`
`It
`-◄-
`
`R1,, TRACTS
`
`Y1,1 TRACTS
`
`RY,,, TRACTS
`
`IS
`1f lJ 1'
`U 1, 1111112' U
`11 u
`"
`
`11 1Jl0
`
`1$ U
`U 15 14
`
`l♦ll uu11
`" l5
`U 11 U IG
`
`1l lllJ U
`u�11 1;i41, .;'i,11,. 11 ull ,i'u
`"
`•• l7 J"u ..u
`The gene cluster. of the galago fJ globin The top most line shows the organization fJ globin gene cluster. FIG. 1. (A} Overview of the galago
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`with known mammalian s relationship to their orthologouand are labeled on top according by the rectangles {3-like globin genes are denoted
`
`
`
`
`
`
`
`
`
`globin genes. The approximate position of the three exons separated by two intrans and their 5' and 3' untranslated region are indicated inside
`
`
`
`
`
`
`
`
`arrows below the line are indicated by sites along the cluster the rectangles where the filled areas denote the coding regions. Eco RI restrictions
`
`
`
`
`
`
`
`into that were subcloned and the size of each EcoRI fragment is listed (kilobases) between the arrows. The ordered set of EcoRl fragments
`
`
`
`
`
`
`
`pUC-18 and numbered 1 through 22 are shown below the cluster map. The 12 overlapping recombinant)\ clones used to reconstruct the cluster
`
`
`
`
`
`
`
`
`
`organization are also shown below the cluster map. The length and position of the ;\ phage clones correspond directly to the galago linkage map.
`
`
`
`
`
`
`
`
`
`
`
`(B) Synopsis of sequence features of the galago (3 globin gene cluster. The extent of the galago fJ globin gene cluster region that was sequenced
`
`
`
`
`
`
`
`
`map. The position,by the rectangles on the cluster globin genes are represented in the top line. The {3-like (Eco RI fragments 5 to 22) is indicated
`
`
`
`
`
`
`
`direction, and length of the interspersed repeat elements (SINEs and LINEs) are shown below the cluster map. Left-pointing arrowheads
`
`
`
`
`for theRight-pointing arrowheads of globin transcription.
`
`
`
`
`
`indicate that the repeat is oriented in the 3' to 5' direction relative to the direction
`
`
`
`
`
`
`
`
`
`repeats indicate a 5' to 3' direction. The locations of purine (Rln and pyrimidine (Y)" tracts (where n ?o 10) and of alternating purine/pyrimidine
`
`
`
`
`
`(RY)" (where n ?o 5) tracts are indicated by numbers corresponding to their lengths.
`
`(Efstratiadis et al., 1980; Hardies et al., 1984; Hardison,
`
`
`firmed by sequence orthology with other known primate
`
`
`1984; Harris et al., 1984; Hill et al., 1984; Goodman et al.,
`
`
`
`
`and mammalian globin genes previously presented and
`
`
`characterized in Tagle et al. (1988) for the E and')' globin
`
`
`
`1984, 1987). A general scheme that depicts some of the
`
`
`
`genes, Tagle et al. ( 1991) for the o and f3 globin genes, and
`
`
`
`major elements and molecular events that have occurred
`
`
`Koop et al. (1989) and Bailey et al. (1991) for the \f1J
`
`
`(3 globin gene cluster
`in the evolution of the mammalian
`
`globin gene locus.
`
`
`
`
`is presented in Fig. 3. The entire sequenced region of the
`
`
`
`The nucleotide sequences of the remaining intergenic
`
`
`
`f3 globin gene cluster spans 41,101 bp and includes
`galago
`
`
`sequences starting 4.3 kb 5' of the t gene, intergenic
`
`
`
`
`regions of the galago {3 globin gene cluster were deter
`
`
`mined from the ordered set of EcoRI plasmid subclones
`
`
`DNAs spanning 6.1, 3.8, 11.1, and 3.1 kb between the t
`and 'Y, 'Y and 1/111, 1/111 and o, and b and f3 globin
`
`
`
`R5 through R22 that span the galago cluster (Fig. lB).
`genes,
`
`
`
`
`The complete sequence of the galago fJ globin gene clus
`
`
`
`and extending respectively, 4. 7 kb 3' of the fJ globin gene.
`
`
`ter is presented in Fig. 2. In general, the organization
`Galago Globin Genes
`
`
`
`
`and position of the fJ-type genes in the galago globin gene
`
`
`
`cluster (51-f-1-1/111-0-/3-3') are similar to that hypothe
`The nucleotide sequence positions (from Fig. 2) for
`
`
`
`
`
`
`sized for the ancestral eutherian f3 globin gene cluster
`
`
`
`each /3-like globin gene are summarized in Table 1. The
`
`SKI Exhibit 2030 - Page 3 of 20
`
`
`
`744
`
`TAGLE ET AL.
`
`four expressed genes exhibit the basic exon-intron
`
`
`20,796 to 21,128; 20,897 to 21,196 and 27,828 to 27,493 of
`
`
`
`
`
`
`
`Fig. 2). Another ORF is associated with the Alu element
`
`
`
`structure of globin genes consisting of three exons sepa
`
`
`
`
`rated by two intervening sequences (reviewed in Collins
`
`
`(GcAlull-3), which shows an ORF (positions 1847 to
`
`
`
`and Weissman, 1984), and each is structurally able to
`
`
`
`
`2168 of Fig. 2) throughout its entirety and may represent
`et al.,
`
`
`
`(Tagle encode the 146-residue globin polyp eptide
`
`
`
`a relatively recent insertion event (discussed below).
`
`
`
`1988, 1991). Codon usage for these expressed galago (3
`
`
`
`
`The remaining ORF at positions 32,071 to 32,391 ( on the
`
`
`
`
`globin genes was analyzed (Tagle, 1990), and there ap
`
`
`
`
`in Fig. 2) not presented complementary strand, strand
`pears to be the same codon usage bias for the amino
`
`
`appears not to be associated with any known structural
`
`
`
`acids, Leu (the codon CTG is preferred 51/67), Val (the
`
`
`
`
`
`feature of the fJ globin gene cluster sequence (i.e., with
`
`
`codon GTG is preferred 39/63), and Gly (the codon GGC
`
`
`
`
`globin genes, repetitive elements, or simple repeat se
`
`
`is preferred 22/47), as that found for human globin
`
`
`
`quences). A search of the translated sequences of this
`
`
`genes. This codon bias has also been observed for the
`
`
`
`ORF against GenBank did not reveal homology with any
`
`
`genes of other mammalian and avian species that are
`known gene.
`
`
`available in GenBank (Wada et al., 1991). The prefer
`While the above search routine of examining the posi
`
`
`
`
`ence for C and G in the third codon position is consistent
`
`
`
`tional and compositional bias of the sequence for ORFs
`
`
`with the prevalence of most mammalian and avian genes
`
`
`
`
`correctly identified all the expressed galago (3-like globin
`
`
`in GC-rich genomic regions (Wada et al., 1991).
`
`
`
`genes, the background noise was too high. The galago
`
`
`Like all other primate 1/17/ glob in genes studied thus far,
`
`
`
`
`sequence was analyzed by GRAIL for protein coding re
`
`
`structural anomalies have maintained the galago i/;ri
`
`
`
`
`gions. This search routine combines a set of seven sensor
`
`
`locus as an nonfunctional gene or pseudogene. These
`
`
`
`
`algorithms that measure important attributes of coding
`
`
`
`
`anomalies include two deletions ( that resulted in a frame
`
`
`
`DNA versus noncoding DNA (such as statistical frame
`
`
`
`shift and six in-frame termination codons) and the loss
`
`
`
`bias, Fickett's base composition, dinucleotide frequen
`
`
`
`
`cies, and coding six-tuple word preferences) on a sliding
`
`
`
`
`
`of consensus intron splice junction sequences for introns
`
`
`
`1 and 2 (Tagle, 1990; Bailey et al., 1991). In addition to
`
`window of 100 bases (Uberbacher and Mural, 1991).
`
`
`
`These sensors were applied to the known coding and
`
`
`
`having deleterious mutations in its coding and 5' regula
`
`
`tory regions (Tagle, 1990; Bailey et al., 1991), the galago
`
`
`noncoding human DNA sequences that are in GenBank,
`
`
`
`
`
`1/17/ gene is truncated by the insertion of Ll elements in
`
`
`and the output was used to train a neural network to
`
`
`
`
`dissect potential protein coding regions from a set of
`
`
`intron 2 (discussed below).
`
`
`
`
`unknown sequence, such as the galago globin cluster se
`
`
`
`quence, based on these "learned" attributes. The results
`
`
`
`
`of this analysis on the galago cluster sequence is shown
`
`
`in Fig. 4. Both sense and antisense strands were
`
`
`The galago (3 globin gene cluster sequence was
`
`
`
`searched for potential exons, but only the sense strand
`
`
`
`searched by computer for open reading frames that are
`
`
`
`showed significant peaks. Peaks with scores of greater
`
`
`at least 30 amino acids in length [ enough to detect the
`
`
`than 0.8 were ranked as excellent candidates for coding
`
`
`smallest globin exon ( exon 1) of 91 bp] and that follow
`
`
`
`
`sequences. The results of this analysis correctly pre
`
`
`
`Fickett's criteria (Fickett, 1982) of G + C base composi
`
`dicted 75% or at least one of the exons of each tran
`
`tion ( window of 200 with a probability of 0.92 or greater).
`
`
`
`
`scribed galago globin gene (Fig. 4). The results of this
`
`
`
`This conventional search routine identified a total of
`
`
`
`analysis indicate that GRAIL, trained on human gene
`354 ORFs (Fig. 4). Of these, 11 of the 16 ORFs that are
`
`
`
`sequences, can also be used to identify protein coding
`
`
`
`each greater than 300 bp in length are associated with a
`
`
`regions of other mammalian species. Of the transcribed
`
`
`
`ORF in functional globin gene(€, 'Y, o, or fJ). The longest
`
`exons greater than 100 bp (exons 2 and 3), 88% were
`
`
`
`the cluster is found in the fJ globin gene, consists of 576
`
`
`identified. Only exon 3 of the E globin gene was missed.
`
`
`bp that begin 34 bp into intron 1, and extends through
`
`However, only about 50% of the exons less than 100 bp
`
`
`225 bp into intron 2. The second most extensive ORF
`
`
`
`
`were found. As expected, the search did not identify the ri
`
`
`
`starts 34 bp into (489 bp) is in the o globin gene. It also
`
`
`
`
`globin pseudogene as one of the protein coding regions
`
`
`
`
`intron 1 and extends through 141 bp into intron 2, and
`
`
`
`but did identify segments of two LI elements (LlGc-5
`
`
`the similarity of this ORF with that of the (3 globin genes
`
`
`
`and LlGc-6; see Figs. lB and 4) as potential exons. Two
`
`
`is due to concerted evolution between the two loci (Tagle
`
`
`
`et al., 1991). Of the remaining five ORFs greater than
`
`
`
`peaks (positions 7011 to 7091 and 33,804 to :33,851) were
`
`
`
`also identified as potential coding sequences where no
`
`
`(positions 300 bp, three are associated with LI elements
`
`
`
`
`
`Search for Other Protein Coding Regions
`
`bp was obtained The total sequence of 41.101 fl globin gene cluster. of the galago FIG. 2. The complete nucleotide sequence from 13 EcoRI
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`(:/-like g!obin genes are of the expressed The coding regions in lowercase letters. sites are indicated subcloned fragments. The Eco RI recognition
`
`
`
`
`
`
`
`indicated on top of the nucleotide sequence by their encoded amino acid sequences. Exons 1 and 2 of the ,{r, gene are indicated by asterisks above
`
`
`
`
`the nucleotide sequence. The promoter elements CACA, CCAAT, and TATA boxes as well as the putative CAP sites are labeled as such above
`
`
`
`
`
`
`
`
`
`their nucleotide sequence. SINE elements are indicated by an overline above their nucleotide sequence. LINE elements are indicated by a
`the :3' to
`
`
`
`
`
`
`
`repeats, left and right arrows indicate of interspersed For both families nucleotide sequence. double overline above their corresponding
`
`
`
`
`
`
`5' or 5' to 3' directionality of the repeats, respectively. A series of arrowheads are used to indicate where short direct repeats flank insertion
`
`
`
`
`
`
`
`
`
`elements. The known structural features of the sequence are indicated to the right. DNA regions sequenced multiple times on only one strand
`
`
`
`
`
`
`
`
`
`
`include positions 33,050 to 33,350: 40,600 to 41,101: 29,500 to 29,7:30: 26.450 to 27,100; and 22,430 to 22,960 presented in this figure.
`
`SKI Exhibit 2030 - Page 4 of 20
`
`
`
`V
`y
` V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`CUCTACCAGCAATCTAAAGTATAq••ttcffCA1'ACTAATAGTGCCTAAGGACU.'l"GCAATAGnGATTCTTCAGATTMlTAMflT1'ATAAATGAGCTAMJU&GlT'J'Tll.AAMCC:ACTGAA.T1''!'!THCCTGCTGU.UGAGACCT 1 so
`y
`
`V
`
`V
`
`V
`
`V
`
`V
`
`V
`
`Y
` V
`<<<<<<<---------
`
`V
`- ------
`
`V
`
`---
`
`V
`Gd luII-1
`300
`C'l"?ACA'IT'l'MIWUITl'CC'l"?CACAGMCTTACIWUIGTACAAATGCTAC'.TGCAMCTTTTGQAGGCAAGIUICTGATCAHATIWUIHTGllllGGTTC'l"?TTHHHTGMACMAGCCTTAAGCTTCCCCC'rGGGTMAGTGCCATG
`V
`
`"
`
`450
`GCTTCACTGCTCACAGCAACCTCCATCTCCTGGGC'l'CAAGCGAGTCTCCTGCCTCAGCCTCCCU.GTAGCTGGGACCAGACATGCCCGCCACAACACCTQGC'TATHTTTGGTC:GCAGCCTTTATTG'l'TGTTTGGe�GCC:CGGGCTGGA
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`GcAlull-2
`----------
`---------------<<<<<<<---------------------------
`----------
`
`
`TTCGAACCCACCACCTTAGGTGTATGTGGCTGGCACCTTAGTCACTTGl.GCCACIGGCACCGAGCCAAGGTTTTTTTTTTTTGAGACAGAGTCTCAC'lTTATCACC:CTCAGTAGAGTGCTGTGGCGTCACA.GCTCACAGAGACCTCAAAC
`60 0
`
`-----
`--------------------------------------------------------
`
`
`
`
`TCCTGGGC1'TAGGCGGT'J'CTCCTCCCTCAGCCTCCCMGTAGCTGGGACTACAGGCACCCGCACll1'GCC1CCCTATTTTTTTGTTGCAGTTTGGCTGGGGCTGGGT1'TGAAc:CCCCCATCetCA.CTATATGGGGCCGGGGCTGGTGCTA
`
`150
`
`... ----- --<
`900
`
`CTCAC:1'GAGCCATTTMJ.TGT-.TTAAGACllACACA.GTTTTTATTTTATTGATTT1't'AATtACATAGACAATAAGGUATATTCT'l'AC'TGATTAGTTTTTCTGACTrCTC1'TGl ATAACCTTGTCATTTCTTTG'lCTAAATTTTGGGCTT
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`
`
`
`V
`V
`V
`V
`V
`V
`V
`Y
` V
`V
`V
`V
`V
`V
`V
`
`
`
`
`V
`V
`V
`V
`Y
` V
`V
`V
`V
`V
`V
`V
`V
`V
`
`
`V
`V
`V
`V
`t/
` V
`V
`\'
`V
`V
`V
`V
`V
`V
`V
`
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`TTCATCtAGAAATGGAAGAGAA1GATTACTTGT1CCGGGTATTTCATAGGG.M.AAAAAATG.Ucn'QCnUAAATGGGCU.CTGAGGGTATTTAAATTGMCCM.TAAGGACCTMGCMTAATGAGATTTCCCATAGGTTATCTGACTC 1050
`
`CAGAACCAGMTTTATAGCACTGACCTGCTGCGTnAAATCCTGTAAACC1'TCC1'GAATCTCTTCTCCATGACTTTTAGCCATATGAGAATI\TGAGATGGGCCAGG1'C1C1'GGAGTCTAGCCAGGTC1'GTTACAGGTACTATGGGCCATG 12 0 0
`
`MGMAQGGATATAGATGGGCCCCGTM.'l''l'GTATCTGAGGGAAGAAGAGGAAGATCAACCCCATACTCTCMTATMGAGMGACTGCCACATTCTAGGGTCAA'?T'?'?AGGGAGGMCATTTTCCCCAMCTCATGGCCTTGI\GAGACCT 1350
`
`AATCTAtllTCCT1'GGACT?TGGCAGTGTCTGACCCTTGCTTCCAAATCCTGACGAAGGTCACTGCCt'l'TAT.UGATAGATCCMGCllATTTCCCTTGMAACCCAGAGTGC'l'A'tC1GCATTGA1'AATAAACTCTTAACTCAAGAACTG 1500
`
`
`
`CTCTTCCTTGAGTAGTCJIGATGAHCTTGACTATTCATTAGTCTTCTGCTTCAATGATCAAJICATCTTCTCTCCCAACCTAGCATCC'IGTCCACTTCAGCTATAGATTCTAGCCCATGAMTCTTCATCTAGGGTCCTCAGAACACTACA 1650
`
`GAMATGGTGMATT1'1'ACGGGUMA.fAATCCCTCTTTATTTTCGCTAACTTTAAATGAAATTTAGO.TTTCCTTCCCTTTTGAATATAAATTCACAAACTAGTTTAATI\TCAGCI\GTTCCTCTGGCCTGTTTATAGAGAGAAATMAG 1800
`
`
`V
`V
`V
`V
`V
`'Ir
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`
`V
`V
`V
`
`----
`- GcAlult-3
`__
`____
`_ ,___ ___
`___
`____
`___
`<«<<<<<<<< _______
`
`
`
`TI\TT1'fCATAnACATTATAGTC'TTATAGATCTTTTGTTTTTCTTTCTTTCTTTCCTTCTTTCTTTC1'1'1'TT1'1'tttTTTTTTTTAAATTGTGTCACCCTCAGTAGAGTGCCAGGGCATCACI\GCTCGCACCTCACAGCAGCTCACAGCA 19 50
`V
`
`ACCTCA.UATCATGGGC'?TC:AGCAATTCTCTTGCCTC:AGCCTCCCUGTAGCTGGGAC'l'ACACGTGCCCATGATAACACCTGGCTATTtCTTGCTTOCAATTGCAATTGTTG1'1''1A'1C"-GGCCTOGGCAGGCTTGAACCCACCI\GCCTTG
`2100
`
`GTGCl\'l'C'TGGTCI\GTGCCCTATCACTGAGCTI\TGGGCCCCMGCACAGATATT'l'flMAT.M.CATTTCCTTATAtCI\CTI\C'l'TAATgHttcATA.M.GAMCATATAAAAG"TTI\TGCTATAMTTTGATGTTTTGATATTTGATAACTG 21SO
`
`...
`......
`
`TtTTTTAATATAACTGGTCTCCTTAGCMTMTATTTATATTATTTTATATAAATAAAATCCACATTATTTHAG'tACTAACAGTCTGCTllACTAGTCCATGCTACATTllAAGTTTAGgeattcCTGATTATGTCTATATTGGCTATT 2400
`
`_____
`_______________________
`········-....... ,.. ____ _
`---------------·-··--·-·---------
`. --
`...
`__
`________
`_ ...... .,. _____ .......................
`.., .......... -............
`
`
`<<<<<<<<<<
`-----------------<
`
`
`
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`
`
`
`V
`V
`
`________
`
`V
`
`V
`<----.. • .. •••------------- .. --------------------• .... -........... --.. ---------------•- .... ------GcAlol-4
`
`
`
`
`GTTTGGMTGTATGCCTAGAAAGATATTCCTTTTTTTTTTTTTTGAMCAGAG1C1'CAC't'rTGTTGCCCTTGGTAGACTCCTATAATCTCAGAGCTTACAGCAMCTCAGACA.CTTGGGCTCAACTGATTCTC'l'TGCCtCAGC1'1'CCCCA 2550
`
`
`
`..,.., ___ ...............
`-... -..... ----------
`
`GGMTTGAGACTA.CAGGTGCCCACCACAATGCCCAGCAATTTTl'THTT'l'T?GAGACAGGGTCTC'l'CTCTTOCTCAGGCTGGTC'l'CAAACCTGTGAACTCAGACAA'l'TCACCCGTCTTGGCCCCCC.U.GTCC'l'GGGATTACAGGTGTGM
`27 0 0
`V
`
`_____ _
`
`CC"-TTGCTCTCGGC'CTCTAQ.AI.AGATAncnAATTAGMCCTllAAGAAAG'l'GGTAAAGAMTTCATTGGCCTGGAATACT-'CTAATTTGGAGAGAGTCltiCTAAGTCAGGTAAAGACTTCCAGAATCATAAGAGAAAGACAAAATATTT 2850
`
`TCATGTGCTACAGGTGTTTTTTTATCACACTCTTAAGllAAGATU.l'llATGCCAAGGTAGGGTTAAAGATAAGAGAAGAGTTCTCAG?AAATA1'GOAA1'ATATGGAGATCTCMGGGCACTAGACTGGGAAGGCATAATCCAAATGGTl' 3000
`
`--<
`
`
`
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`
`
`
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`Y
` V
`
`
`GAGTAAllGGGCCAAAA.ACCTCT GGGGCATCAAAATGCAAGAAATTGCTAGGAAAAGAAACTCCCAGGTCTGGllGGGGCCAATAGAGCACCAGAGAAAGAAAAGATAGAAATAGATG1'GGAAT1'AGCTAGAACAAGMCTGGGCAGAGC 31 5 0
`
`TCACTAGATT"TAGTGAACTGTAGCAAGACTTTTTTAGAATTAACAAAGGTGGAATCTGTGCI\TTCCTGGTTGCATTTTCCCCTAATCATCAATTTTGCCATCTCGACGGTTACACAACTTCTGTATCTTGTTTTCATAGTCAA.UAAAT 33 0 0
`
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`
`
`
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`AAGGGTAATGACCATTATATTACTAGGCACAGATGAAAAGATACACCACMAGAACATT<:TTCCTGCATATGGATGGAACTGAATATATATCAATGCC:AGCTCCAGT'l'TTAGC'l''l'CAACATATAAAGAGCTAACAAGTCA1"l'ACAAC1iTA
`34 50
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`y
`
`
`
`GI\CATGGGCACCTATGGC:'l'AGAGCTAACACTAAATTTAACCTGACTCTTCATCTGATTMCATATCAMlt.TACTTATTTCTCAGTTTGATCACAMTCTTGCTTT'TAMTAATTTTACATTTCTCAGCTCTCTCACTGATGAGAAGTATA JliOO
`
`
`
`GGCAGCACAC1.CAAGCI.CI.MGA'?ACCTCACGCAA>.1
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`
`
`
`V
`V
`V
`V
`V
`Y
` V
`V
`V
`V
`V
`V
`
`GGUAAGGCGGAATG,AGAGAAATCTTTGT1'ATTTTGTAAGGTGATGTGGGGAMGAAGAGAGTGAGTAGAGGGAATQTOCATGGAGGAGGACACMGGMAGAMGGTCTCAJCATCC:CCCACACATTATCTCAATGTGTGCCTACTTCA
`4200
`
`AGTATTAC'AC'flTGG.UTTAAGOCTl'CAAU,'?CGGGGTGAATTTTTTACTAC'l'CTGTTCMTTTtTAGAAGCGAC:C:ATGTATGGTTTflATCTCCCTAGMAMC'?A.AGATCCAGAGGTTTGGGTACAAGTCAGTCACCMGA.GCACtltiG 3900
`
`AACTGGATAAAACTCCATG'TCAACAGCTTACCCTTTTTOTTTCTTOGCCAGOGCTGTCTTTGTCATTGTCACTI\TCCATAATCCAAATTTTMACTCATTTTGGCTGMGCAG'""TTTCCTA.CCTGAGGCTTACTTCAtTA'l'CAGACTCTT 4 0 5 0
`
`'AGCCACJlUGCTCG.U1'TCCC'TCTATACTCACAGA.TA.AI\TGGMAGAGAAAA?GTTCCCTGGI\AGCACCAGGTGTA..,TCTTGTTCTTTCTGTCC1'efCCCAC.U..CCACG'l'CTT 3750
`
`CAP
`CCAAT
`CACA
`ATIIAA
`
`
`CTGCTGACCCTCTGCTGACCAGGCTCCACCCCTGAGGACAGAGCTTAGC:TnGACC:AA'tGACTTCTAACTACCACGGAG.U.CI\AGGGGCTAGAACTTCAGCAGTGCAGGATAAA.AOOCCAC:ATM3AAAATCAGCAGCATACACCTATTTC 4350
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`Exon 1
`
`
`IniVdHisPh•ThrAlaGluGl\lLyaAl■Il■Il.M•tS•rLautrpGlyL)l•V•lhnll.CluGluAl•GlyGlyGluAl•L•uAlaAr/
`
`
`
`TGGTACAGCTGTGATCACCAGCMGCTCCCAGACT1'GACACCA1'GGTGCA'J'TT1'ACTGC1GAGGAAAAGGCTATTATCATGAGCCTGTGGGGAAAGGTGAATATAGAAGAGGCTGGAGGAGAAGCCTTGGGCAGGTAAACACTGGTTCTC 4 500
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`lgLau1AluValV•lTyrProTrp1'hrGln.Ar9Phdh•GluThrP Exon 2
`
`
`
`
`AGTGCATGGGAATUAGGGGGAATATMCTCTGGCAAACTGACCAGGAAI\GTCCTAAAGATTTTGI\GCATCACtAAT'l'T1'C:CACCTGTTATGGTGJ.CGTATC1t.TAGGCTCCTTGTTGTCTACCCCTGGACCCAG1t.GATTCTTTGAAACCT 4fiS0
`
`
`
`Y
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`h•GlyA•nL•uS•rS.r:Al•SetAl•Il•M•tGlyA•nProLysValLy•AlaHiaGlyLyeLysV•lLeuthrSuPheGlyGluAl•V•lLysAanM•tA•pA•nlAuLysGlyAlaPheAlaLysIAuS.rGluLeuHii,CyabpL
`
`
`1'1'GGAAACCTGTCCTCTGCCTCTGCCATCA'l'GGGCAACCCCAAGGTCAAGGCCCA1'GGCMGAAAGTGCTGAC'.CTCCTT'?GGAGAAGCTGTCAMMCATGGACAACCTCAACGCTCC:C:TTTGC?MGCTGAOTGAGCTGCACTGTGACA 4 8 0 0
`V
`V
`V
`V
`V
`V
`V
`V
`V
`V
`yslAuHbVdA•pProGluA•nfh•Ly•/
`
`
`
`AGCTGCACGTGGATCCTGMAACTTCAAGGTAAGTTCAGGAAATGCTACTAGGCTCTTGGCTTTCACTTTGAGACAATAATGGAAGGTTACACTATGATTMAAGGATCAACAAMACGTCAGAAAACATAGG'l'CCAGTTTGGTCTTIIA