scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2002"


Journal ArticleDOI
TL;DR: A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu.
Abstract: As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.

9,605 citations


Journal ArticleDOI
TL;DR: How BLAT was optimized is described, which is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences.
Abstract: Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences BLAT's speed stems from an index of all nonoverlapping K-mers in the genome This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly BLAT has several major stages It uses the index to find regions in the genome likely to be homologous to the query sequence It performs an alignment between homologous regions It stitches together these aligned regions (often exons) into larger alignments (typically genes) Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible This paper describes how BLAT was optimized Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications http://genomeucscedu hosts a web-based BLAT server for the human genome

8,326 citations


Journal ArticleDOI
TL;DR: The overall architecture of the Bioperl toolkit is described, the problem domains that it addresses, and specific examples of how the toolkit can be used to solve common life-sciences problems are given.
Abstract: The Bioperl project is an international open-source collaboration of biologists, bioinformaticians, and computer scientists that has evolved over the past 7 yr into the most comprehensive library of Perl modules available for managing and manipulating life-science information. Bioperl provides an easy-to-use, stable, and consistent programming interface for bioinformatics application programmers. The Bioperl modules have been successfully and repeatedly used to reduce otherwise complex tasks to only a few lines of code. The Bioperl object model has been proven to be flexible enough to support enterprise-level applications such as EnsEMBL, while maintaining an easy learning curve for novice Perl programmers. Bioperl is capable of executing analyses and processing results from programs such as BLAST, ClustalW, or the EMBOSS suite. Interoperation with modules written in Python and Java is supported through the evolving BioCORBA bridge. Bioperl provides access to data stores such as GenBank and SwissProt via a flexible series of sequence input/output modules, and to the emerging common sequence data storage format of the Open Bioinformatics Database Access project. This study describes the overall architecture of the toolkit, the problem domains that it addresses, and gives specific examples of how the toolkit can be used to solve common life-sciences problems. We conclude with a discussion of how the open-source nature of the project has contributed to the development effort.

1,694 citations


Journal ArticleDOI
TL;DR: The Generic Genome Browser (GBrowse) is described, a Web-based application for displaying genomic annotations and other features and easy integration with other components of a model organism system Web site.
Abstract: The Generic Model Organism System Database Project (GMOD) seeks to develop reusable software components for model organism system databases. In this paper we describe the Generic Genome Browser (GBrowse), a Web-based application for displaying genomic annotations and other features. For the end user, features of the browser include the ability to scroll and zoom through arbitrary regions of a genome, to enter a region of the genome by searching for a landmark or performing a full text search of all features, and the ability to enable and disable tracks and change their relative order and appearance. The user can upload private annotations to view them in the context of the public ones, and publish those annotations to the community. For the data provider, features of the browser software include reliance on readily available open source components, simple installation, flexible configuration, and easy integration with other components of a model organism system Web site. GBrowse is freely available under an open source license. The software, its documentation, and support are available at http://www.gmod.org.

1,177 citations


Journal ArticleDOI
TL;DR: This investigation provides the most detailed inventory of human spliceosome-associated factors to date, and the data indicate a number of interesting links coordinating splicing with other steps in the gene expression pathway.
Abstract: In a previous proteomic study of the human spliceosome, we identified 42 spliceosome-associated factors, including 19 novel ones. Using enhanced mass spectrometric tools and improved databases, we now report identification of 311 proteins that copurify with splicing complexes assembled on two separate pre-mRNAs. All known essential human splicing factors were found, and 96 novel proteins were identified, of which 55 contain domains directly linking them to functions in splicing/RNA processing. We also detected 20 proteins related to transcription, which indicates a direct connection between this process and splicing. This investigation provides the most detailed inventory of human spliceosome-associated factors to date, and the data indicate a number of interesting links coordinating splicing with other steps in the gene expression pathway.

922 citations


Journal ArticleDOI
TL;DR: The positional candidate cloning of this QTL is reported, involving the construction of a BAC contig spanning the corresponding marker interval, and the demonstration that a very strong candidate gene, acylCoA:diacylglycerol acyltransferase (DGAT1), maps to that contig.
Abstract: We recently mapped a quantitative trait locus (QTL) with a major effect on milk composition--particularly fat content--to the centromeric end of bovine chromosome 14. We subsequently exploited linkage disequilibrium to refine the map position of this QTL to a 3-cM chromosome interval bounded by microsatellite markers BULGE13 and BULGE09. We herein report the positional candidate cloning of this QTL, involving (1) the construction of a BAC contig spanning the corresponding marker interval, (2) the demonstration that a very strong candidate gene, acylCoA:diacylglycerol acyltransferase (DGAT1), maps to that contig, and (3) the identification of a nonconservative K232A substitution in the DGAT1 gene with a major effect on milk fat content and other milk characteristics.

899 citations


Journal ArticleDOI
TL;DR: An analysis of single nucleotide polymorphisms with allele frequencies that were determined in three populations provides a first generation natural selection map of the human genome and provides compelling evidence that selection has shaped extant patterns of human genomic variation.
Abstract: Natural selection, which can be defined as the differential contribution of genetic variants to future generations (Aquadro et al. 2001), is the driving force of Darwinian evolution. Despite intense research, only a relatively small number of regions and genes have been directly implicated as targets of selection in the human genome (Kitano and Saitou 1999; Rana et al. 1999; Huttley et al. 2000; Hollox et al. 2001; Hull et al. 2001; Hurst and Pal 2001; Koda et al. 2001; Sullivan et al. 2001; Tishkoff et al. 2001; Baum et al. 2002; Fullerton et al. 2002; Gilad et al. 2002; Hamblin et al. 2002). A more comprehensive and genomic understanding of how and where natural selection has shaped patterns of genetic variation may provide important insights into the mechanisms of evolutionary change (Otto 2000), guide selection of loci for inclusion in population genetic studies (Vitalis et al. 2001), facilitate the annotation of functionally significant genomic regions (Nielsen 2001), and help elucidate genotype-phenotype correlations in complex diseases (Przeworski et al. 2000; Nielsen 2001). Detecting unambiguous evidence for natural selection remains challenging because the effect of selection on the distribution of genetic variation can be mimicked by population demographic history (i.e., the size, structure, and mating pattern of a population). For instance, both adaptive hitchhiking and population expansion can cause an excess of rare variants observed in DNA sequence data compared with what is expected under a standard neutral model (Tajima 1989; Przeworski et al. 2000). Despite these difficulties, the recent deluge of publicly available single nucleotide polymorphisms (SNPs) provides an exciting opportunity to identify genome-wide signatures of selection (Sunyaev et al. 2000; Fay et al. 2001; Sachidanandam et al. 2001). To this end, examining the variation in SNP allele frequencies between populations, which can be quantified by the statistic FST, is a promising strategy for detecting signatures of natural selection (Lewontin and Krakauer 1973; Rana et al. 1999; Hollox et al. 2001; Fullerton et al. 2002; Gilad et al. 2002; Hamblin et al. 2002). Under selective neutrality, FST is determined by genetic drift, which will affect all loci across the genome in a similar and predictable fashion. On the other hand, natural selection is a locus-specific force that can cause systematic deviations in FST values for a selected gene and nearby genetic markers. For example, geographically restricted directional selection may lead to an increase in FST of a selected locus, whereas balancing or species-wide directional selection may lead to a decrease in FST compared with neutrally evolving loci (Cavalli-Sforza 1966; Bowcock et al. 1991; Andolfatto 2001). Previous studies that have attempted to identify natural selection based on patterns of population differentiation relied on simulations to obtain the expected distribution of FST under selective neutrality (Lewontin and Krakauer 1973; Bowcock et al. 1991; Beaumont and Nichols 1996). However, the simulated distribution of FST strongly depends on the assumed population demographic history, which is rarely known with any degree of certainty. As an expanding number of SNPs are genotyped across multiple populations, a complimentary approach that does not require tenuous assumptions about population demographic history is now becoming feasible. Specifically, by sampling a large number of SNPs throughout the genome, loci that have been affected by natural selection can simply be identified as outliers in the extreme tails of the empirical distribution of FST (Cavalli-Sforza 1966; Black et al. 2001; Goldstein and Chikhi 2002). Recently, this strategy has been used to infer natural selection in the CAPN10 gene; however, the empirical distribution of FST contained <100 loci (Fullerton et al. 2002). In this work, we describe an analysis of 26,530 SNPs with allele frequencies that were determined in three populations: African-American, East Asian, and European-American. The density of this SNP allele frequency map provides a unique and powerful opportunity to interrogate the genome for signatures of natural selection. Through a variety of analyses, we have found statistically significant evidence supporting the hypothesis that selection has influenced extant patterns of human genetic variation. Furthermore, we have identified 174 candidate genes that demonstrate signatures of selection when contrasted to the empirical genome-wide distribution of FST. This analysis provides the conceptual foundation for constructing a high-resolution natural selection map, which will be an important resource in understanding the recent evolutionary history of our species, and will facilitate detailed studies on the identified candidate genes.

890 citations


Journal ArticleDOI
TL;DR: A simple set of rules was developed to unambiguously label the different clades nested within a single most parsimonious phylogeny, which supersedes and unifies past nomenclatures and allows the inclusion of additional mutations and haplogroups yet to be discovered.
Abstract: The Y chromosome contains the largest nonrecombining block in the human genome. By virtue of its many polymorphisms, it is now the most informative haplotyping system, with applications in evolutionary studies, forensics, medical genetics, and genealogical reconstruction. However, the emergence of several unrelated and nonsystematic nomenclatures for Y-chromosomal binary haplogroups is an increasing source of confusion. To resolve this issue, 245 markers were genotyped in a globally representative set of samples, 74 of which were males from the Y Chromosome Consortium cell line repository. A single most parsimonious phylogeny was constructed for the 153 binary haplogroups observed. A simple set of rules was developed to unambiguously label the different clades nested within this tree. This hierarchical nomenclature system supersedes and unifies past nomenclatures and allows the inclusion of additional mutations and haplogroups yet to be discovered.

797 citations


Journal ArticleDOI
TL;DR: An approach for the de novo identification and classification of repeat sequence families that is based on extensions to the usual approach of single linkage clustering of local pairwise alignments between genomic sequences that was able to properly identify and group known transposable elements.
Abstract: Repetitive sequences make up a major part of eukaryotic genomes. We have developed an approachfor th e de novo identification and classification of repeat sequence families that is based on extensions to the usual approachof single linkage clustering of local pairwise alignments betwe en genomic sequences. Our extensions use multiple alignment information to define the boundaries of individual copies of the repeats and to distinguish homologous but distinct repeat element families. When tested on the human genome, our approach was able to properly identify and group known transposable elements. The program, RECON, should be useful for first-pass automatic classification of repeats in newly sequenced genomes. [The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: R. Klein.]

745 citations


Journal ArticleDOI
TL;DR: The SIFT (Sorting Intolerant From Tolerant) program was used to predict that 25% of 3084 nsSNPs from dbSNP, a public SNP database, would affect protein function, and found the number is likely to be much lower than reported.
Abstract: A major interest in human genetics is to determine whether a nonsynonymous single-base nucleotide polymorphism (nsSNP) in a gene affects its protein product and, consequently, impacts the carrier's health. We used the SIFT (Sorting Intolerant From Tolerant) program to predict that 25% of 3084 nsSNPs from dbSNP, a public SNP database, would affect protein function. Some of the nsSNPs predicted to affect function were variants known to be associated with disease. Others were artifacts of SNP discovery. Two reports have indicated that there are thousands of damaging nsSNPs in an individual's human genome; we find the number is likely to be much lower.

698 citations


Journal ArticleDOI
TL;DR: A new computer system, called ARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads, which has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward- reverse link inconsistency.
Abstract: We describe a new computer system, called ARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNE has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency. To test ARACHNE, we created simulated reads providing ∼10-fold coverage of the genomes of H. influenzae, S. cerevisiae, and D. melanogaster, as well as human chromosomes 21 and 22. The assemblies of these simulated reads yielded nearly complete coverage of the respective genomes, with a small number of contigs joined into a smaller number of supercontigs (or scaffolds). For example, analysis of the D. melanogaster genome yielded ∼98% coverage with an N50 contig length of 324 kb and an N50 supercontig length of 5143 kb. The assembly accuracy was high, although not perfect: small errors occurred at a frequency of roughly 1 per 1 Mb (typically, deletion of ∼1 kb in size), with a very small number of other misassemblies. The assembly was rapid: the Drosophila assembly required only 21 hours on a single 667 MHz processor and used 8.4 Gb of memory.

Journal ArticleDOI
TL;DR: The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the NCBI Entrez Protein Database based on domain architecture, defined as the sequential order of conserved domains in proteins.
Abstract: The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the NCBI Entrez Protein Database based on domain architecture, defined as the sequential order of conserved domains in proteins. The algorithm finds protein similarities across significant evolutionary distances using sensitive protein domain profiles rather than by direct sequence similarity. Proteins similar to a query protein are grouped and scored by architecture. Relying on domain profiles allows CDART to be fast, and, because it relies on annotated functional domains, informative. Domain profiles are derived from several collections of domain definitions that include functional annotation. Searches can be further refined by taxonomy and by selecting domains of interest. CDART is available at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi.

Journal ArticleDOI
TL;DR: The relationship of protein-protein interactions with mRNA expression levels, by integrating a variety of data sources for yeast, is investigated, finding that permanent complexes, such as the ribosome and proteasome, have a particularly strong relationship with expression, while transient ones do not.
Abstract: We investigate the relationship of protein-protein interactions with mRNA expression levels, by integrating a variety of data sources for yeast. We focus on known protein complexes that have clearly defined interactions between their subunits. We find that subunits of the same protein complex show significant coexpression, both in terms of similarities of absolute mRNA levels and expression profiles, e.g., we can often see subunits of a complex having correlated patterns of expression over a time course. We classify the yeast protein complexes as either permanent or transient, with permanent ones being maintained through most cellular conditions. We find that, generally, permanent complexes, such as the ribosome and proteasome, have a particularly strong relationship with expression, while transient ones do not. However, we note that several transient complexes, such as the RNA polymerase II holoenzyme and the replication complex, can be subdivided into smaller permanent ones, which do have a strong relationship to gene expression. We also investigated the interactions in aggregated, genome-wide data sets, such as the comprehensive yeast two-hybrid experiments, and found them to have only a weak relationship with gene expression, similar to that of transient complexes. (Further details on genecensus.org/expression/interactions and bioinfo.mbb.yale.edu/expression/interactions.)

Journal ArticleDOI
TL;DR: The results indicate that illegitimate recombination is the driving force behind genome size decrease in Arabidopsis, removing at least fivefold more DNA than unequal homologous recombination.
Abstract: Genome size varies greatly across angiosperms. It is well documented that, in addition to polyploidization, retrotransposon amplification has been a major cause of genome expansion. The lack of evidence for counterbalancing mechanisms that curtail unlimited genome growth has made many of us wonder whether angiosperms have a "one-way ticket to genomic obesity." We have therefore investigated an angiosperm with a well-characterized and notably small genome, Arabidopsis thaliana, for evidence of genomic DNA loss. Our results indicate that illegitimate recombination is the driving force behind genome size decrease in Arabidopsis, removing at least fivefold more DNA than unequal homologous recombination. The presence of highly degraded retroelements also suggests that retrotransposon amplification has not been confined to the last 4 million years, as is indicated by the dating of intact retroelements.

Journal ArticleDOI
TL;DR: The complete genome sequence of an acetate-utilizing methanogen, Methanosarcina acetivorans C2A, is reported, which indicates the likelihood of undiscovered natural energy sources for methanogenesis, whereas the presence of single-subunit carbon monoxide dehydrogenases raises the possibility of nonmethanogenic growth.
Abstract: The Archaea remain the most poorly understood domain of life despite their importance to the biosphere. Methanogenesis, which plays a pivotal role in the global carbon cycle, is unique to the Archaea. Each year, an estimated 900 million metric tons of methane are biologically produced, representing the major global source for this greenhouse gas and contributing significantly to global warming (Schlesinger 1997). Methanogenesis is critical to the waste-treatment industry and biologically produced methane also represents an important alternative fuel source. At least two-thirds of the methane in nature is derived from acetate, although only two genera of methanogens are known to be capable of utilizing this substrate. We report here the first complete genome sequence of an acetate-utilizing (acetoclastic) methanogen, Methanosarcina acetivorans C2A. The Methanosarcineae are metabolically and physiologically the most versatile methanogens. Only Methanosarcina species possess all three known pathways for methanogenesis (Fig. ​(Fig.1)1) and are capable of utilizing no less than nine methanogenic substrates, including acetate. In contrast, all other orders of methanogens possess a single pathway for methanogenesis, and many utilize no more than two substrates. Among methanogens, the Methanosarcineae also display extensive environmental diversity. Individual species of Methanosarcina have been found in freshwater and marine sediments, decaying leaves and garden soils, oil wells, sewage and animal waste digesters and lagoons, thermophilic digesters, feces of herbivorous animals, and the rumens of ungulates (Zinder 1993). Figure 1 Three pathways for methanogenesis. Methanogenesis is a form of anaerobic respiration using a variety of one-carbon (C-1) compounds or acetic acid as a terminal electron acceptor. All three pathways converge on the reduction of methyl-CoM to methane (CH ... The Methanosarcineae are unique among the Archaea in forming complex multicellular structures during different phases of growth and in response to environmental change (Fig. ​(Fig.2).2). Within the Methanosarcineae, a number of distinct morphological forms have been characterized, including single cells with and without a cell envelope, as well as multicellular packets and lamina (Macario and Conway de Macario 2001). Packets and lamina display internal morphological heterogeneity, suggesting the possibility of cellular differentiation. Moreover, it has been suggested that cells within lamina may display differential production of extracellular material, a potential form of cellular specialization (Macario and Conway de Macario 2001). The formation of multicellular structures has been proposed to act as an adaptation to stress and likely plays a role in the ability of Methanosarcina species to colonize diverse environments. Figure 2 Different morphological forms of Methanosarcina acetivorans. Thin-section electron micrographs showing M. acetivorans growing as both single cells (center of micrograph) and within multicellular aggregates (top left, bottom right). Cells were harvested ... Significantly, powerful methods for genetic analysis exist for Methanosarcina species. These tools include plasmid shuttle vectors (Metcalf et al. 1997), very high efficiency transformation (Metcalf et al. 1997), random in vivo transposon mutagenesis (Zhang et al. 2000), directed mutagenesis of specific genes (Zhang et al. 2000), multiple selectable markers (Boccazzi et al. 2000), reporter gene fusions (M. Pritchett and W. Metcalf, unpubl.), integration vectors (Conway de Macario et al. 1996), and anaerobic incubators for large-scale growth of methanogens on solid media (Metcalf et al. 1998). Furthermore, and in contrast to other known methanogens, genetic analysis can be used to study the process of methanogenesis: Because Methanosarcina species are able to utilize each of the three known methanogenic pathways, mutants in a single pathway are viable (M. Pritchett and W. Metcalf, unpubl.). The availability of genetic methods allowing immediate exploitation of genomic sequence, coupled with the genetic, physiological, and environmental diversity of M. acetivorans make this species an outstanding model organism for the study of archaeal biology. For these reasons, we set out to study the genome of M. acetivorans.

Journal ArticleDOI
TL;DR: A systematic computational analysis of protein sequences containing known nuclear domains led to the identification of 28 novel domain families, which represents a 26% increase in the starting set of 107 known nuclear domain families used for the analysis.
Abstract: A systematic computational analysis of protein sequences containing known nuclear domains led to the identification of 28 novel domain families. This represents a 26% increase in the starting set of 107 known nuclear domain families used for the analysis. Most of the novel domains are present in all major eukaryotic lineages, but 3 are species specific. For about 500 of the 1200 proteins that contain these new domains, nuclear localization could be inferred, and for 700, additional features could be predicted. For example, we identified a new domain, likely to have a role downstream of the unfolded protein response; a nematode-specific signalling domain; and a widespread domain, likely to be a noncatalytic homolog of ubiquitin-conjugating enzymes.

Journal ArticleDOI
TL;DR: In this paper, the authors combine genomic sequence data with experimental knockout data to compare the rates of evolution and the levels of selection for essential versus nonessential bacterial genes, finding that essential bacterial genes appear to be more conserved than are nonessential genes over both relatively short (microevolutionary) and longer (macroevolutional) time scales.
Abstract: The “knockout-rate” prediction holds that essential genes should be more evolutionarily conserved than are nonessential genes. This is because negative (purifying) selection acting on essential genes is expected to be more stringent than that for nonessential genes, which are more functionally dispensable and/or redundant. However, a recent survey of evolutionary distances between Saccharomyces cerevisiae and Caenorhabditis elegans proteins did not reveal any difference between the rates of evolution for essential and nonessential genes. An analysis of mouse and rat orthologous genes also found that essential and nonessential genes evolved at similar rates when genes thought to evolve under directional selection were excluded from the analysis. In the present study, we combine genomic sequence data with experimental knockout data to compare the rates of evolution and the levels of selection for essential versus nonessential bacterial genes. In contrast to the results obtained for eukaryotic genes, essential bacterial genes appear to be more conserved than are nonessential genes over both relatively short (microevolutionary) and longer (macroevolutionary) time scales.

Journal ArticleDOI
TL;DR: Results indicate that internal exons that contain an Alu sequence are predominantly, if not exclusively, alternatively spliced, and evolutionary events that cause a constitutive insertion of an AlU sequence into an mRNA are deleterious and selected against.
Abstract: Alu repetitive elements are found in ∼1.4 million copies in the human genome, comprising more than one-tenth of it. Numerous studies describe exonizations of Alu elements, that is, splicing-mediated insertions of parts of Alu sequences into mature mRNAs. To study the connection between the exonization of Alu elements and alternative splicing, we used a database of ESTs and cDNAs aligned to the human genome. We compiled two exon sets, one of 1176 alternatively spliced internal exons, and another of 4151 constitutively spliced internal exons. Sixty one alternatively spliced internal exons (5.2%) had a significant BLAST hit to an Alu sequence, but none of the constitutively spliced internal exons had such a hit. The vast majority (84%) of the Alu-containing exons that appeared within the coding region of mRNAs caused a frame-shift or a premature termination codon. Alu-containing exons were included in transcripts at lower frequencies than alternatively spliced exons that do not contain an Alu sequence. These results indicate that internal exons that contain an Alu sequence are predominantly, if not exclusively, alternatively spliced. Presumably, evolutionary events that cause a constitutive insertion of an Alu sequence into an mRNA are deleterious and selected against.

Journal ArticleDOI
TL;DR: In this paper, the authors developed a computational tool, rVISTA, for high-throughput discovery of cis-regulatory elements that combines clustering of predicted transcription factor binding sites (TFBSs) and the analysis of interspecies sequence conservation to maximize the identification of functional sites.
Abstract: Identifying transcriptional regulatory elements represents a significant challenge in annotating the genomes of higher vertebrates. We have developed a computational tool, rVISTA, for high-throughput discovery of cis-regulatory elements that combines clustering of predicted transcription factor binding sites (TFBSs) and the analysis of interspecies sequence conservation to maximize the identification of functional sites. To assess the ability of rVISTA to discover true positive TFBSs while minimizing the prediction of false positives, we analyzed the distribution of several TFBSs across 1 Mb of the well-annotated cytokine gene cluster (Hs5q31; Mm11). Because a large number of AP-1, NFAT, and GATA-3 sites have been experimentally identified in this interval, we focused our analysis on the distribution of all binding sites specific for these transcription factors. The exploitation of the orthologous human–mouse dataset resulted in the elimination of >95% of the ∼58,000 binding sites predicted on analysis of the human sequence alone, whereas it identified 88% of the experimentally verified binding sites in this region.

Journal ArticleDOI
TL;DR: With the advancement of genomic technology and genome-wide analysis of organisms, more and more organisms are being studied extensively for gene expression on a global scale and computational methods to predict protein–protein interaction have been developed to predictprotein–protein interactions.
Abstract: With the advancement of genomic technology and genome-wide analysis of organisms, more and more organisms are being studied extensively for gene expression on a global scale. Expression profiling is now being used increasingly to analyze gene functions or to functionally group genes on the basis of their expression profiles (Lockhart and Winzeler 2000). After the completion of the genome sequence of Saccharomyces cerevisiae (Goffeau et al. 1996), a budding yeast, many researchers have undertaken the task of functionally analyzing the yeast genome, comprising ∼6280 proteins (YPD), of which roughly one-third do not have known functions (Mewes et al. 2002). Genes can be clustered on the basis of similar expression profiles. This makes it possible to assign a biological function to genes, depending on the functions of other genes in the cluster (Eisen et al. 1998). However, expression profiling gives an indirect measure of a gene product's biological and cellular function. A more complete study of an organism could possibly be achieved by looking at not only the mRNA levels but also the proteins they encode. It is well known that mRNA levels alone are not sufficient to group genes into different functions, because not all mRNAs end up being translated. Most biological functions within a cell are carried out by proteins and most cellular processes and biochemical events are ultimately achieved by interactions of proteins with one another. Thus, it is important to look at protein expression and their interactions simultaneously. Affinity chromatography, two-hybrid assay, copurification, coimmunoprecipitation, and cross-linking are some of the tools used to verify proteins that are associated physically with one another. Among these techniques, the two-hybrid assay has been used widely to analyze protein–protein interactions in Saccharomyces cerevisiae (Ito et al. 2000, 2001a; Uetz et al. 2000). Their protein interaction profiles have made it possible to look at the interaction networks comprising a large number of proteins and to also functionally classify proteins of unknown function. Uetz et al. (2000) used two different approaches in their two-hybrid experiments. The first was a protein array approach with 192 yeast proteins as bait, Gal4–DNA-binding domain fusions, and ∼6000 yeast transformants as prey, Gal4-activation domain fusions. The second, an interaction sequence tag (IST) approach, used high-throughput screens of an activation domain library encoding ∼6000 yeast genes that were pooled. All yeast proteins were cloned into DNA-binding domain vectors. Of the 6144 yeast ORF PCR products, 5345 were successfully cloned. Their first approach revealed 281 interactions, with less stringent selection criteria, using HIS3. The second approach revealed 692 interactions with the more stringent URA3 selection method. Ito et al. (2001a) used a similar method and reported 4549 interactions among 3278 proteins. Some interactions in both data sets were repeated (bait and prey exchanged). They imposed a more rigorous selection criterion including four reporter genes, ADE2, HIS3, URA3, and MEL1, to minimize false positives due to promoter-specific activation. All of these genes have Gal4-responsive promoter. Computational methods have been developed to predict protein–protein interactions. Those approaches include the Rosetta stone/gene fusion method (Enright et al. 1999; Marcotte et al. 1999a), the phylogenetic profile method (Pellegrini et al. 1999) and the method combining multiple sources of data (Marcotte et al. 1999b). Other computational methods to predict protein–protein interaction have been presented on the basis of different principles, including the interaction domain pair profile method (Rain et al. 2001; Wojcik and Schachter 2001) and the support vector machine learning method (Bock and Gough 2001). Gomez et al. (2001) developed probabilistic models for protein–protein interactions. Sprinzak and Margalit (2001) analyzed over-represented sequence-signature pairs among protein–protein interactions. In our study, we use the protein–protein interaction (PPI) data sets of Uetz and Ito to predict domain–domain interactions (DDI) in yeast proteins. The protein-domain information is obtained from a protein-domain family database called PFAM (Bateman et al. 2000). Because every protein can be characterized by either a distinct domain or a combination of domains, understanding domain interactions is crucial to understanding the nature and extent of biomolecular interactions. Our study predicts probable domain–domain interactions solely on the basis of the information of protein–protein interactions. Because proteins interact with one another through their specific domains, predicting domain–domain interactions on a global scale from the entire protein interaction data set make it possible to predict previously unknown protein–protein interactions from their domains. Thus, domain interactions extend the functional significance of proteins and present a global view of the protein–protein interaction network within a cell responsible for carrying out various biological and cellular functions. It is known that the yeast two-hybrid assay is not accurate in determining protein–protein interactions, and the interaction data used in our study certainly contain many false positive and false negative errors (Legrain and Selig 2000; Hazbun and Fields 2001; Mrowka et al. 2001). Taking into account these errors, we apply the Maximum Likelihood approach to estimate the probability of domain–domain interactions. We have also taken into account multiplicity of observations in the two data sets as evidenced by exchanged baits and preys, repeated interactions, and synonymously used gene names. To assess the accuracy of our method, we predict protein–protein interactions using the inferred domain–domain interactions, and compare them with the observed interactions. The following results are obtained: (1) Our method has shown robustness in analyzing incomplete data sets and dealing with various experimental errors, and we achieve 42.5% specificity and 77.6% sensitivity using the combined Uetz and Ito data. The relative low specificity may be caused by the fact that the observed protein–protein interactions in the Uetz and Ito combined data represent only a small fraction of all of the real interactions. (2) Comparing our predicted protein–protein interactions with the MIPS protein–protein interactions obtained by methods other than the two-hybrid assays, we show that the prediction rate of our method is about 100 times better than that of a random assignment. (3) We also compare the gene expression profile correlation coefficients of our predictions with those of random protein pairs, and our predictions have a higher mean correlation coefficient. (4) Finally, we check for biological significance of our novel predictions, and find several interesting interactions such as RPS0A interacting with APG17 and TAF40 interacting with SPT3, which are consistent with the functions of the proteins. A complete description of our model and the results are given in the sections below.

Journal Article
TL;DR: A new multiple genome rearrangement algorithm that is based on the rearrangements (rather than breakpoint) distance and that is applicable to both unichromosomal and multichromaosomal genomes is proposed and applied for genome-scale phylogenetic tree reconstruction and deriving ancestral gene orders.
Abstract: Recent progress in genome-scale sequencing and comparative mapping raises new challenges in studies of genome rearrangements. Although the pairwise genome rearrangement problem is well-studied, algorithms for reconstructing rearrangement scenarios for multiple species are in great need. The previous approaches to multiple genome rearrangement problem were largely based on the breakpoint distance rather than on a more biologically accurate rearrangement (reversal) distance. Another shortcoming of the existing software tools is their inability to analyze rearrangements (inversions, translocations, fusions, and fissions) of multichromosomal genomes. This paper proposes a new multiple genome rearrangement algorithm that is based on the rearrangement (rather than breakpoint) distance and that is applicable to both unichromosomal and multichromosomal genomes. We further apply this algorithm for genome-scale phylogenetic tree reconstruction and deriving ancestral gene orders. In particular, our analysis suggests a new improved rearrangement scenario for a very difficult Campanulaceae cpDNA dataset and a putative rearrangement scenario for human, mouse and cat genomes.

Journal ArticleDOI
TL;DR: A computational procedure was developed for systematic detection of lineage-specific expansions (LSEs) of protein families in sequenced genomes and applied to obtain a census of LSEs in five eukaryotic species, indicating that their diversification occurred after the divergence of the major lineages of the eucaryotic crown group.
Abstract: The eukaryotic crown group (the unresolved assemblage of lineages in the eukaryotic tree, which includes plants, animals, fungi, and some protists, as opposed to early branching eukaryotes, which are all unicellular protists), although only representing the proverbial tip of the eukaryotic phylogenetic iceberg, encompasses a remarkable variety of organisms (Patterson 1999; Dacks and Doolittle 2001). This diversity is apparent in both morphological and biochemical features of the crown group that spans the entire range from unicellular yeasts and chlorophytes, through facultatively multicellular slime molds, to genuine multicellular organisms, plants, animals, and fungi (Sogin et al. 1996; Patterson 1999). The complete, or nearly complete, genome sequences from three major branches of the crown group, plants, animals, and fungi are starting to provide the first molecular explanations for both their unity and diversity. From one viewpoint, the crown-group eukaryotes are remarkably uniform in that they share a large set of conserved orthologs in the core components of their essential functional systems, such as those involved in DNA replication and repair, most aspects of RNA metabolism, cytoskeletal organization, protein degradation, and secretion (Chervitz et al. 1998; Rubin et al. 2000; Lander et al. 2001). Furthermore, components of the signal transduction pathways, structural and regulatory components of the nucleus, and pre-mRNA processing complexes, although showing clear differences between the major crown-group lineages, are largely constructed from the same set of protein domains, and are based on the same architectural principles (Chervitz et al. 1998; Aravind and Subramanian 1999; Rubin et al. 2000; Lander et al. 2001). This unity notwithstanding, preliminary comparative studies on the sequenced eukaryotic genomes also provided clues as to what evolutionary phenomena might underlie their diversity. At the level of the protein sets encoded in the crown-group genomes, the main contributing forces appear to be the emergence of new domain architectures through domain accretion and domain shuffling, lineage-specific gene loss, and lineage-specific expansion of protein families (Aravind and Subramanian 1999; Aravind et al. 2000; Rubin et al. 2000; Lander et al. 2001). Lineage-specific expansion (LSE) is defined in relative terms, as the proliferation of a protein family in a particular lineage, relative to the sister lineage, with which it is compared (Jordan et al. 2001). Thus, if two sister lineages, for example, Drosophila and Caenorhabditis representing insects and nematodes, respectively, are compared, all protein-family proliferation events (duplications to n-plications) that occurred in either of these lineages after their separation are considered LSEs. Preliminary analysis of proteins from the crown-group eukaryotic genomes revealed some tangible correlations between LSE and emergence of new biological functions, response to diverse environmental pressures, and organizational complexity. Some of the most striking cases of LSE are related to pathogen and stress response and include, among other families, expansions of the immunoglobulin superfamily associated with the vertebrate immune system, AP-ATPases involved in plant disease resistance (Hulbert et al. 2001), and the cytochrome P450 family, which participates in detoxification systems in both plants and animals (Nelson 1999; Tijet et al. 2001). Transcription factors represent another functional category of proteins that tend to show widespread LSE: the independent expansions of the POZ–C2H2 and C4DM–C2H2 fusions in insects, the nuclear hormone receptors in nematodes, and the KRAB-domain-fused Zn-fingers in vertebrates, apparently made substantial contributions to the evolution of developmental and differentiation features specific to each of these lineages (Sluder et al. 1999; Aravind et al. 2000; Riechmann et al. 2000; Coulson et al. 2001; Lander et al. 2001). Despite a wealth of anecdotal information, we are unaware of a systematic comparative analysis of LSEs in eukaryotic genomes. With this objective, we devised a procedure to systematically detect LSEs. Having identified LSEs in five eukaryotic proteomes, those of Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, Drosophila melanogaster, and Arabidopsis thaliana, we predicted, wherever feasible, the biochemical or biological functions of the lineage-specific clusters (LSC) and explored their potential roles in the diversification of the crown group. Here, we present a systematic analysis of the demography of LSEs and provide evidence for a major involvement of LSEs in the generation of the diversity of biological functions in multicellular eukaryotes.

Journal ArticleDOI
TL;DR: The pattern of origin of the genes created by retroposition in Drosophila was investigated and it was found that most of these X-derived autosomal retrogenes had evolved a testis expression pattern.
Abstract: New genes that originated by various molecular mechanisms are an essential component in understanding the evolution of genetic systems. We investigated the pattern of origin of the genes created by retroposition in Drosophila. We surveyed the whole Drosophila melanogaster genome for such new retrogenes and experimentally analyzed their functionality and evolutionary process. These retrogenes, functional as revealed by the analysis of expression, substitution, and population genetics, show a surprisingly asymmetric pattern in their origin. There is a significant excess of retrogenes that originate from the X chromosome and retropose to autosomes; new genes retroposed from autosomes are scarce. Further, we found that most of these X-derived autosomal retrogenes had evolved a testis expression pattern. These observations may be explained by natural selection favoring those new retrogenes that moved to autosomes and avoided the spermatogenesis X inactivation, and suggest the important role of genome position for the origin of new genes.

Journal ArticleDOI
TL;DR: The Drosophila genome carries 51 potential OBP genes, a number comparable to that of its odorant-receptor genes, and an intriguing subfamily of 12 putative OBPs that share a unique C-terminal structure with three conserved cysteines and a conserved proline are reported.
Abstract: Olfactory signal transduction has been well-studied and is generally similar in vertebrates, insects, crustaceans, and nematodes (Ache 1994; Hildebrand and Shepherd 1997; Prasad and Reed 1999). In all of these systems, odorant molecules are detected through interactions with specific G-protein-linked receptors present on the dendrites of olfactory receptor neurons. G-protein activation then produces a second-messenger cascade leading to ion channel activation and receptor neuron depolarization. How is the olfactory system capable of perceiving and discriminating among a myriad of different airborne odorants? One possibility is that these odorants are recognized by a correspondingly large number of receptors. In fact, large numbers of different odorant-receptor genes are found in both mammals (∼1000 genes in mice and rats; Mombaerts 1999) and the roundworm Caenorhabditis elegans (∼800 genes; Bargmann 1998; Robertson 2000). In contrast, recent analyses of the Drosophila melanogaster genome revealed far fewer potential odorant-receptor genes: 60 genes of which only 43 are expressed in the antenna or maxillary palp (Clyne et al. 1999; Gao and Chess 1999; Vosshall et al. 1999, 2000; Vosshall 2001). A related family of 56 receptors is expressed primarily in gustatory neurons (Scott et al. 2001). Why is the variety of odorant-receptor diversity in Drosophila more than an order of magnitude lower than it is in either mammals or C. elegans? Perhaps odorant receptors are not the only molecules involved in odorant recognition by insects. One attractive possibility is that another class of molecules, the odorant-binding proteins (OBPs), contributes substantially to the recognition of odorants in insects. OBPs are small, soluble proteins present at high levels in the fluid surrounding olfactory-receptor neurons (Pelosi 1994). They are generally thought to solubilize hydrophobic odorants and shuttle them to the underlying receptors (Vogt et al. 1991; Pelosi 1994; Prestwich et al. 1995). However, they could potentially function in odorant recognition, perhaps by presentation of the odorant molecule to the underlying receptor (Pelosi 1994; Prestwich et al. 1995). In fact, there is increasing evidence that OBPs do play an active role in odorant recognition rather than merely serving as passive odorant shuttles. One line of evidence is the large number of OBPs present within a variety of insect species. For example, five OBPs have been described in the moth Antheraea pernyi (Breer et al. 1990; Raming et al. 1990; Krieger et al. 1991, 1997). Several studies have shown that the different OBPs found within a single insect species display distinct odorant-binding specificities (Du and Prestwich 1995; Prestwich et al. 1995; Maibeche-Coisne et al. 1997; Plettner et al. 2000). Furthermore, Drosophila that lack the “LUSH” OBP show specific deficits in response to the odorants ethanol or benzaldehyde (Kim et al. 1998; Wang et al. 2001). Also, different OBPs show differential expression patterns in distinct subsets of the olfactory sensory hairs (sensilla) on an insect's antenna (Steinbrecht et al. 1995; Steinbrecht 1996; Park et al. 2000). Each sensillum carries a limited number of olfactory receptor neurons that are exposed only to OBPs present within that particular sensillum. If OBPs and odorant receptors are expressed within different, but overlapping subsets of sensilla, the result would be a mosaic of sensilla with different odorant thresholds. Thus, a moderate number of OBPs could act in a combinatorial manner with a moderate number of odorant receptors to greatly increase the discriminating power of an insect's olfactory system. This combinatorial strategy does not appear to be the case for mammals. Odorant discrimination appears to be largely due to the diversity of olfactory receptors (∼1000; Mombaerts 1999) because only one or a few OBPs are present in the mammalian olfactory mucosa (Tegoni et al. 2000), and they show fairly broad odorant specificities (Lobel et al. 2002). C. elegans also resembles the mammalian system with a large olfactory receptor population (∼800; Bargmann 1998; Robertson 2000). In the case of C. elegans, no OBP has been described (Rubin et al. 2000). Hence, we have two seemingly contrasting situations: Some organisms (mammals and nematodes) have large numbers of olfactory receptors and few or no OBPs, whereas insects have a moderate number of receptors coupled with a moderate number of OBPs. Exactly how many OBPs are there in insects, and how are their genes organized? In this study, we provide a comprehensive examination of OBP-like genes in Drosophila. We find that the Drosophila genome carries 51 potential OBP genes, a number comparable to that of its odorant-receptor genes (Clyne et al. 1999; Gao and Chess 1999; Vosshall et al. 1999, 2000; Vosshall 2001). We find that the majority (73%) of OBP-like genes occur in clusters of four to nine genes; two of these presumptive OBP gene clusters also include an odorant-receptor gene. Our analysis also reveals an apparently monophyletic subfamily of OBP-like proteins whose 12 members have a conserved C terminus.

Journal ArticleDOI
TL;DR: Microarrays containing 195,000 in situ synthesized oligonucleotide features have been created using a benchtop, maskless photolithographic instrument that eliminates the need for expensive and time-consuming chromium masks.
Abstract: Microarrays containing 195,000 in situ synthesized oligonucleotide features have been created using a benchtop, maskless photolithographic instrument. This instrument, the Maskless Array Synthesizer (MAS), uses a digital light processor (DLP) developed by Texas Instruments. The DLP creates the patterns of UV light used in the light-directed synthesis of oligonucleotides. This digital mask eliminates the need for expensive and time-consuming chromium masks. In this report, we describe experiments in which we tested this maskless technology for DNA synthesis on glass surfaces. Parameters examined included deprotection rates, repetitive yields, and oligonucleotide length. Custom gene expression arrays were manufactured and hybridized to Drosophila melanogaster and mouse samples. Quantitative PCR was used to validate the gene expression data from the mouse arrays.

Journal ArticleDOI
TL;DR: A group of large mammalian microarray datasets including the NCI60 cancer cell line panel, a leukemia tumor panel, and a phorbol ester induction time course as well as human and mouse tissue panels are analyzed, showing the problems inherent in the classical use of control genes in estimating gene expression levels in different mammalian cell contexts.
Abstract: Control genes, commonly defined as genes that are ubiquitously expressed at stable levels in different biological contexts, have been used to standardize quantitative expression studies for more than 25 yr We analyzed a group of large mammalian microarray datasets including the NCI60 cancer cell line panel, a leukemia tumor panel, and a phorbol ester induction time course as well as human and mouse tissue panels Twelve housekeeping genes commonly used as controls in classical expression studies (including GAPD, ACTB, B2M, TUBA, G6PD, LDHA, and HPRT) show considerable variability of expression both within and across microarray datasets Although we can identify genes with lower variability within individual datasets by heuristic filtering, such genes invariably show different expression levels when compared across other microarray datasets We confirm these results with an analysis of variance in a controlled mouse dataset, showing the extent of variability in gene expression across tissues The results show the problems inherent in the classical use of control genes in estimating gene expression levels in different mammalian cell contexts, and highlight the importance of controlled study design in the construction of microarray experiments

Journal ArticleDOI
TL;DR: This work reconstructs the gene content of ancestral Archaea and Proteobacteria and quantify the processes connecting them to their present day representatives based on the distribution of genes in completely sequenced genomes.
Abstract: In the course of evolution, genomes are shaped by processes like gene loss, gene duplication, horizontal gene transfer, and gene genesis (the de novo origin of genes) Here we reconstruct the gene content of ancestral Archaea and Proteobacteria and quantify the processes connecting them to their present day representatives based on the distribution of genes in completely sequenced genomes We estimate that the ancestor of the Proteobacteria contained around 2500 genes, and the ancestor of the Archaea around 2050 genes Although it is necessary to invoke horizontal gene transfer to explain the content of present day genomes, gene loss, gene genesis, and simple vertical inheritance are quantitatively the most dominant processes in shaping the genome Together they result in a turnover of gene content such that even the lineage leading from the ancestor of the Proteobacteria to the relatively large genome of Escherichia coli has lost at least 950 genes Gene loss, unlike the other processes, correlates fairly well with time This clock-like behavior suggests that gene loss is under negative selection, while the processes that add genes are under positive selection

Journal ArticleDOI
TL;DR: A fully automatic system for microarray image quantification that automatically locates both subarray grids and individual spots, requiring no user identification of any image coordinates, and yields more accurate estimates of ratios than systems assuming spot circularity.
Abstract: DNA microarrays are now widely used to measure expression levels and DNA copy number in biological samples. Ratios of relative abundance of nucleic acids are derived from images of regular arrays of spots containing target genetic material to which fluorescently labeled samples are hybridized. Whereas there are a number of methods in use for the quantification of images, many of the software systems in wide use either encourage or require extensive human interaction at the level of individual spots on arrays. We present a fully automatic system for microarray image quantification. The system automatically locates both subarray grids and individual spots, requiring no user identification of any image coordinates. Ratios are computed based on explicit segmentation of each spot. On a typical image of 6000 spots, the entire process takes less than 20 sec. We present a quantitative assessment of performance on multiple replicates of genome-wide array-based comparative genomic hybridization experiments. By explicitly identifying the pixels in each spot, the system yields more accurate estimates of ratios than systems assuming spot circularity. The software, called UCSF Spot, runs on Windows platforms and is available free of charge for academic use.

Journal ArticleDOI
TL;DR: It is concluded that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers, and no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance at the optimal choice of cluster number.
Abstract: We compare several commonly used expression-based gene clustering algorithms using a figure of merit based on the mutual information between cluster membership and known gene attributes. By studying various publicly available expression data sets we conclude that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers. As a measure of dissimilarity between the expression patterns of two genes, no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance for non-ratio-based measurements at the optimal choice of cluster number. We show the self-organized-map approach to be best for both measurement types at higher numbers of clusters. Clusters of genes derived from single- and average-linkage hierarchical clustering tend to produce worse-than-random results. [The algorithm described is available at http://llama.med.harvard.edu, under Software.]

Journal ArticleDOI
TL;DR: A computer algorithm is described, making use of the phylogenetic relationships among the sequences under study to make more accurate predictions, and finding several highly conserved motifs for which no function is yet known.
Abstract: Phylogenetic footprinting is a method for the discovery of regulatory elements in a set of orthologous regulatory regions from multiple species. It does so by identifying the best conserved motifs in those orthologous regions. We describe a computer algorithm designed specifically for this purpose, making use of the phylogenetic relationships among the sequences under study to make more accurate predictions. The program is guaranteed to report all sets of motifs with the lowest parsimony scores, calculated with respect to the phylogenetic tree relating the input species. We report the results of this algorithm on several data sets of interest. A large number of known functional binding sites are identified by our method, but we also find several highly conserved motifs for which no function is yet known.