scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Initial sequencing and analysis of the human genome.

Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 moreInstitutions (29)
15 Feb 2001-Nature (Nature Publishing Group)-Vol. 409, Iss: 6822, pp 860-921
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.
Abstract: Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

14,075 citations

Journal ArticleDOI
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 moreInstitutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations

Journal ArticleDOI
14 Jan 2005-Cell
TL;DR: In a four-genome analysis of 3' UTRs, approximately 13,000 regulatory relationships were detected above the estimate of false-positive predictions, thereby implicating as miRNA targets more than 5300 human genes, which represented 30% of the gene set.

11,624 citations

Journal ArticleDOI
TL;DR: A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu.
Abstract: As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.

9,605 citations

Journal ArticleDOI
TL;DR: Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.
Abstract: We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

9,389 citations

References
More filters
Book ChapterDOI
01 Jan 2000
TL;DR: It is interesting to note that there is a substantial literature indicating that this part of the ribosome does indeed move during protein synthesis, and ESSENS can also be used to find less generic structures, such as the sarcin-ricin loop.
Abstract: The ribonucleoproteins called ribosomes were discovered by cytologists in the mid-1950s, and by 1960 it was apparent that they catalyze protein synthesis. Ribosomes consume aminoacyl transfer RNAs, and the sequences of the proteins they produce are determined by those of the mRNAs with which they interact. The first ribosome crystals did not diffract to atomic resolution, but even if they had, it is uncertain what would have come of it in the short term; their analysis would have severely tested the crystallographic technology of the day. Heavy-atom cluster compounds are useful for phasing macromolecular diffraction patterns of large macromolecules at low resolution. The structure of an ordinary macromolecular crystal is solved when the experimental phases available are accurate to a resolution high enough so that an all-atom model of the molecule’s sequence can be fitted into the resulting electron density map. ESSENS can also be used to find less generic structures, such as the sarcin-ricin loop (SRL). The SRL is a critical part of the factor binding center, or GTPase center, of the large ribosomal subunit. A conformational change that moves L11 and its associated rRNA towards the putative factor binding site seems at least equally likely, and since there is no ribosomal material in the way to prevent it, that kind of motion is possible. In this connection, it is interesting to note that there is a substantial literature indicating that this part of the ribosome does indeed move during protein synthesis.

32 citations

Journal ArticleDOI
TL;DR: Sequence level analysis of an interchromosomal rearrangement during evolution has not been reported previously and acquisition of 275 kb of mouse genomic sequence and comparative sequence analysis with HSA 21 and HSA 22 narrowed the junction.
Abstract: During evolution, chromosomes are rearranged and become fixed into new patterns in new species. The relatively conservative nature of this process supports predictions of the arrangement of ancestral mammalian chromosomes, but the basis for these rearrangements is unknown. Physical mapping of mouse chromosome 10 (MMU 10) previously identified a 380-kb region containing the junction of material represented in human on chromosomes 21 (HSA 21) and 22 (HSA 22) that occurred in the evolutionary lineage of the mouse. Here, acquisition of 275 kb of mouse genomic sequence from this region and comparative sequence analysis with HSA 21 and HSA 22 narrowed the junction from 380 kb to 18 kb. The minimal junction region on MMU 10 contains a variety of repeats, including an L32-like ribosomal element and low-copy sequences found on several mouse chromosomes and represented in the mouse EST database. Sequence level analysis of an interchromosomal rearrangement during evolution has not been reported previously.

32 citations

Journal ArticleDOI
24 Feb 2000-Nature
TL;DR: Ensembl is an open-source project and will provide both a common object framework for annotation as well as the synchronization tools needed for anyone to set up to serve annotation for all to see and use, although it could be feared that open annotation will swamp biologists with alternative contradictory views of the sequence.
Abstract: democratic solution to genome sequencing Sir — Jean-Michel Claverie writes in Correspondence about the problems of annotating the whole human genome sequence, given that a draft form will be available in a few months. While we agree with many of his points, we disagree with what he says about the lack of bioinformatics capacity to provide a useful basic analysis. The Sanger laboratories, with the European Molecular Biology Laboratory’s European Bioinformatics Institute, have been developing an automatic analysis system for some months; the results of the first full release of Ensembl can be seen at http://www.ensembl.org/. The system now tracks the daily output of human genomic sequence in real time. It is based on confirming ab initio predictions by homology and providing functional annotation via Pfam. So far 17,045 gene fragments are annotated from the 1,405,539,258 bases processed. We agree with Claverie about the limitations of any automatic analysis system, having ourselves worked on the semi-manual analysis of the human chromosome 22 sequence. However, a large subset of genes can already be predicted accurately, which will be very useful as a way into this huge volume of data. A key aspect of the system is its ability to keep track of genes despite revisions to the sequence. This will be important as the genome is completely sequenced over the next couple of years. Ensembl accession numbers assigned to genes are permanent identifiers that will refer to the same genes throughout this process. How can we go beyond this baseline automatic annotation? Claverie points out the chaos that would result from duplicated annotation efforts, each with different standards and different ways of presenting the data. He is also correct in arguing that no single collaborative group will be capable of annotating the entire genome consistently and to high quality. One way to deal with this is to have a monolithic single entity that invests 300 person-years into annotating the genome. A better one is ‘open annotation’, where the annotation required is distributed across a highly motivated community of biologists. We believe that many of the problems with open annotation are technical ones, which can be and are being addressed. The web allows different data sources to be readily crosslinked, but different websites have different formats and interfaces. An alternative, particularly appropriate for sequence data, is for a browser to merge annotation from multiple data sources on top of a baseline coordinate system to provide the user with a single annotation view. Lincoln Stein and colleagues are developing such a system (DAS) based on XML (see http://stein.cshl.org/das/). All that is then required for any centre to contribute annotation of all or part of the genome is to synchronize its coordinate system with its baseline server. Maintaining the coordinate system across a changing genome does require substantial resources, but keeping in synchronization with this need not. Ensembl is an open-source project and will provide both a common object framework for annotation as well as the synchronization tools needed for anyone to set up to serve annotation for all to see and use. The power of open-source software is well recognized, although it could be feared that open annotation will swamp biologists with alternative contradictory views of the sequence. We are more optimistic. Browsers will allow biologists to select only the data sources they wish to view. Just as some websites become popular, word of useful annotation will spread quickly, since selecting it will be as easy as bookmarking a new website. Software development has been democratized by open-source projects such as Linux, which have allowed everyone the opportunity to contribute. Open annotation provides the same opportunity for genomes, and so should speed our collective decoding of genetics without centralized annotation centres or commercial monopolies. Tim Hubbard*, Ewan Birney† *Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK †EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK

31 citations

Journal ArticleDOI
TL;DR: How the first glimpses of genomic sequence from human chromosome 7 are directly facilitating human-mouse comparative maps and sequence-ready mouse physical maps are described to illustrate how the availability of genomic sequences directly facilitates studies in comparative genomics and genome evolution.
Abstract: The success of the ongoing Human Genome Project has resulted in accelerated plans for completing the human genome sequence and the earlier-than-anticipated initiation of efforts to sequence the mouse genome. As a complement to these efforts, we are utilizing the available human sequence to refine human-mouse comparative maps and to assemble sequence-ready mouse physical maps. Here we describe how the first glimpses of genomic sequence from human chromosome 7 are directly facilitating these activities. Specifically, we are actively enhancing the available human-mouse comparative map by analyzing human chromosome 7 sequence for the presence of orthologs of mapped mouse genes. Such orthologs can then be precisely positioned relative to mapped human STSs and other genes. The chromosome 7 sequence generated to date has allowed us to more than double the number of genes that can be placed on the comparative map. The latter effort reveals that human chromosome 7 is represented by at least 20 orthologous segments of DNA in the mouse genome. A second component of our program involves systematically analyzing the evolving human chromosome 7 sequence for the presence of matching mouse genes and expressed-sequence tags (ESTs). Mouse-specific hybridization probes are designed from such sequences and used to screen a mouse bacterial artificial chromosome (BAC) library, with the resulting data used to assemble BAC contigs based on probe-content data. Nascent contigs are then expanded using probes derived from newly generated BAC-end sequences. This approach produces BAC-based sequence-ready maps that are known to contain a gene(s) and are homologous to segments of the human genome for which sequence is already available. Our ongoing efforts have thus far resulted in the isolation and mapping of >3,800 mouse BACs, which have been assembled into >100 contigs. These contigs include >250 genes and represent approximately 40% of the mouse genome that is homologous to human chromosome 7. Together, these approaches illustrate how the availability of genomic sequence directly facilitates studies in comparative genomics and genome evolution.

29 citations

Journal ArticleDOI
19 Mar 1999-Science
TL;DR: Several large grants announced this week by the U.S. government and the Wellcome Trust may make it possible for researchers to determine the order of the 3 billion bases in the human genetic code much earlier than expected--by the spring of 2000.
Abstract: Several large grants announced this week by the U.S. government and the Wellcome Trust, a U.K. charity, may make it possible for researchers to determine the order of the 3 billion bases in the human genetic code much earlier than expected--by the spring of 2000. On 15 March, the National Human Genome Research Institute announced that it had selected three major centers to do high-volume human DNA sequencing, awarding them $81.6 million over the next 10 months. At the same time, the Wellcome Trust upped this year9s support of the human genome sequencing effort by the Sanger Centre in Cambridge, England, from $57 million to $77 million. But some fear that the smaller sequencing centers, left out of this round of competition, may become obsolete, and some international partners in the effort are feeling left out.

29 citations