scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Initial sequencing and analysis of the human genome.

Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 moreInstitutions (29)
15 Feb 2001-Nature (Nature Publishing Group)-Vol. 409, Iss: 6822, pp 860-921
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The development of a method—called long serial analysis of gene expression (LongSAGE), an adaption of the original SAGE approach—that can be used to rapidly identify novel genes and exons is described.
Abstract: A remaining challenge for the human genome project involves the identification and annotation of expressed genes. The public and private sequencing efforts have identified ∼15,000 sequences that meet stringent criteria for genes, such as correspondence with known genes from humans or other species, and have made another ∼10,000–20,000 gene predictions of lower confidence, supported by various types of in silico evidence, including homology studies, domain searches, and ab initio gene predictions1,2. These computational methods have limitations, both because they are unable to identify a significant fraction of genes and exons and because they are unable to provide definitive evidence about whether a hypothetical gene is actually expressed3,4. As the in silico approaches identified a smaller number of genes than anticipated5,6,7,8,9, we wondered whether high-throughput experimental analyses could be used to provide evidence for the expression of hypothetical genes and to reveal previously undiscovered genes. We describe here the development of such a method—called long serial analysis of gene expression (LongSAGE), an adaption of the original SAGE approach10—that can be used to rapidly identify novel genes and exons.

632 citations

Journal ArticleDOI
TL;DR: The extent of antisense transcription in the human genome is studied by analyzing the public databases of expressed sequences using a set of computational tools designed to identify sense-antisense transcriptional units on opposite DNA strands of the same genomic locus to indicate that antisense modulation of gene expression in human cells may be a common regulatory mechanism.
Abstract: An increasing number of eukaryotic genes are being found to have naturally occurring antisense transcripts. Here we study the extent of antisense transcription in the human genome by analyzing the public databases of expressed sequences using a set of computational tools designed to identify sense-antisense transcriptional units on opposite DNA strands of the same genomic locus. The resulting data set of 2,667 sense-antisense pairs was evaluated by microarrays containing strand-specific oligonucleotide probes derived from the region of overlap. Verification of specific cases by northern blot analysis with strand-specific riboprobes proved transcription from both DNA strands. We conclude that ≥60% of this data set, or ∼1,600 predicted sense-antisense transcriptional units, are transcribed from both DNA strands. This indicates that the occurrence of antisense transcription, usually regarded as infrequent, is a very common phenomenon in the human genome. Therefore, antisense modulation of gene expression in human cells may be a common regulatory mechanism.

630 citations

Journal ArticleDOI
Katja Luck1, Dae-Kyum Kim, Luke Lambourne1, Kerstin Spirohn1, Bridget E. Begg1, Wenting Bian1, Ruth Brignall1, Tiziana M. Cafarelli1, Francisco J. Campos-Laborie2, Benoit Charloteaux1, Dong-Sic Choi3, Atina G. Cote, Meaghan Daley1, Steven Deimling4, Alice Desbuleux, Amélie Dricot1, Marinella Gebbia, Madeleine F. Hardy1, Nishka Kishore, Jennifer J. Knapp, István Kovács5, István Kovács1, Irma Lemmens6, Irma Lemmens7, Miles W. Mee4, Joseph C. Mellor, Carl Pollis1, Carles Pons, Aaron Richardson1, Sadie Schlabach1, Bridget Teeking1, Anupama Yadav1, Mariana Babor, Dawit Balcha1, Omer Basha8, Christian Bowman-Colin1, Suet-Feung Chin9, Soon Gang Choi1, Claudia Colabella10, Georges Coppin, Cassandra D’Amata4, David De Ridder1, Steffi De Rouck7, Steffi De Rouck6, Miquel Duran-Frigola, Hanane Ennajdaoui, Florian Goebels4, Liana Goehring1, Anjali Gopal, Ghazal Haddad, Elodie Hatchi1, Mohamed Helmy4, Yves Jacob11, Yves Jacob12, Yoseph Kassa1, Serena Landini1, Roujia Li, Natascha van Lieshout, Andrew MacWilliams1, Dylan Markey1, Joseph N. Paulson13, Joseph N. Paulson1, Sudharshan Rangarajan1, John Rasla1, Ashyad Rayhan, Thomas Rolland1, Adriana San-Miguel1, Yun Shen1, Dayag Sheykhkarimli, Gloria M. Sheynkman1, Eyal Simonovsky8, Murat Tasan, Alexander O. Tejeda1, Vincent Tropepe4, Jean-Claude Twizere14, Yang Wang1, Robert J. Weatheritt4, Jochen Weile, Yu Xia15, Yu Xia1, Xinping Yang1, Esti Yeger-Lotem8, Quan Zhong, Patrick Aloy16, Gary D. Bader4, Javier De Las Rivas2, Suzanne Gaudet1, Tong Hao1, Janusz Rak3, Jan Tavernier6, Jan Tavernier7, David E. Hill1, Marc Vidal1, Frederick P. Roth, Michael A. Calderwood1 
08 Apr 2020-Nature
TL;DR: The utility of HuRI is demonstrated in identifying the specific subcellular roles of protein–protein interactions and in identifying potential molecular mechanisms that might underlie tissue-specific phenotypes of Mendelian diseases.
Abstract: Global insights into cellular organization and genome function require comprehensive understanding of the interactome networks that mediate genotype–phenotype relationships1,2. Here we present a human ‘all-by-all’ reference interactome map of human binary protein interactions, or ‘HuRI’. With approximately 53,000 protein–protein interactions, HuRI has approximately four times as many such interactions as there are high-quality curated interactions from small-scale studies. The integration of HuRI with genome3, transcriptome4 and proteome5 data enables cellular function to be studied within most physiological or pathological cellular contexts. We demonstrate the utility of HuRI in identifying the specific subcellular roles of protein–protein interactions. Inferred tissue-specific networks reveal general principles for the formation of cellular context-specific functions and elucidate potential molecular mechanisms that might underlie tissue-specific phenotypes of Mendelian diseases. HuRI is a systematic proteome-wide reference that links genomic variation to phenotypic outcomes. A human binary protein interactome map that includes around 53,000 protein–protein interactions involving more than 8,000 proteins provides a reference for the study of human cellular function in health and disease.

630 citations

Journal ArticleDOI
TL;DR: The haplotype phasing methods that are available are assessed, focusing in particular on statistical methods, and the practical aspects of their application are discussed, and recent developments that may transform this field are described.
Abstract: Determination of haplotype phase is becoming increasingly important as we enter the era of large-scale sequencing because many of its applications, such as imputing low-frequency variants and characterizing the relationship between genetic variation and disease susceptibility, are particularly relevant to sequence data. Haplotype phase can be generated through laboratory-based experimental methods, or it can be estimated using computational approaches. We assess the haplotype phasing methods that are available, focusing in particular on statistical methods, and we discuss the practical aspects of their application. We also describe recent developments that may transform this field, particularly the use of identity-by-descent for computational phasing.

630 citations

Journal ArticleDOI
TL;DR: In this paper, a review summarizes the current understanding of AHR diversity among animal species and the evolution of the AHR signaling pathway, as inferred from molecular studies in vertebrate and invertebrate animals.

627 citations

References
More filters
Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations

Journal ArticleDOI
TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.
Abstract: Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

14,075 citations

Journal ArticleDOI
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 moreInstitutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations

Journal ArticleDOI
09 Apr 1981
TL;DR: The complete sequence of the 16,569-base pair human mitochondrial genome is presented and shows extreme economy in that the genes have none or only a few noncoding bases between them, and in many cases the termination codons are not coded in the DNA but are created post-transcriptionally by polyadenylation of the mRNAs.
Abstract: The complete sequence of the 16,569-base pair human mitochondrial genome is presented. The genes for the 12S and 16S rRNAs, 22 tRNAs, cytochrome c oxidase subunits I, II and III, ATPase subunit 6, cytochrome b and eight other predicted protein coding genes have been located. The sequence shows extreme economy in that the genes have none or only a few noncoding bases between them, and in many cases the termination codons are not coded in the DNA but are created post-transcriptionally by polyadenylation of the mRNAs.

8,783 citations