scispace - formally typeset
Search or ask a question
Author

Jeremy Heil

Bio: Jeremy Heil is an academic researcher from Celera Corporation. The author has contributed to research in topics: Genome & Hybrid genome assembly. The author has an hindex of 8, co-authored 10 publications receiving 13892 citations.

Papers
More filters
Journal ArticleDOI
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 moreInstitutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations

Journal ArticleDOI
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ∼12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

1,674 citations

Journal ArticleDOI
TL;DR: The identification and characterization of 2,000 human diallelic insertion/deletion polymorphisms (indels) distributed throughout the human genome found that new alleles were generally lower in frequency than old alleles.
Abstract: We report the identification and characterization of 2,000 human diallelic insertion/deletion polymorphisms (indels) distributed throughout the human genome. Candidate indels were identified by comparison of overlapping genomic or cDNA sequences. Average confirmation rate for indels with a ⩾2-nt allele-length difference was 58%, but the confirmation rate for indels with a 1-nt length difference was only 14%. The vast majority of the human diallelic indels were monomorphic in chimpanzees and gorillas. The ratio of deletion:insertion mutations was 4.1. Allele frequencies for the indels were measured in Europeans, Africans, Japanese, and Native Americans. New alleles were generally lower in frequency than old alleles. This tendency was most pronounced for the Africans, who are likely to be closest among the four groups to the original modern human population. Diallelic indels comprise ∼8% of all human polymorphisms. Their abundance and ease of analysis make them useful for many applications.

367 citations

Journal ArticleDOI
TL;DR: Evaluations indicate that this SNP screening set is more informative than the Marshfield Clinic's commonly used microsatellite-based screening set and provides a resource for fast genome scanning for disease genes.
Abstract: Recent advances in technologies for high-throughout single-nucleotide polymorphism (SNP)–based genotyping have improved efficiency and cost so that it is now becoming reasonable to consider the use of SNPs for genomewide linkage analysis. However, a suitable screening set of SNPs and a corresponding linkage map have yet to be described. The SNP maps described here fill this void and provide a resource for fast genome scanning for disease genes. We have evaluated 6,297 SNPs in a diversity panel composed of European Americans, African Americans, and Asians. The markers were assessed for assay robustness, suitable allele frequencies, and informativeness of multi-SNP clusters. Individuals from 56 Centre d'Etude du Polymorphisme Humain pedigrees, with >770 potentially informative meioses altogether, were genotyped with a subset of 2,988 SNPs, for map construction. Extensive genotyping-error analysis was performed, and the resulting SNP linkage map has an average map resolution of 3.9 cM, with map positions containing either a single SNP or several tightly linked SNPs. The order of markers on this map compares favorably with several other linkage and physical maps. We compared map distances between the SNP linkage map and the interpolated SNP linkage map constructed by the deCode Genetics group. We also evaluated cM/Mb distance ratios in females and males, along each chromosome, showing broadly defined regions of increased and decreased rates of recombination. Evaluations indicate that this SNP screening set is more informative than the Marshfield Clinic’s commonly used microsatellite-based screening set.

118 citations

Journal ArticleDOI
TL;DR: It is found that ascertainment-corrected rho varies along the genome by more than two orders of magnitude, implying great differences in the recombinational history of different portions of the genome, a finding that has direct bearing on the design and utility of LD mapping and on the HapMap project.
Abstract: The prospect of using linkage disequilibrium (LD) for fine-scale mapping in humans has attracted considerable attention, and, during the validation of a set of single-nucleotide polymorphisms (SNPs) for linkage analysis, a set of data for 4,833 SNPs in 538 clusters was produced that provides a rich picture of local attributes of LD across the genome. LD estimates may be biased depending on the means by which SNPs are first identified, and a particular problem of ascertainment bias arises when SNPs identified in small heterogeneous panels are subsequently typed in larger population samples. Understanding and correcting ascertainment bias is essential for a useful quantitative assessment of the landscape of LD across the human genome. Heterogeneity in the population recombination rate, rho=4Nr, along the genome reflects how variable the density of markers will have to be for optimal coverage. We find that ascertainment-corrected rho varies along the genome by more than two orders of magnitude, implying great differences in the recombinational history of different portions of our genome. The distribution of rho is unimodal, and we show that this is compatible with a wide range of mixtures of hotspots in a background of variable recombination rate. Although rho is significantly correlated across the three population samples, some regions of the genome exhibit population-specific spikes or troughs in rho that are too large to be explained by sampling. This result is consistent with differences in the genealogical depth of local genomic regions, a finding that has direct bearing on the design and utility of LD mapping and on the National Institutes of Health HapMap project.

83 citations


Cited by
More filters
Journal ArticleDOI
Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 moreInstitutions (29)
15 Feb 2001-Nature
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

22,269 citations

Journal ArticleDOI
TL;DR: Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.
Abstract: We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

9,389 citations

Journal ArticleDOI
06 Dec 2002-Science
TL;DR: The protein kinase complement of the human genome is catalogued using public and proprietary genomic, complementary DNA, and expressed sequence tag sequences to provide a starting point for comprehensive analysis of protein phosphorylation in normal and disease states and a detailed view of the current state of human genome analysis through a focus on one large gene family.
Abstract: We have catalogued the protein kinase complement of the human genome (the "kinome") using public and proprietary genomic, complementary DNA, and expressed sequence tag (EST) sequences. This provides a starting point for comprehensive analysis of protein phosphorylation in normal and disease states, as well as a detailed view of the current state of human genome analysis through a focus on one large gene family. We identify 518 putative protein kinase genes, of which 71 have not previously been reported or described as kinases, and we extend or correct the protein sequences of 56 more kinases. New genes include members of well-studied families as well as previously unidentified families, some of which are conserved in model organisms. Classification and comparison with model organism kinomes identified orthologous groups and highlighted expansions specific to human and other lineages. We also identified 106 protein kinase pseudogenes. Chromosomal mapping revealed several small clusters of kinase genes and revealed that 244 kinases map to disease loci or cancer amplicons.

7,486 citations

Journal ArticleDOI
TL;DR: The heritability of methylation states and the secondary nature of the decision to invite or exclude methylation support the idea that DNA methylation is adapted for a specific cellular memory function in development.
Abstract: The character of a cell is defined by its constituent proteins, which are the result of specific patterns of gene expression. Crucial determinants of gene expression patterns are DNA-binding transcription factors that choose genes for transcriptional activation or repression by recognizing the sequence of DNA bases in their promoter regions. Interaction of these factors with their cognate sequences triggers a chain of events, often involving changes in the structure of chromatin, that leads to the assembly of an active transcription complex (e.g., Cosma et al. 1999). But the types of transcription factors present in a cell are not alone sufficient to define its spectrum of gene activity, as the transcriptional potential of a genome can become restricted in a stable manner during development. The constraints imposed by developmental history probably account for the very low efficiency of cloning animals from the nuclei of differentiated cells (Rideout et al. 2001; Wakayama and Yanagimachi 2001). A “transcription factors only” model would predict that the gene expression pattern of a differentiated nucleus would be completely reversible upon exposure to a new spectrum of factors. Although many aspects of expression can be reprogrammed in this way (Gurdon 1999), some marks of differentiation are evidently so stable that immersion in an alien cytoplasm cannot erase the memory. The genomic sequence of a differentiated cell is thought to be identical in most cases to that of the zygote from which it is descended (mammalian B and T cells being an obvious exception). This means that the marks of developmental history are unlikely to be caused by widespread somatic mutation. Processes less irrevocable than mutation fall under the umbrella term “epigenetic” mechanisms. A current definition of epigenetics is: “The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence” (Russo et al. 1996). There are two epigenetic systems that affect animal development and fulfill the criterion of heritability: DNA methylation and the Polycomb-trithorax group (Pc-G/trx) protein complexes. (Histone modification has some attributes of an epigenetic process, but the issue of heritability has yet to be resolved.) This review concerns DNA methylation, focusing on the generation, inheritance, and biological significance of genomic methylation patterns in the development of mammals. Data will be discussed favoring the notion that DNA methylation may only affect genes that are already silenced by other mechanisms in the embryo. Embryonic transcription, on the other hand, may cause the exclusion of the DNA methylation machinery. The heritability of methylation states and the secondary nature of the decision to invite or exclude methylation support the idea that DNA methylation is adapted for a specific cellular memory function in development. Indeed, the possibility will be discussed that DNA methylation and Pc-G/trx may represent alternative systems of epigenetic memory that have been interchanged over evolutionary time. Animal DNA methylation has been the subject of several recent reviews (Bird and Wolffe 1999; Bestor 2000; Hsieh 2000; Costello and Plass 2001; Jones and Takai 2001). For recent reviews of plant and fungal DNA methylation, see Finnegan et al. (2000), Martienssen and Colot (2001), and Matzke et al. (2001).

6,691 citations

Journal ArticleDOI
Robert H. Waterston1, Kerstin Lindblad-Toh2, Ewan Birney, Jane Rogers3  +219 moreInstitutions (26)
05 Dec 2002-Nature
TL;DR: The results of an international collaboration to produce a high-quality draft sequence of the mouse genome are reported and an initial comparative analysis of the Mouse and human genomes is presented, describing some of the insights that can be gleaned from the two sequences.
Abstract: The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.

6,643 citations