scispace - formally typeset
Search or ask a question

Showing papers by "Richard K. Wilson published in 2007"


Journal ArticleDOI
14 Jun 2007-Nature
TL;DR: Functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project are reported, providing convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts.
Abstract: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.

5,091 citations


Journal ArticleDOI
18 Oct 2007-Nature
TL;DR: The Phase II HapMap is described, which characterizes over 3.1 million human single nucleotide polymorphisms genotyped in 270 individuals from four geographically diverse populations and includes 25–35% of common SNP variation in the populations surveyed, and increased differentiation at non-synonymous, compared to synonymous, SNPs is demonstrated.
Abstract: We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

4,565 citations


Journal ArticleDOI
Andrew G. Clark1, Michael B. Eisen2, Michael B. Eisen3, Douglas Smith  +426 moreInstitutions (70)
08 Nov 2007-Nature
TL;DR: These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution.
Abstract: Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.

2,057 citations


Journal ArticleDOI
Pardis C. Sabeti1, Pardis C. Sabeti2, Patrick Varilly2, Patrick Varilly1  +255 moreInstitutions (50)
18 Oct 2007-Nature
TL;DR: ‘Long-range haplotype’ methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population are developed.
Abstract: With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2). We used 'long-range haplotype' methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 non-synonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population:LARGE and DMD, both related to infection by the Lassa virus, in West Africa;SLC24A5 and SLC45A2, both involved in skin pigmentation, in Europe; and EDAR and EDA2R, both involved in development of hair follicles, in Asia.

1,778 citations


Journal ArticleDOI
13 Apr 2007-Science
TL;DR: The genome sequence of an Indian-origin Macaca mulatta female is determined and compared with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families.
Abstract: The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species.

1,297 citations


Journal ArticleDOI
Barbara A. Weir1, Barbara A. Weir2, Michele S. Woo1, Gad Getz2, Sven Perner1, Sven Perner3, Li Ding4, Rameen Beroukhim1, Rameen Beroukhim2, William M. Lin2, William M. Lin1, Michael A. Province4, Aldi T. Kraja4, Laura A. Johnson1, Kinjal Shah2, Kinjal Shah1, Mitsuo Sato5, Roman K. Thomas6, Justine A. Barletta1, Ingrid B. Borecki4, Stephen R. Broderick7, Andrew C. Chang8, Derek Y. Chiang1, Derek Y. Chiang2, Lucian R. Chirieac1, Jeonghee Cho1, Yoshitaka Fujii9, Adi F. Gazdar5, Thomas J. Giordano8, Heidi Greulich2, Heidi Greulich1, Megan Hanna2, Megan Hanna1, Bruce E. Johnson1, Mark G. Kris7, Alex E. Lash7, Ling Lin4, Neal I. Lindeman1, Elaine R. Mardis4, John Douglas Mcpherson10, John D. Minna5, Margaret Morgan10, Mark Nadel2, Mark Nadel1, Mark B. Orringer8, John R. Osborne4, Brad Ozenberger11, Alex H. Ramos1, Alex H. Ramos2, James T. Robinson2, Jack A. Roth12, Valerie W. Rusch7, Hidefumi Sasaki9, Frances A. Shepherd13, Carrie Sougnez2, Margaret R. Spitz12, Ming-Sound Tsao13, David Twomey2, Roel G.W. Verhaak14, George M. Weinstock10, David A. Wheeler10, Wendy Winckler1, Wendy Winckler2, Akihiko Yoshizawa7, Soyoung Yu1, Maureen F. Zakowski7, Qunyuan Zhang4, David G. Beer8, Ignacio I. Wistuba12, Mark A. Watson4, Levi A. Garraway2, Levi A. Garraway1, Marc Ladanyi7, William D. Travis7, William Pao7, Mark A. Rubin2, Mark A. Rubin1, Stacey Gabriel2, Richard A. Gibbs10, Harold E. Varmus7, Richard K. Wilson4, Eric S. Lander1, Eric S. Lander14, Eric S. Lander2, Matthew Meyerson2, Matthew Meyerson1 
06 Dec 2007-Nature
TL;DR: A large-scale project to characterize copy-number alterations in primary lung adenocarcinomas using dense single nucleotide polymorphism arrays identifies NKX2-1 (NK2 homeobox 1, also called TITF1), which lies in the minimal 14q13.3 amplification interval and encodes a lineage-specific transcription factor, as a novel candidate proto-oncogene involved in a significant fraction of lung carcinomas.
Abstract: Somatic alterations in cellular DNA underlie almost all human cancers 1 . The prospect of targeted therapies 2 and the development of high-resolution, genome-wide approaches 3–8 are now spurring systematic efforts to characterize cancer genomes. Here we report a large-scale project to characterize copy-number alterations in primary lung adenocarcinomas. By analysis of a large collection oftumours(n 5371)usingdensesinglenucleotidepolymorphism arrays, we identify a total of 57 significantly recurrent events. We find that 26 of 39 autosomal chromosome arms show consistent large-scalecopy-numbergainorloss,ofwhichonlyahandfulhave been linked to a specific gene. We also identify 31 recurrent focal events, including 24 amplifications and 7 homozygous deletions. Only six of these focal events are currently associated with known mutations in lung carcinomas. The most common event, amplification of chromosome 14q13.3, is found in 12% of samples. On the basis of genomic and functional analyses, we identify NKX2-1 (NK2 homeobox 1, also called TITF1), which lies in the minimal 14q13.3 amplification interval and encodes a lineagespecific transcription factor, as a novel candidate proto-oncogene involved in a significant fraction of lung adenocarcinomas. More generally, our results indicate that many of the genes that are involved in lung adenocarcinoma remain to be discovered. A collection of 528 snap-frozen lung adenocarcinoma resection specimens, with at least 70% estimated tumour content, was selected by a panel of thoracic pathologists (Supplementary Table 1); samples were anonymized to protect patient privacy. Tumour and normal DNAs were hybridized to Affymetrix 250K Sty single nucleotide polymorphism (SNP)arrays. Genomic copy number foreach ofover 238,000 probe sets was determined by calculating the intensity ratio between the tumour DNA and the average of a set of normal DNAs 9,10 . Segmented copy numbers for each tumour were inferred with the GLAD (gain and loss analysis of DNA) algorithm 11 and normalized to a median of two copies. Each copy number profile was then subjected to quality control, resulting in 371 high-quality samples used for further analysis, of which 242 had matched normal

1,087 citations


Journal ArticleDOI
TL;DR: In this article, the authors examined how the intestinal environment affects microbial genome evolution and found that lateral gene transfer, mobile elements, and gene amplification have played important roles in affecting the ability of gut-dwelling Bacteroidetes to vary their cell surface, sense their environment, and harvest nutrient resources present in the distal intestine.
Abstract: The adult human intestine contains trillions of bacteria, representing hundreds of species and thousands of subspecies. Little is known about the selective pressures that have shaped and are shaping this community's component species, which are dominated by members of the Bacteroidetes and Firmicutes divisions. To examine how the intestinal environment affects microbial genome evolution, we have sequenced the genomes of two members of the normal distal human gut microbiota, Bacteroides vulgatus and Bacteroides distasonis, and by comparison with the few other sequenced gut and non-gut Bacteroidetes, analyzed their niche and habitat adaptations. The results show that lateral gene transfer, mobile elements, and gene amplification have played important roles in affecting the ability of gut-dwelling Bacteroidetes to vary their cell surface, sense their environment, and harvest nutrient resources present in the distal intestine. Our findings show that these processes have been a driving force in the adaptation of Bacteroidetes to the distal gut environment, and emphasize the importance of considering the evolution of humans from an additional perspective, namely the evolution of our microbiomes.

558 citations


Journal ArticleDOI
Elliott H. Margulies1, Gregory M. Cooper2, Gregory M. Cooper3, George Asimenos2, Daryl J. Thomas4, Colin N. Dewey5, Colin N. Dewey6, Adam Siepel7, Adam Siepel4, Ewan Birney, Damian Keefe, Ariel S. Schwartz5, Minmei Hou8, James Taylor8, Sergey Nikolaev9, Juan I. Montoya-Burgos9, Ari Löytynoja, Simon Whelan10, Fabio Pardi, Tim Massingham, James B. Brown5, Peter J. Bickel5, Ian Holmes5, James C. Mullikin1, Abel Ureta-Vidal, Benedict Paten, Eric A. Stone2, Kate R. Rosenbloom4, W. James Kent4, Gerard G. Bouffard1, Xiaobin Guan1, Nancy F. Hansen1, Jacquelyn R. Idol1, Valerie Maduro1, Baishali Maskeri1, Jennifer C. McDowell1, Morgan Park1, Pamela J. Thomas1, Alice C. Young1, Robert W. Blakesley1, Donna M. Muzny11, Erica Sodergren11, David A. Wheeler11, Kim C. Worley11, Huaiyang Jiang11, George M. Weinstock11, Richard A. Gibbs11, Tina Graves12, Robert S. Fulton12, Elaine R. Mardis12, Richard K. Wilson12, Michele Clamp13, James Cuff13, Sante Gnerre13, David B. Jaffe13, Jean L. Chang13, Kerstin Lindblad-Toh13, Eric S. Lander13, Eric S. Lander14, Angie S. Hinrichs4, Heather Trumbower4, Hiram Clawson4, Ann S. Zweig4, Robert M. Kuhn4, Galt P. Barber4, Rachel A. Harte4, Donna Karolchik4, Matthew A. Field15, Richard A. Moore15, Carrie A. Matthewson4, Jacqueline E. Schein15, Marco A. Marra15, Stylianos E. Antonarakis9, Serafim Batzoglou2, Nick Goldman, Ross C. Hardison, David Haussler5, David Haussler4, Webb Miller8, Lior Pachter5, Eric D. Green1, Arend Sidow2 
TL;DR: The quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments are described.
Abstract: A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.

214 citations


Journal ArticleDOI
09 May 2007-PLOS ONE
TL;DR: This study is one of the first comprehensive mutational analyses of major genes in a specific signaling pathway in a sizeable cohort of lung adenocarcinomas and suggests the majority of gain-of-function mutations within kinase genes in the EGFR signaling pathway have already been identified.
Abstract: Background Fifty percent of lung adenocarcinomas harbor somatic mutations in six genes that encode proteins in the EGFR signaling pathway, i.e., EGFR, HER2/ERBB2, HER4/ERBB4, PIK3CA, BRAF, and KRAS. We performed mutational profiling of a large cohort of lung adenocarcinomas to uncover other potential somatic mutations in genes of this signaling pathway that could contribute to lung tumorigenesis. Methodology/Principal Findings We analyzed genomic DNA from a total of 261 resected, clinically annotated non-small cell lung cancer (NSCLC) specimens. The coding sequences of 39 genes were screened for somatic mutations via high-throughput dideoxynucleotide sequencing of PCR-amplified gene products. Mutations were considered to be somatic only if they were found in an independent tumor-derived PCR product but not in matched normal tissue. Sequencing of 9MB of tumor sequence identified 239 putative genetic variants. We further examined 22 variants found in RAS family genes and 135 variants localized to exons encoding the kinase domain of respective proteins. We identified a total of 37 non-synonymous somatic mutations; 36 were found collectively in EGFR, KRAS, BRAF, and PIK3CA. One somatic mutation was a previously unreported mutation in the kinase domain (exon 16) of FGFR4 (Glu681Lys), identified in 1 of 158 tumors. The FGFR4 mutation is analogous to a reported tumor-specific somatic mutation in ERBB2 and is located in the same exon as a previously reported kinase domain mutation in FGFR4 (Pro712Thr) in a lung adenocarcinoma cell line. Conclusions/Significance This study is one of the first comprehensive mutational analyses of major genes in a specific signaling pathway in a sizeable cohort of lung adenocarcinomas. Our results suggest the majority of gain-of-function mutations within kinase genes in the EGFR signaling pathway have already been identified. Our findings also implicate FGFR4 in the pathogenesis of a subset of lung adenocarcinomas.

88 citations



Journal ArticleDOI
TL;DR: PolyScan is presented, an algorithm and software implementation designed to provide de novo heterozygous indel detection and improved SNP identification in the context of high-throughput medical resequencing and suggests that PolyScan may play a useful role in the post human genome project research era.
Abstract: Small insertions and deletions (indels) and single nucleotide polymorphisms (SNPs) are common genetic variants that are thought to be associated with a wide variety of human diseases. Owing to the genome's size and complexity, manually characterizing each one of these variations in an individual is not practical. While significant progress has been made in automated single-base mutation discovery from the sequences of diploid PCR products, automated and reliable detection of indels continues to pose difficult challenges. In this paper, we present PolyScan, an algorithm and software implementation designed to provide de novo heterozygous indel detection and improved SNP identification in the context of high-throughput medical resequencing. Tests on a human diploid PCR-based sequence data set, consisting of 90,270 traces from 13 genes, indicate that PolyScan identified approximately 90% of the 151 consensus indel sites and approximately 84% of the 1546 heterozygous indels previously identified by manual inspection. Tests on tumor-derived data show that PolyScan better identifies high-quality, low-level mutations as compared with other mutation detection software. Moreover, SNP identification improves when reprocessing the results of other programs. These results suggest that PolyScan may play a useful role in the post human genome project research era.

Journal ArticleDOI
TL;DR: The first detailed clone framework map of the gibbon genome is provided and the location of 86 evolutionary breakpoints to <1 Mb resolution is refined, suggesting that chromosomal rearrangement has been a longstanding property of this particular ape lineage.
Abstract: The gibbon karyotype is known to be extensively rearranged when compared to the human and to the ancestral primate karyotype. By combining a bioinformatics (paired-end sequence analysis) approach and a molecular cytogenetics approach, we have refined the synteny block arrangement of the white-cheeked gibbon (Nomascus leucogenys, NLE) with respect to the human genome. We provide the first detailed clone framework map of the gibbon genome and refine the location of 86 evolutionary breakpoints to <1 Mb resolution. An additional 12 breakpoints, mapping primarily to centromeric and telomeric regions, were mapped to ∼5 Mb resolution. Our combined FISH and BES analysis indicates that we have effectively subcloned 49 of these breakpoints within NLE gibbon BAC clones, mapped to a median resolution of 79.7 kb. Interestingly, many of the intervals associated with translocations were gene-rich, including some genes associated with normal skeletal development. Comparisons of NLE breakpoints with those of other gibbon species reveal variability in the position, suggesting that chromosomal rearrangement has been a longstanding property of this particular ape lineage. Our data emphasize the synergistic effect of combining computational genomics and cytogenetics and provide a framework for ultimate sequence and assembly of the gibbon genome.

Journal ArticleDOI
TL;DR: It is concluded that excess low-frequency variation, intragenic recombination and lack of common disruptive exonic variants favor complete resequencing as the optimal approach for genetic association studies to identify regulatory SFTPB variants that cause neonatal respiratory distress syndrome in genetically diverse populations.
Abstract: Completely penetrant mutations in the surfactant protein B gene (SFTPB) and >75% reduction of SFTPB expression disrupt pulmonary surfactant function and cause neonatal respiratory distress syndrome. To inform studies of genetic regulation of SFTPB expression, we created a catalogue of SFTPB variants by comprehensive resequencing from an unselected, population-based cohort (n = 1,116). We found an excess of low-frequency variation [81 SNPs and five small insertion/deletions (in/dels)]. Despite its small genomic size (9.7 kb), SFTPB was characterized by weak linkage disequilibrium (LD) and high haplotype diversity. Using the HapMap Yoruban and European populations, we identified a recombination hot spot that spans SFTPB, was not detectable in our focused resequencing data, and accounts for weak LD. Using homology-based software tools, we discovered no definitively damaging exonic variants. We conclude that excess low-frequency variation, intragenic recombination and lack of common disruptive exonic variants favor complete resequencing as the optimal approach for genetic association studies to identify regulatory SFTPB variants that cause neonatal respiratory distress syndrome in genetically diverse populations.

Journal ArticleDOI
TL;DR: A general modeling framework for laboratory data and its implementation as an information management system that handles all transactions underlying a throughput rate of about 9 million sequencing reactions of various kinds per month and has handily weathered a number of major pipeline reconfigurations.
Abstract: Investigators in the biological sciences continue to exploit laboratory automation methods and have dramatically increased the rates at which they can generate data. In many environments, the methods themselves also evolve in a rapid and fluid manner. These observations point to the importance of robust information management systems in the modern laboratory. Designing and implementing such systems is non-trivial and it appears that in many cases a database project ultimately proves unserviceable. We describe a general modeling framework for laboratory data and its implementation as an information management system. The model utilizes several abstraction techniques, focusing especially on the concepts of inheritance and meta-data. Traditional approaches commingle event-oriented data with regular entity data in ad hoc ways. Instead, we define distinct regular entity and event schemas, but fully integrate these via a standardized interface. The design allows straightforward definition of a "processing pipeline" as a sequence of events, obviating the need for separate workflow management systems. A layer above the event-oriented schema integrates events into a workflow by defining "processing directives", which act as automated project managers of items in the system. Directives can be added or modified in an almost trivial fashion, i.e., without the need for schema modification or re-certification of applications. Association between regular entities and events is managed via simple "many-to-many" relationships. We describe the programming interface, as well as techniques for handling input/output, process control, and state transitions. The implementation described here has served as the Washington University Genome Sequencing Center's primary information system for several years. It handles all transactions underlying a throughput rate of about 9 million sequencing reactions of various kinds per month and has handily weathered a number of major pipeline reconfigurations. The basic data model can be readily adapted to other high-volume processing environments.

Journal ArticleDOI
TL;DR: A study to establish the safety and tolerability of vandetanib + FOLFIRI, a once-daily oral agent in Phase III development that selectively targets key signaling pathways in cancer by inhibiting VEGF, EGF and RET receptor tyrosine kinases.
Abstract: 4085 Background: Vandetanib (ZD6474) is a once-daily oral agent in Phase III development that selectively targets key signaling pathways in cancer by inhibiting VEGF, EGF and RET receptor tyrosine ...