scispace - formally typeset
Search or ask a question

Showing papers by "Wellcome Trust Sanger Institute published in 2008"


Journal ArticleDOI
06 Nov 2008-Nature
TL;DR: An approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost is reported, effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.
Abstract: DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

3,802 citations


Journal ArticleDOI
TL;DR: This work describes the software MAQ, software that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample.
Abstract: New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.

2,927 citations


Journal ArticleDOI
TL;DR: The results strongly confirm 11 previously reported loci and provide genome-wide significant evidence for 21 additional loci, including the regions containing STAT3, JAK2, ICOSLG, CDKAL1 and ITLN1, which offer promise for informed therapeutic development.
Abstract: Several risk factors for Crohn's disease have been identified in recent genome-wide association studies. To advance gene discovery further, we combined data from three studies on Crohn's disease (a total of 3,230 cases and 4,829 controls) and carried out replication in 3,664 independent cases with a mixture of population-based and family-based controls. The results strongly confirm 11 previously reported loci and provide genome-wide significant evidence for 21 additional loci, including the regions containing STAT3, JAK2, ICOSLG, CDKAL1 and ITLN1. The expanded molecular understanding of the basis of this disease offers promise for informed therapeutic development.

2,584 citations


Journal ArticleDOI
Eleftheria Zeggini1, Laura J. Scott2, Richa Saxena, Benjamin F. Voight, Jonathan Marchini3, T Hu2, de Bakker Piw.4, de Bakker Piw.5, de Bakker Piw.6, Gonçalo R. Abecasis2, Peter Almgren7, Gregers S. Andersen8, Kristin Ardlie6, Kristina Bengtsson Boström, Richard N. Bergman9, Lori L. Bonnycastle10, Knut Borch-Johnsen8, Knut Borch-Johnsen11, Noël P. Burtt6, H Chen12, Peter S. Chines10, Mark J. Daly, P Deodhar10, Ding C-J.2, Doney Asf.13, William L. Duren2, Katherine S. Elliott1, Mike Erdos10, Timothy M. Frayling14, Rachel M. Freathy14, Lauren Gianniny6, Harald Grallert, Niels Grarup8, Christopher J. Groves3, Candace Guiducci6, Torben Hansen8, Christian Herder15, Graham A. Hitman16, Thomas Edward Hughes12, Bo Isomaa, Anne U. Jackson2, Torben Jørgensen17, Augustine Kong18, Kari Kubalanza10, Finny G Kuruvilla5, Finny G Kuruvilla6, Johanna Kuusisto19, Claudia Langenberg20, Hana Lango14, Torsten Lauritzen21, Yun Li2, Cecilia M. Lindgren3, Cecilia M. Lindgren1, Valeriya Lyssenko7, Amanda F. Marvelle22, Christine Meisinger, Kristian Midthjell23, Karen L. Mohlke22, Mario A. Morken10, Andrew D. Morris13, Narisu Narisu10, Peter M. Nilsson7, Katharine R. Owen3, Palmer Cna.13, Felicity Payne24, Perry Jrb.14, E Pettersen23, Carl Platou23, Inga Prokopenko3, Inga Prokopenko1, Lu Qi5, Lu Qi4, L Qin22, Nigel W. Rayner1, Nigel W. Rayner3, Matthew G. Rees10, J J Roix12, A Sandbaek11, Beverley M. Shields, Marketa Sjögren7, Valgerdur Steinthorsdottir18, Heather M. Stringham2, Amy J. Swift10, Gudmar Thorleifsson18, Unnur Thorsteinsdottir18, Nicholas J. Timpson25, Nicholas J. Timpson1, Tiinamaija Tuomi26, Jaakko Tuomilehto26, Mark Walker27, Richard M. Watanabe9, Michael N. Weedon14, Cristen J. Willer2, Thomas Illig, Kristian Hveem23, Frank B. Hu5, Frank B. Hu4, Markku Laakso19, Kari Stefansson18, Oluf Pedersen11, Oluf Pedersen8, Nicholas J. Wareham20, Inês Barroso24, Andrew T. Hattersley14, Francis S. Collins10, Leif Groop26, Leif Groop7, Mark I. McCarthy3, Mark I. McCarthy1, Michael Boehnke2, David Altshuler 
TL;DR: The results illustrate the value of large discovery and follow-up samples for gaining further insights into the inherited basis of T2D, and detect at least six previously unknown loci with robust evidence for association.
Abstract: Genome-wide association (GWA) studies have identified multiple loci at which common variants modestly but reproducibly influence risk of type 2 diabetes (T2D). Established associations to common and rare variants explain only a small proportion of the heritability of T2D. As previously published analyses had limited power to identify variants with modest effects, we carried out meta-analysis of three T2D GWA scans comprising 10,128 individuals of European descent and approximately 2.2 million SNPs (directly genotyped and imputed), followed by replication testing in an independent sample with an effective sample size of up to 53,975. We detected at least six previously unknown loci with robust evidence for association, including the JAZF1 (P = 5.0 x 10(-14)), CDC123-CAMK1D (P = 1.2 x 10(-10)), TSPAN8-LGR5 (P = 1.1 x 10(-9)), THADA (P = 1.1 x 10(-9)), ADAMTS9 (P = 1.2 x 10(-8)) and NOTCH2 (P = 4.1 x 10(-8)) gene regions. Our results illustrate the value of large discovery and follow-up samples for gaining further insights into the inherited basis of T2D.

1,872 citations


Journal ArticleDOI
Hreinn Stefansson1, Dan Rujescu2, Sven Cichon3, Olli Pietiläinen, Andres Ingason1, Stacy Steinberg1, Ragnheidur Fossdal1, Engilbert Sigurdsson, Thordur Sigmundsson, Jacobine E. Buizer-Voskamp4, Thomas Hansen5, Thomas Hansen6, Klaus D. Jakobsen6, Klaus D. Jakobsen5, Pierandrea Muglia7, Clyde Francks7, Paul M. Matthews8, Arnaldur Gylfason1, Bjarni V. Halldorsson1, Daniel F. Gudbjartsson1, Thorgeir E. Thorgeirsson1, Asgeir Sigurdsson1, Adalbjorg Jonasdottir1, Aslaug Jonasdottir1, Asgeir Björnsson1, Sigurborg Mattiasdottir1, Thorarinn Blondal1, Magnús Haraldsson, Brynja B. Magnusdottir, Ina Giegling2, Hans-Jürgen Möller2, Annette M. Hartmann2, Kevin V. Shianna9, Dongliang Ge9, Anna C. Need9, Caroline Crombie10, Gillian Fraser10, Nicholas Walker, Jouko Lönnqvist, Jaana Suvisaari, Annamarie Tuulio-Henriksson, Tiina Paunio, T. Toulopoulou11, Elvira Bramon11, Marta Di Forti11, Robin M. Murray11, Mirella Ruggeri12, Evangelos Vassos11, Sarah Tosato12, Muriel Walshe11, Tao Li11, Tao Li13, Catalina Vasilescu3, Thomas W. Mühleisen3, August G. Wang5, Henrik Ullum5, Srdjan Djurovic14, Ingrid Melle, Jes Olesen15, Lambertus A. Kiemeney16, Barbara Franke16, Chiara Sabatti17, Nelson B. Freimer17, Jeffrey R. Gulcher1, Unnur Thorsteinsdottir1, Augustine Kong1, Ole A. Andreassen14, Roel A. Ophoff4, Roel A. Ophoff17, Alexander Georgi18, Marcella Rietschel18, Thomas Werge5, Hannes Petursson, David Goldstein9, Markus M. Nöthen3, Leena Peltonen19, Leena Peltonen20, David A. Collier11, David A. Collier13, David St Clair10, Kari Stefansson1, Kari Stefansson21 
11 Sep 2008-Nature
TL;DR: In a genome-wide search for CNVs associating with schizophrenia, a population-based sample was used to identify de novo CNVs by analysing 9,878 transmissions from parents to offspring and three deletions significantly associate with schizophrenia and related psychoses in the combined sample.
Abstract: Reduced fecundity, associated with severe mental disorders, places negative selection pressure on risk alleles and may explain, in part, why common variants have not been found that confer risk of disorders such as autism, schizophrenia and mental retardation. Thus, rare variants may account for a larger fraction of the overall genetic risk than previously assumed. In contrast to rare single nucleotide mutations, rare copy number variations (CNVs) can be detected using genome-wide single nucleotide polymorphism arrays. This has led to the identification of CNVs associated with mental retardation and autism. In a genome-wide search for CNVs associating with schizophrenia, we used a population-based sample to identify de novo CNVs by analysing 9,878 transmissions from parents to offspring. The 66 de novo CNVs identified were tested for association in a sample of 1,433 schizophrenia cases and 33,250 controls. Three deletions at 1q21.1, 15q11.2 and 15q13.3 showing nominal association with schizophrenia in the first sample (phase I) were followed up in a second sample of 3,285 cases and 7,951 controls (phase II). All three deletions significantly associate with schizophrenia and related psychoses in the combined sample. The identification of these rare, recurrent risk variants, having occurred independently in multiple founders and being subject to negative selection, is important in itself. CNV analysis may also point the way to the identification of additional and more prevalent risk variants in genes and pathways involved in schizophrenia.

1,767 citations


Journal ArticleDOI
TL;DR: It is shown that tumor cells can disseminate systemically from earliest epithelial alterations in HER-2 and PyMT transgenic mice and from ductal carcinoma in situ in women, and release from dormancy of early-disseminated cancer cells may frequently account for metachronous metastasis.

1,126 citations


Journal ArticleDOI
TL;DR: Here, the minimum information about a genome sequence (MIGS) specification is introduced with the intent of promoting participation in its development and discussing the resources that will be required to develop improved mechanisms of metadata capture and exchange.
Abstract: With the quantity of genomic data increasing at an exponential rate, it is imperative that these data be captured electronically, in a standard format. Standardization activities must proceed within the auspices of open-access and international working bodies. To tackle the issues surrounding the development of better descriptions of genomic investigations, we have formed the Genomic Standards Consortium (GSC). Here, we introduce the minimum information about a genome sequence (MIGS) specification with the intent of promoting participation in its development and discussing the resources that will be required to develop improved mechanisms of metadata capture and exchange. As part of its wider goals, the GSC also supports improving the 'transparency' of the information contained in existing genomic databases.

1,097 citations


Journal ArticleDOI
26 Jun 2008-Nature
TL;DR: High-throughput sequencing of complementary DNAs (RNA-Seq) and strand-specific array data provide rich condition-specific information on novel, mostly non-coding transcripts, untranslated regions and gene structures, thus improving the existing genome annotation.
Abstract: Until recently, it was thought that much of a genome sequence is silent for much of the time. Now a study in the fission yeast Schizosaccharomyces pombe, using recently developed DNA sequencing technologies, shows that almost all of the yeast genome is genetically active. More than 90% of the genome is transcribed into RNA, including more than 450 newly discovered transcripts, many of them non-coding, with regulatory or other unknown roles. Using recently developed DNA sequencing technologies, nucleic acid transcripts are characterized in unprecedented detail from the yeast Schizosaccharomyces pombe. The sequences definitively demonstrate that 90% of more of the genome is transcribed into RNA, and show a previously unseen link between transcription and splicing efficiency at different points in the cell's growth. Recent data from several organisms indicate that the transcribed portions of genomes are larger and more complex than expected, and that many functional properties of transcripts are based not on coding sequences but on regulatory sequences in untranslated regions or non-coding RNAs1,2,3,4,5,6,7,8,9. Alternative start and polyadenylation sites and regulation of intron splicing add additional dimensions to the rich transcriptional output10,11. This transcriptional complexity has been sampled mainly using hybridization-based methods under one or few experimental conditions. Here we applied direct high-throughput sequencing of complementary DNAs (RNA-Seq), supplemented with data from high-density tiling arrays, to globally sample transcripts of the fission yeast Schizosaccharomyces pombe, independently from available gene annotations. We interrogated transcriptomes under multiple conditions, including rapid proliferation, meiotic differentiation and environmental stress, as well as in RNA processing mutants to reveal the dynamic plasticity of the transcriptional landscape as a function of environmental, developmental and genetic factors. High-throughput sequencing proved to be a powerful and quantitative method to sample transcriptomes deeply at maximal resolution. In contrast to hybridization, sequencing showed little, if any, background noise and was sensitive enough to detect widespread transcription in >90% of the genome, including traces of RNAs that were not robustly transcribed or rapidly degraded. The combined sequencing and strand-specific array data provide rich condition-specific information on novel, mostly non-coding transcripts, untranslated regions and gene structures, thus improving the existing genome annotation. Sequence reads spanning exon–exon or exon–intron junctions give unique insight into a surprising variability in splicing efficiency across introns, genes and conditions. Splicing efficiency was largely coordinated with transcript levels, and increased transcription led to increased splicing in test genes. Hundreds of introns showed such regulated splicing during cellular proliferation or differentiation.

991 citations


Journal ArticleDOI
06 Nov 2008-Nature
TL;DR: Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly, and the potential usefulness of next-generation sequencing technologies for personal genomics.
Abstract: Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.

963 citations


Journal ArticleDOI
TL;DR: The results demonstrate the feasibility of systematic, genome-wide characterization of rearrangements in complex human cancer genomes, raising the prospect of a new harvest of genes associated with cancer using this strategy.
Abstract: Human cancers often carry many somatically acquired genomic rearrangements, some of which may be implicated in cancer development. However, conventional strategies for characterizing rearrangements are laborious and low-throughput and have low sensitivity or poor resolution. We used massively parallel sequencing to generate sequence reads from both ends of short DNA fragments derived from the genomes of two individuals with lung cancer. By investigating read pairs that did not align correctly with respect to each other on the reference human genome, we characterized 306 germline structural variants and 103 somatic rearrangements to the base-pair level of resolution. The patterns of germline and somatic rearrangement were markedly different. Many somatic rearrangements were from amplicons, although rearrangements outside these regions, notably including tandem duplications, were also observed. Some somatic rearrangements led to abnormal transcripts, including two from internal tandem duplications and two fusion transcripts created by interchromosomal rearrangements. Germline variants were predominantly mediated by retrotransposition, often involving AluY and LINE elements. The results demonstrate the feasibility of systematic, genome-wide characterization of rearrangements in complex human cancer genomes, raising the prospect of a new harvest of genes associated with cancer using this strategy.

899 citations


Journal ArticleDOI
TL;DR: The loci the authors identified implicate genes in Hedgehog signaling, extracellular matrix, and cancer pathways, and provide new insights into human growth and developmental processes and insights into the genetic architecture of a classic quantitative trait.
Abstract: Adult height is a model polygenic trait, but there has been limited success in identifying the genes underlying its normal variation. To identify genetic variants influencing adult human height, we used genome-wide association data from 13,665 individuals and genotyped 39 variants in an additional 16,482 samples. We identified 20 variants associated with adult height (P < 5 x 10(-7), with 10 reaching P < 1 x 10(-10)). Combined, the 20 SNPs explain approximately 3% of height variation, with a approximately 5 cm difference between the 6.2% of people with 17 or fewer 'tall' alleles compared to the 5.5% with 27 or more 'tall' alleles. The loci we identified implicate genes in Hedgehog signaling (IHH, HHIP, PTCH1), extracellular matrix (EFEMP1, ADAMTSL3, ACAN) and cancer (CDK6, HMGA2, DLEU7) pathways, and provide new insights into human growth and developmental processes. Finally, our results provide insights into the genetic architecture of a classic quantitative trait.

Journal ArticleDOI
TL;DR: Pangenomic calculations indicate that E. coli genomic diversity represents an open pangenome model containing a reservoir of more than 13,000 genes, many of which may be uncharacterized but important virulence factors, which should provide the basis for future functional work on this important group of pathogens.
Abstract: Whole-genome sequencing has been skewed toward bacterial pathogens as a consequence of the prioritization of medical and veterinary diseases. However, it is becoming clear that in order to accurately measure genetic variation within and between pathogenic groups, multiple isolates, as well as commensal species, must be sequenced. This study examined the pangenomic content of Escherichia coli. Six distinct E. coli pathovars can be distinguished using molecular or phenotypic markers, but only two of the six pathovars have been subjected to any genome sequencing previously. Thus, this report provides a seminal description of the genomic contents and unique features of three unsequenced pathovars, enterotoxigenic E. coli, enteropathogenic E. coli, and enteroaggregative E. coli. We also determined the first genome sequence of a human commensal E. coli isolate, E. coli HS, which will undoubtedly provide a new baseline from which workers can examine the evolution of pathogenic E. coli. Comparison of 17 E. coli genomes, 8 of which are new, resulted in identification of ∼2,200 genes conserved in all isolates. We were also able to identify genes that were isolate and pathovar specific. Fewer pathovar-specific genes were identified than anticipated, suggesting that each isolate may have independently developed virulence capabilities. Pangenome calculations indicate that E. coli genomic diversity represents an open pangenome model containing a reservoir of more than 13,000 genes, many of which may be uncharacterized but important virulence factors. This comparative study of the species E. coli, while descriptive, should provide the basis for future functional work on this important group of pathogens.

Journal ArticleDOI
TL;DR: A set of improvements are described to the standard Illumina protocols to make the library preparation more reliable in a high-throughput environment, to reduce bias, tighten insert size distribution and reliably obtain high yields of data.
Abstract: The Wellcome Trust Sanger Institute is one of the world's largest genome centers, and a substantial amount of our sequencing is performed with 'next-generation' massively parallel sequencing technologies: in June 2008 the quantity of purity-filtered sequence data generated by our Genome Analyzer (Illumina) platforms reached 1 terabase, and our average weekly Illumina production output is currently 64 gigabases. Here we describe a set of improvements we have made to the standard Illumina protocols to make the library preparation more reliable in a high-throughput environment, to reduce bias, tighten insert size distribution and reliably obtain high yields of data.

Journal ArticleDOI
Heather C Mefford1, Andrew J. Sharp2, Carl Baker1, Andy Itsara1, Zhaoshi Jiang1, Karen Buysse3, Shuwen Huang4, Viv K. Maloney4, John A. Crolla4, Diana Baralle5, Amanda L. Collins5, Catherine Mercer5, Koenraad Norga6, Thomy de Ravel6, Koenraad Devriendt6, Ernie M.H.F. Bongers7, Nicole de Leeuw7, William Reardon, Stefania Gimelli2, Frédérique Béna2, Raoul C.M. Hennekam8, Raoul C.M. Hennekam9, Alison Male9, Lorraine Gaunt10, Jill Clayton-Smith10, Ingrid Simonic, Soo Mi Park, Sarju G. Mehta, Serena Nik-Zainal, C. Geoffrey Woods, Helen V. Firth, Georgina Parkin, Marco Fichera, Santina Reitano, Mariangela Lo Giudice, Kelly Li, Iris Casuga, Adam Broomer, Bernard Conrad11, Markus Schwerzmann11, Lorenz Räber11, Sabina Gallati11, Pasquale Striano12, Antonietta Coppola12, John Tolmie13, Edward S. Tobias13, Chris Lilley13, Lluís Armengol14, Yves Spysschaert3, Patrick Verloo3, Anja De Coene3, Linde Goossens3, Geert Mortier3, Frank Speleman3, Ellen van Binsbergen15, Marcel R. Nelen15, Ron Hochstenbach15, Martin Poot15, Louise Gallagher, Michael Gill, Jon McClellan1, Mary Claire King1, Regina Regan16, Cindy Skinner, Roger E. Stevenson, Stylianos E. Antonarakis2, Caifu Chen, Xavier Estivill14, Björn Menten3, Giorgio Gimelli, Susan M. Gribble17, Stuart Schwartz18, James S. Sutcliffe19, Tom Walsh1, Samantha J. L. Knight16, Jonathan Sebat20, Corrado Romano, Charles E. Schwartz, Joris A. Veltman7, Bert B.A. de Vries7, Joris Vermeesch6, John C. K. Barber4, Lionel Willatt, May Tassabehji10, Evan E. Eichler21, Evan E. Eichler1 
TL;DR: Recurrent molecular lesions that elude syndromic classification and whose disease manifestations must be considered in a broader context of development as opposed to being assigned to a specific disease are identified.
Abstract: BACKGROUND: Duplications and deletions in the human genome can cause disease or predispose persons to disease. Advances in technologies to detect these changes allow for the routine identification of submicroscopic imbalances in large numbers of patients. METHODS: We tested for the presence of microdeletions and microduplications at a specific region of chromosome 1q21.1 in two groups of patients with unexplained mental retardation, autism, or congenital anomalies and in unaffected persons. RESULTS: We identified 25 persons with a recurrent 1.35-Mb deletion within 1q21.1 from screening 5218 patients. The microdeletions had arisen de novo in eight patients, were inherited from a mildly affected parent in three patients, were inherited from an apparently unaffected parent in six patients, and were of unknown inheritance in eight patients. The deletion was absent in a series of 4737 control persons (P=1.1x10(-7)). We found considerable variability in the level of phenotypic expression of the microdeletion; phenotypes included mild-to-moderate mental retardation, microcephaly, cardiac abnormalities, and cataracts. The reciprocal duplication was enriched in nine children with mental retardation or autism spectrum disorder and other variable features (P=0.02). We identified three deletions and three duplications of the 1q21.1 region in an independent sample of 788 patients with mental retardation and congenital anomalies. CONCLUSIONS: We have identified recurrent molecular lesions that elude syndromic classification and whose disease manifestations must be considered in a broader context of development as opposed to being assigned to a specific disease. Clinical diagnosis in patients with these lesions may be most readily achieved on the basis of genotype rather than phenotype.

Journal ArticleDOI
TL;DR: This extensive genome-wide association follow-up study has identified additional celiac disease risk variants in relevant biological pathways and identified seven previously unknown risk regions.
Abstract: Our genome-wide association study of celiac disease previously identified risk variants in the IL2-IL21 region. To identify additional risk variants, we genotyped 1,020 of the most strongly associated non-HLA markers in an additional 1,643 cases and 3,406 controls. Through joint analysis including the genome-wide association study data (767 cases, 1,422 controls), we identified seven previously unknown risk regions (P < 5 x 10(-7)). Six regions harbor genes controlling immune responses, including CCR3, IL12A, IL18RAP, RGS1, SH2B3 (nsSNP rs3184504) and TAGAP. Whole-blood IL18RAP mRNA expression correlated with IL18RAP genotype. Type 1 diabetes and celiac disease share HLA-DQ, IL2-IL21, CCR3 and SH2B3 risk regions. Thus, this extensive genome-wide association follow-up study has identified additional celiac disease risk variants in relevant biological pathways.

Journal ArticleDOI
Wesley C. Warren1, LaDeana W. Hillier1, Jennifer A. Marshall Graves2, Ewan Birney, Chris P. Ponting3, Frank Grützner4, Katherine Belov5, Webb Miller6, Laura Clarke7, Asif T. Chinwalla1, Shiaw Pyng Yang1, Andreas Heger3, Devin P. Locke1, Pat Miethke2, Paul D. Waters2, Frédéric Veyrunes8, Frédéric Veyrunes2, Lucinda Fulton1, Bob Fulton1, Tina Graves1, John W. Wallis1, Xose S. Puente9, Carlos López-Otín9, Gonzalo R. Ordóñez9, Evan E. Eichler10, Lin Chen10, Ze Cheng10, Janine E. Deakin2, Amber E. Alsop2, Katherine Thompson2, Patrick J. Kirby2, Anthony T. Papenfuss11, Matthew Wakefield11, Tsviya Olender12, Doron Lancet12, Gavin A. Huttley2, Arian F.A. Smit13, Andrew J Pask14, Peter Temple-Smith14, Peter Temple-Smith15, Mark A. Batzer16, Jerilyn A. Walker16, Miriam K. Konkel16, Robert S. Harris6, Camilla M. Whittington5, Emily S. W. Wong5, Neil J. Gemmell17, Emmanuel Buschiazzo17, Iris M. Vargas Jentzsch17, Angelika Merkel17, Juergen Schmitz18, Anja Zemann18, Gennady Churakov18, Jan Ole Kriegs18, Juergen Brosius18, Elizabeth P. Murchison19, Ravi Sachidanandam19, Carly Smith19, Gregory J. Hannon19, Enkhjargal Tsend-Ayush4, Daniel McMillan2, Rosalind Attenborough2, Willem Rens8, Malcolm A. Ferguson-Smith8, Christophe Lefevre20, Christophe Lefevre14, Julie A. Sharp14, Kevin R. Nicholas14, David A. Ray21, Michael Kube, Richard Reinhardt, Thomas H. Pringle, James Taylor22, Russell C. Jones, Brett Nixon, Jean Louis Dacheux23, Hitoshi Niwa, Yoko Sekita, Xiaoqiu Huang24, Alexander Stark25, Pouya Kheradpour25, Manolis Kellis25, Paul Flicek, Yuan Chen, Caleb Webber3, Ross C. Hardison, Joanne O. Nelson1, Kym Hallsworth-Pepin1, Kim D. Delehaunty1, Chris Markovic1, Patrick Minx1, Yucheng Feng1, Colin Kremitzki1, Makedonka Mitreva1, Jarret Glasscock1, Todd Wylie1, Patricia Wohldmann1, Prathapan Thiru1, Michael N. Nhan1, Craig Pohl1, Scott M. Smith1, Shunfeng Hou1, Marilyn B. Renfree14, Elaine R. Mardis1, Richard K. Wilson1 
08 May 2008-Nature
TL;DR: It is found that reptile and platypus venom proteins have been co-opted independently from the same gene families; milk protein genes are conserved despite platypuses laying eggs; and immune gene family expansions are directly related to platypUS biology.
Abstract: We present a draft genome sequence of the platypus, Ornithorhynchus anatinus This monotreme exhibits a fascinating combination of reptilian and mammalian characters For example, platypuses have a coat of fur adapted to an aquatic lifestyle; platypus females lactate, yet lay eggs; and males are equipped with venom similar to that of reptiles Analysis of the first monotreme genome aligned these features with genetic innovations We find that reptile and platypus venom proteins have been co-opted independently from the same gene families; milk protein genes are conserved despite platypuses laying eggs; and immune gene family expansions are directly related to platypus biology Expansions of protein, non-protein-coding RNA and microRNA families, as well as repeat elements, are identified Sequencing of this genome now provides a valuable resource for deep mammalian comparative analyses, as well as for monotreme biology and conservation

Journal ArticleDOI
TL;DR: This work has developed a cross-platform algorithm—Bayesian tool for methylation analysis (Batman)—for analyzing methylated DNA immunoprecipitation profiles generated using oligonucleotide arrays or next-generation sequencing, developed to provide a high-resolution whole-genome DNA methylation profile (DNA methylome) of a mammalian genome.
Abstract: DNA methylation is an indispensible epigenetic modification required for regulating the expression of mammalian genomes. Immunoprecipitation-based methods for DNA methylome analysis are rapidly shifting the bottleneck in this field from data generation to data analysis, necessitating the development of better analytical tools. In particular, an inability to estimate absolute methylation levels remains a major analytical difficulty associated with immunoprecipitation-based DNA methylation profiling. To address this issue, we developed a cross-platform algorithm-Bayesian tool for methylation analysis (Batman)-for analyzing methylated DNA immunoprecipitation (MeDIP) profiles generated using oligonucleotide arrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). We developed the latter approach to provide a high-resolution whole-genome DNA methylation profile (DNA methylome) of a mammalian genome. Strong correlation of our data, obtained using mature human spermatozoa, with those obtained using bisulfite sequencing suggest that combining MeDIP-seq or MeDIP-chip with Batman provides a robust, quantitative and cost-effective functional genomic strategy for elucidating the function of DNA methylation.


Journal ArticleDOI
TL;DR: A fast and reliable pipeline to study protein function in mammalian cells based on protein tagging in bacterial artificial chromosomes (BACs) is described and it is shown that BAC transgenes can be rapidly and reliably generated using 96-well-format recombineering.
Abstract: The interpretation of genome sequences requires reliable and standardized methods to assess protein function at high throughput. Here we describe a fast and reliable pipeline to study protein function in mammalian cells based on protein tagging in bacterial artificial chromosomes (BACs). The large size of the BAC transgenes ensures the presence of most, if not all, regulatory elements and results in expression that closely matches that of the endogenous gene. We show that BAC transgenes can be rapidly and reliably generated using 96-well-format recombineering. After stable transfection of these transgenes into human tissue culture cells or mouse embryonic stem cells, the localization, protein-protein and/or protein-DNA interactions of the tagged protein are studied using generic, tag-based assays. The same high-throughput approach will be generally applicable to other model systems. NOTE: In the version of this article initially published online, the name of one individual was misspelled in the Acknowledgments. The second sentence of the Acknowledgments paragraph should read, “We thank I. Cheesman for helpful discussions.” The error has been corrected for all versions of the article.

Journal ArticleDOI
TL;DR: A meta-analysis of genome-wide association study data of height from 15,821 individuals at 2.2 million SNPs found 10 newly identified and two previously reported loci were strongly associated with variation in height, and highlight several pathways as important regulators of human stature.
Abstract: Identification of ten loci associated with height highlights new biological pathways in human growth

Journal ArticleDOI
TL;DR: The incidence of typhoid varied substantially between sites, being high in India and Pakistan, intermediate in Indonesia, and low in China and Viet Nam, and underscore the importance of evidence on disease burden in making policy decisions about interventions to control this disease.
Abstract: Objective To inform policy-makers about introduction of preventive interventions against typhoid, including vaccination. Methods A population-based prospective surveillance design was used. Study sites where typhoid was considered a problem by local authorities were established in China, India, Indonesia, Pakistan and Viet Nam. Standardized clinical, laboratory, and surveillance methods were used to investigate cases of fever of ³ 3 days’ duration for a one-year period. A total of 441 435 persons were under surveillance, 159 856 of whom were aged 5–15 years. Findings A total of 21 874 episodes of fever were detected. Salmonella typhi was isolated from 475 (2%) blood cultures, 57% (273/475) of which were from 5–15 year-olds. The annual typhoid incidence (per 100 000 person years) among this age group varied from 24.2 and 29.3 in sites in Viet Nam and China, respectively, to 180.3 in the site in Indonesia; and to 412.9 and 493.5 in sites in Pakistan and India, respectively. Altogether, 23% (96/413) of isolates were multidrug resistant (chloramphenicol, ampicillin and trimethoprim-sulfamethoxazole). Conclusion The incidence of typhoid varied substantially between sites, being high in India and Pakistan, intermediate in Indonesia, and low in China and Viet Nam. These findings highlight the considerable, but geographically heterogeneous, burden of typhoid fever in endemic areas of Asia, and underscore the importance of evidence on disease burden in making policy decisions about interventions to control this disease.

Journal ArticleDOI
TL;DR: An important role for mRNA stability in determining steady-state mRNA levels is suggested, and the potential of eQTL mapping as a high-resolution tool for studying the determinants of gene regulation is highlighted.
Abstract: Recent studies of the HapMap lymphoblastoid cell lines have identified large numbers of quantitative trait loci for gene expression (eQTLs) Reanalyzing these data using a novel Bayesian hierarchical model, we were able to create a surprisingly high-resolution map of the typical locations of sites that affect mRNA levels in cis Strikingly, we found a strong enrichment of eQTLs in the 250 bp just upstream of the transcription end site (TES), in addition to an enrichment around the transcription start site (TSS) Most eQTLs lie either within genes or close to genes; for example, we estimate that only 5% of eQTLs lie more than 20 kb upstream of the TSS After controlling for position effects, SNPs in exons are approximately 2-fold more likely than SNPs in introns to be eQTLs Our results suggest an important role for mRNA stability in determining steady-state mRNA levels, and highlight the potential of eQTL mapping as a high-resolution tool for studying the determinants of gene regulation

Journal ArticleDOI
TL;DR: A new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences.
Abstract: Motivation: Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Results: Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Availability: Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/http://www.sanger.ac.uk/Software/ACT/ Contact: artemis@sanger.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The Minimum Information for Biological and Biomedical Investigations (MIBBI) project aims to foster the coordinated development of minimum-information checklists and provide a resource for those exploring the range of extant checklists.
Abstract: The Minimum Information for Biological and Biomedical Investigations (MIBBI) project aims to foster the coordinated development of minimum-information checklists and provide a resource for those exploring the range of extant checklists.

Journal ArticleDOI
TL;DR: The complete genome sequence of S. aureus Newman is reported, which carries four integrated prophages, as well as two large pathogenicity islands, and the absence of drug resistance genes reflects the general antibiotic-susceptible phenotype of Sengers Newman.
Abstract: Strains of Staphylococcus aureus, an important human pathogen, display up to 20% variability in their genome sequence, and most sequence information is available for human clinical isolates that have not been subjected to genetic analysis of virulence attributes. S. aureus strain Newman, which was also isolated from a human infection, displays robust virulence properties in animal models of disease and has already been extensively analyzed for its molecular traits of staphylococcal pathogenesis. We report here the complete genome sequence of S. aureus Newman, which carries four integrated prophages, as well as two large pathogenicity islands. In agreement with the view that S. aureus Newman prophages contribute important properties to pathogenesis, fewer virulence factors are found outside of the prophages than for the highly virulent strain MW2. The absence of drug resistance genes reflects the general antibiotic-susceptible phenotype of S. aureus Newman. Phylogenetic analyses reveal clonal relationships between the staphylococcal strains Newman, COL, NCTC8325, and USA300 and a greater evolutionary distance to strains MRSA252, MW2, MSSA476, N315, Mu50, JH1, JH9, and RF122. However, polymorphism analysis of two large pathogenicity islands distributed among these strains shows that the two islands were acquired independently from the evolutionary pathway of the chromosomal backbones of staphylococcal genomes. Prophages and pathogenicity islands play central roles in S. aureus virulence and evolution.

Journal ArticleDOI
TL;DR: The genome of the M strain of M. marinum comprises a 6,636,827-bp circular chromosome with 5424 CDS, 10 prophages, and a 23-kb mercury-resistance plasmid as discussed by the authors.
Abstract: Mycobacterium marinum, a ubiquitous pathogen of fish and amphibia, is a near relative of Mycobacterium tuberculosis, the etiologic agent of tuberculosis in humans. The genome of the M strain of M. marinum comprises a 6,636,827-bp circular chromosome with 5424 CDS, 10 prophages, and a 23-kb mercury-resistance plasmid. Prominent features are the very large number of genes (57) encoding polyketide synthases (PKSs) and nonribosomal peptide synthases (NRPSs) and the most extensive repertoire yet reported of the mycobacteria-restricted PE and PPE proteins, and related-ESX secretion systems. Some of the NRPS genes comprise a novel family and seem to have been acquired horizontally. M. marinum is used widely as a model organism to study M. tuberculosis pathogenesis, and genome comparisons confirmed the close genetic relationship between these two species, as they share 3000 orthologs with an average amino acid identity of 85%. Comparisons with the more distantly related Mycobacterium avium subspecies paratuberculosis and Mycobacterium smegmatis reveal how an ancestral generalist mycobacterium evolved into M. tuberculosis and M. marinum. M. tuberculosis has undergone genome downsizing and extensive lateral gene transfer to become a specialized pathogen of humans and other primates without retaining an environmental niche. M. marinum has maintained a large genome so as to retain the capacity for environmental survival while becoming a broad host range pathogen that produces disease strikingly similar to M. tuberculosis. The work described herein provides a foundation for using M. marinum to better understand the determinants of pathogenesis of tuberculosis.

Journal ArticleDOI
TL;DR: Key aspects of GO are described, which, when overlooked, can cause erroneous results, and how these pitfalls can be avoided.
Abstract: The Gene Ontology (GO) project is a collaboration among model organism databases to describe gene products from all organisms using a consistent and computable language. GO produces sets of explicitly defined, structured vocabularies that describe biological processes, molecular functions and cellular components of gene products in both a computer- and human-readable manner. Here we describe key aspects of GO, which, when overlooked, can cause erroneous results, and address how these pitfalls can be avoided.

Journal ArticleDOI
TL;DR: The panoply of antimicrobial drug resistance genes and mobile genetic elements found suggests that the organism can act as a reservoir of antimacterial drug resistance determinants in a clinical environment, which is an issue of considerable concern.
Abstract: Background Stenotrophomonas maltophilia is a nosocomial opportunistic pathogen of the Xanthomonadaceae. The organism has been isolated from both clinical and soil environments in addition to the sputum of cystic fibrosis patients and the immunocompromised. Whilst relatively distant phylogenetically, the closest sequenced relatives of S. maltophilia are the plant pathogenic xanthomonads.

Journal ArticleDOI
TL;DR: The observed patterns of genetic isolation and drift are consistent with the proposed key role of asymptomatic carriers of Typhi as the main reservoir of this pathogen, highlighting the need for identification and treatment of carriers.
Abstract: Isolates of Salmonella enterica serovar Typhi (Typhi), a human-restricted bacterial pathogen that causes typhoid, show limited genetic variation. We generated whole-genome sequences for 19 Typhi isolates using 454 (Roche) and Solexa (Illumina) technologies. Isolates, including the previously sequenced CT18 and Ty2 isolates, were selected to represent major nodes in the phylogenetic tree. Comparative analysis showed little evidence of purifying selection, antigenic variation or recombination between isolates. Rather, evolution in the Typhi population seems to be characterized by ongoing loss of gene function, consistent with a small effective population size. The lack of evidence for antigenic variation driven by immune selection is in contrast to strong adaptive selection for mutations conferring antibiotic resistance in Typhi. The observed patterns of genetic isolation and drift are consistent with the proposed key role of asymptomatic carriers of Typhi as the main reservoir of this pathogen, highlighting the need for identification and treatment of carriers.

Journal ArticleDOI
TL;DR: Results of a nonsynonymous SNP scan for ulcerative colitis and a previously unknown susceptibility locus at ECM1 are reported, providing the first detailed illustration of the genetic relationship between these common inflammatory bowel diseases.
Abstract: We report results of a nonsynonymous SNP scan for ulcerative colitis and identify a previously unknown susceptibility locus at ECM1. We also show that several risk loci are common to ulcerative colitis and Crohn's disease (IL23R, IL12B, HLA, NKX2-3 and MST1), whereas autophagy genes ATG16L1 and IRGM, along with NOD2 (also known as CARD15), are specific for Crohn's disease. These data provide the first detailed illustration of the genetic relationship between these common inflammatory bowel diseases.