Showing papers by "Wellcome Trust Sanger Institute published in 2014"
••
TL;DR: Pfam as discussed by the authors is a widely used database of protein families, containing 14 831 manually curated entries in the current version, version 27.0, and has been updated several times since 2012.
Abstract: Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.
9,415 citations
••
TL;DR: A new Java-based architecture for the widely used protein function prediction software package InterProScan is described, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis.
Abstract: Motivation: Robust, large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterise many millions of sequences. Here we describe a new Java-based architecture for the widely-used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete re-implementation of the software framework, resulting in a flexible and stable system that is able to utilise both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the (open) source code is hosted at Google Code. Availability: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/. Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk
5,434 citations
••
TL;DR: It is demonstrated that contaminating DNA is ubiquitous in commonly used DNA extraction kits and other laboratory reagents, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass.
Abstract: The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits and other laboratory reagents, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. Concurrent sequencing of negative control samples is strongly advised.
2,459 citations
••
Icahn School of Medicine at Mount Sinai1, Carnegie Mellon University2, Harvard University3, University of Toronto4, Wellcome Trust Sanger Institute5, University of Pittsburgh6, Nagoya University7, University of Freiburg8, King's College London9, Vanderbilt University10, King Abdulaziz University11, University of Santiago de Compostela12, University of Utah13, Duke University14, Memorial University of Newfoundland15, Trinity College, Dublin16, University of Pennsylvania17, University of Illinois at Chicago18, Boston Children's Hospital19, Columbia University20, German Cancer Research Center21, University College London22, Kaiser Permanente23, Broad Institute24, Cardiff University25, Complutense University of Madrid26, Newcastle University27, Baylor College of Medicine28, University of California, San Francisco29, RWTH Aachen University30, National Health Service31, McMaster University32, Saarland University33, Karolinska Institutet34, National Institutes of Health35, University of Helsinki36, Emory University37
TL;DR: Using exome sequencing, it is shown that analysis of rare coding variation in 3,871 autism cases and 9,937 ancestry-matched or parental controls implicates 22 autosomal genes at a false discovery rate of < 0.05, plus a set of 107 genes strongly enriched for those likely to affect risk (FDR < 0.30).
Abstract: The genetic architecture of autism spectrum disorder involves the interplay of common and rare variants and their impact on hundreds of genes. Using exome sequencing, here we show that analysis of rare coding variation in 3,871 autism cases and 9,937 ancestry-matched or parental controls implicates 22 autosomal genes at a false discovery rate (FDR) < 0.05, plus a set of 107 autosomal genes strongly enriched for those likely to affect risk (FDR < 0.30). These 107 genes, which show unusual evolutionary constraint against mutations, incur de novo loss-of-function mutations in over 5% of autistic subjects. Many of the genes implicated encode proteins for synaptic formation, transcriptional regulation and chromatin-remodelling pathways. These include voltage-gated ion channels regulating the propagation of action potentials, pacemaking and excitability-transcription coupling, as well as histone-modifying enzymes and chromatin remodellers-most prominently those that mediate post-translational lysine methylation/demethylation modifications of histones.
2,228 citations
••
TL;DR: This article identified 697 variants at genome-wide significance that together explained one-fifth of the heritability for adult height, and all common variants together captured 60% of heritability.
Abstract: Using genome-wide data from 253,288 individuals, we identified 697 variants at genome-wide significance that together explained one-fifth of the heritability for adult height. By testing different numbers of variants in independent studies, we show that the most strongly associated ∼2,000, ∼3,700 and ∼9,500 SNPs explained ∼21%, ∼24% and ∼29% of phenotypic variance. Furthermore, all common variants together captured 60% of heritability. The 697 variants clustered in 423 loci were enriched for genes, pathways and tissue types known to be involved in growth and together implicated genes and pathways not highlighted in earlier efforts, such as signaling by fibroblast growth factors, WNT/β-catenin and chondroitin sulfate-related genes. We identified several genes and pathways not previously connected with human skeletal growth, including mTOR, osteoglycin and binding of hyaluronic acid. Our results indicate a genetic architecture for human height that is characterized by a very large but finite number (thousands) of causal variants.
1,872 citations
••
TL;DR: Strong correlations between the presence of a mutant allele, in vitro parasite survival rates and in vivo parasite clearance rates indicate that K13-propeller mutations are important determinants of artemisinin resistance.
Abstract: Plasmodium falciparum resistance to artemisinin derivatives in southeast Asia threatens malaria control and elimination activities worldwide. To monitor the spread of artemisinin resistance, a molecular marker is urgently needed. Here, using whole-genome sequencing of an artemisinin-resistant parasite line from Africa and clinical parasite isolates from Cambodia, we associate mutations in the PF3D7_1343700 kelch propeller domain ('K13-propeller') with artemisinin resistance in vitro and in vivo. Mutant K13-propeller alleles cluster in Cambodian provinces where resistance is prevalent, and the increasing frequency of a dominant mutant K13-propeller allele correlates with the recent spread of resistance in western Cambodia. Strong correlations between the presence of a mutant allele, in vitro parasite survival rates and in vivo parasite clearance rates indicate that K13-propeller mutations are important determinants of artemisinin resistance. K13-propeller polymorphism constitutes a useful molecular marker for large-scale surveillance efforts to contain artemisinin resistance in the Greater Mekong Subregion and prevent its global spread.
1,639 citations
••
TL;DR: Genes affected by mutations in schizophrenia overlap those mutated in autism and intellectual disability, as do mutation-enriched synaptic pathways, and pathophysiology shared with other neurodevelopmental disorders.
Abstract: Inherited alleles account for most of the genetic risk for schizophrenia. However, new (de novo) mutations, in the form of large chromosomal copy number changes, occur in a small fraction of cases and disproportionally disrupt genes encoding postsynaptic proteins. Here we show that small de novo mutations, affecting one or a few nucleotides, are overrepresented among glutamatergic postsynaptic proteins comprising activity-regulated cytoskeleton-associated protein (ARC) and N-methyl-d-aspartate receptor (NMDAR) complexes. Mutations are additionally enriched in proteins that interact with these complexes to modulate synaptic strength, namely proteins regulating actin filament dynamics and those whose messenger RNAs are targets of fragile X mental retardation protein (FMRP). Genes affected by mutations in schizophrenia overlap those mutated in autism and intellectual disability, as do mutation-enriched synaptic pathways. Aligning our findings with a parallel case–control study, we demonstrate reproducible insights into aetiological mechanisms for schizophrenia and reveal pathophysiology shared with other neurodevelopmental disorders.
1,501 citations
••
TL;DR: In this article, the exome sequences of 2,536 schizophrenia cases and 2,543 controls were analyzed and the authors demonstrated a polygenic burden primarily arising from rare (less than 1 in 10,000), disruptive mutations distributed across many genes.
Abstract: Schizophrenia is a common disease with a complex aetiology, probably involving multiple and heterogeneous genetic factors. Here, by analysing the exome sequences of 2,536 schizophrenia cases and 2,543 controls, we demonstrate a polygenic burden primarily arising from rare (less than 1 in 10,000), disruptive mutations distributed across many genes. Particularly enriched gene sets include the voltage-gated calcium ion channel and the signalling complex formed by the activity-regulated cytoskeleton-associated scaffold protein (ARC) of the postsynaptic density, sets previously implicated by genome-wide association and copy-number variation studies. Similar to reports in autism, targets of the fragile X mental retardation protein (FMRP, product of FMR1) are enriched for case mutations. No individual gene-based test achieves significance after correction for multiple testing and we do not detect any alleles of moderately low frequency (approximately 0.5 to 1 per cent) and moderately large effect. Taken together, these data suggest that population-based exome sequencing can discover risk alleles and complements established gene-mapping paradigms in neuropsychiatric disease.
1,323 citations
••
Harvard University1, National Institutes of Health2, Medical College of Wisconsin3, University of Washington4, University of Michigan5, Stanford University6, University of Geneva7, Wellcome Trust Sanger Institute8, Washington University in St. Louis9, University of Chicago10, Yale University11, Duke University12, Boston Children's Hospital13, Baylor College of Medicine14, Lawrence Berkeley National Laboratory15, Johns Hopkins University16, University of Pennsylvania17, Broad Institute18
TL;DR: The key challenges of assessing sequence variants in human disease are discussed, integrating both gene-level and variant-level support for causality and guidelines for summarizing confidence in variant pathogenicity are proposed.
Abstract: The discovery of rare genetic variants is accelerating, and clear guidelines for distinguishing disease-causing sequence variants from the many potentially functional variants present in any human genome are urgently needed. Without rigorous standards we risk an acceleration of false-positive reports of causality, which would impede the translation of genomic research findings into the clinical diagnostic setting and hinder biological understanding of disease. Here we discuss the key challenges of assessing sequence variants in human disease, integrating both gene-level and variant-level support for causality. We propose guidelines for summarizing confidence in variant pathogenicity and highlight several areas that require further resource development.
1,165 citations
••
TL;DR: The results demonstrate the potential for efficient loss-of-function screening using the CRISPR-Cas9 system and identify 27 known and 4 previously unknown genes implicated in these phenotypes.
Abstract: Identification of genes influencing a phenotype of interest is frequently achieved through genetic screening by RNA interference (RNAi) or knockouts. However, RNAi may only achieve partial depletion of gene activity, and knockout-based screens are difficult in diploid mammalian cells. Here we took advantage of the efficiency and high throughput of genome editing based on type II, clustered, regularly interspaced, short palindromic repeats (CRISPR)-CRISPR-associated (Cas) systems to introduce genome-wide targeted mutations in mouse embryonic stem cells (ESCs). We designed 87,897 guide RNAs (gRNAs) targeting 19,150 mouse protein-coding genes and used a lentiviral vector to express these gRNAs in ESCs that constitutively express Cas9. Screening the resulting ESC mutant libraries for resistance to either Clostridium septicum alpha-toxin or 6-thioguanine identified 27 known and 4 previously unknown genes implicated in these phenotypes. Our results demonstrate the potential for efficient loss-of-function screening using the CRISPR-Cas9 system.
1,001 citations
••
TL;DR: The most comprehensive exploration of genetic loci influencing human metabolism thus far, comprising 7,824 adult individuals from 2 European population studies, is reported, reporting genome-wide significant associations at 145 metabolic loci and their biochemical connectivity with more than 400 metabolites in human blood.
Abstract: Genome-wide association scans with high-throughput metabolic profiling provide unprecedented insights into how genetic variation influences metabolism and complex disease. Here we report the most comprehensive exploration of genetic loci influencing human metabolism thus far, comprising 7,824 adult individuals from 2 European population studies. We report genome-wide significant associations at 145 metabolic loci and their biochemical connectivity with more than 400 metabolites in human blood. We extensively characterize the resulting in vivo blueprint of metabolism in human blood by integrating it with information on gene expression, heritability and overlap with known loci for complex disorders, inborn errors of metabolism and pharmacological targets. We further developed a database and web-based resources for data mining and results visualization. Our findings provide new insights into the role of inherited variation in blood metabolic diversity and identify potential new opportunities for drug development and for understanding disease.
••
TL;DR: 25 spatially distinct regions from seven operable NSCLCs were sequenced and found evidence of branched evolution, with driver mutations arising before and after subclonal diversification, and pronounced intratumor heterogeneity in copy number alterations, translocations, and mutations associated with APOBEC cytidine deaminase activity.
Abstract: Spatial and temporal dissection of the genomic changes occurring during the evolution of human non–small cell lung cancer (NSCLC) may help elucidate the basis for its dismal prognosis. We sequenced 25 spatially distinct regions from seven operable NSCLCs and found evidence of branched evolution, with driver mutations arising before and after subclonal diversification. There was pronounced intratumor heterogeneity in copy number alterations, translocations, and mutations associated with APOBEC cytidine deaminase activity. Despite maintained carcinogen exposure, tumors from smokers showed a relative decrease in smoking-related mutations over time, accompanied by an increase in APOBEC-associated mutations. In tumors from former smokers, genome-doubling occurred within a smoking-signature context before subclonal diversification, which suggested that a long period of tumor latency had preceded clinical detection. The regionally separated driver mutations, coupled with the relentless and heterogeneous nature of the genome instability processes, are likely to confound treatment success in NSCLC.
••
TL;DR: In this paper, the authors aggregated published meta-analyses of genome-wide association studies (GWAS), including 26,488 cases and 83,964 controls of European, east Asian, south Asian and Mexican and Mexican American ancestry.
Abstract: To further understanding of the genetic basis of type 2 diabetes (T2D) susceptibility, we aggregated published meta-analyses of genome-wide association studies (GWAS), including 26,488 cases and 83,964 controls of European, east Asian, south Asian and Mexican and Mexican American ancestry. We observed a significant excess in the directional consistency of T2D risk alleles across ancestry groups, even at SNPs demonstrating only weak evidence of association. By following up the strongest signals of association from the trans-ethnic meta-analysis in an additional 21,491 cases and 55,647 controls of European ancestry, we identified seven new T2D susceptibility loci. Furthermore, we observed considerable improvements in the fine-mapping resolution of common variant association signals at several T2D susceptibility loci. These observations highlight the benefits of trans-ethnic GWAS for the discovery and characterization of complex trait loci and emphasize an exciting opportunity to extend insight into the genetic architecture and pathogenesis of human diseases across populations of diverse ancestry.
••
Harvard University1, Yale University2, Broad Institute3, Baylor College of Medicine4, Beth Israel Deaconess Medical Center5, Wellcome Trust Sanger Institute6, Icahn School of Medicine at Mount Sinai7, University of Texas Health Science Center at Houston8, University of Illinois at Chicago9, University of Pennsylvania10, Vanderbilt University11, University of Pittsburgh12, Carnegie Mellon University13
TL;DR: This model is used to identify ∼1,000 genes that are significantly lacking in functional coding variation in non-ASD samples and are enriched for de novo loss-of-function mutations identified in ASD cases, suggesting that the role of de noVO mutations in ASDs might reside in fundamental neurodevelopmental processes.
Abstract: Mark Daly and colleagues present a statistical framework to evaluate the role of de novo mutations in human disease by calibrating a model of de novo mutation rates at the individual gene level. The mutation probabilities defined by their model and list of constrained genes can be used to help identify genetic variants that have a significant role in disease.
••
TL;DR: In this article, a single-cell bisulfite sequencing (scBS-seq) method was used to accurately measure DNA methylation at up to 48.4% of CpG sites.
Abstract: We report a single-cell bisulfite sequencing (scBS-seq) method that can be used to accurately measure DNA methylation at up to 48.4% of CpG sites. Embryonic stem cells grown in serum or in 2i medium displayed epigenetic heterogeneity, with '2i-like' cells present in serum culture. Integration of 12 individual mouse oocyte datasets largely recapitulated the whole DNA methylome, which makes scBS-seq a versatile tool to explore DNA methylation in rare cells and heterogeneous populations.
••
University of Texas Health Science Center at Houston1, Harvard University2, Broad Institute3, University of Wisconsin–Milwaukee4, University of Washington5, Washington University in St. Louis6, University of North Carolina at Chapel Hill7, Icahn School of Medicine at Mount Sinai8, University of Michigan9, Lund University10, University of Leicester11, Queen Mary University of London12, University of Oxford13, University of Milan14, University of Verona15, Merck & Co.16, National Institutes of Health17, Levanger Hospital18, Norwegian University of Science and Technology19, University of Ottawa20, Stanford University21, University of Iowa22, George Washington University23, Umeå University24, University of Dundee25, Cambridge University Hospitals NHS Foundation Trust26, Technische Universität München27, University of Kiel28, University of Lübeck29, University of Bonn30, Group Health Cooperative31, Houston Methodist Hospital32, Baylor College of Medicine33, IMDEA34, Tufts University35, University of Leeds36, King Abdulaziz University37, Wellcome Trust Sanger Institute38, University of Mississippi39, Fred Hutchinson Cancer Research Center40, University of Virginia41, University of Vermont42, Boston University43
TL;DR: Rare mutations that disrupt AP OC3 function were associated with lower levels of plasma triglycerides and APOC3, and carriers of these mutations were found to have a reduced risk of coronary heart disease.
Abstract: Background Plasma triglyceride levels are heritable and are correlated with the risk of coronary heart disease. Sequencing of the protein-coding regions of the human genome (the exome) has the potential to identify rare mutations that have a large effect on phenotype. Methods We sequenced the protein-coding regions of 18,666 genes in each of 3734 participants of European or African ancestry in the Exome Sequencing Project. We conducted tests to determine whether rare mutations in coding sequence, individually or in aggregate within a gene, were associated with plasma triglyceride levels. For mutations associated with triglyceride levels, we subsequently evaluated their association with the risk of coronary heart disease in 110,970 persons. Results An aggregate of rare mutations in the gene encoding apolipoprotein C3 (APOC3) was associated with lower plasma triglyceride levels. Among the four mutations that drove this result, three were loss-of-function mutations: a nonsense mutation (R19X) and two splice-site mutations (IVS2+1G→A and IVS3+1G→T). The fourth was a missense mutation (A43T). Approximately 1 in 150 persons in the study was a heterozygous carrier of at least one of these four mutations. Triglyceride levels in the carriers were 39% lower than levels in noncarriers (P<1×10 − 20 ), and circulating levels of APOC3 in carriers were 46% lower than levels in noncarriers (P = 8×10 − 10 ). The risk of coronary heart disease among 498 carriers of any rare APOC3 mutation was 40% lower than the risk among 110,472 noncarriers (odds ratio, 0.60; 95% confidence interval, 0.47 to 0.75; P = 4×10 − 6 ). Conclusions Rare mutations that disrupt APOC3 function were associated with lower levels of plasma triglycerides and APOC3. Carriers of these mutations were found to have a reduced risk of coronary heart disease. (Funded by the National Heart, Lung, and Blood Institute and others.)
••
TL;DR: Results from applying multiple sequentially Markovian coalescent (MSMC) to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 Years ago.
Abstract: The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000-30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.
••
TL;DR: WES data indicate that a larger subclonal mutation fraction may be associated with increased likelihood of postsurgical relapse in patients with localized lung adenocarcinomas, and different mutations are present in different regions of any given lung cancer, and their pattern may predict patient relapse.
Abstract: Cancers are composed of populations of cells with distinct molecular and phenotypic features, a phenomenon termed intratumor heterogeneity (ITH). ITH in lung cancers has not been well studied. We applied multiregion whole-exome sequencing (WES) on 11 localized lung adenocarcinomas. All tumors showed clear evidence of ITH. On average, 76% of all mutations and 20 out of 21 known cancer gene mutations were identified in all regions of individual tumors, which suggested that single-region sequencing may be adequate to identify the majority of known cancer gene mutations in localized lung adenocarcinomas. With a median follow-up of 21 months after surgery, three patients have relapsed, and all three patients had significantly larger fractions of subclonal mutations in their primary tumors than patients without relapse. These data indicate that a larger subclonal mutation fraction may be associated with increased likelihood of postsurgical relapse in patients with localized lung adenocarcinomas.
••
Broad Institute1, University of Oxford2, Swiss Federal Institute of Aquatic Science and Technology3, University of Bern4, Wellcome Trust/Cancer Research UK Gurdon Institute5, Wellcome Trust Sanger Institute6, University of Konstanz7, Agency for Science, Technology and Research8, Reed College9, Stanford University10, California Institute of Technology11, Benaroya Research Institute12, University of Rennes13, Georgia Institute of Technology14, University of Maryland, College Park15, University of Basel16, University of Texas at Austin17, Tokyo Institute of Technology18, National Museum of Natural History19, University of Stirling20, Carnegie Institution for Science21, National Cheng Kung University22, Science for Life Laboratory23, Norwich University24
TL;DR: This article found an excess of gene duplications in the East African lineage compared to Nile tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs.
Abstract: Cichlid fishes are famous for large, diverse and replicated adaptive radiations in the Great Lakes of East Africa. To understand the molecular mechanisms underlying cichlid phenotypic diversity, we sequenced the genomes and transcriptomes of five lineages of African cichlids: the Nile tilapia (Oreochromis niloticus), an ancestral lineage with low diversity; and four members of the East African lineage: Neolamprologus brichardi/pulcher (older radiation, Lake Tanganyika), Metriaclima zebra (recent radiation, Lake Malawi), Pundamilia nyererei (very recent radiation, Lake Victoria), and Astatotilapia burtoni (riverine species around Lake Tanganyika). We found an excess of gene duplications in the East African lineage compared to tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs. In addition, we analysed sequence data from sixty individuals representing six closely related species from Lake Victoria, and show genome-wide diversifying selection on coding and regulatory variants, some of which were recruited from ancient polymorphisms. We conclude that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.
••
Charité1, Lawrence Berkeley National Laboratory2, Wellcome Trust Sanger Institute3, Cambridge University Hospitals NHS Foundation Trust4, Paul Sabatier University5, University of Manchester6, Newcastle University7, University of Toronto8, Leeds Teaching Hospitals NHS Trust9, Katholieke Universiteit Leuven10, University of Kiel11, University College London12, Drexel University13, University of Cambridge14, Geisinger Medical Center15, St George’s University Hospitals NHS Foundation Trust16, University of Bristol17, Columbia University18, University of Oxford19, Radboud University Nijmegen20, University of Oregon21, Aberystwyth University22, Max Planck Society23
TL;DR: The updated HPO database is described, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO, allowing integration of existing datasets and interoperability with multiple biomedical resources.
Abstract: The Human Phenotype Ontology (HPO) project, available at http://www.human-phenotype-ontology.org, provides a structured, comprehensive and well-defined set of 10,088 classes (terms) describing human phenotypic abnormalities and 13,326 subclass relations between the HPO classes. In addition we have developed logical definitions for 46% of all HPO classes using terms from ontologies for anatomy, cell types, function, embryology, pathology and other domains. This allows interoperability with several resources, especially those containing phenotype information on model organisms such as mouse and zebrafish. Here we describe the updated HPO database, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO. Various meta-attributes such as frequency, references and negations are associated with each annotation. Several large-scale projects worldwide utilize the HPO for describing phenotype information in their datasets. We have therefore generated equivalence mappings to other phenotype vocabularies such as LDDB, Orphanet, MedDRA, UMLS and phenoDB, allowing integration of existing datasets and interoperability with multiple biomedical resources. We have created various ways to access the HPO database content using flat files, a MySQL database, and Web-based tools. All data and documentation on the HPO project can be found online.
••
TL;DR: It is reported that short-term expression of two components, NANOG and KLF2, is sufficient to ignite other elements of the network and reset the human pluripotent state and demonstrate feasibility of installing and propagating functional control circuitry for ground-state pluripotency in human cells.
••
TL;DR: This work has shown that co-microinjection of mouse embryos with Cas9 mRNA and single guide RNAs induces on-target and off-target mutations that are transmissible to offspring, but Cas9 nickase can be used to efficiently mutate genes without detectable damage at known off- target sites.
Abstract: Bacterial RNA-directed Cas9 endonuclease is a versatile tool for site-specific genome modification in eukaryotes. Co-microinjection of mouse embryos with Cas9 mRNA and single guide RNAs induces on-target and off-target mutations that are transmissible to offspring. However, Cas9 nickase can be used to efficiently mutate genes without detectable damage at known off-target sites. This method is applicable for genome editing of any model organism and minimizes confounding problems of off-target mutations.
••
TL;DR: The myeloma genome is heterogeneous across the cohort, and exhibits diversity in clonal admixture and in dynamics of evolution, which may impact prognostic stratification, therapeutic approaches and assessment of disease response to treatment.
Abstract: Multiple myeloma is an incurable plasma cell malignancy with a complex and incompletely understood molecular pathogenesis. Here we use whole-exome sequencing, copy-number profiling and cytogenetics to analyse 84 myeloma samples. Most cases have a complex subclonal structure and show clusters of subclonal variants, including subclonal driver mutations. Serial sampling reveals diverse patterns of clonal evolution, including linear evolution, differential clonal response and branching evolution. Diverse processes contribute to the mutational repertoire, including kataegis and somatic hypermutation, and their relative contribution changes over time. We find heterogeneity of mutational spectrum across samples, with few recurrent genes. We identify new candidate genes, including truncations of SP140, LTB, ROBO1 and clustered missense mutations in EGR1. The myeloma genome is heterogeneous across the cohort, and exhibits diversity in clonal admixture and in dynamics of evolution, which may impact prognostic stratification, therapeutic approaches and assessment of disease response to treatment.
••
TL;DR: Mutational signatures can be used as a physiological readout of the biological history of a cancer and also have potential use for discerning ongoing mutational processes from historical ones, thus possibly revealing new targets for anticancer therapies.
Abstract: The collective somatic mutations observed in a cancer are the outcome of multiple mutagenic processes that have been operative over the lifetime of a patient. Each process leaves a characteristic imprint--a mutational signature--on the cancer genome, which is defined by the type of DNA damage and DNA repair processes that result in base substitutions, insertions and deletions or structural variations. With the advent of whole-genome sequencing, researchers are identifying an increasing array of these signatures. Mutational signatures can be used as a physiological readout of the biological history of a cancer and also have potential use for discerning ongoing mutational processes from historical ones, thus possibly revealing new targets for anticancer therapies.
••
TL;DR: This work mapped interindividual variation in gene expression as a quantitative trait, defining expression quantitative trait loci (eQTLs) and found trans associations to the major histocompatibility complex are dependent on context, paralleling the expression of class II genes.
Abstract: To systematically investigate the impact of immune stimulation upon regulatory variant activity, we exposed primary monocytes from 432 healthy Europeans to interferon-γ (IFN-γ) or differing durations of lipopolysaccharide and mapped expression quantitative trait loci (eQTLs). More than half of cis-eQTLs identified, involving hundreds of genes and associated pathways, are detected specifically in stimulated monocytes. Induced innate immune activity reveals multiple master regulatory trans-eQTLs including the major histocompatibility complex (MHC), coding variants altering enzyme and receptor function, an IFN-β cytokine network showing temporal specificity, and an interferon regulatory factor 2 (IRF2) transcription factor-modulated network. Induced eQTL are significantly enriched for genome-wide association study loci, identifying context-specific associations to putative causal genes including CARD9, ATM, and IRF8. Thus, applying pathophysiologically relevant immune stimuli assists resolution of functional genetic variants.
••
Massachusetts Institute of Technology1, California Institute of Technology2, Stanford University3, Harvard University4, Broad Institute5, Duke University6, University of Massachusetts Medical School7, National Institutes of Health8, University of Southern California9, Yale University10, Florida State University11, Cold Spring Harbor Laboratory12, Wellcome Trust Sanger Institute13, University of California, Santa Cruz14, Princeton University15, University of California, San Diego16, University of Washington17, University of Chicago18, Pennsylvania State University19
TL;DR: The strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies are reviewed.
Abstract: With the completion of the human genome sequence, attention turned to identifying and annotating its functional DNA elements. As a complement to genetic and comparative genomics approaches, the Encyclopedia of DNA Elements Project was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types. The resulting genome-wide data reveal sites of biochemical activity with high positional resolution and cell type specificity that facilitate studies of gene regulation and interpretation of noncoding variants associated with human disease. However, the biochemically active regions cover a much larger fraction of the genome than do evolutionarily conserved regions, raising the question of whether nonconserved but biochemically active regions are truly functional. Here, we review the strengths and limitations of biochemical, evolutionary, and genetic approaches for defining functional DNA segments, potential sources for the observed differences in estimated genomic coverage, and the biological implications of these discrepancies. We also analyze the relationship between signal intensity, genomic coverage, and evolutionary conservation. Our results reinforce the principle that each approach provides complementary information and that we need to use combinations of all three to elucidate genome function in human biology and disease.
••
National Institute for Health Research1, University of Leicester2, Wellcome Trust Sanger Institute3, Science for Life Laboratory4, Institute of Chartered Accountants of Nigeria5, University of Paris6, French Institute of Health and Medical Research7, Aix-Marseille University8, University of Lübeck9, Technische Universität München10, National Health Service11, University of Cambridge12, King's College London13, Queen Mary University of London14, King Abdulaziz University15
TL;DR: Increased BMI in adults of European origin is associated with increased methylation at the HIF3A locus in blood cells and in adipose tissue, and perturbation of hypoxia inducible transcription factor pathways could have an important role in the response to increased weight in people.
University of Oxford1, Broad Institute2, University of Bern3, Swiss Federal Institute of Aquatic Science and Technology4, Wellcome Trust Sanger Institute5, Wellcome Trust/Cancer Research UK Gurdon Institute6, University of Konstanz7, Agency for Science, Technology and Research8, Reed College9, Stanford University10, California Institute of Technology11, Benaroya Research Institute12, University of Rennes13, Georgia Institute of Technology14, University of Maryland, College Park15, University of Basel16, University of Texas at Austin17, Tokyo Institute of Technology18, National Museum of Natural History19, University of Stirling20, Carnegie Institution for Science21, National Cheng Kung University22, Science for Life Laboratory23, Norwich University24
TL;DR: It is concluded that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.
Abstract: Cichlid fishes are famous for large, diverse and replicated adaptive radiations in the Great Lakes of East Africa. To understand the molecular mechanisms underlying cichlid phenotypic diversity, we sequenced the genomes and transcriptomes of five lineages of African cichlids: the Nile tilapia (Oreochromis niloticus), an ancestral lineage with low diversity; and four members of the East African lineage: Neolamprologus brichardi/pulcher (older radiation, Lake Tanganyika), Metriaclima zebra (recent radiation, Lake Malawi), Pundamilia nyererei (very recent radiation, Lake Victoria), and Astatotilapia burtoni (riverine species around Lake Tanganyika). We found an excess of gene duplications in the East African lineage compared to tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs. In addition, we analysed sequence data from sixty individuals representing six closely related species from Lake Victoria, and show genome-wide diversifying selection on coding and regulatory variants, some of which were recruited from ancient polymorphisms. We conclude that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.
••
TL;DR: The goal has always been to develop a system optimized to meet the demands of experimentalists not highly experienced in bioinformatics, and the PredictProtein results are presented as both text and a series of intuitive, interactive and visually appealing figures.
Abstract: PredictProtein is a meta-service for sequence analysis that has been predicting structural and functional features of proteins since 1992. Queried with a protein sequence it returns: multiple sequence alignments, predicted aspects of structure (secondary structure, solvent accessibility, transmembrane helices (TMSEG) and strands, coiled-coil regions, disulfide bonds and disordered regions) and function. The service incorporates analysis methods for the identification of functional regions (ConSurf), homology-based inference of Gene Ontology terms (metastudent), comprehensive subcellular localization prediction (LocTree3), protein–protein binding sites (ISIS2), protein–polynucleotide binding sites (SomeNA) and predictions of the effect of point mutations (non-synonymous SNPs) on protein function (SNAP2). Our goal has always been to develop a system optimized to meet the demands of experimentalists not highly experienced in bioinformatics. To this end, the PredictProtein results are presented as both text and a series of intuitive, interactive and visually appealing figures. The web server and sources are available at http://ppopen.rostlab.org.
••
TL;DR: It is found that SHAPEIT2 produces much lower switch error rates in all cohorts compared to other methods, including those designed specifically for isolated populations, and a general strategy for phasing cohorts with any level of implicit or explicit relatedness between individuals is developed.
Abstract: Many existing cohorts contain a range of relatedness between genotyped individuals, either by design or by chance. Haplotype estimation in such cohorts is a central step in many downstream analyses. Using genotypes from six cohorts from isolated populations and two cohorts from non-isolated populations, we have investigated the performance of different phasing methods designed for nominally ‘unrelated’ individuals. We find that SHAPEIT2 produces much lower switch error rates in all cohorts compared to other methods, including those designed specifically for isolated populations. In particular, when large amounts of IBD sharing is present, SHAPEIT2 infers close to perfect haplotypes. Based on these results we have developed a general strategy for phasing cohorts with any level of implicit or explicit relatedness between individuals. First SHAPEIT2 is run ignoring all explicit family information. We then apply a novel HMM method (duoHMM) to combine the SHAPEIT2 haplotypes with any family information to infer the inheritance pattern of each meiosis at all sites across each chromosome. This allows the correction of switch errors, detection of recombination events and genotyping errors. We show that the method detects numbers of recombination events that align very well with expectations based on genetic maps, and that it infers far fewer spurious recombination events than Merlin. The method can also detect genotyping errors and infer recombination events in otherwise uninformative families, such as trios and duos. The detected recombination events can be used in association scans for recombination phenotypes. The method provides a simple and unified approach to haplotype estimation, that will be of interest to researchers in the fields of human, animal and plant genetics.