Showing papers by "Mark Gerstein published in 2015"

PDF

Open Access

Journal Article•DOI•

A global reference for human genetic variation.

[...]

Adam Auton¹, Gonçalo R. Abecasis², David Altshuler³, Richard Durbin⁴ +514 more•Institutions (90)

01 Oct 2015-Nature

TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.

...read moreread less

Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

...read moreread less

12,661 citations

A global reference for human genetic variation

[...]

Adam Auton, Gonçalo R. Abecasis, David Altshuler, Richard Durbin +476 more

01 Oct 2015

TL;DR: The 1000 Genomes Project as mentioned in this paper provided a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and reported the completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole genome sequencing, deep exome sequencing and dense microarray genotyping.

...read moreread less

3,247 citations

Journal Article•DOI•

The Molecular Taxonomy of Primary Prostate Cancer

[...]

Adam Abeshouse¹, Jaeil Ahn¹, Rehan Akbani¹, Adrian Ally¹ +308 more•Institutions (1)

05 Nov 2015-Cell

TL;DR: The Cancer Genome Atlas (TCGA) has been used for a comprehensive molecular analysis of primary prostate carcinomas as discussed by the authors, revealing substantial heterogeneity among primary prostate cancers, evident in the spectrum of molecular abnormalities and its variable clinical course.

...read moreread less

2,109 citations

Journal Article•DOI•

An integrated map of structural variation in 2,504 human genomes

[...]

Peter H. Sudmant¹, Tobias Rausch, Eugene J. Gardner², Robert E. Handsaker³, Robert E. Handsaker⁴, Alexej Abyzov⁵, John Huddleston¹, Yan Zhang⁶, Kai Ye⁷, Goo Jun⁸, Goo Jun⁹, Markus His Yang Fritz, Miriam K. Konkel¹⁰, Ankit Malhotra, Adrian M. Stütz, Xinghua Shi¹¹, Francesco Paolo Casale¹², Jieming Chen⁶, Fereydoun Hormozdiari¹, Gargi Dayama⁹, Ken Chen¹³, Maika Malig¹, Mark Chaisson¹, Klaudia Walter¹², Sascha Meiers, Seva Kashin³, Seva Kashin⁴, Erik Garrison¹⁴, Adam Auton¹⁵, Hugo Y. K. Lam, Xinmeng Jasmine Mu³, Xinmeng Jasmine Mu⁶, Can Alkan¹⁶, Danny Antaki¹⁷, Taejeong Bae⁵, Eliza Cerveira, Peter S. Chines¹⁸, Zechen Chong¹³, Laura Clarke¹², Elif Dal¹⁶, Li Ding⁷, S. Emery⁹, Xian Fan¹³, Madhusudan Gujral¹⁷, Fatma Kahveci¹⁶, Jeffrey M. Kidd⁹, Yu Kong¹⁵, Eric-Wubbo Lameijer¹⁹, Shane A. McCarthy¹², Paul Flicek¹², Richard A. Gibbs²⁰, Gabor T. Marth¹⁴, Christopher E. Mason²¹, Androniki Menelaou²², Androniki Menelaou²³, Donna M. Muzny²⁴, Bradley J. Nelson¹, Amina Noor¹⁷, Nicholas F. Parrish²⁵, Matthew Pendleton²⁴, Andrew Quitadamo¹¹, Benjamin Raeder, Eric E. Schadt²⁴, Mallory Romanovitch, Andreas Schlattl, Robert Sebra²⁴, Andrey A. Shabalin²⁶, Andreas Untergasser²⁷, Jerilyn A. Walker¹⁰, Min Wang²⁰, Fuli Yu²⁰, Chengsheng Zhang, Jing Zhang⁶, Xiangqun Zheng-Bradley¹², Wanding Zhou¹³, Thomas Zichner, Jonathan Sebat¹⁷, Mark A. Batzer¹⁰, Steven A. McCarroll⁴, Steven A. McCarroll³, Ryan E. Mills⁹, Mark Gerstein⁶, Ali Bashir²⁴, Oliver Stegle¹², Scott E. Devine², Charles Lee²⁸, Evan E. Eichler¹, Jan O. Korbel¹² - Show less +84 more•Institutions (28)

01 Oct 2015-Nature

TL;DR: In this paper, the authors describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which are constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations.

...read moreread less

Abstract: Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.

...read moreread less

1,971 citations

The Molecular Taxonomy of Primary Prostate Cancer

[...]

Adam Abeshouse, Jaeil Ahn, Rehan Akbani, Adrian Ally +306 more

01 Nov 2015

TL;DR: A comprehensive molecular analysis of 333 primary prostate carcinomas revealed a molecular taxonomy in which 74% of these tumors fell into one of seven subtypes defined by specific gene fusions (ERG, ETV1/4, and FLI1) or mutations (SPOP, FOXA1, and IDH1).

...read moreread less

Abstract: There is substantial heterogeneity among primary prostate cancers, evident in the spectrum of molecular abnormalities and its variable clinical course. As part of The Cancer Genome Atlas (TCGA), we present a comprehensive molecular analysis of 333 primary prostate carcinomas. Our results revealed a molecular taxonomy in which 74% of these tumors fell into one of seven subtypes defined by specific gene fusions (ERG, ETV1/4, and FLI1) or mutations (SPOP, FOXA1, and IDH1). Epigenetic profiles showed substantial heterogeneity, including an IDH1 mutant subset with a methylator phenotype. Androgen receptor (AR) activity varied widely and in a subtype-specific manner, with SPOP and FOXA1 mutant tumors having the highest levels of AR-induced transcripts. 25% of the prostate cancers had a presumed actionable lesion in the PI3K or MAPK signaling pathways, and DNA repair genes were inactivated in 19%. Our analysis reveals molecular heterogeneity among primary prostate cancers, as well as potentially actionable molecular defects.

...read moreread less

1,794 citations

Journal Article•DOI•

FOXG1-Dependent Dysregulation of GABA/Glutamate Neuron Differentiation in Autism Spectrum Disorders

[...]

Jessica Mariani¹, Gianfilippo Coppola¹, Ping Zhang¹, Alexej Abyzov¹, Lauren E. Provini¹, Livia Tomasini¹, Mariangela Amenduni¹, Anna Szekely¹, Dean Palejev¹, Michael Wilson¹, Mark Gerstein, Elena L. Grigorenko¹, Katarzyna Chawarska¹, Kevin A. Pelphrey¹, James R. Howe¹, Flora M. Vaccarino¹ - Show less +12 more•Institutions (1)

Yale University¹

16 Jul 2015-Cell

TL;DR: Three-dimensional neural cultures derived from induced pluripotent stem cells are used to investigate neurodevelopmental alterations in individuals with severe idiopathic ASD and show that overexpression of the transcription factor FOXG1 is responsible for the overproduction of GABAergic neurons.

...read moreread less

843 citations

Journal Article•DOI•

The PsychENCODE project

[...]

Schahram Akbarian¹, Chunyu Liu², James A. Knowles³, Flora M. Vaccarino⁴, Peggy J. Farnham³, Gregory E. Crawford⁵, Andrew E. Jaffe, Dalila Pinto¹, Stella Dracheva¹, Daniel H. Geschwind⁶, Jonathan Mill⁷, Jonathan Mill⁸, Angus C. Nairn⁴, Alexej Abyzov⁹, Sirisha Pochareddy⁴, Shyam Prabhakar¹⁰, Sherman M. Weissman⁴, Patrick F. Sullivan¹¹, Matthew W. State¹², Zhiping Weng¹³, Mette A. Peters¹⁴, Kevin P. White¹⁵, Mark Gerstein⁴, Anahita Amiri⁴, Chris Armoskus³, Allison E. Ashley-Koch⁵, Taejeong Bae⁹, Andrea Beckel-Mitchener¹⁶, Benjamin P. Berman³, Gerhard A. Coetzee³, Gianfilippo Coppola⁴, Nancy Francoeur¹, Menachem Fromer¹, Robert Gao³, Kay Grennan², Jennifer Herstein³, David H. Kavanagh¹, Nikolay A. Ivanov, Yan Jiang¹, Robert R. Kitchen⁴, Alexey Kozlenkov¹, Marija Kundakovic¹, Mingfeng Li⁴, Zhen Li⁴, Shuang Liu⁴, Lara M. Mangravite¹⁴, Eugenio Mattei¹³, Eirene Markenscoff-Papadimitriou¹², Fabio C. P. Navarro⁴, Nicole North¹⁶, Larsson Omberg¹⁴, David M. Panchision¹⁶, Neelroop N. Parikshak⁶, Jeremie Poschmann⁷, Amanda J. Price, Michael J. Purcaro¹³, Timothy E. Reddy⁵, Panos Roussos¹, Shannon Schreiner³, Soraya Scuderi⁴, Robert Sebra¹, Mikihito Shibata⁴, Annie W. Shieh², Mario Skarica⁴, Wenjie Sun¹⁰, Vivek Swarup⁶, Amber Thomas¹⁵, Junko Tsuji¹³, Harm van Bakel¹, Daifeng Wang⁴, Yongjun Wang², Kai Wang³, Donna M. Werling¹², A. Jeremy Willsey¹², Heather Witt³, Hyejung Won⁶, Chloe C. Y. Wong⁷, Chloe C. Y. Wong⁸, Gregory A. Wray⁵, Emily Wu⁶, Xuming Xu⁴, Lijing Yao³, Geetha Senthil¹⁶, Thomas Lehner¹⁶, Pamela Sklar¹, Nenad Sestan⁴ - Show less +82 more•Institutions (16)

Icahn School of Medicine at Mount Sinai¹, University of Illinois at Chicago², University of Southern California³, Yale University⁴, Duke University⁵, University of California, Los Angeles⁶, University of Exeter⁷, King's College London⁸, Mayo Clinic⁹, Agency for Science, Technology and Research¹⁰, University of North Carolina at Chapel Hill¹¹, University of California, San Francisco¹², University of Massachusetts Medical School¹³, Sage Bionetworks¹⁴, University of Chicago¹⁵, National Institutes of Health¹⁶

25 Nov 2015-Nature Neuroscience

TL;DR: The PsychENCODE project aims to produce a public resource of multidimensional genomic data using tissue- and cell type–specific samples from approximately 1,000 phenotypically well-characterized, high-quality healthy and disease-affected human post-mortem brains, as well as functionally characterize disease-associated regulatory elements and variants in model systems.

...read moreread less

Abstract: Recent research on disparate psychiatric disorders has implicated rare variants in genes involved in global gene regulation and chromatin modification, as well as many common variants located primarily in regulatory regions of the genome. Understanding precisely how these variants contribute to disease will require a deeper appreciation for the mechanisms of gene regulation in the developing and adult human brain. The PsychENCODE project aims to produce a public resource of multidimensional genomic data using tissue- and cell type–specific samples from approximately 1,000 phenotypically well-characterized, high-quality healthy and disease-affected human post-mortem brains, as well as functionally characterize disease-associated regulatory elements and variants in model systems. We are beginning with a focus on autism spectrum disorder, bipolar disorder and schizophrenia, and expect that this knowledge will apply to a wide variety of psychiatric disorders. This paper outlines the motivation and design of PsychENCODE.

...read moreread less

347 citations

Journal Article•DOI•

Tracking Distinct RNA Populations Using Efficient and Reversible Covalent Chemistry.

[...]

Erin E. Duffy¹, Michael Rutenberg-Schoenberg¹, Catherine D. Stark¹, Robert R. Kitchen¹, Mark Gerstein¹, Matthew D. Simon¹ - Show less +2 more•Institutions (1)

Yale University¹

03 Sep 2015-Molecular Cell

TL;DR: It is demonstrated that methanethiosulfonate (MTS) reagents form disulfide bonds with s(4)U more efficiently than the commonly used HPDP-biotin, leading to higher yields and less biased enrichment.

...read moreread less

171 citations

Journal Article•DOI•

MetaSV: an accurate and integrative structural-variant caller for next generation sequencing

[...]

Marghoob Mohiyuddin¹, John C. Mu¹, Jian Li¹, Narges Bani Asadi¹, Mark Gerstein², Alexej Abyzov³, Wing Hung Wong⁴, Hugo Y. K. Lam¹ - Show less +4 more•Institutions (4)

Hoffmann-La Roche¹, Yale University², Mayo Clinic³, Stanford University⁴

15 Aug 2015-Bioinformatics

TL;DR: MetaSV is proposed, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution and analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs.

...read moreread less

Abstract: Summary: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes. Availability and implementation: Code in Python is at http://bioinform.github.io/metasv/. Contact: moc.anib@dr Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

122 citations

Journal Article•DOI•

An ensemble approach to accurately detect somatic mutations using SomaticSeq

[...]

Li Tai Fang¹, Pegah Tootoonchi Afshar², Aparna Chhibber¹, Marghoob Mohiyuddin¹, Yu Fan³, John C. Mu¹, Greg Gibeling¹, Sharon Y. Barr¹, Narges Bani Asadi¹, Mark Gerstein⁴, Daniel C. Koboldt⁵, Wenyi Wang³, Wing Hung Wong², Hugo Y. K. Lam¹ - Show less +10 more•Institutions (5)

Hoffmann-La Roche¹, Stanford University², University of Texas MD Anderson Cancer Center³, Yale University⁴, Washington University in St. Louis⁵

17 Sep 2015-Genome Biology

TL;DR: SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatics mutation calls for both single nucleotide variants and small insertions and deletions that achieves better overall accuracy than any individual tool incorporated.

...read moreread less

Abstract: SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated.

...read moreread less

95 citations

MetaSV: an accurate and integrative structural-variant caller for next generation

[...]

Marghoob Mohiyuddin, John C. Mu, Jian Li, Narges Bani Asadi, Mark Gerstein, Alexej Abyzov, Wing Hung Wong - Show less +3 more

01 Jan 2015

TL;DR: MetaSV as mentioned in this paper combines multiple orthogonal SV signals for high accuracy and resolution by merging SVs from multiple tools for all types of SVs and analyzes soft-clipped reads from alignment to detect insertions accurately.

...read moreread less

Abstract: Summary: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes. Availability and implementation: Code in Python is at http://bioinform.github.io/metasv/. Contact: rd@bina.com Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression

[...]

Dmitri D. Pervouchine¹, Sarah Djebali, Alessandra Breschi, Carrie A. Davis², Pablo Prieto Barja, Alexander Dobin², Andrea Tanzer³, Julien Lagarde, Chris Zaleski², Lei Hoon See², Meagan Fastuca², Jorg Drenkow², Huaien Wang², Giovanni Bussotti, Baikang Pei⁴, Suganthi Balasubramanian⁴, Jean Monlong⁵, Arif Harmanci⁴, Mark Gerstein⁴, Michael A. Beer⁶, Cedric Notredame, Roderic Guigó, Thomas R. Gingeras² - Show less +19 more•Institutions (6)

Moscow State University¹, Cold Spring Harbor Laboratory², University of Vienna³, Yale University⁴, McGill University⁵, Johns Hopkins University⁶

13 Jan 2015-Nature Communications

TL;DR: In this article, the authors characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates.

...read moreread less

Abstract: Mice have been a long-standing model for human biology and disease. Here we characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles in human cell lines reveals substantial conservation of transcriptional programmes, and uncovers a distinct class of genes with levels of expression that have been constrained early in vertebrate evolution. This core set of genes captures a substantial fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but is associated with conserved epigenetic marking, as well as with characteristic post-transcriptional regulatory programme, in which sub-cellular localization and alternative splicing play comparatively large roles.

...read moreread less

Journal Article•DOI•

Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms.

[...]

Alexej Abyzov¹, Shantao Li², Daniel Rhee Kim², Marghoob Mohiyuddin³, Adrian M. Stütz, Nicholas F. Parrish⁴, Xinmeng Jasmine Mu², Wyatt T. Clark², Ken Chen⁵, Matthew E. Hurles⁶, Jan O. Korbel⁷, Hugo Y. K. Lam, Charles Kai-Wu Lee, Mark Gerstein² - Show less +10 more•Institutions (7)

Mayo Clinic¹, Yale University², Hoffmann-La Roche³, Kyoto University⁴, University of Texas MD Anderson Cancer Center⁵, Wellcome Trust Sanger Institute⁶, European Bioinformatics Institute⁷

01 Jun 2015-Nature Communications

TL;DR: The identification, classification and analysis of a large database of variants giving an insight into mechanisms generating them are presented, and a major source of complexity in the human genome is identified.

...read moreread less

Abstract: Investigating genomic structural variants at basepair resolution is crucial for understanding their formation mechanisms. We identify and analyse 8,943 deletion breakpoints in 1,092 samples from the 1000 Genomes Project. We find breakpoints have more nearby SNPs and indels than the genomic average, likely a consequence of relaxed selection. By investigating the correlation of breakpoints with DNA methylation, Hi-C interactions, and histone marks and the substitution patterns of nucleotides near them, we find that breakpoints with the signature of non-allelic homologous recombination (NAHR) are associated with open chromatin. We hypothesize that some NAHR deletions occur without DNA replication and cell division, in embryonic and germline cells. In contrast, breakpoints associated with non-homologous (NH) mechanisms often have sequence microinsertions, templated from later replicating genomic sites, spaced at two characteristic distances from the breakpoint. These microinsertions are consistent with template-switching events and suggest a particular spatiotemporal configuration for DNA during the events.

...read moreread less

Journal Article•DOI•

LARVA: an integrative framework for large-scale analysis of recurrent variants in noncoding annotations

[...]

Lucas Lochovsky¹, Jing Zhang¹, Yao Fu¹, Ekta Khurana², Mark Gerstein¹ - Show less +1 more•Institutions (2)

Yale University¹, Cornell University²

30 Sep 2015-Nucleic Acids Research

TL;DR: A new computational framework called LARVA, which integrates variants with a comprehensive set of noncoding functional elements, modeling the mutation counts of the elements with a β-binomial distribution to handle overdispersion, and highlights several novel highly mutated regulatory sites that could potentially be nonc coding drivers.

...read moreread less

Abstract: In cancer research, background models for mutation rates have been extensively calibrated in coding regions, leading to the identification of many driver genes, recurrently mutated more than expected. Noncoding regions are also associated with disease; however, background models for them have not been investigated in as much detail. This is partially due to limited noncoding functional annotation. Also, great mutation heterogeneity and potential correlations between neighboring sites give rise to substantial overdispersion in mutation count, resulting in problematic background rate estimation. Here, we address these issues with a new computational framework called LARVA. It integrates variants with a comprehensive set of noncoding functional elements, modeling the mutation counts of the elements with a β-binomial distribution to handle overdispersion. LARVA, moreover, uses regional genomic features such as replication timing to better estimate local mutation rates and mutational hotspots. We demonstrate LARVA's effectiveness on 760 whole-genome tumor sequences, showing that it identifies well-known noncoding drivers, such as mutations in the TERT promoter. Furthermore, LARVA highlights several novel highly mutated regulatory sites that could potentially be noncoding drivers. We make LARVA available as a software tool and release our highly mutated annotations as an online resource (larva.gersteinlab.org).

...read moreread less

Journal Article•DOI•

VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications

[...]

John C. Mu¹, Marghoob Mohiyuddin¹, Jian Li¹, Narges Bani Asadi¹, Mark Gerstein¹, Alexej Abyzov¹, Wing Hung Wong¹, Hugo Y. K. Lam¹ - Show less +4 more•Institutions (1)

Stanford University¹

01 May 2015-Bioinformatics

TL;DR: A novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics make VarSim the most comprehensive validation tool for secondary analysis in next generation sequencing.

...read moreread less

Abstract: Summary: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing. Availability and implementation: Code in Java and Python along with instructions to download the reads and variants is at http://bioinform.github.io/varsim. Contact: moc.anib@dr Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Integration of extracellular RNA profiling data using metadata, biomedical ontologies and Linked Data technologies.

[...]

Sai Lakshmi Subramanian¹, Robert R. Kitchen², Roger P. Alexander³, Bob S. Carter⁴, Kei-Hoi Cheung², Louise C. Laurent⁴, Alexander R. Pico⁵, Lewis R. Roberts⁶, Matthew E. Roth¹, Joel Rozowsky², Andrew I. Su⁷, Mark Gerstein², Aleksandar Milosavljevic¹ - Show less +9 more•Institutions (7)

Baylor College of Medicine¹, Yale University², Pacific Northwest Diabetes Research Institute³, University of California, San Diego⁴, Gladstone Institutes⁵, Mayo Clinic⁶, Scripps Research Institute⁷

28 Aug 2015-Journal of extracellular vesicles

TL;DR: The strategy that is being implemented by the exRNA Data Management and Resource Repository is presented, which employs metadata, biomedical ontologies and Linked Data technologies, such as Resource Description Framework to integrate a diverse set of exRNA profiles into an exRNA Atlas and enable integrative exRNA analysis.

...read moreread less

Abstract: The large diversity and volume of extracellular RNA (exRNA) data that will form the basis of the exRNA Atlas generated by the Extracellular RNA Communication Consortium pose a substantial data integration challenge. We here present the strategy that is being implemented by the exRNA Data Management and Resource Repository, which employs metadata, biomedical ontologies and Linked Data technologies, such as Resource Description Framework to integrate a diverse set of exRNA profiles into an exRNA Atlas and enable integrative exRNA analysis. We focus on the following three specific data integration tasks: (a) selection of samples from a virtual biorepository for exRNA profiling and for inclusion in the exRNA Atlas; (b) retrieval of a data slice from the exRNA Atlas for integrative analysis and (c) interpretation of exRNA analysis results in the context of pathways and networks. As exRNA profiling gains wide adoption in the research community, we anticipate that the strategies discussed here will increasingly be required to enable data reuse and to facilitate integrative analysis of exRNA data.

...read moreread less

Journal Article•DOI•

High-order neural networks and kernel methods for peptide-MHC binding prediction.

[...]

Pavel P. Kuksa¹, Martin Renqiang Min², Rishabh Dugar², Mark Gerstein³•Institutions (3)

University of Pennsylvania¹, Princeton University², Yale University³

15 Nov 2015-Bioinformatics

TL;DR: The experiments show that the proposed shallow HONN outperform the popular pre-trained deep neural network on most tasks, which demonstrates the effectiveness of modelling high-order feature interactions for predicting major histocompatibility complex-peptide binding.

...read moreread less

Abstract: Motivation Effective computational methods for peptide-protein binding prediction can greatly help clinical peptide vaccine search and design. However, previous computational methods fail to capture key nonlinear high-order dependencies between different amino acid positions. As a result, they often produce low-quality rankings of strong binding peptides. To solve this problem, we propose nonlinear high-order machine learning methods including high-order neural networks (HONNs) with possible deep extensions and high-order kernel support vector machines to predict major histocompatibility complex-peptide binding. Results The proposed high-order methods improve quality of binding predictions over other prediction methods. With the proposed methods, a significant gain of up to 25-40% is observed on the benchmark and reference peptide datasets and tasks. In addition, for the first time, our experiments show that pre-training with high-order semi-restricted Boltzmann machines significantly improves the performance of feed-forward HONNs. Moreover, our experiments show that the proposed shallow HONN outperform the popular pre-trained deep neural network on most tasks, which demonstrates the effectiveness of modelling high-order feature interactions for predicting major histocompatibility complex-peptide binding. Availability and implementation There is no associated distributable software. Contact renqiang@nec-labs.com or mark.gerstein@yale.edu Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

An approach for determining and measuring network hierarchy applied to comparing the phosphorylome and the regulome

[...]

Chao Cheng¹, Erik Andrews¹, Koon-Kiu Yan², Matthew Ung¹, Daifeng Wang², Mark Gerstein² - Show less +2 more•Institutions (2)

Dartmouth College¹, Yale University²

31 Mar 2015-Genome Biology

TL;DR: A score is defined to quantify the degree of hierarchy in a network and a simulated-annealing algorithm is developed to maximize the hierarchical score globally over a network.

...read moreread less

Abstract: Many biological networks naturally form a hierarchy with a preponderance of downward information flow. In this study, we define a score to quantify the degree of hierarchy in a network and develop a simulated-annealing algorithm to maximize the hierarchical score globally over a network. We apply our algorithm to determine the hierarchical structure of the phosphorylome in detail and investigate the correlation between its hierarchy and kinase properties. We also compare it to the regulatory network, finding that the phosphorylome is more hierarchical than the regulome.

...read moreread less

Journal Article•DOI•

Loregic: A Method to Characterize the Cooperative Logic of Regulatory Factors

[...]

Daifeng Wang¹, Koon-Kiu Yan¹, Cristina Sisu¹, Chao Cheng², Joel Rozowsky¹, William Meyerson¹, Mark Gerstein¹ - Show less +3 more•Institutions (2)

Yale University¹, Dartmouth College²

17 Apr 2015-PLOS Computational Biology

TL;DR: Loregic, a computational method integrating gene expression and regulatory network data, is presented, to characterize the cooperativity of regulatory factors and inter-relate Loregic’s gate logic with other aspects of regulation, such as indirect binding via protein-protein interactions, feed-forward loop motifs and global regulatory hierarchy.

...read moreread less

Abstract: The topology of the gene-regulatory network has been extensively analyzed. Now, given the large amount of available functional genomic data, it is possible to go beyond this and systematically study regulatory circuits in terms of logic elements. To this end, we present Loregic, a computational method integrating gene expression and regulatory network data, to characterize the cooperativity of regulatory factors. Loregic uses all 16 possible two-input-one-output logic gates (e.g. AND or XOR) to describe triplets of two factors regulating a common target. We attempt to find the gate that best matches each triplet’s observed gene expression pattern across many conditions. We make Loregic available as a general-purpose tool (github.com/gersteinlab/loregic). We validate it with known yeast transcription-factor knockout experiments. Next, using human ENCODE ChIP-Seq and TCGA RNA-Seq data, we are able to demonstrate how Loregic characterizes complex circuits involving both proximally and distally regulating transcription factors (TFs) and also miRNAs. Furthermore, we show that MYC, a well-known oncogenic driving TF, can be modeled as acting independently from other TFs (e.g., using OR gates) but antagonistically with repressing miRNAs. Finally, we inter-relate Loregic’s gate logic with other aspects of regulation, such as indirect binding via protein-protein interactions, feed-forward loop motifs and global regulatory hierarchy.

...read moreread less

Journal Article•DOI•

Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods

[...]

John C. Mu¹, Pegah Tootoonchi Afshar², Marghoob Mohiyuddin¹, Xi Chen², Jian Li¹, Narges Bani Asadi¹, Mark Gerstein³, Wing Hung Wong², Hugo Y. K. Lam¹ - Show less +5 more•Institutions (3)

Hoffmann-La Roche¹, Stanford University², Yale University³

28 Sep 2015-Scientific Reports

TL;DR: This work leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms.

...read moreread less

Abstract: A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.

...read moreread less

Journal Article•DOI•

Reads meet rotamers: structural biology in the age of deep sequencing

[...]

Anurag Sethi¹, Declan Clarke¹, Jieming Chen¹, Sushant Kumar¹, Timur R. Galeev¹, Lynne Regan¹, Mark Gerstein¹ - Show less +3 more•Institutions (1)

Yale University¹

01 Dec 2015-Current Opinion in Structural Biology

TL;DR: Within structural biology there is less emphasis on the discovery of novel folds and more on relating structures to networks of protein interactions, covering this changing mindset here.

...read moreread less

Journal Article•DOI•

The computer connection

[...]

Dov Greenbaum¹, Dov Greenbaum², Mark Gerstein¹•Institutions (2)

Yale University¹, Interdisciplinary Center Herzliya²

27 Feb 2015-Science

TL;DR: The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution as discussed by the authors, narrated by Walter Isaacson, tells the story of the people who invented the computer and the Internet.

...read moreread less

Abstract: In The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution , Walter Isaacson tells the story of the people who invented the computer and the Internet. From complementary duos like Grace Hopper and Howard Aiken, who developed the first computer that automatically executed long computations, to synergistic rivals like Larry Roberts and Bob Taylor, who worked together to create the global internet precursor, ARPANET, Isaacson argues that "innovation comes from teams more often than from the lightbulb moments of lone geniuses." But how do teams really work? And will "citizen science" change how we think of teamwork in the future?

...read moreread less

Journal Article•DOI•

Illuminating the Genome’s Dark Matter

[...]

Dov Greenbaum¹, Dov Greenbaum², Mark Gerstein¹•Institutions (2)

Yale University¹, Interdisciplinary Center Herzliya²

19 Nov 2015-Cell

TL;DR: John Parrington's book The Deeper Genome provides a closer look at the enigma of junk DNA, akin to the great expanses of dark matter within the authors' universe, junk DNA makes up the vast majority of the genome.

...read moreread less

K-mer Analysis on Developmental and Housekeeping Enhancer Peaks

[...]

Yunsi Yang, Anurag Sethi, Mark Gerstein

01 Jan 2015

TL;DR: In this paper, the authors propose a method to solve the problem of self-diagnosis in cancer patients, and propose an approach to diagnose self-declarative cancer patients.

...read moreread less

Abstract: Methods Conclusion

...read moreread less