scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2015"


Journal ArticleDOI
Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

12,661 citations


01 Oct 2015
TL;DR: The 1000 Genomes Project as mentioned in this paper provided a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and reported the completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole genome sequencing, deep exome sequencing and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

3,247 citations


Journal ArticleDOI
Adam Abeshouse1, Jaeil Ahn1, Rehan Akbani1, Adrian Ally1  +308 moreInstitutions (1)
05 Nov 2015-Cell
TL;DR: The Cancer Genome Atlas (TCGA) has been used for a comprehensive molecular analysis of primary prostate carcinomas as discussed by the authors, revealing substantial heterogeneity among primary prostate cancers, evident in the spectrum of molecular abnormalities and its variable clinical course.

2,109 citations


Journal ArticleDOI
Peter H. Sudmant1, Tobias Rausch, Eugene J. Gardner2, Robert E. Handsaker3, Robert E. Handsaker4, Alexej Abyzov5, John Huddleston1, Yan Zhang6, Kai Ye7, Goo Jun8, Goo Jun9, Markus His Yang Fritz, Miriam K. Konkel10, Ankit Malhotra, Adrian M. Stütz, Xinghua Shi11, Francesco Paolo Casale12, Jieming Chen6, Fereydoun Hormozdiari1, Gargi Dayama9, Ken Chen13, Maika Malig1, Mark Chaisson1, Klaudia Walter12, Sascha Meiers, Seva Kashin3, Seva Kashin4, Erik Garrison14, Adam Auton15, Hugo Y. K. Lam, Xinmeng Jasmine Mu3, Xinmeng Jasmine Mu6, Can Alkan16, Danny Antaki17, Taejeong Bae5, Eliza Cerveira, Peter S. Chines18, Zechen Chong13, Laura Clarke12, Elif Dal16, Li Ding7, S. Emery9, Xian Fan13, Madhusudan Gujral17, Fatma Kahveci16, Jeffrey M. Kidd9, Yu Kong15, Eric-Wubbo Lameijer19, Shane A. McCarthy12, Paul Flicek12, Richard A. Gibbs20, Gabor T. Marth14, Christopher E. Mason21, Androniki Menelaou22, Androniki Menelaou23, Donna M. Muzny24, Bradley J. Nelson1, Amina Noor17, Nicholas F. Parrish25, Matthew Pendleton24, Andrew Quitadamo11, Benjamin Raeder, Eric E. Schadt24, Mallory Romanovitch, Andreas Schlattl, Robert Sebra24, Andrey A. Shabalin26, Andreas Untergasser27, Jerilyn A. Walker10, Min Wang20, Fuli Yu20, Chengsheng Zhang, Jing Zhang6, Xiangqun Zheng-Bradley12, Wanding Zhou13, Thomas Zichner, Jonathan Sebat17, Mark A. Batzer10, Steven A. McCarroll4, Steven A. McCarroll3, Ryan E. Mills9, Mark Gerstein6, Ali Bashir24, Oliver Stegle12, Scott E. Devine2, Charles Lee28, Evan E. Eichler1, Jan O. Korbel12 
01 Oct 2015-Nature
TL;DR: In this paper, the authors describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which are constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations.
Abstract: Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.

1,971 citations


01 Nov 2015
TL;DR: A comprehensive molecular analysis of 333 primary prostate carcinomas revealed a molecular taxonomy in which 74% of these tumors fell into one of seven subtypes defined by specific gene fusions (ERG, ETV1/4, and FLI1) or mutations (SPOP, FOXA1, and IDH1).
Abstract: There is substantial heterogeneity among primary prostate cancers, evident in the spectrum of molecular abnormalities and its variable clinical course. As part of The Cancer Genome Atlas (TCGA), we present a comprehensive molecular analysis of 333 primary prostate carcinomas. Our results revealed a molecular taxonomy in which 74% of these tumors fell into one of seven subtypes defined by specific gene fusions (ERG, ETV1/4, and FLI1) or mutations (SPOP, FOXA1, and IDH1). Epigenetic profiles showed substantial heterogeneity, including an IDH1 mutant subset with a methylator phenotype. Androgen receptor (AR) activity varied widely and in a subtype-specific manner, with SPOP and FOXA1 mutant tumors having the highest levels of AR-induced transcripts. 25% of the prostate cancers had a presumed actionable lesion in the PI3K or MAPK signaling pathways, and DNA repair genes were inactivated in 19%. Our analysis reveals molecular heterogeneity among primary prostate cancers, as well as potentially actionable molecular defects.

1,794 citations


Journal ArticleDOI
16 Jul 2015-Cell
TL;DR: Three-dimensional neural cultures derived from induced pluripotent stem cells are used to investigate neurodevelopmental alterations in individuals with severe idiopathic ASD and show that overexpression of the transcription factor FOXG1 is responsible for the overproduction of GABAergic neurons.

843 citations


Journal ArticleDOI
Schahram Akbarian1, Chunyu Liu2, James A. Knowles3, Flora M. Vaccarino4, Peggy J. Farnham3, Gregory E. Crawford5, Andrew E. Jaffe, Dalila Pinto1, Stella Dracheva1, Daniel H. Geschwind6, Jonathan Mill7, Jonathan Mill8, Angus C. Nairn4, Alexej Abyzov9, Sirisha Pochareddy4, Shyam Prabhakar10, Sherman M. Weissman4, Patrick F. Sullivan11, Matthew W. State12, Zhiping Weng13, Mette A. Peters14, Kevin P. White15, Mark Gerstein4, Anahita Amiri4, Chris Armoskus3, Allison E. Ashley-Koch5, Taejeong Bae9, Andrea Beckel-Mitchener16, Benjamin P. Berman3, Gerhard A. Coetzee3, Gianfilippo Coppola4, Nancy Francoeur1, Menachem Fromer1, Robert Gao3, Kay Grennan2, Jennifer Herstein3, David H. Kavanagh1, Nikolay A. Ivanov, Yan Jiang1, Robert R. Kitchen4, Alexey Kozlenkov1, Marija Kundakovic1, Mingfeng Li4, Zhen Li4, Shuang Liu4, Lara M. Mangravite14, Eugenio Mattei13, Eirene Markenscoff-Papadimitriou12, Fabio C. P. Navarro4, Nicole North16, Larsson Omberg14, David M. Panchision16, Neelroop N. Parikshak6, Jeremie Poschmann7, Amanda J. Price, Michael J. Purcaro13, Timothy E. Reddy5, Panos Roussos1, Shannon Schreiner3, Soraya Scuderi4, Robert Sebra1, Mikihito Shibata4, Annie W. Shieh2, Mario Skarica4, Wenjie Sun10, Vivek Swarup6, Amber Thomas15, Junko Tsuji13, Harm van Bakel1, Daifeng Wang4, Yongjun Wang2, Kai Wang3, Donna M. Werling12, A. Jeremy Willsey12, Heather Witt3, Hyejung Won6, Chloe C. Y. Wong7, Chloe C. Y. Wong8, Gregory A. Wray5, Emily Wu6, Xuming Xu4, Lijing Yao3, Geetha Senthil16, Thomas Lehner16, Pamela Sklar1, Nenad Sestan4 
TL;DR: The PsychENCODE project aims to produce a public resource of multidimensional genomic data using tissue- and cell type–specific samples from approximately 1,000 phenotypically well-characterized, high-quality healthy and disease-affected human post-mortem brains, as well as functionally characterize disease-associated regulatory elements and variants in model systems.
Abstract: Recent research on disparate psychiatric disorders has implicated rare variants in genes involved in global gene regulation and chromatin modification, as well as many common variants located primarily in regulatory regions of the genome. Understanding precisely how these variants contribute to disease will require a deeper appreciation for the mechanisms of gene regulation in the developing and adult human brain. The PsychENCODE project aims to produce a public resource of multidimensional genomic data using tissue- and cell type–specific samples from approximately 1,000 phenotypically well-characterized, high-quality healthy and disease-affected human post-mortem brains, as well as functionally characterize disease-associated regulatory elements and variants in model systems. We are beginning with a focus on autism spectrum disorder, bipolar disorder and schizophrenia, and expect that this knowledge will apply to a wide variety of psychiatric disorders. This paper outlines the motivation and design of PsychENCODE.

347 citations


Journal ArticleDOI
TL;DR: It is demonstrated that methanethiosulfonate (MTS) reagents form disulfide bonds with s(4)U more efficiently than the commonly used HPDP-biotin, leading to higher yields and less biased enrichment.

171 citations


Journal ArticleDOI
TL;DR: MetaSV is proposed, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution and analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs.
Abstract: Summary: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes. Availability and implementation: Code in Python is at http://bioinform.github.io/metasv/. Contact: moc.anib@dr Supplementary information: Supplementary data are available at Bioinformatics online.

122 citations


Journal ArticleDOI
TL;DR: SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatics mutation calls for both single nucleotide variants and small insertions and deletions that achieves better overall accuracy than any individual tool incorporated.
Abstract: SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated.

95 citations


01 Jan 2015
TL;DR: MetaSV as mentioned in this paper combines multiple orthogonal SV signals for high accuracy and resolution by merging SVs from multiple tools for all types of SVs and analyzes soft-clipped reads from alignment to detect insertions accurately.
Abstract: Summary: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes. Availability and implementation: Code in Python is at http://bioinform.github.io/metasv/. Contact: rd@bina.com Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: In this article, the authors characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates.
Abstract: Mice have been a long-standing model for human biology and disease. Here we characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles in human cell lines reveals substantial conservation of transcriptional programmes, and uncovers a distinct class of genes with levels of expression that have been constrained early in vertebrate evolution. This core set of genes captures a substantial fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but is associated with conserved epigenetic marking, as well as with characteristic post-transcriptional regulatory programme, in which sub-cellular localization and alternative splicing play comparatively large roles.

Journal ArticleDOI
TL;DR: The identification, classification and analysis of a large database of variants giving an insight into mechanisms generating them are presented, and a major source of complexity in the human genome is identified.
Abstract: Investigating genomic structural variants at basepair resolution is crucial for understanding their formation mechanisms. We identify and analyse 8,943 deletion breakpoints in 1,092 samples from the 1000 Genomes Project. We find breakpoints have more nearby SNPs and indels than the genomic average, likely a consequence of relaxed selection. By investigating the correlation of breakpoints with DNA methylation, Hi-C interactions, and histone marks and the substitution patterns of nucleotides near them, we find that breakpoints with the signature of non-allelic homologous recombination (NAHR) are associated with open chromatin. We hypothesize that some NAHR deletions occur without DNA replication and cell division, in embryonic and germline cells. In contrast, breakpoints associated with non-homologous (NH) mechanisms often have sequence microinsertions, templated from later replicating genomic sites, spaced at two characteristic distances from the breakpoint. These microinsertions are consistent with template-switching events and suggest a particular spatiotemporal configuration for DNA during the events.

Journal ArticleDOI
TL;DR: A new computational framework called LARVA, which integrates variants with a comprehensive set of noncoding functional elements, modeling the mutation counts of the elements with a β-binomial distribution to handle overdispersion, and highlights several novel highly mutated regulatory sites that could potentially be nonc coding drivers.
Abstract: In cancer research, background models for mutation rates have been extensively calibrated in coding regions, leading to the identification of many driver genes, recurrently mutated more than expected. Noncoding regions are also associated with disease; however, background models for them have not been investigated in as much detail. This is partially due to limited noncoding functional annotation. Also, great mutation heterogeneity and potential correlations between neighboring sites give rise to substantial overdispersion in mutation count, resulting in problematic background rate estimation. Here, we address these issues with a new computational framework called LARVA. It integrates variants with a comprehensive set of noncoding functional elements, modeling the mutation counts of the elements with a β-binomial distribution to handle overdispersion. LARVA, moreover, uses regional genomic features such as replication timing to better estimate local mutation rates and mutational hotspots. We demonstrate LARVA's effectiveness on 760 whole-genome tumor sequences, showing that it identifies well-known noncoding drivers, such as mutations in the TERT promoter. Furthermore, LARVA highlights several novel highly mutated regulatory sites that could potentially be noncoding drivers. We make LARVA available as a software tool and release our highly mutated annotations as an online resource (larva.gersteinlab.org).

Journal ArticleDOI
TL;DR: A novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics make VarSim the most comprehensive validation tool for secondary analysis in next generation sequencing.
Abstract: Summary: VarSim is a framework for assessing alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. In contrast to simulating a random mutation spectrum, it synthesizes diploid genomes with germline and somatic mutations based on a realistic model. This model leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. VarSim simulates and validates a wide range of variants, including single nucleotide variants, small indels and large structural variants. It is an automated, comprehensive compute framework supporting parallel computation and multiple read simulators. Furthermore, we developed a novel map data structure to validate read alignments, a strategy to compare variants binned in size ranges and a lightweight, interactive, graphical report to visualize validation results with detailed statistics. Thus far, it is the most comprehensive validation tool for secondary analysis in next generation sequencing. Availability and implementation: Code in Java and Python along with instructions to download the reads and variants is at http://bioinform.github.io/varsim. Contact: moc.anib@dr Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The strategy that is being implemented by the exRNA Data Management and Resource Repository is presented, which employs metadata, biomedical ontologies and Linked Data technologies, such as Resource Description Framework to integrate a diverse set of exRNA profiles into an exRNA Atlas and enable integrative exRNA analysis.
Abstract: The large diversity and volume of extracellular RNA (exRNA) data that will form the basis of the exRNA Atlas generated by the Extracellular RNA Communication Consortium pose a substantial data integration challenge. We here present the strategy that is being implemented by the exRNA Data Management and Resource Repository, which employs metadata, biomedical ontologies and Linked Data technologies, such as Resource Description Framework to integrate a diverse set of exRNA profiles into an exRNA Atlas and enable integrative exRNA analysis. We focus on the following three specific data integration tasks: (a) selection of samples from a virtual biorepository for exRNA profiling and for inclusion in the exRNA Atlas; (b) retrieval of a data slice from the exRNA Atlas for integrative analysis and (c) interpretation of exRNA analysis results in the context of pathways and networks. As exRNA profiling gains wide adoption in the research community, we anticipate that the strategies discussed here will increasingly be required to enable data reuse and to facilitate integrative analysis of exRNA data.

Journal ArticleDOI
TL;DR: The experiments show that the proposed shallow HONN outperform the popular pre-trained deep neural network on most tasks, which demonstrates the effectiveness of modelling high-order feature interactions for predicting major histocompatibility complex-peptide binding.
Abstract: Motivation Effective computational methods for peptide-protein binding prediction can greatly help clinical peptide vaccine search and design. However, previous computational methods fail to capture key nonlinear high-order dependencies between different amino acid positions. As a result, they often produce low-quality rankings of strong binding peptides. To solve this problem, we propose nonlinear high-order machine learning methods including high-order neural networks (HONNs) with possible deep extensions and high-order kernel support vector machines to predict major histocompatibility complex-peptide binding. Results The proposed high-order methods improve quality of binding predictions over other prediction methods. With the proposed methods, a significant gain of up to 25-40% is observed on the benchmark and reference peptide datasets and tasks. In addition, for the first time, our experiments show that pre-training with high-order semi-restricted Boltzmann machines significantly improves the performance of feed-forward HONNs. Moreover, our experiments show that the proposed shallow HONN outperform the popular pre-trained deep neural network on most tasks, which demonstrates the effectiveness of modelling high-order feature interactions for predicting major histocompatibility complex-peptide binding. Availability and implementation There is no associated distributable software. Contact renqiang@nec-labs.com or mark.gerstein@yale.edu Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A score is defined to quantify the degree of hierarchy in a network and a simulated-annealing algorithm is developed to maximize the hierarchical score globally over a network.
Abstract: Many biological networks naturally form a hierarchy with a preponderance of downward information flow. In this study, we define a score to quantify the degree of hierarchy in a network and develop a simulated-annealing algorithm to maximize the hierarchical score globally over a network. We apply our algorithm to determine the hierarchical structure of the phosphorylome in detail and investigate the correlation between its hierarchy and kinase properties. We also compare it to the regulatory network, finding that the phosphorylome is more hierarchical than the regulome.

Journal ArticleDOI
TL;DR: Loregic, a computational method integrating gene expression and regulatory network data, is presented, to characterize the cooperativity of regulatory factors and inter-relate Loregic’s gate logic with other aspects of regulation, such as indirect binding via protein-protein interactions, feed-forward loop motifs and global regulatory hierarchy.
Abstract: The topology of the gene-regulatory network has been extensively analyzed. Now, given the large amount of available functional genomic data, it is possible to go beyond this and systematically study regulatory circuits in terms of logic elements. To this end, we present Loregic, a computational method integrating gene expression and regulatory network data, to characterize the cooperativity of regulatory factors. Loregic uses all 16 possible two-input-one-output logic gates (e.g. AND or XOR) to describe triplets of two factors regulating a common target. We attempt to find the gate that best matches each triplet’s observed gene expression pattern across many conditions. We make Loregic available as a general-purpose tool (github.com/gersteinlab/loregic). We validate it with known yeast transcription-factor knockout experiments. Next, using human ENCODE ChIP-Seq and TCGA RNA-Seq data, we are able to demonstrate how Loregic characterizes complex circuits involving both proximally and distally regulating transcription factors (TFs) and also miRNAs. Furthermore, we show that MYC, a well-known oncogenic driving TF, can be modeled as acting independently from other TFs (e.g., using OR gates) but antagonistically with repressing miRNAs. Finally, we inter-relate Loregic’s gate logic with other aspects of regulation, such as indirect binding via protein-protein interactions, feed-forward loop motifs and global regulatory hierarchy.

Journal ArticleDOI
TL;DR: This work leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms.
Abstract: A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.

Journal ArticleDOI
TL;DR: Within structural biology there is less emphasis on the discovery of novel folds and more on relating structures to networks of protein interactions, covering this changing mindset here.

Journal ArticleDOI
27 Feb 2015-Science
TL;DR: The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution as discussed by the authors, narrated by Walter Isaacson, tells the story of the people who invented the computer and the Internet.
Abstract: In The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution , Walter Isaacson tells the story of the people who invented the computer and the Internet. From complementary duos like Grace Hopper and Howard Aiken, who developed the first computer that automatically executed long computations, to synergistic rivals like Larry Roberts and Bob Taylor, who worked together to create the global internet precursor, ARPANET, Isaacson argues that "innovation comes from teams more often than from the lightbulb moments of lone geniuses." But how do teams really work? And will "citizen science" change how we think of teamwork in the future?

Journal ArticleDOI
19 Nov 2015-Cell
TL;DR: John Parrington's book The Deeper Genome provides a closer look at the enigma of junk DNA, akin to the great expanses of dark matter within the authors' universe, junk DNA makes up the vast majority of the genome.

01 Jan 2015
TL;DR: In this paper, the authors propose a method to solve the problem of self-diagnosis in cancer patients, and propose an approach to diagnose self-declarative cancer patients.
Abstract: Methods Conclusion