scispace - formally typeset
Search or ask a question

Showing papers in "bioRxiv in 2014"


Posted ContentDOI
17 Nov 2014-bioRxiv
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-Seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression and facilitates downstream tasks such as gene ranking and visualization. DESeq2 is available as an R/Bioconductor package.

17,014 citations


Posted ContentDOI
14 May 2014-bioRxiv
TL;DR: The qqman package enables the flexible creation of manhattan plots, both genome-wide and for single chromosomes, with optional highlighting of SNPs of interest.
Abstract: Summary: Genome-wide association studies (GWAS) have identified thousands of human trait-associated single nucleotide polymorphisms. Here, I describe a freely available R package for visualizing GWAS results using Q-Q and manhattan plots. The qqman package enables the flexible creation of manhattan plots, both genome-wide and for single chromosomes, with optional highlighting of SNPs of interest. Availability: qqman is released under the GNU General Public License, and is freely available on the Comprehensive R Archive Network (http://cran.r-project.org/package=qqman). The source code is available on GitHub (https://github.com/stephenturner/qqman).

713 citations


Posted ContentDOI
19 Aug 2014-bioRxiv
TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard work flows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data such as genomic coordinates, sequences, sequencing reads, alignments, gene model information, variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability: HTSeq is released as open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index, https://pypi.python.org/pypi/HTSeq

433 citations


Posted ContentDOI
19 Nov 2014-bioRxiv
TL;DR: FusionCatcher is a software tool for finding somatic fusion genes in paired-end RNA-sequencing data from human or other vertebrates that achieves competitive detection rates and real-time PCR validation rates in RNA- sequencing data from tumor cells.
Abstract: FusionCatcher is a software tool for finding somatic fusion genes in paired-end RNA-sequencing data from human or other vertebrates. FusionCatcher achieves competitive detection rates and real-time PCR validation rates in RNA-sequencing data from tumor cells. FusionCatcher is available at http://code.google.com/p/fusioncatcher

335 citations


Posted ContentDOI
28 Jun 2014-bioRxiv
TL;DR: A Genome-scale CRISPR Knock-Out (GeCKO) library is used to identify loss-of-function mutations conferring vemurafenib resistance in a melanoma model and reveal new mechanisms in diverse biological models.
Abstract: Genome-wide, targeted loss-of-function pooled screens using the CRISPR (clustered regularly interspaced short palindrome repeats)?associated nuclease Cas9 in human and mouse cells provide an alternative screening system to RNA interference (RNAi). Initial lentiviral delivery systems for CRISPR screening had low viral titer or required a cell line already expressing Cas9, limiting the range of biological systems amenable to screening. In this work, we present 1- and 2-vector lentiCRISPR systems capable of producing higher viral titers and, in these vectors, new human and mouse libraries for genome-scale CRISPR knock-out (GeCKO) screening.

309 citations


Posted ContentDOI
05 Mar 2014-bioRxiv
TL;DR: An efficient digital gene expression profiling protocol that enables surveying of mRNA in thousands of single cells at a time is developed and applied to profile 12,832 cells collected at multiple time points during directed adipogenic differentiation of human adipose-derived stem/stromal cells in vitro.
Abstract: Directed differentiation of cells in vitro is a powerful approach for dissection of developmental pathways, disease modeling and regenerative medicine, but analysis of such systems is complicated by heterogeneous and asynchronous cellular responses to differentiation-inducing stimuli. To enable deep characterization of heterogeneous cell populations, we developed an efficient digital gene expression profiling protocol that enables surveying of mRNA in thousands of single cells at a time. We then applied this protocol to profile 12,832 cells collected at multiple time points during directed adipogenic differentiation of human adipose-derived stem/stromal cells in vitro. The resulting data reveal the major axes of cell-to-cell variation within and between time points, and an inverse relationship between inflammatory gene expression and lipid accumulation across cells from a single donor.

251 citations


Posted ContentDOI
20 Dec 2014-bioRxiv
TL;DR: The development of an improved transcriptional regulator through the rational design of a tripartite activator, VP64-p65-Rta (VPR), fused to Cas9 is described and its utility in activating expression of endogenous coding and non-coding genes, targeting several genes simultaneously and stimulating neuronal differentiation of induced pluripotent stem cells (iPSCs).
Abstract: The RNA-guided bacterial nuclease Cas9 can be reengineered as a programmable transcription factor by a series of changes to the Cas9 protein in addition to the fusion of a transcriptional activation domain (AD). However, the modest levels of gene activation achieved by current Cas9 activators have limited their potential applications. Here we describe the development of an improved transcriptional regulator through the rational design of a tripartite activator, VP64-p65-Rta (VPR), fused to Cas9. We demonstrate its utility in activating expression of endogenous coding and non-coding genes, targeting several genes simultaneously and stimulating neuronal differentiation of induced pluripotent stem cells (iPSCs).

198 citations


Posted ContentDOI
21 May 2014-bioRxiv
TL;DR: Results from applying Multiple Sequentially Markovian Coalescent (MSMC) to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,500 years ago.
Abstract: The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.

189 citations


Posted ContentDOI
30 Oct 2014-bioRxiv
TL;DR: These findings suggest complementary roles of arousal and locomotion in promoting functional flexibility in cortical circuits are suggested and suggest that spontaneous firing in anticipation of and during movement was attributable to locomotion effects.
Abstract: Spontaneous and sensory-evoked cortical activity is highly state-dependent, yet relatively little is known about transitions between distinct waking states. Patterns of activity in mouse V1 differ dramatically between quiescence and locomotion, but this difference could be explained by either motor feedback or a change in arousal levels. We recorded single cells and local field potentials from area V1 in mice head-fixed on a running wheel and monitored pupil diameter to assay arousal. Using naturally occurring and induced state transitions, we dissociated arousal and locomotion effects in V1. Arousal suppressed spontaneous firing and strongly altered the temporal patterning of population activity. Moreover, heightened arousal increased the signal-to-noise ratio of visual responses and reduced noise correlations. In contrast, increased firing in anticipation of and during movement was attributable to locomotion effects. Our findings suggest complementary roles of arousal and locomotion in promoting functional flexibility in cortical circuits.

189 citations


Posted ContentDOI
14 Aug 2014-bioRxiv
TL;DR: The MinHash Alignment Process (MHAP) is introduced for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing and demonstrated that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.
Abstract: We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

186 citations


Posted ContentDOI
08 Dec 2014-bioRxiv
TL;DR: DensiTree is extended to allow visualisation of meta-data associated with branches such as population size and evolutionary rates and various methods for positioning internal nodes are explored, which make it easier to comprehend distributions over trees.
Abstract: Motivation: Phylogenetic analysis like Bayesian MCMC or bootstrapping result in a collection of trees. Trees are discrete objects and it is generally difficult to get a mental grip on a distributions over trees. Visualisation tools like DensiTree can give good intuition on tree distributions. It works by drawing all trees in the set transparently thus highlighting areas where the tree in the set agrees. In this way, both uncertainty in clade heights and uncertainty in topology can be visualised. In our experience, a vanilla DensiTree can turn out to be misleading in that it shows too much uncertainty due to wrongly ordering taxa or due to unlucky placement of internal nodes. Results: DensiTree is extended to allow visualisation of meta-data associated with branches such as population size and evolutionary rates. Furthermore, geographic locations of taxa can be shown on a map, making it easy to visually check there is some geographic pattern in a phylogeny. Taxa orderings have a large impact on the layout of the tree set, and advances have been made in finding better orderings resulting in significantly more informative visualisations. We also explored various methods for positioning internal nodes, which can improve the quality of the image. Together, these advances make it easier to comprehend distributions over trees. Availability: DensiTree is freely available from http://compevol. auckland.ac.nz/software/.

Posted ContentDOI
28 Jul 2014-bioRxiv
TL;DR: Using >900 genomes from common pathogens, SRST2 is highly accurate and outperforms assembly-based methods in terms of both gene detection and allele assignment and represents a powerful tool for rapidly extracting clinically useful information from raw WGS data.
Abstract: Rapid molecular typing of bacterial pathogens is critical for public health epidemiology, surveillance and infection control, yet routine use of whole genome sequencing (WGS) for these purposes poses significant challenges. Here we present SRST2, a read mapping-based tool for fast and accurate detection of genes, alleles and multi-locus sequence types (MLST) from WGS data. Using >900 genomes from common pathogens, we show SRST2 is highly accurate and outperforms assembly-based methods in terms of both gene detection and allele assignment. Here we have demonstrated the use of SRST2 for microbial genome surveillance in a variety of public health and hospital settings. In the face of rising threats of antimicrobial resistance and emerging virulence amongst bacterial pathogens, SRST2 represents a powerful tool for rapidly extracting clinically useful information from raw WGS data. Source code is available from http://katholt.github.io/srst2/

Posted ContentDOI
17 Mar 2014-bioRxiv
TL;DR: The results indicate that CRISPR technology is more sensitive than RNAi, and that both techniques have nontrivial false discovery rates that can be mitigated by rigorous analytical methods.
Abstract: Technological advancement has opened the door to systematic genetics in mammalian cells. Genome-scale loss-of-function screens can assay fitness defects induced by partial gene knockdown, using RNA interference, or complete gene knockout, using new CRISPR techniques. These screens can reveal the basic blueprint required for cellular proliferation. Moreover, comparing healthy to cancerous tissue can uncover genes that are essential only in the tumor; these genes are targets for the development of specific anticancer therapies. Unfortunately, progress in this field has been hampered by off-target effects of perturbation reagents and poorly quantified error rates in large-scale screens. To improve the quality of information derived from these screens, and to provide a framework for understanding the capabilities and limitations of CRISPR technology, we derive gold-standard reference sets of essential and nonessential genes, and provide a Bayesian classifier of gene essentiality that outperforms current methods on both RNAi and CRISPR screens. Our results indicate that CRISPR technology is more sensitive than RNAi, and that both techniques have nontrivial false discovery rates that can be mitigated by rigorous analytical methods.

Posted ContentDOI
02 Jul 2014-bioRxiv
TL;DR: Bonsai is described, a modular, high-performance, open-source visual programming framework for the acquisition and online processing of data streams and demonstrated how it allows for flexible and rapid prototyping of integrated experimental designs in neuroscience.
Abstract: The design of modern scientific experiments requires the control and monitoring of many parallel data streams. However, the serial execution of programming instructions in a computer makes it a challenge to develop software that can deal with the asynchronous, parallel nature of scientific data. Here we present Bonsai, a modular, high-performance, open-source visual programming framework for the acquisition and online processing of data streams. We describe Bonsai's core principles and architecture and demonstrate how it allows for flexible and rapid prototyping of integrated experimental designs in neuroscience. We specifically highlight different possible applications which require the combination of many different hardware and software components, including behaviour video tracking, electrophysiology and closed-loop control of stimulation parameters.

Posted ContentDOI
Iosif Lazaridis1, Nick Patterson2, Alissa Mittnik3, Gabriel Renaud4, Swapan Mallick1, Karola Kirsanow5, Peter H. Sudmant6, Joshua G. Schraiber7, Sergi Castellano4, Mark Lipson8, Bonnie Berger8, Christos Economou9, Ruth Bollongino5, Qiaomei Fu4, Kirsten I. Bos3, Susanne Nordenfelt1, Heng Li2, Cesare de Filippo4, Kay Prüfer4, Susanna Sawyer4, Cosimo Posth3, Wolfgang Haak10, Fredrik Hallgren11, Elin Fornander11, Nadin Rohland1, Dominique Delsate12, Michael Francken3, Jean-Michel Guinet13, Joachim Wahl, George Ayodo, Hamza A. Babiker14, Graciela Bailliet15, Elena Balanovska, Oleg Balanovsky, Ramiro Barrantes16, Gabriel Bedoya17, Haim Ben-Ami18, Judit Bene19, Fouad Berrada20, Claudio M. Bravi15, Francesca Brisighelli21, George B.J. Busby22, Francesco Calì, Mikhail Churnosov23, David E. C. Cole24, Daniel Corach25, Larissa Damba26, George van Driem27, Stanislav Dryomov26, Jean-Michel Dugoujon28, Sardana A. Fedorova29, Irene Gallego Romero30, Marina Gubina31, Michael F. Hammer32, Brenna M. Henn33, Tor Hervig34, Ugur Hodoglugil35, Aashish R. Jha30, Sena Karachanak-Yankova36, Rita Khusainova31, Elza Khusnutdinova31, Rick A. Kittles37, Toomas Kivisild38, William Klitz7, Vaidutis Kučinskas39, Alena Kushniarevich40, Leila Laredj41, Sergey Litvinov31, Theologos Loukidis42, Robert W. Mahley43, Béla Melegh19, Ene Metspalu44, Julio Molina, Joanna L. Mountain, Klemetti Näkkäläjärvi45, Desislava Nesheva36, Thomas B. Nyambo46, Ludmila P. Osipova31, Jüri Parik44, Fedor Platonov29, Olga L. Posukh31, Valentino Romano47, Francisco Rothhammer48, Igor Rudan14, Ruslan Ruizbakiev49, Hovhannes Sahakyan40, Antti Sajantila50, Antonio Salas51, Elena B. Starikovskaya31, Ayele Tarekegn, Draga Toncheva36, Shahlo Turdikulova49, Ingrida Uktveryte39, Olga Utevska52, René Vasquez53, Mercedes Villena53, Mikhail Voevoda31, Cheryl A. Winkler54, Levon Yepiskoposyan55, Pierre Zalloua56, Tatijana Zemunik57, Alan Cooper10, Cristian Capelli22, Mark G. Thomas58, Andres Ruiz-Linares58, Sarah A. Tishkoff59, Lalji Singh60, Kumarasamy Thangaraj60, Richard Villems40, David Comas61, Rem I. Sukernik31, Mait Metspalu40, Matthias Meyer4, Evan E. Eichler6, Joachim Burger5, Montgomery Slatkin7, Svante Pääbo4, Janet Kelso4, David Reich1, Johannes Krause3 
Harvard University1, Broad Institute2, University of Tübingen3, Max Planck Society4, University of Mainz5, University of Washington6, University of California, Berkeley7, Massachusetts Institute of Technology8, Stockholm University9, University of Adelaide10, The Heritage Foundation11, National Museum of Natural History12, American Museum of Natural History13, University of Edinburgh14, National Scientific and Technical Research Council15, University of Costa Rica16, University of Antioquia17, Rambam Health Care Campus18, University of Pécs19, Al Akhawayn University20, Catholic University of the Sacred Heart21, University of Oxford22, Belgorod State University23, University of Toronto24, University of Buenos Aires25, Russian Academy26, University of Bern27, Paul Sabatier University28, North-Eastern Federal University29, University of Chicago30, Russian Academy of Sciences31, University of Arizona32, Stony Brook University33, University of Bergen34, Illumina35, Sofia Medical University36, University of Illinois at Chicago37, University of Cambridge38, Vilnius University39, Estonian Biocentre40, University of Strasbourg41, Amgen42, Gladstone Institutes43, University of Tartu44, University of Oulu45, Muhimbili University of Health and Allied Sciences46, University of Palermo47, University of Tarapacá48, Academy of Sciences of Uzbekistan49, University of Helsinki50, University of Santiago de Compostela51, University of Kharkiv52, Higher University of San Andrés53, Leidos54, Armenian National Academy of Sciences55, Lebanese American University56, University of Split57, University College London58, University of Pennsylvania59, Centre for Cellular and Molecular Biology60, Pompeu Fabra University61
02 Apr 2014-bioRxiv
TL;DR: It is shown that the great majority of present-day Europeans derive from at least three highly differentiated populations: West European Hunter-Gatherers (WHG), who contributed ancestry to all Europeans but not to Near Easterners; Ancient North Eurasians (ANE); and Early European Farmers (EEF), who were mainly of Near Eastern origin but also harbored WHG-related ancestry.
Abstract: We sequenced genomes from a ~7,000 year old early farmer from Stuttgart in Germany, an ~8,000 year old hunter-gatherer from Luxembourg, and seven ~8,000 year old hunter-gatherers from southern Sweden. We analyzed these data together with other ancient genomes and 2,345 contemporary humans to show that the great majority of present-day Europeans derive from at least three highly differentiated populations: West European Hunter-Gatherers (WHG), who contributed ancestry to all Europeans but not to Near Easterners; Ancient North Eurasians (ANE), who were most closely related to Upper Paleolithic Siberians and contributed to both Europeans and Near Easterners; and Early European Farmers (EEF), who were mainly of Near Eastern origin but also harbored WHG-related ancestry. We model these populations' deep relationships and show that EEF had ~44% ancestry from a "Basal Eurasian" lineage that split prior to the diversification of all other non-African lineages.

Posted ContentDOI
17 Jul 2014-bioRxiv
TL;DR: The potential for RNA-guided gene drives based on the CRISPR nuclease Cas9 to serve as a general method for spreading altered traits through wild populations over many generations is considered.
Abstract: Gene drives may be capable of addressing ecological problems by altering entire populations of wild organisms, but their use has remained largely theoretical due to technical constraints. Here we consider the potential for RNA-guided gene drives based on the CRISPR nuclease Cas9 to serve as a general method for spreading altered traits through wild populations over many generations. We detail likely capabilities, discuss limitations, and provide novel precautionary strategies to control the spread of gene drives and reverse genomic changes. The ability to edit populations of sexual species would offer substantial benefits to humanity and the environment. For example, RNA-guided gene drives could potentially prevent the spread of disease, support agriculture by reversing pesticide and herbicide resistance in insects and weeds, and control damaging invasive species. However, the possibility of unwanted ecological effects and near-certainty of spread across political borders demand careful assessment of each potential application. We call for thoughtful, inclusive, and well-informed public discussions to explore the responsible use of this currently theoretical technology.

Posted ContentDOI
26 Jun 2014-bioRxiv
TL;DR: An update to the sva approach is described that can be applied to analyze count data or FPKMs from sequencing experiments and the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts.
Abstract: It is now well known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. We introduced surrogate variable analysis for estimating these artifacts by (1) identifying the part of the genomic data only affected by artifacts and (2) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors. Here I describe an update to the sva approach that can be applied to analyze count data or FPKMs from sequencing experiments. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. These updates are available through the surrogate variable analysis (sva) Bioconductor package.

Posted ContentDOI
26 Nov 2014-bioRxiv
TL;DR: This work presents a method to quantify and visualize variation in effective migration across the habitat, which can be used to identify potential barriers to gene flow, from geographically indexed large-scale genetic data.
Abstract: Genetic data often exhibit patterns that are broadly consistent with "isolation by distance" - a phenomenon where genetic similarity tends to decay with geographic distance. In a heterogeneous habitat, decay may occur more quickly in some regions than others: for example, barriers to gene flow can accelerate the genetic differentiation between groups located close in space. We use the concept of "effective migration" to model the relationship between genetics and geography: in this paradigm, effective migration is low in regions where genetic similarity decays quickly. We present a method to quantify and visualize variation in effective migration across the habitat, which can be used to identify potential barriers to gene flow, from geographically indexed large-scale genetic data. Our approach uses a population genetic model to relate underlying migration rates to expected pairwise genetic dissimilarities, and estimates migration rates by matching these expectations to the observed dissimilarities. We illustrate the potential and limitations of our method using simulations and data from elephant, human, and Arabidopsis thaliana populations. The resulting visualizations highlight important features of the spatial population structure that are difficult to discern using existing methods for summarizing genetic variation such as principal components analysis.

Posted ContentDOI
18 Jun 2014-bioRxiv
TL;DR: A new data-driven model using support vector regression that can accurately predict assembly performance is developed and applied to several prokaryotic and eukaryotic genomes, and can achieve near-perfect assemblies of small genomes and substantially improved assemblies of larger ones.
Abstract: Third generation single molecule sequencing technology is poised to revolutionize genomics by enabling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.

Posted ContentDOI
09 Sep 2014-bioRxiv
TL;DR: The MDTraj package is a modern, lightweight and efficient software package for analyzing molecular dynamics simulations, bridging the gap between molecular dynamics data and the rapidly-growing collection of industry-standard statistical analysis and visualization tools in Python.
Abstract: Summary: MDTraj is a modern, lightweight and efficient software package for analyzing molecular dynamics simulations. MDTraj reads trajectory data from a wide variety of commonly used formats. It provides a large number of trajectory analysis capabilities including RMSD, DSSP secondary structure assignment and the extraction of common order parameters. The package has a strong focus on interoperability with the wider scientific Python ecosystem, bridging the gap between molecular dynamics data and the rapidly-growing collection of industry-standard statistical analysis and visualization tools in Python. Availability: Package downloads, detailed examples and full documentation are available at http://mdtraj.org. The source code is distributed under the GNU Lesser General Public License at https://github.com/simtk/mdtraj.

Posted ContentDOI
02 Mar 2014-bioRxiv
TL;DR: An experimental determination of a parameter-free evolutionary model via mutagenesis, functional selection, and deep sequencing is demonstrated: an evolutionary model for influenza nucleoprotein that describes the gene phylogeny far better than existing models with dozens or even hundreds of free parameters.
Abstract: All modern approaches to molecular phylogenetics require a quantitative model for how genes evolve. Unfortunately, existing evolutionary models do not realistically represent the site-heterogeneous selection that governs actual sequence change. Attempts to remedy this problem have involved augmenting these models with a burgeoning number of free parameters. Here I demonstrate an alternative: experimental determination of a parameter-free evolutionary model via mutagenesis, functional selection, and deep sequencing. Using this strategy, I create an evolutionary model for influenza nucleoprotein that describes the gene phylogeny far better than existing models with dozens or even hundreds of free parameters. Emerging high-throughput experimental strategies such as the one employed here provide fundamentally new information that has the potential to transform the sensitivity of phylogenetic and genetic analyses.

Posted ContentDOI
18 Oct 2014-bioRxiv
TL;DR: Ancestry Composition is described, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals and achieves high precision and recall for labeling chromosomesomal segments across over 25 different populations worldwide.
Abstract: Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.

Posted ContentDOI
13 Jan 2014-bioRxiv
TL;DR: Experimental evolution in Saccharomyces cerevisiae is used to quantify the effect of epistasis, finding dramatic differences in adaptability between 64 closely related genotypes, suggesting that many beneficial mutations affecting a variety of biological processes are globally coupled.
Abstract: Epistasis can make adaptation highly unpredictable, rendering evolutionary trajectories contingent on the chance effects of initial mutations. We used experimental evolution in Saccharomyces cerevisiae to quantify this effect, finding dramatic differences in adaptability between 64 closely related genotypes. Despite these differences, sequencing of 105 evolved clones showed no significant effect of initial genotype on future sequence-level evolution. Instead, reconstruction experiments revealed a consistent pattern of diminishing returns epistasis. Our results suggest that many beneficial mutations affecting a variety of biological processes are globally coupled: they interact strongly, but only through their combined effect on fitness. Sequence-level adaptation is thus highly stochastic. Nevertheless, fitness evolution is strikingly predictable because differences in adaptability are determined only by global fitness-mediated epistasis, not by the identity of individual mutations.

Posted ContentDOI
04 Jan 2014-bioRxiv
TL;DR: A recently introduced dynamic programming algorithm for estimating species trees that bypasses MCMC integration over gene trees with sophisticated methods for estimating marginal likelihoods, needed for Bayesian model selection, are combined to provide a rigorous and computationally tractable technique for genome-wide species delimitation.
Abstract: The multi-species coalescent has provided important progress for evolutionary inferences, including increasing the statistical rigor and objectivity of comparisons among competing species delimitation models. However, Bayesian species delimitation methods typically require brute force integration over gene trees via Markov chain Monte Carlo (MCMC), which introduces a large computation burden and precludes their application to genomic-scale data. Here we combine a recently introduced dynamic programming algorithm for estimating species trees that bypasses MCMC integration over gene trees with sophisticated methods for estimating marginal likelihoods, needed for Bayesian model selection, to provide a rigorous and computationally tractable technique for genome-wide species delimitation. We provide a critical yet simple correction that brings the likelihoods of different species trees, and more importantly their corresponding marginal likelihoods, to the same common denominator, which enables direct and accurate comparisons of competing species delimitation models using Bayes factors. We test this approach, which we call Bayes factor delimitation (*with genomic data; BFD*), using common species delimitation scenarios with computer simulations. Varying the numbers of loci and the number of samples suggest that the approach can distinguish the true model even with few loci and limited samples per species. Misspecification of the prior for population size θ has little impact on support for the true model. We apply the approach to West African forest geckos (Hemidactylus fasciatus complex) using genome-wide SNP data data. This new Bayesian method for species delimitation builds on a growing trend for objective species delimitation methods with explicit model assumptions that are easily tested.

Posted ContentDOI
08 Dec 2014-bioRxiv
TL;DR: This comprehensive tree will fuel fundamental research on the nature of biological diversity, ultimately providing up-to-date phylogenies for downstream applications in comparative biology, ecology, conservation biology, climate change, agriculture, and genomics.
Abstract: Reconstructing the phylogenetic relationships that unite all biological lineages (the tree of life) is a grand challenge of biology. However, the paucity of readily available homologous character data across disparately related lineages renders direct phylogenetic inference currently untenable. Our best recourse towards realizing the tree of life is therefore the synthesis of existing collective phylogenetic knowledge available from the wealth of published primary phylogenetic hypotheses, together with taxonomic hierarchy information for unsampled taxa. We combined phylogenetic and taxonomic data to produce a draft tree of life?the Open Tree of Life?containing 2.3 million tips. Realization of this draft tree required the assembly of two resources that should prove valuable to the community: 1) a novel comprehensive global reference taxonomy, and 2) a database of published phylogenetic trees mapped to this common taxonomy. Our open source framework facilitates community comment and contribution, enabling a continuously updatable tree when new phylogenetic and taxonomic data become digitally available. While data coverage and phylogenetic conflict across the Open Tree of Life illuminates significant gaps in both the underlying data available for phylogenetic reconstruction and the publication of trees as digital objects, the tree provides a compelling starting point from which we can continue to improve through community contributions. Having a comprehensive tree of life will fuel fundamental research on the nature of biological diversity, ultimately providing up-to-date phylogenies for downstream applications in comparative biology, ecology, conservation biology, climate change studies, agriculture, and genomics.

Posted ContentDOI
28 Jun 2014-bioRxiv
TL;DR: It is found that U2AF1 mutations influence the similarity of splicing programs in leukemias, but do not give rise to widespread splicing failure.
Abstract: Whole-exome sequencing studies have identified common mutations affecting genes encoding components of the RNA splicing machinery in hematological malignancies. Here, we sought to determine how mutations affecting the 3' splice site recognition factor U2AF1 alter its normal role in RNA splicing. We find that U2AF1 mutations influence the similarity of splicing programs in leukemias, but do not give rise to widespread splicing failure. U2AF1 mutations cause differential splicing of hundreds of genes, affecting biological pathways implicated in myeloid disease such as DNA methylation (DNMT3B), X chromosome inactivation (H2AFY), the DNA damage response (ATR, FANCA), and apoptosis (CASP8). We show that U2AF1 mutations alter the preferred 3' splice site motif in patients, in cell culture, and in vitro. Mutations affecting the first and second zinc fingers give rise to different alterations in splice site preference and largely distinct downstream splicing programs. These allele-specific effects are consistent with a computationally predicted model of U2AF1 in complex with RNA. Our findings suggest that U2AF1 mutations contribute to pathogenesis by causing quantitative changes in splicing that affect diverse cellular pathways, and give insight into the normal function of U2AF1?s zinc finger domains.

Posted ContentDOI
12 May 2014-bioRxiv
TL;DR: The CasFinder system is demonstrated by generating human and mouse exome-wide catalogs of specific sites for three varieties of Cas9 – S. pyogenes, S. thermophilus (ST1), and N. meningitidis – that each target 56-74% of all exons.
Abstract: CRISPR/Cas9 systems enable many molecular activities to be efficiently directed in vivo to user-specifiable DNA sequences of interest, including generation of dsDNA cuts and nicks, transcriptional activation and repression, and fluorescence. CRISPR targeting relies on base pairing of short RNA transcripts with their target DNA sequences that must also be adjacent to fixed DNA motifs. However, rules for Cas9 targeting specificity are incompletely known. With increasing numbers of Cas9 systems being developed and deployed in more and more organisms, there is now strong need for a flexible and rational method for finding Cas9 sites with low off-targeting potential. We address this through the CasFinder system, which we demonstrate by generating human and mouse exome-wide catalogs of specific sites for three varieties of Cas9 - S. pyogenes, S. thermophilus (ST1), and N. meningitidis - that each target 56-74% of all exons. We also generate reduced sets of up to 3 targets per gene for use in high-throughput Cas9-based gene knockout screens that target 75-80% of all genes.

Posted ContentDOI
15 Apr 2014-bioRxiv
TL;DR: This work has successfully applied READemption to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea.
Abstract: Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).

Posted ContentDOI
16 Jul 2014-bioRxiv
TL;DR: It is demonstrated that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass.
Abstract: The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. Concurrent sequencing of negative control samples is strongly advised.

Posted ContentDOI
27 May 2014-bioRxiv
TL;DR: It is shown that all sampled Austronesian groups harbor ancestry that is more closely related to aboriginal Taiwanese than to any present-day mainland population, suggesting that either there was once a substantial Austro-Asiatic presence in Island Southeast Asia, or Austronesians migrated to and through the mainland, admixing there before continuing to western Indonesia.
Abstract: Austronesian languages are spread across half the globe, from Easter Island to Madagascar. Evidence from linguistics and archaeology indicates that the "Austronesian expansion," which began 4-5 thousand years ago, likely had roots in Taiwan, but the ancestry of present-day Austronesian-speaking populations remains controversial. Here, focusing primarily on Island Southeast Asia, we analyze genome-wide data from 56 populations using new methods for tracing ancestral gene flow. We show that all sampled Austronesian groups harbor ancestry that is more closely related to aboriginal Taiwanese than to any present-day mainland population. Surprisingly, western Island Southeast Asian populations have also inherited ancestry from a source nested within the variation of present-day populations speaking Austro-Asiatic languages, which have historically been nearly exclusive to the mainland. Thus, either there was once a substantial Austro-Asiatic presence in Island Southeast Asia, or Austronesian speakers migrated to and through the mainland, admixing there before continuing to western Indonesia.