scispace - formally typeset
Search or ask a question

Showing papers in "PLOS Computational Biology in 2018"


Journal ArticleDOI
TL;DR: MUMmer4 is described, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of Mummer to a 48- bit suffix array, and that offers improved speed through parallel processing of input query sequences.
Abstract: The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.

1,131 citations


Journal ArticleDOI
TL;DR: OpenSim is an extensible and user-friendly software package built on decades of knowledge about computational modeling and simulation of biomechanical systems that enables computational scientists to create new state-of-the-art software tools and empowers others to use these tools in research and clinical applications.
Abstract: Movement is fundamental to human and animal life, emerging through interaction of complex neural, muscular, and skeletal systems. Study of movement draws from and contributes to diverse fields, including biology, neuroscience, mechanics, and robotics. OpenSim unites methods from these fields to create fast and accurate simulations of movement, enabling two fundamental tasks. First, the software can calculate variables that are difficult to measure experimentally, such as the forces generated by muscles and the stretch and recoil of tendons during movement. Second, OpenSim can predict novel movements from models of motor control, such as kinematic adaptations of human gait during loaded or inclined walking. Changes in musculoskeletal dynamics following surgery or due to human–device interaction can also be simulated; these simulations have played a vital role in several applications, including the design of implantable mechanical devices to improve human grasping in individuals with paralysis. OpenSim is an extensible and user-friendly software package built on decades of knowledge about computational modeling and simulation of biomechanical systems. OpenSim’s design enables computational scientists to create new state-of-the-art software tools and empowers others to use these tools in research and clinical applications. OpenSim supports a large and growing community of biomechanics and rehabilitation researchers, facilitating exchange of models and simulations for reproducing and extending discoveries. Examples, tutorials, documentation, and an active user forum support this community. The OpenSim software is covered by the Apache License 2.0, which permits its use for any purpose including both nonprofit and commercial applications. The source code is freely and anonymously accessible on GitHub, where the community is welcomed to make contributions. Platform-specific installers of OpenSim include a GUI and are available on simtk.org.

642 citations


Journal ArticleDOI
TL;DR: A novel simulation method to determine thermodynamic phase diagrams as a function of the total protein concentration and temperature is developed and is capable of capturing qualitative changes in the phase diagram due to phosphomimetic mutations of FUS and to the presence or absence of the large folded domain in LAF-1.
Abstract: Membraneless organelles important to intracellular compartmentalization have recently been shown to comprise assemblies of proteins which undergo liquid-liquid phase separation (LLPS). However, many proteins involved in this phase separation are at least partially disordered. The molecular mechanism and the sequence determinants of this process are challenging to determine experimentally owing to the disordered nature of the assemblies, motivating the use of theoretical and simulation methods. This work advances a computational framework for conducting simulations of LLPS with residue-level detail, and allows for the determination of phase diagrams and coexistence densities of proteins in the two phases. The model includes a short-range contact potential as well as a simplified treatment of electrostatic energy. Interaction parameters are optimized against experimentally determined radius of gyration data for multiple unfolded or intrinsically disordered proteins (IDPs). These models are applied to two systems which undergo LLPS: the low complexity domain of the RNA-binding protein FUS and the DEAD-box helicase protein LAF-1. We develop a novel simulation method to determine thermodynamic phase diagrams as a function of the total protein concentration and temperature. We show that the model is capable of capturing qualitative changes in the phase diagram due to phosphomimetic mutations of FUS and to the presence or absence of the large folded domain in LAF-1. We also explore the effects of chain-length, or multivalency, on the phase diagram, and obtain results consistent with Flory-Huggins theory for polymers. Most importantly, the methodology presented here is flexible so that it can be easily extended to other pair potentials, be used with other enhanced sampling methods, and may incorporate additional features for biological systems of interest.

381 citations


Journal ArticleDOI
TL;DR: The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery and the Reactome graph database use case shows the power of NoSQL database engines for complex biological data types.
Abstract: Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways One of its main priorities is to provide easy and efficient access to its high quality curated data At present, biological pathway databases typically store their contents in relational databases This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data The same data in a graph database can be queried more efficiently Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery The adoption of this technology greatly improved query efficiency, reducing the average query time by 93% The web service built on top of the graph database provides programmatic access to Reactome data by object oriented queries, but also supports more complex queries that take advantage of the new underlying graph-based data storage By adopting graph database technology we are providing a high performance pathway data resource to the community The Reactome graph database use case shows the power of NoSQL database engines for complex biological data types

324 citations


Journal ArticleDOI
TL;DR: PhysiCell is demonstrated by simulating the impact of necrotic core biomechanics, 3-D geometry, and stochasticity on the dynamics of hanging drop tumor spheroids and ductal carcinoma in situ (DCIS) of the breast.
Abstract: Many multicellular systems problems can only be understood by studying how cells move, grow, divide, interact, and die. Tissue-scale dynamics emerge from systems of many interacting cells as they respond to and influence their microenvironment. The ideal "virtual laboratory" for such multicellular systems simulates both the biochemical microenvironment (the "stage") and many mechanically and biochemically interacting cells (the "players" upon the stage). PhysiCell-physics-based multicellular simulator-is an open source agent-based simulator that provides both the stage and the players for studying many interacting cells in dynamic tissue microenvironments. It builds upon a multi-substrate biotransport solver to link cell phenotype to multiple diffusing substrates and signaling factors. It includes biologically-driven sub-models for cell cycling, apoptosis, necrosis, solid and fluid volume changes, mechanics, and motility "out of the box." The C++ code has minimal dependencies, making it simple to maintain and deploy across platforms. PhysiCell has been parallelized with OpenMP, and its performance scales linearly with the number of cells. Simulations up to 105-106 cells are feasible on quad-core desktop workstations; larger simulations are attainable on single HPC compute nodes. We demonstrate PhysiCell by simulating the impact of necrotic core biomechanics, 3-D geometry, and stochasticity on the dynamics of hanging drop tumor spheroids and ductal carcinoma in situ (DCIS) of the breast. We demonstrate stochastic motility, chemical and contact-based interaction of multiple cell types, and the extensibility of PhysiCell with examples in synthetic multicellular systems (a "cellular cargo delivery" system, with application to anti-cancer treatments), cancer heterogeneity, and cancer immunology. PhysiCell is a powerful multicellular systems simulator that will be continually improved with new capabilities and performance improvements. It also represents a significant independent code base for replicating results from other simulation platforms. The PhysiCell source code, examples, documentation, and support are available under the BSD license at http://PhysiCell.MathCancer.org and http://PhysiCell.sf.net.

303 citations


Journal ArticleDOI
TL;DR: A computational model of Matrix Decomposition and Heterogeneous Graph Inference for miRNAs association prediction (MDHGI) to discover new miRNA-disease associations by integrating the predicted association probability obtained from matrix decomposition through sparse learning method, the miRNA functional similarity, the disease semantic similarity, and the Gaussian interaction profile kernel similarity for diseases and mi RNAs into a heterogeneous network is developed.
Abstract: Recently, a growing number of biological research and scientific experiments have demonstrated that microRNA (miRNA) affects the development of human complex diseases. Discovering miRNA-disease associations plays an increasingly vital role in devising diagnostic and therapeutic tools for diseases. However, since uncovering associations via experimental methods is expensive and time-consuming, novel and effective computational methods for association prediction are in demand. In this study, we developed a computational model of Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction (MDHGI) to discover new miRNA-disease associations by integrating the predicted association probability obtained from matrix decomposition through sparse learning method, the miRNA functional similarity, the disease semantic similarity, and the Gaussian interaction profile kernel similarity for diseases and miRNAs into a heterogeneous network. Compared with previous computational models based on heterogeneous networks, our model took full advantage of matrix decomposition before the construction of heterogeneous network, thereby improving the prediction accuracy. MDHGI obtained AUCs of 0.8945 and 0.8240 in the global and the local leave-one-out cross validation, respectively. Moreover, the AUC of 0.8794+/-0.0021 in 5-fold cross validation confirmed its stability of predictive performance. In addition, to further evaluate the model's accuracy, we applied MDHGI to four important human cancers in three different kinds of case studies. In the first type, 98% (Esophageal Neoplasms) and 98% (Lymphoma) of top 50 predicted miRNAs have been confirmed by at least one of the two databases (dbDEMC and miR2Disease) or at least one experimental literature in PubMed. In the second type of case study, what made a difference was that we removed all known associations between the miRNAs and Lung Neoplasms before implementing MDHGI on Lung Neoplasms. As a result, 100% (Lung Neoplasms) of top 50 related miRNAs have been indexed by at least one of the three databases (dbDEMC, miR2Disease and HMDD V2.0) or at least one experimental literature in PubMed. Furthermore, we also tested our prediction method on the HMDD V1.0 database to prove the applicability of MDHGI to different datasets. The results showed that 50 out of top 50 miRNAs related with the breast neoplasms were validated by at least one of the three databases (HMDD V2.0, dbDEMC, and miR2Disease) or at least one experimental literature.

273 citations


Journal ArticleDOI
TL;DR: Evidence is provided that DCNNs have access to some local shape information in the form of local edge relations, but they have no access to global object shapes.
Abstract: Deep convolutional networks (DCNNs) are achieving previously unseen performance in object classification, raising questions about whether DCNNs operate similarly to human vision. In biological vision, shape is arguably the most important cue for recognition. We tested the role of shape information in DCNNs trained to recognize objects. In Experiment 1, we presented a trained DCNN with object silhouettes that preserved overall shape but were filled with surface texture taken from other objects. Shape cues appeared to play some role in the classification of artifacts, but little or none for animals. In Experiments 2–4, DCNNs showed no ability to classify glass figurines or outlines but correctly classified some silhouettes. Aspects of these results led us to hypothesize that DCNNs do not distinguish object’s bounding contours from other edges, and that DCNNs access some local shape features, but not global shape. In Experiment 5, we tested this hypothesis with displays that preserved local features but disrupted global shape, and vice versa. With disrupted global shape, which reduced human accuracy to 28%, DCNNs gave the same classification labels as with ordinary shapes. Conversely, local contour changes eliminated accurate DCNN classification but caused no difficulty for human observers. These results provide evidence that DCNNs have access to some local shape information in the form of local edge relations, but they have no access to global object shapes.

270 citations


Journal ArticleDOI
TL;DR: A new ANN framework called Cox-nnet is developed to predict patient prognosis from high throughput transcriptomics data, with functional biological insights, which achieves the same or better predictive accuracy compared to other methods.
Abstract: Artificial neural networks (ANN) are computing architectures with many interconnections of simple neural-inspired computing elements, and have been applied to biomedical fields such as imaging analysis and diagnosis. We have developed a new ANN framework called Cox-nnet to predict patient prognosis from high throughput transcriptomics data. In 10 TCGA RNA-Seq data sets, Cox-nnet achieves the same or better predictive accuracy compared to other methods, including Cox-proportional hazards regression (with LASSO, ridge, and mimimax concave penalty), Random Forests Survival and CoxBoost. Cox-nnet also reveals richer biological information, at both the pathway and gene levels. The outputs from the hidden layer node provide an alternative approach for survival-sensitive dimension reduction. In summary, we have developed a new method for accurate and efficient prognosis prediction on high throughput data, with functional biological insights. The source code is freely available at https://github.com/lanagarmire/cox-nnet.

246 citations


Journal ArticleDOI
TL;DR: The scRNA-tools database is created to catalogue and curate analysis tools as they become available, and sees that many tools perform tasks specific to sc RNA-seq analysis, particularly clustering and ordering of cells.
Abstract: As single-cell RNA-sequencing (scRNA-seq) datasets have become more widespread the number of tools designed to analyse these data has dramatically increased. Navigating the vast sea of tools now available is becoming increasingly challenging for researchers. In order to better facilitate selection of appropriate analysis tools we have created the scRNA-tools database (www.scRNA-tools.org) to catalogue and curate analysis tools as they become available. Our database collects a range of information on each scRNA-seq analysis tool and categorises them according to the analysis tasks they perform. Exploration of this database gives insights into the areas of rapid development of analysis methods for scRNA-seq data. We see that many tools perform tasks specific to scRNA-seq analysis, particularly clustering and ordering of cells. We also find that the scRNA-seq community embraces an open-source and open-science approach, with most tools available under open-source licenses and preprints being extensively used as a means to describe methods. The scRNA-tools database provides a valuable resource for researchers embarking on scRNA-seq analysis and records the growth of the field over time.

222 citations


Journal ArticleDOI
TL;DR: RAVEN Toolbox 2.0 is presented with major enhancements, including de novo reconstruction of GEMs based on the MetaCyc pathway database; a redesigned KEGG-based reconstruction pipeline; convergence of reconstructions from various sources; and improved performance, usability, and compatibility with the COBRA Toolbox.
Abstract: RAVEN is a commonly used MATLAB toolbox for genome-scale metabolic model (GEM) reconstruction, curation and constraint-based modelling and simulation. Here we present RAVEN Toolbox 2.0 with major enhancements, including: (i) de novo reconstruction of GEMs based on the MetaCyc pathway database; (ii) a redesigned KEGG-based reconstruction pipeline; (iii) convergence of reconstructions from various sources; (iv) improved performance, usability, and compatibility with the COBRA Toolbox. Capabilities of RAVEN 2.0 are here illustrated through de novo reconstruction of GEMs for the antibiotic-producing bacterium Streptomyces coelicolor. Comparison of the automated de novo reconstructions with the iMK1208 model, a previously published high-quality S. coelicolor GEM, exemplifies that RAVEN 2.0 can capture most of the manually curated model. The generated de novo reconstruction is subsequently used to curate iMK1208 resulting in Sco4, the most comprehensive GEM of S. coelicolor, with increased coverage of both primary and secondary metabolism. This increased coverage allows the use of Sco4 to predict novel genome editing targets for optimized secondary metabolites production. As such, we demonstrate that RAVEN 2.0 can be used not only for de novo GEM reconstruction, but also for curating existing models based on up-to-date databases. Both RAVEN 2.0 and Sco4 are distributed through GitHub to facilitate usage and further development by the community (https://github.com/SysBioChalmers/RAVEN and https://github.com/SysBioChalmers/Streptomyces_coelicolor-GEM).

210 citations


Journal ArticleDOI
TL;DR: This work shows how molecular networking can be used to improve the accuracy of in silico predictions through propagation of structural annotations, even when there is no match to a MS/MS spectrum in spectral libraries.
Abstract: The annotation of small molecules is one of the most challenging and important steps in untargeted mass spectrometry analysis, as most of our biological interpretations rely on structural annotations. Molecular networking has emerged as a structured way to organize and mine data from untargeted tandem mass spectrometry (MS/MS) experiments and has been widely applied to propagate annotations. However, propagation is done through manual inspection of MS/MS spectra connected in the spectral networks and is only possible when a reference library spectrum is available. One of the alternative approaches used to annotate an unknown fragmentation mass spectrum is through the use of in silico predictions. One of the challenges of in silico annotation is the uncertainty around the correct structure among the predicted candidate lists. Here we show how molecular networking can be used to improve the accuracy of in silico predictions through propagation of structural annotations, even when there is no match to a MS/MS spectrum in spectral libraries. This is accomplished through creating a network consensus of re-ranked structural candidates using the molecular network topology and structural similarity to improve in silico annotations. The Network Annotation Propagation (NAP) tool is accessible through the GNPS web-platform https://gnps.ucsd.edu/ProteoSAFe/static/gnps-theoretical.jsp.

Journal ArticleDOI
TL;DR: The technical status quo on computer vision approaches for plant species identification is reviewed, the main research challenges to overcome in providing applicable tools are highlighted, and a discussion of open and future research thrusts is discussed.
Abstract: Current rates of species loss triggered numerous attempts to protect and conserve biodiversity. Species conservation, however, requires species identification skills, a competence obtained through intensive training and experience. Field researchers, land managers, educators, civil servants, and the interested public would greatly benefit from accessible, up-to-date tools automating the process of species identification. Currently, relevant technologies, such as digital cameras, mobile devices, and remote access to databases, are ubiquitously available, accompanied by significant advances in image processing and pattern recognition. The idea of automated species identification is approaching reality. We review the technical status quo on computer vision approaches for plant species identification, highlight the main research challenges to overcome in providing applicable tools, and conclude with a discussion of open and future research thrusts.

Journal ArticleDOI
TL;DR: SGZ (somatic-germline-zygosity), a computational method for predicting somatic vs. germline origin and homozygous vs. heterozygous or sub-clonal state of variants identified from deep massively parallel sequencing (MPS) of cancer specimens, is introduced.
Abstract: A key constraint in genomic testing in oncology is that matched normal specimens are not commonly obtained in clinical practice Thus, while well-characterized genomic alterations do not require normal tissue for interpretation, a significant number of alterations will be unknown in whether they are germline or somatic, in the absence of a matched normal control We introduce SGZ (somatic-germline-zygosity), a computational method for predicting somatic vs germline origin and homozygous vs heterozygous or sub-clonal state of variants identified from deep massively parallel sequencing (MPS) of cancer specimens The method does not require a patient matched normal control, enabling broad application in clinical research SGZ predicts the somatic vs germline status of each alteration identified by modeling the alteration's allele frequency (AF), taking into account the tumor content, tumor ploidy, and the local copy number Accuracy of the prediction depends on the depth of sequencing and copy number model fit, which are achieved in our clinical assay by sequencing to high depth (>500x) using MPS, covering 394 cancer-related genes and over 3,500 genome-wide single nucleotide polymorphisms (SNPs) Calls are made using a statistic based on read depth and local variability of SNP AF To validate the method, we first evaluated performance on samples from 30 lung and colon cancer patients, where we sequenced tumors and matched normal tissue We examined predictions for 17 somatic hotspot mutations and 20 common germline SNPs in 20,182 clinical cancer specimens To assess the impact of stromal admixture, we examined three cell lines, which were titrated with their matched normal to six levels (10-75%) Overall, predictions were made in 85% of cases, with 95-99% of variants predicted correctly, a significantly superior performance compared to a basic approach based on AF alone We then applied the SGZ method to the COSMIC database of known somatic variants in cancer and found >50 that are in fact more likely to be germline

Journal ArticleDOI
TL;DR: A novel phylogenetic approach that has been tailor-made for microbial GWAS is introduced, which is applicable to organisms ranging from purely clonal to frequently recombining, and to both binary and continuous phenotypes, and is robust to the confounding effects of both population structure and recombination.
Abstract: Genome-Wide Association Studies (GWAS) in microbial organisms have the potential to vastly improve the way we understand, manage, and treat infectious diseases. Yet, microbial GWAS methods established thus far remain insufficiently able to capitalise on the growing wealth of bacterial and viral genetic sequence data. Facing clonal population structure and homologous recombination, existing GWAS methods struggle to achieve both the precision necessary to reject spurious findings and the power required to detect associations in microbes. In this paper, we introduce a novel phylogenetic approach that has been tailor-made for microbial GWAS, which is applicable to organisms ranging from purely clonal to frequently recombining, and to both binary and continuous phenotypes. Our approach is robust to the confounding effects of both population structure and recombination, while maintaining high statistical power to detect associations. Thorough testing via application to simulated data provides strong support for the power and specificity of our approach and demonstrates the advantages offered over alternative cluster-based and dimension-reduction methods. Two applications to Neisseria meningitidis illustrate the versatility and potential of our method, confirming previously-identified penicillin resistance loci and resulting in the identification of both well-characterised and novel drivers of invasive disease. Our method is implemented as an open-source R package called treeWAS which is freely available at https://github.com/caitiecollins/treeWAS.

Journal ArticleDOI
TL;DR: In this paper, a number of algebraic topology approaches, including multi-component persistent homology, multi-level persistent homologies, and electrostatic persistence, are introduced for representation, characterization, and description of small molecules and biomolecular complexes.
Abstract: This work introduces a number of algebraic topology approaches, including multi-component persistent homology, multi-level persistent homology, and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. In contrast to the conventional persistent homology, multi-component persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for protein-ligand binding analysis and virtual screening of small molecules. Extensive numerical experiments involving 4,414 protein-ligand complexes from the PDBBind database and 128,374 ligand-target and decoy-target pairs in the DUD database are performed to test respectively the scoring power and the discriminatory power of the proposed topological learning strategies. It is demonstrated that the present topological learning outperforms other existing methods in protein-ligand binding affinity prediction and ligand-decoy discrimination.

Journal ArticleDOI
TL;DR: The results of the spikefinder challenge, launched to catalyze the development of new spike rate inference algorithms through crowd-sourcing, are reported, and the top-performing algorithms are based on a wide range of principles, yet provide highly correlated estimates of the neural activity.
Abstract: In recent years, two-photon calcium imaging has become a standard tool to probe the function of neural circuits and to study computations in neuronal populations. However, the acquired signal is only an indirect measurement of neural activity due to the comparatively slow dynamics of fluorescent calcium indicators. Different algorithms for estimating spike rates from noisy calcium measurements have been proposed in the past, but it is an open question how far performance can be improved. Here, we report the results of the spikefinder challenge, launched to catalyze the development of new spike rate inference algorithms through crowd-sourcing. We present ten of the submitted algorithms which show improved performance compared to previously evaluated methods. Interestingly, the top-performing algorithms are based on a wide range of principles from deep neural networks to generative models, yet provide highly correlated estimates of the neural activity. The competition shows that benchmark challenges can drive algorithmic developments in neuroscience.

Journal ArticleDOI
TL;DR: Deepbinner is a tool for Oxford Nanopore demultiplexing that uses a deep neural network to classify reads based on the raw electrical read signal, which allows for greater accuracy than existing ‘base-space’ tools.
Abstract: Multiplexing, the simultaneous sequencing of multiple barcoded DNA samples on a single flow cell, has made Oxford Nanopore sequencing cost-effective for small genomes. However, it depends on the ability to sort the resulting sequencing reads by barcode, and current demultiplexing tools fail to classify many reads. Here we present Deepbinner, a tool for Oxford Nanopore demultiplexing that uses a deep neural network to classify reads based on the raw electrical read signal. This 'signal-space' approach allows for greater accuracy than existing 'base-space' tools (Albacore and Porechop) for which signals must first be converted to DNA base calls, itself a complex problem that can introduce noise into the barcode sequence. To assess Deepbinner and existing tools, we performed multiplex sequencing on 12 amplicons chosen for their distinguishability. This allowed us to establish a ground truth classification for each read based on internal sequence alone. Deepbinner had the lowest rate of unclassified reads (7.8%) and the highest demultiplexing precision (98.5% of classified reads were correctly assigned). It can be used alone (to maximise the number of classified reads) or in conjunction with other demultiplexers (to maximise precision and minimise false positive classifications). We also found cross-sample chimeric reads (0.3%) and evidence of barcode switching (0.3%) in our dataset, which likely arise during library preparation and may be detrimental for quantitative studies that use multiplexing. Deepbinner is open source (GPLv3) and available at https://github.com/rrwick/Deepbinner.

Journal ArticleDOI
TL;DR: In this article, a convolutional neural network based open-source pipeline was developed for detecting ultrasonic, full-spectrum, search-phase calls produced by echolocating bats.
Abstract: Passive acoustic sensing has emerged as a powerful tool for quantifying anthropogenic impacts on biodiversity, especially for echolocating bat species. To better assess bat population trends there is a critical need for accurate, reliable, and open source tools that allow the detection and classification of bat calls in large collections of audio recordings. The majority of existing tools are commercial or have focused on the species classification task, neglecting the important problem of first localizing echolocation calls in audio which is particularly problematic in noisy recordings. We developed a convolutional neural network based open-source pipeline for detecting ultrasonic, full-spectrum, search-phase calls produced by echolocating bats. Our deep learning algorithms were trained on full-spectrum ultrasonic audio collected along road-transects across Europe and labelled by citizen scientists from www.batdetective.org. When compared to other existing algorithms and commercial systems, we show significantly higher detection performance of search-phase echolocation calls with our test sets. As an example application, we ran our detection pipeline on bat monitoring data collected over five years from Jersey (UK), and compared results to a widely-used commercial system. Our detection pipeline can be used for the automatic detection and monitoring of bat populations, and further facilitates their use as indicator species on a large scale. Our proposed pipeline makes only a small number of bat specific design decisions, and with appropriate training data it could be applied to detecting other species in audio. A crucial novelty of our work is showing that with careful, non-trivial, design and implementation considerations, state-of-the-art deep learning methods can be used for accurate and efficient monitoring in audio.

Journal ArticleDOI
TL;DR: This study is a first step to include evolutionary aspects into parameter estimation, allowing to infer properties of species for which very little is known.
Abstract: We developed new methods for parameter estimation-in-context and, with the help of 125 authors, built the AmP (Add-my-Pet) database of Dynamic Energy Budget (DEB) models, parameters and referenced underlying data for animals, where each species constitutes one database entry. The combination of DEB parameters covers all aspects of energetics throughout the full organism’s life cycle, from the start of embryo development to death by aging. The species-specific parameter values capture biodiversity and can now, for the first time, be compared between animals species. An important insight brought by the AmP project is the classification of animal energetics according to a family of related DEB models that is structured on the basis of the mode of metabolic acceleration, which links up with the development of larval stages. We discuss the evolution of metabolism in this context, among animals in general, and ray-finned fish, mollusks and crustaceans in particular. New DEBtool code for estimating DEB parameters from data has been written. AmPtool code for analyzing patterns in parameter values has also been created. A new web-interface supports multiple ways to visualize data, parameters, and implied properties from the entire collection as well as on an entry by entry basis. The DEB models proved to fit data well, the median relative error is only 0.07, for the 1035 animal species at 2018/03/12, including some extinct ones, from all large phyla and all chordate orders, spanning a range of body masses of 16 orders of magnitude. This study is a first step to include evolutionary aspects into parameter estimation, allowing to infer properties of species for which very little is known.

Journal ArticleDOI
TL;DR: ggsashimi is a command-line tool for the visualization of splicing events across multiple samples that uses popular bioinformatics file formats, is annotation-independent, and allows the visualization to be seen even for large genomic regions by scaling down the genomic segments between splice sites.
Abstract: We present ggsashimi, a command-line tool for the visualization of splicing events across multiple samples. Given a specified genomic region, ggsashimi creates sashimi plots for individual RNA-seq experiments as well as aggregated plots for groups of experiments, a feature unique to this software. Compared to the existing versions of programs generating sashimi plots, it uses popular bioinformatics file formats, it is annotation-independent, and allows the visualization of splicing events even for large genomic regions by scaling down the genomic segments between splice sites. ggsashimi is freely available at https://github.com/guigolab/ggsashimi. It is implemented in python, and internally generates R code for plotting.

Journal ArticleDOI
TL;DR: An analysis of 15 million English scientific full-text articles published during the period 1823–2016 is presented, showing the development in article length and publication sub-topics during these nearly 250 years.
Abstract: Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016 We describe the development in article length and publication sub-topics during these nearly 250 years We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets We subsequently compare the findings to corresponding results obtained on 165 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only

Journal ArticleDOI
Weilong Zhao1, Xinwei Sher1
TL;DR: An unbiased view for establishing best practice of T-cell epitope predictions is presented, facilitating future development of methods in immunogenomics.
Abstract: A number of machine learning-based predictors have been developed for identifying immunogenic T-cell epitopes based on major histocompatibility complex (MHC) class I and II binding affinities. Rationally selecting the most appropriate tool has been complicated by the evolving training data and machine learning methods. Despite the recent advances made in generating high-quality MHC-eluted, naturally processed ligandome, the reliability of new predictors on these epitopes has yet to be evaluated. This study reports the latest benchmarking on an extensive set of MHC-binding predictors by using newly available, untested data of both synthetic and naturally processed epitopes. 32 human leukocyte antigen (HLA) class I and 24 HLA class II alleles are included in the blind test set. Artificial neural network (ANN)-based approaches demonstrated better performance than regression-based machine learning and structural modeling. Among the 18 predictors benchmarked, ANN-based mhcflurry and nn_align perform the best for MHC class I 9-mer and class II 15-mer predictions, respectively, on binding/non-binding classification (Area Under Curves = 0.911). NetMHCpan4 also demonstrated comparable predictive power. Our customization of mhcflurry to a pan-HLA predictor has achieved similar accuracy to NetMHCpan. The overall accuracy of these methods are comparable between 9-mer and 10-mer testing data. However, the top methods deliver low correlations between the predicted versus the experimental affinities for strong MHC binders. When used on naturally processed MHC-ligands, tools that have been trained on elution data (NetMHCpan4 and MixMHCpred) shows better accuracy than pure binding affinity predictor. The variability of false prediction rate is considerable among HLA types and datasets. Finally, structure-based predictor of Rosetta FlexPepDock is less optimal compared to the machine learning approaches. With our benchmarking of MHC-binding and MHC-elution predictors using a comprehensive metrics, a unbiased view for establishing best practice of T-cell epitope predictions is presented, facilitating future development of methods in immunogenomics.

Journal ArticleDOI
TL;DR: COBRAme provides tools to simplify constructing and editing ME-models to enable ME-model reconstructions for new organisms and is used to reconstruct a condensed E. coli ME- model called iJL1678b-ME.
Abstract: Genome-scale models of metabolism and macromolecular expression (ME-models) explicitly compute the optimal proteome composition of a growing cell. ME-models expand upon the well-established genome-scale models of metabolism (M-models), and they enable a new fundamental understanding of cellular growth. ME-models have increased predictive capabilities and accuracy due to their inclusion of the biosynthetic costs for the machinery of life, but they come with a significant increase in model size and complexity. This challenge results in models which are both difficult to compute and challenging to understand conceptually. As a result, ME-models exist for only two organisms (Escherichia coli and Thermotoga maritima) and are still used by relatively few researchers. To address these challenges, we have developed a new software framework called COBRAme for building and simulating ME-models. It is coded in Python and built on COBRApy, a popular platform for using M-models. COBRAme streamlines computation and analysis of ME-models. It provides tools to simplify constructing and editing ME-models to enable ME-model reconstructions for new organisms. We used COBRAme to reconstruct a condensed E. coli ME-model called iJL1678b-ME. This reformulated model gives functionally identical solutions to previous E. coli ME-models while using 1/6 the number of free variables and solving in less than 10 minutes, a marked improvement over the 6 hour solve time of previous ME-model formulations. Errors in previous ME-models were also corrected leading to 52 additional genes that must be expressed in iJL1678b-ME to grow aerobically in glucose minimal in silico media. This manuscript outlines the architecture of COBRAme and demonstrates how ME-models can be created, modified, and shared most efficiently using the new software framework.

Journal ArticleDOI
TL;DR: The sequence-based feature projection ensemble learning method, “SFPEL-LPI”, is proposed, which accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods.
Abstract: LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don't have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, "SFPEL-LPI", to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at http://www.bioinfotech.cn/SFPEL-LPI/.

Journal ArticleDOI
TL;DR: This work presents a nonparametric model-based method, Dirichlet process Gaussian process mixture model (DPGP), which jointly models data clusters with a Dirich let process and temporal dependencies with Gaussian processes and demonstrates that jointly modeling cluster number and temporal dependency can reveal shared regulatory mechanisms.
Abstract: Transcriptome-wide time series expression profiling is used to characterize the cellular response to environmental perturbations. The first step to analyzing transcriptional response data is often to cluster genes with similar responses. Here, we present a nonparametric model-based method, Dirichlet process Gaussian process mixture model (DPGP), which jointly models data clusters with a Dirichlet process and temporal dependencies with Gaussian processes. We demonstrate the accuracy of DPGP in comparison to state-of-the-art approaches using hundreds of simulated data sets. To further test our method, we apply DPGP to published microarray data from a microbial model organism exposed to stress and to novel RNA-seq data from a human cell line exposed to the glucocorticoid dexamethasone. We validate our clusters by examining local transcription factor binding and histone modifications. Our results demonstrate that jointly modeling cluster number and temporal dependencies can reveal shared regulatory mechanisms. DPGP software is freely available online at https://github.com/PrincetonUniversity/DP_GP_cluster.

Journal ArticleDOI
TL;DR: A coevolutionary model where beside the payoff-driven competition of cooperator and defector players the level of a renewable resource depends sensitively on the fraction of cooperators and the total consumption of all players is considered.
Abstract: Utilizing common resources is always a dilemma for community members While cooperator players restrain themselves and consider the proper state of resources, defectors demand more than their supposed share for a higher payoff To avoid the tragedy of the common state, punishing the latter group seems to be an adequate reaction This conclusion, however, is less straightforward when we acknowledge the fact that resources are finite and even a renewable resource has limited growing capacity To clarify the possible consequences, we consider a coevolutionary model where beside the payoff-driven competition of cooperator and defector players the level of a renewable resource depends sensitively on the fraction of cooperators and the total consumption of all players The applied feedback-evolving game reveals that beside a delicately adjusted punishment it is also fundamental that cooperators should pay special attention to the growing capacity of renewable resources Otherwise, even the usage of tough punishment cannot save the community from an undesired end

Journal ArticleDOI
TL;DR: A general-purpose framework for recording and simplifying genealogical data is implemented, which can be used to make simulations of any population model more efficient and efficiency gains can be made over classical forward-time simulations.
Abstract: In this paper we describe how to efficiently record the entire genetic history of a population in forwards-time, individual-based population genetics simulations with arbitrary breeding models, population structure and demography. This approach dramatically reduces the computational burden of tracking individual genomes by allowing us to simulate only those loci that may affect reproduction (those having non-neutral variants). The genetic history of the population is recorded as a succinct tree sequence as introduced in the software package msprime, on which neutral mutations can be quickly placed afterwards. Recording the results of each breeding event requires storage that grows linearly with time, but there is a great deal of redundancy in this information. We solve this storage problem by providing an algorithm to quickly 'simplify' a tree sequence by removing this irrelevant history for a given set of genomes. By periodically simplifying the history with respect to the extant population, we show that the total storage space required is modest and overall large efficiency gains can be made over classical forward-time simulations. We implement a general-purpose framework for recording and simplifying genealogical data, which can be used to make simulations of any population model more efficient. We modify two popular forwards-time simulation frameworks to use this new approach and observe efficiency gains in large, whole-genome simulations of one to two orders of magnitude. In addition to speed, our method for recording pedigrees has several advantages: (1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes. (2) A population of N individuals with M polymorphic sites can be stored in O(N log N + M) space, making it feasible to store a simulation's entire final generation as well as its history. (3) A simulation can easily be initialized with a more efficient coalescent simulation of deep history. The software for recording and processing tree sequences is named tskit.

Journal ArticleDOI
TL;DR: It is hypothesized that in the spontaneous condition the brain operates in a metastable regime where cortico-cortical projections target excitatory and inhibitory populations in a balanced manner that produces substantial inter-area interactions while maintaining global stability.
Abstract: Cortical activity has distinct features across scales, from the spiking statistics of individual cells to global resting-state networks. We here describe the first full-density multi-area spiking network model of cortex, using macaque visual cortex as a test system. The model represents each area by a microcircuit with area-specific architecture and features layer- and population-resolved connectivity between areas. Simulations reveal a structured asynchronous irregular ground state. In a metastable regime, the network reproduces spiking statistics from electrophysiological recordings and cortico-cortical interaction patterns in fMRI functional connectivity under resting-state conditions. Stable inter-area propagation is supported by cortico-cortical synapses that are moderately strong onto excitatory neurons and stronger onto inhibitory neurons. Causal interactions depend on both cortical structure and the dynamical state of populations. Activity propagates mainly in the feedback direction, similar to experimental results associated with visual imagery and sleep. The model unifies local and large-scale accounts of cortex, and clarifies how the detailed connectivity of cortex shapes its dynamics on multiple scales. Based on our simulations, we hypothesize that in the spontaneous condition the brain operates in a metastable regime where cortico-cortical projections target excitatory and inhibitory populations in a balanced manner that produces substantial inter-area interactions while maintaining global stability.

Journal ArticleDOI
TL;DR: It is demonstrated that antibiotic resistance in E. coli can be accurately predicted from whole genome sequences without a priori knowledge of mechanisms, and that both genomic and epidemiological data can be informative.
Abstract: The emergence of microbial antibiotic resistance is a global health threat. In clinical settings, the key to controlling spread of resistant strains is accurate and rapid detection. As traditional culture-based methods are time consuming, genetic approaches have recently been developed for this task. The detection of antibiotic resistance is typically made by measuring a few known determinants previously identified from genome sequencing, and thus requires the prior knowledge of its biological mechanisms. To overcome this limitation, we employed machine learning models to predict resistance to 11 compounds across four classes of antibiotics from existing and novel whole genome sequences of 1936 E. coli strains. We considered a range of methods, and examined population structure, isolation year, gene content, and polymorphism information as predictors. Gradient boosted decision trees consistently outperformed alternative models with an average accuracy of 0.91 on held-out data (range 0.81-0.97). While the best models most frequently employed gene content, an average accuracy score of 0.79 could be obtained using population structure information alone. Single nucleotide variation data were less useful, and significantly improved prediction only for two antibiotics, including ciprofloxacin. These results demonstrate that antibiotic resistance in E. coli can be accurately predicted from whole genome sequences without a priori knowledge of mechanisms, and that both genomic and epidemiological data can be informative. This paves way to integrating machine learning approaches into diagnostic tools in the clinic.

Journal ArticleDOI
TL;DR: This work rigorously benchmarked RAbD on a set of 60 diverse antibody–antigen complexes, using two design strategies—optimizing total Rosetta energy and optimizing interface energy alone and utilized two novel metrics for measuring success in computational protein design.
Abstract: A structural-bioinformatics-based computational methodology and framework have been developed for the design of antibodies to targets of interest. RosettaAntibodyDesign (RAbD) samples the diverse sequence, structure, and binding space of an antibody to an antigen in highly customizable protocols for the design of antibodies in a broad range of applications. The program samples antibody sequences and structures by grafting structures from a widely accepted set of the canonical clusters of CDRs (North et al., J. Mol. Biol., 406:228–256, 2011). It then performs sequence design according to amino acid sequence profiles of each cluster, and samples CDR backbones using a flexible-backbone design protocol incorporating cluster-based CDR constraints. Starting from an existing experimental or computationally modeled antigen-antibody structure, RAbD can be used to redesign a single CDR or multiple CDRs with loops of different length, conformation, and sequence. We rigorously benchmarked RAbD on a set of 60 diverse antibody–antigen complexes, using two design strategies—optimizing total Rosetta energy and optimizing interface energy alone. We utilized two novel metrics for measuring success in computational protein design. The design risk ratio (DRR) is equal to the frequency of recovery of native CDR lengths and clusters divided by the frequency of sampling of those features during the Monte Carlo design procedure. Ratios greater than 1.0 indicate that the design process is picking out the native more frequently than expected from their sampled rate. We achieved DRRs for the non-H3 CDRs of between 2.4 and 4.0. The antigen risk ratio (ARR) is the ratio of frequencies of the native amino acid types, CDR lengths, and clusters in the output decoys for simulations performed in the presence and absence of the antigen. For CDRs, we achieved cluster ARRs as high as 2.5 for L1 and 1.5 for H2. For sequence design simulations without CDR grafting, the overall recovery for the native amino acid types for residues that contact the antigen in the native structures was 72% in simulations performed in the presence of the antigen and 48% in simulations performed without the antigen, for an ARR of 1.5. For the non-contacting residues, the ARR was 1.08. This shows that the sequence profiles are able to maintain the amino acid types of these conserved, buried sites, while recovery of the exposed, contacting residues requires the presence of the antigen-antibody interface. We tested RAbD experimentally on both a lambda and kappa antibody–antigen complex, successfully improving their affinities 10 to 50 fold by replacing individual CDRs of the native antibody with new CDR lengths and clusters.