scispace - formally typeset
Search or ask a question

Showing papers in "PLOS Computational Biology in 2017"


Journal ArticleDOI
TL;DR: Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low.
Abstract: The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce complete genome assemblies, but the sequencing is more expensive and error-prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate "hybrid" assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler uses a novel semi-global aligner to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.

2,245 citations


Journal ArticleDOI
TL;DR: MixOmics is introduced, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation and extends Projection to Latent Structure models for discriminant analysis.
Abstract: The advent of high throughput technologies has led to a wealth of publicly available 'omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a 'molecular signature') to explain or predict biological conditions, but mainly for a single type of 'omics. In addition, commonly used methods are univariate and consider each biological feature independently. We introduce mixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a systems biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous 'omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple 'omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latest mixOmics integrative frameworks for the multivariate analyses of 'omics data available from the package.

1,862 citations


Journal ArticleDOI
TL;DR: OpenMM is a molecular dynamics simulation toolkit with a unique focus on extensibility, which makes it an ideal tool for researchers developing new simulation methods, and also allows those new methods to be immediately available to the larger community.
Abstract: OpenMM is a molecular dynamics simulation toolkit with a unique focus on extensibility. It allows users to easily add new features, including forces with novel functional forms, new integration algorithms, and new simulation protocols. Those features automatically work on all supported hardware types (including both CPUs and GPUs) and perform well on all of them. In many cases they require minimal coding, just a mathematical description of the desired function. They also require no modification to OpenMM itself and can be distributed independently of OpenMM. This makes it an ideal tool for researchers developing new simulation methods, and also allows those new methods to be immediately available to the larger community.

1,364 citations


Journal ArticleDOI
TL;DR: A new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks that greatly outperforms existing methods and leads to much more accurate contact-assisted folding.
Abstract: Motivation Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. Method This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain higher-quality contact prediction regardless of how many sequence homologs are available for proteins in question. Results Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then. Availability http://raptorx.uchicago.edu/ContactMap/

779 citations


Journal ArticleDOI
TL;DR: Estimates of mixing patterns for societies for which contact data such as POLYMOD are not yet available are provided, finding contact patterns are highly assortative with age across all countries considered, but pronounced regional differences in the age-specific contacts at home were noticeable.
Abstract: Heterogeneities in contact networks have a major effect in determining whether a pathogen can become epidemic or persist at endemic levels. Epidemic models that determine which interventions can successfully prevent an outbreak need to account for social structure and mixing patterns. Contact patterns vary across age and locations (e.g. home, work, and school), and including them as predictors in transmission dynamic models of pathogens that spread socially will improve the models' realism. Data from population-based contact diaries in eight European countries from the POLYMOD study were projected to 144 other countries using a Bayesian hierarchical model that estimated the proclivity of age-and-location-specific contact patterns for the countries, using Markov chain Monte Carlo simulation. Household level data from the Demographic and Health Surveys for nine lower-income countries and socio-demographic factors from several on-line databases for 152 countries were used to quantify similarity of countries to estimate contact patterns in the home, work, school and other locations for countries for which no contact data are available, accounting for demographic structure, household structure where known, and a variety of metrics including workforce participation and school enrolment. Contacts are highly assortative with age across all countries considered, but pronounced regional differences in the age-specific contacts at home were noticeable, with more inter-generational contacts in Asian countries than in other settings. Moreover, there were variations in contact patterns by location, with work-place contacts being least assortative. These variations led to differences in the effect of social distancing measures in an age structured epidemic model. Contacts have an important role in transmission dynamic models that use contact rates to characterize the spread of contact-transmissible diseases. This study provides estimates of mixing patterns for societies for which contact data such as POLYMOD are not yet available.

734 citations


Journal ArticleDOI
TL;DR: The first attempts to study the whole transcriptome began in the early 1990s, and technological advances since the late 1990s have made transcriptomics a widespread discipline as mentioned in this paper, which has enabled the study of how gene expression changes in different organisms and has been instrumental in the understanding of human disease.
Abstract: Transcriptomics technologies are the techniques used to study an organism’s transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst noncoding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. The first attempts to study the whole transcriptome began in the early 1990s, and technological advances since the late 1990s have made transcriptomics a widespread discipline. Transcriptomics has been defined by repeated technological innovations that transform the field. There are two key contemporary techniques in the field: microarrays, which quantify a set of predetermined sequences, and RNA sequencing (RNA-Seq), which uses high-throughput sequencing to capture all sequences. Measuring the expression of an organism’s genes in different tissues, conditions, or time points gives information on how genes are regulated and reveals details of an organism’s biology. It can also help to infer the functions of previously unannotated genes. Transcriptomic analysis has enabled the study of how gene expression changes in different organisms and has been instrumental in the understanding of human disease. An analysis of gene expression in its entirety allows detection of broad coordinated trends which cannot be discerned by more targeted assays.

525 citations


Journal ArticleDOI
TL;DR: Metacoder, an R package for easily parsing, manipulating, and graphing publication-ready plots of hierarchical data, designed for data from metabarcoding research, can easily be applied to any data that has a hierarchical component such as gene ontology or geographic location data.
Abstract: Community-level data, the type generated by an increasing number of metabarcoding studies, is often graphed as stacked bar charts or pie graphs that use color to represent taxa. These graph types do not convey the hierarchical structure of taxonomic classifications and are limited by the use of color for categories. As an alternative, we developed metacoder, an R package for easily parsing, manipulating, and graphing publication-ready plots of hierarchical data. Metacoder includes a dynamic and flexible function that can parse most text-based formats that contain taxonomic classifications, taxon names, taxon identifiers, or sequence identifiers. Metacoder can then subset, sample, and order this parsed data using a set of intuitive functions that take into account the hierarchical nature of the data. Finally, an extremely flexible plotting function enables quantitative representation of up to 4 arbitrary statistics simultaneously in a tree format by mapping statistics to the color and size of tree nodes and edges. Metacoder also allows exploration of barcode primer bias by integrating functions to run digital PCR. Although it has been designed for data from metabarcoding research, metacoder can easily be applied to any data that has a hierarchical component such as gene ontology or geographic location data. Our package complements currently available tools for community analysis and is provided open source with an extensive online user manual.

409 citations


Journal ArticleDOI
TL;DR: Mindboggle’s algorithms are evaluated using the largest set of manually labeled, publicly available brain images in the world and compare them against state-of-the-art algorithms where they exist and results are publicly available.
Abstract: Mindboggle (http://mindboggle.info) is an open source brain morphometry platform that takes in preprocessed T1-weighted MRI data and outputs volume, surface, and tabular data containing label, feature, and shape information for further analysis. In this article, we document the software and demonstrate its use in studies of shape variation in healthy and diseased humans. The number of different shape measures and the size of the populations make this the largest and most detailed shape analysis of human brains ever conducted. Brain image morphometry shows great potential for providing much-needed biological markers for diagnosing, tracking, and predicting progression of mental health disorders. Very few software algorithms provide more than measures of volume and cortical thickness, while more subtle shape measures may provide more sensitive and specific biomarkers. Mindboggle computes a variety of (primarily surface-based) shapes: area, volume, thickness, curvature, depth, Laplace-Beltrami spectra, Zernike moments, etc. We evaluate Mindboggle’s algorithms using the largest set of manually labeled, publicly available brain images in the world and compare them against state-of-the-art algorithms where they exist. All data, code, and results of these evaluations are publicly available.

403 citations


Journal ArticleDOI
TL;DR: The algorithm is a generalization of the pool adjacent violators algorithm (PAVA) for isotonic regression and inherits its linear-time computational complexity and gains remarkable increases in processing speed: more than one order of magnitude compared to currently employed state of the art convex solvers relying on interior point methods.
Abstract: Fluorescent calcium indicators are a popular means for observing the spiking activity of large neuronal populations, but extracting the activity of each neuron from raw fluorescence calcium imaging data is a nontrivial problem. We present a fast online active set method to solve this sparse non-negative deconvolution problem. Importantly, the algorithm 3progresses through each time series sequentially from beginning to end, thus enabling real-time online estimation of neural activity during the imaging session. Our algorithm is a generalization of the pool adjacent violators algorithm (PAVA) for isotonic regression and inherits its linear-time computational complexity. We gain remarkable increases in processing speed: more than one order of magnitude compared to currently employed state of the art convex solvers relying on interior point methods. Unlike these approaches, our method can exploit warm starts; therefore optimizing model hyperparameters only requires a handful of passes through the data. A minor modification can further improve the quality of activity inference by imposing a constraint on the minimum spike size. The algorithm enables real-time simultaneous deconvolution of O(105) traces of whole-brain larval zebrafish imaging data on a laptop.

390 citations


Journal ArticleDOI
TL;DR: The reliable performance of Path-Based MiRNA-Disease Association is demonstrated, which demonstrates that PBMDA could serve as a powerful computational tool to accelerate the identification of disease-miRNA associations.
Abstract: In the recent few years, an increasing number of studies have shown that microRNAs (miRNAs) play critical roles in many fundamental and important biological processes. As one of pathogenetic factors, the molecular mechanisms underlying human complex diseases still have not been completely understood from the perspective of miRNA. Predicting potential miRNA-disease associations makes important contributions to understanding the pathogenesis of diseases, developing new drugs, and formulating individualized diagnosis and treatment for diverse human complex diseases. Instead of only depending on expensive and time-consuming biological experiments, computational prediction models are effective by predicting potential miRNA-disease associations, prioritizing candidate miRNAs for the investigated diseases, and selecting those miRNAs with higher association probabilities for further experimental validation. In this study, Path-Based MiRNA-Disease Association (PBMDA) prediction model was proposed by integrating known human miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miRNAs and diseases. This model constructed a heterogeneous graph consisting of three interlinked sub-graphs and further adopted depth-first search algorithm to infer potential miRNA-disease associations. As a result, PBMDA achieved reliable performance in the frameworks of both local and global LOOCV (AUCs of 0.8341 and 0.9169, respectively) and 5-fold cross validation (average AUC of 0.9172). In the cases studies of three important human diseases, 88% (Esophageal Neoplasms), 88% (Kidney Neoplasms) and 90% (Colon Neoplasms) of top-50 predicted miRNAs have been manually confirmed by previous experimental reports from literatures. Through the comparison performance between PBMDA and other previous models in case studies, the reliable performance also demonstrates that PBMDA could serve as a powerful computational tool to accelerate the identification of disease-miRNA associations.

323 citations


Journal ArticleDOI
TL;DR: An improved and easy-to-use circRNA read simulator that can produce mimicking backsplicing reads supporting circRNAs deposited in CircBase is provided and the performance of 11 circRNA detection tools on both simulated and real datasets is compared.
Abstract: Circular RNA (circRNA) is mainly generated by the splice donor of a downstream exon joining to an upstream splice acceptor, a phenomenon known as backsplicing. It has been reported that circRNA can function as microRNA (miRNA) sponges, transcriptional regulators, or potential biomarkers. The availability of massive non-polyadenylated transcriptomes data has facilitated the genome-wide identification of thousands of circRNAs. Several circRNA detection tools or pipelines have recently been developed, and it is essential to provide useful guidelines on these pipelines for users, including a comprehensive and unbiased comparison. Here, we provide an improved and easy-to-use circRNA read simulator that can produce mimicking backsplicing reads supporting circRNAs deposited in CircBase. Moreover, we compared the performance of 11 circRNA detection tools on both simulated and real datasets. We assessed their performance regarding metrics such as precision, sensitivity, F1 score, and Area under Curve. It is concluded that no single method dominated on all of these metrics. Among all of the state-of-the-art tools, CIRI, CIRCexplorer, and KNIFE, which achieved better balanced performance between their precision and sensitivity, compared favorably to the other methods.

Journal ArticleDOI
TL;DR: In this article, the authors present a set of good computing practices that every researcher can adopt, regardless of their current level of computational skill, which encompass data management, programming, collaborating with colleagues, organizing projects, tracking work, and writing manuscripts.
Abstract: Author summary Computers are now essential in all branches of science, but most researchers are never taught the equivalent of basic lab skills for research computing. As a result, data can get lost, analyses can take much longer than necessary, and researchers are limited in how effectively they can work with software and data. Computing workflows need to follow the same practices as lab projects and notebooks, with organized data, documented steps, and the project structured for reproducibility, but researchers new to computing often don't know where to start. This paper presents a set of good computing practices that every researcher can adopt, regardless of their current level of computational skill. These practices, which encompass data management, programming, collaborating with colleagues, organizing projects, tracking work, and writing manuscripts, are drawn from a wide variety of published sources from our daily lives and from our work with volunteer organizations that have delivered workshops to over 11,000 people since 2010.

Journal ArticleDOI
TL;DR: A common mathematical framework is developed for understanding the relationship of these three methods, which share one core commonality: all three evaluate the second moment of the distribution of activity profiles, which determines the representational geometry, and thus how well any feature can be decoded from population activity.
Abstract: Representational models specify how activity patterns in populations of neurons (or, more generally, in multivariate brain-activity measurements) relate to sensory stimuli, motor responses, or cognitive processes. In an experimental context, representational models can be defined as hypotheses about the distribution of activity profiles across experimental conditions. Currently, three different methods are being used to test such hypotheses: encoding analysis, pattern component modeling (PCM), and representational similarity analysis (RSA). Here we develop a common mathematical framework for understanding the relationship of these three methods, which share one core commonality: all three evaluate the second moment of the distribution of activity profiles, which determines the representational geometry, and thus how well any feature can be decoded from population activity. Using simulated data for three different experimental designs, we compare the power of the methods to adjudicate between competing representational models. PCM implements a likelihood-ratio test and therefore provides the most powerful test if its assumptions hold. However, the other two approaches-when conducted appropriately-can perform similarly. In encoding analysis, the linear model needs to be appropriately regularized, which effectively imposes a prior on the activity profiles. With such a prior, an encoding model specifies a well-defined distribution of activity profiles. In RSA, the unequal variances and statistical dependencies of the dissimilarity estimates need to be taken into account to reach near-optimal power in inference. The three methods render different aspects of the information explicit (e.g. single-response tuning in encoding analysis and population-response representational dissimilarity in RSA) and have specific advantages in terms of computational demands, ease of use, and extensibility. The three methods are properly construed as complementary components of a single data-analytical toolkit for understanding neural representations on the basis of multivariate brain-activity data.

Journal ArticleDOI
TL;DR: The tools of big data research are increasingly woven into the authors' daily lives, including mining digital medical records for scientific and economic insights, mapping relationships via social media, capturing individuals’ speech and action via sensors, tracking movement across space, shaping police and security policy via “predictive policing,” and much more.
Abstract: The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of the computational and natural sciences, to more directly address sensitive aspects of human behavior, interaction, and health. The tools of big data research are increasingly woven into our daily lives, including mining digital medical records for scientific and economic insights, mapping relationships via social media, capturing individuals’ speech and action via sensors, tracking movement across space, shaping police and security policy via “predictive policing,” and much more. The beneficial possibilities for big data in science and industry are tempered by new challenges facing researchers that often lie outside their training and comfort zone. Social scientists now grapple with data structures and cloud computing, while computer scientists must contend with human subject protocols and institutional review boards (IRBs). While the connection between individual datum and actual human beings can appear quite abstract, the scope, scale, and complexity of many forms of big data creates a rich ecosystem in which human participants and their communities are deeply embedded and susceptible to harm. This complexity challenges any normative set of rules and makes devising universal guidelines difficult. Nevertheless, the need for direction in responsible big data research is evident, and this article provides a set of “ten simple rules” for addressing the complex ethical issues that will inevitably arise. Modeled on PLOS Computational Biology’s ongoing collection of rules, the recommendations we outline involve more nuance than the words “simple” and “rules” suggest. This nuance is inevitably tied to our paper’s starting premise: all big data research on social, medical, psychological, and economic phenomena engages with human subjects, and researchers have the ethical responsibility to minimize potential harm. The variety in data sources, research topics, and methodological approaches in big data belies a one-size-fits-all checklist; as a result, these rules are less specific than some might hope. Rather, we exhort researchers to recognize the human participants and complex systems contained within their data and make grappling with ethical questions part of their standard workflow. Towards this end, we structure the first five rules around how to reduce the chance of harm resulting from big data research practices; the second five rules focus on ways researchers can contribute to building best practices that fit their disciplinary and methodological approaches. At the core of these rules, we challenge big data researchers who consider their data disentangled from the ability to harm to reexamine their assumptions. The examples in this paper show how often even seemingly innocuous and anonymized data have produced unanticipated ethical questions and detrimental impacts. This paper is a result of a two-year National Science Foundation (NSF)-funded project that established the Council for Big Data, Ethics, and Society, a group of 20 scholars from a wide range of social, natural, and computational sciences (http://bdes.datasociety.net/). The Council was charged with providing guidance to the NSF on how to best encourage ethical practices in scientific and engineering research, utilizing big data research methods and infrastructures [1].

Journal ArticleDOI
TL;DR: This work introduces a framework for creating, testing, versioning and archiving portable applications for analyzing neuroimaging data organized and described in compliance with the Brain Imaging Data Structure (BIDS).
Abstract: The rate of progress in human neurosciences is limited by the inability to easily apply a wide range of analysis methods to the plethora of different datasets acquired in labs around the world. In this work, we introduce a framework for creating, testing, versioning and archiving portable applications for analyzing neuroimaging data organized and described in compliance with the Brain Imaging Data Structure (BIDS). The portability of these applications (BIDS Apps) is achieved by using container technologies that encapsulate all binary and other dependencies in one convenient package. BIDS Apps run on all three major operating systems with no need for complex setup and configuration and thanks to the comprehensiveness of the BIDS standard they require little manual user input. Previous containerized data processing solutions were limited to single user environments and not compatible with most multi-tenant High Performance Computing systems. BIDS Apps overcome this limitation by taking advantage of the Singularity container technology. As a proof of concept, this work is accompanied by 22 ready to use BIDS Apps, packaging a diverse set of commonly used neuroimaging algorithms.

Journal ArticleDOI
TL;DR: TADbit provides three-dimensional models built from 3C-based experiments, which are ready for visualization and for characterizing their relation to gene expression and epigenetic states, and TADbit is an open-source Python library available for download.
Abstract: The sequence of a genome is insufficient to understand all genomic processes carried out in the cell nucleus. To achieve this, the knowledge of its three-dimensional architecture is necessary. Advances in genomic technologies and the development of new analytical methods, such as Chromosome Conformation Capture (3C) and its derivatives, provide unprecedented insights in the spatial organization of genomes. Here we present TADbit, a computational framework to analyze and model the chromatin fiber in three dimensions. Our package takes as input the sequencing reads of 3C-based experiments and performs the following main tasks: (i) pre-process the reads, (ii) map the reads to a reference genome, (iii) filter and normalize the interaction data, (iv) analyze the resulting interaction matrices, (v) build 3D models of selected genomic domains, and (vi) analyze the resulting models to characterize their structural properties. To illustrate the use of TADbit, we automatically modeled 50 genomic domains from the fly genome revealing differential structural features of the previously defined chromatin colors, establishing a link between the conformation of the genome and the local chromatin composition. TADbit provides three-dimensional models built from 3C-based experiments, which are ready for visualization and for characterizing their relation to gene expression and epigenetic states. TADbit is an open-source Python library available for download from https://github.com/3DGenomes/tadbit.

Journal ArticleDOI
TL;DR: A multi-task multichannel topological convolutional neural network (MM-TCNN) that outperforms the latest methods in the prediction of protein-ligand binding affinities, mutation induced globular protein foldingfree energy changes, and mutation induced membrane protein folding free energy changes.
Abstract: Although deep learning approaches have had tremendous success in image, video and audio processing, computer vision, and speech recognition, their applications to three-dimensional (3D) biomolecular structural data sets have been hindered by the geometric and biological complexity. To address this problem we introduce the element-specific persistent homology (ESPH) method. ESPH represents 3D complex geometry by one-dimensional (1D) topological invariants and retains important biological information via a multichannel image-like representation. This representation reveals hidden structure-function relationships in biomolecules. We further integrate ESPH and deep convolutional neural networks to construct a multichannel topological neural network (TopologyNet) for the predictions of protein-ligand binding affinities and protein stability changes upon mutation. To overcome the deep learning limitations from small and noisy training sets, we propose a multi-task multichannel topological convolutional neural network (MM-TCNN). We demonstrate that TopologyNet outperforms the latest methods in the prediction of protein-ligand binding affinities, mutation induced globular protein folding free energy changes, and mutation induced membrane protein folding free energy changes. Availability: weilab.math.msu.edu/TDL/

Journal ArticleDOI
TL;DR: This work lays out a family of approaches by which model-based computation may be built upon a core of TD learning, and suggests that this framework represents a neurally plausible family of mechanisms for model- based evaluation.
Abstract: Humans and animals are capable of evaluating actions by considering their long-run future rewards through a process described using model-based reinforcement learning (RL) algorithms. The mechanisms by which neural circuits perform the computations prescribed by model-based RL remain largely unknown; however, multiple lines of evidence suggest that neural circuits supporting model-based behavior are structurally homologous to and overlapping with those thought to carry out model-free temporal difference (TD) learning. Here, we lay out a family of approaches by which model-based computation may be built upon a core of TD learning. The foundation of this framework is the successor representation, a predictive state representation that, when combined with TD learning of value predictions, can produce a subset of the behaviors associated with model-based learning, while requiring less decision-time computation than dynamic programming. Using simulations, we delineate the precise behavioral capabilities enabled by evaluating actions using this approach, and compare them to those demonstrated by biological organisms. We then introduce two new algorithms that build upon the successor representation while progressively mitigating its limitations. Because this framework can account for the full range of observed putatively model-based behaviors while still utilizing a core TD framework, we suggest that it represents a neurally plausible family of mechanisms for model-based evaluation.

Journal ArticleDOI
TL;DR: It is shown that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor, suggesting current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data.
Abstract: There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods from neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.

Journal ArticleDOI
TL;DR: This paper compares model implementations using four case studies, chosen to reflect the key cellular processes of proliferation, adhesion, and short- and long-range signalling, and demonstrates the applicability of each model and provides a guide for model usage.
Abstract: The coordinated behaviour of populations of cells plays a central role in tissue growth and renewal. Cells react to their microenvironment by modulating processes such as movement, growth and proliferation, and signalling. Alongside experimental studies, computational models offer a useful means by which to investigate these processes. To this end a variety of cell-based modelling approaches have been developed, ranging from lattice-based cellular automata to lattice-free models that treat cells as point-like particles or extended shapes. However, it remains unclear how these approaches compare when applied to the same biological problem, and what differences in behaviour are due to different model assumptions and abstractions. Here, we exploit the availability of an implementation of five popular cell-based modelling approaches within a consistent computational framework, Chaste (http://www.cs.ox.ac.uk/chaste). This framework allows one to easily change constitutive assumptions within these models. In each case we provide full details of all technical aspects of our model implementations. We compare model implementations using four case studies, chosen to reflect the key cellular processes of proliferation, adhesion, and short- and long-range signalling. These case studies demonstrate the applicability of each model and provide a guide for model usage.

Journal ArticleDOI
TL;DR: A computational model named Laplacian Regularized Sparse Subspace Learning for MiRNA-Disease Association prediction (LRSSLMDA), which projected miRNAs/diseases’ statistical feature profile and graph theoretical feature profile to a common subspace and would be a valuable computational tool for miRNA-disease association prediction.
Abstract: Predicting novel microRNA (miRNA)-disease associations is clinically significant due to miRNAs' potential roles of diagnostic biomarkers and therapeutic targets for various human diseases. Previous studies have demonstrated the viability of utilizing different types of biological data to computationally infer new disease-related miRNAs. Yet researchers face the challenge of how to effectively integrate diverse datasets and make reliable predictions. In this study, we presented a computational model named Laplacian Regularized Sparse Subspace Learning for MiRNA-Disease Association prediction (LRSSLMDA), which projected miRNAs/diseases' statistical feature profile and graph theoretical feature profile to a common subspace. It used Laplacian regularization to preserve the local structures of the training data and a L1-norm constraint to select important miRNA/disease features for prediction. The strength of dimensionality reduction enabled the model to be easily extended to much higher dimensional datasets than those exploited in this study. Experimental results showed that LRSSLMDA outperformed ten previous models: the AUC of 0.9178 in global leave-one-out cross validation (LOOCV) and the AUC of 0.8418 in local LOOCV indicated the model's superior prediction accuracy; and the average AUC of 0.9181+/-0.0004 in 5-fold cross validation justified its accuracy and stability. In addition, three types of case studies further demonstrated its predictive power. Potential miRNAs related to Colon Neoplasms, Lymphoma, Kidney Neoplasms, Esophageal Neoplasms and Breast Neoplasms were predicted by LRSSLMDA. Respectively, 98%, 88%, 96%, 98% and 98% out of the top 50 predictions were validated by experimental evidences. Therefore, we conclude that LRSSLMDA would be a valuable computational tool for miRNA-disease association prediction.

Journal ArticleDOI
TL;DR: The approach recapitulates and refines known motifs for 43 of the most frequent alleles, uncovers new motifs and provides a scalable framework to incorporate additional HLA peptidomics studies in the future and improves neo-antigen and cancer testis antigen predictions.
Abstract: The precise identification of Human Leukocyte Antigen class I (HLA-I) binding motifs plays a central role in our ability to understand and predict (neo-)antigen presentation in infectious diseases and cancer. Here, by exploiting co-occurrence of HLA-I alleles across ten newly generated as well as forty public HLA peptidomics datasets comprising more than 115,000 unique peptides, we show that we can rapidly and accurately identify many HLA-I binding motifs and map them to their corresponding alleles without any a priori knowledge of HLA-I binding specificity. Our approach recapitulates and refines known motifs for 43 of the most frequent alleles, uncovers new motifs for 9 alleles that up to now had less than five known ligands and provides a scalable framework to incorporate additional HLA peptidomics studies in the future. The refined motifs improve neo-antigen and cancer testis antigen predictions, indicating that unbiased HLA peptidomics data are ideal for in silico predictions of neo-antigens from tumor exome sequencing data. The new motifs further reveal distant modulation of the binding specificity at P2 for some HLA-I alleles by residues in the HLA-I binding site but outside of the B-pocket and we unravel the underlying mechanisms by protein structure analysis, mutagenesis and in vitro binding assays.

Journal ArticleDOI
TL;DR: The Active Vertex Model (AVM) as discussed by the authors was proposed for cell-resolution studies of the mechanics of confluent epithelial tissues consisting of tens of thousands of cells, with a level of detail inaccessible to other methods.
Abstract: We introduce an Active Vertex Model (AVM) for cell-resolution studies of the mechanics of confluent epithelial tissues consisting of tens of thousands of cells, with a level of detail inaccessible to similar methods. The AVM combines the Vertex Model for confluent epithelial tissues with active matter dynamics. This introduces a natural description of the cell motion and accounts for motion patterns observed on multiple scales. Furthermore, cell contacts are generated dynamically from positions of cell centres. This not only enables efficient numerical implementation, but provides a natural description of the T1 transition events responsible for local tissue rearrangements. The AVM also includes cell alignment, cell-specific mechanical properties, cell growth, division and apoptosis. In addition, the AVM introduces a flexible, dynamically changing boundary of the epithelial sheet allowing for studies of phenomena such as the fingering instability or wound healing. We illustrate these capabilities with a number of case studies.

Journal ArticleDOI
TL;DR: This study combinesconstraint-based and individual-based modeling techniques into the R package BacArena to generate novel biological insights into Pseudomonas aeruginosa biofilm formation as well as a seven species model community of the human gut.
Abstract: Recent advances focusing on the metabolic interactions within and between cellular populations have emphasized the importance of microbial communities for human health Constraint-based modeling, with flux balance analysis in particular, has been established as a key approach for studying microbial metabolism, whereas individual-based modeling has been commonly used to study complex dynamics between interacting organisms In this study, we combine both techniques into the R package BacArena (https://cranr-projectorg/package=BacArena) to generate novel biological insights into Pseudomonas aeruginosa biofilm formation as well as a seven species model community of the human gut For our P aeruginosa model, we found that cross-feeding of fermentation products cause a spatial differentiation of emerging metabolic phenotypes in the biofilm over time In the human gut model community, we found that spatial gradients of mucus glycans are important for niche formations which shape the overall community structure Additionally, we could provide novel hypothesis concerning the metabolic interactions between the microbes These results demonstrate the importance of spatial and temporal multi-scale modeling approaches such as BacArena

Journal ArticleDOI
TL;DR: Gravity model estimates indicate a sharp decay in influenza transmission with the distance between infectious and susceptible cities, consistent with spread dominated by work commutes rather than air traffic.
Abstract: Seasonal influenza epidemics offer unique opportunities to study the invasion and re-invasion waves of a pathogen in a partially immune population. Detailed patterns of spread remain elusive, however, due to lack of granular disease data. Here we model high-volume city-level medical claims data and human mobility proxies to explore the drivers of influenza spread in the US during 2002-2010. Although the speed and pathways of spread varied across seasons, seven of eight epidemics likely originated in the Southern US. Each epidemic was associated with 1-5 early long-range transmission events, half of which sparked onward transmission. Gravity model estimates indicate a sharp decay in influenza transmission with the distance between infectious and susceptible cities, consistent with spread dominated by work commutes rather than air traffic. Two early-onset seasons associated with antigenic novelty had particularly localized modes of spread, suggesting that novel strains may spread in a more localized fashion than previously anticipated.

Journal ArticleDOI
TL;DR: In this article, a hierarchical RBC extraction method was proposed to detect the RBC region (ROI) from the background, and then separate touching RBCs in the ROI images by applying an improved random walk method based on automatic seed generation.
Abstract: Sickle cell disease (SCD) is a hematological disorder leading to blood vessel occlusion accompanied by painful episodes and even death. Red blood cells (RBCs) of SCD patients have diverse shapes that reveal important biomechanical and bio-rheological characteristics, e.g. their density, fragility, adhesive properties, etc. Hence, having an objective and effective way of RBC shape quantification and classification will lead to better insights and eventual better prognosis of the disease. To this end, we have developed an automated, high-throughput, ex-vivo RBC shape classification framework that consists of three stages. First, we present an automatic hierarchical RBC extraction method to detect the RBC region (ROI) from the background, and then separate touching RBCs in the ROI images by applying an improved random walk method based on automatic seed generation. Second, we apply a mask-based RBC patch-size normalization method to normalize the variant size of segmented single RBC patches into uniform size. Third, we employ deep convolutional neural networks (CNNs) to realize RBC classification; the alternating convolution and pooling operations can deal with non-linear and complex patterns. Furthermore, we investigate the specific shape factor quantification for the classified RBC image data in order to develop a general multiscale shape analysis. We perform several experiments on raw microscopy image datasets from 8 SCD patients (over 7,000 single RBC images) through a 5-fold cross validation method both for oxygenated and deoxygenated RBCs. We demonstrate that the proposed framework can successfully classify sickle shape RBCs in an automated manner with high accuracy, and we also provide the corresponding shape factor analysis, which can be used synergistically with the CNN analysis for more robust predictions. Moreover, the trained deep CNN exhibits good performance even for a deoxygenated dataset and distinguishes the subtle differences in texture alteration inside the oxygenated and deoxygenated RBCs.

Journal ArticleDOI
TL;DR: Estimating the contribution of transcript levels to orthogonal sources of variability found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability, suggesting extensive post-transcriptional regulation.
Abstract: Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissues variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability. The reliable quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissues variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes.

Journal ArticleDOI
TL;DR: The results suggest that dynamics of a gene circuit is mainly determined by its topology, not by detailed circuit parameters, and provides a theoretical foundation for circuit-based systems biology modeling.
Abstract: One of the most important roles of cells is performing their cellular tasks properly for survival. Cells usually achieve robust functionality, for example, cell-fate decision-making and signal transduction, through multiple layers of regulation involving many genes. Despite the combinatorial complexity of gene regulation, its quantitative behavior has been typically studied on the basis of experimentally verified core gene regulatory circuitry, composed of a small set of important elements. It is still unclear how such a core circuit operates in the presence of many other regulatory molecules and in a crowded and noisy cellular environment. Here we report a new computational method, named random circuit perturbation (RACIPE), for interrogating the robust dynamical behavior of a gene regulatory circuit even without accurate measurements of circuit kinetic parameters. RACIPE generates an ensemble of random kinetic models corresponding to a fixed circuit topology, and utilizes statistical tools to identify generic properties of the circuit. By applying RACIPE to simple toggle-switch-like motifs, we observed that the stable states of all models converge to experimentally observed gene state clusters even when the parameters are strongly perturbed. RACIPE was further applied to a proposed 22-gene network of the Epithelial-to-Mesenchymal Transition (EMT), from which we identified four experimentally observed gene states, including the states that are associated with two different types of hybrid Epithelial/Mesenchymal phenotypes. Our results suggest that dynamics of a gene circuit is mainly determined by its topology, not by detailed circuit parameters. Our work provides a theoretical foundation for circuit-based systems biology modeling. We anticipate RACIPE to be a powerful tool to predict and decode circuit design principles in an unbiased manner, and to quantitatively evaluate the robustness and heterogeneity of gene expression.

Journal ArticleDOI
TL;DR: This study used an iterative modified-sure independence screening (ISIS) approach in reducing the number of SNPs to a moderate size and identified most previously reported genes, suggesting the new method is a good alternative for multi-locus GWAS.
Abstract: Genome-wide association study (GWAS) entails examining a large number of single nucleotide polymorphisms (SNPs) in a limited sample with hundreds of individuals, implying a variable selection problem in the high dimensional dataset. Although many single-locus GWAS approaches under polygenic background and population structure controls have been widely used, some significant loci fail to be detected. In this study, we used an iterative modified-sure independence screening (ISIS) approach in reducing the number of SNPs to a moderate size. Expectation-Maximization (EM)-Bayesian least absolute shrinkage and selection operator (BLASSO) was used to estimate all the selected SNP effects for true quantitative trait nucleotide (QTN) detection. This method is referred to as ISIS EM-BLASSO algorithm. Monte Carlo simulation studies validated the new method, which has the highest empirical power in QTN detection and the highest accuracy in QTN effect estimation, and it is the fastest, as compared with efficient mixed-model association (EMMA), smoothly clipped absolute deviation (SCAD), fixed and random model circulating probability unification (FarmCPU), and multi-locus random-SNP-effect mixed linear model (mrMLM). To further demonstrate the new method, six flowering time traits in Arabidopsis thaliana were re-analyzed by four methods (New method, EMMA, FarmCPU, and mrMLM). As a result, the new method identified most previously reported genes. Therefore, the new method is a good alternative for multi-locus GWAS.

Journal ArticleDOI
TL;DR: The theory establishes a general framework for modeling finite-size neural population dynamics based on single cell and synapse parameters and offers an efficient approach to analyzing cortical circuits and computations.
Abstract: Neural population equations such as neural mass or field models are widely used to study brain activity on a large scale. However, the relation of these models to the properties of single neurons is unclear. Here we derive an equation for several interacting populations at the mesoscopic scale starting from a microscopic model of randomly connected generalized integrate-and-fire neuron models. Each population consists of 50-2000 neurons of the same type but different populations account for different neuron types. The stochastic population equations that we find reveal how spike-history effects in single-neuron dynamics such as refractoriness and adaptation interact with finite-size fluctuations on the population level. Efficient integration of the stochastic mesoscopic equations reproduces the statistical behavior of the population activities obtained from microscopic simulations of a full spiking neural network model. The theory describes nonlinear emergent dynamics such as finite-size-induced stochastic transitions in multistable networks and synchronization in balanced networks of excitatory and inhibitory neurons. The mesoscopic equations are employed to rapidly integrate a model of a cortical microcircuit consisting of eight neuron types, which allows us to predict spontaneous population activities as well as evoked responses to thalamic input. Our theory establishes a general framework for modeling finite-size neural population dynamics based on single cell and synapse parameters and offers an efficient approach to analyzing cortical circuits and computations.