scispace - formally typeset
Search or ask a question

Showing papers in "PLOS Computational Biology in 2020"


Journal ArticleDOI
TL;DR: The purpose of this document is to summarize challenges of estimation of the effective reproductive number Rt, illustrate them with examples from synthetic data, and, where possible, make recommendations.
Abstract: Estimation of the effective reproductive number Rt is important for detecting changes in disease transmission over time. During the Coronavirus Disease 2019 (COVID-19) pandemic, policy makers and public health officials are using Rt to assess the effectiveness of interventions and to inform policy. However, estimation of Rt from available data presents several challenges, with critical implications for the interpretation of the course of the pandemic. The purpose of this document is to summarize these challenges, illustrate them with examples from synthetic data, and, where possible, make recommendations. For near real-time estimation of Rt, we recommend the approach of Cori and colleagues, which uses data from before time t and empirical estimates of the distribution of time between infections. Methods that require data from after time t, such as Wallinga and Teunis, are conceptually and methodologically less suited for near real-time estimation, but may be appropriate for retrospective analyses of how individuals infected at different time points contributed to the spread. We advise caution when using methods derived from the approach of Bettencourt and Ribeiro, as the resulting Rt estimates may be biased if the underlying structural assumptions are not met. Two key challenges common to all approaches are accurate specification of the generation interval and reconstruction of the time series of new infections from observations occurring long after the moment of transmission. Naive approaches for dealing with observation delays, such as subtracting delays sampled from a distribution, can introduce bias. We provide suggestions for how to mitigate this and other technical challenges and highlight open problems in Rt estimation.

360 citations


Journal ArticleDOI
TL;DR: A deep learning-based image analysis pipeline that performs segmentation, tracking, and lineage reconstruction on time-lapse movies of Escherichia coli cells trapped in a "mother machine" microfluidic device, a scalable platform for long-term single-cell analysis that is widely used in the field.
Abstract: Microscopy image analysis is a major bottleneck in quantification of single-cell microscopy data, typically requiring human oversight and curation, which limit both accuracy and throughput. To address this, we developed a deep learning-based image analysis pipeline that performs segmentation, tracking, and lineage reconstruction. Our analysis focuses on time-lapse movies of Escherichia coli cells trapped in a "mother machine" microfluidic device, a scalable platform for long-term single-cell analysis that is widely used in the field. While deep learning has been applied to cell segmentation problems before, our approach is fundamentally innovative in that it also uses machine learning to perform cell tracking and lineage reconstruction. With this framework we are able to get high fidelity results (1% error rate), without human intervention. Further, the algorithm is fast, with complete analysis of a typical frame containing ~150 cells taking <700msec. The framework is not constrained to a particular experimental set up and has the potential to generalize to time-lapse images of other organisms or different experimental configurations. These advances open the door to a myriad of applications including real-time tracking of gene expression and high throughput analysis of strain libraries at single-cell resolution.

116 citations


Journal ArticleDOI
TL;DR: This report presents the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compares its performance with two other popular polishing programs, Pilon and Racon, and shows that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon.
Abstract: The introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8-15%, has made it challenging to generate a highly accurate final sequence Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 05% This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to "polish" the consensus built from long reads In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs

115 citations


Journal ArticleDOI
TL;DR: It is found that drugs targeting the two pathways, although independent, could display strong synergy in blocking virus entry, and may help improve the deployability of drug combinations targeting host proteases required for the entry.
Abstract: The entry of SARS-CoV-2 into target cells requires the activation of its surface spike protein, S, by host proteases. The host serine protease TMPRSS2 and cysteine proteases Cathepsin B/L can activate S, making two independent entry pathways accessible to SARS-CoV-2. Blocking the proteases prevents SARS-CoV-2 entry in vitro. This blockade may be achieved in vivo through 'repurposing' drugs, a potential treatment option for COVID-19 that is now in clinical trials. Here, we found, surprisingly, that drugs targeting the two pathways, although independent, could display strong synergy in blocking virus entry. We predicted this synergy first using a mathematical model of SARS-CoV-2 entry and dynamics in vitro. The model considered the two pathways explicitly, let the entry efficiency through a pathway depend on the corresponding protease expression level, which varied across cells, and let inhibitors compromise the efficiency in a dose-dependent manner. The synergy predicted was novel and arose from effects of the drugs at both the single cell and the cell population levels. Validating our predictions, available in vitro data on SARS-CoV-2 and SARS-CoV entry displayed this synergy. Further, analysing the data using our model, we estimated the relative usage of the two pathways and found it to vary widely across cell lines, suggesting that targeting both pathways in vivo may be important and synergistic given the broad tissue tropism of SARS-CoV-2. Our findings provide insights into SARS-CoV-2 entry into target cells and may help improve the deployability of drug combinations targeting host proteases required for the entry.

112 citations


Journal ArticleDOI
TL;DR: This work provides a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files.
Abstract: Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.

112 citations


Journal ArticleDOI
TL;DR: A new systems-biology-informed deep learning algorithm is developed that incorporates the system of ordinary differential equations into the neural networks and is able to infer the dynamics of unobserved species, external forcing, and the unknown model parameters.
Abstract: Mathematical models of biological reactions at the system-level lead to a set of ordinary differential equations with many unknown parameters that need to be inferred using relatively few experimental measurements. Having a reliable and robust algorithm for parameter inference and prediction of the hidden dynamics has been one of the core subjects in systems biology, and is the focus of this study. We have developed a new systems-biology-informed deep learning algorithm that incorporates the system of ordinary differential equations into the neural networks. Enforcing these equations effectively adds constraints to the optimization procedure that manifests itself as an imposed structure on the observational data. Using few scattered and noisy measurements, we are able to infer the dynamics of unobserved species, external forcing, and the unknown model parameters. We have successfully tested the algorithm for three different benchmark problems.

103 citations


Journal ArticleDOI
TL;DR: A transmission model combining age-stratified contact frequencies with age-dependent susceptibility, probability of clinical symptoms, and transmission from asymptomatic cases is introduced which is used to estimate the country-specific basic reproductive ratio of COVID-19 for 152 countries.
Abstract: The 2019-2020 pandemic of atypical pneumonia (COVID-19) caused by the virus SARS-CoV-2 has spread globally and has the potential to infect large numbers of people in every country Estimating the country-specific basic reproductive ratio is a vital first step in public-health planning The basic reproductive ratio (R0) is determined by both the nature of pathogen and the network of human contacts through which the disease can spread, which is itself dependent on population age structure and household composition Here we introduce a transmission model combining age-stratified contact frequencies with age-dependent susceptibility, probability of clinical symptoms, and transmission from asymptomatic (or mild) cases, which we use to estimate the country-specific basic reproductive ratio of COVID-19 for 152 countries Using early outbreak data from China and a synthetic contact matrix, we estimate an age-stratified transmission structure which can then be extrapolated to 151 other countries for which synthetic contact matrices also exist This defines a set of country-specific transmission structures from which we can calculate the basic reproductive ratio for each country Our predicted R0 is critically sensitive to the intensity of transmission from asymptomatic cases; with low asymptomatic transmission the highest values are predicted across Eastern Europe and Japan and the lowest across Africa, Central America and South-Western Asia This pattern is largely driven by the ratio of children to older adults in each country and the observed propensity of clinical cases in the elderly If asymptomatic cases have comparable transmission to detected cases, the pattern is reversed Our results demonstrate the importance of age-specific heterogeneities going beyond contact structure to the spread of COVID-19 These heterogeneities give COVID-19 the capacity to spread particularly quickly in countries with older populations, and that intensive control measures are likely to be necessary to impede its progress in these countries

100 citations


Journal ArticleDOI
TL;DR: The graph-based approach proposed by PPanGGOLiN is useful to depict the overall genomic diversity of thousands of strains in a compact structure and provides an effective basis for very large scale comparative genomics.
Abstract: Microorganisms have the greatest biodiversity and evolutionary history on earth. At the genomic level, it is reflected by a highly variable gene content even among organisms from the same species which explains the ability of microbes to be pathogenic or to grow in specific environments. We developed a new method called PPanGGOLiN which accurately represents the genomic diversity of a species (i.e. its pangenome) using a compact graph structure. Based on this pangenome graph, we classify genes by a statistical method according to their occurrence in the genomes. This method allowed us to build pangenomes even for uncultivated species at an unprecedented scale. We applied our method on all available genomes in databanks in order to depict the overall diversity of hundreds of species. Overall, our work enables microbiologists to explore and visualize pangenomes alike a subway map.

92 citations


Journal ArticleDOI
TL;DR: A computational modeling approach is used to show that epithelial-mesenchymal heterogeneity can emerge from the noise in the partitioning of biomolecules among daughter cells during the division of a cancer cell, and captures the experimentally observed temporal changes in the fractions of different phenotypes in a population of murine prostate cancer cells.
Abstract: Epithelial-mesenchymal heterogeneity implies that cells within the same tumor can exhibit different phenotypes—epithelial, mesenchymal, or one or more hybrid epithelial-mesenchymal phenotypes. This behavior has been reported across cancer types, both in vitro and in vivo, and implicated in multiple processes associated with metastatic aggressiveness including immune evasion, collective dissemination of tumor cells, and emergence of cancer cell subpopulations with stem cell-like properties. However, the ability of a population of cancer cells to generate, maintain, and propagate this heterogeneity has remained a mystifying feature. Here, we used a computational modeling approach to show that epithelial-mesenchymal heterogeneity can emerge from the noise in the partitioning of biomolecules (such as RNAs and proteins) among daughter cells during the division of a cancer cell. Our model captures the experimentally observed temporal changes in the fractions of different phenotypes in a population of murine prostate cancer cells, and describes the hysteresis in the population-level dynamics of epithelial-mesenchymal plasticity. The model is further able to predict how factors known to promote a hybrid epithelial-mesenchymal phenotype can alter the phenotypic composition of a population. Finally, we used the model to probe the implications of phenotypic heterogeneity and plasticity for different therapeutic regimens and found that co-targeting of epithelial and mesenchymal cells is likely to be the most effective strategy for restricting tumor growth. By connecting the dynamics of an intracellular circuit to the phenotypic composition of a population, our study serves as a first step towards understanding the generation and maintenance of non-genetic heterogeneity in a population of cancer cells, and towards the therapeutic targeting of phenotypic heterogeneity and plasticity in cancer cell populations.

89 citations


Journal ArticleDOI
TL;DR: A set of computational methods for projecting animal vocalizations into low dimensional latent representational spaces that are directly learned from the spectrograms of vocal signals are presented, enabling high-powered comparative analyses of vocal acoustics.
Abstract: Animals produce vocalizations that range in complexity from a single repeated call to hundreds of unique vocal elements patterned in sequences unfolding over hours. Characterizing complex vocalizations can require considerable effort and a deep intuition about each species' vocal behavior. Even with a great deal of experience, human characterizations of animal communication can be affected by human perceptual biases. We present a set of computational methods for projecting animal vocalizations into low dimensional latent representational spaces that are directly learned from the spectrograms of vocal signals. We apply these methods to diverse datasets from over 20 species, including humans, bats, songbirds, mice, cetaceans, and nonhuman primates. Latent projections uncover complex features of data in visually intuitive and quantifiable ways, enabling high-powered comparative analyses of vocal acoustics. We introduce methods for analyzing vocalizations as both discrete sequences and as continuous latent variables. Each method can be used to disentangle complex spectro-temporal structure and observe long-timescale organization in communication.

87 citations


Journal ArticleDOI
TL;DR: This study proposes a computational method called GCNCDA based on the deep learning Fast learning with Graph Convolutional Networks (FastGCN) algorithm to predict the potential disease-associated circRNAs and shows strong competitiveness.
Abstract: Numerous evidences indicate that Circular RNAs (circRNAs) are widely involved in the occurrence and development of diseases. Identifying the association between circRNAs and diseases plays a crucial role in exploring the pathogenesis of complex diseases and improving the diagnosis and treatment of diseases. However, due to the complex mechanisms between circRNAs and diseases, it is expensive and time-consuming to discover the new circRNA-disease associations by biological experiment. Therefore, there is increasingly urgent need for utilizing the computational methods to predict novel circRNA-disease associations. In this study, we propose a computational method called GCNCDA based on the deep learning Fast learning with Graph Convolutional Networks (FastGCN) algorithm to predict the potential disease-associated circRNAs. Specifically, the method first forms the unified descriptor by fusing disease semantic similarity information, disease and circRNA Gaussian Interaction Profile (GIP) kernel similarity information based on known circRNA-disease associations. The FastGCN algorithm is then used to objectively extract the high-level features contained in the fusion descriptor. Finally, the new circRNA-disease associations are accurately predicted by the Forest by Penalizing Attributes (Forest PA) classifier. The 5-fold cross-validation experiment of GCNCDA achieved 91.2% accuracy with 92.78% sensitivity at the AUC of 90.90% on circR2Disease benchmark dataset. In comparison with different classifier models, feature extraction models and other state-of-the-art methods, GCNCDA shows strong competitiveness. Furthermore, we conducted case study experiments on diseases including breast cancer, glioma and colorectal cancer. The results showed that 16, 15 and 17 of the top 20 candidate circRNAs with the highest prediction scores were respectively confirmed by relevant literature and databases. These results suggest that GCNCDA can effectively predict potential circRNA-disease associations and provide highly credible candidates for biological experiments.

Journal ArticleDOI
TL;DR: This work compared neuronal spikes and fluorescence in matched neural populations in behaving mice and developed a model transforming spike trains to synthetic-imaging data, which highlights challenges in relating electrophysiology and imaging data and suggests forward modeling as an effective way to understand differences between these data.
Abstract: Calcium imaging with fluorescent protein sensors is widely used to record activity in neuronal populations. The transform between neural activity and calcium-related fluorescence involves nonlinearities and low-pass filtering, but the effects of the transformation on analyses of neural populations are not well understood. We compared neuronal spikes and fluorescence in matched neural populations in behaving mice. We report multiple discrepancies between analyses performed on the two types of data, including changes in single-neuron selectivity and population decoding. These were only partially resolved by spike inference algorithms applied to fluorescence. To model the relation between spiking and fluorescence we simultaneously recorded spikes and fluorescence from individual neurons. Using these recordings we developed a model transforming spike trains to synthetic-imaging data. The model recapitulated the differences in analyses. Our analysis highlights challenges in relating electrophysiology and imaging data, and suggests forward modeling as an effective way to understand differences between these data.

Journal ArticleDOI
TL;DR: It is shown that introducing a temporal relationship between cases considerably improves performance when the reporting delay distribution is time-varying, and trade-offs in the role of moving windows to accurately capture changes in the delay are identified.
Abstract: Achieving accurate, real-time estimates of disease activity is challenged by delays in case reporting. "Nowcast" approaches attempt to estimate the complete case counts for a given reporting date, using a time series of case reports that is known to be incomplete due to reporting delays. Modeling the reporting delay distribution is a common feature of nowcast approaches. However, many nowcast approaches ignore a crucial feature of infectious disease transmission-that future cases are intrinsically linked to past reported cases-and are optimized to one or two applications, which may limit generalizability. Here, we present a Bayesian approach, NobBS (Nowcasting by Bayesian Smoothing) capable of producing smooth and accurate nowcasts in multiple disease settings. We test NobBS on dengue in Puerto Rico and influenza-like illness (ILI) in the United States to examine performance and robustness across settings exhibiting a range of common reporting delay characteristics (from stable to time-varying), and compare this approach with a published nowcasting software package while investigating the features of each approach that contribute to good or poor performance. We show that introducing a temporal relationship between cases considerably improves performance when the reporting delay distribution is time-varying, and we identify trade-offs in the role of moving windows to accurately capture changes in the delay. We present software implementing this new approach (R package "NobBS") for widespread application and provide practical guidance on implementation.

Journal ArticleDOI
TL;DR: A novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease and unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.
Abstract: Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out and variant sequences. We further demonstrate a novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.

Journal ArticleDOI
TL;DR: A computational model of cell migration through a degradable viscoelastic ECM is presented and it is demonstrated that changes in ECM stiffness and cell strength affect cell migration and are accompanied by changes in number, lifetime and length of protrusions.
Abstract: Actin protrusion dynamics plays an important role in the regulation of three-dimensional (3D) cell migration. Cells form protrusions that adhere to the surrounding extracellular matrix (ECM), mechanically probe the ECM and contract in order to displace the cell body. This results in cell migration that can be directed by the mechanical anisotropy of the ECM. However, the subcellular processes that regulate protrusion dynamics in 3D cell migration are difficult to investigate experimentally and therefore not well understood. Here, we present a computational model of cell migration through a degradable viscoelastic ECM. This model is a 2D representation of 3D cell migration. The cell is modeled as an active deformable object that captures the viscoelastic behavior of the actin cortex and the subcellular processes underlying 3D cell migration. The ECM is regarded as a viscoelastic material, with or without anisotropy due to fibrillar strain stiffening, and modeled by means of the meshless Lagrangian smoothed particle hydrodynamics (SPH) method. ECM degradation is captured by local fluidization of the material and permits cell migration through the ECM. We demonstrate that changes in ECM stiffness and cell strength affect cell migration and are accompanied by changes in number, lifetime and length of protrusions. Interestingly, directly changing the total protrusion number or the average lifetime or length of protrusions does not affect cell migration. A stochastic variability in protrusion lifetime proves to be enough to explain differences in cell migration velocity. Force-dependent adhesion disassembly does not result in faster migration, but can make migration more efficient. We also demonstrate that when a number of simultaneous protrusions is enforced, the optimal number of simultaneous protrusions is one or two, depending on ECM anisotropy. Together, the model provides non-trivial new insights in the role of protrusions in 3D cell migration and can be a valuable contribution to increase the understanding of 3D cell migration mechanics.

Journal ArticleDOI
TL;DR: NuSeT addresses common challenges in nuclear segmentation such as variability in nuclear signal and shape, limited training sample size, and sample preparation artifacts, and consistently fares better in generating accurate segmentation masks and assigning boundaries for touching nuclei.
Abstract: Segmenting cell nuclei within microscopy images is a ubiquitous task in biological research and clinical applications. Unfortunately, segmenting low-contrast overlapping objects that may be tightly packed is a major bottleneck in standard deep learning-based models. We report a Nuclear Segmentation Tool (NuSeT) based on deep learning that accurately segments nuclei across multiple types of fluorescence imaging data. Using a hybrid network consisting of U-Net and Region Proposal Networks (RPN), followed by a watershed step, we have achieved superior performance in detecting and delineating nuclear boundaries in 2D and 3D images of varying complexities. By using foreground normalization and additional training on synthetic images containing non-cellular artifacts, NuSeT improves nuclear detection and reduces false positives. NuSeT addresses common challenges in nuclear segmentation such as variability in nuclear signal and shape, limited training sample size, and sample preparation artifacts. Compared to other segmentation models, NuSeT consistently fares better in generating accurate segmentation masks and assigning boundaries for touching nuclei.

Journal ArticleDOI
Yuting Chen1, Haoyu Lu1, Ning Zhang1, Zefeng Zhu1, Shuqin Wang1, Minghui Li1 
TL;DR: Li et al. as mentioned in this paper developed a new computational method called PremPS to more accurately evaluate the effects of missense mutations on protein stability, which is composed of only ten evolutionary- and structure-based features and parameterized on a balanced dataset with an equal number of stabilizing and destabilizing mutations.
Abstract: Computational methods that predict protein stability changes induced by missense mutations have made a lot of progress over the past decades. Most of the available methods however have very limited accuracy in predicting stabilizing mutations because existing experimental sets are dominated by mutations reducing protein stability. Moreover, few approaches could consistently perform well across different test cases. To address these issues, we developed a new computational method PremPS to more accurately evaluate the effects of missense mutations on protein stability. The PremPS method is composed of only ten evolutionary- and structure-based features and parameterized on a balanced dataset with an equal number of stabilizing and destabilizing mutations. A comprehensive comparison of the predictive performance of PremPS with other available methods on nine benchmark datasets confirms that our approach consistently outperforms other methods and shows considerable improvement in estimating the impacts of stabilizing mutations. A protein could have multiple structures available, and if another structure of the same protein is used, the predicted change in stability for structure-based methods might be different. Thus, we further estimated the impact of using different structures on prediction accuracy, and demonstrate that our method performs well across different types of structures except for low-resolution structures and models built based on templates with low sequence identity. PremPS can be used for finding functionally important variants, revealing the molecular mechanisms of functional influences and protein design. PremPS is freely available at https://lilab.jysw.suda.edu.cn/research/PremPS/, which allows to do large-scale mutational scanning and takes about four minutes to perform calculations for a single mutation per protein with ~ 300 residues and requires ~ 0.4 seconds for each additional mutation.

Journal ArticleDOI
TL;DR: 10 rules to help labs develop antiracists policies and action in an effort to promote racial and ethnic diversity, equity, and inclusion in science are presented.
Abstract: Demographics of the science, technology, engineering, and mathematics (STEM) workforce and student body in the US and Europe continue to show severe underrepresentation of Black, Indigenous, and people of color (BIPOC). Among the documented causes of the persistent lack of diversity in STEM are bias, discrimination, and harassment of members of underrepresented minority groups (URMs). These issues persist due to continued marginalization, power imbalances, and lack of adequate policies against misconduct in academic and other scientific institutions. All scientists can play important roles in reversing this trend by shifting the culture of academic workplaces to intentionally implement equitable and inclusive policies, set norms for acceptable workplace conduct, and provide opportunities for mentorship and networking. As scientists are increasingly acknowledging the lack of racial and ethnic diversity in science, there is a need for clear direction on how to take antiracist action. Here we present 10 rules to help labs develop antiracists policies and action in an effort to promote racial and ethnic diversity, equity, and inclusion in science.

Journal ArticleDOI
TL;DR: A suite of phage-oriented tools housed in open, user-friendly web-based interfaces and a multi-purpose platform that enables researchers to easily and accurately annotate an entire phage genome are developed.
Abstract: In the modern genomic era, scientists without extensive bioinformatic training need to apply high-power computational analyses to critical tasks like phage genome annotation. At the Center for Phage Technology (CPT), we developed a suite of phage-oriented tools housed in open, user-friendly web-based interfaces. A Galaxy platform conducts computationally intensive analyses and Apollo, a collaborative genome annotation editor, visualizes the results of these analyses. The collection includes open source applications such as the BLAST+ suite, InterProScan, and several gene callers, as well as unique tools developed at the CPT that allow maximum user flexibility. We describe in detail programs for finding Shine-Dalgarno sequences, resources used for confident identification of lysis genes such as spanins, and methods used for identifying interrupted genes that contain frameshifts or introns. At the CPT, genome annotation is separated into two robust segments that are facilitated through the automated execution of many tools chained together in an operation called a workflow. First, the structural annotation workflow results in gene and other feature calls. This is followed by a functional annotation workflow that combines sequence comparisons and conserved domain searching, which is contextualized to allow integrated evidence assessment in functional prediction. Finally, we describe a workflow used for comparative genomics. Using this multi-purpose platform enables researchers to easily and accurately annotate an entire phage genome. The portal can be accessed at https://cpt.tamu.edu/galaxy-pub with accompanying user training material.

Journal ArticleDOI
TL;DR: The theory provides a quantitative definition of downward causation, and introduces a complementary modality of emergent behaviour—which the author refers to as causal decoupling—which allows practical criteria that can be efficiently calculated in large systems.
Abstract: The broad concept of emergence is instrumental in various of the most challenging open scientific questions—yet, few quantitative theories of what constitutes emergent phenomena have been proposed. This article introduces a formal theory of causal emergence in multivariate systems, which studies the relationship between the dynamics of parts of a system and macroscopic features of interest. Our theory provides a quantitative definition of downward causation, and introduces a complementary modality of emergent behaviour—which we refer to as causal decoupling. Moreover, the theory allows practical criteria that can be efficiently calculated in large systems, making our framework applicable in a range of scenarios of practical interest. We illustrate our findings in a number of case studies, including Conway’s Game of Life, Reynolds’ flocking model, and neural activity as measured by electrocorticography.

Journal ArticleDOI
TL;DR: The use of Orbit is described in three different real-world applications: quantification of idiopathic lung fibrosis, nerve fibre density quantification, and glomeruli detection in the kidney.
Abstract: We describe Orbit Image Analysis, an open-source whole slide image analysis tool. The tool consists of a generic tile-processing engine which allows the execution of various image analysis algorithms provided by either Orbit itself or from other open-source platforms using a tile-based map-reduce execution framework. Orbit Image Analysis is capable of sophisticated whole slide imaging analyses due to several key features. First, Orbit has machine-learning capabilities. This deep learning segmentation can be integrated with complex object detection for analysis of intricate tissues. In addition, Orbit can run locally as standalone or connect to the open-source image server OMERO. Another important characteristic is its scale-out functionality, using the Apache Spark framework for distributed computing. In this paper, we describe the use of Orbit in three different real-world applications: quantification of idiopathic lung fibrosis, nerve fibre density quantification, and glomeruli detection in the kidney.

Journal ArticleDOI
TL;DR: This work combines coarse-grained molecular dynamics simulations with previously measured small-angle scattering data to study the conformation of three-domain protein TIA-1 in solution and finds that as long as the initial simulation is relatively good, reweighting against experiments is very robust.
Abstract: Many proteins contain multiple folded domains separated by flexible linkers, and the ability to describe the structure and conformational heterogeneity of such flexible systems pushes the limits of structural biology. Using the three-domain protein TIA-1 as an example, we here combine coarse-grained molecular dynamics simulations with previously measured small-angle scattering data to study the conformation of TIA-1 in solution. We show that while the coarse-grained potential (Martini) in itself leads to too compact conformations, increasing the strength of protein-water interactions results in ensembles that are in very good agreement with experiments. We show how these ensembles can be refined further using a Bayesian/Maximum Entropy approach, and examine the robustness to errors in the energy function. In particular we find that as long as the initial simulation is relatively good, reweighting against experiments is very robust. We also study the relative information in X-ray and neutron scattering experiments and find that refining against the SAXS experiments leads to improvement in the SANS data. Our results suggest a general strategy for studying the conformation of multi-domain proteins in solution that combines coarse-grained simulations with small-angle X-ray scattering data that are generally most easy to obtain. These results may in turn be used to design further small-angle neutron scattering experiments that exploit contrast variation through 1H/2H isotope substitutions.

Journal ArticleDOI
TL;DR: It is confirmed that feedback via a trained reinforcement learning agent can be used to maintain populations at target levels, and that model-free performance with bang-bang control can outperform a traditional proportional integral controller with continuous control, when faced with infrequent sampling.
Abstract: Multi-species microbial communities are widespread in natural ecosystems. When employed for biomanufacturing, engineered synthetic communities have shown increased productivity in comparison with monocultures and allow for the reduction of metabolic load by compartmentalising bioprocesses between multiple sub-populations. Despite these benefits, co-cultures are rarely used in practice because control over the constituent species of an assembled community has proven challenging. Here we demonstrate, in silico, the efficacy of an approach from artificial intelligence-reinforcement learning-for the control of co-cultures within continuous bioreactors. We confirm that feedback via a trained reinforcement learning agent can be used to maintain populations at target levels, and that model-free performance with bang-bang control can outperform a traditional proportional integral controller with continuous control, when faced with infrequent sampling. Further, we demonstrate that a satisfactory control policy can be learned in one twenty-four hour experiment by running five bioreactors in parallel. Finally, we show that reinforcement learning can directly optimise the output of a co-culture bioprocess. Overall, reinforcement learning is a promising technique for the control of microbial communities.

Journal ArticleDOI
TL;DR: OpenSim Moco as mentioned in this paper is a software toolkit for optimizing the motion and control of musculoskeletal models built in the OpenSim modeling and simulation package, which can handle a wide range of problems that interest biomechanists, including motion tracking, motion prediction, model fitting, electromyography-driven simulation, and device design.
Abstract: Musculoskeletal simulations are used in many different applications, ranging from the design of wearable robots that interact with humans to the analysis of patients with impaired movement. Here, we introduce OpenSim Moco, a software toolkit for optimizing the motion and control of musculoskeletal models built in the OpenSim modeling and simulation package. OpenSim Moco uses the direct collocation method, which is often faster and can handle more diverse problems than other methods for musculoskeletal simulation. Moco frees researchers from implementing direct collocation themselves-which typically requires extensive technical expertise-and allows them to focus on their scientific questions. The software can handle a wide range of problems that interest biomechanists, including motion tracking, motion prediction, parameter optimization, model fitting, electromyography-driven simulation, and device design. Moco is the first musculoskeletal direct collocation tool to handle kinematic constraints, which enable modeling of kinematic loops (e.g., cycling models) and complex anatomy (e.g., patellar motion). To show the abilities of Moco, we first solved for muscle activity that produced an observed walking motion while minimizing squared muscle excitations and knee joint loading. Next, we predicted how muscle weakness may cause deviations from a normal walking motion. Lastly, we predicted a squat-to-stand motion and optimized the stiffness of an assistive device placed at the knee. We designed Moco to be easy to use, customizable, and extensible, thereby accelerating the use of simulations to understand the movement of humans and other animals.

Journal ArticleDOI
TL;DR: This work exploits the normative framework of active inference to show that efficient action-oriented models can be learned by balancing goal-oriented and epistemic behaviours in a principled manner, and provides a principled method for learning adaptive models from limited interactions with an environment.
Abstract: Converging theories suggest that organisms learn and exploit probabilistic models of their environment. However, it remains unclear how such models can be learned in practice. The open-ended complexity of natural environments means that it is generally infeasible for organisms to model their environment comprehensively. Alternatively, action-oriented models attempt to encode a parsimonious representation of adaptive agent-environment interactions. One approach to learning action-oriented models is to learn online in the presence of goal-directed behaviours. This constrains an agent to behaviourally relevant trajectories, reducing the diversity of the data a model need account for. Unfortunately, this approach can cause models to prematurely converge to sub-optimal solutions, through a process we refer to as a bad-bootstrap. Here, we exploit the normative framework of active inference to show that efficient action-oriented models can be learned by balancing goal-oriented and epistemic (information-seeking) behaviours in a principled manner. We illustrate our approach using a simple agent-based model of bacterial chemotaxis. We first demonstrate that learning via goal-directed behaviour indeed constrains models to behaviorally relevant aspects of the environment, but that this approach is prone to sub-optimal convergence. We then demonstrate that epistemic behaviours facilitate the construction of accurate and comprehensive models, but that these models are not tailored to any specific behavioural niche and are therefore less efficient in their use of data. Finally, we show that active inference agents learn models that are parsimonious, tailored to action, and which avoid bad bootstraps and sub-optimal convergence. Critically, our results indicate that models learned through active inference can support adaptive behaviour in spite of, and indeed because of, their departure from veridical representations of the environment. Our approach provides a principled method for learning adaptive models from limited interactions with an environment, highlighting a route to sample efficient learning algorithms.

Journal ArticleDOI
TL;DR: A Bayesian epidemiological model in which a proportion of individuals are willing and able to participate in distancing is introduced, with the timing of distancing measures informed by survey data on attitudes to distancing and COVID-19 indicated.
Abstract: Extensive non-pharmaceutical and physical distancing measures are currently the primary interventions against coronavirus disease 2019 (COVID-19) worldwide. It is therefore urgent to estimate the impact such measures are having. We introduce a Bayesian epidemiological model in which a proportion of individuals are willing and able to participate in distancing, with the timing of distancing measures informed by survey data on attitudes to distancing and COVID-19. We fit our model to reported COVID-19 cases in British Columbia (BC), Canada, and five other jurisdictions, using an observation model that accounts for both underestimation and the delay between symptom onset and reporting. We estimated the impact that physical distancing (social distancing) has had on the contact rate and examined the projected impact of relaxing distancing measures. We found that, as of April 11 2020, distancing had a strong impact in BC, consistent with declines in reported cases and in hospitalization and intensive care unit numbers; individuals practising physical distancing experienced approximately 0.22 (0.11-0.34 90% CI [credible interval]) of their normal contact rate. The threshold above which prevalence was expected to grow was 0.55. We define the "contact ratio" to be the ratio of the estimated contact rate to the threshold rate at which cases are expected to grow; we estimated this contact ratio to be 0.40 (0.19-0.60) in BC. We developed an R package 'covidseir' to make our model available, and used it to quantify the impact of distancing in five additional jurisdictions. As of May 7, 2020, we estimated that New Zealand was well below its threshold value (contact ratio of 0.22 [0.11-0.34]), New York (0.60 [0.43-0.74]), Washington (0.84 [0.79-0.90]) and Florida (0.86 [0.76-0.96]) were progressively closer to theirs yet still below, but California (1.15 [1.07-1.23]) was above its threshold overall, with cases still rising. Accordingly, we found that BC, New Zealand, and New York may have had more room to relax distancing measures than the other jurisdictions, though this would need to be done cautiously and with total case volumes in mind. Our projections indicate that intermittent distancing measures-if sufficiently strong and robustly followed-could control COVID-19 transmission. This approach provides a useful tool for jurisdictions to monitor and assess current levels of distancing relative to their threshold, which will continue to be essential through subsequent waves of this pandemic.

Journal ArticleDOI
TL;DR: This review introduces multiview learning—an emerging machine learning field—and envisions its potentially powerful applications to multiomics and discusses the potential applications of each method, including genomics, transcriptomics, and epigenomics, in an aim to discover the functional and mechanistic interpretations across omics.
Abstract: The molecular mechanisms and functions in complex biological systems currently remain elusive. Recent high-throughput techniques, such as next-generation sequencing, have generated a wide variety of multiomics datasets that enable the identification of biological functions and mechanisms via multiple facets. However, integrating these large-scale multiomics data and discovering functional insights are, nevertheless, challenging tasks. To address these challenges, machine learning has been broadly applied to analyze multiomics. This review introduces multiview learning-an emerging machine learning field-and envisions its potentially powerful applications to multiomics. In particular, multiview learning is more effective than previous integrative methods for learning data's heterogeneity and revealing cross-talk patterns. Although it has been applied to various contexts, such as computer vision and speech recognition, multiview learning has not yet been widely applied to biological data-specifically, multiomics data. Therefore, this paper firstly reviews recent multiview learning methods and unifies them in a framework called multiview empirical risk minimization (MV-ERM). We further discuss the potential applications of each method to multiomics, including genomics, transcriptomics, and epigenomics, in an aim to discover the functional and mechanistic interpretations across omics. Secondly, we explore possible applications to different biological systems, including human diseases (e.g., brain disorders and cancers), plants, and single-cell analysis, and discuss both the benefits and caveats of using multiview learning to discover the molecular mechanisms and functions of these systems.

Journal ArticleDOI
TL;DR: Modelling revealed that, during an interoceptive perturbation condition (inspiratory breath-holding during heartbeat tapping), healthy individuals assigned greater precision to ascending cardiac signals than individuals with symptoms of anxiety, depression, or co-morbid depression/anxiety–who failed to increase their precision estimates from resting levels.
Abstract: Recent neurocomputational theories have hypothesized that abnormalities in prior beliefs and/or the precision-weighting of afferent interoceptive signals may facilitate the transdiagnostic emergence of psychopathology. Specifically, it has been suggested that, in certain psychiatric disorders, interoceptive processing mechanisms either over-weight prior beliefs or under-weight signals from the viscera (or both), leading to a failure to accurately update beliefs about the body. However, this has not been directly tested empirically. To evaluate the potential roles of prior beliefs and interoceptive precision in this context, we fit a Bayesian computational model to behavior in a transdiagnostic patient sample during an interoceptive awareness (heartbeat tapping) task. Modelling revealed that, during an interoceptive perturbation condition (inspiratory breath-holding during heartbeat tapping), healthy individuals (N = 52) assigned greater precision to ascending cardiac signals than individuals with symptoms of anxiety (N = 15), depression (N = 69), co-morbid depression/anxiety (N = 153), substance use disorders (N = 131), and eating disorders (N = 14)-who failed to increase their precision estimates from resting levels. In contrast, we did not find strong evidence for differences in prior beliefs. These results provide the first empirical computational modeling evidence of a selective dysfunction in adaptive interoceptive processing in psychiatric conditions, and lay the groundwork for future studies examining how reduced interoceptive precision influences visceral regulation and interoceptively-guided decision-making.

Journal ArticleDOI
TL;DR: The reduced Gompertz model was found to exhibit the best results, with drastic improvements when using Bayesian inference as compared to likelihood maximization alone, for both accuracy and precision.
Abstract: Tumor growth curves are classically modeled by means of ordinary differential equations. In analyzing the Gompertz model several studies have reported a striking correlation between the two parameters of the model, which could be used to reduce the dimensionality and improve predictive power. We analyzed tumor growth kinetics within the statistical framework of nonlinear mixed-effects (population approach). This allowed the simultaneous modeling of tumor dynamics and inter-animal variability. Experimental data comprised three animal models of breast and lung cancers, with 833 measurements in 94 animals. Candidate models of tumor growth included the exponential, logistic and Gompertz models. The exponential and-more notably-logistic models failed to describe the experimental data whereas the Gompertz model generated very good fits. The previously reported population-level correlation between the Gompertz parameters was further confirmed in our analysis (R2 > 0.92 in all groups). Combining this structural correlation with rigorous population parameter estimation, we propose a reduced Gompertz function consisting of a single individual parameter (and one population parameter). Leveraging the population approach using Bayesian inference, we estimated times of tumor initiation using three late measurement timepoints. The reduced Gompertz model was found to exhibit the best results, with drastic improvements when using Bayesian inference as compared to likelihood maximization alone, for both accuracy and precision. Specifically, mean accuracy (prediction error) was 12.2% versus 78% and mean precision (width of the 95% prediction interval) was 15.6 days versus 210 days, for the breast cancer cell line. These results demonstrate the superior predictive power of the reduced Gompertz model, especially when combined with Bayesian estimation. They offer possible clinical perspectives for personalized prediction of the age of a tumor from limited data at diagnosis. The code and data used in our analysis are publicly available at https://github.com/cristinavaghi/plumky.

Journal ArticleDOI
TL;DR: This paper presents a novel approach called iDrug, which seamlessly integrates drug repositioning and drug-target prediction into one coherent model via cross-network embedding and provides a principled way to transfer knowledge from these two domains and to enhance prediction performance for both tasks.
Abstract: Computational drug repositioning and drug-target prediction have become essential tasks in the early stage of drug discovery. In previous studies, these two tasks have often been considered separately. However, the entities studied in these two tasks (i.e., drugs, targets, and diseases) are inherently related. On one hand, drugs interact with targets in cells to modulate target activities, which in turn alter biological pathways to promote healthy functions and to treat diseases. On the other hand, both drug repositioning and drug-target prediction involve the same drug feature space, which naturally connects these two problems and the two domains (diseases and targets). By using the wisdom of the crowds, it is possible to transfer knowledge from one of the domains to the other. The existence of relationships among drug-target-disease motivates us to jointly consider drug repositioning and drug-target prediction in drug discovery. In this paper, we present a novel approach called iDrug, which seamlessly integrates drug repositioning and drug-target prediction into one coherent model via cross-network embedding. In particular, we provide a principled way to transfer knowledge from these two domains and to enhance prediction performance for both tasks. Using real-world datasets, we demonstrate that iDrug achieves superior performance on both learning tasks compared to several state-of-the-art approaches. Our code and datasets are available at: https://github.com/Case-esaC/iDrug.