scispace - formally typeset
Search or ask a question

Showing papers in "PLOS Computational Biology in 2014"


Journal ArticleDOI
TL;DR: BEAST 2 now has a fully developed package management system that allows third party developers to write additional functionality that can be directly installed to the BEAST 2 analysis platform via a package manager without requiring a new software release of the platform.
Abstract: We present a new open source, extensible and flexible software platform for Bayesian evolutionary analysis called BEAST 2. This software platform is a re-design of the popular BEAST 1 platform to correct structural deficiencies that became evident as the BEAST 1 software evolved. Key among those deficiencies was the lack of post-deployment extensibility. BEAST 2 now has a fully developed package management system that allows third party developers to write additional functionality that can be directly installed to the BEAST 2 analysis platform via a package manager without requiring a new software release of the platform. This package architecture is showcased with a number of recently published new models encompassing birth-death-sampling tree priors, phylodynamics and model averaging for substitution models and site partitioning. A second major improvement is the ability to read/write the entire state of the MCMC chain to/from disk allowing it to be easily shared between multiple instances of the BEAST software. This facilitates checkpointing and better support for multi-processor and high-end computing extensions. Finally, the functionality in new packages can be easily added to the user interface (BEAUti 2) by a simple XML template-based mechanism because BEAST 2 has been re-designed to provide greater integration between the analysis engine and the user interface so that, for example BEAST and BEAUti use exactly the same XML file format.

5,183 citations


Journal ArticleDOI
TL;DR: It is advocated that investigators avoid rarefying altogether and supported statistical theory is provided that simultaneously accounts for library size differences and biological variability using an appropriate mixture model.
Abstract: Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

2,184 citations


Journal ArticleDOI
TL;DR: The results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.
Abstract: Inferior temporal (IT) cortex in human and nonhuman primates serves visual object recognition. Computational object-vision models, although continually improving, do not yet reach human performance. It is unclear to what extent the internal representations of computational models can explain the IT representation. Here we investigate a wide range of computational model representations (37 in total), testing their categorization performance and their ability to account for the IT representational geometry. The models include well-known neuroscientific object-recognition models (e.g. HMAX, VisNet) along with several models from computer vision (e.g. SIFT, GIST, self-similarity features, and a deep convolutional neural network). We compared the representational dissimilarity matrices (RDMs) of the model representations with the RDMs obtained from human IT (measured with fMRI) and monkey IT (measured with cell recording) for the same set of stimuli (not used in training the models). Better performing models were more similar to IT in that they showed greater clustering of representational patterns by category. In addition, better performing models also more strongly resembled IT in terms of their within-category representational dissimilarities. Representational geometries were significantly correlated between IT and many of the models. However, the categorical clustering observed in IT was largely unexplained by the unsupervised models. The deep convolutional network, which was trained by supervision with over a million category-labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain the IT data. Combining the features of this model with appropriate weights and adding linear combinations that maximize the margin between animate and inanimate objects and between faces and other objects yielded a representation that fully explained our IT data. Overall, our results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.

1,138 citations


Journal ArticleDOI
TL;DR: Integrated Information Theory of consciousness 3.0 is presented, which incorporates several advances over previous formulations and arrives at an identity: an experience is a maximally irreducible conceptual structure (MICS, a constellation of concepts in qualia space), and the set of elements that generates it constitutes a complex.
Abstract: This paper presents Integrated Information Theory (IIT) of consciousness 3.0, which incorporates several advances over previous formulations. IIT starts from phenomenological axioms: information says that each experience is specific – it is what it is by how it differs from alternative experiences; integration says that it is unified – irreducible to non-interdependent components; exclusion says that it has unique borders and a particular spatio-temporal grain. These axioms are formalized into postulates that prescribe how physical mechanisms, such as neurons or logic gates, must be configured to generate experience (phenomenology). The postulates are used to define intrinsic information as “differences that make a difference” within a system, and integrated information as information specified by a whole that cannot be reduced to that specified by its parts. By applying the postulates both at the level of individual mechanisms and at the level of systems of mechanisms, IIT arrives at an identity: an experience is a maximally irreducible conceptual structure (MICS, a constellation of concepts in qualia space), and the set of elements that generates it constitutes a complex. According to IIT, a MICS specifies the quality of an experience and integrated information ΦMax its quantity. From the theory follow several results, including: a system of mechanisms may condense into a major complex and non-overlapping minor complexes; the concepts that specify the quality of an experience are always about the complex itself and relate only indirectly to the external environment; anatomical connectivity influences complexes and associated MICS; a complex can generate a MICS even if its elements are inactive; simple systems can be minimally conscious; complicated systems can be unconscious; there can be true “zombies” – unconscious feed-forward systems that are functionally equivalent to conscious complexes.

787 citations


Journal ArticleDOI
TL;DR: A Matlab toolbox for representational similarity analysis is introduced, designed to help integrate a wide range of computational models into the analysis of multichannel brain-activity measurements as provided by modern functional imaging and neuronal recording techniques.
Abstract: Neuronal population codes are increasingly being investigated with multivariate pattern-information analyses. A key challenge is to use measured brain-activity patterns to test computational models of brain information processing. One approach to this problem is representational similarity analysis (RSA), which characterizes a representation in a brain or computational model by the distance matrix of the response patterns elicited by a set of stimuli. The representational distance matrix encapsulates what distinctions between stimuli are emphasized and what distinctions are de-emphasized in the representation. A model is tested by comparing the representational distance matrix it predicts to that of a measured brain region. RSA also enables us to compare representations between stages of processing within a given brain or model, between brain and behavioral data, and between individuals and species. Here, we introduce a Matlab toolbox for RSA. The toolbox supports an analysis approach that is simultaneously data- and hypothesis-driven. It is designed to help integrate a wide range of computational models into the analysis of multichannel brain-activity measurements as provided by modern functional imaging and neuronal recording techniques. Tools for visualization and inference enable the user to relate sets of models to sets of brain regions and to statistically test and compare the models using nonparametric inference methods. The toolbox supports searchlight-based RSA, to continuously map a measured brain volume in search of a neuronal population code with a specific geometry. Finally, we introduce the linear-discriminant t value as a measure of representational discriminability that bridges the gap between linear decoding analyses and RSA. In order to demonstrate the capabilities of the toolbox, we apply it to both simulated and real fMRI data. The key functions are equally applicable to other modalities of brain-activity measurement. The toolbox is freely available to the community under an open-source license agreement (http://www.mrc-cbu.cam.ac.uk/methods-and-resources/toolboxes/license/).

779 citations


Journal ArticleDOI
TL;DR: These evaluations show that, unlike previous bio-inspired models, the latest DNNs rival the representational performance of IT cortex on this visual object recognition task and propose an extension of “kernel analysis” that measures the generalization accuracy as a function of representational complexity.
Abstract: The primate visual system achieves remarkable visual object recognition performance even in brief presentations, and under changes to object exemplar, geometric transformations, and background variation (a.k.a. core visual object recognition). This remarkable performance is mediated by the representation formed in inferior temporal (IT) cortex. In parallel, recent advances in machine learning have led to ever higher performing models of object recognition using artificial deep neural networks (DNNs). It remains unclear, however, whether the representational performance of DNNs rivals that of the brain. To accurately produce such a comparison, a major difficulty has been a unifying metric that accounts for experimental limitations, such as the amount of noise, the number of neural recording sites, and the number of trials, and computational limitations, such as the complexity of the decoding classifier and the number of classifier training examples. In this work, we perform a direct comparison that corrects for these experimental limitations and computational considerations. As part of our methodology, we propose an extension of ‘‘kernel analysis’’ that measures the generalization accuracy as a function of representational complexity. Our evaluations show that, unlike previous bio-inspired models, the latest DNNs rival the representational performance of IT cortex on this visual object recognition task. Furthermore, we show that models that perform well on measures of representational performance also perform well on measures of representational similarity to IT, and on measures of predicting individual IT multi-unit responses. Whether these DNNs rely on computational mechanisms similar to the primate visual system is yet to be determined, but, unlike all previous bioinspired models, that possibility cannot be ruled out merely on representational performance grounds.

773 citations


Journal ArticleDOI
TL;DR: Over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53 and a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY are mapped.
Abstract: Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org.

680 citations


Journal ArticleDOI
TL;DR: This study constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated prediction tools, and returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools.
Abstract: Single nucleotide variants represent a prevalent form of genetic variation. Mutations in the coding regions are frequently associated with the development of various genetic diseases. Computational tools for the prediction of the effects of mutations on protein function are very important for analysis of single nucleotide variants and their prioritization for experimental characterization. Many computational tools are already widely employed for this purpose. Unfortunately, their comparison and further improvement is hindered by large overlaps between the training datasets and benchmark datasets, which lead to biased and overly optimistic reported performances. In this study, we have constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated tools. The benchmark dataset containing over 43,000 mutations was employed for the unbiased evaluation of eight established prediction tools: MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP. The six best performing tools were combined into a consensus classifier PredictSNP, resulting into significantly improved prediction performance, and at the same time returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools. A user-friendly web interface enables easy access to all eight prediction tools, the consensus classifier PredictSNP and annotations from the Protein Mutant Database and the UniProt database. The web server and the datasets are freely available to the academic community at http://loschmidt.chemi.muni.cz/predictsnp.

571 citations


Journal ArticleDOI
TL;DR: The gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two, and the general utility of this method is demonstrated using a Naïve-Bayes classifier.
Abstract: Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naive-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

456 citations


Journal ArticleDOI
TL;DR: The results not only have important implications for biological theories of tumor growth and the use of mathematical modeling in preclinical anti-cancer drug investigations, but also may assist in defining how mathematical models could serve as potential prognostic tools in the clinic.
Abstract: Despite internal complexity, tumor growth kinetics follow relatively simple laws that can be expressed as mathematical models. To explore this further, quantitative analysis of the most classical of these were performed. The models were assessed against data from two in vivo experimental systems: an ectopic syngeneic tumor (Lewis lung carcinoma) and an orthotopically xenografted human breast carcinoma. The goals were threefold: 1) to determine a statistical model for description of the measurement error, 2) to establish the descriptive power of each model, using several goodness-of-fit metrics and a study of parametric identifiability, and 3) to assess the models' ability to forecast future tumor growth. The models included in the study comprised the exponential, exponential-linear, power law, Gompertz, logistic, generalized logistic, von Bertalanffy and a model with dynamic carrying capacity. For the breast data, the dynamics were best captured by the Gompertz and exponential-linear models. The latter also exhibited the highest predictive power, with excellent prediction scores (≥80%) extending out as far as 12 days in the future. For the lung data, the Gompertz and power law models provided the most parsimonious and parametrically identifiable description. However, not one of the models was able to achieve a substantial prediction rate (≥70%) beyond the next day data point. In this context, adjunction of a priori information on the parameter distribution led to considerable improvement. For instance, forecast success rates went from 14.9% to 62.7% when using the power law model to predict the full future tumor growth curves, using just three data points. These results not only have important implications for biological theories of tumor growth and the use of mathematical modeling in preclinical anti-cancer drug investigations, but also may assist in defining how mathematical models could serve as potential prognostic tools in the clinic.

453 citations


Journal ArticleDOI
TL;DR: SciClone is a computational method that is used to detect subclones in acute myeloid leukemia and breast cancer samples that, though present at disease onset, are not evident from a single primary tumor sample, and can track tumor evolution and identify the spatial origins of cells resisting therapy.
Abstract: The sensitivity of massively-parallel sequencing has confirmed that most cancers are oligoclonal, with subpopulations of neoplastic cells harboring distinct mutations. A fine resolution view of this clonal architecture provides insight into tumor heterogeneity, evolution, and treatment response, all of which may have clinical implications. Single tumor analysis already contributes to understanding these phenomena. However, cryptic subclones are frequently revealed by additional patient samples (e.g., collected at relapse or following treatment), indicating that accurately characterizing a tumor requires analyzing multiple samples from the same patient. To address this need, we present SciClone, a computational method that identifies the number and genetic composition of subclones by analyzing the variant allele frequencies of somatic mutations. We use it to detect subclones in acute myeloid leukemia and breast cancer samples that, though present at disease onset, are not evident from a single primary tumor sample. By doing so, we can track tumor evolution and identify the spatial origins of cells resisting therapy.

Journal ArticleDOI
TL;DR: The rDock as discussed by the authors is a molecular docking program developed at Vernalis for high-throughput virtual screening (HTVS) applications, which can be used against proteins and nucleic acids, and allows the user to incorporate additional constraints and information as a bias to guide docking.
Abstract: Identification of chemical compounds with specific biological activities is an important step in both chemical biology and drug discovery. When the structure of the intended target is available, one approach is to use molecular docking programs to assess the chemical complementarity of small molecules with the target; such calculations provide a qualitative measure of affinity that can be used in virtual screening (VS) to rank order a list of compounds according to their potential to be active. rDock is a molecular docking program developed at Vernalis for high-throughput VS (HTVS) applications. Evolved from RiboDock, the program can be used against proteins and nucleic acids, is designed to be computationally very efficient and allows the user to incorporate additional constraints and information as a bias to guide docking. This article provides an overview of the program structure and features and compares rDock to two reference programs, AutoDock Vina (open source) and Schrodinger's Glide (commercial). In terms of computational speed for VS, rDock is faster than Vina and comparable to Glide. For binding mode prediction, rDock and Vina are superior to Glide. The VS performance of rDock is significantly better than Vina, but inferior to Glide for most systems unless pharmacophore constraints are used; in that case rDock and Glide are of equal performance. The program is released under the Lesser General Public License and is freely available for download, together with the manuals, example files and the complete test sets, at http://rdock.sourceforge.net/

Journal ArticleDOI
TL;DR: A survey of recently published methods that use transcript levels to try to improve metabolic flux predictions either by generating flux distributions or by creating context-specific models finds that for many conditions, the predictions obtained by simple flux balance analysis using growth maximization and parsimony criteria are as good or better than those obtained using methods that incorporate transcriptomic data.
Abstract: Constraint-based models of metabolism are a widely used framework for predicting flux distributions in genome-scale biochemical networks. The number of published methods for integration of transcriptomic data into constraint-based models has been rapidly increasing. So far the predictive capability of these methods has not been critically evaluated and compared. This work presents a survey of recently published methods that use transcript levels to try to improve metabolic flux predictions either by generating flux distributions or by creating context-specific models. A subset of these methods is then systematically evaluated using published data from three different case studies in E. coli and S. cerevisiae. The flux predictions made by different methods using transcriptomic data are compared against experimentally determined extracellular and intracellular fluxes (from 13C-labeling data). The sensitivity of the results to method-specific parameters is also evaluated, as well as their robustness to noise in the data. The results show that none of the methods outperforms the others for all cases. Also, it is observed that for many conditions, the predictions obtained by simple flux balance analysis using growth maximization and parsimony criteria are as good or better than those obtained using methods that incorporate transcriptomic data. We further discuss the differences in the mathematical formulation of the methods, and their relation to the results we have obtained, as well as the connection to the underlying biological principles of metabolic regulation.

Journal ArticleDOI
TL;DR: A unified view of how allostery works is described, which not only clarifies the elusive allosteric mechanism but also provides structural grasp of agonist-mediated signaling pathways, and guidesallosteric drug discovery.
Abstract: The question of how allostery works was posed almost 50 years ago. Since then it has been the focus of much effort. This is for two reasons: first, the intellectual curiosity of basic science and the desire to understand fundamental phenomena, and second, its vast practical importance. Allostery is at play in all processes in the living cell, and increasingly in drug discovery. Many models have been successfully formulated, and are able to describe allostery even in the absence of a detailed structural mechanism. However, conceptual schemes designed to qualitatively explain allosteric mechanisms usually lack a quantitative mathematical model, and are unable to link its thermodynamic and structural foundations. This hampers insight into oncogenic mutations in cancer progression and biased agonists' actions. Here, we describe how allostery works from three different standpoints: thermodynamics, free energy landscape of population shift, and structure; all with exactly the same allosteric descriptors. This results in a unified view which not only clarifies the elusive allosteric mechanism but also provides structural grasp of agonist-mediated signaling pathways, and guides allosteric drug discovery. Of note, the unified view reasons that allosteric coupling (or communication) does not determine the allosteric efficacy; however, a communication channel is what makes potential binding sites allosteric.

Journal ArticleDOI
TL;DR: Results suggest that proxies perform differently in approximating commuting patterns for disease spread at different resolution scales, with the radiation model showing higher accuracy than mobile phone data when the seed is central in the network, the opposite being observed for peripheral locations.
Abstract: Human mobility is a key component of large-scale spatial-transmission models of infectious diseases. Correctly modeling and quantifying human mobility is critical for improving epidemic control, but may be hindered by data incompleteness or unavailability. Here we explore the opportunity of using proxies for individual mobility to describe commuting flows and predict the diffusion of an influenza-like-illness epidemic. We consider three European countries and the corresponding commuting networks at different resolution scales, obtained from (i) official census surveys, (ii) proxy mobility data extracted from mobile phone call records, and (iii) the radiation model calibrated with census data. Metapopulation models defined on these countries and integrating the different mobility layers are compared in terms of epidemic observables. We show that commuting networks from mobile phone data capture the empirical commuting patterns well, accounting for more than 87% of the total fluxes. The distributions of commuting fluxes per link from mobile phones and census sources are similar and highly correlated, however a systematic overestimation of commuting traffic in the mobile phone data is observed. This leads to epidemics that spread faster than on census commuting networks, once the mobile phone commuting network is considered in the epidemic model, however preserving to a high degree the order of infection of newly affected locations. Proxies' calibration affects the arrival times' agreement across different models, and the observed topological and traffic discrepancies among mobility sources alter the resulting epidemic invasion patterns. Results also suggest that proxies perform differently in approximating commuting patterns for disease spread at different resolution scales, with the radiation model showing higher accuracy than mobile phone data when the seed is central in the network, the opposite being observed for peripheral locations. Proxies should therefore be chosen in light of the desired accuracy for the epidemic situation under study.

Journal ArticleDOI
TL;DR: A hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented as an interactive and updatable online database that catalogs the largest number of evolutionary links among structural domain classifications.
Abstract: Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function Fast and easy access to such up-to-date information facilitates research We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or “fold”) This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites ECOD also recognizes closer sequence-based relationships between protein domains Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates This synchronization with PDB uniquely distinguishes ECOD among all protein classifications Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies

Journal ArticleDOI
TL;DR: This work exposes a software toolbox that provides generic, efficient and robust probabilistic solutions to the three problems of model-based analysis of empirical data: data simulation, parameter estimation/model selection, and experimental design optimization.
Abstract: This work is in line with an on-going effort tending toward a computational (quantitative and refutable) understanding of human neuro-cognitive processes. Many sophisticated models for behavioural and neurobiological data have flourished during the past decade. Most of these models are partly unspecified (i.e. they have unknown parameters) and nonlinear. This makes them difficult to peer with a formal statistical data analysis framework. In turn, this compromises the reproducibility of model-based empirical studies. This work exposes a software toolbox that provides generic, efficient and robust probabilistic solutions to the three problems of model-based analysis of empirical data: (i) data simulation, (ii) parameter estimation/model selection, and (iii) experimental design optimization.

Journal ArticleDOI
TL;DR: A Bayesian Markov Chain Monte Carlo algorithm is developed and implemented to infer sampled ancestor trees, that is, trees in which sampled individuals can be direct ancestors of other sampled individuals, and applies its phylogenetic inference accounting for sampled ancestors to epidemiological data.
Abstract: Phylogenetic analyses which include fossils or molecular sequences that are sampled through time require models that allow one sample to be a direct ancestor of another sample. As previously available phylogenetic inference tools assume that all samples are tips, they do not allow for this possibility. We have developed and implemented a Bayesian Markov Chain Monte Carlo (MCMC) algorithm to infer what we call sampled ancestor trees, that is, trees in which sampled individuals can be direct ancestors of other sampled individuals. We use a family of birth-death models where individuals may remain in the tree process after sampling, in particular we extend the birth-death skyline model [Stadler et al., 2013] to sampled ancestor trees. This method allows the detection of sampled ancestors as well as estimation of the probability that an individual will be removed from the process when it is sampled. We show that even if sampled ancestors are not of specific interest in an analysis, failing to account for them leads to significant bias in parameter estimates. We also show that sampled ancestor birth-death models where every sample comes from a different time point are non-identifiable and thus require one parameter to be known in order to infer other parameters. We apply our phylogenetic inference accounting for sampled ancestors to epidemiological data, where the possibility of sampled ancestors enables us to identify individuals that infected other individuals after being sampled and to infer fundamental epidemiological parameters. We also apply the method to infer divergence times and diversification rates when fossils are included along with extant species samples, so that fossilisation events are modelled as a part of the tree branching process. Such modelling has many advantages as argued in the literature. The sampler is available as an open-source BEAST2 package (https://github.com/CompEvol/sampled-ancestors).

Journal ArticleDOI
TL;DR: The properties of the neural vocabulary are explored by estimating its entropy, which constrains the population's capacity to represent visual information, and classifying activity patterns into a small set of metastable collective modes, showing that the neural codeword ensembles are extremely inhomogenous.
Abstract: Maximum entropy models are the least structured probability distributions that exactly reproduce a chosen set of statistics measured in an interacting network. Here we use this principle to construct probabilistic models which describe the correlated spiking activity of populations of up to 120 neurons in the salamander retina as it responds to natural movies. Already in groups as small as 10 neurons, interactions between spikes can no longer be regarded as small perturbations in an otherwise independent system; for 40 or more neurons pairwise interactions need to be supplemented by a global interaction that controls the distribution of synchrony in the population. Here we show that such “K-pairwise” models—being systematic extensions of the previously used pairwise Ising models—provide an excellent account of the data. We explore the properties of the neural vocabulary by: 1) estimating its entropy, which constrains the population's capacity to represent visual information; 2) classifying activity patterns into a small set of metastable collective modes; 3) showing that the neural codeword ensembles are extremely inhomogenous; 4) demonstrating that the state of individual neurons is highly predictable from the rest of the population, allowing the capacity for error correction.

Journal ArticleDOI
TL;DR: The magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families, and the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, are investigated.
Abstract: Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

Journal ArticleDOI
TL;DR: This work develops a framework for quantifying the degree to which pathways suffer these thermodynamic limitations on flux, and calculates a single thermodynamically-derived metric (the Max-min Driving Force, MDF), which enables objective ranking of pathways by thedegree to which their flux is constrained by low thermodynamic driving force.
Abstract: In metabolism research, thermodynamics is usually used to determine the directionality of a reaction or the feasibility of a pathway. However, the relationship between thermodynamic potentials and fluxes is not limited to questions of directionality: thermodynamics also affects the kinetics of reactions through the flux-force relationship, which states that the logarithm of the ratio between the forward and reverse fluxes is directly proportional to the change in Gibbs energy due to a reaction (ΔrG′). Accordingly, if an enzyme catalyzes a reaction with a ΔrG′ of -5.7 kJ/mol then the forward flux will be roughly ten times the reverse flux. As ΔrG′ approaches equilibrium (ΔrG′ = 0 kJ/mol), exponentially more enzyme counterproductively catalyzes the reverse reaction, reducing the net rate at which the reaction proceeds. Thus, the enzyme level required to achieve a given flux increases dramatically near equilibrium. Here, we develop a framework for quantifying the degree to which pathways suffer these thermodynamic limitations on flux. For each pathway, we calculate a single thermodynamically-derived metric (the Max-min Driving Force, MDF), which enables objective ranking of pathways by the degree to which their flux is constrained by low thermodynamic driving force. Our framework accounts for the effect of pH, ionic strength and metabolite concentration ranges and allows us to quantify how alterations to the pathway structure affect the pathway's thermodynamics. Applying this methodology to pathways of central metabolism sheds light on some of their features, including metabolic bypasses (e.g., fermentation pathways bypassing substrate-level phosphorylation), substrate channeling (e.g., of oxaloacetate from malate dehydrogenase to citrate synthase), and use of alternative cofactors (e.g., quinone as an electron acceptor instead of NAD). The methods presented here place another arrow in metabolic engineers' quiver, providing a simple means of evaluating the thermodynamic and kinetic quality of different pathway chemistries that produce the same molecules.

Journal ArticleDOI
TL;DR: Functional connectivity reflects the interplay of at least three types of components: a backbone of anatomical connectivity, a stationary dynamical regime directly driven by the underlying anatomy, and other stationary and non-stationary dynamics not directly related to the anatomy.
Abstract: Investigating the relationship between brain structure and function is a central endeavor for neuroscience research. Yet, the mechanisms shaping this relationship largely remain to be elucidated and are highly debated. In particular, the existence and relative contributions of anatomical constraints and dynamical physiological mechanisms of different types remain to be established. We addressed this issue by systematically comparing functional connectivity (FC) from resting-state functional magnetic resonance imaging data with simulations from increasingly complex computational models, and by manipulating anatomical connectivity obtained from fiber tractography based on diffusion-weighted imaging. We hypothesized that FC reflects the interplay of at least three types of components: (i) a backbone of anatomical connectivity, (ii) a stationary dynamical regime directly driven by the underlying anatomy, and (iii) other stationary and non-stationary dynamics not directly related to the anatomy. We showed that anatomical connectivity alone accounts for up to 15% of FC variance; that there is a stationary regime accounting for up to an additional 20% of variance and that this regime can be associated to a stationary FC; that a simple stationary model of FC better explains FC than more complex models; and that there is a large remaining variance (around 65%), which must contain the non-stationarities of FC evidenced in the literature. We also show that homotopic connections across cerebral hemispheres, which are typically improperly estimated, play a strong role in shaping all aspects of FC, notably indirect connections and the topographic organization of brain networks.

Journal ArticleDOI
TL;DR: In this paper, a statistical method exploiting both pathogen sequences and collection dates is proposed to unravel the dynamics of densely sampled outbreaks, identifying likely transmission events and infers dates of infections, unobserved cases and separate introductions of the disease.
Abstract: Recent years have seen progress in the development of statistically rigorous frameworks to infer outbreak transmission trees (“who infected whom”) from epidemiological and genetic data. Making use of pathogen genome sequences in such analyses remains a challenge, however, with a variety of heuristic approaches having been explored to date. We introduce a statistical method exploiting both pathogen sequences and collection dates to unravel the dynamics of densely sampled outbreaks. Our approach identifies likely transmission events and infers dates of infections, unobserved cases and separate introductions of the disease. It also proves useful for inferring numbers of secondary infections and identifying heterogeneous infectivity and super-spreaders. After testing our approach using simulations, we illustrate the method with the analysis of the beginning of the 2003 Singaporean outbreak of Severe Acute Respiratory Syndrome (SARS), providing new insights into the early stage of this epidemic. Our approach is the first tool for disease outbreak reconstruction from genetic data widely available as free software, the R package outbreaker. It is applicable to various densely sampled epidemics, and improves previous approaches by detecting unobserved and imported cases, as well as allowing multiple introductions of the pathogen. Because of its generality, we believe this method will become a tool of choice for the analysis of densely sampled disease outbreaks, and will form a rigorous framework for subsequent methodological developments.

Journal ArticleDOI
TL;DR: It is found that correlation increases sharply with the swarm's density, indicating that the interaction between midges is based on a metric perception mechanism, suggesting that correlation, rather than order, is the true hallmark of collective behaviour in biological systems.
Abstract: Collective behaviour is a widespread phenomenon in biology, cutting through a huge span of scales, from cell colonies up to bird flocks and fish schools. The most prominent trait of collective behaviour is the emergence of global order: individuals synchronize their states, giving the stunning impression that the group behaves as one. In many biological systems, though, it is unclear whether global order is present. A paradigmatic case is that of insect swarms, whose erratic movements seem to suggest that group formation is a mere epiphenomenon of the independent interaction of each individual with an external landmark. In these cases, whether or not the group behaves truly collectively is debated. Here, we experimentally study swarms of midges in the field and measure how much the change of direction of one midge affects that of other individuals. We discover that, despite the lack of collective order, swarms display very strong correlations, totally incompatible with models of non-interacting particles. We find that correlation increases sharply with the swarm's density, indicating that the interaction between midges is based on a metric perception mechanism. By means of numerical simulations we demonstrate that such growing correlation is typical of a system close to an ordering transition. Our findings suggest that correlation, rather than order, is the true hallmark of collective behaviour in biological systems.

Journal ArticleDOI
TL;DR: This comprehensive taxonomy of BMC loci, based on their constituent protein domains, foregrounds the functional diversity of BMCs and provides a reference for interpreting the role of BMC gene clusters encoded in isolate, single cell, and metagenomic data.
Abstract: Bacterial microcompartments (BMCs) are proteinaceous organelles involved in both autotrophic and heterotrophic metabolism. All BMCs share homologous shell proteins but differ in their complement of enzymes; these are typically encoded adjacent to shell protein genes in genetic loci, or operons. To enable the identification and prediction of functional (sub)types of BMCs, we developed LoClass, an algorithm that finds putative BMC loci and inventories, weights, and compares their constituent pfam domains to construct a locus similarity network and predict locus (sub)types. In addition to using LoClass to analyze sequences in the Non-redundant Protein Database, we compared predicted BMC loci found in seven candidate bacterial phyla (six from single-cell genomic studies) to the LoClass taxonomy. Together, these analyses resulted in the identification of 23 different types of BMCs encoded in 30 distinct locus (sub)types found in 23 bacterial phyla. These include the two carboxysome types and a divergent set of metabolosomes, BMCs that share a common catalytic core and process distinct substrates via specific signature enzymes. Furthermore, many Candidate BMCs were found that lack one or more core metabolosome components, including one that is predicted to represent an entirely new paradigm for BMC-associated metabolism, joining the carboxysome and metabolosome. By placing these results in a phylogenetic context, we provide a framework for understanding the horizontal transfer of these loci, a starting point for studies aimed at understanding the evolution of BMCs. This comprehensive taxonomy of BMC loci, based on their constituent protein domains, foregrounds the functional diversity of BMCs and provides a reference for interpreting the role of BMC gene clusters encoded in isolate, single cell, and metagenomic data. Many loci encode ancillary functions such as transporters or genes for cofactor assembly; this expanded vocabulary of BMC-related functions should be useful for design of genetic modules for introducing BMCs in bioengineering applications.

Journal ArticleDOI
TL;DR: This work presents fastcore, a generic algorithm for reconstructing context-specific metabolic network models from global genome-wide metabolicnetwork models such as Recon X, and shows that a minimal consistent reconstruction can be defined via a set of sparse modes of the global network.
Abstract: Systemic approaches to the study of a biological cell or tissue rely increasingly on the use of context-specific metabolic network models. The reconstruction of such a model from high-throughput data can routinely involve large numbers of tests under different conditions and extensive parameter tuning, which calls for fast algorithms. We present fastcore, a generic algorithm for reconstructing context-specific metabolic network models from global genome-wide metabolic network models such as Recon X. fastcore takes as input a core set of reactions that are known to be active in the context of interest (e.g., cell or tissue), and it searches for a flux consistent subnetwork of the global network that contains all reactions from the core set and a minimal set of additional reactions. Our key observation is that a minimal consistent reconstruction can be defined via a set of sparse modes of the global network, and fastcore iteratively computes such a set via a series of linear programs. Experiments on liver data demonstrate speedups of several orders of magnitude, and significantly more compact reconstructions, over a rival method. Given its simplicity and its excellent performance, fastcore can form the backbone of many future metabolic network reconstruction algorithms.

Journal ArticleDOI
TL;DR: A Poisson model is developed that accurately estimates the level of ILI activity in the American population, up to two weeks ahead of the CDC, with an absolute average difference between the two estimates of just 0.27% over 294 weeks of data.
Abstract: Circulating levels of both seasonal and pandemic influenza require constant surveillance to ensure the health and safety of the population. While up-to-date information is critical, traditional surveillance systems can have data availability lags of up to two weeks. We introduce a novel method of estimating, in near-real time, the level of influenza-like illness (ILI) in the United States (US) by monitoring the rate of particular Wikipedia article views on a daily basis. We calculated the number of times certain influenza- or health-related Wikipedia articles were accessed each day between December 2007 and August 2013 and compared these data to official ILI activity levels provided by the Centers for Disease Control and Prevention (CDC). We developed a Poisson model that accurately estimates the level of ILI activity in the American population, up to two weeks ahead of the CDC, with an absolute average difference between the two estimates of just 0.27% over 294 weeks of data. Wikipedia-derived ILI models performed well through both abnormally high media coverage events (such as during the 2009 H1N1 pandemic) as well as unusually severe influenza seasons (such as the 2012–2013 influenza season). Wikipedia usage accurately estimated the week of peak ILI activity 17% more often than Google Flu Trends data and was often more accurate in its measure of ILI intensity. With further study, this method could potentially be implemented for continuous monitoring of ILI activity in the US and to provide support for traditional influenza surveillance tools.

Journal ArticleDOI
TL;DR: Simulation of complex asymmetric plasma membrane model, which contains seven different lipids species including the glycolipid GM3 in the outer leaflet and the anionic lipid, phosphatidylinositol 4,5-bisphophate, in the inner leaflet, reveals emergent nanoscale membrane organization which may be coupled both to fluctuations in local membrane geometry and to interactions with proteins.
Abstract: Cell membranes are complex multicomponent systems, which are highly heterogeneous in the lipid distribution and composition. To date, most molecular simulations have focussed on relatively simple lipid compositions, helping to inform our understanding of in vitro experimental studies. Here we describe on simulations of complex asymmetric plasma membrane model, which contains seven different lipids species including the glycolipid GM3 in the outer leaflet and the anionic lipid, phosphatidylinositol 4,5-bisphophate (PIP2), in the inner leaflet. Plasma membrane models consisting of 1500 lipids and resembling the in vivo composition were constructed and simulations were run for 5 µs. In these simulations the most striking feature was the formation of nano-clusters of GM3 within the outer leaflet. In simulations of protein interactions within a plasma membrane model, GM3, PIP2, and cholesterol all formed favorable interactions with the model α-helical protein. A larger scale simulation of a model plasma membrane containing 6000 lipid molecules revealed correlations between curvature of the bilayer surface and clustering of lipid molecules. In particular, the concave (when viewed from the extracellular side) regions of the bilayer surface were locally enriched in GM3. In summary, these simulations explore the nanoscale dynamics of model bilayers which mimic the in vivo lipid composition of mammalian plasma membranes, revealing emergent nanoscale membrane organization which may be coupled both to fluctuations in local membrane geometry and to interactions with proteins.

Journal ArticleDOI
TL;DR: This Perspectives article gives a broad introduction to the maximum entropy method, in an attempt to encourage its further adoption and highlights each of these contributions in turn.
Abstract: A key component of computational biology is to compare the results of computer modelling with experimental measurements. Despite substantial progress in the models and algorithms used in many areas of computational biology, such comparisons sometimes reveal that the computations are not in quantitative agreement with experimental data. The principle of maximum entropy is a general procedure for constructing probability distributions in the light of new data, making it a natural tool in cases when an initial model provides results that are at odds with experiments. The number of maximum entropy applications in our field has grown steadily in recent years, in areas as diverse as sequence analysis, structural modelling, and neurobiology. In this Perspectives article, we give a broad introduction to the method, in an attempt to encourage its further adoption. The general procedure is explained in the context of a simple example, after which we proceed with a real-world application in the field of molecular simulations, where the maximum entropy procedure has recently provided new insight. Given the limited accuracy of force fields, macromolecular simulations sometimes produce results that are at not in complete and quantitative accordance with experiments. A common solution to this problem is to explicitly ensure agreement between the two by perturbing the potential energy function towards the experimental data. So far, a general consensus for how such perturbations should be implemented has been lacking. Three very recent papers have explored this problem using the maximum entropy approach, providing both new theoretical and practical insights to the problem. We highlight each of these contributions in turn and conclude with a discussion on remaining challenges.

Journal ArticleDOI
TL;DR: The results show that the cortical encoding of natural sounds entails the formation of multiple representations of sound spectrograms with different degrees of spectral and temporal resolution, and suggest that a spectral-temporal resolution trade-off may govern the modulation tuning of neuronal populations throughout the auditory cortex.
Abstract: Functional neuroimaging research provides detailed observations of the response patterns that natural sounds (e.g. human voices and speech, animal cries, environmental sounds) evoke in the human brain. The computational and representational mechanisms underlying these observations, however, remain largely unknown. Here we combine high spatial resolution (3 and 7 Tesla) functional magnetic resonance imaging (fMRI) with computational modeling to reveal how natural sounds are represented in the human brain. We compare competing models of sound representations and select the model that most accurately predicts fMRI response patterns to natural sounds. Our results show that the cortical encoding of natural sounds entails the formation of multiple representations of sound spectrograms with different degrees of spectral and temporal resolution. The cortex derives these multi-resolution representations through frequency-specific neural processing channels and through the combined analysis of the spectral and temporal modulations in the spectrogram. Furthermore, our findings suggest that a spectral-temporal resolution trade-off may govern the modulation tuning of neuronal populations throughout the auditory cortex. Specifically, our fMRI results suggest that neuronal populations in posterior/dorsal auditory regions preferably encode coarse spectral information with high temporal precision. Vice-versa, neuronal populations in anterior/ventral auditory regions preferably encode fine-grained spectral information with low temporal precision. We propose that such a multi-resolution analysis may be crucially relevant for flexible and behaviorally-relevant sound processing and may constitute one of the computational underpinnings of functional specialization in auditory cortex.