scispace - formally typeset
Search or ask a question

Showing papers in "PLOS Computational Biology in 2013"


Journal ArticleDOI
TL;DR: This work describes Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions, including those for sequence analysis, differential expression analysis and visualization.
Abstract: We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.

3,005 citations


Journal ArticleDOI
TL;DR: A set of computational algorithms are presented which, by leveraging the collective power of metabolic pathways and networks, predict functional activity directly from spectral feature tables without a priori identification of metabolites.
Abstract: The functional interpretation of high throughput metabolomics by mass spectrometry is hindered by the identification of metabolites, a tedious and challenging task. We present a set of computational algorithms which, by leveraging the collective power of metabolic pathways and networks, predict functional activity directly from spectral feature tables without a priori identification of metabolites. The algorithms were experimentally validated on the activation of innate immune cells.

633 citations


Journal ArticleDOI
TL;DR: It is emphasized that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducible can also be a burden for you as an individual researcher.
Abstract: Replication is the cornerstone of a cumulative science [1]. However, new tools and technologies, massive amounts of data, interdisciplinary approaches, and the complexity of the questions being asked are complicating replication efforts, as are increased pressures on scientists to advance their research [2]. As full replication of studies on independently collected data is often not feasible, there has recently been a call for reproducible research as an attainable minimum standard for assessing the value of scientific claims [3]. This requires that papers in experimental science describe the results and provide a sufficiently clear protocol to allow successful repetition and extension of analyses based on original data [4]. The importance of replication and reproducibility has recently been exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction [5], studies showing difficulties with replicating published experimental results [6], an increase in retracted papers [7], and through a high number of failing clinical trials [8], [9]. This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility. In order to foster such aspects, it has been suggested that the scientific community needs to develop a “culture of reproducibility” for computational science, and to require it for published claims [3]. We want to emphasize that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducibility can also be a burden for you as an individual researcher. As an example, a good practice of reproducibility is necessary in order to allow previously developed methodology to be effectively applied on new data, or to allow reuse of code and results for new projects. In other words, good habits of reproducibility may actually turn out to be a time-saver in the longer run. We further note that reproducibility is just as much about the habits that ensure reproducible research as the technologies that can make these processes efficient and realistic. Each of the following ten rules captures a specific aspect of reproducibility, and discusses what is needed in terms of information handling and tracking of procedures. If you are taking a bare-bones approach to bioinformatics analysis, i.e., running various custom scripts from the command line, you will probably need to handle each rule explicitly. If you are instead performing your analyses through an integrated framework (such as GenePattern [10], Galaxy [11], LONI pipeline [12], or Taverna [13]), the system may already provide full or partial support for most of the rules. What is needed on your part is then merely the knowledge of how to exploit these existing possibilities. In a pragmatic setting, with publication pressure and deadlines, one may face the need to make a trade-off between the ideals of reproducibility and the need to get the research out while it is still relevant. This trade-off becomes more important when considering that a large part of the analyses being tried out never end up yielding any results. However, frequently one will, with the wisdom of hindsight, contemplate the missed opportunity to ensure reproducibility, as it may already be too late to take the necessary notes from memory (or at least much more difficult than to do it while underway). We believe that the rewards of reproducibility will compensate for the risk of having spent valuable time developing an annotated catalog of analyses that turned out as blind alleys. As a minimal requirement, you should at least be able to reproduce the results yourself. This would satisfy the most basic requirements of sound research, allowing any substantial future questioning of the research to be met with a precise explanation. Although it may sound like a very weak requirement, even this level of reproducibility will often require a certain level of care in order to be met. There will for a given analysis be an exponential number of possible combinations of software versions, parameter values, pre-processing steps, and so on, meaning that a failure to take notes may make exact reproduction essentially impossible. With this basic level of reproducibility in place, there is much more that can be wished for. An obvious extension is to go from a level where you can reproduce results in case of a critical situation to a level where you can practically and routinely reuse your previous work and increase your productivity. A second extension is to ensure that peers have a practical possibility of reproducing your results, which can lead to increased trust in, interest for, and citations of your work [6], [14]. We here present ten simple rules for reproducibility of computational research. These rules can be at your disposal for whenever you want to make your research more accessible—be it for peers or for your future self.

603 citations


Journal ArticleDOI
TL;DR: Interestingly, with this model, T-cells are equipped to better recognize viral than human (self) peptides, and the identification of variables that influence immunogenicity will be an important next step in the investigation of T-cell epitopes and the understanding of cellular immune responses.
Abstract: T-cells have to recognize peptides presented on MHC molecules to be activated and elicit their effector functions. Several studies demonstrate that some peptides are more immunogenic than others and therefore more likely to be T-cell epitopes. We set out to determine which properties cause such differences in immunogenicity. To this end, we collected and analyzed a large set of data describing the immunogenicity of peptides presented on various MHC-I molecules. Two main conclusions could be drawn from this analysis: First, in line with previous observations, we showed that positions P4–6 of a presented peptide are more important for immunogenicity. Second, some amino acids, especially those with large and aromatic side chains, are associated with immunogenicity. This information was combined into a simple model that was used to demonstrate that immunogenicity is, to a certain extent, predictable. This model (made available at http://tools.iedb.org/immunogenicity/) was validated with data from two independent epitope discovery studies. Interestingly, with this model we could show that T-cells are equipped to better recognize viral than human (self) peptides. After the past successful elucidation of different steps in the MHC-I presentation pathway, the identification of variables that influence immunogenicity will be an important next step in the investigation of T-cell epitopes and our understanding of cellular immune responses.

566 citations


Journal ArticleDOI
TL;DR: Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that widen the realm of models for which statistical inference can be considered and exacerbates the challenges of parameter estimation and model selection.
Abstract: Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics. In all model-based statistical inference, the likelihood function is of central importance, since it expresses the probability of the observed data under a particular statistical model, and thus quantifies the support data lend to particular values of parameters and to choices among different models. For simple models, an analytical formula for the likelihood function can typically be derived. However, for more complex models, an analytical formula might be elusive or the likelihood function might be computationally very costly to evaluate. ABC methods bypass the evaluation of the likelihood function. In this way, ABC methods widen the realm of models for which statistical inference can be considered. ABC methods are mathematically well-founded, but they inevitably make assumptions and approximations whose impact needs to be carefully assessed. Furthermore, the wider application domain of ABC exacerbates the challenges of parameter estimation and model selection. ABC has rapidly gained popularity over the last years and in particular for the analysis of complex problems arising in biological sciences (e.g., in population genetics, ecology, epidemiology, and systems biology).

531 citations


Journal ArticleDOI
TL;DR: A novel method to infer microbial community ecology directly from time-resolved metagenomics is presented, extending generalized Lotka–Volterra dynamics to account for external perturbations and suggests a subnetwork of bacterial groups implicated in protection against C. difficile.
Abstract: The intestinal microbiota is a microbial ecosystem of crucial importance to human health. Understanding how the microbiota confers resistance against enteric pathogens and how antibiotics disrupt that resistance is key to the prevention and cure of intestinal infections. We present a novel method to infer microbial community ecology directly from time-resolved metagenomics. This method extends generalized Lotka–Volterra dynamics to account for external perturbations. Data from recent experiments on antibiotic-mediated Clostridium difficile infection is analyzed to quantify microbial interactions, commensal-pathogen interactions, and the effect of the antibiotic on the community. Stability analysis reveals that the microbiota is intrinsically stable, explaining how antibiotic perturbations and C. difficile inoculation can produce catastrophic shifts that persist even after removal of the perturbations. Importantly, the analysis suggests a subnetwork of bacterial groups implicated in protection against C. difficile. Due to its generality, our method can be applied to any high-resolution ecological time-series data to infer community structure and response to external stimuli.

521 citations


Journal ArticleDOI
TL;DR: This work tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S r RNA region.
Abstract: Recent analyses of human-associated bacterial diversity have categorized individuals into 'enterotypes' or clusters based on the abundances of key bacterial genera in the gut microbiota. There is a lack of consensus, however, on the analytical basis for enterotypes and on the interpretation of these results. We tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S rRNA region. We included 16S rRNA gene sequences from the Human Microbiome Project (HMP) and from 16 additional studies and WGS sequences from the HMP and MetaHIT. In most body sites, we observed smooth abundance gradients of key genera without discrete clustering of samples. Some body habitats displayed bimodal (e.g., gut) or multimodal (e.g., vagina) distributions of sample abundances, but not all clustering methods and workflows accurately highlight such clusters. Because identifying enterotypes in datasets depends not only on the structure of the data but is also sensitive to the methods applied to identifying clustering strength, we recommend that multiple approaches be used and compared when testing for enterotypes.

466 citations


Journal ArticleDOI
TL;DR: Chaste as mentioned in this paper is an open source C++ library for the computational simulation of mathematical models developed for physiology and biology, such as cardiac electrophysiology and cancer development, which can be used to solve a wide range of problems.
Abstract: Chaste - Cancer, Heart And Soft Tissue Environment - is an open source C++ library for the computational simulation of mathematical models developed for physiology and biology. Code development has been driven by two initial applications: cardiac electrophysiology and cancer development. A large number of cardiac electrophysiology studies have been enabled and performed, including high-performance computational investigations of defibrillation on realistic human cardiac geometries. New models for the initiation and growth of tumours have been developed. In particular, cell-based simulations have provided novel insight into the role of stem cells in the colorectal crypt. Chaste is constantly evolving and is now being applied to a far wider range of problems. The code provides modules for handling common scientific computing components, such as meshes and solvers for ordinary and partial differential equations (ODEs/PDEs). Re-use of these components avoids the need for researchers to 're-invent the wheel' with each new project, accelerating the rate of progress in new applications. Chaste is developed using industrially-derived techniques, in particular test-driven development, to ensure code quality, re-use and reliability. In this article we provide examples that illustrate the types of problems Chaste can be used to solve, which can be run on a desktop computer. We highlight some scientific studies that have used or are using Chaste, and the insights they have provided. The source code, both for specific releases and the development version, is available to download under an open source Berkeley Software Distribution (BSD) licence at http://www.cs.ox.ac.uk/chaste, together with details of a mailing list and links to documentation and tutorials.

440 citations


Journal ArticleDOI
TL;DR: The RAVEN Toolbox workflow was applied in order to reconstruct a genome-scale metabolic model for the important microbial cell factory Penicillium chrysogenum Wisconsin54-1255, and was then used to study the roles of ATP and NADPH in the biosynthesis of Penicillin, and to identify potential metabolic engineering targets for maximization of penicillin production.
Abstract: We present the RAVEN (Reconstruction, Analysis and Visualization of Metabolic Networks) Toolbox: a software suite that allows for semi-automated reconstruction of genome-scale models. It makes use of published models and/or the KEGG database, coupled with extensive gap-filling and quality control features. The software suite also contains methods for visualizing simulation results and omics data, as well as a range of methods for performing simulations and analyzing the results. The software is a useful tool for system-wide data analysis in a metabolic context and for streamlined reconstruction of metabolic networks based on protein homology. The RAVEN Toolbox workflow was applied in order to reconstruct a genome-scale metabolic model for the important microbial cell factory Penicillium chrysogenum Wisconsin54-1255. The model was validated in a bibliomic study of in total 440 references, and it comprises 1471 unique biochemical reactions and 1006 ORFs. It was then used to study the roles of ATP and NADPH in the biosynthesis of penicillin, and to identify potential metabolic engineering targets for maximization of penicillin production.

409 citations


Journal ArticleDOI
TL;DR: GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation, is developed, designed for reproducibility and flexibility, to provide researchers with a standard framework for medical genomics.
Abstract: Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI's utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.

374 citations


Journal ArticleDOI
TL;DR: Viral phylodynamics is defined as the study of how epidemiological, immunological, and evolutionary processes act and potentially interact to shape viral phylogenies as discussed by the authors, which can be considered at the level of cells within an infected host, individual hosts within a population, or entire populations of hosts.
Abstract: Viral phylodynamics is defined as the study of how epidemiological, immunological, and evolutionary processes act and potentially interact to shape viral phylogenies. Since the coining of the term in 2004, research on viral phylodynamics has focused on transmission dynamics in an effort to shed light on how these dynamics impact viral genetic variation. Transmission dynamics can be considered at the level of cells within an infected host, individual hosts within a population, or entire populations of hosts. Many viruses, especially RNA viruses, rapidly accumulate genetic variation because of short generation times and high mutation rates. Patterns of viral genetic variation are therefore heavily influenced by how quickly transmission occurs and by which entities transmit to one another. Patterns of viral genetic variation will also be affected by selection acting on viral phenotypes. Although viruses can differ with respect to many phenotypes, phylodynamic studies have to date tended to focus on a limited number of viral phenotypes. These include virulence phenotypes, phenotypes associated with viral transmissibility, cell or tissue tropism phenotypes, and antigenic phenotypes that can facilitate escape from host immunity. Due to the impact that transmission dynamics and selection can have on viral genetic variation, viral phylogenies can therefore be used to investigate important epidemiological, immunological, and evolutionary processes, such as epidemic spread [2], spatio-temporal dynamics including metapopulation dynamics [3], zoonotic transmission, tissue tropism [4], and antigenic drift [5]. The quantitative investigation of these processes through the consideration of viral phylogenies is the central aim of viral phylodynamics.

Journal ArticleDOI
TL;DR: It is concluded that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable.
Abstract: This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

Journal ArticleDOI
TL;DR: This study demonstrates that collective motion can be effectively mapped onto a set of order parameters describing the macroscopic group structure, revealing the existence of at least three dynamically-stable collective states; swarm, milling and polarized groups.
Abstract: The spontaneous emergence of pattern formation is ubiquitous in nature, often arising as a collective phenomenon from interactions among a large number of individual constituents or sub-systems. Understanding, and controlling, collective behavior is dependent on determining the low-level dynamical principles from which spatial and temporal patterns emerge; a key question is whether different group-level patterns result from all components of a system responding to the same external factor, individual components changing behavior but in a distributed self-organized way, or whether multiple collective states co-exist for the same individual behaviors. Using schooling fish (golden shiners, in groups of 30 to 300 fish) as a model system, we demonstrate that collective motion can be effectively mapped onto a set of order parameters describing the macroscopic group structure, revealing the existence of at least three dynamically-stable collective states; swarm, milling and polarized groups. Swarms are characterized by slow individual motion and a relatively dense, disordered structure. Increasing swim speed is associated with a transition to one of two locally-ordered states, milling or highly-mobile polarized groups. The stability of the discrete collective behaviors exhibited by a group depends on the number of group members. Transitions between states are influenced by both external (boundary-driven) and internal (changing motion of group members) factors. Whereas transitions between locally-disordered and locally-ordered group states are speed dependent, analysis of local and global properties of groups suggests that, congruent with theory, milling and polarized states co-exist in a bistable regime with transitions largely driven by perturbations. Our study allows us to relate theoretical and empirical understanding of animal group behavior and emphasizes dynamic changes in the structure of such groups.

Journal ArticleDOI
TL;DR: GFT data may not provide reliable surveillance for seasonal or pandemic influenza and should be interpreted with caution until the algorithm can be improved and evaluated, and current internet search query data are no substitute for timely local clinical and laboratory surveillance, or national surveillance based on local data collection.
Abstract: The goal of influenza-like illness (ILI) surveillance is to determine the timing, location and magnitude of outbreaks by monitoring the frequency and progression of clinical case incidence. Advances in computational and information technology have allowed for automated collection of higher volumes of electronic data and more timely analyses than previously possible. Novel surveillance systems, including those based on internet search query data like Google Flu Trends (GFT), are being used as surrogates for clinically-based reporting of influenza-like-illness (ILI). We investigated the reliability of GFT during the last decade (2003 to 2013), and compared weekly public health surveillance with search query data to characterize the timing and intensity of seasonal and pandemic influenza at the national (United States), regional (Mid-Atlantic) and local (New York City) levels. We identified substantial flaws in the original and updated GFT models at all three geographic scales, including completely missing the first wave of the 2009 influenza A/H1N1 pandemic, and greatly overestimating the intensity of the A/H3N2 epidemic during the 2012/2013 season. These results were obtained for both the original (2008) and the updated (2009) GFT algorithms. The performance of both models was problematic, perhaps because of changes in internet search behavior and differences in the seasonality, geographical heterogeneity and age-distribution of the epidemics between the periods of GFT model-fitting and prospective use. We conclude that GFT data may not provide reliable surveillance for seasonal or pandemic influenza and should be interpreted with caution until the algorithm can be improved and evaluated. Current internet search query data are no substitute for timely local clinical and laboratory surveillance, or national surveillance based on local data collection. New generation surveillance systems such as GFT should incorporate the use of near-real time electronic health data and computational methods for continued model-fitting and ongoing evaluation and improvement.

Journal ArticleDOI
TL;DR: Using re-sequencing datasets, this work comprehensively characterises the biases and errors introduced by the PGM at both the base and flow level, across a combination of factors, including chip density, sequencing kit, template species and machine.
Abstract: The Ion Torrent Personal Genome Machine (PGM) is a new sequencing platform that substantially differs from other sequencing technologies by measuring pH rather than light to detect polymerisation events. Using re-sequencing datasets, we comprehensively characterise the biases and errors introduced by the PGM at both the base and flow level, across a combination of factors, including chip density, sequencing kit, template species and machine. We found two distinct insertion/deletion (indel) error types that accounted for the majority of errors introduced by the PGM. The main error source was inaccurate flow-calls, which introduced indels at a raw rate of 2.84% (1.38% after quality clipping) using the OneTouch 200 bp kit. Inaccurate flow-calls typically resulted in over-called short-homopolymers and under-called long-homopolymers. Flow-call accuracy decreased with consecutive flow cycles, but we also found significant periodic fluctuations in the flow error-rate, corresponding to specific positions within the flow-cycle pattern. Another less common PGM error, high frequency indel (HFI) errors, are indels that occur at very high frequency in the reads relative to a given base position in the reference genome, but in the majority of instances were not replicated consistently across separate runs. HFI errors occur approximately once every thousand bases in the reference, and correspond to 0.06% of bases in reads. Currently, the PGM does not achieve the accuracy of competing light-based technologies. However, flow-call inaccuracy is systematic and the statistical models of flow-values developed here will enable PGM-specific bioinformatics approaches to be developed, which will account for these errors. HFI errors may prove more challenging to address, especially for polymorphism and amplicon applications, but may be overcome by sequencing the same DNA template across multiple chips.

Journal ArticleDOI
TL;DR: The results suggest that the experimentally observed spontaneous activity and trial-to-trial variability of cortical neurons are essential features of their information processing capability, since their functional role is to represent probability distributions rather than static neural codes.
Abstract: The principles by which networks of neurons compute, and how spike-timing dependent plasticity (STDP) of synaptic weights generates and maintains their computational function, are unknown. Preceding work has shown that soft winner-take-all (WTA) circuits, where pyramidal neurons inhibit each other via interneurons, are a common motif of cortical microcircuits. We show through theoretical analysis and computer simulations that Bayesian computation is induced in these network motifs through STDP in combination with activity-dependent changes in the excitability of neurons. The fundamental components of this emergent Bayesian computation are priors that result from adaptation of neuronal excitability and implicit generative models for hidden causes that are created in the synaptic weights through STDP. In fact, a surprising result is that STDP is able to approximate a powerful principle for fitting such implicit generative models to high-dimensional spike inputs: Expectation Maximization. Our results suggest that the experimentally observed spontaneous activity and trial-to-trial variability of cortical neurons are essential features of their information processing capability, since their functional role is to represent probability distributions rather than static neural codes. Furthermore it suggests networks of Bayesian computation modules as a new model for distributed information processing in the cortex.

Journal ArticleDOI
TL;DR: All the major steps in the analysis of ChIP-seq data: sequencing depth selection, quality checking, mapping, data normalization, assessment of reproducibility, peak calling, differential binding analysis, controlling the false discovery rate, peak annotation, visualization, and motif analysis are addressed.
Abstract: Mapping the chromosomal locations of transcription factors, nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, and polymerases is one of the key tasks of modern biology, as evidenced by the Encyclopedia of DNA Elements (ENCODE) Project. To this end, chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is the standard methodology. Mapping such protein-DNA interactions in vivo using ChIP-seq presents multiple challenges not only in sample preparation and sequencing but also for computational analysis. Here, we present step-by-step guidelines for the computational analysis of ChIP-seq data. We address all the major steps in the analysis of ChIP-seq data: sequencing depth selection, quality checking, mapping, data normalization, assessment of reproducibility, peak calling, differential binding analysis, controlling the false discovery rate, peak annotation, visualization, and motif analysis. At each step in our guidelines we discuss some of the software tools most frequently used. We also highlight the challenges and problems associated with each step in ChIP-seq data analysis. We present a concise workflow for the analysis of ChIP-seq data in Figure 1 that complements and expands on the recommendations of the ENCODE and modENCODE projects. Each step in the workflow is described in detail in the following sections.

Journal ArticleDOI
TL;DR: In this article, the authors measure brain activity during motor sequencing and characterize network properties based on coherent activity between brain regions, and they find that the complex reconfiguration patterns of the brain's putative functional modules that control learning can be described parsimoniously by the combined presence of a relatively stiff temporal core that is composed primarily of sensorimotor and visual regions whose connectivity changes little in time and a flexible temporal periphery, whereas connectivity changes frequently.
Abstract: As a person learns a new skill, distinct synapses, brain regions, and circuits are engaged and change over time. In this paper, we develop methods to examine patterns of correlated activity across a large set of brain regions. Our goal is to identify properties that enable robust learning of a motor skill. We measure brain activity during motor sequencing and characterize network properties based on coherent activity between brain regions. Using recently developed algorithms to detect time-evolving communities, we find that the complex reconfiguration patterns of the brain's putative functional modules that control learning can be described parsimoniously by the combined presence of a relatively stiff temporal core that is composed primarily of sensorimotor and visual regions whose connectivity changes little in time and a flexible temporal periphery that is composed primarily of multimodal association regions whose connectivity changes frequently. The separation between temporal core and periphery changes over the course of training and, importantly, is a good predictor of individual differences in learning success. The core of dynamically stiff regions exhibits dense connectivity, which is consistent with notions of core-periphery organization established previously in social networks. Our results demonstrate that core-periphery organization provides an insightful way to understand how putative functional modules are linked. This, in turn, enables the prediction of fundamental human capacities, including the production of complex goal-directed behavior.

Journal ArticleDOI
TL;DR: The Multi-Dendrix algorithm is introduced for the simultaneous identification of multiple driver pathways de novo in somatic mutation data from a cohort of cancer samples and is shown to outperforms other algorithms for identifying combinations of mutations and is also orders of magnitude faster on genome-scale data.
Abstract: Distinguishing the somatic mutations responsible for cancer (driver mutations) from random, passenger mutations is a key challenge in cancer genomics. Driver mutations generally target cellular signaling and regulatory pathways consisting of multiple genes. This heterogeneity complicates the identification of driver mutations by their recurrence across samples, as different combinations of mutations in driver pathways are observed in different samples. We introduce the Multi-Dendrix algorithm for the simultaneous identification of multiple driver pathways de novo in somatic mutation data from a cohort of cancer samples. The algorithm relies on two combinatorial properties of mutations in a driver pathway: high coverage and mutual exclusivity. We derive an integer linear program that finds set of mutations exhibiting these properties. We apply Multi-Dendrix to somatic mutations from glioblastoma, breast cancer, and lung cancer samples. Multi-Dendrix identifies sets of mutations in genes that overlap with known pathways – including Rb, p53, PI(3)K, and cell cycle pathways – and also novel sets of mutually exclusive mutations, including mutations in several transcription factors or other genes involved in transcriptional regulation. These sets are discovered directly from mutation data with no prior knowledge of pathways or gene interactions. We show that Multi-Dendrix outperforms other algorithms for identifying combinations of mutations and is also orders of magnitude faster on genome-scale data. Software available at: http://compbio.cs.brown.edu/software.

Journal ArticleDOI
TL;DR: The approach suggests that spikes do matter when considering how the brain computes, and that the reliability of cortical representations could have been strongly underestimated.
Abstract: Two observations about the cortex have puzzled neuroscientists for a long time. First, neural responses are highly variable. Second, the level of excitation and inhibition received by each neuron is tightly balanced at all times. Here, we demonstrate that both properties are necessary consequences of neural networks that represent information efficiently in their spikes. We illustrate this insight with spiking networks that represent dynamical variables. Our approach is based on two assumptions: We assume that information about dynamical variables can be read out linearly from neural spike trains, and we assume that neurons only fire a spike if that improves the representation of the dynamical variables. Based on these assumptions, we derive a network of leaky integrate-and-fire neurons that is able to implement arbitrary linear dynamical systems. We show that the membrane voltage of the neurons is equivalent to a prediction error about a common population-level signal. Among other things, our approach allows us to construct an integrator network of spiking neurons that is robust against many perturbations. Most importantly, neural variability in our networks cannot be equated to noise. Despite exhibiting the same single unit properties as widely used population code models (e.g. tuning curves, Poisson distributed spike trains), balanced networks are orders of magnitudes more reliable. Our approach suggests that spikes do matter when considering how the brain computes, and that the reliability of cortical representations could have been strongly underestimated.

Journal ArticleDOI
TL;DR: A way to combine the estimations of group contribution with the more accurate reactant contributions by decomposing each reaction into two parts and applying one of the methods on each of them is presented, which shows a significant increase in the accuracy of estimations compared to standard group contribution.
Abstract: Standard Gibbs energies of reactions are increasingly being used in metabolic modeling for applying thermodynamic constraints on reaction rates, metabolite concentrations and kinetic parameters. The increasing scope and diversity of metabolic models has led scientists to look for genome-scale solutions that can estimate the standard Gibbs energy of all the reactions in metabolism. Group contribution methods greatly increase coverage, albeit at the price of decreased precision. We present here a way to combine the estimations of group contribution with the more accurate reactant contributions by decomposing each reaction into two parts and applying one of the methods on each of them. This method gives priority to the reactant contributions over group contributions while guaranteeing that all estimations will be consistent, i.e. will not violate the first law of thermodynamics. We show that there is a significant increase in the accuracy of our estimations compared to standard group contribution. Specifically, our cross-validation results show an 80% reduction in the median absolute residual for reactions that can be derived by reactant contributions only. We provide the full framework and source code for deriving estimates of standard reaction Gibbs energy, as well as confidence intervals, and believe this will facilitate the wide use of thermodynamic data for a better understanding of metabolism.

Journal ArticleDOI
TL;DR: Ten simple rules, learned working on about 25 literature reviews as a PhD and postdoctoral student, about how to approach and carry out a literature review.
Abstract: Literature reviews are in great demand in most scientific fields. Their need stems from the ever-increasing output of scientific publications [1]. For example, compared to 1991, in 2008 three, eight, and forty times more papers were indexed in Web of Science on malaria, obesity, and biodiversity, respectively [2]. Given such mountains of papers, scientists cannot be expected to examine in detail every single new paper relevant to their interests [3]. Thus, it is both advantageous and necessary to rely on regular summaries of the recent literature. Although recognition for scientists mainly comes from primary research, timely literature reviews can lead to new synthetic insights and are often widely read [4]. For such summaries to be useful, however, they need to be compiled in a professional way [5]. When starting from scratch, reviewing the literature can require a titanic amount of work. That is why researchers who have spent their career working on a certain research issue are in a perfect position to review that literature. Some graduate schools are now offering courses in reviewing the literature, given that most research students start their project by producing an overview of what has already been done on their research issue [6]. However, it is likely that most scientists have not thought in detail about how to approach and carry out a literature review. Reviewing the literature requires the ability to juggle multiple tasks, from finding and evaluating relevant material to synthesising information from various sources, from critical thinking to paraphrasing, evaluating, and citation skills [7]. In this contribution, I share ten simple rules I learned working on about 25 literature reviews as a PhD and postdoctoral student. Ideas and insights also come from discussions with coauthors and colleagues, as well as feedback from reviewers and editors.

Journal ArticleDOI
TL;DR: The analyses indicate that long-term memory recall engages a network that has a distinct thalamic-hippocampal-cortical signature that has small-world properties and contains hub-like regions in the prefrontal cortex and thalamus that may play privileged roles in memory expression.
Abstract: Long-term memories are thought to depend upon the coordinated activation of a broad network of cortical and subcortical brain regions. However, the distributed nature of this representation has made it challenging to define the neural elements of the memory trace, and lesion and electrophysiological approaches provide only a narrow window into what is appreciated a much more global network. Here we used a global mapping approach to identify networks of brain regions activated following recall of long-term fear memories in mice. Analysis of Fos expression across 84 brain regions allowed us to identify regions that were co-active following memory recall. These analyses revealed that the functional organization of long-term fear memories depends on memory age and is altered in mutant mice that exhibit premature forgetting. Most importantly, these analyses indicate that long-term memory recall engages a network that has a distinct thalamic-hippocampal-cortical signature. This network is concurrently integrated and segregated and therefore has small-world properties, and contains hub-like regions in the prefrontal cortex and thalamus that may play privileged roles in memory expression.

Journal ArticleDOI
TL;DR: The updated model integrates novel results with respect to the cyanobacterial TCA cycle, an alleged glyoxylate shunt, and the role of photorespiration in cellular growth, and extends the computational analysis to diurnal light/dark cycles of cyanob bacterial metabolism.
Abstract: Cyanobacteria are versatile unicellular phototrophic microorganisms that are highly abundant in many environments. Owing to their capability to utilize solar energy and atmospheric carbon dioxide for growth, cyanobacteria are increasingly recognized as a prolific resource for the synthesis of valuable chemicals and various biofuels. To fully harness the metabolic capabilities of cyanobacteria necessitates an in-depth understanding of the metabolic interconversions taking place during phototrophic growth, as provided by genome-scale reconstructions of microbial organisms. Here we present an extended reconstruction and analysis of the metabolic network of the unicellular cyanobacterium Synechocystis sp. PCC 6803. Building upon several recent reconstructions of cyanobacterial metabolism, unclear reaction steps are experimentally validated and the functional consequences of unknown or dissenting pathway topologies are discussed. The updated model integrates novel results with respect to the cyanobacterial TCA cycle, an alleged glyoxylate shunt, and the role of photorespiration in cellular growth. Going beyond conventional flux-balance analysis, we extend the computational analysis to diurnal light/dark cycles of cyanobacterial metabolism.

Journal ArticleDOI
TL;DR: A Random-Forest based algorithm, RFECS (Random Forest based Enhancer identification from Chromatin States) is developed to integrate histone modification profiles for identification of enhancers, and used it to identify enhancers in a number of cell-types.
Abstract: Transcriptional enhancers play critical roles in regulation of gene expression, but their identification in the eukaryotic genome has been challenging. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of cell types or chromatin marks have previously been investigated for this purpose, leaving the question unanswered whether there exists an optimal set of histone modifications for enhancer prediction in different cell types. Here, we address this issue by exploring genome-wide profiles of 24 histone modifications in two distinct human cell types, embryonic stem cells and lung fibroblasts. We developed a Random-Forest based algorithm, RFECS (Random Forest based Enhancer identification from Chromatin States) to integrate histone modification profiles for identification of enhancers, and used it to identify enhancers in a number of cell-types. We show that RFECS not only leads to more accurate and precise prediction of enhancers than previous methods, but also helps identify the most informative and robust set of three chromatin marks for enhancer prediction.

Journal ArticleDOI
TL;DR: A novel Bayesian probabilistic approach is described, denoted as “Bayesian 3D constructor for Hi-C data” (BACH), to infer the consensus 3D chromosomal structure and a variant algorithm BACH-MIX is described to study the structural variations of chromatin in a cell population.
Abstract: Knowledge of spatial chromosomal organizations is critical for the study of transcriptional regulation and other nuclear processes in the cell. Recently, chromosome conformation capture (3C) based technologies, such as Hi-C and TCC, have been developed to provide a genome-wide, three-dimensional (3D) view of chromatin organization. Appropriate methods for analyzing these data and fully characterizing the 3D chromosomal structure and its structural variations are still under development. Here we describe a novel Bayesian probabilistic approach, denoted as “Bayesian 3D constructor for Hi-C data” (BACH), to infer the consensus 3D chromosomal structure. In addition, we describe a variant algorithm BACH-MIX to study the structural variations of chromatin in a cell population. Applying BACH and BACH-MIX to a high resolution Hi-C dataset generated from mouse embryonic stem cells, we found that most local genomic regions exhibit homogeneous 3D chromosomal structures. We further constructed a model for the spatial arrangement of chromatin, which reveals structural properties associated with euchromatic and heterochromatic regions in the genome. We observed strong associations between structural properties and several genomic and epigenetic features of the chromosome. Using BACH-MIX, we further found that the structural variations of chromatin are correlated with these genomic and epigenetic features. Our results demonstrate that BACH and BACH-MIX have the potential to provide new insights into the chromosomal architecture of mammalian cells.

Journal ArticleDOI
TL;DR: A simple model in which cells are described as point particles with a dynamic based on the two premises that, first, cells move in a stochastic manner and, second, tend to adapt their motion to that of their neighbors is developed.
Abstract: Modelling the displacement of thousands of cells that move in a collective way is required for the simulation and the theoretical analysis of various biological processes. Here, we tackle this question in the controlled setting where the motion of Madin-Darby Canine Kidney (MDCK) cells in a confluent epithelium is triggered by the unmasking of free surface. We develop a simple model in which cells are described as point particles with a dynamic based on the two premises that, first, cells move in a stochastic manner and, second, tend to adapt their motion to that of their neighbors. Detailed comparison to experimental data show that the model provides a quantitatively accurate description of cell motion in the epithelium bulk at early times. In addition, inclusion of model “leader” cells with modified characteristics, accounts for the digitated shape of the interface which develops over the subsequent hours, providing that leader cells invade free surface more easily than other cells and coordinate their motion with their followers. The previously-described progression of the epithelium border is reproduced by the model and quantitatively explained.

Journal ArticleDOI
TL;DR: A definition of emotional valence is proposed in terms of the negative rate of change of free-energy over time, which highlights the crucial role played by emotions in biological agents' adaptation to unexpected changes in their world.
Abstract: The free-energy principle has recently been proposed as a unified Bayesian account of perception, learning and action. Despite the inextricable link between emotion and cognition, emotion has not yet been formulated under this framework. A core concept that permeates many perspectives on emotion is valence, which broadly refers to the positive and negative character of emotion or some of its aspects. In the present paper, we propose a definition of emotional valence in terms of the negative rate of change of free-energy over time. If the second time-derivative of free-energy is taken into account, the dynamics of basic forms of emotion such as happiness, unhappiness, hope, fear, disappointment and relief can be explained. In this formulation, an important function of emotional valence turns out to regulate the learning rate of the causes of sensory inputs. When sensations increasingly violate the agent's expectations, valence is negative and increases the learning rate. Conversely, when sensations increasingly fulfil the agent's expectations, valence is positive and decreases the learning rate. This dynamic interaction between emotional valence and learning rate highlights the crucial role played by emotions in biological agents' adaptation to unexpected changes in their world.

Journal ArticleDOI
TL;DR: It is found that human subjects combined single actions to build action sequences and that the formation of such action sequences was sufficient to explain habitual actions, and based on Bayesian model comparison, a family of hierarchical RL models, assuming a hierarchical interaction between habit and goal-directed processes, provided a better fit of the subjects' behavior than aFamily of flat models.
Abstract: Behavioral evidence suggests that instrumental conditioning is governed by two forms of action control: a goal-directed and a habit learning process. Model-based reinforcement learning (RL) has been argued to underlie the goal-directed process; however, the way in which it interacts with habits and the structure of the habitual process has remained unclear. According to a flat architecture, the habitual process corresponds to model-free RL, and its interaction with the goal-directed process is coordinated by an external arbitration mechanism. Alternatively, the interaction between these systems has recently been argued to be hierarchical, such that the formation of action sequences underlies habit learning and a goal-directed process selects between goal-directed actions and habitual sequences of actions to reach the goal. Here we used a two-stage decision-making task to test predictions from these accounts. The hierarchical account predicts that, because they are tied to each other as an action sequence, selecting a habitual action in the first stage will be followed by a habitual action in the second stage, whereas the flat account predicts that the statuses of the first and second stage actions are independent of each other. We found, based on subjects' choices and reaction times, that human subjects combined single actions to build action sequences and that the formation of such action sequences was sufficient to explain habitual actions. Furthermore, based on Bayesian model comparison, a family of hierarchical RL models, assuming a hierarchical interaction between habit and goal-directed processes, provided a better fit of the subjects' behavior than a family of flat models. Although these findings do not rule out all possible model-free accounts of instrumental conditioning, they do show such accounts are not necessary to explain habitual actions and provide a new basis for understanding how goal-directed and habitual action control interact.

Journal ArticleDOI
TL;DR: The theoretical framework is developed and applied to a range of exemplary problems that highlight how to improve experimental investigations into the structure and dynamics of biological systems and their behavior.
Abstract: Our understanding of most biological systems is in its infancy. Learning their structure and intricacies is fraught with challenges, and often side-stepped in favour of studying the function of different gene products in isolation from their physiological context. Constructing and inferring global mathematical models from experimental data is, however, central to systems biology. Different experimental setups provide different insights into such systems. Here we show how we can combine concepts from Bayesian inference and information theory in order to identify experiments that maximize the information content of the resulting data. This approach allows us to incorporate preliminary information; it is global and not constrained to some local neighbourhood in parameter space and it readily yields information on parameter robustness and confidence. Here we develop the theoretical framework and apply it to a range of exemplary problems that highlight how we can improve experimental investigations into the structure and dynamics of biological systems and their behavior.