scispace - formally typeset
Search or ask a question

Showing papers in "PLOS Computational Biology in 2009"


Journal ArticleDOI
TL;DR: It is concluded that the Bayesian phylogeographic framework will make an important asset in molecular epidemiology that can be easily generalized to infer biogeogeography from genetic data for many organisms.
Abstract: As a key factor in endemic and epidemic dynamics, the geographical distribution of viruses has been frequently interpreted in the light of their genetic histories. Unfortunately, inference of historical dispersal or migration patterns of viruses has mainly been restricted to model-free heuristic approaches that provide little insight into the temporal setting of the spatial dynamics. The introduction of probabilistic models of evolution, however, offers unique opportunities to engage in this statistical endeavor. Here we introduce a Bayesian framework for inference, visualization and hypothesis testing of phylogeographic history. By implementing character mapping in a Bayesian software that samples time-scaled phylogenies, we enable the reconstruction of timed viral dispersal patterns while accommodating phylogenetic uncertainty. Standard Markov model inference is extended with a stochastic search variable selection procedure that identifies the parsimonious descriptions of the diffusion process. In addition, we propose priors that can incorporate geographical sampling distributions or characterize alternative hypotheses about the spatial dynamics. To visualize the spatial and temporal information, we summarize inferences using virtual globe software. We describe how Bayesian phylogeography compares with previous parsimony analysis in the investigation of the influenza A H5N1 origin and H5N1 epidemiological linkage among sampling localities. Analysis of rabies in West African dog populations reveals how virus diffusion may enable endemic maintenance through continuous epidemic cycles. From these analyses, we conclude that our phylogeographic framework will make an important asset in molecular epidemiology that can be easily generalized to infer biogeogeography from genetic data for many organisms.

1,535 citations


Journal ArticleDOI
TL;DR: Over development, the organization of multiple functional networks shifts from a local anatomical emphasis in children to a more “distributed” architecture in young adults, and it is argued that this “local to distributed” developmental characterization has important implications for understanding the development of neural systems underlying cognition.
Abstract: The mature human brain is organized into a collection of specialized functional networks that flexibly interact to support various cognitive functions. Studies of development often attempt to identify the organizing principles that guide the maturation of these functional networks. In this report, we combine resting state functional connectivity MRI (rs-fcMRI), graph analysis, community detection, and spring-embedding visualization techniques to analyze four separate networks defined in earlier studies. As we have previously reported, we find, across development, a trend toward ‘segregation’ (a general decrease in correlation strength) between regions close in anatomical space and ‘integration’ (an increased correlation strength) between selected regions distant in space. The generalization of these earlier trends across multiple networks suggests that this is a general developmental principle for changes in functional connectivity that would extend to large-scale graph theoretic analyses of large-scale brain networks. Communities in children are predominantly arranged by anatomical proximity, while communities in adults predominantly reflect functional relationships, as defined from adult fMRI studies. In sum, over development, the organization of multiple functional networks shifts from a local anatomical emphasis in children to a more “distributed” architecture in young adults. We argue that this “local to distributed” developmental characterization has important implications for understanding the development of neural systems underlying cognition. Further, graph metrics (e.g., clustering coefficients and average path lengths) are similar in child and adult graphs, with both showing “small-world”-like properties, while community detection by modularity optimization reveals stable communities within the graphs that are clearly different between young children and young adults. These observations suggest that early school age children and adults both have relatively efficient systems that may solve similar information processing problems in divergent ways.

1,358 citations


Journal ArticleDOI
TL;DR: The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects and are robust across datasets of varied complexity and sampling level.
Abstract: Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them. We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/.

1,353 citations


Journal ArticleDOI
TL;DR: The results support a model in which variable exterior components feed into a large, densely connected core composed of ubiquitously expressed intracellular proteins.
Abstract: The parts of the genome transcribed by a cell or tissue reflect the biological processes and functions it carries out. We characterized the features of mammalian tissue transcriptomes at the gene level through analysis of RNA deep sequencing (RNA-Seq) data across human and mouse tissues and cell lines. We observed that roughly 8,000 protein-coding genes were ubiquitously expressed, contributing to around 75% of all mRNAs by message copy number in most tissues. These mRNAs encoded proteins that were often intracellular, and tended to be involved in metabolism, transcription, RNA processing or translation. In contrast, genes for secreted or plasma membrane proteins were generally expressed in only a subset of tissues. The distribution of expression levels was broad but fairly continuous: no support was found for the concept of distinct expression classes of genes. Expression estimates that included reads mapping to coding exons only correlated better with qRT-PCR data than estimates which also included 39 untranslated regions (UTRs). Muscle and liver had the least complex transcriptomes, in that they expressed predominantly ubiquitous genes and a large fraction of the transcripts came from a few highly expressed genes, whereas brain, kidney and testis expressed more complex transcriptomes with the vast majority of genes expressed and relatively small contributions from the most expressed genes. mRNAs expressed in brain had unusually long 39UTRs, and mean 39UTR length was higher for genes involved in development, morphogenesis and signal transduction, suggesting added complexity of UTR-based regulation for these genes. Our results support a model in which variable exterior components feed into a large, densely connected core composed of ubiquitously expressed intracellular proteins.

858 citations


Journal ArticleDOI
TL;DR: This work reviews semantic similarity measures applied to biomedical ontologies and proposes their classification according to the strategies they employ: node-based versus edge-based and pairwise versus groupwise.
Abstract: In recent years, ontologies have become a mainstream topic in biomedical research. When biological entities are described using a common schema, such as an ontology, they can be compared by means of their annotations. This type of comparison is called semantic similarity, since it assesses the degree of relatedness between two entities by the similarity in meaning of their annotations. The application of semantic similarity to biomedical ontologies is recent; nevertheless, several studies have been published in the last few years describing and evaluating diverse approaches. Semantic similarity has become a valuable tool for validating the results drawn from biomedical studies such as gene clustering, gene expression data analysis, prediction and validation of molecular interactions, and disease gene prioritization. We review semantic similarity measures applied to biomedical ontologies and propose their classification according to the strategies they employ: node-based versus edge-based and pairwise versus groupwise. We also present comparative assessment studies and discuss the implications of their results. We survey the existing implementations of semantic similarity measures, and we describe examples of applications to biomedical research. This will clarify how biomedical researchers can benefit from semantic similarity measures and help them choose the approach most suitable for their studies. Biomedical ontologies are evolving toward increased coverage, formality, and integration, and their use for annotation is increasingly becoming a focus of both effort by biomedical experts and application of automated annotation procedures to create corpora of higher quality and completeness than are currently available. Given that semantic similarity measures are directly dependent on these evolutions, we can expect to see them gaining more relevance and even becoming as essential as sequence similarity is today in biomedical research.

789 citations


Journal ArticleDOI
TL;DR: The results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized and strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannation.
Abstract: Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.

668 citations


Journal ArticleDOI
TL;DR: This model describes the recruitment of new adipose cells and their subsequent development in different strains, and with different diet regimens, with common mechanisms, but with diet- and genetics-dependent model parameters.
Abstract: Adipose tissue grows by two mechanisms: hyperplasia (cell number increase) and hypertrophy (cell size increase). Genetics and diet affect the relative contributions of these two mechanisms to the growth of adipose tissue in obesity. In this study, the size distributions of epididymal adipose cells from two mouse strains, obesity-resistant FVB/N and obesity-prone C57BL/6, were measured after 2, 4, and 12 weeks under regular and high-fat feeding conditions. The total cell number in the epididymal fat pad was estimated from the fat pad mass and the normalized cell-size distribution. The cell number and volume-weighted mean cell size increase as a function of fat pad mass. To address adipose tissue growth precisely, we developed a mathematical model describing the evolution of the adipose cell-size distributions as a function of the increasing fat pad mass, instead of the increasing chronological time. Our model describes the recruitment of new adipose cells and their subsequent development in different strains, and with different diet regimens, with common mechanisms, but with diet- and genetics-dependent model parameters. Compared to the FVB/N strain, the C57BL/6 strain has greater recruitment of small adipose cells. Hyperplasia is enhanced by high-fat diet in a strain-dependent way, suggesting a synergistic interaction between genetics and diet. Moreover, high-fat feeding increases the rate of adipose cell size growth, independent of strain, reflecting the increase in calories requiring storage. Additionally, high-fat diet leads to a dramatic spreading of the size distribution of adipose cells in both strains; this implies an increase in size fluctuations of adipose cells through lipid turnover.

653 citations


Journal ArticleDOI
TL;DR: A new paradigm of non-oscillatory “asynchronous,” scale-free, changes in cortical potentials, corresponding to changes in mean population-averaged firing rate, to complement the prevalent “synchronous” rhythm-based paradigm is suggested.
Abstract: Recent studies have identified broadband phenomena in the electric potentials produced by the brain. We report the finding of power-law scaling in these signals using subdural electrocorticographic recordings from the surface of human cortex. The power spectral density (PSD) of the electric potential has the power-law form from 80 to 500 Hz. This scaling index, , is conserved across subjects, area in the cortex, and local neural activity levels. The shape of the PSD does not change with increases in local cortical activity, but the amplitude, , increases. We observe a “knee” in the spectra at , implying the existence of a characteristic time scale . Below , we explore two-power-law forms of the PSD, and demonstrate that there are activity-related fluctuations in the amplitude of a power-law process lying beneath the rhythms. Finally, we illustrate through simulation how, small-scale, simplified neuronal models could lead to these power-law observations. This suggests a new paradigm of non-oscillatory “asynchronous,” scale-free, changes in cortical potentials, corresponding to changes in mean population-averaged firing rate, to complement the prevalent “synchronous” rhythm-based paradigm.

633 citations


Journal ArticleDOI
TL;DR: It is demonstrated that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual.
Abstract: The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25–70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.

610 citations


Journal ArticleDOI
TL;DR: It is shown, in contrast to previous models, that continuous attractor models of grid cell activity can generate regular triangular grid responses, based on inputs that encode only the rat's velocity and heading direction, which form a proof-of-concept thatContinuous attractor dynamics may underlie velocity integration in the dorsolateral medial entorhinal cortex.
Abstract: Grid cells in the rat entorhinal cortex display strikingly regular firing responses to the animal's position in 2-D space and have been hypothesized to form the neural substrate for dead-reckoning. However, errors accumulate rapidly when velocity inputs are integrated in existing models of grid cell activity. To produce grid-cell-like responses, these models would require frequent resets triggered by external sensory cues. Such inadequacies, shared by various models, cast doubt on the dead-reckoning potential of the grid cell system. Here we focus on the question of accurate path integration, specifically in continuous attractor models of grid cell activity. We show, in contrast to previous models, that continuous attractor models can generate regular triangular grid responses, based on inputs that encode only the rat's velocity and heading direction. We consider the role of the network boundary in the integration performance of the network and show that both periodic and aperiodic networks are capable of accurate path integration, despite important differences in their attractor manifolds. We quantify the rate at which errors in the velocity integration accumulate as a function of network size and intrinsic noise within the network. With a plausible range of parameters and the inclusion of spike variability, our model networks can accurately integrate velocity inputs over a maximum of ∼10–100 meters and ∼1–10 minutes. These findings form a proof-of-concept that continuous attractor dynamics may underlie velocity integration in the dorsolateral medial entorhinal cortex. The simulations also generate pertinent upper bounds on the accuracy of integration that may be achieved by continuous attractor dynamics in the grid cell network. We suggest experiments to test the continuous attractor model and differentiate it from models in which single cells establish their responses independently of each other.

589 citations


Journal ArticleDOI
TL;DR: The hypothesis that individual differences in intelligence are associated with brain structural organization and higher scores on intelligence tests are related to greater global efficiency of the brain anatomical network is tested to suggest that the efficiency of brain structural organizations may be an important biological basis for intelligence.
Abstract: Intuitively, higher intelligence might be assumed to correspond to more efficient information transfer in the brain, but no direct evidence has been reported from the perspective of brain networks. In this study, we performed extensive analyses to test the hypothesis that individual differences in intelligence are associated with brain structural organization, and in particular that higher scores on intelligence tests are related to greater global efficiency of the brain anatomical network. We constructed binary and weighted brain anatomical networks in each of 79 healthy young adults utilizing diffusion tensor tractography and calculated topological properties of the networks using a graph theoretical method. Based on their IQ test scores, all subjects were divided into general and high intelligence groups and significantly higher global efficiencies were found in the networks of the latter group. Moreover, we showed significant correlations between IQ scores and network properties across all subjects while controlling for age and gender. Specifically, higher intelligence scores corresponded to a shorter characteristic path length and a higher global efficiency of the networks, indicating a more efficient parallel information transfer in the brain. The results were consistently observed not only in the binary but also in the weighted networks, which together provide convergent evidence for our hypothesis. Our findings suggest that the efficiency of brain structural organization may be an important biological basis for intelligence.

Journal ArticleDOI
TL;DR: A new phenotypic database summarizing correlations obtained from the disease history of more than 30 million patients in a Phenotypic Disease Network (PDN) is introduced, offering the potential to enhance the understanding of the origin and evolution of human diseases.
Abstract: The use of networks to integrate different genetic, proteomic, and metabolic datasets has been proposed as a viable path toward elucidating the origins of specific diseases. Here we introduce a new phenotypic database summarizing correlations obtained from the disease history of more than 30 million patients in a Phenotypic Disease Network (PDN). We present evidence that the structure of the PDN is relevant to the understanding of illness progression by showing that (1) patients develop diseases close in the network to those they already have; (2) the progression of disease along the links of the network is different for patients of different genders and ethnicities; (3) patients diagnosed with diseases which are more highly connected in the PDN tend to die sooner than those affected by less connected diseases; and (4) diseases that tend to be preceded by others in the PDN tend to be more connected than diseases that precede other illnesses, and are associated with higher degrees of mortality. Our findings show that disease progression can be represented and studied using network methods, offering the potential to enhance our understanding of the origin and evolution of human diseases. The dataset introduced here, released concurrently with this publication, represents the largest relational phenotypic resource publicly available to the research community.

Journal ArticleDOI
TL;DR: Scanning several hundred proteomes showed that the occurrence of disordered binding sites increased with the complexity of the organisms even compared to disordered regions in general, and the length distribution of binding sites was different from disordered protein regions ingeneral and was dominated by shorter segments.
Abstract: Many disordered proteins function via binding to a structured partner and undergo a disorder-to-order transition. The coupled folding and binding can confer several functional advantages such as the precise control of binding specificity without increased affinity. Additionally, the inherent flexibility allows the binding site to adopt various conformations and to bind to multiple partners. These features explain the prevalence of such binding elements in signaling and regulatory processes. In this work, we report ANCHOR, a method for the prediction of disordered binding regions. ANCHOR relies on the pairwise energy estimation approach that is the basis of IUPred, a previous general disorder prediction method. In order to predict disordered binding regions, we seek to identify segments that are in disordered regions, cannot form enough favorable intrachain interactions to fold on their own, and are likely to gain stabilizing energy by interacting with a globular protein partner. The performance of ANCHOR was found to be largely independent from the amino acid composition and adopted secondary structure. Longer binding sites generally were predicted to be segmented, in agreement with available experimentally characterized examples. Scanning several hundred proteomes showed that the occurrence of disordered binding sites increased with the complexity of the organisms even compared to disordered regions in general. Furthermore, the length distribution of binding sites was different from disordered protein regions in general and was dominated by shorter segments. These results underline the importance of disordered proteins and protein segments in establishing new binding regions. Due to their specific biophysical properties, disordered binding sites generally carry a robust sequence signal, and this signal is efficiently captured by our method. Through its generality, ANCHOR opens new ways to study the essential functional sites of disordered proteins.

Journal ArticleDOI
TL;DR: It is suggested that human brain functional systems exist in an endogenous state of dynamical criticality, characterized by a greater than random probability of both prolonged periods of phase-locking and occurrence of large rapid changes in the state of global synchronization, analogous to the neuronal “avalanches” previously described in cellular systems.
Abstract: Self-organized criticality is an attractive model for human brain dynamics, but there has been little direct evidence for its existence in large-scale systems measured by neuroimaging. In general, critical systems are associated with fractal or power law scaling, long-range correlations in space and time, and rapid reconfiguration in response to external inputs. Here, we consider two measures of phase synchronization: the phase-lock interval, or duration of coupling between a pair of (neurophysiological) processes, and the lability of global synchronization of a (brain functional) network. Using computational simulations of two mechanistically distinct systems displaying complex dynamics, the Ising model and the Kuramoto model, we show that both synchronization metrics have power law probability distributions specifically when these systems are in a critical state. We then demonstrate power law scaling of both pairwise and global synchronization metrics in functional MRI and magnetoencephalographic data recorded from normal volunteers under resting conditions. These results strongly suggest that human brain functional systems exist in an endogenous state of dynamical criticality, characterized by a greater than random probability of both prolonged periods of phase-locking and occurrence of large rapid changes in the state of global synchronization, analogous to the neuronal “avalanches” previously described in cellular systems. Moreover, evidence for critical dynamics was identified consistently in neurophysiological systems operating at frequency intervals ranging from 0.05–0.11 to 62.5–125 Hz, confirming that criticality is a property of human brain functional network organization at all frequency intervals in the brain's physiological bandwidth.

Journal ArticleDOI
TL;DR: A matching model for short reads that can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates is introduced and shows significantly increased performance.
Abstract: With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.

Journal ArticleDOI
TL;DR: The model suggests that dynamic lesion effects can be predicted on the basis of specific network measures of structural brain networks and that these effects may be related to known behavioral and cognitive consequences of brain lesions.
Abstract: Lesions of anatomical brain networks result in functional disturbances of brain systems and behavior which depend sensitively, often unpredictably, on the lesion site. The availability of whole-brain maps of structural connections within the human cerebrum and our increased understanding of the physiology and large-scale dynamics of cortical networks allow us to investigate the functional consequences of focal brain lesions in a computational model. We simulate the dynamic effects of lesions placed in different regions of the cerebral cortex by recording changes in the pattern of endogenous (“resting-state”) neural activity. We find that lesions produce specific patterns of altered functional connectivity among distant regions of cortex, often affecting both cortical hemispheres. The magnitude of these dynamic effects depends on the lesion location and is partly predicted by structural network properties of the lesion site. In the model, lesions along the cortical midline and in the vicinity of the temporo-parietal junction result in large and widely distributed changes in functional connectivity, while lesions of primary sensory or motor regions remain more localized. The model suggests that dynamic lesion effects can be predicted on the basis of specific network measures of structural brain networks and that these effects may be related to known behavioral and cognitive consequences of brain lesions.

Journal ArticleDOI
TL;DR: It is inferred that the −13,910*T allele first underwent selection among dairying farmers around 7,500 years ago in a region between the central Balkans and central Europe, possibly in association with the dissemination of the Neolithic Linearbandkeramik culture over Central Europe.
Abstract: Lactase persistence (LP) is common among people of European ancestry, but with the exception of some African, Middle Eastern and southern Asian groups, is rare or absent elsewhere in the world. Lactase gene haplotype conservation around a polymorphism strongly associated with LP in Europeans (213,910 C/T) indicates that the derived allele is recent in origin and has been subject to strong positive selection. Furthermore, ancient DNA work has shown that the 213,910*T (derived) allele was very rare or absent in early Neolithic central Europeans. It is unlikely that LP would provide a selective advantage without a supply of fresh milk, and this has lead to a gene-culture coevolutionary model where lactase persistence is only favoured in cultures practicing dairying, and dairying is more favoured in lactase persistent populations. We have developed a flexible demic computer simulation model to explore the spread of lactase persistence, dairying, other subsistence practices and unlinked genetic markers in Europe and western Asia’s geographic space. Using data on 213,910*T allele frequency and farming arrival dates across Europe, and approximate Bayesian computation to estimate parameters of interest, we infer that the 213,910*T allele first underwent selection among dairying farmers around 7,500 years ago in a region between the central Balkans and central Europe, possibly in association with the dissemination of the Neolithic Linearbandkeramik culture over Central Europe. Furthermore, our results suggest that natural selection favouring a lactase persistence allele was not higher in northern latitudes through an increased requirement for dietary vitamin D. Our results provide a coherent and spatially explicit picture of the coevolution of lactase persistence and dairying in Europe.

Journal ArticleDOI
TL;DR: The present study identifies major statistical features of audiovisual speech and interprets these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver.
Abstract: Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2-7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver.

Journal ArticleDOI
TL;DR: This paper presents a meta-analyses of the dynamic response of the immune system to infectious disease and shows clear trends in immune-inflammatory bowel disease and central nervous system disorders.
Abstract: 3,41Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany, 2Institute for Mathematical Optimization, Faculty of Mathematics, Otto-von-Guericke University Magdeburg, Magdeburg, Germany, 3Institute for Bioinformatics and Systems Biology, Helmholtz Zentrum Mu¨nchen—German Research Center forEnvironmental Health, Neuherberg, Germany, 4Max Planck Institute for Dynamics and Self-Organization, Go¨ttingen, Germany

Journal ArticleDOI
TL;DR: This paper describes how Bayesian belief propagation in a spatio-temporal hierarchical model, called Hierarchical Temporal Memory (HTM), can lead to a mathematical model for cortical circuits and describes testable predictions that can be derived from the model.
Abstract: The theoretical setting of hierarchical Bayesian inference is gaining acceptance as a framework for understanding cortical computation. In this paper, we describe how Bayesian belief propagation in a spatio-temporal hierarchical model, called Hierarchical Temporal Memory (HTM), can lead to a mathematical model for cortical circuits. An HTM node is abstracted using a coincidence detector and a mixture of Markov chains. Bayesian belief propagation equations for such an HTM node define a set of functional constraints for a neuronal implementation. Anatomical data provide a contrasting set of organizational constraints. The combination of these two constraints suggests a theoretically derived interpretation for many anatomical and physiological features and predicts several others. We describe the pattern recognition capabilities of HTM networks and demonstrate the application of the derived circuits for modeling the subjective contour effect. We also discuss how the theory and the circuit can be extended to explain cortical features that are not explained by the current model and describe testable predictions that can be derived from the model.

Journal ArticleDOI
TL;DR: In this article, the results of nine leading orthology identification projects and methods (COG, KOG, Inparanoid, OrthoMCL, Ensembl Compara, Homologene, RoundUp, EggNOG, and OMA) were systematically compared with respect to both phylogeny and function, using six different tests.
Abstract: Accurate genome-wide identification of orthologs is a central problem in comparative genomics, a fact reflected by the numerous orthology identification projects developed in recent years. However, only a few reports have compared their accuracy, and indeed, several recent efforts have not yet been systematically evaluated. Furthermore, orthology is typically only assessed in terms of function conservation, despite the phylogeny-based original definition of Fitch. We collected and mapped the results of nine leading orthology projects and methods (COG, KOG, Inparanoid, OrthoMCL, Ensembl Compara, Homologene, RoundUp, EggNOG, and OMA) and two standard methods (bidirectional best-hit and reciprocal smallest distance). We systematically compared their predictions with respect to both phylogeny and function, using six different tests. This required the mapping of millions of sequences, the handling of hundreds of millions of predicted pairs of orthologs, and the computation of tens of thousands of trees. In phylogenetic analysis or in functional analysis where high specificity is required, we find that OMA and Homologene perform best. At lower functional specificity but higher coverage level, OrthoMCL outperforms Ensembl Compara, and to a lesser extent Inparanoid. Lastly, the large coverage of the recent EggNOG can be of interest to build broad functional grouping, but the method is not specific enough for phylogenetic or detailed function analyses. In terms of general methodology, we observe that the more sophisticated tree reconstruction/reconciliation approach of Ensembl Compara was at times outperformed by pairwise comparison approaches, even in phylogenetic tests. Furthermore, we show that standard bidirectional best-hit often outperforms projects with more complex algorithms. First, the present study provides guidance for the broad community of orthology data users as to which database best suits their needs. Second, it introduces new methodology to verify orthology. And third, it sets performance standards for current and future approaches.

Journal ArticleDOI
TL;DR: A parsimony approach, called MinPath (Minimal set of Pathways), is developed for biological pathway reconstructions using protein family predictions, which yields a more conservative, yet more faithful, estimation of the biological pathways for a query dataset.
Abstract: A common biological pathway reconstruction approach—as implemented by many automatic biological pathway services (such as the KAAS and RAST servers) and the functional annotation of metagenomic sequences—starts with the identification of protein functions or families (e.g., KO families for the KEGG database and the FIG families for the SEED database) in the query sequences, followed by a direct mapping of the identified protein families onto pathways. Given a predicted patchwork of individual biochemical steps, some metric must be applied in deciding what pathways actually exist in the genome or metagenome represented by the sequences. Commonly, and straightforwardly, a complete biological pathway can be identified in a dataset if at least one of the steps associated with the pathway is found. We report, however, that this naive mapping approach leads to an inflated estimate of biological pathways, and thus overestimates the functional diversity of the sample from which the DNA sequences are derived. We developed a parsimony approach, called MinPath (Minimal set of Pathways), for biological pathway reconstructions using protein family predictions, which yields a more conservative, yet more faithful, estimation of the biological pathways for a query dataset. MinPath identified far fewer pathways for the genomes collected in the KEGG database—as compared to the naive mapping approach—eliminating some obviously spurious pathway annotations. Results from applying MinPath to several metagenomes indicate that the common methods used for metagenome annotation may significantly overestimate the biological pathways encoded by microbial communities.

Journal ArticleDOI
TL;DR: E-Flux as mentioned in this paper extends the technique of Flux Balance Analysis by modeling maximum flux constraints as a function of measured gene expression, which can predict changes in metabolic flux capacity.
Abstract: Metabolism is central to cell physiology, and metabolic disturbances play a role in numerous disease states. Despite its importance, the ability to study metabolism at a global scale using genomic technologies is limited. In principle, complete genome sequences describe the range of metabolic reactions that are possible for an organism, but cannot quantitatively describe the behaviour of these reactions. We present a novel method for modeling metabolic states using whole cell measurements of gene expression. Our method, which we call E-Flux (as a combination of flux and expression), extends the technique of Flux Balance Analysis by modeling maximum flux constraints as a function of measured gene expression. In contrast to previous methods for metabolically interpreting gene expression data, E-Flux utilizes a model of the underlying metabolic network to directly predict changes in metabolic flux capacity. We applied E-Flux to Mycobacterium tuberculosis, the bacterium that causes tuberculosis (TB). Key components of mycobacterial cell walls are mycolic acids which are targets for several first-line TB drugs. We used E-Flux to predict the impact of 75 different drugs, drug combinations, and nutrient conditions on mycolic acid biosynthesis capacity in M. tuberculosis, using a public compendium of over 400 expression arrays. We tested our method using a model of mycolic acid biosynthesis as well as on a genome-scale model of M. tuberculosis metabolism. Our method correctly predicts seven of the eight known fatty acid inhibitors in this compendium and makes accurate predictions regarding the specificity of these compounds for fatty acid biosynthesis. Our method also predicts a number of additional potential modulators of TB mycolic acid biosynthesis. E-Flux thus provides a promising new approach for algorithmically predicting metabolic state from gene expression data.

Journal ArticleDOI
TL;DR: The determination of the speech MTF furnishes an additional method for producing speech signals with reduced bandwidth but high intelligibility and evaluated the importance of spectrotemporal modulations for vocal gender identification and found a different region of interest.
Abstract: We systematically determined which spectrotemporal modulations in speech are necessary for comprehension by human listeners. Speech comprehension has been shown to be robust to spectral and temporal degradations, but the specific relevance of particular degradations is arguable due to the complexity of the joint spectral and temporal information in the speech signal. We applied a novel modulation filtering technique to recorded sentences to restrict acoustic information quantitatively and to obtain a joint spectrotemporal modulation transfer function for speech comprehension, the speech MTF. For American English, the speech MTF showed the criticality of low modulation frequencies in both time and frequency. Comprehension was significantly impaired when temporal modulations <12 Hz or spectral modulations <4 cycles/kHz were removed. More specifically, the MTF was bandpass in temporal modulations and low-pass in spectral modulations: temporal modulations from 1 to 7 Hz and spectral modulations <1 cycles/kHz were the most important. We evaluated the importance of spectrotemporal modulations for vocal gender identification and found a different region of interest: removing spectral modulations between 3 and 7 cycles/kHz significantly increases gender misidentifications of female speakers. The determination of the speech MTF furnishes an additional method for producing speech signals with reduced bandwidth but high intelligibility. Such compression could be used for audio applications such as file compression or noise removal and for clinical applications such as signal processing for cochlear implants.

Journal ArticleDOI
TL;DR: ConCavity is introduced, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities and finds that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone.
Abstract: Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalytic sites and drug binding pockets. Overall, the algorithms and analysis presented here significantly improve our ability to identify ligand binding sites and further advance our understanding of the relationship between evolutionary sequence conservation and structural and functional attributes of proteins. Data, source code, and prediction visualizations are available on the ConCavity web site (http://compbio.cs.princeton.edu/concavity/).

Journal ArticleDOI
TL;DR: The Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment.
Abstract: We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.

Journal ArticleDOI
TL;DR: The protocol described in this paper can be included in a drug discovery pipeline in an effort to discover novel drug leads with desired safety profiles, and therefore accelerate the development of new drugs.
Abstract: The rise of multi-drug resistant (MDR) and extensively drug resistant (XDR) tuberculosis around the world, including in industrialized nations, poses a great threat to human health and defines a need to develop new, effective and inexpensive anti-tubercular agents. Previously we developed a chemical systems biology approach to identify off-targets of major pharmaceuticals on a proteome-wide scale. In this paper we further demonstrate the value of this approach through the discovery that existing commercially available drugs, prescribed for the treatment of Parkinson's disease, have the potential to treat MDR and XDR tuberculosis. These drugs, entacapone and tolcapone, are predicted to bind to the enzyme InhA and directly inhibit substrate binding. The prediction is validated by in vitro and InhA kinetic assays using tablets of Comtan, whose active component is entacapone. The minimal inhibition concentration (MIC99) of entacapone for Mycobacterium tuberculosis (M.tuberculosis) is approximately 260.0 µM, well below the toxicity concentration determined by an in vitro cytotoxicity model using a human neuroblastoma cell line. Moreover, kinetic assays indicate that Comtan inhibits InhA activity by 47.0% at an entacapone concentration of approximately 80 µM. Thus the active component in Comtan represents a promising lead compound for developing a new class of anti-tubercular therapeutics with excellent safety profiles. More generally, the protocol described in this paper can be included in a drug discovery pipeline in an effort to discover novel drug leads with desired safety profiles, and therefore accelerate the development of new drugs.

Journal ArticleDOI
TL;DR: This approach can yield significant, reproducible gains in performance across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art purpose-built vision systems from the literature.
Abstract: While many models of biological object recognition share a common set of ‘‘broad-stroke’’ properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model—e.g., the number of units per layer, the size of pooling kernels, exponents in normalization operations, etc. Since the number of such parameters (explicit or implicit) is typically large and the computational cost of evaluating one particular parameter set is high, the space of possible model instantiations goes largely unexplored. Thus, when a model fails to approach the abilities of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea or because the correct ‘‘parts’’ have not been tuned correctly, assembled at sufficient scale, or provided with enough training. Here, we present a high-throughput approach to the exploration of such parameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3’s IBM Cell Processor). In analogy to highthroughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and parameter instantiations, screening those that show promising object recognition performance for further analysis. We show that this approach can yield significant, reproducible gains in performance across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art purpose-built vision systems from the literature. As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both artificial vision and our understanding of the computational underpinning of biological vision.

Journal ArticleDOI
TL;DR: A concerted effort is advocated for a concerted effort to fill this gap, through systematic, experimental mapping of neural circuits at a mesoscopic scale of resolution suitable for comprehensive, brainwide coverage, using injections of tracers or viral vectors.
Abstract: In this era of complete genomes, our knowledge of neuroanatomical circuitry remains surprisingly sparse. Such knowledge is critical, however, for both basic and clinical research into brain function. Here we advocate for a concerted effort to fill this gap, through systematic, experimental mapping of neural circuits at a mesoscopic scale of resolution suitable for comprehensive, brainwide coverage, using injections of tracers or viral vectors. We detail the scientific and medical rationale and briefly review existing knowledge and experimental techniques. We define a set of desiderata, including brainwide coverage; validated and extensible experimental techniques suitable for standardization and automation; centralized, open-access data repository; compatibility with existing resources; and tractability with current informatics technology. We discuss a hypothetical but tractable plan for mouse, additional efforts for the macaque, and technique development for human. We estimate that the mouse connectivity project could be completed within five years with a comparatively modest budget.

Journal ArticleDOI
TL;DR: This work shows that the results of a bias-exchange metadynamics simulation can be used for constructing a detailed thermodynamic and kinetic model of the Trp-cage folding kinetics and thermodynamics in agreement with experimental data.
Abstract: Trp-cage is a designed 20-residue polypeptide that, in spite of its size, shares several features with larger globular proteins. Although the system has been intensively investigated experimentally and theoretically, its folding mechanism is not yet fully understood. Indeed, some experiments suggest a two-state behavior, while others point to the presence of intermediates. In this work we show that the results of a bias-exchange metadynamics simulation can be used for constructing a detailed thermodynamic and kinetic model of the system. The model, although constructed from a biased simulation, has a quality similar to those extracted from the analysis of long unbiased molecular dynamics trajectories. This is demonstrated by a careful benchmark of the approach on a smaller system, the solvated Ace-Ala3-Nme peptide. For the Trp-cage folding, the model predicts that the relaxation time of 3100 ns observed experimentally is due to the presence of a compact molten globule-like conformation. This state has an occupancy of only 3% at 300 K, but acts as a kinetic trap. Instead, non-compact structures relax to the folded state on the sub-microsecond timescale. The model also predicts the presence of a state at of 4.4 A from the NMR structure in which the Trp strongly interacts with Pro12. This state can explain the abnormal temperature dependence of the and chemical shifts. The structures of the two most stable misfolded intermediates are in agreement with NMR experiments on the unfolded protein. Our work shows that, using biased molecular dynamics trajectories, it is possible to construct a model describing in detail the Trp-cage folding kinetics and thermodynamics in agreement with experimental data.