scispace - formally typeset
Search or ask a question

Showing papers in "PLOS Computational Biology in 2005"


Journal ArticleDOI
TL;DR: A research strategy to achieve the connection matrix of the human brain (the human “connectome”) is proposed, and its potential impact is discussed.
Abstract: The connection matrix of the human brain (the human “connectome”) represents an indispensable foundation for basic and applied neurobiological research. However, the network of anatomical connections linking the neuronal elements of the human brain is still largely unknown. While some databases or collations of large-scale anatomical connection patterns exist for other mammalian species, there is currently no connection matrix of the human brain, nor is there a coordinated research effort to collect, archive, and disseminate this important information. We propose a research strategy to achieve this goal, and discuss its potential impact.

2,908 citations


Journal ArticleDOI
TL;DR: The results suggest that sloppy sensitivity spectra are universal in systems biology models and highlights the power of collective fits and suggests that modelers should focus on predictions rather than on parameters.
Abstract: Quantitative computational models play an increasingly important role in modern biology. Such models typically involve many free parameters, and assigning their values is often a substantial obstacle to model development. Directly measuring in vivo biochemical parameters is difficult, and collectively fitting them to other experimental data often yields large parameter uncertainties. Nevertheless, in earlier work we showed in a growth-factor-signaling model that collective fitting could yield well-constrained predictions, even when it left individual parameters very poorly constrained. We also showed that the model had a ‘‘sloppy’’ spectrum of parameter sensitivities, with eigenvalues roughly evenly distributed over many decades. Here we use a collection of models from the literature to test whether such sloppy spectra are common in systems biology. Strikingly, we find that every model we examine has a sloppy spectrum of sensitivities. We also test several consequences of this sloppiness for building predictive models. In particular, sloppiness suggests that collective fits to even large amounts of ideal time-series data will often leave many parameters poorly constrained. Tests over our model collection are consistent with this suggestion. This difficulty with collective fits may seem to argue for direct parameter measurements, but sloppiness also implies that such measurements must be formidably precise and complete to usefully constrain many model predictions. We confirm this implication in our growth-factor-signaling model. Our results suggest that sloppy sensitivity spectra are universal in systems biology models. The prevalence of sloppiness highlights the power of collective fits and suggests that modelers should focus on predictions rather than on parameters.

1,226 citations


Journal ArticleDOI
TL;DR: It is evident from this analysis that CRISPR/cas loci are larger, more complex, and more heterogeneous than previously appreciated.
Abstract: Clustered regularly interspaced short palindromic repeats (CRISPRs) are a family of DNA direct repeats found in many prokaryotic genomes. Repeats of 21–37 bp typically show weak dyad symmetry and are separated by regularly sized, nonrepetitive spacer sequences. Four CRISPR-associated (Cas) protein families, designated Cas1 to Cas4, are strictly associated with CRISPR elements and always occur near a repeat cluster. Some spacers originate from mobile genetic elements and are thought to confer “immunity” against the elements that harbor these sequences. In the present study, we have systematically investigated uncharacterized proteins encoded in the vicinity of these CRISPRs and found many additional protein families that are strictly associated with CRISPR loci across multiple prokaryotic species. Multiple sequence alignments and hidden Markov models have been built for 45 Cas protein families. These models identify family members with high sensitivity and selectivity and classify key regulators of development, DevR and DevS, in Myxococcus xanthus as Cas proteins. These identifications show that CRISPR/cas gene regions can be quite large, with up to 20 different, tandem-arranged cas genes next to a repeat cluster or filling the region between two repeat clusters. Distinctive subsets of the collection of Cas proteins recur in phylogenetically distant species and correlate with characteristic repeat periodicity. The analyses presented here support initial proposals of mobility of these units, along with the likelihood that loci of different subtypes interact with one another as well as with host cell defensive, replicative, and regulatory systems. It is evident from this analysis that CRISPR/cas loci are larger, more complex, and more heterogeneous than previously appreciated.

1,039 citations


Journal ArticleDOI
Haiyuan Yu1, Philip M. Kim1, Emmett Sprecher1, Valery Trifonov1, Mark Gerstein1 
TL;DR: In this article, the authors define bottlenecks as proteins with a high betweenness centrality (i.e., network nodes that have many "shortest paths" going through them, analogous to major bridges and tunnels on a highway map).
Abstract: It has been a long-standing goal in systems biology to find relations between the topological properties and functional features of protein networks. However, most of the focus in network studies has been on highly connected proteins (“hubs”). As a complementary notion, it is possible to define bottlenecks as proteins with a high betweenness centrality (i.e., network nodes that have many “shortest paths” going through them, analogous to major bridges and tunnels on a highway map). Bottlenecks are, in fact, key connector proteins with surprising functional and dynamic properties. In particular, they are more likely to be essential proteins. In fact, in regulatory and other directed networks, betweenness (i.e., “bottleneck-ness”) is a much more significant indicator of essentiality than degree (i.e., “hub-ness”). Furthermore, bottlenecks correspond to the dynamic components of the interaction network—they are significantly less well coexpressed with their neighbors than nonbottlenecks, implying that expression dynamics is wired into the network topology.

924 citations


Journal ArticleDOI
TL;DR: It is found that optimized component rearrangements could substantially reduce total wiring length in all tested neural networks, suggesting that neural systems are not exclusively optimized for minimal global wiring, but for a variety of factors including the minimization of processing steps.
Abstract: It has been suggested that neural systems across several scales of organization show optimal component placement, in which any spatial rearrangement of the components would lead to an increase of total wiring. Using extensive connectivity datasets for diverse neural networks combined with spatial coordinates for network nodes, we applied an optimization algorithm to the network layouts, in order to search for wire-saving component rearrangements. We found that optimized component rearrangements could substantially reduce total wiring length in all tested neural networks. Specifically, total wiring among 95 primate (Macaque) cortical areas could be decreased by 32%, and wiring of neuronal networks in the nematode Caenorhabditis elegans could be reduced by 48% on the global level, and by 49% for neurons within frontal ganglia. Wiring length reductions were possible due to the existence of long-distance projections in neural networks. We explored the role of these projections by comparing the original networks with minimally rewired networks of the same size, which possessed only the shortest possible connections. In the minimally rewired networks, the number of processing steps along the shortest paths between components was significantly increased compared to the original networks. Additional benchmark comparisons also indicated that neural networks are more similar to network layouts that minimize the length of processing paths, rather than wiring length. These findings suggest that neural systems are not exclusively optimized for minimal global wiring, but for a variety of factors including the minimization of processing steps. Citation: Kaiser M, Hilgetag CC (2006) Nonoptimal component placement, but short processing paths, due to long-distance projections in neural systems. PLoS Comput Biol

646 citations


Journal ArticleDOI
TL;DR: The results show that temporal codes may be a key to understanding the phenomenal processing speed achieved by the visual system and that STDP can lead to fast and selective responses.
Abstract: Spike timing dependent plasticity (STDP) is a learning rule that modifies synaptic strength as a function of the relative timing of pre- and postsynaptic spikes. When a neuron is repeatedly presented with similar inputs, STDP is known to have the effect of concentrating high synaptic weights on afferents that systematically fire early, while postsynaptic spike latencies decrease. Here we use this learning rule in an asynchronous feedforward spiking neural network that mimics the ventral visual pathway and shows that when the network is presented with natural images, selectivity to intermediate-complexity visual features emerges. Those features, which correspond to prototypical patterns that are both salient and consistently present in the images, are highly informative and enable robust object recognition, as demonstrated on various classification tasks. Taken together, these results show that temporal codes may be a key to understanding the phenomenal processing speed achieved by the visual system and that STDP can lead to fast and selective responses.

550 citations


Journal ArticleDOI
TL;DR: The results of this study demonstrate that intrinsic structural disorder is a distinctive and common characteristic of eukaryotic hub proteins, and that disorder may serve as a determinant of protein interactivity.
Abstract: Recent proteome-wide screening approaches have provided a wealth of information about interacting proteins in various organisms. To test for a potential association between protein connectivity and the amount of predicted structural disorder, the disorder propensities of proteins with various numbers of interacting partners from four eukaryotic organisms (Caenorhabditis elegans, Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens) were investigated. The results of PONDR VL-XT disorder analysis show that for all four studied organisms, hub proteins, defined here as those that interact with ≥10 partners, are significantly more disordered than end proteins, defined here as those that interact with just one partner. The proportion of predicted disordered residues, the average disorder score, and the number of predicted disordered regions of various lengths were higher overall in hubs than in ends. A binary classification of hubs and ends into ordered and disordered subclasses using the consensus prediction method showed a significant enrichment of wholly disordered proteins and a significant depletion of wholly ordered proteins in hubs relative to ends in worm, fly, and human. The functional annotation of yeast hubs and ends using GO categories and the correlation of these annotations with disorder predictions demonstrate that proteins with regulation, transcription, and development annotations are enriched in disorder, whereas proteins with catalytic activity, transport, and membrane localization annotations are depleted in disorder. The results of this study demonstrate that intrinsic structural disorder is a distinctive and common characteristic of eukaryotic hub proteins, and that disorder may serve as a determinant of protein interactivity.

544 citations


Journal ArticleDOI
TL;DR: A general comparative genomics method based on phylogenetic stochastic context-free grammars for identifying functionalRNAs encoded in the human genome is developed and used to survey an eight-way genome-wide alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebra-fish, and puffer-fish genomes for deeply conserved functional RNAs.
Abstract: The discoveries of microRNAs and riboswitches, among others, have shown functional RNAs to be biologically more important and genomically more prevalent than previously anticipated. We have developed a general comparative genomics method based on phylogenetic stochastic context-free grammars for identifying functional RNAs encoded in the human genome and used it to survey an eight-way genome-wide alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebra-fish, and puffer-fish genomes for deeply conserved functional RNAs. At a loose threshold for acceptance, this search resulted in a set of 48,479 candidate RNA structures. This screen finds a large number of known functional RNAs, including 195 miRNAs, 62 histone 3'UTR stem loops, and various types of known genetic recoding elements. Among the highest-scoring new predictions are 169 new miRNA candidates, as well as new candidate selenocysteine insertion sites, RNA editing hairpins, RNAs involved in transcript auto regulation, and many folds that form singletons or small functional RNA families of completely unknown function. While the rate of false positives in the overall set is difficult to estimate and is likely to be substantial, the results nevertheless provide evidence for many new human functional RNAs and present specific predictions to facilitate their further characterization.

540 citations


Journal ArticleDOI
TL;DR: These findings provide the most extensive microRNA target predictions in Drosophila to date, suggest specific functional roles for most microRNAs, indicate the existence of coordinate gene regulation executed by clustered micro RNAs, and shed light on the evolution of microRNA function across large evolutionary distances.
Abstract: microRNAs are small noncoding genes that regulate the protein production of genes by binding to partially complementary sites in the mRNAs of targeted genes. Here, using our algorithm PicTar, we exploit cross-species comparisons to predict, on average, 54 targeted genes per microRNA above noise in Drosophila melanogaster. Analysis of the functional annotation of target genes furthermore suggests specific biological functions for many microRNAs. We also predict combinatorial targets for clustered microRNAs and find that some clustered microRNAs are likely to coordinately regulate target genes. Furthermore, we compare microRNA regulation between insects and vertebrates. We find that the widespread extent of gene regulation by microRNAs is comparable between flies and mammals but that certain microRNAs may function in clade-specific modes of gene regulation. One of these microRNAs (miR-210) is predicted to contribute to the regulation of fly oogenesis. We also list specific regulatory relationships that appear to be conserved between flies and mammals. Our findings provide the most extensive microRNA target predictions in Drosophila to date, suggest specific functional roles for most microRNAs, indicate the existence of coordinate gene regulation executed by clustered microRNAs, and shed light on the evolution of microRNA function across large evolutionary distances. All predictions are freely accessible at our searchable Web site http://pictar.bio.nyu.edu.

476 citations


Journal ArticleDOI
TL;DR: A structure-based clustering approach that is capable of extracting putative RNA classes from genome-wide surveys for structured RNAs and suggests several novel classes of ncRNAs for which to date no representative has been experimentally characterized.
Abstract: The RFAM database defines families of ncRNAs by means of sequence similarities that are sufficient to establish homology. In some cases, such as microRNAs and box H/ACA snoRNAs, functional commonalities define classes of RNAs that are characterized by structural similarities, and typically consist of multiple RNA families. Recent advances in high-throughput transcriptomics and comparative genomics have produced very large sets of putative noncoding RNAs and regulatory RNA signals. For many of them, evidence for stabilizing selection acting on their secondary structures has been derived, and at least approximate models of their structures have been computed. The overwhelming majority of these hypothetical RNAs cannot be assigned to established families or classes. We present here a structure-based clustering approach that is capable of extracting putative RNA classes from genome-wide surveys for structured RNAs. The LocARNA (local alignment of RNA) tool implements a novel variant of the Sankoff algorithm that is sufficiently fast to deal with several thousand candidate sequences. The method is also robust against false positive predictions, i.e., a contamination of the input data with unstructured or nonconserved sequences. We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments. Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis. Bioinformatics 21 (Supplement 2): i77–i78). In addition to recovering, e.g., tRNAs as a structure-based class, the method identifies several RNA families, including microRNA and snoRNA candidates, and suggests several novel classes of ncRNAs for which to date no representative has been experimentally characterized.

443 citations


Journal ArticleDOI
TL;DR: A statistical model is developed that makes the problem of estimating richness statistically accessible by evaluating the characteristics of samples drawn from simulated communities with parametric community distributions, and shows that generating sufficient sequence data to do so requires less sequencing effort than completely sequencing a bacterial genome.
Abstract: For more than a century, microbiologists have sought to determine the species richness of bacteria in soil, but the extreme complexity and unknown structure of soil microbial communities have obscured the answer. We developed a statistical model that makes the problem of estimating richness statistically accessible by evaluating the characteristics of samples drawn from simulated communities with parametric community distributions. We identified simulated communities with rank-abundance distributions that followed a truncated lognormal distribution whose samples resembled the structure of 16S rRNA gene sequence collections made using Alaskan and Minnesotan soils. The simulated communities constructed based on the distribution of 16S rRNA gene sequences sampled from the Alaskan and Minnesotan soils had a richness of 5,000 and 2,000 operational taxonomic units (OTUs), respectively, where an OTU represents a collection of sequences not more than 3% distant from each other. To sample each of these OTUs in the Alaskan 16S rRNA gene library at least twice, 480,000 sequences would be required; however, to estimate the richness of the simulated communities using nonparametric richness estimators would require only 18,000 sequences. Quantifying the richness of complex environments such as soil is an important step in building an ecological framework. We have shown that generating sufficient sequence data to do so requires less sequencing effort than completely sequencing a bacterial genome.

Journal ArticleDOI
TL;DR: This work has developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods, and annotated “TE models” in Drosophila melanogaster Release 4 genomic sequences.
Abstract: Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated "TE models" in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n = 6,013) than previously identified (n = 1,572). Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1). We also estimated that 518 TE copies (8.6%) are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other species in the genus Drosophila.

Journal ArticleDOI
TL;DR: The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.
Abstract: The application of whole-genome shotgun sequencing to microbial communities represents a major development in metagenomics, the study of uncultured microbes via the tools of modern genomic analysis. In the past year, whole-genome shotgun sequencing projects of prokaryotic communities from an acid mine biofilm, the Sargasso Sea, Minnesota farm soil, three deep-sea whale falls, and deep-sea sediments have been reported, adding to previously published work on viral communities from marine and fecal samples. The interpretation of this new kind of data poses a wide variety of exciting and difficult bioinformatics problems. The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.

Journal ArticleDOI
TL;DR: A methodology relying on a logical formalism is applied to the functional analysis of the complex signaling network governing the activation of T cells via the T cell receptor, the CD4/CD8 co-receptors, and the accessory signaling receptor CD28, and it proves to be a promising in silico tool.
Abstract: Cellular decisions are determined by complex molecular interaction networks. Large-scale signaling networks are currently being reconstructed, but the kinetic parameters and quantitative data that would allow for dynamic modeling are still scarce. Therefore, computational studies based upon the structure of these networks are of great interest. Here, a methodology relying on a logical formalism is applied to the functional analysis of the complex signaling network governing the activation of T cells via the T cell receptor, the CD4/CD8 co-receptors, and the accessory signaling receptor CD28. Our large-scale Boolean model, which comprises 94 nodes and 123 interactions and is based upon well-established qualitative knowledge from primary T cells, reveals important structural features (e.g., feedback loops and network-wide dependencies) and recapitulates the global behavior of this network for an array of published data on T cell activation in wild-type and knock-out conditions. More importantly, the model predicted unexpected signaling events after antibody-mediated perturbation of CD28 and after genetic knockout of the kinase Fyn that were subsequently experimentally validated. Finally, we show that the logical model reveals key elements and potential failure modes in network functioning and provides candidates for missing links. In summary, our large-scale logical model for T cell activation proved to be a promising in silico tool, and it inspires immunologists to ask new questions. We think that it holds valuable potential in foreseeing the effects of drugs and network modifications.

Journal ArticleDOI
TL;DR: This work proposes the first hierarchical classification of whole protein complexes of known 3-D structure, based on representing their fundamental structural features as a graph, and provides the first overview of all the complexes in the Protein Data Bank and allows nonredundant sets to be derived at different levels of detail.
Abstract: Most of the proteins in a cell assemble into complexes to carry out their function. It is therefore crucial to understand the physicochemical properties as well as the evolution of interactions between proteins. The Protein Data Bank represents an important source of information for such studies, because more than half of the structures are homo- or heteromeric protein complexes. Here we propose the first hierarchical classification of whole protein complexes of known 3-D structure, based on representing their fundamental structural features as a graph. This classification provides the first overview of all the complexes in the Protein Data Bank and allows nonredundant sets to be derived at different levels of detail. This reveals that between one-half and two-thirds of known structures are multimeric, depending on the level of redundancy accepted. We also analyse the structures in terms of the topological arrangement of their subunits and find that they form a small number of arrangements compared with all theoretically possible ones. This is because most complexes contain four subunits or less, and the large majority are homomeric. In addition, there is a strong tendency for symmetry in complexes, even for heteromeric complexes. Finally, through comparison of Biological Units in the Protein Data Bank with the Protein Quaternary Structure database, we identified many possible errors in quaternary structure assignments. Our classification, available as a database and Web server at http://www.3Dcomplex.org, will be a starting point for future work aimed at understanding the structure and evolution of protein complexes.

Journal ArticleDOI
TL;DR: It is shown that most known microRNA genes in these four species have the same type of promoters as protein-coding genes have, and a novel promoter prediction method is developed, called common query voting (CoVote), which is more effective than available promoter prediction methods.
Abstract: MicroRNAs are short, noncoding RNAs that play important roles in post-transcriptional gene regulation. Although many functions of microRNAs in plants and animals have been revealed in recent years, the transcriptional mechanism of microRNA genes is not well-understood. To elucidate the transcriptional regulation of microRNA genes, we study and characterize, in a genome scale, the promoters of intergenic microRNA genes in Caenorhabditis elegans, Homo sapiens, Arabidopsis thaliana, and Oryza sativa. We show that most known microRNA genes in these four species have the same type of promoters as protein-coding genes have. To further characterize the promoters of microRNA genes, we developed a novel promoter prediction method, called common query voting (CoVote), which is more effective than available promoter prediction methods. Using this new method, we identify putative core promoters of most known microRNA genes in the four model species. Moreover, we characterize the promoters of microRNA genes in these four species. We discover many significant, characteristic sequence motifs in these core promoters, several of which match or resemble the known cis-acting elements for transcription initiation. Among these motifs, some are conserved across different species while some are specific to microRNA genes of individual species.

Journal ArticleDOI
TL;DR: A new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences, which performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account.
Abstract: A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and “background” intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.

Journal ArticleDOI
TL;DR: In this article, an acceleration method called query-dependent banding (QDB) is proposed, which uses the probabilistic query CM to precalculate regions of the dynamic programming lattice that have negligible probability, independently of the target database.
Abstract: When searching sequence databases for RNAs, it is desirable to score both primary sequence and RNA secondary structure similarity. Covariance models (CMs) are probabilistic models well-suited for RNA similarity search applications. However, the computational complexity of CM dynamic programming alignment algorithms has limited their practical application. Here we describe an acceleration method called query-dependent banding (QDB), which uses the probabilistic query CM to precalculate regions of the dynamic programming lattice that have negligible probability, independently of the target database. We have implemented QDB in the freely available Infernal software package. QDB reduces the average case time complexity of CM alignment from LN2.4 to LN1.3 for a query RNA of N residues and a target database of L residues, resulting in a 4-fold speedup for typical RNA queries. Combined with other improvements to Infernal, including informative mixture Dirichlet priors on model parameters, benchmarks also show increased sensitivity and specificity resulting from improved parameterization.

Journal ArticleDOI
TL;DR: It is concluded that hub proteins are more important for cellular growth rate and under tight regulation but are not slow evolving, while local connectivity does not correlate with the rate of protein evolution even in reliable datasets.
Abstract: It has been claimed that proteins with more interaction partners (hubs) are both physiologically more important (i.e., less dispensable) and, owing to an assumed high density of binding sites, slow evolving. Not all analyses, however, support these results, probably because of biased and less-than reliable global protein interaction data. Here we provide the first examination of these issues using a comprehensive literature-curated dataset of well-substantiated protein interactions in Saccharomyces cerevisiae. Whereas use of less reliable yeast two-hybrid data alone can reject the possibility that local connectivity correlates with measures of dispensability, in higher quality datasets a relatively robust correlation is observed. In contrast, local connectivity does not correlate with the rate of protein evolution even in reliable datasets. This perhaps surprising lack of correlation with evolutionary rate appears in part to arise from the fact that hub proteins do not have a higher density of residues associated with binding. However, hub proteins do have at least one other set of unusual features, namely rapid turnover and regulation, as manifest in high mRNA decay rates and a large number of phosphorylation sites. This, we suggest, is an adaptation to minimize unwanted activation of pathways that might be mediated by adventitious binding to hubs, were they to actively persist longer than required at any given time point. We conclude that hub proteins are more important for cellular growth rate and under tight regulation but are not slow evolving.

Journal ArticleDOI
TL;DR: This investigation of knotted structures in the Protein Data Bank reveals the most complicated knot discovered to date and suggests that the occurrence of this knot in a human ubiquitin hydrolase might be related to the role of the enzyme in protein degradation.
Abstract: Our investigation of knotted structures in the Protein Data Bank reveals the most complicated knot discovered to date. We suggest that the occurrence of this knot in a human ubiquitin hydrolase might be related to the role of the enzyme in protein degradation. While knots are usually preserved among homologues, we also identify an exception in a transcarbamylase. This allows us to exemplify the function of knots in proteins and to suggest how they may have been created.

Journal ArticleDOI
TL;DR: A large set of 48,828 quantitative peptide-binding affinity measurements relating to 48 different mouse, human, macaque, and chimpanzee MHC class I alleles is made public to provide a transparent prediction evaluation allowing bioinformaticians to identify promising features of prediction methods and providing guidance to immunologists regarding the reliability of prediction tools.
Abstract: Recognition of peptides bound to major histocompatibility complex (MHC) class I molecules by T lymphocytes is an essential part of immune surveillance. Each MHC allele has a characteristic peptide binding preference, which can be captured in prediction algorithms, allowing for the rapid scan of entire pathogen proteomes for peptide likely to bind MHC. Here we make public a large set of 48,828 quantitative peptide-binding affinity measurements relating to 48 different mouse, human, macaque, and chimpanzee MHC class I alleles. We use this data to establish a set of benchmark predictions with one neural network method and two matrix-based prediction methods extensively utilized in our groups. In general, the neural network outperforms the matrix-based predictions mainly due to its ability to generalize even on a small amount of data. We also retrieved predictions from tools publicly available on the internet. While differences in the data used to generate these predictions hamper direct comparisons, we do conclude that tools based on combinatorial peptide libraries perform remarkably well. The transparent prediction evaluation on this dataset provides tool developers with a benchmark for comparison of newly developed prediction methods. In addition, to generate and evaluate our own prediction methods, we have established an easily extensible web-based prediction framework that allows automated side-by-side comparisons of prediction methods implemented by experts. This is an advance over the current practice of tool developers having to generate reference predictions themselves, which can lead to underestimating the performance of prediction methods they are not as familiar with as their own. The overall goal of this effort is to provide a transparent prediction evaluation allowing bioinformaticians to identify promising features of prediction methods and providing guidance to immunologists regarding the reliability of prediction tools.

Journal ArticleDOI
TL;DR: The results propose that the functioning of certain classes of membrane proteins is regulated by changes in the lateral pressure profile, which can be altered by a change in lipid content.
Abstract: The paradigm of biological membranes has recently gone through a major update. Instead of being fluid and homogeneous, recent studies suggest that membranes are characterized by transient domains with varying fluidity. In particular, a number of experimental studies have revealed the existence of highly ordered lateral domains rich in sphingomyelin and cholesterol (CHOL). These domains, called functional lipid rafts, have been suggested to take part in a variety of dynamic cellular processes such as membrane trafficking, signal transduction, and regulation of the activity of membrane proteins. However, despite the proposed importance of these domains, their properties, and even the precise nature of the lipid phases, have remained open issues mainly because the associated short time and length scales have posed a major challenge to experiments. In this work, we employ extensive atom-scale simulations to elucidate the properties of ternary raft mixtures with CHOL, palmitoylsphingomyelin (PSM), and palmitoyloleoylphosphatidylcholine. We simulate two bilayers of 1,024 lipids for 100 ns in the liquid-ordered phase and one system of the same size in the liquid-disordered phase. The studies provide evidence that the presence of PSM and CHOL in raft-like membranes leads to strongly packed and rigid bilayers. We also find that the simulated raft bilayers are characterized by nanoscale lateral heterogeneity, though the slow lateral diffusion renders the interpretation of the observed lateral heterogeneity more difficult. The findings reveal aspects of the role of favored (specific) lipid–lipid interactions within rafts and clarify the prominent role of CHOL in altering the properties of the membrane locally in its neighborhood. Also, we show that the presence of PSM and CHOL in rafts leads to intriguing lateral pressure profiles that are distinctly different from corresponding profiles in nonraft-like membranes. The results propose that the functioning of certain classes of membrane proteins is regulated by changes in the lateral pressure profile, which can be altered by a change in lipid content.

Journal ArticleDOI
TL;DR: This paper introduces delays into the stochastic simulation algorithm, thus mimicking delays associated with transcription and translation, and shows that this process may well explain more faithfully than continuous deterministic models the observed sustained oscillations in expression levels of hes1 mRNA and Hes1 protein.
Abstract: Discrete stochastic simulations are a powerful tool for understanding the dynamics of chemical kinetics when there are small-to-moderate numbers of certain molecular species. In this paper we introduce delays into the stochastic simulation algorithm, thus mimicking delays associated with transcription and translation. We then show that this process may well explain more faithfully than continuous deterministic models the observed sustained oscillations in expression levels of hes1 mRNA and Hes1 protein.

Journal ArticleDOI
TL;DR: Based on the analysis of databases of quantitative architectonic and connection data for primate prefrontal cortices, support is offered for the hypothesis that tension exerted by corticocortical connections is a significant factor in shaping the cerebral cortical landscape.
Abstract: The convoluted cortex of primates is instantly recognizable in its principal morphologic features, yet puzzling in its complex finer structure. Various hypotheses have been proposed about the mechanisms of its formation. Based on the analysis of databases of quantitative architectonic and connection data for primate prefrontal cortices, we offer support for the hypothesis that tension exerted by corticocortical connections is a significant factor in shaping the cerebral cortical landscape. Moreover, forces generated by cortical folding influence laminar morphology, and appear to have a previously unsuspected impact on cellular migration during cortical development. The evidence for a significant role of mechanical factors in cortical morphology opens the possibility of constructing computational models of cortical develoment based on physical principles. Such models are particularly relevant for understanding the relationship of cortical morphology to the connectivity of normal brains, and structurally altered brains in diseases of developmental origin, such as schizophrenia and autism.

Journal ArticleDOI
TL;DR: A bistable switch is identified in the OCT4–SOX2–NANOG network, which arises due to several positive feedback loops, and is switched on/off by input environmental signals, and can be manipulated to be self-renewing without the requirement of input signals.
Abstract: Recent ChIP experiments of human and mouse embryonic stem cells have elucidated the architecture of the transcriptional regulatory circuitry responsible for cell determination, which involves the transcription factors OCT4, SOX2, and NANOG. In addition to regulating each other through feedback loops, these genes also regulate downstream target genes involved in the maintenance and differentiation of embryonic stem cells. A search for the OCT4–SOX2–NANOG network motif in other species reveals that it is unique to mammals. With a kinetic modeling approach, we ascribe function to the observed OCT4–SOX2–NANOG network by making plausible assumptions about the interactions between the transcription factors at the gene promoter binding sites and RNA polymerase (RNAP), at each of the three genes as well as at the target genes. We identify a bistable switch in the network, which arises due to several positive feedback loops, and is switched on/off by input environmental signals. The switch stabilizes the expression levels of the three genes, and through their regulatory roles on the downstream target genes, leads to a binary decision: when OCT4, SOX2, and NANOG are expressed and the switch is on, the self-renewal genes are on and the differentiation genes are off. The opposite holds when the switch is off. The model is extremely robust to parameter changes. In addition to providing a self-consistent picture of the transcriptional circuit, the model generates several predictions. Increasing the binding strength of NANOG to OCT4 and SOX2, or increasing its basal transcriptional rate, leads to an irreversible bistable switch: the switch remains on even when the activating signal is removed. Hence, the stem cell can be manipulated to be self-renewing without the requirement of input signals. We also suggest tests that could discriminate between a variety of feedforward regulation architectures of the target genes by OCT4, SOX2, and NANOG.

Journal ArticleDOI
TL;DR: Using comparative genomics approaches, Wang et al. as discussed by the authors predict DNA-binding motifs for these transcriptional factors and describe corresponding regulons in available bacterial genomes, and demonstrate considerable interconnection between various nitrogenoxides-responsive regulatory systems for the denitrification and NO detoxification genes and evolutionary plasticity of this transcriptional network.
Abstract: Bacterial response to nitric oxide (NO) is of major importance since NO is an obligatory intermediate of the nitrogen cycle. Transcriptional regulation of the dissimilatory nitric oxides metabolism in bacteria is diverse and involves FNR-like transcription factors HcpR, DNR, and NnrR; two-component systems NarXL and NarQP; NO-responsive activator NorR; and nitrite-sensitive repressor NsrR. Using comparative genomics approaches, we predict DNA-binding motifs for these transcriptional factors and describe corresponding regulons in available bacterial genomes. Within the FNR family of regulators, we observed a correlation of two specificity-determining amino acids and contacting bases in corresponding DNA recognition motif. Highly conserved regulon HcpR for the hybrid cluster protein and some other redox enzymes is present in diverse anaerobic bacteria, including Clostridia, Thermotogales, and delta-proteobacteria. NnrR and DNR control denitrification in alpha- and beta-proteobacteria, respectively. Sigma-54-dependent NorR regulon found in some gamma- and beta-proteobacteria contains various enzymes involved in the NO detoxification. Repressor NsrR, which was previously known to control only nitrite reductase operon in Nitrosomonas spp., appears to be the master regulator of the nitric oxides' metabolism, not only in most gamma- and beta-proteobacteria (including well-studied species such as Escherichia coli), but also in Gram-positive Bacillus and Streptomyces species. Positional analysis and comparison of regulatory regions of NO detoxification genes allows us to propose the candidate NsrR-binding motif. The most conserved member of the predicted NsrR regulon is the NO-detoxifying flavohemoglobin Hmp. In enterobacteria, the regulon also includes two nitrite-responsive loci, nipAB (hcp-hcr) and nipC (dnrN), thus confirming the identity of the effector, i.e. nitrite. The proposed NsrR regulons in Neisseria and some other species are extended to include denitrification genes. As the result, we demonstrate considerable interconnection between various nitrogen-oxides-responsive regulatory systems for the denitrification and NO detoxification genes and evolutionary plasticity of this transcriptional network.

Journal ArticleDOI
TL;DR: It is proposed that most of the so-called “atypical kinases” are not intermittently derived from protein kinases, but rather diverged early in evolution to form a distinct phyletic group.
Abstract: The protein kinase family is large and important, but it is only one family in a larger superfamily of homologous kinases that phosphorylate a variety of substrates and play important roles in all three superkingdoms of life We used a carefully constructed structural alignment of selected kinases as the basis for a study of the structural evolution of the protein kinase–like superfamily The comparison of structures revealed a “universal core” domain consisting only of regions required for ATP binding and the phosphotransfer reaction Remarkably, even within the universal core some kinase structures display notable changes, while still retaining essential activity Hence, the protein kinase–like superfamily has undergone substantial structural and sequence revision over long evolutionary timescales We constructed a phylogenetic tree for the superfamily using a novel approach that allowed for the combination of sequence and structure information into a unified quantitative analysis When considered against the backdrop of species distribution and other metrics, our tree provides a compelling scenario for the development of the various kinase families from a shared common ancestor We propose that most of the so-called “atypical kinases” are not intermittently derived from protein kinases, but rather diverged early in evolution to form a distinct phyletic group Within the atypical kinases, the aminoglycoside and choline kinase families appear to share the closest relationship These two families in turn appear to be the most closely related to the protein kinase family In addition, our analysis suggests that the actin-fragmin kinase, an atypical protein kinase, is more closely related to the phosphoinositide-3 kinase family than to the protein kinase family The two most divergent families, α-kinases and phosphatidylinositol phosphate kinases (PIPKs), appear to have distinct evolutionary histories While the PIPKs probably have an evolutionary relationship with the rest of the kinase superfamily, the relationship appears to be very distant (and perhaps indirect) Conversely, the α-kinases appear to be an exception to the scenario of early divergence for the atypical kinases: they apparently arose relatively recently in eukaryotes We present possible scenarios for the derivation of the α-kinases from an extant kinase fold

Journal ArticleDOI
TL;DR: Despite differences in the age distribution of tandem arrays, the striking similarities between rice and Arabidopsis indicate similar mechanisms of TAG generation and maintenance, which reflect an evolutionary trend in which successful tandem duplication involves genes either at the end of biochemical pathways or in flexible steps in a pathway, for which fluctuation in copy number is unlikely to affect downstream genes.
Abstract: In Arabidopsis, tandemly arrayed genes (TAGs) comprise >10% of the genes in the genome. These duplicated genes represent a rich template for genetic innovation, but little is known of the evolutionary forces governing their generation and maintenance. Here we compare the organization and evolution of TAGs between Arabidopsis and rice, two plant genomes that diverged ~150 million years ago. TAGs from the two genomes are similar in a number of respects, including the proportion of genes that are tandemly arrayed, the number of genes within an array, the number of tandem arrays, and the dearth of TAGs relative to single copy genes in centromeric regions. Analysis of recombination rates along rice chromosomes confirms a positive correlation between the occurrence of TAGs and recombination rate, as found in Arabidopsis. TAGs are also biased functionally relative to duplicated, nontandemly arrayed genes. In both genomes, TAGs are enriched for genes that encode membrane proteins and function in “abiotic and biotic stress” but underrepresented for genes involved in transcription and DNA or RNA binding functions. We speculate that these observations reflect an evolutionary trend in which successful tandem duplication involves genes either at the end of biochemical pathways or in flexible steps in a pathway, for which fluctuation in copy number is unlikely to affect downstream genes. Despite differences in the age distribution of tandem arrays, the striking similarities between rice and Arabidopsis indicate similar mechanisms of TAG generation and maintenance.

Journal ArticleDOI
TL;DR: An exhaustive study of the relationship between amino acid composition of proteomes, nucleotide composition of DNA, and optimal growth temperature (OGT) of prokaryotes finds strong and independent correlation between OGT and the frequency with which pairs of A and G nucleotides appear as nearest neighbors in genome sequences.
Abstract: There have been considerable attempts in the past to relate phenotypic trait—habitat temperature of organisms—to their genotypes, most importantly compositions of their genomes and proteomes. However, despite accumulation of anecdotal evidence, an exact and conclusive relationship between the former and the latter has been elusive. We present an exhaustive study of the relationship between amino acid composition of proteomes, nucleotide composition of DNA, and optimal growth temperature (OGT) of prokaryotes. Based on 204 complete proteomes of archaea and bacteria spanning the temperature range from −10 °C to 110 °C, we performed an exhaustive enumeration of all possible sets of amino acids and found a set of amino acids whose total fraction in a proteome is correlated, to a remarkable extent, with the OGT. The universal set is Ile, Val, Tyr, Trp, Arg, Glu, Leu (IVYWREL), and the correlation coefficient is as high as 0.93. We also found that the G + C content in 204 complete genomes does not exhibit a significant correlation with OGT (R = −0.10). On the other hand, the fraction of A + G in coding DNA is correlated with temperature, to a considerable extent, due to codon patterns of IVYWREL amino acids. Further, we found strong and independent correlation between OGT and the frequency with which pairs of A and G nucleotides appear as nearest neighbors in genome sequences. This adaptation is achieved via codon bias. These findings present a direct link between principles of proteins structure and stability and evolutionary mechanisms of thermophylic adaptation. On the nucleotide level, the analysis provides an example of how nature utilizes codon bias for evolutionary adaptation to extreme conditions. Together these results provide a complete picture of how compositions of proteomes and genomes in prokaryotes adjust to the extreme conditions of the environment.

Journal ArticleDOI
TL;DR: The feedback mechanism described in this paper is likely to be a widespread principle on how cells achieve ultrasensitivity, bistability, and irreversibility in caspase activation.
Abstract: The intrinsic, or mitochondrial, pathway of caspase activation is essential for apoptosis induction by various stimuli including cytotoxic stress. It depends on the cellular context, whether cytochrome c released from mitochondria induces caspase activation gradually or in an all-or-none fashion, and whether caspase activation irreversibly commits cells to apoptosis. By analyzing a quantitative kinetic model, we show that inhibition of caspase-3 (Casp3) and Casp9 by inhibitors of apoptosis (IAPs) results in an implicit positive feedback, since cleaved Casp3 augments its own activation by sequestering IAPs away from Casp9. We demonstrate that this positive feedback brings about bistability (i.e., all-or-none behaviour), and that it cooperates with Casp3-mediated feedback cleavage of Casp9 to generate irreversibility in caspase activation. Our calculations also unravel how cell-specific protein expression brings about the observed qualitative differences in caspase activation (gradual versus all-or-none and reversible versus irreversible). Finally, known regulators of the pathway are shown to efficiently shift the apoptotic threshold stimulus, suggesting that the bistable caspase cascade computes multiple inputs into an all-or-none caspase output. As cellular inhibitory proteins (e.g., IAPs) frequently inhibit consecutive intermediates in cellular signaling cascades (e.g., Casp3 and Casp9), the feedback mechanism described in this paper is likely to be a widespread principle on how cells achieve ultrasensitivity, bistability, and irreversibility.