scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2006"


Journal ArticleDOI
TL;DR: UNLABELLED RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML) that has been used to compute ML trees on two of the largest alignments to date.
Abstract: Summary: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Γ yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets ≥4000 taxa it also runs 2--3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Availability: icwww.epfl.ch/~stamatak Contact: Alexandros.Stamatakis@epfl.ch Supplementary information: Supplementary data are available at Bioinformatics online.

14,847 citations


Journal ArticleDOI
TL;DR: Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.
Abstract: Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282--283, Bioinformatics, 18, 77--82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability: http://cd-hit.org Contact: [email protected]

8,306 citations


Journal ArticleDOI
TL;DR: The SWISS-MODEL workspace is a web-based integrated service dedicated to protein structure homology modelling that assists and guides the user in building protein homology models at different levels of complexity.
Abstract: Motivation: Homology models of proteins are of great interest for planning and analysing biological experiments when no experimental three-dimensional structures are available. Building homology models requires specialized programs and up-to-date sequence and structural databases. Integrating all required tools, programs and databases into a single web-based workspace facilitates access to homology modelling from a computer with web connection without the need of downloading and installing large program packages and databases. Results: SWISS-MODEL workspace is a web-based integrated service dedicated to protein structure homology modelling. It assists and guides the user in building protein homology models at different levels of complexity. A personal working environment is provided for each user where several modelling projects can be carried out in parallel. Protein sequence and structure databases necessary for modelling are accessible from the workspace and are updated in regular intervals. Tools for template selection, model building and structure quality evaluation can be invoked from within the workspace. Workflow and usage of the workspace are illustrated by modelling human Cyclin A1 and human Transmembrane Protease 3. Availability: The SWISS-MODEL workspace can be accessed freely at http://swissmodel.expasy.org/workspace/ Contact: Torsten.Schwede@unibas.ch Supplementary information: Supplementary data are available at Bioinformatics online.

7,107 citations


Journal ArticleDOI
TL;DR: COPASI is presented, a platform-independent and user-friendly biochemical simulator that offers several unique features, and numerical issues with these features are discussed; in particular, the criteria to switch between stochastic and deterministic simulation methods, hybrid deterministic-stochastic methods, and the importance of random number generator numerical resolution in Stochastic simulation.
Abstract: Motivation: Simulation and modeling is becoming a standard approach to understand complex biochemical processes. Therefore, there is a big need for software tools that allow access to diverse simulation and modeling methods as well as support for the usage of these methods. Results: Here, we present COPASI, a platform-independent and user-friendly biochemical simulator that offers several unique features. We discuss numerical issues with these features; in particular, the criteria to switch between stochastic and deterministic simulation methods, hybrid deterministic--stochastic methods, and the importance of random number generator numerical resolution in stochastic simulation. Availability: The complete software is available in binary (executable) for MS Windows, OS X, Linux (Intel) and Sun Solaris (SPARC), as well as the full source code under an open source license from http://www.copasi.org. Contact: mendes@vbi.vt.edu

2,351 citations


Journal ArticleDOI
TL;DR: Pvclust is an add-on package for a statistical software R to assess the uncertainty in hierarchical cluster analysis to perform the bootstrap analysis of clustering, which has been popular in phylogenetic analysis.
Abstract: Summary: Pvclust is an add-on package for a statistical software R to assess the uncertainty in hierarchical cluster analysis. Pvclust can be used easily for general statistical problems, such as DNA microarray analysis, to perform the bootstrap analysis of clustering, which has been popular in phylogenetic analysis. Pvclust calculates probability values (p-values) for each cluster using bootstrap resampling techniques. Two types of p-values are available: approximately unbiased (AU) p-value and bootstrap probability (BP) value. Multiscale bootstrap resampling is used for the calculation of AU p-value, which has superiority in bias over BP value calculated by the ordinary bootstrap resampling. In addition the computation time can be enormously decreased with parallel computing option. Availability: The program is freely distributed under GNU General Public License (GPL) and can directly be installed from CRAN (http://cran.r-project.org/), the official R package archive. The instruction and program source code are available at http://www.is.titech.ac.jp/~shimo/prog/pvclust Contact: ryota.suzuki@is.titech.ac.jp

2,155 citations


Journal ArticleDOI
TL;DR: Two novel algorithms that improve GO group scoring using the underlying GO graph topology are presented and it is shown that both methods eliminate local dependencies between GO terms and point to relevant areas in the GO graph that remain undetected with state-of-the-art algorithms for scoring functional terms.
Abstract: Motivation: The result of a typical microarray experiment is a long list of genes with corresponding expression measurements. This list is only the starting point for a meaningful biological interpretation. Modern methods identify relevant biological processes or functions from gene expression data by scoring the statistical significance of predefined functional gene groups, e.g. based on Gene Ontology (GO). We develop methods that increase the explanatory power of this approach by integrating knowledge about relationships between the GO terms into the calculation of the statistical significance. Results: We present two novel algorithms that improve GO group scoring using the underlying GO graph topology. The algorithms are evaluated on real and simulated gene expression data. We show that both methods eliminate local dependencies between GO terms and point to relevant areas in the GO graph that remain undetected with state-of-the-art algorithms for scoring functional terms. A simulation study demonstrates that the new methods exhibit a higher level of detecting relevant biological terms than competing methods. Availability: topgo.bioinf.mpi-inf.mpg.de Contact: alexa@mpi-sb.mpg.de Supplementary Information: Supplementary data are available at Bioinformatics online.

1,843 citations


Journal ArticleDOI
TL;DR: A web-based application to analyze association studies from a genetic epidemiology point of view, main capabilities include descriptive analysis, test for Hardy-Weinberg equilibrium and linkage disequilibrium.
Abstract: Summary: A web-based application has been designed from a genetic epidemiology point of view to analyze association studies. Main capabilities include descriptive analysis, test for Hardy--Weinberg equilibrium and linkage disequilibrium. Analysis of association is based on linear or logistic regression according to the response variable (quantitative or binary disease status, respectively). Analysis of single SNPs: multiple inheritance models (co-dominant, dominant, recessive, over-dominant and log-additive), and analysis of interactions (gene--gene or gene--environment). Analysis of multiple SNPs: haplotype frequency estimation, analysis of association of haplotypes with the response, including analysis of interactions. Availability:http://bioinfo.iconcologia.net/SNPstats. Source code for local installation is available under GNU license. Contact: v.moreno@iconcologia.net Supplementary Information: Figures with a sample run are available on Bioinformatics online. A detailed online tutorial is available within the application.

1,665 citations


Journal ArticleDOI
TL;DR: An automated procedure for the analysis of homologous protein structures has been developed that facilitates the characterization of internal conformational differences and inter-conformer relationships and provides a framework for theAnalysis of protein structural evolution.
Abstract: Summary: An automated procedure for the analysis of homologous protein structures has been developed. The method facilitates the characterization of internal conformational differences and inter-conformer relationships and provides a framework for the analysis of protein structural evolution. The method is implemented in bio3d, an R package for the exploratory analysis of structure and sequence data. Availability: The bio3d package is distributed with full source code as a platform-independent R package under a GPL2 license from: http://mccammon.ucsd.edu/~bgrant/bio3d/ Contact: bgrant@mccammon.ucsd.edu

1,324 citations


Journal ArticleDOI
TL;DR: Hahn et al. as mentioned in this paper presented CAFE (Computational Analysis of gene Family Evolution), a tool for the statistical analysis of the evolution of the size of gene families.
Abstract: Summary: We present CAFE (Computational Analysis of gene Family Evolution), a tool for the statistical analysis of the evolution of the size of gene families. It uses a stochastic birth and death process to model the evolution of gene family sizes over a phylogeny. For a specified phylogenetic tree, and given the gene family sizes in the extant species, CAFE can estimate the global birth and death rate of gene families, infer the most likely gene family size at all internal nodes, identify gene families that have accelerated rates of gain and loss (quantified by a p-value) and identify which branches cause the p-value to be small for significant families. Availability: Software is available from http://www.bio.indiana.edu/~hahnlab/Software.html Contact: mwh@indiana.edu

1,170 citations


Journal ArticleDOI
TL;DR: TimeTree brings time estimates from molecular data together in a consistent format and uses a hierarchical structure, corresponding to the tree of life, to maximize their utility.
Abstract: Summary: Biologists and other scientists routinely need to know times of divergence between species and to construct phylogenies calibrated to time (timetrees). Published studies reporting time estimates from molecular data have been increasing rapidly, but the data have been largely inaccessible to the greater community of scientists because of their complexity. TimeTree brings these data together in a consistent format and uses a hierarchical structure, corresponding to the tree of life, to maximize their utility. Results are presented and summarized, allowing users to quickly determine the range and robustness of time estimates and the degree of consensus from the published literature. Availability: TimeTree is available at http://www.timetree.net Contact: [email protected]

1,137 citations


Journal ArticleDOI
TL;DR: The Orientations of Proteins in Membranes (OPM) database provides a collection of transmembrane, monotopic and peripheral proteins from the Protein Data Bank whose spatial arrangements in the lipid bilayer have been calculated theoretically and compared with experimental data.
Abstract: Summary: The Orientations of Proteins in Membranes (OPM) database provides a collection of transmembrane, monotopic and peripheral proteins from the Protein Data Bank whose spatial arrangements in the lipid bilayer have been calculated theoretically and compared with experimental data. The database allows analysis, sorting and searching of membrane proteins based on their structural classification, species, destination membrane, numbers of transmembrane segments and subunits, numbers of secondary structures and the calculated hydrophobic thickness or tilt angle with respect to the bilayer normal. All coordinate files with the calculated membrane boundaries are available for downloading. Availabililty: http://opm.phar.umich.edu Contact: almz@umich.edu

Journal ArticleDOI
TL;DR: A methodology for comparing and validating biclustering methods that includes a simple binary reference model that captures the essential features of most bic Lustering approaches and proposes a fast divide-and-conquer algorithm (Bimax).
Abstract: Motivation: In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify sets of genes sharing compatible expression patterns across subsets of samples, and its usefulness has been demonstrated for different organisms and datasets. Several biclustering methods have been proposed in the literature; however, it is not clear how the different techniques compare with each other with respect to the biological relevance of the clusters as well as with other characteristics such as robustness and sensitivity to noise. Accordingly, no guidelines concerning the choice of the biclustering method are currently available. Results: First, this paper provides a methodology for comparing and validating biclustering methods that includes a simple binary reference model. Although this model captures the essential features of most biclustering approaches, it is still simple enough to exactly determine all optimal groupings; to this end, we propose a fast divide-and-conquer algorithm (Bimax). Second, we evaluate the performance of five salient biclustering algorithms together with the reference model and a hierarchical clustering method on various synthetic and real datasets for Saccharomyces cerevisiae and Arabidopsis thaliana. The comparison reveals that (1) biclustering in general has advantages over a conventional hierarchical clustering approach, (2) there are considerable performance differences between the tested methods and (3) already the simple reference model delivers relevant patterns within all considered settings. Availability: The datasets used, the outcomes of the biclustering algorithms and the Bimax implementation for the reference model are available at http://www.tik.ee.ethz.ch/sop/bimax Contact: bleuler@tik.ee.ethz.ch Supplementary information: Supplementary data are available at http://www.tik.ee.ethz.ch/sop/bimax

Journal ArticleDOI
TL;DR: It is shown that in gene (protein) association networks CFinder can be used to predict the function(s) of a single protein and to discover novel modules, and CFinder is also very efficient for locating the cliques of large sparse graphs.
Abstract: Summary: Most cellular tasks are performed not by individual proteins, but by groups of functionally associated proteins, often referred to as modules. In a protein assocation network modules appear as groups of densely interconnected nodes, also called communities or clusters. These modules often overlap with each other and form a network of their own, in which nodes (links) represent the modules (overlaps). We introduce CFinder, a fast program locating and visualizing overlapping, densely interconnected groups of nodes in undirected graphs, and allowing the user to easily navigate between the original graph and the web of these groups. We show that in gene (protein) association networks CFinder can be used to predict the function(s) of a single protein and to discover novel modules. CFinder is also very efficient for locating the cliques of large sparse graphs. Availability: CFinder (for Windows, Linux and Macintosh) and its manual can be downloaded from http://angel.elte.hu/clustering. Supplementary information: Supplementary data are available on Bioinformatics online. Contact: cfinder@angel.elte.hu

Journal ArticleDOI
TL;DR: Using simulated datasets, the Bayesian method generally fares better than the ML approach in accuracy and coverage, although for some values the two approaches are equal in performance.
Abstract: Comparison of the performance and accuracy of different inference methods, such as maximum likelihood (ML) and Bayesian inference, is difficult because the inference methods are implemented in different programs, often written by different authors. Both methods were implemented in the program MIGRATE, that estimates population genetic parameters, such as population sizes and migration rates, using coalescence theory. Both inference methods use the same Markov chain Monte Carlo algorithm and differ from each other in only two aspects: parameter proposal distribution and maximization of the likelihood function. Using simulated datasets, the Bayesian method generally fares better than the ML approach in accuracy and coverage, although for some values the two approaches are equal in performance. Motivation: The Markov chain Monte Carlo-based ML framework can fail on sparse data and can deliver non-conservative support intervals. A Bayesian framework with appropriate prior distribution is able to remedy some of these problems. Results: The program MIGRATE was extended to allow not only for ML(-) maximum likelihood estimation of population genetics parameters but also for using a Bayesian framework. Comparisons between the Bayesian approach and the ML approach are facilitated because both modes estimate the same parameters under the same population model and assumptions. Availability: The program is available from http://popgen.csit.fsu.edu/ Contact: beerli@csit.fsu.edu

Journal ArticleDOI
TL;DR: A likelihood-based model selection procedure that uses a genetic algorithm to search multiple sequence alignments for evidence of recombination breakpoints and identify putative recombinant sequences and is an extensible and intuitive method that can be run efficiently in parallel.
Abstract: Motivation: Phylogenetic and evolutionary inference can be severely misled if recombination is not accounted for, hence screening for it should be an essential component of nearly every comparative study. The evolution of recombinant sequences can not be properly explained by a single phylogenetic tree, but several phylogenies may be used to correctly model the evolution of non-recombinant fragments. Results: We developed a likelihood-based model selection procedure that uses a genetic algorithm to search multiple sequence alignments for evidence of recombination breakpoints and identify putative recombinant sequences. GARD is an extensible and intuitive method that can be run efficiently in parallel. Extensive simulation studies show that the method nearly always outperforms other available tools, both in terms of power and accuracy and that the use of GARD to screen sequences for recombination ensures good statistical properties for methods aimed at detecting positive selection. Availability: Freely available http://www.datamonkey.org/GARD/ Contact: [email protected]

Journal ArticleDOI
TL;DR: A method based on support vector machines (SVMs) that starting from the protein sequence information can predict whether a new phenotype derived from a nsSNP can be related to a genetic disease in humans is developed.
Abstract: Motivation: Human single nucleotide polymorphisms (SNPs) are the most frequent type of genetic variation in human population. One of the most important goals of SNP projects is to understand which human genotype variations are related to Mendelian and complex diseases. Great interest is focused on non-synonymous coding SNPs (nsSNPs) that are responsible of protein single point mutation. nsSNPs can be neutral or disease associated. It is known that the mutation of only one residue in a protein sequence can be related to a number of pathological conditions of dramatic social impact such as Altzheimer's, Parkinson's and Creutzfeldt-Jakob's diseases. The quality and completeness of presently available SNPs databases allows the application of machine learning techniques to predict the insurgence of human diseases due to single point protein mutation starting from the protein sequence. Results: In this paper, we develop a method based on support vector machines (SVMs) that starting from the protein sequence information can predict whether a new phenotype derived from a nsSNP can be related to a genetic disease in humans. Using a dataset of 21 185 single point mutations, 61% of which are disease-related, out of 3587 proteins, we show that our predictor can reach more than 74% accuracy in the specific task of predicting whether a single point mutation can be disease related or not. Our method, although based on less information, outperforms other web-available predictors implementing different approaches. Availability: A beta version of the web tool is available at http://gpcr.biocomp.unibo.it/cgi/predictors/PhD-SNP/PhD-SNP.cgi Contact: casadio@alma.unibo.it

Journal ArticleDOI
TL;DR: The Bioconductor package RankProd modifies and extends the rank product method proposed by Breitling et al., to integrate multiple microarray studies from different laboratories and/or platforms and accepts pre-processed expression datasets produced from a wide variety of platforms.
Abstract: Summary: While meta-analysis provides a powerful tool for analyzing microarray experiments by combining data from multiple studies, it presents unique computational challenges. The Bioconductor package RankProd provides a new and intuitive tool for this purpose in detecting differentially expressed genes under two experimental conditions. The package modifies and extends the rank product method proposed by Breitling et al., [(2004) FEBS Lett., 573, 83--92] to integrate multiple microarray studies from different laboratories and/or platforms. It offers several advantages over t-test based methods and accepts pre-processed expression datasets produced from a wide variety of platforms. The significance of the detection is assessed by a non-parametric permutation test, and the associated P-value and false discovery rate (FDR) are included in the output alongside the genes that are detected by user-defined criteria. A visualization plot is provided to view actual expression levels for each gene with estimated significance measurements. Availability: RankProd is available at Bioconductor http://www.bioconductor.org. A web-based interface will soon be available at http://cactus.salk.edu/RankProd Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: New additional methods for processing and visualizing mass spectrometry based molecular profile data, implemented as part of the recently introduced MZmine software, include new features and extensions such as support for mzXML data format.
Abstract: Summary: New additional methods are presented for processing and visualizing mass spectrometry based molecular profile data, implemented as part of the recently introduced MZmine software. They include new features and extensions such as support for mzXML data format, capability to perform batch processing for large number of files, support for parallel processing, new methods for calculating peak areas using post-alignment peak picking algorithm and implementation of Sammon's mapping and curvilinear distance analysis for data visualization and exploratory analysis. Availability: MZmine is available under GNU Public license from http://mzmine.sourceforge.net/ Contact: matej.oresic@vtt.fi

Journal ArticleDOI
TL;DR: FANMOD relies on recently developed algorithms to improve the efficiency of network motif detection by some orders of magnitude over existing tools, which facilitates the detection of larger motifs in bigger networks than previously possible.
Abstract: Summary: Motifs are small connected subnetworks that a network displays in significantly higher frequencies than would be expected for a random network. They have recently gathered much attention as a concept to uncover structural design principles of complex biological networks. FANMOD is a tool for fast network motif detection; it relies on recently developed algorithms to improve the efficiency of network motif detection by some orders of magnitude over existing tools. This facilitates the detection of larger motifs in bigger networks than previously possible. Additional benefits of FANMOD are the ability to analyze colored networks, a graphical user interface and the ability to export results to a variety of machine- and human-readable file formats including comma-separated values and HTML. Availability: The tool is freely available online at http://www.minet.uni-jena.de/~wernicke/motifs/ and runs under Linux, Mac OS and Windows. Contact: wernicke@minet.uni-jena.de

Journal ArticleDOI
TL;DR: A continuous wavelet transform (CWT)-based peak detection algorithm has been devised that identifies peaks with different scales and amplitudes and can identify both strong and weak peaks while keeping false positive rate low.
Abstract: Motivation: A major problem for current peak detection algorithms is that noise in mass spectrometry (MS) spectra gives rise to a high rate of false positives. The false positive rate is especially problematic in detecting peaks with low amplitudes. Usually, various baseline correction algorithms and smoothing methods are applied before attempting peak detection. This approach is very sensitive to the amount of smoothing and aggressiveness of the baseline correction, which contribute to making peak detection results inconsistent between runs, instrumentation and analysis methods. Results: Most peak detection algorithms simply identify peaks based on amplitude, ignoring the additional information present in the shape of the peaks in a spectrum. In our experience, 'true' peaks have characteristic shapes, and providing a shape-matching function that provides a 'goodness of fit' coefficient should provide a more robust peak identification method. Based on these observations, a continuous wavelet transform (CWT)-based peak detection algorithm has been devised that identifies peaks with different scales and amplitudes. By transforming the spectrum into wavelet space, the pattern-matching problem is simplified and in addition provides a powerful technique for identifying and separating the signal from the spike noise and colored noise. This transformation, with the additional information provided by the 2D CWT coefficients can greatly enhance the effective signal-to-noise ratio. Furthermore, with this technique no baseline removal or peak smoothing preprocessing steps are required before peak detection, and this improves the robustness of peak detection under a variety of conditions. The algorithm was evaluated with SELDI-TOF spectra with known polypeptide positions. Comparisons with two other popular algorithms were performed. The results show the CWT-based algorithm can identify both strong and weak peaks while keeping false positive rate low. Availability: The algorithm is implemented in R and will be included as an open source module in the Bioconductor project. Contact: s-lin2@northwestern.edu Supplementary material:http://basic.northwestern.edu/publications/peakdetection/. Colour versions of the figures in this article can be found at Bioinformatics Online.

Journal ArticleDOI
TL;DR: A Markov chain Monte Carlo coalescent genealogy sampler, LAMARC 2.0, which estimates population genetic parameters from genetic data, and can perform either maximum-likelihood or Bayesian analysis.
Abstract: Summary: We present a Markov chain Monte Carlo coalescent genealogy sampler, LAMARC 2.0, which estimates population genetic parameters from genetic data. LAMARC can co-estimate subpopulation Θ = 4Neμ, immigration rates, subpopulation exponential growth rates and overall recombination rate, or a user-specified subset of these parameters. It can perform either maximum-likelihood or Bayesian analysis, and accomodates nucleotide sequence, SNP, microsatellite or elecrophoretic data, with resolved or unresolved haplotypes. It is available as portable source code and executables for all three major platforms. Availability: LAMARC 2.0 is freely available at http://evolution.gs.washington.edu/lamarc Contact: lamarc@gs.washington.edu

Journal ArticleDOI
TL;DR: This work compares the respective advantages and limits of synchronous versus asynchronous updating assumptions to delineate the asymptotical behaviour of regulatory networks and proposes several intermediate strategies to optimize the computation of asymPTotical properties depending on available knowledge.
Abstract: Motivation: To understand the behaviour of complex biological regulatory networks, a proper integration of molecular data into a full-fledge formal dynamical model is ultimately required. As most available data on regulatory interactions are qualitative, logical modelling offers an interesting framework to delineate the main dynamical properties of the underlying networks. Results: Transposing a generic model of the core network controlling the mammalian cell cycle into the logical framework, we compare different strategies to explore its dynamical properties. In particular, we assess the respective advantages and limits of synchronous versus asynchronous updating assumptions to delineate the asymptotical behaviour of regulatory networks. Furthermore, we propose several intermediate strategies to optimize the computation of asymptotical properties depending on available knowledge. Availability: The mammalian cell cycle model is available in a dedicated XML format (GINML) on our website, along with our logical simulation software GINsim ( ). Higher resolution state transitions graphs are also found on this web site (Model Repository page). Contact: thieffry@ibdm.univ-mrs.fr

Journal ArticleDOI
TL;DR: An algorithm is developed that predicts the functions of a protein in two steps by estimating its functional similarity with the protein using the local topology of the interaction network as well as the reliability of experimental sources and scoring each function based on its weighted frequency in these neighbours.
Abstract: Motivation: Most approaches in predicting protein function from protein--protein interaction data utilize the observation that a protein often share functions with proteins that interacts with it (its level-1 neighbours). However, proteins that interact with the same proteins (i.e. level-2 neighbours) may also have a greater likelihood of sharing similar physical or biochemical characteristics. We speculate that functional similarity between a protein and its neighbours from the two different levels arise from two distinct forms of functional association, and a protein is likely to share functions with its level-1 and/or level-2 neighbours. We are interested in finding out how significant is functional association between level-2 neighbours and how they can be exploited for protein function prediction. Results: We made a statistical study on recent interaction data and observed that functional association between level-2 neighbours is clearly observable. A substantial number of proteins are observed to share functions with level-2 neighbours but not with level-1 neighbours. We develop an algorithm that predicts the functions of a protein in two steps: (1) assign a weight to each of its level-1 and level-2 neighbours by estimating its functional similarity with the protein using the local topology of the interaction network as well as the reliability of experimental sources and (2) scoring each function based on its weighted frequency in these neighbours. Using leave-one-out cross validation, we compare the performance of our method against that of several other existing approaches and show that our method performs relatively well. Contact: g0306417@nus.edu.sg

Journal ArticleDOI
TL;DR: A Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints is developed using a hierarchy of support vector machine (SVM) classifiers trained on multiple data types to obtain the most probable consistent set of predictions.
Abstract: Motivation: Assigning functions for unknown genes based on diverse large-scale data is a key task in functional genomics. Previous work on gene function prediction has addressed this problem using independent classifiers for each function. However, such an approach ignores the structure of functional class taxonomies, such as the Gene Ontology (GO). Over a hierarchy of functional classes, a group of independent classifiers where each one predicts gene membership to a particular class can produce a hierarchically inconsistent set of predictions, where for a given gene a specific class may be predicted positive while its inclusive parent class is predicted negative. Taking the hierarchical structure into account resolves such inconsistencies and provides an opportunity for leveraging all classifiers in the hierarchy to achieve higher specificity of predictions. Results: We developed a Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints. Using a hierarchy of support vector machine (SVM) classifiers trained on multiple data types, we combined predictions in our Bayesian framework to obtain the most probable consistent set of predictions. Experiments show that over a 105-node subhierarchy of the GO, our Bayesian framework improves predictions for 93 nodes. As an additional benefit, our method also provides implicit calibration of SVM margin outputs to probabilities. Using this method, we make function predictions for multiple proteins, and experimentally confirm predictions for proteins involved in mitosis. Supplementary information: Results for the 105 selected GO classes and predictions for 1059 unknown genes are available at: http://function.princeton.edu/genesite/ Contact: ogt@cs.princeton.edu

Journal ArticleDOI
TL;DR: Contrafold, a novel secondary structure prediction method based on conditional log-linear models (CLLMs), a flexible class of probabilistic models which generalize upon SCFGs by using discriminative training and feature-rich scoring, achieves the highest single sequence prediction accuracies to date.
Abstract: Motivation: For several decades, free energy minimization methods have been the dominant strategy for single sequence RNA secondary structure prediction. More recently, stochastic context-free grammars (SCFGs)have emergedas an alternative probabilisticmethodology for modeling RNA structure. Unlike physics-based methods, which rely on thousands of experimentally-measured thermodynamic parameters, SCFGs use fully-automated statistical learning algorithms to derive model parameters. Despite this advantage, however, probabilistic methods have not replaced free energy minimization methods as the toolofchoiceforsecondarystructureprediction,astheaccuraciesofthe best current SCFGs have yet to match those of the best physics-based models. Results: In this paper, we present CONTRAfold, a novel secondary structure prediction method based on conditional log-linear models (CLLMs), a flexible class of probabilistic models which generalize upon SCFGs by using discriminative training and feature-rich scoring. In a series of cross-validation experiments, we show that grammarbased secondary structure prediction methods formulated as CLLMs consistently outperform their SCFG analogs. Furthermore, CONTRAfold, a CLLM incorporating most of the features found in typical thermodynamic models, achieves the highest single sequence prediction accuracies to date, outperforming currently available probabilistic and physics-based techniques. Our result thus closes the gap between probabilistic and thermodynamic models, demonstrating that statistical learning procedures provide an effective alternative to empirical measurement of thermodynamic parameters for RNA secondary structure prediction. Availability:SourcecodeforCONTRAfoldis availableat http://contra.

Journal ArticleDOI
TL;DR: A Systems Biology Toolbox for the widely used general purpose mathematical software MATLAB, which contains a large number of analysis methods, such as deterministic and stochastic simulation, parameter estimation, network identification, parameter sensitivity analysis and bifurcation analysis.
Abstract: Summary: We present a Systems Biology Toolbox for the widely used general purpose mathematical software MATLAB. The toolbox offers systems biologists an open and extensible environment, in which to explore ideas, prototype and share new algorithms, and build applications for the analysis and simulation of biological and biochemical systems. Additionally it is well suited for educational purposes. The toolbox supports the Systems Biology Markup Language (SBML) by providing an interface for import and export of SBML models. In this way the toolbox connects nicely to other SBML-enabled modelling packages. Models are represented in an internal model format and can be described either by entering ordinary differential equations or, more intuitively, by entering biochemical reaction equations. The toolbox contains a large number of analysis methods, such as deterministic and stochastic simulation, parameter estimation, network identification, parameter sensitivity analysis and bifurcation analysis. Availability: The Systems Biology Toolbox for MATLAB is open source and freely available from http://www.sbtoolbox.org. The website also contains a tutorial, extensive documentation and examples. Contact: henning@fcc.chalmers.se

Journal ArticleDOI
TL;DR: Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved.
Abstract: The University of California Santa Cruz (UCSC) Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank. The detailed steps of this process are described. Extensive cross-references from this dataset to other genomic and proteomic data were constructed. For each known gene, a details page is provided containing rich information about the gene, together with extensive links to other relevant genomic, proteomic and pathway data. As of July 2005, the UCSC Known Genes are available for human, mouse and rat genomes. The Known Genes serves as a foundation to support several key programs: the Genome Browser, Proteome Browser, Gene Sorter and Table Browser offered at the UCSC website. All the associated data files and program source code are also available. They can be accessed at http://genome.ucsc.edu. The genomic coverage of UCSC Known Genes, RefSeq, Ensembl Genes, H-Invitational and CCDS is analyzed. Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved. Contact: fanhsu@soe.ucsc.edu

Journal ArticleDOI
TL;DR: It is shown that cancer proteins contain a high ratio of highly promiscuous structural domains, i.e., domains with a high propensity for mediating protein interactions, reflecting the central roles of proteins, whose mutations lead to cancer.
Abstract: Motivation: The study of interactomes, or networks of protein-protein interactions, is increasingly providing valuable information on biological systems. Here we report a study of cancer proteins in an extensive human protein-protein interaction network constructed by computational methods. Results: We show that human proteins translated from known cancer genes exhibit a network topology that is different from that of proteins not documented as being mutated in cancer. In particular, cancer proteins show an increase in the number of proteins they interact with. They also appear to participate in central hubs rather than peripheral ones, mirroring their greater centrality and participation in networks that form the backbone of the proteome. Moreover, we show that cancer proteins contain a high ratio of highly promiscuous structural domains, i.e., domains with a high propensity for mediating protein interactions. These observations indicate an underlying evolutionary distinction between the two groups of proteins, reflecting the central roles of proteins, whose mutations lead to cancer. Contact: paul.bates@cancer.org.uk Supplementary information: The interactome data are available though the PIP (Potential Interactions of Proteins) web server at http://bmm.cancerresearchuk.org/servers/pip. Further additional material is available at http://bmm.cancerresearchuk.org/servers/pip/bioinformatics/

Journal ArticleDOI
TL;DR: Galperin et al. as discussed by the authors reported identification of the PilZ ('pills') domain (Pfam domain PF07238) in the sequences of bacterial cellulose synthases, alginate biosynthesis protein Alg44, proteins of enterobacterial YcgR and firmicute YpfA families, and other proteins encoded in bacterial genomes.
Abstract: Recent studies identified c-di-GMP as a universal bacterial secondary messenger regulating biofilm formation, motility, production of extracellular polysaccharide and multicellular behavior in diverse bacteria. However, except for cellulose synthase, no protein has been shown to bind c-di-GMP and the targets for c-di-GMP action remain unknown. Here we report identification of the PilZ ('pills') domain (Pfam domain PF07238) in the sequences of bacterial cellulose synthases, alginate biosynthesis protein Alg44, proteins of enterobacterial YcgR and firmicute YpfA families, and other proteins encoded in bacterial genomes and present evidence indicating that this domain is (part of) the long-sought c-di-GMP-binding protein. Association of the PilZ domain with a variety of other domains, including likely components of bacterial multidrug secretion system, could provide clues to multiple functions of the c-di-GMP in bacterial pathogenesis and cell development. Contact: galperin@ncbi.nlm.nih.gov Supplementary information: http://www.ncbi.nlm.nih.gov/Complete_Genomes/SigCensus/PilZ.html

Journal ArticleDOI
TL;DR: This study first integrates proteomics and microarray datasets and represents the yeast protein-protein interaction network as a weighted graph, and extends a betweenness-based partition algorithm, and uses it to identify 266 functional modules in the yeast proteome network, showing that these modules are indeed densely connected subgraphs.
Abstract: Motivation: Identification of functional modules in protein interaction networks is a first step in understanding the organization and dynamics of cell functions. To ensure that the identified modules are biologically meaningful, network-partitioning algorithms should take into account not only topological features but also functional relationships, and identified modules should be rigorously validated. Results: In this study we first integrate proteomics and microarray datasets and represent the yeast protein--protein interaction network as a weighted graph. We then extend a betweenness-based partition algorithm, and use it to identify 266 functional modules in the yeast proteome network. For validation we show that the functional modules are indeed densely connected subgraphs. In addition, genes in the same functional module confer a similar phenotype. Furthermore, known protein complexes are largely contained in the functional modules in their entirety. We also analyze an example of a functional module and show that functional modules can be useful for gene annotation. Contact: [email protected] Supplementary Information: Supplementary data are available at Bioinformatics online