scispace - formally typeset
Search or ask a question

Showing papers in "BMC Bioinformatics in 2006"


Journal ArticleDOI
TL;DR: It is shown that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
Abstract: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

2,610 citations


Journal ArticleDOI
TL;DR: This approach should enhance the ability to use microarray data to elucidate functional mechanisms that underlie cellular processes and to identify molecular targets of pharmacological compounds in mammalian cellular networks.
Abstract: Elucidating gene regulatory networks is crucial for understanding normal cell physiology and complex pathologic phenotypes. Existing computational methods for the genome-wide "reverse engineering" of such networks have been successful only for lower eukaryotes with simple genomes. Here we present ARACNE, a novel algorithm, using microarray expression profiles, specifically designed to scale up to the complexity of regulatory networks in mammalian cells, yet general enough to address a wider range of network deconvolution problems. This method uses an information theoretic approach to eliminate the majority of indirect interactions inferred by co-expression methods. We prove that ARACNE reconstructs the network exactly (asymptotically) if the effect of loops in the network topology is negligible, and we show that the algorithm works well in practice, even in the presence of numerous loops and complex topologies. We assess ARACNE's ability to reconstruct transcriptional regulatory networks using both a realistic synthetic dataset and a microarray dataset from human B cells. On synthetic datasets ARACNE achieves very low error rates and outperforms established methods, such as Relevance Networks and Bayesian Networks. Application to the deconvolution of genetic networks in human B cells demonstrates ARACNE's ability to infer validated transcriptional targets of the cMYC proto-oncogene. We also study the effects of misestimation of mutual information on network reconstruction, and show that algorithms based on mutual information ranking are more resilient to estimation errors. ARACNE shows promise in identifying direct transcriptional interactions in mammalian cellular networks, a problem that has challenged existing reverse engineering algorithms. This approach should enhance our ability to use microarray data to elucidate functional mechanisms that underlie cellular processes and to identify molecular targets of pharmacological compounds in mammalian cellular networks.

2,533 citations


Journal ArticleDOI
TL;DR: Data quality control and analysis procedures presented here provide statistical elements for the estimation of the relative expression of genes using real-time PCR.
Abstract: Even though real-time PCR has been broadly applied in biomedical sciences, data processing procedures for the analysis of quantitative real-time PCR are still lacking; specifically in the realm of appropriate statistical treatment. Confidence interval and statistical significance considerations are not explicit in many of the current data analysis approaches. Based on the standard curve method and other useful data analysis methods, we present and compare four statistical approaches and models for the analysis of real-time PCR data. In the first approach, a multiple regression analysis model was developed to derive ΔΔCt from estimation of interaction of gene and treatment effects. In the second approach, an ANCOVA (analysis of covariance) model was proposed, and the ΔΔCt can be derived from analysis of effects of variables. The other two models involve calculation ΔCt followed by a two group t- test and non-parametric analogous Wilcoxon test. SAS programs were developed for all four models and data output for analysis of a sample set are presented. In addition, a data quality control model was developed and implemented using SAS. Practical statistical solutions with SAS programs were developed for real-time PCR data and a sample dataset was analyzed with the SAS programs. The analysis using the various models and programs yielded similar results. Data quality control and analysis procedures presented here provide statistical elements for the estimation of the relative expression of genes using real-time PCR.

1,798 citations


Journal ArticleDOI
TL;DR: Analysis of previously published sequences from the Columbia river, its estuary, and the adjacent coastal ocean using the UniFrac interface provided insights that were not apparent from the initial data analysis, which used other commonly employed techniques to compare the communities.
Abstract: Background: Moving beyond pairwise significance tests to compare many microbial communities simultaneously is critical for understanding large-scale trends in microbial ecology and community assembly. Techniques that allow microbial communities to be compared in a phylogenetic context are rapidly gaining acceptance, but the widespread application of these techniques has been hindered by the difficulty of performing the analyses. Results: We introduce UniFrac, a web application available at http://bmf.colorado.edu/unifrac, that allows several phylogenetic tests for differences among communities to be easily applied and interpreted. We demonstrate the use of UniFrac to cluster multiple environments, and to test which environments are significantly different. We show that analysis of previously published sequences from the Columbia river, its estuary, and the adjacent coastal ocean using the UniFrac interface provided insights that were not apparent from the initial data analysis, which used other commonly employed techniques to compare the communities. Conclusion: UniFrac provides easy access to powerful multivariate techniques for comparing microbial communities in a phylogenetic context. We thus expect that it will provide a completely new picture of many microbial interactions and processes in both environmental and medical contexts.

1,404 citations


Journal ArticleDOI
TL;DR: It is shown that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error.
Abstract: Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data. We used CV to optimize the classification parameters for two kinds of classifiers; Shrunken Centroids and Support Vector Machines (SVM). Random training datasets were created, with no difference in the distribution of the features between the two classes. Using these "null" datasets, we selected classifier parameter values that minimized the CV error estimate. 10-fold CV was used for Shrunken Centroids while Leave-One-Out-CV (LOOCV) was used for the SVM. Independent test data was created to estimate the true error. With "null" and "non null" (with differential expression between the classes) data, we also tested a nested CV procedure, where an inner CV loop is used to perform the tuning of the parameters while an outer CV is used to compute an estimate of the error. The CV error estimate for the classifier with the optimal parameters was found to be a substantially biased estimate of the true error that the classifier would incur on independent data. Even though there is no real difference between the two classes for the "null" datasets, the CV error estimate for the Shrunken Centroid with the optimal parameters was less than 30% on 18.5% of simulated training data-sets. For SVM with optimal parameters the estimated error rate was less than 30% on 38% of "null" data-sets. Performance of the optimized classifiers on the independent test set was no better than chance. The nested CV procedure reduces the bias considerably and gives an estimate of the error that is very close to that obtained on the independent testing set for both Shrunken Centroids and SVM classifiers for "null" and "non-null" data distributions. We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.

1,314 citations


Journal ArticleDOI
TL;DR: The unique algorithms STEM implements to cluster and compare short time series gene expression data combined with its visualization capabilities and integration with the Gene Ontology should make STEM useful in the analysis of data from a significant portion of all microarray studies.
Abstract: Time series microarray experiments are widely used to study dynamical biological processes. Due to the cost of microarray experiments, and also in some cases the limited availability of biological material, about 80% of microarray time series experiments are short (3–8 time points). Previously short time series gene expression data has been mainly analyzed using more general gene expression analysis tools not designed for the unique challenges and opportunities inherent in short time series gene expression data. We introduce the Short Time-series Expression Miner (STEM) the first software program specifically designed for the analysis of short time series microarray gene expression data. STEM implements unique methods to cluster, compare, and visualize such data. STEM also supports efficient and statistically rigorous biological interpretations of short time series data through its integration with the Gene Ontology. The unique algorithms STEM implements to cluster and compare short time series gene expression data combined with its visualization capabilities and integration with the Gene Ontology should make STEM useful in the analysis of data from a significant portion of all microarray studies. STEM is available for download for free to academic and non-profit users at http://www.cs.cmu.edu/~jernst/stem .

1,201 citations


Journal ArticleDOI
TL;DR: TreeDyn is a tree visualization and annotation tool which includes tools for tree manipulation and annotation and uses meta-information through dynamic graphical operators or scripting to help analyses and annotations of single trees or tree collections.
Abstract: Analyses of biomolecules for biodiversity, phylogeny or structure/function studies often use graphical tree representations. Many powerful tree editors are now available, but existing tree visualization tools make little use of meta-information related to the entities under study such as taxonomic descriptions or gene functions that can hardly be encoded within the tree itself (if using popular tree formats). Consequently, a tedious manual analysis and post-processing of the tree graphics are required if one needs to use external information for displaying or investigating trees. We have developed TreeDyn, a tool using annotations and dynamic graphical methods for editing and analyzing multiple trees. The main features of TreeDyn are 1) the management of multiple windows and multiple trees per window, 2) the export of graphics to several standard file formats with or without HTML encapsulation and a new format called TGF, which enables saving and restoring graphical analysis, 3) the projection of texts or symbols facing leaf labels or linked to nodes, through manual pasting or by using annotation files, 4) the highlight of graphical elements after querying leaf labels (or annotations) or by selection of graphical elements and information extraction, 5) the highlight of targeted trees according to a source tree browsed by the user, 6) powerful scripts for automating repetitive graphical tasks, 7) a command line interpreter enabling the use of TreeDyn through CGI scripts for online building of trees, 8) the inclusion of a library of packages dedicated to specific research fields involving trees. TreeDyn is a tree visualization and annotation tool which includes tools for tree manipulation and annotation and uses meta-information through dynamic graphical operators or scripting to help analyses and annotations of single trees or tree collections.

1,014 citations


Journal ArticleDOI
TL;DR: A new method is described, which combines a new method of robust nonlinear regression with anew method of outlier identification, that identifies outliers from nonlinear curve fits with reasonable power and few false positives.
Abstract: Nonlinear regression, like linear regression, assumes that the scatter of data around the ideal curve follows a Gaussian or normal distribution. This assumption leads to the familiar goal of regression: to minimize the sum of the squares of the vertical or Y-value distances between the points and the curve. Outliers can dominate the sum-of-the-squares calculation, and lead to misleading results. However, we know of no practical method for routinely identifying outliers when fitting curves with nonlinear regression. We describe a new method for identifying outliers when fitting data with nonlinear regression. We first fit the data using a robust form of nonlinear regression, based on the assumption that scatter follows a Lorentzian distribution. We devised a new adaptive method that gradually becomes more robust as the method proceeds. To define outliers, we adapted the false discovery rate approach to handling multiple comparisons. We then remove the outliers, and analyze the data using ordinary least-squares regression. Because the method combines robust regression and outlier removal, we call it the ROUT method. When analyzing simulated data, where all scatter is Gaussian, our method detects (falsely) one or more outlier in only about 1–3% of experiments. When analyzing data contaminated with one or several outliers, the ROUT method performs well at outlier identification, with an average False Discovery Rate less than 1%. Our method, which combines a new method of robust nonlinear regression with a new method of outlier identification, identifies outliers from nonlinear curve fits with reasonable power and few false positives.

981 citations


Journal ArticleDOI
TL;DR: Sensitive probabilistic modeling of extrinsic evidence such as sequence database matches can increase gene prediction accuracy and be used to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that is able to predict genes under user-defined constraints.
Abstract: Background In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence.

946 citations


Journal ArticleDOI
TL;DR: The analysis of high-throughput data supports the superiority of MCL for the extraction of complexes from interaction networks, and shows that MCL is remarkably robust to graph alterations.
Abstract: Protein interactions are crucial components of all cellular processes. Recently, high-throughput methods have been developed to obtain a global description of the interactome (the whole network of protein interactions for a given organism). In 2002, the yeast interactome was estimated to contain up to 80,000 potential interactions. This estimate is based on the integration of data sets obtained by various methods (mass spectrometry, two-hybrid methods, genetic studies). High-throughput methods are known, however, to yield a non-negligible rate of false positives, and to miss a fraction of existing interactions. The interactome can be represented as a graph where nodes correspond with proteins and edges with pairwise interactions. In recent years clustering methods have been developed and applied in order to extract relevant modules from such graphs. These algorithms require the specification of parameters that may drastically affect the results. In this paper we present a comparative assessment of four algorithms: Markov Clustering (MCL), Restricted Neighborhood Search Clustering (RNSC), Super Paramagnetic Clustering (SPC), and Molecular Complex Detection (MCODE). A test graph was built on the basis of 220 complexes annotated in the MIPS database. To evaluate the robustness to false positives and false negatives, we derived 41 altered graphs by randomly removing edges from or adding edges to the test graph in various proportions. Each clustering algorithm was applied to these graphs with various parameter settings, and the clusters were compared with the annotated complexes. We analyzed the sensitivity of the algorithms to the parameters and determined their optimal parameter values. We also evaluated their robustness to alterations of the test graph. We then applied the four algorithms to six graphs obtained from high-throughput experiments and compared the resulting clusters with the annotated complexes. This analysis shows that MCL is remarkably robust to graph alterations. In the tests of robustness, RNSC is more sensitive to edge deletion but less sensitive to the use of suboptimal parameter values. The other two algorithms are clearly weaker under most conditions. The analysis of high-throughput data supports the superiority of MCL for the extraction of complexes from interaction networks.

863 citations


Journal ArticleDOI
TL;DR: The VSL2 predictors are applicable to disordered regions of any length and can accurately identify the short dis ordered regions that are often misclassified by the previous disorder predictors.
Abstract: Due to the functional importance of intrinsically disordered proteins or protein regions, prediction of intrinsic protein disorder from amino acid sequence has become an area of active research as witnessed in the 6th experiment on Critical Assessment of Techniques for Protein Structure Prediction (CASP6). Since the initial work by Romero et al. (Identifying disordered regions in proteins from amino acid sequences, IEEE Int. Conf. Neural Netw., 1997), our group has developed several predictors optimized for long disordered regions (>30 residues) with prediction accuracy exceeding 85%. However, these predictors are less successful on short disordered regions (≤30 residues). A probable cause is a length-dependent amino acid compositions and sequence properties of disordered regions. We proposed two new predictor models, VSL2-M1 and VSL2-M2, to address this length-dependency problem in prediction of intrinsic protein disorder. These two predictors are similar to the original VSL1 predictor used in the CASP6 experiment. In both models, two specialized predictors were first built and optimized for short (≤30 residues) and long disordered regions (>30 residues), respectively. A meta predictor was then trained to integrate the specialized predictors into the final predictor model. As the 10-fold cross-validation results showed, the VSL2 predictors achieved well-balanced prediction accuracies of 81% on both short and long disordered regions. Comparisons over the VSL2 training dataset via 10-fold cross-validation and a blind-test set of unrelated recent PDB chains indicated that VSL2 predictors were significantly more accurate than several existing predictors of intrinsic protein disorder. The VSL2 predictors are applicable to disordered regions of any length and can accurately identify the short disordered regions that are often misclassified by our previous disorder predictors. The success of the VSL2 predictors further confirmed the previously observed differences in amino acid compositions and sequence properties between short and long disordered regions, and justified our approaches for modelling short and long disordered regions separately. The VSL2 predictors are freely accessible for non-commercial use at http://www.ist.temple.edu/disprot/predictorVSL2.php

Journal ArticleDOI
TL;DR: The software tools RepbaseSubmitter and Censor, which are designed to facilitate updating and screening the content of Repbase, are described and made available for use with Repbase.
Abstract: Repbase is a reference database of eukaryotic repetitive DNA, which includes prototypic sequences of repeats and basic information described in annotations. Updating and maintenance of the database requires specialized tools, which we have created and made available for use with Repbase, and which may be useful as a template for other curated databases.

Journal ArticleDOI
TL;DR: SimMAP has been developed to implement stochastic character mapping that is useful to both molecular evolutionists, systematists, and bioinformaticians and enables users to address questions that require mapping characters onto phylogenies using a probabilistic approach that does not rely on parsimony.
Abstract: Character mapping on phylogenies has played an important, if not critical role, in our understanding of molecular, morphological, and behavioral evolution. Until very recently we have relied on parsimony to infer character changes. Parsimony has a number of serious limitations that are drawbacks to our understanding. Recent statistical methods have been developed that free us from these limitations enabling us to overcome the problems of parsimony by accommodating uncertainty in evolutionary time, ancestral states, and the phylogeny. SIMMAP has been developed to implement stochastic character mapping that is useful to both molecular evolutionists, systematists, and bioinformaticians. Researchers can address questions about positive selection, patterns of amino acid substitution, character association, and patterns of morphological evolution. Stochastic character mapping, as implemented in the SIMMAP software, enables users to address questions that require mapping characters onto phylogenies using a probabilistic approach that does not rely on parsimony. Analyses can be performed using a fully Bayesian approach that is not reliant on considering a single topology, set of substitution model parameters, or reconstruction of ancestral states. Uncertainty in these quantities is accommodated by using MCMC samples from their respective posterior distributions.

Journal ArticleDOI
TL;DR: In this paper, two complementary computational strategies for conservation-based motif discovery improves the ability to identify the specificity of transcriptional regulators from genome-wide chromatin immunoprecipitation data.
Abstract: The regulatory map of a genome consists of the binding sites for proteins that determine the transcription of nearby genes An initial regulatory map for S cerevisiae was recently published using six motif discovery programs to analyze genome-wide chromatin immunoprecipitation data for 203 transcription factors The programs were used to identify sequence motifs that were likely to correspond to the DNA-binding specificity of the immunoprecipitated proteins We report improved versions of two conservation-based motif discovery algorithms, PhyloCon and Converge Using these programs, we create a refined regulatory map for S cerevisiae by reanalyzing the same chromatin immunoprecipitation data Applying the same conservative criteria that were applied in the original study, we find that PhyloCon and Converge each separately discover more known specificities than the combination of all six programs in the previous study Combining the results of PhyloCon and Converge, we discover significant sequence motifs for 36 transcription factors that were previously missed The new set of motifs identifies 636 more regulatory interactions than the previous one The new network contains 28% more regulatory interactions among transcription factors, evidence of greater cross-talk between regulators Combining two complementary computational strategies for conservation-based motif discovery improves the ability to identify the specificity of transcriptional regulators from genome-wide chromatin immunoprecipitation data The increased sensitivity of these methods significantly expands the map of yeast regulatory sites without the need to alter any of the thresholds for statistical significance The new map of regulatory sites reveals a more elaborate and complex view of the yeast genetic regulatory network than was observed previously

Journal ArticleDOI
TL;DR: A new method for comparing sets of GO terms and for assessing the functional similarity of gene products and a proposed map of the functional space provides a new global view on the functional relationships between gene products or protein families.
Abstract: Gene Ontology (GO) is a standard vocabulary of functional terms and allows for coherent annotation of gene products. These annotations provide a basis for new methods that compare gene products regarding their molecular function and biological role. We present a new method for comparing sets of GO terms and for assessing the functional similarity of gene products. The method relies on two semantic similarity measures; sim Rel and funSim. One measure (sim Rel ) is applied in the comparison of the biological processes found in different groups of organisms. The other measure (funSim) is used to find functionally related gene products within the same or between different genomes. Results indicate that the method, in addition to being in good agreement with established sequence similarity approaches, also provides a means for the identification of functionally related proteins independent of evolutionary relationships. The method is also applied to estimating functional similarity between all proteins in Saccharomyces cerevisiae and to visualizing the molecular function space of yeast in a map of the functional space. A similar approach is used to visualize the functional relationships between protein families. The approach enables the comparison of the underlying molecular biology of different taxonomic groups and provides a new comparative genomics tool identifying functionally related gene products independent of homology. The proposed map of the functional space provides a new global view on the functional relationships between gene products or protein families.

Journal ArticleDOI
TL;DR: This report demonstrates the necessity of isotope abundance information by mathematical confirmation of the concept and concludes that a mass spectrometer capable of 3 ppm mass accuracy and 2% error for isotopic abundance patterns outperformsmass spectrometers with less than 1 ppm mass errors.
Abstract: Metabolomic studies are targeted at identifying and quantifying all metabolites in a given biological context. Among the tools used for metabolomic research, mass spectrometry is one of the most powerful tools. However, metabolomics by mass spectrometry always reveals a high number of unknown compounds which complicate in depth mechanistic or biochemical understanding. In principle, mass spectrometry can be utilized within strategies of de novo structure elucidation of small molecules, starting with the computation of the elemental composition of an unknown metabolite using accurate masses with errors 95% of false candidates. This orthogonal filter can condense several thousand candidates down to only a small number of molecular formulas. Example calculations for 10, 5, 3, 1 and 0.1 ppm mass accuracy are given. Corresponding software scripts can be downloaded from http://fiehnlab.ucdavis.edu . A comparison of eight chemical databases revealed that PubChem and the Dictionary of Natural Products can be recommended for automatic queries using molecular formulae. More than 1.6 million molecular formulae in the range 0–500 Da were generated in an exhaustive manner under strict observation of mathematical and chemical rules. Assuming that ion species are fully resolved (either by chromatography or by high resolution mass spectrometry), we conclude that a mass spectrometer capable of 3 ppm mass accuracy and 2% error for isotopic abundance patterns outperforms mass spectrometers with less than 1 ppm mass accuracy or even hypothetical mass spectrometers with 0.1 ppm mass accuracy that do not include isotope information in the calculation of molecular formulae.

Journal ArticleDOI
TL;DR: A web-based tool to predict population coverage of T-cell epitope-based diagnostics and vaccines based on MHC binding and/or T cell restriction data is developed, designed to maximize population coverage, while minimizing complexity.
Abstract: Background T cells recognize a complex between a specific major histocompatibility complex (MHC) molecule and a particular pathogen-derived epitope. A given epitope will elicit a response only in individuals that express an MHC molecule capable of binding that particular epitope. MHC molecules are extremely polymorphic and over a thousand different human MHC (HLA) alleles are known. A disproportionate amount of MHC polymorphism occurs in positions constituting the peptide-binding region, and as a result, MHC molecules exhibit a widely varying binding specificity. In the design of peptide-based vaccines and diagnostics, the issue of population coverage in relation to MHC polymorphism is further complicated by the fact that different HLA types are expressed at dramatically different frequencies in different ethnicities. Thus, without careful consideration, a vaccine or diagnostic with ethnically biased population coverage could result.

Journal ArticleDOI
TL;DR: UCSF Chimera as mentioned in this paper is a suite of tools for interactive analyses of sequences and structures, which can be used with a researcher's own data, including sequence alignments and annotations, closely or distantly related sets of proteins, etc.
Abstract: Comparing related structures and viewing the structures in the context of sequence alignments are important tasks in protein structure-function research. While many programs exist for individual aspects of such work, there is a need for interactive visualization tools that: (a) provide a deep integration of sequence and structure, far beyond mapping where a sequence region falls in the structure and vice versa; (b) facilitate changing data of one type based on the other (for example, using only sequence-conserved residues to match structures, or adjusting a sequence alignment based on spatial fit); (c) can be used with a researcher's own data, including arbitrary sequence alignments and annotations, closely or distantly related sets of proteins, etc.; and (d) interoperate with each other and with a full complement of molecular graphics features. We describe enhancements to UCSF Chimera to achieve these goals. The molecular graphics program UCSF Chimera includes a suite of tools for interactive analyses of sequences and structures. Structures automatically associate with sequences in imported alignments, allowing many kinds of crosstalk. A novel method is provided to superimpose structures in the absence of a pre-existing sequence alignment. The method uses both sequence and secondary structure, and can match even structures with very low sequence identity. Another tool constructs structure-based sequence alignments from superpositions of two or more proteins. Chimera is designed to be extensible, and mechanisms for incorporating user-specific data without Chimera code development are also provided. The tools described here apply to many problems involving comparison and analysis of protein structures and their sequences. Chimera includes complete documentation and is intended for use by a wide range of scientists, not just those in the computational disciplines. UCSF Chimera is free for non-commercial use and is available for Microsoft Windows, Apple Mac OS X, Linux, and other platforms from http://www.cgl.ucsf.edu/chimera .

Journal ArticleDOI
TL;DR: An algorithm for the rapid, accurate and automated identification of paths leading from buried protein clefts, pockets and cavities in dynamic and static protein structures to the outside solvent is developed and implemented for the CAVER program.
Abstract: Background: The main aim of this study was to develop and implement an algorithm for the rapid, accurate and automated identification of paths leading from buried protein clefts, pockets and cavities in dynamic and static protein structures to the outside solvent. Results: The algorithm to perform a skeleton search was based on a reciprocal distance function grid that was developed and implemented for the CAVER program. The program identifies and visualizes routes from the interior of the protein to the bulk solvent. CAVER was primarily developed for proteins, but the algorithm is sufficiently robust to allow the analysis of any molecular system, including nucleic acids or inorganic material. Calculations can be performed using discrete structures from crystallographic analysis and NMR experiments as well as with trajectories from molecular dynamics simulations. The fully functional program is available as a stand-alone version and as plug-in for the molecular modeling program PyMol. Additionally, selected functions are accessible in an online version. Conclusion: The algorithm developed automatically finds the path from a starting point located within the interior of a protein. The algorithm is sufficiently rapid and robust to enable routine analysis of molecular dynamics trajectories containing thousands of snapshots. The algorithm is based on reciprocal metrics and provides an easy method to find a centerline, i.e. the spine, of complicated objects such as a protein tunnel. It can also be applied to many other molecules. CAVER is freely available from the web site http://loschmidt.chemi.muni.cz/caver/.

Journal ArticleDOI
TL;DR: VANTED greatly helps researchers in the analysis and interpretation of biochemical data, and thus is a useful tool for modern biological research.
Abstract: Background Recent advances with high-throughput methods in life-science research have increased the need for automatized data analysis and visual exploration techniques. Sophisticated bioinformatics tools are essential to deduct biologically meaningful interpretations from the large amount of experimental data, and help to understand biological processes.

Journal ArticleDOI
TL;DR: The combination of SNP impact analysis, a knowledge based network of gene relationships and candidate genes, and access to a wide range of data and literature allow a user to quickly assimilate available information, and so develop models of gene-pathway-disease interaction.
Abstract: The relationship between disease susceptibility and genetic variation is complex, and many different types of data are relevant. We describe a web resource and database that provides and integrates as much information as possible on disease/gene relationships at the molecular level. The resource http://www.SNPs3D.org has three primary modules. One module identifies which genes are candidates for involvement in a specified disease. A second module provides information about the relationships between sets of candidate genes. The third module analyzes the likely impact of non-synonymous SNPs on protein function. Disease/candidate gene relationships and gene-gene relationships are derived from the literature using simple but effective text profiling. SNP/protein function relationships are derived by two methods, one using principles of protein structure and stability, the other based on sequence conservation. Entries for each gene include a number of links to other data, such as expression profiles, pathway context, mouse knockout information and papers. Gene-gene interactions are presented in an interactive graphical interface, providing rapid access to the underlying information, as well as convenient navigation through the network. Use of the resource is illustrated with aspects of the inflammatory response and hypertension. The combination of SNP impact analysis, a knowledge based network of gene relationships and candidate genes, and access to a wide range of data and literature allow a user to quickly assimilate available information, and so develop models of gene-pathway-disease interaction.

Journal ArticleDOI
TL;DR: The proposed algorithm makes it possible to detect clusters of proteins in PPI networks which mostly represent molecular biological functional units and can help to predict the functions of proteins, and they are also useful to understand and explain certain biological processes.
Abstract: After complete sequencing of a number of genomes the focus has now turned to proteomics. Advanced proteomics technologies such as two-hybrid assay, mass spectrometry etc. are producing huge data sets of protein-protein interactions which can be portrayed as networks, and one of the burning issues is to find protein complexes in such networks. The enormous size of protein-protein interaction (PPI) networks warrants development of efficient computational methods for extraction of significant complexes. This paper presents an algorithm for detection of protein complexes in large interaction networks. In a PPI network, a node represents a protein and an edge represents an interaction. The input to the algorithm is the associated matrix of an interaction network and the outputs are protein complexes. The complexes are determined by way of finding clusters, i. e. the densely connected regions in the network. We also show and analyze some protein complexes generated by the proposed algorithm from typical PPI networks of Escherichia coli and Saccharomyces cerevisiae. A comparison between a PPI and a random network is also performed in the context of the proposed algorithm. The proposed algorithm makes it possible to detect clusters of proteins in PPI networks which mostly represent molecular biological functional units. Therefore, protein complexes determined solely based on interaction data can help us to predict the functions of proteins, and they are also useful to understand and explain certain biological processes.

Journal ArticleDOI
TL;DR: This work proposes formalisms and methods, relying on adapted and partially newly introduced approaches, which facilitate a structural analysis of signaling and regulatory networks with focus on functional aspects, and proposes a formalism based on logical interaction hypergraphs, which facilitates a logical steady state analysis (LSSA).
Abstract: Structural analysis of cellular interaction networks contributes to a deeper understanding of network-wide interdependencies, causal relationships, and basic functional capabilities. While the structural analysis of metabolic networks is a well-established field, similar methodologies have been scarcely developed and applied to signaling and regulatory networks. We propose formalisms and methods, relying on adapted and partially newly introduced approaches, which facilitate a structural analysis of signaling and regulatory networks with focus on functional aspects. We use two different formalisms to represent and analyze interaction networks: interaction graphs and (logical) interaction hypergraphs. We show that, in interaction graphs, the determination of feedback cycles and of all the signaling paths between any pair of species is equivalent to the computation of elementary modes known from metabolic networks. Knowledge on the set of signaling paths and feedback loops facilitates the computation of intervention strategies and the classification of compounds into activators, inhibitors, ambivalent factors, and non-affecting factors with respect to a certain species. In some cases, qualitative effects induced by perturbations can be unambiguously predicted from the network scheme. Interaction graphs however, are not able to capture AND relationships which do frequently occur in interaction networks. The consequent logical concatenation of all the arcs pointing into a species leads to Boolean networks. For a Boolean representation of cellular interaction networks we propose a formalism based on logical (or signed) interaction hypergraphs, which facilitates in particular a logical steady state analysis (LSSA). LSSA enables studies on the logical processing of signals and the identification of optimal intervention points (targets) in cellular networks. LSSA also reveals network regions whose parametrization and initial states are crucial for the dynamic behavior. We have implemented these methods in our software tool CellNetAnalyzer (successor of FluxAnalyzer) and illustrate their applicability using a logical model of T-Cell receptor signaling providing non-intuitive results regarding feedback loops, essential elements, and (logical) signal processing upon different stimuli. The methods and formalisms we propose herein are another step towards the comprehensive functional analysis of cellular interaction networks. Their potential, shown on a realistic T-cell signaling model, makes them a promising tool.

Journal ArticleDOI
TL;DR: Gene Designer is a software tool that drastically facilitates the design of synthetic genes and is an expandable Synthetic Biology workbench suitable for molecular biologists interested in the de novo creation of genetic constructs.
Abstract: Direct synthesis of genes is rapidly becoming the most efficient way to make functional genetic constructs and enables applications such as codon optimization, RNAi resistant genes and protein engineering. Here we introduce a software tool that drastically facilitates the design of synthetic genes. Gene Designer is a stand-alone software for fast and easy design of synthetic DNA segments. Users can easily add, edit and combine genetic elements such as promoters, open reading frames and tags through an intuitive drag-and-drop graphic interface and a hierarchical DNA/Protein object map. Using advanced optimization algorithms, open reading frames within the DNA construct can readily be codon optimized for protein expression in any host organism. Gene Designer also includes features such as a real-time sliding calculator of oligonucleotide annealing temperatures, sequencing primer generator, tools for avoidance or inclusion of restriction sites, and options to maximize or minimize sequence identity to a reference. Gene Designer is an expandable Synthetic Biology workbench suitable for molecular biologists interested in the de novo creation of genetic constructs.

Journal ArticleDOI
TL;DR: Fast-FIND is easy to implement, takes less than a second to search the entire Drosophila genome sequence for arbitrary patterns adjacent to sites of alternative polyadenylation, and is sufficiently fast to allow sensitivity analysis on the patterns.
Abstract: Many vital biological processes, including transcription and splicing, require a combination of short, degenerate sequence patterns, or motifs, adjacent to defined sequence features. Although these motifs occur frequently by chance, they only have biological meaning within a specific context. Identifying transcripts that contain meaningful combinations of patterns is thus an important problem, which existing tools address poorly. Here we present a new approach, Fast-FIND (Fast-F ully I ndexed N ucleotide D atabase), that uses a relational database to support rapid indexed searches for arbitrary combinations of patterns defined either by sequence or composition. Fast-FIND is easy to implement, takes less than a second to search the entire Drosophila genome sequence for arbitrary patterns adjacent to sites of alternative polyadenylation, and is sufficiently fast to allow sensitivity analysis on the patterns. We have applied this approach to identify transcripts that contain combinations of sequence motifs for RNA-binding proteins that may regulate alternative polyadenylation. Fast-FIND provides an efficient way to identify transcripts that are potentially regulated via alternative polyadenylation. We have used it to generate hypotheses about interactions between specific polyadenylation factors, which we will test experimentally.

Journal ArticleDOI
TL;DR: An empirical study in which some of the most commonly used feature selection methods are compared to 9 publicly available datasets, and it is reported that the choice of feature selection method, the number of genes in the genelist, the numbers of cases and the noise in the dataset, substantially influence classification success.
Abstract: Numerous feature selection methods have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic and moderated t-statistics. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to 9 publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets.

Journal ArticleDOI
TL;DR: MultiSeq is a major extension of the Multiple Alignment tool provided as part of VMD, a structural visualization program for analyzing molecular dynamics simulations, and contains a new algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of a homologous group of distantly related proteins.
Abstract: Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes.

Journal ArticleDOI
TL;DR: SIGI-HMM is a sensitive tool for the identification of GIs in microbial genomes that allows to interactively analyze genomes in detail and to generate or to test hypotheses about the origin of acquired genes.
Abstract: Horizontal gene transfer (HGT) is considered a strong evolutionary force shaping the content of microbial genomes in a substantial manner. It is the difference in speed enabling the rapid adaptation to changing environmental demands that distinguishes HGT from gene genesis, duplications or mutations. For a precise characterization, algorithms are needed that identify transfer events with high reliability. Frequently, the transferred pieces of DNA have a considerable length, comprise several genes and are called genomic islands (GIs) or more specifically pathogenicity or symbiotic islands. We have implemented the program SIGI-HMM that predicts GIs and the putative donor of each individual alien gene. It is based on the analysis of codon usage (CU) of each individual gene of a genome under study. CU of each gene is compared against a carefully selected set of CU tables representing microbial donors or highly expressed genes. Multiple tests are used to identify putatively alien genes, to predict putative donors and to mask putatively highly expressed genes. Thus, we determine the states and emission probabilities of an inhomogeneous hidden Markov model working on gene level. For the transition probabilities, we draw upon classical test theory with the intention of integrating a sensitivity controller in a consistent manner. SIGI-HMM was written in JAVA and is publicly available. It accepts as input any file created according to the EMBL-format. It generates output in the common GFF format readable for genome browsers. Benchmark tests showed that the output of SIGI-HMM is in agreement with known findings. Its predictions were both consistent with annotated GIs and with predictions generated by different methods. SIGI-HMM is a sensitive tool for the identification of GIs in microbial genomes. It allows to interactively analyze genomes in detail and to generate or to test hypotheses about the origin of acquired genes.

Journal ArticleDOI
TL;DR: This study describes an algorithm for the synchronous alignment of sequences and their associated secondary structures as well as the main features of 4SALE used for further analyses and editing, which is a comprehensive toolbox for RNA analysis based on sequence and secondary structure information.
Abstract: In sequence analysis the multiple alignment builds the fundament of all proceeding analyses Errors in an alignment could strongly influence all succeeding analyses and therefore could lead to wrong predictions Hand-crafted and hand-improved alignments are necessary and meanwhile good common practice For RNA sequences often the primary sequence as well as a secondary structure consensus is well known, eg, the cloverleaf structure of the t-RNA Recently, some alignment editors are proposed that are able to include and model both kinds of information However, with the advent of a large amount of reliable RNA sequences together with their solved secondary structures (available from eg the ITS2 Database), we are faced with the problem to handle sequences and their associated secondary structures synchronously 4SALE fills this gap The application allows a fast sequence and synchronous secondary structure alignment for large data sets and for the first time synchronous manual editing of aligned sequences and their secondary structures This study describes an algorithm for the synchronous alignment of sequences and their associated secondary structures as well as the main features of 4SALE used for further analyses and editing 4SALE builds an optimal and unique starting point for every RNA sequence and structure analysis 4SALE, which provides an user-friendly and intuitive interface, is a comprehensive toolbox for RNA analysis based on sequence and secondary structure information The program connects sequence and structure databases like the ITS2 Database to phylogeny programs as for example the CBCAnalyzer 4SALE is written in JAVA and therefore platform independent The software is freely available and distributed from the website at http://4salebioappsbiozentrumuni-wuerzburgde

Journal ArticleDOI
TL;DR: PageMan is introduced, a multiplatform, user-friendly, and stand-alone software tool that annotates, investigates, and condenses high-throughput microarray data in the context of functional ontologies, enabling researchers to gain a rapid overview of the biological responses in the experiments.
Abstract: Microarray technology has become a widely accepted and standardized tool in biology. The first microarray data analysis programs were developed to support pair-wise comparison. However, as microarray experiments have become more routine, large scale experiments have become more common, which investigate multiple time points or sets of mutants or transgenics. To extract biological information from such high-throughput expression data, it is necessary to develop efficient analytical platforms, which combine manually curated gene ontologies with efficient visualization and navigation tools. Currently, most tools focus on a few limited biological aspects, rather than offering a holistic, integrated analysis. Here we introduce PageMan, a multiplatform, user-friendly, and stand-alone software tool that annotates, investigates, and condenses high-throughput microarray data in the context of functional ontologies. It includes a GUI tool to transform different ontologies into a suitable format, enabling the user to compare and choose between different ontologies. It is equipped with several statistical modules for data analysis, including over-representation analysis and Wilcoxon statistical testing. Results are exported in a graphical format for direct use, or for further editing in graphics programs. PageMan provides a fast overview of single treatments, allows genome-level responses to be compared across several microarray experiments covering, for example, stress responses at multiple time points. This aids in searching for trait-specific changes in pathways using mutants or transgenics, analyzing development time-courses, and comparison between species. In a case study, we analyze the results of publicly available microarrays of multiple cold stress experiments using PageMan, and compare the results to a previously published meta-analysis. PageMan offers a complete user's guide, a web-based over-representation analysis as well as a tutorial, and is freely available at http://mapman.mpimp-golm.mpg.de/pageman/ . PageMan allows multiple microarray experiments to be efficiently condensed into a single page graphical display. The flexible interface allows data to be quickly and easily visualized, facilitating comparisons within experiments and to published experiments, thus enabling researchers to gain a rapid overview of the biological responses in the experiments.