scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2003"


Journal ArticleDOI
TL;DR: MrBayes 3 performs Bayesian phylogenetic analysis combining information from different data partitions or subsets evolving under different stochastic evolutionary models to analyze heterogeneous data sets and explore a wide variety of structured models mixing partition-unique and shared parameters.
Abstract: Summary: MrBayes 3 performs Bayesian phylogenetic analysis combining information from different data partitions or subsets evolving under different stochastic evolutionary models. This allows the user to analyze heterogeneous data sets consisting of different data types—e.g. morphological, nucleotide, and protein— and to explore a wide variety of structured models mixing partition-unique and shared parameters. The program employs MPI to parallelize Metropolis coupling on Macintosh or UNIX clusters.

25,931 citations


Journal ArticleDOI
TL;DR: Three methods of performing normalization at the probe intensity level are presented: a one number scaling based algorithm and a method that uses a non-linear normalizing relation by comparing the variability and bias of an expression measure and the simplest and quickest complete data method is found to perform favorably.
Abstract: Motivation: When running experiments that involve multiple high density oligonucleotide arrays, it is important to remove sources of variation between arrays of non-biological origin. Normalization is a process for reducing this variation. It is common to see non-linear relations between arrays and the standard normalization provided by Affymetrix does not perform well in these situations. Results: We present three methods of performing normalization at the probe intensity level. These methods are called complete data methods because they make use of data from all arrays in an experiment to form the normalizing relation. These algorithms are compared to two methods that make use of a baseline array: a one number scaling based algorithm and a method that uses a non-linear normalizing relation by comparing the variability and bias of an expression measure. Two publicly available datasets are used to carry out the comparisons. The simplest and quickest complete data method is found to perform favorably. Availabilty: Software implementing all three of the complete data normalization methods is available as part of the R package Affy, which is a part of the Bioconductor project http://www.bioconductor.org. Contact: bolstad@stat.berkeley.edu Supplementary information: Additional figures may be found at http://www.stat.berkeley.edu/∼bolstad/normalize/ index.html

8,324 citations


Journal ArticleDOI
TL;DR: The present version of DnaSP introduces several new modules and features which, among other options, allow handling big data sets and conducting a large number of coalescent-based tests by Monte Carlo computer simulations.
Abstract: Summary: DnaSP is a software package for the analysis of DNA polymorphism data. Present version introduces several new modules and features which, among other options allow: (1) handling big data sets (∼5 Mb per sequence); (2) conducting a large number of coalescent-based tests by Monte Carlo computer simulations; (3) extensive analyses of the genetic differentiation and gene flow among populations; (4) analysing the evolutionary pattern of preferred and unpreferred codons; (5) generating graphical outputs for an easy visualization of results. Availability: The software package, including complete documentation and examples, is freely available to academic users from: http://www.ub.es/dnasp

6,100 citations


Journal ArticleDOI
TL;DR: This work summarizes the Systems Biology Markup Language (SBML) Level 1, a free, open, XML-based format for representing biochemical reaction networks, a software-independent language for describing models common to research in many areas of computational biology.
Abstract: Motivation: Molecular biotechnology now makes it possible to build elaborate systems models, but the systems biology community needs information standards if models are to be shared, evaluated and developed cooperatively. Results: We summarize the Systems Biology Markup Language (SBML) Level 1, a free, open, XML-based format for representing biochemical reaction networks. SBML is a software-independent language for describing models common to research in many areas of computational biology, including cell signaling pathways, metabolic pathways, gene regulation, and others. ∗ To whom correspondence should be addressed. Availability: The specification of SBML Level 1 is freely available from http://www.sbml.org/.

3,205 citations


Journal ArticleDOI
TL;DR: R/qtl is an extensible, interactive environment for mapping quantitative trait loci (QTLs) in experimental populations derived from inbred lines and includes functions for estimating genetic maps, identifying genotyping errors, and performing single-QTL and two-dimensional, two- QTL genome scans by multiple methods.
Abstract: Summary R/qtl is an extensible, interactive environment for mapping quantitative trait loci (QTLs) in experimental populations derived from inbred lines. It is implemented as an add-on package for the freely-available statistical software, R, and includes functions for estimating genetic maps, identifying genotyping errors, and performing single-QTL and two-dimensional, two-QTL genome scans by multiple methods, with the possible inclusion of covariates. Availability The package is freely available at http://www.biostat.jhsph.edu/~kbroman/qtl.

3,111 citations


Journal ArticleDOI
TL;DR: A website for performing power calculations for the design of linkage and association genetic mapping studies of complex traits and the package is made available atstatgen.iop.ac.uk/gpc.
Abstract: Summary: Aw ebsite for performing power calculations for the design of linkage and association genetic mapping studies of complex traits. Availibility: The package is made available at http://

2,108 citations


Journal ArticleDOI
TL;DR: Three resampling-based FDR controlling procedures are presented, that account for the test statistics distribution, and their performance is compared to that of the naïve application of the linear step-up procedure in Benjamini and Hochberg (1995), and the highest power is achieved, at the expense of a more sophisticated algorithm, by the resamplings-based procedures that resample the joint distribution of the testStatistics and estimate the level of FDR control.
Abstract: Motivation: DNA microarrays have recently been used for the purpose of monitoring expression levels of thousands of genes simultaneously and identifying those genes that are differentially expressed. The probability that a false identification (type I error) is committed can increase sharply when the number of tested genes gets large. Correlation between the test statistics attributed to gene co-regulation and dependency in the measurement errors of the gene expression levels further complicates the problem. In this paper we address this very large multiplicity problem by adopting the false discovery rate (FDR) controlling approach. In order to address the dependency problem, we present three resampling-based FDR controlling procedures, that account for the test statistics distribution, and compare their performance to that of the

1,713 citations


Journal ArticleDOI
TL;DR: TGICL is a pipeline for analysis of large Expressed Sequence Tags and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters to produce longer, more complete consensus sequences.
Abstract: TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.

1,703 citations


Journal ArticleDOI
TL;DR: R8s version 1.5 is a program which uses parametric, nonparametric and semiparametric methods to relax the assumption of constant rates of evolution to obtain better estimates of rates and times.
Abstract: Summary Estimating divergence times and rates of substitution from sequence data is plagued by the problem of rate variation between lineages. R8s version 1.5 is a program which uses parametric, nonparametric and semiparametric methods to relax the assumption of constant rates of evolution to obtain better estimates of rates and times. Unlike most programs for rate inference or phylogenetics, r8s permits users to convert results to absolute rates and ages by constraining one or more node times to be fixed, minimum or maximum ages (using fossil or other evidence). Version 1.5 uses truncated Newton nonlinear optimization code with bound constraints, offering superior performance over previous versions. Availability The linux executable, C source code, sample data sets and user manual are available free at http://ginger.ucdavis.edu/r8s.

1,689 citations


Journal ArticleDOI
TL;DR: PISCES is a public server for culling sets of protein sequences from the Protein Data Bank by sequence identity and structural quality criteria and provides better lists than servers that use BLAST, which is unable to identify many relationships below 40% sequence identity.
Abstract: PISCES is a public server for culling sets of protein sequences from the Protein Data Bank (PDB) by sequence identity and structural quality criteria. PISCES can provide lists culled from the entire PDB or from lists of PDB entries or chains provided by the user. The sequence identities are obtained from PSI-BLAST alignments with position-specific substitution matrices derived from the non-redundant protein sequence database. PISCES therefore provides better lists than servers that use BLAST, which is unable to identify many relationships below 40% sequence identity and often overestimates sequence identity by aligning only well-conserved fragments. PDB sequences are updated weekly. PISCES can also cull non-PDB sequences provided by the user as a list of GenBank identifiers, a FASTA format file, or BLAST/PSI-BLAST output.

1,649 citations


Journal ArticleDOI
TL;DR: A new web server, ConSurf, is presented, which automates algorithms for the identification of functionally important regions in proteins of known three dimensional structure by estimating the degree of conservation of the amino-acid sites among their close sequence homologues.
Abstract: UNLABELLED We recently developed algorithmic tools for the identification of functionally important regions in proteins of known three dimensional structure by estimating the degree of conservation of the amino-acid sites among their close sequence homologues. Projecting the conservation grades onto the molecular surface of these proteins reveals patches of highly conserved (or occasionally highly variable) residues that are often of important biological function. We present a new web server, ConSurf, which automates these algorithmic tools. ConSurf may be used for high-throughput characterization of functional regions in proteins. AVAILABILITY The ConSurf web server is available at:http://consurf.tau.ac.il. SUPPLEMENTARY INFORMATION A set of examples is available at http://consurf.tau.ac.il under 'GALLERY'.

Journal ArticleDOI
TL;DR: The GENIA corpus as mentioned in this paper is a large corpus of 2000 MEDLINE abstracts with more than 400 000 words and almost 100, 000 annotations for biological terms for bio-text mining.
Abstract: Motivation: Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining. Results: GENIA corpus version 3.0 consisting of 2000 MEDLINE abstracts has been released with more than 400 000 words and almost 100 000 annotations for biological terms. Availability: GENIA corpus is freely available at http://

Journal ArticleDOI
TL;DR: A multifactor dimensionality reduction (MDR) method for collapsing high-dimensional genetic data into a single dimension thus permitting interactions to be detected in relatively small sample sizes is developed.
Abstract: Motivation: Polymorphisms in human genes are being described in remarkable numbers. Determining which polymorphisms and which environmental factors are associated with common, complex diseases has become a daunting task. This is partly because the effect of any single genetic variation will likely be dependent on other genetic variations (gene–gene interaction or epistasis) and environmental factors (gene–environment interaction). Detecting and characterizing interactions among multiple factors is both a statistical and a computational challenge. To address this problem, we have developed am ultifactor dimensionality reduction (MDR) method for collapsing high-dimensional genetic data into a single dimension thus permitting interactions to be detected in relatively small sample sizes. In this paper, we describe the MDR approach and an MDR software package. Results: We developed a program that integrates MDR with a cross-validation strategy for estimating the classification and prediction error of multifactor models. The software can be used to analyze interactions among 2–15 genetic and/or environmental factors. The dataset may contain up to 500 total variables and a maximum of 4000 study subjects.

Journal ArticleDOI
TL;DR: In this paper, the authors investigate the use of ontological annotation to measure the similarities in knowledge content or "semantic similarity" between entries in a data resource, and present a simple extension that enables a semantic search of the knowledge held within sequence databases.
Abstract: Motivation: Many bioinformatics data resources not only hold data in the form of sequences, but also as annotation. In the majority of cases, annotation is written as scientific natural language: this is suitable for humans, but not particularly useful for machine processing. Ontologies offer a mechanism by which knowledge can be represented in a form capable of such processing. In this paper we investigate the use of ontological annotation to measure the similarities in knowledge content or ‘semantic similarity’ between entries in a data resource. These allow a bioinformatician to perform a similarity measure over annotation in an analogous manner to those performed over sequences. Am easure of semantic similarity for the knowledge component of bioinformatics resources should afford a biologist a new tool in their repetoire of analyses. Results: We present the results from experiments that investigate the validity of using semantic similarity by comparison with sequence similarity. We show a simple extension that enables a semantic search of the knowledge held within sequence databases. Availability: Software available from http://www.russet.

Journal ArticleDOI
TL;DR: Alignment-free metrics are furthering their usage as a scale-independent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment.
Abstract: Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignment-free methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. Results: The overwhelming majority of work on alignmentfree sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed—methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignment-free metrics are in fact already widely used as pre-selection filters for alignment-based querying of large applications. Recent work is furthering their usage as a scale-independent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Availability: Most of the alignment-free algorithms reviewed were implemented in MATLAB code and are available at http://bioinformatics.musc.edu/resources.html. Contact: almeidaj@musc.edu; svinga@itqb.unl.pt

Journal ArticleDOI
TL;DR: The goal for the development of the 3D-Jury system is to create a simple but powerful procedure for generating meta-predictions using variable sets of models obtained from diverse sources to improve the quality of structural annotations of novel proteins.
Abstract: Motivation: Consensus structure prediction methods (meta-predictors) have higher accuracy than individual structure prediction algorithms (their components). The goal for the development of the 3D-Jury system is to create a simple but powerful procedure for generating meta-predictions using variable sets of models obtained from diverse sources. The resulting protocol should help to improve the quality of structural annotations of novel proteins. Results: The 3D-Jury system generates meta-predictions from sets of models created using variable methods. It is not necessary to know prior characteristics of the methods. The system is able to utilize immediately new components (additional prediction providers). The accuracy of the system is comparable with other well-tuned prediction servers. The algorithm resembles methods of selecting models generated using ab initio folding simulations. It is simple and offers a portable solution to improve the accuracy of other protein structure prediction protocols. Availability: The 3D-Jury system is available via the Structure Prediction Meta Server (http://BioInfo.PL/Meta/) to the academic community. Contact: leszek@bioinfo.pl Supplementary information: 3D-Jury is coupled to the continuous online server evaluation program, LiveBench (http://BioInfo.PL/LiveBench/).

Journal ArticleDOI
TL;DR: It will be shown that the calculation of Mean Normalized Expressions has to be used for processing simplex PCR data, while multiplex PCRData should preferably be processed by calculating Normalizedexpressions.
Abstract: Summary: Q-Gene is an application for the processing of quantitative real-time RT–PCR data. It offers the user the possibility to freely choose between two principally different procedures to calculate normalized gene expressions as either means of Normalized Expressions or Mean Normalized Expressions. In this contribution it will be shown that the calculation of Mean Normalized Expressions has to be used for processing simplex PCR data, while multiplex PCR data should preferably be processed by calculating Normalized Expressions. The two procedures, which are currently in widespread use and regarded as more or less equivalent alternatives, should therefore specifically be applied according to the quantification procedure used. Availability: Web access to this program is provided at http://www.biotechniques.com/softlib/qgene.html

Journal ArticleDOI
TL;DR: While the estimation performance of existing methods depends on model parameters whose determination is difficult, the BPCA method is free from this difficulty, and provides accurate and convenient estimation for missing values.
Abstract: Motivation: Gene expression profile analyses have been used in numerous studies covering a broad range of areas in biology. When unreliable measurements are excluded, missing values are introduced in gene expression profiles. Although existing multivariate analysis methods have difficulty with the treatment of missing values, this problem has received little attention. There are many options for dealing with missing values, each of which reaches drastically different results. Ignoring missing values is the simplest method and is frequently applied. This approach, however, has its flaws. In this article, we propose an estimation method for missing values, which is based on Bayesian principal component analysis (BPCA). Although the methodology that a probabilistic model and latent variables are estimated simultaneously within the framework of Bayes inference is not new in principle, actual BPCA implementation that makes it possible to estimate arbitrary missing variables is new in terms of statistical methodology. Results: When applied to DNA microarray data from various experimental conditions, the BPCA method exhibited markedly better estimation ability than other recently proposed methods, such as singular value decomposition and K -nearest neighbors. While the estimation performance of existing methods depends on model parameters whose determination is difficult, our BPCA method is free from this difficulty. Accordingly, the BPCA method provides accurate and convenient estimation for missing values. Availability: The software is available at http://hawaii.aist

Journal ArticleDOI
TL;DR: A model by which the within gene variances are drawn from an inverse gamma distribution, whose parameters are estimated across all genes is proposed, which results in a test statistic that is a minor variation of those used in standard linear models.
Abstract: Motivation: Microarray techniques provide a valuable way of characterizing the molecular nature of disease. Unfortunately expense and limited specimen availability often lead to studies with small sample sizes. This makes accurate estimation of variability difficult, since variance estimates made on a gene by gene basis will have few degrees of freedom, and the assumption that all genes share equal variance is unlikely to be true. Results: We propose a model by which the within gene variances are drawn from an inverse gamma distribution, whose parameters are estimated across all genes. This results in a test statistic that is a minor variation of those used in standard linear models. We demonstrate that the model assumptions are valid on experimental data, and that the model has more power than standard tests to pick up large changes in expression, while not increasing the rate of false positives. Availability: This method is incorporated into BRB-ArrayTools version 3.0 (http://linus.nci.nih.gov/BRB-ArrayTools.html). Supplementary material: ftp://linus.nci.nih.gov/pub/ techreport/RVM_supplement.pdf Contact: wrightge@mail.nih.gov

Journal ArticleDOI
TL;DR: ModLoop is a web server for automated modeling of loops in protein structures that predicts the loop conformations by satisfaction of spatial restraints, without relying on a database of known protein structures.
Abstract: Summary: ModLoop is a web server for automated modeling of loops in protein structures. The input is the atomic coordinates of the protein structure in the Protein Data Bank format, and the specification of the starting and ending residues of one or more segments to be modeled, containing no more than 20 residues in total. The output is the coordinates of the nonhydrogen atoms in the modeled segments. A user provides the input to the server via a simple web interface, and receives the output by e-mail. The server relies on the loop modeling routine in MODELLER that predicts the loop conformations by satisfaction of spatial restraints, without relying on a database of known protein structures. For a rapid response, ModLoop runs on a cluster of Linux PC computers. Availability: The server is freely accessible to academic users at http://salilab.org/modloop

Journal ArticleDOI
TL;DR: PathwayAssist is a software application developed for navigation and analysis of biological pathways, gene regulation networks and protein interaction maps that comes with the built-in natural language processing module MedScan and the comprehensive database describing more than 100 000 events of regulation, interaction and modification between proteins, cell processes and small molecules.
Abstract: Summary: PathwayAssist is a software application developed for navigation and analysis of biological pathways, gene regulation networks and protein interaction maps. It comes with the built-in natural language processing module MedScan and the comprehensive database describing more than 100 000 events of regulation, interaction and modification between proteins, cell processes and small molecules. Availability: PathwayAssist is available for commercial licensing from Ariadne Genomics, Inc. The light version with limited functionality will be available for free for academic users at www.ariadnegenomics.com/downloads/ Contact: mazo@ariadnegenomics.com Information about protein function and cellular pathways is central to the system-level understanding of living organism. This knowledge is scattered throughout numerous scientific publications. The need to bring the relevant information together calls for software systems to organize and study pathway data. PathwayAssist is a Windows desktop application developed for navigation and analysis of molecular networks. It is written in C++ and runs under Windows ME, 2000 and XP. The application uses Jet engine as a back-end to store data, but can connect to other databases that support ADO or ODBC access (e.g. MySQL, Oracle). In addition, there is a second data abstraction layer implemented as COM interfaces to allow for the accommodation of different database schema. PathwayAssist comes with a database of molecular networks automatically assembled from scientific abstracts. It contains more than 100 000 events of regulation, interaction and modification between proteins, cell processes and small molecules. The database has been compiled by the application of the text-mining tool MedScan to the whole PubMed. MedScan preprocesses input text to extract relevant sentences, which are subjected to natural language processing.

Journal ArticleDOI
TL;DR: The findings demonstrate how the network inference performance varies with the training set size, the degree of inadequacy of prior assumptions, the experimental sampling strategy and the inclusion of further, sequence-based information.
Abstract: Motivation: Bayesian networks have been applied to infer genetic regulatory interactions from microarray gene expression data. This inference problem is particularly hard in that interactions between hundreds of genes have to be learned from very small data sets, typically containing only a few dozen time points during a cell cycle. Most previous studies have assessed the inference results on real gene expression data by comparing predicted genetic regulatory interactions with those known from the biological literature. This approach is controversial due to the absence of known gold standards, which renders the estimation of the sensitivity and specificity, that is, the true and (complementary) false detection rate, unreliable and difficult. The objective of the present study is to test the viability of the Bayesian network paradigm in a realistic simulation study. First, gene expression data are simulated from a realistic biological network involving DNAs, mRNAs, inactive protein monomers and active protein dimers. Then, interaction networks are inferred from these data in a reverse engineering approach, using Bayesian networks and Bayesian learning with Markov chain Monte Carlo. Results: The simulation results are presented as receiver operator characteristics curves. This allows estimating the proportion of spurious gene interactions incurred for a specified target proportion of recovered true interactions. The findings demonstrate how the network inference performance varies with the training set size, the degree of inadequacy of prior assumptions, the experimental sampling strategy and the inclusion of further, sequence-based information. Availability: The programs and data used in the present study are available from http://www.bioss.sari.ac.uk/~dirk/ Supplements

Journal ArticleDOI
TL;DR: By setting threshold levels for the membership values of the FCM method, genes which are tigthly associated to a given cluster can be selected and this selection increases the overall biological significance of the genes within the cluster.
Abstract: Motivation: Clustering analysis of data from DNA microarra yh ybridization studies is essential for identifying biologically relevant groups of genes. Partitional clustering methods such as K-means or self-organizing maps assign each gene to a single cluster. However, these methods do not provide information about the influence of a given gene for the overall shape of clusters. Here we apply a fuzzy partitioning method, Fuzzy C-means (FCM), to attribute cluster membership values to genes. Results: Am ajor problem in applying the FCM method for clustering microarray data is the choice of the fuzziness parameter m .W eshow that the commonly used value m = 2 is not appropriate for some data sets, and that optimal values for m vary widely from one data set to another. We propose an empirical method, based on the distribution of distances between genes in a given data set, to determine an adequate value for m .B ysetting threshold levels for the membership values, genes which are tigthly associated to a given cluster can be selected. Using a yeast cell cycle data set as an example, we show that this selection increases the overall biological significance of the genes within the cluster. Availability: Supplementary text and Matlab functions are available at http://www-igbmc.u-strasbg.fr/fcm/

Journal ArticleDOI
TL;DR: Two new resampling methods, inspired from bagging in prediction, are proposed to improve and assess the accuracy of a given clustering procedure to solve the problem of accurate partitioning of tumor samples into clusters.
Abstract: MOTIVATION The microarray technology is increasingly being applied in biological and medical research to address a wide range of problems such as the classification of tumors. An important statistical question associated with tumor classification is the identification of new tumor classes using gene expression profiles. Essential aspects of this clustering problem include identifying accurate partitions of the tumor samples into clusters and assessing the confidence of cluster assignments for individual samples. RESULTS Two new resampling methods, inspired from bagging in prediction, are proposed to improve and assess the accuracy of a given clustering procedure. In these ensemble methods, a partitioning clustering procedure is applied to bootstrap learning sets and the resulting multiple partitions are combined by voting or the creation of a new dissimilarity matrix. As in prediction, the motivation behind bagging is to reduce variability in the partitioning results via averaging. The performances of the new and existing methods were compared using simulated data and gene expression data from two recently published cancer microarray studies. The bagged clustering procedures were in general at least as accurate and often substantially more accurate than a single application of the partitioning clustering procedure. A valuable by-product of bagged clustering are the cluster votes which can be used to assess the confidence of cluster assignments for individual observations. SUPPLEMENTARY INFORMATION For supplementary information on datasets, analyses, and software, consult http://www.stat.berkeley.edu/~sandrine and http://www.bioconductor.org.

Journal ArticleDOI
TL;DR: The distribution of the connection degree of these networks is shown to follow the power law, indicating that the overall structure of all the metabolic networks has the characteristics of a small world network.
Abstract: Motivation: Information from fully sequenced genomes makes it possible to reconstruct strain-specific global metabolic network for structural and functional studies. These networks are often very large and complex. To properly understand and analyze the global properties of metabolic networks, methods for rationally representing and quantitatively analyzing their structure are needed. Results: In this work, the metabolic networks of 80 fully sequenced organisms are in silico reconstructed from genome data and an extensively revised bioreaction database. The networks are represented as directed graphs and analyzed by using the ‘breadth first searching algorithm to identify the shortest pathway (path length) between any pair of the metabolites. The average path length of the networks are then calculated and compared for all the organisms. Different from previous studies the connections through current metabolites and cofactors are deleted to make the path length analysis physiologically more meaningful. The distribution of the connection degree of these networks is shown to follow the power law, indicating that the overall structure of all the metabolic networks has the characteristics of a small world network. However, clear differences exist in the network structure of the three domains of organisms. Eukaryotes and archaea have a longer average path length than bacteria. Availability: The reaction database in excel format and the programs in VBA (Visual Basic for Applications) are available upon request. Supplementary Material: Fo rS upplementary Material refer to Bioinformatics Online.

Journal ArticleDOI
TL;DR: FuncAssociate is a web-based tool to help researchers use Gene Ontology attributes to characterize large sets of genes derived from experiment with a Monte Carlo simulation approach that is more appropriate to determine significance than other methods, such as Bonferroni or idák p-value correction.
Abstract: Summary: FuncAssociate is a web-based tool to help researchers use Gene Ontology attributes to characterize large sets of genes derived from experiment. Distinguishing features of FuncAssociate include the ability to handle ranked input lists, and a Monte Carlo simulation approach that is more appropriate to determine significance than other methods, such as Bonferroni or ˘ Sidak p-value correction. FuncAssociate currently supports 10 organisms (Vibrio cholerae, Shewanella oneidensis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Arabidopsis thaliana, Caenorhaebditis elegans, Drosophila melanogaster, Mus musculus, Rattus norvegicus and Homo sapiens). Availability: FuncAssociate is freely accessible at http://llama. med.harvard.edu/Software.html. Source code (in Perl and C) is freely available to academic users ‘as is’.

Journal ArticleDOI
TL;DR: This work compares the performance of several classes of statistical methods for the classification of cancer based on MS spectra and finds that RF outperforms other methods in the analysis of MS data.
Abstract: Motivation: Novel methods, both molecular and statistical, are urgently needed to take advantage of recent advances in biotechnology and the human genome project for disease diagnosis and prognosis. Mass spectrometry (MS) holds great promise for biomarker identification and genome-wide protein profiling. It has been demonstrated in the literature that biomarkers can be identified to distinguish normal individuals from cancer patients using MS data. Such progress is especially exciting for the detection of early-stage ovarian cancer patients. Although various statistical methods have been utilized to identify biomarkers from MS data, there has been no systematic comparison among these approaches in their relative ability to analyze MS data. Results: We compare the performance of several classes of statistical methods for the classification of cancer based on MS spectra. These methods include: linear discriminant analysis, quadratic discriminant analysis, k -nearest neighbor classifier, bagging and boosting classification trees, support vector machine, and random forest (RF). The methods are applied to ovarian cancer and control serum samples from the National Ovarian Cancer Early Detection Program clinic at Northwestern University Hospital. We found that RF outperforms other methods in the analysis of MS data.

Journal ArticleDOI
TL;DR: A new and efficient algorithm for the sparse logistic regression problem based on the Gauss-Seidel method that is simple and extremely easy to implement and can be applied to a variety of real-world problems like identifying marker genes and building a classifier in the context of cancer diagnosis using microarray data.
Abstract: Motivation: This paper gives a new and efficient algorithm for the sparse logistic regression problem. The proposed algorithm is based on the Gauss–Seidel method and is asymptotically convergent. It is simple and extremely easy to implement; it neither uses any sophisticated mathematical programming software nor needs any matrix operations. It can be applied to a variety of real-world problems like identifying marker genes and building a classifier in the context of cancer diagnosis using microarray data. Results: The gene selection method suggested in this paper is demonstrated on two real- world data sets and the results were found to be consistent with the literature. Availability: The implementation of this algorithm is available at the site http://guppy.mpe.nus.edu.sg/~mpessk/SparseLOGREG.shtml Contact: mpessk@nus.edu.sg Supplementary Information: Supplementary material is available at the site http://guppy.mpe.nus.edu.sg/~mpessk/SparseLOGREG.shtml

Journal ArticleDOI
TL;DR: The occurrence of false positives and false negatives in a microarray analysis could be easily estimated if the distribution of p-values were approximated and then expressed as a mixture of null and alternative densities.
Abstract: Motivation: The occurrence of false positives and false negatives in a microarray analysis could be easily estimated if the distribution of p-values were approximated and then expressed as a mixture of null and alternative densities. Essentially any distribution of p-values can be expressed as such a mixture by extracting a uniform density from it. Results: Am odel is introduced that frequently describes very accurately the distribution of a set of p-values arising from an array analysis. The model is used to obtain an estimated distribution that is easily expressed as a mixture of null and alternative densities. Given a threshold of significance, the estimated distribution is partitioned into regions corresponding to the occurrences of false positives, false negatives, true positives, and true negatives. Availability: An S-plus function library is available from

Journal ArticleDOI
TL;DR: A unified extension of the basic method to predict not only the network structure but also its dynamics using a Genetic Algorithm and an S-system formalism is proposed and successfully inferred the dynamics of a small genetic network constructed with 60 parameters for 5 network variables and feedback loops using only time-course data of gene expression.
Abstract: Motivation: The modeling of system dynamics of genetic networks, metabolic networks or signal transduction cascades from time-course data is formulated as a reverse-problem. Previous studies focused on the estimation of only network structures, and they were ineffective in inferring a network structure with feedback loops. We previously proposed a method to predict not only the network structure but also its dynamics using a Genetic Algorithm (GA) and an S-system formalism. However, it could predict only a small number of parameters and could rarely obtain essential structures. In this work, we propose a unified extension of the basic method. Notable improvements are as follows: (1) an additional term in its evaluation function that aims at eliminating futile parameters; (2) a crossover method called Simplex Crossover (SPX) to improve its optimization ability; and (3) a gradual optimization strategy to increase the number of predictable parameters. Results: The proposed method is implemented as a C program called PEACE1 (Predictor by Evolutionary Algorithms and Canonical Equations 1). Its performance was compared with the basic method. The comparison showed that: (1) the convergence rate increased about 5-fold; (2) the optimization speed was raised about 1.5-fold; and (3) the number of predictable parameters was increased about 5-fold. Moreover, we successfully inferred the dynamics of a small genetic network constructed with 60 parameters for 5 network variables and feedback loops using only time-course data of gene expression.