scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2002"


Journal ArticleDOI
TL;DR: TREE-PUZZLE is a program package for quartet-based maximum-likelihood phylogenetic analysis that provides methods for reconstruction, comparison, and testing of trees and models on DNA as well as protein sequences to reduce waiting time for larger datasets.
Abstract: SUMMARY TREE-PUZZLE is a program package for quartet-based maximum-likelihood phylogenetic analysis (formerly PUZZLE, Strimmer and von Haeseler, Mol. Biol. Evol., 13, 964-969, 1996) that provides methods for reconstruction, comparison, and testing of trees and models on DNA as well as protein sequences. To reduce waiting time for larger datasets the tree reconstruction part of the software has been parallelized using message passing that runs on clusters of workstations as well as parallel computers. AVAILABILITY http://www.tree-puzzle.de. The program is written in ANSI C. TREE-PUZZLE can be run on UNIX, Windows and Mac systems, including Mac OS X. To run the parallel version of PUZZLE, a Message Passing Interface (MPI) library has to be installed on the system. Free MPI implementations are available on the Web (cf. http://www.lam-mpi.org/mpi/implementations/).

2,581 citations


Journal ArticleDOI
TL;DR: A Monte Carlo computer program is available to generate samples drawn from a population evolving according to a Wright-Fisher neutral model, and the samples produced can be used to investigate the sampling properties of any sample statistic under these neutral models.
Abstract: A Monte Carlo computer program is available to generate samples drawn from a population evolving according to a Wright-Fisher neutral model. The program assumes an infinite-sites model of mutation, and allows recombination, gene conversion, symmetric migration among subpopulations, and a variety of demographic histories. The samples produced can be used to investigate the sampling properties of any sample statistic under these neutral models.

2,566 citations


Journal ArticleDOI
TL;DR: MethPrimer, based on Primer 3, is a program for designing PCR primers for methylation mapping that takes a DNA sequence as its input and searches the sequence for potential CpG islands, and picks primers around the predicted C pG islands or around regions specified by users.
Abstract: Motivation: DNA methylation is an epigenetic mechanism of gene regulation. Bisulfite- conversion-based PCR methods, such as bisulfite sequencing PCR (BSP) and methylation specific PCR (MSP), remain the most commonly used techniques for methylation mapping. Existing primer design programs developed for standard PCR cannot handle primer design for bisulfite-conversion-based PCRs due to changes in DNA sequence context caused by bisulfite treatment and many special constraints both on the primers and the region to be amplified for such experiments. Therefore, the present study was designed to develop a program for such applications. Results: MethPrimer, based on Primer3, is a program for designing PCR primers for methylation mapping. It first takes a DNA sequence as its input and searches the sequence for potential CpG islands. Primers are then picked around the predicted CpG islands or around regions specified by users. MethPrimer can design primers for BSP and MSP. Results of primer selection are delivered through a web browser in text and in graphic view. Availability: MethPrimer is freely accessible at the following Web address http://itsa.ucsf.edu/∼urolab/methprimer

2,378 citations


Journal ArticleDOI
TL;DR: Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, self-organizing maps, k-means, principal component analysis, and support vector machines.
Abstract: Summary: A versatile, platform independent and easy to use Java suite for large-scale gene expression analysis was developed. Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, selforganizing maps, k-means, principal component analysis, and support vector machines. The results of the clustering are transparent across all implemented methods and enable the analysis of the outcome of different algorithms and parameters. Additionally, mapping of gene expression data onto chromosomal sequences was implemented to enhance promoter analysis and investigation of transcriptional control mechanisms. Availability: http://genome.tugraz.at Contact: zlatko.trajanoski@tugraz.at

1,768 citations


Journal ArticleDOI
TL;DR: Probabilistic Boolean Networks (PBN) are introduced that share the appealing rule-based properties of Boolean networks, but are robust in the face of uncertainty.
Abstract: Motivation: Our goal is to construct a model for genetic regulatory networks such that the model class: (i) incorporates rule-based dependencies between genes; (ii) allows the systematic study of global network dynamics; (iii) is able to cope with uncertainty, both in the data and the model selection; and (iv) permits the quantification of the relative influence and sensitivity of genes in their interactions with other genes. Results: We introduce Probabilistic Boolean Networks (PBN) that share the appealing rule-based properties of Boolean networks, but are robust in the face of uncertainty. We show how the dynamics of these networks can be studied in the probabilistic context of Markov chains, with standard Boolean networks being special cases. Then, we discuss the relationship between PBNs and Bayesian networks—a family of graphical models that explicitly represent probabilistic relationships between variables. We show how probabilistic dependencies between a gene and its parent genes, constituting the basic building blocks of Bayesian networks, can be obtained from PBNs. Finally, we present methods for quantifying the influence of genes on other genes, within the context of PBNs. Examples illustrating the above concepts are presented throughout the paper.

1,571 citations


Journal ArticleDOI
TL;DR: A set of tools to construct positional weight matrices from known transcription factor binding sites in a species or taxon-specific manner and to search for matches in DNA sequences are developed.
Abstract: We have developed a set of tools to construct positional weight matrices from known transcription factor binding sites in a species or taxon-specific manner, and to search for matches in DNA sequences.

1,136 citations


Journal ArticleDOI
TL;DR: A new homology search algorithm 'PatternHunter' is presented that uses a novel seed model for increased sensitivity and new hit-processing techniques for significantly increased speed.
Abstract: Motivation: Genomics and proteomics studies routinely depend on homology searches based on the strategy of finding short seed matches which are then extended. The exploding genomic data growth presents a dilemma for DNA homology search techniques: increasing seed size decreases sensitivity whereas decreasing seed size slows down computation. Results: We present a new homology search algorithm ‘PatternHunter’ that uses a novel seed model for increased sensitivity and new hit-processing techniques for significantly increased speed. At Blast levels of sensitivity, PatternHunter is able to find homologies between sequences as large as human chromosomes, in mere hours on a desktop. Availability: PatternHunter is available at http://www. bioinformaticssolutions.com, as a commercial package. It runs on all platforms that support Java. PatternHunter technology is being patented; commercial use requires a license from BSI, while non-commercial use will be free.

865 citations


Journal ArticleDOI
TL;DR: A novel analysis procedure for classifying (predicting) human tumor samples based on microarray gene expressions is proposed and PLS proves superior to the well known dimension reduction method of Principal Components Analysis (PCA).
Abstract: Motivation: One important application of gene expression microarray data is classification of samples into categories, such as the type of tumor. The use of microarrays allows simultaneous monitoring of thousands of genes expressions per sample. This ability to measure gene expression en masse has resulted in data with the number of variables p (genes) far exceeding the number of samples N . Standard statistical methodologies in classification and prediction do not work well or even at all when N < p. Modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data. Results: We propose a novel analysis procedure for classifying (predicting) human tumor samples based on microarray gene expressions. This procedure involves dimension reduction using Partial Least Squares (PLS) and classification using Logistic Discrimination (LD) and Quadratic Discriminant Analysis (QDA). We compare PLS to the well known dimension reduction method of Principal Components Analysis (PCA). Under many circumstances PLS proves superior; we illustrate a condition when PCA particularly fails to predict well relative to PLS. The proposed methods were applied to five different microarray data sets involving various human tumor samples: (1) normal versus ovarian tumor; (2) Acute Myeloid Leukemia (AML) versus Acute Lymphoblastic Leukemia (ALL); (3) Diffuse Large B-cell Lymphoma (DLBCLL) versus B-cell Chronic Lymphocytic Leukemia (BCLL); (4) normal versus colon tumor; and (5) Non-SmallCell-Lung-Carcinoma (NSCLC) versus renal samples. Stability of classification results and methods were further assessed by re-randomization studies. Availability: The methodology can be implemented using a combination of standard statistical methods, available, for example, in SAS. Illustrative SAS code is available from the first author.

847 citations


Journal ArticleDOI
TL;DR: This work presents a graph representation of an MSA that can itself be aligned directly by pairwise dynamic programming, eliminating the need to reduce the MSA to a profile, and introduces a new edit operator, homologous recombination, important for multidomain sequences.
Abstract: Motivation: Progressive Multiple Sequence Alignment (MSA) methods depend on reducing an MSA to a linear profile for each alignment step. However, this leads to loss of information needed for accurate alignment, and gap scoring artifacts. Results: We present a graph representation of an MSA that can itself be aligned directly by pairwise dynamic programming, eliminating the need to reduce the MSA to a profile. This enables our algorithm (Partial Order Alignment (POA)) to guarantee that the optimal alignment of each new sequence versus each sequence in the MSA will be considered. Moreover, this algorithm introduces a new edit operator, homologous recombination, important for multidomain sequences. The algorithm has improved speed (linear time complexity) over existing MSA algorithms, enabling construction of massive and complex alignments (e.g. an alignment of 5000 sequences in 4 h on a Pentium II). We demonstrate the utility of this algorithm on a family of multidomain SH2 proteins, and on EST assemblies containing alternative splicing and polymorphism. Availability: The partial order alignment program POA is available at http://www.bioinformatics.ucla.edu/poa.

786 citations


Journal ArticleDOI
TL;DR: A user-friendly website for the analysis of protein secondary structures from Circular Dichroism (CD) and Synchrotron Radiation Circ circular DichROism (SRCD) spectra has been created.
Abstract: A user-friendly website for the analysis of protein secondary structures from Circular Dichroism (CD) and Synchrotron Radiation Circular Dichroism (SRCD) spectra has been created.

724 citations


Journal ArticleDOI
TL;DR: This work has succeeded in finding rules whose prediction accuracies come close to that of TargetP, while still retaining a very simple and interpretable form.
Abstract: Motivation: The prediction of localization sites of various proteins is an important and challenging problem in the field of molecular biology. TargetP, by Emanuelsson et al. (J. Mol. Biol., 300, 1005‐1016, 2000) is a neural network based system which is currently the best predictor in the literature for N-terminal sorting signals. One drawback of neural networks, however, is that it is generally difficult to understand and interpret how and why they make such predictions. In this paper, we aim to generate simple and interpretable rules as predictors, and still achieve a practical prediction accuracy. We adopt an approach which consists of an extensive search for simple rules and various attributes which is partially guided by human intuition. Results: We have succeeded in finding rules whose prediction accuracies come close to that of TargetP, while still retaining a very simple and interpretable form. We also discuss and interpret the discovered rules. Availability: An (experimental) web service using rules obtained by our method is provided at http:

Journal ArticleDOI
Earl Hubbell1, Wei-Min Liu1, Rui Mei1
TL;DR: A hierarchy of simple models is used to design robust estimators meeting these goals for both stand alone and comparative experiments, and shows comparable performance to existing standards.
Abstract: Motivation: We consider the problem of estimating values associated with gene expression from oligonucleotide arrays. Such estimates should linearly track concentration, yield non-negative results, have statistical guarantees of robustness against outliers, and allow estimates of significance and variance. Results: Ah ierarchy of simple models is used to design robust estimators meeting these goals for both stand alone and comparative experiments. This algorithm has been validated against an extensive panel of known spike experiments, and shows comparable performance to existing standards. Availability: Algorithms available commercially as part of the MAS 5.0 software package. Data sets available from the Affymetrix website. Contact: earl hubbell@affymetrix.com Supplemental Information: Additional data at http: //www.affymetrix.com/community/publications/affymetrix/ index.affx.

Journal ArticleDOI
TL;DR: ESyPred3D, a new automated homology modeling program that gets benefit of the increased alignment performances of a new alignment strategy, and alignments are among the most accurate compared to those of participants having used the same template.
Abstract: Motivation Homology or comparative modeling is currently the most accurate method to predict the three-dimensional structure of proteins. It generally consists in four steps: (1) databanks searching to identify the structural homolog, (2) target-template alignment, (3) model building and optimization, and (4) model evaluation. The target-template alignment step is generally accepted as the most critical step in homology modeling. Results We present here ESyPred3D, a new automated homology modeling program. The method gets benefit of the increased alignment performances of a new alignment strategy. Alignments are obtained by combining, weighting and screening the results of several multiple alignment programs. The final three-dimensional structure is build using the modeling package MODELLER. ESyPred3D was tested on 13 targets in the CASP4 experiment (Critical Assessment of Techniques for Proteins Structural Prediction). Our alignment strategy obtains better results compared to PSI-BLAST alignments and ESyPred3D alignments are among the most accurate compared to those of participants having used the same template. Availability ESyPred3D is available through its web site at http://www.fundp.ac.be/urbm/bioinfo/esypred/ Contact christophe.lambert@fundp.ac.be; http://www.fundp.ac.be/~lambertc

Journal ArticleDOI
TL;DR: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues, and relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classified tissues or with background and biological knowledge of these sets.
Abstract: Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.

Journal ArticleDOI
Wei Pan1
TL;DR: All the three methods here are based on using the two-sample t-statistic or its minor variation, but they differ in how to associate a statistical significance level to the corresponding statistic, leading to possibly large difference in the resulting significance levels and the numbers of genes detected.
Abstract: Motivation A common task in analyzing microarray data is to determine which genes are differentially expressed across two kinds of tissue samples or samples obtained under two experimental conditions. Recently several statistical methods have been proposed to accomplish this goal when there are replicated samples under each condition. However, it may not be clear how these methods compare with each other. Our main goal here is to compare three methods, the t-test, a regression modeling approach (Thomas et al., Genome Res., 11, 1227-1236, 2001) and a mixture model approach (Pan et al., http://www.biostat.umn.edu/cgi-bin/rrs?print+2001,2001a,b) with particular attention to their different modeling assumptions. Results It is pointed out that all the three methods are based on using the two-sample t-statistic or its minor variation, but they differ in how to associate a statistical significance level to the corresponding statistic, leading to possibly large difference in the resulting significance levels and the numbers of genes detected. In particular, we give an explicit formula for the test statistic used in the regression approach. Using the leukemia data of Golub et al. (Science, 285, 531-537, 1999), we illustrate these points. We also briefly compare the results with those of several other methods, including the empirical Bayesian method of Efron et al. (J. Am. Stat. Assoc., to appear, 2001) and the Significance Analysis of Microarray (SAM) method of Tusher et al. (PROC: Natl Acad. Sci. USA, 98, 5116-5121, 2001).

Journal ArticleDOI
TL;DR: QTL Express is the first application for Quantitative Trait Locus mapping in outbred populations with a web-based user interface that allows mapping of single or multiple QTL by the regression approach, with the option to perform permutation or bootstrap tests.
Abstract: QTL Express is the first application for Quantitative Trait Locus (QTL) mapping in outbred populations with a web-based user interface. User input of three files containing a marker map, trait data and marker genotypes allows mapping of single or multiple QTL by the regression approach, with the option to perform permutation or bootstrap tests.

Journal ArticleDOI
TL;DR: Clusters of genes and cell lines were discordant between the two technologies, suggesting that relative intra-technology relationships were not preserved, implying a poor prognosis for a broad utilization of gene expression measurements across platforms.
Abstract: Motivation: The existence of several technologies for measuring gene expression makes the question of crosstechnology agreement of measurements an important issue. Cross-platform utilization of data from different technologies has the potential to reduce the need to duplicate experiments but requires corresponding measurements to be comparable. Methods: A comparison of mRNA measurements of 2895 sequence-matched genes in 56 cell lines from the standard panel of 60 cancer cell lines from the National Cancer Institute (NCI 60) was carried out by calculating correlation between matched measurements and calculating concordance between cluster from two high-throughput DNA microarray technologies, Stanford type cDNA microarrays and Affymetrix oligonucleotide microarrays. Results: In general, corresponding measurements from the two platforms showed poor correlation. Clusters of genes and cell lines were discordant between the two technologies, suggesting that relative intra-technology relationships were not preserved. GC-content, sequence length, average signal intensity, and an estimator of cross-hybridization were found to be associated with the degree of correlation. This suggests gene-specific, or more correctly probe-specific, factors influencing measurements differently in the two platforms, implying a poor prognosis for a broad utilization of gene expression measurements across platforms.

Journal ArticleDOI
TL;DR: This work presents rank-based algorithms for making detection and comparison calls on expression microarrays based on Wilcoxon's signed-rank test, which should be robust against data outliers over a wide target concentration range.
Abstract: Motivation: We consider the detection of expressed genes and the comparison of them in different experiments with the high-density oligonucleotide microarrays. The results are summarized as the detection calls and comparison calls, and they should be robust against data outliers over a wide target concentration range. It is also helpful to provide parameters that can be adjusted by the user to balance specificity and sensitivity under various experimental conditions. Results: We present rank-based algorithms for making detection and comparison calls on expression microarrays. The detection call algorithm utilizes the discrimination scores. The comparison call algorithm utilizes intensity differences. Both algorithms are based on Wilcoxon’s signed-rank test. Several parameters in the algorithms can be adjusted by the user to alter levels of specificity and sensitivity. The algorithms were developed and analyzed using spiked-in genes arrayed in a Latin square format. In the call process, p-values are calculated to give a confidence level for the pertinent hypotheses. For comparison calls made between two arrays, two primary normalization factors are defined. To overcome the difficulty that constant normalization factors do not fit all probe sets, we perturb these primary normalization factors and make increasing or decreasing calls only if all resulting p-values fall within a defined critical region. Our algorithms also automatically handle scanner saturation. Availability: These algorithms are available commercially as part of the MAS 5.0 software package. Contact: wei-min liu@affymetrix.com

Journal ArticleDOI
TL;DR: A cross-validated study suggests that MARCOIL improves predictions compared to the traditional PSSM algorithm, especially for some protein families and for short CCDs.
Abstract: Motivation: Large-scale sequence data require methods for the automated annotation of protein domains. Many of the predictive methods are based either on a Position Specific Scoring Matrix (PSSM) of fixed length or on a windowless Hidden Markov Model (HMM). The performance of the two approaches is tested for Coiled-Coil Domains (CCDs). The prediction of CCDs is used frequently, and its optimization seems worthwhile. Results: We have conceived MARCOIL, an HMM for the recognition of proteins with a CCD on a genomic scale. A cross-validated study suggests that MARCOIL improves predictions compared to the traditional PSSM algorithm, especially for some protein families and for short CCDs. The study was designed to reveal differences inherent in the two methods. Potential confounding factors such as differences in the dimension of parameter space and in the parameter values were avoided by using the same amino acid propensities and by keeping the transition probabilities of the HMM constant during cross-validation. Availability: The prediction program and the databases are available at http://www.wehi.edu.au/bioweb/Mauro/

Journal ArticleDOI
TL;DR: Geno3D (http://geno3d-pbil.ibcp.fr) is an automatic web server for protein molecular modelling that identifies homologous proteins with known 3D structures by using PSI-BLAST and performs 3D construction of the protein by using a distance geometry approach.
Abstract: Geno3D (http://geno3d-pbil.ibcp.fr) is an automatic web server for protein molecular modelling. Starting with a query protein sequence, the server performs the homology modelling in six successive steps: (i) identify homologous proteins with known 3D structures by using PSI-BLAST; (ii) provide the user all potential templates through a very convenient user interface for target selection; (iii) perform the alignment of both query and subject sequences; (iv) extract geometrical restraints (dihedral angles and distances) for corresponding atoms between the query and the template; (v) perform the 3D construction of the protein by using a distance geometry approach and (vi) finally send the results by e-mail to the user.

Journal ArticleDOI
TL;DR: This paper shows that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program's speed 100 times and implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in approximately 5 days.
Abstract: Motivation: Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI NonRedundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous program, CD-HI, clustered NR at 90% identity in ∼1 h and at 75% identity in ∼1 day on a 1 GHz Linux PC (Li et al., Bioinformatics, 17, 282, 2001); however even faster clustering speed is needed because the size of protein databases are rapidly growing and many applications desire a lower attainable thresholds. Results: For our previous algorithm (CD-HI), we have employed short-word filters to speed up the clustering. In this paper, we show that tolerating some redundancy makes for more efficient use of these short-word filters and increases the program’s speed 100 times. Our new program implements this technique and clusters NR at 70% identity within 2 h, and at 50% identity in ∼ 5d ays. Although some redundancy is present after clustering, our new program’s results only differ from our previous program’s by less than 0.4%. Availability: The program and its previous version are available at http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@burnham-inst.org; adam@burnhaminst.org

Journal ArticleDOI
TL;DR: Studying cell cycle-related gene expression in yeast, it is found that the dominant expression modes could be related to distinct biological functions, such as phases of the cell cycle or the mating response.
Abstract: Motivation: The expression of genes is controlled by specific combinations of cellular variables. We applied Independent Component Analysis (ICA) to gene expression data, deriving a linear model based on hidden variables, which we term ‘expression modes’. The expression of each gene is a linear function of the expression modes, where, according to the ICA model, the linear influences of different modes show a minimal statistical dependence, and their distributions deviate sharply from the normal distribution. Results: Studying cell cycle-related gene expression in yeast, we found that the dominant expression modes could be related to distinct biological functions, such as phases of the cell cycle or the mating response. Analysis of human lymphocytes revealed modes that were related to characteristic differences between cell types. With both data sets, the linear influences of the dominant modes showed distributions with large tails, indicating the existence of specifically up- and downregulated target genes. The expression modes and their influences can be used to visualize the samples and genes in lowdimensional spaces. A projection to expression modes helps to highlight particular biological functions, to reduce noise, and to compress the data in a biologically sensible way. Availability: The FastICA algorithm (Hyv ¨ arinen, IEEE

Journal ArticleDOI
TL;DR: Three model-free approaches based on high Pearson correlation to a perfectly differentiating gene ('ideal discriminator method') are compared to identify biologically relevant differentially expressed genes that allow clear separation of groups in question.
Abstract: Motivation: Gene expression experiments provide a fast and systematic way to identify disease markers relevant to clinical care. In this study, we address the problem of robust identification of differentially expressed genes from microarray data. Differentially expressed genes, or discriminator genes, are genes with significantly different expression in two user-defined groups of microarray experiments. We compare three model-free approaches: (1) nonparametric t-test, (2) Wilcoxon (or Mann‐Whitney) rank sum test, and (3) a heuristic method based on high Pearson correlation to a perfectly differentiating gene (‘ideal discriminator method’). We systematically assess the performance of each method based on simulated and biological data under varying noise levels and p-value cutoffs. Results: All methods exhibit very low false positive rates and identify a large fraction of the differentially expressed genes in simulated data sets with noise level similar to that of actual data. Overall, the rank sum test appears most conservative, which may be advantageous when the computationally identified genes need to be tested biologically. However, if a more inclusive list of markers is desired, a higher p-value cutoff or the nonparametric t-test may be appropriate. When applied to data from lung tumor and lymphoma data sets, the methods identify biologically relevant differentially expressed genes that allow clear separation of groups in question. Thus the methods described and evaluated here provide a convenient and robust way to identify differentially expressed genes for further biological and clinical analysis. Availability: By request from the authors.

Journal ArticleDOI
TL;DR: This work proposes to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation, and demonstrates that this method can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets.
Abstract: Motivation The MEDLINE database of biomedical abstracts contains scientific knowledge about thousands of interacting genes and proteins. Automated text processing can aid in the comprehension and synthesis of this valuable information. The fundamental task of identifying gene and protein names is a necessary first step towards making full use of the information encoded in biomedical text. This remains a challenging task due to the irregularities and ambiguities in gene and protein nomenclature. We propose to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation. Results We present a method for tagging gene and protein names in biomedical text using a combination of statistical and knowledge-based strategies. This method incorporates automatically generated rules from a transformation-based part-of-speech tagger, and manually generated rules from morphological clues, low frequency trigrams, indicator terms, suffixes and part-of-speech information. Results of an experiment on a test corpus of 56K MEDLINE documents demonstrate that our method to extract gene and protein names can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets. Availability The programs are available on request from the authors.

Journal ArticleDOI
TL;DR: A clustering procedure based on the Bayesian infinite mixture model and applied to clustering gene expression profiles that allows for incorporation of uncertainties involved in the model selection in the final assessment of confidence in similarities of expression profiles.
Abstract: MOTIVATION The biologic significance of results obtained through cluster analyses of gene expression data generated in microarray experiments have been demonstrated in many studies. In this article we focus on the development of a clustering procedure based on the concept of Bayesian model-averaging and a precise statistical model of expression data. RESULTS We developed a clustering procedure based on the Bayesian infinite mixture model and applied it to clustering gene expression profiles. Clusters of genes with similar expression patterns are identified from the posterior distribution of clusterings defined implicitly by the stochastic data-generation model. The posterior distribution of clusterings is estimated by a Gibbs sampler. We summarized the posterior distribution of clusterings by calculating posterior pairwise probabilities of co-expression and used the complete linkage principle to create clusters. This approach has several advantages over usual clustering procedures. The analysis allows for incorporation of a reasonable probabilistic model for generating data. The method does not require specifying the number of clusters and resulting optimal clustering is obtained by averaging over models with all possible numbers of clusters. Expression profiles that are not similar to any other profile are automatically detected, the method incorporates experimental replicates, and it can be extended to accommodate missing data. This approach represents a qualitative shift in the model-based cluster analysis of expression data because it allows for incorporation of uncertainties involved in the model selection in the final assessment of confidence in similarities of expression profiles. We also demonstrated the importance of incorporating the information on experimental variability into the clustering model. AVAILABILITY The MS Windows(TM) based program implementing the Gibbs sampler and supplemental material is available at http://homepages.uc.edu/~medvedm/BioinformaticsSupplement.htm CONTACT medvedm@email.uc.edu

Journal ArticleDOI
TL;DR: A new framework for representing a set of multi-dimensional gene expression data as a Minimum Spanning Tree (MST), a concept from the graph theory, which can overcome many of the problems faced by classical clustering algorithms.
Abstract: Motivation: Gene expression data clustering provides a powerful tool for studying functional relationships of genes in a biological process. Identifying correlated expression patterns of genes represents the basic challenge in this clustering problem. Results: This paper describes a new framework for representing a set of multi-dimensional gene expression data as a Minimum Spanning Tree (MST), a concept from the graph theory. A key property of this representation is that each cluster of the expression data corresponds to one subtree of the MST, which rigorously converts a multi-dimensional clustering problem to a tree partitioning problem. We have demonstrated that though the inter-data relationship is greatly simplified in the MST representation, no essential information is lost for the purpose of clustering. Two key advantages in representing a set of multi-dimensional data as an MST are: (1) the simple structure of a tree facilitates efficient implementations of rigorous clustering algorithms, which otherwise are highly computationally challenging; and (2) as an MST-based clustering does not depend on detailed geometric shape of a cluster, it can overcome many of the problems faced by classical clustering algorithms. Based on the MST representation, we have developed a number of rigorous and efficient clustering algorithms, including two with guaranteed global optimality. We have implemented these algorithms as a computer software EXpression data Clustering Analysis and VisualizATiOn Resource (EXCAVATOR). To demonstrate its effectiveness, we have tested it on three data sets, i.e. expression data from yeast Saccharomyces cerevisiae, expression data in response of human fibroblasts to serum, and Arabidopsis expression data in response to chitin elicitation. The test results are highly encouraging. Availability: EXCAVATOR is available on request from the authors.

Journal ArticleDOI
TL;DR: Genome Rearrangements In Man and Mouse (GRIMM) is a tool for analyzing rearrangements of gene orders in pairs of unichROMosomal and multichromosomal genomes, with either signed or unsigned gene data.
Abstract: Summary Genome Rearrangements In Man and Mouse (GRIMM) is a tool for analyzing rearrangements of gene orders in pairs of unichromosomal and multichromosomal genomes, with either signed or unsigned gene data. Although there are several programs for analyzing rearrangements in unichromosomal genomes, this is the first to analyze rearrangements in multichromosomal genomes. GRIMM also provides a new algorithm for analyzing comparative maps for which gene directions are unknown. Availability A web server, with instructions and sample data, is available at http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM.

Journal ArticleDOI
TL;DR: A model for random gene perturbations is developed and an explicit formula for the transition probabilities in the new Probabilistic Boolean Networks (PBNs) is derived and it is demonstrated that states of the network that are more 'easily reachable' from other states are more stable in the presence of gene perturgations.
Abstract: Motivation: A major objective of gene regulatory network modeling, in addition to gaining a deeper understanding of genetic regulation and control, is the development of computational tools for the identification and discovery of potential targets for therapeutic intervention in diseases such as cancer. We consider the general question of the potential effect of individual genes on the global dynamical network behavior, both from the view of random gene perturbation as well as intervention in order to elicit desired network behavior. Results: Using a recently introduced class of models, called Probabilistic Boolean Networks (PBNs), this paper develops a model for random gene perturbations and derives an explicit formula for the transition probabilities in the new PBN model. This result provides a building block for performing simulations and deriving other results concerning network dynamics. An example is provided to show how the gene perturbation model can be used to compute long-term influences of genes on other genes. Following this, the problem of intervention is addressed via the development of several computational tools based on first-passage times in Markov chains. The consequence is a methodology for finding the best gene with which to intervene in order to most likely achieve desirable network behavior. The ideas are illustrated with several examples in which the goal is to induce the network to transition into a desired state, or set of states. The corresponding issue of avoiding undesirable states is also addressed. Finally, the paper turns to the important problem of assessing the effect of gene perturbations on long-run network behavior. A bound on the steady-state probabilities is derived in terms of the perturbation probability. The result demonstrates that states of the network that are more ‘easily reachable’ from other states are more stable in the presence of gene perturbations. Consequently, these are hypothesized to correspond to cellular functional states. Availability: A library of functions written in MATLAB for ∗ To whom correspondence should be addressed.

Journal ArticleDOI
TL;DR: To encourage participation and accelerate progress in this expanding field of literature data mining, it is proposed creating challenge evaluations, and two specific applications are described in this context.
Abstract: We review recent results in literature data mining for biology and discuss the need and the steps for a challenge evaluation for this field. Literature data mining has progressed from simple recognition of terms to extraction of interaction relationships from complex sentences, and has broadened from recognition of protein interactions to a range of problems such as improving homology search, identifying cellular location, and so on. To encourage participation and accelerate progress in this expanding field, we propose creating challenge evaluations, and we describe two specific applications in this context.

Journal ArticleDOI
TL;DR: A simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile Hidden Markov Model (HMM), and methods, including Support Vector Machines (SVMs), that transform protein sequences into fixed-length feature vectors are compared.
Abstract: Motivation: The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-Protein Coupled Receptors (GPCRs), a superfamily of cell membrane proteins. GPCRs are found in a wide range of organisms and are central to a cellular signalling network that regulates many basic physiological processes. They are the focus of a significant amount of current pharmaceutical research because they play a key role in many diseases. However, their tertiary structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile Hidden Markov Model (HMM), and methods, including Support Vector Machines (SVMs), that transform protein sequences into fixed-length feature vectors. Results: The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the Minimum Error Point (MEP) were 13.7% for multi-class SVMs, 17.1% for our SVMtree method of hierarchical multi-class SVM classification, 25.5% for BLAST, 30% for profile HMMs, and 49% for classification based on nearest neighbor feature vector Kernel Nearest Neighbor (kernNN). The percentage of true positives recognized before the first false positive was 65% for both SVM methods, 13% for BLAST, 5% for profile HMMs and 4% for kernNN. Availability: We have set up a web server for GPCR subfamily classification based on hierarchical multi-class ∗ To whom correspondence should be addressed. SVMs at http://www.soe.ucsc.edu/research/compbio/ gpcr-subclass. By scanning predicted peptides found in the human genome with the SVMtree server, we have identified a large number of genes that encode GPCRs. A list of our predictions for human GPCRs is available at http://www.soe.ucsc.edu/research/compbio/gpcr hg/ class results. We also provide suggested subfamily classification for 18 sequences previously identified as unclassified Class A (rhodopsin-like) GPCRs in GPCRDB (Horn et al., Nucleic Acids Res., 26, 277‐281, 1998), available at http://www.soe.ucsc.edu/research/compbio/ gpcr/classA unclassified/.