scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2004"


Journal ArticleDOI
TL;DR: Extending the single optimized spaced seed of PatternHunter to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the Lack of speed of Smith-Waterman, for homology search.
Abstract: Extending the single optimized spaced seed of PatternHunter(20) to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of Smith-Waterman, for homology search. At Blastn speed, PatternHunter II approaches Smith-Waterman sensitivity, bringing homology search methodology research back to a full circle.

301 citations


Journal ArticleDOI
TL;DR: It is proved that if there is a galled-tree, then the one produced by the algorithm minimizes the number of recombinations over all phylogenetic networks for the data, even allowing multiple-crossover recombinations.
Abstract: A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. In a seminal paper, Wang et al.(1) studied the problem of constructing a phylogenetic network, allowing recombination between sequences, with the constraint that the resulting cycles must be disjoint. We call such a phylogenetic network a "galled-tree". They gave a polynomial-time algorithm that was intended to determine whether or not a set of sequences could be generated on galled-tree. Unfortunately, the algorithm by Wang et al.(1) is incomplete and does not constitute a necessary test for the existence of a galled-tree for the data. In this paper, we completely solve the problem. Moreover, we prove that if there is a galled-tree, then the one produced by our algorithm minimizes the number of recombinations over all phylogenetic networks for the data, even allowing multiple-crossover recombinations. We also prove that when there is a galled-tree for the data, the galled-tree minimizing the number of recombinations is "essentially unique". We also note two additional results: first, any set of sequences that can be derived on a galled tree can be derived on a true tree (without recombination cycles), where at most one back mutation per site is allowed; second, the site compatibility problem (which is NP-hard in general) can be solved in polynomial time for any set of sequences that can be derived on a galled tree. Perhaps more important than the specific results about galled-trees, we introduce an approach that can be used to study recombination in general phylogenetic networks. This paper greatly extends the conference version that appears in an earlier work.(8) PowerPoint slides of the conference talk can be found at our website.(7).

220 citations


Journal ArticleDOI
TL;DR: This work proposes a statistical method for estimating a gene network based on Bayesian networks from microarray gene expression data together with biological knowledge including protein-protein interactions, protein-DNA interactions, binding site information, existing literature and so on.
Abstract: We propose a statistical method for estimating a gene network based on Bayesian networks from microarray gene expression data together with biological knowledge including protein-protein interactions, protein-DNA interactions, binding site information, existing literature and so on. Microarray data do not contain enough information for constructing gene networks accurately in many cases. Our method adds biological knowledge to the estimation method of gene networks under a Bayesian statistical framework, and also controls the trade-off between microarray information and biological knowledge automatically. We conduct Monte Carlo simulations to show the effectiveness of the proposed method. We analyze Saccharomyces cerevisiae gene expression data as an application.

186 citations


Journal ArticleDOI
TL;DR: An initial review of the various modeling approaches based on Petri net found in the literature, and of the biological systems that have been successfully modeled with these approaches.
Abstract: Petri nets are a discrete event simulation approach developed for system representation, in particular for their concurrency and synchronization properties. Various extensions to the original theory of Petri nets have been used for modeling molecular biology systems and metabolic networks. These extensions are stochastic, colored, hybrid and functional. This paper carries out an initial review of the various modeling approaches based on Petri net found in the literature, and of the biological systems that have been successfully modeled with these approaches. Moreover, the modeling goals and possibilities of qualitative analysis and system simulation of each approach are discussed.

155 citations


Journal ArticleDOI
TL;DR: This is a review of a new and essentially simple method of inferring phylogenetic relationships from complete genome data without using sequence alignment based on counting the appearance frequency of oligopeptides in the collection of protein sequences of a species.
Abstract: This is a review of a new and essentially simple method of inferring phylogenetic relationships from complete genome data without using sequence alignment. The method is based on counting the appearance frequency of oligopeptides of a fixed length (up to K = 6) in the collection of protein sequences of a species. It is a method without fine adjustment and choice of genes. Applied to prokaryotic genomes it has led to results comparable with the bacteriologists' systematics as reflected in the latest 2002 outline of the Bergey's Manual of Systematic Bacteriology. The method has also been used to compare chloroplast genomes and to the phylogeny of Coronaviruses including human SARS-CoV. A key point in our approach is subtraction of a random background from the original counts by using a Markov model of order K-2 in order to highlight the shaping role of natural selection. The implications of the subtraction procedure is specially analyzed and further development of the new approach is indicated.

90 citations


Journal ArticleDOI
TL;DR: The positive feedback of cI repressor gene transcription, enhanced by the CI dimer cooperative binding, is the key to the robustness of the phage lambda genetic switch against mutations and fluctuations in kinetic parameter values.
Abstract: Based on the dynamical structure theory for complex networks recently developed by one of us and on the physical-chemical models for gene regulation, developed by Shea and Ackers in the 1980's, we formulate a direct and concise mathematical framework for the genetic switch controlling phage lambda life cycles, which naturally includes the stochastic effect. The dynamical structure theory states that the dynamics of a complex network is determined by its four elementary components: The dissipation (analogous to degradation), the stochastic force, the driving force determined by a potential, and the transverse force. The potential may be interpreted as a landscape for the phage development in terms of attractive basins, saddle points, peaks and valleys. The dissipation gives rise to the adaptivity of the phage in the landscape defined by the potential: The phage always has the tendency to approach the bottom of the nearby attractive basin. The transverse force tends to keep the network on the equal-potential contour of the landscape. The stochastic fluctuation gives the phage the ability to search around the potential landscape by passing through saddle points. With molecular parameters in our model fixed primarily by the experimental data on wild-type phage and supplemented by data on one mutant, our calculated results on mutants agree quantitatively with the available experimental observations on other mutants for protein number, lysogenization frequency, and a lysis frequency in lysogen culture. The calculation reproduces the observed robustness of the phage lambda genetic switch. This is the first mathematical description that successfully represents such a wide variety of major experimental phenomena. Specifically, we find: (1) The explanation for both the stability and the efficiency of phage lambda genetic switch is the exponential dependence of saddle point crossing rate on potential barrier height, a result of the stochastic motion in a landscape; and (2) The positive feedback of cI repressor gene transcription, enhanced by the CI dimer cooperative binding, is the key to the robustness of the phage lambda genetic switch against mutations and fluctuations in kinetic parameter values.

88 citations


Journal ArticleDOI
TL;DR: This work studies the problem of computing optimal spaced seeds for detecting homologous coding regions in unannotated genomic sequences, and gives an efficient algorithm to compute the optimal spaced seed when conservation patterns are generated by these models.
Abstract: Optimal spaced seeds were developed as a method to increase sensitivity of local alignment programs similar to BLASTN. Such seeds have been used before in the program PatternHunter, and have given improved sensitivity and running time relative to BLASTN in genome-genome comparison. We study the problem of computing optimal spaced seeds for detecting homologous coding regions in unannotated genomic sequences. By using well-chosen seeds, we are able to improve the sensitivity of coding sequence alignment over that of TBLASTX, while keeping runtime comparable to BLASTN. We identify good seeds by first giving effective hidden Markov models of conservation in alignments of homologous coding regions. We give an efficient algorithm to compute the optimal spaced seed when conservation patterns are generated by these models. Our results offer the hope of improved gene finding due to fewer missed exons in DNA/DNA comparison, and more effective homology search in general, and may have applications outside of bioinformatics.

78 citations


Journal ArticleDOI
TL;DR: LOGOS is presented, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complexBiopolymer sequence analysis.
Abstract: The complexity of the global organization and internal structure of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection, it is necessary to model the complex dependencies within and among motifs and to incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local alignment model capturing biological prior knowledge and positional dependency within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semi-realistic test data and cis-regulatory sequences from yeast and Drosophila genomes with regard to sensitivity, specificity, flexibility and extensibility.

68 citations


Journal ArticleDOI
TL;DR: An optimal, robust prediction model for classifying cancer sub-types using gene expression data is constructed in a step-wise fashion implementing cross-validated quadratic discriminant analysis and finds that the dimensionality of the optimal prediction models is relatively small for these cases.
Abstract: Microarrays can provide genome-wide expression patterns for various cancers, especially for tumor sub-types that may exhibit substantially different patient prognosis. Using such gene expression data, several approaches have been proposed to classify tumor sub-types accurately. These classification methods are not robust, and often dependent on a particular training sample for modelling, which raises issues in utilizing these methods to administer proper treatment for a future patient. We propose to construct an optimal, robust prediction model for classifying cancer sub-types using gene expression data. Our model is constructed in a step-wise fashion implementing cross-validated quadratic discriminant analysis. At each step, all identified models are validated by an independent sample of patients to develop a robust model for future data. We apply the proposed methods to two microarray data sets of cancer: the acute leukemia data by Golub et al. and the colon cancer data by Alon et al. We have found that the dimensionality of our optimal prediction models is relatively small for these cases and that our prediction models with one or two gene factors outperforms or has competing performance, especially for independent samples, to other methods based on 50 or more predictive gene factors. The methodology is implemented and developed by the procedures in R and Splus. The source code can be obtained at http://hesweb1.med.virginia.edu/bioinformatics.

60 citations


Journal ArticleDOI
TL;DR: A novel image-processing program is developed that extracts quantitative data from microscope images automatically about yeast morphology, such as cell size, roundness, bud neck position angle, and bud growth direction, and fits an ellipse to the cell outline.
Abstract: Every living organism has its own species-specific morphology. Despite the relatively simple ellipsoidal shape of budding yeast cells, the global regulation of yeast morphology remains unclear. In the past, each mutated gene from many mutants with abnormal morphology had to be classified manually. To investigate the morphological characteristics of yeast in detail, we developed a novel image-processing program that extracts quantitative data from microscope images automatically. This program extracts data on cells that are often used by yeast morphology researchers, such as cell size, roundness, bud neck position angle, and bud growth direction, and fits an ellipse to the cell outline. We evaluated the ability of the program to extract quantitative parameters. The results suggest that our image-processing program can play a central objective role in yeast morphology studies.

57 citations


Journal ArticleDOI
TL;DR: The RuleMiner is able to provide an enhanced capability for protein function analysis, such as results from the integrated sequence analysis tools for given proteins can be comparatively analyzed due to the clear feature-PFG relationships.
Abstract: In this paper, we present RuleMiner, a knowledge system to facilitate a seamless integration of multi-sequence analysis tools and define profile-based rules for supporting high-throughput protein function annotations. This system consists of three essential components, Protein Function Groups (PFGs), PFG profiles and rules. The PFGs, established from an integrated analysis of current knowledge of protein functions from Swiss-Prot database and protein family-based sequence classifications, cover all possible cellular functions available in the database. The PFG profiles illustrate detailed protein features in the PFGs as in sequence conservations, the occurrences of sequence-based motifs, domains and species distributions. The rules, extracted from the PFG profiles, describe the clear relationships between these PFGs and all possible features. As a result, the RuleMiner is able to provide an enhanced capability for protein function analysis, such as results from the integrated sequence analysis tools for given proteins can be comparatively analyzed due to the clear feature-PFG relationships. Also, much needed guidance is readily available for such analysis. If the rules describe one-to-one (unique) relationships between the protein features and the PFGs, then these features can be utilized as unique functional identifiers and cellular functions of unknown proteins can be reliably determined. Otherwise, additional information has to be provided.

Journal ArticleDOI
TL;DR: A new general definition of locality for sequence-structure alignments that is biologically motivated and efficiently tractable is suggested and it is proved that the defined locality means connectivity by atomic and non-atomic bonds.
Abstract: Ribonuclic acid (RNA) enjoys increasing interest in molecular biology; despite this interest fundamental algorithms are lacking, e.g. for identifying local motifs. As proteins, RNA molecules have a distinctive structure. Therefore, in addition to sequence information, structure plays an important part in assessing the similarity of RNAs. Furthermore, common sequence-structure features in two or several RNA molecules are often only spatially local, where possibly large parts of the molecules are dissimilar. Consequently, we address the problem of comparing RNA molecules by computing an optimal local alignment with respect to sequence and structure information. While local alignment is superior to global alignment for identifying local similarities, no general local sequence-structure alignment algorithms are currently known. We suggest a new general definition of locality for sequence-structure alignments that is biologically motivated and efficiently tractable. To show the former, we discuss locality of RNA and prove that the defined locality means connectivity by atomic and non-atomic bonds. To show the latter, we present an efficient algorithm for the newly defined pairwise local sequence-structure alignment (lssa) problem for RNA. For molecules of lengthes n and m, the algorithm has worst-case time complexity of O(n2·m2·max(n,m)) and a space complexity of only O(n·m). An implementation of our algorithm is available at . Its runtime is competitive with global sequence-structure alignment.

Journal ArticleDOI
TL;DR: The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals.
Abstract: The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in proteinbinding signals. A signal is defined as any short nucleotide pattern having up to d mutations differing from a motif of length l. The algorithm finds such motifs if a clique consisting of a suffciently large number of mutated copies of the motif (i.e., the signals) is present in the DNA sequence. The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum detectable clique size qc as a function of sequence length N for random sequences. We found that qc increases linearly with N for a fast version of the algorithm based on counting threemember sub-cliques. Imposing consensus constraints reduces qc by a factor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N=12,000 for (l,d)=(15,4).

Journal ArticleDOI
Jung-jae Kim1, Jong Cheol Park1
TL;DR: A biomedical information extraction system, BioIE, is presented to address both of these needs by utilizing a full-fledged English grammar formalism, or a combinatory categorial grammar, and by annotating the results with the terms of Gene Ontology, which provides a common and controlled vocabulary.
Abstract: The need for extracting general biological interactions of arbitrary types from the rapidly growing volume of the biomedical literature is drawing increased attention, while the need for this much diversity also requires both a robust treatment of complex linguistic phenomena and a method to consistently characterize the results. We present a biomedical information extraction system, BioIE, to address both of these needs by utilizing a full-fledged English grammar formalism, or a combinatory categorial grammar, and by annotating the results with the terms of Gene Ontology, which provides a common and controlled vocabulary. BioIE deals with complex linguistic phenomena such as coordination, relative structures, acronyms, appositive structures, and anaphoric expressions. In order to deal with real-world syntactic variations of ontological terms, BioIE utilizes the syntactic dependencies between words in sentences as well, based on the observation that the component words in an ontological term usually appear in a sentence with known patterns of syntactic dependencies.

Journal ArticleDOI
TL;DR: This work shows that classical methods and early heuristic methods give a vast improvement in both sensitivity and specificity over previous methods, and can achieve sensitivity at the level of classical algorithms while requiring orders of magnitude less runtime.
Abstract: We review recent results on local alignment. We begin with a review of classical methods and early heuristic methods, and then focus on more recent work on the seeding of local alignment. We show that these techniques give a vast improvement in both sensitivity and specificity over previous methods, and can achieve sensitivity at the level of classical algorithms while requiring orders of magnitude less runtime.

Journal ArticleDOI
TL;DR: This work generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names, which is composed of 82% (+/-3%) complete and accurate gene/protein names, and Examination of a random sample from this gene/ protein name lexicon suggests it is composed
Abstract: The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE© documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (±3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at .

Journal ArticleDOI
TL;DR: It was suggested that in the case of the alternative GC-AG introns, the tendency to have a weak consensus sequence at 5'ss is different between H. sapiens and M. musculus pre-mRNAs, and the trend was observed, which indicates that GC 5's possess strong consensus sequences.
Abstract: For the purpose of analyzing the relation between the splice sites and the order of introns, we conducted the following analysis for the GT–AG and GC–AG splice site groups. First, the pre-mRNAs of H. sapiens, M. musculus, D. melanogaster, A. thaliana and O. sativa were sampled by mapping the full-length cDNA to the genomes. Next, the consensus sequences at different regions of pre-mRNAs were analyzed in the five species. We also investigated the mononucleotide and dinucleotide frequencies in the extensive regions around the 5' splice sites (5'ss) and 3' splice sites (3'ss). As a result, differential frequencies of nucleotides at the first 5'ss in both the GT–AG and GC–AG splice site groups were observed in A. thaliana and O. sativa pre-mRNAs. The trend, which indicates that GC 5'ss possess strong consensus sequences, was observed not only in mammalian pre-mRNAs but also in the pre-mRNAs of D. melanogaster, A. thaliana and O. sativa. Furthermore, we examined the consensus sequences of the constitutive and alternative splice sites. It was suggested that in the case of the alternative GC–AG introns, the tendency to have a weak consensus sequence at 5'ss is different between H. sapiens and M. musculus pre-mRNAs.

Journal ArticleDOI
TL;DR: This work makes a first step in this direction by proving that ancestral maximum likelihood (AML) is NP-complete, and follows that for MP given in (Day, Johnson and Sankoff, 1986) in that the same reduction from Vertex Cover is used.
Abstract: Maximum likelihood (ML) (Felsenstein, 1981) is an increasingly popular optimality criterion for selecting evolutionary trees. Finding optimal ML trees appears to be a very hard computational task - ...

Journal ArticleDOI
TL;DR: The strength of various data mining techniques combined with sequence motif information in the promoter region of genes were applied to discover functional genes that are involved in the defense mechanism of systemic acquired resistance (SAR) in Arabidopsis thaliana and suggests a broader application of this approach.
Abstract: Various data mining techniques combined with sequence motif information in the promoter region of genes were applied to discover functional genes that are involved in the defense mechanism of systemic acquired resistance (SAR) in Arabidopsis thaliana. A series of K-Means clustering with difference-in-shape as distance measure was initially applied. A stability measure was used to validate this clustering process. A decision tree algorithm with the discover-and-mask technique was used to identify a group of most informative genes. Appearance and abundance of various transcription factor binding sites in the promoter region of the genes were studied. Through the combination of these techniques, we were able to identify 24 candidate genes involved in the SAR defense mechanism. The candidate genes fell into 2 highly resolved categories, each category showing significantly unique profiles of regulatory elements in their promoter regions. This study demonstrates the strength of such integration methods and suggests a broader application of this approach.

Journal ArticleDOI
TL;DR: An efficient algorithm for detecting putative regulatory elements in the upstream DNA sequences of genes, using gene expression information obtained from microarray experiments, is presented, based on a generalized suffix tree.
Abstract: We present an efficient algorithm for detecting putative regulatory elements in the upstream DNA sequences of genes, using gene expression information obtained from microarray experiments. Based on a generalized suffix tree, our algorithm looks for motif patterns whose appearance in the upstream region is most correlated with the expression levels of the genes. We are able to find the optimal pattern, in time linear in the total length of the upstream sequences. We implement and apply our algorithm to publicly available microarray gene expression data, and show that our method is able to discover biologically significant motifs, including various motifs which have been reported previously using the same data set. We further discuss applications for which the efficiency of the method is essential, as well as possible extensions to our algorithm.

Journal ArticleDOI
TL;DR: An integrated, comprehensive network-inferring system for genetic interactions, named VoyaGene, which can analyze experimentally observed expression profiles by using and combining the following five independent inferring models: Clustering, Threshold-Test, Bayesian, multi-level digraph and S-system models is proposed.
Abstract: We propose an integrated, comprehensive network-inferring system for genetic interactions, named VoyaGene, which can analyze experimentally observed expression profiles by using and combining the following five independent inferring models: Clustering, Threshold-Test, Bayesian, multi-level digraph and S-system models. Since VoyaGene also has effective tools for visualizing the inferred results, researchers may evaluate the combination of appropriate inferring models, and can construct a genetic network to an accuracy that is beyond the reach of a single inferring model. Through the use of VoyaGene, the present study demonstrates the effectiveness of combining different inferring models.

Journal ArticleDOI
TL;DR: This tutorial provides an overview of the various current high-throughput methods for discovering protein-protein interactions, covering both the conventional experimental methods and new computational approaches.
Abstract: The ongoing genomics and proteomics efforts have helped identify many new genes and proteins in living organisms. However, simply knowing the existence of genes and proteins does not tell us much about the biological processes in which they participate. Many major biological processes are controlled by protein interaction networks. A comprehensive description of protein–protein interactions is therefore necessary to understand the genetic program of life. In this tutorial, we provide an overview of the various current high-throughput methods for discovering protein–protein interactions, covering both the conventional experimental methods and new computational approaches.

Journal ArticleDOI
TL;DR: A multivariate entropy distance (MED) algorithm is proposed for the identification of prokaryotic genes; it has a feature to combine the use of a coding potential and an EDP-based sequence similarity analysis.
Abstract: A new simple method is found for efficient and accurate identification of coding sequences in prokaryotic genome. The method employs a Shannon description of artificial language for DNA sequences. It consists in translating a DNA sequence into a pseudo-amino acid sequence with 20 fundamental words according to the universal genetic code. With an entropy-density profile (EDP), the method maps a sequence of finite length to a vector and then analyzes its position in the 20-dimensional phase space depending on its nature. It is found that the ratio of the relative distance to an averaged coding and non-coding EDP over a small number (up to one) of open reading frames (ORFs) can serve as a good coding potential. An iterative algorithm is designed for finding a set of "root" sequences using this coding potential. A multivariate entropy distance (MED) algorithm is then proposed for the identification of prokaryotic genes; it has a feature to combine the use of a coding potential and an EDP-based sequence similarity analysis. The current version of MED is unsupervised, parameter-free and simple to implement. It is demonstrated to be able to detect 95-99% genes with 10-30% of additional genes when tested against the RefSeq database of NCBI and to detect 97.5-99.8% of confirmed genes with known functions. It is also shown to be able to find a set of (functionally known) genes that are missed by other well-known gene finding algorithms. All measurements show that the MED algorithm reaches a similar performance level as the algorithms like GeneMark and Glimmer for prokaryotic gene prediction.

Journal ArticleDOI
TL;DR: This system is designed for the recovery of gene interactions concurrently in many gene regulatory networks related by a tree or a more general graph and it is shown how this comparative framework can facilitate the Recovery of the networks and improve the quality of the solutions inferred.
Abstract: We present a method for gene network inference and revision based on time-series data. Gene networks are modeled using linear differential equations and a generalized stepwise multiple linear regression procedure is used to recover the interaction coefficients. Our system is designed for the recovery of gene interactions concurrently in many gene regulatory networks related by a tree or a more general graph. We show how this comparative framework can facilitate the recovery of the networks and improve the quality of the solutions inferred.

Journal ArticleDOI
TL;DR: Gene expression analysis, utilizes clustering techniques extensively to unravel relations between genes and help to deduce their biological role, since genes of similar function tend to display similar expression patterns.
Abstract: Self-Organized Maps (SOMs) are a popular approach for analyzing genome-wide expression data. However, most SOM based approaches ignore prior knowledge about functional gene categories. Also, Self Organized Map (SOM) based approaches usually develop topographic maps with disjoint and uniform activation regions that correspond to a hard clustering of the patterns at their nodes. We present a novel Self-Organizing map, the Kernel Supervised Dynamic Grid Self-Organized Map (KSDG-SOM). This model adapts its parameters in a kernel space. Gaussian kernels are used and their mean and variance components are adapted in order to optimize the fitness to the input density. The KSDG-SOM also grows dynamically up to a size defined with statistical criteria. It is capable of incorporating a priori information for the known functional characteristics of genes. This information forms a supervised bias at the cluster formation and the model owns the potentiality of revising incorrect functional labels. The new method overcomes the main drawbacks of most of the existing clustering methods that lack a mechanism for dynamical extension on the basis of a balance between unsupervised and supervised drives.

Journal ArticleDOI
TL;DR: A practical approach to construct progressive multiple alignments using sequence triplet optimizations rather than a conventional pairwise approach is demonstrated and it is revealed that the triplet based approaches generate more accurate sequence alignments than the traditional pairwise based procedures.
Abstract: In this paper we demonstrate a practical approach to construct progressive multiple alignments using sequence triplet optimizations rather than a conventional pairwise approach. Using the sequence triplet alignments progressively provides a scope for the synthesis of a three-residue exchange amino acid substitution matrix. We develop such a 20×20×20 matrix for the first time and demonstrate how its use in optimal sequence triplet alignments increases the sensitivity of building multiple alignments. Various comparisons were made between alignments generated using the progressive triplet methods and the conventional progressive pairwise procedure. The assessment of these data reveal that, in general, the triplet based approaches generate more accurate sequence alignments than the traditional pairwise based procedures, especially between more divergent sets of sequences.

Journal ArticleDOI
TL;DR: A general framework is developed in which a large class of binding site detection methods can be described in a uniform and consistent way and the binding matrix is the most specific matrix based classifier which is consistent with the input set of known binding words.
Abstract: Recognition of protein-DNA binding sites in genomic sequences is a crucial step for discovering biological functions of genomic sequences. Explosive growth in availability of sequence information has resulted in a demand for binding site detection methods with high specificity. The motivation of the work presented here is to address this demand by a systematic approach based on Maximum Likelihood Estimation. A general framework is developed in which a large class of binding site detection methods can be described in a uniform and consistent way. Protein-DNA binding is determined by binding energy, which is an approximately linear function within the space of sequence words. All matrix based binding word detectors can be regarded as different linear classifiers which attempt to estimate the linear separation implied by the binding energy function. The standard approaches of consensus sequences and profile matrices are described using this framework. A maximum likelihood approach for determining this linear separation leads to a novel matrix type, called the binding matrix. The binding matrix is the most specific matrix based classifier which is consistent with the input set of known binding words. It achieves significant improvements in specificity compared to other matrices. This is demonstrated using 95 sets of experimentally determined binding words provided by the TRANSFAC database.

Journal ArticleDOI
TL;DR: Two methods for finding similarities in protein structure databases using feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins using a multidimensional index structure are proposed.
Abstract: We propose new methods for finding similarities in protein structure databases. These methods extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. The feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. It quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times, while keeping the sensitivity similar. Our technique can also be incorporated with DALI and CE to improve their running times by a factor of 2 and 2.7 respectively. The software is available online at .

Journal ArticleDOI
TL;DR: The algorithm uses a representation of the backbones that is independent of their relative orientations in space and applies dynamic programming to this representation to compute an initial alignment, which is then refined iteratively.
Abstract: Determining structural similarities between proteins is an important problem since it can help identify functional and evolutionary relationships. In this paper, an algorithm is proposed to align two protein structures. Given the protein backbones, the algorithm finds a rigid motion of one backbone onto the other such that large substructures are matched. The algorithm uses a representation of the backbones that is independent of their relative orientations in space and applies dynamic programming to this representation to compute an initial alignment, which is then refined iteratively. Experiments indicate that the algorithm is competitive with two well-known algorithms, namely DALI and LOCK.

Journal ArticleDOI
Sun Chong Wang1
TL;DR: A power-law formalism is described to model the combinatorial effect of regulators on gene transcription and a principled network reconstruction approach is employed that accounts for the high noise and low replicate characteristics of present day microarray data.
Abstract: Different genes of an organism are expressed to different levels at different times during the life cycle and in response to various environmental stresses. Elucidating the network of gene-gene interactions responsible for the expression helps understand living processes. Microarray technology allows concurrent genomic scale measurement of an organism's mRNA levels. We describe a power-law formalism to model the combinatorial effect of regulators on gene transcription. The dynamic model allows delayed transcription. We employ a principled network reconstruction approach that accounts for the high noise and low replicate characteristics of present day microarray data. An important feature of our approach is that the detail of the reconstructed network is limited to the noise level of the data. We apply the methodology to a microarray dataset of yeast cells grown in glucose and experiencing a diauxic transition upon glucose depletion. The reconstructed transcriptional regulations of yeast glycolytic genes are consistent with published findings.