scispace - formally typeset
Search or ask a question

Showing papers in "IEEE/ACM Transactions on Computational Biology and Bioinformatics in 2004"


Journal ArticleDOI
TL;DR: In this comprehensive survey, a large number of existing approaches to biclustering are analyzed, and they are classified in accordance with the type of biclusters they can find, the patterns of bIClusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.
Abstract: A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.

2,123 citations


Journal ArticleDOI
TL;DR: This paper presents a general definition of phylogenetic networks in terms of directed acyclic graphs (DAGs) and a set of conditions and distinguishes between model networks and reconstructible ones and characterize the effect of extinction and taxon sampling on the reconstructibility of the network.
Abstract: Phylogenetic networks model the evolutionary history of sets of organisms when events such as hybrid speciation and horizontal gene transfer occur. In spite of their widely acknowledged importance in evolutionary biology, phylogenetic networks have so far been studied mostly for specific data sets. We present a general definition of phylogenetic networks in terms of directed acyclic graphs (DAGs) and a set of conditions. Further, we distinguish between model networks and reconstructible ones and characterize the effect of extinction and taxon sampling on the reconstructibility of the network. Simulation studies are a standard technique for assessing the performance of phylogenetic methods. A main step in such studies entails quantifying the topological error between the model and inferred phylogenies. While many measures of tree topological accuracy have been proposed, none exist for phylogenetic networks. Previously, we proposed the first such measure, which applied only to a restricted class of networks. In this paper, we extend that measure to apply to all networks, and prove that it is a metric on the space of phylogenetic networks. Our results allow for the systematic study of existing network methods, and for the design of new accurate ones.

198 citations


Journal ArticleDOI
TL;DR: This paper proposes an alternative, called the analog-spectrum model, which more closely reflects the biochemical process, and reestablishes probe length as the performance-governing factor, adopting "semidegenerate bases" as suitable emulators of currently inadequate universal bases.
Abstract: All published approaches to DNA sequencing by hybridization (SBH) consist of the biochemical acquisition of the spectrum of a target sequence (the set of its subsequences conforming to a given probing pattern) followed by the algorithmic reconstruction of the sequence from its spectrum. In the "standard" or "uniform" approach, the probing pattern is a string of length L and the length of reliably reconstructible sequences is known to be m/sub len/ = O(2/sup L/). For a fixed microarray area, higher sequencing performance can be achieved by inserting nonprobing gaps ("wild-cards") in the probing pattern. The reconstruction, however, must cope with the emergence of fooling probes due to the gaps and algorithmic failure occurs when the spectrum becomes too densely populated, although we can achieve m/sub comp/ = O(4/sup L/). Despite the combinatorial success of gapped probing, all current approaches are based on a biochemically unrealistic spectrum-acquisition model (digital-spectrum). The reality of hybridization is much more complex. Departing from the conventional model, in this paper, we propose an alternative, called the analog-spectrum model, which more closely reflects the biochemical process. This novel modeling reestablishes probe length as the performance-governing factor, adopting "semidegenerate bases" as suitable emulators of currently inadequate universal bases. One important conclusion is that accurate biochemical measurements are pivotal to the success of SBH. The theoretical proposal presented in this paper should be a convincing stimulus for the needed biotechnological work.

194 citations


Journal ArticleDOI
TL;DR: This paper presents a dimension reduction and feature extraction scheme, called Uncorrelated Linear Discriminant Analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data.
Abstract: The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear Discriminant Analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called Uncorrelated Linear Discriminant Analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the Generalized Singular Value Decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets.

179 citations


Journal ArticleDOI
TL;DR: A method for computing multiple alignments of RNA secondary structures under the tree alignment model, which is suitable to cluster RNA molecules purely on the structural level, i.e., sequence similarity is not required.
Abstract: In functional, noncoding RNA, structure is often essential to function. While the full 3D structure is very difficult to determine, the 2D structure of an RNA molecule gives good clues to its 3D structure, and for molecules of moderate length, it can be predicted with good reliability. Structure comparison is, in analogy to sequence comparison, the essential technique to infer related function. We provide a method for computing multiple alignments of RNA secondary structures under the tree alignment model, which is suitable to cluster RNA molecules purely on the structural level, i.e., sequence similarity is not required. We give a systematic generalization of the profile alignment method from strings to trees and forests. We introduce a tree profile representation of RNA secondary structure alignments which allows reasonable scoring in structure comparison. Besides the technical aspects, an RNA profile is a useful data structure to represent multiple structures of RNA sequences. Moreover, we propose a visualization of RNA consensus structures that is enriched by the full sequence information.

163 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of computing a splits graph for a given set of splits by implementing all presented algorithms in a new program called SplitsTree4, which generalizes the concept of a phylogenetic tree.
Abstract: Phylogenetic trees correspond one-to-one to compatible systems of splits and so splits play an important role in theoretical and computational aspects of phylogeny. Whereas any tree reconstruction method can be thought of as producing a compatible system of splits, an increasing number of phylogenetic algorithms are available that compute split systems that are not necessarily compatible and, thus, cannot always be represented by a tree. Such methods include the split decomposition, Neighbor-Net, consensus networks, and the Z-closure method. A more general split system of this kind can be represented graphically by a so-called splits graph, which generalizes the concept of a phylogenetic tree. This paper addresses the problem of computing a splits graph for a given set of splits. We have implemented all presented algorithms in a new program called SplitsTree4.

130 citations


Journal ArticleDOI
TL;DR: This paper poses the problem of inferring a phylogenetic super-network from incomplete phylogenetic data and provides an efficient algorithm for doing so, called the Z-closure method, which is implemented as a plug-in for the program SplitsTree4.
Abstract: In practice, one is often faced with incomplete phylogenetic data, such as a collection of partial trees or partial splits. This paper poses the problem of inferring a phylogenetic super-network from such data and provides an efficient algorithm for doing so, called the Z-closure method. Additionally, the questions of assigning lengths to the edges of the network and how to restrict the “dimensionality” of the network are addressed. Applications to a set of five published partial gene trees relating different fungal species and to six published partial gene trees relating different grasses illustrate the usefulness of the method and an experimental study confirms its potential. The method is implemented as a plug-in for the program SplitsTree4.

117 citations


Journal ArticleDOI
TL;DR: A new step in the blast algorithm is proposed to reduce the computational cost of searching with negligible effect on accuracy, and a heuristic is proposed that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible impact on accuracy.
Abstract: Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is blast, which has been in widespread use within universities, research centers, and commercial enterprises since the early 1990s. In this paper, we propose a new step in the blast algorithm to reduce the computational cost of searching with negligible effect on accuracy. This new step—semigapped alignment—compromises between the efficiency of ungapped alignment and the accuracy of gapped alignment, allowing blast to accurately filter sequences with lower computational cost. In addition, we propose a heuristic—restricted insertion alignment—that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible effect on accuracy. Together, after including an optimization of the local alignment recursion, our two techniques more than double the speed of the gapped alignment stages in blast. We conclude that our techniques are an important improvement to the blast algorithm. Source code for the alignment algorithms is available for download at http://www.bsg.rmit.edu.au/iga/.

102 citations


Journal ArticleDOI
TL;DR: This work presents a method for computing the consensus structures including pseudoknots based on alignments of a few sequences that combines thermodynamic and covariation information to assign scores to all possible base pairs, the base pairs are chosen with the help of the maximum weighted matching algorithm.
Abstract: Most functional RNA molecules have characteristic structures that are highly conserved in evolution. Many of them contain pseudoknots. Here, we present a method for computing the consensus structures including pseudoknots based on alignments of a few sequences. The algorithm combines thermodynamic and covariation information to assign scores to all possible base pairs, the base pairs are chosen with the help of the maximum weighted matching algorithm. We applied our algorithm to a number of different types of RNA known to contain pseudoknots. All pseudoknots were predicted correctly and more than 85 percent of the base pairs were identified.

66 citations


Journal ArticleDOI
TL;DR: The number of nontrivial connected components, R/sub c/, in the conflict graph for a given set of sequences, computable in time 0(nm/sup 2/), is also a lower bound on the minimum number of recombination events.
Abstract: We consider the following problem: Given a set of binary sequences, determine lower bounds on the minimum number of recombinations required to explain the history of the sample, under the infinite-sites model of mutation. The problem has implications for finding recombination hotspots and for the Ancestral Recombination Graph reconstruction problem. Hudson and Kaplan gave a lower bound based on the four-gamete test. In practice, their bound R/sub m/ often greatly underestimates the minimum number of recombinations. The problem was recently revisited by Myers and Griffiths, who introduced two new lower bounds R/sub h/ and R/sub s/ which are provably better, and also yield good bounds in practice. However, the worst-case complexities of their procedures for computing R/sub h/ and R/sub s/ are exponential and super-exponential, respectively. In this paper, we show that the number of nontrivial connected components, R/sub c/, in the conflict graph for a given set of sequences, computable in time 0(nm/sup 2/), is also a lower bound on the minimum number of recombination events. We show that in many cases, R/sub c/ is a better bound than R/sub h/. The conflict graph was used by Gusfield et al. to obtain a polynomial time algorithm for the galled tree problem, which is a special case of the Ancestral Recombination Graph (ARG) reconstruction problem. Our results also offer some insight into the structural properties of this graph and are of interest for the general Ancestral Recombination Graph reconstruction problem.

43 citations


Journal ArticleDOI
TL;DR: A novel estimator for the local false discovery rate is introduced that is based on an algorithm which splits all genes into two groups, representing induced and noninduced genes, respectively, and performs compatibly in detecting the shape of the localfalse discovery rate.
Abstract: Screening for differential gene expression in microarray studies leads to difficult large-scale multiple testing problems. The local false discovery rate is a statistical concept for quantifying uncertainty in multiple testing. In this paper, we introduce a novel estimator for the local false discovery rate that is based on an algorithm which splits all genes into two groups, representing induced and noninduced genes, respectively. Starting from the full set of genes, we successively exclude genes until the gene-wise p{\hbox{-}}{\rm values} of the remaining genes look like a typical sample from a uniform distribution. In comparison to other methods, our algorithm performs compatibly in detecting the shape of the local false discovery rate and has a smaller bias with respect to estimating the overall percentage of noninduced genes. Our algorithm is implemented in the Bioconductor compatible R package TWILIGHT version 1.0.1, which is available from http://compdiag.molgen.mpg.de/software or from the Bioconductor project at http://www.bioconductor.org.

Journal ArticleDOI
TL;DR: A modification of the random projection algorithm is described, called the uniform projection algorithm, which utilizes a different choice of projections, and replaces the random selection of projections by a greedy heuristic that approximately equalizes the coverage of the projections.
Abstract: Buhler and Tompa (2002) introduced the random projection algorithm for the motif discovery problem and demonstrated that this algorithm performs well on both simulated and biological samples. We describe a modification of the random projection algorithm, called the uniform projection algorithm, which utilizes a different choice of projections. We replace the random selection of projections by a greedy heuristic that approximately equalizes the coverage of the projections. We show that this change in selection of projections leads to improved performance on motif discovery problems. Furthermore, the uniform projection algorithm is directly applicable to other problems where the random projection algorithm has been used, including comparison of protein sequence databases.

Journal ArticleDOI
TL;DR: The key result states a simple recursive relationship between maximum-scoring segment sets that leads to fast algorithms for finding such segment sets in sequence of scores.
Abstract: We examine the problem of finding maximum-scoring sets of disjoint segments in a sequence of scores. The problem arises in DNA and protein segmentation and in postprocessing of sequence alignments. Our key result states a simple recursive relationship between maximum-scoring segment sets. The statement leads to fast algorithms for finding such segment sets. We apply our methods to the identification of noncoding RNA genes in thermophiles.

Journal ArticleDOI
TL;DR: This work considers an example of a clock-like tree with three taxa, one unknown edge length, a known root state, and a parametric family of scale factor distributions that contains the gamma family, which has the property that there is another edge length and scale factor distribution which generates data with exactly the same distribution.
Abstract: The rates-across-sites assumption in phylogenetic inference posits that the rate matrix governing the Markovian evolution of a character on an edge of the putative phylogenetic tree is the product of a character-specific scale factor and a rate matrix that is particular to that edge. Thus, evolution follows basically the same process for all characters, except that it occurs faster for some characters than others. To allow estimation of tree topologies and edge lengths for such models, it is commonly assumed that the scale factors are not arbitrary unknown constants, but rather unobserved, independent, identically distributed draws from a member of some parametric family of distributions. A popular choice is the gamma family. We consider an example of a clock-like tree with three taxa, one unknown edge length, a known root state, and a parametric family of scale factor distributions that contains the gamma family. This model has the property that, for a generic choice of unknown edge length and scale factor distribution, there is another edge length and scale factor distribution which generates data with exactly the same distribution, so that even with infinitely many data it will be typically impossible to make correct inferences about the unknown edge length.

Journal ArticleDOI
TL;DR: An O(N2) time algorithm for finding the optimal pair of substring patterns combined with Boolean functions, where N is the total length of the sequences, and an efficient implementation using suffix arrays is presented.
Abstract: We consider the problem of finding the optimal combination of string patterns, which characterizes a given set of strings that have a numeric attribute value assigned to each string.Pattern combinations are scored based on the correlation between their occurrences in the strings and the numeric attribute values. The aim is to find the combination of patterns which is best with respect to an appropriate scoring function. We present an O(N^2) time algorithm for finding the optimal pair of substring patterns combined with Boolean functions, where N is the total length of the sequences. The algorithm looks for all possible Boolean combinations of the patterns, e.g., patterns of the form p \land \lnot q, which indicates that the pattern pair is considered to occur in a given string s, if p occurs in s, AND q does NOT occur in s. An efficient implementation using suffix arrays is presented, and we further show that the algorithm can be adapted to find the best k{\hbox{-}}{\rm pattern} Boolean combination inO(N^k) time. The algorithm is applied to mRNA sequence data sets of moderate size combined with their turnover rates for the purpose of finding regulatory elements that cooperate, complement, or compete with each other in enhancing and/or silencing mRNA decay.

Journal ArticleDOI
TL;DR: This paper provides experimental results which show that contact maps derived from real protein structures can be processed efficiently and answers an open question raised by Vialette whether CMPM for contact maps is NP-hard or solvable in polynomial time.
Abstract: Contact maps are a model to capture the core information in the structure of biological molecules, e.g., proteins. A contact map consists of an ordered set S of elements (representing a protein's sequence of amino acids), and a set A of element pairs of S, called arcs (representing amino acids which are closely neighbored in the structure). Given two contact maps (S, A) and (Sp, Ap ) with |A| ges |Ap| the contact map pattern matching (CMPM) problem asks whether the "pattern" (Sp, Ap) "occurs" in (S, A), i.e., informally stated, whether there is a subset of |Ap| arcs in A whose arc structure coincides with Ap . CMPM captures the biological question of finding structural motifs in protein structures. In general, CMPM is NP-hard. In this paper, we show that CMPM is solvable in O(|A|6|Ap| time when the pattern is {<, }-structured, i.e., when each two arcs in the pattern are disjoint or crossing. Our algorithm extends to other closely related models. In particular, it answers an open question raised by Vialette that, rephrased in terms of contact maps, asked whether CMPM for {<, } -structured patterns is NP-hard or solvable in polynomial time. Our result stands in sharp contrast to the NP-hardness of closely related problems. We provide experimental results which show that contact maps derived from real protein structures can be processed efficiently

Journal ArticleDOI
TL;DR: This inaugural issue introduces the editorial board of IEEE/ACM TCBB and begins fulfilling the scientific mission of the journal, with the publication of three regular papers and one survey.
Abstract: AND BIOINFORMATICS Iwould like to welcome you to the IEEE/ACM Transactions on Computational Biology and Bioinformatics. This inaugural issue introduces the editorial board of IEEE/ACM TCBB and begins fulfilling the scientific mission of the journal, with the publication of three regular papers and one survey. Bioinformatics and computational biology are concerned with the use of computation to understand biological phenomena and to acquire and exploit biological data, increasingly large-scale data. Methods from bioinformatics and computational biology are increasingly used to augment or leverage traditional laboratory and observationbased biology. These methods have become critical in biology due to recent changes in our ability and determination to acquire massive biological data sets, and due to the ubiquitous, successful biological insights that have come from the exploitation of those data. This transformation from a data-poor to a data-rich field began with DNA sequence data, but is now occurring in many other areas of biology. At the same time, we are seeing the beginnings of systems biology, which attempts to integrate diverse types of biological data and knowledge, to obtain insights into the high-level workings of biological systems. The shift to data-driven biology and the accumulation and exploitation of large-scale data has lead to the need for new computational technology (machines, software, algorithms, theory) and for research into these issues. As this transformation extends into more biological domains, so too will bioinformatics and computational biology expand in scope, importance, and the number of participants. Hence, computational biology and bioinformatics, and research into the underlying computational techniques, have a huge future, requiring a large expansion in publication opportunities. The IEEE/ACM Transactions on Computational Biology and Bioinformatics is being launched to provide such publication opportunities for high-quality research papers. The establishment of IEEE/ACM TCBB is supported by several societies of the IEEE and by The Association for Computing Machinery (ACM). The supporting societies in the IEEE are the IEEE Computer Society, the IEEE Engineering in Medicine and Biology Society, and the IEEE Neural Networks Society. IEEE/ACM TCBB is also cosponsored by the IEEE Control Systems Society. The cooperation of several IEEE societies, along with the ACM, reflects the wide range of interests in computational biology and bioinformatics that will be reflected in the journal, and also demonstrates a commitment by those organizations to further the development of the fields of computational biology and bioinformatics.

Journal ArticleDOI
TL;DR: This issue of WABI 2004 contains a local search method that allows exploration of the complete space of possible duplication trees and shows that the method is superior to other existing methods for reconstructing the tree and recovering its duplication events.
Abstract: THE Fourth International Workshop on Algorithms in BIoinformatics (WABI) 2004 was held in Bergen, Norway, September 2004. The program committee consisted of 33 members and selected, among 117 submissions, 39 to be presented at the workshop and included in the proceedings from the workshop (volume 3240 of Lecture Notes in Bioinformatics, series edited by Sorin Istrail, Pavel Pevzner, and Michael Waterman). The WABI 2004 program committee selected a small number of papers among the 39 to be invited to submit extended versions of their papers to a special section of the IEEE/ACM Transactions on Computational Biology and Bioinformatics. Four papers were published in the OctoberDecember 2004 issue of the journal and this issue contains an additional three papers. We would like to thank both the entire program committee for WABI and the reviewers of the papers in this issue for their valuable contributions. The first of the papers is “A New Distance for High Level RNA Secondary Structure Comparison” authored by Julien Allali and Marie-France Sagot. This paper describes algorithms for comparing secondary structuresofRNAmolecules where the structures are represented by trees. The problemof classifying RNA secondary structure is becoming critical as biologists are discovering more and more noncoding functional elements in the genome (e.g., miRNA). Most likely, the major functional determinants of the elements are their secondary structure and, therefore, a metric between such secondary structures will also help delineate clusters of functional groups. In Allali and Sagot’s paper, two tree representations of secondary structure are compared by analysing how one tree can be transformed into the other using an allowed set of operations. Each operation can be associatedwith a cost and the distance between two trees can then be defined as the minimum cost associated with a transform of one tree to the other. Allali and Sagot introduce two new operations that they name edge fusion and node fusion and show that these alleviate limitations associated with the classical tree edit operations used for RNA comparison. Importantly, they also present algorithms for calculating the distance between trees allowing the new operations in addition to the classical ones, and analyze the performance of the algorithms. The second paper is “Topological Rearrangements and Local Search Method for Tandem Duplication Trees” and is authored by Denis Bertrand and Olivier Gascuel. The paper approaches the problem of estimating the evolutionary history of tandem repeats. A tandem repeat is a stretch of DNA sequence that contains an element that is repeated multiple times and where the repeat occurrences are next to each other in the sequence. Since the repeats are subject to mutations, they are not identical. Therefore, tandem repeats occur through evolution by “copying” (duplication) of repeat elements in blocks of varying size. Bertrand and Gascuel address the problem of finding the most likely sequence of events giving rise to the observed set of repeats. Each sequence of events can be described by a duplication tree and one searches for the tree that is the most parsimonious, i.e., one that explains how the sequence has evolved from an ancestral single copy with a minimum number of mutations along the branches of the tree. The main difference with the standard phylogeny problem is that linear ordering of the tandem duplications impose constraints the possible binary tree form. This paper describes a local search method that allows exploration of the complete space of possible duplication trees and shows that the method is superior to other existing methods for reconstructing the tree and recovering its duplication events. The third paper is “Optimizing Multiple Seeds for Homology Search” authored by Daniel G. Brown. The paper presents an approach to selecting starting points for pairwise local alignments of protein sequences. The problem of pairwise local alignment is to find a segment from each so that the two local segments can be aligned to obtain a high score. For commonly used scoring schemes, this can be solved exactly using dynamic programming. However, pairwise alignment is frequently applied to large data sets and heuristic methods for restricting alignments to be considered are frequently used, for instance, in the BLAST programs. The key is to restrict the number of alignments as much as possible, by choosing a few good seeds, without missing high scoring alignments. The paper shows that this can be formulated as an integer programming problem and presents algorithm for choosing optimal seeds. Analysis is presented showing that the approach gives four times fewer false positives (unnecessary seeds) in comparison with BLASTP without losing more good hits.