scispace - formally typeset
Search or ask a question

Showing papers by "Wing-Kin Sung published in 2007"


Journal ArticleDOI
14 Jun 2007-Nature
TL;DR: Functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project are reported, providing convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts.
Abstract: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.

5,091 citations


Journal ArticleDOI
TL;DR: These global histone methylation maps provide an epigenetic framework that enables the discovery of novel transcriptional networks and delineation of different genetic compartments of the pluripotent cell genome.

597 citations


Journal ArticleDOI
TL;DR: Mapping the genome-wide loci bound by the RELA subunit of NF-kappaB in lipopolysaccharide (LPS)-stimulated human monocytic cells found an overrepresentation of the E2F1-binding motif among RELA-bound loci associated with NF- kappaB target genes, demonstrating the critical role of E1F1 in the Toll-like receptor 4 pathway.

194 citations


Journal ArticleDOI
TL;DR: Using the PET approach for comprehensive transcriptome analysis, the PET mapping strategy presented here promises to be a useful tool in annotating the human genome, especially aberrations in human cancer genomes.
Abstract: Identification of unconventional functional features such as fusion transcripts is a challenging task in the effort to annotate all functional DNA elements in the human genome. Paired-End diTag (PET) analysis possesses a unique capability to accurately and efficiently characterize the two ends of DNA fragments, which may have either normal or unusual compositions. This unique nature of PET analysis makes it an ideal tool for uncovering unconventional features residing in the human genome. Using the PET approach for comprehensive transcriptome analysis, we were able to identify fusion transcripts derived from genome rearrangements and actively expressed retrotransposed pseudogenes, which would be difficult to capture by other means. Here, we demonstrate this unique capability through the analysis of 865,000 individual transcripts in two types of cancer cells. In addition to the characterization of a large number of differentially expressed alternative 5′ and 3′ transcript variants and novel transcriptional units, we identified 70 fusion transcript candidates in this study. One was validated as the product of a fusion gene between BCAS4 and BCAS3 resulting from an amplification followed by a translocation event between the two loci, chr20q13 and chr17q23. Through an examination of PETs that mapped to multiple genomic locations, we identified 4055 retrotransposed loci in the human genome, of which at least three were found to be transcriptionally active. The PET mapping strategy presented here promises to be a useful tool in annotating the human genome, especially aberrations in human cancer genomes.

105 citations


Proceedings ArticleDOI
07 Jan 2007
TL;DR: The first succinct tree representation supporting every one of the fundamental operations previously proposed for BP or DFUDS along with some new operations in constant time is given, its size surpasses the information-theoretic lower bound and matches the entropy of the tree based on the distribution of node degrees.
Abstract: There exist two well-known succinct representations of ordered trees: BP (balanced parenthesis) [Munro, Raman 2001] and DFUDS (depth first unary degree sequence) [Benoit et al. 2005]. Both have size 2n + o(n) bits for n-node trees, which asymptotically matches the information-theoretic lower bound. Many fundamental operations on trees can be done in constant time on word RAM, for example finding the parent, the first child, the next sibling, the number of descendants, etc. However there has been no single representation supporting every existing operation in constant time; BP does not support i-th child, while DFUDS does not support lca (lowest common ancestor).In this paper, we give the first succinct tree representation supporting every one of the fundamental operations previously proposed for BP or DFUDS along with some new operations in constant time. Moreover, its size surpasses the information-theoretic lower bound and matches the entropy of the tree based on the distribution of node degrees. We call this an ultra-succinct data structure. As a consequence, a tree in which every internal node has exactly two children can be represented in n + o(n) bits. We also show applications for ultra-succinct compressed suffix trees and labeled trees.

92 citations


Journal ArticleDOI
TL;DR: This paper initiates the study of constructing compressed suffix arrays directly from the text with a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O( n log n).
Abstract: With the first human DNA being decoded into a sequence of about 28 billion characters, much biological research has been centered on analyzing this sequence Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (ie, more than 13 gigabytes for human DNA) This paper initiates the study of constructing compressed suffix arrays directly from the text The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n) Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese Precisely, when the alphabet size is |Σ|, the working space is O(n log |Σ|) bits, and the time complexity remains O(n log n), which is independent of |Σ|

84 citations


Journal ArticleDOI
TL;DR: FS-Weighted Averaging can effectively make use of indirect interactions to make the inference of protein functions from protein interactions more effective and is general enough to work over a variety of genomes.
Abstract: Protein-protein interaction has been used to complement traditional sequence homology to elucidate protein function. Most existing approaches only make use of direct interactions to infer function, and some have studied the application of indirect interactions for functional inference but are unable to improve prediction performance. We have previously proposed an approach, FS-Weighted Averaging, which uses topological weighting and level-2 indirect interactions (protein pairs connected via two interactions) for predicting protein function from protein interactions and have found that it yields predictions with superior precision on yeast proteins over existing approaches. Here we study the use of this technique to predict functional annotations from the Gene Ontology for seven genomes: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Rattus norvegicus, Mus musculus, and Homo sapiens. Our analysis shows that protein-protein interactions provide supplementary coverage over sequence homology in the inference of protein function and is definitely a complement to sequence homology. We also find that FS-Weighted Averaging consistently outperforms two classical approaches, Neighbor Counting and Chi-Square, across the seven genomes for all three categories of the Gene Ontology. By randomly adding and removing interactions from the interactions, we find that Weighted Averaging is also rather robust against noisy interaction data. We have conducted a comprehensive study over seven genomes. We conclude that FS-Weighted Averaging can effectively make use of indirect interactions to make the inference of protein functions from protein interactions more effective. Furthermore, the technique is general enough to work over a variety of genomes.

75 citations


Journal ArticleDOI
TL;DR: This study proposes a new approach for haplotype association analysis that is based on a variable-sized sliding-window framework and employs regularized regression analysis to tackle the problem of multiple degrees of freedom in the haplotype test.
Abstract: Large-scale haplotype association analysis, especially at the whole-genome level, is still a very challenging task without an optimal solution. In this study, we propose a new approach for haplotype association analysis that is based on a variable-sized sliding-window framework and employs regularized regression analysis to tackle the problem of multiple degrees of freedom in the haplotype test. Our method can handle a large number of haplotypes in association analyses more efficiently and effectively than do currently available approaches. We implement a procedure in which the maximum size of a sliding window is determined by local haplotype diversity and sample size, an attractive feature for large-scale haplotype analyses, such as a whole-genome scan, in which linkage disequilibrium patterns are expected to vary widely. We compare the performance of our method with that of three other methods—a test based on a single-nucleotide polymorphism, a cladistic analysis of haplotypes, and variable-length Markov chains—with use of both simulated and experimental data. By analyzing data sets simulated under different disease models, we demonstrate that our method consistently outperforms the other three methods, especially when the region under study has high haplotype diversity. Built on the regression analysis framework, our method can incorporate other risk-factor information into haplotype-based association analysis, which is becoming an increasingly necessary step for studying common disorders to which both genetic and environmental risk factors contribute.

72 citations


Journal ArticleDOI
TL;DR: Integrated Weighted Averaging is proposed--a scalable, efficient and flexible function prediction framework that integrates diverse information using simple weighting strategies and a local prediction method that makes it possible to make predictions based on on-the-fly information fusion.
Abstract: Motivation: With the increasing availability of diverse biological information, protein function prediction approaches have converged towards integration of heterogeneous data. Many adapted existing techniques, such as machine-learning and probabilistic methods, which have proven successful on specific data types. However, the impact of these approaches is hindered by a couple of factors. First, there is little comparison between existing approaches. This is in part due to a divergence in the focus adopted by different works, which makes comparison difficult or even fuzzy. Second, there seems to be over-emphasis on the use of computationally demanding machine-learning methods, which runs counter to the surge in biological data. Analogous to the success of BLAST for sequence homology search, we believe that the ability to tap escalating quantity, quality and diversity of biological data is crucial to the success of automated function prediction as a useful instrument for the advancement of proteomic research. We address these problems by: (1) providing useful comparison between some prominent methods; (2) proposing Integrated Weighted Averaging (IWA)—a scalable, efficient and flexible function prediction framework that integrates diverse information using simple weighting strategies and a local prediction method. The simplicity of the approach makes it possible to make predictions based on on-the-fly information fusion. Results: In addition to its greater efficiency, IWA performs exceptionally well against existing approaches. In the presence of cross-genome information, which is overwhelming for existing approaches, IWA makes even better predictions. We also demonstrate the significance of appropriate weighting strategies in data integration. Contact: hnchua@i2r.a-star.edu.sg Supplementary information: Supplementary data are available at Bioinformatics online.

66 citations


Journal ArticleDOI
TL;DR: The findings show that microarrays can be used for the robust and accurate diagnosis of pathogens, and further substantiate the use of microarray technology in clinical diagnostics.
Abstract: DNA microarrays used as 'genomic sensors' have great potential in clinical diagnostics. Biases inherent in random PCR-amplification, cross-hybridization effects, and inadequate microarray analysis, however, limit detection sensitivity and specificity. Here, we have studied the relationships between viral amplification efficiency, hybridization signal, and target-probe annealing specificity using a customized microarray platform. Novel features of this platform include the development of a robust algorithm that accurately predicts PCR bias during DNA amplification and can be used to improve PCR primer design, as well as a powerful statistical concept for inferring pathogen identity from probe recognition signatures. Compared to real-time PCR, the microarray platform identified pathogens with 94% accuracy (76% sensitivity and 100% specificity) in a panel of 36 patient specimens. Our findings show that microarrays can be used for the robust and accurate diagnosis of pathogens, and further substantiate the use of microarray technology in clinical diagnostics.

63 citations


Proceedings ArticleDOI
01 Jan 2007
TL;DR: The use of indirect interactions and topological weight to augment protein-protein interactions can be used to improve the precision of clusters predicted by various existing clustering algorithms, and the complex finding algorithm performs very well on interaction networks modified in this way.
Abstract: Protein complexes are fundamental for understanding principles of cellular organizations. Accurate and fast protein complex prediction from the PPI networks of increasing sizes can serve as a guide for biological experiments to discover novel protein complexes. However, protein complex prediction from PPI networks is a hard problem, especially in situations where the PPI network is noisy. We know from previous work that proteins that do not interact, but share interaction partners (level-2 neighbors) often share biological functions. The strength of functional association can be estimated using a topological weight, FS-Weight. Here we study the use of indirect interactions between level-2 neighbors (level-2 interactions) for protein complex prediction. All direct and indirect interactions are first weighted using topological weight (FS-Weight). Interactions with low weight are removed from the network, while level-2 interactions with high weight are introduced into the interaction network. Existing clustering algorithms can then be applied on this modified network. We also propose a novel algorithm that searches for cliques in the modified network, and merge cliques to form clusters using a "partial clique merging" method. In this paper, we show that 1) the use of indirect interactions and topological weight to augment protein-protein interactions can be used to improve the precision of clusters predicted by various existing clustering algorithms; 2) our complex finding algorithm performs very well on interaction networks modified in this way. Since no any other information except the original PPI network is used, our approach would be very useful for protein complex prediction, especially for prediction of novel protein complexes.

Journal ArticleDOI
TL;DR: This article proposes a novel approach for identifying spaced motifs with any number of spacers of different lengths by introducing the notion of submotifs to capture the segments in the spaced motif and providing an algorithm called SPACE to solve the problem.
Abstract: Motivation: Identification of motifs is one of the critical stages in studying the regulatory interactions of genes. Motifs can have complicated patterns. In particular, spaced motifs, an important class of motifs, consist of several short segments separated by spacers of different lengths. Locating spaced motifs is not trivial. Existing motif-finding algorithms are either designed for monad motifs (short contiguous patterns with some mismatches) or have assumptions on the spacer lengths or can only handle at most two segments. An effective motif finder for generic spaced motifs is highly desirable. Results: This article proposes a novel approach for identifying spaced motifs with any number of spacers of different lengths. We introduce the notion of submotifs to capture the segments in the spaced motif and formulate the motif-finding problem as a frequent submotif mining problem. We provide an algorithm called SPACE to solve the problem. Based on experiments on real biological datasets, synthetic datasets and the motif assessment benchmarks by Tompa et al., we show that our algorithm performs better than existing tools for spaced motifs with improvements in both sensitivity and specificity and for monads, SPACE performs as good as other tools. Availability: The source code is available upon request from the authors. Contact: ksung@comp.nus.edu.sg Supplementary information: Supplementary data are available at Bioinformatics online.

Book ChapterDOI
17 Dec 2007
TL;DR: The solution to the pattern-only case improves the matching time of the previous work tremendously in practice, and can be extended to handle optional wildcards, each of which can match zero or one character.
Abstract: Given a text T of length n, the classical indexing problem for pattern matching is to build an index for T so that for any query pattern P, we can report efficiently all occurrences of P in T. Cole et al (2004) extended this problem to allow don't care characters (wildcards) in the text and pattern, and they gave the first index that supports efficient pattern matching. The space complexity of this index is linear in n (text length) but exponential in terms of the number of wildcards. Motivated by bioinformatics applications, we investigate indexes whose size depends on n only. In the literature, space efficient indexes for wildcard matching are known only for the special case when wildcards appear only in the pattern (Iliopoulos and Rahman, 2007); the space required is O(n). Not much has been heard for the case when wildcards appear in the text only, or in both the text and pattern. In this paper we give an O(n) space index to support efficient wildcard matching in all three cases. Our solution to the pattern-only case improves the matching time of the previous work tremendously in practice. In addition, our solution can be extended to handle optional wildcards, each of which can match zero or one character.

Book ChapterDOI
08 Sep 2007
TL;DR: This paper studies the adaptive version of the point placement problem on a line, which is motivated by a DNA mapping problem, and shows that 4n/3+O(√n) queries are sufficient for the case of two rounds while the best known result was 3n/2 queries.
Abstract: In this paper, we study the adaptive version of the point placement problem on a line, which is motivated by a DNA mapping problem. To identify the relative positions of n distinct points on a straight line, we are allowed to ask queries of pairwise distances of the points in rounds. The problem is to find the number of queries required to determine a unique solution for the positions of the points up to translation and reflection. We improved the bounds for several cases. We show that 4n/3+O(√n) queries are sufficient for the case of two rounds while the best known result was 3n/2 queries. For unlimited number of rounds, the best result was 4n/3 queries. We obtain a much better result of using 5n/4+O(√n) queries with three rounds only. We also improved the lower bound of 30n/29 to 17n/16 for the case of two rounds.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: The focus of this paper is to design an IO-efficient and compact partitioned suffix tree representation (CPS-tree) on disk that outperforms other indexes on disk and expects CPS-tree to be a good disk-based representation of suffix tree, with potential use in practical applications.
Abstract: Suffix tree is an important data structure for indexing a long sequence (like a genome sequence) or a concatenation of sequences. It finds many applications in practice, especially in the domain of bioinformatics. Suffix tree allows for efficient pattern search with time independent of the sequence length. However, the performance of disk-based suffix tree is a concern as it is slowed down significantly by poor localized access resulting in high 10 disk access. The focus of this paper is to design an IO-efficient and compact partitioned suffix tree representation (CPS-tree) on disk. We show that representing suffix tree using CPS-tree has several advantages. First, our representation allows us to visit any node in the suffix tree by accessing at most log n pages of the tree where n is the length of the sequence. Second, our storage scheme improves the access pattern and reduces the number of page fault resulting in efficient search retrieval and efficient tree traversal operations. Third, by bit packing, our index is compact. Experimental results show that CPS-tree outperforms other indexes on disk. When fully loaded into the main memory, CPS-tree is still efficient. Hence, we expect CPS-tree to be a good disk-based representation of suffix tree, with potential use in practical applications.

Journal ArticleDOI
TL;DR: The small parsimony problem is studied for reconstructing recombination networks from sequence data and is polynomial-time solvable for phylogenetic trees.
Abstract: The small parsimony problem is studied for reconstructing recombination networks from sequence data. The small parsimony problem is polynomial-time solvable for phylogenetic trees. However, the problem is proved NP-hard even for galled recombination networks. A dynamic programming algorithm is also developed to solve the small parsimony problem. It takes $O(dn2^{3h})$ time on an input recombination network over length-$d$ sequences in which there are $h$ recombination and $n - h$ tree nodes.

Journal ArticleDOI
TL;DR: The highly correlated results unmistakably point to a systematic downregulation of mitochondrial activities, which is hypothesize aims to downgrade the mitochondria-mediated apoptosis and the dependency of cancer cells on angiogenesis.
Abstract: Melanoma is the major cause of skin cancer deaths and melanoma incidence doubles every 10 to 20 years. However, little is known about melanoma pathway aberrations. Here we applied the robust Gene Identification Signature Paired End diTag (GIS-PET) approach to investigate the melanoma transcriptome and characterize the global pathway aberrations. GIS-PET technology directly links 5' mRNA signatures with their corresponding 3' signatures to generate, and then concatenate, PETs for efficient sequencing. We annotated PETs to pathways of KEGG database and compared the murine B16F1 melanoma transcriptome with three non-melanoma murine transcriptomes (Melan-a2 melanocytes, E14 embryonic stem cells, and E17.5 embryo). Gene expression levels as represented by PET counts were compared across melanoma and melanocyte libraries to identify the most significantly altered pathways and investigate the expression levels of crucial cancer genes. Melanin biosynthesis genes were solely expressed in the cells of melanocytic origin, indicating the feasibility of using the PET approach for transcriptome comparison. The most significantly altered pathways were metabolic pathways, including upregulated pathways: purine metabolism, aminophosphonate metabolism, tyrosine metabolism, selenoamino acid metabolism, galactose utilization, nitrobenzene degradation, and bisphenol A degradation; and downregulated pathways: oxidative phosphorylation, ATPase synthesis, TCA cycle, pyruvate metabolism, and glutathione metabolism. The downregulated pathways concurrently indicated a slowdown of mitochondrial activities. Mitochondrial permeability was also significantly altered, as indicated by transcriptional activation of ATP/ADP, citrate/malate, Mg++, fatty acid and amino acid transporters, and transcriptional repression of zinc and metal ion transporters. Upregulation of cell cycle progression, MAPK, and PI3K/Akt pathways were more limited to certain region(s) of the pathway. Expression levels of c-Myc and Trp53 were also higher in melanoma. Moreover, transcriptional variants resulted from alternative transcription start sites or alternative polyadenylation sites were found in Ras and genes encoding adhesion or cytoskeleton proteins such as integrin, β-catenin, α-catenin, and actin. The highly correlated results unmistakably point to a systematic downregulation of mitochondrial activities, which we hypothesize aims to downgrade the mitochondria-mediated apoptosis and the dependency of cancer cells on angiogenesis. Our results also demonstrate the advantage of using the PET approach in conjunction with KEGG database for systematic pathway analysis.

Book ChapterDOI
12 Dec 2007
TL;DR: A compressed version of the dynamic trie data structure is proposed which is not only space efficient, but also allows pattern searching in o(|P|) time and leaf insertion/ deletion in O(log n) time, where |P| is the length of the pattern and n is the size of the trie.
Abstract: The dynamic trie is a fundamental data structure which finds applications in many areas. This paper proposes a compressed version of the dynamic trie data structure. Our data-structure is not only space efficient, it also allows pattern searching in o(|P|) time and leaf insertion/ deletion in o(log n) time, where |P| is the length of the pattern and n is the size of the trie. To demonstrate the usefulness of the new data structure, we apply it to the LZ-compression problem. For a string S of length s over an alphabet A of size σ, the previously best known algorithms for computing the Ziv-Lempel encoding (lz78) of S either run in: (1) O(s) time and O(s log s) bits working space; or (2) O(sσ) time and O(sHk + s log σ/ logσ s) bits working space, where Hk is the k- order entropy of the text. No previous algorithm runs in sublinear time. Our new data structure implies a LZ-compression algorithm which runs in sublinear time and uses optimal working space. More precisely, the LZ-compression algorithm uses O(s(log σ +log logσ s)/ logσ s) bits working space and runs in O(s(log log s)2/(logσ s log log log s)) worst-case time, which is sublinear when σ = 2o(log slog log log s/(log log s)2).

Journal ArticleDOI
TL;DR: A method called Partial Energy ratio for Microarray (PEM) is proposed for the analysis of time course microarray data and shows the robustness and the generality of the PEM method in identifying the genes of interest.
Abstract: Replication of time series in microarray experiments is costly. To analyze time series data with no replicate, many model-specific approaches have been proposed. However, they fail to identify the genes whose expression patterns do not fit the pre-defined models. Besides, modeling the temporal expression patterns is difficult when the dynamics of gene expression in the experiment is poorly understood. We propose a method called Partial Energy ratio for Microarray (PEM) for the analysis of time course microarray data. In the PEM method, we assume the gene expressions vary smoothly in the temporal domain. This assumption is comparatively weak and hence the method is general enough to identify genes expressed in unexpected patterns. To identify the differentially expressed genes, a new statistic is developed by comparing the energies of two convoluted profiles. We further improve the statistic for microarray analysis by introducing the concept of partial energy. The PEM statistic can be easily incorporated into the SAM framework for significance analysis. We evaluated the PEM method with an artificial dataset and two published time course cDNA microarray datasets on yeast. The experimental results show the robustness and the generality of the PEM method in identifying the genes of interest.


Journal ArticleDOI
TL;DR: An O(n) time algorithm is given for comparing galled-trees, which are an important, biological meaningful special case of phylogenetic network, and an $$O(n + kh)$$-time algorithm for comparing a galling-tree with another general network, where h and k are the number of hybrid nodes in the latter network and its biggest biconnected component respectively.
Abstract: Consider two phylogenetic networks $${\cal N}$$ and $${\cal N}$$ ’ of size n. The tripartition-based distance finds the proportion of tripartitions which are not shared by $${\cal N}$$ and $${\cal N}$$ ’. This distance is proposed by Moret et al. (2004) and is a generalization of Robinson-Foulds distance, which is orginally used to compare two phylogenetic trees. This paper gives an $$O(\min \{k n \log n, n \log n + hn\})$$ -time algorithm to compute this distance, where h is the number of hybrid nodes in $${\cal N}$$ and $${\cal N}$$ ’ while k is the maximum number of hybrid nodes among all biconnected components in $${\cal N}$$ and $${\cal N}$$ ’. Note that $k \ll h \ll n$ in a phylogenetic network. In addition, we propose algorithms for comparing galled-trees, which are an important, biological meaningful special case of phylogenetic network. We give an $O(n)$-time algorithm for comparing two galled-trees. We also give an $$O(n + kh)$$ -time algorithm for comparing a galled-tree with another general network, where h and k are the number of hybrid nodes in the latter network and its biggest biconnected component respectively.

01 Jan 2007
TL;DR: This paper considers anchors with mismatches in order to increase the effectiveness of locating conserved regions in whole genome alignment.
Abstract: Recent work on whole genome alignment has resulted in efficient tools to locate (possibly) conserved regions of two genomic sequences. Most of such tools start with locating a set of short and highly similar substrings (called anchors) that are present in both genomes. These anchors provide clues for the conserved regions, and the effectiveness of the tools is highly related to the quality of the anchors. Some popular software tools use the exact match maximal unique substrings (EM-MUM) as anchors. However, the result is not satisfactory especially for genomes with high mutation rates (e.g. virus). In our experiments, we found that more than 40% of the conserved genes are not recovered. In this paper, we consider anchors with mismatches in order to increase the effectiveness of locating conserved regions. Key-Words: Whole genome alignment, anchors with mismatches, conserved regions

Book ChapterDOI
21 Apr 2007
TL;DR: RB-Finder, a fast and accurate distance-based window method to detect recombination in a multiple sequence alignment is introduced, which is faster than existing phylogenybased methods since it does not need to construct and compare complex phylogenetic trees.
Abstract: Recombination detection is important before inferring phylogenetic relationships. This will eventually lead to a better understanding of pathogen evolution, more accurate genotyping and advancements in vaccine development. In this paper, we introduce RB-Finder, a fast and accurate distance-based window method to detect recombination in a multiple sequence alignment. Our method introduces a more informative distance measure and a novel weighting strategy to reduce the window size sensitivity problem and hence improve the accuracy of breakpoint detection. Furthermore, our method is faster than existing phylogenybased methods since we do not need to construct and compare complex phylogenetic trees. When compared with the current best method Pruned-PDM, we are about a few hundred times more efficient. Experimental evaluation of RB-Finder using synthetic and biological datasets showed that our method is more accurate than existing phylogeny-based methods. We also show how our method has potential use in other related applications such as genotyping.

Book ChapterDOI
14 Aug 2007
TL;DR: This paper shows how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments with any pattern P, and reveals that BWt-SW is the first practical tool that can find all localalignments.
Abstract: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing long DNA sequences such as the human genome (about 3 billion characters) in the main memory [5,13,16]. However, these indexes are designed for exact pattern matching, which is too stringent for most biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). In this paper, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments with any pattern P. Experiments reveal that BWT-SW is very efficient (e.g., aligning a pattern of length 3,000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically, using a simpler model (with gaps disallowed) and random strings. We find that the expected running time is O(|T|0.628|P|). As far as we know, BWT-SW is the first practical tool that can find all local alignments.

Journal ArticleDOI
TL;DR: The purpose of this work is to describe a method called MotifVoter to identify transcription factor binding sites by integrating the results found by motif finders of different models, which offers a practical alternative for biologists to study novel transcription factors.
Abstract: Locating transcription factor binding sites is a key step in understanding gene regulation. Due to its importance, many _de novo_ motif-finding methods have been proposed. Individually, these motif finders perform unimpressively overall based on Tompa's benchmark datasets. Moreover, these motif finders vary in their definitions of what constitute a motif, and in their methods for finding statistically overrepresented motifs. There is no clear way for biologists to choose the motif finder that is most suitable for their task. The purpose of this work is to describe a method called MotifVoter to identify transcription factor binding sites by integrating the results found by motif finders of different models. Validation of our method on Tompa's benchmark, real metazoan and _E. coli_ datasets show that it can improve the sensitivity significantly without sacrificing the precision. Our approach offers a practical alternative for biologists to study novel transcription factors.The MotifVoter software is available for public use at: http://www.comp.nus.edu.sg/~bioinfo/MotifVoter