scispace - formally typeset
Search or ask a question

Showing papers by "Wing-Kin Sung published in 2005"


Journal ArticleDOI
TL;DR: GIS analysis, in which 5′ and 3′ signatures of full-length cDNAs are accurately extracted into paired-end ditags (PETs) that are concatenated for efficient sequencing and mapped to genome sequences to demarcate the transcription boundaries of every gene, is developed.
Abstract: We have developed a DNA tag sequencing and mapping strategy called gene identification signature (GIS) analysis, in which 5' and 3' signatures of full-length cDNAs are accurately extracted into paired-end ditags (PETs) that are concatenated for efficient sequencing and mapped to genome sequences to demarcate the transcription boundaries of every gene. GIS analysis is potentially 30-fold more efficient than standard cDNA sequencing approaches for transcriptome characterization. We demonstrated this approach with 116,252 PET sequences derived from mouse embryonic stem cells. Initial analysis of this dataset identified hundreds of previously uncharacterized transcripts, including alternative transcripts of known genes. We also uncovered several intergenically spliced and unusual fusion transcripts, one of which was confirmed as a trans-splicing event and was differentially expressed. The concept of paired-end ditagging described here for transcriptome analysis can also be applied to whole-genome analysis of cis-regulatory and other DNA elements and represents an important technological advance for genome annotation.

257 citations


Journal ArticleDOI
TL;DR: The maximum agreement phylogenetic subnetwork problem (MASN) is introduced and it is proved that the problem is NP-hard even if restricted to three phylogenetic networks and an O(n2)-time algorithm is given for the special case of two level-1 phylogenetics networks.

87 citations


Journal ArticleDOI
TL;DR: The first evidence that common genetic variation within LRRK2 contributes to the risk of sporadic PD in the Chinese population is provided, using a haplotype that dramatically increases disease risk when present in two copies.
Abstract: Parkinson's disease (PD) is a complex neurodegenerative disorder whose aetiologies are largely unknown. To date, mutations in six genes have been found causal for some rare familial forms of the disease and common variation within at least three of these is associated with the more common sporadic forms of PD. LRRK2 is the most recently identified familial PD gene, although its role in sporadic disease is unknown. In this study, we have performed the first comprehensive evaluation of common genetic variation within LRRK2 and investigated its contribution to risk of sporadic PD. We first characterized the linkage disequilibrium within LRRK2 using a panel of densely spaced SNPs across the gene. We then identified a subset of tagging-SNPs (tSNP) that capture the majority of common variation within LRRK2. Both single tSNP and tSNP haplotype analyses, using a large epidemiologically matched sporadic case-control series comprising 932 individuals, yielded significant evidence for disease association. We identified a haplotype that dramatically increases disease risk when present in two copies (OR=5.5, 95%CI=2.1-14.0, P=0.0001). Thus, we provide the first evidence that common genetic variation within LRRK2 contributes to the risk of sporadic PD in the Chinese population.

81 citations


Journal ArticleDOI
TL;DR: Clustering 20 amino acids into a few groups by the proposed greedy algorithm provides a new way to extract features from protein sequences to cover more adjacent amino acids and hence reduce the dimensionality of the input vector of protein features.
Abstract: Predicting the subcellular localization of proteins is important for determining the function of proteins Previous works focused on predicting protein localization in Gram-negative bacteria obtained good results However, these methods had relatively low accuracies for the localization of extracellular proteins This paper studies ways to improve the accuracy for predicting extracellular localization in Gram-negative bacteria We have developed a system for predicting the subcellular localization of proteins for Gram-negative bacteria based on amino acid subalphabets and a combination of multiple support vector machines The recall of the extracellular site and overall recall of our predictor reach 860% and 898%, respectively, in 5-fold cross-validation To the best of our knowledge, these are the most accurate results for predicting subcellular localization in Gram-negative bacteria Clustering 20 amino acids into a few groups by the proposed greedy algorithm provides a new way to extract features from protein sequences to cover more adjacent amino acids and hence reduce the dimensionality of the input vector of protein features It was observed that a good amino acid grouping leads to an increase in prediction performance Furthermore, a proper choice of a subset of complementary support vector machines constructed by different features of proteins maximizes the prediction accuracy

80 citations


Journal ArticleDOI
TL;DR: This paper presents a framework using the Tree-Augmented Bayesian Networks (TAN) which performs multi-classification based on the theory of learning Bayesian networks and using improved feature vector representation of (Ding et al., 2001).
Abstract: Due to the large volume of protein sequence data, computational methods to determine the structure class and the fold class of a protein sequence have become essential. Several techniques based on sequence similarity, Neural Networks, Support Vector Machines (SVMs), etc. have been applied. Since most of these classifiers use binary classifiers for multi-classification, there may be Nc2 classifiers required. This paper presents a framework using the Tree-Augmented Bayesian Networks (TAN) which performs multi-classification based on the theory of learning Bayesian Networks and using improved feature vector representation of (Ding et al., 2001).4 In order to enhance TAN's performance, pre-processing of data is done by feature discretization and post-processing is done by using Mean Probability Voting (MPV) scheme. The advantage of using Bayesian approach over other learning methods is that the network structure is intuitive. In addition, one can read off the TAN structure probabilities to determine the significance of each feature (say, hydrophobicity) for each class, which helps to further understand the complexity in protein structure. The experiments on the datasets used in three prominent recent works show that our approach is more accurate than other discriminative methods. The framework is implemented on the BAYESPROT web server and it is available at . More detailed results are also available on the above website.

59 citations


Journal ArticleDOI
TL;DR: It is proved that MASP is NP-hard for any fixed $k \geq 3$ when $D$ is unrestricted, and also NP- hard forAny fixed $D \geqi 2$ when £k is unrestricted even if each input tree is required to contain at most three leaves.
Abstract: Given a set $\T$ of rooted, unordered trees, where each $T_i \in \T$ is distinctly leaf-labeled by a set $\Lambda(T_i)$ and where the sets $\Lambda(T_i)$ may overlap, the maximum agreement supertree problem~(MASP) is to construct a distinctly leaf-labeled tree $Q$ with leaf set $\Lambda(Q) \subseteq $\cup$_{T_i \in \T} \Lambda(T_i)$ such that $|\Lambda(Q)|$ is maximized and for each $T_i \in \T$, the topological restriction of $T_i$ to $\Lambda(Q)$ is isomorphic to the topological restriction of $Q$ to $\Lambda(T_i)$. Let $n = \left| $\cup$_{T_i \in \T} \Lambda(T_i)\right|$, $k = |\T|$, and $D = \max_{T_i \in \T}\{\deg(T_i)\}$. We first show that MASP with $k = 2$ can be solved in $O(\sqrt{D} n \log (2n/D))$ time, which is $O(n \log n)$ when $D = O(1)$ and $O(n^{1.5})$ when $D$ is unrestricted. We then present an algorithm for MASP with $D = 2$ whose running time is polynomial if $k = O(1)$. On the other hand, we prove that MASP is NP-hard for any fixed $k \geq 3$ when $D$ is unrestricted, and also NP-hard for any fixed $D \geq 2$ when $k$ is unrestricted even if each input tree is required to contain at most three leaves. Finally, we describe a polynomial-time $(n/\!\log n)$-approximation algorithm for MASP.

46 citations


Book ChapterDOI
14 May 2005
TL;DR: Two new efficient algorithms for inferring a phylogenetic network from a set of gene trees of arbitrary degrees named RGNet and RGNet are presented and it is shown that these methods outperform the other existing methods neighbor-joining, NeighborNet, and SpNet.
Abstract: Reticulation events occur frequently in many types of species. Therefore, to develop accurate methods for reconstructing phylogenetic networks in order to describe evolutionary history in the presence of reticulation events is important. Previous work has suggested that constructing phylogenetic networks by merging gene trees is a biologically meaningful approach. This paper presents two new efficient algorithms for inferring a phylogenetic network from a set $\mathcal{T}$ of gene trees of arbitrary degrees. The first algorithm solves the open problem of constructing a refining galled network for $\mathcal{T}$ (if one exists) with no restriction on the number of hybrid nodes; in fact, it outputs the smallest possible solution. In comparison, the previously best method (SpNet) can only construct networks having a single hybrid node. For cases where there exists no refining galled network for $\mathcal{T}$, our second algorithm identifies a minimum subset of the species set to be removed so that the resulting trees can be combined into a galled network. Based on our two algorithms, we propose two general methods named RGNet and RGNet+. Through simulations, we show that our methods outperform the other existing methods neighbor-joining, NeighborNet, and SpNet.

38 citations


Book ChapterDOI
24 May 2005
TL;DR: A positive answer to the question whether it is for every undirected graph possible to assign the local orientations in such a way that the resulting perpetual traversal visits every node in O(n) moves is given.
Abstract: We consider the problem of perpetual traversal by a single agent in an anonymous undirected graph G. Our requirements are: (1) deterministic algorithm, (2) each node is visited within O(n) moves, (3) the agent uses no memory, it can use only the label of the link via which it arrived to the current node, (4) no marking of the underlying graph is allowed and (5) no additional information is stored in the graph (e.g. routing tables, spanning tree) except the ability to distinguish between the incident edges (called Local Orientation). This problem is unsolvable, as has been proven in [9,28] even for much less restrictive setting. Our approach is to somewhat relax the requirement (5). We fix the following traversal algorithm: “Start by taking the edge with the smallest labelx. Afterwards, whenever you come to a node, continue by taking the successor edge (in the local orientation) to the edge via which you arrived” and ask whether it is for every undirected graph possible to assign the local orientations in such a way that the resulting perpetual traversal visits every node in O(n) moves. We give a positive answer to this question, by showing how to construct such local orientations. This leads to an extremely simple, memoryless, yet efficient traversal algorithm.

32 citations


01 Jan 2005
TL;DR: A novel statistical technique for detecting promoter regions in long genomic sequences and a continuous naïve Bayes classifier is developed for the detection of human promoters and transcription start sites in genomic sequences.
Abstract: Summary Objective: The gene promoter region controls transcriptional initiation of a gene, which is the most important step in gene regulation. In-silico detection of promoter region in genomic sequences has a number of applications in gene discovery and understanding gene expression regulation. However, computational prediction of eukaryotic poly-II promoters has remained a difficult task. This paper introduces a novel statistical technique for detecting promoter regions in long genomic sequences. Method: A number of existing techniques analyze the occurrence frequencies of oligonucleotides in promoter sequences as compared to other genomic regions. In contrast, the present work studies the positional densities of oligonucleotides in promoter sequences. The analysis does not require any non-promoter sequence dataset or any model of the background oligonucleotide content of the genome. The statistical model learnt from a dataset of promoter sequences automatically recognizes a number of transcription factor binding sites simultaneously with their occurrence positions relative to the transcription start site. Based on this model, a

26 citations


Journal ArticleDOI
TL;DR: In this paper, a continuous naive Bayes classifier is developed for the detection of human promoters and transcription start sites in genomic sequences, which can automatically recognize a number of transcription factor binding sites simultaneously with their occurrence positions relative to the transcription start site.

26 citations


Book ChapterDOI
19 Dec 2005
TL;DR: This paper presents a space efficient data structure to solve the 1-mismatch and 1-difference problems and can be generalized to solved the k-mistatch problem in O and O(logen (|A|kmk(k+log log n) + occ) and O-bit query time using an $O(n-bit and an O(n)-bit indexing data structures, respectively.
Abstract: Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over a fixed alphabet A, we can preprocess T and give an $O(n\sqrt{{\rm log} n})$-bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(m log log n + occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(n log2n) bits. The space of our data structure can be further reduced to O(n) if we can afford a slow down factor of logen, for 0 < e ≤ 1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A|kmk(k+log log n) + occ) and O(logen (|A|kmk(k+log log n) + occ)) query time using an $O(n\sqrt{{\rm log} n})$-bit and an O(n)-bit indexing data structures, respectively.

Proceedings ArticleDOI
01 Jan 2005
TL;DR: This paper identified some limitations in this model that prevented the SVM-Pairwise algorithm from realizing its full potential and also studied several ways to overcome them.
Abstract: SVM-Pairwise was a major breakthrough in remote homology detection techniques, significantly outperforming previous approaches. This approach has been extensively evaluated and cited by later works, and is frequently taken as a benchmark. No known work however, has examined the gap penalty model employed by SVM-Pairwise. In this paper, we study in depth the relevance and effectiveness of SVM-Pairwise’s gap penalty model with respect to the homology detection task. We have identified some limitations in this model that prevented the SVM-Pairwise algorithm from realizing its full potential and also studied several ways to overcome them. We discovered a more appropriate gap penalty model that significantly improves the performance of SVM-Pairwise.

Proceedings ArticleDOI
01 Oct 2005
TL;DR: The experimental results clearly show that the proposed fusion schemes give a significant improvement in term of the mean of F1 as well as the number of the detected concepts.
Abstract: In this paper, two discriminative fusion schemes are proposed for automatic image annotation One is the ensemble-pattern association based fusion and another is the model-based transformation The fusion approaches are studied and evaluated in a unified framework for AIA based on the text representation of the image content and the MC MFoM learning The schemes are flexible for fusing diverse visual features and multiple modalities The discriminative learning can automatically weight the most important features for the classification We evaluate the fusion schemes based on the Corel and TRECVID 2003 datasets The experimental results clearly show that the proposed fusion schemes give a significant improvement in term of the mean of F1 as well as the number of the detected concepts

Proceedings ArticleDOI
23 Jan 2005
TL;DR: In this article, the problem of determining whether a given set T of rooted triplets can be merged without conflicts into a galled phylogenetic network, and if so, constructing such a network was studied.
Abstract: This paper considers the problem of determining whether a given set T of rooted triplets can be merged without conflicts into a galled phylogenetic network, and if so, constructing such a network. When the input T is dense, we solve the problem in O(|T|) time, which is optimal since the size of the input is Θ(|T|). In comparison, the previously fastest algorithm for this problem runs in O(|T|2) time. Next, we prove that the problem becomes NP-hard if extended to non-dense inputs, even for the special case of simple phylogenetic networks. We also show that for every positive integer n, there exists some set T of rooted triplets on n leaves such that any galled network can be consistent with at most 0.4883·|T| of the rooted triplets in T. On the other hand, we provide a polynomial-time approximation algorithm that always outputs a galled network consistent with at least a factor of 5/12 (>0.4166) of the rooted triplets in T.

Journal ArticleDOI
TL;DR: TSSA is based on A* search algorithm, and TSSD is a heuristic algorithm, which can find the optimal solutions for medium-sized problems in reasonable time, while T SSD can handle very large problems and report approximate solutions very close to the optimal ones.
Abstract: Single nucleotide polymorphisms (SNPs), due to their abundance and low mutation rate, are very useful genetic markers for genetic association studies However, the current genotyping technology cannot afford to genotype all common SNPs in all the genes By making use of linkage disequilibrium, we can reduce the experiment cost by genotyping a subset of SNPs, called Tag SNPs, which have a strong association with the ungenotyped SNPs, while are as independent from each other as possible The problem of selecting Tag SNPs is NP-complete; when there are large number of SNPs, in order to avoid extremely long computational time, most of the existing Tag SNP selection methods first partition the SNPs into blocks based on certain block definitions, then Tag SNPs are selected in each block by brute-force search The size of the Tag SNP set obtained in this way may usually be reduced further due to the inter-dependency among blocks This paper proposes two algorithms, TSSA and TSSD, to tackle the block-independent Tag SNP selection problem TSSA is based on A* search algorithm, and TSSD is a heuristic algorithm Experiments show that TSSA can find the optimal solutions for medium-sized problems in reasonable time, while TSSD can handle very large problems and report approximate solutions very close to the optimal ones

Patent
17 Aug 2005
TL;DR: In this article, a transcript mapping method based on Gene Identification Signature (GIS) analysis is described. And a compressed suffix array (CSA) is used for indexing the genome sequence for improving mapping speed and to reduce computational memory requirements.
Abstract: A transcript mapping method according to an embodiment of the invention is described hereinafter and combines short tag based (SAGE and MPSS) efficiency with the accuracy of full-length cDNA (flcDNA) for comprehensive characterization of transcriptomes. This method is also referred to as Gene Identification Signature (GIS) analysis. In this method, the 5' and 3' ends of full-length cDNA clones are initially extracted into a ditag structure, with the ditag concatemers of the ditag being subsequently sequenced in an efficient manner, and finally mapped to the genome for defining the gene structure. As a GIS ditag represents the 5' and 3' ends of a transcript, it is more informative than SAGE and MPSS tags. Segment lengths between 5' and 3' tag pairs are obtainable including orientation, ordering and chromosome family for efficient transcript mapping and gene location identification. Furthermore, a compressed suffix array (CSA) is used for indexing the genome sequence for improve mapping speed and to reduce computational memory requirements.

Journal ArticleDOI
TL;DR: A hybrid approach to integrate MUMmer or MaxMinCluster with mutated subsequence algorithm (MSS), which has better performance and reliability.
Abstract: Motivation: For the purpose of locating conserved genes in a whole genome scale, this paper proposes a new structural optimization problem called the Mutated Subsequence Problem, which gives consideration to possible mutations between two species (in the form of reversals and transpositions) when comparing the genomes. Results: A practical algorithm called mutated subsequence algorithm (MSS) is devised to solve this optimization problem, and it has been evaluated using different pairs of human and mouse chromosomes, and different pairs of virus genomes of Baculoviridae. MSS is found to be effective and efficient; in particular, MSS can reveal >90% of the conserved genes of human and mouse that have been reported in the literature. When compared with existing softwares MUMmer and MaxMinCluster, MSS uncovers 14 and 7% more genes on average, respectively. Furthermore, this paper shows a hybrid approach to integrate MUMmer or MaxMinCluster with MSS, which has better performance and reliability. Availability: http://www.cs.hku.hk/~mss/ Contact: [email protected]

Patent
12 Aug 2005
TL;DR: In this paper, the authors proposed a method of detecting at least one target nucleic acid, if present in a human biological sample, by using an oligonucleotide probe.
Abstract: It is provided a method of designing oligonucleotide probe(s) for nucleic acid detection comprising the following steps in any order: (i) identifying and selecting region(s) of a target nucleic acid to be amplified, the region(s) having an efficiency of amplification (AE) higher than the average AE; and (ii) designing oligonucleotide probe(s) capable of hybridizing to the selected region(s). It is also provided a method of detecting at least one target nucleic acid comprising the steps of: (i) providing a biological sample; (ii) amplifying the nucleic acid(s) of the biological sample; (iii) providing at least an oligonucleotide probe capable of hybridizing to at least a target nucleic acid, if present in the biological sample; and (iv) contacting the probe(s) with the amplified nucleic acids and detecting the probe(s) hybridized to the target nucleic acid(s). In particular, the method indicates the presence of at least a pathogen, for example a virus, in a human biological sample. The probes may be placed on a support, for example a microarray or a biochip.

Journal ArticleDOI
TL;DR: For moderate error rates a small fraction of the target sequence can be involved in error recovery; thus, expectedly the remainder of the sequence is reconstructible by the standard noiseless algorithm, with the provision to switch to operation with increasingly higher thresholds after detecting failure.
Abstract: We consider the problem of sequence reconstruction in sequencing-by-hybridization in the presence of spectrum errors. As suggested by intuition, and reported in the literature, false-negatives (i.e., missing spectrum probes) are by far the leading cause of reconstruction failures. In a recent paper we have described an algorithm, called "threshold-θ", designed to recover from false negatives. This algorithm is based on overcompensating for missing extensions by allowing larger reconstruction subtrees. We demonstrated, both analytically and with simulations, the increasing effectiveness of the approach as the parameter θ grows, but also pointed out that for larger error rates the size of the extension trees translates into an unacceptable computational burden. To obviate this shortcoming, in this paper we propose an adaptive approach which is both effective and efficient. Effective, because for a fixed value of θ it performs as well as its single-threshold counterpart, efficient because it exhibits substantial speed-ups over it. The idea is that, for moderate error rates a small fraction of the target sequence can be involved in error recovery; thus, expectedly the remainder of the sequence is reconstructible by the standard noiseless algorithm, with the provision to switch to operation with increasingly higher thresholds after detecting failure. This policy generates interesting and complex interplays between fooling probes and false negatives. These phenomena are carefully analyzed for random sequences and the results are found to be in excellent agreement with the simulations. In addition, the experimental algorithmic speed-ups of the multithreshold approach are explained in terms of the interaction amongst the different threshold regimes.

Proceedings ArticleDOI
19 Oct 2005
TL;DR: The use of multimodality as a criterion for choosing genes in feature selection is examined, and a novel measure of pairwise dissimilarity is proposed to cluster the genes that have survived the preprocessing step.
Abstract: One important way that gene expression data are often analysed in an unsupervised way is to cluster the samples without reference to any annotations about them. Before clustering, the data are often subjected to a feature selection preprocessing step, in which a subset of genes are chosen for further analysis. We examine the use of multimodality as a criterion for choosing genes in feature selection, and also propose a novel measure of pairwise dissimilarity to cluster the genes that have survived the preprocessing step. The resulting multiple gene subsets usually contain those that are more strongly correlated with the sample annotations of interest than those obtained through variance-based feature selection. Class discovery may be facilitated when gene expression data are analysed using the proposed method.

Proceedings ArticleDOI
19 Oct 2005
TL;DR: A multi-step constrained optimization based position weight matrix (PWM) motif finding methodology called ConstrainedMotif models the cell-cycle regulated gene expression as a linear function of the motif features while the weights of them are constrained to be periodic across the time-course.
Abstract: Cell-cycle associated promoter motif prediction is very important to understand the cell-cycle control and process. Modeling genome-wide gene expression as a function of the promoter sequence motif features has drawn great attention recently. The proposed techniques using this approach are not specific to cell-cycle associated motif discovery, hence find aperiodic motif weights across the time-course and lower sensitivity. Motifs are scored based on the successive model error reduction steps which may not reveal all relevant motifs since they are alternatives for the model. Another, drawback is, these methods output a list of sequences which may either contain several instances of a dominating motif box (a set of alternative sequence motifs) such as MCB or only a few instances of an important box. To address the above problems, we propose a multi-step constrained optimization based position weight matrix (PWM) motif finding methodology called ConstrainedMotif. It models the cell-cycle regulated gene expression as a linear function of the motif features while the weights of them are constrained to be periodic across the time-course. The score of a motif is the error reduction in the prediction by that motif alone. The multi-step modeling starts with a set of sequences and output a ranked list of cell-cycle associated PWM motifs. We evaluate this methodology using S. Cerevesiae cell-cycle data published by Spellman et al. The results show that ConstrainedMotif is more sensitive and most of the instances of the boxes are represented by the respective matching PWMs.

Proceedings ArticleDOI
01 Jan 2005
TL;DR: In this paper, the authors consider the problem of constructing a phylogenetic tree/network which is consistent with all of the rooted triplets in a given set and none of the roots in another given set.
Abstract: To construct a phylogenetic tree or phylogenetic network for describing the evolutionary history of a set of species is a well-studied problem in computational biology. One previously proposed method to infer a phylogenetic tree/network for a large set of species is by merging a collection of known smaller phylogenetic trees on overlapping sets of species so that no (or as little as possible) branching information is lost. However, little work has been done so far on inferring a phylogenetic tree/network from a specified set of trees when in addition, certain evolutionary relationships among the species are known to be highly unlikely. In this paper, we consider the problem of constructing a phylogenetic tree/network which is consistent with all of the rooted triplets in a given set and none of the rooted triplets in another given set . Although NP-hard in the general case, we provide some efficient exact and approximation algorithms for a number of biologically meaningful variants of the problem.

Book ChapterDOI
19 Dec 2005
TL;DR: An O(min{kn log n, n log n+hn})-time algorithm is given to compute this tripartition-based distance, where h is the number of hybrid nodes in N and N′ while k is the maximum number of Hybrid nodes among all biconnected components in N & N′.
Abstract: Consider two phylogenetic networks N and N′ of size n. The tripartition-based distance finds the proportion of tripartitions which are not shared by N and N′. This distance is proposed by Moret et al (2004) and is a generalization of Robinson-Foulds distance, which is orginally used to compare two phylogenetic trees. This paper gives an O(min{kn log n, n log n+hn})-time algorithm to compute this distance, where h is the number of hybrid nodes in N and N′ while k is the maximum number of hybrid nodes among all biconnected components in N and N′. Note that k << h << n in a phylogenetic network. In addition, we propose algorithms for comparing galled-trees, which are an important, biological meaningful special case of phylogenetic network. We give an O(n)-time algorithm for comparing two galled-trees. We also give an O(n + kh)-time algorithm for comparing a galled-tree with another general network, where h and k are the number of hybrid nodes in the latter network and its biggest biconnected component respectively.

Journal ArticleDOI
TL;DR: Comparison of FAMCS with other methods on various proteins shows that FAMCS can address all four requirements and infer interesting biological discoveries.

Proceedings ArticleDOI
01 Jan 2005
TL;DR: Based on the experiments on 35 pairs of virus genomes using three software tools, it is shown that using anchors with mismatches does increase the effectiveness of locating conserved regions.
Abstract: Based on the experiments on 35 pairs of virus genomes using three software tools (MUMmer-3, MaxMinCluster, MSS), we show that using anchors with mismatches does increase the effectiveness of locating conserved regions ( about 10% more conserved gene regions are located, while maintaining a high sensitivity) . To generate a more comprehensive set of anchors with mismatches is not trivial for long sequences due to the time and memory limitation. We propose two practical algorithms for generating this anchor set. One aims at speeding up the process, the other aims at saving memory. Experimental results show that both algorithms are faster (6 times and 5 times, respectively) than a straightforward suffix tree based appr oach.

Patent
07 Sep 2005
TL;DR: In this paper, the authors proposed a GIS (Gene Identification Signature) analysis combining a short tag based SAGE and MPSS efficiency with the accuracy of full-length cDNA.
Abstract: PROBLEM TO BE SOLVED: To provide an improved transcript mapping method, since a conventional transcript mapping method causes poor and incorrect result, and the information given as a transcription structure is incomplete and ambiguous. SOLUTION: The transcript mapping method is a GIS (Gene Identification Signature) analysis combining a short tag based SAGE and MPSS efficiency with the accuracy of full-length cDNA. The method comprises obtaining the 5' and 3' end tags from a transcription product of the gene, forming a GIS tag, matching the 5' end tag with at least a part of a genome sequence, thereafter matching the 3' end tag, identifying at least one generated segment, identifying at least one gene position having possibility, and regulating each of the gene positions having possibility so as not to exceed a sequence length of a previously defined gene length. A compressed suffix array CSA is used for attaching an index to the genome sequence. COPYRIGHT: (C)2006,JPO&NCIPI