scispace - formally typeset
Search or ask a question
Author

Thomas G. Marr

Bio: Thomas G. Marr is an academic researcher from Cold Spring Harbor Laboratory. The author has contributed to research in topics: String metric & Schizosaccharomyces pombe. The author has an hindex of 16, co-authored 21 publications receiving 1545 citations.

Papers
More filters
Proceedings ArticleDOI
24 May 1994
TL;DR: This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases, which give information that is complementary to the best protein classifier available today.
Abstract: Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric.This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today.

193 citations

Journal ArticleDOI
TL;DR: The results show that there may exist weak pairwise correlations within the signals and that the proposed weight array method can help to better discriminate these signals.
Abstract: A new method of sequence analysis, using a weight array method (WAM), which generalizes the traditional Staden weight matrix method (WMM), is proposed. With the help of a statistical mechanical model, the discriminant function is identified with the energy function describing macromolecular interactions. The method is applied to the study of 5'-splice signals in Schizosaccharomyces pombe pre-mRNA sequences. The results show that there may exist weak pairwise correlations within the signals and that our method can help to better discriminate these signals. Experiments are proposed to test the predictions of the theory.

191 citations

Journal ArticleDOI
TL;DR: Assessment of 65 pedigrees ascertained through a Bipolar I proband for evidence of linkage, using nonparametric methods in a genome-wide scan and for possible parent of origin effect using several analytical methods identified 15 loci with nominally significant evidence for increased allele sharing among affected relative pairs.
Abstract: The purpose of this study was to assess 65 pedigrees ascertained through a Bipolar I (BPI) proband for evidence of linkage, using nonparametric methods in a genome-wide scan and for possible parent of origin effect using several analytical methods. We identified 15 loci with nominally significant evidence for increased allele sharing among affected relative pairs. Eight of these regions, at 8q24, 18q22, 4q32, 13q12, 4q35, 10q26, 2p12, and 12q24, directly overlap with previously reported evidence of linkage to bipolar disorder. Five regions at 20p13, 2p22, 14q23, 9p13, and 1q41 are within several Mb of previously reported regions. We report our findings in rank order and the top five markers had an NPL>2.5. The peak finding in these regions were D8S256 at 8q24, NPL 3.13; D18S878 at 18q22, NPL 2.90; D4S1629 at 4q32, NPL 2.80; D2S99 at 2p12, NPL 2.54; and D13S1493 at 13q12, NPL 2.53. No locus produced statistically significant evidence for linkage at the genome-wide level. The parent of origin effect was studied and consistent with our previous findings, evidence for a locus on 18q22 was predominantly from families wherein the father or paternal lineage was affected. There was evidence consistent with paternal imprinting at the loci on 13q12 and 1q41.

148 citations

Journal ArticleDOI
09 Apr 1993-Cell
TL;DR: This work presents the application of a nonrandom sequence-tagged site (STS) content detection method in mapping an entire genome, that of fission yeast, and developed powerful techniques, based on consistency analysis, for error detection and contig assembly.

146 citations

Journal ArticleDOI
TL;DR: It is concluded that a mixture of many length scales (including some relatively long ones) in DNA sequences is responsible for the observed 1 f -like spectral component.

136 citations


Cited by
More filters
Proceedings ArticleDOI
06 Mar 1995
TL;DR: Three algorithms are presented to solve the problem of mining sequential patterns over databases of customer transactions, and empirically evaluating their performance using synthetic data shows that two of them have comparable performance.
Abstract: We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction. >

5,663 citations

Journal ArticleDOI
TL;DR: A general probabilistic model of the gene structure of human genomic sequences which incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions is introduced.

3,709 citations

Book ChapterDOI
25 Mar 1996
TL;DR: This work adds time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern, and relax the restriction that the items in an element of a sequential pattern must come from the same transaction.
Abstract: The problem of mining sequential patterns was recently introduced in [3] We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items The problem is to discover all sequential patterns with a user-specified minimum support, where the support of a pattern is the number of data-sequences that contain the pattern An example of a sequential pattern is“5% of customers bought ‘Foundation’ and ‘Ringworld’ in one transaction, followed by ‘Second Foundation’ in a later transaction” We generalize the problem as follows First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transaction-times are within a user-specified time window Third, given a user-defined taxonomy (is-a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy

2,973 citations

Journal ArticleDOI
TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Abstract: We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems.

2,723 citations

Journal ArticleDOI
15 Aug 1997-Science
TL;DR: In this paper, the homologous genes from the fission yeast Schizosaccharomyces pombe and human are identified and the proposed telomerase catalytic subunits represent a deep branch in the evolution of reverse transcriptases.
Abstract: Catalytic protein subunits of telomerase from the ciliate Euplotes aediculatus and the yeast Saccharomyces cerevisiae contain reverse transcriptase motifs. Here the homologous genes from the fission yeast Schizosaccharomyces pombe and human are identified. Disruption of the S. pombe gene resulted in telomere shortening and senescence, and expression of mRNA from the human gene correlated with telomerase activity in cell lines. Sequence comparisons placed the telomerase proteins in the reverse transcriptase family but revealed hallmarks that distinguish them from retroviral and retrotransposon relatives. Thus, the proposed telomerase catalytic subunits are phylogenetically conserved and represent a deep branch in the evolution of reverse transcriptases.

2,181 citations