scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

On the resemblance and containment of documents

11 Jun 1997-Sequence (IEEE)-pp 21-29
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Abstract: Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences, is presented, demonstrating that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences or Oxford Nanopore technologies.
Abstract: Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

4,806 citations


Cites background from "On the resemblance and containment ..."

  • ...In contrast to the first-stage filter, which uses multiple hash functions (Broder et al. 2000), bottom sketching uses a single hash function from which the s minimum values are retained as the sketch (Broder 1997)....

    [...]

Proceedings ArticleDOI
23 May 1998
TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Abstract: We present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces. For data sets of size n living in R d , the algorithms require space that is only polynomial in n and d, while achieving query times that are sub-linear in n and polynomial in d. We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree. The article is based on the material from the authors' STOC'98 and FOCS'01 papers. It unifies, generalizes and simplifies the results from those papers.

4,478 citations

Journal ArticleDOI
TL;DR: The goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs, and to provide pointers into the literature for those who are less familiar with the field.
Abstract: Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

2,843 citations


Cites background or methods from "On the resemblance and containment ..."

  • ...dex vectors of each non-unique coordinate of r. Random Indexing was shown to perform as well as LSA on a word synonym selection task (Karlgren & Sahlgren, 2001). Locality sensitive hashing (LSH) (Broder, 1997) is another technique that approximates the similarity matrix with complexity O(n2 rδ 2), where δ is a constant number of random projections, which controls the accuracy versusefficiency tradeoff.21 LSHi...

    [...]

  • ...s, such that two similar vectors are likely to have similar fingerprints. Definitions of LSH functions include the Min-wise independent function, which preserves the Jaccard similarity between vectors (Broder, 1997), and functions that preserve the cosine similarity between vectors (Charikar, 2002). On a word similarity task, Ravichandran et al. (2005) showed that, on average, over 80% of the top-10 similar word...

    [...]

Proceedings ArticleDOI
19 May 2002
TL;DR: It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.
Abstract: (MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, PrheF[h(x) = h(y)] = sim(x,y), where sim(x,y) e [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:A collection of vectors with the distance between → \over u and → \over v measured by O(→ \over u, → \over v)/π, where O(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) ≤ Ehe\F [d(h(P),h(Q))] ≤ O(log n log log n). EMD(P, Q).

2,477 citations


Cites background from "On the resemblance and containment ..."

  • ...This new setting has resulted in an increased interest in algorithms that process the input data in restricted ways, including sampling a few data points, making only a few passes over the data, and constructing a succinct sketch of the input which can then be e.ciently processed....

    [...]

Journal ArticleDOI
TL;DR: FastANI is developed, a method to compute ANI using alignment-free approximate sequence mapping, and it is shown 95% ANI is an accurate threshold for demarcating prokaryotic species by analyzing about 90,000 proKaryotic genomes.
Abstract: A fundamental question in microbiology is whether there is continuum of genetic diversity among genomes, or clear species boundaries prevail instead. Whole-genome similarity metrics such as Average Nucleotide Identity (ANI) help address this question by facilitating high resolution taxonomic analysis of thousands of genomes from diverse phylogenetic lineages. To scale to available genomes and beyond, we present FastANI, a new method to estimate ANI using alignment-free approximate sequence mapping. FastANI is accurate for both finished and draft genomes, and is up to three orders of magnitude faster compared to alignment-based approaches. We leverage FastANI to compute pairwise ANI values among all prokaryotic genomes available in the NCBI database. Our results reveal clear genetic discontinuity, with 99.8% of the total 8 billion genome pairs analyzed conforming to >95% intra-species and <83% inter-species ANI values. This discontinuity is manifested with or without the most frequently sequenced species, and is robust to historic additions in the genome databases. Average Nucleotide Identity (ANI) is a robust and useful measure to gauge genetic relatedness between two genomes. Here, the authors develop FastANI, a method to compute ANI using alignment-free approximate sequence mapping, and show 95% ANI is an accurate threshold for demarcating prokaryotic species by analyzing about 90,000 prokaryotic genomes.

2,176 citations

References
More filters
Book
01 Jan 1991
TL;DR: A particular set of problems - all dealing with “good” colorings of an underlying set of points relative to a given family of sets - is explored.
Abstract: The use of randomness is now an accepted tool in Theoretical Computer Science but not everyone is aware of the underpinnings of this methodology in Combinatorics - particularly, in what is now called the probabilistic Method as developed primarily by Paul Erdoős over the past half century. Here I will explore a particular set of problems - all dealing with “good” colorings of an underlying set of points relative to a given family of sets. A central point will be the evolution of these problems from the purely existential proofs of Erdős to the algorithmic aspects of much interest to this audience.

6,594 citations


"On the resemblance and containment ..." refers background in this paper

  • ...in other words in this case we can ignore the eect of multiple collisions, that is, three or more distinct elements of S having the same image under f . Furthermore, it can be argued that the size of f (S) is fairly well concentrated. By Azuma’s inequality [ 1 ] we have...

    [...]

Journal ArticleDOI
01 Sep 1997
TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Abstract: We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.

1,560 citations


"On the resemblance and containment ..." refers background in this paper

  • ...For very large collections of documents where r is such that external storage is needed, dierent approaches become necessary. For further details see [ 5 ]....

    [...]

  • ...The remaining 1.5 million clusters contained 7 million documents (a mixture of exact duplicates and similar). For further details see [ 5 ]....

    [...]

Journal ArticleDOI
01 Jun 2000
TL;DR: This research was motivated by the fact that such a family of permutations is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents.
Abstract: We define and study the notion of min-wise independent families of permutations. We say that F?Sn (the symmetric group) is min-wise independent if for any set X?n and any x?X, when ? is chosen at random in F we havePr(min{?(X)}=?(x))=1|X| . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under ?. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept?we present the solutions to some of them and we list the rest as open problems.

962 citations

Proceedings Article
Udi Manber1
17 Jan 1994
TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.
Abstract: We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to 1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar files to a query file using a preprocessed index. Application of sif can be found in file management, information collecting (to remove duplicates), program reuse, file synchronization, data compression, and maybe even plagiarism detection.

821 citations


"On the resemblance and containment ..." refers methods in this paper

  • ...Related sampling mechanisms for determining similarity were also developed by Manber [7] and within the Stanford SCAM project [2, 8, 9]....

    [...]

Proceedings ArticleDOI
22 May 1995
TL;DR: This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).
Abstract: In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity.In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters.

660 citations


"On the resemblance and containment ..." refers methods in this paper

  • ...Related sampling mechanisms for determining similarity were also developed by Manber [7] and within the Stanford SCAM project [2, 8, 9]....

    [...]