scispace - formally typeset
Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder
- 11 Jun 1997 - 
- pp 21-29
TLDR
The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Abstract
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Traffic classification-based spam filter

TL;DR: An unsupervised spam filter called Bulk Mail Traffic Classification (BMTC) for filtering junk mails from the perspective of ISPs, which can be implemented in a high-volume traffic environment handling over millions of mails every day with small memory consumption.
Proceedings ArticleDOI

On-the-fly token similarity joins in relational databases

TL;DR: This work defines tokenize, a new relational operator that generates tokens and allows the similarity join to be fully integrated into relational databases, and implemented the operator in the kernel of PostgreSQL and empirically evaluated its performance for similarity joins.

On matching nodes between trees

TL;DR: One novelty in the solution is to reduce the problem to computing the upper-envelope of pseudo-planes and then apply the results from Computational Geometry to obtain an efficient algorithm.
Journal ArticleDOI

GraphClust2: Annotation and discovery of structured RNAs with scalable and accessible integrative clustering.

TL;DR: GraphClust2 bridges the gap between high-throughput sequencing and structural RNA analysis and provides an integrative solution by incorporating diverse experimental and genomic data in an accessible manner via the Galaxy framework and demonstrates that the annotation performance of clustering functional RNAs can be considerably improved.
Patent

Data replication with delta compression

TL;DR: Data replication with delta compression is discussed in this paper, where a primary system and a replica system are determined to both have an identical first data segment that is similar to a second data segment.
References
More filters
Book

The Probabilistic Method

Joel Spencer
TL;DR: A particular set of problems - all dealing with “good” colorings of an underlying set of points relative to a given family of sets - is explored.
Journal ArticleDOI

Syntactic clustering of the Web

TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Journal ArticleDOI

Min-Wise Independent Permutations

TL;DR: This research was motivated by the fact that such a family of permutations is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents.
Proceedings Article

Finding similar files in a large file system

TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.
Proceedings ArticleDOI

Copy detection mechanisms for digital documents

TL;DR: This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).