Journal ArticleDOI
Syntactic clustering of the Web
Andrei Z. Broder,Steven C. Glassman,Mark S. Manasse,Geoffrey Zweig +3 more
- Vol. 29, pp 1157-1166
TLDR
An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.Abstract:
We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.read more
Citations
More filters
Proceedings ArticleDOI
Approximate nearest neighbors: towards removing the curse of dimensionality
Piotr Indyk,Rajeev Motwani +1 more
TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Proceedings ArticleDOI
Similarity estimation techniques from rounding algorithms
TL;DR: It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.
Proceedings ArticleDOI
On the resemblance and containment of documents
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Journal ArticleDOI
Duplicate Record Detection: A Survey
Elmagarmid,Ipeirotis,Verykios +2 more
TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Journal ArticleDOI
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Alexandr Andoni,Piotr Indyk +1 more
TL;DR: An algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(DN + n1+1c2 + o(1) + 1/c2), which almost matches the lower bound for hashing-based algorithm recently obtained.
References
More filters
Proceedings Article
Finding similar files in a large file system
TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.
Proceedings ArticleDOI
Copy detection mechanisms for digital documents
TL;DR: This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).
SCAM: A Copy Detection Mechanism for Digital Documents
TL;DR: A new scheme for detecting copies based on comparing the word frequency occurrences of the new document against those of registered documents, and an experimental comparison between this scheme and COPS, a detection scheme based on sentence overlap is reported on.
Proceedings ArticleDOI
Building a scalable and accurate copy detection mechanism
TL;DR: This paper study's the performance of various copy detection mechanisms, including the disk storage requirements, main memory requirements, response times for registration, and response time for querying, and contrast performance to the accuracy of the mechanisms (how well they detectpartial copies).