scispace - formally typeset
Journal ArticleDOI

Syntactic clustering of the Web

TLDR
An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Abstract
We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Approximate nearest neighbors: towards removing the curse of dimensionality

TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Proceedings ArticleDOI

Similarity estimation techniques from rounding algorithms

TL;DR: It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.
Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder
- 11 Jun 1997 - 
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Journal ArticleDOI

Duplicate Record Detection: A Survey

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Journal ArticleDOI

Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

TL;DR: An algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(DN + n1+1c2 + o(1) + 1/c2), which almost matches the lower bound for hashing-based algorithm recently obtained.
References
More filters
Proceedings Article

Finding similar files in a large file system

TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.
Proceedings ArticleDOI

Copy detection mechanisms for digital documents

TL;DR: This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).

SCAM: A Copy Detection Mechanism for Digital Documents

TL;DR: A new scheme for detecting copies based on comparing the word frequency occurrences of the new document against those of registered documents, and an experimental comparison between this scheme and COPS, a detection scheme based on sentence overlap is reported on.
Proceedings ArticleDOI

Building a scalable and accurate copy detection mechanism

TL;DR: This paper study's the performance of various copy detection mechanisms, including the disk storage requirements, main memory requirements, response times for registration, and response time for querying, and contrast performance to the accuracy of the mechanisms (how well they detectpartial copies).