Syntactic clustering of the Web

doi:10.1016/S0169-7552(97)00031-7

Journal ArticleDOI

Syntactic clustering of the Web

- Vol. 29, pp 1157-1166

TLDR

An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.

Abstract:

We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Approximate nearest neighbors: towards removing the curse of dimensionality

Piotr Indyk, +1 more

TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.

...read moreread less

Proceedings ArticleDOI

Similarity estimation techniques from rounding algorithms

Moses Charikar

TL;DR: It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.

...read moreread less

Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder

- 11 Jun 1997 -

Sequence

TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.

...read moreread less

Journal ArticleDOI

Duplicate Record Detection: A Survey

Elmagarmid, +2 more

- 01 Jan 2007 -

IEEE Transactions on Knowledge and Data ...

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.

...read moreread less

Journal ArticleDOI

Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Alexandr Andoni, +1 more

- 01 Jan 2008 -

Communications of The ACM

TL;DR: An algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(DN + n1+1c2 + o(1) + 1/c2), which almost matches the lower bound for hashing-based algorithm recently obtained.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Finding similar files in a large file system

Udi Manber

TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.

...read moreread less

Proceedings ArticleDOI

Copy detection mechanisms for digital documents

Sergey Brin, +2 more

TL;DR: This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).

...read moreread less

SCAM: A Copy Detection Mechanism for Digital Documents

Narayanan Shivakumar, +1 more

TL;DR: A new scheme for detecting copies based on comparing the word frequency occurrences of the new document against those of registered documents, and an experimental comparison between this scheme and COPS, a detection scheme based on sentence overlap is reported on.

...read moreread less

Proceedings ArticleDOI

Building a scalable and accurate copy detection mechanism

Narayanan Shivakumar, +1 more

TL;DR: This paper study's the performance of various copy detection mechanisms, including the disk storage requirements, main memory requirements, response times for registration, and response time for querying, and contrast performance to the accuracy of the mechanisms (how well they detectpartial copies).

...read moreread less

Syntactic clustering of the Web

Citations

Approximate nearest neighbors: towards removing the curse of dimensionality

Similarity estimation techniques from rounding algorithms

On the resemblance and containment of documents

Duplicate Record Detection: A Survey

Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

References

Finding similar files in a large file system

Copy detection mechanisms for digital documents

SCAM: A Copy Detection Mechanism for Digital Documents

Building a scalable and accurate copy detection mechanism

Related Papers (5)

On the resemblance and containment of documents

Similarity estimation techniques from rounding algorithms

Approximate nearest neighbors: towards removing the curse of dimensionality

The anatomy of a large-scale hypertextual Web search engine

Min-Wise Independent Permutations