Winnowing: local algorithms for document fingerprinting

doi:10.1145/872757.872770

Proceedings ArticleDOI

Winnowing: local algorithms for document fingerprinting

Saul Schleimer, +2 more

- pp 76-85

Chats0

TLDR

The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.

Abstract:

Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Heng Li

- 15 Jul 2016 -

Bioinformatics

TL;DR: A new mapper, minimap and a de novo assembler, miniasm, is presented for efficiently mapping and assembling SMRT and ONT reads without an error correction stage.

...read moreread less

Proceedings ArticleDOI

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

Lingxiao Jiang, +3 more

TL;DR: This paper presents an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code and implemented this algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK.

...read moreread less

Journal ArticleDOI

Scalable statistical bug isolation

Ben Liblit, +4 more

TL;DR: A statistical debugging algorithm that isolates bugs in programs containing multiple undiagnosed bugs and identifies predictors that are associated with individual bugs that reveal both the circumstances under which bugs occur as well as the frequencies of failure modes, making it easier to prioritize debugging efforts.

...read moreread less

Journal ArticleDOI

Comparison and Evaluation of Clone Detection Tools

S. Bellon, +4 more

- 01 Sep 2007 -

IEEE Transactions on Software Engineerin...

TL;DR: An experiment is presented that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC) and selects techniques that cover the whole spectrum of the state-of-the-art in clone detection.

...read moreread less

A Survey on Software Clone Detection Research

Chanchal K. Roy, +1 more

TL;DR: The state of the art in clone detection research is surveyed, the clone terms commonly used in the literature are described along with their corresponding mappings to the commonly used clone types and several open problems related to clone detectionResearch are pointed out.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder

- 11 Jun 1997 -

Sequence

TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.

...read moreread less

Journal ArticleDOI

Syntactic clustering of the Web

Andrei Z. Broder, +3 more

TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.

...read moreread less

Journal ArticleDOI

On-line construction of suffix trees

Esko Ukkonen

- 01 Sep 1995 -

Algorithmica

TL;DR: An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string, developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries.

...read moreread less

Journal ArticleDOI

Efficient randomized pattern-matching algorithms

Richard M. Karp, +1 more

- 01 Mar 1987 -

Ibm Journal of Research and Development

TL;DR: In this article, the first occurrence of a string X as a consecutive block within a text Y is found by using a randomized algorithm. But the algorithm requires a constant number of storage locations, and essentially runs in real time.

...read moreread less

Proceedings Article

Finding similar files in a large file system

Udi Manber

TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.

...read moreread less

Winnowing: local algorithms for document fingerprinting

Citations

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

Scalable statistical bug isolation

Comparison and Evaluation of Clone Detection Tools

A Survey on Software Clone Detection Research

References

On the resemblance and containment of documents

Syntactic clustering of the Web

On-line construction of suffix trees

Efficient randomized pattern-matching algorithms

Finding similar files in a large file system

Related Papers (5)

Copy detection mechanisms for digital documents

Finding similar files in a large file system

On the resemblance and containment of documents

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

Clone detection using abstract syntax trees