Proceedings ArticleDOI
Winnowing: local algorithms for document fingerprinting
Saul Schleimer,Daniel Shawcross Wilkerson,Alex Aiken +2 more
- pp 76-85
Reads0
Chats0
TLDR
The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.Abstract:
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.read more
Citations
More filters
Journal ArticleDOI
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
TL;DR: A new mapper, minimap and a de novo assembler, miniasm, is presented for efficiently mapping and assembling SMRT and ONT reads without an error correction stage.
Proceedings ArticleDOI
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones
TL;DR: This paper presents an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code and implemented this algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK.
Journal ArticleDOI
Scalable statistical bug isolation
TL;DR: A statistical debugging algorithm that isolates bugs in programs containing multiple undiagnosed bugs and identifies predictors that are associated with individual bugs that reveal both the circumstances under which bugs occur as well as the frequencies of failure modes, making it easier to prioritize debugging efforts.
Journal ArticleDOI
Comparison and Evaluation of Clone Detection Tools
TL;DR: An experiment is presented that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC) and selects techniques that cover the whole spectrum of the state-of-the-art in clone detection.
A Survey on Software Clone Detection Research
Chanchal K. Roy,James R. Cordy +1 more
TL;DR: The state of the art in clone detection research is surveyed, the clone terms commonly used in the literature are described along with their corresponding mappings to the commonly used clone types and several open problems related to clone detectionResearch are pointed out.
References
More filters
Proceedings ArticleDOI
On the resemblance and containment of documents
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Journal ArticleDOI
Syntactic clustering of the Web
TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Journal ArticleDOI
On-line construction of suffix trees
TL;DR: An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string, developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries.
Journal ArticleDOI
Efficient randomized pattern-matching algorithms
Richard M. Karp,Michael O. Rabin +1 more
TL;DR: In this article, the first occurrence of a string X as a consecutive block within a text Y is found by using a randomized algorithm. But the algorithm requires a constant number of storage locations, and essentially runs in real time.
Proceedings Article
Finding similar files in a large file system
TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.