Overview of fingerprinting methods for local text reuse detection

doi:10.1109/INNOVATIONS.2016.7880050

Proceedings ArticleDOI

Overview of fingerprinting methods for local text reuse detection

Leena Lulu, +2 more

- pp 1-6

Chats0

TLDR

This work defines the context of local text reuse and situate it within the general spectrum of information retrieval in order to pinpoint its particular applicability and challenges and introduces the general principles of fingerprinting algorithms from an information retrieval perspective.

Abstract:

We overview several local text reuse detection methods based on fingerprinting techniques. We first define the context of local text reuse and situate it within the general spectrum of information retrieval in order to pinpoint its particular applicability and challenges. After a brief description of the major text reuse detection approaches, we introduce the general principles of fingerprinting algorithms from an information retrieval perspective. Three classes of fingerprinting methods (overlap, non-overlap, and randomized) are surveyed. Specific algorithms, such as k-gram, winnowing, hailstorm, DCT and hash-breaking, are described. The performance and characteristics of these algorithms are summarized based on data from the literature.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Comparison between the Stemmer Porter Effect and Nazief-Adriani on the Performance of Winnowing Algorithms for Measuring Plagiarism

Alam Rahmatulloh, +4 more

- 02 Aug 2019 -

International Journal on Advanced Scienc...

TL;DR: The results of this study indicate that the effect of nazief-adriani stemmer on the winnowing algorithm is superior to the stemmer porter, only decreasing the detection performance of the 0.28% similarity value while the Porter stemmer is superior in increasing the processing time to 69% faster.

...read moreread less

DissertationDOI

A corpus-assisted discourse analysis of NHS responses to online patient feedback

Craig Evans

TL;DR: This article used a corpus-assisted discourse studies (CADS) approach to examine linguistic patterns in datasets based on three staff reply text types derived from an 11.5-million-word corpus of NHS replies.

...read moreread less

Proceedings ArticleDOI

Study on a Text Reuse Measurement Method Using Expanded Index Term

Eunji Lee, +4 more

TL;DR: This work is an attempt to improve accuracy of text reuse measurement by using expanded index terms, expanding the range of reused inspection sentences, and circularizing words in order to resolve the issue of undetected reused sentences that arise from the replacement of similar terms.

...read moreread less

DOI

Applying data mining techniques in the context of social media to improve situational awareness at large-scale events

Rainer Simon, +5 more

TL;DR: In this article, the authors propose a method for applying data mining techniques in the context of social media to improve situational awareness in large-scale events by describing a solution for an intelligent situational information portal that leverages open data information sources and channels for improving decision support.

...read moreread less

Proceedings ArticleDOI

Identifying text reuse using word net-based extended named entity recognition

Eunji Lee, +1 more

TL;DR: A method of measuring similarity in which named entity recognition is performed on the words appearing in the target document and named entity tags are annotated to them is proposed.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Locality-sensitive hashing scheme based on p-stable distributions

Mayur Datar, +3 more

TL;DR: A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.

...read moreread less

Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder

- 11 Jun 1997 -

Sequence

TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.

...read moreread less

Proceedings ArticleDOI

Winnowing: local algorithms for document fingerprinting

Saul Schleimer, +2 more

TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.

...read moreread less

Proceedings Article

Finding similar files in a large file system

Udi Manber

TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.

...read moreread less

Proceedings ArticleDOI

Copy detection mechanisms for digital documents

Sergey Brin, +2 more

TL;DR: This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).

...read moreread less