scispace - formally typeset
Proceedings ArticleDOI

Overview of fingerprinting methods for local text reuse detection

Reads0
Chats0
TLDR
This work defines the context of local text reuse and situate it within the general spectrum of information retrieval in order to pinpoint its particular applicability and challenges and introduces the general principles of fingerprinting algorithms from an information retrieval perspective.
Abstract
We overview several local text reuse detection methods based on fingerprinting techniques. We first define the context of local text reuse and situate it within the general spectrum of information retrieval in order to pinpoint its particular applicability and challenges. After a brief description of the major text reuse detection approaches, we introduce the general principles of fingerprinting algorithms from an information retrieval perspective. Three classes of fingerprinting methods (overlap, non-overlap, and randomized) are surveyed. Specific algorithms, such as k-gram, winnowing, hailstorm, DCT and hash-breaking, are described. The performance and characteristics of these algorithms are summarized based on data from the literature.

read more

Citations
More filters
Journal ArticleDOI

Comparison between the Stemmer Porter Effect and Nazief-Adriani on the Performance of Winnowing Algorithms for Measuring Plagiarism

TL;DR: The results of this study indicate that the effect of nazief-adriani stemmer on the winnowing algorithm is superior to the stemmer porter, only decreasing the detection performance of the 0.28% similarity value while the Porter stemmer is superior in increasing the processing time to 69% faster.
DissertationDOI

A corpus-assisted discourse analysis of NHS responses to online patient feedback

Craig Evans
TL;DR: This article used a corpus-assisted discourse studies (CADS) approach to examine linguistic patterns in datasets based on three staff reply text types derived from an 11.5-million-word corpus of NHS replies.
Proceedings ArticleDOI

Study on a Text Reuse Measurement Method Using Expanded Index Term

TL;DR: This work is an attempt to improve accuracy of text reuse measurement by using expanded index terms, expanding the range of reused inspection sentences, and circularizing words in order to resolve the issue of undetected reused sentences that arise from the replacement of similar terms.

Applying data mining techniques in the context of social media to improve situational awareness at large-scale events

TL;DR: In this article, the authors propose a method for applying data mining techniques in the context of social media to improve situational awareness in large-scale events by describing a solution for an intelligent situational information portal that leverages open data information sources and channels for improving decision support.
Proceedings ArticleDOI

Identifying text reuse using word net-based extended named entity recognition

TL;DR: A method of measuring similarity in which named entity recognition is performed on the words appearing in the target document and named entity tags are annotated to them is proposed.
References
More filters
Proceedings ArticleDOI

Locality-sensitive hashing scheme based on p-stable distributions

TL;DR: A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder
- 11 Jun 1997 - 
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Proceedings ArticleDOI

Winnowing: local algorithms for document fingerprinting

TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
Proceedings Article

Finding similar files in a large file system

TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.
Proceedings ArticleDOI

Copy detection mechanisms for digital documents

TL;DR: This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).
Related Papers (5)