Proceedings ArticleDOI
Overview of fingerprinting methods for local text reuse detection
Leena Lulu,Boumediene Belkhouche,Saad Harous +2 more
- pp 1-6
Reads0
Chats0
TLDR
This work defines the context of local text reuse and situate it within the general spectrum of information retrieval in order to pinpoint its particular applicability and challenges and introduces the general principles of fingerprinting algorithms from an information retrieval perspective.Abstract:
We overview several local text reuse detection methods based on fingerprinting techniques. We first define the context of local text reuse and situate it within the general spectrum of information retrieval in order to pinpoint its particular applicability and challenges. After a brief description of the major text reuse detection approaches, we introduce the general principles of fingerprinting algorithms from an information retrieval perspective. Three classes of fingerprinting methods (overlap, non-overlap, and randomized) are surveyed. Specific algorithms, such as k-gram, winnowing, hailstorm, DCT and hash-breaking, are described. The performance and characteristics of these algorithms are summarized based on data from the literature.read more
Citations
More filters
Journal ArticleDOI
Comparison between the Stemmer Porter Effect and Nazief-Adriani on the Performance of Winnowing Algorithms for Measuring Plagiarism
TL;DR: The results of this study indicate that the effect of nazief-adriani stemmer on the winnowing algorithm is superior to the stemmer porter, only decreasing the detection performance of the 0.28% similarity value while the Porter stemmer is superior in increasing the processing time to 69% faster.
DissertationDOI
A corpus-assisted discourse analysis of NHS responses to online patient feedback
TL;DR: This article used a corpus-assisted discourse studies (CADS) approach to examine linguistic patterns in datasets based on three staff reply text types derived from an 11.5-million-word corpus of NHS replies.
Proceedings ArticleDOI
Study on a Text Reuse Measurement Method Using Expanded Index Term
TL;DR: This work is an attempt to improve accuracy of text reuse measurement by using expanded index terms, expanding the range of reused inspection sentences, and circularizing words in order to resolve the issue of undetected reused sentences that arise from the replacement of similar terms.
Applying data mining techniques in the context of social media to improve situational awareness at large-scale events
Rainer Simon,Drazen Ignjatovic,Georg Neubauer,Clemens Gutschi,Johannes Pan,Siegfried Vössner +5 more
TL;DR: In this article, the authors propose a method for applying data mining techniques in the context of social media to improve situational awareness in large-scale events by describing a solution for an intelligent situational information portal that leverages open data information sources and channels for improving decision support.
Proceedings ArticleDOI
Identifying text reuse using word net-based extended named entity recognition
Eunji Lee,Pankoo Kim +1 more
TL;DR: A method of measuring similarity in which named entity recognition is performed on the words appearing in the target document and named entity tags are annotated to them is proposed.
References
More filters
Proceedings ArticleDOI
Locality-sensitive hashing scheme based on p-stable distributions
TL;DR: A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Proceedings ArticleDOI
On the resemblance and containment of documents
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Proceedings ArticleDOI
Winnowing: local algorithms for document fingerprinting
TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
Proceedings Article
Finding similar files in a large file system
TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.
Proceedings ArticleDOI
Copy detection mechanisms for digital documents
TL;DR: This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security).