Journal ArticleDOI
Intrinsic plagiarism analysis
Benno Stein,Nedim Lipka,Peter Prettenhofer +2 more
- Vol. 45, Iss: 1, pp 63-82
TLDR
The question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form is investigated.Abstract:
Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed "unmasking", can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.read more
Citations
More filters
Journal ArticleDOI
Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods
TL;DR: A new taxonomy of plagiarism is presented that highlights differences between literal plagiarism and intelligent plagiarism, from the plagiarist's behavioral point of view, and supports deep understanding of different linguistic patterns in committing plagiarism.
Overview of the Author Identification Task at PAN 2013
TL;DR: The author identification task at PAN-2014 focuses on author verification and adopts the c@1 measure, originally proposed for the question answering task, and continues the successful practice of the PAN labs to examine meta-models based on the combination of all submitted systems.
Journal ArticleDOI
Plagiarism detection using stopword n -grams
TL;DR: It is shown that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries.
Journal ArticleDOI
Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection
TL;DR: The P4P corpus is created, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection, providing critical insights for the improvement of automatic plagiarisms detection systems.
Journal ArticleDOI
Surveying Stylometry Techniques and Applications
TL;DR: An extensive performance analysis is performed on a corpus of 1,000 authors to investigate authorship attribution, verification, and clustering using 14 algorithms from the literature.
References
More filters
Journal ArticleDOI
SMOTE: synthetic minority over-sampling technique
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Book
Artificial Intelligence: A Modern Approach
Stuart Russell,Peter Norvig +1 more
TL;DR: In this article, the authors present a comprehensive introduction to the theory and practice of artificial intelligence for modern applications, including game playing, planning and acting, and reinforcement learning with neural networks.
Journal ArticleDOI
Reducing the Dimensionality of Data with Neural Networks
TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Journal ArticleDOI
SMOTE: Synthetic Minority Over-sampling Technique
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Proceedings ArticleDOI
Approximate nearest neighbors: towards removing the curse of dimensionality
Piotr Indyk,Rajeev Motwani +1 more
TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.