scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
01 Jan 2011
TL;DR: In this article, each suspicious document is divided into a series of consecutive, po-tentially overlapping "windows" of equal size, represented by vectors containing the relative frequencies of a predetermined set of high-frequency char- acter trigrams.
Abstract: In this paper, we describe a novel approach to intrinsic plagiarism de- tection. Each suspicious document is divided into a series of consecutive, po- tentially overlapping 'windows' of equal size. These are represented by vectors containing the relative frequencies of a predetermined set of high-frequency char- acter trigrams. Subsequently, a distance matrix is set up in which each of the document's windows is compared to each other window. The distance measure used is a symmetric adaptation of the normalized distance (nd1) proposed by Stamatatos (17). Finally, an algorithm for outlier detection in multivariate data (based on Principal Components Analysis) is applied to the distance matrix in or- der to detect plagiarized sections. In the PAN-PC-2011 competition, this system (second place) achieved a competitive recall (.4279) but only reached a plagdet of .1679 due to a disappointing precision (.1075).

34 citations

Proceedings Article
01 May 2010
TL;DR: A newly developed large-scale corpus of artificial plagiarism is developed useful for the evaluation of intrinsic as well as external plagiarism detection.
Abstract: The simple access to texts on digital libraries and the World Wide Web has led to an increased number of plagiarism cases in recent years, which renders manual plagiarism detection infeasible at large. Various methods for automatic plagiarism detection have been developed whose objective is to assist human experts in the analysis of documents for plagiarism. The methods can be divided into two main approaches: intrinsic and external. Unlike other tasks in natural language processing and information retrieval, it is not possible to publish a collection of real plagiarism cases for evaluation purposes since they cannot be properly anonymized. Therefore, current evaluations found in the literature are incomparable and, very often not even reproducible. Our contribution in this respect is a newly developed large-scale corpus of artificial plagiarism useful for the evaluation of intrinsic as well as external plagiarism detection. Additionally, new detection performance measures tailored to the evaluation of plagiarism detection algorithms are proposed.

34 citations

Journal ArticleDOI
TL;DR: This work presents a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts and shows that it produces a comparable or state-of-the-art performance on all three benchmark datasets.
Abstract: Paraphrase detection is an important task in text analytics with numerous applications such as plagiarism detection, duplicate question identification, and enhanced customer support helpdesks. Deep models have been proposed for representing and classifying paraphrases. These models, however, require large quantities of human-labeled data, which is expensive to obtain. In this work, we present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our data augmentation strategy considers the notions of paraphrases and non-paraphrases as binary relations over the set of texts. Subsequently, it uses graph theoretic concepts to efficiently generate additional paraphrase and non-paraphrase pairs in a sound manner. Our multi-cascaded model employs three supervised feature learners (cascades) based on CNN and LSTM networks with and without soft-attention. The learned features, together with hand-crafted linguistic features, are then forwarded to a discriminator network for final classification. Our model is both wide and deep and provides greater robustness across clean and noisy short texts. We evaluate our approach on three benchmark datasets and show that it produces a comparable or state-of-the-art performance on all three.

34 citations

Journal ArticleDOI
TL;DR: Two leading plagiarism detection tools are contrasted, TurnItIn and MyDropBox, in detecting submissions that were obviously plagiarized from articles published in IEEE journals.
Abstract: Several tools are marketed to the educational community for plagiarism detection and prevention. This article briefly contrasts the performance of two leading tools, TurnItIn and MyDropBox, in detecting submissions that were obviously plagiarized from articles published in IEEE journals. Both tools performed poorly because they do not compare submitted writings to publications in the IEEE database. Moreover, these tools do not cover the Association for Computing Machinery (ACM) database or several others important for scholarly work in software engineering. Reports from these tools suggesting that a submission has ldquopassedrdquo can encourage false confidence in the integrity of a submitted writing. Additionally, students can submit drafts to determine the extent to which these tools detect plagiarism in their work. Because the tool samples the engineering professional literature narrowly, the student who chooses to plagiarize can use this tool to determine what plagiarism will be invisible to the faculty member. An appearance of successful plagiarism prevention may in fact reflect better training of students to avoid plagiarism detection.

34 citations

Proceedings ArticleDOI
23 May 2018
TL;DR: An adaptive, scalable, and extensible image-based plagiarism detection approach suitable for analyzing a wide range of image similarities that was observed in academic documents and can complement other content-based feature analysis approaches to retrieve potential source documents for suspiciously similar content from large collections.
Abstract: Identifying plagiarized content is a crucial task for educational and research institutions, funding agencies, and academic publishers. Plagiarism detection systems available for productive use reliably identify copied text, or near-copies of text, but often fail to detect disguised forms of academic plagiarism, such as paraphrases, translations, and idea plagiarism. To improve the detection capabilities for disguised forms of academic plagiarism, we analyze the images in academic documents as text-independent features. We propose an adaptive, scalable, and extensible image-based plagiarism detection approach suitable for analyzing a wide range of image similarities that we observed in academic documents. The proposed detection approach integrates established image analysis methods, such as perceptual hashing, with newly developed similarity assessments for images, such as ratio hashing and position-aware OCR text matching. We evaluate our approach using 15 image pairs that are representative of the spectrum of image similarity we observed in alleged and confirmed cases of academic plagiarism. We embed the test cases in a collection of 4,500 related images from academic texts. Our detection approach achieved a recall of 0.73 and a precision of 1. These results indicate that our image-based approach can complement other content-based feature analysis approaches to retrieve potential source documents for suspiciously similar content from large collections. We provide our code as open source to facilitate future research on image-based plagiarism detection.

34 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125