scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The reported work aims to explore syntax-semantic concept extractions with genetic algorithm in detecting cases of idea plagiarism, where the source ideas are plagiarized and represented in a summarized form.
Abstract: Plagiarism is increasingly becoming a major issue in the academic and educational domains. Automated and effective plagiarism detection systems are direly required to curtail this information breach, especially in tackling idea plagiarism. The proposed approach is aimed to detect such plagiarism cases, where the idea of a third party is adopted and presented intelligently so that at the surface level, plagiarism cannot be unmasked. The reported work aims to explore syntax-semantic concept extractions with genetic algorithm in detecting cases of idea plagiarism. The work mainly focuses on idea plagiarism where the source ideas are plagiarized and represented in a summarized form. Plagiarism detection is employed at both the document and passage levels by exploiting the document concepts at various structural levels. Initially, the idea embedded within the given source document is captured using sentence level concept extraction with genetic algorithm. Document level detection is facilitated with word-level concepts where syntactic information is extracted and the non-plagiarized documents are pruned. A combined similarity metric that utilizes the semantic level concept extraction is then employed for passage level detection. The proposed approach is tested on PAN13-14 1 plagiarism corpus for summary obfuscation data, which represents a challenging case of idea plagiarism. The performance of the current approach and its variations are evaluated both at the document and passage levels, using information retrieval and PAN plagiarism measures respectively. The results are also compared against six top ranked plagiarism detection systems submitted as a part of PAN13-14 competition. The results obtained are found to exhibit significant improvement over the compared systems and hence reflects the potency of the proposed syntax-semantic based concept extractions in detecting idea plagiarism.

43 citations

01 Jan 2003
TL;DR: A comprehensive survey on natural language text copy detection is given, the developments of copy detection are introduced, and some key detection techniques are listed and compared with each other.
Abstract: Copy detection has very important application in both intellectual property protection and information retrieval. Currently, copy detection concentrates on document copy detection mainly. In early days, document copy detection concentrated on program plagiarism detection mainly and now the most studies are on text copy detection. In this paper, a comprehensive survey on natural language text copy detection is given, the developments of copy detection is introduced. The approaches and features of a variety of existing text copy detection systems or prototypes are reviewed in detail. Then some key detection techniques are listed and compared with each other. In the end, the future trend of text copy detection is discussed.

43 citations

Proceedings ArticleDOI
27 Jun 2018
TL;DR: This work presents the first plagiarism detection approach that combines the analysis of mathematical expressions, images, citations and text and demonstrates the usefulness of the hybrid detection and result visualization approaches by using HyPlag to analyze a confirmed case of content reuse present in a retracted research publication.
Abstract: Current plagiarism detection systems reliably find instances of copied and moderately altered text, but often fail to detect strong paraphrases, translations, and the reuse of non-textual content and ideas. To improve upon the detection capabilities for such concealed content reuse in academic publications, we make four contributions: i) We present the first plagiarism detection approach that combines the analysis of mathematical expressions, images, citations and text. ii) We describe the implementation of this hybrid detection approach in the research prototype HyPlag. iii) We present novel visualization and interaction concepts to aid users in reviewing content similarities identified by the hybrid detection approach. iv) We demonstrate the usefulness of the hybrid detection and result visualization approaches by using HyPlag to analyze a confirmed case of content reuse present in a retracted research publication.

43 citations

01 Jan 2009
TL;DR: Kolmogorov Complexity measures are introduced as a way of extracting structural information from texts for Intrinsic Plagiarism Detection and more sophisticated compression algorithms which are suited to com- pressing the English language show great promise for feature extraction for various text classification problems.
Abstract: We introduce Kolmogorov Complexity measures as a way of extracting structural information from texts for Intrinsic Plagiarism Detection. Kolmogorov complexity measures have been used as features in a variety of machine learning tasks including image recognition, radar signal classification, EEG classification, DNA analysis, speech recognition and some text classification tasks (Chi and Kong, 1998; Zhang, Hu, and Jin, 2003; Bhattacharya, 2000; Menconi, Benci, and Buiatti, 2008; Frank, Chui, and Witten, 2000; Dalkilic et al., 2006; Seaward and Saxton, 2007; Seaward, Inkpen, and Nayak, 2008). Intrinsic Plagiarism detection uses no external corpus for document comparison and thus plagiarism must be detected solely on the basis of style shifts within the text to be analyzed. Given the small amount of text to be analyzed, feature extraction is of particular importance. We give a theoretical background as to why complexity measures are meaningful and we introduce some experimental results on the PAN'09 Intrinsic Plagiarism Corpus. We show complexity features based on the Lempel-Ziv compression algorithm slightly increase performance over features based on normalized counts. Furthermore we believe that more sophisticated compression algorithms which are suited to com- pressing the English language show great promise for feature extraction for various text classification problems.

43 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125