scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
01 Jan 2015
TL;DR: A bilingual Persian-English sentence aligned parallel corpus in a combination with Wikipedia articles is used to create a plagiarism detection corpus based on parallel corpus sentences.
Abstract: Plagiarism detection is the process of locating text reuse within a suspicious document. The plagiarism detection corpora are used for evaluating plagiarism detection systems. In this paper, we present a bilingual Persian- English plagiarism detection corpus. We provide our corpus for the task of text alignment corpus construction in the PAN 2015 competition. Our approach is based on parallel corpus sentences. We have used a Persian-English sentence aligned parallel corpus in a combination with Wikipedia articles to create our corpus. Paired sentences in parallel corpus have a similarity score between 0 and 1. We have used similarity scores to establish the degree of obfuscation for constructing the plagiarism cases.

12 citations

01 Jan 2014
TL;DR: This paper has proposed two different solutions to relax the comparison of two documents, so as to consider the semantic relations between them, and prepared a framework, which lets us combine different feature types and different strategies for merging the features.
Abstract: Text alignment is a sub-task in the plagiarism detection process. In this paper we discuss our approach to address this problem. Our approach is based on mapping text alignment to the problem of subsequence matching just as previous works. We have prepared a framework, which lets us combine different feature types and different strategies for merging the features. We have proposed two different solutions to relax the comparison of two documents, so as to consider the semantic relations between them. Our first approach is based on defining a new feature type that contains semantic information about its corresponding doc- ument. In our second approach we have proposed a new method for comparing the features considering their semantic relations. Finally, We have applied DB- SCAN clustering algorithm to merge features in a neighborhood in both source and suspicious documents. Our experiments indicate that different feature sets are suitable for detecting different types of plagiarism.

12 citations

Journal ArticleDOI
TL;DR: A TDW matrix based algorithm with three phases, rendering, filtering and verification, which receives an input web-page and a threshold in its first phase, returns an optimal set of near-duplicate web pages in the verification phase after calculating its similarity.
Abstract: voluminous amount of web documents has weakened the performance and reliability of web search engines. The subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. Web content mining face huge problems due to the existence of duplicate and near-duplicate web pages. These pages either increase the index storage space or increase the serving costs thereby irritating the users. Near-duplicate detection has been recognized as an important one in the field of plagiarism detection, spam detection and in focused web crawling scenarios. Here we propose a novel idea for finding near- duplicates of an input web-page, from a huge repository. We proposes a TDW matrix based algorithm with three phases, rendering, filtering and verification, which receives an input web-page and a threshold in its first phase , prefix filtering and positional filtering to reduce the size of records in the second phase and returns an optimal set of near-duplicate web pages in the verification phase after calculating its similarity. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.

12 citations

01 Jan 2003
TL;DR: Investigations into two plagiarism detection tools are described: the widely used commercial service Turnitin, and an in-house tool, Ferret, which are more useful in detecting plagiarism from web sources and within a group of students.
Abstract: One strategy in the prevention and detection of plagiarism and collusion is to use an automated detection tool. We argue that, for consistent treatment of students, we should be applying these tools to ALL written submissions in a given assignment rather than merely using a detection tool to confirm suspicions that a single text has been plagiarised. In this paper we describe our investigations into two plagiarism detection tools: the widely used commercial service Turnitin, and an in-house tool, Ferret. We conclude that there are technical and practical problems, first in the large scale use of electronic submission of assignments and then in the further submission of these assignments to a plagiarism detector. Nevertheless, the reporting mechanisms of both tools are fast and easy to use. Turnitin is more useful in detecting plagiarism from web sources, Ferret for detecting collusion within a group of students.

12 citations

Book ChapterDOI
15 Sep 2014
TL;DR: A probabilistic distribution model to represent each document as a feature set to increase the interpretability of the results and features is proposed and a distance measure is introduced to compute the distance between two feature sets.
Abstract: Authorship identification was introduced as one of the important problems in the law and journalism fields and it is one of the major techniques in plagiarism detection. In this paper, to tackle the authorship verification problem, we propose a probabilistic distribution model to represent each document as a feature set to increase the interpretability of the results and features. We also introduce a distance measure to compute the distance between two feature sets. Finally, we exploit a KNN-based approach and a dynamic feature selection method to detect the features which discriminate the author’s writing style.

12 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125