scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This work considers, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques, and combines features and imbalanced dataset treatment with various classification methods.
Abstract: The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly In this work, we follow a supervised machine learning classification approach We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques Apart from this, we propose some novel stylistic features We combine our features and imbalanced dataset treatment with various classification methods Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores

9 citations

Book ChapterDOI
06 Sep 2010
TL;DR: The deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee a high recall for texts which are not fully parsable and increases precision considerably in comparison to traditional shallow methods.
Abstract: Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized index. In order to detect many kinds of paraphrases the semantic networks of a candidate text are varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Important phenomena occurring in difficult duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora is explained briefly. The deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee a high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably in comparison to traditional shallow methods.

9 citations

Proceedings Article
10 Jul 2012
TL;DR: A framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram, reordering and alignment of words, POS and phrase tags, and semantic similarity of sentences is introduced.
Abstract: In this paper, we introduce a framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram, reordering and alignment of words, POS and phrase tags, and semantic similarity of sentences. We establish an ensemble framework to combine the predictions of each model. Results demonstrate that our system can not only find considerable amount of real-world online plagiarism cases but also outperforms several state-of-the-art algorithms and commercial software.

9 citations

Journal ArticleDOI
TL;DR: In this paper, the authors have focused on practical assignments (projects) as well as written document which is to be submitted by students in to college or university and their algorithm divides submitted articles in small pieces and scans it to compare with connected databases to the server on internet.
Abstract: In today word copying something from other sources and claiming it as an own contribution is a crime. We have also seen it is major problem in academic where students of UG, PG or even at PhD level copying some part of original documents and publishing on own name without taking proper permission from author or developer. Many software tools in exist to find out and assist the monotonous and time consuming task of tracing plagiarism, because identifying the owner of that whole text is practically difficult and impossible for markers. In our presentation we have focused on practical assignments (projects) as well as written document which is to be submitted by students in to college or university. Because of this crucial task and day by day increasing research in different fields, industry, academy people demanding such software to detect whether submitted articles, books, national or international papers are genuine or not. In this paper, our algorithm divides submitted articles in small pieces and scans it to compare with connected databases to the server on internet. Some existing work compares submitted articles with previously submitted articles i.e. with existing database.

9 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125