scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A novel method for detecting likely portions of reused text that is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition and represents the identified reused text by means of a set of features that denote its degree of plagiarism, relevance and fragmentation.
Abstract: An important task in plagiarism detection is determining and measuring similar text portions between a given pair of documents. One of the main difficulties of this task resides on the fact that reused text is commonly modified with the aim of covering or camouflaging the plagiarism. Another difficulty is that not all similar text fragments are examples of plagiarism, since thematic coincidences also tend to produce portions of similar text. In order to tackle these problems, we propose a novel method for detecting likely portions of reused text. This method is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition, allowing to obtain plausible portions of reused text. We also propose representing the identified reused text by means of a set of features that denote its degree of plagiarism, relevance and fragmentation. This new representation aims to facilitate the recognition of plagiarism by considering diverse characteristics of the reused text during the classification phase. Experimental results employing a supervised classification strategy showed that the proposed method is able to outperform traditionally used approaches.

35 citations

Journal ArticleDOI
01 Oct 2016
TL;DR: This work compares content and citation‐based approaches for plagiarism detection with the goal of evaluating whether they are complementary and if their combination can improve the quality of the detection and concluded that a combination of the methods can be beneficial.
Abstract: The vast amount of scientific publications available online makes it easier for students and researchers to reuse text from other authors and makes it harder for checking the originality of a given text. Reusing text without crediting the original authors is considered plagiarism. A number of studies have reported the prevalence of plagiarism in academia. As a consequence, numerous institutions and researchers are dedicated to devising systems to automate the process of checking for plagiarism. This work focuses on the problem of detecting text reuse in scientific papers. The contributions of this paper are twofold: a we survey the existing approaches for plagiarism detection based on content, based on content and structure, and based on citations and references; and b we compare content and citation-based approaches with the goal of evaluating whether they are complementary and if their combination can improve the quality of the detection. We carry out experiments with real data sets of scientific papers and concluded that a combination of the methods can be beneficial.

35 citations

Proceedings ArticleDOI
01 Oct 2010
TL;DR: A plagiarism detection tool named CCS (Code Comparison System) which is based on the Abstract Syntax Tree (AST), which performs well in the code comparison field, and is able to help with the copyright protecting of the source code.
Abstract: The code comparison technology plays a very important part in the work of plagiarism detection and software evaluation. Software plagiarism mainly appears as copy-and-paste or with a little modification after this, which will not change the function of the code, such as replacing the name of methods or variables, reordering the sequence of the statements etc. This paper introduces a plagiarism detection tool named CCS (Code Comparison System) which is based on the Abstract Syntax Tree (AST). According to the syntax tree's characteristics, CCS calculates their hash values, transforms their storage forms, and then compares them node by node. As a result, the efficiency improves. Moreover, CCS preprocesses a large amount of source code in its database for potential use, which also accelerate the course of plagiarism detection. CCS also takes special measurement to reduce mistakes when calculating the hash values of the operations like subtraction and division. It performs well in the code comparison field, and is able to help with the copyright protecting of the source code.

34 citations

Journal Article
TL;DR: Turnitin this article is an online plagiarism detection software that is used to detect plagiarism in the form of copying text from electronic documents available through the Internet and other electronic sources.
Abstract: Plagiarism is an increasing problem in high schools and universities. To address the issue of how to teach students not to plagiarize, this study examined several pedagogical approaches for reducing plagiarism and the use of Turnitin, an online plagiarism detection software. The study found a significant difference between the control group and one instructional treatment group that was reflected in the reduced level of plagiarized text. This finding indicates that the lack of knowledge in proper documentation and paraphrasing is a primary reason why some students plagiarize, albeit perhaps inadvertently. Implications point to the need for consistent in-depth instruction in proper quotation, citation, and paraphrasing techniques. Keywords: plagiarism, pedagogy, paraphrasing, documentation, Turnitin Introduction Studies on various forms of academic dishonesty such as cheating on examinations and plagiarism have appeared in academic journals for over 60 years. The rates of student cheating reported in these studies ranged from 23% in 1941 as reported by Drake to 59% in 1964 (Hetherington & Feldman) to 76% in 1992 (Davis, Grover, Becker, & McGregor). Similar to cheating, plagiarism is a growing problem. According to a 1999 Center for Academic Integrity survey that included over 12,000 students on 48 different college campuses, 10% of the students admitted to using other people's ideas and words without proper citation (McCabe, Trevino, & Butterfield, 2001). In a later study (McCabe, 2005), 40% of the students surveyed admitted to plagiarism. Research has shown that plagiarism in the form of copying text from electronic documents available through the Internet and other electronic sources is an increasing problem in universities as well as high schools (Larkham & Manns, 2002; McCabe, Trevino & Butterfield, 2001). The rising trend of plagiarism has been attributed to several factors. With high speed Internet available in dormitories and computer labs, students can use search engines easily to find relevant electronic sources from which they plagiarize. Internet sites that proffer complete manuscripts on a wide variety of topics - over 100 such sites at last count according to TechTrends magazine (Talab, 2004) - make obtaining entire assignment papers as easy as copying and pasting. While some of these sites charge for downloaded material, others do not. In addition to using entire papers or portions of papers from these "paper mill" Websites, students also use other sites on the Internet. Many universities provide students access to innumerable electronic journal, newspaper, and magazine articles and other documents through the use of databases. Powerful databases such as ProQuest, EBSCOhost, LexisNexis and others offer full-text articles on all topics of interest. While reputable use of these electronic documents is expected, text from these resources can easily be cut and pasted and presented in a student's paper as his or her own work. While the widely available and easily accessible electronic information sources may have facilitated plagiarism, students' attitude toward cheating is another contributing factor. In the Center for Academic Integrity survey (McCabe, 2005), 68% of the students surveyed believed using someone else's ideas, words, or sentences without acknowledging was not a serious problem. Pressure to obtain and keep good grades and stiff competition for admission into college and for jobs were reasons students often give for cheating (Fanning, 2005; Maramark & Maline, 1993). Furthermore, students' indifference toward academic integrity and the prevalent culture of cheating could be a reflection of the "broader sociopolitical context of corporate fraud" and the "underlying cultural nod toward getting ahead while getting away with unethical behavior" (Robin son-Zanartu, Pena, CookMorales, Pena, Afshani, & Nguyen, 2005, p. 320). Besides student attitude, the lack of consistent enforcement of academic honesty policy by faculty members and university administration may have fostered a culture of cheating. …

34 citations

Book ChapterDOI
21 Mar 2010
TL;DR: A model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison of texts to determine how similar they are, and experimentally shows that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text.
Abstract: The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

34 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125