Topic
Plagiarism detection
About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: The terminology on plagiarism is fluid, a bit ambiguous, and still emerging as mentioned in this paper, and it may take some time to settle the terms more clearly, concretely and exhaustively.
Abstract: The terminology on plagiarism is not hard and fast. It is fluid, a bit ambiguous, and still emerging. It may take some time to settle the terms more clearly, concretely and exhaustively. This paper aims to provide a terminological discussion of some important and current concepts related to plagiarism. It discusses key terms/concepts such as copyright, citation cartels, citing vs. quoting, compulsive thief, cryptomnesia, data fakery, ignorance of laws and codes of ethics, information literacy, lack of training, misattribution, fair use clause, paraphrasing, plagiarism, plagiarism detection software, publish or perish syndrome, PubPeer, retraction, retraction vs. correction, retraction watch, salami publication, similarity score, Society for Scientific Values, and source attribution. The explanation and definition of these terms/concepts can be useful for LIS scholars and professionals in their efforts to fight plagiarism. We expect this terminology can be referred in future discussions on the topic and also used to improve the communications between the actors involved.
11 citations
•
TL;DR: Social and educational aspects of the source code plagiarism in academic environment are discussed, and an overview of software tools for source code similarity detection is presented, and results show that 5–10% of students plagiarized their solutions.
Abstract: Computing education usually involves intensive practical training through laboratory exercises, programming projects,and homework assignments. Those assignments are frequent targets for plagiarism. In this paper, we discuss social andeducational aspects of the source code plagiarism in academic environment, and present an overview of software tools forsource code similarity detection. We present our experiences with JPlag, Moss, and SPD tools, and compare them usingsimulated plagiarism based on programming assignment solutions produced after 1, 2, 4, and 8 hours of work on baselineversion using more than 20 types of lexical and structural modifications that students use to hide plagiarism. We alsocompare results of the selected tools used on real-life student programming solutions from three different courses. Thecourses were attended by 100 to 300 students, and the programming assignment solutions varied in size and complexityfrom 50 to 1000 lines of source code. The results show that 5–10% of students plagiarized their solutions. In our experience,JPlag and Moss proved to be effective tools for plagiarism detection, as they clearly indicated cases of similarity which weremanually confirmed by human code inspection.
11 citations
••
TL;DR: All manuscript submitted to Biochemia Medica are now first assigned to Research integrity editor (RIE), before sending the manuscript for peer-review, to implement CrossCheck plagiarism detection service.
Abstract: In February 2013, Biochemia Medica has joined CrossRef, which enabled us to implement CrossCheck plagiarism detection service. Therefore, all manuscript submitted to Biochemia Medica are now first assigned to Research integrity editor (RIE), before sending the manuscript for peer-review. RIE submits the text to CrossCheck analysis and is responsible for reviewing the results of the text similarity analysis. Based on the CrossCheck analysis results, RIE subsequently provides a recommendation to the Editor-in-chief (EIC) on whether the manuscript should be forwarded to peer-review, corrected for suspected parts prior to peer-review or immediately rejected. Final decision on the manuscript is, however, with the EIC. We hope that our new policy and manuscript processing algorithm will help us to further increase the overall quality of our Journal.
11 citations
••
01 Dec 2011TL;DR: An index-based method to the n-gram extraction for large collections using common data structures like B+-tree and Hash table is shown and the scalability of this method is shown by presenting experiments with the gigabytes collection.
Abstract: N-grams are applied in some applications searching in text documents, especially in cases when one must work with phrases, e.g. in plagiarism detection. N-gram is a sequence of n terms (or generally tokens) from a document. We get a set of n-grams by moving a floating window from the begin to the end of the document. During the extraction we must remove duplicate n-grams and we must store additional values to each n-gram type, e.g. n-gram type frequency for each document and so on, it depends on a query model used. Previous works utilize a sorting algorithm to compute the n-gram frequency. These approaches must handle a high number of the same n-grams resulting in high time and space overhead. Moreover, these techniques are often main-memory only, it means they must be executed for small or middle size collections. In this paper, we show an index-based method to the n-gram extraction for large collections. This method utilizes common data structures like B+-tree and Hash table. We show the scalability of our method by presenting experiments with the gigabytes collection.
11 citations
••
01 Nov 2017TL;DR: This paper proposes a method for detecting plagiarism in source-codes using deep features, obtained using a character-level Recurrent Neural Network (char-rnn), which is pre-trained on Linux Kernel source-code.
Abstract: This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-rnn), which is pre-trained on Linux Kernel source-code. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char-rnn model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. Comparing our proposed features with handcrafted features (source-code metrics and textual features), we report f1-score improvement of 9.5% for binary classification and 5% for three-way classification tasks respectively.
11 citations