scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
Journal ArticleDOI
10 Oct 2016
TL;DR: Plagiarism was a common occurrence among manuscripts submitted for publication to a major American specialty medical journal and most manuscripts with plagiarized material were submitted from countries in which English was not an official language.
Abstract: Plagiarism is common and threatens the integrity of the scientific literature. However, its detection is time consuming and difficult, presenting challenges to editors and publishers who are entrusted with ensuring the integrity of published literature. In this study, the extent of plagiarism in manuscripts submitted to a major specialty medical journal was documented. We manually curated submitted manuscripts and deemed an article contained plagiarism if one sentence had 80 % of the words copied from another published paper. Commercial plagiarism detection software was utilized and its use was optimized. In 400 consecutively submitted manuscripts, 17 % of submissions contained unacceptable levels of plagiarized material with 82 % of plagiarized manuscripts submitted from countries where English was not an official language. Using the most commonly employed commercial plagiarism detection software, sensitivity and specificity were studied with regard to the generated plagiarism score. The cutoff score maximizing both sensitivity and specificity was 15 % (sensitivity 84.8 % and specificity 80.5 %). Plagiarism was a common occurrence among manuscripts submitted for publication to a major American specialty medical journal and most manuscripts with plagiarized material were submitted from countries in which English was not an official language. The use of commercial plagiarism detection software can be optimized by selecting a cutoff score that reflects desired sensitivity and specificity.

38 citations

Journal ArticleDOI
TL;DR: The taxonomy of machine learning-based binary code analysis is provided, the recent advances and key findings on the topic are described, and the thoughts for future directions on this topic are presented.
Abstract: Binary code analysis is crucial in various software engineering tasks, such as malware detection, code refactoring, and plagiarism detection. With the rapid growth of software complexity and the increasing number of heterogeneous computing platforms, binary analysis is particularly critical and more important than ever. Traditionally adopted techniques for binary code analysis are facing multiple challenges, such as the need for cross-platform analysis, high scalability and speed, and improved fidelity, to name a few. To meet these challenges, machine learning-based binary code analysis frameworks attract substantial attention due to their automated feature extraction and drastically reduced efforts needed on large-scale programs. In this paper, we provide the taxonomy of machine learning-based binary code analysis, describe the recent advances and key findings on the topic, and discuss the key challenges and opportunities. Finally, we present our thoughts for future directions on this topic.

38 citations

Proceedings ArticleDOI
19 Jul 2010
TL;DR: The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: the adequate identification of good potential sources for a given suspicious text, the detection of plagiarism despite modifications and the generation of standard collections of cases of plagiarisms and text reuse.
Abstract: Plagiarism, the unacknowledged reuse of text, has increased in recent years due to the large amount of texts readily available. For instance, recent studies claim that nowadays a high rate of student reports include plagiarism, making manual plagiarism detection practically infeasible. Automatic plagiarism detection tools assist experts to analyse documents for plagiarism. Nevertheless, the lack of standard collections with cases of plagiarism has prevented accurate comparing models, making differences hard to appreciate. Seminal efforts on the detection of text reuse [2] have fostered the composition of standard resources for the accurate evaluation and comparison of methods. The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: (i) the adequate identification of good potential sources for a given suspicious text; (ii) the detection of plagiarism despite modifications, such as words substitution and paraphrasing (special stress is given to cross-language plagiarism); and (iii) the generation of standard collections of cases of plagiarism and text reuse in order to provide a framework for accurate comparison of models. Regarding difficulties (i) and (ii) , we have carried out preliminary experiments over the METER corpus [2]. Given a suspicious document dq and a collection of potential source documents D, the process is divided in two steps. First, a small subset of potential source documents D* in D is retrieved. The documents d in D* are the most related to dq and, therefore, the most likely to include the source of the plagiarised fragments in it. We performed this stage on the basis of the Kullback-Leibler distance, over a subsample of document's vocabularies. Afterwards, a detailed analysis is carried out comparing dq to every d in D* in order to identify potential cases of plagiarism and their source. This comparison was made on the basis of word n-grams, by considering n = {2, 3}. These n-gram levels are flexible enough to properly retrieve plagiarised fragments and their sources despite modifications [1]. The result is offered to the user to take the final decision. Further experiments were done in both stages in order to compare other similarity measures, such as the cosine measure, the Jaccard coefficient and diverse fingerprinting and probabilistic models. One of the main weaknesses of currently available models is that they are unable to detect cross-language plagiarism. Approaching the detection of this kind of plagiarism is of high relevance, as the most of the information published is written in English, and authors in other languages may find it attractive to make use of direct translations. Our experiments, carried out over parallel and a comparable corpora, show that models of "standard" cross-language information retrieval are not enough. In fact, if the analysed source and target languages are related in some way (common linguistic ancestors or technical vocabulary), a simple comparison based on character n-grams seems to be the option. However, in those cases where the relation between the implied languages is weaker, other models, such as those based on statistical machine translation, are necessary [3]. We plan to perform further experiments, mainly to approach the detection of cross-language plagiarism. In order to do that, we will use the corpora developed under the framework of the PAN competition on plagiarism detection (cf. PAN@CLEF: http://pan.webis.de). Models that consider cross-language thesauri and comparison of cognates will also be applied.

37 citations

Journal ArticleDOI
TL;DR: It is observed that using the online compiler and the plagiarism detection tool reduces time and effort needed for the assessment of the programming assignments; prevents the authors' students from plagiarism; and increases their success in their programming based Data Structures course.
Abstract: In this study, an online compiler and a source code plagiarism detection tool have been included into the Moodle based distance education system of our Computer Engineering department. For this purpose Moodle system has been extended with the GCC compiler, and the Moss source code plagiarism detection tool. We observed that using the online compiler and the plagiarism detection tool reduces time and effort needed for the assessment of the programming assignments; prevents our students from plagiarism; and increases their success in their programming based Data Structures course. © 2014 Wiley Periodicals, Inc. Comput Appl Eng Educ 23:363–373, 2015; View this article online at wileyonlinelibrary.com/journal/cae; DOI 10.1002/cae.21606

37 citations

Journal ArticleDOI
TL;DR: Regular usage of professional plagiarism detection tools for similarity checks with critical interpretation by the editorial team at the pre-review stage will certainly help in reducing the menace of plagiarism in submitted manuscripts.
Abstract: Plagiarism is one of the most serious forms of scientific misconduct prevalent today and is an important reason for significant proportion of rejection of manuscripts and retraction of published articles. It is time for the medical fraternity to unanimously adopt a 'zero tolerance' policy towards this menace. While responsibility for ensuring a plagiarism-free manuscript primarily lies with the authors, editors cannot absolve themselves of their accountability. The only way to write a plagiarism-free manuscript for an author is to write an article in his/her own words, literally and figuratively. This article discusses various types of plagiarism, reasons for increasingly reported instances of plagiarism, pros and cons of use of plagiarism detection tools for detecting plagiarism and role of authors and editors in preventing/avoiding plagiarism in a submitted manuscript. Regular usage of professional plagiarism detection tools for similarity checks with critical interpretation by the editorial team at the pre-review stage will certainly help in reducing the menace of plagiarism in submitted manuscripts.

37 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125