scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A substantial level of plagiarism via duplicate publications in the three analyzed predatory journals is found, further diluting credible scientific literature and risking the ability to synthesize evidence accurately to inform practice.
Abstract: PURPOSE This study compared three known predatory nursing journals to determine the percentage of content among them that was plagiarized or duplicated. A serendipitous finding of several instances of plagiarism via duplicate publications during the random analysis of articles in a study examining the quality of articles published in predatory journals prompted this investigation. DESIGN The study utilized a descriptive, comparative design. All articles in each journal (n = 296 articles) from inception (volume 1, number 1) through May 1, 2017, were analyzed. METHODS Each article was evaluated and scored electronically for similarity using an electronic plagiarism detection tool. Articles were then individually reviewed, and exact and near exact matches (90% or greater plagiarized content) were paired. Articles with less than 70% plagiarized scores were randomly sampled, and an in-depth search for matches of partial content in other journals was conducted. Descriptive statistics were used to summarize the data. FINDINGS The extent and direction of duplication from one given journal to another was established. Changes made in subsequent publications, as a potential distraction to identify plagiarism, were also identified. There were 100 (68%) exact and near exact matches in the paired articles. The time lapse between the original and duplicate publication ranged from 0 to 63 months, with a mean of 27.2 months (SD =19.68). Authors were from 26 countries, including Africa, the United States, Turkey, and Iran. Articles with similarity scores in the range of 20% to 70% included possible similarities in content or research plagiarism, but not to the extent of the exact or near exact matches. The majority of the articles (n = 94) went from Journal A or C to Journal B, although four articles were first published in Journal B and then Journal A. CONCLUSIONS This study found a substantial level of plagiarism via duplicate publications in the three analyzed predatory journals, further diluting credible scientific literature and risking the ability to synthesize evidence accurately to inform practice. Editors should continue to use electronic plagiarism detection tools. Education about publishing misconduct for editors and authors is a high priority. CLINICAL RELEVANCE Both contributors and consumers of nursing literature rely on integrity in publication. Authors expect appropriate credit for their scholarly contributions without unethical and unauthorized duplication of their work. Readers expect current information from original authors, upon which they can make informed practice decisions.

21 citations

Journal ArticleDOI
TL;DR: BPlag as discussed by the authors applies symbolic execution to analyses execution behavior and represents a program in a novel graph-based format, then detects plagiarism by comparing these graphs and evaluating similarity scores.
Abstract: Source code plagiarism is a long-standing issue in tertiary computer science education. Many source code plagiarism detection tools have been proposed to aid in the detection of source code plagiarism. However, existing detection tools are not robust to pervasive plagiarism-hiding transformations and can be inaccurate in the detection of plagiarised source code. This article presents BPlag, a behavioural approach to source code plagiarism detection. BPlag is designed to be both robust to pervasive plagiarism-hiding transformations and accurate in the detection of plagiarised source code. Greater robustness and accuracy is afforded by analyzing the behavior of a program, as behavior is perceived to be the least susceptible aspect of a program impacted upon by plagiarism-hiding transformations. BPlag applies symbolic execution to analyses execution behavior and represents a program in a novel graph-based format. Plagiarism is then detected by comparing these graphs and evaluating similarity scores. BPlag is evaluated for robustness, accuracy and efficiency against five commonly used source code plagiarism detection tools. It is then shown that BPlag is more robust to plagiarism-hiding transformations and more accurate in the detection of plagiarised source code, but is less efficient than the compared tools.

21 citations

Book ChapterDOI
23 Sep 2013
TL;DR: The first corpus for the evaluation of Arabic intrinsic plagiarism detection is introduced, consisting of 1024 artificial suspicious documents in which 2833 plagiarism cases have been inserted automatically from source documents.
Abstract: The present paper introduces the first corpus for the evaluation of Arabic intrinsic plagiarism detection. The corpus consists of 1024 artificial suspicious documents in which 2833 plagiarism cases have been inserted automatically from source documents.

21 citations

01 Jan 2012
TL;DR: An open-source prototype of a citation-based plagiarism detection system called CitePlag, to evaluate the citations of academic documents as language independent markers to detect plagiarism, is presented.
Abstract: This paper presents an open-source prototype of a citation-based plagiarism detection system called CitePlag. The underlying idea of the system is to evaluate the citations of academic documents as language independent markers to detect plagiarism. CitePlag uses three different detection algorithms that analyze the citation sequence of academic documents for similar patterns that may indicate unduly used foreign text or ideas. The algorithms consider multiple citation-related factors such as proximity and order of citations within the text, or their probability of co-occurrence in order to compute document similarity scores. We present technical details of CitePlag’s detection algorithms and the acquisition of test data from the PubMed Central Open Access Subset. Future advancement of the prototype lies in increasing the reference database by enabling the system to process more document and citation formats. Improving CitePlag’s detection algorithms and scoring functions to reduce the number of false positives is another major goal. Eventually, we plan to integrate text-based detection algorithms in addition to the citation-based detection algorithms within CitePlag.

21 citations

01 Jan 1998
TL;DR: A system that has been designed to assist with the extraction of count-based metrics from source code, and with the development of models of authorship using statistical and machine-learning approaches, which will enable the analyst to use several analysis procedures such as case-based reasoning.
Abstract: Software forensics is the use of authorship analysis techniques to analyse computer programs for a legal or official purpose This generally consists of plagiarism detection and malicious code analysis IDENTIFIED is a system that has been designed to assist with the extraction of count-based metrics from source code, and with the development of models of authorship using statistical and machine-learning approaches Software forensic models can be used for identification, classification, characterisation, and intent analysis One of the more promising methods for identification is case-based reasoning, where samples of code can be compared to those collected from known authors 1 SOFTWARE FORENSICS The frequency and severity of the many forms of computer-based attacks such as viruses and worms, logic bombs, trojan horses, computer fraud, and plagiarism of software code (both object and source) have all become increasingly prevalent and costly for many organisations and individuals involved with information systems Part of the difficulty experienced in collecting evidence regarding the attack or theft in such situations has been the definition of appropriate measurements to use in models of authorship and the development of appropriate models from these metrics Several options for developing such models for identification, discrimination, characterisation, and intent analysis exist, including subjective expert opinion (including fuzzy logic as in [Kilgour, et al, 1997]) and statistical and machine learning models using formally defined metrics Each offers its own set of advantages and disadvantages for the task of software forensics With the difficulties of data collection and the goal of increasing the accessibility of such modelling techniques in mind a system called IDENTIFIED is being developed It is intended to assist with the task of software forensics which is defined here to be the use of software code authorship analysis for legal or official purposes [Gray, et al, 1997] IDENTIFIED uses combinations of wildcards and special characters to define count-based metrics, allows for hierarchical meta-metric definitions, automates much of the file handling task, extracts metric values from source code, and assists with the analysis and modelling process In particular IDENTIFIED will enable the analyst to use several analysis procedures such as case-based reasoning It is hoped that the availability of such tools will encourage more detailed research into this area of ever-increasing importance 2 SOFTWARE FORENSICS Source code is the textual form of a computer program that is written by a computer programmer in a computer programming language These programming languages can in some respects be treated as a form of language from a linguistic perspective, or more precisely as a series of languages of particular types, but within some common family In the same manner that written text can be analysed for evidence of authorship (such as [Sallis, 1994]), computer programs can also be examined from a forensics or linguistics viewpoint [Sallis, et al 1996] for information regarding the program’s authorship Figure 1 [from Gray et al 1997] shows two small code fragments that were written in C++ by two separate programmers Both programs provide the same functionality (calculating the mathematical function factorial(n), normally written as n!) from the users’ perspective That is to say, the same inputs will generate the same outputs for each of these programs As should be apparent, each programmer has solved the same problem, that of calculating the factorial of an input, in both a different manner (algorithm) and with a different style exhibited in his or her code These stylistic differences include the use of comments, variable names, use of white space, indentation, and the levels of readability in each function These fragments are obviously far too short to make any substantial claims about the feasibility of using source code characteristics to make statements regarding the author(s) However, they do illustrate the fact that programmers writing programs will often do so in a significantly different manner to another programmer, without any instruction to do so Both of these functions were written in the natural styles of their respective authors // Factorial takes an integer as an input and returns // the factorial of the input // This routine does not deal with negative values! int Factorial (int Input)

21 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125