scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The results indicate that disciplinary differences do exist in terms of the degree of matching text incidences and that the greater the number of authors an article has the more consecutive text-matching can be observed in their published works.

32 citations

Posted Content
TL;DR: This is the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark and shows that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches.
Abstract: Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.

32 citations

Journal ArticleDOI
TL;DR: In this article, the authors examined a four-prong anti-plagiarism program and its impact on the incidence of plagiarism in a Post-Professional Doctor of Physical Therapy program.
Abstract: Maintaining academic integrity and preventing students from cheating and plagiarising academic work are challenges faced by higher education institutions. These areas have become even more problematic with the growth of the Internet and readily available information, which increase the temptation for students to copy and paste information directly into academic work. Institutions have turned to various strategies to mitigate these aspects. This retrospective research study examined a four-prong anti-plagiarism programme and its impact on the incidence of plagiarism in a Post-Professional Doctor of Physical Therapy programme. The results showed that, using a combination of a structured education module related to plagiarism, Turnitin plagiarism detection software, implementation of policies and procedures, and support from the institution’s writing centre resulted in significant differences in the rate of plagiarism (P < .001) over the five-year period. The rate of plagiarism in year 1 (0.96%) was ...

32 citations

Proceedings ArticleDOI
01 Nov 2019
TL;DR: Among the compared models, as expected, Recurrent Neural Network is best suited for the paraphrase identification task and it is proposed that Plagiarism detection is one of the areas where Paraphrase Identification can be effectively implemented.
Abstract: Paraphrase Identification or Natural Language Sentence Matching (NLSM) is one of the important and challenging tasks in Natural Language Processing where the task is to identify if a sentence is a paraphrase of another sentence in a given pair of sentences. Paraphrase of a sentence conveys the same meaning but its structure and the sequence of words varies. It is a challenging task as it is difficult to infer the proper context about a sentence given its short length. Also, coming up with similarity metrics for the inferred context of a pair of sentences is not straightforward as well. Whereas, its applications are numerous. This work explores various machine learning algorithms to model the task and also applies different input encoding scheme. Specifically, we created the models using Logistic Regression, Support Vector Machines, and different architectures of Neural Networks. Among the compared models, as expected, Recurrent Neural Network (RNN) is best suited for our paraphrase identification task. Also, we propose that Plagiarism detection is one of the areas where Paraphrase Identification can be effectively implemented.

32 citations

Journal ArticleDOI
TL;DR: The Croatian Medical Journal (CMJ) appointed Research Integrity Editor in 2001, which paved the way for the introduction of computer detection of plagiarism.
Abstract: Plagiarism detection software has considerably affected the quality of scientific publishing. No longer is plagiarism detection done by chance or is the sole responsibility of the reviewer and reader (1). The Croatian Medical Journal (CMJ) appointed Research Integrity Editor in 2001, which paved the way for the introduction of computer detection of plagiarism (2,3). The story began when Mladen Petrovecki and Lidija Bilic-Zulle, members of the CMJ Editorial Board, came up with the idea to measure the prevalence of and attitudes toward plagiarism in the scientific community, as a follow-up to their investigation on plagiarism among students (4,5). Together with Matko Marusic and Ana Marusic, Editors-in-Chief, and Vedran Katavic, Research Integrity Editor, they developed a procedure for detecting and preventing plagiarism using plagiarism detection software, which later became a standard (1,6). The study of research integrity started in the early 2000s at the Rijeka University School of Medicine as part of two consecutive projects supported by the Ministry of Science, Technology, and Sports. Even outside our small scientific community, the projects were recognized as valuable and obtained a Committee on Publications Ethics (COPE) grant in 2010. Membership in the CrossRef association (http://www.crossref.org/) and the introduction of CrossCheck (http://www.crossref.org/crosscheck/index.html), a unique web-service for detecting plagiarism in scientific publications, marked the beginning of a new era at the CMJ. In 2009, we started to systemically check all the submitted manuscripts. The plagiarism detection procedure consisted of automatic scanning of manuscripts using plagiarism detection software (eTBLAST and CrossCheck) and manual verification of manuscripts suspected of having been plagiarized (more than 10% text similarity). The criteria for plagiarism were set according to the prior investigations carried out by Bilic-Zulle et al (4,5) and Segal et al (7), and the definition of redundant publication used by the British Medical Journal (8). Manual verification (reading of both manuscripts) was done according to the COPE's flowcharts (9) and the CMJ's Guidelines for Authors. Over two years, we detected 85 manuscripts (11%) containing plagiarized parts (8% true plagiarism and 3% self-plagiarism) (6). CrossCheck is an excellent service for detecting plagiarism, which detected almost all plagiarized manuscripts in our study. eTBLAST was less informative, possibly because at the time of the investigation it only had the ability to compare the text with abstracts from the Medline database (today eTBLAST searches abstracts in Medline, Pub Med Central, Clinical Trials, Wikipedia, and other databases outside the field of medicine). If a suspected case of copy/paste activity was found, the investigator wrote a plagiarism report to the Editorial Board to assist in deciding on the manuscript’s status. Editors mostly accepted the suggestions and in case of disagreement, the final decision lay with the Research Integrity Editor. Cases of blatant plagiarism were easy to deal with because of text similarity in all sections of the manuscript, while those with less text similarity were sometimes more complicated and COPE’s flowcharts were not sufficient to conclude whether the manuscript was plagiarized. Special attention was paid to plagiarism in the Results section. Also, there was zero tolerance for plagiarism in the Discussion section. When manuscripts contained plagiarism in the Materials and Methods section or when the original article was not cited in follow-up investigations, accidentally or by ignorance, authors were given an opportunity to rewrite the text and publish their investigation. These examples once again show that it is of genuine importance for editors to become educators, ie, to teach authors about standards in publishing and research through continuing education (10). We believe that the main reasons for plagiarizing were unawareness of research integrity policies, poor English proficiency, attitudes toward plagiarism, and cultural values (6,11-13). In Croatia, the situation could be further deteriorated by a new law on science, higher education, and universities that abolishes the Committee for Ethics in Science and Higher Education, the highest national body dealing with research integrity (14). Integrity issues and education of future scientists about the responsible research conduct will now be the task of Croatian universities and schools only. Also, since in the academic community there is a considerable pressure to publish and since English is not the first language in Croatia, some authors simply decide to “borrow” a portion of text from previous papers (11). In addition, it has been shown that in post-communist countries moral and cultural values and attitudes toward plagiarism are different from those in Western countries that have a longer tradition of high research integrity standards (15). Plagiarism is not easy to define (16); there are still no criteria that are widely accepted by medical editors/journals as to what constitutes plagiarism. How much textual similarity raises the suspicion of plagiarism? Is it 5% or 10%, as stated by one source, or 100 words, as it was argued in the discussion of the COPE’s recent paper “How should editors respond to plagiarism?” (5-7,17)? Is there a difference between different types of plagiarism detection software? Plagiarism detection software offers valuable help in preventing plagiarism, but only if followed by manual verification (6). All manuscripts submitted to the journal should be checked and never rejected relying solely on the similarity report of plagiarism detection software (1,6). Therefore, medical editors are expected to reach a consensus on what constitutes plagiarism and make clear policies on how to deal with cases of plagiarism. The CMJ was the first scientific journal in Croatia to begin checking all the submitted manuscripts for plagiarism (2009) and, to the best of my knowledge, together with the Chinese Journal of Zhejiang University Science, the only journal in the world that has systematically collected data on plagiarism in the submitted manuscripts. Furthermore, the CMJ is the first medical journal to publish the standard operating procedure for scanning submitted manuscripts (study protocol), as part of the journal’s “striving for excellence” policy (1,18). Plagiarism detection software enables systematic detection and prevention of plagiarism, leading to fewer retractions. The results of our study were published (6) and we expect other medical journals to publish their results, not only a description of experiences. In order to reach high research integrity standards and journal quality, journals should perform systematic checking of all submitted manuscripts according to the widely accepted standards (protocols), as well as conduct ongoing education of authors.

32 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125