scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2022"


Posted ContentDOI
27 Dec 2022-bioRxiv
TL;DR: This article evaluated the abstracts using an artificial intelligence (AI) output detector, plagiarism detector, and had blinded human reviewers try to distinguish whether abstracts were original or generated, but only 8% correctly followed the specific journal's formatting requirements.
Abstract: Background Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. Methods We gathered ten research abstracts from five high impact factor medical journals (n=50) and asked ChatGPT to generate research abstracts based on their titles and journals. We evaluated the abstracts using an artificial intelligence (AI) output detector, plagiarism detector, and had blinded human reviewers try to distinguish whether abstracts were original or generated. Results All ChatGPT-generated abstracts were written clearly but only 8% correctly followed the specific journal’s formatting requirements. Most generated abstracts were detected using the AI output detector, with scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% [12.73, 99.98] compared with very low probability of AI-generated output in the original abstracts of 0.02% [0.02, 0.09]. The AUROC of the AI output detector was 0.94. Generated abstracts scored very high on originality using the plagiarism detector (100% [100, 100] originality). Generated abstracts had a similar patient cohort size as original abstracts, though the exact numbers were fabricated. When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, but that the generated abstracts were vaguer and had a formulaic feel to the writing. Conclusion ChatGPT writes believable scientific abstracts, though with completely generated data. These are original without any plagiarism detected but are often identifiable using an AI output detector and skeptical human reviewers. Abstract evaluation for journals and medical conferences must adapt policy and practice to maintain rigorous scientific standards; we suggest inclusion of AI output detectors in the editorial process and clear disclosure if these technologies are used. The boundaries of ethical and acceptable use of large language models to help scientific writing remain to be determined.

102 citations


Journal ArticleDOI
TL;DR: In this paper , a plagiarism detection system based on POS tag n-grams was proposed, which is able to show syntactic similarities between source and suspicious sentences and the semantic relatedness between words was measured with the word embedding technique called Word2Vec.
Abstract: The aim of this paper is to present an automatic plagiarism detection system to identify plagiarized passages of documents. Our plagiarism detection system uses both syntactic and semantic similarities to identify plagiarized passages. Our proposed method is a novel contribution because of its usage of part-of-speech tag n-grams (POSNG) which are able to show syntactic similarities between source and suspicious sentences. Each source document is indexed according to part-of-speech (POS) tag n-grams by a search engine in order to access rapidly to sentences that are possible plagiarism candidates. Even though our plagiarism detection system obtains very good results just using POS tag n-grams, its performance is further improved with the usage of semantic similarities. The semantic relatedness between words is measured with the word embedding technique called Word2Vec and the longest common subsequence approach is used to measure the semantic similarity between source and suspicious sentences. There are several types of plagiarism such as verbatim, paraphrasing, source-code, and cross-lingual. The high obfuscation paraphrasing is a type of plagiarism and its detection is one of the most difficult plagiarism detection tasks. Our proposed method, which is based on POS tag n-grams, improves the detection performance of the high obfuscation paraphrasing type and is the main contribution of this paper. For this study, we use the large dataset called PAN-PC-11 which is created for the evaluation of automatic plagiarism detection algorithms. Our experiments are conducted with the four types of paraphrasing in PAN-PC-11 which are none, low, high and simulated obfuscation paraphrasing types. We defined various threshold and parameter settings in order to assess the diversity of our results. We compared the performance of our method with the plagiarism detectors in the 3rd International Competition on Plagiarism Detection (PAN11). According to the experimental results, the proposed method achieved the best performance in terms of plagdet measure in the types of high and low obfuscation paraphrasing and produced competitive results in the other paraphrasing types.

9 citations


Journal ArticleDOI
TL;DR: Results on the PAN2016 corpus, show that the proposed method with 00:01:27(h:m:s) run-time for each pair of documents, has a plagdet of 94.37%, which outperforms the support vector machine method and deep learning method by 4.33% and 3.6% respectively.

9 citations


Journal ArticleDOI
TL;DR: Dolos is a new source code plagiarism detection tool that is powered by state-of-the art similarity detection algorithms, offers interactive visualizations, and uses generic parser models to support a broad range of programming languages.
Abstract: Background Learning to code is increasingly embedded in secondary and higher education curricula, where solving programming exercises plays an important role in the learning process and in formative and summative assessment. Unfortunately, students admit that copying code from each other is a common practice and teachers indicate they rarely use plagiarism detection tools. Objectives We want to lower the barrier for teachers to detect plagiarism by introducing a new source code plagiarism detection tool (Dolos) that is powered by state-of-the art similarity detection algorithms, offers interactive visualizations, and uses generic parser models to support a broad range of programming languages. Methods Dolos is compared with state-of-the-art plagiarism detection tools in a benchmark based on a standardized dataset. We describe our experience with integrating Dolos in a programming course with a strong focus on online learning and the impact of transitioning to remote assessment during the COVID-19 pandemic. Results and Conclusions Dolos outperforms other plagiarism detection tools in detecting potential cases of plagiarism and is a valuable tool for preventing and detecting plagiarism in online learning environments. It is available under the permissive MIT open-source license at https://dolos.ugent.be. Implications Dolos lowers barriers for teachers to discover, prove and prevent plagiarism in programming courses. This helps to enable a shift towards open and online learning and assessment environments, and opens up interesting avenues for more effective learning and better assessment. (PsycInfo Database Record (c) 2022 APA, all rights reserved)

8 citations


Journal ArticleDOI
TL;DR: This paper used pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models to detect machine-paraphrased text.
Abstract: Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99% (F1 = 99.68% for SpinBot and F1 = 71.64% for SpinnerChief cases), while human evaluators achieved F1 = 78.4% for SpinBot and F1 = 65.6% for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan.

7 citations


Journal ArticleDOI
TL;DR: In this paper , an assessment submission system with automated, personalized, and timely formative feedback that can be used in institutions that apply some leniency in early instances of plagiarism and collusion is presented.
Abstract: To help address programming plagiarism and collusion, students should be informed about acceptable practices and about program similarity, both coincidental and non-coincidental. However, current approaches are usually manual, brief, and delivered well before students are in a situation where they might commit academic misconduct. This article presents an assessment submission system with automated, personalized, and timely formative feedback that can be used in institutions that apply some leniency in early instances of plagiarism and collusion. If a student’s submission shares coincidental or non-coincidental similarity with other submissions, then personalized similarity reports are generated for the involved submissions and the students are expected to explain the similarity and resubmit the work. Otherwise, a report simulating similarities is sent just to the author of the submitted program to enhance their knowledge. Results from two quasi-experiments involving two academic semesters suggest that students with our approach are more aware of programming plagiarism and collusion, including the futility of some program disguises. Further, their submitted programs have lower similarity even at the level of program flow, suggesting that they are less likely to have engaged in programming plagiarism and collusion. Student behavior while using the system is also analyzed based on the statistics of the generated reports and student justifications for the reported similarities.

6 citations


Journal ArticleDOI
Mehdi Akbari1
TL;DR: In this article , two methods are proposed to identify Extrinsic plagiarism, which are based on pre-trained network technique of words embedding FastText and TF-IDF weighting technique to form two structural and semantic matrices.
Abstract: Plagiarism is a misconduct, which refers to the use of scientific and literary content contained in other sources without reference to them. Today, the rise of plagiarism has become a serious problem for publishers and researchers. Many researchers have discussed this problem and tried to identify types of plagiarism; however, most of these methods are not effective in detecting intelligent plagiarism. In other words, most of these methods focus on direct copying. Therefore, in this study, two methods are proposed to identify Extrinsic plagiarism. In both methods, to limit the search space, two stages of filtering based on the bag of word (BoW) technique are used at the document level and at the sentence level, and plagiarism is investigated only in the outputs of these two stages. In the first method to detect similarities in suspicious documents and sentences, the combination of pre-trained network technique of words embedding FastText and TF-IDF weighting technique to form two structural and semantic matrices and in the second method to form the two matrices, WordNet ontology and weighting TF-IDF is used. After forming the above matrices and calculating the similarity between the pairs of matrices of each sentence, using the Dice similarity and the structural similarity of the weighted composition, two similarity values are calculated. By comparing the similarity of suspicious sentences with the minimum threshold, the document containing the suspicious sentence receives the label of plagiarism or non-plagiarism. Experimental results on the PAN-PC-11 database show that the first method has achieved 95.1% precision and the second method 93.8% precision, which shows that the use of word embedding network compared to WordNet ontology can be more successful in detecting Extrinsic plagiarism.

5 citations


Journal ArticleDOI
TL;DR: In this paper , the authors proposed a novel adaptive meta-heuristic for music plagiarism detection, which combines text similarity-based and clustering-based methods to get an improved hybrid method.
Abstract: Abstract Plagiarism is a controversial and debated topic in different fields, especially in the Music one, where the commercial market generates a huge amount of money. The lack of objective metrics to decide whether a song is a plagiarism, makes music plagiarism detection a very complex task: often decisions have to be based on subjective argumentations. Automated music analysis methods that identify music similarities can be of help. In this work, we first propose two novel such methods: a text similarity-based method and a clustering-based method. Then, we show how to combine them to get an improved (hybrid) method. The result is a novel adaptive meta-heuristic for music plagiarism detection. To assess the effectiveness of the proposed methods, considered both singularly and in the combined meta-heuristic, we performed tests on a large dataset of ascertained plagiarism and non-plagiarism cases. Results show that the meta-heuristic outperforms existing methods. Finally, we deployed the meta-heuristic into a tool , accessible as a Web application, and assessed the effectiveness, usefulness, and overall user acceptance of the tool by means of a study involving 20 people, divided into two groups, one of which with access to the tool. The study consisted in having people decide which pair of songs, in a predefined set of pairs, should be considered plagiarisms and which not. The study shows that the group supported by our tool successfully identified all plagiarism cases, performing all tasks with no errors. The whole sample agreed about the usefulness of an automatic tool that provides a measure of similarity between two songs.

5 citations


Journal ArticleDOI
TL;DR: Different types of traditional and algorithms of deep learning for instance, Convolutional Neural Network (CNN) and Long ShortTerm Memory (LSTM) are discussed as a plagiarism detector, which are based on intelligent or traditional techniques.
Abstract: The Web provides various kinds of data and applications that are readily available to explore and are considered a powerful tool for humans. Copyright violation in web documents occurs when there is an unauthorized copy of the information or text from the original document on the web; this violation is known as Plagiarism. Plagiarism Detection (PD)can be defined as the procedure that finds similarities between a document and other documents based on lexical, semantic, and syntactic textual features. The approaches for numeric representation (vectorization) of text like Vector Space Model (VSM) and word embedding along with text similarity measures such as cosine and jaccard are very necessary for plagiarism detection. This paper deals with the concepts of plagiarism, kinds of plagiarism, textual features, text similarity measures, and plagiarism detection methods, which are based on intelligent or traditional techniques. Furthermore, different types of traditional and algorithms of deep learning for instance, Convolutional Neural Network (CNN) and Long ShortTerm Memory (LSTM) are discussed as a plagiarism detector. Besides that, this work reviews many other papers that give attention to the topic of Plagiarism and its detection.

5 citations


Journal ArticleDOI
TL;DR: In this paper , the plagiarism management policies adopted by a selection of universities in mainland China and Hong Kong were examined and compared, revealing both similarities and divergences in these universities' communication of plagiarism-related information, mechanism for plagiarism detection, provision of academic guidance and support for avoiding plagiarism, and competing discourses on plagiarism underpinning their mixed approaches to the problem.
Abstract: Long characterized as a primary form of academic misconduct and a major threat to academic integrity, the issue of plagiarism has been extensively researched from multiple perspectives, including students' and academic staff's perceptions and attitudes concerning plagiarism, measures for detecting and deterring plagiarism and their effectiveness, and the higher education sector's response to plagiarism. Yet knowledge remains patchy regarding this last strand of research. With the aim of bridging this research gap, we examine and compare the plagiarism management policies adopted by a selection of universities in mainland China and Hong Kong, two contexts that have been influenced by different academic traditions. Analysis reveals both similarities and divergences in these universities' communication of plagiarism-related information, mechanism for plagiarism detection, provision of academic guidance and support for avoiding plagiarism, and competing discourses on plagiarism underpinning their mixed approaches to the problem. Implications for institutional policymaking and academic integrity education are discussed.

4 citations


Journal ArticleDOI
TL;DR: In this article , the authors present a similarity detector that works on many kinds of weekly programming assessments, combining three-layered types of similarity so that even within a set of highly similar submissions, program pairs are still sorted according to their levels of similarity.
Abstract: When weekly programming assessments are used, it is often the case that some of them are either trivial or strongly directed. Common code similarity detectors are not particularly helpful with such assessments: some potential instances of misconduct are not selected for manual investigation as all submissions are expected to be similar and it is not feasible to check them all. Several dedicated similarity detectors have been developed to work with such assessments, but the experience is required to determine when to use them. This paper presents a similarity detector that works on many kinds of weekly assessments. It combines three‐layered types of similarity so that even within a set of highly similar submissions, program pairs are still sorted according to their levels of similarity. Our similarity detector is more effective than JPlag in distinguishing similar programs and helping to identify plagiarism and collusion. The similarity detector is slower than JPlag, but the longer execution time is partly offset by some optimization that has no negative impact on the effectiveness. As weekly assessments seldom entail large submissions, the execution time does not appear to be a barrier to use.


Proceedings ArticleDOI
10 Jun 2022
TL;DR: This paper proposes to leverage the bottom-k sketch (a.k.a. conditional random sampling) to estimate the similarity of two passages and proves that all the O(n2) passages can be partitioned into O( nk) groups in a document with n words and develops an algorithm to generate these groups in O(nlogn+nk) time.
Abstract: In this paper, we study the near-duplicate text alignment search problem, which, given a collection of source (data) documents and a suspicious (query) document, finds all the near-duplicate passage pairs between the suspicious document and every source document. It finds applications in plagiarism detection. Specifically, the first two steps in plagiarism detection are source retrieval and text alignment. Source retrieval finds candidate source documents in a corpus that share content with the suspicious document while text alignment finds all the similar passage pairs between the suspicious document and every candidate source document. This problem is computation-intensive, especially for long documents. This is because there are O(n2m2) passage pairs between a single source document with n words and a suspicious document with m words, not to mention the large number of source documents in a corpus. Due to the high computation cost, existing solutions primarily rely on heuristic rules, such as the "seeding-extension-filtering" pipeline, and involve many hard-to-tune hyper-parameters. To address these issues, a recent work ALLIGN leverages the min-wise hash sketch for the text alignment problem. However, ALLIGN only works for two documents and leaves the source retrieval problem unattended. In this paper, we propose to leverage the bottom-k sketch (a.k.a. conditional random sampling) to estimate the similarity of two passages. We observe that many nearby passages in a document would share the same bottom-k sketch. Thus we propose to group all the passages in a document by their sketches. We prove that all the O(n2) passages can be partitioned into O(nk) groups in a document with n words and develop an algorithm to generate these groups in O(nlogn+nk) time. Then, to address the source retrieval problem, we only need to find groups of passages with "similar" bottom-k sketches. Every passage pair in two groups with "similar" sketches are near-duplicates. Experimental results on real-world datasets show that our techniques are highly efficient.

Journal ArticleDOI
TL;DR: In this article , a prototype of a bespoke software tool that attempts to repurpose some of these techniques into an automated process for detecting plagiarism and / or contract cheating in Word documents is presented.
Abstract: Abstract Academic misconduct in all its various forms is a challenge for degree-granting institutions. Whilst text-based plagiarism can be detected using tools such as Turnitin™, Plagscan™ and Urkund™ (amongst others), contract cheating and collusion can be more difficult to detect, and even harder to prove, often falling to no more than a ‘balance of probabilities’ rather than fact. To further complicate the matter, some students will make deliberate attempts to obfuscate cheating behaviours by submitting work in Portable Document Format, in image form, or by inserting hidden glyphs or using alternative character sets which text matching software does not always accurately detect (Rogerson, Int J Educ Integr 13, 2017; Heather, Assess Eval High Educ 35:647-660, 2010). Educators do not tend to think of academic misconduct in terms of criminality per se, but the tools and techniques used by digital forensics experts in law enforcement can teach us much about how to investigate allegations of academic misconduct. The National Institute of Standards and Technology’s Glossary describes digital forensics as ‘the application of computer science and investigative procedures involving the examination of digital evidence - following proper search authority, chain of custody, validation with mathematics, use of validated tools, repeatability, reporting, and possibly expert testimony.’ (NIST, Digital Forensics, 2021). These techniques are used in criminal investigations as a means to identify the perpetrator of, or accomplices to, a crime and their associated actions. They are sometimes used in cases relating to intellectual property to establish the legitimate ownership of a variety of objects, both written and graphical, as well as in fraud and forgery (Jeong and Lee, Digit Investig 23:3-10, 2017; Fu et. al, Digit Investig 8:44–55, 2011 ). Whilst there have been some research articles and case studies that demonstrate the use of digital forensics techniques to detect academic misconduct as proof of concept, there is no evidence of their actual deployment in an academic setting. This paper will examine some of the tools and techniques that are used in law enforcement and the digital forensics field with a view to determining whether they could be repurposed for use in an academic setting. These include methods widely used to determine if a file has been tampered with that could be repurposed to identify if an image is plagiarised; file extraction techniques for examining meta data, used in criminal cases to determine authorship of documents, and tools such as FTK™ and Autopsy™ which are used to forensically examine single files as well as entire hard drives. The paper will also present a prototype of a bespoke software tool that attempts to repurpose some of these techniques into an automated process for detecting plagiarism and / or contract cheating in Word documents. Finally, this article will discuss whether these tools have a place in an academic setting and whether their use in determining if a student’s work is truly their own is ethical.

Journal ArticleDOI
TL;DR: In this article , the authors evaluated different academic plagiarism detection methods using the fuzzy MCDM (multi-criteria decision-making) method and provided recommendations for the development of efficient plagiarism-detection systems.
Abstract: Due to the overall widespread accessibility of electronic materials available on the internet, the availability and usage of computers in education have resulted in a growth in the incidence of plagiarism among students. A growing number of individuals at colleges around the globe appear to be presenting plagiarised papers to their professors for credit, while no specific details are collected of how much was plagiarised previously or how much is plagiarised currently. Supervisors, who are overburdened with huge responsibility, desire a simple way—similar to a litmus test—to rapidly reform plagiarized papers so that they may focus their work on the remaining students. Plagiarism-checking software programs are useful for detecting plagiarism in examinations, projects, publications, and academic research. A number of the latest research findings dedicated to evaluating and comparing plagiarism-checking methods have demonstrated that these have restrictions in identifying the complicated structures of plagiarism, such as extensive paraphrasing as well as the utilization of technical manipulations, such as substituting original text with similar text from foreign alphanumeric characters. Selecting the best reliable and efficient plagiarism-detection method is a challenging task with so many options available nowadays. This paper evaluates the different academic plagiarism-detection methods using the fuzzy MCDM (multi-criteria decision-making) method and provides recommendations for the development of efficient plagiarism-detection systems. A hierarchy of evaluation is discussed, as well as an examination of the most promising plagiarism-detection methods that have the opportunity to resolve the constraints of current state-of-the-art tools. As a result, the study serves as a “blueprint” for constructing the next generation of plagiarism-checking tools.

Journal ArticleDOI
TL;DR: In this paper , a plagiarism detection system based on intelligent deep learning was proposed for detecting lexical, syntactic, and semantic text plagiarism, which used CNN and RNN architectures.
Abstract: Abstract The phenomenon of scientific burglary has seen a significant increase recently due to the technological development in software. Therefore, many types of research have been developed to address this phenomenon. However, detecting lexical, syntactic, and semantic text plagiarism remains to be a challenge. Thus, in this study, we have computed and recorded all the features that reflect different types of text similarities in a new database. The created database is proposed for intelligent learning to solve text plagiarism detection problems. Using the created database, a reliable plagiarism detection system is also proposed, which depends on intelligent deep learning. Different approaches to deep learning, such as convolution and recurrent neural network architectures, were considered during the construction of this system. A comparative study was implemented to evaluate the proposed intelligent system on the two benchmark datasets: PAN 2013 and PAN 2014 of the PAN Workshop series. The experimental results showed that the proposed system based on long short-term memory (LSTM) achieved the first rank compared to up-to-date ranking systems.

Journal ArticleDOI
TL;DR: In this article , the authors discuss a clear concept about plagiarism from its origin to its consequences, with special considerations about its status in the COVID-19 pandemic, and highlight the main reasons for this malpractice were pressure for publication under a limited time frame and a lack of training for scientific writing.
Abstract: Background: Plagiarism, in simple words meaning theft of ideas or text, is a grave scientific misconduct that is talked about frequently, however is notable in its conspicuous absence from the formal educational curriculum. Students and young researchers tend to engage in this malpractice, intentionally or unintentionally, due to various reasons. Aim: In this review, we aim to discuss a clear concept about plagiarism from its origin to its consequences, with special considerations about its status in the COVID-19 pandemic. This lucid conceptualization will help young authors invest in original research in terms of both the idea and the script, avoiding unnecessary rejections and breach in medical ethics. Search Strategy: An electronic search strategy was performed on MEDLINE using the following keywords: “Plagiarism” OR “Plagiarism AND reasons” OR “Plagiarism AND consequences OR retractions” OR “Plagiarism AND detection”. Results: Of 2112 articles obtained, 36 were selected for the review. The main reasons for this malpractice were pressure for publication under a limited time frame along with a lack of training for scientific writing. The forms of plagiarism observed include intentional and unintentional, theft of ideas, copying verbatim, graphics, self-plagiarism and translational plagiarism. Use of various software are available for detection of plagiarism like iThenticate, Turnitin Feedback Studio, Grammarly etc along with careful reviewing by authors, reviewers and editors can detect this menace and help maintain originality in science. The consequences can be severe, ranging from defamation to monetary to legal action against the authors. Conducting interactive workshops on scientific writing along with promoting creativity in thought at the level of grass-root education is the key to preventing the scientific misconduct of plagiarism amongst students and young researchers. Conclusion: Plagiarism is a serious scientific misconduct that must be discussed with students and young researchers, and its prevention is the key to fostering growth in medical science and academics.

Journal ArticleDOI
TL;DR: In this paper , a prototype of a bespoke software tool that attempts to repurpose some of these techniques into an automated process for detecting plagiarism and / or contract cheating in Word documents is presented.
Abstract: Abstract Academic misconduct in all its various forms is a challenge for degree-granting institutions. Whilst text-based plagiarism can be detected using tools such as Turnitin™, Plagscan™ and Urkund™ (amongst others), contract cheating and collusion can be more difficult to detect, and even harder to prove, often falling to no more than a ‘balance of probabilities’ rather than fact. To further complicate the matter, some students will make deliberate attempts to obfuscate cheating behaviours by submitting work in Portable Document Format, in image form, or by inserting hidden glyphs or using alternative character sets which text matching software does not always accurately detect (Rogerson, Int J Educ Integr 13, 2017; Heather, Assess Eval High Educ 35:647-660, 2010). Educators do not tend to think of academic misconduct in terms of criminality per se, but the tools and techniques used by digital forensics experts in law enforcement can teach us much about how to investigate allegations of academic misconduct. The National Institute of Standards and Technology’s Glossary describes digital forensics as ‘the application of computer science and investigative procedures involving the examination of digital evidence - following proper search authority, chain of custody, validation with mathematics, use of validated tools, repeatability, reporting, and possibly expert testimony.’ (NIST, Digital Forensics, 2021). These techniques are used in criminal investigations as a means to identify the perpetrator of, or accomplices to, a crime and their associated actions. They are sometimes used in cases relating to intellectual property to establish the legitimate ownership of a variety of objects, both written and graphical, as well as in fraud and forgery (Jeong and Lee, Digit Investig 23:3-10, 2017; Fu et. al, Digit Investig 8:44–55, 2011 ). Whilst there have been some research articles and case studies that demonstrate the use of digital forensics techniques to detect academic misconduct as proof of concept, there is no evidence of their actual deployment in an academic setting. This paper will examine some of the tools and techniques that are used in law enforcement and the digital forensics field with a view to determining whether they could be repurposed for use in an academic setting. These include methods widely used to determine if a file has been tampered with that could be repurposed to identify if an image is plagiarised; file extraction techniques for examining meta data, used in criminal cases to determine authorship of documents, and tools such as FTK™ and Autopsy™ which are used to forensically examine single files as well as entire hard drives. The paper will also present a prototype of a bespoke software tool that attempts to repurpose some of these techniques into an automated process for detecting plagiarism and / or contract cheating in Word documents. Finally, this article will discuss whether these tools have a place in an academic setting and whether their use in determining if a student’s work is truly their own is ethical.

Journal ArticleDOI
TL;DR: In this article , the authors presented a novel approach for addressing plagiarism named Multi-agents Indexing System, which is composed of three phases: (1) natural language processing phase, (2) indexing phase and (3) evaluation phase.

Journal ArticleDOI
TL;DR: In this article , the authors evaluated different aspects of plagiarism among students and researchers in all public universities of Morocco based on a 23-questions cross-sectional survey and explored factors associated with plagiarism using contingency tables and logistic regression.
Abstract: Plagiarism is widely regarded as an issue of low- and middle-income countries because of several factors such as the lack of ethics policy and poor research training. In Morocco, plagiarism and its perception by academics has not been investigated on a large scale. In this study, we evaluated different aspects of plagiarism among students and researchers in all public universities of Morocco based on a 23-questions cross-sectional survey. Factors associated with plagiarism were explored using contingency tables and logistic regression. The survey results covered all public universities (n=12) and included 1,220 responses from undergraduate students (31.4%), followed by PhD students (26.6%), scientific graduates (19%), PhD holders and postdoctoral fellows (12.2%), and lastly university professors (10.7%). The academic level was highly significantly associated with plagiarism (p<0.001). Most respondents that committed plagiarism were respectively scientific graduates (58.2%), PhD students (44.6%), PhD holders and postdoctoral fellows (37.6%), and finally university professors (28.2%). Having publication records was statistically associated with a reduced plagiarism (p=0.002). Notably, the ability of participants to correctly define plagiarism was also highly significantly associated with a reduced plagiarism misconduct (p<0.001 for all). Unintentional plagiarism (p<0.001), time constraint to write an original text (p<0.001), and inability of participants to paraphrase (p<0.001) were associated factors of Moroccan scholars with plagiarism. Moreover, participants that considered plagiarism as a serious issue in academic research had significantly committed less plagiarism (p<0.001). The current study showed that various actionable factors associated with plagiarism can be targeted by educational interventions, and therefore, it provided the rationale to build training programs on research integrity in Morocco.

Proceedings ArticleDOI
10 Oct 2022
TL;DR: In this paper , the authors elaborate on 8 elements that form unique posters and 6 judgement criteria for plagiarism using an exploratory study with designers and build a novel poster dataset with plagiarism annotations according to the criteria.
Abstract: The wide sharing and rapid dissemination of digital artworks has aggravated the issues of plagiarism, raising significant concerns in cultural preservation and copyright protection. Yet, modes of plagiarism are formally uncharted, causing rough plagiarism detection practices with duplicate checking. This work is thus devoted to understanding artwork plagiarism, with poster design as the running case, for building more dedicated detection techniques. As the first study of such, we elaborate on 8 elements that form unique posters and 6 judgement criteria for plagiarism using an exploratory study with designers. Second, we build a novel poster dataset with plagiarism annotations according to the criteria. Third, we propose models, leveraging the combination of primary elements and criteria of plagiarism, to find suspect instances in a retrieval process. The models are trained under the context of modern artwork and evaluated on the poster plagiarism dataset. The proposal is shown to outperform the baseline with superior Top-K accuracy (~33%) and retrieval performance (~42%).


Journal ArticleDOI
TL;DR: In this paper , the authors present a set of policies for social work programs and journals to establish clear and widely distributed policies regarding plagiarism, which may help reduce plagiarism and improve the quality of professional writing.
Abstract: Plagiarism is a continuing and growing concern in higher education and in academic publishing. Educating to avoid plagiarism requires ongoing efforts at all levels and clear policies that explain the several types of plagiarism and potential consequences when it is found. Identifying plagiarism requires complex judgments and is not a simple matter of using plagiarism detection software. Both social work programs and journals should establish clear and widely distributed policies regarding plagiarism. Ongoing education, care in course and assignment development, tracking incidents within each institution, and establishing clear policies may help reduce plagiarism and improve the quality of professional writing.

Journal ArticleDOI
TL;DR: In this paper , a serial version of vector space model was implemented on CPU and tested with 1,000 documents, which consumed 1,641 s. As processing time was a bottleneck of performance, they indented to develop parallel version of the model on the graphics processing units (GPUs) using compute unified device architecture (CUDA) and tested the same dataset which consumed only 36 s and gained 45x speed up compared to the CPU.
Abstract: Plagiarism is a rapidly rising issue among students during submission of assignments, reports and publications in universities and educational institutions, due to easy accessibility of abundant e-resources on the internet. Existing tools become inefficient in terms of time consumption when dealing with the prolific number of documents with large content. Therefore, we have focused on software-based acceleration for plagiarism detection using CPU/GPU. Initially serial version of vector space model was implemented on CPU and tested with 1,000 documents, which consumed 1,641 s. As processing time was a bottleneck of performance, we indented to develop parallel version of the model on the graphics processing units (GPUs) using compute unified device architecture (CUDA) and tested with the same dataset which consumed only 36 s and gained 45x speed up compared to the CPU. Then the version was optimised further and took only 4 s for the same dataset which was 389x faster than the serial implementation.

Book ChapterDOI
01 Jan 2022
TL;DR: In this article , a comparison of the performance of five word-embedding based deep learning models in the field of semantic similarity detection, such as TF-IDF, Word2Vec, Doc2vec, FastText, and BERT on two publicly available corpora: Quora Question Pairs and Plagiarized Short Answers (PSA), is presented.
Abstract: Most of the state-of-the-art plagiarism detection tools focus on verbatim reproduction of a document for plagiarism identification (PI), not keeping into account its semantic properties. Recently, deep learning models have shown considerable performance in identifying paraphrases using word embedding approach. This paper gives an overview and comparison of the performances of five word-embedding based deep learning models in the field of semantic similarity detection, such as TF-IDF, Word2Vec, Doc2Vec, FastText, and BERT on two publicly available corpora: Quora Question Pairs (QQP) and Plagiarized Short Answers (PSA). After extensive literature review and experiments, the most appropriate text preprocessing approaches, distance measures, and the thresholds have been settled on for detecting semantic similarity/paraphrasing. The paper concludes on FastText being the most efficient model out of the five, both in terms of evaluation metrics, i.e., accuracy, precision, recall, F1-score, receiver operating characteristic (ROC) curve, and resource consumption. It also compares all the models with each other based on the above-mentioned metrics.

Journal ArticleDOI
TL;DR: In this article , an English-Persian cross-language plagiarism detection corpus based on parallel bilingual sentences that artificially generate passages with various degrees of paraphrasing was constructed.
Abstract: In recent years, the availability of documents through the Internet along with automatic translation systems have increased plagiarism, especially across languages. Cross-lingual plagiarism occurs when the source or original text is in one language and the plagiarized or re-used text is in another language. Various methods for automatic text re-use detection across languages have been developed whose objective is to assist human experts in analyzing documents for plagiarism cases. For evaluating the performance of these systems and algorithms, standard evaluation resources are needed. To construct cross lingual plagiarism detection corpora, the majority of earlier studies have paid attention to English and other European language pairs, and have less focused on low resource languages. In this paper, we investigate a method for constructing an English-Persian cross-language plagiarism detection corpus based on parallel bilingual sentences that artificially generate passages with various degrees of paraphrasing. The plagiarized passages are inserted into topically related English and Persian Wikipedia articles in order to have more realistic text documents. The proposed approach can be applied to other less-resourced languages. In order to evaluate the compiled corpus, both intrinsic and extrinsic evaluation methods were employed. So, the compiled corpus can be suitably included into an evaluation framework for assessing cross-language plagiarism detection systems. Our proposed corpus is free and publicly available for research purposes.

Journal ArticleDOI
TL;DR: Turnitin anti-plagiarism software has been found to influence students' understanding of plagiarism and their ability to understand originality reports generated by Turnitin this paper .
Abstract: Abstract Plagiarism adversely affects contributions to knowledge and the goal of instilling the skill of academic writing. Thus, most institutions of higher learning strive to develop mechanisms to help students understand the importance of academic integrity and reasons why they should avoid plagiarism. In many of these institutions, libraries/librarians play an integral role in academic integrity instruction as it is incorporated into information literacy instruction. This study examined students’ understanding of plagiarism and the influence of Turnitin anti-plagiarism software on academic writing. The study used the survey research design. Using stratified and simple random sampling methods, questionnaires were administered to 175 students. Findings from the study revealed that efforts made by the university through seminars have influenced the perception of students on the concept of plagiarism and acts which constitute plagiarism. Though respondents agree with the fact that the use of Turnitin software influences academic writing positively, knowledge about the existence and usage of this software is low since the total number of students who had no idea about the university’s subscription to the software was higher than those who did. Additionally, many of the respondents had difficulty understanding originality reports generated by the software and most were not aware of the existence of the university’s policy on plagiarism. The study recommends for the library to intensify training on the usage and interpretation of similarity reports that are generated by the software as well as making the university’s policy on plagiarism more visible or otherwise lose the capital and efforts invested in producing graduates who can contribute to knowledge and scholarship ethically.

Posted ContentDOI
08 Aug 2022-medRxiv
TL;DR: Plagiarism is prevalent in COVID19 related publications in infection journals among various quartiles and it is not enough to rely only on similarity reports; such reports must be accompanied by manual curation of the results with an appropriate threshold to be able to appropriately determine if plagiarism is occurring.
Abstract: Abstract Background: The COVID-19 pandemic has caused drastic changes in the publishing framework in order to quickly review and publish vital information during this public health emergency. The quality of the academic work being published may have been compromised. One area of concern is plagiarism, where the work of others is directly copied and represented as ones own. The purpose of this study is to determine the presence of plagiarism in infection journals in papers relating to the COVID-19 pandemic. Methods: Consecutively occurring original research or reviews relating to the COVID-19 pandemic, published in infection journals as ranked by SCOPUS Journal finder were collected. Each manuscript was optimized and uploaded to the Turnitin program. Similarity reports were then manually checked for true plagiarism within the text, where any sentence with more than 80% copying was deemed plagiarised. Results: A total of 310 papers were analyzed in this cross-sectional study. Papers from a total of 23 journals among 4 quartiles were examined. Of the papers we examined, 41.6% were deemed plagiarised (n=129). Among the plagiarised papers, the average number of copied sentences was 5.42+/-9.18. The highest recorded similarity report was 60%, and the highest number of copied sentences was 85. Plagiarism was higher in papers published in the year 2020. The most problematic area in the manuscripts was the discussion section. Self plagiarism was identified in 31 papers. Average time to judge all manuscripts was 2.45+/-3.09. Among all the plagiarized papers 72% belonged to papers where the similarity report was less than 15% (n=93). Papers published from core anglosphere speaking countries were not associated with higher rates of plagiarism. No significant differences were found with regards to plagiarism events among the quartiles. Conclusion: Plagiarism is prevalent in COVID19 related publications in infection journals among various quartiles. It is not enough to rely only on similarity reports. Such reports must be accompanied by manual curation of the results with an appropriate threshold to be able to appropriately determine if plagiarism is occurring. The majority of plagiarism is occurring in reports of less than 15% similarity, and this is a blind spot. Incorporating a manual judge could save future time in avoiding retractions and improving the quality of papers in these journals.

Book ChapterDOI
01 Jan 2022
TL;DR: In this paper , a deep learning framework that leverages a Siamese BLSTM network and character-based embeddings to detect source code plagiarism is presented. But, this method is limited to source code codes.
Abstract: AbstractSource code plagiarism is a severe ongoing problem that threatens academic integrity and intellectual rights. Students from computing disciplines commit plagiarism through diverse channels, in which direct in-class plagiarism being the most popular. Programming instructors struggle to manually inspect plagiarism activities in large volumes of submissions. Thus, many research works on detection approaches have been proposed to overcome prolonged manual inspection. In this article, we present a deep learning framework that leverages a Siamese BLSTM network and character-based embeddings to detect source code plagiarism. The goal of this research is to determine which character-based embedding architecture produces the most accurate plagiarism detection scores. The proposed framework uses Word2Vec and fastText models to obtain various pre-trained source code embedding sequences as input to the network. Subsequently, we utilise Manhattan distance to measure the plagiarism scores between the two outputs produced by the network. To the best of our knowledge, this is the first research work to utilise various embedding models for source code plagiarism detection. Experimental results showed that the embeddings from the Word2Vec Skip-Gram and Negative Sampling (W2V-SGNS) architecture produce the most accurate detection scores.KeywordsSource code plagiarism detectionSource code embeddingsSiamese LSTM networkProgramming language processingDeep learningCode similarity

Journal ArticleDOI
TL;DR: The winnowing algorithm can be used as a plagiarism check in thesis and journal documents and the application built in this system is running well because the winnowed algorithm can help check plagiarism in the thesis andJournal documents.
Abstract: The ease of accessing information results in an increase in the level of plagiarism. Plagiarism is an act of tacking other peoples’ writings or opinions and making it look as if they were by themselves without first studying and not including the source. A detection similarity is an application that is made based on a website to detect the similarity or similarity of a document/text with other documents/text. In addition, il also provides an overview of the calculation sequence of how the calculation process in detecting the text runs until it produces the percentage of similarity of the text. In a making-based application website, this methodology used is the waterfall. While the method used in calculating the similarity of the text is the Algorithm Winnowing. The winnowing algorithm is one of the document method fingerprints. This method can identify the similarity of the test, including small parts that are similar in a set of documents that are analyzed through the fingerprint generated and for calculating the percentage results using the Jaccard Coefficient. The Smaller the percentage level of similarity of a text document, the smaller the level of similarity, but if the percentage value is greater then can be ascertained that the document is plagiarized. The winnowing algorithm can be used as a plagiarism check in thesis and journal documents. The application built in this system is running well because the winnowing algorithm can help check plagiarism in the thesis and journal documents.