scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2020"


Journal ArticleDOI
TL;DR: The sobering results show that although some web-based text-matching systems can indeed help identify some plagiarized content, they clearly do not find all plagiarism and at times also identify non-plagiarized material as problematic.
Abstract: There is a general belief that software must be able to easily do things that humans find difficult. Since finding sources for plagiarism in a text is not an easy task, there is a wide-spread expectation that it must be simple for software to determine if a text is plagiarized or not. Software cannot determine plagiarism, but it can work as a support tool for identifying some text similarity that may constitute plagiarism. But how well do the various systems work? This paper reports on a collaborative test of 15 web-based text-matching systems that can be used when plagiarism is suspected. It was conducted by researchers from seven countries using test material in eight different languages, evaluating the effectiveness of the systems on single-source and multi-source documents. A usability examination was also performed. The sobering results show that although some systems can indeed help identify some plagiarized content, they clearly do not find all plagiarism and at times also identify non-plagiarized material as problematic.

53 citations


Journal ArticleDOI
TL;DR: A new plagiarism detection technique between C++ and Java source codes based on semantics in multimedia-based e-Learning and smart assessment methodology is proposed and the experimental results show better semantic similarity results for plagiarism Detection based on comparison.
Abstract: The multimedia-based e-Learning methodology provides virtual classrooms to students. The teacher uploads learning materials, programming assignments and quizzes on university’ Learning Management System (LMS). The students learn lessons from uploaded videos and then solve the given programming tasks and quizzes. The source code plagiarism is a serious threat to academia. However, identifying similar source code fragments between different programming languages is a challenging task. To solve the problem, this paper proposed a new plagiarism detection technique between C++ and Java source codes based on semantics in multimedia-based e-Learning and smart assessment methodology. First, it transforms source codes into tokens to calculate semantic similarity in token by token comparison. After that, it finds semantic similarity in scalar value for the complete source codes written in C++ and Java. To analyse the experiment, we have taken the dataset consists of four (4) case studies of Factorial, Bubble Sort, Binary Search and Stack data structure in both C++ and Java. The entire experiment is done in R Studio with R version 3.4.2. The experimental results show better semantic similarity results for plagiarism detection based on comparison.

36 citations


Journal ArticleDOI
TL;DR: This work presents a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts and shows that it produces a comparable or state-of-the-art performance on all three benchmark datasets.
Abstract: Paraphrase detection is an important task in text analytics with numerous applications such as plagiarism detection, duplicate question identification, and enhanced customer support helpdesks. Deep models have been proposed for representing and classifying paraphrases. These models, however, require large quantities of human-labeled data, which is expensive to obtain. In this work, we present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our data augmentation strategy considers the notions of paraphrases and non-paraphrases as binary relations over the set of texts. Subsequently, it uses graph theoretic concepts to efficiently generate additional paraphrase and non-paraphrase pairs in a sound manner. Our multi-cascaded model employs three supervised feature learners (cascades) based on CNN and LSTM networks with and without soft-attention. The learned features, together with hand-crafted linguistic features, are then forwarded to a discriminator network for final classification. Our model is both wide and deep and provides greater robustness across clean and noisy short texts. We evaluate our approach on three benchmark datasets and show that it produces a comparable or state-of-the-art performance on all three.

34 citations


Posted Content
TL;DR: This is the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark and shows that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches.
Abstract: Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.

32 citations


Journal ArticleDOI
TL;DR: The design and construction of an Arabic PD reference corpus that is dedicated to academic language and a database for the detection of plagiarism in student assignments, reports, and dissertations is discussed.
Abstract: Advancement in information technology has resulted in massive textual material that is open to appropriation. Due to researchers’ misconduct, a plethora of plagiarism detection (PD) systems have been developed. However, most PD systems on the market do not support the Arabic language. In this paper, we discuss the design and construction of an Arabic PD reference corpus that is dedicated to academic language. It consists of (2312) dissertations that were defended by postgraduate students at the University of Jordan (JU) between the years 2001–2016. This Academic Jordan University Plagiarism Detection corpus; henceforth, JUPlag, follows the Dewey decimal classification (DDC) in the way it is structured. The goal of the corpus is twofold: Firstly, it is a database for the detection of plagiarism in student assignments, reports, and dissertations. Secondly, the n-gram structure of the corpus provides a knowledgebase for linguistic analysis, language teaching, and the learning of plagiarism-free writing. The PD system is guided by JU Library’s metadata for retrieval and discovery of plagiarism. To test JUPlag, we injected an unseen dissertation with multiple instances of plagiarism-simulated paragraphs and sentences. Experimentation with the system using different verbatim n-gram segments is indeed promising. Preliminary results encourage that permission be sought to enrich this corpus with all the theses in the Thesis Repository of the Union of Arab Universities. The JUPlag corpus is intended to function as an indispensable source for testing and evaluating plagiarism detection techniques. Since the University of Jordan is seeking to become a center for plagiarism detection for Arabic content and being a non-profit organization, it will charge a nominal fee for the use of JUPlag to finance the maintenance and development of the corpus.

27 citations


Journal ArticleDOI
TL;DR: The proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.
Abstract: Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.

23 citations


Journal ArticleDOI
TL;DR: The authors propose several new features that can be extracted from source code repositories with the purpose of building a comprehensive profile of each individual developer, significantly improve upon the performance of more traditional plagiarism detection tools.
Abstract: Detecting instances of plagiarism in student homework, especially programming homework, is an important issue for practitioners. In the past decades, several tools have emerged that are able to effectively compare large corpora of homeworks and sort pairs by degree of similarity. However, those tools are available to students as well, allowing them to experiment and develop elaborate methods for evading detection. Also, such tools are unable to detect instances of “external plagiarism” where students obtained unethical help from sources not among other students of the same course. One way to battle this problem is to monitor student activity while solving their homeworks using a cloud-based integrated development environment (IDE) and detect suspicious behaviours. Each editing event in program source can be stored as a new commit to create a form of ultra-fine-grained source code repository. In this paper, the authors propose several new features that can be extracted from such repositories with the purpose of building a comprehensive profile of each individual developer. Machine learning techniques were used to detect suspicious behaviours, which allowed the authors to significantly improve upon the performance of more traditional plagiarism detection tools.

22 citations


Journal ArticleDOI
TL;DR: Developing country researchers appear to be familiar with the concept of plagiarism, but knowledge among the participants surveyed here was incomplete, and knowledge about plagiarism and awareness of its harmfulness must be improved.
Abstract: To explore the knowledge, attitudes and practices regarding plagiarism in a large culturally diverse sample of researchers who participated in the AuthorAID MOOC on Research Writing. An online survey was designed and delivered through Google Forms to the participants in the AuthorAID MOOC on Research Writing during April to June 2017. A total of 765 participants completed the survey (response rate 47.8%), and 746 responses were included in the analysis. Almost all participants (97.6%) reported knowledge of the term “plagiarism”, and 89.1% of them understand the meaning of the term before joining the course. Most participants reported that their university does not provide access to plagiarism detection software (82.0%), and 35% participants admitted they had been involved in plagiarism during their education. Overall attitudes toward plagiarism (65.3 ± 10.93) indicated low acceptance of plagiarism. Moreover, low scores were reported for approval attitude (25.22 ± 5.63), disapproval attitude (11.78 ± 3.64), and knowledge of subjective norms (20.63 ± 5.22). The most common reason for plagiarizing was lack of time (16.1%), and the most common consequence was the perception that “those who plagiarize are not respected or seen positively” (71.4%). Developing country researchers appear to be familiar with the concept of plagiarism, but knowledge among the participants surveyed here was incomplete. Knowledge about plagiarism and awareness of its harmfulness must be improved, because there is an obvious relationship between attitudes toward plagiarism and knowledge, reasons and consequences. The use of plagiarism-detection software can raise awareness about plagiarism.

20 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed cross-language text alignment approach significantly outperforms the state-of-the-art models and can be fed into an expert system for further improvement of cross- language plagiarism detection.
Abstract: The exponential growth of documents in various languages throughout the web, along with the availability of several editing and translation tools have made the cross-language plagiarism detection a challenging issue. Regarding its high importance, the present study focuses on the task of cross-language text alignment also known as detailed analysis which works on the outputs of the source retrieval step of cross-language plagiarism detection systems. The paper proposes a two-level matching approach with the aim of considering both syntactic and semantic information to align plagiarism fragments from the source and suspicious documents, accurately. At the first level, a vector space model which employs a multilingual word embeddings based dictionary and a local weighting technique is used in order to extract a minimal set of highly potential candidate fragment pairs rather than considering all possible pairs of fragments. This step also contains a dynamic expansion technique to cover more candidate pairs aiming at improving the system’s recall. It is followed by a more precise algorithm that examines the candidate pairs at the sentence level using a graph-of-words representation of text. As a result, by modelling both the words and their relationships, an acceptable increase in the system’s precision which is the goal of the second level is also observed. To identify evidence of plagiarism, i.e. potential cases of unauthorized text reuse, the algorithm tries to find maximum cliques from the match graph of source and suspicious texts. With this two-level investigation, the approach is capable to discriminate true plagiarism cases from the original text. The experimental results on different datasets such as PAN-PC-11, PAN-PC-12, and SemEval-2017 show that the proposed cross-language text alignment approach significantly outperforms the state-of-the-art models and can be fed into an expert system for further improvement of cross-language plagiarism detection. The source codes are publicly available on GitHub 1 , for the purposes of reproducible research.

19 citations


Journal ArticleDOI
TL;DR: A comparative study based on a set of criterions like: Vector representation method, Level Treatment, Similarity Method and Dataset to give an overview of different propositions for plagiarism detection based on the deep learning algorithms.
Abstract: The ease of access to the various resources on the web-enabled the democratization of access to information but at the same time allowed the appearance of enormous plagiarism problems. Many techniques of plagiarism were identified in the literature, but the plagiarism of idea steels the foremost troublesome to detect, because it uses different text manipulation at the same time. Indeed, a few strategies have been proposed to perform semantic plagiarism detection, but they are still numerous challenges to overcome. Unlike the existing states of the art, the purpose of this study is to give an overview of different propositions for plagiarism detection based on the deep learning algorithms. The main goal of these approaches is to provide a high quality of worlds or sentences vector representation. In this paper, we propose a comparative study based on a set of criterions like: Vector representation method, Level Treatment, Similarity Method and Dataset. One result of this study is that most of researches are based on world granularity and use the word2vec method for word vector representation, which sometimes is not suitable to keep the meaning of the whole sentences. Each technique has strengths and weaknesses; however, none is quite mature for semantic plagiarism detection.

17 citations


Journal ArticleDOI
01 Jun 2020
TL;DR: While more research is necessary to further investigate the reliability of the best performing software packages, stylometry software appears to show significant promise for the potential detection of contract cheating.
Abstract: Contract cheating, instances in which a student enlists someone other than themselves to produce coursework, has been identified as a growing problem within academic integrity literature and in news headlines. The percentage of students who have utilized this type of cheating has been reported to range between 6% and 15.7%. Generational sentiments about cheating and the prevalent accessibility of contract cheating providers online seems to only have exacerbated the issue. The problem is that there is currently no simple means identified and verified to detect contract cheating, as available plagiarism detection software has been shown to be ineffective in these cases. One method that is commonly used for authorship authentication in nonacademic settings, stylometry, has been suggested as a potential means for detection. Stylometry uses various attributes of documents to determine if they were written by the same individual. This pilot study sought to assess the utility of three easy to use and readily available stylometry software systems to detect simulated cases of contract cheating on academic documents. Average accuracy ranged from 33% to 88.9%. While more research is necessary to further investigate the reliability of the best performing software packages, stylometry software appears to show significant promise for the potential detection of contract cheating.

Journal ArticleDOI
TL;DR: The results show that the proposed candidate retrieval model outperforms the state-of-the-art models and can be considered as a proper choice to be embedded in cross-language plagiarism detection systems.
Abstract: Due to the rapid growth of documents and manuscripts in various languages all over the world, plagiarism detection has become a challenging task, especially for cross lingual cases. Because of this issue, in today's plagiarism detection systems, a candidate retrieval process is developed as the first step, in order to reduce the set of documents for comparison to a reasonable number. The performance of the second step of plagiarism detection, which is devoted to a detailed analysis of the candidates is tightly dependent on the candidate retrieval phase. Regarding its high importance, the present study focuses on the candidate retrieval task and aims to extract the minimal set of highly potential source documents, accurately. The paper proposes a fusion of concept-based and keyword-based retrieval models for this purpose. A dynamic interpolation factor is used in the proposed scheme in order to combine the results of conceptual and bag-of-words models. The effectiveness of the proposed model for cross language candidate retrieval is also compared with state-of-the-art models over German-English and Spanish-English language partitions. The results show that the proposed candidate retrieval model outperforms the state-of-the-art models and can be considered as a proper choice to be embedded in cross-language plagiarism detection systems.

Journal ArticleDOI
13 Nov 2020
TL;DR: This paper presents an entirely automatic program transformation approach, MOSSAD, that defeats popular software plagiarism detection tools and is effective at defeating four plagiarism detectors, including Moss and JPlag.
Abstract: Automatic software plagiarism detection tools are widely used in educational settings to ensure that submitted work was not copied. These tools have grown in use together with the rise in enrollments in computer science programs and the widespread availability of code on-line. Educators rely on the robustness of plagiarism detection tools; the working assumption is that the effort required to evade detection is as high as that required to actually do the assigned work. This paper shows this is not the case. It presents an entirely automatic program transformation approach, MOSSAD, that defeats popular software plagiarism detection tools. MOSSAD comprises a framework that couples techniques inspired by genetic programming with domain-specific knowledge to effectively undermine plagiarism detectors. MOSSAD is effective at defeating four plagiarism detectors, including Moss and JPlag. MOSSAD is both fast and effective: it can, in minutes, generate modified versions of programs that are likely to escape detection. More insidiously, because of its non-deterministic approach, MOSSAD can, from a single program, generate dozens of variants, which are classified as no more suspicious than legitimate assignments. A detailed study of MOSSAD across a corpus of real student assignments demonstrates its efficacy at evading detection. A user study shows that graduate student assistants consistently rate MOSSAD-generated code as just as readable as authentic student code. This work motivates the need for both research on more robust plagiarism detection tools and greater integration of naturally plagiarism-resistant methodologies like code review into computer science education.

Journal ArticleDOI
TL;DR: The cases of plagiarism in non-English speaking countries have a strong message for honest researchers that they should improve their English writing skills and credit used sources by properly citing and referencing them.
Abstract: What constitutes plagiarism? What are the methods to detect plagiarism? How do “plagiarism detection tools” assist in detecting plagiarism? What is the difference between plagiarism and similarity index? These are probably the most common questions regarding plagiarism that many research experts in scientific writing are usually faced with, but a definitive answer to them is less known to many. According to a report published in 2018, papers retracted for plagiarism have sharply increased over the last two decades, with higher rates in developing and non-English speaking countries.1 Several studies have reported similar findings with Iran, China, India, Japan, Korea, Italy, Romania, Turkey, and France amongst the countries with highest number of retractions due to plagiarism.1,2,3,4 A study reported that duplication of text, figures or tables without appropriate referencing accounted for 41.3% of post-2009 retractions of papers published from India.5 In Pakistan, Journal of Pakistan Medical Association started a special section titled “Learning Research” and published a couple of papers on research writing skills, research integrity and scientific misconduct.6,7 However, the problem has not been adequately addressed and specific issues about it remain unresolved and unclear. According to an unpublished data based on 1,679 students from four universities of Pakistan, 85.5% did not have a clear understanding of the difference between similarity index and plagiarism (unpublished data). Smart et al.8 in their global survey of editors reported that around 63% experienced some plagiarized submissions, with Asian editors experiencing the highest levels of plagiarized/duplicated content. In some papers, journals from non-English speaking countries have specifically discussed the cases of plagiarized submissions to them and have highlighted the drawbacks in relying on similarity checking programs.9,10,11 The cases of plagiarism in non-English speaking countries have a strong message for honest researchers that they should improve their English writing skills and credit used sources by properly citing and referencing them.12 Despite aggregating literature on plagiarism from non-Anglophonic countries, the answers to the aforementioned questions remain unclear. In order to answer these questions, it is important to have a thorough understanding of plagiarism and bring clarity to the less known issues about it. Therefore, this paper aims to 1) define plagiarism and growth in its prevalence as well as literature on it; 2) explain the difference between similarity and plagiarism; 3) discuss the role of similarity checking tools in detecting plagiarism and the flaws on completely relying on them; and 4) discuss the phenomenon called Trojan citation. At the end, suggestions are provided for authors and editors from developing countries so that this issue maybe collectively addressed.

Book ChapterDOI
23 Mar 2020
TL;DR: It is shown that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks and provides a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing.
Abstract: Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.

Journal ArticleDOI
TL;DR: A performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection shows that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.
Abstract: Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection—where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

Journal ArticleDOI
TL;DR: This paper employs text embedding vectors to compare similarity among documents to detect plagiarism and applies the proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective.
Abstract: The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different types of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.

Journal ArticleDOI
TL;DR: The proposed system introduces an extrinsic plagiarism detection approach inspired by cognition because it utilizes semantic knowledge to detect the plagiarized part from the text without human involvement.
Abstract: Plagiarism occurs when we use the ideas, expressions, work, and words of other authors and do not give them the required attribution. The major contributing factor in plagiarism is the availability of a high amount of data and information on the internet that can be swiftly accessed. The proposed system introduces an extrinsic plagiarism detection approach inspired by cognition because it utilizes semantic knowledge to detect the plagiarized part from the text without human involvement. A lexical database like WordNet assists the computers to perceive the data and information. These days most of the plagiarism detection systems fail to detect highly complex cases of plagiarism. The proposed system uses Dice measure as similarity measure for finding the semantic resemblance between the pair of sentences. It also uses linguistic features like path similarity, depth estimation measure to compute the resemblance between the pair of words and these features are combined by assigning different weights to them. It is capable of identifying cases like restructuring, paraphrasing, verbatim copy, and synonymized plagiarism. It has been evaluated on the PAN-PC-11 corpus. The results obtained from the proposed system signify that it has outperformed other existing systems on PAN-PC-11 in terms of precision, recall, F-measure, and PlagDet score. The proposed system has innovative approach, but the results are somehow close and reasonably better than the existing systems.

Journal ArticleDOI
TL;DR: Paraphrase identification is a natural language processing (NLP) problem that involves the determination of whether two text segments have the same meaning.
Abstract: Paraphrase identification is a natural language processing (NLP) problem that involves the determination of whether two text segments have the same meaning. Various NLP applications rely on a solut...

Journal ArticleDOI
TL;DR: All the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy, and Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.
Abstract: The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models’ accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.

Journal ArticleDOI
TL;DR: The use of digital technologies has transformed the processes of writing for academic journals and the dissemination and preservation of academic work as discussed by the authors, and it has also made the measurement of the impac...
Abstract: The use of digital technologies has transformed the processes of writing for academic journals and the dissemination and preservation of academic work. It has also made the measurement of the impac...

Journal ArticleDOI
TL;DR: Digital forensics techniques were used to investigate a known case of contract cheating where the contract author has notified the university and the student subsequently confirmed that they had contracted the work out.
Abstract: Contract cheating is a major problem in Higher Education because it is very difficult to detect using traditional plagiarism detection tools. Digital forensics techniques are already used in law to determine ownership of documents, and also in criminal cases, where it is not uncommon to hide information and images within an ordinary looking document using steganography techniques. These digital forensic techniques were used to investigate a known case of contract cheating where the contract author has notified the university and the student subsequently confirmed that they had contracted the work out. Microsoft Word documents use a format known as Office Open XML Format, and as such, it is possible to review the editing process of a document. A student submission known to have been contracted out was analysed using the revision identifiers within the document, and a tool was developed to review these identifiers. Using visualisation techniques it is possible to see a pattern of editing that is inconsistent with the pattern seen in an authentic document.

Proceedings ArticleDOI
03 Feb 2020
TL;DR: Two case studies are presented that explore how resilient current source code plagiarism detection tools are to plagiarism-hiding transformations and an evaluation of a new advanced technique that indicates the technique is robust in its ability to identify the same program after it has been transformed.
Abstract: Source code plagiarism is a persistent problem in undergraduate computer science education. Unfortunately, it is a widespread phenomena with many students plagiarising either because they are unwilling or incapable of completing their own work. Many source code plagiarism detection tools have been proposed to identify suspected cases of source code plagiarism. However, these tools are not resilient to pervasive plagiarism-hiding transformations that significantly change the structure of source code. In this paper, two case studies are presented that explore how resilient current source code plagiarism detection tools are to plagiarism-hiding transformations. Furthermore, an evaluation of a new advanced technique for source code plagiarism detection is presented to show that is it possible to identify pervasive cases of source code plagiarism. The results of this evaluation indicate the technique is robust in its ability to identify the same program after it has been transformed.

Proceedings ArticleDOI
12 Aug 2020
TL;DR: Examinator compares pairs of take-home exams to select which should be manually checked for plagiarism and generates a report with evidence for these cases using its metrics and those generated as a by-product of the commercial grading tool Gradescope.
Abstract: Examinator compares pairs of take-home exams to select which should be manually checked for plagiarism. Examinator also generates a report with evidence for these cases using its metrics and those generated as a by-product of the commercial grading tool Gradescope. Examinator supports degree-seeking graduate programs (both online and on-campus) at a top computer science graduate institute in the United States. Since Spring 2019, Examinator has compared over 2 million pairs of exams from a popular Artificial Intelligence course, resulting in 56 cases being referred for discipline. Iterative development has improved the percentage of referrals of suggested cases from 15% to 25%.

Journal ArticleDOI
TL;DR: Students entering postsecondary educational institutions require ongoing support and learning opportunities to improve their skills in paraphrasing and referencing to avoid plagiarism.
Abstract: Pre- and postintervention surveys of first-year nursing students were undertaken to establish the students' knowledge of plagiarism following implementation of an online library-based Academic Integrity Module and the use of plagiarism detection software. Knowledge and understanding of plagiarism improved, but students' ability to paraphrase remained poor. Students entering postsecondary educational institutions require ongoing support and learning opportunities to improve their skills in paraphrasing and referencing to avoid plagiarism.

Journal ArticleDOI
TL;DR: The obtained results show that the proposed automatic plagiarism detection system for obfuscated text based on a support vector machine classifier that exploits a set of lexical, syntactic and semantic features had the best performance in terms of the F -measure on the PAN 2012 and the PAN@FIRE2015 obfuscated sub-corpora.
Abstract: Plagiarism is a serious problem in education, research, publishing and other fields. Automatic plagiarism detection systems are crucial for ensuring the integrity and genuineness of intellectual work. There are different types of plagiarism, such as copy–paste, obfuscation and translation. In particular, obfuscated text is one of the hardest types of plagiarism to detect. In this paper, we propose an automatic plagiarism detection system for obfuscated text based on a support vector machine classifier that exploits a set of lexical, syntactic and semantic features. We evaluated the performance of the proposed system on benchmark English and Arabic corpora made available by the PAN Workshop series: PAN 2012, PAN 2013, PAN 2014 and PAN@FIRE2015. We also compared the performance of our system to the performances of other systems that participated in the PAN competitions. The obtained results show that our system had the best performance in terms of the F-measure on the PAN 2012 and on the PAN@FIRE2015 obfuscated sub-corpora, was among the top four on the PAN 2013 corpus and was among the top two on the PAN 2014 corpus.

Journal ArticleDOI
TL;DR: The designed anti-plagiarism system is compared with the success of plagiarism detection performed by the two most used anti-PLAGiarism tools, namely JPlag and MOSS.
Abstract: The paper deals with the issue of detecting plagiarism in source code, which we unfortunately encounter when teaching subjects dealing with programming and software development. Many students want to simplify the completion of the course and therefore submit modified source codes of their classmates or even those found on the Internet. Some try to modify the source code e.g. by changing the identifiers of classes, methods and variables to different ones, by changing the corresponding loops, by introducing new methods or by changing the order of methods in the source code or in other ways. We focused directly on this problem and designed our own anti-plagiarism system that we describe in this paper. The designed system consists of three parts during which the source code is processed using six designed algorithms. The basis is the processing of the source code and its transformation into an abstract syntax tree, consisting of two types of nodes, which is then vectorized using our modified DECKARD algorithm. The vectors are then clustered and stored in a database from which similar parts of the source code can be searched. The output of the system is then the final report containing a list of matches with similarities of all works that have been added to the database until then. The designed anti-plagiarism system is finally compared with the success of plagiarism detection performed by the two most used anti-plagiarism tools, namely JPlag and MOSS. It is evaluated on assignments elaborated by students from the courses dealing with object-oriented programming at our faculty.

Journal ArticleDOI
TL;DR: This paper attempts to investigate two tasks, namely, cross-lingual semantic text similarity (CL-STS) and plagiarism detection and judgement (PD) using deep neural networks, which, to the best of the knowledge, have not been implemented before for STS and PD in cross-lingsual setting, and using such combination of features.

Journal ArticleDOI
TL;DR: This work considers, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques, and combines features and imbalanced dataset treatment with various classification methods.
Abstract: The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly In this work, we follow a supervised machine learning classification approach We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques Apart from this, we propose some novel stylistic features We combine our features and imbalanced dataset treatment with various classification methods Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores

Journal ArticleDOI
TL;DR: An efficient mechanism for the detection of plagiarism in repositories of Model-Driven Engineering (MDE) assignments based on the adaptation of the Locality Sensitive Hashing, an approximate nearest neighbor search mechanism, to the modeling technical space is provided.
Abstract: Reports suggest plagiarism is a common occurrence in universities. While plagiarism detection mechanisms exist for textual artifacts, this is less so for non-code related ones such as software desi...