scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2021"


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a blockchain-based code copyright management system which provides better response speed and storage efficiency than the traditional code originality verification model based on Syntax Tree.
Abstract: With the increasing number of open-source software projects, code plagiarism has become one of the threats to the software industry. However, current research on code copyright protection mostly focuses on the approach for code plagiarism detection, failing to fundamentally solve the problem of copyright confirmation and protection. This paper proposes a blockchain-based code copyright management system. Firstly, an Syntax Tree-based code originality verification model is constructed. The originality of the uploaded code is determined through its similarity to other original codes. Secondly, the Peer-to-Peer blockchain network is designed to store the copyright information of the original code. The nodes in the blockchain network can verify the originality of the code based on the code originality verification model. Then, through the construction of blocks and legitimacy validation and linking of blocks, the blockchain-based code copyright management structure is built. The whole process guarantees that the copyright information is traceable and will not be tampered with. According to the experiments, the accuracy and processing time of the code originality verification model are shown to meet code originality verification requirements. The experiment also shows that the best storage type of the code copyright information is the code fingerprint which is a 256bits hash value converted from code eigenvalues. It performs better in both response speed and storage efficiency. Moreover, because of the uniqueness and irreversibility of the result from the SHA256 algorithm, the code fingerprint storage yields a better level of storage security. In summary, this paper proposes a blockchain-based code copyright management system which provides better response speed and storage efficiency.

43 citations


Journal ArticleDOI
TL;DR: BPlag as discussed by the authors applies symbolic execution to analyses execution behavior and represents a program in a novel graph-based format, then detects plagiarism by comparing these graphs and evaluating similarity scores.
Abstract: Source code plagiarism is a long-standing issue in tertiary computer science education. Many source code plagiarism detection tools have been proposed to aid in the detection of source code plagiarism. However, existing detection tools are not robust to pervasive plagiarism-hiding transformations and can be inaccurate in the detection of plagiarised source code. This article presents BPlag, a behavioural approach to source code plagiarism detection. BPlag is designed to be both robust to pervasive plagiarism-hiding transformations and accurate in the detection of plagiarised source code. Greater robustness and accuracy is afforded by analyzing the behavior of a program, as behavior is perceived to be the least susceptible aspect of a program impacted upon by plagiarism-hiding transformations. BPlag applies symbolic execution to analyses execution behavior and represents a program in a novel graph-based format. Plagiarism is then detected by comparing these graphs and evaluating similarity scores. BPlag is evaluated for robustness, accuracy and efficiency against five commonly used source code plagiarism detection tools. It is then shown that BPlag is more robust to plagiarism-hiding transformations and more accurate in the detection of plagiarised source code, but is less efficient than the compared tools.

21 citations


Journal ArticleDOI
TL;DR: A methodology for software plagiarism detection in multiprogramming languages based on machine learning approaches is proposed and the Principal Component Analysis (PCA) is applied for features extraction from source codes without losing the actual information.
Abstract: The Software plagiarism, which arises the problem of software piracy is a growing major concern nowadays. It is a serious risk to the software industry that gives huge economic damages every year. The customers may develop a modified version of the original software in other types of programming languages. Furthermore, the plagiarism detection in different types of source codes is a challenging task because each source code may have specific syntax rules. In this paper, we proposed a methodology for software plagiarism detection in multiprogramming languages based on machine learning approaches. The Principal Component Analysis (PCA) is applied for features extraction from source codes without losing the actual information. It extracts features by factor analysis and converts the dataset into normalized linear principal components which are further useful for predictions analysis. Then, the multinomial logistic regression model (MLR) is applied to these components to classify the source codes documents based on predictions. It gives the generalization of logistic regression to handle multiclass problems. Further, the predictors' performance in MLR is evaluated by 2 tailed z test. To apply the experiment, the dataset is collected in five different and popular languages, ie, C, C++, Java, C#, and Python. Each programming language taken in two different case studies, ie, binary search and Stack.

20 citations


Journal ArticleDOI
TL;DR: In this article, an algorithm for document plagiarism detection using the provided incremental knowledge construction with formal concept analysis (FCA) is presented to support document matching between the source document in storage and the suspect document.
Abstract: This paper proposes an algorithm for document plagiarism detection using the provided incremental knowledge construction with formal concept analysis (FCA). The incremental knowledge construction is presented to support document matching between the source document in storage and the suspect document. Thus, a new concept similarity measure is also proposed for retrieving formal concepts in the knowledge construction. The presented concept similarity employs appearance frequencies in the obtained knowledge construction. Our approach can be applied to retrieve relevant information because the obtained structure uses FCA in concept form that is definable by a conjunction of properties. This measure is mathematically proven to be a formal similarity metric. The performance of the proposed similarity measure is demonstrated in document plagiarism detection. Moreover, this paper provides an algorithm to build the information structure for document plagiarism detection. Thai text test collections are used for performance evaluation of the implemented web application.

16 citations


Journal ArticleDOI
TL;DR: A deep contextual long semantic textual similarity network is proposed and detailed experimentation and results show that the proposed deep contextual model performs better than the human annotation.
Abstract: Semantic text similarity (STS) is a challenging issue for natural language processing due to linguistic expression variability and ambiguities. The degree of the likelihood between the two sentences is calculated by sentence similarity. It plays a prominent role in many applications like information retrieval (IR), plagiarism detection (PD), question answering platform and text paraphrasing, etc. Now, deep contextualised word representations became a better way for feature extraction in sentences. It has shown exciting experimental results from recent studies. In this paper, we propose a deep contextual long semantic textual similarity network. Deep contextual mechanisms for collecting high-level semantic knowledge is used in the LSTM network. Through implementing architecture in multiple datasets, we have demonstrated our model’s effectiveness. By applying architecture to various semantic similarity datasets, we showed the usefulness of our model’s on regression and classification dataset. Detailed experimentation and results show that the proposed deep contextual model performs better than the human annotation.

16 citations


Journal ArticleDOI
TL;DR: A new deep-learning based model which can generalize well despite the lack of training data for deep models is proposed which outperforms almost all the previous works in terms of f-measure and accuracy on MSRP and Quora datasets.
Abstract: Paraphrase detection is one of the fundamental tasks in the area of natural language processing. Paraphrase refers to those sentences or phrases that convey the same meaning but use different wording. It has a lot of applications such as machine translation, text summarization, QA systems, and plagiarism detection. In this research, we propose a new deep-learning based model which can generalize well despite the lack of training data for deep models. After preprocessing, our model can be divided into two separate modules. In the first one, we train a single Bi-LSTM neural network to encode the whole input by leveraging its pretrained GloVe word vectors. In the second module, three sets of handcrafted features are used to measure the similarity between each pair of sentences, some of which are introduced in this research for the first time. Our final model is formed by incorporating the handcrafted features with the output of the Bi-LSTM network. Evaluation results on MSRP and Quora datasets show that it outperforms almost all the previous works in terms of f-measure and accuracy on MSRP and achieves comparable results on Quora. On the Quora-question pair competition launched by Kaggle, our model ranked among the top 24% solutions between more than 3000 teams.

15 citations


Journal ArticleDOI
Yikun Hu1, Hui Wang1, Yuanyuan Zhang1, Bodong Li1, Dawu Gu1 
TL;DR: The experimental results show that BinMatch is resilient to the semantics-equivalent code transformation, and not only covers all target functions for similarity comparison, but also improves the accuracy comparing to the state-of-the-art solutions.
Abstract: Binary code similarity comparison is a methodology for identifying similar or identical code fragments in binary programs. It is indispensable in fields of software engineering and security, which has many important applications (e.g., plagiarism detection, bug detection). With the widespread of smart and Internet of Things (IoT) devices, an increasing number of programs are ported to multiple architectures (e.g., ARM, MIPS). It becomes necessary to detect similar binary code across architectures as well. The main challenge of this topic lies in the semantics-equivalent code transformation resulting from different compilation settings, code obfuscation, and varied instruction set architectures. Another challenge is the trade-off between comparison accuracy and coverage. Unfortunately, existing methods still heavily rely on semantics-less code features which are susceptible to the code transformation. Additionally, they perform the comparison merely either in a static or in a dynamic manner, which cannot achieve high accuracy and coverage simultaneously. In this paper, we propose a semantics-based hybrid method to compare binary function similarity. We execute the reference function with test cases, then emulate the execution of every target function with the runtime information migrated from the reference function. Semantic signatures are extracted during the execution as well as the emulation. Lastly, similarity scores are calculated from the signatures to measure the likeness of functions. We have implemented the method in a prototype system designated as BinMatch which performs binary code similarity comparison across architectures of x86, ARM and MIPS on the Linux platform. We evaluate BinMatch with nine real-word projects compiled with different compilation settings, on variant architectures, and with commonly-used obfuscation methods, totally performing over 100 million pairs of function comparison. The experimental results show that BinMatch is resilient to the semantics-equivalent code transformation. Besides, it not only covers all target functions for similarity comparison, but also improves the accuracy comparing to the state-of-the-art solutions.

14 citations


Journal ArticleDOI
TL;DR: Medical researchers and authors may improve their writing skills and avoid the same errors by consulting the list of retractions due to plagiarism which are tracked on the PubMed platform and discussed on the Retraction Watch blog.
Abstract: Plagiarism is an ethical misconduct affecting the quality, readability, and trustworthiness of scholarly publications. Improving researcher awareness of plagiarism of words, ideas, and graphics is essential for avoiding unacceptable writing practices. Global editorial associations have publicized their statements on strategies to clean literature from redundant, stolen, and misleading information. Consulting related documents is advisable for upgrading author instructions and warning plagiarists of academic and other consequences of the unethical conduct. A lack of creative thinking and poor academic English skills are believed to compound most instances of redundant and “copy-and-paste” writing. Plagiarism detection software largely relies on reporting text similarities. However, manual checks are required to reveal inappropriate referencing, copyright violations, and substandard English writing. Medical researchers and authors may improve their writing skills and avoid the same errors by consulting the list of retractions due to plagiarism which are tracked on the PubMed platform and discussed on the Retraction Watch blog.

13 citations


Journal ArticleDOI
14 Jul 2021
TL;DR: This paper uses Word2vec to transform the words into word vectors which are able to reveal the semantic relationship among different words, and this method can be done more effectively in plagiarism detection.
Abstract: Plagiarism is a common problem in the modern age. With the advance of Internet, it is more and more convenient to access other people’s writings or publications. When someone uses the content of a text in an undesirable way, plagiarism may occur. Plagiarism infringes the intellectual property rights, so it is a serious problem nowadays. However, detecting plagiarism effectively is a challenging work. Traditional methods, like vector space model or bag-of-words, are short of providing a good solution due to the incapability of handling the semantics of words satisfactorily. In this paper, we propose a new method for plagiarism detection. We use Word2vec to transform the words into word vectors which are able to reveal the semantic relationship among different words. Through word vectors, words are clustered into concepts. Then documents and their paragraphs are represented in terms of concepts, and plagiarism detection can be done more effectively. A number of experiments are conducted to demonstrate the good performance of our proposed method.

13 citations


Posted Content
02 Jan 2021
TL;DR: This article proposed a cross-document language model (CD-LM) for multi-document NLP tasks, which pretrain with multiple related documents in a single input, via crossdocument masking, which encourages the model to learn crossdocument and long-range relationships and introduce a new attention pattern that uses sequence-level global attention to predict masked tokens.
Abstract: We introduce a new pretraining approach for language models that are geared to support multi-document NLP tasks. Our cross-document language model (CD-LM) improves masked language modeling for these tasks with two key ideas. First, we pretrain with multiple related documents in a single input, via cross-document masking, which encourages the model to learn cross-document and long-range relationships. Second, extending the recent Longformer model, we pretrain with long contexts of several thousand tokens and introduce a new attention pattern that uses sequence-level global attention to predict masked tokens, while retaining the familiar local attention elsewhere. We show that our CD-LM sets new state-of-the-art results for several multi-text tasks, including cross-document event and entity coreference resolution, paper citation recommendation, and documents plagiarism detection, while using a significantly reduced number of training parameters relative to prior works.

11 citations


Journal ArticleDOI
TL;DR: This paper presents a low-level approach with a more-efficient comparison; the comparison only takes linear time complexity with the help of Cosine Correlation in Vector Space Model.
Abstract: Low-level approach is a recent way for source code plagiarism detection. Instead of relying on source code tokens, it relies on low-level structural representation resulted from compiling given sou...

Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed an architecture based on a Long Short-Term Memory (LSTM) and attention mechanism called LSTM-AM-ABC boosted by a population-based approach for parameter initialization.
Abstract: Plagiarism is one of the leading problems in academic and industrial environments, which its goal is to find the similar items in a typical document or source code. This paper proposes an architecture based on a Long Short-Term Memory (LSTM) and attention mechanism called LSTM-AM-ABC boosted by a population-based approach for parameter initialization. Gradient-based optimization algorithms such as back-propagation (BP) are widely used in the literature for learning process in LSTM, attention mechanism, and feed-forward neural network, while they suffer from some problems such as getting stuck in local optima. To tackle this problem, population-based metaheuristic (PBMH) algorithms can be used. To this end, this paper employs a PBMH algorithm, artificial bee colony (ABC), to moderate the problem. Our proposed algorithm can find the initial values for model learning in all LSTM, attention mechanism, and feed-forward neural network, simultaneously. In other words, ABC algorithm finds a promising point for starting BP algorithm. For evaluation, we compare our proposed algorithm with both conventional and population-based methods. The results clearly show that the proposed method can provide competitive performance.

Journal ArticleDOI
TL;DR: The authors proposed a three staged approach that uses context matching and pretrained word embeddings for identifying synonymous substitution and word reordering in paraphrased, plagiarised sentence pairs, which can be used to complement similarity reports generated by currently available plagiarism detection systems by incorporating methods to identify paraphrase types.
Abstract: Paraphrase types have been proposed by researchers as the paraphrasing mechanisms underlying acts of plagiarism. Synonymous substitution, word reordering and insertion/deletion have been identified as some of the common paraphrasing strategies used by plagiarists. However, similarity reports generated by most plagiarism detection systems provide a similarity score and produce matching sections of text with their possible sources. In this research we propose methods to identify two important paraphrase types – synonymous substitution and word reordering in paraphrased, plagiarised sentence pairs. We propose a three staged approach that uses context matching and pretrained word embeddings for identifying synonymous substitution and word reordering. Our proposed approach indicates that the use of Smith Waterman Algorithm for Plagiarism Detection and ConceptNet Numberbatch pretrained word embeddings produces the best performance in terms of $$\hbox {F}_1$$ scores. This research can be used to complement similarity reports generated by currently available plagiarism detection systems by incorporating methods to identify paraphrase types for plagiarism detection.

Journal ArticleDOI
TL;DR: In this paper, the authors give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection, which is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc.
Abstract: Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection-where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

Journal ArticleDOI
08 Mar 2021
TL;DR: There is strong statistical evidence that awareness about plagiarism and anti-plagiarism software has significantly impacted researchers’ actions towards preventing plagiarism.
Abstract: This paper aims to analyse researchers’ awareness about plagiarism and impact of plagiarism detection tools on the actions that they take to prevent plagiarism. It also employs a structural model that examines whether awareness of plagiarism and anti-plagiarism tools have any significant effect on the actions taken by the researchers to avoid plagiarism.,A survey questionnaire was distributed to researchers at a large public university in Bangladesh. The survey accumulated 184 valid responses. Descriptive statistics were obtained to assess researchers’ awareness about plagiarism and impact of plagiarism detection tools and the actions taken by them. The reasons that may cause plagiarism were also identified. The awareness of the availability of the anti-plagiarism software that was being used by the university and its actual use by the researchers was gathered through the survey. Non-parametric Mann–Whitney and Kruskal–Wallis tests were conducted to investigate the differences in awareness levels and actions in terms of gender, age, discipline and current level of research. The chi-square test was carried out to examine the relationship between awareness about the availability of the anti-plagiarism software and its use by the researchers. Finally, the survey data were analysed using structural equation modeling to examine the effects of awareness of plagiarism and anti-plagiarism software on the actions taken by the researchers.,The study revealed that the level of awareness regarding plagiarism and impact of plagiarism detection software is generally high among the researchers. There are some significant differences between researchers’ demographic and personal characteristics and their awareness levels and actions with regard to plagiarism. The findings indicate that almost three-quarters of the researchers were aware about the anti-plagiarism tool that is being used, whereas more than half of the researchers indicated that they used the software to assess their works. The results of the structural equation model do not show a good fit, although there is strong statistical evidence that awareness about plagiarism and anti-plagiarism software has significantly impacted researchers’ actions towards preventing plagiarism.,There is no reported study on researchers’ awareness of plagiarism and its affiliated issues in Bangladesh. The findings of this study will not only provide useful insights regarding awareness about plagiarism but also assist university authorities to formulate relevant policy and take necessary actions against plagiarism in higher education institutions.

Journal ArticleDOI
TL;DR: HAMTA, a Persian plagiarism detection corpus is proposed and evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism Detection corpus.
Abstract: Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus.

DOI
03 Oct 2021
TL;DR: The purpose of this study is to build a web -based application for early detection of plagiarism in scientific writing using the Cosine Similarity Algorithm method which will make it easier for students to determine the choice of final assignment title so that they can detect early plagiarism.
Abstract: Scientific Writing A graduation requirement for several students at each University. In making the Scientific Writing / Final Project, it is expected that final year students can make a journal/writing that is as original as possible and without plagiarism in order to meet innovation in society in the future to get a good and useful solution, especially in the IT field. The purpose of this study is to build a web -based application for early detection of plagiarism in scientific writing using the Cosine Similarity Algorithm method which will make it easier for students to determine the choice of final assignment title so that they can detect early plagiarism. The problem that often occurs in determining the title of scientific writing is not knowing about the previous titles that already exist, which often results in revisions at the beginning of the collection of outlines and journal titles. The method used is Cosine Similarity, this method is a method that can be used to calculate similarities between two objects. This calculation is based on the Vector Space Similarity Measure, namely, in calculating the level of similarity, the object will be expressed in vector form by using keywords as a measure.

Journal ArticleDOI
TL;DR: In this paper, a plagiarism checker, Grammarly, has been used to facilitate students' learning and assessment of source-based writing, which can be used to provide plagiarism alerts and help students with their source use practices.

Proceedings ArticleDOI
26 Jun 2021
TL;DR: In this article, the authors describe the process applied to develop open-book online exams for final year (undergraduate)students studying Applied Machine Learning and Applied Artificial Intelligence and Deep Learning courses as part of a four-year BSc in Computer Science.
Abstract: Like many others, our institution had to adapt our traditional proctored, written examinations to open-book online variants due to theCOVID-19 pandemic. This paper describes the process applied to develop open-book online exams for final year (undergraduate)students studying Applied Machine Learning and Applied Artificial Intelligence and Deep Learning courses as part of a four-year BSc in Computer Science. We also present processes used to validate the examinations as well as plagiarism detection methods implemented. Findings from this study highlight positive effects of using open-book online exams, with ~85% of students reporting that they either prefer online open-book examinations or have no preference between traditional and open-book exams. There were no statistically significant differences reported comparing the exam results of student cohorts who took the open-book online examination, compared to previous cohorts who sat traditional exams. These results are of value to the CSEd community for three reasons. First, it outlines a methodology for developing online open-book exams(including publishing the open-book online exam papers as samples). Second, it provides approaches for deterring plagiarism and implementing plagiarism detection for open-book exams. Finally, we present feedback from students which may be used to guidefuture online open-book exam development.

Journal ArticleDOI
TL;DR: In this paper, a cross-sectional survey was conducted to analyze plagiarism perceptions among researchers and journal editors, particularly from non-Anglophone countries, with a large representation from India (50, 24), Turkey (28, 13), Kazakhstan (25, 12%), and Ukraine (24, 11%).
Abstract: Background Plagiarism is one of the most common violation of publication ethics, and it still remains an area with several misconceptions and uncertainties. Methods This online cross-sectional survey was conducted to analyze plagiarism perceptions among researchers and journal editors, particularly from non-Anglophone countries. Results Among 211 respondents (mean age 40 years; M:F, 0.85:1), 26 were scholarly journal editors and 70 were reviewers with a large representation from India (50, 24%), Turkey (28, 13%), Kazakhstan (25, 12%) and Ukraine (24, 11%). Rigid and outdated pre- and post-graduate education was considered as the origin of plagiarism by 63% of respondents. Paraphragiarism was the most commonly encountered type of plagiarism (145, 69%). Students (150, 71%), non-Anglophone researchers with poor English writing skills (117, 55%), and agents of commercial editing agencies (126, 60%) were thought to be prone to plagiarize. There was a significant disagreement on the legitimacy of text copying in scholarly articles, permitted plagiarism limit, and plagiarized text in methods section. More than half (165, 78%) recommended specifically designed courses for plagiarism detection and prevention, and 94.7% (200) thought that social media platforms may be deployed to educate and notify about plagiarism. Conclusion Great variation exists in the understanding of plagiarism, potentially contributing to unethical publications and even retractions. Bridging the knowledge gap by arranging topical education and widely employing advanced anti-plagiarism software address this unmet need.


Journal ArticleDOI
TL;DR: This paper presents an online judging framework that is capable of automatic scoring of codes by detecting plagiarized contents and the level of accuracy of codes efficiently and is compared with the existing online judging platforms to show the superiority in terms of time efficiency, correctness, and feature availability.
Abstract: A programming contest generally involves the host presenting a set of logical and mathematical problems to the contestants. The contestants are required to write computer programs that are capable of solving these problems. An online judge system is used to automate the judging procedure of the programs that are submitted by the users. Online judges are systems designed for the reliable evaluation of the source codes submitted by the users. Traditional online judging platforms are not ideally suitable for programming labs, as they do not support partial scoring and efficient detection of plagiarized codes. When considering this fact, in this paper, we present an online judging framework that is capable of automatic scoring of codes by detecting plagiarized contents and the level of accuracy of codes efficiently. Our system performs the detection of plagiarism by detecting fingerprints of programs and using the fingerprints to compare them instead of using the whole file. We used winnowing to select fingerprints among k-gram hash values of a source code, which was generated by the Rabin–Karp Algorithm. The proposed system is compared with the existing online judging platforms to show the superiority in terms of time efficiency, correctness, and feature availability. In addition, we evaluated our system by using large data sets and comparing the run time with MOSS, which is the widely used plagiarism detection technique.

Journal ArticleDOI
TL;DR: This paper test numerous source code similarity detection tools on pairs of code fragments written in the data science-oriented functional programming language R and finds that program dependence graph-based methods tend to outperform those relying on normalised source code text, tokens, and names of functions invoked.
Abstract: Making correct decisions as to whether code chunks should be considered similar becomes increasingly important in software design and education and not only can improve the quality of computer programs, but also help assure the integrity of student assessments. In this paper we test numerous source code similarity detection tools on pairs of code fragments written in the data science-oriented functional programming language R. Contrary to mainstream approaches, instead of considering symmetric measures of “how much code chunks A and B are similar to each other”, we propose and study the nonsymmetric degrees of inclusion “to what extent A is a subset of B” and “to what degree B is included in A”. Overall, t-norms yield better precision (how many suspicious pairs are actually similar), t-conorms maximise recall (how many similar pairs are successfully retrieved), and custom aggregation functions fitted to training data provide a good balance between the two. Also, we find that program dependence graph-based methods tend to outperform those relying on normalised source code text, tokens, and names of functions invoked.

Journal ArticleDOI
TL;DR: In this article, the authors defined the operation of verbatim plagiarism as a simple type of copy and paste and shed the lights on intelligent plagiarism where this process became more difficult to reveal because it may include manipulation of original text, adoption of other researchers' ideas, and translation to other languages, which will be more challenging to handle.
Abstract: Plagiarism Detection Systems play an important role in revealing instances of a plagiarism act, especially in the educational sector with scientific documents and papers. The idea of plagiarism is that when any content is copied without permission or citation from the author. To detect such activities, it is necessary to have extensive information about plagiarism forms and classes. Thanks to the developed tools and methods it is possible to reveal many types of plagiarism. The development of the Information and Communication Technologies (ICT) and the availability of the online scientific documents lead to the ease of access to these documents. With the availability of many software text editors, plagiarism detections becomes a critical issue. A large number of scientific papers have already investigated in plagiarism detection, and common types of plagiarism detection datasets are being used for recognition systems, WordNet and PAN Datasets have been used since 2009. The researchers have defined the operation of verbatim plagiarism detection as a simple type of copy and paste. Then they have shed the lights on intelligent plagiarism where this process became more difficult to reveal because it may include manipulation of original text, adoption of other researchers' ideas, and translation to other languages, which will be more challenging to handle. Other researchers have expressed that the ways of plagiarism may overshadow the scientific text by replacing, removing, or inserting words, along with shuffling or modifying the original papers. This paper gives an overall definition of plagiarism and works through different papers for the most known types of plagiarism methods and tools.

Journal ArticleDOI
TL;DR: Different available tools and techniques which are used for plagiarism detection are presented and the detailed taxonomy and various methodologies for identification of plagiarised contents has been explained in detail.
Abstract: With high use of internet technology worldwide, the growth of data availability is also growing rapidly. With the availability of ready data, it attracts few people to steal the data and use it as it is generated by them. Mostly it is found in the higher education sector where students and teachers use the existing information and commit plagiarism. It is an important task to perform Plagiarism detection at various levels to control the theft of data and maintain the novelty of the source of information. To do this, the research has already been started for many years. Over the period of time many tools and techniques have been developed to detect plagiarism at various levels. Still there are many issues present for detecting plagiarism because of many reasons such as language, availability of data sets, availability of highly sophisticated algorithms etc. There is much research performed on detection of Plagiarism based on world level, also called as syntactic level as it is based on only words and their forms within the text. But there is less research done for detection of plagiarism at semantic level [10]. Due to unavailability of Corpus, algorithms, techniques etc. the semantic level plagiarism detection is becoming a tedious task. In this paper we are presenting different available tools and techniques which are used for plagiarism detection. The detailed taxonomy and various methodologies for identification of plagiarised contents has been explained in detail.

Journal ArticleDOI
TL;DR: In this article, 11 source code plagiarism detection tools are evaluated for robustness against plagiarism-hiding modifications, including fine-grained transformations to the source code structure.
Abstract: Source code plagiarism is a common occurrence in undergraduate computer science education. In order to identify such cases, many source code plagiarism detection tools have been proposed. A source code plagiarism detection tool evaluates pairs of assignment submissions to detect indications of plagiarism. However, a plagiarising student will commonly apply plagiarism-hiding modifications to source code in an attempt to evade detection. Subsequently, prior work has implied that currently available source code plagiarism detection tools are not robust to the application of pervasive plagiarism-hiding modifications. In this article, 11 source code plagiarism detection tools are evaluated for robustness against plagiarism-hiding modifications. The tools are evaluated with data sets of simulated undergraduate plagiarism, constructed with source code modifications representative of undergraduate students. The results of the performed evaluations indicate that currently available source code plagiarism detection tools are not robust against modifications which apply fine-grained transformations to the source code structure. Of the evaluated tools, JPlag and Plaggie demonstrates the greatest robustness to different types of plagiarism-hiding modifications. However, the results also indicate that graph-based tools, specifically those that compare programs as program dependence graphs, show potentially greater robustness to pervasive plagiarism-hiding modifications.

Journal ArticleDOI
TL;DR: In this paper, a study aimed to examine EFL students' perception towards plagiarism, factors triggering students to plagiarize in completing their undergraduate theses and their strategies in avoiding plagiarism.
Abstract: This study was aimed to examining EFL students’ perception towards plagiarism, factors triggering students to plagiarize in completing their undergraduate theses and their strategies in avoiding plagiarism. The study, conducted at English Language Education Department, in one of the universities in Indonesia, employed a qualitative method approach aiming at getting more information and a detailed description of social or human issues. Ten alumni who were randomly selected agreed to participate in this study. Selected participants were interviewed to collect required information needed to address research questions of this study. Besides interviewing the students, the researchers also analyzed their theses in order to examine the level of plagiarism by using Turnitin software, the result of which showed that the range of similarities index in students’ theses varied from 16 to 36 %. The findings of the study revealed that there were three strategies that students implemented to avoid plagiarism; paraphrasing and quoting others' ideas, understanding the meaning of plagiarism, and using the lectures' particular method and online plagiarism detection software. Furthermore, the study found that the factors influencing students to plagiarize were related to time limitation on an assignment and poor time management, ease of using online sources, lack of understanding to plagiarism, and poor understanding of plagiarism acts.

Journal ArticleDOI
01 Jan 2021
TL;DR: This paper has proposed a system that has been trained on both fake images (GAN Generated images) and real images and will help in flagging whether the image is plagiarised or a real image.
Abstract: In Today’s date plagiarism is a very important aspect because content originality is the client's prior requirement. Many people on the internet use others' images and get publicity while the owner of the image or data won′t get anything out of it. Many users copy the data or image features from the other users and modify it a little bit or create an artificial replica of it. With sufficient computational power and volume of data, the GAN models are capable enough to produce fake images that look very much similar to the real images. These kinds of images are generally not detected by modern plagiarism systems. GAN stands for generative adversarial network. It has two neural networks working inside. The first one is the generator which generates a random image and the second one is the discriminator which identifies whether the image being generated is a real or a fake image. In this paper, we have proposed a system that has been trained on both fake images (GAN Generated images) and real images and will help us in flagging whether the image is plagiarised or a real image.

Book ChapterDOI
01 Jan 2021
TL;DR: In this paper, the authors describe how academic plagiarism poses a challenge for digital humanities, when sophisticated tools make it possible to discover inappropriate academic activity and examine the changing norms of academic integrity.
Abstract: This chapter describes how academic plagiarism poses a challenge for digital humanities, when sophisticated tools make it possible to discover inappropriate academic activity. Focusing on dissertations defended in Russia in recent years, the authors discuss academic plagiarism and examine the changing norms of academic integrity. Section 27.1 introduces the questions under consideration. The next describes various types of plagiarism and computational tools used to detect them. Section 27.3 reviews available digitized resources. The activities of the Dissernet network are described in Sect. 27.4, which presents an overall picture of findings based on large-scale (more than 50%) plagiarism in dissertations. The case study described in Sect. 27.5 concerns small-scale plagiarism within the same academic genre, raising the question of academic authenticity’s shifting norms.

Journal ArticleDOI
TL;DR: The results demonstrate that the methodology is adaptable with new technologies that might arise; it also presents the advantages of avoiding plagiarism in students and a personalized induction for every specific student in the learning process.
Abstract: The fast pace of development of the Internet and the Coronavirus Disease (COVID-19) pandemic have considerably impacted the educative sector, encouraging the constant transformation of the teaching/learning strategies and more in technological areas as Educational Software Engineering. Web programming, a fundamental topic in Software Engineering and Cloud-based applications, deals with various critical challenges in education, such as learning continuous emerging technological tools, plagiarism detection, generating innovative learning environments, among others. Continual change and even more change with the current digitization becomes a challenge for teachers and students who cannot depend on traditional educational methods. The article presents a sustainable teaching/learning methodology for web programming courses in Engineering Education using project-based learning adaptable to the continuous web technological advances. The methodology has been developed and improved during 9 years, 15 groups, and 3 different universities. Our results demonstrate that the methodology is adaptable with new technologies that might arise; it also presents the advantages of avoiding plagiarism in students and a personalized induction for every specific student in the learning process.