scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2019"


Proceedings ArticleDOI
01 Jan 2019
TL;DR: In this article, the authors leverage ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages, to address two important code similarity comparison problems: given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA.
Abstract: Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.

92 citations


Journal ArticleDOI
TL;DR: The integration of heterogeneous analysis methods for textual and non-textual content features using machine learning is seen as the most promising area for future research contributions to improve the detection of academic plagiarism further.
Abstract: This article summarizes the research on computational methods to detect academic plagiarism by systematically reviewing 239 research papers published between 2013 and 2018. To structure the presentation of the research contributions, we propose novel technically oriented typologies for plagiarism prevention and detection efforts, the forms of academic plagiarism, and computational plagiarism detection methods. We show that academic plagiarism detection is a highly active research field. Over the period we review, the field has seen major advances regarding the automated detection of strongly obfuscated and thus hard-to-identify forms of academic plagiarism. These improvements mainly originate from better semantic text analysis methods, the investigation of non-textual content features, and the application of machine learning. We identify a research gap in the lack of methodologically thorough performance evaluations of plagiarism detection systems. Concluding from our analysis, we see the integration of heterogeneous analysis methods for textual and non-textual content features using machine learning as the most promising area for future research contributions to improve the detection of academic plagiarism further.

88 citations


Journal ArticleDOI
TL;DR: This review gives an overview of definitions of plagiarism, plagiarism detection tools, comparison metrics, obfuscation methods, datasets used for comparison, and algorithm types and identifies interesting insights about metrics and datasets for quantitative tool comparison and categorisation of detection algorithms.
Abstract: Teachers deal with plagiarism on a regular basis, so they try to prevent and detect plagiarism, a task that is complicated by the large size of some classes. Students who cheat often try to hide their plagiarism (obfuscate), and many different similarity detection engines (often called plagiarism detection tools) have been built to help teachers. This article focuses only on plagiarism detection and presents a detailed systematic review of the field of source-code plagiarism detection in academia. This review gives an overview of definitions of plagiarism, plagiarism detection tools, comparison metrics, obfuscation methods, datasets used for comparison, and algorithm types. Perspectives on the meaning of source-code plagiarism detection in academia are presented, together with categorisations of the available detection tools and analyses of their effectiveness. While writing the review, some interesting insights have been found about metrics and datasets for quantitative tool comparison and categorisation of detection algorithms. Also, existing obfuscation methods classifications have been expanded together with a new definition of “source-code plagiarism detection in academia.”

75 citations


Journal ArticleDOI
TL;DR: The taxonomy of machine learning-based binary code analysis is provided, the recent advances and key findings on the topic are described, and the thoughts for future directions on this topic are presented.
Abstract: Binary code analysis is crucial in various software engineering tasks, such as malware detection, code refactoring, and plagiarism detection. With the rapid growth of software complexity and the increasing number of heterogeneous computing platforms, binary analysis is particularly critical and more important than ever. Traditionally adopted techniques for binary code analysis are facing multiple challenges, such as the need for cross-platform analysis, high scalability and speed, and improved fidelity, to name a few. To meet these challenges, machine learning-based binary code analysis frameworks attract substantial attention due to their automated feature extraction and drastically reduced efforts needed on large-scale programs. In this paper, we provide the taxonomy of machine learning-based binary code analysis, describe the recent advances and key findings on the topic, and discuss the key challenges and opportunities. Finally, we present our thoughts for future directions on this topic.

38 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: Among the compared models, as expected, Recurrent Neural Network is best suited for the paraphrase identification task and it is proposed that Plagiarism detection is one of the areas where Paraphrase Identification can be effectively implemented.
Abstract: Paraphrase Identification or Natural Language Sentence Matching (NLSM) is one of the important and challenging tasks in Natural Language Processing where the task is to identify if a sentence is a paraphrase of another sentence in a given pair of sentences. Paraphrase of a sentence conveys the same meaning but its structure and the sequence of words varies. It is a challenging task as it is difficult to infer the proper context about a sentence given its short length. Also, coming up with similarity metrics for the inferred context of a pair of sentences is not straightforward as well. Whereas, its applications are numerous. This work explores various machine learning algorithms to model the task and also applies different input encoding scheme. Specifically, we created the models using Logistic Regression, Support Vector Machines, and different architectures of Neural Networks. Among the compared models, as expected, Recurrent Neural Network (RNN) is best suited for our paraphrase identification task. Also, we propose that Plagiarism detection is one of the areas where Paraphrase Identification can be effectively implemented.

32 citations


Proceedings ArticleDOI
02 Jun 2019
TL;DR: Overall, it is shown that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.
Abstract: Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines. The data and code of our study are openly available at https://purl.org/hybridPD

29 citations


Journal ArticleDOI
TL;DR: ES‐Plag, a plagiarism detection tool featured with cosine‐based filtering and penalty mechanism to handle aforementioned issues, is proposed and its features are beneficial for examiners.
Abstract: Source code plagiarism detection using Running‐Karp‐Rabin Greedy‐String‐Tiling (RKRGST) is a common practice in academic environment. However, such approach is time‐inefficient (due to RKRGST's cubic time complexity) and insensitive (toward token subsequence rearrangement). This paper proposes ES‐Plag, a plagiarism detection tool featured with cosine‐based filtering and penalty mechanism to handle aforementioned issues. Cosine‐based filtering mitigates time‐inefficiency by excluding non‐potential pairs from RKRGST comparison; while penalty mechanism mitigates insensitivity by reducing the number of matched tokens with the number of matched subsequences prior similarity normalization. In addition to issue‐solving features, ES‐Plag is also featured with project‐based input, colorized adjacency similarity matrix, matched token highlighting, and various similarity algorithms (e.g., Cosine Similarity and Local Alignment). Three findings can be deducted from our evaluation. First, cosine‐based filtering boosts up time efficiency with a trade‐off in effectiveness. Second, penalty mechanism enhances sensitivity even though its improvement in terms of effectiveness is quite limited. Third, ES‐Plag's features are beneficial for examiners.

27 citations


Journal ArticleDOI
Mamdouh Farouk1
TL;DR: Word-to-word based, structure based, and vector-based are the most widely used approaches to find sentences similarity, but structure based similarity that measures similarity between sentences structures needs more investigation.
Abstract: This study is to review the approaches used for measuring sentences similarity. Measuring similarity between natural language sentences is a crucial task for many Natural Language Processing applications such as text classification, information retrieval, question answering, and plagiarism detection. This survey classifies approaches of calculating sentences similarity based on the adopted methodology into three categories. Word-to-word based, structure based, and vector-based are the most widely used approaches to find sentences similarity. Each approach measures relatedness between short texts based on a specific perspective. In addition, datasets that are mostly used as benchmarks for evaluating techniques in this field are introduced to provide a complete view on this issue. The approaches that combine more than one perspective give better results. Moreover, structure based similarity that measures similarity between sentences structures needs more investigation.

26 citations


Journal ArticleDOI
TL;DR: In this paper, a survey classifies approaches of calculating sentences similarity based on the adopted methodology into three categories: word-to-word based, structure-based, and vector-based.
Abstract: Objective/Methods: This study is to review the approaches used for measuring sentences similarity. Measuring similarity between natural language sentences is a crucial task for many Natural Language Processing applications such as text classification, information retrieval, question answering, and plagiarism detection. This survey classifies approaches of calculating sentences similarity based on the adopted methodology into three categories. Word-to-word based, structurebased, and vector-based are the most widely used approaches to find sentences similarity. Findings/Application: Each approach measures relatedness between short texts based on a specific perspective. In addition, datasets that are mostly used as benchmarks for evaluating techniques in this field are introduced to provide a complete view on this issue. The approaches that combine more than one perspective give better results. Moreover, structure based similarity that measures similarity between sentences’ structures needs more investigation. Keywords: Sentence Representation, Sentences Similarity, Structural Similarity, Word Embedding, Words Similarity

25 citations


Journal ArticleDOI
TL;DR: This work proposed a deep learning-based approach to indicate how original and suspect documents expressed the same meaning in Arabic language, achieving good results enhancing an efficient contextual relationship detection between Arabic documents in terms of precision and recall.
Abstract: The continuous increase in extraordinary textual sources on the web has facilitated the act of paraphrase. Its detection has become a challenge in different natural language processing applications (e.g., plagiarism detection, information retrieval and extraction, question answering, etc.). Different from western languages like English, few works have been addressed the problem of extrinsic paraphrase detection in Arabic language. In this context, we proposed a deep learning-based approach to indicate how original and suspect documents expressed the same meaning. Indeed, word2vec algorithm extracted the relevant features by predicting each word to its neighbors. Subsequently, averaging the obtained vectors was efficient for generating sentence vectors representations. Then, convolutional neural network was useful to capture more contextual information and compute the degree of semantic relatedness. Faced to the lack of resources publicly available, paraphrased corpus was developed using skip gram model. It had better performance in replacing an original word by its most similar one that had the same grammatical class from a vocabulary. Finally, the proposed system achieved good results enhancing an efficient contextual relationship detection between Arabic documents in terms of precision (85%) and recall (86.8%) than previous studies.

25 citations


Journal ArticleDOI
TL;DR: A substantial level of plagiarism via duplicate publications in the three analyzed predatory journals is found, further diluting credible scientific literature and risking the ability to synthesize evidence accurately to inform practice.
Abstract: PURPOSE This study compared three known predatory nursing journals to determine the percentage of content among them that was plagiarized or duplicated. A serendipitous finding of several instances of plagiarism via duplicate publications during the random analysis of articles in a study examining the quality of articles published in predatory journals prompted this investigation. DESIGN The study utilized a descriptive, comparative design. All articles in each journal (n = 296 articles) from inception (volume 1, number 1) through May 1, 2017, were analyzed. METHODS Each article was evaluated and scored electronically for similarity using an electronic plagiarism detection tool. Articles were then individually reviewed, and exact and near exact matches (90% or greater plagiarized content) were paired. Articles with less than 70% plagiarized scores were randomly sampled, and an in-depth search for matches of partial content in other journals was conducted. Descriptive statistics were used to summarize the data. FINDINGS The extent and direction of duplication from one given journal to another was established. Changes made in subsequent publications, as a potential distraction to identify plagiarism, were also identified. There were 100 (68%) exact and near exact matches in the paired articles. The time lapse between the original and duplicate publication ranged from 0 to 63 months, with a mean of 27.2 months (SD =19.68). Authors were from 26 countries, including Africa, the United States, Turkey, and Iran. Articles with similarity scores in the range of 20% to 70% included possible similarities in content or research plagiarism, but not to the extent of the exact or near exact matches. The majority of the articles (n = 94) went from Journal A or C to Journal B, although four articles were first published in Journal B and then Journal A. CONCLUSIONS This study found a substantial level of plagiarism via duplicate publications in the three analyzed predatory journals, further diluting credible scientific literature and risking the ability to synthesize evidence accurately to inform practice. Editors should continue to use electronic plagiarism detection tools. Education about publishing misconduct for editors and authors is a high priority. CLINICAL RELEVANCE Both contributors and consumers of nursing literature rely on integrity in publication. Authors expect appropriate credit for their scholarly contributions without unethical and unauthorized duplication of their work. Readers expect current information from original authors, upon which they can make informed practice decisions.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: This work regards instructions as words in NLP-inspired binary code analysis, and proposes a joint learning approach to generating instruction embeddings that capture not only the semantics of instructions within an architecture, but also their semantic relationships across architectures.
Abstract: Given a closed-source program, such as most of proprietary software and viruses, binary code analysis is indispensable for many tasks, such as code plagiarism detection and malware analysis. Today, source code is very often compiled for various architectures, making cross-architecture binary code analysis increasingly important. A binary, after being disassembled, is expressed in an assembly languages. Thus, recent work starts exploring Natural Language Processing (NLP) inspired binary code analysis. In NLP, words are usually represented in high-dimensional vectors (i.e., embeddings) to facilitate further processing, which is one of the most common and critical steps in many NLP tasks. We regard instructions as words in NLP-inspired binary code analysis, and aim to represent instructions as embeddings as well. To facilitate cross-architecture binary code analysis, our goal is that similar instructions, regardless of their architectures, have embeddings close to each other. To this end, we propose a joint learning approach to generating instruction embeddings that capture not only the semantics of instructions within an architecture, but also their semantic relationships across architectures. To the best of our knowledge, this is the first work on building cross-architecture instruction embedding model. As a showcase, we apply the model to resolving one of the most fundamental problems for binary code similarity comparison---semantics-based basic block comparison, and the solution outperforms the code statistics based approach. It demonstrates that it is promising to apply the model to other cross-architecture binary code analysis tasks.

Journal ArticleDOI
TL;DR: The dataset is designed for evaluation with an Information Retrieval (IR) perspective, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.
Abstract: Source code plagiarism is an emerging issue in computer science education. As a result, a number of techniques have been proposed to handle this issue. However, comparing these techniques may be challenging, since they are evaluated with their own private dataset(s). This paper contributes in providing a public dataset for comparing these techniques. Specifically, the dataset is designed for evaluation with an Information Retrieval (IR) perspective. The dataset consists of 467 source code files, covering seven introductory programming assessment tasks. Unique to this dataset, both intention to plagiarise and advanced plagiarism attacks are considered in its construction. The dataset's characteristics were observed by comparing three IR-based detection techniques, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.

Journal ArticleDOI
TL;DR: Turnitin is software that identifies the matched material by checking the electronically submitted documents against its database of academic publications, internet, and previously submitted documents, which does not mean plagiarism.
Abstract: The institutional integrity constitutes the bases of scientific activity. The frequent incidences of similarity, plagiarism, and retraction cases created the space for frequent use of similarity and plagiarism detecting tools. Turnitin is software that identifies the matched material by checking the electronically submitted documents against its database of academic publications, internet, and previously submitted documents. Turnitin provides a “similarity index,” which does not mean plagiarism. The prevalence of plagiarism could not reduce tremendously in the presence of many paid and un-paid plagiarism detecting tools because of the assortment of reasons such as poor research and citation skills, language problems, underdeveloped academic skills, etc., This paper may provide an adequate feedback to the students, researchers, and faculty members in understanding the difference between similarity index and plagiarism.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: This survey will classify different types of semantic similarity approaches such as corpus-based, knowledge-based and string-based.
Abstract: This paper provides a survey of semantic similarity of text documents. Semantic Similarity is an important task in Natural Language Processing (NLP). It is widely used for information retrieval, text classification, question answering, and plagiarism detection. This survey will classify different types of semantic similarity approaches such as corpus-based, knowledge-based and string-based. Various papers are reviewed and prepared performance analysis in this survey.

Proceedings ArticleDOI
06 Apr 2019
TL;DR: This paper aims to use a deep learning approach for the task of authorship identification by defining a suitable characterization of texts to capture the distinctive style of an author by using an index based word embedding for the C50 and the BBC datasets.
Abstract: Authorship identification is the process of revealing the hidden identity of authors from a corpus of literary data based on a stylometric analysis of the text. It has essential applications in various fields, such as cyber-forensics, plagiarism detection, and political socialization. This paper aims to use a deep learning approach for the task of authorship identification by defining a suitable characterization of texts to capture the distinctive style of an author. The proposed model uses an index based word embedding for the C50 and the BBC datasets, applied to the input data of article level Long Short Term Memory (LSTM) network and Gated Recurrent Unit (GRU) network models. A comparative study of this new variant of embeddings is done with the standard approach of pre-trained word embeddings.

Journal ArticleDOI
TL;DR: A new approach for calculating semantic similarity between two concepts is proposed, based on set theory's concepts and WordNet properties, by calculating the relatedness between the synsets’ and glosses’s of the two concepts.

Journal ArticleDOI
TL;DR: In this article, the authors argue that the increasingly online nature of composition is not conducive to plagiarism detection, and argue that Turnitin and Grammarly can be used to detect plagiarism.
Abstract: This article discusses and challenges the increasing use of plagiarism detection services such as Turnitin and Grammarly by students, arguing that the increasingly online nature of composition is h...

Journal ArticleDOI
Hui-Fang Shang1
TL;DR: The results of this study show that student plagiaristic behavior has changed and the awareness of adoption of plagiarism-detection software significantly has reduced the instances of textual plagiarism, and no obvious relationship is observed between plagiarism awareness and the students’ actual plagiarism behavior.
Abstract: Plagiarism is often considered as cheating, dishonesty, copying, or moral failing in writing because it is the act of stealing others’ language and ideas without proper citation or paraphrasing. However, this idea is not universally shared because people from different cultural backgrounds are likely to conceptualize plagiarism differently. To fill the gap in teachers’ perceptions and investigate students’ actual plagiaristic behavior, this study aims to detect the extent of plagiarism in students’ English summary writing by adopting plagiarism-detection software, Turnitin. The present study draws on both quantitative and qualitative data to investigate whether discrepancy exists between students’ perceptions of plagiarism and their actual plagiaristic behavior, and whether student behavior changes after their awareness training on plagiarism. The results of this study show that student plagiaristic behavior has changed and the awareness of adoption of plagiarism-detection software significantly has reduced the instances of textual plagiarism. It is thus noted that students who were aware that a plagiarism detection system was in use had lower percentages of plagiarism. However, no obvious relationship is observed between plagiarism awareness and the students’ actual plagiaristic behavior. The implications of these findings on pedagogical practice and future research are discussed and presented.

Proceedings ArticleDOI
TL;DR: In this article, a two-stage plagiarism detection process was proposed, which combines similarity assessments of mathematical content, academic citations, and text for detecting concealed plagiarism in Science, Technology, Engineering and Mathematics (STEM).
Abstract: Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.

Proceedings ArticleDOI
29 Jan 2019
TL;DR: This paper investigates automated code plagiarism detection in the context of an undergraduate level data structures and algorithms module and shows that the degree of agreement between these tools is relatively low.
Abstract: This paper investigates automated code plagiarism detection in the context of an undergraduate level data structures and algorithms module. We compare three software tools which aim to detect plagiarism in the students' programming source code. We evaluate the performance of these tools on an individual basis and the degree of agreement between them. Based on this evaluation we show that the degree of agreement between these tools is relatively low. We also report the challenges faced during utilization of these methods and suggest possible future improvements for tools of this kind. The discrepancies in the results obtained by these detection techniques were used to devise guidelines for effectively detecting code plagiarism.


Proceedings ArticleDOI
02 Jul 2019
TL;DR: This work adapts the optimal Smith-Waterman sequence alignment algorithm to precisely measure the similarity between programs, greatly improving detection accuracy relative to competitors.
Abstract: Software plagiarism cheats students out of their own education and leads to unfair grading, making software plagiarism detection an important problem. However, many popular plagiarism detection tools are inaccurate, language-specific, or closed source, limiting their applicability. In this work, we seek to address these problems via a novel approach. We adapt the optimal Smith-Waterman sequence alignment algorithm to precisely measure the similarity between programs, greatly improving detection accuracy relative to competitors. Our approach is applicable to any language describable by an ANTLR grammar, which includes most programming languages. We also provide a new type of evaluation based on random program generation and obfuscation. Finally, we make our approach freely available, allowing for customizations and transparent reasoning about detection behavior.

Journal ArticleDOI
TL;DR: The terminology on plagiarism is fluid, a bit ambiguous, and still emerging as mentioned in this paper, and it may take some time to settle the terms more clearly, concretely and exhaustively.
Abstract: The terminology on plagiarism is not hard and fast. It is fluid, a bit ambiguous, and still emerging. It may take some time to settle the terms more clearly, concretely and exhaustively. This paper aims to provide a terminological discussion of some important and current concepts related to plagiarism. It discusses key terms/concepts such as copyright, citation cartels, citing vs. quoting, compulsive thief, cryptomnesia, data fakery, ignorance of laws and codes of ethics, information literacy, lack of training, misattribution, fair use clause, paraphrasing, plagiarism, plagiarism detection software, publish or perish syndrome, PubPeer, retraction, retraction vs. correction, retraction watch, salami publication, similarity score, Society for Scientific Values, and source attribution. The explanation and definition of these terms/concepts can be useful for LIS scholars and professionals in their efforts to fight plagiarism. We expect this terminology can be referred in future discussions on the topic and also used to improve the communications between the actors involved.

Journal ArticleDOI
TL;DR: The results of this study indicate that the effect of nazief-adriani stemmer on the winnowing algorithm is superior to the stemmer porter, only decreasing the detection performance of the 0.28% similarity value while the Porter stemmer is superior in increasing the processing time to 69% faster.
Abstract: Current technological developments change physical paper patterns into digital, and this has a very high impact. Positive impact because paper waste is reduced, on the other hand, the rampant copying of digital data raises the amount of plagiarism that is increasing. At present, there are many efforts made by experts to overcome the problem of plagiarism, one of which is by utilizing the winnowing algorithm as a tool to detect plagiarism data. In its development, many optimizing winnowing algorithms used stemming techniques. The most widely used stemmer algorithms include stemmer porter and nazief-adriani. However, there has not been a discussion on the comparison of the effect of performance using stemmer on the winnowing algorithm in measuring the value of plagiarism. So it is necessary to research the effect of stemmer algorithms on winnowing algorithms so that the results of plagiarism detection are more optimal. The results of this study indicate that the effect of nazief-adriani stemmer on the winnowing algorithm is superior to the stemmer porter, only decreasing the detection performance of the 0.28% similarity value while the Porter stemmer is superior in increasing the processing time to 69% faster.

Journal ArticleDOI
TL;DR: The corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair.
Abstract: Cross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessible in many languages through online digital repositories and machine translation systems are readily available, making it easier to perform cross-lingual plagiarism and harder to detect it. To develop and evaluate cross-lingual plagiarism detection systems, standard evaluation resources are needed. The majority of earlier studies have developed cross-lingual plagiarism corpora for English and other European language pairs. However, for Urdu-English language pair, the problem of cross-lingual plagiarism detection has not been thoroughly explored although a large amount of digital text is readily available in Urdu and it is spoken in many countries of the world (particularly in Pakistan, India, and Bangladesh). To fulfill this gap, this paper presents a large benchmark cross-lingual corpus for Urdu-English language pair. The proposed corpus contains 2,395 source-suspicious document pairs (540 are automatic translation, 539 are artificially paraphrased, 508 are manually paraphrased, and 808 are nonplagiarized). Furthermore, our proposed corpus contains three types of cross-lingual examples including artificial (automatic translation and artificially paraphrased), simulated (manually paraphrased), and real (nonplagiarized), which have not been previously reported in the development of cross-lingual corpora. Detailed analysis of our proposed corpus was carried out using - gram overlap and longest common subsequence approaches. Using Word unigrams, mean similarity scores of 1.00, 0.68, 0.52, and 0.22 were obtained for automatic translation, artificially paraphrased, manually paraphrased, and nonplagiarized documents, respectively. These results show that documents in the proposed corpus are created using different obfuscation techniques, which makes the dataset more realistic and challenging. We believe that the corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair. Our proposed corpus is free and publicly available for research purposes.

Journal ArticleDOI
TL;DR: A novel method to detect the source code plagiarisms by using a high-level fuzzy Petri net (HLFPN) based on abstract syntax tree (AST), which can effectively detect the code plagiarism by changing the identifier or program statement order.
Abstract: Those students who major in computer science and/or engineering are required to design program codes in a variety of programming languages. However, many students submit their source codes they get from the Internet or friends with no or few modifications. Detecting the code plagiarisms done by students is very time-consuming and leads to the problems of unfair learning performance evaluation. This paper proposes a novel method to detect the source code plagiarisms by using a high-level fuzzy Petri net (HLFPN) based on abstract syntax tree (AST). First, the AST of each source code is generated after the lexical and syntactic analyses have been done. Second, token sequence is generated based on the AST. Using the AST can effectively detect the code plagiarism by changing the identifier or program statement order. Finally, the generated token sequences are compared with one another using an HLFPN to determine the code plagiarism. Furthermore, the experimental results have indicated that we can make better determination to detect the code plagiarism.

Proceedings ArticleDOI
22 Feb 2019
TL;DR: A container-based system to automatically run and evaluate networked applications that implement distributed algorithms, which has been implemented as an extension to Submitty, an open source, language-agnostic course management platform with automated testing and automated grading of student programming assignments.
Abstract: We present a container-based system to automatically run and evaluate networked applications that implement distributed algorithms. Our implementation of this design leverages lightweight, networked Docker containers to provide students with fast, accurate, and helpful feedback about the correctness of their submitted code. We provide a simple, easy-to-use interface for instructors to specify networks, deploy and run instances of student and instructor code, and to log and collect statistics concerning node connection types and message content. Instructors further have the ability to control network features such as message delay, drop, and reorder. Running student programs can be interfaced with via stream-controlled standard input or through additional containers running custom instructor software. Student program behavior can be automatically evaluated by analyzing console or file output and instructor-specified rules regarding network communications. Program behavior, including logs of all messages passed within the system, can optionally be displayed to the student to aid in development and debugging. We evaluate the utility of this design and implementation for managing the submission and robust and secure testing of programming projects in a large enrollment theory of distributed systems course. This research has been implemented as an extension to Submitty, an open source, language-agnostic course management platform with automated testing and automated grading of student programming assignments. Submitty supports all levels of courses, from introductory to advanced special topics, and includes features for manual grading by TAs, version control, team submission, discussion forums, and plagiarism detection.

Journal ArticleDOI
14 Jun 2019
TL;DR: In this article, the plagiarism is a misconduct act and a scourge for science and psychology is one of the most vulnerable sciences with plagiarism and must give more attention to this issue.
Abstract: Plagiarism is a misconduct act and a scourge for science. Plagiarism perpetrators steal other author's work without citing the original references. Psychology is one of the most vulnerable sciences with plagiarism and must give more attention to this issue. Several types of plagiarism can be distinguished to the plagiarism motivation (intentional, unintentional, and inadvertent), how to do plagiarism (patchwriting, inappropriate paraphrasing, and summaries) and self-plagiarism (text recycling, redundant or duplicate publication, salami-slicing or data fragmentation). There are several reasons to do plagiarism, such as ease to get information via the internet, pressure on academic tasks, bad writing skill, hurry to write under pressure, lack of understanding how to rewrite the original reference, a misconception to understanding self-plagiarism, and habitual plagiarists. This article also presents steps to avoid plagiarism, such as avoiding "intellectual theft", doing good writing (citation and paraphrasing), and testing the similarity test (plagiarism detection service).

Journal ArticleDOI
TL;DR: The proposed approach has two main steps: the first step tries to find candidate plagiarised fragments and focuses on high recall, followed by a more precise similarity analysis based on dynamic text alignment that will filter the results by finding alignments between the identified fragments.
Abstract: Fast and easy access to a wide range of documents in various languages, in conjunction with the wide availability of translation and editing tools, has led to the need to develop effective tools fo...