scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
Journal ArticleDOI
Yikun Hu1, Hui Wang1, Yuanyuan Zhang1, Bodong Li1, Dawu Gu1 
TL;DR: The experimental results show that BinMatch is resilient to the semantics-equivalent code transformation, and not only covers all target functions for similarity comparison, but also improves the accuracy comparing to the state-of-the-art solutions.
Abstract: Binary code similarity comparison is a methodology for identifying similar or identical code fragments in binary programs. It is indispensable in fields of software engineering and security, which has many important applications (e.g., plagiarism detection, bug detection). With the widespread of smart and Internet of Things (IoT) devices, an increasing number of programs are ported to multiple architectures (e.g., ARM, MIPS). It becomes necessary to detect similar binary code across architectures as well. The main challenge of this topic lies in the semantics-equivalent code transformation resulting from different compilation settings, code obfuscation, and varied instruction set architectures. Another challenge is the trade-off between comparison accuracy and coverage. Unfortunately, existing methods still heavily rely on semantics-less code features which are susceptible to the code transformation. Additionally, they perform the comparison merely either in a static or in a dynamic manner, which cannot achieve high accuracy and coverage simultaneously. In this paper, we propose a semantics-based hybrid method to compare binary function similarity. We execute the reference function with test cases, then emulate the execution of every target function with the runtime information migrated from the reference function. Semantic signatures are extracted during the execution as well as the emulation. Lastly, similarity scores are calculated from the signatures to measure the likeness of functions. We have implemented the method in a prototype system designated as BinMatch which performs binary code similarity comparison across architectures of x86, ARM and MIPS on the Linux platform. We evaluate BinMatch with nine real-word projects compiled with different compilation settings, on variant architectures, and with commonly-used obfuscation methods, totally performing over 100 million pairs of function comparison. The experimental results show that BinMatch is resilient to the semantics-equivalent code transformation. Besides, it not only covers all target functions for similarity comparison, but also improves the accuracy comparing to the state-of-the-art solutions.

14 citations

Journal ArticleDOI
TL;DR: A performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection shows that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.
Abstract: Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection—where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

14 citations

Proceedings ArticleDOI
29 Jan 2019
TL;DR: This paper investigates automated code plagiarism detection in the context of an undergraduate level data structures and algorithms module and shows that the degree of agreement between these tools is relatively low.
Abstract: This paper investigates automated code plagiarism detection in the context of an undergraduate level data structures and algorithms module. We compare three software tools which aim to detect plagiarism in the students' programming source code. We evaluate the performance of these tools on an individual basis and the degree of agreement between them. Based on this evaluation we show that the degree of agreement between these tools is relatively low. We also report the challenges faced during utilization of these methods and suggest possible future improvements for tools of this kind. The discrepancies in the results obtained by these detection techniques were used to devise guidelines for effectively detecting code plagiarism.

14 citations

01 Jan 2009
TL;DR: The process of corpus creation is described and some features of the resulting resource are described, designed to represent the types of plagiarism that are found within higher education as closely as possible.
Abstract: Plagiarism is a serious problem in higher education and generally acknowledged to be on the increase (McCabe, 2005). Text analysis tools have the potential to be applied to work submitted by students and assist the educator in the detection of plagiarised text. It is difficult to develop and evaluate such systems without examples of such documents. There is therefore the need for resources that contain examples of plagiarised text submitted by students. However, gathering examples of such texts presents a unique set of challenges for corpus construction. This paper discusses current work towards the creation of a corpus of documents submitted for assessment in higher education that contain examples of simulated plagiarism. The corpus is designed to represent the types of plagiarism that are found within higher education as closely as possible. We describe the process of corpus creation and some features of the resulting resource. It is hoped that this resource will become useful for research into the problem of plagiarism detection.

14 citations

Journal ArticleDOI
TL;DR: This paper employs text embedding vectors to compare similarity among documents to detect plagiarism and applies the proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective.
Abstract: The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different types of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.

14 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125