Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A Semantics-Based Hybrid Approach on Binary Code Similarity Comparison

[...]

Yikun Hu¹, Hui Wang¹, Yuanyuan Zhang¹, Bodong Li¹, Dawu Gu¹ - Show less +1 more•Institutions (1)

Shanghai Jiao Tong University¹

01 Jun 2021-IEEE Transactions on Software Engineering

TL;DR: The experimental results show that BinMatch is resilient to the semantics-equivalent code transformation, and not only covers all target functions for similarity comparison, but also improves the accuracy comparing to the state-of-the-art solutions.

...read moreread less

Abstract: Binary code similarity comparison is a methodology for identifying similar or identical code fragments in binary programs. It is indispensable in fields of software engineering and security, which has many important applications (e.g., plagiarism detection, bug detection). With the widespread of smart and Internet of Things (IoT) devices, an increasing number of programs are ported to multiple architectures (e.g., ARM, MIPS). It becomes necessary to detect similar binary code across architectures as well. The main challenge of this topic lies in the semantics-equivalent code transformation resulting from different compilation settings, code obfuscation, and varied instruction set architectures. Another challenge is the trade-off between comparison accuracy and coverage. Unfortunately, existing methods still heavily rely on semantics-less code features which are susceptible to the code transformation. Additionally, they perform the comparison merely either in a static or in a dynamic manner, which cannot achieve high accuracy and coverage simultaneously. In this paper, we propose a semantics-based hybrid method to compare binary function similarity. We execute the reference function with test cases, then emulate the execution of every target function with the runtime information migrated from the reference function. Semantic signatures are extracted during the execution as well as the emulation. Lastly, similarity scores are calculated from the signatures to measure the likeness of functions. We have implemented the method in a prototype system designated as BinMatch which performs binary code similarity comparison across architectures of x86, ARM and MIPS on the Linux platform. We evaluate BinMatch with nine real-word projects compiled with different compilation settings, on variant architectures, and with commonly-used obfuscation methods, totally performing over 100 million pairs of function comparison. The experimental results show that BinMatch is resilient to the semantics-equivalent code transformation. Besides, it not only covers all target functions for similarity comparison, but also improves the accuracy comparing to the state-of-the-art solutions.

...read moreread less

14 citations

Journal Article•DOI•

Corpus-Based Paraphrase Detection Experiments and Review

[...]

Tedo Vrbanec, Ana Meštrović

29 Apr 2020-Information-an International Interdisciplinary Journal

TL;DR: A performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection shows that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

...read moreread less

Abstract: Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection—where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

...read moreread less

14 citations

Proceedings Article•DOI•

A Comparison of Three Popular Source code Similarity Tools for Detecting Student Plagiarism

[...]

Alireza Ahadi¹, Luke Mathieson¹•Institutions (1)

University of Technology, Sydney¹

29 Jan 2019

TL;DR: This paper investigates automated code plagiarism detection in the context of an undergraduate level data structures and algorithms module and shows that the degree of agreement between these tools is relatively low.

...read moreread less

Abstract: This paper investigates automated code plagiarism detection in the context of an undergraduate level data structures and algorithms module. We compare three software tools which aim to detect plagiarism in the students' programming source code. We evaluate the performance of these tools on an individual basis and the degree of agreement between them. Based on this evaluation we show that the degree of agreement between these tools is relatively low. We also report the challenges faced during utilization of these methods and suggest possible future improvements for tools of this kind. The discrepancies in the results obtained by these detection techniques were used to devise guidelines for effectively detecting code plagiarism.

...read moreread less

14 citations

Creating A Corpus of Plagiarised Academic Texts

[...]

Mark Stevenson¹, Paul Clough¹•Institutions (1)

University of Sheffield¹

01 Jan 2009

TL;DR: The process of corpus creation is described and some features of the resulting resource are described, designed to represent the types of plagiarism that are found within higher education as closely as possible.

...read moreread less

Abstract: Plagiarism is a serious problem in higher education and generally acknowledged to be on the increase (McCabe, 2005). Text analysis tools have the potential to be applied to work submitted by students and assist the educator in the detection of plagiarised text. It is difficult to develop and evaluate such systems without examples of such documents. There is therefore the need for resources that contain examples of plagiarised text submitted by students. However, gathering examples of such texts presents a unique set of challenges for corpus construction. This paper discusses current work towards the creation of a corpus of documents submitted for assessment in higher education that contain examples of simulated plagiarism. The corpus is designed to represent the types of plagiarism that are found within higher education as closely as possible. We describe the process of corpus creation and some features of the resulting resource. It is hoped that this resource will become useful for research into the problem of plagiarism detection.

...read moreread less

14 citations

Journal Article•DOI•

Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase

[...]

Erfaneh Gharavi¹, Hadi Veisi¹, Paolo Rosso²•Institutions (2)

University of Tehran¹, Polytechnic University of Valencia²

01 Jul 2020-Neural Computing and Applications

TL;DR: This paper employs text embedding vectors to compare similarity among documents to detect plagiarism and applies the proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective.

...read moreread less

Abstract: The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different types of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.

...read moreread less

14 citations

Collapse

Network Information

Performance

Metrics

1,976

Papers

29,005

Citations

No. of papers in the topic in previous years
Year	Papers
2023	59
2022	126
2021	83
2020	118
2019	130
2018	125

Plagiarism detection

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics