Showing papers on "Plagiarism detection published in 2017"

PDF

Open Access

Proceedings Article•DOI•

Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection

[...]

Xiaojun Xu¹, Chang Liu², Qian Feng³, Heng Yin⁴, Le Song⁵, Dawn Song² - Show less +2 more•Institutions (5)

Shanghai Jiao Tong University¹, University of California, Berkeley², Samsung³, University of California, Riverside⁴, Georgia Institute of Technology⁵

30 Oct 2017

TL;DR: This work proposes a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy.

...read moreread less

Abstract: The problem of cross-platform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph-matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art's embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the state-of-the-art, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems.

...read moreread less

339 citations

Proceedings Article•DOI•

Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection

[...]

Xiaojun Xu¹, Chang Liu², Qian Feng³, Heng Yin⁴, Le Song⁵, Dawn Song² - Show less +2 more•Institutions (5)

Shanghai Jiao Tong University¹, University of California, Berkeley², Samsung³, University of California, Riverside⁴, Georgia Institute of Technology⁵

22 Aug 2017-arXiv: Cryptography and Security

TL;DR: Zhang et al. as discussed by the authors proposed a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions.

...read moreread less

Abstract: The problem of cross-platform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art's embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the state-of-the-art, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems.

...read moreread less

258 citations

Journal Article•DOI•

Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection

[...]

Lannan Luo¹, Jiang Ming², Dinghao Wu¹, Peng Liu¹, Sencun Zhu³ - Show less +1 more•Institutions (3)

Penn State College of Information Sciences and Technology¹, University of Texas at Arlington², Pennsylvania State University³

01 Dec 2017-IEEE Transactions on Software Engineering

TL;DR: The experimental results show that the proposed binary-oriented, obfuscation-resilient binary code similarity comparison method can be applied to software plagiarism and algorithm detection, and is effective and practical to analyze real-world software.

...read moreread less

Abstract: Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. Identifying similar or identical code fragments among programs is very important in some applications. For example, one application is to detect illegal code reuse. In the code theft cases, emerging obfuscation techniques have made automated detection increasingly difficult. Another application is to identify cryptographic algorithms which are widely employed by modern malware to circumvent detection, hide network communications, and protect payloads among other purposes. Due to diverse coding styles and high programming flexibility, different implementation of the same algorithm may appear very distinct, causing automatic detection to be very hard, let alone code obfuscations are sometimes applied. In this paper, we propose a binary-oriented, obfuscation-resilient binary code similarity comparison method based on a new concept, longest common subsequence of semantically equivalent basic blocks , which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantic equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantic similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype. The experimental results show that our method can be applied to software plagiarism and algorithm detection, and is effective and practical to analyze real-world software.

...read moreread less

87 citations

Book Chapter•DOI•

Source Code Authorship Attribution Using Long Short-Term Memory Based Networks

[...]

Bander Alsulami¹, Edwin Dauber¹, Richard Harang, Spiros Mancoridis¹, Rachel Greenstadt¹ - Show less +1 more•Institutions (1)

Drexel University¹

11 Sep 2017

TL;DR: This work states that the introduction of features derived from the Abstract Syntax Tree of source code has recently set new benchmarks in this area, significantly improving over previous work that relied on easily obfuscatable lexical and format features of program source code.

...read moreread less

Abstract: Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST) of source code has recently set new benchmarks in this area, significantly improving over previous work that relied on easily obfuscatable lexical and format features of program source code. However, these AST-based approaches rely on hand-constructed features derived from such trees, and often include ancillary information such as function and variable names that may be obfuscated or manipulated.

...read moreread less

75 citations

Proceedings Article•DOI•

Binary code clone detection across architectures and compiling configurations

[...]

Yikun Hu¹, Yuanyuan Zhang¹, Juanru Li¹, Dawu Gu¹•Institutions (1)

Shanghai Jiao Tong University¹

20 May 2017

TL;DR: The experimental results show that CACompare not only is effective in dealing with binaries of different architectures and variant compiling configurations, but also improves the accuracy of binary code clone detection comparing to state-of-the-art solutions.

...read moreread less

Abstract: Binary code clone detection (or similarity comparison) is a fundamental technique for many important applications, such as plagiarism detection, malware analysis, software vulnerability assessment and program comprehension. With the prevailing of smart and IoT (Internet of Things) devices, more and more programs are ported from traditional desktop platforms (e.g., IA-32) to ARM and MIPS architectures. It becomes imperative to detect cloned binary code across architectures. However, because of incomparable instruction sets of different architectures as well as alternative compiling configurations, it is difficult to conduct a binary code clone detection with traditional syntax- or structure-based methods.To address, we propose a semantics-based approach to fulfill the target. We recognize arguments and indirect jump targets of each binary function, and emulate executions of those functions, extracting semantic signatures to measure the similarity of functions. The approach has been implemented in a prototype system named CACompare to detect cloned binary functions across architectures and compiling configurations. It supports comparisons between mainstream architectures (IA-32, ARM and MIPS) and is able to analyse binaries on the Linux platform. The experimental results show that CACompare not only is effective in dealing with binaries of different architectures and variant compiling configurations, but also improves the accuracy of binary code clone detection comparing to state-of-the-art solutions.

...read moreread less

72 citations

Proceedings Article•DOI•

[...]

El Moatez Billah Nagoudi, Didier Schwab¹•Institutions (1)

University of Grenoble¹

03 Apr 2017

TL;DR: An innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences by exploiting vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words.

...read moreread less

Abstract: Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidi-mensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments.

...read moreread less

55 citations

Journal Article•DOI•

Current Trends in Source Code Analysis, Plagiarism Detection and Issues of Analysis Big Datasets

[...]

Michal Ďuračík¹, Emil Krsak¹, Patrik Hrkut¹•Institutions (1)

University of Žilina¹

01 Jan 2017-Procedia Engineering

TL;DR: The future belongs to the algorithms that will be able to handle large amount of source code and should use one of model-based representations, which can be used for formation of large-scale anti-plagiarism systems.

...read moreread less

53 citations

Proceedings Article•DOI•

Using Word Embedding for Cross-Language Plagiarism Detection

[...]

Jérémy Ferrero, Laurent Besacier¹, Didier Schwab², Frédéric Agnès•Institutions (2)

Institut Universitaire de France¹, University of Grenoble²

03 Apr 2017

TL;DR: This paper introduces new cross-language similarity detection methods based on distributed representation of words and combines the different methods proposed to verify their complementarity, obtaining an overall F1 score of 89.15% for English-French similarity detection at chunk level.

...read moreread less

Abstract: This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F 1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

...read moreread less

47 citations

Book Chapter•DOI•

Identifying Multiple Authors in a Binary Program

[...]

Xiaozhu Meng¹, Barton P. Miller¹, Kwang-Sung Jun¹•Institutions (1)

University of Wisconsin-Madison¹

11 Sep 2017

TL;DR: New code features that capture programming style at the basic block level, an approach for identifying external template library code, and a new approach to capture correlations between the authors of basic blocks in a binary are presented.

...read moreread less

Abstract: Knowing the authors of a binary program has significant application to forensics of malicious software (malware), software supply chain risk management, and software plagiarism detection. Existing techniques assume that a binary is written by a single author, which does not hold true in real world because most modern software, including malware, often contains code from multiple authors. In this paper, we make the first step toward identifying multiple authors in a binary. We present new fine-grained techniques to address the tougher problem of determining the author of each basic block. The decision of attributing authors at the basic block level is based on an empirical study of three large open source software, in which we find that a large fraction of basic blocks can be well attributed to a single author. We present new code features that capture programming style at the basic block level, our approach for identifying external template library code, and a new approach to capture correlations between the authors of basic blocks in a binary. Our experiments show strong evidence that programming styles can be recovered at the basic block level and it is practical to identify multiple authors in a binary.

...read moreread less

45 citations

Journal Article•DOI•

Detection of idea plagiarism using syntax–Semantic concept extractions with genetic algorithm

[...]

K Vani¹, Deepa Gupta¹•Institutions (1)

Amrita Vishwa Vidyapeetham¹

01 May 2017-Expert Systems With Applications

TL;DR: The reported work aims to explore syntax-semantic concept extractions with genetic algorithm in detecting cases of idea plagiarism, where the source ideas are plagiarized and represented in a summarized form.

...read moreread less

Abstract: Plagiarism is increasingly becoming a major issue in the academic and educational domains. Automated and effective plagiarism detection systems are direly required to curtail this information breach, especially in tackling idea plagiarism. The proposed approach is aimed to detect such plagiarism cases, where the idea of a third party is adopted and presented intelligently so that at the surface level, plagiarism cannot be unmasked. The reported work aims to explore syntax-semantic concept extractions with genetic algorithm in detecting cases of idea plagiarism. The work mainly focuses on idea plagiarism where the source ideas are plagiarized and represented in a summarized form. Plagiarism detection is employed at both the document and passage levels by exploiting the document concepts at various structural levels. Initially, the idea embedded within the given source document is captured using sentence level concept extraction with genetic algorithm. Document level detection is facilitated with word-level concepts where syntactic information is extracted and the non-plagiarized documents are pruned. A combined similarity metric that utilizes the semantic level concept extraction is then employed for passage level detection. The proposed approach is tested on PAN13-14 1 plagiarism corpus for summary obfuscation data, which represents a challenging case of idea plagiarism. The performance of the current approach and its variations are evaluated both at the document and passage levels, using information retrieval and PAN plagiarism measures respectively. The results are also compared against six top ranked plagiarism detection systems submitted as a part of PAN13-14 competition. The results obtained are found to exhibit significant improvement over the compared systems and hence reflects the potency of the proposed syntax-semantic based concept extractions in detecting idea plagiarism.

...read moreread less

43 citations

Journal Article•DOI•

WASTK: A Weighted Abstract Syntax Tree Kernel Method for Source Code Plagiarism Detection

[...]

Deqiang Fu¹, Yanyan Xu¹, Haoran Yu, Boyang Yang•Institutions (1)

Beijing Forestry University¹

13 Feb 2017-Scientific Programming

TL;DR: A source code plagiarism detection method, named WASTK (Weighted Abstract Syntax Tree Kernel), for computer science education that takes some aspects other than the similarity between programs into account, and performs much better than other popular methods.

...read moreread less

Abstract: In this paper, we introduce a source code plagiarism detection method, named WASTK (Weighted Abstract Syntax Tree Kernel), for computer science education. Different from other plagiarism detection methods, WASTK takes some aspects other than the similarity between programs into account. WASTK firstly transfers the source code of a program to an abstract syntax tree and then gets the similarity by calculating the tree kernel of two abstract syntax trees. To avoid misjudgment caused by trivial code snippets or frameworks given by instructors, an idea similar to TF-IDF (Term Frequency-Inverse Document Frequency) in the field of information retrieval is applied. Each node in an abstract syntax tree is assigned a weight by TF-IDF. WASTK is evaluated on different datasets and, as a result, performs much better than other popular methods like Sim and JPlag.

...read moreread less

Journal Article•DOI•

A resource-light method for cross-lingual semantic textual similarity

[...]

Goran Glavaš¹, Marc Franco-Salvador², Simone Paolo Ponzetto¹, Paolo Rosso²•Institutions (2)

University of Mannheim¹, Polytechnic University of Valencia²

01 Dec 2017-Knowledge Based Systems

TL;DR: This paper proposed an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages by projecting continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model.

...read moreread less

Abstract: Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.

...read moreread less

Journal Article•DOI•

Authorship attribution of source code by using back propagation neural network based on particle swarm optimization.

[...]

Xinyu Yang¹, Guoai Xu¹, Qi Li¹, Yanhui Guo¹, Miao Zhang¹ - Show less +1 more•Institutions (1)

Beijing University of Posts and Telecommunications¹

01 Jan 2017-PLOS ONE

TL;DR: This paper aims to propose a new method to identify the programmer of Java source code samples with a higher accuracy, and compares with previous work on authorship attribution of source code for Java language illustrates that this proposed method outperforms others overall, also with an acceptable overhead.

...read moreread less

Abstract: Authorship attribution is to identify the most likely author of a given sample among a set of candidate known authors. It can be not only applied to discover the original author of plain text, such as novels, blogs, emails, posts etc., but also used to identify source code programmers. Authorship attribution of source code is required in diverse applications, ranging from malicious code tracking to solving authorship dispute or software plagiarism detection. This paper aims to propose a new method to identify the programmer of Java source code samples with a higher accuracy. To this end, it first introduces back propagation (BP) neural network based on particle swarm optimization (PSO) into authorship attribution of source code. It begins by computing a set of defined feature metrics, including lexical and layout metrics, structure and syntax metrics, totally 19 dimensions. Then these metrics are input to neural network for supervised learning, the weights of which are output by PSO and BP hybrid algorithm. The effectiveness of the proposed method is evaluated on a collected dataset with 3,022 Java files belong to 40 authors. Experiment results show that the proposed method achieves 91.060% accuracy. And a comparison with previous work on authorship attribution of source code for Java language illustrates that this proposed method outperforms others overall, also with an acceptable overhead.

...read moreread less

Proceedings Article•DOI•

Analyzing Mathematical Content to Detect Academic Plagiarism

[...]

Norman Meuschke¹, Moritz Schubotz¹, Felix Hamborg¹, Tomáš Skopal², Bela Gipp¹ - Show less +1 more•Institutions (2)

University of Konstanz¹, Charles University in Prague²

06 Nov 2017

TL;DR: The results show that mathematical expressions are promising text-independent features to identify academic plagiarism in large collections and an open source parallel data processing pipeline built using the Apache Flink framework is developed.

...read moreread less

Abstract: This paper presents, to our knowledge, the first study on analyzing mathematical expressions to detect academic plagiarism. We make the following contributions. First, we investigate confirmed cases of plagiarism to categorize the similarities of mathematical content commonly found in plagiarized publications. From this investigation, we derive possible feature selection and feature comparison strategies for developing math-based detection approaches and a ground truth for our experiments. Second, we create a test collection by embedding confirmed cases of plagiarism into the NTCIR-11 MathIR Task dataset, which contains approx. 60 million mathematical expressions in 105,120 documents from arXiv.org. Third, we develop a first math-based detection approach by implementing and evaluating different feature comparison approaches using an open source parallel data processing pipeline built using the Apache Flink framework. The best performing approach identifies all but two of our real-world test cases at the top rank and achieves a mean reciprocal rank of 0.86. The results show that mathematical expressions are promising text-independent features to identify academic plagiarism in large collections. To facilitate future research on math-based plagiarism detection, we make our source code and data available.

...read moreread less

Journal Article•DOI•

[...]

Ming Liu¹, Bo Lang¹, Zepeng Gu¹, Ahmed Zeeshan¹•Institutions (1)

Beihang University¹

21 Dec 2017-Tsinghua Science & Technology

TL;DR: In this article, a joint word-embedding model for long documents in the academic domain is proposed to improve the semantic representation quality of word vectors by incorporating a domain-specific semantic relation constraint into the traditional context constraint.

...read moreread less

Journal Article•

The confounding factors leading to plagiarism in academic writing and some suggested remedies: A systematic review.

[...]

Salman Yousuf Guraya¹, Shaista Salman Guraya¹•Institutions (1)

Taibah University¹

01 May 2017-Journal of Pakistan Medical Association

TL;DR: Plagiarism can be managed by a balance among its prevention, detection by plagiarism detection software, and institutional sanctions against proven plagiarists.

...read moreread less

Abstract: There is a staggering upsurge in the incidence of plagiarism of scientific literature Literature shows divergent views about the factors that make plagiarism reprehensible This review explores the causes and remedies for the perennial academic problem of plagiarism Data sources were searched for full text English language articles published from 2000 to 2015 Data selection was done using medical subject headline (MeSH) terms plagiarism, unethical writing, academic theft, retraction, medical field, and plagiarism detection software Data extraction was undertaken by selecting titles from retrieved references and data synthesis identified key factors leading to plagiarism such as unawareness of research ethics, poor writing skills and pressure or publish mantra Plagiarism can be managed by a balance among its prevention, detection by plagiarism detection software, and institutional sanctions against proven plagiarists Educating researchers about ethical principles of academic writing and institutional support in training writers about academic integrity and ethical publications can curtail plagiarism

...read moreread less

Proceedings Article•DOI•

Deep Learning Based Technique for Plagiarism Detection in Arabic Texts

[...]

Dima Suleiman, Arafat Awajan, Nailah Al-Madi

01 Oct 2017

TL;DR: The similarity measures show how simple changes in text such as changing one word, or changing the position of verbs and nouns results with similarity value equal to 99% which provide the possibility to detect plagiarism even if the test is altered by replacing words by their synonyms orChanging the words order.

...read moreread less

Abstract: Plagiarism detection is very important especially for academician, researchers and students. Although, there are many plagiarism detection tools, it is still challenging task because of huge amount of online documents. In this research, we propose to use word2vec model to detect the semantic similarity between words in Arabic language which can help in detecting plagiarism. Word2vec is a deep learning technique that is used to represent words as features of vectors with high precision. The quality of vectors representation depends on the quality of corpus used in training phase. In this paper, we used OSAC corpus for training word2vec model. Moreover cosine similarity measure is used to compute the similarity between words' vectors. The similarity measures show how simple changes in text such as changing one word, or changing the position of verbs and nouns results with similarity value equal to 99% which provide the possibility to detect plagiarism even if the test is altered by replacing words by their synonyms or changing the words order

...read moreread less

Proceedings Article•DOI•

Experiments with Convolutional Neural Network Models for Answer Selection

[...]

Jinfeng Rao¹, Hua He¹, Jimmy Lin²•Institutions (2)

University of Maryland, College Park¹, University of Waterloo²

07 Aug 2017

TL;DR: This paper attempts to replicate and reproduce the results of Severyn and Moschitti using their open-source code as well as to reproduce their results via a de novo implementation using a completely different deep learning toolkit.

...read moreread less

Abstract: In recent years, neural networks have been applied to many text processing problems. One example is learning a similarity function between pairs of text, which has applications to paraphrase extraction, plagiarism detection, question answering, and ad hoc retrieval. Within the information retrieval community, the convolutional neural network model proposed by Severyn and Moschitti in a SIGIR 2015 paper has gained prominence. This paper focuses on the problem of answer selection for question answering: we attempt to replicate the results of Severyn and Moschitti using their open-source code as well as to reproduce their results via a de novo (i.e., from scratch) implementation using a completely different deep learning toolkit. Our de novo implementation is instructive in ascertaining whether reported results generalize across toolkits, each of which have their idiosyncrasies. We were able to successfully replicate and reproduce the reported results of Severyn and Moschitti, albeit with minor differences in effectiveness, but affirming the overall design of their model. Additional ablation experiments break down the components of the model to show their contributions to overall effectiveness. Interestingly, we find that removing one component actually increases effectiveness and that a simplified model with only four word overlap features performs surprisingly well, even better than convolution feature maps alone.

...read moreread less

Journal Article•DOI•

An integrated approach for intrinsic plagiarism detection

[...]

Muna Alsallal¹, Rahat Iqbal¹, Vasile Palade¹, Saad Amin¹, Victor Chang² - Show less +1 more•Institutions (2)

Coventry University¹, Xi'an Jiaotong-Liverpool University²

01 Dec 2017-Future Generation Computer Systems

TL;DR: A novel approach for plagiarism detection without reference collections is presented, which aims to generate a model of author’s “style” by revealing a set of certain features of authorship by integrating deep latent semantic and stylometric analyses.

...read moreread less

Journal Article•DOI•

Text Documents Plagiarism Detection using Rabin-Karp and Jaro-Winkler Distance Algorithms

[...]

Brinardi Leonardo, Seng Hansun

01 Feb 2017-Indonesian Journal of Electrical Engineering and Computer Science

TL;DR: Rabin-Karp algorithm is much more effective and faster in the process of detecting the document with the size more than 1000 KB and Jaro-Winkler Distance algorithm has advantages in terms of time.

...read moreread less

Abstract: Plagiarism is an act that is considered by the university as a fraud by taking someone ideas or writings without mentioning the references and claimed as his own. Plagiarism detection system is generally implement string matching algorithm in a text document to search for common words between documents. There are some algorithms used for string matching, two of them are Rabin-Karp and Jaro-Winkler Distance algorithms. Rabin-Karp algorithm is one of compatible algorithms to solve the problem of multiple string patterns, while, Jaro-Winkler Distance algorithm has advantages in terms of time. A plagiarism detection application is developed and tested on different types of documents, i.e. doc, docx, pdf and txt. From the experimental results, we obtained that both of these algorithms can be used to perform plagiarism detection of those documents, but in terms of their effectiveness, Rabin-Karp algorithm is much more effective and faster in the process of detecting the document with the size more than 1000 KB.

...read moreread less

Proceedings Article•DOI•

Analyzing Semantic Concept Patterns to Detect Academic Plagiarism

[...]

Norman Meuschke¹, Nicolas Siebeck¹, Moritz Schubotz¹, Bela Gipp¹•Institutions (1)

University of Konstanz¹

15 Dec 2017

TL;DR: This work presents Semantic Concept Pattern Analysis - an approach that performs an integrated analysis of semantic text relatedness and structural text similarity and demonstrates that this approach can detect plagiarism that established text matching approaches would not identify.

...read moreread less

Abstract: Detecting academic plagiarism is a pressing problem, e.g., for educational and research institutions, funding agencies, and academic publishers. Existing plagiarism detection systems reliably identify copied text, or near copies of text, but often fail to detect disguised forms of academic plagiarism, such as paraphrases, translations, and idea plagiarism. We present Semantic Concept Pattern Analysis - an approach that performs an integrated analysis of semantic text relatedness and structural text similarity. Using 25 officially retracted academic plagiarism cases, we demonstrate that our approach can detect plagiarism that established text matching approaches would not identify. We view our approach as a promising addition to improve the detection capabilities for strong paraphrases. We plan to further improve Semantic Concept Pattern Analysis and include the approach as part of an integrated detection process that analyzes heterogeneous similarity features to better identify the many possible forms of plagiarism in academic documents.

...read moreread less

Journal Article•DOI•

Detecting Source Code Plagiarism on .NET Programming Languages using Low-level Representation and Adaptive Local Alignment

[...]

Faqih Salban Rabbani¹, Oscar Karnalim¹•Institutions (1)

Maranatha Christian University¹

16 Jun 2017-Journal of information and organizational sciences

TL;DR: A source code plagiarism detection which rely on low-level representation which is more effective and efficient when compared with standard lexical-token approach is proposed.

...read moreread less

Abstract: Even though there are various source code plagiarism detection approaches, only a few works which are focused on low-level representation for deducting similarity. Most of them are only focused on lexical token sequence extracted from source code. In our point of view, low-level representation is more beneficial than lexical token since its form is more compact than the source code itself. It only considers semantic-preserving instructions and ignores many source code delimiter tokens. This paper proposes a source code plagiarism detection which rely on low-level representation. For a case study, we focus our work on .NET programming languages with Common Intermediate Language as its low-level representation. In addition, we also incorporate Adaptive Local Alignment for detecting similarity. According to Lim et al, this algorithm outperforms code similarity state-of-the-art algorithm (i.e. Greedy String Tiling) in term of effectiveness. According to our evaluation which involves various plagiarism attacks, our approach is more effective and efficient when compared with standard lexical-token approach.

...read moreread less

Journal Article•DOI•

A Novel Technique for Detecting Plagiarism in Documents Exploiting Information Sources

[...]

Mansi Sahi¹, Vishal Gupta¹•Institutions (1)

University Institute of Engineering and Technology, Panjab University¹

22 Aug 2017-Cognitive Computation

TL;DR: The proposed plagiarism detection approach is the hybrid of semantic and syntactic similarity between the text documents that exploits linguistic information sources non-linearly using the lexical database for finding the relatedness between text documents.

...read moreread less

Abstract: Plagiarism takes place when we use any person’s work without giving due acknowledgment. There are several fields where the text similarity is involved like web document retrieval, information mining, and searching related articles. Several approaches have been introduced for detecting plagiarism in the text documents based on the syntactic structure of the text, string similarity, fingerprinting, semantic meaning underlying the text, etc. The basic limitation of plagiarism detection systems these days is that they fail to detect tough cases of plagiarism. The proposed plagiarism detection approach is the hybrid of semantic and syntactic similarity between the text documents. This novel approach exploits linguistic information sources non-linearly using the lexical database for finding the relatedness between text documents. The proposed approach uses semantic knowledge to perform cognitive-inspired computing. The framework is capable of detecting intelligent plagiarism cases like a verbatim copy, paraphrasing, rewording in a sentence, and sentence transformation. The approach has been evaluated on the standard PAN-PC-11 dataset. The experiments show that our technique has outperformed other strong baseline techniques in terms of precision, recall, F-measure, and plagiarism detection (PlagDet) score.

...read moreread less

Journal Article•DOI•

Text plagiarism classification using syntax based linguistic features

[...]

K Vani¹, Deepa Gupta¹•Institutions (1)

Amrita Vishwa Vidyapeetham¹

01 Dec 2017-Expert Systems With Applications

TL;DR: An approach that utilizes minimal and effective syntax based linguistic features for plagiarism classification extracted using shallow natural language processing techniques, which improves the effectiveness of classification by selecting appropriate set of features as the input to machine learning based classifiers.

...read moreread less

Abstract: An approach that utilizes minimal and effective syntax based linguistic features for plagiarism classification extracted using shallow natural language processing techniques.A two-phase feature selection approach that identifies minimal and best features for plagiarism classification.Detailed analysis of the impact and dependencies of plagiarism types and complexities on the extracted features. The proposed work models document level text plagiarism detection as a binary classification problem, where the task is to distinguish a given suspicious-source document pair as plagiarized or non-plagiarized. The objective is to explore the potency of syntax based linguistic features extracted using shallow natural language processing techniques for plagiarism classification task. Shallow syntactic features, viz., part of speech tags and chunks are utilized after effective pre-processing and filtrations for pruning the irrelevant information. The work further proposes the modelling of this classification phase as an intermediate stage, which will be post candidate source retrieval and before exhaustive passage level detections. A two-phase feature selection approach is proposed, which improves the effectiveness of classification by selecting appropriate set of features as the input to machine learning based classifiers. The proposed approach is evaluated on smaller and larger test conditions using the corpus of plagiarized short answers (PSA) and plagiarism instances collected from PAN corpus respectively. Under both the test conditions, performances are evaluated using general as well as advanced classification metrics. Another main contribution of the current work is the analysis of dependencies and impact of the extracted features, upon the type and complexity of plagiarism imposed in the documents. The proposed results are compared with the two state-of-the-art approaches and they outperform the baseline approaches significantly. This in turn reflects the cogency of syntactic linguistic features in document level plagiarism classification, especially for the instances close to manual or real plagiarism scenarios.

...read moreread less

Journal Article•DOI•

A linguistic treatment for automatic external plagiarism detection

[...]

Asad Abdi¹, Siti Mariyam Shamsuddin¹, Norisma Idris², Rasim M. Alguliyev³, Ramiz M. Aliguliyev³ - Show less +1 more•Institutions (3)

Universiti Teknologi Malaysia¹, Information Technology University², Azerbaijan National Academy of Sciences³

01 Nov 2017-Knowledge Based Systems

TL;DR: An External Plagiarism Detection System (EPDS), which employs a combination of the Semantic Role Labeling (SRL) technique, the semantic and syntactic information, and the content word expansion approach to detect different types of plagiarism.

...read moreread less

Abstract: Plagiarism is the unauthorized use of the ideas, presentation of someone else's words or work as your own. This paper presents an External Plagiarism Detection System (EPDS), which employs a combination of the Semantic Role Labeling (SRL) technique, the semantic and syntactic information. Most of the available methods fail to capture the meaning in the comparison between a source document sentence and a suspicious document sentence when two sentences have same surface text. Therefore, it leads to incorrect or even unnecessary matching results. However, the proposed method is able to avoid selecting the source text sentence whose similarity with suspicious text sentence is high but its meaning is different. On the other hand, an author may change the sentence from: active to passive and vice versa; hence, the method also employed the SRL technique to tackle the aforementioned challenge. Furthermore, the method used the content word expansion approach to bridge the lexical gaps and identify the similar ideas that are expressed using different wording. The proposed method is able to detect different types of plagiarism such as the exact verbatim copying, paraphrasing, transformation of sentences, changing of word structure. As a result, the experimental results have displayed that the proposed method is able to improve the performance compared with the participating systems in PAN-PC-11 and other existing techniques.

...read moreread less

Journal Article•DOI•

An IR-Based Approach Utilizing Query Expansion for Plagiarism Detection in MEDLINE

[...]

Rao Muhammad Adeel Nawab¹, Mark Stevenson², Paul Clough²•Institutions (2)

COMSATS Institute of Information Technology¹, University of Sheffield²

01 Jul 2017-IEEE/ACM Transactions on Computational Biology and Bioinformatics

TL;DR: Methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE are investigated, particularly when the original text has been modified through the replacement of words or phrases.

...read moreread less

Abstract: The identification of duplicated and plagiarized passages of text has become an increasingly active area of research. In this paper, we investigate methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE, particularly when the original text has been modified through the replacement of words or phrases. A scalable approach based on Information Retrieval is used to perform candidate document selection—the identification of a subset of potential source documents given a suspicious text—from MEDLINE. Query expansion is performed using the ULMS Metathesaurus to deal with situations in which original documents are obfuscated. Various approaches to Word Sense Disambiguation are investigated to deal with cases where there are multiple Concept Unique Identifiers (CUIs) for a given term. Results using the proposed IR-based approach outperform a state-of-the-art baseline based on Kullback-Leibler Distance.

...read moreread less

Posted Content•

UsingWord Embedding for Cross-Language Plagiarism Detection

[...]

Jérémy Ferrero, Frédéric Agnès, Laurent Besacier, Didier Schwab

10 Feb 2017-arXiv: Computation and Language

TL;DR: This paper proposed to use distributed representation of words (word embeddings) in cross-language textual similarity detection and obtained an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

...read moreread less

Abstract: This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

...read moreread less

Proceedings Article•DOI•

Style Analysis for Source Code Plagiarism Detection — An Analysis of a Dataset of Student Coursework

[...]

Olfat M. Mirza¹, Mike Joy¹, Georgina Cosma²•Institutions (2)

University of Warwick¹, Nottingham Trent University²

01 Jul 2017

TL;DR: This paper focuses to identify whether a data set consisting of student programming assignments is rich enough to apply coding style metrics to detect similarities between code sequences, and uses the BlackBox dataset as a case study.

...read moreread less

Abstract: Plagiarism has become an increasing problem in higher education in recent years. Coding style can be used to detect source code plagiarism that involves writing and deciding the structure of the code which does not affect the logic of a program, thus offering a way to differentiate between different code authors. This paper focuses to identify whether a data set consisting of student programming assignments is rich enough to apply coding style metrics to detect similarities between code sequences, and we use the BlackBox dataset as a case study.

...read moreread less

Journal Article•DOI•

Conceptions of Plagiarism and Problems in Academic Writing in a Changing Landscape of External Regulation.

[...]

Erika Löfström¹, Erika Löfström², Elisa Huotari¹, Pauliina Kupila¹•Institutions (2)

University of Helsinki¹, Tallinn University²

29 Apr 2017-Journal of Academic Ethics

TL;DR: This article investigated the consequences of the use of text-matching software on teachers' and students' conceptions of plagiarism and problems in academic writing and found that teachers are inclined to think plagiarism as part of a learning process rather an issue of morality, which may have consequences for how they understand the role of text matching.

...read moreread less

Abstract: The aim of this study was to investigate the consequences of the use of text-matching software on teachers’ and students’ conceptions of plagiarism and problems in academic writing. An electronic questionnaire included scale items, structured questions, and open-ended questions. The respondents were 85 teachers and 506 students in a large Finnish university. Methods of analysis included exploratory factor analysis, t-test, and inductive content analysis. Both teachers and students reported increased awareness of plagiarism and improvements in writing habits, as well as concerns and limitations related to the system. The results suggest that teachers are inclined to think of plagiarism as part of a learning process rather an issue of morality, which may have consequences for how they understand the role of text matching. The introduction of text-matching software has supported teachers’ work, but at the same time teachers emphasized their own responsibility in detecting problems in student writing. The survey provides a limited sample of “Case Finland,” where implementation of text-matching software nationwide has been remarkably rapid; it offers a glimpse into one institution’s implementation of a newly introduced policy for mandatory plagiarism detection.

...read moreread less

Journal Article•DOI•

Plagiarism detection using document similarity based on distributed representation

[...]

Kensuke Baba¹, Tetsuya Nakatoh², Toshiro Minami³•Institutions (3)

Fujitsu¹, Kyushu University², Kyushu Institute of Information Sciences³

01 Sep 2017

TL;DR: The experimental results show that the proposed plagiarism detection method based on the local maximal value of the length of the longest common subsequence with the weight defined by a distributed representation is useful in the applications that need a strict detection of complex plagiarisms.

...read moreread less

Abstract: Accurate methods are required for plagiarism detection from documents. Generally, plagiarism detection is implemented on the basis of similarity between documents. This paper evaluates the validity of using distributed representation of words for defining a document similarity. This paper proposes a plagiarism detection method based on the local maximal value of the length of the longest common subsequence (LCS) with the weight defined by a distributed representation. The proposed method and other two straightforward methods, which are based on the simple length of LCS and the local maximal value of LCS with no weight, are applied to the dataset of a plagiarism detection competition. The experimental results show that the proposed method is useful in the applications that need a strict detection of complex plagiarisms.

...read moreread less

Collapse