scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2017"


Proceedings ArticleDOI
30 Oct 2017
TL;DR: This work proposes a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy.
Abstract: The problem of cross-platform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph-matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art's embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the state-of-the-art, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems.

339 citations


Proceedings ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions.
Abstract: The problem of cross-platform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art's embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the state-of-the-art, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems.

258 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed binary-oriented, obfuscation-resilient binary code similarity comparison method can be applied to software plagiarism and algorithm detection, and is effective and practical to analyze real-world software.
Abstract: Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. Identifying similar or identical code fragments among programs is very important in some applications. For example, one application is to detect illegal code reuse. In the code theft cases, emerging obfuscation techniques have made automated detection increasingly difficult. Another application is to identify cryptographic algorithms which are widely employed by modern malware to circumvent detection, hide network communications, and protect payloads among other purposes. Due to diverse coding styles and high programming flexibility, different implementation of the same algorithm may appear very distinct, causing automatic detection to be very hard, let alone code obfuscations are sometimes applied. In this paper, we propose a binary-oriented, obfuscation-resilient binary code similarity comparison method based on a new concept, longest common subsequence of semantically equivalent basic blocks , which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantic equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantic similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype. The experimental results show that our method can be applied to software plagiarism and algorithm detection, and is effective and practical to analyze real-world software.

87 citations


Book ChapterDOI
11 Sep 2017
TL;DR: This work states that the introduction of features derived from the Abstract Syntax Tree of source code has recently set new benchmarks in this area, significantly improving over previous work that relied on easily obfuscatable lexical and format features of program source code.
Abstract: Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST) of source code has recently set new benchmarks in this area, significantly improving over previous work that relied on easily obfuscatable lexical and format features of program source code. However, these AST-based approaches rely on hand-constructed features derived from such trees, and often include ancillary information such as function and variable names that may be obfuscated or manipulated.

75 citations


Proceedings ArticleDOI
20 May 2017
TL;DR: The experimental results show that CACompare not only is effective in dealing with binaries of different architectures and variant compiling configurations, but also improves the accuracy of binary code clone detection comparing to state-of-the-art solutions.
Abstract: Binary code clone detection (or similarity comparison) is a fundamental technique for many important applications, such as plagiarism detection, malware analysis, software vulnerability assessment and program comprehension. With the prevailing of smart and IoT (Internet of Things) devices, more and more programs are ported from traditional desktop platforms (e.g., IA-32) to ARM and MIPS architectures. It becomes imperative to detect cloned binary code across architectures. However, because of incomparable instruction sets of different architectures as well as alternative compiling configurations, it is difficult to conduct a binary code clone detection with traditional syntax- or structure-based methods.To address, we propose a semantics-based approach to fulfill the target. We recognize arguments and indirect jump targets of each binary function, and emulate executions of those functions, extracting semantic signatures to measure the similarity of functions. The approach has been implemented in a prototype system named CACompare to detect cloned binary functions across architectures and compiling configurations. It supports comparisons between mainstream architectures (IA-32, ARM and MIPS) and is able to analyse binaries on the Linux platform. The experimental results show that CACompare not only is effective in dealing with binaries of different architectures and variant compiling configurations, but also improves the accuracy of binary code clone detection comparing to state-of-the-art solutions.

72 citations


Proceedings ArticleDOI
03 Apr 2017
TL;DR: An innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences by exploiting vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words.
Abstract: Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidi-mensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments.

55 citations


Journal ArticleDOI
TL;DR: The future belongs to the algorithms that will be able to handle large amount of source code and should use one of model-based representations, which can be used for formation of large-scale anti-plagiarism systems.

53 citations


Proceedings ArticleDOI
03 Apr 2017
TL;DR: This paper introduces new cross-language similarity detection methods based on distributed representation of words and combines the different methods proposed to verify their complementarity, obtaining an overall F1 score of 89.15% for English-French similarity detection at chunk level.
Abstract: This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F 1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

47 citations


Book ChapterDOI
11 Sep 2017
TL;DR: New code features that capture programming style at the basic block level, an approach for identifying external template library code, and a new approach to capture correlations between the authors of basic blocks in a binary are presented.
Abstract: Knowing the authors of a binary program has significant application to forensics of malicious software (malware), software supply chain risk management, and software plagiarism detection. Existing techniques assume that a binary is written by a single author, which does not hold true in real world because most modern software, including malware, often contains code from multiple authors. In this paper, we make the first step toward identifying multiple authors in a binary. We present new fine-grained techniques to address the tougher problem of determining the author of each basic block. The decision of attributing authors at the basic block level is based on an empirical study of three large open source software, in which we find that a large fraction of basic blocks can be well attributed to a single author. We present new code features that capture programming style at the basic block level, our approach for identifying external template library code, and a new approach to capture correlations between the authors of basic blocks in a binary. Our experiments show strong evidence that programming styles can be recovered at the basic block level and it is practical to identify multiple authors in a binary.

45 citations


Journal ArticleDOI
TL;DR: The reported work aims to explore syntax-semantic concept extractions with genetic algorithm in detecting cases of idea plagiarism, where the source ideas are plagiarized and represented in a summarized form.
Abstract: Plagiarism is increasingly becoming a major issue in the academic and educational domains. Automated and effective plagiarism detection systems are direly required to curtail this information breach, especially in tackling idea plagiarism. The proposed approach is aimed to detect such plagiarism cases, where the idea of a third party is adopted and presented intelligently so that at the surface level, plagiarism cannot be unmasked. The reported work aims to explore syntax-semantic concept extractions with genetic algorithm in detecting cases of idea plagiarism. The work mainly focuses on idea plagiarism where the source ideas are plagiarized and represented in a summarized form. Plagiarism detection is employed at both the document and passage levels by exploiting the document concepts at various structural levels. Initially, the idea embedded within the given source document is captured using sentence level concept extraction with genetic algorithm. Document level detection is facilitated with word-level concepts where syntactic information is extracted and the non-plagiarized documents are pruned. A combined similarity metric that utilizes the semantic level concept extraction is then employed for passage level detection. The proposed approach is tested on PAN13-14 1 plagiarism corpus for summary obfuscation data, which represents a challenging case of idea plagiarism. The performance of the current approach and its variations are evaluated both at the document and passage levels, using information retrieval and PAN plagiarism measures respectively. The results are also compared against six top ranked plagiarism detection systems submitted as a part of PAN13-14 competition. The results obtained are found to exhibit significant improvement over the compared systems and hence reflects the potency of the proposed syntax-semantic based concept extractions in detecting idea plagiarism.

43 citations


Journal ArticleDOI
TL;DR: A source code plagiarism detection method, named WASTK (Weighted Abstract Syntax Tree Kernel), for computer science education that takes some aspects other than the similarity between programs into account, and performs much better than other popular methods.
Abstract: In this paper, we introduce a source code plagiarism detection method, named WASTK (Weighted Abstract Syntax Tree Kernel), for computer science education. Different from other plagiarism detection methods, WASTK takes some aspects other than the similarity between programs into account. WASTK firstly transfers the source code of a program to an abstract syntax tree and then gets the similarity by calculating the tree kernel of two abstract syntax trees. To avoid misjudgment caused by trivial code snippets or frameworks given by instructors, an idea similar to TF-IDF (Term Frequency-Inverse Document Frequency) in the field of information retrieval is applied. Each node in an abstract syntax tree is assigned a weight by TF-IDF. WASTK is evaluated on different datasets and, as a result, performs much better than other popular methods like Sim and JPlag.

Journal ArticleDOI
TL;DR: This paper proposed an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages by projecting continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model.
Abstract: Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.

Journal ArticleDOI
01 Jan 2017-PLOS ONE
TL;DR: This paper aims to propose a new method to identify the programmer of Java source code samples with a higher accuracy, and compares with previous work on authorship attribution of source code for Java language illustrates that this proposed method outperforms others overall, also with an acceptable overhead.
Abstract: Authorship attribution is to identify the most likely author of a given sample among a set of candidate known authors. It can be not only applied to discover the original author of plain text, such as novels, blogs, emails, posts etc., but also used to identify source code programmers. Authorship attribution of source code is required in diverse applications, ranging from malicious code tracking to solving authorship dispute or software plagiarism detection. This paper aims to propose a new method to identify the programmer of Java source code samples with a higher accuracy. To this end, it first introduces back propagation (BP) neural network based on particle swarm optimization (PSO) into authorship attribution of source code. It begins by computing a set of defined feature metrics, including lexical and layout metrics, structure and syntax metrics, totally 19 dimensions. Then these metrics are input to neural network for supervised learning, the weights of which are output by PSO and BP hybrid algorithm. The effectiveness of the proposed method is evaluated on a collected dataset with 3,022 Java files belong to 40 authors. Experiment results show that the proposed method achieves 91.060% accuracy. And a comparison with previous work on authorship attribution of source code for Java language illustrates that this proposed method outperforms others overall, also with an acceptable overhead.

Proceedings ArticleDOI
06 Nov 2017
TL;DR: The results show that mathematical expressions are promising text-independent features to identify academic plagiarism in large collections and an open source parallel data processing pipeline built using the Apache Flink framework is developed.
Abstract: This paper presents, to our knowledge, the first study on analyzing mathematical expressions to detect academic plagiarism. We make the following contributions. First, we investigate confirmed cases of plagiarism to categorize the similarities of mathematical content commonly found in plagiarized publications. From this investigation, we derive possible feature selection and feature comparison strategies for developing math-based detection approaches and a ground truth for our experiments. Second, we create a test collection by embedding confirmed cases of plagiarism into the NTCIR-11 MathIR Task dataset, which contains approx. 60 million mathematical expressions in 105,120 documents from arXiv.org. Third, we develop a first math-based detection approach by implementing and evaluating different feature comparison approaches using an open source parallel data processing pipeline built using the Apache Flink framework. The best performing approach identifies all but two of our real-world test cases at the top rank and achieves a mean reciprocal rank of 0.86. The results show that mathematical expressions are promising text-independent features to identify academic plagiarism in large collections. To facilitate future research on math-based plagiarism detection, we make our source code and data available.

Journal ArticleDOI
TL;DR: In this article, a joint word-embedding model for long documents in the academic domain is proposed to improve the semantic representation quality of word vectors by incorporating a domain-specific semantic relation constraint into the traditional context constraint.

Journal Article
TL;DR: Plagiarism can be managed by a balance among its prevention, detection by plagiarism detection software, and institutional sanctions against proven plagiarists.
Abstract: There is a staggering upsurge in the incidence of plagiarism of scientific literature Literature shows divergent views about the factors that make plagiarism reprehensible This review explores the causes and remedies for the perennial academic problem of plagiarism Data sources were searched for full text English language articles published from 2000 to 2015 Data selection was done using medical subject headline (MeSH) terms plagiarism, unethical writing, academic theft, retraction, medical field, and plagiarism detection software Data extraction was undertaken by selecting titles from retrieved references and data synthesis identified key factors leading to plagiarism such as unawareness of research ethics, poor writing skills and pressure or publish mantra Plagiarism can be managed by a balance among its prevention, detection by plagiarism detection software, and institutional sanctions against proven plagiarists Educating researchers about ethical principles of academic writing and institutional support in training writers about academic integrity and ethical publications can curtail plagiarism

Proceedings ArticleDOI
01 Oct 2017
TL;DR: The similarity measures show how simple changes in text such as changing one word, or changing the position of verbs and nouns results with similarity value equal to 99% which provide the possibility to detect plagiarism even if the test is altered by replacing words by their synonyms orChanging the words order.
Abstract: Plagiarism detection is very important especially for academician, researchers and students. Although, there are many plagiarism detection tools, it is still challenging task because of huge amount of online documents. In this research, we propose to use word2vec model to detect the semantic similarity between words in Arabic language which can help in detecting plagiarism. Word2vec is a deep learning technique that is used to represent words as features of vectors with high precision. The quality of vectors representation depends on the quality of corpus used in training phase. In this paper, we used OSAC corpus for training word2vec model. Moreover cosine similarity measure is used to compute the similarity between words' vectors. The similarity measures show how simple changes in text such as changing one word, or changing the position of verbs and nouns results with similarity value equal to 99% which provide the possibility to detect plagiarism even if the test is altered by replacing words by their synonyms or changing the words order

Proceedings ArticleDOI
07 Aug 2017
TL;DR: This paper attempts to replicate and reproduce the results of Severyn and Moschitti using their open-source code as well as to reproduce their results via a de novo implementation using a completely different deep learning toolkit.
Abstract: In recent years, neural networks have been applied to many text processing problems. One example is learning a similarity function between pairs of text, which has applications to paraphrase extraction, plagiarism detection, question answering, and ad hoc retrieval. Within the information retrieval community, the convolutional neural network model proposed by Severyn and Moschitti in a SIGIR 2015 paper has gained prominence. This paper focuses on the problem of answer selection for question answering: we attempt to replicate the results of Severyn and Moschitti using their open-source code as well as to reproduce their results via a de novo (i.e., from scratch) implementation using a completely different deep learning toolkit. Our de novo implementation is instructive in ascertaining whether reported results generalize across toolkits, each of which have their idiosyncrasies. We were able to successfully replicate and reproduce the reported results of Severyn and Moschitti, albeit with minor differences in effectiveness, but affirming the overall design of their model. Additional ablation experiments break down the components of the model to show their contributions to overall effectiveness. Interestingly, we find that removing one component actually increases effectiveness and that a simplified model with only four word overlap features performs surprisingly well, even better than convolution feature maps alone.

Journal ArticleDOI
TL;DR: A novel approach for plagiarism detection without reference collections is presented, which aims to generate a model of author’s “style” by revealing a set of certain features of authorship by integrating deep latent semantic and stylometric analyses.

Journal ArticleDOI
TL;DR: Rabin-Karp algorithm is much more effective and faster in the process of detecting the document with the size more than 1000 KB and Jaro-Winkler Distance algorithm has advantages in terms of time.
Abstract: Plagiarism is an act that is considered by the university as a fraud by taking someone ideas or writings without mentioning the references and claimed as his own. Plagiarism detection system is generally implement string matching algorithm in a text document to search for common words between documents. There are some algorithms used for string matching, two of them are Rabin-Karp and Jaro-Winkler Distance algorithms. Rabin-Karp algorithm is one of compatible algorithms to solve the problem of multiple string patterns, while, Jaro-Winkler Distance algorithm has advantages in terms of time. A plagiarism detection application is developed and tested on different types of documents, i.e. doc, docx, pdf and txt. From the experimental results, we obtained that both of these algorithms can be used to perform plagiarism detection of those documents, but in terms of their effectiveness, Rabin-Karp algorithm is much more effective and faster in the process of detecting the document with the size more than 1000 KB.

Proceedings ArticleDOI
15 Dec 2017
TL;DR: This work presents Semantic Concept Pattern Analysis - an approach that performs an integrated analysis of semantic text relatedness and structural text similarity and demonstrates that this approach can detect plagiarism that established text matching approaches would not identify.
Abstract: Detecting academic plagiarism is a pressing problem, e.g., for educational and research institutions, funding agencies, and academic publishers. Existing plagiarism detection systems reliably identify copied text, or near copies of text, but often fail to detect disguised forms of academic plagiarism, such as paraphrases, translations, and idea plagiarism. We present Semantic Concept Pattern Analysis - an approach that performs an integrated analysis of semantic text relatedness and structural text similarity. Using 25 officially retracted academic plagiarism cases, we demonstrate that our approach can detect plagiarism that established text matching approaches would not identify. We view our approach as a promising addition to improve the detection capabilities for strong paraphrases. We plan to further improve Semantic Concept Pattern Analysis and include the approach as part of an integrated detection process that analyzes heterogeneous similarity features to better identify the many possible forms of plagiarism in academic documents.

Journal ArticleDOI
TL;DR: A source code plagiarism detection which rely on low-level representation which is more effective and efficient when compared with standard lexical-token approach is proposed.
Abstract: Even though there are various source code plagiarism detection approaches, only a few works which are focused on low-level representation for deducting similarity. Most of them are only focused on lexical token sequence extracted from source code. In our point of view, low-level representation is more beneficial than lexical token since its form is more compact than the source code itself. It only considers semantic-preserving instructions and ignores many source code delimiter tokens. This paper proposes a source code plagiarism detection which rely on low-level representation. For a case study, we focus our work on .NET programming languages with Common Intermediate Language as its low-level representation. In addition, we also incorporate Adaptive Local Alignment for detecting similarity. According to Lim et al, this algorithm outperforms code similarity state-of-the-art algorithm (i.e. Greedy String Tiling) in term of effectiveness. According to our evaluation which involves various plagiarism attacks, our approach is more effective and efficient when compared with standard lexical-token approach.

Journal ArticleDOI
TL;DR: The proposed plagiarism detection approach is the hybrid of semantic and syntactic similarity between the text documents that exploits linguistic information sources non-linearly using the lexical database for finding the relatedness between text documents.
Abstract: Plagiarism takes place when we use any person’s work without giving due acknowledgment. There are several fields where the text similarity is involved like web document retrieval, information mining, and searching related articles. Several approaches have been introduced for detecting plagiarism in the text documents based on the syntactic structure of the text, string similarity, fingerprinting, semantic meaning underlying the text, etc. The basic limitation of plagiarism detection systems these days is that they fail to detect tough cases of plagiarism. The proposed plagiarism detection approach is the hybrid of semantic and syntactic similarity between the text documents. This novel approach exploits linguistic information sources non-linearly using the lexical database for finding the relatedness between text documents. The proposed approach uses semantic knowledge to perform cognitive-inspired computing. The framework is capable of detecting intelligent plagiarism cases like a verbatim copy, paraphrasing, rewording in a sentence, and sentence transformation. The approach has been evaluated on the standard PAN-PC-11 dataset. The experiments show that our technique has outperformed other strong baseline techniques in terms of precision, recall, F-measure, and plagiarism detection (PlagDet) score.

Journal ArticleDOI
TL;DR: An approach that utilizes minimal and effective syntax based linguistic features for plagiarism classification extracted using shallow natural language processing techniques, which improves the effectiveness of classification by selecting appropriate set of features as the input to machine learning based classifiers.
Abstract: An approach that utilizes minimal and effective syntax based linguistic features for plagiarism classification extracted using shallow natural language processing techniques.A two-phase feature selection approach that identifies minimal and best features for plagiarism classification.Detailed analysis of the impact and dependencies of plagiarism types and complexities on the extracted features. The proposed work models document level text plagiarism detection as a binary classification problem, where the task is to distinguish a given suspicious-source document pair as plagiarized or non-plagiarized. The objective is to explore the potency of syntax based linguistic features extracted using shallow natural language processing techniques for plagiarism classification task. Shallow syntactic features, viz., part of speech tags and chunks are utilized after effective pre-processing and filtrations for pruning the irrelevant information. The work further proposes the modelling of this classification phase as an intermediate stage, which will be post candidate source retrieval and before exhaustive passage level detections. A two-phase feature selection approach is proposed, which improves the effectiveness of classification by selecting appropriate set of features as the input to machine learning based classifiers. The proposed approach is evaluated on smaller and larger test conditions using the corpus of plagiarized short answers (PSA) and plagiarism instances collected from PAN corpus respectively. Under both the test conditions, performances are evaluated using general as well as advanced classification metrics. Another main contribution of the current work is the analysis of dependencies and impact of the extracted features, upon the type and complexity of plagiarism imposed in the documents. The proposed results are compared with the two state-of-the-art approaches and they outperform the baseline approaches significantly. This in turn reflects the cogency of syntactic linguistic features in document level plagiarism classification, especially for the instances close to manual or real plagiarism scenarios.

Journal ArticleDOI
TL;DR: An External Plagiarism Detection System (EPDS), which employs a combination of the Semantic Role Labeling (SRL) technique, the semantic and syntactic information, and the content word expansion approach to detect different types of plagiarism.
Abstract: Plagiarism is the unauthorized use of the ideas, presentation of someone else's words or work as your own. This paper presents an External Plagiarism Detection System (EPDS), which employs a combination of the Semantic Role Labeling (SRL) technique, the semantic and syntactic information. Most of the available methods fail to capture the meaning in the comparison between a source document sentence and a suspicious document sentence when two sentences have same surface text. Therefore, it leads to incorrect or even unnecessary matching results. However, the proposed method is able to avoid selecting the source text sentence whose similarity with suspicious text sentence is high but its meaning is different. On the other hand, an author may change the sentence from: active to passive and vice versa; hence, the method also employed the SRL technique to tackle the aforementioned challenge. Furthermore, the method used the content word expansion approach to bridge the lexical gaps and identify the similar ideas that are expressed using different wording. The proposed method is able to detect different types of plagiarism such as the exact verbatim copying, paraphrasing, transformation of sentences, changing of word structure. As a result, the experimental results have displayed that the proposed method is able to improve the performance compared with the participating systems in PAN-PC-11 and other existing techniques.

Journal ArticleDOI
TL;DR: Methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE are investigated, particularly when the original text has been modified through the replacement of words or phrases.
Abstract: The identification of duplicated and plagiarized passages of text has become an increasingly active area of research. In this paper, we investigate methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE, particularly when the original text has been modified through the replacement of words or phrases. A scalable approach based on Information Retrieval is used to perform candidate document selection—the identification of a subset of potential source documents given a suspicious text—from MEDLINE. Query expansion is performed using the ULMS Metathesaurus to deal with situations in which original documents are obfuscated. Various approaches to Word Sense Disambiguation are investigated to deal with cases where there are multiple Concept Unique Identifiers (CUIs) for a given term. Results using the proposed IR-based approach outperform a state-of-the-art baseline based on Kullback-Leibler Distance.

Posted Content
TL;DR: This paper proposed to use distributed representation of words (word embeddings) in cross-language textual similarity detection and obtained an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.
Abstract: This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This paper focuses to identify whether a data set consisting of student programming assignments is rich enough to apply coding style metrics to detect similarities between code sequences, and uses the BlackBox dataset as a case study.
Abstract: Plagiarism has become an increasing problem in higher education in recent years. Coding style can be used to detect source code plagiarism that involves writing and deciding the structure of the code which does not affect the logic of a program, thus offering a way to differentiate between different code authors. This paper focuses to identify whether a data set consisting of student programming assignments is rich enough to apply coding style metrics to detect similarities between code sequences, and we use the BlackBox dataset as a case study.

Journal ArticleDOI
TL;DR: This article investigated the consequences of the use of text-matching software on teachers' and students' conceptions of plagiarism and problems in academic writing and found that teachers are inclined to think plagiarism as part of a learning process rather an issue of morality, which may have consequences for how they understand the role of text matching.
Abstract: The aim of this study was to investigate the consequences of the use of text-matching software on teachers’ and students’ conceptions of plagiarism and problems in academic writing. An electronic questionnaire included scale items, structured questions, and open-ended questions. The respondents were 85 teachers and 506 students in a large Finnish university. Methods of analysis included exploratory factor analysis, t-test, and inductive content analysis. Both teachers and students reported increased awareness of plagiarism and improvements in writing habits, as well as concerns and limitations related to the system. The results suggest that teachers are inclined to think of plagiarism as part of a learning process rather an issue of morality, which may have consequences for how they understand the role of text matching. The introduction of text-matching software has supported teachers’ work, but at the same time teachers emphasized their own responsibility in detecting problems in student writing. The survey provides a limited sample of “Case Finland,” where implementation of text-matching software nationwide has been remarkably rapid; it offers a glimpse into one institution’s implementation of a newly introduced policy for mandatory plagiarism detection.

Journal ArticleDOI
01 Sep 2017
TL;DR: The experimental results show that the proposed plagiarism detection method based on the local maximal value of the length of the longest common subsequence with the weight defined by a distributed representation is useful in the applications that need a strict detection of complex plagiarisms.
Abstract: Accurate methods are required for plagiarism detection from documents. Generally, plagiarism detection is implemented on the basis of similarity between documents. This paper evaluates the validity of using distributed representation of words for defining a document similarity. This paper proposes a plagiarism detection method based on the local maximal value of the length of the longest common subsequence (LCS) with the weight defined by a distributed representation. The proposed method and other two straightforward methods, which are based on the simple length of LCS and the local maximal value of LCS with no weight, are applied to the dataset of a plagiarism detection competition. The experimental results show that the proposed method is useful in the applications that need a strict detection of complex plagiarisms.