scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2014"


Proceedings ArticleDOI
Lannan Luo1, Jiang Ming1, Dinghao Wu1, Peng Liu1, Sencun Zhu1 
11 Nov 2014
TL;DR: This paper proposes a binary-oriented, obfuscation-resilient method based on a new concept, longest common subsequence of semantically equivalent basic blocks, which combines rigorous program semantics with longestcommon subsequence based fuzzy matching.
Abstract: Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. In the case of software plagiarism, emerging obfuscation techniques have made automated detection increasingly difficult. In this paper, we propose a binary-oriented, obfuscation-resilient method based on a new concept, longest common subsequence of semantically equivalent basic blocks, which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantics equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantics similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype and our experimental results show that our method is effective and practical when applied to real-world software.

195 citations


Proceedings ArticleDOI
01 Oct 2014
TL;DR: An approach that uses character n-grams as features is proposed for the task of native language identification and has an important advantage in that it is language independent and linguistic theory neutral.
Abstract: A common approach in text mining tasks such as text categorization, authorship identification or plagiarism detection is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. In this work, an approach that uses character n-grams as features is proposed for the task of native language identification. Instead of doing standard feature selection, the proposed approach combines several string kernels using multiple kernel learning. Kernel Ridge Regression and Kernel Discriminant Analysis are independently used in the learning stage. The empirical results obtained in all the experiments conducted in this work indicate that the proposed approach achieves state of the art performance in native language identification, reaching an accuracy that is 1.7% above the top scoring system of the 2013 NLI Shared Task. Furthermore, the proposed approach has an important advantage in that it is language independent and linguistic theory neutral. In the cross-corpus experiment, the proposed approach shows that it can also be topic independent, improving the state of the art system by 32.3%.

82 citations


Proceedings Article
01 Jan 2014
TL;DR: The PAN 2014 evaluation lab as mentioned in this paper proposed a new web service called TIRA, which facilitates software submissions and allows participants to submit running softwares instead of their run output, which helps to reduce the workload for both participants and organizers.
Abstract: This paper reports on the PAN 2014 evaluation lab which hosts three shared tasks on plagiarism detection, author identification, and author profiling. To improve the reproducibility of shared tasks in general, and PAN’s tasks in particular, the Webis group developed a new web service called TIRA, which facilitates software submissions. Unlike many other labs, PAN asks participants to submit running softwares instead of their run output. To deal with the organizational overhead involved in handling software submissions, the TIRA experimentation platform helps to significantly reduce the workload for both participants and organizers, whereas the submitted softwares are kept in a running state. This year, we addressed the matter of responsibility of successful execution of submitted softwares in order to put participants back in charge of executing their software at our site. In sum, 57 softwares have been submitted to our lab; together with the 58 software submissions of last year, this forms the largest collection of softwares for our three tasks to date, all of which are readily available for further analysis. The report concludes with a brief summary of each task.

74 citations


Proceedings ArticleDOI
01 Jan 2014
TL;DR: The state of the art in software plagiarism detection tools are illustrated by comparing their features and testing them against a wide range of source codes to show the tools accuracy at detecting each type of plagiarism.
Abstract: We illustrate the state of the art in software plagiarism detection tools by comparing their features and testing them against a wide range of source codes. The source codes were edited according to several types of plagiarism to show the tools accuracy at detecting each type. The decision to focus our research on plagiarism of programming languages is two fold: on one hand, it is a challenging case-study since programming languages impose a structured writing style; on the other hand, we are looking for the integration of such a tool in an Automatic-Grading System (AGS) developed to support teachers in the context of Programming courses. Besides the systematic characterisation of the underlying problem domain, the tools were surveyed with the objective of identifying the most successful approach in order to design the aimed plugin for our AGS.

51 citations


Book
14 Jul 2014
TL;DR: Citation-based Plagiarism Detection does not rely on text comparisons alone, but analyzes citation patterns within documents to form a language-independent "semantic fingerprint" for similarity assessment.
Abstract: Plagiarism is a problem with far-reaching consequences for the sciences. However, even todays best software-based systems can only reliably identify copy & paste plagiarism. Disguised plagiarism forms, including paraphrased text, cross-language plagiarism, as well as structural and idea plagiarism often remain undetected. This weakness of current systems results in a large percentage of scientific plagiarism going undetected. Bela Gipp provides an overview of the state-of-the art in plagiarism detection and an analysis of why these approaches fail to detect disguised plagiarism forms. The author proposes Citation-based Plagiarism Detection to address this shortcoming. Unlike character-based approaches, this approach does not rely on text comparisons alone, but analyzes citation patterns within documents to form a language-independent "semantic fingerprint" for similarity assessment. The practicability of Citation-based Plagiarism Detection was proven by its capability to identify so-far non-machine detectable plagiarism in scientific publications.

50 citations


Proceedings ArticleDOI
03 Nov 2014
TL;DR: Since LoPD is a formal program semantics-based method, it can provide a guarantee of resilience against many known obfuscation attacks and is more resilient to current automatic obfuscation techniques, compared to the existing detection mechanisms.
Abstract: Software plagiarism, an act of illegally copying others' code, has become a serious concern for honest software companies and the open source community. In this paper, we propose LoPD, a program logic based approach to software plagiarism detection. Instead of directly comparing the similarity between two programs, LoPD searches for any dissimilarity between two programs by finding an input that will cause these two programs to behave differently, either with different output states or with semantically different execution paths. As long as we can find one dissimilarity, the programs are semantically different, but if we cannot find any dissimilarity, it is likely a plagiarism case. We leverage symbolic execution and weakest precondition reasoning to capture the semantics of execution paths and to find path dissimilarities. LoPD is more resilient to current automatic obfuscation techniques, compared to the existing detection mechanisms. In addition, since LoPD is a formal program semantics-based method, it can provide a guarantee of resilience against many known obfuscation attacks. Our evaluation results indicate that LoPD is both effective and efficient in detecting software plagiarism.

46 citations


Proceedings ArticleDOI
15 Oct 2014
TL;DR: An n-gram model for programming languages is proposed and a plagiarism detector with accuracy that beats previous techniques is implemented, and a bug-finding tool is demonstrated that discovered over a dozen previously unknown bugs in a collection of real deployed programs.
Abstract: Several program analysis tools - such as plagiarism detection and bug finding - rely on knowing a piece of code's relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have an identical simple loop counter test, but should report programs that share more distinctive code. Traditional program analysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common a line of code is. Natural language processing researchers have encountered a similar problem and addressed it using an n-gram model of text frequency, derived from statistics computed over text corpora. We propose and compute an n-gram model for programming languages, computed over a corpus of 2.8 million JavaScript programs we downloaded from the Web. In contrast to previous techniques, we describe a code n-gram as a subgraph of the program dependence graph that contains all nodes and edges reachable in n steps from the statement. We can count n-grams in a program and count the frequency of n-grams in the corpus, enabling us to compute tf-idf-style measures that capture the differing importance of different lines of code. We demonstrate the power of this approach by implementing a plagiarism detector with accuracy that beats previous techniques, and a bug-finding tool that discovered over a dozen previously unknown bugs in a collection of real deployed programs.

40 citations


Journal ArticleDOI
TL;DR: It is compulsory to all the authors, reviewers and editors of all the scientific journals to know about the plagiarism and how to avoid it by following ethical guidelines and use of plagiarism detection software while scientific writing.
Abstract: Plagiarism has become more common in both dental and medical communities. Most of the writers do not know that plagiarism is a serious problem. Plagiarism can range from simple dishonesty (minor copy paste/any discrepancy) to a more serious problem (major discrepancy/duplication of manuscript) when the authors do cut-copy-paste from the original source without giving adequate credit to the main source. When we search databases like PubMed/MedLine there is a lot of information regarding plagiarism. However, it is still a current topic of interest to all the researchers to know how to avoid plagiarism. It's time to every young researcher to know ethical guidelines while writing any scientific publications. By using one's own ideas, we can write the paper completely without looking at the original source. Specific words from the source can be added by using quotations and citing them which can help in not only supporting your work and amplifying ideas but also avoids plagiarism. It is compulsory to all the authors, reviewers and editors of all the scientific journals to know about the plagiarism and how to avoid it by following ethical guidelines and use of plagiarism detection software while scientific writing.

36 citations


Journal ArticleDOI
TL;DR: Evaluation of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles shows that the citation‐based approach achieves superior ranking performance for heavily disguised plagiarism forms and is demonstrated to be computationally more efficient than character‐based approaches.
Abstract: The automated detection of plagiarism is an information retrieval task of increasing importance as the volume of readily accessible information on the web expands. A major shortcoming of current automated plagiarism detection approaches is their dependence on high character-based similarity. As a result, heavily disguised plagiarism forms, such as paraphrases, translated plagiarism, or structural and idea plagiarism, remain undetected. A recently proposed language-independent approach to plagiarism detection, Citation-based Plagiarism Detection (CbPD), allows the detection of semantic similarity even in the absence of text overlap by analyzing the citation placement in a document's full text to determine similarity. This article evaluates the performance of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles. We benchmark CbPD against two character-based detection approaches using a ground truth approximated in a user study. Our evaluation shows that the citation-based approach achieves superior ranking performance for heavily disguised plagiarism forms. Additionally, we demonstrate CbPD to be computationally more efficient than character-based approaches. Finally, upon combining the citation-based with the traditional character-based document similarity visualization methods in a hybrid detection prototype, we observe a reduction in the required user effort for document verification.

36 citations


Proceedings ArticleDOI
05 Dec 2014
TL;DR: SOCO 2014 as mentioned in this paper focused on the detection of re-used source codes in C/C++ and Java programming languages, and participant systems were asked to annotate several source codes whether or not they represent cases of source code re-use.
Abstract: This paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that automatically detect the source code re-use phenomenon. The detection of source code re-use is an important research field for both software industry and academia fields. Accordingly, PAN@FIRE track, named SOurce COde Re-use (SOCO) focused on the detection of re-used source codes in C/C++ and Java programming languages. Participant systems were asked to annotate several source codes whether or not they represent cases of source code re-use. In total five teams submitted 17 runs. The training set consisted of annotations made by several experts, a feature which turns the SOCO 2014 collection in a useful data set for future evaluations and, at the same time, it establishes a standard evaluation framework for future research works on the posed shared task.

33 citations


Proceedings ArticleDOI
01 Dec 2014
TL;DR: The paper explores the different preprocessing methods based on Natural Language Processing (NLP) techniques and further explores fuzzy-semantic similarity measures for document comparisons and performances of different methods are compared.
Abstract: Plagiarism is one of the most serious crimes in academia and research fields. In this modern era, where access to information has become much easier, the act of plagiarism is rapidly increasing. This paper aligns on external plagiarism detection method, where the source collection of documents is available against which the suspicious documents are compared. Primary focus is to detect intelligent plagiarism cases where semantics and linguistic variations play an important role. The paper explores the different preprocessing methods based on Natural Language Processing (NLP) techniques. It further explores fuzzy-semantic similarity measures for document comparisons. The system is finally evaluated using PAN 20121 data set and performances of different methods are compared.

Book ChapterDOI
01 Jan 2014
TL;DR: Citations and references of scholarly publications have long been recognized as containing valuable semantic relatedness information for documents, as demonstrated in Section 3.2.
Abstract: When the author first considered the use of citation information as a method to detect plagiarism, he assumed this concept had already been explored or even integrated into today’s plagiarism detection systems (PDS). After all, citations and references of scholarly publications have long been recognized as containing valuable semantic relatedness information for documents, as demonstrated in Section 3.2.

Proceedings ArticleDOI
01 Nov 2014
TL;DR: The primary focus is to explore the unsupervised document categorization/ clustering methods using different variations of K-means algorithm and compare it with the general N-gram based method and Vector Space Model based method.
Abstract: Text document categorization is one of the rapidly emerging research fields, where documents are identified, differentiated and classified manually or algorithmically The paper focuses on application of automatic text document categorization in plagiarism detection domain In today's world plagiarism has become a prime concern, especially in research and educational fields This paper aims on the study and comparison of different methods of document categorization in external plagiarism detection Here the primary focus is to explore the unsupervised document categorization/ clustering methods using different variations of K-means algorithm and compare it with the general N-gram based method and Vector Space Model based method Finally the analysis and evaluation is done using data set from PAN-20131 and performance is compared based on precision, recall and efficiency in terms of time taken for algorithm execution

Proceedings ArticleDOI
02 Oct 2014
TL;DR: The first methodology for evaluating CFG similarity algorithms with respect to accuracy and efficiency is proposed and shown, using a technique to automatically generate benchmark graphs, CFGs of known edit distances.
Abstract: Control-Flow Graph (CFG) similarity is a core technique in many areas, including malware detection and software plagiarism detection. While many algorithms have been proposed in the literature, their relative strengths and weaknesses have not been previously studied. Moreover, it is not even clear how to perform such an evaluation. In this paper we therefore propose the first methodology for evaluating CFG similarity algorithms with respect to accuracy and efficiency. At the heart of our methodology is a technique to automatically generate benchmark graphs, CFGs of known edit distances. We show the result of applying our methodology to four popular algorithms. Our results show that an algorithm proposed by Hu et al. is most efficient both in terms of running time and accuracy.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: A novel languageindependent intrinsic plagiarism detection method which is based on a new text representation that is called n-gram classes is introduced which is comparable to the best state-of-the-art methods.
Abstract: When it is not possible to compare the suspicious document to the source document(s) plagiarism has been committed from, the evidence of plagiarism has to be looked for intrinsically in the document itself. In this paper, we introduce a novel languageindependent intrinsic plagiarism detection method which is based on a new text representation that we called n-gram classes. The proposed method was evaluated on three publicly available standard corpora. The obtained results are comparable to the ones obtained by the best state-of-the-art methods.

Proceedings ArticleDOI
05 Mar 2014
TL;DR: This work presents an experiment in which teachers are requested to compare different code solutions to the same problem and detects that comparison of students' codes has significant potential to be automated to help teachers in their work.
Abstract: In introductory programming courses it is common to demand from students exercises based on the production of code. However, it is difficult for the teacher to give fast feedback to the students about the main solutions tried, the main errors and the drawbacks and advantages of certain solutions. If we could use automatic code comparison algorithms to build visualisation tools to support the teacher in analysing how each solution provided is similar or different from another, such information would be able to be rapidly obtained. However, can computers compare students code solutions as well as teachers? In this work we present an experiment in which we have requested teachers to compare different code solutions to the same problem. Then we have evaluated the level of agreement among each teacher comparison strategy and some algorithms generally used for plagiarism detection and automatic grading. We found out a maximum rate of 77% of agreement between one of the teachers and the algorithms, but a minimum agreement of 75%. However, for most of the teachers, the maximum agreement rate was over 90% for at least one of the automatic strategies to compare code. We have also detected that the level of agreement among teachers regarding their personal strategies to compare students solutions was between 62% and 95%, which shows that there may be more agreement between a teacher and an algorithm than between a teacher and one of her colleagues regarding their strategies to compare students' solutions. The results also seem to support that comparison of students' codes has significant potential to be automated to help teachers in their work.

Book
04 Sep 2014
TL;DR: This paper proposes eight ethical techniques to avoid unconscious and accidental plagiarism in manuscripts without using online systems such as Turnitin and/or iThenticate for cross checking and plagiarism detection.
Abstract: This paper discusses plagiarism origins, and the ethical solutions to prevent it. It also reviews some unethical approaches, which may be used to decrease the plagiarism rate in academic writings. We propose eight ethical techniques to avoid unconscious and accidental plagiarism in manuscripts without using online systems such as Turnitin and/or iThenticate for cross checking and plagiarism detection. The efficiency of the proposed techniques is evaluated on five different texts using students individually. After application of the techniques on the texts, they were checked by Turnitin to produce the plagiarism and similarity report. At the end, the “effective factor” of each method has been compared with each other; and the best result went to a hybrid combination of all techniques to avoid plagiarism. The hybrid of ethical methods decreased the plagiarism rate reported by Turnitin from nearly 100% to the average of 8.4% on 5 manuscripts.

Journal ArticleDOI
TL;DR: A multi-register corpus gathered for this purpose is introduced, in which each text has been located in a similarity space based on ratings by human readers, which provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement.
Abstract: Quantifying the similarity or dissimilarity between documents is an important task in authorship attribution, information retrieval, plagiarism detection, text mining, and many other areas of linguistic computing. Numerous similarity indices have been devised and used, but relatively little attention has been paid to calibrating such indices against externally imposed standards, mainly because of the difficulty of establishing agreed reference levels of inter-text similarity. The present article introduces a multi-register corpus gathered for this purpose, in which each text has been located in a similarity space based on ratings by human readers. This provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement, i.e. external to the texts themselves. We describe the results of a benchmarking study in five different languages in which some widely used measures perform comparatively poorly. In particular, several alternative correlational measures (Pearson r, Spearman rho, tetrachoric correlation) consistently outperform cosine similarity on our data. A method of using what we call ‘anchor texts’ to extend this method from monolingual inter-text similarity-scoring to inter-text similarity-scoring across languages is also proposed and tested.

Proceedings ArticleDOI
08 Sep 2014
TL;DR: It is shown that a hybrid approach that integrates detection methods using citations, semantic argument structure, and semantic word similarity with character-based methods to achieve a higher detection performance for disguised plagiarism forms allows semantic plagiarism detection to become feasible even on large collections for the first time.
Abstract: This paper proposes a hybrid approach to plagiarism detection in academic documents that integrates detection methods using citations, semantic argument structure, and semantic word similarity with character-based methods to achieve a higher detection performance for disguised plagiarism forms. Currently available software for plagiarism detection exclusively performs text string comparisons. These systems find copies, but fail to identify disguised plagiarism, such as paraphrases, translations, or idea plagiarism. Detection approaches that consider semantic similarity on word and sentence level exist and have consistently achieved higher detection accuracy for disguised plagiarism forms compared to character-based approaches. However, the high computational effort of these semantic approaches makes them infeasible for use in real-world plagiarism detection scenarios. The proposed hybrid approach uses citation-based methods as a preliminary heuristic to reduce the retrieval space with a relatively low loss in detection accuracy. This preliminary step can then be followed by a computationally more expensive semantic and character-based analysis. We show that such a hybrid approach allows semantic plagiarism detection to become feasible even on large collections for the first time.

01 Jan 2014
TL;DR: All the previously researched experience from PAN12 and PAN 13 research works are aggregated and thus further improved previously developed methods of detecting plagiarism, combined into a single model to mark similarity sections and thus effectively detect different types of obfuscation techniques.
Abstract: This paper describes approaches used for the Plagiarism Detection task during PAN 2014 International Competition on Uncovering Plagiarism, Authorship, and Social Software Misuse, that scored 1-st place with plagdet score (0.907) for test corpus no.3 and 3-rd place score (0.868) for test corpus no. 2. In this work we aggregated all the previously researched experience from PAN12 and PAN 13 research works (2) and thus further improved previously developed methods of detecting plagiarism (8), with the help of: contextual n- grams, surrounding context n-grams, named entity based n-grams, odd-even skip n-grams, functional words frame based n-grams, TF-IDF sentence level similarity index and noise sensitive clusterization algorithm, focused summary type detection heuristics, combined into a single model to mark similarity sections and thus effectively detect different types of obfuscation techniques.

Proceedings ArticleDOI
27 Apr 2014
TL;DR: A fully functional, web-based visualization of citation patterns for this verified cross-language plagiarism case, allowing the user to interactively experience the benefits of citation pattern analysis for plagiarism detection.
Abstract: In a previous paper, we showed that analyzing citation patterns in the well-known plagiarized thesis by K. T. zu Guttenberg clearly outperformed current detection methods in identifying cross-language plagiarism. However, the experiment was a proof of concept and we did not provide a prototype. This paper presents a fully functional, web-based visualization of citation patterns for this verified cross-language plagiarism case, allowing the user to interactively experience the benefits of citation pattern analysis for plagiarism detection. Using examples from the Guttenberg plagiarism case, we demonstrate that the citation pattern visualization reduces the required examiner effort to verify the extent of plagiarism.

Journal ArticleDOI
TL;DR: A hybrid approach of SVM method for detecting plagiarism using an Internet search to ensure that a document is in the detection is up-to-date and measurement results show that the hybrid machine learning does not always result in better performance.
Abstract: Currently, most of the plagiarism detections are using similarity measurement techniques. Basically, a pair of similar sentences describes the same idea. However, not all like that, there are also sentences that are similar but have opposite meanings. This is one problem that is not easily solved by use of the technique similarity. Determination of dubious value similarity threshold on similarity method is another problem. The plagiarism threshold was adjustable, but it means uncertainty. Another problem, although the rules of plagiarism can be understood together but in practice, some people have a different opinion in determining a document, whether or not classified as plagiarism. Of the three problems, a statistical approach could possibly be the most appropriate solution. Machine learning methods like k-nearest neighbors (KNN), support vector machine (SVM), artificial neural networks (ANN) is a technique that is commonly used in solving the problem based on statistical data. This method of learning process based on statistical data to be smart resembling intelligence experts. In this case, plagiarism is data that has been validated by experts. This paper offers a hybrid approach of SVM method for detecting plagiarism. The data collection method in this work using an Internet search to ensure that a document is in the detection is up-to-date. The measurement results based on accuracy, precision and recall show that the hybrid machine learning does not always result in better performance. There is no better and vice versa. Overall testing of the four hybrid combinations concluded that the hybrid ANN-SVM method is the best performance in the case of plagiarism.

Proceedings ArticleDOI
02 Jun 2014
TL;DR: Two dynamic birthmark based approaches are presented that can effectively detect plagiarism of multithread programs and exhibit strong resilience to various semantic-preserving code obfuscations.
Abstract: The availability of inexpensive multicore hardware presents a turning point in software development. In order to benefit from the continued exponential throughput advances in new processors, the software applications must be multithreaded programs. As multithreaded programs become increasingly popular, plagiarism of multithreaded programs starts to plague the software industry. Although there has been tremendous progress on software plagiarism detection technology, existing dynamic approaches remain optimized for sequential programs and cannot be applied to multithreaded programs without significant redesign. This paper fills the gap by presenting two dynamic birthmark based approaches. The first approach extracts key instructions while the second approach extracts system calls. Both approaches consider the effect of thread scheduling on computing software birthmarks. We have implemented a prototype based on the Pin instrumentation framework. Our empirical study shows that the proposed approaches can effectively detect plagiarism of multithread programs and exhibit strong resilience to various semantic-preserving code obfuscations.

05 Dec 2014
TL;DR: This paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that automatically detect the source code re-use phenomenon, and establishes a standard evaluation framework for future research works on the posed shared task.
Abstract: © Owner/Author This is the author's version of the work It is posted here for your personal use Not for redistribution The definitive Version of Record was published in ACM, In Proceedings of the Forum for Information Retrieval Evaluation FIRE/ 14 http://dxdoiorg/101145/28248642824878

DOI
30 Jun 2014
TL;DR: A method for ‘translingual’ plagiarism detection that is grounded on translation and interlanguage theories as well as on the principle of ‘linguistic uniqueness’ is proposed and applications of the method as an investigative tool in forensic contexts are discussed.
Abstract: Plagiarism detection methods have improved signiVcantly over the last decades, and as a result of the advanced research conducted by computational and mostly forensic linguists, simple and sophisticated textual borrowing strategies can now be identiVed more easily. In particular, simple text comparison algorithms developed by computational linguists allow literal, word-for-word plagiarism (i.e. where identical strings of text are reused across diUerent documents) to be easily detected (semi-)automatically (e.g. Turnitin or SafeAssign), although these methods tend to perform less well when the borrowing is offuscated by introducing edits to the original text. In this case, more sophisticated linguistic techniques, such as an analysis of lexical overlap (Johnson, 1997), are required to detect the borrowing. However, these have limited applicability in cases of ‘translingual’ plagiarism, where a text is translated and borrowed without acknowledgment from an original in another language. Considering that (a) traditionally non-professional translation (e.g. literal or free machine translation) is the method used to plagiarise; (b) the plagiarist usually edits the text for grammar and syntax, especially when machine-translated; and (c) lexical items are those that tend to be translated more correctly, and carried over to the derivative text, this paper proposes a method for ‘translingual’ plagiarism detection that is grounded on translation and interlanguage theories (Selinker, 1972; Bassnett and Lefevere, 1998), as well as on the principle of ‘linguistic uniqueness’ (Coulthard, 2004). Empirical evidence from the CorRUPT corpus (Corpus of Reused and Plagiarised Texts), a corpus of real academic and non-academic texts that were investigated and accused of plagiarising originals in other languages, is used to illustrate the applicability of the methodology proposed for ‘translingual’ plagiarism detection. Finally, applications of the method as an investigative tool in forensic contexts are discussed.

Proceedings ArticleDOI
28 May 2014
TL;DR: In this paper, the authors employed latent semantic analysis (LSA) as the term-document representation to handle the intelligence plagiarism, which was used in the Heuristic Retrieval (HR) component and Detailed Analysis (DA) component.
Abstract: Plagiarism is an important task since its number is increasing and the plagiarism technique is getting difficult. It means that there is not only literal plagiarism but also intelligence plagiarism. In order to handle the intelligence plagiarism, we employed latent semantic analysis (LSA) as the term-document representation. The LSA was used in the Heuristic Retrieval (HR) component and Detailed Analysis (DA) component. We conducted several experiments to compare the token type, the text segmentation and the threshold value. The test data were prepared manually from the available Indonesian paper corpus. Experimental results showed that the LSA outperformed the VSM (Vector Space Model), especially in test cases with intelligence plagiarism.

Proceedings ArticleDOI
22 Dec 2014
TL;DR: An external Persian plagiarism detection method based on the vector space model (VSM) has been proposed and a Persian corpus has been developed to implement and examine this method.
Abstract: Nowadays, extremely wide and facilitated access to the Internet has made the plagiarism and text reuse more common. Many studies have been conducted on automatic plagiarism detection. But there are few studies on automatic Persian plagiarism detection methods due to lack of a suitable Persian corpus. In this paper, an external Persian plagiarism detection method based on the vector space model (VSM) has been proposed. To implement and examine this method, a Persian corpus has been developed. Several optimizations have been done during the study. These optimizations make the algorithm very fast and accurate. The test results of the proposed method shows an accuracy of 0.87 and a processing time cost of less than 10 minutes.

Proceedings ArticleDOI
22 Dec 2014
TL;DR: In this article, an effective plagiarism detection tool on identifying suitable intra-corpal plagiarisms detection for text based assignments by comparing unigram, bigram, trigram of vector space model with cosine similarity measure.
Abstract: Plagiarism is known as illegal use of others' part of work or whole work as one's own in any field such as art, poetry, literature, cinema, research and other creative forms of study. Plagiarism is one of the important issues in academic and research fields and giving more concern in academic systems. The situation is even worse with the availability of ample resources on the web. This paper focuses on an effective plagiarism detection tool on identifying suitable intra-corpal plagiarism detection for text based assignments by comparing unigram, bigram, trigram of vector space model with cosine similarity measure. Manually evaluated, labelled dataset was tested using unigram, bigram and trigram vector. Even though trigram vector consumes comparatively more time, it shows better results with the labelled data. In addition, the selected trigram vector space model with cosine similarity measure is compared with tri-gram sequence matching technique with Jaccard measure. In the results, cosine similarity score shows slightly higher values than the other. Because, it focuses on giving more weight for terms that do not frequently exist in the dataset and cosine similarity measure using trigram technique is more preferable than the other. Therefore, we present our new tool and it could be used as an effective tool to evaluate text based electronic assignments and minimize the plagiarism among students.

Journal Article
TL;DR: Ballor et al. as mentioned in this paper reported a case of unattributed dependence that appeared in the pages of the Journal of Markets & Morality and reported that plagiarism is a serious intellectual and moral offense.
Abstract: With the advent of printing technologies, particularly those in the industrial era, the dissemination and availability of academic texts became much more widespread, and along with this dissemination arose a reference apparatus and practice that more or less rigorously expected detailed attribution for source material In theory, at least, a reader ought to be able to reconstruct the argument being made by carefully tracing the footnotes Often this was more ideal than actual, but the standard has persisted to this day that plagiarism is a serious intellectual (and moral) offense What has changed even more recently is that the digital dissemination of intellectual material has made plagiarism both easier to commit and easier to detect With a simple cut and paste maneuver huge blocks of text can be moved from a web page to a word processer However the same works in reverse, and massive search engines like Google, as well as specialized services like Turnitin, have made plagiarism detection a cottage industry (and for some professors, an unavoidable component of their occupation) It is therefore with regret that I must report a case of unattributed dependence that appeared in the pages of the Journal of Markets & Morality Jordan J Ballor, "Editorial: Plagiarism in a Digital Age," Journal of Markets & Morality 17, no 2 (Fall 2014): 349-352

Journal ArticleDOI
TL;DR: The work on designing and implementation of a plagiarism detection system based on pre-processing and NLP technics will be described and the results of testing on a corpus will be presented.
Abstract: Currently there are lots of plagiarism detection approaches. But few of them implemented and adapted for Persian languages. In this paper, our work on designing and implementation of a plagiarism detection system based on pre-processing and NLP technics will be described. And the results of testing on a corpus will be presented.