Showing papers on "Plagiarism detection published in 2010"

PDF

Open Access

Proceedings Article•

An Evaluation Framework for Plagiarism Detection

[...]

Martin Potthast¹, Benno Stein¹, Alberto Barrón-Cedeño², Paolo Rosso²•Institutions (2)

Bauhaus University, Weimar¹, Polytechnic University of Valencia²

23 Aug 2010

TL;DR: Empirical evidence is given that the construction of tailored training corpora for plagiarism detection can be automated, and hence be done on a large scale.

...read moreread less

Abstract: We present an evaluation framework for plagiarism detection. The framework provides performance measures that address the specifics of plagiarism detection, and the PAN-PC-10 corpus, which contains 64 558 artificial and 4 000 simulated plagiarism cases, the latter generated via Amazon's Mechanical Turk. We discuss the construction principles behind the measures and the corpus, and we compare the quality of our corpus to existing corpora. Our analysis gives empirical evidence that the construction of tailored training corpora for plagiarism detection can be automated, and hence be done on a large scale.

...read moreread less

327 citations

Journal Article•

Turning to Turnitin to Fight Plagiarism among University Students

[...]

Tshepo Batane¹•Institutions (1)

University of Botswana¹

01 Apr 2010-Educational Technology & Society

TL;DR: To win the fight against plagiarism, the paper recommends that the university adopt a more comprehensive approach in dealing with the problems that addresses, among other things, the fundamental reason why students plagiarise.

...read moreread less

Abstract: This paper reports on a pilot project of the Turnitin plagiarism detection software, which was implemented to determine the impact of the software on the level of plagiarism among University of Botswana (UB) students. Students’ assignments were first submitted to the software without their knowledge so as to gauge their level of plagiarism. The results recorded the average level of plagiarism among UB students to be 20.5%.The software was then introduced to the students and they were warned that their second assignments would be checked through the software. The results showed a 4.3% decrease in the level of plagiarism among students. A survey was conducted to find out the reasons why students plagiarise and also get the participants’ views on the use of the software to fight plagiarism. To win the fight against plagiarism, the paper recommends that the university adopt a more comprehensive approach in dealing with the problems that addresses, among other things, the fundamental reason why students plagiarise.

...read moreread less

144 citations

Proceedings Article•

Plagiarism Detection across Distant Language Pairs

[...]

Alberto Barrón-Cedeño¹, Paolo Rosso¹, Eneko Agirre, Gorka Labaka•Institutions (1)

Polytechnic University of Valencia¹

23 Aug 2010

TL;DR: Two recently proposed cross-language plagiarism detection methods are compared to a novel approach to this problem, based on machine translation and monolingual similarity analysis (T+MA), and the effectiveness of the three approaches for less related languages is explored.

...read moreread less

Abstract: Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language plagiarism occurs if a text is translated from a fragment written in a different language and no proper citation is provided. Regardless of the change of language, the contents and, in particular, the ideas remain the same. Whereas different methods for the detection of monolingual plagiarism have been developed, less attention has been paid to the cross-language case. In this paper we compare two recently proposed cross-language plagiarism detection methods (CL-CNG, based on character n-grams and CL-ASA, based on statistical translation), to a novel approach to this problem, based on machine translation and monolingual similarity analysis (T+MA). We explore the effectiveness of the three approaches for less related languages. CL-CNG shows not be appropriate for this kind of language pairs, whereas T+MA performs better than the previously proposed models.

...read moreread less

86 citations

Fuzzy semantic-based string similarity for extrinsic plagiarism detection: Lab report for PAN at CLEF 2010

[...]

Salha Alzahrani¹, Naomie Salim•Institutions (1)

Taif University¹

01 Jan 2010

TL;DR: Although the fuzzy semantic-based method can detect some means of obfuscation, it might not work at all levels and future work is to improve it for more detection efficiency and less time complexity, and to advance the post-processing stage to gain more ideal granularity.

...read moreread less

Abstract: This report explains our plagiarism detection method using fuzzy semantic-based string similarity approach. The algorithm was developed through four main stages. First is pre-processing which includes tokenisation, stemming and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate documents. This stage entails the computation of fuzzy degree of similarity that ranges between two edges: 0 for completely different sentences and 1 for exactly identical sentences. Two sentences are marked as similar (i.e. plagiarised) if they gain a fuzzy similarity score above a certain threshold. The last step is post-processing whereby consecutive sentences are joined to form single paragraphs/sections. Our performance measures on PAN’09 training corpus for external plagiarism detection task (recall=0.3097, precision=0.5424, granularity=7.8867) indicates that about 54% of our detections are correct while we detect only 30% of the plagiarism cases. The performance measures on PAN’10 test collection is less (recall= 0.1259, precision= 0.5761, granularity= 3.5828), due to the fact that our algorithm handles external plagiarism detection but neither intrinsic nor cross-lingual. Although our fuzzy semantic-based method can detect some means of obfuscation, it might not work at all levels. Our future work is to improve it for more detection efficiency and less time complexity. In particular, we need to advance the post-processing stage to gain more ideal granularity.

...read moreread less

84 citations

Proceedings Article•DOI•

Language-Independent Clone Detection Applied to Plagiarism Detection

[...]

Romain Brixtel¹, Mathieu Fontaine¹, Boris Lesner¹, Cyril Bazin¹, Romain Robbes² - Show less +1 more•Institutions (2)

University of Caen Lower Normandy¹, University of Chile²

12 Sep 2010

TL;DR: This paper addresses the problem of clone detection applied to plagiarism detection in the context of source code assignments done by computer science students, and proposes an alignment method using the parallel principle at local resolution (character level) to compute similarities between documents.

...read moreread less

Abstract: Clone detection is usually applied in the context of detecting small-to medium scale fragments of duplicated code in large software systems. In this paper, we address the problem of clone detection applied to plagiarism detection in the context of source code assignments done by computer science students. Plagiarism detection comes with a distinct set of constraints to usual clone detection approaches, which influenced the design of the approach we present in this paper. For instance, the source code can be heavily changed at a superficial level (in an attempt to look genuine), yet be functionally very similar. Since assignments turned in by computer science students can be in a variety of languages, we work at the syntactic level and do not consider the source-code semantics. Consequently, the approach we propose is endogenous and makes no assumption about the programming language being analysed. It is based on an alignment method using the parallel principle at local resolution (character level) to compute similarities between documents. We tested our framework on hundreds of real source files, involving a wide array of programming languages (Java, C, Python, PHP, Haskell, bash). Our approach allowed us to discover previously undetected frauds, and to empirically evaluate its accuracy and robustness.

...read moreread less

60 citations

Journal Article•DOI•

CrossCheck: an effective tool for detecting plagiarism

[...]

Helen Yue-hong Zhang¹•Institutions (1)

Zhejiang University Press¹

01 Jan 2010-Learned Publishing

TL;DR: The plagiarism detection service CrossCheck has been used since October 2008 as part of the paper reviewing process for the Journal of Zhejiang University – Science, and four types of copying or plagiarism were identified.

...read moreread less

Abstract: The plagiarism detection service CrossCheck has been used since October 2008 as part of the paper reviewing process for the Journal of Zhejiang University – Science (A & B). Between October 2008 and May 2009 662 papers were CrossChecked; 151 of these (around 22.8% of submitted papers) were found to contain apparently unreasonable levels of copying or self-plagiarism, and 25.8% of these cases (39 papers) gave rise to serious suspicions of plagiarism and copyright infringement. Four types of copying or plagiarism were identified, in an attempt to reach a consensus on this type of academic misconduct.

...read moreread less

59 citations

A comparison of plagiarism detection tools

[...]

Jurriaan Hage, Peter Rademaker

01 Jan 2010

TL;DR: Five tools for detecting plagiarism in source code texts: JPlag, Marble, moss, Plaggie, and sim are compared with respect to their features and performance.

...read moreread less

Abstract: In this paper we compare five tools for detecting plagiarism in source code texts: JPlag, Marble, moss, Plaggie, and sim. The tools are compared with respect to their features and performance. For the performance comparison we carried out two experiments: to compare the sensitivity of the tools for different plagiarism techniques we have applied the tools to a set of intentionally plagiarised programs. To get a picture of the precision of the tools, we have run the tools on several incarnations of a student assignment and compared the top 10’s of the results.

...read moreread less

59 citations

Journal Article•DOI•

Automatic Student Plagiarism Detection: Future Perspectives

[...]

Maxim Mozgovoy¹, Tuomo Kakkonen², Georgina Cosma•Institutions (2)

University of Aizu¹, University of Eastern Finland²

07 Dec 2010-Journal of Educational Computing Research

TL;DR: Limits in automatic detection of student plagiarism are investigated and ways on how these issues could be tackled in future systems by applying various natural language processing and information retrieval technologies are proposed.

...read moreread less

Abstract: The availability and use of computers in teaching has seen an increase in the rate of plagiarism among students because of the wide availability of electronic texts online. While computer tools that have appeared in the recent years are capable of detecting simple forms of plagiarism, such as copy-paste, a number of recent research studies devoted to evaluation and comparison of plagiarism detection tools revealed that these contain limitations in detecting complex forms of plagiarism such as extensive paraphrasing and use of technical tricks, such as replacing original characters with similar-looking characters from foreign alphabets. This article investigates limitations in automatic detection of student plagiarism and proposes ways on how these issues could be tackled in future systems by applying various natural language processing and information retrieval technologies. A classification of types of plagiarism is presented, and an analysis is provided of the most promising technologies that have the potential of dealing with the limitations of current state-of-the-art systems. Furthermore, the article concludes with a discussion on legal and ethical issues related to the use of plagiarism detection software. The article, hence, provides a "roadmap" for developing the next generation of plagiarism detection systems.

...read moreread less

56 citations

Improving the Reliability of the Plagiarism Detection System Lab Report for PAN at CLEF 2010

[...]

Jan Kasprzak, Michal Brandejs

01 Jan 2010

TL;DR: This paper describes the approach at the PAN 2010 plagiarism detection competition, and discusses the com- putational cost of each step of the implementation, including the performance data from two different computers.

...read moreread less

Abstract: In this paper we describe our approach at the PAN 2010 plagiarism detection competition. We refer to the system we have used in PAN'09. We then present the improvements we have tried since the PAN'09 competition, and their impact on the results on the development corpus. We describe our experiments with intrinsic plagiarism detection and evaluate them. We then discuss the com- putational cost of each step of our implementation, including the performance data from two different computers.

...read moreread less

54 citations

Journal Article•DOI•

Hermetic and Web Plagiarism Detection Systems for Student Essays--An Evaluation of the State-of-the-Art.

[...]

Tuomo Kakkonen¹, Maxim Mozgovoy²•Institutions (2)

University of Eastern Finland¹, University of Aizu²

09 Feb 2010-Journal of Educational Computing Research

TL;DR: While Sherlock was clearly the overall best hermetic detection system, SafeAssignment performed best in detecting web plagiarism and TurnitIn was found to be the most advanced system for detecting semi-automatic forms of plagiarism.

...read moreread less

Abstract: Plagiarism has become a serious problem in education, and several plagiarism detection systems have been developed for dealing with this problem. This study provides an empirical evaluation of eight plagiarism detection systems for student essays. We present a categorical hierarchy of the most common types of plagiarism that are encountered in student texts. Our purpose-built test set contains texts in which instances of several commonly utilized plagiaristic techniques have been embedded. While Sherlock was clearly the overall best hermetic detection system, SafeAssignment performed best in detecting web plagiarism. TurnitIn was found to be the most advanced system for detecting semi-automatic forms of plagiarism such as the substitution of Cyrillic equivalents for certain characters or the insertion of fake whitespaces. The survey indicates that none of the systems are capable of reliably detecting plagiarism from both local and Internet sources while at the same time being able to identify the technica...

...read moreread less

47 citations

Proceedings Article•DOI•

Web Based Cross Language Plagiarism Detection

[...]

Chow Kok Kent, Naomie Salim

28 Sep 2010

TL;DR: In this paper, the authors proposed a new approach in detecting cross language plagiarism by considering Bahasa Melayu as an input language of the submitted query document and English as a target language of similar, possibly plagiarised documents.

...read moreread less

Abstract: As the Internet help us cross language and cultural border by providing different types of translation tools, cross language plagiarism, also known as translation plagiarism are bound to arise. In this paper, we propose a new approach in detecting cross language plagiarism. In order to limit certain scale of our proposed system, we are consider Bahasa Melayu as an input language of the submitted query document and English as a target language of similar, possibly plagiarised documents. Input documents are translated into English using Google Translate API before undergo pre-processing phase (stemming and removal of stop words). Tokenized documents are sent to the Google AJAX Search API to detect similar documents throughout the World Wide Web. Only top ten sources retrieved by the Google Search API are considered as the candidate of source documents. We integrate the use of Stanford Parser and WordNet to determine the similarity level between the suspected documents with those candidate source documents. After that, a detailed similarity analysis is performed and a report of results is produced.

...read moreread less

External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation System Lab Report for PAN at CLEF 2010

[...]

Markus Muhr, Roman Kern, Mario Zechner, Michael Granitzer

01 Jan 2010

TL;DR: This work presents a hybrid system that performs plagiarism detection for translated and non-translated exter- nally as well as intrinsically plagiarized document passages, using heuristic post processing to arrive at the final detection results.

...read moreread less

Abstract: We present our hybrid system for the PAN challenge at CLEF 2010. Our system performs plagiarism detection for translated and non-translated exter- nally as well as intrinsically plagiarized document passages. Our external plagia- rism detection approach is formulated as an information retrieval problem, using heuristic post processing to arrive at the final detection results. For the retrieval step, source documents are split into overlapping blocks which are indexed via a Lucene instance. Suspicious documents are similarly split into consecutive over- lapping boolean queries which are performed on the Lucene index to retrieve an initial set of potentially plagiarized passages. For performance reasons queries might get rejected via a heuristic before actually being executed. Candidate hits gathered via the retrieval step are further post-processed by performing sequence analysis on the passages retrieved from the index with respect to the passages used for querying the index. By applying several merge heuristics bigger blocks are formed from matching sequences. German and Spanish source documents are first translated using word alignment on the Europarl corpus before enter- ing the above detection process. For each word in a translated document several translations are produced. Intrinsic plagiarism detection is done by finding major changes in style measured via word suffixes after the documents have been parti- tioned by an linear text segmentation algorithm. Our approach lead us to the third overall rank with an overall score of 0.6948.

...read moreread less

Proceedings Article•DOI•

On the mono- and cross-language detection of text reuse and plagiarism

[...]

Alberto Barrón-Cedeño¹•Institutions (1)

Polytechnic University of Valencia¹

19 Jul 2010

TL;DR: The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: the adequate identification of good potential sources for a given suspicious text, the detection of plagiarism despite modifications and the generation of standard collections of cases of plagiarisms and text reuse.

...read moreread less

Abstract: Plagiarism, the unacknowledged reuse of text, has increased in recent years due to the large amount of texts readily available. For instance, recent studies claim that nowadays a high rate of student reports include plagiarism, making manual plagiarism detection practically infeasible. Automatic plagiarism detection tools assist experts to analyse documents for plagiarism. Nevertheless, the lack of standard collections with cases of plagiarism has prevented accurate comparing models, making differences hard to appreciate. Seminal efforts on the detection of text reuse [2] have fostered the composition of standard resources for the accurate evaluation and comparison of methods. The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: (i) the adequate identification of good potential sources for a given suspicious text; (ii) the detection of plagiarism despite modifications, such as words substitution and paraphrasing (special stress is given to cross-language plagiarism); and (iii) the generation of standard collections of cases of plagiarism and text reuse in order to provide a framework for accurate comparison of models. Regarding difficulties (i) and (ii) , we have carried out preliminary experiments over the METER corpus [2]. Given a suspicious document dq and a collection of potential source documents D, the process is divided in two steps. First, a small subset of potential source documents D* in D is retrieved. The documents d in D* are the most related to dq and, therefore, the most likely to include the source of the plagiarised fragments in it. We performed this stage on the basis of the Kullback-Leibler distance, over a subsample of document's vocabularies. Afterwards, a detailed analysis is carried out comparing dq to every d in D* in order to identify potential cases of plagiarism and their source. This comparison was made on the basis of word n-grams, by considering n = {2, 3}. These n-gram levels are flexible enough to properly retrieve plagiarised fragments and their sources despite modifications [1]. The result is offered to the user to take the final decision. Further experiments were done in both stages in order to compare other similarity measures, such as the cosine measure, the Jaccard coefficient and diverse fingerprinting and probabilistic models. One of the main weaknesses of currently available models is that they are unable to detect cross-language plagiarism. Approaching the detection of this kind of plagiarism is of high relevance, as the most of the information published is written in English, and authors in other languages may find it attractive to make use of direct translations. Our experiments, carried out over parallel and a comparable corpora, show that models of "standard" cross-language information retrieval are not enough. In fact, if the analysed source and target languages are related in some way (common linguistic ancestors or technical vocabulary), a simple comparison based on character n-grams seems to be the option. However, in those cases where the relation between the implied languages is weaker, other models, such as those based on statistical machine translation, are necessary [3]. We plan to perform further experiments, mainly to approach the detection of cross-language plagiarism. In order to do that, we will use the corpora developed under the framework of the PAN competition on plagiarism detection (cf. PAN@CLEF: http://pan.webis.de). Models that consider cross-language thesauri and comparison of cognates will also be applied.

...read moreread less

A Cluster-Based Plagiarism Detection Method - Lab Report for PAN at CLEF 2010.

[...]

Du Zou, Wei-Jiang Long, Zhang Ling

01 Jan 2010

TL;DR: A cluster-based plagiarism detection method is described, which has been used in the learning management system of SCUT to detect plagiarism in the network engineering related courses and was used to detect external plagiarisms in the PAN-10 competition.

...read moreread less

Abstract: In this paper we describe a cluster-based plagiarism detection method, which we have used in the learning management system of SCUT to detect plagiarism in the network engineering related courses And we also used this method to detect external plagiarism in the PAN-10 competition The method is divided into three steps: the first step, called pre-selecting, is to narrow the scope of detection using the successive same fingerprint; the second step, called locating, is to find and merge all fragments between two documents using cluster method; the third step, called post-processing, is to deal with some merging errors Our method ran 19 hours in the PAN-10 competition, and the result ranked the second place, which met our expectation Keywords Plagiarism detection, Similar text, Locating, Cluster

...read moreread less

Plagiarism detection using Rouge and WordNet

[...]

柯皓仁

01 Mar 2010

TL;DR: In this paper, the authors proposed adoption of ROUGE and WordNet to plagiarism detection, which includes n-gram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS).

...read moreread less

Abstract: With the arrival of digital era and Internet, the lack of information control provides an incentive for people to freely use any content available to them. Plagiarism occurs when users fail to credit the original owner for the content referred to, and such behavior leads to violation of intellectual property. Two main approaches to plagiarism detection are fingerprinting and term occurrence; however, one common weakness shared by both approaches, especially fingerprinting, is the incapability to detect modified text plagiarism. This study proposes adoption of ROUGE and WordNet to plagiarism detection. The former includes ngram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS), while the latter acts as a thesaurus and provides semantic information. N-gram co-occurrence statistics can detect verbatim copy and certain sentence modification, skip-bigram and LCS are immune from text modification such as simple addition or deletion of words, and WordNet may handle the problem of word substitution.

...read moreread less

Journal Article•DOI•

Turnitoff: Identifying and Fixing a Hole in Current Plagiarism Detection Software

[...]

James Heather¹•Institutions (1)

University of Surrey¹

20 Jul 2010-Assessment & Evaluation in Higher Education

TL;DR: It is demonstrated that, in its current incarnation, one can easily create a document that passes the plagiarism check regardless of how much copied material it contains; it is shown how to improve the system to avoid such attacks.

...read moreread less

Abstract: In recent times, plagiarism detection software has become popular in universities and colleges, in an attempt to stem the tide of plagiarised student coursework. Such software attempts to detect any copied material and identify its source. The most popular such software is Turnitin, a commercial system used by thousands of institutions in more than 100 countries. Here, we show how to fix a loophole in Turnitin's current plagiarism detection process. We demonstrate that, in its current incarnation, one can easily create a document that passes the plagiarism check regardless of how much copied material it contains; we then show how to improve the system to avoid such attacks.

...read moreread less

Proceedings Article•DOI•

Code Comparison System based on Abstract Syntax Tree

[...]

Baojiang Cui¹, Jiansong Li¹, Tao Guo, Jianxin Wang², Ding Ma¹ - Show less +1 more•Institutions (2)

Beijing University of Posts and Telecommunications¹, Beijing Forestry University²

01 Oct 2010

TL;DR: A plagiarism detection tool named CCS (Code Comparison System) which is based on the Abstract Syntax Tree (AST), which performs well in the code comparison field, and is able to help with the copyright protecting of the source code.

...read moreread less

Abstract: The code comparison technology plays a very important part in the work of plagiarism detection and software evaluation. Software plagiarism mainly appears as copy-and-paste or with a little modification after this, which will not change the function of the code, such as replacing the name of methods or variables, reordering the sequence of the statements etc. This paper introduces a plagiarism detection tool named CCS (Code Comparison System) which is based on the Abstract Syntax Tree (AST). According to the syntax tree's characteristics, CCS calculates their hash values, transforms their storage forms, and then compares them node by node. As a result, the efficiency improves. Moreover, CCS preprocesses a large amount of source code in its database for potential use, which also accelerate the course of plagiarism detection. CCS also takes special measurement to reduce mistakes when calculating the hash values of the operations like subtraction and division. It performs well in the code comparison field, and is able to help with the copyright protecting of the source code.

...read moreread less

Book Chapter•DOI•

Word length n-grams for text re-use detection

[...]

Alberto Barrón-Cedeño¹, Chiara Basile², Mirko Degli Esposti², Paolo Rosso¹•Institutions (2)

Polytechnic University of Valencia¹, University of Bologna²

21 Mar 2010

TL;DR: A model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison of texts to determine how similar they are, and experimentally shows that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text.

...read moreread less

Abstract: The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

...read moreread less

Proceedings Article•

Corpus and Evaluation Measures for Automatic Plagiarism Detection

[...]

Alberto Barrón-Cedeño¹, Martin Potthast², Paolo Rosso¹, Benno Stein²•Institutions (2)

Polytechnic University of Valencia¹, Bauhaus University, Weimar²

01 May 2010

TL;DR: A newly developed large-scale corpus of artificial plagiarism is developed useful for the evaluation of intrinsic as well as external plagiarism detection.

...read moreread less

Abstract: The simple access to texts on digital libraries and the World Wide Web has led to an increased number of plagiarism cases in recent years, which renders manual plagiarism detection infeasible at large. Various methods for automatic plagiarism detection have been developed whose objective is to assist human experts in the analysis of documents for plagiarism. The methods can be divided into two main approaches: intrinsic and external. Unlike other tasks in natural language processing and information retrieval, it is not possible to publish a collection of real plagiarism cases for evaluation purposes since they cannot be properly anonymized. Therefore, current evaluations found in the literature are incomparable and, very often not even reproducible. Our contribution in this respect is a newly developed large-scale corpus of artificial plagiarism useful for the evaluation of intrinsic as well as external plagiarism detection. Additionally, new detection performance measures tailored to the evaluation of plagiarism detection algorithms are proposed.

...read moreread less

Proceedings Article•DOI•

Citation based plagiarism detection: a new approach to identify plagiarized work language independently

[...]

Bela Gipp¹, Jöran Beel¹•Institutions (1)

University of California, Berkeley¹

13 Jun 2010

TL;DR: This approach is based on citation analysis and allows duplicate and plagiarism detection even if a document has been paraphrased or translated, since the relative position of citations remains similar.

...read moreread less

Abstract: This paper describes a new approach towards detecting plagiarism and scientific documents that have been read but not cited. In contrast to existing approaches, which analyze documents' words but ignore their citations, this approach is based on citation analysis and allows duplicate and plagiarism detection even if a document has been paraphrased or translated, since the relative position of citations remains similar. Although this approach allows in many cases the detection of plagiarized work that could not be detected automatically with the traditional approaches, it should be considered as an extension rather than a substitute. Whereas the known text analysis methods can detect copied or, to a certain degree, modified passages, the proposed approach requires longer passages with at least two citations in order to create a digital fingerprint.

...read moreread less

Book Chapter•DOI•

A new approach for cross-language plagiarism analysis

[...]

Rafael Corezola Pereira¹, Viviane Pereira Moreira¹, Renata Galante¹•Institutions (1)

Universidade Federal do Rio Grande do Sul¹

20 Sep 2010

TL;DR: A plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing, showing that the method achieved better results with medium and large plagiarized passages.

...read moreread less

Abstract: This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.

...read moreread less

Posted Content•

Plagiarism detection using Rouge and WordNet

[...]

Chien-Ying Chen, Jen-Yuan Yeh, Hao-Ren Ke

22 Mar 2010-arXiv: Other Computer Science

TL;DR: This study proposes adoption of ROUGE and WordNet to plagiarism detection and includes ngram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS), while the latter acts as a thesaurus and provides semantic information.

...read moreread less

Journal Article•DOI•

PlagDetect: a Java programming plagiarism detection tool

[...]

Zuhoor Al-Khanjari¹, Jinan Fiaidhi², R. A. Al-Hinai¹, Narayana Swamy Kutti¹•Institutions (2)

Sultan Qaboos University¹, Lakehead University²

01 Dec 2010-ACM Inroads

TL;DR: The research in this context involves at first examining various metrics used in plagiarism detection in program codes and secondly selecting an appropriate statistical measure using attribute counting metrics (ATMs) for detecting plagiarism in Java programming assignments.

...read moreread less

Abstract: Practical computing courses that involve significant amount of programming assessment tasks suffer from e-Plagiarism. A pragmatic solution for this problem could be by discouraging plagiarism particularly among the beginners in programming. One way to address this is to automate the detection of plagiarized work during the marking phase. Our research in this context involves at first examining various metrics used in plagiarism detection in program codes and secondly selecting an appropriate statistical measure using attribute counting metrics (ATMs) for detecting plagiarism in Java programming assignments. The goal of this investigation is to study the effectiveness of ATMs for detecting plagiarism among assignment submissions of introductory programming courses.

...read moreread less

Posted Content•

Features based text similarity detection

[...]

Chow Kok Kent, Naomie Salim

20 Jan 2010-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new approach to detect plagiarism is proposed which integrates the use of fingerprint matching technique with four key features to assist in the detection process and time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection.

...read moreread less

Abstract: As the Internet help us cross cultural border by providing different information, plagiarism issue is bound to arise. As a result, plagiarism detection becomes more demanding in overcoming this issue. Different plagiarism detection tools have been developed based on various detection techniques. Nowadays, fingerprint matching technique plays an important role in those detection tools. However, in handling some large content articles, there are some weaknesses in fingerprint matching technique especially in space and time consumption issue. In this paper, we propose a new approach to detect plagiarism which integrates the use of fingerprint matching technique with four key features to assist in the detection process. These proposed features are capable to choose the main point or key sentence in the articles to be compared. Those selected sentence will be undergo the fingerprint matching process in order to detect the similarity between the sentences. Hence, time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection.

...read moreread less

Proceedings Article•

Detection of Simple Plagiarism in Computer Science Papers

[...]

Yaakov HaCohen-Kerner¹, Aharon Tayeb¹, Natan Ben-Dror¹•Institutions (1)

Jerusalem College of Technology¹

23 Aug 2010

TL;DR: This research developed software capable of simple plagiarism detection that has built a corpus containing 10,100 academic papers in computer science written in English and two test sets including papers that were randomly chosen from C.

...read moreread less

Abstract: Plagiarism is the use of the language and thoughts of another work and the representation of them as one's own original work. Various levels of plagiarism exist in many domains in general and in academic papers in particular. Therefore, diverse efforts are taken to automatically identify plagiarism. In this research, we developed software capable of simple plagiarism detection. We have built a corpus (C) containing 10,100 academic papers in computer science written in English and two test sets including papers that were randomly chosen from C. A widespread variety of baseline methods has been developed to identify identical or similar papers. Several methods are novel. The experimental results and their analysis show interesting findings. Some of the novel methods are among the best predictive methods.

...read moreread less

Book Chapter•DOI•

Who's the thief? automatic detection of the direction of plagiarism

[...]

Cristian Grozea¹, Marius Popescu²•Institutions (2)

Fraunhofer Society¹, University of Bucharest²

21 Mar 2010

TL;DR: An approach using an extension of the method Encoplot, which won the 1st international competition on plagiarism detection in 2009, is presented, tested on a large-scale corpus of artificial plagiarism, with good results.

...read moreread less

Abstract: Determining the direction of plagiarism (who plagiarized whom in a given pair of documents) is one of the most interesting problems in the field of automatic plagiarism detection. We present here an approach using an extension of the method Encoplot, which won the 1st international competition on plagiarism detection in 2009. We have tested it on a large-scale corpus of artificial plagiarism, with good results.

...read moreread less

Encoplot - Performance in the Second International Plagiarism Detection Challenge - Lab Report for PAN at CLEF 2010 .

[...]

Cristian Grozea, Marius Popescu

01 Jan 2010

TL;DR: This year's submission is generated by the same method Encoplot that was developed for the last year competition and there is a single improvement.

...read moreread less

Abstract: Our submission this year is generated by the same method Encoplot that we have developed for the last year competition. There is a single improvement, we compare in addition each suspicious document with each other and flag the passages most probably in correspondence as intrinsic plagiarism.

...read moreread less

Proceedings Article•DOI•

The Source Code Plagiarism Detection Using AST

[...]

Xiao Li¹, Xiao Jing Zhong¹•Institutions (1)

Southwest University¹

28 Oct 2010

TL;DR: A source code plagiarism detection technologe based on AST is described that can detect the plagiarism accurately when the position of functions is changed by plagiarist.

...read moreread less

Abstract: In the instruction of computer courses, some students copy other’s source code as themselves. In order to detect this plagiarism accurately, researchers did a lot. In this paper, we described a source code plagiarism detection technologe based on AST, This technologe can detect the plagiarism accurately when the position of functions is changed by plagiarist. At first, transforming the programs to the AST using ANTLR, and then, abstracting the function subtrees from the AST, at last, compare the function subtree using LCS, get the simalarity between programs.

...read moreread less

Improving the Reliability of the Plagiarism Detection System

[...]

Jan Kasprzak¹, Michal Brandejs¹•Institutions (1)

Masaryk University¹

22 Sep 2010

TL;DR: In this article, the authors describe their approach at the PAN 2010 plagiarism detection competition and discuss the computational cost of each step of their implementation, including the performance data from two different computers, and present the improvements they have tried since the PAN'09 competition, and their impact on the results on the development corpus.

...read moreread less

Abstract: In this paper we describe our approach at the PAN 2010 plagiarism detection competition. We refer to the system we have used in PAN'09. We then present the improvements we have tried since the PAN'09 competition, and their impact on the results on the development corpus. We describe our experiments with intrinsic plagiarism detection and evaluate them. We then discuss the computational cost of each step of our implementation, including the performance data from two different computers.

...read moreread less

Journal Article•DOI•

Support for checking plagiarism in e-learning

[...]

Daniela Chudá, Pavol Návrat

01 Jan 2010-Procedia - Social and Behavioral Sciences

TL;DR: The paper attempts to analyze current situation in plagiarism detection and to analyze existing methods and tools for checking the plagiarized programming code and natural language text, and proposes an effective, widely usable tool with more precise results.

...read moreread less