scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2010"


Proceedings Article
23 Aug 2010
TL;DR: Empirical evidence is given that the construction of tailored training corpora for plagiarism detection can be automated, and hence be done on a large scale.
Abstract: We present an evaluation framework for plagiarism detection. The framework provides performance measures that address the specifics of plagiarism detection, and the PAN-PC-10 corpus, which contains 64 558 artificial and 4 000 simulated plagiarism cases, the latter generated via Amazon's Mechanical Turk. We discuss the construction principles behind the measures and the corpus, and we compare the quality of our corpus to existing corpora. Our analysis gives empirical evidence that the construction of tailored training corpora for plagiarism detection can be automated, and hence be done on a large scale.

327 citations


Journal Article
TL;DR: To win the fight against plagiarism, the paper recommends that the university adopt a more comprehensive approach in dealing with the problems that addresses, among other things, the fundamental reason why students plagiarise.
Abstract: This paper reports on a pilot project of the Turnitin plagiarism detection software, which was implemented to determine the impact of the software on the level of plagiarism among University of Botswana (UB) students. Students’ assignments were first submitted to the software without their knowledge so as to gauge their level of plagiarism. The results recorded the average level of plagiarism among UB students to be 20.5%.The software was then introduced to the students and they were warned that their second assignments would be checked through the software. The results showed a 4.3% decrease in the level of plagiarism among students. A survey was conducted to find out the reasons why students plagiarise and also get the participants’ views on the use of the software to fight plagiarism. To win the fight against plagiarism, the paper recommends that the university adopt a more comprehensive approach in dealing with the problems that addresses, among other things, the fundamental reason why students plagiarise.

144 citations


Proceedings Article
23 Aug 2010
TL;DR: Two recently proposed cross-language plagiarism detection methods are compared to a novel approach to this problem, based on machine translation and monolingual similarity analysis (T+MA), and the effectiveness of the three approaches for less related languages is explored.
Abstract: Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language plagiarism occurs if a text is translated from a fragment written in a different language and no proper citation is provided. Regardless of the change of language, the contents and, in particular, the ideas remain the same. Whereas different methods for the detection of monolingual plagiarism have been developed, less attention has been paid to the cross-language case. In this paper we compare two recently proposed cross-language plagiarism detection methods (CL-CNG, based on character n-grams and CL-ASA, based on statistical translation), to a novel approach to this problem, based on machine translation and monolingual similarity analysis (T+MA). We explore the effectiveness of the three approaches for less related languages. CL-CNG shows not be appropriate for this kind of language pairs, whereas T+MA performs better than the previously proposed models.

86 citations


01 Jan 2010
TL;DR: Although the fuzzy semantic-based method can detect some means of obfuscation, it might not work at all levels and future work is to improve it for more detection efficiency and less time complexity, and to advance the post-processing stage to gain more ideal granularity.
Abstract: This report explains our plagiarism detection method using fuzzy semantic-based string similarity approach. The algorithm was developed through four main stages. First is pre-processing which includes tokenisation, stemming and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate documents. This stage entails the computation of fuzzy degree of similarity that ranges between two edges: 0 for completely different sentences and 1 for exactly identical sentences. Two sentences are marked as similar (i.e. plagiarised) if they gain a fuzzy similarity score above a certain threshold. The last step is post-processing whereby consecutive sentences are joined to form single paragraphs/sections. Our performance measures on PAN’09 training corpus for external plagiarism detection task (recall=0.3097, precision=0.5424, granularity=7.8867) indicates that about 54% of our detections are correct while we detect only 30% of the plagiarism cases. The performance measures on PAN’10 test collection is less (recall= 0.1259, precision= 0.5761, granularity= 3.5828), due to the fact that our algorithm handles external plagiarism detection but neither intrinsic nor cross-lingual. Although our fuzzy semantic-based method can detect some means of obfuscation, it might not work at all levels. Our future work is to improve it for more detection efficiency and less time complexity. In particular, we need to advance the post-processing stage to gain more ideal granularity.

84 citations


Proceedings ArticleDOI
12 Sep 2010
TL;DR: This paper addresses the problem of clone detection applied to plagiarism detection in the context of source code assignments done by computer science students, and proposes an alignment method using the parallel principle at local resolution (character level) to compute similarities between documents.
Abstract: Clone detection is usually applied in the context of detecting small-to medium scale fragments of duplicated code in large software systems. In this paper, we address the problem of clone detection applied to plagiarism detection in the context of source code assignments done by computer science students. Plagiarism detection comes with a distinct set of constraints to usual clone detection approaches, which influenced the design of the approach we present in this paper. For instance, the source code can be heavily changed at a superficial level (in an attempt to look genuine), yet be functionally very similar. Since assignments turned in by computer science students can be in a variety of languages, we work at the syntactic level and do not consider the source-code semantics. Consequently, the approach we propose is endogenous and makes no assumption about the programming language being analysed. It is based on an alignment method using the parallel principle at local resolution (character level) to compute similarities between documents. We tested our framework on hundreds of real source files, involving a wide array of programming languages (Java, C, Python, PHP, Haskell, bash). Our approach allowed us to discover previously undetected frauds, and to empirically evaluate its accuracy and robustness.

60 citations


Journal ArticleDOI
TL;DR: The plagiarism detection service CrossCheck has been used since October 2008 as part of the paper reviewing process for the Journal of Zhejiang University – Science, and four types of copying or plagiarism were identified.
Abstract: The plagiarism detection service CrossCheck has been used since October 2008 as part of the paper reviewing process for the Journal of Zhejiang University – Science (A & B). Between October 2008 and May 2009 662 papers were CrossChecked; 151 of these (around 22.8% of submitted papers) were found to contain apparently unreasonable levels of copying or self-plagiarism, and 25.8% of these cases (39 papers) gave rise to serious suspicions of plagiarism and copyright infringement. Four types of copying or plagiarism were identified, in an attempt to reach a consensus on this type of academic misconduct.

59 citations


01 Jan 2010
TL;DR: Five tools for detecting plagiarism in source code texts: JPlag, Marble, moss, Plaggie, and sim are compared with respect to their features and performance.
Abstract: In this paper we compare five tools for detecting plagiarism in source code texts: JPlag, Marble, moss, Plaggie, and sim. The tools are compared with respect to their features and performance. For the performance comparison we carried out two experiments: to compare the sensitivity of the tools for different plagiarism techniques we have applied the tools to a set of intentionally plagiarised programs. To get a picture of the precision of the tools, we have run the tools on several incarnations of a student assignment and compared the top 10’s of the results.

59 citations


Journal ArticleDOI
TL;DR: Limits in automatic detection of student plagiarism are investigated and ways on how these issues could be tackled in future systems by applying various natural language processing and information retrieval technologies are proposed.
Abstract: The availability and use of computers in teaching has seen an increase in the rate of plagiarism among students because of the wide availability of electronic texts online. While computer tools that have appeared in the recent years are capable of detecting simple forms of plagiarism, such as copy-paste, a number of recent research studies devoted to evaluation and comparison of plagiarism detection tools revealed that these contain limitations in detecting complex forms of plagiarism such as extensive paraphrasing and use of technical tricks, such as replacing original characters with similar-looking characters from foreign alphabets. This article investigates limitations in automatic detection of student plagiarism and proposes ways on how these issues could be tackled in future systems by applying various natural language processing and information retrieval technologies. A classification of types of plagiarism is presented, and an analysis is provided of the most promising technologies that have the potential of dealing with the limitations of current state-of-the-art systems. Furthermore, the article concludes with a discussion on legal and ethical issues related to the use of plagiarism detection software. The article, hence, provides a "roadmap" for developing the next generation of plagiarism detection systems.

56 citations


01 Jan 2010
TL;DR: This paper describes the approach at the PAN 2010 plagiarism detection competition, and discusses the com- putational cost of each step of the implementation, including the performance data from two different computers.
Abstract: In this paper we describe our approach at the PAN 2010 plagiarism detection competition. We refer to the system we have used in PAN'09. We then present the improvements we have tried since the PAN'09 competition, and their impact on the results on the development corpus. We describe our experiments with intrinsic plagiarism detection and evaluate them. We then discuss the com- putational cost of each step of our implementation, including the performance data from two different computers.

54 citations


Journal ArticleDOI
TL;DR: While Sherlock was clearly the overall best hermetic detection system, SafeAssignment performed best in detecting web plagiarism and TurnitIn was found to be the most advanced system for detecting semi-automatic forms of plagiarism.
Abstract: Plagiarism has become a serious problem in education, and several plagiarism detection systems have been developed for dealing with this problem. This study provides an empirical evaluation of eight plagiarism detection systems for student essays. We present a categorical hierarchy of the most common types of plagiarism that are encountered in student texts. Our purpose-built test set contains texts in which instances of several commonly utilized plagiaristic techniques have been embedded. While Sherlock was clearly the overall best hermetic detection system, SafeAssignment performed best in detecting web plagiarism. TurnitIn was found to be the most advanced system for detecting semi-automatic forms of plagiarism such as the substitution of Cyrillic equivalents for certain characters or the insertion of fake whitespaces. The survey indicates that none of the systems are capable of reliably detecting plagiarism from both local and Internet sources while at the same time being able to identify the technica...

47 citations


Proceedings ArticleDOI
28 Sep 2010
TL;DR: In this paper, the authors proposed a new approach in detecting cross language plagiarism by considering Bahasa Melayu as an input language of the submitted query document and English as a target language of similar, possibly plagiarised documents.
Abstract: As the Internet help us cross language and cultural border by providing different types of translation tools, cross language plagiarism, also known as translation plagiarism are bound to arise. In this paper, we propose a new approach in detecting cross language plagiarism. In order to limit certain scale of our proposed system, we are consider Bahasa Melayu as an input language of the submitted query document and English as a target language of similar, possibly plagiarised documents. Input documents are translated into English using Google Translate API before undergo pre-processing phase (stemming and removal of stop words). Tokenized documents are sent to the Google AJAX Search API to detect similar documents throughout the World Wide Web. Only top ten sources retrieved by the Google Search API are considered as the candidate of source documents. We integrate the use of Stanford Parser and WordNet to determine the similarity level between the suspected documents with those candidate source documents. After that, a detailed similarity analysis is performed and a report of results is produced.

01 Jan 2010
TL;DR: This work presents a hybrid system that performs plagiarism detection for translated and non-translated exter- nally as well as intrinsically plagiarized document passages, using heuristic post processing to arrive at the final detection results.
Abstract: We present our hybrid system for the PAN challenge at CLEF 2010. Our system performs plagiarism detection for translated and non-translated exter- nally as well as intrinsically plagiarized document passages. Our external plagia- rism detection approach is formulated as an information retrieval problem, using heuristic post processing to arrive at the final detection results. For the retrieval step, source documents are split into overlapping blocks which are indexed via a Lucene instance. Suspicious documents are similarly split into consecutive over- lapping boolean queries which are performed on the Lucene index to retrieve an initial set of potentially plagiarized passages. For performance reasons queries might get rejected via a heuristic before actually being executed. Candidate hits gathered via the retrieval step are further post-processed by performing sequence analysis on the passages retrieved from the index with respect to the passages used for querying the index. By applying several merge heuristics bigger blocks are formed from matching sequences. German and Spanish source documents are first translated using word alignment on the Europarl corpus before enter- ing the above detection process. For each word in a translated document several translations are produced. Intrinsic plagiarism detection is done by finding major changes in style measured via word suffixes after the documents have been parti- tioned by an linear text segmentation algorithm. Our approach lead us to the third overall rank with an overall score of 0.6948.

Proceedings ArticleDOI
19 Jul 2010
TL;DR: The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: the adequate identification of good potential sources for a given suspicious text, the detection of plagiarism despite modifications and the generation of standard collections of cases of plagiarisms and text reuse.
Abstract: Plagiarism, the unacknowledged reuse of text, has increased in recent years due to the large amount of texts readily available. For instance, recent studies claim that nowadays a high rate of student reports include plagiarism, making manual plagiarism detection practically infeasible. Automatic plagiarism detection tools assist experts to analyse documents for plagiarism. Nevertheless, the lack of standard collections with cases of plagiarism has prevented accurate comparing models, making differences hard to appreciate. Seminal efforts on the detection of text reuse [2] have fostered the composition of standard resources for the accurate evaluation and comparison of methods. The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: (i) the adequate identification of good potential sources for a given suspicious text; (ii) the detection of plagiarism despite modifications, such as words substitution and paraphrasing (special stress is given to cross-language plagiarism); and (iii) the generation of standard collections of cases of plagiarism and text reuse in order to provide a framework for accurate comparison of models. Regarding difficulties (i) and (ii) , we have carried out preliminary experiments over the METER corpus [2]. Given a suspicious document dq and a collection of potential source documents D, the process is divided in two steps. First, a small subset of potential source documents D* in D is retrieved. The documents d in D* are the most related to dq and, therefore, the most likely to include the source of the plagiarised fragments in it. We performed this stage on the basis of the Kullback-Leibler distance, over a subsample of document's vocabularies. Afterwards, a detailed analysis is carried out comparing dq to every d in D* in order to identify potential cases of plagiarism and their source. This comparison was made on the basis of word n-grams, by considering n = {2, 3}. These n-gram levels are flexible enough to properly retrieve plagiarised fragments and their sources despite modifications [1]. The result is offered to the user to take the final decision. Further experiments were done in both stages in order to compare other similarity measures, such as the cosine measure, the Jaccard coefficient and diverse fingerprinting and probabilistic models. One of the main weaknesses of currently available models is that they are unable to detect cross-language plagiarism. Approaching the detection of this kind of plagiarism is of high relevance, as the most of the information published is written in English, and authors in other languages may find it attractive to make use of direct translations. Our experiments, carried out over parallel and a comparable corpora, show that models of "standard" cross-language information retrieval are not enough. In fact, if the analysed source and target languages are related in some way (common linguistic ancestors or technical vocabulary), a simple comparison based on character n-grams seems to be the option. However, in those cases where the relation between the implied languages is weaker, other models, such as those based on statistical machine translation, are necessary [3]. We plan to perform further experiments, mainly to approach the detection of cross-language plagiarism. In order to do that, we will use the corpora developed under the framework of the PAN competition on plagiarism detection (cf. PAN@CLEF: http://pan.webis.de). Models that consider cross-language thesauri and comparison of cognates will also be applied.

01 Jan 2010
TL;DR: A cluster-based plagiarism detection method is described, which has been used in the learning management system of SCUT to detect plagiarism in the network engineering related courses and was used to detect external plagiarisms in the PAN-10 competition.
Abstract: In this paper we describe a cluster-based plagiarism detection method, which we have used in the learning management system of SCUT to detect plagiarism in the network engineering related courses And we also used this method to detect external plagiarism in the PAN-10 competition The method is divided into three steps: the first step, called pre-selecting, is to narrow the scope of detection using the successive same fingerprint; the second step, called locating, is to find and merge all fragments between two documents using cluster method; the third step, called post-processing, is to deal with some merging errors Our method ran 19 hours in the PAN-10 competition, and the result ranked the second place, which met our expectation Keywords Plagiarism detection, Similar text, Locating, Cluster

01 Mar 2010
TL;DR: In this paper, the authors proposed adoption of ROUGE and WordNet to plagiarism detection, which includes n-gram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS).
Abstract: With the arrival of digital era and Internet, the lack of information control provides an incentive for people to freely use any content available to them. Plagiarism occurs when users fail to credit the original owner for the content referred to, and such behavior leads to violation of intellectual property. Two main approaches to plagiarism detection are fingerprinting and term occurrence; however, one common weakness shared by both approaches, especially fingerprinting, is the incapability to detect modified text plagiarism. This study proposes adoption of ROUGE and WordNet to plagiarism detection. The former includes ngram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS), while the latter acts as a thesaurus and provides semantic information. N-gram co-occurrence statistics can detect verbatim copy and certain sentence modification, skip-bigram and LCS are immune from text modification such as simple addition or deletion of words, and WordNet may handle the problem of word substitution.

Journal ArticleDOI
TL;DR: It is demonstrated that, in its current incarnation, one can easily create a document that passes the plagiarism check regardless of how much copied material it contains; it is shown how to improve the system to avoid such attacks.
Abstract: In recent times, plagiarism detection software has become popular in universities and colleges, in an attempt to stem the tide of plagiarised student coursework. Such software attempts to detect any copied material and identify its source. The most popular such software is Turnitin, a commercial system used by thousands of institutions in more than 100 countries. Here, we show how to fix a loophole in Turnitin's current plagiarism detection process. We demonstrate that, in its current incarnation, one can easily create a document that passes the plagiarism check regardless of how much copied material it contains; we then show how to improve the system to avoid such attacks.

Proceedings ArticleDOI
01 Oct 2010
TL;DR: A plagiarism detection tool named CCS (Code Comparison System) which is based on the Abstract Syntax Tree (AST), which performs well in the code comparison field, and is able to help with the copyright protecting of the source code.
Abstract: The code comparison technology plays a very important part in the work of plagiarism detection and software evaluation. Software plagiarism mainly appears as copy-and-paste or with a little modification after this, which will not change the function of the code, such as replacing the name of methods or variables, reordering the sequence of the statements etc. This paper introduces a plagiarism detection tool named CCS (Code Comparison System) which is based on the Abstract Syntax Tree (AST). According to the syntax tree's characteristics, CCS calculates their hash values, transforms their storage forms, and then compares them node by node. As a result, the efficiency improves. Moreover, CCS preprocesses a large amount of source code in its database for potential use, which also accelerate the course of plagiarism detection. CCS also takes special measurement to reduce mistakes when calculating the hash values of the operations like subtraction and division. It performs well in the code comparison field, and is able to help with the copyright protecting of the source code.

Book ChapterDOI
21 Mar 2010
TL;DR: A model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison of texts to determine how similar they are, and experimentally shows that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text.
Abstract: The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

Proceedings Article
01 May 2010
TL;DR: A newly developed large-scale corpus of artificial plagiarism is developed useful for the evaluation of intrinsic as well as external plagiarism detection.
Abstract: The simple access to texts on digital libraries and the World Wide Web has led to an increased number of plagiarism cases in recent years, which renders manual plagiarism detection infeasible at large. Various methods for automatic plagiarism detection have been developed whose objective is to assist human experts in the analysis of documents for plagiarism. The methods can be divided into two main approaches: intrinsic and external. Unlike other tasks in natural language processing and information retrieval, it is not possible to publish a collection of real plagiarism cases for evaluation purposes since they cannot be properly anonymized. Therefore, current evaluations found in the literature are incomparable and, very often not even reproducible. Our contribution in this respect is a newly developed large-scale corpus of artificial plagiarism useful for the evaluation of intrinsic as well as external plagiarism detection. Additionally, new detection performance measures tailored to the evaluation of plagiarism detection algorithms are proposed.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This approach is based on citation analysis and allows duplicate and plagiarism detection even if a document has been paraphrased or translated, since the relative position of citations remains similar.
Abstract: This paper describes a new approach towards detecting plagiarism and scientific documents that have been read but not cited. In contrast to existing approaches, which analyze documents' words but ignore their citations, this approach is based on citation analysis and allows duplicate and plagiarism detection even if a document has been paraphrased or translated, since the relative position of citations remains similar. Although this approach allows in many cases the detection of plagiarized work that could not be detected automatically with the traditional approaches, it should be considered as an extension rather than a substitute. Whereas the known text analysis methods can detect copied or, to a certain degree, modified passages, the proposed approach requires longer passages with at least two citations in order to create a digital fingerprint.

Book ChapterDOI
20 Sep 2010
TL;DR: A plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing, showing that the method achieved better results with medium and large plagiarized passages.
Abstract: This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.

Posted Content
TL;DR: This study proposes adoption of ROUGE and WordNet to plagiarism detection and includes ngram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS), while the latter acts as a thesaurus and provides semantic information.
Abstract: With the arrival of digital era and Internet, the lack of information control provides an incentive for people to freely use any content available to them. Plagiarism occurs when users fail to credit the original owner for the content referred to, and such behavior leads to violation of intellectual property. Two main approaches to plagiarism detection are fingerprinting and term occurrence; however, one common weakness shared by both approaches, especially fingerprinting, is the incapability to detect modified text plagiarism. This study proposes adoption of ROUGE and WordNet to plagiarism detection. The former includes ngram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS), while the latter acts as a thesaurus and provides semantic information. N-gram co-occurrence statistics can detect verbatim copy and certain sentence modification, skip-bigram and LCS are immune from text modification such as simple addition or deletion of words, and WordNet may handle the problem of word substitution.

Journal ArticleDOI
TL;DR: The research in this context involves at first examining various metrics used in plagiarism detection in program codes and secondly selecting an appropriate statistical measure using attribute counting metrics (ATMs) for detecting plagiarism in Java programming assignments.
Abstract: Practical computing courses that involve significant amount of programming assessment tasks suffer from e-Plagiarism. A pragmatic solution for this problem could be by discouraging plagiarism particularly among the beginners in programming. One way to address this is to automate the detection of plagiarized work during the marking phase. Our research in this context involves at first examining various metrics used in plagiarism detection in program codes and secondly selecting an appropriate statistical measure using attribute counting metrics (ATMs) for detecting plagiarism in Java programming assignments. The goal of this investigation is to study the effectiveness of ATMs for detecting plagiarism among assignment submissions of introductory programming courses.

Posted Content
TL;DR: A new approach to detect plagiarism is proposed which integrates the use of fingerprint matching technique with four key features to assist in the detection process and time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection.
Abstract: As the Internet help us cross cultural border by providing different information, plagiarism issue is bound to arise. As a result, plagiarism detection becomes more demanding in overcoming this issue. Different plagiarism detection tools have been developed based on various detection techniques. Nowadays, fingerprint matching technique plays an important role in those detection tools. However, in handling some large content articles, there are some weaknesses in fingerprint matching technique especially in space and time consumption issue. In this paper, we propose a new approach to detect plagiarism which integrates the use of fingerprint matching technique with four key features to assist in the detection process. These proposed features are capable to choose the main point or key sentence in the articles to be compared. Those selected sentence will be undergo the fingerprint matching process in order to detect the similarity between the sentences. Hence, time and space usage for the comparison process is reduced without affecting the effectiveness of the plagiarism detection.

Proceedings Article
23 Aug 2010
TL;DR: This research developed software capable of simple plagiarism detection that has built a corpus containing 10,100 academic papers in computer science written in English and two test sets including papers that were randomly chosen from C.
Abstract: Plagiarism is the use of the language and thoughts of another work and the representation of them as one's own original work. Various levels of plagiarism exist in many domains in general and in academic papers in particular. Therefore, diverse efforts are taken to automatically identify plagiarism. In this research, we developed software capable of simple plagiarism detection. We have built a corpus (C) containing 10,100 academic papers in computer science written in English and two test sets including papers that were randomly chosen from C. A widespread variety of baseline methods has been developed to identify identical or similar papers. Several methods are novel. The experimental results and their analysis show interesting findings. Some of the novel methods are among the best predictive methods.

Book ChapterDOI
21 Mar 2010
TL;DR: An approach using an extension of the method Encoplot, which won the 1st international competition on plagiarism detection in 2009, is presented, tested on a large-scale corpus of artificial plagiarism, with good results.
Abstract: Determining the direction of plagiarism (who plagiarized whom in a given pair of documents) is one of the most interesting problems in the field of automatic plagiarism detection. We present here an approach using an extension of the method Encoplot, which won the 1st international competition on plagiarism detection in 2009. We have tested it on a large-scale corpus of artificial plagiarism, with good results.

01 Jan 2010
TL;DR: This year's submission is generated by the same method Encoplot that was developed for the last year competition and there is a single improvement.
Abstract: Our submission this year is generated by the same method Encoplot that we have developed for the last year competition. There is a single improvement, we compare in addition each suspicious document with each other and flag the passages most probably in correspondence as intrinsic plagiarism.

Proceedings ArticleDOI
28 Oct 2010
TL;DR: A source code plagiarism detection technologe based on AST is described that can detect the plagiarism accurately when the position of functions is changed by plagiarist.
Abstract: In the instruction of computer courses, some students copy other’s source code as themselves. In order to detect this plagiarism accurately, researchers did a lot. In this paper, we described a source code plagiarism detection technologe based on AST, This technologe can detect the plagiarism accurately when the position of functions is changed by plagiarist. At first, transforming the programs to the AST using ANTLR, and then, abstracting the function subtrees from the AST, at last, compare the function subtree using LCS, get the simalarity between programs.

22 Sep 2010
TL;DR: In this article, the authors describe their approach at the PAN 2010 plagiarism detection competition and discuss the computational cost of each step of their implementation, including the performance data from two different computers, and present the improvements they have tried since the PAN'09 competition, and their impact on the results on the development corpus.
Abstract: In this paper we describe our approach at the PAN 2010 plagiarism detection competition. We refer to the system we have used in PAN'09. We then present the improvements we have tried since the PAN'09 competition, and their impact on the results on the development corpus. We describe our experiments with intrinsic plagiarism detection and evaluate them. We then discuss the computational cost of each step of our implementation, including the performance data from two different computers.

Journal ArticleDOI
TL;DR: The paper attempts to analyze current situation in plagiarism detection and to analyze existing methods and tools for checking the plagiarized programming code and natural language text, and proposes an effective, widely usable tool with more precise results.