scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2013"


Journal ArticleDOI
TL;DR: The P4P corpus is created, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection, providing critical insights for the improvement of automatic plagiarisms detection systems.
Abstract: Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that i more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, ii lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and iii paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.

136 citations


Journal ArticleDOI
TL;DR: In the future, plagiarism detection systems may benefit from combining traditional character-based detection methods with these emerging detection approaches, including intrinsic, cross-lingual and citation-based plagiarism Detection.
Abstract: The problem of academic plagiarism has been present for centuries. Yet, the widespread dissemination of information technology, including the internet, made plagiarising much easier. Consequently, methods and systems aiding in the detection of plagiarism have attracted much research within the last two decades. Researchers proposed a variety of solutions, which we will review comprehensively in this article. Available detection systems use sophisticated and highly efficient character-based text comparisons, which can reliably identify verbatim and moderately disguised copies. Automatically detecting more strongly disguised plagiarism, such as paraphrases, translations or idea plagiarism, is the focus of current research. Proposed approaches for this task include intrinsic, cross-lingual and citation-based plagiarism detection. Each method offers unique strengths and weaknesses; however, none is currently mature enough for practical use. In the future, plagiarism detection systems may benefit from combining traditional character-based detection methods with these emerging detection approaches.

99 citations


Journal ArticleDOI
TL;DR: The lessons learned at PAN 2010 are reviewed, the method used to construct the corpus is explained, and the work presented here is the first to join the paraphrasing and plagiarism communities.
Abstract: To paraphrase means to rewrite content while preserving the original meaning. Paraphrasing is important in fields such as text reuse in journalism, anonymizing work, and improving the quality of customer-written reviews. This article contributes to paraphrase acquisition and focuses on two aspects that are not addressed by current research: (1) acquisition via crowdsourcing, and (2) acquisition of passage-level samples. The challenge of the first aspect is automatic quality assurance; without such a means the crowdsourcing paradigm is not effective, and without crowdsourcing the creation of test corpora is unacceptably expensive for realistic order of magnitudes. The second aspect addresses the deficit that most of the previous work in generating and evaluating paraphrases has been conducted using sentence-level paraphrases or shorter; these short-sample analyses are limited in terms of application to plagiarism detection, for example. We present the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11), which recently formed part of the PAN 2010 international plagiarism detection competition. This corpus comprises passage-level paraphrases with 4067 positive samples and 3792 negative samples that failed our criteria, using Amazon's Mechanical Turk for crowdsourcing. In this article, we review the lessons learned at PAN 2010, and explain in detail the method used to construct the corpus. The empirical contributions include machine learning experiments to explore if passage-level paraphrases can be identified in a two-class classification problem using paraphrase similarity features, and we find that a k-nearest-neighbor classifier can correctly distinguish between paraphrased and nonparaphrased samples with 0.980 precision at 0.523 recall. This result implies that just under half of our samples must be discarded (remaining 0.477 fraction), but our cost analysis shows that the automation we introduce results in a 18p financial saving and over 100 hours of time returned to the researchers when repeating a similar corpus design. On the other hand, when building an unrelated corpus requiring, say, 25p training data for the automated component, we show that the financial outcome is cost neutral, while still returning over 70 hours of time to the researchers. The work presented here is the first to join the paraphrasing and plagiarism communities.

97 citations


Journal ArticleDOI
TL;DR: The proposedsource code similarity system for plagiarism detection showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code.
Abstract: Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%.

94 citations


Journal ArticleDOI
TL;DR: Text mining is done, exploring the use of words as a linguistic feature for analyzing a document by modeling the writing style present in it, and it is demonstrated that this feature shows promise in this area, achieving reasonable results compared to benchmark models.
Abstract: Plagiarism detection is of special interest to educational institutions, and with the proliferation of digital documents on the Web the use of computational systems for such a task has become important. While traditional methods for automatic detection of plagiarism compute the similarity measures on a document-to-document basis, this is not always possible since the potential source documents are not always available. We do text mining, exploring the use of words as a linguistic feature for analyzing a document by modeling the writing style present in it. The main goal is to discover deviations in the style, looking for segments of the document that could have been written by another person. This can be considered as a classification problem using self-based information where paragraphs with significant deviations in style are treated as outliers. This so-called intrinsic plagiarism detection approach does not need comparison against possible sources at all, and our model relies only on the use of words, so it is not language specific. We demonstrate that this feature shows promise in this area, achieving reasonable results compared to benchmark models.

82 citations


Journal ArticleDOI
TL;DR: This paper proposes a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing, and explores the suitability of three cross-language similarity estimation models.
Abstract: Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T+MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks-something never done before. The experiments show that T+MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.

80 citations


01 Jan 2013
TL;DR: A characterization of evaluation metrics to grade programming assignments is provided as first step to get a model, and new paths in this research field are proposed.
Abstract: Automatic grading of programming assignments is an important topic in academic research. It aims at improving the level of feedback given to students and optimizing the professor time. Several researches have reported the development of software tools to support this process. Then, it is helpfulto get a quickly and good sight about their key features. This paper reviews an ample set of tools forautomatic grading of programming assignments. They are divided in those most important mature tools, which have remarkable features; and those built recently, with new features. The review includes the definition and description of key features e.g. supported languages, used technology, infrastructure, etc. The two kinds of tools allow making a temporal comparative analysis. This analysis infrastructure, etc. The two kinds of tools allow making a temporal comparative analysis. This analysis shows good improvements in this research field, these include security, more language support, plagiarism detection, etc. On the other hand, the lack of a grading model for assignments is identified as an important gap in the reviewed tools. Thus, a characterization of evaluation metrics to grade programming assignments is provided as first step to get a model. Finally new paths in this research field are proposed.

74 citations


Book ChapterDOI
23 Sep 2013
TL;DR: This paper outlines the concepts and achievements of the evaluation lab on digital text forensics, PANi¾?13, which called for original research and development on plagiarism detection, author identification, and author profiling and presents a standardized evaluation framework for each of the three tasks.
Abstract: This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PANi¾?13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58i¾?submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.

74 citations


Proceedings ArticleDOI
Dong-Kyu Chae1, Jiwoon Ha1, Sang-Wook Kim1, BooJoong Kang1, Eul Gyu Im1 
27 Oct 2013
TL;DR: This paper proposes a software plagiarism detection system using an API-labeled control flow graph (A-CFG) that abstracts the functionalities of a program and demonstrates the effectiveness and the scalability of the system compared with existing methods.
Abstract: As plagiarism of software increases rapidly, there are growing needs for software plagiarism detection systems. In this paper, we propose a software plagiarism detection system using an API-labeled control flow graph (A-CFG) that abstracts the functionalities of a program. The A-CFG can reflect both the sequence and the frequency of APIs, while previous work rarely considers both of them together. To perform a scalable comparison of a pair of A-CFGs, we use random walk with restart (RWR) that computes an importance score for each node in a graph. By the RWR, we can generate a single score vector for an A-CFG and can also compare A-CFGs by comparing their score vectors. Extensive evaluations on a set of Windows applications demonstrate the effectiveness and the scalability of our proposed system compared with existing methods.

64 citations


Journal ArticleDOI
TL;DR: Evaluating Turnitin's use with staff on a first year undergraduate module within the psychology department at a UK university indicated that it has the potential to be a very valuable asset to plagiarism detection and electronic marking.
Abstract: The aim of this project was to pilot plagiarism detection software and online marking, evaluating its use with staff on a first year undergraduate module within the psychology department at a UK university. One hundred and sixty undergraduate psychology students submitted three assignments via Turnitin, and staff used the software to check for instances of academic misconduct and marked submissions using the Grade Mark feature in the software, providing online feedback to students. Eleven members of teaching staff took part in focus groups to gain insight into their experiences of using Turnitin in this manner, and this paper reports the findings. Results indicated that staff identified several strengths but also several weaknesses to the implementation of Turnitin and Grade Mark. The Originality Check feature received very positive evaluations due to its capacity to provide a clear and timely indicator of plagiarism levels in assignments and a useful formative learning tool for students from an educator perspective. Staff did however encounter some technical difficulties when using the software. In conclusion, for staff, the benefits of using Turnitin were clear, and it has the potential to be a very valuable asset to plagiarism detection and electronic marking. [ABSTRACT FROM AUTHOR]

60 citations


Journal ArticleDOI
TL;DR: In this article, the authors explored the use of a plagiarism detection system to deter digital plagiarism and found that when students were aware that their work would be run through a detection system, they were less inclined to plagiarize.
Abstract: Computer technology and the Internet now make plagiarism an easier enterprise. As a result, faculty must be more diligent in their efforts to mitigate the practice of academic integrity, and institutions of higher education must provide the leadership and support to ensure the context for it. This study explored the use of a plagiarism detection system to deter digital plagiarism. Findings suggest that when students were aware that their work would be run through a detection system, they were less inclined to plagiarize. These findings suggest that, regardless of class standing, gender, and college major, recognition by the instructor of the nature and extent of the plagiarism problem and acceptance of responsibility for deterring it are pivotal in reducing the problem.

Book ChapterDOI
24 Mar 2013
TL;DR: Experimental results indicate that the proposed graph-based approach is a good alternative for cross-language plagiarism detection and compared with two state-of-the-art models.
Abstract: Cross-language plagiarism refers to the type of plagiarism where the source and suspicious documents are in different languages. Plagiarism detection across languages is still in its infancy state. In this article, we propose a new graph-based approach that uses a multilingual semantic network to compare document paragraphs in different languages. In order to investigate the proposed approach, we used the German-English and Spanish-English cross-language plagiarism cases of the PAN-PC'11 corpus. We compare the obtained results with two state-of-the-art models. Experimental results indicate that our graph-based approach is a good alternative for cross-language plagiarism detection.

Journal ArticleDOI
TL;DR: In this article, the authors discuss the pedagogical implications and suggest that the contextual reasons for plagiarism require focus primarily on study strategies, whereas the intentional reasons require profound discussion about attitudes and conceptions of good learning and university-level study habits.
Abstract: The focus of this article is university teachers’ and students’ views of plagiarism, plagiarism detection, and the use of plagiarism detection software as learning support. The data were collected from teachers and students who participated in a pilot project to test plagiarism detection software at a major university in Finland. The data were analysed through factor analysis, T-tests and inductive content analysis. Three distinct reasons for plagiarism were identified: intentional, unintentional and contextual. The teachers did not utilise plagiarism detection to support student learning to any great extent. We discuss the pedagogical implications and suggest that the contextual reasons for plagiarism require focus primarily on study strategies, whereas the intentional reasons require profound discussion about attitudes and conceptions of good learning and university-level study habits.

Journal ArticleDOI
TL;DR: A novel method for detecting likely portions of reused text that is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition and represents the identified reused text by means of a set of features that denote its degree of plagiarism, relevance and fragmentation.
Abstract: An important task in plagiarism detection is determining and measuring similar text portions between a given pair of documents. One of the main difficulties of this task resides on the fact that reused text is commonly modified with the aim of covering or camouflaging the plagiarism. Another difficulty is that not all similar text fragments are examples of plagiarism, since thematic coincidences also tend to produce portions of similar text. In order to tackle these problems, we propose a novel method for detecting likely portions of reused text. This method is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition, allowing to obtain plausible portions of reused text. We also propose representing the identified reused text by means of a set of features that denote its degree of plagiarism, relevance and fragmentation. This new representation aims to facilitate the recognition of plagiarism by considering diverse characteristics of the reused text during the classification phase. Experimental results employing a supervised classification strategy showed that the proposed method is able to outperform traditionally used approaches.

Journal ArticleDOI
TL;DR: The results indicate that disciplinary differences do exist in terms of the degree of matching text incidences and that the greater the number of authors an article has the more consecutive text-matching can be observed in their published works.

01 Jan 2013
TL;DR: This paper describes the approach at the PAN@CLEF2013 plagiarism detection competition, and proposes a method based on sentence similarity to extract the keywords of suspicious documents as queries to retrieve the plagiarism source document.
Abstract: In this paper, we describe our approach at the PAN@CLEF2013 plagiarism detection competition. In sub-task of Source Retrieval, a method combined TF-IDF, PatTree and Weighted TF-IDF to extract the keywords of suspicious documents as queries to retrieve the plagiarism source document is proposed. In sub-task of Text Alignment, a method based on sentence similarity is presented. Our text alignment algorism and similar sentences merging algorism, called Bilateral Alternating Merging Algorithm, are described in detail.

Journal ArticleDOI
TL;DR: Overall, institutional policies on self-plagiarism did not exist and faculty did not clearly understand the concept and believed their students did not either, and faculty assumed students had previously been educated on plagiarism as well as self-PLAGiarism.
Abstract: The purpose of this research study was to evaluate faculty perceptions regarding student self-plagiarism or recycling of student papers. Although there is a plethora of information on plagiarism and faculty who self-plagiarize in publications, there is very little research on how faculty members perceive students re-using all or part of a previously completed assignment in a second assignment. With the wide use of plagiarism detection software, this issue becomes even more crucial. A population of 340 faculty members from two private universities at three different sites was surveyed in Fall 2012 semester regarding their perceptions of student self-plagiarism. A total of 89 faculty responded for a return rate of 26.2 %. Overall, institutional policies on self-plagiarism did not exist and faculty did not clearly understand the concept and believed their students did not either. Although faculty agreed students need to be educated on self-plagiarism, faculty assumed students had previously been educated on plagiarism as well as self-plagiarism; only 13 % ensured students understood this concept.

Proceedings ArticleDOI
28 Jul 2013
TL;DR: State-of-the-art plagiarism detection approaches capably identify copy & paste and to some extent slightly modified plagiarism but cannot reliably identify strongly disguised plagiarism forms, including paraphrases, translated plagiarism, and idea plagiarism.
Abstract: Limitations of Plagiarism Detection Systems State-of-the-art plagiarism detection approaches capably identify copy & paste and to some extent slightly modified plagiarism. However, they cannot reliably identify strongly disguised plagiarism forms, including paraphrases, translated plagiarism, and idea plagiarism, which are forms of plagiarism more commonly found in scientific texts. This weakness of current systems results in a large fraction of today’s scientific plagiarism going undetected.

Proceedings ArticleDOI
20 Mar 2013
TL;DR: The results showed that the best performance of fingerprint algorithm was 92.8% while Winnowing algorithm's best performance was 91.8%.
Abstract: Plagiarism detection has been widely discussed in recent years. Various approaches have been proposed such as the text-similarity calculation, structural-approaches, and the fingerprint. In fingerprint-approaches, small parts of document are taken to be matched with other documents. In this paper, fingerprint and Winnowing algorithm is proposed. Those algorithms are used for detecting plagiarism of scientific articles in Bahasa Indonesia. Plagiarism classification is determined from those two documents by a Dice Coefficient at a certain threshold value. The results showed that the best performance of fingerprint algorithm was 92.8% while Winnowing algorithm's best performance was 91.8%. Level-of-relevance to the topic analysis result showed that Winnowing algorithm has got stronger term-correlation of 37.1% compared to the 33.6% fingerprint algorithm.

Proceedings ArticleDOI
01 Nov 2013
TL;DR: By introducing dynamic data flow analysis into birthmark generation, DKISB is able to produce a high quality birthmark that is closely correlated to program semantics, making it resilient to various kinds of semantic-preserving code obfuscation techniques.
Abstract: With the burst of open source software, software plagiarism has been a serious threat to the healthy development of software industry. Software birthmark reflecting intrinsic properties of software, is an effective way for the detection of software theft. However, most of the existing software birthmarks face a series of challenges: (1) the absence of source code, (2) diversity of operating systems and programing languages, (3) various automated code obfuscation techniques. In this paper, a dynamic key instruction sequence based software birthmark (DKISB) is proposed. By introducing dynamic data flow analysis into birthmark generation, we are able to produce a high quality birthmark that is closely correlated to program semantics, making it resilient to various kinds of semantic-preserving code obfuscation techniques. Based on the Pin instrumentation framework, a DKISB based software plagiarism detection system is implemented, which generates birthmarks for both the plaintiff and defendant program, and then make the plagiarism decision according to the similarity of their birthmarks. The experimental results show that DKISB is effective to either weak obfuscation techniques like compiler optimization or strong obfuscation techniques provided by tools such as Sand Mark.

Book ChapterDOI
30 Aug 2013
TL;DR: The proposed solution based on simhash document fingerprints essentially reduces the problem to a secure XOR computation between two bit vectors, which improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol.
Abstract: Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.

01 Jan 2013
TL;DR: The optimized process by high performance C/C++ multi- core programming techniques, has yielded the best speed, but the tests were arranged in single core machines, so you can expect much better runtime.
Abstract: This paper describes the process and basics of the Text Alignment Module into the CoReMo 2.1 Plagiarism Detector, which has won the Plagiarism Detection Text Alignment task in PAN-2013 edition, for both evaluation criteria of efficacy and efficiency, achieving the best detections and the best runtime too. Its high detection efficacy is mainly due to the special features of the contextual n -grams, evolved to surrounding context and odd- even skip n-grams. When combined all together, the matching opportunity increases, especially when translations or paraphrases happen, but keeping its highly discriminative feature that simplifies the accurate location for plagiarized sections. The optimized process by high performance C/C++ multi- core programming techniques, has yielded the best speed, but the tests were arranged in single core machines, so you can expect much better runtime.

Journal ArticleDOI
TL;DR: This paper investigates an unsupervised feature learning technique called sparse auto-encoder as a method of extracting features from source code files and shows that performance is very close to the state of art techniques in the source code identification field.

Dissertation
01 Jan 2013
TL;DR: Man Yan Miranda Chong A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy in 2013.
Abstract: Man Yan Miranda Chong A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy 2013

Proceedings ArticleDOI
09 Sep 2013
TL;DR: This paper will introduce a technique based on the Abstract Syntax Tree (AST) that can effectively detects the plagiarism cases of changing the names of methods and variables in the code, reordering the sequences of the code and so on.
Abstract: Plagiarism detection technology plays a very important role in copyright protection of computer software. The plagiarism technology mainly includes text-based, token-based and syntax-based technologies. This paper will introduce a technique based on the Abstract Syntax Tree (AST). This algorithm based on AST can effectively detects the plagiarism cases of changing the names of methods and variables in the code, reordering the sequences of the code and so on. According to algorithm of the Abstract Syntax Tree, we will calculates hash values of every node in the AST, then store the AST, and compare the hash value node by node after completing all of the above. Finally, we will use the experiments to illustrate the superiority of the AST algorithm.

01 Jan 2013
TL;DR: In this paper, a modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance is presented, that presented approach is adaptable in real-world plagiarism situations.
Abstract: This paper describes approaches used for the Plagiarism Detection task in PAN 2013 international competition on uncovering plagiarism, authorship, and social software misuse. We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance. The results show, that presented approach is adaptable in real-world plagiarism situations. For the Detailed Comparison task, we discuss feature type selection and global postprocessing. Resulting performance is significantly better with the described modifications, and further improvement is still possible.

Dissertation
12 Jun 2013
TL;DR: In this article, the authors investigated whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality, and also evaluated current computational approaches to plagiarism detection, and identified strategies that these systems fail to detect.
Abstract: This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools.

Journal ArticleDOI
Stephanie Vie1
TL;DR: In this article, the authors argue for a pedagogy of resistance to plagiarism detection technologies, arguing that the circular logic of avoiding plagiarism/catching plagiarists/punishing plagiarism and prizing singular authorship above all other forms risks failing to find the ability to break free and move beyond to more challenging modes of writing that rely on community.

Book ChapterDOI
23 Sep 2013
TL;DR: The first corpus for the evaluation of Arabic intrinsic plagiarism detection is introduced, consisting of 1024 artificial suspicious documents in which 2833 plagiarism cases have been inserted automatically from source documents.
Abstract: The present paper introduces the first corpus for the evaluation of Arabic intrinsic plagiarism detection. The corpus consists of 1024 artificial suspicious documents in which 2833 plagiarism cases have been inserted automatically from source documents.

Proceedings ArticleDOI
14 Jul 2013
TL;DR: A hybrid similarity measure model is proposed on the basis of the fitting function of the optimal dividing line between plagiarism and none-plagiarism where it integrates VSM and Jaccard coefficient into a unified one and can extract more reasonable heuristic seeds in the plagiarism detection.
Abstract: Detailed comparison is one important sub-task of external plagiarism detection. Seed heuristic between two documents is often used in this task. Vector space model (VSM) and Jaccard coefficient are commonly used in plagiarism detection. VSM can produce high recall performance; Jaccard coefficient can produce high precision performance. In this paper, we propose a hybrid similarity measure model on the basis of the fitting function of the optimal dividing line between plagiarism and none-plagiarism where we integrates VSM and Jaccard coefficient into a unified one, our method make full use of the advantage of VSM and the Jaccard coefficient, and it can extract more reasonable heuristic seeds in the plagiarism detection. Our method is evaluated at PAN corpus of CLEF (Cross-Language Evaluation Forum) and compared with the methods based on VSM or Jaccard coefficient. Experimental results show our method can produce better performance.