Showing papers on "Plagiarism detection published in 2013"

PDF

Open Access

Journal Article•DOI•

Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection

[...]

Alberto Barrón-Cedeño¹, Marta Vila², Maria Antònia Martí², Paolo Rosso³•Institutions (3)

Polytechnic University of Catalonia¹, University of Barcelona², Polytechnic University of Valencia³

01 Dec 2013-Computational Linguistics

TL;DR: The P4P corpus is created, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection, providing critical insights for the improvement of automatic plagiarisms detection systems.

...read moreread less

Abstract: Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that i more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, ii lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and iii paraphrase mechanisms tend to shorten the plagiarized text. For the first time, the paraphrase mechanisms behind plagiarism have been analyzed, providing critical insights for the improvement of automatic plagiarism detection systems.

...read moreread less

136 citations

Journal Article•DOI•

State-of-the-art in detecting academic plagiarism

[...]

Norman Meuschke¹, Bela Gipp•Institutions (1)

University of California, Berkeley¹

06 Jun 2013-The International Journal for Educational Integrity

TL;DR: In the future, plagiarism detection systems may benefit from combining traditional character-based detection methods with these emerging detection approaches, including intrinsic, cross-lingual and citation-based plagiarism Detection.

...read moreread less

Abstract: The problem of academic plagiarism has been present for centuries. Yet, the widespread dissemination of information technology, including the internet, made plagiarising much easier. Consequently, methods and systems aiding in the detection of plagiarism have attracted much research within the last two decades. Researchers proposed a variety of solutions, which we will review comprehensively in this article. Available detection systems use sophisticated and highly efficient character-based text comparisons, which can reliably identify verbatim and moderately disguised copies. Automatically detecting more strongly disguised plagiarism, such as paraphrases, translations or idea plagiarism, is the focus of current research. Proposed approaches for this task include intrinsic, cross-lingual and citation-based plagiarism detection. Each method offers unique strengths and weaknesses; however, none is currently mature enough for practical use. In the future, plagiarism detection systems may benefit from combining traditional character-based detection methods with these emerging detection approaches.

...read moreread less

99 citations

Journal Article•DOI•

Paraphrase acquisition via crowdsourcing and machine learning

[...]

Steven Burrows¹, Martin Potthast¹, Benno Stein¹•Institutions (1)

Bauhaus University, Weimar¹

01 Jul 2013-ACM Transactions on Intelligent Systems and Technology

TL;DR: The lessons learned at PAN 2010 are reviewed, the method used to construct the corpus is explained, and the work presented here is the first to join the paraphrasing and plagiarism communities.

...read moreread less

Abstract: To paraphrase means to rewrite content while preserving the original meaning. Paraphrasing is important in fields such as text reuse in journalism, anonymizing work, and improving the quality of customer-written reviews. This article contributes to paraphrase acquisition and focuses on two aspects that are not addressed by current research: (1) acquisition via crowdsourcing, and (2) acquisition of passage-level samples. The challenge of the first aspect is automatic quality assurance; without such a means the crowdsourcing paradigm is not effective, and without crowdsourcing the creation of test corpora is unacceptably expensive for realistic order of magnitudes. The second aspect addresses the deficit that most of the previous work in generating and evaluating paraphrases has been conducted using sentence-level paraphrases or shorter; these short-sample analyses are limited in terms of application to plagiarism detection, for example. We present the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11), which recently formed part of the PAN 2010 international plagiarism detection competition. This corpus comprises passage-level paraphrases with 4067 positive samples and 3792 negative samples that failed our criteria, using Amazon's Mechanical Turk for crowdsourcing. In this article, we review the lessons learned at PAN 2010, and explain in detail the method used to construct the corpus. The empirical contributions include machine learning experiments to explore if passage-level paraphrases can be identified in a two-class classification problem using paraphrase similarity features, and we find that a k-nearest-neighbor classifier can correctly distinguish between paraphrased and nonparaphrased samples with 0.980 precision at 0.523 recall. This result implies that just under half of our samples must be discarded (remaining 0.477 fraction), but our cost analysis shows that the automation we introduce results in a 18p financial saving and over 100 hours of time returned to the researchers when repeating a similar corpus design. On the other hand, when building an unrelated corpus requiring, say, 25p training data for the automated component, we show that the financial outcome is cost neutral, while still returning over 70 hours of time to the researchers. The work presented here is the first to join the paraphrasing and plagiarism communities.

...read moreread less

97 citations

Journal Article•DOI•

A Source Code Similarity System for Plagiarism Detection

[...]

Zoran Đurić¹, Dragan Gašević²•Institutions (2)

University of Banja Luka¹, Athabasca University²

01 Jan 2013-The Computer Journal

TL;DR: The proposedsource code similarity system for plagiarism detection showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code.

...read moreread less

Abstract: Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%.

...read moreread less

94 citations

Journal Article•DOI•

Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style

[...]

Gabriel Oberreuter¹, Juan D. Velásquez¹•Institutions (1)

University of Chile¹

01 Jul 2013-Expert Systems With Applications

TL;DR: Text mining is done, exploring the use of words as a linguistic feature for analyzing a document by modeling the writing style present in it, and it is demonstrated that this feature shows promise in this area, achieving reasonable results compared to benchmark models.

...read moreread less

Abstract: Plagiarism detection is of special interest to educational institutions, and with the proliferation of digital documents on the Web the use of computational systems for such a task has become important. While traditional methods for automatic detection of plagiarism compute the similarity measures on a document-to-document basis, this is not always possible since the potential source documents are not always available. We do text mining, exploring the use of words as a linguistic feature for analyzing a document by modeling the writing style present in it. The main goal is to discover deviations in the style, looking for segments of the document that could have been written by another person. This can be considered as a classification problem using self-based information where paragraphs with significant deviations in style are treated as outliers. This so-called intrinsic plagiarism detection approach does not need comparison against possible sources at all, and our model relies only on the use of words, so it is not language specific. We demonstrate that this feature shows promise in this area, achieving reasonable results compared to benchmark models.

...read moreread less

82 citations

Journal Article•DOI•

Methods for cross-language plagiarism detection

[...]

Alberto Barrón-Cedeño¹, Parth Gupta², Paolo Rosso²•Institutions (2)

Technical University of Madrid¹, Polytechnic University of Valencia²

01 Sep 2013-Knowledge Based Systems

TL;DR: This paper proposes a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing, and explores the suitability of three cross-language similarity estimation models.

...read moreread less

Abstract: Three reasons make plagiarism across languages to be on the rise: (i) speakers of under-resourced languages often consult documentation in a foreign language, (ii) people immersed in a foreign country can still consult material written in their native language, and (iii) people are often interested in writing in a language different to their native one. Most efforts for automatically detecting cross-language plagiarism depend on a preliminary translation, which is not always available. In this paper we propose a freely available architecture for plagiarism detection across languages covering the entire process: heuristic retrieval, detailed analysis, and post-processing. On top of this architecture we explore the suitability of three cross-language similarity estimation models: Cross-Language Alignment-based Similarity Analysis (CL-ASA), Cross-Language Character n-Grams (CL-CNG), and Translation plus Monolingual Analysis (T+MA); three inherently different models in nature and required resources. The three models are tested extensively under the same conditions on the different plagiarism detection sub-tasks-something never done before. The experiments show that T+MA produces the best results, closely followed by CL-ASA. Still CL-ASA obtains higher values of precision, an important factor in plagiarism detection when lesser user intervention is desired.

...read moreread less

80 citations

Programming assignments automatic grading: review of tools and implementations

[...]

Julio C. Caiza¹, J. del Alamo¹•Institutions (1)

Technical University of Madrid¹

01 Jan 2013

TL;DR: A characterization of evaluation metrics to grade programming assignments is provided as first step to get a model, and new paths in this research field are proposed.

...read moreread less

Abstract: Automatic grading of programming assignments is an important topic in academic research. It aims at improving the level of feedback given to students and optimizing the professor time. Several researches have reported the development of software tools to support this process. Then, it is helpfulto get a quickly and good sight about their key features. This paper reviews an ample set of tools forautomatic grading of programming assignments. They are divided in those most important mature tools, which have remarkable features; and those built recently, with new features. The review includes the definition and description of key features e.g. supported languages, used technology, infrastructure, etc. The two kinds of tools allow making a temporal comparative analysis. This analysis infrastructure, etc. The two kinds of tools allow making a temporal comparative analysis. This analysis shows good improvements in this research field, these include security, more language support, plagiarism detection, etc. On the other hand, the lack of a grading model for assignments is identified as an important gap in the reviewed tools. Thus, a characterization of evaluation metrics to grade programming assignments is provided as first step to get a model. Finally new paths in this research field are proposed.

...read moreread less

74 citations

Book Chapter•DOI•

Recent Trends in Digital Text Forensics and Its Evaluation

[...]

Tim Gollub¹, Martin Potthast¹, Anna Beyer¹, Matthias Busse¹, Francisco Rangel², Paolo Rosso², Efstathios Stamatatos³, Benno Stein¹ - Show less +4 more•Institutions (3)

Bauhaus University, Weimar¹, Polytechnic University of Valencia², University of the Aegean³

23 Sep 2013

TL;DR: This paper outlines the concepts and achievements of the evaluation lab on digital text forensics, PANi¾?13, which called for original research and development on plagiarism detection, author identification, and author profiling and presents a standardized evaluation framework for each of the three tasks.

...read moreread less

Abstract: This paper outlines the concepts and achievements of our evaluation lab on digital text forensics, PANi¾?13, which called for original research and development on plagiarism detection, author identification, and author profiling. We present a standardized evaluation framework for each of the three tasks and discuss the evaluation results of the altogether 58i¾?submitted contributions. For the first time, instead of accepting the output of software runs, we collected the softwares themselves and run them on a computer cluster at our site. As evaluation and experimentation platform we use TIRA, which is being developed at the Webis Group in Weimar. TIRA can handle large-scale software submissions by means of virtualization, sandboxed execution, tailored unit testing, and staged submission. In addition to the achieved evaluation results, a major achievement of our lab is that we now have the largest collection of state-of-the-art approaches with regard to the mentioned tasks for further analysis at our disposal.

...read moreread less

74 citations

Proceedings Article•DOI•

Software plagiarism detection: a graph-based approach

[...]

Dong-Kyu Chae¹, Jiwoon Ha¹, Sang-Wook Kim¹, BooJoong Kang¹, Eul Gyu Im¹ - Show less +1 more•Institutions (1)

Hanyang University¹

27 Oct 2013

TL;DR: This paper proposes a software plagiarism detection system using an API-labeled control flow graph (A-CFG) that abstracts the functionalities of a program and demonstrates the effectiveness and the scalability of the system compared with existing methods.

...read moreread less

Abstract: As plagiarism of software increases rapidly, there are growing needs for software plagiarism detection systems. In this paper, we propose a software plagiarism detection system using an API-labeled control flow graph (A-CFG) that abstracts the functionalities of a program. The A-CFG can reflect both the sequence and the frequency of APIs, while previous work rarely considers both of them together. To perform a scalable comparison of a pair of A-CFGs, we use random walk with restart (RWR) that computes an importance score for each node in a graph. By the RWR, we can generate a single score vector for an A-CFG and can also compare A-CFGs by comparing their score vectors. Extensive evaluations on a set of Windows applications demonstrate the effectiveness and the scalability of our proposed system compared with existing methods.

...read moreread less

64 citations

Journal Article•DOI•

An evaluation of the use of Turnitin for electronic submission and marking and as a formative feedback tool from an educator's perspective

[...]

Emily Buckley¹, Lisa Cowap•Institutions (1)

Staffordshire University¹

01 Jul 2013-British Journal of Educational Technology

TL;DR: Evaluating Turnitin's use with staff on a first year undergraduate module within the psychology department at a UK university indicated that it has the potential to be a very valuable asset to plagiarism detection and electronic marking.

...read moreread less

Abstract: The aim of this project was to pilot plagiarism detection software and online marking, evaluating its use with staff on a first year undergraduate module within the psychology department at a UK university. One hundred and sixty undergraduate psychology students submitted three assignments via Turnitin, and staff used the software to check for instances of academic misconduct and marked submissions using the Grade Mark feature in the software, providing online feedback to students. Eleven members of teaching staff took part in focus groups to gain insight into their experiences of using Turnitin in this manner, and this paper reports the findings. Results indicated that staff identified several strengths but also several weaknesses to the implementation of Turnitin and Grade Mark. The Originality Check feature received very positive evaluations due to its capacity to provide a clear and timely indicator of plagiarism levels in assignments and a useful formative learning tool for students from an educator perspective. Staff did however encounter some technical difficulties when using the software. In conclusion, for staff, the benefits of using Turnitin were clear, and it has the potential to be a very valuable asset to plagiarism detection and electronic marking. [ABSTRACT FROM AUTHOR]

...read moreread less

60 citations

Journal Article•DOI•

Turnitin Systems: A Deterrent to Plagiarism in College Classrooms

[...]

Nina C. Heckler¹, Margaret L. Rice², C. Hobson Bryan²•Institutions (2)

Union University¹, University of Alabama²

01 Mar 2013-Journal of research on technology in education

TL;DR: In this article, the authors explored the use of a plagiarism detection system to deter digital plagiarism and found that when students were aware that their work would be run through a detection system, they were less inclined to plagiarize.

...read moreread less

Abstract: Computer technology and the Internet now make plagiarism an easier enterprise. As a result, faculty must be more diligent in their efforts to mitigate the practice of academic integrity, and institutions of higher education must provide the leadership and support to ensure the context for it. This study explored the use of a plagiarism detection system to deter digital plagiarism. Findings suggest that when students were aware that their work would be run through a detection system, they were less inclined to plagiarize. These findings suggest that, regardless of class standing, gender, and college major, recognition by the instructor of the nature and extent of the plagiarism problem and acceptance of responsibility for deterring it are pivotal in reducing the problem.

...read moreread less

Book Chapter•DOI•

Cross-Language plagiarism detection using a multilingual semantic network

[...]

Marc Franco-Salvador¹, Parth Gupta¹, Paolo Rosso¹•Institutions (1)

Polytechnic University of Valencia¹

24 Mar 2013

TL;DR: Experimental results indicate that the proposed graph-based approach is a good alternative for cross-language plagiarism detection and compared with two state-of-the-art models.

...read moreread less

Abstract: Cross-language plagiarism refers to the type of plagiarism where the source and suspicious documents are in different languages. Plagiarism detection across languages is still in its infancy state. In this article, we propose a new graph-based approach that uses a multilingual semantic network to compare document paragraphs in different languages. In order to investigate the proposed approach, we used the German-English and Spanish-English cross-language plagiarism cases of the PAN-PC'11 corpus. We compare the obtained results with two state-of-the-art models. Experimental results indicate that our graph-based approach is a good alternative for cross-language plagiarism detection.

...read moreread less

Journal Article•DOI•

The Instructional Challenges of Student Plagiarism

[...]

Erika Löfström, Pauliina Kupila

01 Mar 2013-Journal of Academic Ethics

TL;DR: In this article, the authors discuss the pedagogical implications and suggest that the contextual reasons for plagiarism require focus primarily on study strategies, whereas the intentional reasons require profound discussion about attitudes and conceptions of good learning and university-level study habits.

...read moreread less

Abstract: The focus of this article is university teachers’ and students’ views of plagiarism, plagiarism detection, and the use of plagiarism detection software as learning support. The data were collected from teachers and students who participated in a pilot project to test plagiarism detection software at a major university in Finland. The data were analysed through factor analysis, T-tests and inductive content analysis. Three distinct reasons for plagiarism were identified: intentional, unintentional and contextual. The teachers did not utilise plagiarism detection to support student learning to any great extent. We discuss the pedagogical implications and suggest that the contextual reasons for plagiarism require focus primarily on study strategies, whereas the intentional reasons require profound discussion about attitudes and conceptions of good learning and university-level study habits.

...read moreread less

Journal Article•DOI•

Determining and characterizing the reused text for plagiarism detection

[...]

Fernando Sánchez-Vega, Esaú Villatoro-Tello¹, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, Paolo Rosso² - Show less +1 more•Institutions (2)

Universidad Autónoma Metropolitana¹, Polytechnic University of Valencia²

01 Apr 2013-Expert Systems With Applications

TL;DR: A novel method for detecting likely portions of reused text that is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition and represents the identified reused text by means of a set of features that denote its degree of plagiarism, relevance and fragmentation.

...read moreread less

Abstract: An important task in plagiarism detection is determining and measuring similar text portions between a given pair of documents. One of the main difficulties of this task resides on the fact that reused text is commonly modified with the aim of covering or camouflaging the plagiarism. Another difficulty is that not all similar text fragments are examples of plagiarism, since thematic coincidences also tend to produce portions of similar text. In order to tackle these problems, we propose a novel method for detecting likely portions of reused text. This method is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition, allowing to obtain plausible portions of reused text. We also propose representing the identified reused text by means of a set of features that denote its degree of plagiarism, relevance and fragmentation. This new representation aims to facilitate the recognition of plagiarism by considering diverse characteristics of the reused text during the classification phase. Experimental results employing a supervised classification strategy showed that the proposed method is able to outperform traditionally used approaches.

...read moreread less

Journal Article•DOI•

Do journal authors plagiarize? Using plagiarism detection software to uncover matching text across disciplines

[...]

Yu-Chih Sun

01 Dec 2013-Journal of English for Academic Purposes

TL;DR: The results indicate that disciplinary differences do exist in terms of the degree of matching text incidences and that the greater the number of authors an article has the more consecutive text-matching can be observed in their published works.

...read moreread less

Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Notebook for PAN at CLEF 2013.

[...]

Leilei Kong, Haoliang Qi, Cuixia Du, Mingxing Wang, Zhongyuan Han - Show less +1 more

01 Jan 2013

TL;DR: This paper describes the approach at the PAN@CLEF2013 plagiarism detection competition, and proposes a method based on sentence similarity to extract the keywords of suspicious documents as queries to retrieve the plagiarism source document.

...read moreread less

Abstract: In this paper, we describe our approach at the PAN@CLEF2013 plagiarism detection competition. In sub-task of Source Retrieval, a method combined TF-IDF, PatTree and Weighted TF-IDF to extract the keywords of suspicious documents as queries to retrieve the plagiarism source document is proposed. In sub-task of Text Alignment, a method based on sentence similarity is presented. Our text alignment algorism and similar sentences merging algorism, called Bilateral Alternating Merging Algorithm, are described in detail.

...read moreread less

Journal Article•DOI•

Faculty Perceptions of Student Self Plagiarism: An Exploratory Multi-university Study

[...]

Colleen Halupa¹, Colleen Halupa², Doris U. Bolliger³•Institutions (3)

A.T. Still University¹, LeTourneau University², University of Wyoming³

28 Jul 2013-Journal of Academic Ethics

TL;DR: Overall, institutional policies on self-plagiarism did not exist and faculty did not clearly understand the concept and believed their students did not either, and faculty assumed students had previously been educated on plagiarism as well as self-PLAGiarism.

...read moreread less

Abstract: The purpose of this research study was to evaluate faculty perceptions regarding student self-plagiarism or recycling of student papers. Although there is a plethora of information on plagiarism and faculty who self-plagiarize in publications, there is very little research on how faculty members perceive students re-using all or part of a previously completed assignment in a second assignment. With the wide use of plagiarism detection software, this issue becomes even more crucial. A population of 340 faculty members from two private universities at three different sites was surveyed in Fall 2012 semester regarding their perceptions of student self-plagiarism. A total of 89 faculty responded for a return rate of 26.2 %. Overall, institutional policies on self-plagiarism did not exist and faculty did not clearly understand the concept and believed their students did not either. Although faculty agreed students need to be educated on self-plagiarism, faculty assumed students had previously been educated on plagiarism as well as self-plagiarism; only 13 % ensured students understood this concept.

...read moreread less

Proceedings Article•DOI•

Demonstration of citation pattern analysis for plagiarism detection

[...]

Bela Gipp¹, Norman Meuschke¹, Corinna Breitinger¹, Mario Lipinski¹, Andreas Nürnberger² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Otto-von-Guericke University Magdeburg²

28 Jul 2013

TL;DR: State-of-the-art plagiarism detection approaches capably identify copy & paste and to some extent slightly modified plagiarism but cannot reliably identify strongly disguised plagiarism forms, including paraphrases, translated plagiarism, and idea plagiarism.

...read moreread less

Abstract: Limitations of Plagiarism Detection Systems State-of-the-art plagiarism detection approaches capably identify copy & paste and to some extent slightly modified plagiarism. However, they cannot reliably identify strongly disguised plagiarism forms, including paraphrases, translated plagiarism, and idea plagiarism, which are forms of plagiarism more commonly found in scientific texts. This weakness of current systems results in a large fraction of today’s scientific plagiarism going undetected.

...read moreread less

Proceedings Article•DOI•

Comparison between fingerprint and winnowing algorithm to detect plagiarism fraud on Bahasa Indonesia documents

[...]

Agung Toto Wibowo¹, Kadek W. Sudarmadi¹, Ari Moesriami Barmawi¹•Institutions (1)

Telkom Institute of Technology¹

20 Mar 2013

TL;DR: The results showed that the best performance of fingerprint algorithm was 92.8% while Winnowing algorithm's best performance was 91.8%.

...read moreread less

Abstract: Plagiarism detection has been widely discussed in recent years. Various approaches have been proposed such as the text-similarity calculation, structural-approaches, and the fingerprint. In fingerprint-approaches, small parts of document are taken to be matched with other documents. In this paper, fingerprint and Winnowing algorithm is proposed. Those algorithms are used for detecting plagiarism of scientific articles in Bahasa Indonesia. Plagiarism classification is determined from those two documents by a Dice Coefficient at a certain threshold value. The results showed that the best performance of fingerprint algorithm was 92.8% while Winnowing algorithm's best performance was 91.8%. Level-of-relevance to the topic analysis result showed that Winnowing algorithm has got stronger term-correlation of 37.1% compared to the 33.6% fingerprint algorithm.

...read moreread less

Proceedings Article•DOI•

DKISB: Dynamic Key Instruction Sequence Birthmark for Software Plagiarism Detection

[...]

Zhenzhou Tian¹, Qinghua Zheng¹, Ting Liu¹, Ming Fan¹•Institutions (1)

Xi'an Jiaotong University¹

01 Nov 2013

TL;DR: By introducing dynamic data flow analysis into birthmark generation, DKISB is able to produce a high quality birthmark that is closely correlated to program semantics, making it resilient to various kinds of semantic-preserving code obfuscation techniques.

...read moreread less

Abstract: With the burst of open source software, software plagiarism has been a serious threat to the healthy development of software industry. Software birthmark reflecting intrinsic properties of software, is an effective way for the detection of software theft. However, most of the existing software birthmarks face a series of challenges: (1) the absence of source code, (2) diversity of operating systems and programing languages, (3) various automated code obfuscation techniques. In this paper, a dynamic key instruction sequence based software birthmark (DKISB) is proposed. By introducing dynamic data flow analysis into birthmark generation, we are able to produce a high quality birthmark that is closely correlated to program semantics, making it resilient to various kinds of semantic-preserving code obfuscation techniques. Based on the Pin instrumentation framework, a DKISB based software plagiarism detection system is implemented, which generates birthmarks for both the plaintiff and defendant program, and then make the plagiarism decision according to the similarity of their birthmarks. The experimental results show that DKISB is effective to either weak obfuscation techniques like compiler optimization or strong obfuscation techniques provided by tools such as Sand Mark.

...read moreread less

Book Chapter•DOI•

[...]

Sahin Buyrukbilen¹, Spiridon Bakiras¹•Institutions (1)

City University of New York¹

30 Aug 2013

TL;DR: The proposed solution based on simhash document fingerprints essentially reduces the problem to a secure XOR computation between two bit vectors, which improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol.

...read moreread less

Abstract: Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.

...read moreread less

Text Alignment Module in CoReMo 2.1 Plagiarism Detector Notebook for PAN at CLEF 2013.

[...]

Diego Antonio Rodríguez Torrejón, José Manuel Martín Ramos¹•Institutions (1)

University of Huelva¹

01 Jan 2013

TL;DR: The optimized process by high performance C/C++ multi- core programming techniques, has yielded the best speed, but the tests were arranged in single core machines, so you can expect much better runtime.

...read moreread less

Abstract: This paper describes the process and basics of the Text Alignment Module into the CoReMo 2.1 Plagiarism Detector, which has won the Plagiarism Detection Text Alignment task in PAN-2013 edition, for both evaluation criteria of efficacy and efficiency, achieving the best detections and the best runtime too. Its high detection efficacy is mainly due to the special features of the contextual n -grams, evolved to surrounding context and odd- even skip n-grams. When combined all together, the matching opportunity increases, especially when translations or paraphrases happen, but keeping its highly discriminative feature that simplifies the accurate location for plagiarized sections. The optimized process by high performance C/C++ multi- core programming techniques, has yielded the best speed, but the tests were arranged in single core machines, so you can expect much better runtime.

...read moreread less

Journal Article•DOI•

Source code author identification with unsupervised feature learning

[...]

Upul Bandara¹, Gamini Wijayarathna¹•Institutions (1)

University of Kelaniya¹

01 Feb 2013-Pattern Recognition Letters

TL;DR: This paper investigates an unsupervised feature learning technique called sparse auto-encoder as a method of extracting features from source code files and shows that performance is very close to the state of art techniques in the source code identification field.

...read moreread less

Dissertation•

A study on plagiarism detection and plagiarism direction identification using natural language processing techniques

[...]

Man Yan Miranda Chong

01 Jan 2013

TL;DR: Man Yan Miranda Chong A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy in 2013.

...read moreread less

Abstract: Man Yan Miranda Chong A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy 2013

...read moreread less

Proceedings Article•DOI•

A Code Comparison Algorithm Based on AST for Plagiarism Detection

[...]

Jianglang Feng¹, Baojiang Cui², Kunfeng Xia²•Institutions (2)

Chengdu University of Technology¹, Beijing University of Posts and Telecommunications²

09 Sep 2013

TL;DR: This paper will introduce a technique based on the Abstract Syntax Tree (AST) that can effectively detects the plagiarism cases of changing the names of methods and variables in the code, reordering the sequences of the code and so on.

...read moreread less

Abstract: Plagiarism detection technology plays a very important role in copyright protection of computer software. The plagiarism technology mainly includes text-based, token-based and syntax-based technologies. This paper will introduce a technique based on the Abstract Syntax Tree (AST). This algorithm based on AST can effectively detects the plagiarism cases of changing the names of methods and variables in the code, reordering the sequences of the code and so on. According to algorithm of the Abstract Syntax Tree, we will calculates hash values of every node in the AST, then store the AST, and compare the hash value node by node after completing all of the above. Finally, we will use the experiments to illustrate the superiority of the AST algorithm.

...read moreread less

Diverse queries and feature type selection for plagiarism discovery: Notebook for PAN at CLEF 2013

[...]

Šimon Suchomel¹, Jan Kasprzak¹, Michal Brandejs¹•Institutions (1)

Masaryk University¹

01 Jan 2013

TL;DR: In this paper, a modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance is presented, that presented approach is adaptable in real-world plagiarism situations.

...read moreread less

Abstract: This paper describes approaches used for the Plagiarism Detection task in PAN 2013 international competition on uncovering plagiarism, authorship, and social software misuse. We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance. The results show, that presented approach is adaptable in real-world plagiarism situations. For the Detailed Comparison task, we discuss feature type selection and global postprocessing. Resulting performance is significantly better with the described modifications, and further improvement is still possible.

...read moreread less

Dissertation•

Detecting plagiarism in the forensic linguistics turn

[...]

Rui Sousa Silva

12 Jun 2013

TL;DR: In this article, the authors investigated whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality, and also evaluated current computational approaches to plagiarism detection, and identified strategies that these systems fail to detect.

...read moreread less

Abstract: This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools.

...read moreread less

Journal Article•DOI•

A Pedagogy of Resistance Toward Plagiarism Detection Technologies

[...]

Stephanie Vie¹•Institutions (1)

Fort Lewis College¹

01 Mar 2013-Computers and Composition

TL;DR: In this article, the authors argue for a pedagogy of resistance to plagiarism detection technologies, arguing that the circular logic of avoiding plagiarism/catching plagiarists/punishing plagiarism and prizing singular authorship above all other forms risks failing to find the ability to break free and move beyond to more challenging modes of writing that rely on community.

...read moreread less

Book Chapter•DOI•

A New Corpus for the Evaluation of Arabic Intrinsic Plagiarism Detection

[...]

Imene Bensalem, Paolo Rosso¹, Salim Chikhi•Institutions (1)

Polytechnic University of Valencia¹

23 Sep 2013

TL;DR: The first corpus for the evaluation of Arabic intrinsic plagiarism detection is introduced, consisting of 1024 artificial suspicious documents in which 2833 plagiarism cases have been inserted automatically from source documents.

...read moreread less

Abstract: The present paper introduces the first corpus for the evaluation of Arabic intrinsic plagiarism detection. The corpus consists of 1024 artificial suspicious documents in which 2833 plagiarism cases have been inserted automatically from source documents.

...read moreread less

Proceedings Article•DOI•

Combination of VSM and Jaccard coefficient for external plagiarism detection

[...]

Shuai Wang, Haoliang Qi, Leilei Kong, Cuixia Nu¹•Institutions (1)

Harbin Engineering University¹

14 Jul 2013

TL;DR: A hybrid similarity measure model is proposed on the basis of the fitting function of the optimal dividing line between plagiarism and none-plagiarism where it integrates VSM and Jaccard coefficient into a unified one and can extract more reasonable heuristic seeds in the plagiarism detection.

...read moreread less

Abstract: Detailed comparison is one important sub-task of external plagiarism detection. Seed heuristic between two documents is often used in this task. Vector space model (VSM) and Jaccard coefficient are commonly used in plagiarism detection. VSM can produce high recall performance; Jaccard coefficient can produce high precision performance. In this paper, we propose a hybrid similarity measure model on the basis of the fitting function of the optimal dividing line between plagiarism and none-plagiarism where we integrates VSM and Jaccard coefficient into a unified one, our method make full use of the advantage of VSM and the Jaccard coefficient, and it can extract more reasonable heuristic seeds in the plagiarism detection. Our method is evaluated at PAN corpus of CLEF (Cross-Language Evaluation Forum) and compared with the methods based on VSM or Jaccard coefficient. Experimental results show our method can produce better performance.

...read moreread less