scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2012"


Journal ArticleDOI
01 Mar 2012
TL;DR: A new taxonomy of plagiarism is presented that highlights differences between literal plagiarism and intelligent plagiarism, from the plagiarist's behavioral point of view, and supports deep understanding of different linguistic patterns in committing plagiarism.
Abstract: Plagiarism can be of many different natures, ranging from copying texts to adopting ideas, without giving credit to its originator. This paper presents a new taxonomy of plagiarism that highlights differences between literal plagiarism and intelligent plagiarism, from the plagiarist's behavioral point of view. The taxonomy supports deep understanding of different linguistic patterns in committing plagiarism, for example, changing texts into semantically equivalent but with different words and organization, shortening texts with concept generalization and specification, and adopting ideas and important contributions of others. Different textual features that characterize different plagiarism types are discussed. Systematic frameworks and methods of monolingual, extrinsic, intrinsic, and cross-lingual plagiarism detection are surveyed and correlated with plagiarism types, which are listed in the taxonomy. We conduct extensive study of state-of-the-art techniques for plagiarism detection, including character n-gram-based (CNG), vector-based (VEC), syntax-based (SYN), semantic-based (SEM), fuzzy-based (FUZZY), structural-based (STRUC), stylometric-based (STYLE), and cross-lingual techniques (CROSS). Our study corroborates that existing systems for plagiarism detection focus on copying text but fail to detect intelligent plagiarism when ideas are presented in different words.

275 citations


Journal ArticleDOI
TL;DR: PlaGate is described, a novel tool that can be integrated with existing plagiarism detection tools to improve plagiarism Detection performance and implements a new approach for investigating the similarity between source-code files with a view to gathering evidence for proving plagiarism.
Abstract: Plagiarism is a growing problem in academia. Academics often use plagiarism detection tools to detect similar source-code files. Once similar files are detected, the academic proceeds with the investigation process which involves identifying the similar source-code fragments within them that could be used as evidence for proving plagiarism. This paper describes PlaGate, a novel tool that can be integrated with existing plagiarism detection tools to improve plagiarism detection performance. The tool also implements a new approach for investigating the similarity between source-code files with a view to gathering evidence for proving plagiarism. Graphical evidence is presented that allows for the investigation of source-code fragments with regards to their contribution toward evidence for proving plagiarism. The graphical evidence indicates the relative importance of the given source-code fragments across files in a corpus. This is done by using the Latent Semantic Analysis information retrieval technique to detect how important they are within the specific files under investigation in relation to other files in the corpus.

173 citations


Journal ArticleDOI
01 May 2012
TL;DR: The method significantly outperforms the modern methods for plagiarism detection in terms of Recall, Precision and F-measure and weighting for each argument generated by SRL to study its behaviour is introduced.
Abstract: Plagiarism occurs when the content is copied without permission or citation. One of the contributing factors is that many text documents on the internet are easily copied and accessed. This paper introduces a plagiarism detection technique based on the Semantic Role Labeling (SRL). The technique analyses and compares text based on the semantic allocation for each term inside the sentence. SRL is superior in generating arguments for each sentence semantically. Weighting for each argument generated by SRL to study its behaviour is also introduced in this paper. It was found that not all arguments affect the plagiarism detection process. In addition, experimental results on PAN-PC-09 data sets showed that our method significantly outperforms the modern methods for plagiarism detection in terms of Recall, Precision and F-measure.

93 citations


Journal ArticleDOI
TL;DR: This paper conducted a study of Chinese students' knowledge of and attitudes toward plagiarism in English academic writing and found that a minority of students recognized plagiarism and generally took a punitive attitude toward the detected cases of plagiarism.
Abstract: This article reports on a mixed-methods study of Chinese university students’ knowledge of and attitudes toward plagiarism in English academic writing. A sample of 270 undergraduates from two Chinese universities rated three short English passages under different conditions, provided open-ended responses to justify their ratings, and completed a written questionnaire. The rating tasks were designed to determine their ability to recognize two forms of intertextuality (i.e., unacknowledged copying and paraphrasing) generally regarded as plagiarism in Anglo-American academia. The questionnaire was administered to collect self-appraisals of competence in source use and to assess declarative knowledge of intertextual practices that match typical Anglo-American definitions of blatant and subtle plagiarism. Qualitative and quantitative analyses revealed that a minority of the students recognized the two forms of plagiarism and generally took a punitive attitude toward the detected cases of plagiarism. Further quantitative analyses found that discipline, self-reported competence in referencing, and knowledge of subtle plagiarism were consistently significant predictors of successful plagiarism detection. These findings raise questions about some culturally based interpretations of plagiarism and point to the need to take a nuanced approach to plagiarism in L2 writing.

91 citations


Proceedings ArticleDOI
15 Jul 2012
TL;DR: Two dynamic value-based approaches, namely N-version and annotation, for algorithm plagiarism detection are proposed, motivated by the observation that there exist some critical runtime values which are irreplaceable and uneliminatable for all implementations of the same algorithm.
Abstract: In this work, we address the problem of algorithm plagiarism, which occurs when a plagiarist, violating intellectual property rights, steals others' algorithms and covertly implements them. In contrast to software plagiarism, which has been extensively studied, limited attention has been paid to algorithm plagiarism. In this paper, we propose two dynamic value-based approaches, namely N-version and annotation, for algorithm plagiarism detection. Our approaches are motivated by the observation that there exist some critical runtime values which are irreplaceable and uneliminatable for all implementations of the same algorithm. The N-version approach extracts such values by filtering out non-core values. The annotation approach leverages auxiliary information to flag important variables which contain core values. We also propose a value dependence graph based similarity metric in addition to the longest common subsequence based one, in order to address the potential value reordering attack. We have implemented a prototype and evaluated the proposed schemes on various algorithms. The results show that our approaches to algorithm plagiarism detection are practical, effective and resilient to many automatic obfuscation techniques.

71 citations


Journal ArticleDOI
TL;DR: The prevalence of plagiarized manuscripts submitted to the CMJ, a journal dedicated to promoting research integrity, was 11% in the 2-year period 2009–2010, and three authors were identified in the Déjà vu database.
Abstract: To assess the prevalence of plagiarism in manuscripts submitted for publication in the Croatian Medical Journal (CMJ). All manuscripts submitted in 2009–2010 were analyzed using plagiarism detection software: eTBLAST, CrossCheck, and WCopyfind. Plagiarism was suspected in manuscripts with more than 10% of the text derived from other sources. These manuscripts were checked against the Deja vu database and manually verified by investigators. Of 754 submitted manuscripts, 105 (14%) were identified by the software as suspicious of plagiarism. Manual verification confirmed that 85 (11%) manuscripts were plagiarized: 63 (8%) were true plagiarism and 22 (3%) were self-plagiarism. Plagiarized manuscripts were mostly submitted from China (21%), Croatia (14%), and Turkey (19%). There was no significant difference in the text similarity rate between plagiarized and self-plagiarized manuscripts (25% [95% CI 22–27%] vs. 28% [95% CI 20–33%]; U = 645.50; P = 0.634). Differences in text similarity rate were found between various sections of self-plagiarized manuscripts (H = 12.65, P = 0.013). The plagiarism rate in the Materials and Methods (61% (95% CI 41–68%) was higher than in the Results (23% [95% CI 17–36%], U = 33.50; P = 0.009) or Discussion (25.5 [95% CI 15–35%]; U = 57.50; P < 0.001) sections. Three authors were identified in the Deja vu database. Plagiarism detection software combined with manual verification may be used to detect plagiarized manuscripts and prevent their publication. The prevalence of plagiarized manuscripts submitted to the CMJ, a journal dedicated to promoting research integrity, was 11% in the 2-year period 2009–2010.

66 citations


Journal ArticleDOI
TL;DR: A plagiarism detection tool for comparison of Arabic documents to identify potential similarities based on a new comparison algorithm that uses heuristics to compare suspect documents at different hierarchical levels to avoid unnecessary comparisons is presented.
Abstract: Many language-sensitive tools for detecting plagiarism in natural language documents have been developed, particularly for English. Language- independent tools exist as well, but are considered restrictive as they usually do not take into account specific language features. Detecting plagiarism in Arabic documents is particularly a challenging task because of the complex linguistic structure of Arabic. In this paper, we present a plagiarism detection tool for comparison of Arabic documents to identify potential similarities. The tool is based on a new comparison algorithm that uses heuristics to compare suspect documents at different hierarchical levels to avoid unnecessary comparisons. We evaluate its performance in terms of precision and recall on a large data set of Arabic documents, and show its capability in identifying direct and sophisticated copying, such as sentence reordering and synonym substitution. We also demonstrate its advantages over other plagiarism detection tools, including Turnitin, the well-known language-independent tool.

55 citations


Proceedings ArticleDOI
03 Sep 2012
TL;DR: The TIRA (Testbed for Information Retrieval Algorithms) web framework is presented, which is currently used as an official evaluation platform for the well-established PAN international plagiarism detection competition and possesses a unique set of compelling features in comparison to existing web-based solutions.
Abstract: With its close ties to the Web, the information retrieval community is destined to leverage the dissemination and collaboration capabilities that the Web provides today. Especially with the advent of the software as a service principle, an information retrieval community is conceivable that publishes executable experiments by anyone over the Web. A review of recent SIGIR papers shows that we are far away from this vision of collaboration. The benefits of publishing information retrieval experiments as a service are striking for the community as a whole, including potential to boost research profiles and reputation. However, the additional work must be kept to a minimum and sensitive data must be kept private for this paradigm to become an accepted practice. In order to foster experiments as a service in information retrieval, we present the TIRA (Testbed for Information Retrieval Algorithms) web framework that addresses the outlined challenges and possesses a unique set of compelling features in comparison to existing web-based solutions. To describe TIRA in a practical setting, we explain how it is currently used as an official evaluation platform for the well-established PAN international plagiarism detection competition. We also describe how it can be used in future scenarios for search result clustering of non-static collections of web query results, as well as within a simulation data mining setting to support interactive structural design in civil engineering.

55 citations


Book
05 Mar 2012
TL;DR: Software similarity and classification is an emerging topic with wide applications applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection and demonstrates that considering these applied problems as a similarity and Classification problem enables techniques to be shared between areas.
Abstract: Software similarity and classification is an emerging topic with wide applications. It is applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection. Extracting program features, processing those features into suitable representations, and constructing distance metrics to define similarity and dissimilarity are the key methods to identify software variants, clones, derivatives, and classes of software. Software Similarity and Classification reviews the literature of those core concepts, in addition to relevant literature in each application and demonstrates that considering these applied problems as a similarity and classification problem enables techniques to be shared between areas. Additionally, the authors present in-depth case studies using the software similarity and classification techniques developed throughout the book.

54 citations


Proceedings Article
01 Dec 2012
TL;DR: The results demonstrate that the composition consistently outperforms previous approaches on three standard evaluation datasets, and that text reuse detection greatly benefits from incorporating a diverse feature set that reflects a wide variety of text characteristics.
Abstract: Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived from the content of the given texts, thereby inherently implying that any other text characteristics are negligible. In this paper, we overcome this traditional limitation and compute similarity along three characteristic dimensions inherent to texts: content, structure, and style. We explore and discuss possible combinations of measures along these dimensions, and our results demonstrate that the composition consistently outperforms previous approaches on three standard evaluation datasets, and that text reuse detection greatly benefits from incorporating a diverse feature set that reflects a wide variety of text characteristics.

50 citations


Proceedings ArticleDOI
23 Sep 2012
TL;DR: This study complements the study by McMillan et al. by leveraging another source of information aside from API usage patterns, namely software tags, and shows that collaborative tagging is a promising source of Information useful for detecting similar software applications.
Abstract: Detecting similar applications are useful for various purposes ranging from program comprehension, rapid prototyping, plagiarism detection, and many more. McMillan et al. have proposed a solution to detect similar applications based on common Java API usage patterns. Recently, collaborative tagging has impacted software development practices. Various sites allow users to give various tags to software systems. In this study, we would like to complement the study by McMillan et al. by leveraging another source of information aside from API usage patterns, namely software tags. We have performed a user study involving several participants and the results show that collaborative tagging is a promising source of information useful for detecting similar software applications.

Journal ArticleDOI
TL;DR: A survey on plagiarism detection systems is presented, a summary of several plagiarism types, techniques, and algorithms is provided and a web enabled system to detect plagiarism in documents, code and images is proposed.
Abstract: Being a growing problem, plagiarism is generally defined as "literary theft" and "academic dishonesty" in the literature, and it is really has to be well-informed on this topic to prevent the problem and stick to the ethical principles. This paper presents a survey on plagiarism detection systems, a summary of several plagiarism types, techniques, and algorithms is provided. Common feature of deferent detection systems are described. At the end of this paper authors propose a web enabled system to detect plagiarism in documents , code and images, also this system could be used in E-Learning, E-Journal, and E-Business.

Journal ArticleDOI
TL;DR: Turnitin plagiarism detection software was used for analysis of 100 doctoral dissertations that were published by institutions granting doctorate degrees through a primarily online format as discussed by the authors, and the results showed that 72% of the documents had at least one case of improper paraphrasing and citation.
Abstract: The current research literature has claimed that plagiarism is a significant problem in postsecondary education. Unfortunately, these claims are primarily supported by self-report data from students. In fact little research has been done to quantify the prevalence of plagiarism particularly at the advanced graduate education level. Further, few studies exist on online education even though this is a rapidly growing sector of higher education. This descriptive study quantified the amount of plagiarism that existed among 100 doctoral dissertations that were published by institutions granting doctorate degrees through a primarily online format. The dissertations were submitted to Turnitin plagiarism detection software for analysis. The mean similarity index of these dissertations was 15.1 (SD = 13.02). The results were then categorized per previous research. Forty-six percent of the dissertations were classified as having a low level of plagiarism while 11 % had a medium level and 3 % had a high level. Further analysis revealed that 72 % of the dissertations had at least one case of improper paraphrasing and citation (verbatim text accompanied by a citation) and 46 % had verbatim text without any citation. The results of this study should encourage faculty, dissertation committee members, university administrators, and accrediting bodies to take action to help reduce the level of plagiarism among doctoral learners. Suggestions for future research include comparing online and brick-and-mortar dissertation plagiarism rates, a larger study to investigate plagiarism trend data, and surveys of faculty about how they address plagiarism and ethics during the dissertation process.

Journal ArticleDOI
TL;DR: The plagiarism detection tool and its similarity report are extremely useful and effective and can assist editors in screening documents suspected of plagiarism, and their attitude to potential plagiarism once discovered is investigated.
Abstract: The purpose of this survey was to investigate journal editors' use of CrossCheck, powered by iThenticate, to detect plagiarism, and their attitude to potential plagiarism once discovered. A 22-question survey was sent to 3,305 recipients, primarily scholarly journal editors from Anglophone countries, and a reduced 10-question version to 607 editors from non-Anglophone countries. The response rate was 5.6%. 42% of all respondents had used CrossCheck in their work. The main findings are as follows: (1) the plagiarism detection tool and its similarity report are extremely useful and effective and can assist editors in screening documents suspected of plagiarism; (2) responses show the journal editors' attitude and level of tolerance towards different kinds of plagiarism in different disciplines; (3) the survey results underscore a clear consensus on editorial standards on plagiarism, but there were small variations between different disciplines and countries, as well as between Anglophone and non-Anglophone. The study also suggests that further work is needed to establish a universal principle and practical approaches to prevent plagiarism and duplicate publication.

Journal ArticleDOI
TL;DR: It was found that many of the proposed methods for plagiarism detection have a weakness and lacking for detecting some types of plagiarized text.
Abstract: In this paper we are going to review and list the advantages and limitations of the significant effective techniques employed or developed in text plagiarism detection. It was found that many of the proposed methods for plagiarism detection have a weakness and lacking for detecting some types of plagiarized text. This paper discussed several important issues in plagiarism detection such as; plagiarism detection Tasks, plagiarism detection process and some of the current plagiarism detection techniques.

17 Sep 2012
TL;DR: The proposed methodology was the best performing one in case of long term operation and also the most cost-effective one at the PAN 2012 plagiarism detection competition.
Abstract: In this paper, we describe our approach at the PAN 2012 plagiarism detection competition. Our candidate retrieval system is based on extraction of three different types of Web queries with narrowing their execution by skipping certain passages of an input document. We have created queries based on keywords extraction, intrinsic plagiarism detection and headers extraction. We have also compared the performance of constructed queries used during the PAN 2012 test process. The proposed methodology was the best performing one in case of long term operation and also the most cost-effective one. Our detailed comparison system is based on detecting common features of several types (in the final submission, we have used two types of features: sorted word 5-grams and unsorted stop word 8-grams) in the input document pair. We propose a method of computing so called valid intervals from those features, represented by their offset and length attributes in both source and suspicious document. Previous works use the feature ordering as the measure of distance, which is not usable for multiple types of features, which do not have any natural ordering. From those valid intervals we compute final detections in the post-processing phase, where we merge neighbouring valid intervals and remove some types of overlapping detections. We further discuss other approaches which we explored, but which have not been used in our final submission. In the paper we also discuss the performance aspects of our program, parameter settings, and the relevance of current PAN 2012 rules (including the plagdet score) to the real-world plagiarism detection systems.

Journal ArticleDOI
TL;DR: The Croatian Medical Journal (CMJ) appointed Research Integrity Editor in 2001, which paved the way for the introduction of computer detection of plagiarism.
Abstract: Plagiarism detection software has considerably affected the quality of scientific publishing. No longer is plagiarism detection done by chance or is the sole responsibility of the reviewer and reader (1). The Croatian Medical Journal (CMJ) appointed Research Integrity Editor in 2001, which paved the way for the introduction of computer detection of plagiarism (2,3). The story began when Mladen Petrovecki and Lidija Bilic-Zulle, members of the CMJ Editorial Board, came up with the idea to measure the prevalence of and attitudes toward plagiarism in the scientific community, as a follow-up to their investigation on plagiarism among students (4,5). Together with Matko Marusic and Ana Marusic, Editors-in-Chief, and Vedran Katavic, Research Integrity Editor, they developed a procedure for detecting and preventing plagiarism using plagiarism detection software, which later became a standard (1,6). The study of research integrity started in the early 2000s at the Rijeka University School of Medicine as part of two consecutive projects supported by the Ministry of Science, Technology, and Sports. Even outside our small scientific community, the projects were recognized as valuable and obtained a Committee on Publications Ethics (COPE) grant in 2010. Membership in the CrossRef association (http://www.crossref.org/) and the introduction of CrossCheck (http://www.crossref.org/crosscheck/index.html), a unique web-service for detecting plagiarism in scientific publications, marked the beginning of a new era at the CMJ. In 2009, we started to systemically check all the submitted manuscripts. The plagiarism detection procedure consisted of automatic scanning of manuscripts using plagiarism detection software (eTBLAST and CrossCheck) and manual verification of manuscripts suspected of having been plagiarized (more than 10% text similarity). The criteria for plagiarism were set according to the prior investigations carried out by Bilic-Zulle et al (4,5) and Segal et al (7), and the definition of redundant publication used by the British Medical Journal (8). Manual verification (reading of both manuscripts) was done according to the COPE's flowcharts (9) and the CMJ's Guidelines for Authors. Over two years, we detected 85 manuscripts (11%) containing plagiarized parts (8% true plagiarism and 3% self-plagiarism) (6). CrossCheck is an excellent service for detecting plagiarism, which detected almost all plagiarized manuscripts in our study. eTBLAST was less informative, possibly because at the time of the investigation it only had the ability to compare the text with abstracts from the Medline database (today eTBLAST searches abstracts in Medline, Pub Med Central, Clinical Trials, Wikipedia, and other databases outside the field of medicine). If a suspected case of copy/paste activity was found, the investigator wrote a plagiarism report to the Editorial Board to assist in deciding on the manuscript’s status. Editors mostly accepted the suggestions and in case of disagreement, the final decision lay with the Research Integrity Editor. Cases of blatant plagiarism were easy to deal with because of text similarity in all sections of the manuscript, while those with less text similarity were sometimes more complicated and COPE’s flowcharts were not sufficient to conclude whether the manuscript was plagiarized. Special attention was paid to plagiarism in the Results section. Also, there was zero tolerance for plagiarism in the Discussion section. When manuscripts contained plagiarism in the Materials and Methods section or when the original article was not cited in follow-up investigations, accidentally or by ignorance, authors were given an opportunity to rewrite the text and publish their investigation. These examples once again show that it is of genuine importance for editors to become educators, ie, to teach authors about standards in publishing and research through continuing education (10). We believe that the main reasons for plagiarizing were unawareness of research integrity policies, poor English proficiency, attitudes toward plagiarism, and cultural values (6,11-13). In Croatia, the situation could be further deteriorated by a new law on science, higher education, and universities that abolishes the Committee for Ethics in Science and Higher Education, the highest national body dealing with research integrity (14). Integrity issues and education of future scientists about the responsible research conduct will now be the task of Croatian universities and schools only. Also, since in the academic community there is a considerable pressure to publish and since English is not the first language in Croatia, some authors simply decide to “borrow” a portion of text from previous papers (11). In addition, it has been shown that in post-communist countries moral and cultural values and attitudes toward plagiarism are different from those in Western countries that have a longer tradition of high research integrity standards (15). Plagiarism is not easy to define (16); there are still no criteria that are widely accepted by medical editors/journals as to what constitutes plagiarism. How much textual similarity raises the suspicion of plagiarism? Is it 5% or 10%, as stated by one source, or 100 words, as it was argued in the discussion of the COPE’s recent paper “How should editors respond to plagiarism?” (5-7,17)? Is there a difference between different types of plagiarism detection software? Plagiarism detection software offers valuable help in preventing plagiarism, but only if followed by manual verification (6). All manuscripts submitted to the journal should be checked and never rejected relying solely on the similarity report of plagiarism detection software (1,6). Therefore, medical editors are expected to reach a consensus on what constitutes plagiarism and make clear policies on how to deal with cases of plagiarism. The CMJ was the first scientific journal in Croatia to begin checking all the submitted manuscripts for plagiarism (2009) and, to the best of my knowledge, together with the Chinese Journal of Zhejiang University Science, the only journal in the world that has systematically collected data on plagiarism in the submitted manuscripts. Furthermore, the CMJ is the first medical journal to publish the standard operating procedure for scanning submitted manuscripts (study protocol), as part of the journal’s “striving for excellence” policy (1,18). Plagiarism detection software enables systematic detection and prevention of plagiarism, leading to fewer retractions. The results of our study were published (6) and we expect other medical journals to publish their results, not only a description of experiences. In order to reach high research integrity standards and journal quality, journals should perform systematic checking of all submitted manuscripts according to the widely accepted standards (protocols), as well as conduct ongoing education of authors.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper proposes a technique based on textual similarity for external plagiarism detection that uses an approach based on the traditional Vector Space Model (VSM) for this candidate selection.
Abstract: Plagiarism denotes the act of copying someone else's idea (or, works) and claiming it as his/her own. Plagiarism detection is the procedure to detect the texts of a given document which are plagiarized, i.e. copied from from some other documents. Potential challenges are due to the facts that plagiarists often obfuscate the copied texts; might shuffle, remove, insert, or replace words or short phrases; might also restructure the sentences replacing words with synonyms; and changing the order of appearances of words in a sentence. In this paper we propose a technique based on textual similarity for external plagiarism detection. For a given suspicious document we have to identify the set of source documents from which the suspicious document is copied. The method we propose comprises of four phases. In the first phase, we process all the documents to generate tokens, lemmas, finding Part-of-Speech (PoS) classes, character-offsets, sentence numbers and named-entity (NE) classes. In the second phase we select a subset of documents that may possibly be the sources of plagiarism. We use an approach based on the traditional Vector Space Model (VSM) for this candidate selection. In the third phase we use a graph-based approach to find out the similar passages in suspicious document and selected source documents. Finally we filter out the false detections1.

Proceedings ArticleDOI
03 Jul 2012
TL;DR: Student Submissions Integrity Diagnosis (SSID) is an open-source system that provides holistic plagiarism detection in an instructor-centric way, and its workflow enhancements have made plagiarism Detection in faculty less tedious and more successful.
Abstract: Existing source code plagiarism systems focus on the problem of identifying plagiarism between pairs of submissions. The task of detection, while essential, is only a small part of managing plagiarism in an instructional setting. Holistic plagiarism detection and management requires coordination and sharing of assignment similarity -- elevating plagiarism detection from pairwise similarity to cluster-based similarity; from a single assignment to a sequence of assignments in the same course, and even among instructors of different courses.To address these shortcomings, we have developed Student Submissions Integrity Diagnosis (SSID), an open-source system that provides holistic plagiarism detection in an instructor-centric way. SSID's visuals show overviews of plagiarism clusters throughout all assignments in a course as well as highlighting most-similar submissions on any specific student. SSID supports plagiarism detection workflows; e.g., allowing student assistants to flag suspicious assignments for later review and confirmation by an instructor with proper authority. Evidence is automatically entered into SSID's logs and shared among instructors.We have additionally collected a source code plagiarism corpus, which we employ to identify and correct shortcomings of previous plagiarism detection engines and to optimize parameter tuning for SSID deployment. Since its deployment, SSID's workflow enhancements have made plagiarism detection in our faculty less tedious and more successful.

Journal ArticleDOI
TL;DR: An empirical study on the system's response shows that structural information, unlike existing plagiarism detectors, helps to flag significant plagiarism cases, improve the similarity index, and provide human-like plagiarism screening results.
Abstract: In plagiarism detection (PD) systems, two important problems should be considered: the problem of retrieving candidate documents that are globally similar to a document q under investigation, and the problem of side-by-side comparison of q and its candidates to pinpoint plagiarized fragments in detail. In this article, the authors investigate the usage of structural information of scientific publications in both problems, and the consideration of citation evidence in the second problem. Three statistical measures namely Inverse Generic Class Frequency, Spread, and Depth are introduced to assign a degree of importance (i.e., weight) to structural components in scientific articles. A term-weighting scheme is adjusted to incorporate component-weight factors, which is used to improve the retrieval of potential sources of plagiarism. A plagiarism screening process is applied based on a measure of resemblance, in which component-weight factors are exploited to ignore less or nonsignificant plagiarism cases. Using the notion of citation evidence, parts with proper citation evidence are excluded, and remaining cases are suspected and used to calculate the similarity index. The authors compare their approach to two flat-based baselines, TF-IDF weighting with a Cosine coefficient, and shingling with a Jaccard coefficient. In both baselines, they use different comparison units with overlapping measures for plagiarism screening. They conducted extensive experiments using a dataset of 15,412 documents divided into 8,657 source publications and 6,755 suspicious queries, which included 18,147 plagiarism cases inserted automatically. Component-weight factors are assessed using precision, recall, and F-measure averaged over a 10-fold cross-validation and compared using the ANOVA statistical test. Results from structural-based candidate retrieval and plagiarism detection are evaluated statistically against the flat baselines using paired-t tests on 10-fold cross-validation runs, which demonstrate the efficacy achieved by the proposed framework. An empirical study on the system's response shows that structural information, unlike existing plagiarism detectors, helps to flag significant plagiarism cases, improve the similarity index, and provide human-like plagiarism screening results. © 2012 Wiley Periodicals, Inc.

Journal Article
TL;DR: The experiments revealed that the best retrieval performance is obtained after removal of in-code comments and applying a combined weighting scheme based on terms frequencies, normalized term frequencies, and a cosine-based document normalization.
Abstract: Latent Semantic Analysis (LSA) is an intelligent information retrieval technique that uses mathematical algorithms for analyzing large corpora of text and revealing the underlying semantic information of documents. LSA is a highly parameterized statistical method, and its effectiveness is driven by the setting of its parameters which are adjusted based on the task to which it is applied. This paper discusses and evaluates the importance of parameterization for LSA based similarity detection of source-code documents, and the applicability of LSA as a technique for source-code plagiarism detection when its parameters are appropriately tuned. The parameters involve preprocessing techniques, weighting approaches; and parameter tweaking inherent to LSA processing – in particular, the choice of dimensions for the step of reducing the original post-SVD matrix. The experiments revealed that the best retrieval performance is obtained after removal of in-code comments (Java comment blocks) and applying a combined weighting scheme based on term frequencies, normalized term frequencies, and a cosine-based document normalization. Furthermore, the use of similarity thresholds (instead of mere rankings) requires the use of a higher number of dimensions. Povzetek: Prispevek analizira metodo LSA posebej glede plagiarizma izvirne kode.

Proceedings ArticleDOI
29 May 2012
TL;DR: Evaluation results using real student program assignments show the effectiveness of the proposed method to automatically detect plagiarisms, i.e. illegal copies, among a set of programs submitted by students in elementary programming courses.
Abstract: This paper proposes a method to automatically detect plagiarisms, i.e. illegal copies, among a set of programs submitted by students in elementary programming courses. In such courses, programming assignments are so simple that submitted programs are very short and similar to each other. Existing plagiarism detection methods, therefore, may yield many false positive results. The proposed method solves the problem by using three types of similarity: code, comment, and inconsistence. The inconsistence similarity, a unique feature of the method, improves the precision and recall ratios, and helps to find evidences of plagiarisms. Evaluation results using real student program assignments show the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: This paper found that plagiarism training can improve students' recognition of plagiarism in the sciences. But plagiarism was not significantly different when the quoted or paraphrased text included complex sentence structure and scientific jargon and when it included only simple sentences that mostly lacked jargon.
Abstract: Regrettably, the sciences are not untouched by the plagiarism affliction that threatens the integrity of budding professionals in classrooms around the world. My research, however, suggests that plagiarism training can improve students' recognition of plagiarism. I found that 148 undergraduate ecology students successfully identified plagiarized or unplagiarized paragraphs three-quarters of the time. The students' ability to identify plagiarism was not significantly different when the quoted or paraphrased text included complex sentence structure and scientific jargon and when it included only simple sentences that mostly lacked jargon. The students who received plagiarism training performed significantly better at plagiarism detection than did those who did not receive the training. Most of the students, independent of training, identified properly paraphrased, quoted, and attributed material but had much greater difficulty identifying paraphrases that included long strings of copied text—up to 15 words—...

01 Jan 2012
TL;DR: An open-source prototype of a citation-based plagiarism detection system called CitePlag, to evaluate the citations of academic documents as language independent markers to detect plagiarism, is presented.
Abstract: This paper presents an open-source prototype of a citation-based plagiarism detection system called CitePlag. The underlying idea of the system is to evaluate the citations of academic documents as language independent markers to detect plagiarism. CitePlag uses three different detection algorithms that analyze the citation sequence of academic documents for similar patterns that may indicate unduly used foreign text or ideas. The algorithms consider multiple citation-related factors such as proximity and order of citations within the text, or their probability of co-occurrence in order to compute document similarity scores. We present technical details of CitePlag’s detection algorithms and the acquisition of test data from the PubMed Central Open Access Subset. Future advancement of the prototype lies in increasing the reference database by enabling the system to process more document and citation formats. Improving CitePlag’s detection algorithms and scoring functions to reduce the number of false positives is another major goal. Eventually, we plan to integrate text-based detection algorithms in addition to the citation-based detection algorithms within CitePlag.

Journal ArticleDOI
TL;DR: This paper proposes an automatic system for semantic plagiarism detection based on ontology mapping, a way of describing documents semantics, which can resolve semantic heterogeneity in documents.
Abstract: Plagiarism detection can play an important role in detecting stealing of original ideas in papers, journals and internet web sites. Checking these manually is simply impossible nowadays due to existence of large digital repository. Ontology is a way of describing documents semantics. Ontology mapping can resolve semantic heterogeneity in documents. Our paper proposes an automatic system for semantic plagiarism detection based on ontology mapping.

17 Feb 2012
TL;DR: This thesis presents a comprehensive overview of the different ways in which text and language is reused today, and how exactly information retrieval technologies can be applied in this respect, and introduces technologies that solve three retrieval tasks based on language reuse.
Abstract: Texts from the web can be reused individually or in large quantities. The former is called text reuse and the latter language reuse. We first present a comprehensive overview of the different ways in which text and language is reused today, and how exactly information retrieval technologies can be applied in this respect. The remainder of the thesis then deals with specific retrieval tasks. In general, our contributions consist of models and algorithms, their evaluation, and for that purpose, large-scale corpus construction. The thesis divides into two parts. The first part introduces technologies for text reuse detection, and our contributions are as follows: (1) A unified view of projecting-based and embedding-based fingerprinting for near-duplicate detection and the first time evaluation of fingerprint algorithms on Wikipedia revision histories as a new, large-scale corpus of near-duplicates. (2) A new retrieval model for the quantification of cross-language text similarity, which gets by without parallel corpora. We have evaluated the model in comparison to other models on many different pairs of languages. (3) An evaluation framework for text reuse and particularly plagiarism detectors, which consists of tailored detection performance measures and a large-scale corpus of automatically generated and manually written plagiarism cases. The latter have been obtained via crowdsourcing. This framework has been successfully applied to evaluate many different state-of-the-art plagiarism detection approaches within three international evaluation competitions. The second part introduces technologies that solve three retrieval tasks based on language reuse, and our contributions are as follows: (4) A new model for the comparison of textual and non-textual web items across media, which exploits web comments as a source of information about the topic of an item. In this connection, we identify web comments as a largely neglected information source and introduce the rationale of comment retrieval. (5) Two new algorithms for query segmentation, which exploit web n-grams and Wikipedia as a means of discerning the user intent of a keyword query. Moreover, we crowdsource a new corpus for the evaluation of query segmentation which surpasses existing corpora by two orders of magnitude. (6) A new writing assistance tool called Netspeak, which is a search engine for commonly used language. Netspeak indexes the web in the form of web n-grams as a source of writing examples and implements a wildcard query processor on top of it.

01 Jan 2012
TL;DR: The authors' plagiarism detection system which is used to process the PAN plagiarism corpus for the tasks of Candidate Document Retrieval and Detailed Comparison and a method based on tf*idf to extract the keywords of suspicious documents as queries was proposed.
Abstract: In this paper we report on our plagiarism detection system which is used to process the PAN plagiarism corpus for the tasks of Candidate Document Retrieval and Detailed Comparison. To retrieve the plagiarism candidate document by using ChatNoir API, a method based on tf*idf to extract the keywords of suspicious documents as queries is proposed. An Lucene ranking method is used for plagiarism candidate document reduction. And a detailed comparison algorithm to get the web pages that are actually sources for plagiarized passages is applied. To extract all plagiarism passages from the suspicious document and their corresponding source passages from the source document, a plagiarism detection method combined with semantic similarity and structure similarity is proposed. Semantic similarity is calculated by Vector Space Model while structure similarity is calculated by our own method. We use information retrieval to get candidate pairs of sentences from suspicious document and potential source document. A method which is called Bilateral Alternating Sorting is applied to merge pairs of sentences. Those plagiarism candidate result pairs are screened in post-processing. In this paper, we introduce a method for Candidate Document Retrieval and Detailed Comparison sub-tasks. In the Candidate Document Retrieval sub-task, a method based on tf*idf to extract the keywords of suspicious documents as queries was proposed. A scoring method was used to plagiarism candidate document ranking. In the Detailed Comparison sub-task, a plagiarism detection method based on Vector Space Model(VSM) and Overlapping Measure Model at the sentence level was presented. Bilateral Alternating Sorting was designed to merge the pairs of plagiarism sentences, and

20 Sep 2012
TL;DR: This paper discusses the approaches to these tasks in the Author Identification track at PAN2012, which represents the first proper attempt at any of them, and discusses the results achieved using some simple but relatively novel approaches.
Abstract: Tasks such as Authorship Attribution, Intrinsic Plagiarism detection and Sexual Predator Identification are representative of attempts to deceive. In the first two, authors try to convince others that the presented work is theirs, and in the third there is an attempt to convince readers to take actions based on false beliefs or ill-perceived risks. In this paper, we discuss our approaches to these tasks in the Author Identification track at PAN2012, which represents our first proper attempt at any of them. Our initial intention was to determine whether cues of deception, documented in the literature, might be relevant to such tasks. However, it quickly became apparent that such cues would not be readily useful, and we discuss the results achieved using some simple but relatively novel approaches: for the Traditional Authorship Attribution task, we show how a mean-variance framework using just 10 stopwords detects 42.8% and could be obtain 52.12% using fewer; for Intrinsic Plagiarism Detection, frequent words achieved 91.1% overall; and for Sexual Predator Identification, we used just a few features covering requests for personal information, with mixed results.

Book ChapterDOI
29 May 2012
TL;DR: A novel plagiarism detection system for Arabic text-based documents, Iqtebas 1.0 is presented that is based on a search engine in order to reduce the cost of pairwise similarity and the search time is improved and the detection process is accurate and robust.
Abstract: This paper presents a novel plagiarism detection system for Arabic text-based documents, Iqtebas 1.0. This is a primary work dedicated for plagiarism of Arabic based documents. Arabic is a rich morphological language that is among the top used languages in the world and in the Internet as well. Given a document and a set of suspected files, our goal is to compute the originality value of the examined document. The originality value of a text is computed by computing the distance between each sentence in the text and the closest sentence in the suspected files, if exists. The proposed system structure is based on a search engine in order to reduce the cost of pairwise similarity. For the indexing process, we use the winnowing n-gram fingerprinting algorithm to reduce the index size. The fingerprints of each sentence are its n-grams that are represented by hash codes. The winnowing algorithm computes fingerprints for each sentence. As a result, the search time is improved and the detection process is accurate and robust. The experimental results showed superb performance of Iqtebas 1.0 as it achieved a recall value of 94% and a precision of 99%.Moreover, a comparison that is carried out between Iqtebas and the well known plagiarism detection system, SafeAssign, confirmed the high performance of Iqtebas.

Journal Article
TL;DR: Turnitin this paper was used to enhance good practice within academic writing through the use of the plagiarism detection software Turnitin and 71 students evaluated their learning experiences and found a reduction in academic misconduct cases compared to the previous year.
Abstract: There is growing concern among many regarding plagiarism within student writing. This has promoted investigation into both the factors that predict plagiarism and potential methods of reducing plagiarism. Consequently, we developed and evaluated an intervention to enhance good practice within academic writing through the use of the plagiarism detection software Turnitin. One-hundred-and-sixteen first-year psychology students submitted work to Turnitin and 71 of these students evaluated their learning experiences. For the next assignment the students completed, there was a reduction in academic misconduct cases compared to the previous year and students evaluated the session positively. The findings have implications for teaching good practice in academic writing.