scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2003"


Proceedings ArticleDOI
09 Jun 2003
TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
Abstract: Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.

1,220 citations


01 Jan 2003
TL;DR: The nature of the plagiarism problem is explored, and the approaches used so far for its detection are summarized, and a number of methods used to measure text reuse are discussed.
Abstract: Automatic methods of measuring similarity between program code and natural language text pairs have been used for many years to assist humans in detecting plagiarism. For example, over the past thirty years or so, a vast number of approaches have been proposed for detecting likely plagiarism between programs written by Computer Science students. However, more recently, approaches to identifying similarities between natural language texts have been addressed, but given the ambiguity and complexity of natural over program languages, this task is very difficult. Automatic detection is gaining further interest from both the academic and commercial worlds given the ease with which texts can now be found, copied and rewritten. Following the recent increase in the popularity of on-line services offering plagiarism detection services and the increased publicity surrounding cases of plagiarism in academia and industry, this paper explores the nature of the plagiarism problem, and in particular summarise the approaches used so far for its detection. I focus on plagiarism detection in natural language, and discuss a number of methods I have used to measure text reuse. I end by suggesting a number of recommendations for further work in the field of automatic plagiarism detection.

111 citations


01 Jan 2003
TL;DR: A comprehensive survey on natural language text copy detection is given, the developments of copy detection are introduced, and some key detection techniques are listed and compared with each other.
Abstract: Copy detection has very important application in both intellectual property protection and information retrieval. Currently, copy detection concentrates on document copy detection mainly. In early days, document copy detection concentrated on program plagiarism detection mainly and now the most studies are on text copy detection. In this paper, a comprehensive survey on natural language text copy detection is given, the developments of copy detection is introduced. The approaches and features of a variety of existing text copy detection systems or prototypes are reviewed in detail. Then some key detection techniques are listed and compared with each other. In the end, the future trend of text copy detection is discussed.

43 citations


Proceedings ArticleDOI
26 Oct 2003
TL;DR: SSK first finds out semantic sequences in documents, and then it uses a kernel function to calculate their similarity, and it is shown that SSK is excellent on nonrewording corpus and valid on rewording corpus with some impairment on the performance.
Abstract: We present semantic sequence kernel (SSK) to detect document plagiarism, which is derived from string kernel (SK) and word sequence kernel (WSK) SSK first finds out semantic sequences in documents, and then it uses a kernel function to calculate their similarity SK and WSK only calculate the gap between the first word and the last one SSK takes into account each common word's position information We believe SSK contains both local and global information so that it makes a great progress in small partial plagiarism detection We compare SSK with relative frequency model and semantic sequence model, which is a word frequency based model The results show that SSK is excellent on nonrewording corpus It is also valid on rewording corpus with some impairment on the performance

24 citations


01 Jan 2003
TL;DR: An internal plagiarism detection method that uses style markers from authorship attribution studies in order to find stylistic changes in a text, which shows that at these small levels the style markers generally cannot detect plagiarized sections, and that at bigger levels the results are strongly influenced by the sliding window approach.
Abstract: Most of the existing plagiarism detection systems compare a text to a database of other texts. These external approaches, however, are vulnerable because texts not contained in the database cannot be detected as source texts. This paper examines an internal plagiarism detection method that uses style markers from authorship attribution studies in order to find stylistic changes in a text. These changes might pinpoint plagiarized passages. Additionally, a new style marker called specific words is introduced. A pre-study tests if the style markers can fingerprint an author s style and if they are constant with sample size. It is shown that vocabulary richness measures do not fulfil these prerequisites. The other style markers - simple ratio measures, readability scores, frequency lists, and entropy measures - have these characteristics and are, together with the new specific words measure, used in a main study with an unsupervised approach for detecting stylistic changes in plagiarized texts at sentence and paragraph levels. It is shown that at these small levels the style markers generally cannot detect plagiarized sections because of intra-authorial stylistic variations (i.e. noise), and that at bigger levels the results are strongly a ected by the sliding window approach. The specific words measure, however, can pinpoint single sentences written by another author.

12 citations


01 Jan 2003
TL;DR: Investigations into two plagiarism detection tools are described: the widely used commercial service Turnitin, and an in-house tool, Ferret, which are more useful in detecting plagiarism from web sources and within a group of students.
Abstract: One strategy in the prevention and detection of plagiarism and collusion is to use an automated detection tool. We argue that, for consistent treatment of students, we should be applying these tools to ALL written submissions in a given assignment rather than merely using a detection tool to confirm suspicions that a single text has been plagiarised. In this paper we describe our investigations into two plagiarism detection tools: the widely used commercial service Turnitin, and an in-house tool, Ferret. We conclude that there are technical and practical problems, first in the large scale use of electronic submission of assignments and then in the further submission of these assignments to a plagiarism detector. Nevertheless, the reporting mechanisms of both tools are fast and easy to use. Turnitin is more useful in detecting plagiarism from web sources, Ferret for detecting collusion within a group of students.

12 citations


01 Jan 2003
TL;DR: Electronic submission of student assignments certainly provides many advantages for the faculty member and graders, and paperless transactions are especially useful when the number of submissions is large and the assignments must be distributed to multiple locations.
Abstract: And so it is with grading assignments that have been submitted electronically. Electronic submission of student assignments certainly provides many advantages for the faculty member and graders. For instance, electronic submissions are easier to manage and keep track of than their paper counterparts, particularly as the number of submissions gets large. Submissions can be time-stamped automatically and archived, thus minimizing the potential for disputes over lateness and lost assignments and/or grades. Furthermore, archives can help resolve issues involving academic dishonesty and/or plagiarism. Finally, paperless transactions are especially useful when the number of submissions is large and the assignments must be distributed to multiple locations (such as to teaching assistants, graders, and plagiarism detection software).

10 citations


01 Jan 2003
TL;DR: An XML-based model is introduced to detect similarities among programs that arise under plagiarism, based upon the syntax of a specific programming language, that is suitable for the detection of plagiarism.
Abstract: Plagiarism is a common place in academics, especially in courses involving programming In this paper, XPDec, an XML-based model is introduced to detect similarities among programs that arise under plagiarism Based upon the syntax of a specific programming language, XPDec uses an XML scheme that is suitable for the detection of plagiarism XML documents are generated from given program sources and XQuery is used to extract information relevant to the detection of plagiarism The XML’s tree-like representation of query results is exploited to ignore common forms of reordering that arise in plagiarism The level of similarity between a pair of programs is numerically quantified and reported The usefulness of XPDec in detection of plagiarism is discussed XPDec has been implemented, and its architecture is presented

9 citations



01 May 2003
TL;DR: The tests made on chunking methods used for plagiarism detection makes it possible to decide on the best fitting chunking method for a given application.
Abstract: This paper describes the tests made on chunking methods used for plagiarism detection. The result of the tests makes it possible to decide on the best fitting chunking method for a given application. For example, overlapping word chunking is good for a grammar analyzer or for small databases, sentence chunking suits best for finding quoted texts, hashed breakpoint chunking is the fastest method therefore advisable for search in big set of documents, or if more reliability is needed overlapping hashed breakpoint chunking can be used as well.

9 citations



Journal Article
TL;DR: A comprehensive survey on natural language text copy detection is given, the developments of copy detection are introduced, and some key detection techniques are listed and compared with each other.
Abstract: Copy detection has very important application in both intellectual property protection and information retrieval. Currently, copy detection concentrates on document copy detection mainly. In early days, document copy detection concentrated on program plagiarism detection mainly and now the most studies are on text copy detection. In this paper, a comprehensive survey on natural language text copy detection is given, the developments of copy detection is introduced. The approaches and features of a variety of existing text copy detection systems or prototypes are reviewed in detail. Then some key detection techniques are listed and compared with each other. In the end, the future trend of text copy detection is discussed.

01 Jan 2003
TL;DR: This case study is an evaluation of generic, general-purpose plagiarism detection systems applied to a specific domain and task: detecting intra-class student copying in a corpus of Biomedical Science laboratory reports.
Abstract: This case study is an evaluation of generic, general-purpose plagiarism detection systems applied to a specific domain and task: detecting intra-class student copying in a corpus of Biomedical Science laboratory reports From the outset, our project had the practical, pragmatic aim to find a workable solution to a specific problem Biomedical Science undergraduates learn experimental methods by working through a series of laboratory experiments and reporting on their results These laboratory reports are “peer-reviewed” in large classes, following a prescribed marking scheme; as the reports are effectively marked by other students rather than by a single lecturer, there is an opportunity for an unscrupulous student to avoid having to carry out and report on an experiment, by simply copying another student’s report To reduce this temptation, the Biomedical Science director of teaching, Paul Gent, approached Eric Atwell of the School of Computing and Clive Souter of the Centre for Joint Honours in Science, to look at ways to compare laboratory reports automatically, and flag candidates with signs of copying We were joined by Julia Medori, forensic linguist from Trinity College Dublin, who developed and evaluated a range of possible solutions

Maeve Paris1
01 Aug 2003
TL;DR: Computational linguistics might inform software metrics, and vice versa, if a computer programming language is considered to be similar to a natural language and computer-assisted text analysis techniques may be employed to assist the academic in detecting plagiarism in source code.
Abstract: Plagiarism and collusion among students may be facilitated by the preponderance of material in electronic format and the ability to submit coursework online. A distinction has generally been drawn between plagiarism of text and plagiarism of source code, and different tools and metrics have been developed for either type. However, if a computer programming language is considered to be similar to a natural language (although it has a restricted syntax and vocabulary), computer-assisted text analysis techniques may be employed to assist the academic in detecting plagiarism in source code. So computational linguistics might inform software metrics, and vice versa.

Book
01 Jan 2003
TL;DR: This dissertation studies the problem of searching and retrieving music based on acoustic content similarity using two types of systems, one based on exhaustive matching by dynamic programming (which is relatively accurate but not scalable), the other based on high-dimensional indexing ( which is less accurate but scalable).
Abstract: With the explosive amount of music data available on the internet in recent years, there has been much interest in developing new ways to search and retrieve such data effectively. Currently, most music search engines operate on text labels or symbolic data, rather than on the underlying acoustic content. A truly content-based music retrieval system should have the ability to find similar songs based on their underlying score or melody, regardless of their metadata description or file names. Potential applications include automatic music identification, music analysis, plagiarism detection, copyright enforcement, etc. In this dissertation, we study the problem of searching and retrieving music based on acoustic content similarity. Given a query sound clip, our goal is to retrieve “similar” occurrences from a music database, where similarity is based on the intuitive notion of “same song” perceived by humans: two pieces are similar if they are fully or partially based on the same score, even if they are performed by different people, with different instruments, or at different tempo. Retrieval results are given as a list of songs ranked by computed similarity estimate. Both the input query and the underlying database are taken from actual music recordings in raw acoustic format. We study two types of systems, one based on exhaustive matching by dynamic programming (which is relatively accurate but not scalable), the other based on high-dimensional indexing (which is less accurate but scalable). For the latter index-based retrieval system, the core algorithm is parallelizable and can be placed into a peer-to-peer architecture for improved performance, with the ability to share spare CPU resources and to achieve dynamic load-balancing.


Journal ArticleDOI
TL;DR: Turnitin this article is a plagiarism detection system that uses the Turnitin anti-plagiarism program, run by the Joint Information Systems Committee and thought to be the first national system of its kind.
Abstract: Plagiarism is one of the most serious offences in the academic world. It has occurred as long as there have been teachers and students, but the recent growth of the Internet has made the problem much worse. Recent studies indicate that approximately 30% of all students may be plagiarising on every written assignment they complete. The “information technology revolution” is almost always presented as having cataclysmic consequences for education. In post‐secondary circles, perhaps the most commonly apprehended cataclysm is “Internet plagiarism”. Academics at all British universities and colleges can now test students’ work for cheating using the anti‐plagiarism program Turnitin. The program, run by the Joint Information Systems Committee and thought to be the first national system of its kind, offers free advice and a plagiarism detection service to all further education institutions in the UK. This article will try to: first, define exactly what plagiarism is; second, give examples and reports on...

Proceedings Article
01 Jan 2003
TL;DR: A knowledge-based approach to the detection of plagiarism is presented and documents to be test in form of graph structures are described and Experimental results show that the approach is workable and effective.
Abstract: With the growth of the Internet, it is much more easier for plagiarists to copy the materials from the Internet and put them in their own documents without the permission of the original authors. Plagiarists usually modify the copied materials to prevent from being detected. Therefore, simple comparison of two documents is not sufficient for detecting plagiarism. In this paper, we present a knowledge-based approach to the detection of plagiarism. We analyze the types and behaviors of plagiarism and describe documents to be test in form of graph structures. The problem of detecting plagiarism then becomes the comparison of similarity among these structures. Experimental results show that our approach is workable and effective.