Showing papers on "Plagiarism detection published in 2003"

PDF

Open Access

Proceedings Article•DOI•

Winnowing: local algorithms for document fingerprinting

[...]

Saul Schleimer¹, Daniel Shawcross Wilkerson², Alex Aiken²•Institutions (2)

University of Illinois at Chicago¹, University of California, Berkeley²

09 Jun 2003

TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.

...read moreread less

Abstract: Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.

...read moreread less

1,220 citations

Old and new challenges in automatic plagiarism detection

[...]

Paul Clough¹, Regent Court•Institutions (1)

University of Sheffield¹

01 Jan 2003

TL;DR: The nature of the plagiarism problem is explored, and the approaches used so far for its detection are summarized, and a number of methods used to measure text reuse are discussed.

...read moreread less

Abstract: Automatic methods of measuring similarity between program code and natural language text pairs have been used for many years to assist humans in detecting plagiarism. For example, over the past thirty years or so, a vast number of approaches have been proposed for detecting likely plagiarism between programs written by Computer Science students. However, more recently, approaches to identifying similarities between natural language texts have been addressed, but given the ambiguity and complexity of natural over program languages, this task is very difficult. Automatic detection is gaining further interest from both the academic and commercial worlds given the ease with which texts can now be found, copied and rewritten. Following the recent increase in the popularity of on-line services offering plagiarism detection services and the increased publicity surrounding cases of plagiarism in academia and industry, this paper explores the nature of the plagiarism problem, and in particular summarise the approaches used so far for its detection. I focus on plagiarism detection in natural language, and discuss a number of methods I have used to measure text reuse. I end by suggesting a number of recommendations for further work in the field of automatic plagiarism detection.

...read moreread less

111 citations

A Survey on Natural Language Text Copy Detection

[...]

Bao Jun-Peng¹, Shen Jun-yi, Liu Xiao-dong, Song Qin-Bao•Institutions (1)

Xi'an Jiaotong University¹

01 Jan 2003

TL;DR: A comprehensive survey on natural language text copy detection is given, the developments of copy detection are introduced, and some key detection techniques are listed and compared with each other.

...read moreread less

Abstract: Copy detection has very important application in both intellectual property protection and information retrieval. Currently, copy detection concentrates on document copy detection mainly. In early days, document copy detection concentrated on program plagiarism detection mainly and now the most studies are on text copy detection. In this paper, a comprehensive survey on natural language text copy detection is given, the developments of copy detection is introduced. The approaches and features of a variety of existing text copy detection systems or prototypes are reviewed in detail. Then some key detection techniques are listed and compared with each other. In the end, the future trend of text copy detection is discussed.

...read moreread less

43 citations

Proceedings Article•DOI•

Document copy detection based on kernel method

[...]

Bao Jun-Peng¹, Shen Jun-yi¹, Liu Xiao-dong¹, Liu Haiyan¹, Zhang Xiao-Di¹ - Show less +1 more•Institutions (1)

Xi'an Jiaotong University¹

26 Oct 2003

TL;DR: SSK first finds out semantic sequences in documents, and then it uses a kernel function to calculate their similarity, and it is shown that SSK is excellent on nonrewording corpus and valid on rewording corpus with some impairment on the performance.

...read moreread less

Abstract: We present semantic sequence kernel (SSK) to detect document plagiarism, which is derived from string kernel (SK) and word sequence kernel (WSK) SSK first finds out semantic sequences in documents, and then it uses a kernel function to calculate their similarity SK and WSK only calculate the gap between the first word and the last one SSK takes into account each common word's position information We believe SSK contains both local and global information so that it makes a great progress in small partial plagiarism detection We compare SSK with relative frequency model and semantic sequence model, which is a word frequency based model The results show that SSK is excellent on nonrewording corpus It is also valid on rewording corpus with some impairment on the performance

...read moreread less

24 citations

Using Style Markers for Detecting Plagiarism in Natural Language Documents

[...]

Marco Kimler

01 Jan 2003

TL;DR: An internal plagiarism detection method that uses style markers from authorship attribution studies in order to find stylistic changes in a text, which shows that at these small levels the style markers generally cannot detect plagiarized sections, and that at bigger levels the results are strongly influenced by the sliding window approach.

...read moreread less

Abstract: Most of the existing plagiarism detection systems compare a text to a database of other texts. These external approaches, however, are vulnerable because texts not contained in the database cannot be detected as source texts. This paper examines an internal plagiarism detection method that uses style markers from authorship attribution studies in order to find stylistic changes in a text. These changes might pinpoint plagiarized passages. Additionally, a new style marker called specific words is introduced. A pre-study tests if the style markers can fingerprint an author s style and if they are constant with sample size. It is shown that vocabulary richness measures do not fulfil these prerequisites. The other style markers - simple ratio measures, readability scores, frequency lists, and entropy measures - have these characteristics and are, together with the new specific words measure, used in a main study with an unsupervised approach for detecting stylistic changes in plagiarized texts at sentence and paragraph levels. It is shown that at these small levels the style markers generally cannot detect plagiarized sections because of intra-authorial stylistic variations (i.e. noise), and that at bigger levels the results are strongly a ected by the sliding window approach. The specific words measure, however, can pinpoint single sentences written by another author.

...read moreread less

12 citations

Are we ready for large scale use of plagiarism detection tools

[...]

Ruth Barrett, James A. Malcolm, Caroline Lyon

01 Jan 2003

TL;DR: Investigations into two plagiarism detection tools are described: the widely used commercial service Turnitin, and an in-house tool, Ferret, which are more useful in detecting plagiarism from web sources and within a group of students.

...read moreread less

Abstract: One strategy in the prevention and detection of plagiarism and collusion is to use an automated detection tool. We argue that, for consistent treatment of students, we should be applying these tools to ALL written submissions in a given assignment rather than merely using a detection tool to confirm suspicions that a single text has been plagiarised. In this paper we describe our investigations into two plagiarism detection tools: the widely used commercial service Turnitin, and an in-house tool, Ferret. We conclude that there are technical and practical problems, first in the large scale use of electronic submission of assignments and then in the further submission of these assignments to a plagiarism detector. Nevertheless, the reporting mechanisms of both tools are fast and easy to use. Turnitin is more useful in detecting plagiarism from web sources, Ferret for detecting collusion within a group of students.

...read moreread less

12 citations

Pen-Based Electronic Grading of Online Student Submissions

[...]

Jeffrey L. Popyack, Nira Herrmann, Bruce W. Char, Paul Zoski, Chris Cera, Robert N. Lass¹ - Show less +2 more•Institutions (1)

Drexel University¹

01 Jan 2003

TL;DR: Electronic submission of student assignments certainly provides many advantages for the faculty member and graders, and paperless transactions are especially useful when the number of submissions is large and the assignments must be distributed to multiple locations.

...read moreread less

Abstract: And so it is with grading assignments that have been submitted electronically. Electronic submission of student assignments certainly provides many advantages for the faculty member and graders. For instance, electronic submissions are easier to manage and keep track of than their paper counterparts, particularly as the number of submissions gets large. Submissions can be time-stamped automatically and archived, thus minimizing the potential for disputes over lateness and lost assignments and/or grades. Furthermore, archives can help resolve issues involving academic dishonesty and/or plagiarism. Finally, paperless transactions are especially useful when the number of submissions is large and the assignments must be distributed to multiple locations (such as to teaching assistants, graders, and plagiarism detection software).

...read moreread less

10 citations

An XML Plagiarism Detection Model for Procedural Programming Languages

[...]

Seo-Young Noh¹•Institutions (1)

Iowa State University¹

01 Jan 2003

TL;DR: An XML-based model is introduced to detect similarities among programs that arise under plagiarism, based upon the syntax of a specific programming language, that is suitable for the detection of plagiarism.

...read moreread less

Abstract: Plagiarism is a common place in academics, especially in courses involving programming In this paper, XPDec, an XML-based model is introduced to detect similarities among programs that arise under plagiarism Based upon the syntax of a specific programming language, XPDec uses an XML scheme that is suitable for the detection of plagiarism XML documents are generated from given program sources and XQuery is used to extract information relevant to the detection of plagiarism The XML’s tree-like representation of query results is exploited to ignore common forms of reordering that arise in plagiarism The level of similarity between a pair of programs is numerically quantified and reported The usefulness of XPDec in detection of plagiarism is discussed XPDec has been implemented, and its architecture is presented

...read moreread less

9 citations

Experiments in Electronic Plagiarism Detection

[...]

Caroline Lyon, Ruth Barrett, James A. Malcolm

01 Jan 2003

9 citations

Plagiarism detection and document chunking methods

[...]

Máté Pataki

01 May 2003

TL;DR: The tests made on chunking methods used for plagiarism detection makes it possible to decide on the best fitting chunking method for a given application.

...read moreread less

Abstract: This paper describes the tests made on chunking methods used for plagiarism detection. The result of the tests makes it possible to decide on the best fitting chunking method for a given application. For example, overlapping word chunking is good for a grammar analyzer or for small databases, sentence chunking suits best for finding quoted texts, hashed breakpoint chunking is the fastest method therefore advisable for search in big set of documents, or if more reliability is needed overlapping hashed breakpoint chunking can be used as well.

...read moreread less

9 citations

Plagiarism detection and prevention. Pedagogical implications for lecturers of first year university students

[...]

Ursula McGowan

01 Jan 2003

Journal Article•

A Survey on Natural Language Text Copy Detection

[...]

Bao Jun

01 Jan 2003-Journal of Software

...read moreread less

Detecting student copying in a corpus of science laboratory reports

[...]

Eric Atwell, JP Gent, Jdm Medori, DC Souter

01 Jan 2003

TL;DR: This case study is an evaluation of generic, general-purpose plagiarism detection systems applied to a specific domain and task: detecting intra-class student copying in a corpus of Biomedical Science laboratory reports.

...read moreread less

Abstract: This case study is an evaluation of generic, general-purpose plagiarism detection systems applied to a specific domain and task: detecting intra-class student copying in a corpus of Biomedical Science laboratory reports From the outset, our project had the practical, pragmatic aim to find a workable solution to a specific problem Biomedical Science undergraduates learn experimental methods by working through a series of laboratory experiments and reporting on their results These laboratory reports are “peer-reviewed” in large classes, following a prescribed marking scheme; as the reports are effectively marked by other students rather than by a single lecturer, there is an opportunity for an unscrupulous student to avoid having to carry out and report on an experiment, by simply copying another student’s report To reduce this temptation, the Biomedical Science director of teaching, Paul Gent, approached Eric Atwell of the School of Computing and Clive Souter of the Centre for Joint Honours in Science, to look at ways to compare laboratory reports automatically, and flag candidates with signs of copying We were joined by Julia Medori, forensic linguist from Trinity College Dublin, who developed and evaluated a range of possible solutions

...read moreread less

Source Code and Text Plagiarism Detection Strategies

[...]

Maeve Paris¹•Institutions (1)

Intel¹

01 Aug 2003

TL;DR: Computational linguistics might inform software metrics, and vice versa, if a computer programming language is considered to be similar to a natural language and computer-assisted text analysis techniques may be employed to assist the academic in detecting plagiarism in source code.

...read moreread less

Abstract: Plagiarism and collusion among students may be facilitated by the preponderance of material in electronic format and the ability to submit coursework online. A distinction has generally been drawn between plagiarism of text and plagiarism of source code, and different tools and metrics have been developed for either type. However, if a computer programming language is considered to be similar to a natural language (although it has a restricted syntax and vocabulary), computer-assisted text analysis techniques may be employed to assist the academic in detecting plagiarism in source code. So computational linguistics might inform software metrics, and vice versa.

...read moreread less

Book•

Content-Based Music Retrieval on Acoustic Data

[...]

Jeffrey D. Ullman, Cheng Yang

01 Jan 2003

TL;DR: This dissertation studies the problem of searching and retrieving music based on acoustic content similarity using two types of systems, one based on exhaustive matching by dynamic programming (which is relatively accurate but not scalable), the other based on high-dimensional indexing ( which is less accurate but scalable).

...read moreread less

Abstract: With the explosive amount of music data available on the internet in recent years, there has been much interest in developing new ways to search and retrieve such data effectively. Currently, most music search engines operate on text labels or symbolic data, rather than on the underlying acoustic content. A truly content-based music retrieval system should have the ability to find similar songs based on their underlying score or melody, regardless of their metadata description or file names. Potential applications include automatic music identification, music analysis, plagiarism detection, copyright enforcement, etc. In this dissertation, we study the problem of searching and retrieving music based on acoustic content similarity. Given a query sound clip, our goal is to retrieve “similar” occurrences from a music database, where similarity is based on the intuitive notion of “same song” perceived by humans: two pieces are similar if they are fully or partially based on the same score, even if they are performed by different people, with different instruments, or at different tempo. Retrieval results are given as a list of songs ranked by computed similarity estimate. Both the input query and the underlying database are taken from actual music recordings in raw acoustic format. We study two types of systems, one based on exhaustive matching by dynamic programming (which is relatively accurate but not scalable), the other based on high-dimensional indexing (which is less accurate but scalable). For the latter index-based retrieval system, the core algorithm is parallelizable and can be placed into a peer-to-peer architecture for improved performance, with the ability to share spare CPU resources and to achieve dynamic load-balancing.

...read moreread less

Proceedings Article•

Designing a vortal for detecting Java programs cyberplagiarism

[...]

Jinan Fiaidhi¹, Sabah Mohammed¹, Zuhoor Al-Khanjari•Institutions (1)

Lakehead University¹

01 Jan 2003

Journal Article•DOI•

Part one: Turnitin and the perils of entering the evil house of cheat at cheathouse.com

[...]

Mark Van Hoorebeek

01 Jan 2003-The Law Teacher

TL;DR: Turnitin this article is a plagiarism detection system that uses the Turnitin anti-plagiarism program, run by the Joint Information Systems Committee and thought to be the first national system of its kind.

...read moreread less

Abstract: Plagiarism is one of the most serious offences in the academic world. It has occurred as long as there have been teachers and students, but the recent growth of the Internet has made the problem much worse. Recent studies indicate that approximately 30% of all students may be plagiarising on every written assignment they complete. The “information technology revolution” is almost always presented as having cataclysmic consequences for education. In post‐secondary circles, perhaps the most commonly apprehended cataclysm is “Internet plagiarism”. Academics at all British universities and colleges can now test students’ work for cheating using the anti‐plagiarism program Turnitin. The program, run by the Joint Information Systems Committee and thought to be the first national system of its kind, offers free advice and a plagiarism detection service to all further education institutions in the UK. This article will try to: first, define exactly what plagiarism is; second, give examples and reports on...

...read moreread less

Proceedings Article•

Plagiarism detection of text using knowledge-based techniques

[...]

Jung-Sheng Yang, Chih-Hung Wu

01 Jan 2003

TL;DR: A knowledge-based approach to the detection of plagiarism is presented and documents to be test in form of graph structures are described and Experimental results show that the approach is workable and effective.

...read moreread less

Abstract: With the growth of the Internet, it is much more easier for plagiarists to copy the materials from the Internet and put them in their own documents without the permission of the original authors. Plagiarists usually modify the copied materials to prevent from being detected. Therefore, simple comparison of two documents is not sufficient for detecting plagiarism. In this paper, we present a knowledge-based approach to the detection of plagiarism. We analyze the types and behaviors of plagiarism and describe documents to be test in form of graph structures. The problem of detecting plagiarism then becomes the comparison of similarity among these structures. Experimental results show that our approach is workable and effective.

...read moreread less