scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2002"


Journal Article
TL;DR: Some college lawyers and professors are warning that one of the most widely used plagiarism-detection services may be trampling on students' copyrights and privacy as mentioned in this paper, and many campus officials are starting to wonder whether some of the high-tech tools they are using to detect dishonesty clash with students' legal rights.
Abstract: When electronic tools to ferret out student plagiarism hit the market a few years ago, colleges saw them as easy-to-use and affordable But now some college lawyers and professors are warning that one of the most widely used plagiarism-detection services may be trampling on students' copyrights and privacy And many campus officials are starting to wonder whether some of the high-tech tools they are using to detect dishonesty clash with students' legal rights

65 citations


01 Jan 2002
TL;DR: A web-accessible text registry based on signature extraction that extracts a small but diagnostic signature from each registered text for permanent storage and comparison against other stored signatures is described.
Abstract: Easy access to the Web has led to increased potential for students cheating on assignments by plagiarising others' work. By the same token, Web-based tools offer the potential for instructors to check submitted assignments for signs of plagiarism. Overlap-detection tools are easy to use and accurate in plagiarism detection, so they can be an excellent deterrent to plagiarism. Documents can overlap for other reasons, too: Old documents are superseded, and authors summarize previous work identically in several papers. Overlap-detection tools can pinpoint interconnections in a corpus of documents and could be used in search engines.We describe a web-accessible text registry based on signature extraction. We extract a small but diagnostic signature from each registered text for permanent storage and comparison against other stored signatures. This comparison allows us to estimate the amount of overlap between pairs of documents, although the total time required is linear in the total size of the documents. We compare our algorithm with several alternatives and present both efficiency and accuracy results.

46 citations


DOI
01 Jan 2002
TL;DR: In this article, the authors describe a web-accessible text registry based on signature extraction, which extracts a small but diagnostic signature from each registered text for permanent storage and comparison against other stored signatures.
Abstract: Easy access to the Web has led to increased potential for students cheating on assignments by plagiarising others' work. By the same token, Web-based tools offer the potential for instructors to check submitted assignments for signs of plagiarism. Overlap-detection tools are easy to use and accurate in plagiarism detection, so they can be an excellent deterrent to plagiarism. Documents can overlap for other reasons, too: Old documents are superseded, and authors summarize previous work identically in several papers. Overlap-detection tools can pinpoint interconnections in a corpus of documents and could be used in search engines.We describe a web-accessible text registry based on signature extraction. We extract a small but diagnostic signature from each registered text for permanent storage and comparison against other stored signatures. This comparison allows us to estimate the amount of overlap between pairs of documents, although the total time required is linear in the total size of the documents. We compare our algorithm with several alternatives and present both efficiency and accuracy results.

45 citations


Book ChapterDOI
21 Apr 2002
TL;DR: Two common stages of plagiarism detection are studied: chunking of documents and selection of representative chunks and a third stage to be applied in the comparison that uses suffix trees and suffix vectors to identify the overlapping chunks is proposed.
Abstract: Easy access to the World Wide Web has raised concerns about copyright issues and plagiarism. It is easy to copy someone else's work and submit it as someone's own. This problem has been targeted by many systems, which use very similar approaches. These approaches are compared in this paper and suggestions are made when different strategies are more applicable than others. Some alternative approaches are proposed that perform better than previously presented methods. These previous methods share two common stages: chunking of documents and selection of representative chunks. We study both stages and also propose alternatives that are better in terms of accuracy and space requirement. The applications of these methods are not limited to plagiarism detection but may target other copy-detection problems. We also propose a third stage to be applied in the comparison that uses suffix trees and suffix vectors to identify the overlapping chunks.

41 citations


01 Jan 2002

19 citations



01 Jan 2002
TL;DR: A new data structure is proposed, which has the same versatility as a suffix tree but requires less space than any other representation known to date and is called a suffix vector because of the way it is organised in memory.
Abstract: With the rapid growth of the Internet, large collections of digital documents have become available. These documents may be used for various purposes including education, research, entertainment, and many others. Given this diversity of objectives in using these documents, we need tools that are capable of identifying overlap and similarity within a potentially very large set of documents. Applications of such tools include plagiarism detection, search engines, comparative analysis of literary works, and clustering collections of documents. This thesis studies different approaches to identifying overlap between electronic documents. Existing approaches are compared based on different attributes including accuracy, performance, and the degree of protection. A novel two-stage approach is proposed in this thesis. It selects a set of candidate documents in the first phase by applying chunking and indexing methods. The second phase uses exact comparison methods based on suffix trees. Suffix trees have been identified as an efficient data structure for exact stringmatching problems. One of the main arguments against the widespread use of suffix trees is their extensive space-consumption requirements. We propose a new data structure, which has the same versatility as a suffix tree but requires less space than any other representation known to date. This new structure is called a suffix vector because of the way it is organised in memory. We show how this structure can be constructed in linear time and we also prove that this data structure requires the least space among those representations that have the same versatility. Not only does the new representation save space but it is also more time-efficient because it eliminates certain redundancies of a suffix tree. Therefore, some phases of algorithms running on the structure can be eliminated, too. This thesis also analyses how existing data structures can be used for document comparison. Sparse suffix trees and directed acyclic graphs generated from suffix trees are discussed in the context of document comparison applications. Some algorithms have been modified to suit the above mentioned data structures. We have built a prototype system, called MatchDetectReveal (MDR) that implements the algorithms we proposed. The MDR system is capable of efficiently identifying overlapping documents in a large set and uses suffix trees and suffix vectors in its core component – the matching engine.

4 citations