Topic
Plagiarism detection
About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: A plagiarism detection tool for comparison of Arabic documents to identify potential similarities based on a new comparison algorithm that uses heuristics to compare suspect documents at different hierarchical levels to avoid unnecessary comparisons is presented.
Abstract: Many language-sensitive tools for detecting plagiarism in natural language documents have been developed, particularly for English. Language- independent tools exist as well, but are considered restrictive as they usually do not take into account specific language features. Detecting plagiarism in Arabic documents is particularly a challenging task because of the complex linguistic structure of Arabic. In this paper, we present a plagiarism detection tool for comparison of Arabic documents to identify potential similarities. The tool is based on a new comparison algorithm that uses heuristics to compare suspect documents at different hierarchical levels to avoid unnecessary comparisons. We evaluate its performance in terms of precision and recall on a large data set of Arabic documents, and show its capability in identifying direct and sophisticated copying, such as sentence reordering and synonym substitution. We also demonstrate its advantages over other plagiarism detection tools, including Turnitin, the well-known language-independent tool.
55 citations
••
03 Apr 2017TL;DR: An innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences by exploiting vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words.
Abstract: Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidi-mensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments.
55 citations
••
03 Sep 2012TL;DR: The TIRA (Testbed for Information Retrieval Algorithms) web framework is presented, which is currently used as an official evaluation platform for the well-established PAN international plagiarism detection competition and possesses a unique set of compelling features in comparison to existing web-based solutions.
Abstract: With its close ties to the Web, the information retrieval community is destined to leverage the dissemination and collaboration capabilities that the Web provides today. Especially with the advent of the software as a service principle, an information retrieval community is conceivable that publishes executable experiments by anyone over the Web. A review of recent SIGIR papers shows that we are far away from this vision of collaboration. The benefits of publishing information retrieval experiments as a service are striking for the community as a whole, including potential to boost research profiles and reputation. However, the additional work must be kept to a minimum and sensitive data must be kept private for this paradigm to become an accepted practice. In order to foster experiments as a service in information retrieval, we present the TIRA (Testbed for Information Retrieval Algorithms) web framework that addresses the outlined challenges and possesses a unique set of compelling features in comparison to existing web-based solutions. To describe TIRA in a practical setting, we explain how it is currently used as an official evaluation platform for the well-established PAN international plagiarism detection competition. We also describe how it can be used in future scenarios for search result clustering of non-static collections of web query results, as well as within a simulation data mining setting to support interactive structural design in civil engineering.
55 citations
••
02 Nov 2005TL;DR: There is a minimum level of acceptable performance for the application of detecting student plagiarism, and it would be useful if the detector operated at a level that meant for a piece of work to fool the algorithm would require that the student spent a large amount of time on the assignment.
Abstract: The large class sizes typical for an undergraduate programming course mean that it is nearly impossible for a human marker to accurately detect plagiarism, particularly if some attempt has been made to hide the copying. While it would be desirable to be able to detect all possible code transformations we believe that there is a minimum level of acceptable performance for the application of detecting student plagiarism. It would be useful if the detector operated at a level that meant for a piece of work to fool the algorithm would require that the student spent a large amount of time on the assignment and had a good enough understanding to do the work without plagiarising.
54 citations
01 Jan 2010
TL;DR: This paper describes the approach at the PAN 2010 plagiarism detection competition, and discusses the com- putational cost of each step of the implementation, including the performance data from two different computers.
Abstract: In this paper we describe our approach at the PAN 2010 plagiarism detection competition. We refer to the system we have used in PAN'09. We then present the improvements we have tried since the PAN'09 competition, and their impact on the results on the development corpus. We describe our experiments with intrinsic plagiarism detection and evaluate them. We then discuss the com- putational cost of each step of our implementation, including the performance data from two different computers.
54 citations