scispace - formally typeset
Journal ArticleDOI

Paraphrase plagiarism identification with character-level features

Reads0
Chats0
TLDR
It is established that the original author’s writing style fingerprint prevails in the plagiarized text even when paraphrases occur, and a novel text representation scheme is proposed that gathers both content and style characteristics of texts, represented by means of character-level features.
Abstract
Several methods have been proposed for determining plagiarism between pairs of sentences, passages or even full documents. However, the majority of these methods fail to reliably detect paraphrase plagiarism due to the high complexity of the task, even for human beings. Paraphrase plagiarism identification consists in automatically recognizing document fragments that contain reused text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes and morphological or lexical substitutions. Our main hypothesis establishes that the original author’s writing style fingerprint prevails in the plagiarized text even when paraphrases occur. Thus, in this paper we propose a novel text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features. As an additional contribution, we describe the methodology followed for the construction of an appropriate corpus for the task of paraphrase plagiarism identification, which represents a new valuable resource to the NLP community for future research work in this field.

read more

Citations
More filters

Plagiarism detection using Rouge and WordNet

柯 皓仁
TL;DR: In this paper, the authors proposed adoption of ROUGE and WordNet to plagiarism detection, which includes n-gram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS).
Journal ArticleDOI

Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

TL;DR: The experimental results show that the proposed cross-language text alignment approach significantly outperforms the state-of-the-art models and can be fed into an expert system for further improvement of cross- language plagiarism detection.
Journal ArticleDOI

An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes

TL;DR: The results show that the proposed candidate retrieval model outperforms the state-of-the-art models and can be considered as a proper choice to be embedded in cross-language plagiarism detection systems.
Journal ArticleDOI

Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase

TL;DR: This paper employs text embedding vectors to compare similarity among documents to detect plagiarism and applies the proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective.
Journal ArticleDOI

Using word semantic concepts for plagiarism detection in text documents

TL;DR: This paper uses Word2vec to transform the words into word vectors which are able to reveal the semantic relationship among different words, and this method can be done more effectively in plagiarism detection.
References
More filters
Proceedings ArticleDOI

Winnowing: local algorithms for document fingerprinting

TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
Journal IssueDOI

Computational methods in authorship attribution

TL;DR: Three scenarios are considered here for which solutions to the basic attribution problem are inadequate; it is shown how machine learning methods can be adapted to handle the special challenges of that variant.
Proceedings ArticleDOI

Measuring the Semantic Similarity of Texts

TL;DR: A method that combines word- to-word similarity metrics into a text-to-text metric is introduced, and it is shown that this method outperforms the traditional text similarity metrics based on lexical matching.
Journal ArticleDOI

Methods for identifying versioned and plagiarized documents

TL;DR: The identity measure and the best fingerprinting technique are both able to accurately identify coderivative documents, and it is demonstrated that the identity measure is clearly superior for fingerprinting parameters.
Related Papers (5)