scispace - formally typeset
Search or ask a question
Topic

Plagiarism detection

About: Plagiarism detection is a research topic. Over the lifetime, 1790 publications have been published within this topic receiving 24740 citations.


Papers
More filters
01 Jan 2014
TL;DR: In this paper, a ranking model based on Ranking SVM is proposed to rank the query keywords group which is contributed to get the higher evaluation measure F.
Abstract: 1Heilongjiang Institute of Technology, China 2Harbin Engineering University, China 3Harbin Institute of Technology, China kongleilei1979@gmail.com Abstract. For the task of source retrieval, the target is to retrieve all plagiarized sources while minimizing retrieval costs. It has become standard for plagiarism detection to retrieve plagiarism sources with query keywords selected from suspicious document. This paper regards the keywords selection problem as learning a ranking model to choose the method of keywords extraction over suspicious document segments. There are four basic methods which are used in our ranking function, which are BM25, TFIDF, TF and EW. Then, a ranking model based on Ranking SVM is proposed to rank the query keywords group which is contributed to get the higher evaluation measure F. In our ranking model, achieving the best performance measure F of source retrieval is used as the target of learning to rank and all kinds of statistic features are fused for searching the better query keywords groups.

10 citations

Proceedings ArticleDOI
01 Jan 2017
TL;DR: A novel text similarity measure inspired from a common representation in DNA sequence alignment algorithms is presented, called TextFlow, which represents input text pairs as continuous curves and uses both the actual position of the words and sequence matching to compute the similarity value.
Abstract: Text similarity measures are used in multiple tasks such as plagiarism detection, information ranking and recognition of paraphrases and textual entailment. While recent advances in deep learning highlighted the relevance of sequential models in natural language generation, existing similarity measures do not fully exploit the sequential nature of language. Examples of such similarity measures include n-grams and skip-grams overlap which rely on distinct slices of the input texts. In this paper we present a novel text similarity measure inspired from a common representation in DNA sequence alignment algorithms. The new measure, called TextFlow, represents input text pairs as continuous curves and uses both the actual position of the words and sequence matching to compute the similarity value. Our experiments on 8 different datasets show very encouraging results in paraphrase detection, textual entailment recognition and ranking relevance.

10 citations

Proceedings Article
Salha Alzahrani1
01 Jan 2015
TL;DR: This system can detect some means of obfuscation such as restructuring or rewording of few phrases, it might not work with handmade paraphrases, and its future work is to advance the candidate retrieval stage and contain semantic-based metrics in the detection stage.
Abstract: This report explains our Arabic plagiarism detection system which we used to submit our run to AraPlagDetect competition at FIRE 2015. The system was constructed through four main stages. First is pre-processing which includes tokenisation and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using K-gram fingerprinting and Jaccard coefficient. Suspicious documents are then compared indepth with the associated candidate documents. This stage entails the computation of the similarity between constructed N-grams with K-overlapping where N and K were experimentally assigned to 8 and 3, respectively. The similarity between N-Gram pairs were computed based on word correlations. Each word was compared with words in candidate N-Gram and correlated by 1 if they are matched. Correlation values were averaged then compared to a threshold. The last step is post-processing whereby consecutive N-Grams were joined to form united plagiarised segments. Our performance measures on the training corpus were encouraging (recall=0.829, precision=0.843, granularity=1.11). The recall measure on the test collection was unfortunately less (recall= 0.530) but precision and granularity remained consistent with the train set (precision= 0.831, granularity= 1.18). This drop in recall may be due to the fact that our candidate retrieval stage retrieves only documents which share copied fragments but there exist plagiarised documents which have no exact-copied cases. Although this system can detect some means of obfuscation such as restructuring or rewording of few phrases, it might not work with handmade paraphrases. Our future work is to advance the candidate retrieval stage and contain semantic-based metrics in the detection stage.

10 citations

Journal ArticleDOI
TL;DR: This paper presents an application programming interface for several Semantic Relatedness/Similarity metrics measuring semantic similarity/distance between multilingual words and concepts, in order to use it after for sentences and paragraphs in Cross Language Plagiarism Detection (CLPD).
Abstract: Generally utterances in natural language are highly ambiguous, and a unique interpretation can usually be determined only by taking into account the context in the utterance occurred. Automatically determining the correct sense of a polysemous word is a complicated problem especially in multilingual corpuses. This paper presents an application programming interface for several Semantic Relatedness/Similarity metrics measuring semantic similarity/distance between multilingual words and concepts, in order to use it after for sentences and paragraphs in Cross Language Plagiarism Detection (CLPD); using WordNet for the English-French and English-Arabic multilingual plagiarism cases.

10 citations

Proceedings ArticleDOI
06 Mar 2015
TL;DR: This paper makes a first step toward gathering clone information needs from the description of user goals and results are useful for various stakeholders such as programmers, managers, tool developers, and researchers.
Abstract: —Clone detection can be used to achieve diverse objectives such as refactoring, program understanding, bug localization, and plagiarism detection, etc. Each goal takes a different perspective on clone information needs. Different clone detection tools report different information about clones. To gauge the suitability of a given clone detector for a particular user objective, we need to determine which information needs implied by the objective a clone detector addresses. In this paper, we make a first step toward gathering clone information needs from the description of user goals. The results of our analysis are useful for various stakeholders such as programmers, managers, tool developers, and researchers.

10 citations


Network Information
Related Topics (5)
Active learning
42.3K papers, 1.1M citations
78% related
The Internet
213.2K papers, 3.8M citations
77% related
Software development
73.8K papers, 1.4M citations
77% related
Graph (abstract data type)
69.9K papers, 1.2M citations
76% related
Deep learning
79.8K papers, 2.1M citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202359
2022126
202183
2020118
2019130
2018125