scispace - formally typeset
Search or ask a question

Showing papers on "Plagiarism detection published in 2011"


Proceedings Article
01 Jan 2011
TL;DR: In PAN'10, 18 plagiarism detectors were evaluated in detail, highlighting several important aspects of plagiarism detection, such as obfuscation, intrinsic vs. external plagiarism, and plagiarism case length as mentioned in this paper.
Abstract: Thispaper overviews 18 plagiarism detectors that have been developed and evaluated within PAN'10. We start with a unified retrieval process that sum- marizes the best practices employed this year. Then, the detectors' performances are evaluated in detail, highlighting several important aspects of plagiarism de- tection, such as obfuscation, intrinsic vs. external plagiarism, and plagiarism case length. Finally, all results are compared to those of last year's competition.

419 citations


Journal ArticleDOI
01 Mar 2011
TL;DR: The results of the evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related.
Abstract: Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.

232 citations


Journal ArticleDOI
01 Mar 2011
TL;DR: The question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form is investigated.
Abstract: Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed "unmasking", can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.

159 citations


Journal ArticleDOI
TL;DR: It is shown that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries.
Abstract: In this paper a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries. Experimental results on a publicly available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified and most of the words or phrases have been replaced with synonyms. © 2011 Wiley Periodicals, Inc.

138 citations


Journal ArticleDOI
01 Mar 2011
TL;DR: The initial experiences with constructing a corpus consisting of answers to short questions in which plagiarism has been simulated are described, designed to represent types of plagiarism that are not included in existing corpora and will be a useful addition to the set of resources available for the evaluation of plagiarisms detection systems.
Abstract: Plagiarism is widely acknowledged to be a significant and increasing problem for higher education institutions (McCabe 2005; Judge 2008). A wide range of solutions, including several commercial systems, have been proposed to assist the educator in the task of identifying plagiarised work, or even to detect them automatically. Direct comparison of these systems is made difficult by the problems in obtaining genuine examples of plagiarised student work. We describe our initial experiences with constructing a corpus consisting of answers to short questions in which plagiarism has been simulated. This corpus is designed to represent types of plagiarism that are not included in existing corpora and will be a useful addition to the set of resources available for the evaluation of plagiarism detection systems.

133 citations


Journal ArticleDOI
TL;DR: This paper focuses on how the delegation of plagiarism detection to a technical actor produces a particular set of agencies and intentionalities which unintentionally and unexpectedly conspires to constitute some students as plagiarists and others as not.

107 citations


Journal ArticleDOI
TL;DR: In two studies, students at California State University, Northridge wrote papers that were checked for plagiarism using plagiarism-detection software as discussed by the authors, but no correlation was found between knowledge and plagiarism on the second paper, and participants were discovered to draw repeatedly from the same sources of plagiarized material across papers.
Abstract: In two studies, students at California State University, Northridge wrote papers that were checked for plagiarism using plagiarism-detection software. In the first study, half of the students in two classes were randomly selected and told by the professor that their term papers would be scanned for plagiarism using the software. Students in the remainder of each class were not informed that the software would be used. The researcher predicted that students who were explicitly warned about the use of the software would plagiarize less than students who were not, but the warning had no effect. In a second study, students wrote two papers in a series. Their knowledge about plagiarism-detection software was inversely correlated with plagiarism rates on the first paper, but no correlation was found between knowledge and plagiarism on the second paper. Instead, participants were discovered to draw repeatedly from the same sources of plagiarized material across papers.

94 citations


Proceedings ArticleDOI
21 May 2011
TL;DR: This work introduces a novel approach to dynamic characterization of executable programs based on an observation that some critical runtime values are hard to be replaced or eliminated by semantics-preserving transformation techniques and how to apply this runtime property to help solve problems in software plagiarism detection.
Abstract: Identifying similar or identical code fragments becomes much more challenging in code theft cases where plagiarizers can use various automated code transformation techniques to hide stolen code from being detected. Previous works in this field are largely limited in that (1) most of them cannot handle advanced obfuscation techniques; (2) the methods based on source code analysis are less practical since the source code of suspicious programs is typically not available until strong evidences are collected; and (3) those depending on the features of specific operating systems or programming languages have limited applicability. Based on an observation that some critical runtime values are hard to be replaced or eliminated by semantics-preserving transformation techniques, we introduce a novel approach to dynamic characterization of executable programs. Leveraging such invariant values, our technique is resilient to various control and data obfuscation techniques. We show how the values can be extracted and refined to expose the critical values and how we can apply this runtime property to help solve problems in software plagiarism detection. We have implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. Our experimental results show that the value-based method successfully discriminates 34 plagiarisms obfuscated by SandMark, plagiarisms heavily obfuscated by KlassMaster, programs obfuscated by Thicket, and executables obfuscated by Loco/Diablo.

89 citations


Journal ArticleDOI
TL;DR: Using Turnitin formatively was viewed positively by staff and students, and although the incidence of plagiarism did not reduce because of a worsening of referencing and citation skills, the approach encouraged students to develop their writing.
Abstract: New students face the challenge of making a smooth transition between school and university, and with regards to academic practice, there are often gaps between student expectations and university requirements. This study supports the use of the plagiarism detection service Turnitin to give students instant feedback on essays to help improve academic literacy. A student cohort ( n = 76) submitted draft essays to Turnitin and received instruction on how to interpret the 'originality report' themselves for feedback. The impact of this self-service approach was analysed by comparing the writing quality and incidence of plagiarism in draft and final essays, and comparing the results to a previous cohort ( n = 80) who had not used Turnitin formatively. Student and staff perceptions were explored by interview and questionnaire. Using Turnitin formatively was viewed positively by staff and students, and although the incidence of plagiarism did not reduce because of a worsening of referencing and citation skills, the approach encouraged students to develop their writing. To conclude, students were positive of their experience of using Turnitin. Further work is required to understand how to use the self-service approach more effectively to improve referencing and citation, and narrow the gap between student expectations and university standards. [ABSTRACT FROM AUTHOR]

85 citations


Proceedings ArticleDOI
19 Sep 2011
TL;DR: Three algorithms are introduced and it is shown that if these algorithms are combined, common forms of plagiarism can be detected reliably and Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence are combined.
Abstract: Plagiarism Detection Systems have been developed to locate instances of plagiarism e.g. within scientific papers. Studies have shown that the existing approaches deliver reasonable results in identifying copy&paste plagiarism, but fail to detect more sophisticated forms such as paraphrased plagiarism, translation plagiarism or idea plagiarism. The authors of this paper demonstrated in recent studies that the detection rate can be significantly improved by not only relying on text analysis, but by additionally analyzing the citations of a document. Citations are valuable language independent markers that are similar to a fingerprint. In fact, our examinations of real world cases have shown that the order of citations in a document often remains similar even if the text has been strongly paraphrased or translated in order to disguise plagiarism.This paper introduces three algorithms and discusses their suitability for the purpose of citation-based plagiarism detection. Due to the numerous ways in which plagiarism can occur, these algorithms need to be versatile. They must be capable of detecting transpositions, scaling and combinations in a local and global form. The algorithms are coined Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. The evaluation showed that if these algorithms are combined, common forms of plagiarism can be detected reliably.

71 citations


01 Jan 2011
TL;DR: An overview of eective plagiarism detection methods that have been used for natural language text plagia- rism detection, external plagiarisms detection, clustering-base plagiarism Detection and some methods used in code source plagiarism detecting are done.
Abstract: In this paper we have done an overview of eective plagiarism detection methods that have been used for natural language text plagia- rism detection, external plagiarism detection, clustering-base plagiarism detection and some methods used in code source plagiarism detection, also we have done a comparison between five of software used for tex- tual plagiarism detection: (PlagAware, PlagScan, Check for Plagiarism, iThenticate and PlagiarismDetection.org), software are compared with respect of their features and performance.

Proceedings ArticleDOI
13 Jun 2011
TL;DR: It is shown that citation-based plagiarism detection performs significantly better than text-based procedures in identifying strong paraphrasing, translation and some idea plagiarism.
Abstract: Various approaches for plagiarism detection exist. All are based on more or less sophisticated text analysis methods such as string matching, fingerprinting or style comparison. In this paper a new approach called Citation-based Plagiarism Detection is evaluated using a doctoral thesis, in which a volunteer crowd-sourcing project called GuttenPlag identified substantial amounts of plagiarism through careful manual inspection. This new approach is able to identify similar and plagiarized documents based on the citations used in the text. It is shown that citation-based plagiarism detection performs significantly better than text-based procedures in identifying strong paraphrasing, translation and some idea plagiarism. Detection rates can be improved by combining citation-based with text-based plagiarism detection.

01 Jan 2011
TL;DR: The main contribution for the proposed external plagiarism detection is the space reduction technique to reduce the complexity of this plagiarism Detection task.
Abstract: Plagiarism detection has been considered as a classification problem which can be approximated with intrinsic strategies, considering self-based information from a given document, and external strategies, considering comparison techniques between a suspicious document and different sources. In this work, both intrinsic and external approaches for plagiarism detection are presented. First, the main contribution for intrinsic plagiarism detection is associated to the outlier detection approach for detecting changes in the author’s style. Then, the main contribution for the proposed external plagiarism detection is the space reduction technique to reduce the complexity of this plagiarism detection task. Results shows that our approach is highly competitive with respect to the leading research teams in plagiarism detection.

Journal ArticleDOI
Harold R. Garner1
TL;DR: The importance for the opportunity for editors and reviewers to have detection system to identify highly similar text in submitted manuscripts so that they can then review them for novelty is highlighted.
Abstract: About 3,000 new citations that are highly similar to citations in previously published manuscripts that appear each year in the biomedical literature (Medline) alone. This underscores the importance for the opportunity for editors and reviewers to have detection system to identify highly similar text in submitted manuscripts so that they can then review them for novelty. New software-based services, both commercial and free, provide this capability. The availability of such tools provides both a way to intercept suspect manuscripts and serve as a deterrent. Unfortunately, the capabilities of these services vary considerably, mainly as a consequence of the availability and completeness of the literature bases to which new queries are compared. Most of the commercial software has been designed for detection of plagiarism in high school and college papers; however, there is at least 1 fee-based service (CrossRef) and 1 free service (etblast.org), which are designed to target the needs of the biomedical publication industry. Information on these various services, examples of the type of operability and output, and things that need to be considered by publishers, editors, and reviewers before selecting and using these services is provided.

Proceedings ArticleDOI
24 May 2011
TL;DR: In this paper, the advantage and disadvantages of the latest and the important effective methods used or developed in automatic plagiarism detection, according to their result, are surveyed and list the advantages and disadvantages.
Abstract: Plagiarism has become one area of interest for researchers due to its importance, and its fast growing rates. In this paper we are going to survey and list the advantage sand disadvantages of the latest and the important effective methods used or developed in automatic plagiarism detection, according to their result. Mainly methods used in natural language text detection, index structure, and external plagiarism detection and clustering -- based detection.

01 Jan 2011
TL;DR: In this article, each suspicious document is divided into a series of consecutive, po-tentially overlapping "windows" of equal size, represented by vectors containing the relative frequencies of a predetermined set of high-frequency char- acter trigrams.
Abstract: In this paper, we describe a novel approach to intrinsic plagiarism de- tection. Each suspicious document is divided into a series of consecutive, po- tentially overlapping 'windows' of equal size. These are represented by vectors containing the relative frequencies of a predetermined set of high-frequency char- acter trigrams. Subsequently, a distance matrix is set up in which each of the document's windows is compared to each other window. The distance measure used is a symmetric adaptation of the normalized distance (nd1) proposed by Stamatatos (17). Finally, an algorithm for outlier detection in multivariate data (based on Principal Components Analysis) is applied to the distance matrix in or- der to detect plagiarized sections. In the PAN-PC-2011 competition, this system (second place) achieved a competitive recall (.4279) but only reached a plagdet of .1679 due to a disappointing precision (.1075).

Proceedings ArticleDOI
09 Feb 2011
TL;DR: The proposed method for finding the repeated n-grams in text corpora provides an alternative to previous approaches that scales almost linearly in the length of the sequence, is largely independent of n, and provides a uniform workload balance across the set of available processors.
Abstract: The identification of repeated n-gram phrases in text has many practical applications, including authorship attribution, text reuse identification, and plagiarism detection. We consider methods for finding the repeated n-grams in text corpora, with emphasis on techniques that can be effectively scaled across a cluster of processors to handle very large amounts of text. We compare our proposed method to existing techniques using the 1.5 TB TREC ClueWeb-B text collection, using both single-processor and multi-processor approaches. The experiments show that our method offers an important tradeoff between speed and temporary storage space, and provides an alternative to previous approaches that scales almost linearly in the length of the sequence, is largely independent of n, and provides a uniform workload balance across the set of available processors.

Journal ArticleDOI
TL;DR: The Earth Mover's Distance (EMD) is employed to retrieve relevant documents, which enables us to markedly shrink the searching domain and corroborate that the proposed approach is accurate and computationally efficient for performing PD.


01 Jan 2011
TL;DR: This paper aims to explain the performance of plagiarism detection system which can detect External as well as Intrinsic Plagia- rism in text and reports the results on PAN-PC-2011 test corpus.
Abstract: This paper aims to explain the performance of plagiarism detection system which can detect External as well as Intrinsic Plagia- rism in text. It reports the results on PAN-PC-2011 test corpus. We investigated Vector Space Model based techniques for detecting external plagiarism cases and discourse markers based features to detect intrinsic plagiarism cases.

Proceedings ArticleDOI
07 Apr 2011
TL;DR: Five tools for detecting plagiarism in Java source code texts: JPlag, Marble, moss, Plaggie, and sim are compared with respect to their features and performance.
Abstract: In this paper we compare five tools for detecting plagiarism in Java source code texts: JPlag, Marble, moss, Plaggie, and sim. The tools are compared with respect to their features and performance. For the performance comparison we carried out two experiments: to compare the sensitivity of the tools for different plagiarism techniques we have applied the tools to a set of intentionally plagiarised programs. To get a picture of the precision of the tools, we have run the tools on several incarnations of a student assignment and compared the top 10's of the results.

Book ChapterDOI
12 Sep 2011
TL;DR: This work includes the inclusion of text outlier detection methodologies to enhance both intrinsic and external plagiarism detection and shows that the approach is highly competitive with respect to the leading research teams in plagiarism Detection.
Abstract: Plagiarism detection, one of the main problems that educational institutions have been dealing with since the massification of Internet, can be considered as a classification problem using both self-based information and text processing algorithms whose computational complexity is intractable without using space search reduction algorithms. First, self-based information algorithms treat plagiarism detection as an outlier detection problem for which the classifier must decide plagiarism using only the text in a given document. Then, external plagiarism detection uses text matching algorithms where it is fundamental to reduce the matching space with text search space reduction techniques, which can be represented as another outlier detection problem. The main contribution of this work is the inclusion of text outlier detection methodologies to enhance both intrinsic and external plagiarism detection. Results shows that our approach is highly competitive with respect to the leading research teams in plagiarism detection.

Journal ArticleDOI
TL;DR: The research shows that most of the anti‐plagiarism services can be cracked through different methods and artificial intelligence techniques can help to improve the performance of the detection procedure.
Abstract: Purpose – This paper aims to focus on plagiarism and the consequences of anti‐plagiarism services such as Turnitin.com, iThenticate, and PlagiarismDetect.com in detecting the most recent cheatings in academic and other writings.Design/methodology/approach – The most important approach is plagiarism prevention and finding proper solutions for detecting more complex kinds of plagiarism through natural language processing and artificial intelligence self‐learning techniques.Findings – The research shows that most of the anti‐plagiarism services can be cracked through different methods and artificial intelligence techniques can help to improve the performance of the detection procedure.Research limitations/implications – Accessing entire data and plagiarism algorithms is not possible completely, so comparing is just based on the outputs from detection services. They may produce different results on the same inputs.Practical implications – Academic papers and web pages are increasing over time, and it is very ...

Journal ArticleDOI
TL;DR: Advice and warnings against plagiarism were ineffective but a subsequent interactive seminar was effective at reducing plagiarism.
Abstract: Background: Plagiarism is a common issue in education. Software can detect plagiarism but little is known about prevention. Aims: To identify ways to reduce the incidence of plagiarism in a postgraduate programme. Methods: From 2006, all student assignments were monitored using plagiarism detection software (Turn It In) to produce percentage text matches for each assignment. In 2007, students were advised software was being used, and that plagiarism would result in penalties. In 2008, students attending a key module took part in an additional interactive seminar on plagiarism. A separate cohort of students did not attend the seminar, allowing comparison between attendees and non-attendees. Results: Between 2006 and 2007, mean percentage text match values were consistent with a stable process, indicating advice and warnings were ineffective. Control chart analysis revealed that between 2007 and 2008, mean percentage text match changes showed a reduced text match in all nine modules, where students attended the interactive seminar, but none where students did not. This indicated that the interactive seminar had an effect. In 2008, there were no occurrences of plagiarism. Improvements were maintained in 2009. Conclusions: Advice and warnings against plagiarism were ineffective but a subsequent interactive seminar was effective at reducing plagiarism.

01 Jan 2011
TL;DR: The main novelties are the introduction of a new similarity measure and a new ranking method, which cooperate to rank much better the source– suspicious document pairs when selecting the candidates for the detailed analysis phase.
Abstract: This paper describes the evolution of our method Encoplot for automatic plagiarism detection and the results of the participation to the PAN’11 competition. The main novelties are the introduction of a new similarity measure and of a new ranking method, which cooperate to rank much better the source– suspicious document pairs when selecting the candidates for the detailed analysis phase. We have obtained excellent results in the competition, ranking 1 on the manually paraphrased cases, 2 overall in the external plagiarism detection task, and getting the best recall on the non-translated corpus.

Proceedings ArticleDOI
24 Oct 2011
TL;DR: It is shown that stopword n-grams are able to capture local syntactic similarities between suspicious and original documents and an algorithm for detecting the exact boundaries of plagiarized and source passages is proposed.
Abstract: In this paper a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses mainly content terms to represent documents, the proposed method is based on structural information provided by occurrences of a small list of stopwords (i.e., very frequent words). We show that stopword n-grams are able to capture local syntactic similarities between suspicious and original documents. Moreover, an algorithm for detecting the exact boundaries of plagiarized and source passages is proposed. Experimental results on a publicly-available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified by replacing most of the words or phrases with synonyms to hide the similarity with the source documents.

Proceedings ArticleDOI
28 Oct 2011
TL;DR: This work investigated whether plagiarism detection tools could be used as filters for spam text messages and solved the near-duplicate detection problem on the basis of a clustering approach using CLUTO framework.
Abstract: Today, the number of spam text messages has grown in number, mainly because companies are looking for free advertising. For the users is very important to filter these kinds of spam messages that can be viewed as near-duplicate texts because mostly created from templates. The identification of spam text messages is a very hard and time-consuming task and it involves to carefully scanning hundreds of text messages. Therefore, since the task of near-duplicate detection can be seen as a specific case of plagiarism detection, we investigated whether plagiarism detection tools could be used as filters for spam text messages. Moreover we solve the near-duplicate detection problem on the basis of a clustering approach using CLUTO framework. We carried out some preliminary experiments on the SMS Spam Collection that recently was made available for research purposes. The results were compared with the ones obtained with the CLUTO. Althought plagiarism detection tools detect a good number of near-duplicate SMS spam messages even better results are obtained with the CLUTO clustering tool.

Proceedings ArticleDOI
26 Sep 2011
TL;DR: Preliminary results show that APlag significantly improves the results obtained by APD in terms of recall and precision metrics, and new heuristics for text comparison.
Abstract: Plagiarism is a serious problem, especially in academia and education. Detecting it is a challenging task, particularly in natural language texts. Many plagiarism detection tools have been developed for diverse natural languages, mainly English. Language-independent tools exist as well, but are considered as too restrictive as they usually do not consider specific language features. In this paper, we introduce APlag, a new plagiarism detection tool for Arabic texts, based on a logical representation of a document as paragraphs, sentences, and words, and new heuristics for text comparison. We describe its main attributes and present the results of some experiments conducted on a dummy test set. We demonstrate its effectiveness by comparing its performance to that of APD, a plagiarism detection tool for Arabic. Overall, preliminary results show that APlag significantly improves the results obtained by APD in terms of recall and precision metrics.

01 Jan 2011
TL;DR: It is shown that there exists a direct correlation between the obfuscation degree (method) and the achieved performance, thus defining the baseline for further studies.
Abstract: Continuing our previous work started at PAN 2009 and PAN 2010 [7] we considered further research options based on the achieved baseline of the best performing algorithms. The research done by Potthast et al. [4] presented a sliced view of the presented approaches showing their performance on specific corpus metrics external\intrinsic, obfuscation strategies (none, artificial high\low, simulated, translated), topic match, case length and document length thus defining the baseline for further studies. A brief analysis of the above named results [1,3] shows that there exists a direct correlation between the obfuscation degree (method) and the achieved performance.

Journal ArticleDOI
TL;DR: A novel plagiarism-detection method, called SimPaD, which establishes the degree of resemblance between any two documents D1 and D2 based on their sentence-to-sentence similarity computed by using pre-defined word-correlation factors and generates a graphical view of sentences that are similar (or the same) in D2.
Abstract: Plagiarism is a serious problem that infringes copyrighted documents/materials, which is an unethical practice and decreases the economic incentive received by their legal owners. Unfortunately, plagiarism is getting worse due to the increasing number of on-line publications and easy access on the Web, which facilitates locating and paraphrasing information. In solving this problem, we propose a novel plagiarism-detection method, called SimPaD, which (i) establishes the degree of resemblance between any two documents D1 and D2 based on their sentence-to-sentence similarity computed by using pre-defined word-correlation factors, and (ii) generates a graphical view of sentences that are similar (or the same) in D1 and D2. Experimental results verify that SimPaD is highly accurate in detecting (non-)plagiarized documents and outperforms existing plagiarism-detection approaches.