scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

01 Jun 2015-Digital Scholarship in the Humanities (The Oxford University Press)-Vol. 30, Iss: 2, pp 167-182
TL;DR: In this article, the authors aim to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise, and a few controlled tests for different sample lengths, languages, and genres are discussed and compared.
Abstract: The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method ( Burrows, J.F . (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing , 17 (3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k -NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.
Citations
More filters
Journal ArticleDOI
TL;DR: It is argued that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.
Abstract: The veil of anonymity provided by smartphones with pre-paid SIM cards, public Wi-Fi hotspots, and distributed networks like Tor has drastically complicated the task of identifying users of social media during forensic investigations. In some cases, the text of a single posted message will be the only clue to an author’s identity. How can we accurately predict who that author might be when the message may never exceed 140 characters on a service like Twitter? For the past 50 years, linguists, computer scientists, and scholars of the humanities have been jointly developing automated methods to identify authors based on the style of their writing. All authors possess peculiarities of habit that influence the form and content of their written works. These characteristics can often be quantified and measured using machine learning algorithms. In this paper, we provide a comprehensive review of the methods of authorship attribution that can be applied to the problem of social media forensics. Furthermore, we examine emerging supervised learning-based methods that are effective for small sample sizes, and provide step-by-step explanations for several scalable approaches as instructional case studies for newcomers to the field. We argue that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.

189 citations

Journal ArticleDOI
TL;DR: The impact optical character recognition (OCR) has on the quantitative analysis of historical documents is quantified and a series of specific analyses common to the digital humanities are conducted: topic modelling, authorship attribution, collocation analysis, and vector space modelling.
Abstract: This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

70 citations

Journal ArticleDOI
TL;DR: Using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus, the accuracy of word n-grams in identifying the authors of anonymised email samples is tested and the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n -grams.
Abstract: Forensic authorship attribution is concerned with identifying the writers of anonymous criminal documents. Over the last twenty years, computer scientists have developed a wide range of statistical procedures using a number of different linguistic features to measure similarity between texts. However, much of this work is not of practical use to forensic linguists who need to explain in reports or in court why a particular method of identifying potential authors works. This paper sets out to address this problem using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus. Drawing on literature positing the idiolectal nature of collocations, phrases and word sequences, this paper tests the accuracy of word n-grams in identifying the authors of anonymised email samples. Moving beyond the statistical analysis, the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n-grams.

69 citations

Journal ArticleDOI
TL;DR: In this paper, the authors discuss reliability issues of a few visual techniques used in stylometry, and introduce a new method that enhances the explanatory power of visualization with a procedure of validation inspired by advanced statistical methods.
Abstract: The aim of this article is to discuss reliability issues of a few visual techniques used in stylometry, and to introduce a new method that enhances the explanatory power of visualization with a procedure of validation inspired by advanced statistical methods A promising way of extending cluster analysis dendrograms with a self-validating procedure involves producing numerous particular ‘snapshots’, or dendrograms produced using different input parameters, and combining them all into the form of a consensus tree Significantly better results, however, can be obtained using a new visualization technique, which combines the idea of nearest neighborhood derived from cluster analysis, the idea of hammering out a clustering consensus from bootstrap consensus trees, with the idea of mapping textual similarities onto a form of a network Additionally, network analysis seems to be a good solution for large data sets

67 citations

Journal ArticleDOI
01 Jan 2017
TL;DR: An unsupervised and effective authorship verification model called Spatium‐L1 is described and evaluated, using the 200 most frequent terms of the disputed text as features and applying a simple distance measure and a set of impostors to determine whether or not the disputedText was written by the proposed author.
Abstract: This paper describes and evaluates an unsupervised and effective authorship verification model called Spatium-L1. As features, we suggest using the 200 most frequent terms of the disputed text isolated words and punctuation symbols. Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was written by the proposed author. Moreover, based on a simple rule we can define when there is enough evidence to propose an answer or when the attribution scheme is unable to make a decision with a high degree of certainty. Evaluations based on 6 test collections PAN CLEF 2014 evaluation campaign indicate that Spatium-L1 usually appears in the top 3 best verification systems, and on an aggregate measure, presents the best performance. The suggested strategy can be adapted without any problem to different Indo-European languages such as English, Dutch, Spanish, and Greek or genres essay, novel, review, and newspaper article.

47 citations

References
More filters
Journal IssueDOI
TL;DR: A survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification.
Abstract: Authorship attribution supported by statistical or computational methods has a long history starting from the 19th century and is marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed “Federalist Papers.” During the last decade, this scientific field has been developed substantially, taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology, provided it is able to handle short and noisy text from multiple candidate authors. In this article, a survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than on linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area. © 2009 Wiley Periodicals, Inc.

1,186 citations

Journal IssueDOI
TL;DR: Three scenarios are considered here for which solutions to the basic attribution problem are inadequate; it is shown how machine learning methods can be adapted to handle the special challenges of that variant.
Abstract: Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample. In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant. © 2009 Wiley Periodicals, Inc.

523 citations

Journal ArticleDOI
TL;DR: A new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship, which offers a simple but comparatively accurate addition to current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length.
Abstract: This paper is a companion to my 'Questions of authorship: attribution and beyond', in which I sketched a new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship. The main emphasis of that paper was not on the new procedure but on the broader consequences of our increasing sophistication in making such comparisons and the increasing (although never absolute) reliability of our inferences about authorship. My present objects, accordingly, are to give a more complete account of the procedure itself; to report the outcome of an extensive set of trials; and to consider the strengths and limitations of the new procedure. The procedure offers a simple but comparatively accurate addition to our current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length. It is of even greater value as a method of reducing the field of likely candidates for texts of as little as 100 words in length. Not unexpectedly, it works least well with texts of a genre uncharacteristic of their author and, in one case, with texts far separated in time across a long literary career. Its possible use for other classificatory tasks has not yet been investigated.

457 citations

Journal ArticleDOI
TL;DR: The results suggest that the empirical trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.
Abstract: A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.

391 citations

Journal ArticleDOI
TL;DR: The statement, ’’Results of most non-traditional authorship attribution studies are not universally accepted as definitive,'' is explicated.
Abstract: The statement, ’’Results of most non-traditional authorship attribution studies are not universally accepted as definitive,'' is explicated. A variety of problems in these studies are listed and discussed: studies governed by expediency; a lack of competent research; flawed statistical techniques; corrupted primary data; lack of expertise in allied fields; a dilettantish approach; inadequate treatment of errors. Various solutions are suggested: construct a correct and complete experimental design; educate the practitioners; study style in its totality; identify and educate the gatekeepers; develop a complete theoretical framework; form an association of practitioners.

263 citations