Journal ArticleDOI
Does Size Matter? Authorship Attribution, Small Samples, Big Problem
Reads0
Chats0
TLDR
In this article, the authors aim to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise, and a few controlled tests for different sample lengths, languages, and genres are discussed and compared.Abstract:
The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method ( Burrows, J.F . (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing , 17 (3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k -NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.read more
Citations
More filters
Journal ArticleDOI
Authorship Attribution for Social Media Forensics
Anderson Rocha,Walter J. Scheirer,Christopher W. Forstall,Thiago Cavalcante,Antonio Theophilo,Bingyu Shen,Ariadne R. B. Carvalho,Efstathios Stamatatos +7 more
TL;DR: It is argued that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.
Journal ArticleDOI
Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
Mark J. Hill,Simon Hengchen +1 more
TL;DR: The impact optical character recognition (OCR) has on the quantitative analysis of historical documents is quantified and a series of specific analyses common to the digital humanities are conducted: topic modelling, authorship attribution, collocation analysis, and vector space modelling.
Journal ArticleDOI
Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem
TL;DR: Using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus, the accuracy of word n-grams in identifying the authors of anonymised email samples is tested and the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n -grams.
Journal ArticleDOI
Visualization in Stylometry: Cluster Analysis Using Networks
TL;DR: In this paper, the authors discuss reliability issues of a few visual techniques used in stylometry, and introduce a new method that enhances the explanatory power of visualization with a procedure of validation inspired by advanced statistical methods.
Journal ArticleDOI
A simple and efficient algorithm for authorship verification
Mirco Kocher,Jacques Savoy +1 more
TL;DR: An unsupervised and effective authorship verification model called Spatium‐L1 is described and evaluated, using the 200 most frequent terms of the disputed text as features and applying a simple distance measure and a set of impostors to determine whether or not the disputedText was written by the proposed author.
References
More filters
Journal ArticleDOI
Non-traditional Authorship Attribution Studies in the Historia Augusta: Some Caveats
TL;DR: The authors discusses some of the problems inherent in the non-traditional authorship attribution studies of the Historia Augusta (those using statistics, stylistics, and the computer) and some of them are due to practitioner error in these studies.
Journal ArticleDOI
Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz's Trilogy and its Two English Translations
TL;DR: The aim of the study was to verify the intuitions of traditional interpretations, acquire a more comprehensive view of the phenomenon, and obtain new insights into the nature of idiolect differentiation in Sienkiewicz.
Journal ArticleDOI
Cherry Picking in Nontraditional Authorship Attribution Studies
TL;DR: In this paper, a Cherry Picking in Nontraditional Authorship Attribution Studies is presented. But it does not address the problem of non-traditional authorship attribution in non-canonical documents.
Journal ArticleDOI
Who wrote Bacon? Assessing the respective roles of Francis Bacon and his secretaries in the production of his English works
TL;DR: A follow-up study in which two independent statistical analyses of Bacon's English works both conclude that, whereas Bacon's autographic writings show clearly that they are authored by the same person; almost none of his published works can be matched statistically with the autographs.
Journal ArticleDOI
Goldsmith's contributions to the ‘Critical Review’: a supplement
Peter Dixon,David Mannion +1 more
TL;DR: Test scores for linguistic features, word-patterns and word-lengths, in conjunction with parallels of word and thought, can provide corroborative internal evidence both for and against the attribution to Goldsmith of items in the Critical Review.