scispace - formally typeset
Journal ArticleDOI

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

Reads0
Chats0
TLDR
In this article, the authors aim to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise, and a few controlled tests for different sample lengths, languages, and genres are discussed and compared.
Abstract
The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method ( Burrows, J.F . (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing , 17 (3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k -NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.

read more

Citations
More filters
Journal ArticleDOI

Authorship Attribution for Social Media Forensics

TL;DR: It is argued that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.
Journal ArticleDOI

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

TL;DR: The impact optical character recognition (OCR) has on the quantitative analysis of historical documents is quantified and a series of specific analyses common to the digital humanities are conducted: topic modelling, authorship attribution, collocation analysis, and vector space modelling.
Journal ArticleDOI

Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem

TL;DR: Using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus, the accuracy of word n-grams in identifying the authors of anonymised email samples is tested and the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n -grams.
Journal ArticleDOI

Visualization in Stylometry: Cluster Analysis Using Networks

TL;DR: In this paper, the authors discuss reliability issues of a few visual techniques used in stylometry, and introduce a new method that enhances the explanatory power of visualization with a procedure of validation inspired by advanced statistical methods.
Journal ArticleDOI

A simple and efficient algorithm for authorship verification

TL;DR: An unsupervised and effective authorship verification model called Spatium‐L1 is described and evaluated, using the 200 most frequent terms of the disputed text as features and applying a simple distance measure and a set of impostors to determine whether or not the disputedText was written by the proposed author.
References
More filters
Journal ArticleDOI

Non-traditional Authorship Attribution Studies in the Historia Augusta: Some Caveats

TL;DR: The authors discusses some of the problems inherent in the non-traditional authorship attribution studies of the Historia Augusta (those using statistics, stylistics, and the computer) and some of them are due to practitioner error in these studies.
Journal ArticleDOI

Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz's Trilogy and its Two English Translations

TL;DR: The aim of the study was to verify the intuitions of traditional interpretations, acquire a more comprehensive view of the phenomenon, and obtain new insights into the nature of idiolect differentiation in Sienkiewicz.
Journal ArticleDOI

Cherry Picking in Nontraditional Authorship Attribution Studies

Joseph Rudman
- 01 Mar 2003 - 
TL;DR: In this paper, a Cherry Picking in Nontraditional Authorship Attribution Studies is presented. But it does not address the problem of non-traditional authorship attribution in non-canonical documents.
Journal ArticleDOI

Who wrote Bacon? Assessing the respective roles of Francis Bacon and his secretaries in the production of his English works

TL;DR: A follow-up study in which two independent statistical analyses of Bacon's English works both conclude that, whereas Bacon's autographic writings show clearly that they are authored by the same person; almost none of his published works can be matched statistically with the autographs.
Journal ArticleDOI

Goldsmith's contributions to the ‘Critical Review’: a supplement

TL;DR: Test scores for linguistic features, word-patterns and word-lengths, in conjunction with parallels of word and thought, can provide corroborative internal evidence both for and against the attribution to Goldsmith of items in the Critical Review.