Does Size Matter? Authorship Attribution, Small Samples, Big Problem

doi:10.1093/LLC/FQT066

Journal ArticleDOI

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

Maciej Eder

- 01 Jun 2015 -

Digital Scholarship in the Humanities

- Vol. 30, Iss: 2, pp 167-182

Chats0

TLDR

In this article, the authors aim to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise, and a few controlled tests for different sample lengths, languages, and genres are discussed and compared.

Abstract:

The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method ( Burrows, J.F . (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing , 17 (3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k -NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

Citations

A Computational Approach to Source Adaptation in Thomas Malory’s Morte Darthur

Exploring the Distinctiveness of Emoji Use for Digital Authorship Analysis

Exploring the Role of Emojis in Tweets for Authorship Attribution

Beyond Idiolectometry? On Racine's Stylometric Signature.

Layer on layer. ‘Computational archaeology’ in 15th-century Middle Dutch historiography

References

A survey of modern authorship attribution methods

Computational methods in authorship attribution

‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship

How variable may a constant be? Measures of lexical richness in perspective

The State of Authorship Attribution Studies: Some Problems and Solutions

Related Papers (5)

A survey of modern authorship attribution methods

‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship

Computational methods in authorship attribution

Authorship Attribution

Inference and disputed authorship : The Federalist