Does Size Matter? Authorship Attribution, Small Samples, Big Problem

doi:10.1093/LLC/FQT066

Journal ArticleDOI

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

Maciej Eder

- 01 Jun 2015 -

Digital Scholarship in the Humanities

- Vol. 30, Iss: 2, pp 167-182

Chats0

TLDR

In this article, the authors aim to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise, and a few controlled tests for different sample lengths, languages, and genres are discussed and compared.

Abstract:

The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method ( Burrows, J.F . (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing , 17 (3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k -NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Authorship Attribution for Social Media Forensics

Anderson Rocha, +7 more

- 01 Jan 2017 -

IEEE Transactions on Information Forensi...

TL;DR: It is argued that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.

...read moreread less

Journal ArticleDOI

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Mark J. Hill, +1 more

- 01 Dec 2019 -

Digital Scholarship in the Humanities

TL;DR: The impact optical character recognition (OCR) has on the quantitative analysis of historical documents is quantified and a series of specific analyses common to the digital humanities are conducted: topic modelling, authorship attribution, collocation analysis, and vector space modelling.

...read moreread less

Journal ArticleDOI

Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem

David L. Wright

- 22 Sep 2017 -

International Journal of Corpus Linguist...

TL;DR: Using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus, the accuracy of word n-grams in identifying the authors of anonymised email samples is tested and the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n -grams.

...read moreread less

Journal ArticleDOI

Visualization in Stylometry: Cluster Analysis Using Networks

Maciej Eder

- 01 Apr 2017 -

Digital Scholarship in the Humanities

TL;DR: In this paper, the authors discuss reliability issues of a few visual techniques used in stylometry, and introduce a new method that enhances the explanatory power of visualization with a procedure of validation inspired by advanced statistical methods.

...read moreread less

Journal ArticleDOI

A simple and efficient algorithm for authorship verification

Mirco Kocher, +1 more

TL;DR: An unsupervised and effective authorship verification model called Spatium‐L1 is described and evaluated, using the 200 most frequent terms of the disputed text as features and applying a simple distance measure and a set of impostors to determine whether or not the disputedText was written by the proposed author.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Non-traditional Authorship Attribution Studies in the Historia Augusta: Some Caveats

Joseph Rudman

- 01 Sep 1998 -

Literary and Linguistic Computing

TL;DR: The authors discusses some of the problems inherent in the non-traditional authorship attribution studies of the Historia Augusta (those using statistics, stylistics, and the computer) and some of them are due to practitioner error in these studies.

...read moreread less

Journal ArticleDOI

Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz's Trilogy and its Two English Translations

Jan Rybicki

- 01 Apr 2006 -

Literary and Linguistic Computing

TL;DR: The aim of the study was to verify the intuitions of traditional interpretations, acquire a more comprehensive view of the phenomenon, and obtain new insights into the nature of idiolect differentiation in Sienkiewicz.

...read moreread less

Journal ArticleDOI

Cherry Picking in Nontraditional Authorship Attribution Studies

Joseph Rudman

- 01 Mar 2003 -

Chance

TL;DR: In this paper, a Cherry Picking in Nontraditional Authorship Attribution Studies is presented. But it does not address the problem of non-traditional authorship attribution in non-canonical documents.

...read moreread less

Journal ArticleDOI

Who wrote Bacon? Assessing the respective roles of Francis Bacon and his secretaries in the production of his English works

Noel B. Reynolds, +2 more

- 01 Dec 2012 -

Literary and Linguistic Computing

TL;DR: A follow-up study in which two independent statistical analyses of Bacon's English works both conclude that, whereas Bacon's autographic writings show clearly that they are authored by the same person; almost none of his published works can be matched statistically with the autographs.

...read moreread less

Journal ArticleDOI

Goldsmith's contributions to the ‘Critical Review’: a supplement

Peter Dixon, +1 more

- 01 Dec 2012 -

Literary and Linguistic Computing

TL;DR: Test scores for linguistic features, word-patterns and word-lengths, in conjunction with parallels of word and thought, can provide corroborative internal evidence both for and against the attribution to Goldsmith of items in the Critical Review.

...read moreread less

Collapse

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

Citations

Authorship Attribution for Social Media Forensics

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem

Visualization in Stylometry: Cluster Analysis Using Networks

A simple and efficient algorithm for authorship verification

References

Non-traditional Authorship Attribution Studies in the Historia Augusta: Some Caveats

Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz's Trilogy and its Two English Translations

Cherry Picking in Nontraditional Authorship Attribution Studies

Who wrote Bacon? Assessing the respective roles of Francis Bacon and his secretaries in the production of his English works

Goldsmith's contributions to the ‘Critical Review’: a supplement

Related Papers (5)

A survey of modern authorship attribution methods

‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship

Computational methods in authorship attribution

Authorship Attribution

Inference and disputed authorship : The Federalist