scispace - formally typeset
Journal ArticleDOI

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

Reads0
Chats0
TLDR
In this article, the authors aim to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise, and a few controlled tests for different sample lengths, languages, and genres are discussed and compared.
Abstract
The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method ( Burrows, J.F . (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing , 17 (3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k -NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.

read more

Citations
More filters
Journal ArticleDOI

Authorship Attribution for Social Media Forensics

TL;DR: It is argued that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.
Journal ArticleDOI

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

TL;DR: The impact optical character recognition (OCR) has on the quantitative analysis of historical documents is quantified and a series of specific analyses common to the digital humanities are conducted: topic modelling, authorship attribution, collocation analysis, and vector space modelling.
Journal ArticleDOI

Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem

TL;DR: Using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus, the accuracy of word n-grams in identifying the authors of anonymised email samples is tested and the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n -grams.
Journal ArticleDOI

Visualization in Stylometry: Cluster Analysis Using Networks

TL;DR: In this paper, the authors discuss reliability issues of a few visual techniques used in stylometry, and introduce a new method that enhances the explanatory power of visualization with a procedure of validation inspired by advanced statistical methods.
Journal ArticleDOI

A simple and efficient algorithm for authorship verification

TL;DR: An unsupervised and effective authorship verification model called Spatium‐L1 is described and evaluated, using the 200 most frequent terms of the disputed text as features and applying a simple distance measure and a set of impostors to determine whether or not the disputedText was written by the proposed author.
References
More filters
Book

Attributing Authorship: An Introduction

TL;DR: Forgery and attribution of Shakespeare and Co. as discussed by the authors are discussed in detail in the article "Arguing attribution: Acknowledgements and Bibliographical evidence for authorship".
Journal ArticleDOI

A comparative study of machine learning methods for authorship attribution

TL;DR: Each of the methods tested performed well, but nearest shrunken centroids and regularized discriminant analysis had the best overall performances with 0/70 cross-validation errors.
Journal ArticleDOI

Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts

TL;DR: A method based on the frequency of bigrams of syntactic labels that arise from partial parsing of the text achieves a high accuracy on discrimination of the work of Anne and Charlotte Bronte, which is very difficult to do by traditional methods.
Book ChapterDOI

Effective and scalable authorship attribution using function words

TL;DR: This paper examines the use of a large publicly available collection of newswire articles as a benchmark for comparing authorship attribution methods, and shows that the benchmark is able to clearly distinguish between different approaches, and that the scalability of the best methods based on using function words features is acceptable.
Journal ArticleDOI

Author identification: Using text sampling to handle the class imbalance problem

TL;DR: This paper presents methods to handle imbalanced multi-class textual datasets based on two text corpora of two languages, namely, newswire stories in English and newspaper reportage in Arabic and explores text sampling methods in order to construct a training set according to a desirable distribution over the classes.