scispace - formally typeset
Journal ArticleDOI

Does Size Matter? Authorship Attribution, Small Samples, Big Problem

Reads0
Chats0
TLDR
In this article, the authors aim to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise, and a few controlled tests for different sample lengths, languages, and genres are discussed and compared.
Abstract
The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method ( Burrows, J.F . (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing , 17 (3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k -NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.

read more

Citations
More filters
Journal ArticleDOI

Do birds of a feather really flock together, or how to choose training samples for authorship attribution

TL;DR: A bootstrap-like approach is used to choose randomly, in 500 iterations, the samples for the training and the test sets, inspired by k-fold cross-validation procedures, and shows considerable resistance of the English corpus to permutations, while the other corpora turned out to be more dependent on the choice of the samples.
Journal Article

Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data

TL;DR: The significance of punctuation marks is explored in order to distinguish between authors, showing that an increase in the performance can be achieved and robustness of NB classifier in doing AA on very short-sized texts when compared to Support Vector Machines (SVMs).

The Object of Platform Studies: Relational Materialities and the Social Platform (the case of the Nintendo Wii)

TL;DR: Values in the 19th Century British Novel: Decline and Transformation of a Semantic Field and the Cultural Impact of New Media on American Literary Writing.
Journal ArticleDOI

Mind your corpus: systematic errors in authorship attribution

TL;DR: The authors conducted a series of experiments on several corpora of English, German, Polish, Ancient Greek, and Latin prose texts to verify the impact of unwanted noise on the attribution abilities of particular corpora.

Poetics of the Sufi Carnival: The ‘Rogue Lyrics’ (Qalandariyât) of Sanâ’i, ‘Attâr, and ‘Erâqi

TL;DR: Keshavarz et al. as mentioned in this paper presented a detailed study of the poetics and cultural politics of the "rogue lyrics" (qalandariyât) of medieval Persian Sufi literature.
References
More filters
Journal IssueDOI

A survey of modern authorship attribution methods

TL;DR: A survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification.
Journal IssueDOI

Computational methods in authorship attribution

TL;DR: Three scenarios are considered here for which solutions to the basic attribution problem are inadequate; it is shown how machine learning methods can be adapted to handle the special challenges of that variant.
Journal ArticleDOI

‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship

TL;DR: A new way of using the relative frequencies of the very common words for comparing written texts and testing their likely authorship, which offers a simple but comparatively accurate addition to current methods of distinguishing the most likely author of texts exceeding about 1,500 words in length.
Journal ArticleDOI

How variable may a constant be? Measures of lexical richness in perspective

TL;DR: The results suggest that the empirical trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.
Journal ArticleDOI

The State of Authorship Attribution Studies: Some Problems and Solutions

TL;DR: The statement, ’’Results of most non-traditional authorship attribution studies are not universally accepted as definitive,'' is explicated.