scispace - formally typeset
Open AccessJournal ArticleDOI

Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem

Reads0
Chats0
TLDR
Using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus, the accuracy of word n-grams in identifying the authors of anonymised email samples is tested and the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n -grams.
Abstract
Forensic authorship attribution is concerned with identifying the writers of anonymous criminal documents. Over the last twenty years, computer scientists have developed a wide range of statistical procedures using a number of different linguistic features to measure similarity between texts. However, much of this work is not of practical use to forensic linguists who need to explain in reports or in court why a particular method of identifying potential authors works. This paper sets out to address this problem using a corpus linguistic approach and the 176-author 2.5 million-word Enron Email Corpus. Drawing on literature positing the idiolectal nature of collocations, phrases and word sequences, this paper tests the accuracy of word n-grams in identifying the authors of anonymised email samples. Moving beyond the statistical analysis, the usage-based concept of entrenchment is offered as a means by which to account for the recurring and distinctive production of idiolectal word n-grams.

read more

Content maybe subject to copyright    Report

This is a post-review, pre-publication (post-print) version of the paper: Wright, D. (2017) Using
word n-grams to identify authors and idiolects: A corpus approach to a forensic linguistic
problem. To appear in the International Journal of Corpus Linguistics 22(2).
https://benjamins.com/#catalog/journals/ijcl.22.2.03wri/details
Using word n-grams to identify authors and idiolects
A corpus approach to a forensic linguistic problem
David Wright
Nottingham Trent University
Forensic authorship attribution is concerned with identifying the writers of
anonymous criminal documents. Over the last twenty years, computer scientists
have developed a wide range of statistical procedures using a number of
different linguistic features to measure similarity between texts. However, much
of this work is not of practical use to forensic linguists who need to explain in
reports or in court why a particular method of identifying potential authors
works. This paper sets out to address this problem using a corpus linguistic
approach and the 176-author 2.5 million-word Enron Email Corpus. Drawing
on literature positing the idiolectal nature of collocations, phrases and word
sequences, this paper tests the accuracy of word n-grams in identifying the
authors of anonymised email samples. Moving beyond the statistical analysis,
the usage-based concept of entrenchment is offered as a means by which to
account for the recurring and distinctive production of idiolectal word n-grams.
Keywords: forensic linguistics, idiolect, authorship attribution, entrenchment,
Enron
1. The linguistic individual, corpora and forensic linguistics
‘Idiolect’ is a well established concept in linguistics, yet the individual is rarely the
focus of linguistic enquiry. There are many possible reasons for this, but perhaps the
main deterrent to the study of idiolect is the practical difficulties in doing so. Bloch
(1948: 7) coined the term ‘idiolect’ to refer to not merely what a speaker says at one
time: it is everything that he could say in a given language” (original emphasis).

Clearly, the task of collecting anything that a person could say is an impossible one.
However, recent work in corpus linguistics that has put the individual at the centre of
their investigations has narrowed the goal posts set out by Bloch (1948) by analysing
the linguistic output that individual speakers or writers actually produce (e.g. Coniam
2004, Mollin 2009, Barlow 2013). These studies use smaller, specialised corpora to
systematically examine idiolectal variation that is masked or buried in traditional large-
scale reference corpora.
The field which stands to benefit the most from the empirical investigation of
idiolect is forensic linguistics, and in particular forensic authorship attribution.
Authorship attribution is the process in which linguists set out to identify the author(s)
of disputed texts using identifiable features of linguistic style, ranging from word
frequencies to preferred syntactic structures. In a forensic context, the disputed texts
under analysis are potentially evidential in alleged infringements of the law or threats to
security. Such texts can include abusive emails, ransom notes, extortion letters, falsified
suicide notes, or text messages sent by a person acting as someone else. In the most
straightforward case, the analysis requires the linguist to analyse the style(s) exhibited
in the known writings of the suspect or candidate authors involved in the case.
Attention then turns to the disputed document(s), as the linguist compares the writing
style of the text(s) in question and examines the extent to which it is similar or
consistent with the known writing style of one (or more) of the suspects. The linguist
may then express an opinion as to how likely it is that the disputed text is or is not
written by one of the suspects. Such an analysis relies on a theory of idiolect (Coulthard
2004: 431), or at least depends on the consistency and distinctiveness of the styles of the
individuals involved (Grant 2013: 473).
There are a small number of studies and cases in which corpora or corpus
methods have been used to attribute forensic texts to their authors. Svartvik (1968) uses
a corpus approach to analyse a set of disputed witness statements in a murder case.
Coulthard (1994) uses specialised corpora of ordinary witness statements and police
statements, along with the much larger spoken element of the COBUILD corpus, in his
seminal analysis of the disputed Derek Bentley statement. Coulthard (2004) reports
another case in which the internet was used to investigate the author-distinctiveness of
twelve lexical items co-selected in one text in the capturing of the Unabomber. Despite

the success of corpus approaches in these cases, few have pursued the utility of corpus
linguistics in forensic research. Kredens (2002) is the earliest exception, using a corpus-
approach to comparing the idiolects of two English musicians, Robert Smith (The Cure)
and Steven Morrisey (The Smiths). Larner (2014) is an exception too, with his work on
identifying idiolectal preferences for formulaic sequences in personal narratives, while
Grant (2013) uses a corpus method to identify lexical variation in text messages central
to a murder investigation, and Wright (2013) and Johnson and Wright (2014) employ
corpus techniques in the analyses of author-distinctive language use in a corpus of
business emails. This study continues to develop the use of corpus methodologies in the
investigation of idiolect and the attribution of disputed texts in a forensic context. There
are two parts to the analysis in this paper. The first part reports the results of an
authorship attribution experiment using ‘word n-grams’ as style markers. The second
part focuses on one author as a case study and examines the n-grams which were most
useful in identifying his disputed texts, discussing their nature and their implications for
the theory of idiolect and forensic authorship analysis.
2. Word strings as features in authorship analysis
Most of the work in authorship attribution is from computer science and computational
linguistics. The last two decades have seen an explosion in the number of different
linguistic features that have been used to discriminate between authors and attribute
samples of writing to their correct author. These range from average word/sentence
length, vocabulary richness measures and function word frequencies, to word, character
and part-of-speech sequences (Stamatatos 2009). This research is unquestionably
valuable; there is now little doubt that by using a combination of linguistic features and
a sophisticated machine learning technique or algorithm we are able to successfully
identify the most likely author of a text. What we cannot do with the same confidence,
however, is explain why these methods work. As Argamon & Koppel (2013: 299)
comment, in almost no case is there strong theoretical motivation behind the input
feature sets, such that the features have clear interpretations in stylistic terms”. Herein
lies the problem for forensic linguists, who must be able to say why the features they

describe might distinguish between authors (Grant 2008: 226). We cannot expect lay
decision makers such as judges and jurors to understand methods and results which we
cannot explain ourselves.
Word strings offer one possible remedy. Sinclair’s (1991: 109) ‘idiom principle’
holds that a language user has available to him or her a large number of semi-
preconstructed phrases that constitute single choices”. In the twenty-five years since the
idiom principle was first introduced, there has been considerable research attention paid
to word strings, with different studies naming, identifying and characterising them in
different ways depending on the research goals at hand (Biber et al. 2004: 372; Wray
2002: 9). Despite using different terminology, originating from different theoretical
positions, and developing from different disciplines of linguistics, it is possible to
identify a common feature in previous work on word strings: their individual nature.
The following sections give an overview of some of the prominent theories regarding
the individuality of word strings, their relationship with routine communicative events,
and the existing empirical evidence of their individual nature. Finally, focus shifts to
how the present study builds upon this previous work by utilising word n-grams as a
means of attributing disputed texts and identifying idiolectal variation.
2.1. Word strings, routine and the individual
Hoey (2005: 8) argues that we can only account for collocation if we assume that every
word is mentally primed for collocational use. Hoey (2005: 15) draws on Firth’s
(1957) notion of ‘personal collocations’, emphasising that “an inherent quality of lexical
priming is that it is personal and that words are never primed per se; they are only
primed for someone”. He argues that everyone’s primings are different and that
everyone’s language is unique as a result of different linguistic encounters, different
parents, friends and colleagues (Hoey 2005: 181). This is a premise shared by Barlow
(2013: 444), as he points out that from a usage-based perspective, an individual’s
cognitive representation of language is influenced by the frequency of the different
expressions and constructions encountered by the speaker. This idea that differing
socio-historical linguistic backgrounds lead to differences in repertoires of choice

appears to be acceptable to forensic linguists as a means by which to account for inter-
author variation (Nini & Grant 2013: 175).
Wray (2002: 9) introduces ‘formulaic sequences’ as sequences of words (or
other elements) which appear to be pre-fabricated and retrieved whole from memory at
the time of use. The term was coined as a coverall, to consolidate “any kind of linguistic
unit that has been considered formulaic in any research field” (Wray 2002: 9). Although
Wray (2008: 67) marks a clear distinction between formulaic sequences and lexical
priming insofar as what constitutes the fundamental currency of processing”, she too
emphasises individual variation. While particular sequences are formulaic “in the
language” and are shared across the speech community, she argues that what is
formulaic for one person need not be formulaic for another” (Wray 2008: 11). Schmitt
et al. (2004) argue something similar. They ran oral-response dictation tasks to test
whether corpus-derived recurrent word clusters are stored holistically as
psychologically “real” formulaic sequences for native and non-native speakers of
English. Results varied, with native speakers performing better than non-natives. While
the authors emphasise that the dictation task is an indirect measure of holistic storage
(Schmitt et al. 2004: 147), they did report that some recurrent clusters are “highly
likely” to be formulaic sequences (such as go away and I don’t know what to do), while
others are “quite unlikely” to be (such as in the same way as and aim of this study)
(Schmitt et al. 2004: 138). Between these, they state, are clusters that will be formulaic
for some people and not others; it is idiosyncratic to the individual speaker whether
they have stored these clusters or not” (Schmitt et al. 2004: 138). Furthermore, they
offer an argument that echoes Hoey’s (2005: 181) and Barlow’s (2013: 444)
explanations for idiolectal collocational preferences. They propose that as part of their
idiolect, it is reasonable to assume that individuals have their own unique store of
formulaic sequences based on their own experience and language exposure” (Schmitt et
al. 2004: 138).
There exists a relationship between such recurring word sequences and the
specific communicative purposes they fulfil. Some argue that this relationship is
pervasive through language, such that we start with the information we wish to
convey” in a given situation, and then we haul out of our phrasal lexicon some patterns
that can provide the major elements of this expression” (Becker 1975: 62). Others (e.g.

Citations
More filters
Journal ArticleDOI

Estimating the deep replicability of scientific findings using human and artificial intelligence

TL;DR: An artificial intelligence model is trained to estimate a paper’s replicability using ground truth data on studies that had passed or failed manual replication tests, and its generalizability is tested on an extensive set of out-of-sample studies.
Journal ArticleDOI

Winning Is Not Everything: Enhancing Game Development With Intelligent Agents

TL;DR: In this paper, the authors discuss two fundamental metrics based on which they measure the human-likeness of agents, namely skill and style, and report four case studies in which the style and skill requirements inform the choice of algorithms and metrics used to train agents.
Journal ArticleDOI

Attributing the Bixby Letter using n-gram tracing

TL;DR: The Bixby Letter is concluded that the text was authored by John Hay—rewriting this one episode in the history of the USA, while offering a solution to one of the most persistent problems in authorship attribution.
Posted Content

Winning Isn't Everything: Enhancing Game Development with Intelligent Agents

TL;DR: It is shown that the learning potential of state-of-the-art deep RL models does not seamlessly transfer from the benchmark environments to target ones without heavily tuning their hyperparameters, leading to linear scaling of the engineering efforts, and computational cost with the number of target domains.
MonographDOI

Language and Online Identities: The Undercover Policing of Internet Sexual Crime

TL;DR: In this article, forensic linguistics is at the cutting edge of the undercover policing of child sexual abuse on the open internet and dark web, and language and identity is a fundamental part of this.
References
More filters
Journal ArticleDOI

Corpus, concordance, collocation

TL;DR: The emergence of a new view of language and the computer technology associated with it is charted in this study.
Book

Formulaic Language and the Lexicon

TL;DR: This chapter discusses formulaicity, the variable unit distributed lexicon of formulaic sequences, and its role in language acquisition.
Journal IssueDOI

A survey of modern authorship attribution methods

TL;DR: A survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification.
Book

Lexical Phrases and Language Teaching

TL;DR: In this chapter two applications for language teaching in secondary schools are explained, using English as a second language and Spanish as a third.
Frequently Asked Questions (13)
Q1. What have the authors contributed in "Using word n-grams to identify authors and idiolects a corpus approach to a forensic linguistic problem" ?

However, much of this work is not of practical use to forensic linguists who need to explain in reports or in court why a particular method of identifying potential authors works. This paper sets out to address this problem using a corpus linguistic approach and the 176-author 2. 5 million-word Enron Email Corpus. Drawing on literature positing the idiolectal nature of collocations, phrases and word sequences, this paper tests the accuracy of word n-grams in identifying the authors of anonymised email samples. 

This may motivate future authorship studies to more closely consider the success rate of methods on individuals, rather than making generalised claims about reliability and accuracy across whole corpora. It was suggested that these word ngrams were tied to recurrent and routine communicative situations that he encounters during his daily work. As style markers, word n-grams require further testing on a wider range of less explicitly routine or formulaic text-types to evaluate their usefulness. 

In the attribution experiment, there were a total of 3,000 tests in which the author of a disputed sample was to be identified (12 authors, ten samples of five different sizes, five n-gram lengths). 

For each of the twelve authors, ten random samples of 20%, 15%, 10%, 5% and 2% of their emails were extracted and anonymised, giving a total of 50 samples per author. 

The similarity between the disputed samples and known sets inthis experiment is measured using Jaccard’s similarity co-efficient, a statistic which has its origins in ecology but has recently been adopted by forensic linguists (Grant 2013, Juola 2013, Johnson & Wright 2014, Larner 2014). 

Creating samples of an author’s writing by combining a number of texts they have written, rather than individual texts, is a common practice in authorship studies (e.g. Luyckx & Daelemans 2011, Juola 2013, Stamatatos 2013). 

there are three points which need to be addressed with regards to the performance of this method: (i) the effect of sample size on accuracy, (ii) the performance of the different n-gram lengths, and (iii) difference in performance across authors. 

Some refer to redlined versions but not clean (Dickson), some refer to draft instead of version (Jones), some use redline as a noun rather than the adjectival redlined (Mann, Perlingerie, Shackleton), and most use other ways of saying attached is, such as The authoram attaching, here is and this is. 

Using the variant/variable paradigm (Mollin 2009: 382), the authors can observe the different ways in which authors express semantically and pragmatically the same thing. 

By extension, the usage-based process of entrenchment (Langacker 1988, 2000; Schmid 2016) holds that on the basis of their unique socio-historicallinguistic characteristics, experiences and encounters, which word strings become entrenched inherently varies from author to author. 

The main pattern that emerges from across these distinctive five-grams is that many of them are related to his job as a lawyer in the company and, in particular, are reflective of his collaborative practice with colleagues of drafting and revising legal documents and agreements. 

Of these 13,412 “formulaic sequences”, 301 were found in the 100 personal narratives he had collected from twenty different authors, including phrases such as in the end, at least, go back and in fact. 

asking someone to review something is a very common practice in Enron emails, and there are twenty other authors in the corpus who, like Nemec, write please review and and then request some further communication.