How many samples were used in the attribution experiment?

In the attribution experiment, there were a total of 3,000 tests in which the author of a disputed sample was to be identified (12 authors, ten samples of five different sizes, five n-gram lengths).

How many samples were taken from each author?

For each of the twelve authors, ten random samples of 20%, 15%, 10%, 5% and 2% of their emails were extracted and anonymised, giving a total of 50 samples per author.

What statistic is used to measure similarity between the disputed samples and known sets?

The similarity between the disputed samples and known sets inthis experiment is measured using Jaccard’s similarity co-efficient, a statistic which has its origins in ecology but has recently been adopted by forensic linguists (Grant 2013, Juola 2013, Johnson & Wright 2014, Larner 2014).

What is the common practice in authorship studies?

Creating samples of an author’s writing by combining a number of texts they have written, rather than individual texts, is a common practice in authorship studies (e.g. Luyckx & Daelemans 2011, Juola 2013, Stamatatos 2013).

What are the three points which need to be addressed with regards to the performance of this method?

there are three points which need to be addressed with regards to the performance of this method: (i) the effect of sample size on accuracy, (ii) the performance of the different n-gram lengths, and (iii) difference in performance across authors.

What is the common way to say attached is?

Some refer to redlined versions but not clean (Dickson), some refer to draft instead of version (Jones), some use redline as a noun rather than the adjectival redlined (Mann, Perlingerie, Shackleton), and most use other ways of saying attached is, such as The authoram attaching, here is and this is.

How can the authors observe the different ways in which authors express semantically and pragmatically the same thing?

Using the variant/variable paradigm (Mollin 2009: 382), the authors can observe the different ways in which authors express semantically and pragmatically the same thing.

What does the usage-based process of entrenchment hold?

By extension, the usage-based process of entrenchment (Langacker 1988, 2000; Schmid 2016) holds that on the basis of their unique socio-historicallinguistic characteristics, experiences and encounters, which word strings become entrenched inherently varies from author to author.

What is the main pattern that emerges from across Nemec’s distinctive five-gram?

The main pattern that emerges from across these distinctive five-grams is that many of them are related to his job as a lawyer in the company and, in particular, are reflective of his collaborative practice with colleagues of drafting and revising legal documents and agreements.

How many of these were found in the personal narratives?

Of these 13,412 “formulaic sequences”, 301 were found in the 100 personal narratives he had collected from twenty different authors, including phrases such as in the end, at least, go back and in fact.

How many other authors in the corpus write please review?

asking someone to review something is a very common practice in Enron emails, and there are twenty other authors in the corpus who, like Nemec, write please review and and then request some further communication.

(Open Access) Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem (2017) | David L. Wright

Q: What are the future works mentioned in the paper "Using word n-grams to identify authors and idiolects a corpus approach to a forensic linguistic problem" ?

This may motivate future authorship studies to more closely consider the success rate of methods on individuals, rather than making generalised claims about reliability and accuracy across whole corpora. It was suggested that these word ngrams were tied to recurrent and routine communicative situations that he encounters during his daily work. As style markers, word n-grams require further testing on a wider range of less explicitly routine or formulaic text-types to evaluate their usefulness.

This is a post-review, pre-publication (post-print) version of the paper: Wright, D. (2017) Using

word n-grams to identify authors and idiolects: A corpus approach to a forensic linguistic

problem. To appear in the International Journal of Corpus Linguistics 22(2).

https://benjamins.com/#catalog/journals/ijcl.22.2.03wri/details

Using word n-grams to identify authors and idiolects

A corpus approach to a forensic linguistic problem

David Wright

Nottingham Trent University

Forensic authorship attribution is concerned with identifying the writers of

anonymous criminal documents. Over the last twenty years, computer scientists

have developed a wide range of statistical procedures using a number of

different linguistic features to measure similarity between texts. However, much

of this work is not of practical use to forensic linguists who need to explain in

reports or in court why a particular method of identifying potential authors

works. This paper sets out to address this problem using a corpus linguistic

approach and the 176-author 2.5 million-word Enron Email Corpus. Drawing

on literature positing the idiolectal nature of collocations, phrases and word

sequences, this paper tests the accuracy of word n-grams in identifying the

authors of anonymised email samples. Moving beyond the statistical analysis,

the usage-based concept of entrenchment is offered as a means by which to

account for the recurring and distinctive production of idiolectal word n-grams.

Keywords: forensic linguistics, idiolect, authorship attribution, entrenchment,

Enron

1. The linguistic individual, corpora and forensic linguistics

‘Idiolect’ is a well established concept in linguistics, yet the individual is rarely the

focus of linguistic enquiry. There are many possible reasons for this, but perhaps the

main deterrent to the study of idiolect is the practical difficulties in doing so. Bloch

(1948: 7) coined the term ‘idiolect’ to refer to “not merely what a speaker says at one

time: it is everything that he could say in a given language” (original emphasis).

Clearly, the task of collecting anything that a person could say is an impossible one.

However, recent work in corpus linguistics that has put the individual at the centre of

their investigations has narrowed the goal posts set out by Bloch (1948) by analysing

the linguistic output that individual speakers or writers actually produce (e.g. Coniam

2004, Mollin 2009, Barlow 2013). These studies use smaller, specialised corpora to

systematically examine idiolectal variation that is masked or buried in traditional large-

scale reference corpora.

The field which stands to benefit the most from the empirical investigation of

idiolect is forensic linguistics, and in particular forensic authorship attribution.

Authorship attribution is the process in which linguists set out to identify the author(s)

of disputed texts using identifiable features of linguistic style, ranging from word

frequencies to preferred syntactic structures. In a forensic context, the disputed texts

under analysis are potentially evidential in alleged infringements of the law or threats to

security. Such texts can include abusive emails, ransom notes, extortion letters, falsified

suicide notes, or text messages sent by a person acting as someone else. In the most

straightforward case, the analysis requires the linguist to analyse the style(s) exhibited

in the “known” writings of the suspect or candidate authors involved in the case.

Attention then turns to the disputed document(s), as the linguist compares the writing

style of the text(s) in question and examines the extent to which it is similar or

consistent with the known writing style of one (or more) of the suspects. The linguist

may then express an opinion as to how likely it is that the disputed text is or is not

written by one of the suspects. Such an analysis relies on a theory of idiolect (Coulthard

2004: 431), or at least depends on the consistency and distinctiveness of the styles of the

individuals involved (Grant 2013: 473).

There are a small number of studies and cases in which corpora or corpus

methods have been used to attribute forensic texts to their authors. Svartvik (1968) uses

a corpus approach to analyse a set of disputed witness statements in a murder case.

Coulthard (1994) uses specialised corpora of ordinary witness statements and police

statements, along with the much larger spoken element of the COBUILD corpus, in his

seminal analysis of the disputed Derek Bentley statement. Coulthard (2004) reports

another case in which the internet was used to investigate the author-distinctiveness of

twelve lexical items co-selected in one text in the capturing of the Unabomber. Despite

the success of corpus approaches in these cases, few have pursued the utility of corpus

linguistics in forensic research. Kredens (2002) is the earliest exception, using a corpus-

approach to comparing the idiolects of two English musicians, Robert Smith (The Cure)

and Steven Morrisey (The Smiths). Larner (2014) is an exception too, with his work on

identifying idiolectal preferences for formulaic sequences in personal narratives, while

Grant (2013) uses a corpus method to identify lexical variation in text messages central

to a murder investigation, and Wright (2013) and Johnson and Wright (2014) employ

corpus techniques in the analyses of author-distinctive language use in a corpus of

business emails. This study continues to develop the use of corpus methodologies in the

investigation of idiolect and the attribution of disputed texts in a forensic context. There

are two parts to the analysis in this paper. The first part reports the results of an

authorship attribution experiment using ‘word n-grams’ as style markers. The second

part focuses on one author as a case study and examines the n-grams which were most

useful in identifying his disputed texts, discussing their nature and their implications for

the theory of idiolect and forensic authorship analysis.

2. Word strings as features in authorship analysis

Most of the work in authorship attribution is from computer science and computational

linguistics. The last two decades have seen an explosion in the number of different

linguistic features that have been used to discriminate between authors and attribute

samples of writing to their correct author. These range from average word/sentence

length, vocabulary richness measures and function word frequencies, to word, character

and part-of-speech sequences (Stamatatos 2009). This research is unquestionably

valuable; there is now little doubt that by using a combination of linguistic features and

a sophisticated machine learning technique or algorithm we are able to successfully

identify the most likely author of a text. What we cannot do with the same confidence,

however, is explain why these methods work. As Argamon & Koppel (2013: 299)

comment, “in almost no case is there strong theoretical motivation behind the input

feature sets, such that the features have clear interpretations in stylistic terms”. Herein

lies the problem for forensic linguists, who must be able to say why the features they

describe might distinguish between authors (Grant 2008: 226). We cannot expect lay

decision makers such as judges and jurors to understand methods and results which we

cannot explain ourselves.

Word strings offer one possible remedy. Sinclair’s (1991: 109) ‘idiom principle’

holds that a language user “has available to him or her a large number of semi-

preconstructed phrases that constitute single choices”. In the twenty-five years since the

idiom principle was first introduced, there has been considerable research attention paid

to word strings, with different studies naming, identifying and characterising them in

different ways depending on the research goals at hand (Biber et al. 2004: 372; Wray

2002: 9). Despite using different terminology, originating from different theoretical

positions, and developing from different disciplines of linguistics, it is possible to

identify a common feature in previous work on word strings: their individual nature.

The following sections give an overview of some of the prominent theories regarding

the individuality of word strings, their relationship with routine communicative events,

and the existing empirical evidence of their individual nature. Finally, focus shifts to

how the present study builds upon this previous work by utilising word n-grams as a

means of attributing disputed texts and identifying idiolectal variation.

2.1. Word strings, routine and the individual

Hoey (2005: 8) argues that “we can only account for collocation if we assume that every

word is mentally primed for collocational use”. Hoey (2005: 15) draws on Firth’s

(1957) notion of ‘personal collocations’, emphasising that “an inherent quality of lexical

priming is that it is personal” and that “words are never primed per se; they are only

primed for someone”. He argues that everyone’s primings are different and that

everyone’s language is unique as a result of different linguistic encounters, different

parents, friends and colleagues (Hoey 2005: 181). This is a premise shared by Barlow

(2013: 444), as he points out that from a usage-based perspective, an individual’s

cognitive representation of language is influenced by “the frequency of the different

expressions and constructions encountered by the speaker.” This idea that differing

socio-historical linguistic backgrounds lead to differences in repertoires of choice

appears to be acceptable to forensic linguists as a means by which to account for inter-

author variation (Nini & Grant 2013: 175).

Wray (2002: 9) introduces ‘formulaic sequences’ as sequences of words (or

other elements) which appear to be pre-fabricated and retrieved whole from memory at

the time of use. The term was coined as a coverall, to consolidate “any kind of linguistic

unit that has been considered formulaic in any research field” (Wray 2002: 9). Although

Wray (2008: 67) marks a clear distinction between formulaic sequences and lexical

priming insofar as what constitutes the “fundamental currency of processing”, she too

emphasises individual variation. While particular sequences are formulaic “in the

language” and are shared across the speech community, she argues that “what is

formulaic for one person need not be formulaic for another” (Wray 2008: 11). Schmitt

et al. (2004) argue something similar. They ran oral-response dictation tasks to test

whether corpus-derived recurrent word clusters are stored holistically as

psychologically “real” formulaic sequences for native and non-native speakers of

English. Results varied, with native speakers performing better than non-natives. While

the authors emphasise that the dictation task is an indirect measure of holistic storage

(Schmitt et al. 2004: 147), they did report that some recurrent clusters are “highly

likely” to be formulaic sequences (such as go away and I don’t know what to do), while

others are “quite unlikely” to be (such as in the same way as and aim of this study)

(Schmitt et al. 2004: 138). Between these, they state, are clusters that will be formulaic

for some people and not others; “it is idiosyncratic to the individual speaker whether

they have stored these clusters or not” (Schmitt et al. 2004: 138). Furthermore, they

offer an argument that echoes Hoey’s (2005: 181) and Barlow’s (2013: 444)

explanations for idiolectal collocational preferences. They propose that as part of their

idiolect, “it is reasonable to assume that individuals have their own unique store of

formulaic sequences based on their own experience and language exposure” (Schmitt et

al. 2004: 138).

There exists a relationship between such recurring word sequences and the

specific communicative purposes they fulfil. Some argue that this relationship is

pervasive through language, such that “we start with the information we wish to

convey” in a given situation, and then we “haul out of our phrasal lexicon some patterns

that can provide the major elements of this expression” (Becker 1975: 62). Others (e.g.

Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem

Figures

Citations

Estimating the deep replicability of scientific findings using human and artificial intelligence

Winning Is Not Everything: Enhancing Game Development With Intelligent Agents

Attributing the Bixby Letter using n-gram tracing

Winning Isn't Everything: Enhancing Game Development with Intelligent Agents

Language and Online Identities: The Undercover Policing of Internet Sexual Crime

References

Corpus, concordance, collocation

A Synopsis of Linguistic Theory, 1930-1955

Formulaic Language and the Lexicon

A survey of modern authorship attribution methods

Lexical Phrases and Language Teaching

Related Papers (5)

More than words: Frequency effects for multi-word phrases

Going Beyond the Native Speaker in Language Teaching

Discourse strategies: Subject index

Language Attrition: List of figures

Gender identity and lexical variation in social media

Frequently Asked Questions (13)

Q1. What have the authors contributed in "Using word n-grams to identify authors and idiolects a corpus approach to a forensic linguistic problem" ?

Q2. What are the future works mentioned in the paper "Using word n-grams to identify authors and idiolects a corpus approach to a forensic linguistic problem" ?

Q3. How many samples were used in the attribution experiment?

Q4. How many samples were taken from each author?

Q5. What statistic is used to measure similarity between the disputed samples and known sets?

Q6. What is the common practice in authorship studies?

Q7. What are the three points which need to be addressed with regards to the performance of this method?

Q8. What is the common way to say attached is?

Q9. How can the authors observe the different ways in which authors express semantically and pragmatically the same thing?

Q10. What does the usage-based process of entrenchment hold?

Q11. What is the main pattern that emerges from across Nemec’s distinctive five-gram?

Q12. How many of these were found in the personal narratives?

Q13. How many other authors in the corpus write please review?