Using word n-grams to identify authors and idiolects: a corpus approach to a forensic linguistic problem
read more
Citations
Estimating the deep replicability of scientific findings using human and artificial intelligence
Winning Is Not Everything: Enhancing Game Development With Intelligent Agents
Attributing the Bixby Letter using n-gram tracing
Winning Isn't Everything: Enhancing Game Development with Intelligent Agents
Language and Online Identities: The Undercover Policing of Internet Sexual Crime
References
Corpus, concordance, collocation
Formulaic Language and the Lexicon
A survey of modern authorship attribution methods
Lexical Phrases and Language Teaching
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the future works mentioned in the paper "Using word n-grams to identify authors and idiolects a corpus approach to a forensic linguistic problem" ?
This may motivate future authorship studies to more closely consider the success rate of methods on individuals, rather than making generalised claims about reliability and accuracy across whole corpora. It was suggested that these word ngrams were tied to recurrent and routine communicative situations that he encounters during his daily work. As style markers, word n-grams require further testing on a wider range of less explicitly routine or formulaic text-types to evaluate their usefulness.
Q3. How many samples were used in the attribution experiment?
In the attribution experiment, there were a total of 3,000 tests in which the author of a disputed sample was to be identified (12 authors, ten samples of five different sizes, five n-gram lengths).
Q4. How many samples were taken from each author?
For each of the twelve authors, ten random samples of 20%, 15%, 10%, 5% and 2% of their emails were extracted and anonymised, giving a total of 50 samples per author.
Q5. What statistic is used to measure similarity between the disputed samples and known sets?
The similarity between the disputed samples and known sets inthis experiment is measured using Jaccard’s similarity co-efficient, a statistic which has its origins in ecology but has recently been adopted by forensic linguists (Grant 2013, Juola 2013, Johnson & Wright 2014, Larner 2014).
Q6. What is the common practice in authorship studies?
Creating samples of an author’s writing by combining a number of texts they have written, rather than individual texts, is a common practice in authorship studies (e.g. Luyckx & Daelemans 2011, Juola 2013, Stamatatos 2013).
Q7. What are the three points which need to be addressed with regards to the performance of this method?
there are three points which need to be addressed with regards to the performance of this method: (i) the effect of sample size on accuracy, (ii) the performance of the different n-gram lengths, and (iii) difference in performance across authors.
Q8. What is the common way to say attached is?
Some refer to redlined versions but not clean (Dickson), some refer to draft instead of version (Jones), some use redline as a noun rather than the adjectival redlined (Mann, Perlingerie, Shackleton), and most use other ways of saying attached is, such as The authoram attaching, here is and this is.
Q9. How can the authors observe the different ways in which authors express semantically and pragmatically the same thing?
Using the variant/variable paradigm (Mollin 2009: 382), the authors can observe the different ways in which authors express semantically and pragmatically the same thing.
Q10. What does the usage-based process of entrenchment hold?
By extension, the usage-based process of entrenchment (Langacker 1988, 2000; Schmid 2016) holds that on the basis of their unique socio-historicallinguistic characteristics, experiences and encounters, which word strings become entrenched inherently varies from author to author.
Q11. What is the main pattern that emerges from across Nemec’s distinctive five-gram?
The main pattern that emerges from across these distinctive five-grams is that many of them are related to his job as a lawyer in the company and, in particular, are reflective of his collaborative practice with colleagues of drafting and revising legal documents and agreements.
Q12. How many of these were found in the personal narratives?
Of these 13,412 “formulaic sequences”, 301 were found in the 100 personal narratives he had collected from twenty different authors, including phrases such as in the end, at least, go back and in fact.
Q13. How many other authors in the corpus write please review?
asking someone to review something is a very common practice in Enron emails, and there are twenty other authors in the corpus who, like Nemec, write please review and and then request some further communication.