scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Understanding and explaining Delta measures for authorship attribution

TL;DR: It is shown that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (implicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently.
Abstract: This article builds on a mathematical explanation of one the most prominent stylometric measures, Burrows’s Delta (and its variants), to understand and explain its working. Starting with the conceptual separation between feature selection, feature scaling, and distance measures, we have designed a series of controlled experiments in which we used the kind of feature scaling (various types of standardization and normalization) and the type of distance measures (notably Manhattan, Euclidean, and Cosine) as independent variables and the correct authorship attributions as the dependent variable indicative of the performance of each of the methods proposed. In this way, we are able to describe in some detail how each of these two variables interact with each other and how they influence the results. Thus we can show that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (implicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently. We are also able to show that the information particularly relevant to the identification of the author of a text lies in the profile of deviation across the most frequent words rather than in the extent of the deviation or in the deviation of specific words only. .................................................................................................................................................................................

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A hybrid deep learning model for fine-grained sentiment prediction in real-time multimodal data that reinforces the strengths of deep learning nets in combination to machine learning to deal with two specific semiotic systems, namely the textual and visual systems.
Abstract: Detecting sentiments in natural language is tricky even for humans, making its automated detection more complicated. This research proffers a hybrid deep learning model for fine-grained sentiment prediction in real-time multimodal data. It reinforces the strengths of deep learning nets in combination to machine learning to deal with two specific semiotic systems, namely the textual (written text) and visual (still images) and their combination within the online content using decision level multimodal fusion. The proposed contextual ConvNet-SVMBoVW model, has four modules, namely, the discretization, text analytics, image analytics, and decision module. The input to the model is multimodal text, m e {text, image, info-graphic}. The discretization module uses Google Lens to separate the text from the image, which is then processed as discrete entities and sent to the respective text analytics and image analytics modules. Text analytics module determines the sentiment using a hybrid of a convolution neural network (ConvNet) enriched with the contextual semantics of SentiCircle. An aggregation scheme is introduced to compute the hybrid polarity. A support vector machine (SVM) classifier trained using bag-of-visual-words (BoVW) for predicting the visual content sentiment. A Boolean decision module with a logical OR operation is augmented to the architecture which validates and categorizes the output on the basis of five fine-grained sentiment categories (truth values), namely ‘highly positive,’ ‘positive,’ ‘neutral,’ ‘negative’ and ‘highly negative.’ The accuracy achieved by the proposed model is nearly 91% which is an improvement over the accuracy obtained by the text and image modules individually.

96 citations

BookDOI
01 Jan 2018
TL;DR: In this paper, the authors give some illustrative insights into the spectrum of methods and model types from Computational Linguistics that one could in principle apply in the analysis of literary texts.
Abstract: In its first part, this article gives some illustrative insights into the spectrum of methods and model types from Computational Linguistics that one could in principle apply in the analysis of literary texts. The idea is to indicate the considerable potential that lies in a targeted refinement and extension of the analysis procedures, as they have been typically developed for newspaper texts and other everyday texts. The second part is a personal assessment of some key challenges for the integration of working practices from Computational Linguistics and Literary Studies, which ultimately leads to a plea for an approach that derives the validity of model-based empirical text analysis from the annotation of reference corpus data. This approach should make it possible, in perspective, to refine modeling techniques from Computational Linguistics in such a way that even complex hypotheses from Literary Theory can be addressed with differential, data-based experiments, which one should ideally be able to integrate into a hermeneutic argumentation.

39 citations

Journal ArticleDOI
TL;DR: This article proposes to revisit this authorship attribution problem by considering two effective methods (Burrows' Delta, Labbé's intertextual distance); a hierarchical clustering is applied showing that four clusters can be derived.
Abstract: The name Paul appears in 13 epistles, but is he the real author? According to different biblical scholars, the number of letters really attributed to Paul varies from 4 to 13, with a majority agreeing on seven. This article proposes to revisit this authorship attribution problem by considering two effective methods (Burrows' Delta, Labbé's intertextual distance). Based on these results, a hierarchical clustering is then applied showing that four clusters can be derived, namely: {Colossians‐Ephesians}, {1 and 2 Thessalonians}, {Titus, 1 and 2 Timothy}, and {Romans, Galatians, 1 and 2 Corinthians}. Moreover, a verification method based on the impostors' strategy indicates clearly that the group {Colossians‐Ephesians} is written by the same author who seems not to be Paul. The same conclusion can be found for the cluster {Titus, 1 and 2 Timothy}. The Letter to Philemon stays as a singleton, without any close stylistic relationship with the other epistles. Finally, a group of four letters {Romans, Galatians, 1 and 2 Corinthians} is certainly written by the same author (Paul), but the verification protocol also indicates that 2 Corinthians is related to 1 Thessalonians, rendering a clear and simple interpretation difficult.

17 citations


Cites methods from "Understanding and explaining Delta ..."

  • ...Pm i¼1Max rtf iA,rtf iBð Þ ð1Þ With the Burrows’ Delta model, the relative term frequency rtfiA of each term ti in Text A is computed, as well as the mean (meani), and standard deviation (si) of that term over all texts belonging to the corpus....

    [...]

  • ...As well-known strategies, one can mention Burrows’ Delta (Burrows, 2002; Evert et al., 2017) using the top mmost frequent word-tokens (with m = 40 to 1,000), the Kullback– Leibler divergence (Zhao & Zobel, 2007) using a predefined set of 363 English words, or Labbé’s method (Labbé, 2014) based on…...

    [...]

  • ...This article proposes to revisit this authorship attribution problem by considering two effective methods (Burrows’ Delta, Labbé’s intertextual distance)....

    [...]

  • ...In this article, two computer-based authorship methods (Burrows’ Delta, Burrows, 2002), and intertextual distance (Labbé, 2014), have been applied....

    [...]

  • ...As well-known strategies, one can mention Burrows’ Delta (Burrows, 2002; Evert et al., 2017) using the top mmost frequent word-tokens (with m = 40 to 1,000), the Kullback– Leibler divergence (Zhao & Zobel, 2007) using a predefined set of 363 English words, or Labbé’s method (Labbé, 2014) based on the entire vocabulary and opting for a variant of the Tanimoto distance, an approach found effective for Authorship Attribution (AA; Kocher & Savoy, 2017b)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors discuss the problem of estimating the sampling distribution of a pre-specified random variable R(X, F) on the basis of the observed data x.
Abstract: We discuss the following problem given a random sample X = (X 1, X 2,…, X n) from an unknown probability distribution F, estimate the sampling distribution of some prespecified random variable R(X, F), on the basis of the observed data x. (Standard jackknife theory gives an approximate mean and variance in the case R(X, F) = \(\theta \left( {\hat F} \right) - \theta \left( F \right)\), θ some parameter of interest.) A general method, called the “bootstrap”, is introduced, and shown to work satisfactorily on a variety of estimation problems. The jackknife is shown to be a linear approximation method for the bootstrap. The exposition proceeds by a series of examples: variance of the sample median, error rates in a linear discriminant analysis, ratio estimation, estimating regression parameters, etc.

14,483 citations

Book
01 Jan 1974
TL;DR: This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering.
Abstract: Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organising multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques are applicable in a wide range of areas such as medicine, psychology and market research. This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering. Real life examples are used throughout to demonstrate the application of the theory, and figures are used extensively to illustrate graphical techniques. The book is comprehensive yet relatively non-mathematical, focusing on the practical aspects of cluster analysis.

9,857 citations

Book ChapterDOI
01 Jan 2008

6,615 citations


"Understanding and explaining Delta ..." refers methods in this paper

  • ...For example, bootstrapping approaches (Efron 1979) cannot easily be applied because the clustering quality is not based on individual measurements for the texts in the sample but rather on the sample as a whole; permutation tests (Hunter & McCoy 2004) can only be used to show that a clustering is significantly better than chance, which is entirely obvious given the excellent ARI in our experiments; and calculating p-values for clustersvalue clustering (Suzuki & Shimodaira 2006) assumes that features are independent and identically distributed, which is clearly not the case for language data due to Zipf’s law....

    [...]

  • ...For example, bootstrapping approaches (Efron 1979) cannot easily be applied because the clustering quality is not based on individual measurements for the texts in the sample but rather on the sample as a whole; permutation tests (Hunter & McCoy 2004) can only be used to show that a clustering is…...

    [...]

Journal ArticleDOI
TL;DR: Pvclust is an add-on package for a statistical software R to assess the uncertainty in hierarchical cluster analysis to perform the bootstrap analysis of clustering, which has been popular in phylogenetic analysis.
Abstract: Summary: Pvclust is an add-on package for a statistical software R to assess the uncertainty in hierarchical cluster analysis. Pvclust can be used easily for general statistical problems, such as DNA microarray analysis, to perform the bootstrap analysis of clustering, which has been popular in phylogenetic analysis. Pvclust calculates probability values (p-values) for each cluster using bootstrap resampling techniques. Two types of p-values are available: approximately unbiased (AU) p-value and bootstrap probability (BP) value. Multiscale bootstrap resampling is used for the calculation of AU p-value, which has superiority in bias over BP value calculated by the ordinary bootstrap resampling. In addition the computation time can be enormously decreased with parallel computing option. Availability: The program is freely distributed under GNU General Public License (GPL) and can directly be installed from CRAN (http://cran.r-project.org/), the official R package archive. The instruction and program source code are available at http://www.is.titech.ac.jp/~shimo/prog/pvclust Contact: ryota.suzuki@is.titech.ac.jp

2,155 citations


"Understanding and explaining Delta ..." refers methods in this paper

  • ...…better than chance, which is entirely obvious given the excellent ARI in our experiments; and calculating p-values for clustersvalue clustering (Suzuki & Shimodaira 2006) assumes that features are independent and identically distributed, which is clearly not the case for language data due to…...

    [...]

Journal IssueDOI
TL;DR: A survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification.
Abstract: Authorship attribution supported by statistical or computational methods has a long history starting from the 19th century and is marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed “Federalist Papers.” During the last decade, this scientific field has been developed substantially, taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology, provided it is able to handle short and noisy text from multiple candidate authors. In this article, a survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than on linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area. © 2009 Wiley Periodicals, Inc.

1,186 citations

Trending Questions (1)
What to measures in authorship attribution?

The article discusses the importance of feature vector normalization and the profile of deviation across frequent words in authorship attribution.