Learning Stylometric Representations for Authorship Analysis

doi:10.1109/TCYB.2017.2766189

Open AccessJournal ArticleDOI

Learning Stylometric Representations for Authorship Analysis

Steven H. H. Ding, +3 more

- 01 Jan 2019 -

IEEE Transactions on Systems, Man, and C...

- Vol. 49, Iss: 1, pp 107-121

Chats0

TLDR

This article proposed to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA, which allows topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics.

Abstract:

Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic ${n}$ -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.

Learning Stylometric Representations for Authorship Analysis

Citations

Cross-Domain Authorship Attribution Using Pre-trained Language Models

Explainable Authorship Verification in Social Media via Attention-based Similarity Learning

Can software improve marker accuracy at detecting contract cheating? A pilot study of the Turnitin authorship investigate alpha

Language models and fusion for authorship attribution

Improving author verification based on topic modeling

References

Deep learning

Glove: Global Vectors for Word Representation

Efficient Estimation of Word Representations in Vector Space

An introduction to ROC analysis

Efficient Estimation of Word Representations in Vector Space

Related Papers (5)

A survey of modern authorship attribution methods

Latent dirichlet allocation

Long short-term memory

Computational methods in authorship attribution

Glove: Global Vectors for Word Representation