scispace - formally typeset
Open AccessJournal ArticleDOI

Learning Stylometric Representations for Authorship Analysis

Reads0
Chats0
TLDR
This article proposed to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA, which allows topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics.
Abstract
Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic ${n}$ -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.

read more

Citations
More filters
Book ChapterDOI

Cross-Domain Authorship Attribution Using Pre-trained Language Models

TL;DR: This paper modify a successful authorship verification approach based on a multi-headed neural network language model and combine it with pre-trained language models and demonstrates the crucial effect of the normalization corpus in cross-domain attribution.
Proceedings ArticleDOI

Explainable Authorship Verification in Social Media via Attention-based Similarity Learning

TL;DR: In this article, a hierarchical Siamese neural network was proposed to learn neural features and to visualize the decision-making process of authorship verification for short and topically varied social media texts, which outperformed state-of-the-art approaches that were built up on stylometric features.
Journal ArticleDOI

Can software improve marker accuracy at detecting contract cheating? A pilot study of the Turnitin authorship investigate alpha

TL;DR: An early alpha version of Turnitin’s Authorship Investigate tool is trialed, which compares students’ assignment submissions against their previous work, and suggests that software may be an effective component of institutional strategies to address contract cheating.
Journal ArticleDOI

Language models and fusion for authorship attribution

TL;DR: It is found that language models based on POS tags are competitive in only one of the corpora (movie reviews), they generally provide efficiency benefits and robustness against data sparsity, and POS language models are proven essential effective components in fusion.
Journal ArticleDOI

Improving author verification based on topic modeling

TL;DR: The comparison to state‐of‐the‐art methods demonstrates the great potential of the approaches presented in this study and demonstrates that even when genre‐agnostic external documents are used, the proposed extrinsic models are very competitive.
References
More filters
Journal ArticleDOI

Deep learning

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Posted Content

Efficient Estimation of Word Representations in Vector Space

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Journal ArticleDOI

An introduction to ROC analysis

TL;DR: The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
Proceedings Article

Efficient Estimation of Word Representations in Vector Space

TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
Related Papers (5)