N-Gram feature selection for authorship identification

doi:10.1007/11861461_10

Book ChapterDOI

N-Gram feature selection for authorship identification

John Houvardas, +1 more

- pp 77-86

Chats0

TLDR

The authors proposed a variable-length n-gram approach inspired by previous work for selecting variable length word sequences for authorship identification, using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors.

Abstract:

Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Fake News Detection on Social Media: A Data Mining Perspective

Kai Shu, +4 more

- 01 Sep 2017 -

Sigkdd Explorations

TL;DR: Wang et al. as discussed by the authors presented a comprehensive review of detecting fake news on social media, including fake news characterizations on psychology and social theories, existing algorithms from a data mining perspective, evaluation metrics and representative datasets.

...read moreread less

Journal IssueDOI

A survey of modern authorship attribution methods

Efstathios Stamatatos

- 01 Mar 2009 -

Journal of the Association for Informati...

TL;DR: A survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification.

...read moreread less

Posted Content

Fake News Detection on Social Media: A Data Mining Perspective

Kai Shu, +4 more

- 07 Aug 2017 -

arXiv: Social and Information Networks

TL;DR: This survey presents a comprehensive review of detecting fake news on social media, including fake news characterizations on psychology and social theories, existing algorithms from a data mining perspective, evaluation metrics and representative datasets, and future research directions for fake news detection on socialMedia.

...read moreread less

Journal IssueDOI

Computational methods in authorship attribution

Moshe Koppel, +2 more

- 01 Jan 2009 -

Journal of the Association for Informati...

TL;DR: Three scenarios are considered here for which solutions to the basic attribution problem are inadequate; it is shown how machine learning methods can be adapted to handle the special challenges of that variant.

...read moreread less

Proceedings Article

Improving Gender Classification of Blog Authors

Arjun Mukherjee, +1 more

TL;DR: Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-of-the-art methods significantly.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Foundations of Statistical Natural Language Processing

Christopher D. Manning, +1 more

TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.

...read moreread less

Book ChapterDOI

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

Thorsten Joachims

TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.

...read moreread less

Journal ArticleDOI

Machine learning in automated text categorization

Fabrizio Sebastiani

- 01 Mar 2002 -

ACM Computing Surveys

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

...read moreread less

Proceedings Article

A Comparative Study on Feature Selection in Text Categorization

Yiming Yang, +1 more

TL;DR: This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.

...read moreread less

Journal ArticleDOI

Word association norms, mutual information, and lexicography

Kenneth Church, +1 more

- 01 Mar 1990 -

Computational Linguistics

TL;DR: The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.

...read moreread less

Collapse

N-Gram feature selection for authorship identification

Citations

Fake News Detection on Social Media: A Data Mining Perspective

A survey of modern authorship attribution methods

Fake News Detection on Social Media: A Data Mining Perspective

Computational methods in authorship attribution

Improving Gender Classification of Blog Authors

References

Foundations of Statistical Natural Language Processing

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

Machine learning in automated text categorization

A Comparative Study on Feature Selection in Text Categorization

Word association norms, mutual information, and lexicography

Related Papers (5)

A survey of modern authorship attribution methods

N-gram-based author profiles for authorship attribution

A framework for authorship identification of online messages: Writing-style features and classification techniques

Authorship Attribution

Applying authorship analysis to extremist-group Web forum messages