Book ChapterDOI
N-Gram feature selection for authorship identification
John Houvardas,Efstathios Stamatatos +1 more
- pp 77-86
Reads0
Chats0
TLDR
The authors proposed a variable-length n-gram approach inspired by previous work for selecting variable length word sequences for authorship identification, using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors.Abstract:
Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.read more
Citations
More filters
Journal ArticleDOI
Fake News Detection on Social Media: A Data Mining Perspective
TL;DR: Wang et al. as discussed by the authors presented a comprehensive review of detecting fake news on social media, including fake news characterizations on psychology and social theories, existing algorithms from a data mining perspective, evaluation metrics and representative datasets.
Journal IssueDOI
A survey of modern authorship attribution methods
TL;DR: A survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification.
Posted Content
Fake News Detection on Social Media: A Data Mining Perspective
TL;DR: This survey presents a comprehensive review of detecting fake news on social media, including fake news characterizations on psychology and social theories, existing algorithms from a data mining perspective, evaluation metrics and representative datasets, and future research directions for fake news detection on socialMedia.
Journal IssueDOI
Computational methods in authorship attribution
TL;DR: Three scenarios are considered here for which solutions to the basic attribution problem are inadequate; it is shown how machine learning methods can be adapted to handle the special challenges of that variant.
Proceedings Article
Improving Gender Classification of Blog Authors
Arjun Mukherjee,Bing Liu +1 more
TL;DR: Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-of-the-art methods significantly.
References
More filters
Book
Foundations of Statistical Natural Language Processing
TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Book ChapterDOI
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Journal ArticleDOI
Machine learning in automated text categorization
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Proceedings Article
A Comparative Study on Feature Selection in Text Categorization
Yiming Yang,Jan O. Pedersen +1 more
TL;DR: This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
Journal ArticleDOI
Word association norms, mutual information, and lexicography
Kenneth Church,Patrick Hanks +1 more
TL;DR: The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.