scispace - formally typeset
Book ChapterDOI

Syntactic dependency-based n-grams as classification features

Reads0
Chats0
TLDR
It is described how sn-grams were applied to authorship attribution, and how SVM classifier for several profile sizes was used, which resulted in better results.
Abstract
In this paper we introduce a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking the words as they appear in the text. Dependency trees fit directly into this idea, while in case of constituency trees some simple additional steps should be made. Sn-grams can be applied in any NLP task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. SVM classifier for several profile sizes was used. We used as baseline traditional n-grams of words, POS tags and characters. Obtained results are better when applying sn-grams.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Text Classification Algorithms: A Survey

TL;DR: An overview of text classification algorithms is discussed, which covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods.
Journal ArticleDOI

Syntactic N-grams as machine learning features for natural language processing

TL;DR: Sn-grams can be applied in any natural language processing (NLP) task where traditional n- grams are used and described how sn-rams were applied to authorship attribution.
Journal ArticleDOI

Towards an intelligent framework for multimodal affective data analysis

TL;DR: A novel multimodal information extraction agent is proposed, which infers and aggregates the semantic and affective information associated with user-generated multi-modal data in contexts such as e-learning, e-health, automatic video content tagging and human-computer interaction.
Patent

Natural language processing system and method

TL;DR: In this paper, a natural language processing system is described, which includes a language decoder that generates information which is stored in a three-level framework (word, clause, phrase).
References
More filters
Journal ArticleDOI

Machine learning in automated text categorization

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Journal IssueDOI

A survey of modern authorship attribution methods

TL;DR: A survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification.
Book

Authorship Attribution

TL;DR: This review shows that the authorship attribution discipline is quite successful, even in difficult cases involving small documents in unfamiliar and less studied languages; it further analyzes the types of analysis and features used and tries to determine characteristics of well-performing systems, finally formulating these in a set of recommendations for best practices.
Journal ArticleDOI

The Evolution of Stylometry in Humanities Scholarship

TL;DR: The authors traces the historical development of the use of statistical methods in the analysis of literary style, starting with stylometry's early origins, and looks at both successful and unsuccessful applications, and at the internal struggles as statisticians search for a proven methodology.
Journal ArticleDOI

Applying authorship analysis to extremist-group Web forum messages

TL;DR: A special multilingual model is developed - the set of algorithms and related features - to identify Arabic messages, gearing this model toward the language's unique characteristics and incorporated a complex message extraction component to allow the use of a more comprehensive set of features tailored specifically toward online messages.
Related Papers (5)