Open AccessProceedings Article
What to do about bad language on the internet
Jacob Eisenstein
- pp 359-369
Reads0
Chats0
TLDR
A critical review of the NLP community's response to the landscape of bad language is offered, and a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora is presented.Abstract:
The rise of social media has brought computational linguistics in ever-closer contact with bad language: text that defies our expectations about vocabulary, spelling, and syntax. This paper surveys the landscape of bad language, and offers a critical review of the NLP community’s response, which has largely followed two paths: normalization and domain adaptation. Each approach is evaluated in the context of theoretical and empirical work on computer-mediated communication. In addition, the paper presents a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora.read more
Citations
More filters
Proceedings Article
Synthetic Data for English Lexical Normalization: How Close Can We Get to Manually Annotated Data?
Kelly Dekker,Rob van der Goot +1 more
TL;DR: This work attempts to overcome dependence by automatically generating training data for lexical normalization by attempting to insert non-standardness (noise) and to automatically normalize in an unsupervised setting.
Posted Content
Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis
TL;DR: A multi-layer annotation scheme for the annotation of hate speech in corpora of Web 2.0 commentary, which is pilot-tested against a binary ±hate speech classification and appears to yield higher inter-annotator agreement.
Noise or music? Investigating the usefulness of normalisation for robust sentiment analysis on social media data
Cynthia Van Hee,Marjan Van de Kauter,Orphée De Clercq,Els Lefever,Bart Desmet,Veronique Hoste +5 more
TL;DR: This work presents an optimised sentiment classifier and investigates to what extent its performance can be enhanced by integrating SMT-based normalisation as preprocessing, showing the model’s ability to generalise to other data genres.
Posted Content
Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality.
Gustavo Aguilar,Bryan McCann,Tong Niu,Nazneen Fatema Rajani,Nitish Shirish Keskar,Thamar Solorio +5 more
TL;DR: A character-based subword transformer module (char2subword) that learns the subword embedding table in pre-trained models like BERT and is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation is proposed.
Posted Content
WASSUP? LOL : Characterizing Out-of-Vocabulary Words in Twitter
Suman Kalyan Maity,Chaitanya Sarda,Anshit Chaudhary,Abhijeet Patil,Shraman Kumar,Akash Mondal,Animesh Mukherjee +6 more
TL;DR: The authors proposed a classification model to classify out-of-vocabulary (OOV) words into at least six categories: content features are the most discriminative ones followed by lexical and context features.
References
More filters
Proceedings ArticleDOI
Earthquake shakes Twitter users: real-time event detection by social sensors
TL;DR: This paper investigates the real-time interaction of events such as earthquakes in Twitter and proposes an algorithm to monitor tweets and to detect a target event and produces a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location.
Journal ArticleDOI
Critical questions for big data
danah boyd,Kate Crawford +1 more
TL;DR: The era of Big Data has begun as discussed by the authors, where diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, government records, and other digital traces left by people.
Proceedings ArticleDOI
Feature-rich part-of-speech tagging with a cyclic dependency network
TL;DR: A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.
Book
Natural Language Processing with Python
TL;DR: This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.
Proceedings ArticleDOI
Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling
TL;DR: By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorporate non-local structure while preserving tractable inference.