scispace - formally typeset
Open AccessProceedings Article

What to do about bad language on the internet

Reads0
Chats0
TLDR
A critical review of the NLP community's response to the landscape of bad language is offered, and a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora is presented.
Abstract
The rise of social media has brought computational linguistics in ever-closer contact with bad language: text that defies our expectations about vocabulary, spelling, and syntax. This paper surveys the landscape of bad language, and offers a critical review of the NLP community’s response, which has largely followed two paths: normalization and domain adaptation. Each approach is evaluated in the context of theoretical and empirical work on computer-mediated communication. In addition, the paper presents a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

TL;DR: This work systematically evaluates the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy on Twitter and achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks.
Journal ArticleDOI

Survey of the state of the art in natural language generation: core tasks, applications and evaluation

TL;DR: A survey of the state of the art in natural language generation can be found in this article, with an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organized.
Proceedings ArticleDOI

BERTweet: A pre-trained language model for English Tweets

TL;DR: BERweet as discussed by the authors is the first large-scale pre-trained language model for English Tweets, having the same architecture as BERT-base and is trained using the RoBERTa pre-training procedure.
Journal ArticleDOI

Predicting crime using Twitter and kernel density estimation

TL;DR: This article uses Twitter-specific linguistic analysis and statistical topic modeling to automatically identify discussion topics across a major city in the United States and shows that the addition of Twitter data improves crime prediction performance versus a standard approach based on kernel density estimation.
Posted Content

Language (Technology) is Power: A Critical Survey of "Bias" in NLP

TL;DR: The authors survey 146 papers analyzing "bias" in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing bias is an inherently normative process.
References
More filters
Proceedings ArticleDOI

The virtual speech community: social network and language variation on IRC

TL;DR: A social network approach to online language variation and change is developed through qualitative and quantitative analysis of log files of Internet Relay Chat interaction, which reveals a highly structured relationship between participants' social positions on a channel and the linguistic variants they use.
Journal ArticleDOI

The Virtual Speech Community: Social Network and Language Variation on Irc

TL;DR: A social network approach to online language variation and change is developed through qualitative and quantitative analysis of logfiles of Internet Relay Chat interaction, which reveals a highly structured relationship between participants’ social positions on a channel and the linguistic variants they use.
Proceedings Article

Content Models with Attitude

TL;DR: A probabilistic topic model for jointly identifying properties and attributes of social media review snippets and captures aggregate user sentiments towards these properties that enables discovery of highly rated or inconsistent properties of a product.
Proceedings Article

An Exploration of Observable Features Related to Blogger Age.

TL;DR: This paper offers an initial exploratory data analysis of candidate features for blogger age prediction by analyzing the text and metadata of blog entries for evidence of blogger age.
Related Papers (5)