What to do about bad language on the internet

Open AccessProceedings Article

What to do about bad language on the internet

Jacob Eisenstein

- pp 359-369

Chats0

TLDR

A critical review of the NLP community's response to the landscape of bad language is offered, and a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora is presented.

Abstract:

The rise of social media has brought computational linguistics in ever-closer contact with bad language: text that defies our expectations about vocabulary, spelling, and syntax. This paper surveys the landscape of bad language, and offers a critical review of the NLP community’s response, which has largely followed two paths: normalization and domain adaptation. Each approach is evaluated in the context of theoretical and empirical work on computer-mediated communication. In addition, the paper presents a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora.

Citations

PDF

Open Access

More filters

Proceedings Article

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

Olutobi Owoputi, +5 more

TL;DR: This work systematically evaluates the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy on Twitter and achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks.

...read moreread less

Journal ArticleDOI

Survey of the state of the art in natural language generation: core tasks, applications and evaluation

Albert Gatt, +1 more

- 01 Jan 2018 -

Journal of Artificial Intelligence Resea...

TL;DR: A survey of the state of the art in natural language generation can be found in this article, with an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organized.

...read moreread less

Proceedings ArticleDOI

BERTweet: A pre-trained language model for English Tweets

Dat Quoc Nguyen, +2 more

TL;DR: BERweet as discussed by the authors is the first large-scale pre-trained language model for English Tweets, having the same architecture as BERT-base and is trained using the RoBERTa pre-training procedure.

...read moreread less

Journal ArticleDOI

Predicting crime using Twitter and kernel density estimation

Matthew S. Gerber

TL;DR: This article uses Twitter-specific linguistic analysis and statistical topic modeling to automatically identify discussion topics across a major city in the United States and shows that the addition of Twitter data improves crime prediction performance versus a standard approach based on kernel density estimation.

...read moreread less

Posted Content

Language (Technology) is Power: A Critical Survey of "Bias" in NLP

Su Lin Blodgett, +3 more

- 28 May 2020 -

arXiv: Computation and Language

TL;DR: The authors survey 146 papers analyzing "bias" in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing bias is an inherently normative process.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Automatically Constructing a Normalisation Dictionary for Microblogs

Bo Han, +2 more

TL;DR: This paper proposes a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution and shows that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset.

...read moreread less

Overview of the 2012 Shared Task on Parsing the Web

Slav Petrov, +1 more

TL;DR: A shared task on parsing web text from the Google Web Treebank to build a single parsing system that is robust to domain changes and can handle noisy text that is commonly encountered on the web is described.

...read moreread less

Proceedings Article

Automatic Domain Adaptation for Parsing

David McClosky, +2 more

TL;DR: The resulting system proposes linear combinations of parsing models trained on the source corpora that outperforms all non-oracle baselines including the best domain-independent parsing model.

...read moreread less

Journal IssueDOI

Homophily in MySpace

Mike Thelwall

- 01 Feb 2009 -

Journal of the Association for Informati...

TL;DR: For instance, the authors reported an exploratory study of the similarity between the reported attributes of pairs of active MySpace Friends based upon a systematic sample of 2,567 members joining on June 18, 2007 and Friends who commented on their profile.

...read moreread less

Proceedings ArticleDOI

Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

Sara Rosenthal, +1 more

TL;DR: It is found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction.

...read moreread less

Collapse

Journal of Machine Learning Research

What to do about bad language on the internet

Citations

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

Survey of the state of the art in natural language generation: core tasks, applications and evaluation

BERTweet: A pre-trained language model for English Tweets

Predicting crime using Twitter and kernel density estimation

Language (Technology) is Power: A Critical Survey of "Bias" in NLP

References

Automatically Constructing a Normalisation Dictionary for Microblogs

Overview of the 2012 Shared Task on Parsing the Web

Automatic Domain Adaptation for Parsing

Homophily in MySpace

Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

Related Papers (5)

Named Entity Recognition in Tweets: An Experimental Study

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Latent dirichlet allocation