What to do about bad language on the internet

Open AccessProceedings Article

What to do about bad language on the internet

Jacob Eisenstein

- pp 359-369

Chats0

TLDR

A critical review of the NLP community's response to the landscape of bad language is offered, and a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora is presented.

Abstract:

The rise of social media has brought computational linguistics in ever-closer contact with bad language: text that defies our expectations about vocabulary, spelling, and syntax. This paper surveys the landscape of bad language, and offers a critical review of the NLP community’s response, which has largely followed two paths: normalization and domain adaptation. Each approach is evaluated in the context of theoretical and empirical work on computer-mediated communication. In addition, the paper presents a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora.

Citations

PDF

Open Access

More filters

Posted Content

When silver glitters more than gold: Bootstrapping an Italian part-of-speech tagger for Twitter

Barbara Plank, +1 more

- 09 Nov 2016 -

arXiv: Computation and Language

TL;DR: In this paper, a part-of-speech tagger was used to tag Italian Twitter data, in the context of the Evalita 2016 PoSTWITA shared task, and they showed that training the tagger on native Twitter data enriched with little amounts of specifically selected gold data and additional silver-labelled data scraped from Facebook yields better results than using large amounts of manually annotated data from a mix of genres.

...read moreread less

Journal ArticleDOI

MDLText e Indexação Semântica aplicados na Detecção de Spam nos Comentários do YouTube

Renato Moraes Silva, +3 more

- 30 Sep 2017 -

iSys

TL;DR: In this paper, an artigo avalia um metodo de classificacao baseado no principio da descricao mais simples e compara os resultados with os de metodos tradicionais de aprendizado online.

...read moreread less

Proceedings Article

Efficient Named Entity Annotation through Pre-empting

Leon Derczynski, +1 more

TL;DR: A technique for reducing the amount of entity-less text examined by annotators, which is called "preempting", is demonstrated and evaluated in a crowdsourcing scenario, where it provides downstream performance improvements for the same size corpus.

...read moreread less

Posted Content

Named Entity Recognition and Classification on Historical Documents: A Survey.

Maud Ehrmann, +4 more

- 23 Sep 2021 -

arXiv: Computation and Language

TL;DR: In this article, the authors present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.

...read moreread less

Posted Content

Ontology Driven Disease Incidence Detection on Twitter

Mark Abraham Magumba, +1 more

- 21 Nov 2016 -

arXiv: Computation and Language

TL;DR: This work employs an ontology of disease related concepts and uses it to obtain a conceptual representation of tweets and shows that word vectors can be learned directly from the authors' concepts to achieve even better results.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Earthquake shakes Twitter users: real-time event detection by social sensors

Takeshi Sakaki, +2 more

TL;DR: This paper investigates the real-time interaction of events such as earthquakes in Twitter and proposes an algorithm to monitor tweets and to detect a target event and produces a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location.

...read moreread less

Journal ArticleDOI

Critical questions for big data

danah boyd, +1 more

- 25 May 2012 -

Information, Communication & Society

TL;DR: The era of Big Data has begun as discussed by the authors, where diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, government records, and other digital traces left by people.

...read moreread less

Proceedings ArticleDOI

Feature-rich part-of-speech tagging with a cyclic dependency network

Kristina Toutanova, +3 more

TL;DR: A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.

...read moreread less

Book

Natural Language Processing with Python

Steven Bird, +3 more

TL;DR: This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.

...read moreread less

Proceedings ArticleDOI

Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling

Jenny Rose Finkel, +2 more

TL;DR: By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorporate non-local structure while preserving tractable inference.

...read moreread less

Collapse

Journal of Machine Learning Research

What to do about bad language on the internet

Citations

When silver glitters more than gold: Bootstrapping an Italian part-of-speech tagger for Twitter

MDLText e Indexação Semântica aplicados na Detecção de Spam nos Comentários do YouTube

Efficient Named Entity Annotation through Pre-empting

Named Entity Recognition and Classification on Historical Documents: A Survey.

Ontology Driven Disease Incidence Detection on Twitter

References

Earthquake shakes Twitter users: real-time event detection by social sensors

Critical questions for big data

Feature-rich part-of-speech tagging with a cyclic dependency network

Natural Language Processing with Python

Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling

Related Papers (5)

Named Entity Recognition in Tweets: An Experimental Study

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Latent dirichlet allocation