Open AccessProceedings Article
What to do about bad language on the internet
Jacob Eisenstein
- pp 359-369
Reads0
Chats0
TLDR
A critical review of the NLP community's response to the landscape of bad language is offered, and a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora is presented.Abstract:
The rise of social media has brought computational linguistics in ever-closer contact with bad language: text that defies our expectations about vocabulary, spelling, and syntax. This paper surveys the landscape of bad language, and offers a critical review of the NLP community’s response, which has largely followed two paths: normalization and domain adaptation. Each approach is evaluated in the context of theoretical and empirical work on computer-mediated communication. In addition, the paper presents a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora.read more
Citations
More filters
Proceedings ArticleDOI
Competition and Selection Among Conventions
TL;DR: In this paper, the authors track the spread of low-level authoring conventions in the arXiv over 24 years and roughly a million posted papers, and find that the interaction among co-authors over time plays a crucial role in the selection of conventions; the distinction between more and less experienced members of the community, and distinction between conventions with visible versus invisible effects, are both central to the underlying processes.
Proceedings ArticleDOI
Lexical Normalization for Code-switched Data and its Effect on POS Tagging
Rob van der Goot,Özlem Çetinoğlu +1 more
TL;DR: This article proposed three normalization models specifically designed to handle code-switched data which they evaluate for two language pairs: Indonesian-English and Turkish-German, and evaluate the downstream effect of normalization on POS tagging.
Journal ArticleDOI
On the performance of phonetic algorithms in microtext normalization
TL;DR: The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization, intending to find those algorithms that taking as input non–standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words.
Proceedings ArticleDOI
USFD: Twitter NER with Drift Compensation and Linked Data
TL;DR: A pilot NER system for Twitter is described, comprising the USFD system entry to the W-NUT 2015 NER shared task, and the goal is to correctly label entities in a tweet dataset, using an inventory of ten types.
Posted Content
Treebanking User-Generated Content: a UD Based Overview of Guidelines, Corpora and Unified Recommendations.
Manuela Sanguinetti,Lauren Cassidy,Cristina Bosco,Özlem Çetinoğlu,Alessandra Teresa Cignarella,Teresa Lynn,Ines Rehbein,Josef Ruppenhofer,Djamé Seddah,Amir Zeldes +9 more
TL;DR: The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.
References
More filters
Proceedings ArticleDOI
Earthquake shakes Twitter users: real-time event detection by social sensors
TL;DR: This paper investigates the real-time interaction of events such as earthquakes in Twitter and proposes an algorithm to monitor tweets and to detect a target event and produces a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location.
Journal ArticleDOI
Critical questions for big data
danah boyd,Kate Crawford +1 more
TL;DR: The era of Big Data has begun as discussed by the authors, where diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, government records, and other digital traces left by people.
Proceedings ArticleDOI
Feature-rich part-of-speech tagging with a cyclic dependency network
TL;DR: A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.
Book
Natural Language Processing with Python
TL;DR: This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.
Proceedings ArticleDOI
Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling
TL;DR: By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorporate non-local structure while preserving tractable inference.