What to do about bad language on the internet
Citations
780 citations
Cites background from "What to do about bad language on th..."
...These are the result of unintentional errors, dialectal variation, conversational ellipsis, topic diversity, and creative use of language and orthography (Eisenstein, 2013)....
[...]
...of authors that heavily skews towards younger ages and minorities, with heavy usage of dialects that are different than the standard American English most often seen in NLP datasets (Eisenstein, 2013; Eisenstein et al., 2011)....
[...]
562 citations
517 citations
505 citations
Additional excerpts
..., word boundary identification) [6]....
[...]
465 citations
References
3,976 citations
"What to do about bad language on th..." refers background in this paper
...This makes Twitter data less problematic from a privacy standpoint,1 far easier to obtain, and more amenable to target applications such as large-scale mining of events (Sakaki et al., 2010; Benson et al., 2011) and opinions (Sauper et al., 2011)....
[...]
...makes Twitter data less problematic from a privacy standpoint,1 far easier to obtain, and more amenable to target applications such as large-scale mining of events (Sakaki et al., 2010; Benson et al., 2011) and opinions (Sauper et al....
[...]
3,955 citations
3,466 citations
"What to do about bad language on th..." refers background in this paper
...accuracy of the Stanford tagger (Toutanova et al., 2003) falls from 97% on Wall Street Journal text to 85% accuracy on Twitter (Gimpel et al....
[...]
...In part-of-speech tagging, the accuracy of the Stanford tagger (Toutanova et al., 2003) falls from 97% on Wall Street Journal text to 85% accuracy on Twitter (Gimpel et al., 2011)....
[...]
3,361 citations
"What to do about bad language on th..." refers methods in this paper
...motif;3 the Penn Treebank data uses the gold standard tokenization; Infinite Jest and the blog data are tokenized using NLTK (Bird et al., 2009)....
[...]
...;3 the Penn Treebank data uses the gold standard tokenization; Infinite Jest and the blog data are tokenized using NLTK (Bird et al., 2009)....
[...]
3,209 citations
"What to do about bad language on th..." refers background or methods in this paper
..., 2011), down from 86% on the CoNLL test set (Finkel et al., 2005)....
[...]
...In named entity recognition, the CoNLL-trained Stanford recognizer achieves 44% F-measure (Ritter et al., 2011), down from 86% on the CoNLL test set (Finkel et al., 2005)....
[...]