Language Identification for Creating Language-Specific Twitter Collections

Open Access

Language Identification for Creating Language-Specific Twitter Collections

Shane Bergsma, +4 more

- pp 65-74

Chats0

TLDR

This work annotates and releases a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts, the first publicly-available collection of LID-annotated tweets in non-Latin scripts and should become a standard evaluation set for LID systems.

Abstract:

Social media services such as Twitter offer an immense volume of real-world linguistic data. We explore the use of Twitter to obtain authentic user-generated text in low-resource languages such as Nepali, Urdu, and Ukrainian. Automatic language identification (LID) can be used to extract language-specific data from Twitter, but it is unclear how well LID performs on short, informal texts in low-resource languages. We address this question by annotating and releasing a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts. This is the first publicly-available collection of LID-annotated tweets in non-Latin scripts, and should become a standard evaluation set for LID systems. We also advance the state-of-the-art by evaluating new, highly-accurate LID systems, trained both on our new corpus and on standard materials only. Both types of systems achieve a huge performance improvement over the existing state-of-the-art, correctly classifying around 98% of our gold standard tweets. We provide a detailed analysis showing how the accuracy of our systems vary along certain dimensions, such as the tweet-length and the amount of in- and out-of-domain training data.

Language Identification for Creating Language-Specific Twitter Collections

Citations

What to do about bad language on the internet

Measuring Post Traumatic Stress Disorder in Twitter

Code Mixing: A Challenge for Language Identification in the Language of Social Media

Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

Systems and methods for multi-user multi-lingual communications

References

LIBLINEAR: A Library for Large Linear Classification

Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

N-gram-based text categorization

Data Compression Using Adaptive Coding and Partial String Matching

Comparison of four approaches to automatic language identification of telephone speech

Related Papers (5)

Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

langid.py: An Off-the-shelf Language Identification Tool

N-gram-based text categorization

Language Identification: The Long and the Short of the Matter

Cross-domain Feature Selection for Language Identification