scispace - formally typeset
Open Access

Language Identification for Creating Language-Specific Twitter Collections

Reads0
Chats0
TLDR
This work annotates and releases a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts, the first publicly-available collection of LID-annotated tweets in non-Latin scripts and should become a standard evaluation set for LID systems.
Abstract
Social media services such as Twitter offer an immense volume of real-world linguistic data. We explore the use of Twitter to obtain authentic user-generated text in low-resource languages such as Nepali, Urdu, and Ukrainian. Automatic language identification (LID) can be used to extract language-specific data from Twitter, but it is unclear how well LID performs on short, informal texts in low-resource languages. We address this question by annotating and releasing a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts. This is the first publicly-available collection of LID-annotated tweets in non-Latin scripts, and should become a standard evaluation set for LID systems. We also advance the state-of-the-art by evaluating new, highly-accurate LID systems, trained both on our new corpus and on standard materials only. Both types of systems achieve a huge performance improvement over the existing state-of-the-art, correctly classifying around 98% of our gold standard tweets. We provide a detailed analysis showing how the accuracy of our systems vary along certain dimensions, such as the tweet-length and the amount of in- and out-of-domain training data.

read more

Citations
More filters
Proceedings Article

What to do about bad language on the internet

TL;DR: A critical review of the NLP community's response to the landscape of bad language is offered, and a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora is presented.
Proceedings Article

Measuring Post Traumatic Stress Disorder in Twitter

TL;DR: PTSD is considered, a serious condition that affects millions worldwide, with especially high rates in military veterans, and its utility is demonstrated by examining differences in language use between PTSD and random individuals, building classifiers to separate these two groups and by detecting elevated rates of PTSD at and around U.S. military bases using classifiers.
Proceedings ArticleDOI

Code Mixing: A Challenge for Language Identification in the Language of Social Media

TL;DR: A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.
Proceedings Article

Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

TL;DR: It is shown that gender differences in subjective language can effectively be used to improve sentiment analysis, and in particular, polarity classification for Spanish and Russian.
Patent

Systems and methods for multi-user multi-lingual communications

TL;DR: In this article, the authors describe a system and a method for assessing the accuracy of translations between two or more languages, and a reward for these submissions is given to users for submitting corrections for inaccurate or erroneous translations.
References
More filters
Journal Article

LIBLINEAR: A Library for Large Linear Classification

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.
Proceedings ArticleDOI

Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

TL;DR: This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks.

N-gram-based text categorization

TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Journal ArticleDOI

Data Compression Using Adaptive Coding and Partial String Matching

TL;DR: This paper describes how the conflict can be resolved with partial string matching, and reports experimental results which show that mixed-case English text can be coded in as little as 2.2 bits/ character with no prior knowledge of the source.
Journal ArticleDOI

Comparison of four approaches to automatic language identification of telephone speech

TL;DR: Four approaches for automatic language identification of speech utterances are compared: Gaussian mixture model (GMM) classification; single-language phone recognition followed by languaged dependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single- language phone recognizers, each trained in a different language; and languagedependent parallel phone recognition (PPR).