What have the authors stated for future works in "Code mixing: a challenge for language identification in the language of social media" ?

In the future the authors plan to apply the techniques and feature sets that they used in these experiments to other datasets. The authors did not include word-level code mixing in their experiments – in their future experiments they will explore ways to identify and segment this type of code mixing. It will be also important to find the best way to handle inclusions since there is a fine line between word borrowing and code mixing.

How many tokens are in the training data?

Only 2.4% (4,658) of total tokens of the training data are Hindi, out of which 55.36% are bilingually ambiguous and 29.51% are tri-lingually ambiguous tokens.

What is the dominant language in the world?

Since these Facebook users are from West Bengal, the most dominant language is Bengali1https://www.facebook.com/jumatrimonial(Native Language), followed by English and then Hindi (National Language of India).

How many iterations are used for different feature combinations?

The authors employ a linear chain CRF with an increasing order (Order-0, Order-1 and Order-2) with 200 iterations for different feature combinations (usedin SVM-based runs).

What is the way to train a decision tree for length?

The authors use length as the only feature to train a decision tree for each fold and use the nodes obtained from the tree to create boolean features.

What is the system for a given test set?

The authors apply their best dictionary-based system, their best SVM system (with and without context) and their best CRF system to the held-out test set.

What is the cross-validation accuracy for the P1N1?

After C parameter optimization, the best cross-validation accuracy is found for the P1N1 (one word previous and one word next) run with C=0.125 (95.14%).

What is the definition of a dictionary-based language detector?

Generally a dictionary-based language detector predicts the language of a word based on its frequency in multiple language dictionaries.

What is the language of the word chosen?

The predicted language is chosen based on the dominant language(s) of the corpus if the word appears in multiple dictionaries with same frequency or if the word does not appear in any dictionary or list.

What is the example of a word-level classifier?

As an example, a part of a comment is presented from crossvalidation fold 1 that contains the word die which is wrongly classified by the SVM classifier.

What is the way to handle the word-level classification problem?

For systems which do not take the context of a word into account, i.e. the dictionary-based approach (Section 5.1) and the SVM approach without contextual clues (Section 5.2), named entities and instances of word-level code mixing can be safely excluded from training.

(Open Access) Code Mixing: A Challenge for Language Identification in the Language of Social Media (2014) | Utsab Barman

Q: What are the contributions in "Code mixing: a challenge for language identification in the language of social media" ?

In this paper, the authors describe their work in progress on the problem of automatic language identification for the language of social media. The authors describe a new dataset that they are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. The authors also present some preliminary word-level language identification experiments using this dataset. The authors find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.

Q: Why is code mixing common in India?

code mixing is very frequent in the Indian sub-continent because languages change within very short geodistances and people generally have a basic knowledge of their neighboring languages.

Q: What is the Kappa value of the word-level annotation process?

Their observations that the word-level annotation process is not a very ambiguous task andthat annotation instruction is also straightforward are confirmed in a high inter-annotator agreement (IAA) with a Kappa value of 0.884.

Proceedings of The First Workshop on Computational Approaches to Code Switching, pages 13–23,

October 25, 2014, Doha, Qatar.

2014 Association for Computational Linguistics

Code Mixing: A Challenge for Language Identiﬁcation in the Language of

Social Media

Utsab Barman, Amitava Das

†

, Joachim Wagner and Jennifer Foster

CNGL Centre for Global Intelligent Content, National Centre for Language Technology

School of Computing, Dublin City University, Dublin, Ireland

†

Department of Computer Science and Engineering

University of North Texas, Denton, Texas, USA

{ubarman,jwagner,jfoster}@computing.dcu.ie

amitava.das@unt.edu

Abstract

In social media communication, multilin-

gual speakers often switch between lan-

guages, and, in such an environment, au-

tomatic language identiﬁcation becomes

both a necessary and challenging task.

In this paper, we describe our work in

progress on the problem of automatic

language identiﬁcation for the language

of social media. We describe a new

dataset that we are in the process of cre-

ating, which contains Facebook posts and

comments that exhibit code mixing be-

tween Bengali, English and Hindi. We

also present some preliminary word-level

language identiﬁcation experiments using

this dataset. Different techniques are

employed, including a simple unsuper-

vised dictionary-based approach, super-

vised word-level classiﬁcation with and

without contextual clues, and sequence la-

belling using Conditional Random Fields.

We ﬁnd that the dictionary-based approach

is surpassed by supervised classiﬁcation

and sequence labelling, and that it is im-

portant to take contextual clues into con-

sideration.

1 Introduction

Automatic processing and understanding of Social

Media Content (SMC) is currently attracting much

attention from the Natural Language Processing

research community. Although English is still by

far the most popular language in SMC, its domi-

nance is receding. Hong et al. (2011), for exam-

ple, applied an automatic language detection algo-

rithm to over 62 million tweets to identify the top

10 most popular languages on Twitter. They found

that only half of the tweets were in English. More-

over, mixing multiple languages together (code

mixing) is a popular trend in social media users

from language-dense areas (C

ardenas-Claros and

Isharyanti, 2009; Shaﬁe and Nayan, 2013). In

a scenario where speakers switch between lan-

guages within a conversation, sentence or even

word, the task of automatic language identiﬁca-

tion becomes increasingly important to facilitate

further processing.

Speakers whose ﬁrst language uses a non-

Roman alphabet write using the Roman alphabet

for convenience (phonetic typing) which increases

the likelihood of code mixing with a Roman-

alphabet language. This can be especially ob-

served in South-East Asia and in the Indian sub-

continent. The following is a code mixing com-

ment taken from a Facebook group of Indian uni-

versity students:

Original: Yaar tu to, GOD hain. tui JU

te ki korchis? Hail u man!

Translation: Buddy you are GOD. What

are you doing in JU? Hail u man!

This comment is written in three languages: En-

glish, Hindi (italics), and Bengali (boldface). For

Bengali and Hindi, phonetic typing has been used.

We follow in the footsteps of recent work on

language identiﬁcation for SMC (Hughes et al.,

2006; Baldwin and Lui, 2010; Bergsma et al.,

2012), focusing speciﬁcally on the problem of

word-level language identiﬁcation for code mixing

SMC. Our corpus for this task is collected from

Facebook and contains instances of Bengali(BN)-

English(EN)-Hindi(HI) code mixing.

The paper is organized as follows: in Section 2,

we review related research in the area of code

mixing and language identiﬁcation; in Section 3,

we describe our code mixing corpus, the data it-

self and the annotation process; in Section 4, we

list the tools and resources which we use in our

language identiﬁcation experiments, described in

Section 5. Finally, in Section 6, we conclude

and provide suggestions for future research on this

topic.

2 Background and Related Work

The problem of language identiﬁcation has been

investigated for half a century (Gold, 1967) and

that of computational analysis of code switching

for several decades (Joshi, 1982), but there has

been less work on automatic language identiﬁ-

cation for multilingual code-mixed texts. Before

turning to that topic, we ﬁrst brieﬂy survey studies

on the general characteristics of code mixing.

Code mixing is a normal, natural product of

bilingual and multilingual language use. Signif-

icant studies of the phenomenon can be found

in the linguistics literature (Milroy and Muysken,

1995; Alex, 2008; Auer, 2013). These works

mainly discuss the sociological and conversational

necessities behind code mixing as well as its lin-

guistic nature. Scholars distinguish between inter-

sentence, intra-sentence and intra-word code mix-

ing.

Several researchers have investigated the rea-

sons for and the types of code mixing. Initial stud-

ies on Chinese-English code mixing in Hong Kong

(Li, 2000) and Macao (San, 2009) indicated that

mainly linguistic motivations were triggering the

code mixing in those highly bilingual societies.

Hidayat (2012) showed that Facebook users tend

to mainly use inter-sentential switching over intra-

sentential, and report that 45% of the switching

was instigated by real lexical needs, 40% was used

for talking about a particular topic, and 5% for

content clariﬁcation. The predominance of inter-

sentential code mixing in social media text was

also noted in the study by San (2009), which com-

pared the mixing in blog posts to that in the spoken

language in Macao. Dewaele (2010) claims that

‘strong emotional arousal’ increases the frequency

of code mixing. Dey and Fung (2014) present

a speech corpus of English-Hindi code mixing in

student interviews and analyse the motivations for

code mixing and in what grammatical contexts

code mixing occurs.

Turning to the work on automatic analysis of

code mixing, there have been some studies on de-

tecting code mixing in speech (Solorio and Liu,

2008a; Weiner et al., 2012). Solorio and Liu

(2008b) try to predict the points inside a set of spo-

ken Spanish-English sentences where the speak-

ers switch between the two languages. Other

studies have looked at code mixing in differ-

ent types of short texts, such as information re-

trieval queries (Gottron and Lipka, 2010) and SMS

messages (Farrugia, 2004; Rosner and Farrugia,

2007). Yamaguchi and Tanaka-Ishii (2012) per-

form language identiﬁcation using artiﬁcial mul-

tilingual data, created by randomly sampling text

segments from monolingual documents. King

and Abney (2013) used weakly semi-supervised

methods to perform word-level language identiﬁ-

cation. A dataset of 30 languages has been used

in their work. They explore several language

identiﬁcation approaches, including a Naive Bayes

classiﬁer for individual word-level classiﬁcation

and sequence labelling with Conditional Random

Fields trained with Generalized Expectation crite-

ria (Mann and McCallum, 2008; Mann and Mc-

Callum, 2010), which achieved the highest scores.

Another very recent work on this topic is (Nguyen

and Do

gru

oz, 2013). They report on language

identiﬁcation experiments performed on Turkish

and Dutch forum data. Experiments have been

carried out using language models, dictionaries,

logistic regression classiﬁcation and Conditional

Random Fields. They ﬁnd that language models

are more robust than dictionaries and that contex-

tual information is helpful for the task.

3 Corpus Acquisition

Taking into account the claim that code mixing is

frequent among speakers who are multilingual and

younger in age (C

ardenas-Claros and Isharyanti,

2009), we choose an Indian student community

between the 20-30 year age group as our data

source. India is a country with 30 spoken lan-

guages, among which 22 are ofﬁcial. code mix-

ing is very frequent in the Indian sub-continent

because languages change within very short geo-

distances and people generally have a basic knowl-

edge of their neighboring languages.

A Facebook group

and 11 Facebook users

(known to the authors) were selected to obtain

publicly available posts and comments. The Face-

book graph API explorer was used for data collec-

tion. Since these Facebook users are from West

Bengal, the most dominant language is Bengali

https://www.facebook.com/jumatrimonial

(Native Language), followed by English and then

Hindi (National Language of India). The posts

and comments in Bengali and Hindi script were

discarded during data collection, resulting in 2335

posts and 9813 comments.

3.1 Annotation

Four annotators took part in the annotation task.

Three were computer science students and the

other was one of the authors. The annotators are

proﬁcient in all three languages of our corpus. A

simple annotation tool was developed which en-

abled these annotators to identify and distinguish

the different languages present in the content by

tagging them. Annotators were supplied with 4

basic tags (viz. sentence, fragment, inclusion and

wlcm (word-level code mixing)) to annotate differ-

ent levels of code mixing. Under each tag, six at-

tributes were provided, viz. English (en), Bengali

(bn), Hindi (hi), Mixed (mixd), Universal (univ)

and Undeﬁned (undef). The attribute univ is as-

sociated with symbols, numbers, emoticons and

universal expressions (e.g. hahaha, lol). The at-

tribute undef is speciﬁed for a sentence or a word

for which no language tags can be attributed or

cannot be categorized as univ. In addition, anno-

tators were instructed to annotate named entities

separately. What follows are descriptions of each

of the annotation tags.

Sentence (sent): This tag refers to a sentence

and can be used to mark inter-sentential code mix-

ing. Annotators were instructed to identify a sen-

tence with its base language (e.g. en, bn, hi and

mixd) or with other types (e.g. univ, undef ) as the

ﬁrst task of annotation. Only the attribute mixd is

used to refer to a sentence which contains multi-

ple languages in the same proportion. A sentence

may contain any number of inclusions, fragments

and word-level code mixing. A sentence can be at-

tributed as univ if and only if it contains symbols,

numbers, emoticons, chat acronyms and no other

words (Hindi, English or Bengali). A sentence can

be attributed as undef if it is not a sentence marked

as univ and has words/tokens that can not be cate-

gorized as Hindi, English or Bengali. Some exam-

ples of sentence-level annotations are the follow-

ing:

1. English-Sentence:

[sent-lang=“en”] what a.....6 hrs long...but re-

ally nice tennis.... [/sent]

2. Bengali-Sentence:

[sent-lang=“bn”] shubho nabo borsho.. :)

[/sent]

3. Hindi Sentence:

[sent-lang=“hi”] karwa sachh ..... :( [/sent]

4. Mixed-Sentence:

[sent-lang=“mixd”] [frag-lang=“hi”] oye

hoye ..... angreji me kahte hai ke [/frag]

[frag-lang=“en”] I love u.. !!! [/frag] [/sent]

5. Univ-Sentence:

[sent-lang=“univ”] hahahahahahah....!!!!!

[/sent]

6. Undef-Sentence:

[sent-lang=“undef”] Hablando de una triple

amenaza. [/sent]

Fragment (frag): This refers to a group of for-

eign words, grammatically related, in a sentence.

The presence of this tag in a sentence conveys that

intra-sentential code mixing has occurred within

the sentence boundary. Identiﬁcation of fragments

(if present) in a sentence was the second task of

annotation. A sentence (sent) with attribute mixd

must contain multiple fragments (frag) with a spe-

ciﬁc language attribute. In the fourth example

above, the sentence contains a Hindi fragment oye

hoye ..... angreji me kahte hai ke and an English

fragment I love u.. !!!, hence it is considered as a

mixd sentence. A fragment can have any number

of inclusions and word-level code mixing. In the

ﬁrst example below, Jio is a popular Bengali word

appearing in the English fragment Jio.. good joke,

hence tagged as a Bengali inclusion. One can ar-

gue that the word Jio could be a separate Bengali

inclusion (i.e. can be tagged as a Bengali inclu-

sion outside the English fragment). But looking

at the syntactic pattern and the sense expressed by

the comment, the annotator kept it as a single unit.

In the second example below, an instance of word-

level code mixing, typer, has been found in an En-

glish fragment (where the root English word type

has the Bengali sufﬁx r).

1. Fragment with Inclusion:

[sent-lang=“mixd”] [frag-lang=“en”] [incl-

lang=“bn”] Jio.. [/incl] good joke [/frag] [frag

lang=“bn”] ”amar Babin” [/frag] [/sent]

2. Fragment with Word-Level code mixing:

[sent-lang=“mixd”] [frag-lang=“en”] ” I will

ﬁnd u and marry you ” [/frag] [frag-

lang=“bn”] [wlcm-type=“en-and-bn-sufﬁx”]

typer [/wlcm] hoe glo to! :D [/frag] [/sent]

Inclusion (incl): An inclusion is a foreign word

or phrase in a sentence or in a fragment which

is assimilated or used very frequently in native

language. Identiﬁcation of inclusions can be per-

formed after annotating a sentence and fragment

(if present in that sentence). An inclusion within a

sentence or fragment also denotes intra-sentential

code mixing. In the example below, seriously is an

English inclusion which is assimilated in today’s

colloquial Bengali and Hindi. The only tag that an

inclusion may contain is word-level code mixing.

1. Sentence with Inclusion:

[sent-lang=“bn”] Na re [incl-lang=“en”] seri-

ously [/incl] ami khub kharap achi. [/sent]

Word-Level code mixing (wlcm): This is the

smallest unit of code mixing. This tag was in-

troduced to capture intra-word code mixing and

denotes cases where code mixing has occurred

within a single word. Identifying word-level code

mixing is the last task of annotation. Annotators

were told to mention the type of word-level code

mixing in the form of an attribute (Base Language

+ Second Language) format. Some examples are

provided below. In the ﬁrst example below, the

root word class is English and e is an Bengali suf-

ﬁx that has been added. In the third example be-

low, the opposite can be observed – the root word

Kando is Bengali, and an English sufﬁx z has been

added. In the second example below, a named en-

tity suman is present with a Bengali sufﬁx er.

1. Word-Level code mixing (EN-BN):

[wlcm-type=“en-and-bn-sufﬁx”] classe

[/wlcm]

2. Word-Level code mixing (NE-BN):

[wlcm-type=“NE-and-bn-sufﬁx”] sumaner

[/wlcm]

3. Word-Level code mixing (BN-EN):

[wlcm-type=“bn-and-en-sufﬁx”] kandoz

[/wlcm]

3.1.1 Inter Annotator Agreement

We calculate word-level inter annotator agreement

(Cohen’s Kappa) on a subset of 100 comments

(randomly selected) between two annotators. Two

annotators are in agreement about a word if they

both annotate the word with the same attribute

(en, bn, hi, univ, undef ), regardless of whether

the word is inside an inclusion, fragment or sen-

tence. Our observations that the word-level anno-

tation process is not a very ambiguous task and

that annotation instruction is also straightforward

are conﬁrmed in a high inter-annotator agreement

(IAA) with a Kappa value of 0.884.

3.2 Data Characteristics

Tag-level and word-level statistics of annotated

data that reveal the characteristics of our data set

are described in Table 1 and in Table 2 respec-

tively. More than 56% of total sentences and al-

most 40% of total tokens are in Bengali, which is

the dominant language of this corpus. English is

the second most dominant language covering al-

most 33% of total tokens and 35% of total sen-

tences. The amount of Hindi data is substantially

lower – nearly 1.75% of total tokens and 2% of to-

tal sentences. However, English inclusions (84%

of total inclusions) are more prominent than Hindi

or Bengali inclusions and there are a substantial

number of English fragments (almost 52% of total

fragments) present in our corpus. This means that

English is the main language involved in the code

mixing.

Statistics of Different Tags

Tags En Bn Hi Mixd Univ Undef

sent 5,370 8,523 354 204 746 15

frag 288 213 40 0 6 0

incl 7,377 262 94 0 1,032 1

wlcm 477

Name Entity 3,602

Acronym 691

Table 1: Tag-level statistics

Word-Level Tag Count

EN 66,298

BN 79,899

HI 3,440

WLCM 633

NE 5,233

ACRO 715

UNIV 39,291

UNDEF 61

Table 2: Word-level statistics

3.2.1 Code Mixing Types

In our corpus, inter- and intra-sentential code mix-

ing are more prominent than word-level code mix-

ing, which is similar to the ﬁndings of (Hidayat,

2012) . Our corpus contains every type of code

mixing in English, Hindi and Bengali viz. in-

ter/intra sentential and word-level as described in

the previous section. Some examples of different

types of code mixing in our corpus are presented

below.

1. Inter-Sentential:

[sent-lang=“hi”] Itna izzat diye aapne mujhe

!!! [/sent]

[sent-lang=“en”] Tears of joy. :’( :’( [/sent]

2. Intra-Sentential:

[sent-lang=“bn”] [incl-lang=“en”] by d way

[/incl] ei [frag-lang=“en”] my craving arms

shall forever remain empty .. never hold u

close .. [/frag] line ta baddo [incl-lang=“en”]

cheezy [/incl] :P ;) [/sent]

3. Word-Level:

[sent-lang=“bn”] [incl-lang=“en”] 1st yr

[/incl] eo to ei [wlcm-type=“en+bnSufﬁx”]

tymer [/wlcm] modhye sobar jute jay ..

[/sent]

3.2.2 Ambiguous Words

Annotators were instructed to tag an English word

as English irrespective of any inﬂuence of word

borrowing or foreign inclusion but an inspection of

the annotations revealed that English words were

sometimes annotated as Bengali or Hindi. To un-

derstand this phenomenon we processed the list

of language (EN,BN and HI) word types (total

26,475) and observed the percentage of types that

were not always annotated with the one language

throughout the corpus. The results are presented in

Table 3. Almost 7% of total types are ambiguous

(i.e. tagged in different languages during annota-

tion). Among them, a substantial amount (5.58%)

are English/Bengali.

Label(s) Count Percentage

EN 9,109 34.40

BN 14,345 54.18

HI 1,039 3.92

EN or BN 1,479 5.58

EN or HI 61 0.23

BN or HI 277 1.04

EN or BN or HI 165 0.62

Table 3: Statistics of ambiguous and monolingual

word types

There are two reasons why this is happening:

Same Words Across Languages Some words

are the same (e.g. baba, maa, na, khali) in Hindi

and Bengali because both of the languages orig-

inated from a single language Sanskrit and share

a good amount of common vocabulary. It also

occurred in English-Hindi and English-Bengali as

a result of word borrowing. Most of these are

commonly used inclusions like clg, dept, ques-

tion, cigarette, and topic. Sometimes the anno-

tators were careful enough to tag such words as

English and sometimes these words were tagged

in the annotators’ native languages. During cross

checking of the annotated data the same error pat-

terns were observed for multiple annotators, i.e.

tagging commonly used foreign words into native

language. It only demonstrates that these English

words are highly assimilated in the conversational

vocabulary of Bengali and Hindi.

Phonetic Similarity of Spellings Due to pho-

netic typing some words share the same surface

form across two and sometimes across three lan-

guages. As an example, to is a word in the three

languages: it has occurred 1209 times as English,

715 times as Bengali and 55 times as Hindi in our

data. The meaning of these words (e.g. to, bolo,

die) are different in different languages. This phe-

nomenon is perhaps exacerbated by the trend to-

wards short and noisy spelling in SMC.

4 Tools and Resources

We have used the following resources and tools in

our experiment.

Dictionaries

1. British National Corpus (BNC): We com-

pile a word frequency list from the BNC (As-

ton and Burnard, 1998).

2. SEMEVAL 2013 Twitter Corpus (Se-

mevalTwitter): To cope with the language

of social media we use the SEMEVAL 2013

(Nakov et al., 2013) training data for the

Twitter sentiment analysis task. This data

comes from a popular social media site and

hence is likely to reﬂect the linguistic proper-

ties of SMC.

3. Lexical Normalization List (LexNorm-

List): Spelling variation is a well-known

phenomenon in SMC. We use a lexical nor-

malization dictionary created by Han et al.

(2012) to handle the different spelling vari-

ations in our data.

Machine Learning Toolkits

1. WEKA: We use the Weka toolkit (Hall et

al., 2009) for our experiments in decision tree

training.

2. MALLET: CRF learning is applied using the

MALLET toolkit (McCallum, 2002).

Code Mixing: A Challenge for Language Identification in the Language of Social Media

Figures

Citations

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification.

Many Languages, One Parser

Many Languages, One Parser

A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection.

Corpus creation for sentiment analysis in code-mixed Tamil-English text

References

The WEKA data mining software: an update

LIBLINEAR: A Library for Large Linear Classification

A Practical Guide to Support Vector Classication

Language identification in the limit

N-gram-based text categorization

Related Papers (5)

POS Tagging of English-Hindi Code-Mixed Social Media Content

Overview for the First Shared Task on Language Identification in Code-Switched Data

â€œI am borrowing ya mixing ?â€ An Analysis of English-Hindi Code Mixing in Facebook

Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods

Processing of sentences with intra-sentential code-switching

Frequently Asked Questions (15)

Q1. What are the contributions in "Code mixing: a challenge for language identification in the language of social media" ?

Q2. What have the authors stated for future works in "Code mixing: a challenge for language identification in the language of social media" ?

Q3. How many tokens are in the training data?

Q4. What is the dominant language in the world?

Q5. How many iterations are used for different feature combinations?

Q6. What is the way to train a decision tree for length?

Q7. Why is code mixing common in India?

Q8. What is the system for a given test set?

Q9. What is the prominent language in the corpus?

Q10. What is the Kappa value of the word-level annotation process?

Q11. What is the cross-validation accuracy for the P1N1?

Q12. What is the definition of a dictionary-based language detector?

Q13. What is the language of the word chosen?

Q14. What is the example of a word-level classifier?

Q15. What is the way to handle the word-level classification problem?