scispace - formally typeset
Open AccessProceedings ArticleDOI

Code Mixing: A Challenge for Language Identification in the Language of Social Media

TLDR
A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.
Abstract
In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labelling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.

read more

Content maybe subject to copyright    Report

Proceedings of The First Workshop on Computational Approaches to Code Switching, pages 13–23,
October 25, 2014, Doha, Qatar.
c
2014 Association for Computational Linguistics
Code Mixing: A Challenge for Language Identification in the Language of
Social Media
Utsab Barman, Amitava Das
, Joachim Wagner and Jennifer Foster
CNGL Centre for Global Intelligent Content, National Centre for Language Technology
School of Computing, Dublin City University, Dublin, Ireland
Department of Computer Science and Engineering
University of North Texas, Denton, Texas, USA
{ubarman,jwagner,jfoster}@computing.dcu.ie
amitava.das@unt.edu
Abstract
In social media communication, multilin-
gual speakers often switch between lan-
guages, and, in such an environment, au-
tomatic language identification becomes
both a necessary and challenging task.
In this paper, we describe our work in
progress on the problem of automatic
language identification for the language
of social media. We describe a new
dataset that we are in the process of cre-
ating, which contains Facebook posts and
comments that exhibit code mixing be-
tween Bengali, English and Hindi. We
also present some preliminary word-level
language identification experiments using
this dataset. Different techniques are
employed, including a simple unsuper-
vised dictionary-based approach, super-
vised word-level classification with and
without contextual clues, and sequence la-
belling using Conditional Random Fields.
We find that the dictionary-based approach
is surpassed by supervised classification
and sequence labelling, and that it is im-
portant to take contextual clues into con-
sideration.
1 Introduction
Automatic processing and understanding of Social
Media Content (SMC) is currently attracting much
attention from the Natural Language Processing
research community. Although English is still by
far the most popular language in SMC, its domi-
nance is receding. Hong et al. (2011), for exam-
ple, applied an automatic language detection algo-
rithm to over 62 million tweets to identify the top
10 most popular languages on Twitter. They found
that only half of the tweets were in English. More-
over, mixing multiple languages together (code
mixing) is a popular trend in social media users
from language-dense areas (C
´
ardenas-Claros and
Isharyanti, 2009; Shafie and Nayan, 2013). In
a scenario where speakers switch between lan-
guages within a conversation, sentence or even
word, the task of automatic language identifica-
tion becomes increasingly important to facilitate
further processing.
Speakers whose first language uses a non-
Roman alphabet write using the Roman alphabet
for convenience (phonetic typing) which increases
the likelihood of code mixing with a Roman-
alphabet language. This can be especially ob-
served in South-East Asia and in the Indian sub-
continent. The following is a code mixing com-
ment taken from a Facebook group of Indian uni-
versity students:
Original: Yaar tu to, GOD hain. tui JU
te ki korchis? Hail u man!
Translation: Buddy you are GOD. What
are you doing in JU? Hail u man!
This comment is written in three languages: En-
glish, Hindi (italics), and Bengali (boldface). For
Bengali and Hindi, phonetic typing has been used.
We follow in the footsteps of recent work on
language identification for SMC (Hughes et al.,
2006; Baldwin and Lui, 2010; Bergsma et al.,
2012), focusing specifically on the problem of
word-level language identification for code mixing
SMC. Our corpus for this task is collected from
Facebook and contains instances of Bengali(BN)-
English(EN)-Hindi(HI) code mixing.
The paper is organized as follows: in Section 2,
we review related research in the area of code
mixing and language identification; in Section 3,
we describe our code mixing corpus, the data it-
13

self and the annotation process; in Section 4, we
list the tools and resources which we use in our
language identification experiments, described in
Section 5. Finally, in Section 6, we conclude
and provide suggestions for future research on this
topic.
2 Background and Related Work
The problem of language identification has been
investigated for half a century (Gold, 1967) and
that of computational analysis of code switching
for several decades (Joshi, 1982), but there has
been less work on automatic language identifi-
cation for multilingual code-mixed texts. Before
turning to that topic, we first briefly survey studies
on the general characteristics of code mixing.
Code mixing is a normal, natural product of
bilingual and multilingual language use. Signif-
icant studies of the phenomenon can be found
in the linguistics literature (Milroy and Muysken,
1995; Alex, 2008; Auer, 2013). These works
mainly discuss the sociological and conversational
necessities behind code mixing as well as its lin-
guistic nature. Scholars distinguish between inter-
sentence, intra-sentence and intra-word code mix-
ing.
Several researchers have investigated the rea-
sons for and the types of code mixing. Initial stud-
ies on Chinese-English code mixing in Hong Kong
(Li, 2000) and Macao (San, 2009) indicated that
mainly linguistic motivations were triggering the
code mixing in those highly bilingual societies.
Hidayat (2012) showed that Facebook users tend
to mainly use inter-sentential switching over intra-
sentential, and report that 45% of the switching
was instigated by real lexical needs, 40% was used
for talking about a particular topic, and 5% for
content clarification. The predominance of inter-
sentential code mixing in social media text was
also noted in the study by San (2009), which com-
pared the mixing in blog posts to that in the spoken
language in Macao. Dewaele (2010) claims that
‘strong emotional arousal’ increases the frequency
of code mixing. Dey and Fung (2014) present
a speech corpus of English-Hindi code mixing in
student interviews and analyse the motivations for
code mixing and in what grammatical contexts
code mixing occurs.
Turning to the work on automatic analysis of
code mixing, there have been some studies on de-
tecting code mixing in speech (Solorio and Liu,
2008a; Weiner et al., 2012). Solorio and Liu
(2008b) try to predict the points inside a set of spo-
ken Spanish-English sentences where the speak-
ers switch between the two languages. Other
studies have looked at code mixing in differ-
ent types of short texts, such as information re-
trieval queries (Gottron and Lipka, 2010) and SMS
messages (Farrugia, 2004; Rosner and Farrugia,
2007). Yamaguchi and Tanaka-Ishii (2012) per-
form language identification using artificial mul-
tilingual data, created by randomly sampling text
segments from monolingual documents. King
and Abney (2013) used weakly semi-supervised
methods to perform word-level language identifi-
cation. A dataset of 30 languages has been used
in their work. They explore several language
identification approaches, including a Naive Bayes
classifier for individual word-level classification
and sequence labelling with Conditional Random
Fields trained with Generalized Expectation crite-
ria (Mann and McCallum, 2008; Mann and Mc-
Callum, 2010), which achieved the highest scores.
Another very recent work on this topic is (Nguyen
and Do
˘
gru
¨
oz, 2013). They report on language
identification experiments performed on Turkish
and Dutch forum data. Experiments have been
carried out using language models, dictionaries,
logistic regression classification and Conditional
Random Fields. They find that language models
are more robust than dictionaries and that contex-
tual information is helpful for the task.
3 Corpus Acquisition
Taking into account the claim that code mixing is
frequent among speakers who are multilingual and
younger in age (C
´
ardenas-Claros and Isharyanti,
2009), we choose an Indian student community
between the 20-30 year age group as our data
source. India is a country with 30 spoken lan-
guages, among which 22 are official. code mix-
ing is very frequent in the Indian sub-continent
because languages change within very short geo-
distances and people generally have a basic knowl-
edge of their neighboring languages.
A Facebook group
1
and 11 Facebook users
(known to the authors) were selected to obtain
publicly available posts and comments. The Face-
book graph API explorer was used for data collec-
tion. Since these Facebook users are from West
Bengal, the most dominant language is Bengali
1
https://www.facebook.com/jumatrimonial
14

(Native Language), followed by English and then
Hindi (National Language of India). The posts
and comments in Bengali and Hindi script were
discarded during data collection, resulting in 2335
posts and 9813 comments.
3.1 Annotation
Four annotators took part in the annotation task.
Three were computer science students and the
other was one of the authors. The annotators are
proficient in all three languages of our corpus. A
simple annotation tool was developed which en-
abled these annotators to identify and distinguish
the different languages present in the content by
tagging them. Annotators were supplied with 4
basic tags (viz. sentence, fragment, inclusion and
wlcm (word-level code mixing)) to annotate differ-
ent levels of code mixing. Under each tag, six at-
tributes were provided, viz. English (en), Bengali
(bn), Hindi (hi), Mixed (mixd), Universal (univ)
and Undefined (undef). The attribute univ is as-
sociated with symbols, numbers, emoticons and
universal expressions (e.g. hahaha, lol). The at-
tribute undef is specified for a sentence or a word
for which no language tags can be attributed or
cannot be categorized as univ. In addition, anno-
tators were instructed to annotate named entities
separately. What follows are descriptions of each
of the annotation tags.
Sentence (sent): This tag refers to a sentence
and can be used to mark inter-sentential code mix-
ing. Annotators were instructed to identify a sen-
tence with its base language (e.g. en, bn, hi and
mixd) or with other types (e.g. univ, undef ) as the
first task of annotation. Only the attribute mixd is
used to refer to a sentence which contains multi-
ple languages in the same proportion. A sentence
may contain any number of inclusions, fragments
and word-level code mixing. A sentence can be at-
tributed as univ if and only if it contains symbols,
numbers, emoticons, chat acronyms and no other
words (Hindi, English or Bengali). A sentence can
be attributed as undef if it is not a sentence marked
as univ and has words/tokens that can not be cate-
gorized as Hindi, English or Bengali. Some exam-
ples of sentence-level annotations are the follow-
ing:
1. English-Sentence:
[sent-lang=“en”] what a.....6 hrs long...but re-
ally nice tennis.... [/sent]
2. Bengali-Sentence:
[sent-lang=“bn”] shubho nabo borsho.. :)
[/sent]
3. Hindi Sentence:
[sent-lang=“hi”] karwa sachh ..... :( [/sent]
4. Mixed-Sentence:
[sent-lang=“mixd”] [frag-lang=“hi”] oye
hoye ..... angreji me kahte hai ke [/frag]
[frag-lang=“en”] I love u.. !!! [/frag] [/sent]
5. Univ-Sentence:
[sent-lang=“univ”] hahahahahahah....!!!!!
[/sent]
6. Undef-Sentence:
[sent-lang=“undef] Hablando de una triple
amenaza. [/sent]
Fragment (frag): This refers to a group of for-
eign words, grammatically related, in a sentence.
The presence of this tag in a sentence conveys that
intra-sentential code mixing has occurred within
the sentence boundary. Identification of fragments
(if present) in a sentence was the second task of
annotation. A sentence (sent) with attribute mixd
must contain multiple fragments (frag) with a spe-
cific language attribute. In the fourth example
above, the sentence contains a Hindi fragment oye
hoye ..... angreji me kahte hai ke and an English
fragment I love u.. !!!, hence it is considered as a
mixd sentence. A fragment can have any number
of inclusions and word-level code mixing. In the
first example below, Jio is a popular Bengali word
appearing in the English fragment Jio.. good joke,
hence tagged as a Bengali inclusion. One can ar-
gue that the word Jio could be a separate Bengali
inclusion (i.e. can be tagged as a Bengali inclu-
sion outside the English fragment). But looking
at the syntactic pattern and the sense expressed by
the comment, the annotator kept it as a single unit.
In the second example below, an instance of word-
level code mixing, typer, has been found in an En-
glish fragment (where the root English word type
has the Bengali suffix r).
1. Fragment with Inclusion:
[sent-lang=“mixd”] [frag-lang=“en”] [incl-
lang=“bn”] Jio.. [/incl] good joke [/frag] [frag
lang=“bn”] ”amar Babin” [/frag] [/sent]
2. Fragment with Word-Level code mixing:
[sent-lang=“mixd”] [frag-lang=“en”] I will
find u and marry you [/frag] [frag-
lang=“bn”] [wlcm-type=“en-and-bn-suffix”]
typer [/wlcm] hoe glo to! :D [/frag] [/sent]
15

Inclusion (incl): An inclusion is a foreign word
or phrase in a sentence or in a fragment which
is assimilated or used very frequently in native
language. Identification of inclusions can be per-
formed after annotating a sentence and fragment
(if present in that sentence). An inclusion within a
sentence or fragment also denotes intra-sentential
code mixing. In the example below, seriously is an
English inclusion which is assimilated in today’s
colloquial Bengali and Hindi. The only tag that an
inclusion may contain is word-level code mixing.
1. Sentence with Inclusion:
[sent-lang=“bn”] Na re [incl-lang=“en”] seri-
ously [/incl] ami khub kharap achi. [/sent]
Word-Level code mixing (wlcm): This is the
smallest unit of code mixing. This tag was in-
troduced to capture intra-word code mixing and
denotes cases where code mixing has occurred
within a single word. Identifying word-level code
mixing is the last task of annotation. Annotators
were told to mention the type of word-level code
mixing in the form of an attribute (Base Language
+ Second Language) format. Some examples are
provided below. In the first example below, the
root word class is English and e is an Bengali suf-
fix that has been added. In the third example be-
low, the opposite can be observed the root word
Kando is Bengali, and an English suffix z has been
added. In the second example below, a named en-
tity suman is present with a Bengali suffix er.
1. Word-Level code mixing (EN-BN):
[wlcm-type=“en-and-bn-suffix”] classe
[/wlcm]
2. Word-Level code mixing (NE-BN):
[wlcm-type=“NE-and-bn-suffix”] sumaner
[/wlcm]
3. Word-Level code mixing (BN-EN):
[wlcm-type=“bn-and-en-suffix”] kandoz
[/wlcm]
3.1.1 Inter Annotator Agreement
We calculate word-level inter annotator agreement
(Cohen’s Kappa) on a subset of 100 comments
(randomly selected) between two annotators. Two
annotators are in agreement about a word if they
both annotate the word with the same attribute
(en, bn, hi, univ, undef ), regardless of whether
the word is inside an inclusion, fragment or sen-
tence. Our observations that the word-level anno-
tation process is not a very ambiguous task and
that annotation instruction is also straightforward
are confirmed in a high inter-annotator agreement
(IAA) with a Kappa value of 0.884.
3.2 Data Characteristics
Tag-level and word-level statistics of annotated
data that reveal the characteristics of our data set
are described in Table 1 and in Table 2 respec-
tively. More than 56% of total sentences and al-
most 40% of total tokens are in Bengali, which is
the dominant language of this corpus. English is
the second most dominant language covering al-
most 33% of total tokens and 35% of total sen-
tences. The amount of Hindi data is substantially
lower nearly 1.75% of total tokens and 2% of to-
tal sentences. However, English inclusions (84%
of total inclusions) are more prominent than Hindi
or Bengali inclusions and there are a substantial
number of English fragments (almost 52% of total
fragments) present in our corpus. This means that
English is the main language involved in the code
mixing.
Statistics of Different Tags
Tags En Bn Hi Mixd Univ Undef
sent 5,370 8,523 354 204 746 15
frag 288 213 40 0 6 0
incl 7,377 262 94 0 1,032 1
wlcm 477
Name Entity 3,602
Acronym 691
Table 1: Tag-level statistics
Word-Level Tag Count
EN 66,298
BN 79,899
HI 3,440
WLCM 633
NE 5,233
ACRO 715
UNIV 39,291
UNDEF 61
Table 2: Word-level statistics
3.2.1 Code Mixing Types
In our corpus, inter- and intra-sentential code mix-
ing are more prominent than word-level code mix-
ing, which is similar to the findings of (Hidayat,
2012) . Our corpus contains every type of code
mixing in English, Hindi and Bengali viz. in-
ter/intra sentential and word-level as described in
the previous section. Some examples of different
types of code mixing in our corpus are presented
below.
16

1. Inter-Sentential:
[sent-lang=“hi”] Itna izzat diye aapne mujhe
!!! [/sent]
[sent-lang=“en”] Tears of joy. :’( :’( [/sent]
2. Intra-Sentential:
[sent-lang=“bn”] [incl-lang=“en”] by d way
[/incl] ei [frag-lang=“en”] my craving arms
shall forever remain empty .. never hold u
close .. [/frag] line ta baddo [incl-lang=“en”]
cheezy [/incl] :P ;) [/sent]
3. Word-Level:
[sent-lang=“bn”] [incl-lang=“en”] 1st yr
[/incl] eo to ei [wlcm-type=“en+bnSuffix”]
tymer [/wlcm] modhye sobar jute jay ..
[/sent]
3.2.2 Ambiguous Words
Annotators were instructed to tag an English word
as English irrespective of any influence of word
borrowing or foreign inclusion but an inspection of
the annotations revealed that English words were
sometimes annotated as Bengali or Hindi. To un-
derstand this phenomenon we processed the list
of language (EN,BN and HI) word types (total
26,475) and observed the percentage of types that
were not always annotated with the one language
throughout the corpus. The results are presented in
Table 3. Almost 7% of total types are ambiguous
(i.e. tagged in different languages during annota-
tion). Among them, a substantial amount (5.58%)
are English/Bengali.
Label(s) Count Percentage
EN 9,109 34.40
BN 14,345 54.18
HI 1,039 3.92
EN or BN 1,479 5.58
EN or HI 61 0.23
BN or HI 277 1.04
EN or BN or HI 165 0.62
Table 3: Statistics of ambiguous and monolingual
word types
There are two reasons why this is happening:
Same Words Across Languages Some words
are the same (e.g. baba, maa, na, khali) in Hindi
and Bengali because both of the languages orig-
inated from a single language Sanskrit and share
a good amount of common vocabulary. It also
occurred in English-Hindi and English-Bengali as
a result of word borrowing. Most of these are
commonly used inclusions like clg, dept, ques-
tion, cigarette, and topic. Sometimes the anno-
tators were careful enough to tag such words as
English and sometimes these words were tagged
in the annotators’ native languages. During cross
checking of the annotated data the same error pat-
terns were observed for multiple annotators, i.e.
tagging commonly used foreign words into native
language. It only demonstrates that these English
words are highly assimilated in the conversational
vocabulary of Bengali and Hindi.
Phonetic Similarity of Spellings Due to pho-
netic typing some words share the same surface
form across two and sometimes across three lan-
guages. As an example, to is a word in the three
languages: it has occurred 1209 times as English,
715 times as Bengali and 55 times as Hindi in our
data. The meaning of these words (e.g. to, bolo,
die) are different in different languages. This phe-
nomenon is perhaps exacerbated by the trend to-
wards short and noisy spelling in SMC.
4 Tools and Resources
We have used the following resources and tools in
our experiment.
Dictionaries
1. British National Corpus (BNC): We com-
pile a word frequency list from the BNC (As-
ton and Burnard, 1998).
2. SEMEVAL 2013 Twitter Corpus (Se-
mevalTwitter): To cope with the language
of social media we use the SEMEVAL 2013
(Nakov et al., 2013) training data for the
Twitter sentiment analysis task. This data
comes from a popular social media site and
hence is likely to reflect the linguistic proper-
ties of SMC.
3. Lexical Normalization List (LexNorm-
List): Spelling variation is a well-known
phenomenon in SMC. We use a lexical nor-
malization dictionary created by Han et al.
(2012) to handle the different spelling vari-
ations in our data.
Machine Learning Toolkits
1. WEKA: We use the Weka toolkit (Hall et
al., 2009) for our experiments in decision tree
training.
2. MALLET: CRF learning is applied using the
MALLET toolkit (McCallum, 2002).
17

Citations
More filters
Proceedings ArticleDOI

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification.

TL;DR: This paper proposes a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks, and shows the effectiveness of starting off with existing pre-trained generic language models, and continue training them on Twitter corpora.
Journal ArticleDOI

Many Languages, One Parser

TL;DR: The authors used multilingual word clusters and embeddings, token-level language information, and fine-grained POS tags to train a multilingual model for dependency parsing and use it to parse sentences in several languages.
Posted Content

Many Languages, One Parser

TL;DR: This article used multilingual word clusters and embeddings, token-level language information, and fine-grained POS tags to train a multilingual model for dependency parsing and use it to parse sentences in several languages.
Proceedings ArticleDOI

A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection.

TL;DR: This work presents a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter and proposes a supervised classification system for detecting hate speech in the text using various character level, word level, and lexicon based features.
Proceedings ArticleDOI

Corpus creation for sentiment analysis in code-mixed Tamil-English text

TL;DR: A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown.
References
More filters
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Journal Article

LIBLINEAR: A Library for Large Linear Classification

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.

A Practical Guide to Support Vector Classication

TL;DR: A simple procedure is proposed, which usually gives reasonable results and is suitable for beginners who are not familiar with SVM.
Journal ArticleDOI

Language identification in the limit

TL;DR: It was found that theclass of context-sensitive languages is learnable from an informant, but that not even the class of regular languages is learningable from a text.

N-gram-based text categorization

TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions in "Code mixing: a challenge for language identification in the language of social media" ?

In this paper, the authors describe their work in progress on the problem of automatic language identification for the language of social media. The authors describe a new dataset that they are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. The authors also present some preliminary word-level language identification experiments using this dataset. The authors find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration. 

In the future the authors plan to apply the techniques and feature sets that they used in these experiments to other datasets. The authors did not include word-level code mixing in their experiments – in their future experiments they will explore ways to identify and segment this type of code mixing. It will be also important to find the best way to handle inclusions since there is a fine line between word borrowing and code mixing. 

Only 2.4% (4,658) of total tokens of the training data are Hindi, out of which 55.36% are bilingually ambiguous and 29.51% are tri-lingually ambiguous tokens. 

Since these Facebook users are from West Bengal, the most dominant language is Bengali1https://www.facebook.com/jumatrimonial(Native Language), followed by English and then Hindi (National Language of India). 

The authors employ a linear chain CRF with an increasing order (Order-0, Order-1 and Order-2) with 200 iterations for different feature combinations (usedin SVM-based runs). 

The authors use length as the only feature to train a decision tree for each fold and use the nodes obtained from the tree to create boolean features. 

code mixing is very frequent in the Indian sub-continent because languages change within very short geodistances and people generally have a basic knowledge of their neighboring languages. 

The authors apply their best dictionary-based system, their best SVM system (with and without context) and their best CRF system to the held-out test set. 

English inclusions (84% of total inclusions) are more prominent than Hindi or Bengali inclusions and there are a substantial number of English fragments (almost 52% of total fragments) present in their corpus. 

Their observations that the word-level annotation process is not a very ambiguous task andthat annotation instruction is also straightforward are confirmed in a high inter-annotator agreement (IAA) with a Kappa value of 0.884. 

After C parameter optimization, the best cross-validation accuracy is found for the P1N1 (one word previous and one word next) run with C=0.125 (95.14%). 

Generally a dictionary-based language detector predicts the language of a word based on its frequency in multiple language dictionaries. 

The predicted language is chosen based on the dominant language(s) of the corpus if the word appears in multiple dictionaries with same frequency or if the word does not appear in any dictionary or list. 

As an example, a part of a comment is presented from crossvalidation fold 1 that contains the word die which is wrongly classified by the SVM classifier. 

For systems which do not take the context of a word into account, i.e. the dictionary-based approach (Section 5.1) and the SVM approach without contextual clues (Section 5.2), named entities and instances of word-level code mixing can be safely excluded from training.