Code Mixing: A Challenge for Language Identification in the Language of Social Media
read more
Citations
TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification.
Many Languages, One Parser
Many Languages, One Parser
A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection.
Corpus creation for sentiment analysis in code-mixed Tamil-English text
References
The WEKA data mining software: an update
LIBLINEAR: A Library for Large Linear Classification
A Practical Guide to Support Vector Classication
Language identification in the limit
N-gram-based text categorization
Related Papers (5)
Frequently Asked Questions (15)
Q2. What have the authors stated for future works in "Code mixing: a challenge for language identification in the language of social media" ?
In the future the authors plan to apply the techniques and feature sets that they used in these experiments to other datasets. The authors did not include word-level code mixing in their experiments – in their future experiments they will explore ways to identify and segment this type of code mixing. It will be also important to find the best way to handle inclusions since there is a fine line between word borrowing and code mixing.
Q3. How many tokens are in the training data?
Only 2.4% (4,658) of total tokens of the training data are Hindi, out of which 55.36% are bilingually ambiguous and 29.51% are tri-lingually ambiguous tokens.
Q4. What is the dominant language in the world?
Since these Facebook users are from West Bengal, the most dominant language is Bengali1https://www.facebook.com/jumatrimonial(Native Language), followed by English and then Hindi (National Language of India).
Q5. How many iterations are used for different feature combinations?
The authors employ a linear chain CRF with an increasing order (Order-0, Order-1 and Order-2) with 200 iterations for different feature combinations (usedin SVM-based runs).
Q6. What is the way to train a decision tree for length?
The authors use length as the only feature to train a decision tree for each fold and use the nodes obtained from the tree to create boolean features.
Q7. Why is code mixing common in India?
code mixing is very frequent in the Indian sub-continent because languages change within very short geodistances and people generally have a basic knowledge of their neighboring languages.
Q8. What is the system for a given test set?
The authors apply their best dictionary-based system, their best SVM system (with and without context) and their best CRF system to the held-out test set.
Q9. What is the prominent language in the corpus?
English inclusions (84% of total inclusions) are more prominent than Hindi or Bengali inclusions and there are a substantial number of English fragments (almost 52% of total fragments) present in their corpus.
Q10. What is the Kappa value of the word-level annotation process?
Their observations that the word-level annotation process is not a very ambiguous task andthat annotation instruction is also straightforward are confirmed in a high inter-annotator agreement (IAA) with a Kappa value of 0.884.
Q11. What is the cross-validation accuracy for the P1N1?
After C parameter optimization, the best cross-validation accuracy is found for the P1N1 (one word previous and one word next) run with C=0.125 (95.14%).
Q12. What is the definition of a dictionary-based language detector?
Generally a dictionary-based language detector predicts the language of a word based on its frequency in multiple language dictionaries.
Q13. What is the language of the word chosen?
The predicted language is chosen based on the dominant language(s) of the corpus if the word appears in multiple dictionaries with same frequency or if the word does not appear in any dictionary or list.
Q14. What is the example of a word-level classifier?
As an example, a part of a comment is presented from crossvalidation fold 1 that contains the word die which is wrongly classified by the SVM classifier.
Q15. What is the way to handle the word-level classification problem?
For systems which do not take the context of a word into account, i.e. the dictionary-based approach (Section 5.1) and the SVM approach without contextual clues (Section 5.2), named entities and instances of word-level code mixing can be safely excluded from training.