Corpus creation for sentiment analysis in code-mixed Tamil-English text

doi:10.5281/ZENODO.4015253

Open AccessProceedings ArticleDOI

Corpus creation for sentiment analysis in code-mixed Tamil-English text

- pp 202-210

TLDR

A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown.

Abstract:

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German

Thomas Mandl, +3 more

TL;DR: The HASOC track as mentioned in this paper is dedicated to evaluate technology for finding offensive language and hate speech, which has attracted much interest and over 40 research groups have participated as well as described their approaches in papers.

...read moreread less

Proceedings ArticleDOI

Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text

Bharathi Raja Chakravarthi, +6 more

TL;DR: The Dravidian-CodeMix-FIRE 2020 Track as discussed by the authors focused on sentiment analysis of code-mixed text in code mixed text for Tamil and Malayalam, and participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognizing whether the comment is not in the intended language.

...read moreread less

Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

Shardul Suryawanshi, +3 more

TL;DR: An early fusion technique is used to combine the image and text modality and compare it with a text- and an image-only baseline to investigate its effectiveness, and results show improvements in terms of Precision, Recall, and F-Score.

...read moreread less

Proceedings ArticleDOI

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Bharathi Raja Chakravarthi, +4 more

TL;DR: In this article, a model of language models for minority and historical languages was developed using a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2 (Insight 2), co-funded by the European Regional Development Fund as well as the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Language

...read moreread less

HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion

Bharathi Raja Chakravarthi

TL;DR: This paper annotated hope speech for equality, diversity and inclusion in a multilingual setting and determined that the inter-annotator agreement of their dataset using Krippendorff's alpha.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Mining and summarizing customer reviews

Minqing Hu, +1 more

TL;DR: This research aims to mine and to summarize all the customer reviews of a product, and proposes several novel techniques to perform these tasks.

...read moreread less

Journal ArticleDOI

Annotating Expressions of Opinions and Emotions in Language

Janyce Wiebe, +2 more

TL;DR: The manual annotation process and the results of an inter-annotator agreement study on a 10,000-sentence corpus of articles drawn from the world press are presented.

...read moreread less

Journal ArticleDOI

Estimating the Reliability, Systematic Error and Random Error of Interval Data

Klaus Krippendorff

- 01 Apr 1970 -

Educational and Psychological Measuremen...

TL;DR: In content analysis, in the process of developing recording instructions, defining units of analysis and operationalizing scales, the researcher requires more detailed information about the sources and kind of unreliability and over-all measures of agreement do not provide such information readily.

...read moreread less

Proceedings ArticleDOI

Code Mixing: A Challenge for Language Identification in the Language of Social Media

Utsab Barman, +3 more

TL;DR: A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.

...read moreread less

Proceedings ArticleDOI

Overview for the First Shared Task on Language Identification in Code-Switched Data

Thamar Solorio, +10 more

TL;DR: The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs.

...read moreread less

Collapse

Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German

Thomas Mandl, +3 more

Sentiment Analysis in Tamil Texts: A Study on Machine Learning Techniques and Feature Representation

Sajeetha Thavareesan, +1 more

Corpus creation for sentiment analysis in code-mixed Tamil-English text

Citations

Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German

Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text

Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion

References

Mining and summarizing customer reviews

Annotating Expressions of Opinions and Emotions in Language

Estimating the Reliability, Systematic Error and Random Error of Interval Data

Code Mixing: A Challenge for Language Identification in the Language of Social Media

Overview for the First Shared Task on Language Identification in Code-Switched Data

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding

Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text

Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German

Sentiment Analysis in Tamil Texts: A Study on Machine Learning Techniques and Feature Representation