scispace - formally typeset
Open AccessProceedings ArticleDOI

Corpus creation for sentiment analysis in code-mixed Tamil-English text

TLDR
A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown.
Abstract
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

read more

Citations
More filters
Proceedings ArticleDOI

Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German

TL;DR: The HASOC track as mentioned in this paper is dedicated to evaluate technology for finding offensive language and hate speech, which has attracted much interest and over 40 research groups have participated as well as described their approaches in papers.
Proceedings ArticleDOI

Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text

TL;DR: The Dravidian-CodeMix-FIRE 2020 Track as discussed by the authors focused on sentiment analysis of code-mixed text in code mixed text for Tamil and Malayalam, and participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognizing whether the comment is not in the intended language.

Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

TL;DR: An early fusion technique is used to combine the image and text modality and compare it with a text- and an image-only baseline to investigate its effectiveness, and results show improvements in terms of Precision, Recall, and F-Score.
Proceedings ArticleDOI

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

TL;DR: In this article, a model of language models for minority and historical languages was developed using a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2 (Insight 2), co-funded by the European Regional Development Fund as well as the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Language

HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion

TL;DR: This paper annotated hope speech for equality, diversity and inclusion in a multilingual setting and determined that the inter-annotator agreement of their dataset using Krippendorff's alpha.
References
More filters
Proceedings ArticleDOI

Mining and summarizing customer reviews

TL;DR: This research aims to mine and to summarize all the customer reviews of a product, and proposes several novel techniques to perform these tasks.
Journal ArticleDOI

Annotating Expressions of Opinions and Emotions in Language

TL;DR: The manual annotation process and the results of an inter-annotator agreement study on a 10,000-sentence corpus of articles drawn from the world press are presented.
Journal ArticleDOI

Estimating the Reliability, Systematic Error and Random Error of Interval Data

TL;DR: In content analysis, in the process of developing recording instructions, defining units of analysis and operationalizing scales, the researcher requires more detailed information about the sources and kind of unreliability and over-all measures of agreement do not provide such information readily.
Proceedings ArticleDOI

Code Mixing: A Challenge for Language Identification in the Language of Social Media

TL;DR: A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.
Proceedings ArticleDOI

Overview for the First Shared Task on Language Identification in Code-Switched Data

TL;DR: The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs.
Related Papers (5)