scispace - formally typeset
Proceedings ArticleDOI

A Survey of Current Datasets for Code-Switching Research

TLDR
A set of quality metrics to evaluate the dataset and categorize them accordingly is proposed and will assist users in various natural language processing tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc.
Abstract
Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.

read more

Citations
More filters
Proceedings ArticleDOI

Corpus creation for sentiment analysis in code-mixed Tamil-English text

TL;DR: A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown.
Proceedings ArticleDOI

Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text

TL;DR: The Dravidian-CodeMix-FIRE 2020 Track as discussed by the authors focused on sentiment analysis of code-mixed text in code mixed text for Tamil and Malayalam, and participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognizing whether the comment is not in the intended language.

Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

TL;DR: An early fusion technique is used to combine the image and text modality and compare it with a text- and an image-only baseline to investigate its effectiveness, and results show improvements in terms of Precision, Recall, and F-Score.
Proceedings ArticleDOI

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

TL;DR: In this article, a model of language models for minority and historical languages was developed using a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2 (Insight 2), co-funded by the European Regional Development Fund as well as the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Language

HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion

TL;DR: This paper annotated hope speech for equality, diversity and inclusion in a multilingual setting and determined that the inter-annotator agreement of their dataset using Krippendorff's alpha.
References
More filters

N-gram-based text categorization

TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Proceedings ArticleDOI

Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition

TL;DR: The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities and to evaluate the ability of participating entries to detect and classify novel and emerging named entities in noisy text.
Proceedings ArticleDOI

Code Mixing: A Challenge for Language Identification in the Language of Social Media

TL;DR: A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.

"Code Switching" in Sociocultural Linguistics

TL;DR: Code switching as discussed by the authors is defined as the practice of selecting or altering linguistic elements so as to contextualize talk in interaction, which may relate to local discourse practices, such as turn selection or various forms of bracketing, or it may make relevant information beyond the current exchange.
Journal ArticleDOI

Language Choice Online: Globalization and Identity in Egypt

TL;DR: This paper combines linguistic analysis, a survey, and interviews to examine English and Arabic language use in online communications by a group of young professionals in Egypt and indicates that, among this group, English is used overwhelmingly in Web use and in formal e-mail communication, but that a Romanized version of Egyptian Arabic is used extensively in informal e- email messages and online chats.
Related Papers (5)