scispace - formally typeset
Search or ask a question
Author

Sajeetha Thavareesan

Bio: Sajeetha Thavareesan is an academic researcher from Eastern University (United States). The author has contributed to research in topics: Tamil & Dravidian languages. The author has an hindex of 6, co-authored 21 publications receiving 194 citations.

Papers
More filters
Proceedings ArticleDOI
01 Dec 2019
TL;DR: Basic features such as word count and punctuation count are used in addition to traditional features including Bag of Words and Term Frequency-Inverse Document Frequency included to check their influence in the prediction.
Abstract: Sentiment Analysis (SA) is an application of Natural Language Processing (NLP) to extract the sentiments expressed in the text. In this paper, we experimented five approaches to perform SA, namely, Lexicon based approach, Supervised Machine learning based approach, Hybrid approach, K-means with Bag of Word (BoW) approach and K-modes with BoW approach. We have experimented these approaches using five corpora with different feature representation techniques to predict the best approach to perform SA in Tamil texts. In this research we used Basic features such as word count and punctuation count in addition to traditional features such as Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) included to check their influence in the prediction. We have compared these approaches, features and the corpora. From the evaluation the highest accuracy of 79% is obtained for UJ_Corpus_Opinions_Nouns corpus with fastText for supervised Machine learning based approach.

91 citations

Proceedings ArticleDOI
28 Jul 2020
TL;DR: A sentiment lexicon expansion method using Word2vec and fastText word embeddings along with rule-based Sentiment Analysis method, which uses expanded lexicons, lists of conjunctions and negational words to predict the sentiments expressed in Tamil texts is proposed.
Abstract: Sentiment Analysis is the process of identifying and categorising the sentiments expressed in a text into positive or negative. The words which carry the sentiments are the keys in sentiment prediction. The SentiWordNet is the sentiment lexicon used to determine the sentiment of texts. There are huge number of sentiment terms that are not in the SentiWordNet limit the performance of Sentiment Analysis. Gathering and grouping such sentiment words manually is a tedious task. In this paper we propose a sentiment lexicon expansion method using Word2vec and fastText word embeddings along with rule-based Sentiment Analysis method. We expand the sentiment lexicon from the initial seed list of 2951 positive and 5598 negative words in two steps: (i) Gathering related words using Word2vec word embedding and (ii) Gathering lexically similar words using fastText word embedding. Our final lexicons UJ_Lex_Pos and UJ_Lex_Neg ended up with 10537 positive and 12664 negative words respectively which are labelled using Word2vec word embedding. Furthermore the rule-based Sentiment Analysis method uses expanded lexicons (UJ_Lex_Pos and UJ_Lex_Neg), lists of conjunctions and negational words to predict the sentiments expressed in Tamil texts. The method is evaluated on UJ_MovieReviews and an accuracy of 88 0.14% is obtained.

88 citations

Proceedings ArticleDOI
26 Nov 2020
TL;DR: In this article, a word embedding-based POS tagger for Tamil language is proposed, where the experiments are conducted with different word embeddings BoW, TF-IDF, Word2vec, fastText and GloVe.
Abstract: This paper proposes a word embedding-based Part of Speech (POS) tagger for Tamil language The experiments are conducted with different word embeddings BoW, TF-IDF, Word2vec, fastText and GloVe that are created using UJ-Tamil corpus Different combinations of eight features with three classifiers linear SVM, Extreme Gradient Boosting and k-Nearest Neighbor are used to build the POS tagger The results are compared against Viterbi algorithm-based POS tagger The results show that word embedding can be used for POS tagging with good performance BoW, TF-IDF and fastText give an impressive performance compared with Word2vec and GloVe The accuracy of 99% is obtained with word embedding of BoW and TF-IDF with unigrams as well as bigrams and with linear SVM classifier POS tag of a given word can be identified with 99% of accuracy using word embeddings based POS tagger in Tamil

79 citations

Proceedings ArticleDOI
01 Jan 2022
TL;DR: The dataset used in the shared task, task description, and the methodology used by the participants and the evaluation results of the submission are presented.
Abstract: This paper presents the overview of the shared task on emotional analysis in Tamil. The result of the shared task is presented at the workshop. This paper presents the dataset used in the shared task, task description, and the methodology used by the participants and the evaluation results of the submission. This task is organized as two Tasks. Task A is carried with 11 emotions annotated data for social media comments in Tamil and Task B is organized with 31 fine-grained emotion annotated data for social media comments in Tamil. For conducting experiments, training and development datasets were provided to the participants and results are evaluated for the unseen data. Totally we have received around 24 submissions from 13 teams. For evaluating the models, Precision, Recall, micro average metrics are used.

45 citations

Journal ArticleDOI
12 May 2022
TL;DR: This paper outlines the dataset released, methods, and results of the submitted systems, and provides Tamil-English code-mixed social comments with offensive spans with annotated data for offensive spans.
Abstract: Offensive content moderation is vital in social media platforms to support healthy online discussions. However, their prevalence in code-mixed Dravidian languages is limited to classifying whole comments without identifying part of it contributing to offensiveness. Such limitation is primarily due to the lack of annotated data for offensive spans. Accordingly, in this shared task, we provide Tamil-English code-mixed social comments with offensive spans. This paper outlines the dataset so released, methods, and results of the submitted systems.

37 citations


Cited by
More filters
01 Jan 2013
TL;DR: This book gives a comprehensive view of state-of-the-art techniques that are used to build spoken dialogue systems and presents dialogue modelling and system development issues relevant in both academic and industrial environments and also discusses requirements and challenges for advanced interaction management and future research.
Abstract: Considerable progress has been made in recent years in the development of dialogue systems that support robust and efficient human–machine interaction using spoken language. Spoken dialogue technology allows various interactive applications to be built and used for practical purposes, and research focuses on issues that aim to increase the system’s communicative competence by including aspects of error correction, cooperation, multimodality, and adaptation in context. This book gives a comprehensive view of state-of-the-art techniques that are used to build spoken dialogue systems. It provides an overview of the basic issues such as system architectures, various dialogue management methods, system evaluation, and also surveys advanced topics concerning extensions of the basic model to more conversational setups. The goal of the book is to provide an introduction to the methods, problems, and solutions that are used in dialogue system development and evaluation. It presents dialogue modelling and system development issues relevant in both academic and industrial environments and also discusses requirements and challenges for advanced interaction management and future research. vi KEywoRDS Spoken dialogue systems, multimodality, evaluation, error-handling, dialogue management, statistical method v MC_Jok nen_FM. ndd Achorn Internat onal 10/10/2009 04:18AM

304 citations

01 Apr 2021
TL;DR: The shared task of hope speech detection for Tamil, English, and Malayalam languages was conducted as a part of the EACL 2021 workshop on Language Technology for Equality, Diversity, and Inclusion.
Abstract: Hope is considered significant for the well-being, recuperation and restoration of human life by health professionals. Hope speech reflects the belief that one can discover pathways to their desired objectives and become roused to utilise those pathways. To encourage research in natural language processing towards positive reinforcement approach, we created a hope speech detection dataset. This paper reports on the shared task of hope speech detection for Tamil, English, and Malayalam languages. The shared task was conducted as a part of the EACL 2021 workshop on Language Technology for Equality, Diversity, and Inclusion (LT-EDI-2021). We summarize here the datasets for this challenge which are openly available at https://competitions.codalab.org/competitions/27653, and present an overview of the methods and the results of the competing systems. To the best of our knowledge, this is the first shared task to conduct hope speech detection.

107 citations

Proceedings ArticleDOI
28 Jul 2020
TL;DR: A sentiment lexicon expansion method using Word2vec and fastText word embeddings along with rule-based Sentiment Analysis method, which uses expanded lexicons, lists of conjunctions and negational words to predict the sentiments expressed in Tamil texts is proposed.
Abstract: Sentiment Analysis is the process of identifying and categorising the sentiments expressed in a text into positive or negative. The words which carry the sentiments are the keys in sentiment prediction. The SentiWordNet is the sentiment lexicon used to determine the sentiment of texts. There are huge number of sentiment terms that are not in the SentiWordNet limit the performance of Sentiment Analysis. Gathering and grouping such sentiment words manually is a tedious task. In this paper we propose a sentiment lexicon expansion method using Word2vec and fastText word embeddings along with rule-based Sentiment Analysis method. We expand the sentiment lexicon from the initial seed list of 2951 positive and 5598 negative words in two steps: (i) Gathering related words using Word2vec word embedding and (ii) Gathering lexically similar words using fastText word embedding. Our final lexicons UJ_Lex_Pos and UJ_Lex_Neg ended up with 10537 positive and 12664 negative words respectively which are labelled using Word2vec word embedding. Furthermore the rule-based Sentiment Analysis method uses expanded lexicons (UJ_Lex_Pos and UJ_Lex_Neg), lists of conjunctions and negational words to predict the sentiments expressed in Tamil texts. The method is evaluated on UJ_MovieReviews and an accuracy of 88 0.14% is obtained.

88 citations

Proceedings ArticleDOI
16 Dec 2020
TL;DR: The Dravidian-CodeMix-FIRE 2020 Track as discussed by the authors focused on sentiment analysis of code-mixed text in code mixed text for Tamil and Malayalam, and participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognizing whether the comment is not in the intended language.
Abstract: Sentiment analysis of Dravidian languages has received attention in recent years However, most social media text is code-mixed and there is no research available on sentiment analysis of code-mixed Dravidian languages The Dravidian-CodeMix-FIRE 2020, a track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, focused on creating a platform for researchers to come together and investigate the problem There were two languages for this track: (i) Tamil, and (ii) Malayalam The participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognising whether the comment is not in the intended language The performance of the systems was evaluated by weighted-F1 score

87 citations

Proceedings ArticleDOI
01 Jan 2022
TL;DR: The dataset used in the shared task, task description, and the methodology used by the participants and the evaluation results of the submission are presented.
Abstract: This paper presents the overview of the shared task on emotional analysis in Tamil. The result of the shared task is presented at the workshop. This paper presents the dataset used in the shared task, task description, and the methodology used by the participants and the evaluation results of the submission. This task is organized as two Tasks. Task A is carried with 11 emotions annotated data for social media comments in Tamil and Task B is organized with 31 fine-grained emotion annotated data for social media comments in Tamil. For conducting experiments, training and development datasets were provided to the participants and results are evaluated for the unseen data. Totally we have received around 24 submissions from 13 teams. For evaluating the models, Precision, Recall, micro average metrics are used.

45 citations