scispace - formally typeset
Search or ask a question
Author

Anirudh Dahiya

Bio: Anirudh Dahiya is an academic researcher from International Institute of Information Technology, Hyderabad. The author has contributed to research in topics: Hindi & Sentiment analysis. The author has an hindex of 2, co-authored 3 publications receiving 6 citations.

Papers
More filters
Posted Content
TL;DR: This paper introduced curriculum learning strategies for semantic tasks in code-mixed Hindi-English (Hi-En) texts, and investigated various training strategies for enhancing model performance, which outperforms the state-of-the-art methods for Hi-En codemixed sentiment analysis.
Abstract: Sentiment Analysis and other semantic tasks are commonly used for social media textual analysis to gauge public opinion and make sense from the noise on social media. The language used on social media not only commonly diverges from the formal language, but is compounded by codemixing between languages, especially in large multilingual societies like India. Traditional methods for learning semantic NLP tasks have long relied on end to end task specific training, requiring expensive data creation process, even more so for deep learning methods. This challenge is even more severe for resource scarce texts like codemixed language pairs, with lack of well learnt representations as model priors, and task specific datasets can be few and small in quantities to efficiently exploit recent deep learning approaches. To address above challenges, we introduce curriculum learning strategies for semantic tasks in code-mixed Hindi-English (Hi-En) texts, and investigate various training strategies for enhancing model performance. Our method outperforms the state of the art methods for Hi-En codemixed sentiment analysis by 3.31% accuracy, and also shows better model robustness in terms of convergence, and variance in test performance.

4 citations

Book ChapterDOI
10 Aug 2019
TL;DR: This work introduces curriculum learning strategies for semantic tasks in code-mixed Hindi-English (Hi-En) texts, and investigates various training strategies for enhancing model performance.
Abstract: Sentiment Analysis and other semantic tasks are commonly used for social media textual analysis to gauge public opinion and make sense from the noise on social media. The language used on social media not only commonly diverges from the formal language, but is compounded by code-mixing between languages, especially in large multilingual societies like India.

3 citations

Book ChapterDOI
08 Sep 2020
TL;DR: This work explores various cross-lingual transfer techniques on Hindi Discourse Relation Bank (HDRB), a Penn Discourse Treebank styled dataset for discourse analysis in Hindi and observes performance gains in both zero shot and finetuning settings on the Hindi Discourses Relation Classification task.
Abstract: Discourse relations between two textual spans in a document attempt to capture the coherent structure which emerges in language use. Automatic classification of these relations remains a challenging task especially in case of implicit discourse relations, where there is no explicit textual cue which marks the discourse relation. In low resource languages, this motivates the exploration of transfer learning approaches, more particularly the cross-lingual techniques towards discourse relation classification. In this work, we explore various cross-lingual transfer techniques on Hindi Discourse Relation Bank (HDRB), a Penn Discourse Treebank styled dataset for discourse analysis in Hindi and observe performance gains in both zero shot and finetuning settings on the Hindi Discourse Relation Classification task. This is the first effort towards exploring transfer learning for Hindi Discourse relation classification to the best of our knowledge.

Cited by
More filters
Proceedings ArticleDOI
10 Dec 2020
TL;DR: The authors proposed an ensemble based approach which is based on hybridization of Naive Bayes, SVM, Linear Regression, and SGD classifiers for sentiment classification of Hindi-English text.
Abstract: India is a multilingual and multi-script country and a large part of its population speaks more than one language. It has been noted that such multilingual speakers switch between languages while communicating informally. The code-mixed language is very common in informal communication and social media, and extracting sentiments from these code-mixed sentences is a challenging task. In this work, we have worked on sentiment classification for one of the most common code-mixed language pairs in India i.e. Hindi-English. The conventional sentiment analysis techniques designed for a single language don’t provide satisfactory results for such texts. We have proposed two approaches for better sentiment classification. We have proposed an Ensembling based approach which is based on hybridization of Naive Bayes, SVM, Linear Regression, and SGD classifiers. We have also developed a bidirectional LSTM based novel approach. The approaches provide quite satisfactory results for the code-mixed Hindi-English text.

9 citations

Proceedings ArticleDOI
01 Dec 2019
TL;DR: A large scale code-mixed corpus is generated that would aid in further research of code mixed text on social media and machine learning models that improve upon the previous state-of-the-art using a much lighter and explainable architecture are trained.
Abstract: As an increasing number of people embrace social media, mining data generated from the same has become an important task. Possible applications range from opinion mining, sentiment analysis to hate speech detection. More importantly, analyzing code-mixed multilingual text has gained popularity due to the reason that it holds important socio-cultural clues that may be lost in translation. Methods to effectively analyse code-mixed Hindi/English(Hinglish) text have been explored in this paper. Firstly, we generate a large scale code-mixed corpus that would aid in further research of code mixed text on social media. High-quality word embeddings are trained on this code-mixed text. Finally, we demonstrate the efficacy of our proposed method by training machine learning models that improve upon the previous state-of-the-art using a much lighter and explainable architecture. Our main intention behind training the classifier model was not only high performance but also good model explainability and speed.

1 citations

Journal ArticleDOI
TL;DR: The authors proposed a novel approach for calculating feature values using Kullback-Leibler (KL) divergence method for sentiment analysis for low-resource languages like Hindi using Neuro-Fuzzy Technique.
Abstract: This work proposes sentiment analysis for low-resource languages like Hindi using Neuro-Fuzzy Technique. Low-resource languages suffer from the scarcity of resources; consequently, we propose a method that can be implemented for any language. We use information theory for establishing a relation between terms that exists in a sentence. This work proposes a novel approach for calculating feature values using Kullback-Leibler (KL) divergence method. The feature values are employed to calculate the membership values associated with the Fuzzy logic in Neuro-Fuzzy Technique. The novelty of this method lies in its predictive nature that can mitigate the impact generated from un-labeled, unknown data or multi-domain data. We have seen the results for multi-domain data in our experiments. We evaluate our results using Accuracy, Precision, Recall and F1-Score. Our experiments show the efficacy of the proposed approach. It achieved 93.01% accuracy for English dataset and 91.18% accuracy for Hindi dataset which is more than the other state-of-art techniques like Naive Bayes and SVM. Additionally, we found that our approach provides satisfactory results with multi-domain data as both the datasets were of different domains.

1 citations

Proceedings Article
TL;DR: In this article , the authors proposed the AWD-LSTM model for the code-mixed(Tamil-English) language data set and Logistic Regression for Tamil, Malayalam, and English languages.
Abstract: This paper presents the submission of the shared task “Homophobia, Transphobia Detection of YouTube Comments” organized by DravidianLangTech. Our team has participated in Task - B, which tries to identify the comments on youtube are Non-anti LGBTQ+ content or Homophobic or Transphobic in code-mixed(Tamil-English), Tamil, Malayalam, and English. We proposed the AWD-LSTM model for the code-mixed(Tamil-English) language data set and Logistic Regression for Tamil, Malayalam, and English languages. Our AWD-LSTM model achieved a 0.33 macro average F1 score for code-mixed(Tamil-English) language and Logistic Regression achieved a 0.55 macro average F1 score in the Tamil language, 0.98 macro average F1-score in the Malayalam language, 0.91 macro average F1-score in the English language.