scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Sentiment mining: An approach for Bengali and Tamil tweets

TL;DR: The aim is to classify a given Bengali or Tamil tweets into three sentiment classes namely positive, negative or neutral, using unigram and bi-gram models along with different supervised machine learning techniques.
Abstract: This paper presents a proposed work for extracting the sentiments from tweets in Indian Language. We proposed a system that deal with the goal to extract the sentiments from Bengali & Tamil tweets. Our aim is to classify a given Bengali or Tamil tweets into three sentiment classes namely positive, negative or neutral. In recent time, Twitter gain much attention to NLP researchers as it is most widely used platform that allows the user to share there opinion in form of tweets. The proposed methodology used unigram and bi-gram models along with different supervised machine learning techniques. We also consider the use of features generated from lexical resources such as Wordnets and Emoticons Tagger.
Citations
More filters
Proceedings ArticleDOI
11 May 2020
TL;DR: A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown.
Abstract: Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

168 citations


Cites background from "Sentiment mining: An approach for B..."

  • ...Several research activities on sentiment analysis in Tamil (Padmamala and Prema, 2017) and other Indian languages (Ranjan et al., 2016; Das and Bandyopadhyay, 2010; A.R. et al., 2012; Phani et al., 2016; Prasad et al., 2016; Priyadharshini et al., 2020; Chakravarthi et al., 2020) are happening because the sheer number of native speakers are a potential market for commercial NLP applications....

    [...]

  • ...…(Padmamala and Prema, 2017) and other Indian languages (Ranjan et al., 2016; Das and Bandyopadhyay, 2010; A.R. et al., 2012; Phani et al., 2016; Prasad et al., 2016; Priyadharshini et al., 2020; Chakravarthi et al., 2020) are happening because the sheer number of native speakers are a…...

    [...]

Posted Content
TL;DR: In this article, the authors created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube and presented inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.
Abstract: Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

29 citations

Book ChapterDOI
17 Dec 2020
TL;DR: In this article, the authors presented an emotional dataset (hereafter called "BEmoD") for analysis of emotion in Bengali texts and described its development process, including data crawling, pre-processing, labeling, and verification.
Abstract: Recently, emotion detection in language has increased attention to NLP researchers due to the massive availability of people’s expressions, opinions, and emotions through comments on the Web 2.0 platforms. It is a very challenging task to develop an automatic sentiment analysis system in Bengali due to the scarcity of resources and the unavailability of standard corpora. Therefore, the development of a standard dataset is a prerequisite to analyze emotional expressions in Bengali texts. This paper presents an emotional dataset (hereafter called ‘BEmoD’) for analysis of emotion in Bengali texts and describes its development process, including data crawling, pre-processing, labeling, and verification. BEmoD contains 5200 texts, which are labeled into six basic emotional categories such as anger, fear, surprise, sadness, joy, and disgust, respectively. Dataset evaluation with a Cohen’s \(\kappa \) score of 0.920 shows the agreement among annotators. The evaluation analysis also shows the distribution of emotion words that follow Zipf’s law.

17 citations

Journal ArticleDOI
TL;DR: In this article , the authors describe the development of an emotional corpus (hereafter called "BEmoC") for classifying six emotions in Bengali texts, i.e., anger, fear, surprise, sadness, joy, and disgust.
Abstract: Emotion classification in text has growing interest among NLP experts due to the enormous availability of people's emotions and its emergence on various Web 2.0 applications/services. Emotion classification in the Bengali texts is also gradually being considered as an important task for sports, e-commerce, entertainments, and security applications. However, It is a very critical task to develop an automatic emotion classification system for low-resource languages such as, Bengali. Scarcity of resources and deficiency of benchmark corpora make the task more complicated. Thus, the development of a benchmark corpus is the prerequisite to develop an emotion classifier for Bengali texts. This paper describes the development of an emotional corpus (hereafter called 'BEmoC') for classifying six emotions in Bengali texts. The corpus development process consists of four key steps: data crawling, pre-processing, labelling, and verification. A total of 7000 texts are labelled into six basic emotion categories such as anger, fear, surprise, sadness, joy, and disgust, respectively. Dataset evaluation with 0.969 Cohen's κ score indicates the close agreement between the corpus annotators and the expert. The analysis of evaluation also represents that the distribution of emotion words obeys Zipf's law. Moreover, the results of BEmoC analysis shown in terms of coding reliability, emotion density, and most frequent emotion words, respectively.

11 citations

Book ChapterDOI
TL;DR: This work classify each line of text to a particular language and focused on short phrases of length 2–6 words for 15 Indian languages to detect that a given document is in multilingual and identifies the appropriate Indian languages.
Abstract: Language identification is used to categorize the language of a given document Language identification categorizes the contents and can have a better search results for a multilingual document In this work, we classify each line of text to a particular language and focused on short phrases of length 2–6 words for 15 Indian languages It detects that a given document is in multilingual and identifies the appropriate Indian languages The approach used is the combination of n-gram technique and a list of short distinctive words The n-gram model applied is language independent whereas short word method uses less computation The results show the effectiveness of our approach over the synthetic data

5 citations

References
More filters
01 Jan 2002
TL;DR: In this paper, the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative, was considered and three machine learning methods (Naive Bayes, maximum entropy classiflcation, and support vector machines) were employed.
Abstract: We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we flnd that standard machine learning techniques deflnitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classiflcation, and support vector machines) do not perform as well on sentiment classiflcation as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classiflcation problem more challenging.

6,980 citations

Proceedings ArticleDOI
06 Jul 2002
TL;DR: This work considers the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative, and concludes by examining factors that make the sentiment classification problem more challenging.
Abstract: We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classification problem more challenging.

6,626 citations

Posted Content
TL;DR: A simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (Thumbs down) if the average semantic orientation of its phrases is positive.
Abstract: This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down). The classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. A phrase has a positive semantic orientation when it has good associations (e.g., "subtle nuances") and a negative semantic orientation when it has bad associations (e.g., "very cavalier"). In this paper, the semantic orientation of a phrase is calculated as the mutual information between the given phrase and the word "excellent" minus the mutual information between the given phrase and the word "poor". A review is classified as recommended if the average semantic orientation of its phrases is positive. The algorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). The accuracy ranges from 84% for automobile reviews to 66% for movie reviews.

4,526 citations


"Sentiment mining: An approach for B..." refers methods in this paper

  • ...Early work in this area includes work done by Turney [2] and Pang [3] for detecting the polarity of product reviews....

    [...]

Proceedings Article
01 Jan 2002
TL;DR: This article proposed an unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended(thumbs down) based on the average semantic orientation of phrases in the review that contain adjectives or adverbs.
Abstract: This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down) The classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs A phrase has a positive semantic orientation when it has good associations (eg, “subtle nuances”) and a negative semantic orientation when it has bad associations (eg, “very cavalier”) In this paper, the semantic orientation of a phrase is calculated as the mutual information between the given phrase and the word “excellent” minus the mutual information between the given phrase and the word “poor” A review is classified as recommended if the average semantic orientation of its phrases is positive The algorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations) The accuracy ranges from 84% for automobile reviews to 66% for movie reviews

3,814 citations

Proceedings ArticleDOI
25 Jun 2005
TL;DR: A meta-algorithm is applied, based on a metric labeling formulation of the rating-inference problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels.
Abstract: We address the rating-inference problem, wherein rather than simply decide whether a review is "thumbs up" or "thumbs down", as in previous sentiment analysis work, one must determine an author's evaluation with respect to a multi-point scale (e.g., one to five "stars"). This task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, "three stars" is intuitively closer to "four stars" than to "one star".We first evaluate human performance at the task. Then, we apply a meta-algorithm, based on a metric labeling formulation of the problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels. We show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of SVMs when we employ a novel similarity measure appropriate to the problem.

2,544 citations


"Sentiment mining: An approach for B..." refers methods in this paper

  • ...Early work in this area includes work done by Turney [2] and Pang [3] for detecting the polarity of product reviews....

    [...]

  • ...A multiway document classification on polarity basis is attempted by Pang [4] and Synder [5]....

    [...]