scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A comparative study of machine translation for multilingual sentence-level sentiment analysis

TL;DR: This work evaluates existing efforts proposed to do language specific sentiment analysis with a simple yet effective baseline approach and suggests that simply translating the input text in a specific language to English and then using one of the existing best methods developed for English can be better than the existing language-specific approach evaluated.
About: This article is published in Information Sciences.The article was published on 2020-02-01 and is currently open access. It has received 72 citations till now. The article focuses on the topics: Sentiment analysis & Machine translation.

Summary (12 min read)

Jump to: [A Appendices 53][2 CHAPTER 1. INTRODUCTION][1.1. OBJECTIVES 3][1.1 Objectives][4 CHAPTER 1. INTRODUCTION][1.2 Results and Contributions][1.4. ORGANIZATION 5][1.4 Organization][6 CHAPTER 1. INTRODUCTION][2.1 Definitions and Terminologies][8 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW][2.1. DEFINITIONS AND TERMINOLOGIES 9][10 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW][2.2 English Methods][2.2. ENGLISH METHODS 11][2.3 Multilingual Sentiment Analysis][2.3.1 Machine translation-based methods][2.3. MULTILINGUAL SENTIMENT ANALYSIS 13][2.3.2 Lexicon and corpus-based methods][2.3.3 Machine Learning-based methods][2.3.4 Parallel corpus-based methods][2.3. MULTILINGUAL SENTIMENT ANALYSIS 15][2.3.5 Hybrid cross-lingual and lexicon-based methods][16 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW][2.3.6 Neural Networks-based methods][2.3.7 Research Gap][3.1 English Sentiment Analysis Methods][18 CHAPTER 3. METHODOLOGY][3.2. HUMAN LABELED DATASETS 19][3.2 Human Labeled Datasets][20 CHAPTER 3. METHODOLOGY][3.3 Language-Specific Sentiment Analysis Methods][3.4 Machine Translation Systems][3.4. MACHINE TRANSLATION SYSTEMS 23][24 CHAPTER 3. METHODOLOGY][4.1 Metrics][26 CHAPTER 4. EXPERIMENTAL EVALUATION][4.2. COMPARISON BETWEEN MACHINE TRANSLATORS 27][4.2 Comparison Between Machine Translators][28 CHAPTER 4. EXPERIMENTAL EVALUATION][4.3 Overall Performance][4.3. OVERALL PERFORMANCE 29][30 CHAPTER 4. EXPERIMENTAL EVALUATION][4.3. OVERALL PERFORMANCE 33][34 CHAPTER 4. EXPERIMENTAL EVALUATION][4.4 Ranking the methods][4.4. RANKING THE METHODS 35][36 CHAPTER 4. EXPERIMENTAL EVALUATION][5.1 iFeel Architecture and Functionalities][5.1. IFEEL ARCHITECTURE AND FUNCTIONALITIES 39][40 CHAPTER 5. IFEEL SYSTEM] and [42 CHAPTER 6. CONCLUSION]

A Appendices 53

  • Sentiment analysis has become a popular tool for data analysts, es- pecially those that deal with social media data.
  • Thus, sentiment analysis became a hot topic in Web applications, with the high demand from industry and academy, motivating the proposal of new methods to deal with this subject.
  • The potential market for sentiment analysis in different languages is vast.
  • Suppose a mobile application that simply uses sentiment analysis.

2 CHAPTER 1. INTRODUCTION

  • Additionally, it arguments towards the use of translationbased techniques as a baseline for new multilingual sentiment analysis methods.
  • The authors should emphasize that their work focuses on comparing "off-the-shelf" methods as they are used in practice.
  • This excludes most of the supervised methods which require labeled sets for training, as these are usually not available for practitioners.
  • Moreover, most of the supervised solutions do not share the source code or a trained model to be 1https://translate.google.com.

1.1. OBJECTIVES 3

  • According to Internet World Stats5, seven of those languages appear among the top ten languages used on the Web and represent more than 61% of non-English speaker users.
  • As suggested by a recent benchmark study [Ribeiro et al., 2016], their findings suggest that machine translation systems are mature enough to produce reliable translations to English that can be used for sentence-level sentiment analysis and obtain a competitive prediction performance results.
  • Additionally, the authors show that some popular language-specific methods do not have a significant advantage over a machine translation approach.

1.1 Objectives

  • The main objective of this work is to provide a quantitative comparison between the use of several already developed English methods for sentiment analysis combined with machine translation in the multilingual context.
  • Also, the authors want to compare the results with current language-specific methods in order to identify if these methods are better than the machine translation approach.

4 CHAPTER 1. INTRODUCTION

  • The authors hypothesis is based on the assumption that machine translation of datasets to English and it posterior analysis on English specific methods can be as good as the specific sentiment analysis methods created for determined languages.
  • The authors support this, because, even when words change between two paired sentences in different languages; an accurate machine translation should not change their meaning and its sentiment polarity.

1.2 Results and Contributions

  • To address the problem of multilingual sentiment analysis the authors perform several experiments using methods created for English on multilingual datasets with the help of automatic machine translators.
  • As the main result, the hypothesis proposed was confirmed, and the methods designed for specific languages do not overcome in any evaluation the machine translation approach.
  • This work highlights that current commercial and non-commercial methods for sentiment analysis for non-English datasets are not powerful enough against the machine translation approach combined with state-of-the-art sentiment analysis methods published in English.
  • There are two main contributions for this work.
  • Therefore, machine translation should be used as a baseline when new methods are proposed by the scientific community.

1.4. ORGANIZATION 5

  • An evaluation of ma- chine translation for multilingual sentence-level sentiment analysis.
  • In IV Brazilian Workshop on Social Network Analysis and Mining (BraSNAM 2015).
  • A multilingual benchmarking system for sentence-level sentiment analysis, also known as ifeel 2.0.
  • In these publications, the authors discuss the same approach described here, including a previous version of iFeel.

1.4 Organization

  • This chapter presents an overview of the main concepts and terminologies related to sentence-level sentiment analysis and the current state-of-the-art methodologies.
  • Furthermore, the authors describe existing approaches for non-English sentiment analysis including previous direct machine translation approaches as focused in this work and how they distinguish from them.
  • This chapter presents the resources used throughout this work in order to evaluate their hypothesis.
  • It describes their effort in gather representative human labeled datasets in multiple languages, the machine translation systems used to translate these datasets to English, and all the English and non-English sentiment analysis methods.

6 CHAPTER 1. INTRODUCTION

  • This chapter presents the results and dis- cussions that the authors use to validate their hypothesis.
  • It describes how the authors did the performance comparison between machine translation approach and language-specific methods, including an evaluation of machine translation systems and a ranking of best non-English and English methods.
  • This chapter presents a web-based framework for multi- lingual sentiment analysis named iFeel, developed to facilitate the sentiment analysis study by the community and share the code including the datasets of this work.
  • F1-Score and Macro-F1 for every language datasets and every supported method.

2.1 Definitions and Terminologies

  • Given, the recent popularity of the term sentiment analysis, it has been used to describe a wide variety of tasks by the community.
  • There are a variety of conferences that covers these topics, in particular when related to natural language processing, for example, the Annual Meeting of the Association for Computational Linguistics (ACL) and Conference on Empirical Methods in Natural Language Processing .
  • The SemEval1 workshop highlights as one that annually tries to evaluate the current state-of-the-art techniques and proposes several new challenging tasks for the field.
  • Each of these tasks has many subtasks, ranging from three-class polarity detection of tweets to veracity prediction given a rumor.
  • Since there are many definitions related to the sentiment analysis field, here the authors list and describe the concepts they use under the context of this work.

8 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW

  • Given a message and a topic, the goal is to identify whether the message expresses positive, a negative or neutral sentiment regarding that topic.
  • Anyway, as higher the score is, more strong is the positive sentiment in the sentence.
  • Since some words have emotional meaning, like surprise, anger, happiness, the methods should be able to identify correctly the best "affective text" that matches with the sentence.
  • The most common use of multilingual sentiment analysis is when authors propose a generic methodology to perform sentiment analysis in datasets written in just one language, usually different than English.
  • This term was coined throughout their recent work [Araújo et al., 2016].

2.1. DEFINITIONS AND TERMINOLOGIES 9

  • English, but it had a new training dataset from a different language to perform multilingual analysis.
  • These methods are used in practice and exclude most of the supervised methods which require labeled sets for training.
  • The granularity level says that the classification given by a method may be attached to whole documents (for document-based sentiment), to individual sentences (for sentence-based sentiment) or specific aspects of entities (for aspect-based sentiment) [Feldman, 2013].
  • Most approaches using this granularity in sentiment analysis are either based on supervised learning [min Kim and Hovy, 2007] or on unsupervised learning [Yu and Hatzivassiloglou, 2003].
  • The sentence “This hotel, despite the great room, have a terrible customer service” has two different polarities associated with “room” and “customer service” for the same hotel.

10 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW

  • At this granularity level, the polarity classification occurs at the document level, to detect the polarity of a whole text at once.
  • [Pang et al., 2002] show that even in this simple granularity level, good accuracy can be achieved.
  • Give a strength score associated with the in- tensity of the sentiment, the authors map these outputs to the 2-class detection problem.
  • Also, many of the methods have the neutral output.
  • This extra class would transform the problem in a 3-class question as deeply discussed at [Ribeiro et al., 2016].

2.2 English Methods

  • Due to the enormous interest and applicability, there has been a corresponding increase in the number of proposed sentiment analysis methods in the last years.
  • These methods rely on many different techniques from various computer science fields.
  • In the case of machine learning, the authors give as example [Pannala et al., 2016], where the authors discuss the use of Support Vector Machines (SVM’s) and Maximum Entropy(EM) regarding polarity detection on aspect-level.
  • The methods make use of predefined lists of words, in which each word is associated with a specific sentiment.
  • The lexical methods vary according to the context in which they were created.

2.2. ENGLISH METHODS 11

  • Overall, the above techniques are acceptable by the research community, and it is common to see concurrent important papers, sometimes published in the same computer science conference, using completely different methods.
  • In a recent study, [Ribeiro et al., 2016] compared many of the current sentiment analysis methods "off-the-shelf" for the English language over several English datasets, although they claim that exist methods that usually have a better performance than others, they also conclude that there is no best method that can perform the best in all cases.
  • Their study highlights the importance of both main techniques: machine learning and lexicons.
  • Finally, [Gonçalves et al., 2013] and [Gonçalves et al., 2016] also spotlight the potential improvements of performance when combining multiple methods output according to some weighting techniques.

2.3 Multilingual Sentiment Analysis

  • Most approaches for sentiment analysis available today were developed only for English, and there are few efforts that explore the problem considering other languages.
  • Besides this disadvantage compared to English, the authors list in the following subsections several tentatives that move towards a multilingual sentiment analysis context.
  • In general, these previous efforts focus on adapting strategies that previously suc- ceeded for English to other languages.
  • Overall, they provide limited baseline comparisons and validations.
  • It is unclear if currently available specific language strategies are able to surpass existing sentiment analysis for English if the authors apply text translation to English.

2.3.1 Machine translation-based methods

  • As an approach similar to their work, [Refaee and Rieser, 2015] performed machine translations in Arabic tweets to English.
  • Then, the results from the translated text are compared with native methods, where, in the worst case it was only 5% inferior.
  • According to the authors, such a setup may be advantageous when lacking the appropriate resources for a particular language and when fast deployment is crucial.
  • Considering automatic translation to Romanian and Spanish, they investigate the performance of polarity classification from a labeled English corpus.
  • Similarly, in order to build a real standalone multilingual sentiment analysis system [Balahur and Turchi, 2013] builds a simple method for English using a Gold Standard dataset and translates this dataset from English to four other languages -Italian, Spanish, French and German to rebuild his sentiment analysis method into a multilingual settings.

2.3. MULTILINGUAL SENTIMENT ANALYSIS 13

  • That the resultant sentiment analysis can perform multilingual classification with 70% of accuracy.
  • Nevertheless, their work is the first to test this technique in such wide range covering 14 different languages and comparing the results of 15 English sentiment analysis methods against 3 language-specific methods increasing the confidence in the hypothesis.
  • Besides, all the resources including the iFeel system were developed throughout this work to allow easy access to the methods and techniques by the community, this extra work is unique and helps maintain the reproducibility in the field.

2.3.2 Lexicon and corpus-based methods

  • These features, or rules, implicates that if the same word from a sentence appears in a previously defined rule, it has a high probability this sentence has the same opinion from the respective rule.
  • Moreover, these rules are built on the combination of lexicons and several linguistic tools such as part-of-speech (POS).
  • In [Wan, 2008], the authors propose an approach that uses an English dataset to in- crement the results from a Chinese sentiment analysis.
  • These values were combined with each other to calculate the sentiment value of the sentence.
  • The overall accuracy of this approach was 86%.

2.3.3 Machine Learning-based methods

  • Many of the proposed methods, not limited to this subsection, uses at least in part machine learning techniques.
  • Usually, the most frequent models for classification task are Naive Bayes, Maximum Entropy and Support Vector Machines.
  • While lexical resources are still used to detect the polarity in the text, machine-learning approaches are more common in this type of analysis.
  • It is also highly depended on the training dataset, inclusively, driven by the context from the source of the collection data.
  • Instead of using a machine translation technique, the authors manually annotated three datasets ( English, Dutch and French) to train different machine learning algorithms.

2.3.4 Parallel corpus-based methods

  • A different approach to the multilingual solution for sentiment analysis is the use of a parallel corpus that does not depend on machine translation.
  • The authors acquire some amount of sentiment labeled data and a parallel dataset with the same semantic information, but in different languages.
  • In [Meng et al., 2012], the authors propose a technique named cross-lingual mixture model (CLMM), where they focused on maximizing the likelihood of a bilingual parallel data in order to expand the vocabulary of the target language.
  • The CLMM shows effective when labeled data in the target language is scarce.
  • Also, the authors show that this methodology can boost the machine translated approach where there is a limited vocab-.

2.3. MULTILINGUAL SENTIMENT ANALYSIS 15

  • Their results show an improvement of 12% in the accuracy using this approach when combing corpus from English and Chinese.
  • First, they used sentiment-tagged Bible chapters from English to build the sentiment prediction model and the parallel foreign language labels.
  • The authors used others 54 versions of the bible in different languages and the Latent Semantic Indexing (LSI) to converts that multilingual corpus into a multilingual “concept space.”.
  • Their results for accuracy ranges from 72% to 75%.

2.3.5 Hybrid cross-lingual and lexicon-based methods

  • Many techniques combine corpus-based and lexicon-based approaches, focusing on the domain adaption of sentiment analysis for the resource-poor languages or special domains.
  • These techniques mostly use both annotated corpora and lexicon resources towards learning labels and expand vocabulary.
  • Also, most of their models are developed using machine learning algorithms.
  • They include parsing and pattern matching techniques using a transfer-based machine translation technology to develop a high-precision model.
  • In order to improve classification, they extracted word semantic orientation from the lexical resource SentiWordNet.

16 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW

  • On the other hand, [Demirtas and Pechenizkiy, 2013] do not archive good results using a cross-lingual framework for analyzing movies and product review datasets in English and Turkish.
  • The authors show that expanding training size with new instances taken from another corpus does not necessarily increase classification accuracy.
  • Co-training classification with machine translation improved the results when used by semi-supervised learning with unlabeled data coming from the same domain.

2.3.6 Neural Networks-based methods

  • Neural Networks, or also called deep learning-based methods, recently shows a promising approach for text classification and sentiment analysis [Kim, 2014].
  • A cascade layers with non-linearities models allows them to build complex functions such as sentiment compositionality, while their ability to process raw signals provides them language and domain independence.
  • They proposed a convolutional neural network (CNN) for both tasks: aspect extraction and aspect-based sentiment analysis.
  • Their methodology was the top-2 in 7 out of 11 language-domain pairs across other candidates for polarity classification, and top-2 in 5 out of 11 language domain pairs for the aspect-based task.

2.3.7 Research Gap

  • This brief literature overview presents how sentiment analysis is complex, with a variety of tasks and subtasks.
  • Usually, mixing part of them to improve their results.
  • These approaches do not have a successful engagement yet.
  • Therefore, the authors understand that a comparison of "off-the-shelf" methods applied to a wide range of languages has significant value for the community.
  • The authors methodology to evaluate sentiment analysis in multiple languages involves three key elements.

3.1 English Sentiment Analysis Methods

  • The term sentiment analysis has been used to describe different tasks and problems.
  • It is common to see sentiment analysis to be used to describe efforts that attempt to extract opinions from reviews [Hu and Liu, 2004], gauge the news polarity [Reis et al., 2015a], as well as for tasks that attempt to measure mood fluctuations [Hannak et al., 2012].
  • Hence, the authors restrict their focus on those efforts related to detecting the polarity (i.e., positivity or negativity) of a given text, which can be done with small adaptations on the output of some existing methods, a methodology previously described by [Gonçalves et al., 2013; Araújo et al., 2014].
  • The authors effort to identify a high number of sentiment analysis methods consisted of a systematically search for them in the main conferences in the field and then checking 17.

18 CHAPTER 3. METHODOLOGY

  • Their citations and those papers that cited them.
  • It is important to notice that some methods are available for download on the Web, others were kindly shared by their authors under request, and a small part of them was reproduced from a paper that describes the method.
  • This usually happened when authors shared only the lexical dictionaries they created, letting the implementation of the method that uses the lexical resource to ourselves.
  • Table 3.1 presents an overview of the methods used in this work, the reference paper in which they were published and the main technique they are based on (machine learning or lexicon).
  • The original output of these methods are written in the.

3.2. HUMAN LABELED DATASETS 19

  • Table, but the authors colored as blue the outputs they consider as positive, red the negative output and black what they considered as neutral.
  • The methods used in this work were deeply discussed and had their performance compared throughout different English datasets at [Ribeiro et al., 2016].
  • Following their methodology the authors choose 15 methods from that study.
  • Finally, the authors also choose to add the Google Prediction API, a commercial sentiment analysis tool created by Google in order to verify the results discrepancies between paid and unpaid methods.
  • All of the methods, excluding the Google Prediction API can be used on the iFeel system developed in this work and described on the Chapter 5.

3.2 Human Labeled Datasets

  • The authors present an overview of the datasets used in this work to compare the performance of their approach against traditional methods.
  • These workloads consist of 14 gold standard datasets of sentences, which were labeled by humans as positive, negative or neutral according to their sentiment polarity.
  • Using the human labels, the authors can compare the quality of the sentiment analysis methods and judge their performance.
  • In Table 3.3 the authors summarize the relevant information about these datasets, showing in each row the language, its ISO 639-1 two letter code, the place it was first published, the type of data collect, and the number of positive (Pos) and negative sentences(Neg) labeled by humans1.
  • The authors contact many others who published works related to non-English sentence-level sentiment analysis, the 1The datasets used in this paper are available under request.

20 CHAPTER 3. METHODOLOGY

  • Result of this extensive manual work is a unique and rich source of human labeled sentences in many languages.
  • It is very challenging to produce datasets labeled by human regarding sentiments because of two main reasons: the subjectivity intrinsic in the sentence (context dependent) and the amount of time needed to humans label thousands of sentences.
  • So, the manner the authors found to successfully proceed with this work was contacting different and independent authors in the field who already did this labeling work in a specific language.
  • After getting these 14 independent datasets, the authors post-process them to make sure the labels are all the same and can be comparable to the sentiment analysis methods output.
  • Some were labeled by three humans others by two humans.

3.3 Language-Specific Sentiment Analysis Methods

  • Ideally, the authors would like to compare the use of machine translation using all the methods designed for English described in Section 3.1 with a large number of methods proposed for some specific language.
  • While the authors succeeded in obtaining a large number of datasets, most of these methods are not available even under request to authors, making reproducibility almost impossible in most of the cases.
  • The authors were able to assess 3 native methods created or trained specifically for certain languages.
  • These authors released an adaptation of the original sentistrength that consists in changing the lexicons files for the correspondent ones of the language you desire to perform sentiment analysis.
  • The authors used the trial version of the Microsoft Excel Plugin available on their website.

3.4 Machine Translation Systems

  • Since 1950s, machine translation or automated translation is a field for research [V. Le and Schuster, 2016].
  • Its main goal is provide text translation by a computer without human interaction.
  • There are three main approaches to solving the problem of automatic 3Simplified/Standardized Chinese.

3.4. MACHINE TRANSLATION SYSTEMS 23

  • Rules-based/phrase-based, statistical methods or neutral networks, also known as translation.
  • Rulesbased uses lexicons combined with grammar definitions in order to translate sentences in a meaningful way.
  • For the purpose of this work, the authors want to justify two main potential questions related to the use of machine translators:.
  • These tools even when based on a pre-trained statistical system are static and do not follow the evolution language of the Web.
  • So, the authors choose well-known commercial tools which retrain periodically their models, as explained by [Microsoft, 2017], [Yandex, 2017], [V. Le and Schuster, 2016].

24 CHAPTER 3. METHODOLOGY

  • In Figure 3.1 the authors see a comparison performance between three translators candi- dates, a Neural Networks, a phrases-based system, and proper humans.
  • Text API can be used though Microsoft Azure platform, and it allows to process the first 2 million characters for free and for each additional million of characters it costs U$10.
  • The authors present all the experiments performed in this work to sustain their hypothesis.
  • The authors believe that current sentiment analysis methods create for English combined with the current state-of-the-art machine translation system are able to be as good as or even better than native sentiment analysis methods in multiple languages.

4.1 Metrics

  • The F1-Score is a metric used to compare the quality of the prediction for a given ground truth.
  • The F1-Score considers equally important the correct classification of each sentence, independently of the class, and basically measures the capability of the method to predict the correct output.
  • This metric can be easily computed for 2-class experiments using the Table 4.1.
  • The precision of positive class is computed as: 25.

26 CHAPTER 4. EXPERIMENTAL EVALUATION

  • This metric considers equally important the effectiveness in each class, independently of the relative size of the class.
  • Therefore, the Macro-F1 reported represents how effective the method is when it indicates a polarity.
  • The methods still have the neutral classification for some of the sentences.the authors.
  • Suppose that Emoticons’ method can classify only 10% of the sentences in a dataset, corresponding to the actual percentage of sentences with emoticons.

4.2. COMPARISON BETWEEN MACHINE TRANSLATORS 27

  • Calculated as the number of total sentences minus the number of undefined sentences, all of this divided by the total of sentences.
  • Throughout the analysis of their results, the authors mainly discuss the results and tradeoff between these two metrics: Macro-F1 and Applicability .
  • The authors understand that the Macro-F1 might not have the same weight of Applicability depending on the task, hence, during their analysis, they will show and discuss these metrics separately.

4.2 Comparison Between Machine Translators

  • So, using the 3 machine translators the authors selected to test their hypothesis, all the language datasets were translated from their original texts to English.
  • In Figure 4.1, the authors present the performance distribution in a boxplot, with the result for all datasets given a particular machine translation system.
  • The distribution is very similar, especially between 25th and 75th percentile, with Google Translator slightly better than others.
  • According to their results, when averaging the Macro-F1 for all methods in all datasets, the systems from Yandex and Google have scores 0.72 with a standard deviation.

28 CHAPTER 4. EXPERIMENTAL EVALUATION

  • The Microsoft Translator has a marginally inferior performance with an MacroF1 average of 0.69 and standard deviation of 0.20.
  • The confidence intervals of the results overlap for α= 0.95 and the variation coefficient is 0.02.
  • This conclusion doesn’t mean that the sentences are keeping their sentiment polarity from the original language, but it gives confidence that choosing the machine translation system might not impact abruptly in the results.
  • It’s important to explain why the boxplot has so large tail with Macro-F1 outlines close to 0 and 1.
  • These are the case when methods such as Emoticons or Panas-t have poor Applicability .

4.3 Overall Performance

  • First, the authors present Figure 4.2 on which is plotted the distribution of Macro-F1 scores for non-Native methods on each language dataset.
  • To complement this Figure, Table A.1 to Table A.14 where the authors show the results for Applicability , F1-scores (positive and negative classes), and Macro-F1 for each language dataset.
  • Additionally, the authors have Figures 4.3, 4.4 and 4.5 where they can visualize the behavior of the methods regarding Applicability and Macro-F1 simultaneously.
  • Now, the authors discuss the main findings regarding these results.
  • If you remove the labels on the x-axis is very hard to tell accurately which bar corresponds to the English language.

4.3. OVERALL PERFORMANCE 29

  • A potential lack of efficiency of the machine translation approach does not seem to influence the overall results.
  • If the contrary happens, the authors would expect the corresponding English boxplot as an outline.
  • In these figures, the authors plot the position of each method in a chart, for every language dataset, according to its Applicability (x-axis) and Macro-F1 (Y axis).
  • The authors also highlight the native methods, giving them a red circle.
  • In these charts, the authors can see that Emoticons(2), usually appear in the upper-left po- sitions, demonstrating its good Macro-F1 and poor Applicability .

30 CHAPTER 4. EXPERIMENTAL EVALUATION

  • Above 0.8, but Vader has a much better Applicability .
  • As discussed before, the Haitian Creole chart has the most heterogeneous shape, with many of the methods towards the bottom-left corner.
  • Regarding the performance of the native methods, one can highlight the IBM Watson(16) for English in Figure 4.4, with an outstanding performance in Applicability and Macro-F1 , on the other hand, it is in the bottom-left corner for French.the authors.
  • The Emoticons obtained a Macro-F1 of 1 for the translated Russian dataset, which is much better than the 0.52 obtained for the Spanish dataset.
  • Since these tables show the F1-Score per classes, the authors can analyze the performance of the methods separately and understand if one is better for analyze positive than negative sentences, or vice versa.

4.3. OVERALL PERFORMANCE 33

  • The authors summarize the results, separating both groups of methods.
  • In Table 4.2 the authors present the average for Macro-F1 and Applicability for each language dataset and a final average performance for each group of methods.
  • The authors can observe that native methods have a higher Macro-F1 score in average, but a lower Applicability .
  • Some details of these are important to discuss.

34 CHAPTER 4. EXPERIMENTAL EVALUATION

  • In the Russian dataset, for example, the high Macro-F1 for natives come with the cost of only 0.08 in Applicability .
  • Also, the main problem with this evaluation is that the authors are considering 15 translation-based methods, many of them, push down the Macro-F1 average for the whole group.
  • Therefore, the authors want to check if there is a subgroup of these methods where they can constantly affirm that they are better than the native methods.
  • In the next section, the authors provide a different perspective of their results presenting the methods according to the average rank in each dataset.
  • This approach allows us to conclude some interesting findings of their research.

4.4 Ranking the methods

  • In the previous section, the authors presented the detailed results generated in this work comparing the Macro-F1 and Applicability metrics between machine translation approach and native methods.
  • Now, the authors will present another perspective of their results showing a rankings of the methods based on the average position of them in each dataset.
  • So, in Table 4.3 the authors show these results considering the Macro-F1 , and in Table 4.5 they show the results for Applicability .
  • Semantria has a relatively good MacroF1 average compared with them, where is only 0.01 behind Emoticons and Vader, but its average position appears at 5th in the rankings.
  • First, the Google Sentiment Analysis API and NRCHashtag appears in the.

4.4. RANKING THE METHODS 35

  • Average Ranking using Macro-F1 as positional metric top.
  • If you consider both metrics Google Sentiment Analysis API has a great advantage, it has a Macro-F1 only 0.07 behind the best method and has almost a perfect Applicability .
  • Second, ten of their 15 shows better results than the best native method (ML-Sentistrength) for Applicability .
  • In summary, their results show that native methods do not administer well the trade- off between Macro-F1 and Applicability .
  • This result triggers an alert to authors from these methods which should compare their methods not only with other native methods but also the baseline proposed in this work.

36 CHAPTER 4. EXPERIMENTAL EVALUATION

  • Average Winning Points using fcov as positional metric Chapter 5 iFeel System.
  • The authors propose iFeel 3.0,1 a benchmark system for sentence-level multi- lingual sentiment analysis.
  • First published at [Araújo et al., 2014], iFeel implemented only eight methods without multilingual support.
  • The main reason for the development of iFeel’s third version was the scalability and stability not provided by the both previous versions.
  • The system had a high peak of 100 users created, and due its high computational resources demands, when few users upload files to be analyzed in parallel it used to crash.

5.1 iFeel Architecture and Functionalities

  • The local server runs the iFeel System implemented on the Spring Framework; it is responsible for the security layer, and view layer where the user can interact with the system.
  • It was chosen because it has the largest free tier among the top commercial machine translator systems.
  • So, the authors leave two fields to be filled by the user, the language option, and a free text field when the form is submitted iFeel will perform the sentiment analysis polarity in all methods implemented as shown in figure 5.2.
  • In the example, the authors submitted the text "Brazilian president is going to have a fair judgment :)" with the "English" language selected.
  • The authors can see that most of the methods pointed the sentence as "positive", only the method.

5.1. IFEEL ARCHITECTURE AND FUNCTIONALITIES 39

  • Stanford and Happiness Index classified as "neutral".
  • Then he has to upload the sentences from a plain text file, iFeel will perform a sentiment analysis for each line of the file with a maximum of 5000 sentences.
  • The result is a .xml or .xlsx file which the user can download containing the output of all methods implemented.
  • A future step for iFeel is to provide a REST API for it’s users.
  • It meets the need of the current state of Internet where microservices implemented for a machine-to-machine communication provided specialized functionality to be part of some larger solution.

40 CHAPTER 5. IFEEL SYSTEM

  • The Sentiment analysis field is currently popular and important to understand the social interactions throughout the Internet.
  • The field has a certain value for Academy and commercial application.
  • Specifically, the authors analyzed how the current state-of-theart English methods with the help of machine translators can solve this problem compared to previously published native methods.
  • Then, using the average position across the languages datasets to ranking these methods, their findings suggest that the automatic translation of the input from a non-English language to English and the subsequent analyze in English methods can be a competitive strategy if the suitable sentiment analysis method is properly chosen.
  • Moreover, the authors would recommend use the SOCAL or Sentistrength methods with the machine translation approach when analysing multilanguage texts.

42 CHAPTER 6. CONCLUSION

  • Throughout this work, the authors presented many tentatives to implement multilingual sentiment analysis from the literature.
  • Their approach distinguishes from others in several ways.
  • It is the first to analyze such wide variety of different languages with gold standard datasets.
  • Additionally, the results show that machine translation aproach is a generic methodology that can be used in all languages supported by any proper machine translator.
  • The authors also release to the scientific community all the methods codes and labeled datasets used in this paper hoping that it can help sentiment analysis to become English independent.

Did you find this useful? Give us your feedback

Citations
More filters
Journal ArticleDOI
TL;DR: This work evaluates not only a rich set of meta-features examined in state-of-the-art studies, but also a significant collection of pre-trained word embedding models, and shows that the sentiment detection of tweets benefits from combining different types of features proposed in the literature.
Abstract: Sentiment analysis of short informal texts, such as tweets, remains a challenging task due to their particular characteristics. Much effort has been made in the literature of Twitter sentiment analysis to achieve an effective and efficient representation of tweets. In this context, distinct types of features have been proposed and employed, from the simple n-gram representation to meta-features to word embeddings. Hence, in this work, using a relevant set of twenty-two datasets of tweets, we present a thorough evaluation of features by means of different supervised learning algorithms. We evaluate not only a rich set of meta-features examined in state-of-the-art studies, but also a significant collection of pre-trained word embedding models. Also, we evaluate and analyze the effect of combining those distinct types of features in order to detect which combination may provide core information in the polarity detection task in Twitter sentiment analysis. For this purpose, we exploit different strategies for combination, such as feature concatenation and ensemble learning techniques, and show that the sentiment detection of tweets benefits from combining different types of features proposed in the literature.

56 citations

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a methodology based on topic modeling, namely entity recognition, and sentiment analysis of texts to compare Twitter posts and news, followed by visualization of the evolution and impact of the COVID-19 pandemic.
Abstract: Background: The COVID-19 pandemic is severely affecting people worldwide. Currently, an important approach to understand this phenomenon and its impact on the lives of people consists of monitoring social networks and news on the internet. Objective: The purpose of this study is to present a methodology to capture the main subjects and themes under discussion in news media and social media and to apply this methodology to analyze the impact of the COVID-19 pandemic in Brazil. Methods: This work proposes a methodology based on topic modeling, namely entity recognition, and sentiment analysis of texts to compare Twitter posts and news, followed by visualization of the evolution and impact of the COVID-19 pandemic. We focused our analysis on Brazil, an important epicenter of the pandemic; therefore, we faced the challenge of addressing Brazilian Portuguese texts. Results: In this work, we collected and analyzed 18,413 articles from news media and 1,597,934 tweets posted by 1,299,084 users in Brazil. The results show that the proposed methodology improved the topic sentiment analysis over time, enabling better monitoring of internet media. Additionally, with this tool, we extracted some interesting insights about the evolution of the COVID-19 pandemic in Brazil. For instance, we found that Twitter presented similar topic coverage to news media; the main entities were similar, but they differed in theme distribution and entity diversity. Moreover, some aspects represented negative sentiment toward political themes in both media, and a high incidence of mentions of a specific drug denoted high political polarization during the pandemic. Conclusions: This study identified the main themes under discussion in both news and social media and how their sentiments evolved over time. It is possible to understand the major concerns of the public during the pandemic, and all the obtained information is thus useful for decision-making by authorities.

42 citations

Journal ArticleDOI
TL;DR: This paper categorizes and describes state of the art works involving approaches to each of the tasks of sentiment analysis, as well as supporting language resources such as natural language processing tools, lexicons, corpora, ontologies, and datasets.
Abstract: Sentiment analysis is an area of study that aims to develop computational methods and tools to extract and classify the opinions and emotions expressed by people on social networks, blogs, forums, online shoppings, and others. A lot of research has been developed addressing opinions expressed in the English language. However, studies involving the Portuguese language still need to be advanced to make better use of the specificities of the language. This paper aims to survey the efforts made specifically to address sentiment analysis in the Portuguese language. It categorizes and describes state of the art works involving approaches to each of the tasks of sentiment analysis, as well as supporting language resources such as natural language processing tools, lexicons, corpora, ontologies, and datasets.

41 citations

Journal ArticleDOI
TL;DR: In this article, the authors conduct before-and-after sentiment analysis to examine how these two fatal crashes have affected people's perceptions of self-driving and autonomous vehicle technology using Twitter data.
Abstract: In March 2018, an Uber-pedestrian crash and a Tesla's Model X crash attracted a lot of media attention because the vehicles were operating under self-driving and autopilot mode respectively at the time of the crash. This study aims to conduct before-and-after sentiment analysis to examine how these two fatal crashes have affected people's perceptions of self-driving and autonomous vehicle technology using Twitter data. Five different and relevant keywords were used to extract tweets. Over 1.7 million tweets were found within 15 days before and after the incidents with the specific keywords, which were eventually analyzed in this study. The results indicate that after the two incidents, the negative tweets on “self-driving/autonomous” technology increased by 32 percentage points (from 14% to 46%). The compound scores of “pedestrian crash”, “Uber”, and “Tesla” keywords saw a 6% decrease while “self-driving/autonomous” recorded the highest change with an 11% decrease. Before the Uber-incident, 19% of the tweets on Uber were negative and 27% were positive. With the Uber-pedestrian crash, these percentages have changed to 30% negative and 23% positive. Overall, the negativity in the tweets and the percentage of negative tweets on self-driving/autonomous technology have increased after their involvement in fatal crashes. Providing opportunities to interact with this developing technology has shown to positively influence peoples' perception.

29 citations

Journal ArticleDOI
TL;DR: The authors conducted a systematic literature review that is intended to give insight into the most and least popular research areas, dialects, machine learning approaches, neural network input features, data types, datasets, system evaluation criteria, publication venues, and publication trends.
Abstract: It is becoming increasingly difficult to know who is working on what and how in computational studies of Dialectal Arabic. This study comes to chart the field by conducting a systematic literature review that is intended to give insight into the most and least popular research areas, dialects, machine learning approaches, neural network input features, data types, datasets, system evaluation criteria, publication venues, and publication trends. It is a review that is guided by the norms of systematic reviews. It has taken account of all the research that adopted a computational approach to dialectal Arabic identification and detection and that was published between 2000 and 2020. It collected, analyzed, and collated this research, discovered its trends, and identified research gaps. It revealed, inter alia, that our research effort has not been directed evenly between speech and text or between the vernaculars; there is some bias favoring text over speech, regional varieties over individual vernaculars, and Egyptian over all other vernaculars. Furthermore, there is a clear preference for shallow machine learning approaches, for the use of n-grams, TF-IDF, and MFCC as neural network features, and for accuracy as a statistical measure of validation of results. This paper also pointed to some glaring gaps in the research: (1) total neglect of Mauritanian and Bahraini in the continuous Arabic language area and of such enclave varieties as Anatolian Arabic, Khuzistan Arabic, Khurasan Arabic, Uzbekistan Arabic, the Subsaharan Arabic of Nigeria and Chad, Djibouti Arabic, Cypriot Arabic and Maltese; (2) scarcity of city dialect resources; (3) rarity of linguistic investigations that would complement our research; (4) and paucity of deep machine learning experimentation.

24 citations

References
More filters
Journal ArticleDOI
TL;DR: Two 10-item mood scales that comprise the Positive and Negative Affect Schedule (PANAS) are developed and are shown to be highly internally consistent, largely uncorrelated, and stable at appropriate levels over a 2-month time period.
Abstract: In recent studies of the structure of affect, positive and negative affect have consistently emerged as two dominant and relatively independent dimensions. A number of mood scales have been created to measure these factors; however, many existing measures are inadequate, showing low reliability or poor convergent or discriminant validity. To fill the need for reliable and valid Positive Affect and Negative Affect scales that are also brief and easy to administer, we developed two 10-item mood scales that comprise the Positive and Negative Affect Schedule (PANAS). The scales are shown to be highly internally consistent, largely uncorrelated, and stable at appropriate levels over a 2-month time period. Normative data and factorial and external evidence of convergent and discriminant validity for the scales are also presented.

34,482 citations


"A comparative study of machine tran..." refers methods in this paper

  • ...The method consists of an adapted version (PANAS) Positive Affect Negative Affect Scale [46] , well-known method in psychology with a large set of words, each of them associated with one from eleven moods such as surprise, fear, guilt, etc....

    [...]

Posted Content
Yoon Kim1
TL;DR: In this article, CNNs are trained on top of pre-trained word vectors for sentence-level classification tasks and a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks.
Abstract: We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.

7,826 citations

Proceedings ArticleDOI
22 Aug 2004
TL;DR: This research aims to mine and to summarize all the customer reviews of a product, and proposes several novel techniques to perform these tasks.
Abstract: Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques.

7,330 citations


"A comparative study of machine tran..." refers background in this paper

  • ...Applicability F1( + ) F1( −) Macro-F1 Method Name 0.84 0.93 0.91 0.92 SOCAL 0.76 0.91 0.86 0.89 Semantria 0.94 0.87 0.87 0.87 Stanford 0.64 0.88 0.79 0.84 Vader 0.99 0.86 0.83 0.84 Google SA 0.79 0.84 0.81 0.82 Sentistrength 0.69 0.8 0.79 0.79 MPQA 0.87 0.82 0.7 0.76 AFINN 0.89 0.81 0.71 0.76 Emolex 0.80 0.76 0.76 0.76 Umigon 0.13 0.77 0.64 0.71 Panas-t 0.85 0.79 0.57 0.68 OpinionLexicon 0.97 0.58 0.72 0.65 NRCHashtag 0.82 0.7 0.45 0.57 SASA 0.86 0.73 0.33 0.53 Happiness Index 0.00 0 0 0.00 Emoticons - - - - IBM Watson - - - - ML-Sentistrength 16 The datasets are available at https://homepages.dcc.ufmg.br/ ∼fabricio/ifeel _ resources.htm ....

    [...]

  • ...Applicability F1( + ) F1( −) Macro-F1 Method Name 0.04 0.92 0.9 0.91 Emoticons 0.99 0.58 0.87 0.73 Google SA 0.29 0.63 0.78 0.71 Vader 0.58 0.5 0.76 0.63 AFINN 0.23 0.65 0.62 0.63 Umigon 0.35 0.54 0.64 0.59 OpinionLexicon 0.93 0.35 0.75 0.55 NRCHashtag 1.00 0.12 0.93 0.53 SASA 0.95 0.12 0.92 0.52 Stanford 0.00 0 1 0.50 Panas-t 0.42 0.44 0.56 0.50 SOCAL 0.27 0.4 0.54 0.47 MPQA 0.62 0.4 0.54 0.47 Sentistrength 0.37 0.3 0.58 0.44 Emolex 0.32 0.28 0.44 0.36 Happiness Index - - - - Semantria - - - - IBM Watson - - - - ML-Sentistrength Supplementary material Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.ins.2019.10.031 ....

    [...]

  • ...Applicability F1( + ) F1( −) Macro-F1 Method Name 0.78 0.92 0.88 0.90 Sentistrength 0.65 0.92 0.85 0.89 Semantria 0.68 0.93 0.83 0.88 Vader 0.82 0.89 0.83 0.86 AFINN 0.12 0.86 0.86 0.86 Panas-t 0.82 0.88 0.77 0.83 OpinionLexicon 0.77 0.84 0.78 0.81 Umigon 0.81 0.86 0.76 0.81 SOCAL 0.88 0.87 0.72 0.80 Emolex 0.61 0.83 0.77 0.80 MPQA 0.98 0.88 0.73 0.80 Google SA 0.98 0.75 0.7 0.73 NRCHashtag 0.84 0.81 0.53 0.67 Happiness Index 0.79 0.76 0.52 0.64 SASA 0.95 0.53 0.67 0.60 Stanford 0.00 0 0 0.00 Emoticons - - - - IBM Watson - - - - ML-Sentistrength Table A9 Czech....

    [...]

  • ...Applicability F1( + ) F1( −) Macro-F1 Method Name 0.05 0.96 0.98 0.97 Panas-t 0.59 0.89 0.83 0.86 Vader 0.79 0.85 0.81 0.83 Sentistrength 0.12 0.90 0.76 0.83 Emoticons 0.72 0.82 0.81 0.82 SOCAL 0.68 0.83 0.79 0.81 Umigon 0.75 0.83 0.78 0.80 AFINN 0.97 0.82 0.77 0.79 Google SA 0.54 0.79 0.75 0.77 Semantria 0.74 0.79 0.73 0.76 Emolex 0.72 0.79 0.72 0.75 OpinionLexicon 0.51 0.72 0.73 0.73 MPQA 0.74 0.68 0.68 0.68 ML-Sentistrength 0.98 0.62 0.72 0.67 NRCHashtag 0.71 0.73 0.56 0.65 SASA 0.66 0.75 0.55 0.65 Happiness Index 0.93 0.52 0.71 0.62 Stanford 0.27 0.00 0.87 0.43 IBM Watson Table A6 Croatian....

    [...]

  • ...Applicability F1( + ) F1( −) Macro-F1 Method Name 0.35 0.91 0.83 0.87 Vader 0.78 0.87 0.82 0.84 SOCAL 0.38 0.83 0.8 0.82 MPQA 0.38 0.79 0.76 0.78 Umigon 0.71 0.82 0.72 0.77 OpinionLexicon 0.58 0.78 0.75 0.77 Sentistrength 0.63 0.81 0.7 0.75 AFINN 1.00 0.76 0.59 0.67 Google SA 0.91 0.62 0.68 0.65 Stanford 0.79 0.74 0.54 0.64 Emolex 0.05 0.87 0.4 0.63 Panas-t 0.79 0.71 0.54 0.62 SASA 0.63 0.74 0.31 0.53 Happiness Index 0.99 0.45 0.56 0.50 NRCHashtag 0.00 0 0 0.00 Emoticons - - - - Semantria - - - - IBM Watson - - - - ML-Sentistrength Table A8 Dutch....

    [...]

01 Jan 2002
TL;DR: In this paper, the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative, was considered and three machine learning methods (Naive Bayes, maximum entropy classiflcation, and support vector machines) were employed.
Abstract: We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we flnd that standard machine learning techniques deflnitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classiflcation, and support vector machines) do not perform as well on sentiment classiflcation as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classiflcation problem more challenging.

6,980 citations

Proceedings Article
01 Oct 2013
TL;DR: A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.
Abstract: Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases.

6,792 citations

Frequently Asked Questions (5)
Q1. What have the authors contributed in "A comparative study of machine translation for multilingual sentence-level sentiment analysis" ?

Despite the significant interest in this theme and amount of research efforts in the field, almost all existing methods are designed to work with only English content. In this work, the authors take a different step into this field. The authors focus on evaluating existing efforts proposed to do language specific sentiment analysis with a simple yet effective baseline approach. To do it, the authors evaluated sixteen methods for sentence-level sentiment analysis proposed for English, comparing them with three language-specific methods. Based on fourteen human labeled language-specific datasets, the authors provide an extensive quantitative analysis of existing multi-language approaches. The authors also rank methods according to their prediction performance and they identified the methods that acquired the best results using machine translation across different languages. Their primary results suggest that simply translating the input text on a specific language to English and then using one of the existing best methods developed to English can be better than the existing language specific efforts evaluated. 

Most current strategies in many languages consist of adapting existing lexical resources, without presenting proper validations and basic baseline comparisons. 

As a final contribution to the research community, the authors release their codes, datasets, and the iFeel 3.0 system, a web framework for multilingual sentence-level sentiment analysis. 

Palavras-chave: Análise de Sentimentos, Multilíngue , Tradução Automática, Redes Sociais Online, Mineração de Opinião.xvAbstractSentiment analysis has become a key tool for several social media applications, including analysis of user’s opinions about products and services, support for politics during campaigns and even for market trending. 

Based on fourteen human labeled language-specific datasets, the authors provide an extensive quantitative analysis of existing multi-language approaches.