Journal Article•DOI•

A comparative study of machine translation for multilingual sentence-level sentiment analysis

Matheus Araújo¹, Matheus Araújo², Adriano C. M. Pereira¹, Fabrício Benevenuto¹•Institutions (2)

Universidade Federal de Minas Gerais¹, University of Minnesota²

01 Feb 2020-Information Sciences (Elsevier)-Vol. 512, pp 1078-1102

TL;DR: This work evaluates existing efforts proposed to do language specific sentiment analysis with a simple yet effective baseline approach and suggests that simply translating the input text in a specific language to English and then using one of the existing best methods developed for English can be better than the existing language-specific approach evaluated.

read less

About: This article is published in Information Sciences.The article was published on 2020-02-01 and is currently open access. It has received 72 citations till now. The article focuses on the topics: Sentiment analysis & Machine translation.

...read moreread less

Summary (12 min read)

Jump to: [A Appendices 53] – [2 CHAPTER 1. INTRODUCTION] – [1.1. OBJECTIVES 3] – [1.1 Objectives] – [4 CHAPTER 1. INTRODUCTION] – [1.2 Results and Contributions] – [1.4. ORGANIZATION 5] – [1.4 Organization] – [6 CHAPTER 1. INTRODUCTION] – [2.1 Definitions and Terminologies] – [8 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW] – [2.1. DEFINITIONS AND TERMINOLOGIES 9] – [10 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW] – [2.2 English Methods] – [2.2. ENGLISH METHODS 11] – [2.3 Multilingual Sentiment Analysis] – [2.3.1 Machine translation-based methods] – [2.3. MULTILINGUAL SENTIMENT ANALYSIS 13] – [2.3.2 Lexicon and corpus-based methods] – [2.3.3 Machine Learning-based methods] – [2.3.4 Parallel corpus-based methods] – [2.3. MULTILINGUAL SENTIMENT ANALYSIS 15] – [2.3.5 Hybrid cross-lingual and lexicon-based methods] – [16 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW] – [2.3.6 Neural Networks-based methods] – [2.3.7 Research Gap] – [3.1 English Sentiment Analysis Methods] – [18 CHAPTER 3. METHODOLOGY] – [3.2. HUMAN LABELED DATASETS 19] – [3.2 Human Labeled Datasets] – [20 CHAPTER 3. METHODOLOGY] – [3.3 Language-Specific Sentiment Analysis Methods] – [3.4 Machine Translation Systems] – [3.4. MACHINE TRANSLATION SYSTEMS 23] – [24 CHAPTER 3. METHODOLOGY] – [4.1 Metrics] – [26 CHAPTER 4. EXPERIMENTAL EVALUATION] – [4.2. COMPARISON BETWEEN MACHINE TRANSLATORS 27] – [4.2 Comparison Between Machine Translators] – [28 CHAPTER 4. EXPERIMENTAL EVALUATION] – [4.3 Overall Performance] – [4.3. OVERALL PERFORMANCE 29] – [30 CHAPTER 4. EXPERIMENTAL EVALUATION] – [4.3. OVERALL PERFORMANCE 33] – [34 CHAPTER 4. EXPERIMENTAL EVALUATION] – [4.4 Ranking the methods] – [4.4. RANKING THE METHODS 35] – [36 CHAPTER 4. EXPERIMENTAL EVALUATION] – [5.1 iFeel Architecture and Functionalities] – [5.1. IFEEL ARCHITECTURE AND FUNCTIONALITIES 39] – [40 CHAPTER 5. IFEEL SYSTEM] and [42 CHAPTER 6. CONCLUSION]

A Appendices 53

Sentiment analysis has become a popular tool for data analysts, es- pecially those that deal with social media data.
Thus, sentiment analysis became a hot topic in Web applications, with the high demand from industry and academy, motivating the proposal of new methods to deal with this subject.
The potential market for sentiment analysis in different languages is vast.
Suppose a mobile application that simply uses sentiment analysis.

2 CHAPTER 1. INTRODUCTION

Additionally, it arguments towards the use of translationbased techniques as a baseline for new multilingual sentiment analysis methods.
The authors should emphasize that their work focuses on comparing "off-the-shelf" methods as they are used in practice.
This excludes most of the supervised methods which require labeled sets for training, as these are usually not available for practitioners.
Moreover, most of the supervised solutions do not share the source code or a trained model to be 1https://translate.google.com.

1.1. OBJECTIVES 3

According to Internet World Stats5, seven of those languages appear among the top ten languages used on the Web and represent more than 61% of non-English speaker users.
As suggested by a recent benchmark study [Ribeiro et al., 2016], their findings suggest that machine translation systems are mature enough to produce reliable translations to English that can be used for sentence-level sentiment analysis and obtain a competitive prediction performance results.
Additionally, the authors show that some popular language-specific methods do not have a significant advantage over a machine translation approach.

1.1 Objectives

The main objective of this work is to provide a quantitative comparison between the use of several already developed English methods for sentiment analysis combined with machine translation in the multilingual context.
Also, the authors want to compare the results with current language-specific methods in order to identify if these methods are better than the machine translation approach.

4 CHAPTER 1. INTRODUCTION

The authors hypothesis is based on the assumption that machine translation of datasets to English and it posterior analysis on English specific methods can be as good as the specific sentiment analysis methods created for determined languages.
The authors support this, because, even when words change between two paired sentences in different languages; an accurate machine translation should not change their meaning and its sentiment polarity.

1.2 Results and Contributions

To address the problem of multilingual sentiment analysis the authors perform several experiments using methods created for English on multilingual datasets with the help of automatic machine translators.
As the main result, the hypothesis proposed was confirmed, and the methods designed for specific languages do not overcome in any evaluation the machine translation approach.
This work highlights that current commercial and non-commercial methods for sentiment analysis for non-English datasets are not powerful enough against the machine translation approach combined with state-of-the-art sentiment analysis methods published in English.
There are two main contributions for this work.
Therefore, machine translation should be used as a baseline when new methods are proposed by the scientific community.

1.4. ORGANIZATION 5

An evaluation of ma- chine translation for multilingual sentence-level sentiment analysis.
In IV Brazilian Workshop on Social Network Analysis and Mining (BraSNAM 2015).
A multilingual benchmarking system for sentence-level sentiment analysis, also known as ifeel 2.0.
In these publications, the authors discuss the same approach described here, including a previous version of iFeel.

1.4 Organization

This chapter presents an overview of the main concepts and terminologies related to sentence-level sentiment analysis and the current state-of-the-art methodologies.
Furthermore, the authors describe existing approaches for non-English sentiment analysis including previous direct machine translation approaches as focused in this work and how they distinguish from them.
This chapter presents the resources used throughout this work in order to evaluate their hypothesis.
It describes their effort in gather representative human labeled datasets in multiple languages, the machine translation systems used to translate these datasets to English, and all the English and non-English sentiment analysis methods.

6 CHAPTER 1. INTRODUCTION

This chapter presents the results and dis- cussions that the authors use to validate their hypothesis.
It describes how the authors did the performance comparison between machine translation approach and language-specific methods, including an evaluation of machine translation systems and a ranking of best non-English and English methods.
This chapter presents a web-based framework for multi- lingual sentiment analysis named iFeel, developed to facilitate the sentiment analysis study by the community and share the code including the datasets of this work.
F1-Score and Macro-F1 for every language datasets and every supported method.

2.1 Definitions and Terminologies

Given, the recent popularity of the term sentiment analysis, it has been used to describe a wide variety of tasks by the community.
There are a variety of conferences that covers these topics, in particular when related to natural language processing, for example, the Annual Meeting of the Association for Computational Linguistics (ACL) and Conference on Empirical Methods in Natural Language Processing .
The SemEval1 workshop highlights as one that annually tries to evaluate the current state-of-the-art techniques and proposes several new challenging tasks for the field.
Each of these tasks has many subtasks, ranging from three-class polarity detection of tweets to veracity prediction given a rumor.
Since there are many definitions related to the sentiment analysis field, here the authors list and describe the concepts they use under the context of this work.

8 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW

Given a message and a topic, the goal is to identify whether the message expresses positive, a negative or neutral sentiment regarding that topic.
Anyway, as higher the score is, more strong is the positive sentiment in the sentence.
Since some words have emotional meaning, like surprise, anger, happiness, the methods should be able to identify correctly the best "affective text" that matches with the sentence.
The most common use of multilingual sentiment analysis is when authors propose a generic methodology to perform sentiment analysis in datasets written in just one language, usually different than English.
This term was coined throughout their recent work [Araújo et al., 2016].

2.1. DEFINITIONS AND TERMINOLOGIES 9

English, but it had a new training dataset from a different language to perform multilingual analysis.
These methods are used in practice and exclude most of the supervised methods which require labeled sets for training.
The granularity level says that the classification given by a method may be attached to whole documents (for document-based sentiment), to individual sentences (for sentence-based sentiment) or specific aspects of entities (for aspect-based sentiment) [Feldman, 2013].
Most approaches using this granularity in sentiment analysis are either based on supervised learning [min Kim and Hovy, 2007] or on unsupervised learning [Yu and Hatzivassiloglou, 2003].
The sentence “This hotel, despite the great room, have a terrible customer service” has two different polarities associated with “room” and “customer service” for the same hotel.

10 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW

At this granularity level, the polarity classification occurs at the document level, to detect the polarity of a whole text at once.
[Pang et al., 2002] show that even in this simple granularity level, good accuracy can be achieved.
Give a strength score associated with the in- tensity of the sentiment, the authors map these outputs to the 2-class detection problem.
Also, many of the methods have the neutral output.
This extra class would transform the problem in a 3-class question as deeply discussed at [Ribeiro et al., 2016].

2.2 English Methods

Due to the enormous interest and applicability, there has been a corresponding increase in the number of proposed sentiment analysis methods in the last years.
These methods rely on many different techniques from various computer science fields.
In the case of machine learning, the authors give as example [Pannala et al., 2016], where the authors discuss the use of Support Vector Machines (SVM’s) and Maximum Entropy(EM) regarding polarity detection on aspect-level.
The methods make use of predefined lists of words, in which each word is associated with a specific sentiment.
The lexical methods vary according to the context in which they were created.

2.2. ENGLISH METHODS 11

Overall, the above techniques are acceptable by the research community, and it is common to see concurrent important papers, sometimes published in the same computer science conference, using completely different methods.
In a recent study, [Ribeiro et al., 2016] compared many of the current sentiment analysis methods "off-the-shelf" for the English language over several English datasets, although they claim that exist methods that usually have a better performance than others, they also conclude that there is no best method that can perform the best in all cases.
Their study highlights the importance of both main techniques: machine learning and lexicons.
Finally, [Gonçalves et al., 2013] and [Gonçalves et al., 2016] also spotlight the potential improvements of performance when combining multiple methods output according to some weighting techniques.

2.3 Multilingual Sentiment Analysis

Most approaches for sentiment analysis available today were developed only for English, and there are few efforts that explore the problem considering other languages.
Besides this disadvantage compared to English, the authors list in the following subsections several tentatives that move towards a multilingual sentiment analysis context.
In general, these previous efforts focus on adapting strategies that previously suc- ceeded for English to other languages.
Overall, they provide limited baseline comparisons and validations.
It is unclear if currently available specific language strategies are able to surpass existing sentiment analysis for English if the authors apply text translation to English.

2.3.1 Machine translation-based methods

As an approach similar to their work, [Refaee and Rieser, 2015] performed machine translations in Arabic tweets to English.
Then, the results from the translated text are compared with native methods, where, in the worst case it was only 5% inferior.
According to the authors, such a setup may be advantageous when lacking the appropriate resources for a particular language and when fast deployment is crucial.
Considering automatic translation to Romanian and Spanish, they investigate the performance of polarity classification from a labeled English corpus.
Similarly, in order to build a real standalone multilingual sentiment analysis system [Balahur and Turchi, 2013] builds a simple method for English using a Gold Standard dataset and translates this dataset from English to four other languages -Italian, Spanish, French and German to rebuild his sentiment analysis method into a multilingual settings.

2.3. MULTILINGUAL SENTIMENT ANALYSIS 13

That the resultant sentiment analysis can perform multilingual classification with 70% of accuracy.
Nevertheless, their work is the first to test this technique in such wide range covering 14 different languages and comparing the results of 15 English sentiment analysis methods against 3 language-specific methods increasing the confidence in the hypothesis.
Besides, all the resources including the iFeel system were developed throughout this work to allow easy access to the methods and techniques by the community, this extra work is unique and helps maintain the reproducibility in the field.

2.3.2 Lexicon and corpus-based methods

These features, or rules, implicates that if the same word from a sentence appears in a previously defined rule, it has a high probability this sentence has the same opinion from the respective rule.
Moreover, these rules are built on the combination of lexicons and several linguistic tools such as part-of-speech (POS).
In [Wan, 2008], the authors propose an approach that uses an English dataset to in- crement the results from a Chinese sentiment analysis.
These values were combined with each other to calculate the sentiment value of the sentence.
The overall accuracy of this approach was 86%.

2.3.3 Machine Learning-based methods

Many of the proposed methods, not limited to this subsection, uses at least in part machine learning techniques.
Usually, the most frequent models for classification task are Naive Bayes, Maximum Entropy and Support Vector Machines.
While lexical resources are still used to detect the polarity in the text, machine-learning approaches are more common in this type of analysis.
It is also highly depended on the training dataset, inclusively, driven by the context from the source of the collection data.
Instead of using a machine translation technique, the authors manually annotated three datasets ( English, Dutch and French) to train different machine learning algorithms.

2.3.4 Parallel corpus-based methods

A different approach to the multilingual solution for sentiment analysis is the use of a parallel corpus that does not depend on machine translation.
The authors acquire some amount of sentiment labeled data and a parallel dataset with the same semantic information, but in different languages.
In [Meng et al., 2012], the authors propose a technique named cross-lingual mixture model (CLMM), where they focused on maximizing the likelihood of a bilingual parallel data in order to expand the vocabulary of the target language.
The CLMM shows effective when labeled data in the target language is scarce.
Also, the authors show that this methodology can boost the machine translated approach where there is a limited vocab-.

2.3. MULTILINGUAL SENTIMENT ANALYSIS 15

Their results show an improvement of 12% in the accuracy using this approach when combing corpus from English and Chinese.
First, they used sentiment-tagged Bible chapters from English to build the sentiment prediction model and the parallel foreign language labels.
The authors used others 54 versions of the bible in different languages and the Latent Semantic Indexing (LSI) to converts that multilingual corpus into a multilingual “concept space.”.
Their results for accuracy ranges from 72% to 75%.

2.3.5 Hybrid cross-lingual and lexicon-based methods

Many techniques combine corpus-based and lexicon-based approaches, focusing on the domain adaption of sentiment analysis for the resource-poor languages or special domains.
These techniques mostly use both annotated corpora and lexicon resources towards learning labels and expand vocabulary.
Also, most of their models are developed using machine learning algorithms.
They include parsing and pattern matching techniques using a transfer-based machine translation technology to develop a high-precision model.
In order to improve classification, they extracted word semantic orientation from the lexical resource SentiWordNet.

16 CHAPTER 2. SENTIMENT ANALYSIS OVERVIEW

On the other hand, [Demirtas and Pechenizkiy, 2013] do not archive good results using a cross-lingual framework for analyzing movies and product review datasets in English and Turkish.
The authors show that expanding training size with new instances taken from another corpus does not necessarily increase classification accuracy.
Co-training classification with machine translation improved the results when used by semi-supervised learning with unlabeled data coming from the same domain.

2.3.6 Neural Networks-based methods

Neural Networks, or also called deep learning-based methods, recently shows a promising approach for text classification and sentiment analysis [Kim, 2014].
A cascade layers with non-linearities models allows them to build complex functions such as sentiment compositionality, while their ability to process raw signals provides them language and domain independence.
They proposed a convolutional neural network (CNN) for both tasks: aspect extraction and aspect-based sentiment analysis.
Their methodology was the top-2 in 7 out of 11 language-domain pairs across other candidates for polarity classification, and top-2 in 5 out of 11 language domain pairs for the aspect-based task.

2.3.7 Research Gap

This brief literature overview presents how sentiment analysis is complex, with a variety of tasks and subtasks.
Usually, mixing part of them to improve their results.
These approaches do not have a successful engagement yet.
Therefore, the authors understand that a comparison of "off-the-shelf" methods applied to a wide range of languages has significant value for the community.
The authors methodology to evaluate sentiment analysis in multiple languages involves three key elements.

3.1 English Sentiment Analysis Methods

The term sentiment analysis has been used to describe different tasks and problems.
It is common to see sentiment analysis to be used to describe efforts that attempt to extract opinions from reviews [Hu and Liu, 2004], gauge the news polarity [Reis et al., 2015a], as well as for tasks that attempt to measure mood fluctuations [Hannak et al., 2012].
Hence, the authors restrict their focus on those efforts related to detecting the polarity (i.e., positivity or negativity) of a given text, which can be done with small adaptations on the output of some existing methods, a methodology previously described by [Gonçalves et al., 2013; Araújo et al., 2014].
The authors effort to identify a high number of sentiment analysis methods consisted of a systematically search for them in the main conferences in the field and then checking 17.

18 CHAPTER 3. METHODOLOGY

Their citations and those papers that cited them.
It is important to notice that some methods are available for download on the Web, others were kindly shared by their authors under request, and a small part of them was reproduced from a paper that describes the method.
This usually happened when authors shared only the lexical dictionaries they created, letting the implementation of the method that uses the lexical resource to ourselves.
Table 3.1 presents an overview of the methods used in this work, the reference paper in which they were published and the main technique they are based on (machine learning or lexicon).
The original output of these methods are written in the.

3.2. HUMAN LABELED DATASETS 19

Table, but the authors colored as blue the outputs they consider as positive, red the negative output and black what they considered as neutral.
The methods used in this work were deeply discussed and had their performance compared throughout different English datasets at [Ribeiro et al., 2016].
Following their methodology the authors choose 15 methods from that study.
Finally, the authors also choose to add the Google Prediction API, a commercial sentiment analysis tool created by Google in order to verify the results discrepancies between paid and unpaid methods.
All of the methods, excluding the Google Prediction API can be used on the iFeel system developed in this work and described on the Chapter 5.

3.2 Human Labeled Datasets

The authors present an overview of the datasets used in this work to compare the performance of their approach against traditional methods.
These workloads consist of 14 gold standard datasets of sentences, which were labeled by humans as positive, negative or neutral according to their sentiment polarity.
Using the human labels, the authors can compare the quality of the sentiment analysis methods and judge their performance.
In Table 3.3 the authors summarize the relevant information about these datasets, showing in each row the language, its ISO 639-1 two letter code, the place it was first published, the type of data collect, and the number of positive (Pos) and negative sentences(Neg) labeled by humans1.
The authors contact many others who published works related to non-English sentence-level sentiment analysis, the 1The datasets used in this paper are available under request.

20 CHAPTER 3. METHODOLOGY

Result of this extensive manual work is a unique and rich source of human labeled sentences in many languages.
It is very challenging to produce datasets labeled by human regarding sentiments because of two main reasons: the subjectivity intrinsic in the sentence (context dependent) and the amount of time needed to humans label thousands of sentences.
So, the manner the authors found to successfully proceed with this work was contacting different and independent authors in the field who already did this labeling work in a specific language.
After getting these 14 independent datasets, the authors post-process them to make sure the labels are all the same and can be comparable to the sentiment analysis methods output.
Some were labeled by three humans others by two humans.

3.3 Language-Specific Sentiment Analysis Methods

Ideally, the authors would like to compare the use of machine translation using all the methods designed for English described in Section 3.1 with a large number of methods proposed for some specific language.
While the authors succeeded in obtaining a large number of datasets, most of these methods are not available even under request to authors, making reproducibility almost impossible in most of the cases.
The authors were able to assess 3 native methods created or trained specifically for certain languages.
These authors released an adaptation of the original sentistrength that consists in changing the lexicons files for the correspondent ones of the language you desire to perform sentiment analysis.
The authors used the trial version of the Microsoft Excel Plugin available on their website.

3.4 Machine Translation Systems

Since 1950s, machine translation or automated translation is a field for research [V. Le and Schuster, 2016].
Its main goal is provide text translation by a computer without human interaction.
There are three main approaches to solving the problem of automatic 3Simplified/Standardized Chinese.

3.4. MACHINE TRANSLATION SYSTEMS 23

Rules-based/phrase-based, statistical methods or neutral networks, also known as translation.
Rulesbased uses lexicons combined with grammar definitions in order to translate sentences in a meaningful way.
For the purpose of this work, the authors want to justify two main potential questions related to the use of machine translators:.
These tools even when based on a pre-trained statistical system are static and do not follow the evolution language of the Web.
So, the authors choose well-known commercial tools which retrain periodically their models, as explained by [Microsoft, 2017], [Yandex, 2017], [V. Le and Schuster, 2016].

24 CHAPTER 3. METHODOLOGY

In Figure 3.1 the authors see a comparison performance between three translators candi- dates, a Neural Networks, a phrases-based system, and proper humans.
Text API can be used though Microsoft Azure platform, and it allows to process the first 2 million characters for free and for each additional million of characters it costs U$10.
The authors present all the experiments performed in this work to sustain their hypothesis.
The authors believe that current sentiment analysis methods create for English combined with the current state-of-the-art machine translation system are able to be as good as or even better than native sentiment analysis methods in multiple languages.

4.1 Metrics

The F1-Score is a metric used to compare the quality of the prediction for a given ground truth.
The F1-Score considers equally important the correct classification of each sentence, independently of the class, and basically measures the capability of the method to predict the correct output.
This metric can be easily computed for 2-class experiments using the Table 4.1.
The precision of positive class is computed as: 25.

26 CHAPTER 4. EXPERIMENTAL EVALUATION

This metric considers equally important the effectiveness in each class, independently of the relative size of the class.
Therefore, the Macro-F1 reported represents how effective the method is when it indicates a polarity.
The methods still have the neutral classification for some of the sentences.the authors.
Suppose that Emoticons’ method can classify only 10% of the sentences in a dataset, corresponding to the actual percentage of sentences with emoticons.

4.2. COMPARISON BETWEEN MACHINE TRANSLATORS 27

Calculated as the number of total sentences minus the number of undefined sentences, all of this divided by the total of sentences.
Throughout the analysis of their results, the authors mainly discuss the results and tradeoff between these two metrics: Macro-F1 and Applicability .
The authors understand that the Macro-F1 might not have the same weight of Applicability depending on the task, hence, during their analysis, they will show and discuss these metrics separately.

4.2 Comparison Between Machine Translators

So, using the 3 machine translators the authors selected to test their hypothesis, all the language datasets were translated from their original texts to English.
In Figure 4.1, the authors present the performance distribution in a boxplot, with the result for all datasets given a particular machine translation system.
The distribution is very similar, especially between 25th and 75th percentile, with Google Translator slightly better than others.
According to their results, when averaging the Macro-F1 for all methods in all datasets, the systems from Yandex and Google have scores 0.72 with a standard deviation.

28 CHAPTER 4. EXPERIMENTAL EVALUATION

The Microsoft Translator has a marginally inferior performance with an MacroF1 average of 0.69 and standard deviation of 0.20.
The confidence intervals of the results overlap for α= 0.95 and the variation coefficient is 0.02.
This conclusion doesn’t mean that the sentences are keeping their sentiment polarity from the original language, but it gives confidence that choosing the machine translation system might not impact abruptly in the results.
It’s important to explain why the boxplot has so large tail with Macro-F1 outlines close to 0 and 1.
These are the case when methods such as Emoticons or Panas-t have poor Applicability .

4.3 Overall Performance

First, the authors present Figure 4.2 on which is plotted the distribution of Macro-F1 scores for non-Native methods on each language dataset.
To complement this Figure, Table A.1 to Table A.14 where the authors show the results for Applicability , F1-scores (positive and negative classes), and Macro-F1 for each language dataset.
Additionally, the authors have Figures 4.3, 4.4 and 4.5 where they can visualize the behavior of the methods regarding Applicability and Macro-F1 simultaneously.
Now, the authors discuss the main findings regarding these results.
If you remove the labels on the x-axis is very hard to tell accurately which bar corresponds to the English language.

4.3. OVERALL PERFORMANCE 29

A potential lack of efficiency of the machine translation approach does not seem to influence the overall results.
If the contrary happens, the authors would expect the corresponding English boxplot as an outline.
In these figures, the authors plot the position of each method in a chart, for every language dataset, according to its Applicability (x-axis) and Macro-F1 (Y axis).
The authors also highlight the native methods, giving them a red circle.
In these charts, the authors can see that Emoticons(2), usually appear in the upper-left po- sitions, demonstrating its good Macro-F1 and poor Applicability .

30 CHAPTER 4. EXPERIMENTAL EVALUATION

Above 0.8, but Vader has a much better Applicability .
As discussed before, the Haitian Creole chart has the most heterogeneous shape, with many of the methods towards the bottom-left corner.
Regarding the performance of the native methods, one can highlight the IBM Watson(16) for English in Figure 4.4, with an outstanding performance in Applicability and Macro-F1 , on the other hand, it is in the bottom-left corner for French.the authors.
The Emoticons obtained a Macro-F1 of 1 for the translated Russian dataset, which is much better than the 0.52 obtained for the Spanish dataset.
Since these tables show the F1-Score per classes, the authors can analyze the performance of the methods separately and understand if one is better for analyze positive than negative sentences, or vice versa.

4.3. OVERALL PERFORMANCE 33

The authors summarize the results, separating both groups of methods.
In Table 4.2 the authors present the average for Macro-F1 and Applicability for each language dataset and a final average performance for each group of methods.
The authors can observe that native methods have a higher Macro-F1 score in average, but a lower Applicability .
Some details of these are important to discuss.

34 CHAPTER 4. EXPERIMENTAL EVALUATION

In the Russian dataset, for example, the high Macro-F1 for natives come with the cost of only 0.08 in Applicability .
Also, the main problem with this evaluation is that the authors are considering 15 translation-based methods, many of them, push down the Macro-F1 average for the whole group.
Therefore, the authors want to check if there is a subgroup of these methods where they can constantly affirm that they are better than the native methods.
In the next section, the authors provide a different perspective of their results presenting the methods according to the average rank in each dataset.
This approach allows us to conclude some interesting findings of their research.

4.4 Ranking the methods

In the previous section, the authors presented the detailed results generated in this work comparing the Macro-F1 and Applicability metrics between machine translation approach and native methods.
Now, the authors will present another perspective of their results showing a rankings of the methods based on the average position of them in each dataset.
So, in Table 4.3 the authors show these results considering the Macro-F1 , and in Table 4.5 they show the results for Applicability .
Semantria has a relatively good MacroF1 average compared with them, where is only 0.01 behind Emoticons and Vader, but its average position appears at 5th in the rankings.
First, the Google Sentiment Analysis API and NRCHashtag appears in the.

4.4. RANKING THE METHODS 35

Average Ranking using Macro-F1 as positional metric top.
If you consider both metrics Google Sentiment Analysis API has a great advantage, it has a Macro-F1 only 0.07 behind the best method and has almost a perfect Applicability .
Second, ten of their 15 shows better results than the best native method (ML-Sentistrength) for Applicability .
In summary, their results show that native methods do not administer well the trade- off between Macro-F1 and Applicability .
This result triggers an alert to authors from these methods which should compare their methods not only with other native methods but also the baseline proposed in this work.

36 CHAPTER 4. EXPERIMENTAL EVALUATION

Average Winning Points using fcov as positional metric Chapter 5 iFeel System.
The authors propose iFeel 3.0,1 a benchmark system for sentence-level multi- lingual sentiment analysis.
First published at [Araújo et al., 2014], iFeel implemented only eight methods without multilingual support.
The main reason for the development of iFeel’s third version was the scalability and stability not provided by the both previous versions.
The system had a high peak of 100 users created, and due its high computational resources demands, when few users upload files to be analyzed in parallel it used to crash.

5.1 iFeel Architecture and Functionalities

The local server runs the iFeel System implemented on the Spring Framework; it is responsible for the security layer, and view layer where the user can interact with the system.
It was chosen because it has the largest free tier among the top commercial machine translator systems.
So, the authors leave two fields to be filled by the user, the language option, and a free text field when the form is submitted iFeel will perform the sentiment analysis polarity in all methods implemented as shown in figure 5.2.
In the example, the authors submitted the text "Brazilian president is going to have a fair judgment :)" with the "English" language selected.
The authors can see that most of the methods pointed the sentence as "positive", only the method.

5.1. IFEEL ARCHITECTURE AND FUNCTIONALITIES 39

Stanford and Happiness Index classified as "neutral".
Then he has to upload the sentences from a plain text file, iFeel will perform a sentiment analysis for each line of the file with a maximum of 5000 sentences.
The result is a .xml or .xlsx file which the user can download containing the output of all methods implemented.
A future step for iFeel is to provide a REST API for it’s users.
It meets the need of the current state of Internet where microservices implemented for a machine-to-machine communication provided specialized functionality to be part of some larger solution.

40 CHAPTER 5. IFEEL SYSTEM

The Sentiment analysis field is currently popular and important to understand the social interactions throughout the Internet.
The field has a certain value for Academy and commercial application.
Specifically, the authors analyzed how the current state-of-theart English methods with the help of machine translators can solve this problem compared to previously published native methods.
Then, using the average position across the languages datasets to ranking these methods, their findings suggest that the automatic translation of the input from a non-English language to English and the subsequent analyze in English methods can be a competitive strategy if the suitable sentiment analysis method is properly chosen.
Moreover, the authors would recommend use the SOCAL or Sentistrength methods with the machine translation approach when analysing multilanguage texts.

42 CHAPTER 6. CONCLUSION

Throughout this work, the authors presented many tentatives to implement multilingual sentiment analysis from the literature.
Their approach distinguishes from others in several ways.
It is the first to analyze such wide variety of different languages with gold standard datasets.
Additionally, the results show that machine translation aproach is a generic methodology that can be used in all languages supported by any proper machine translator.
The authors also release to the scientific community all the methods codes and labeled datasets used in this paper hoping that it can help sentiment analysis to become English independent.

Did you find this useful? Give us your feedback

Figures (32)

Table 3.1. Overview of the sentence-level methods available in the literature.

Table 4.2. Mean Macro-F1 and Applicability metrics comparing the machine translation approach and native methods

Table 3.2. Overview of the sentence-level methods

Figure 1.1. Interest in "Sentiment Analysis" since 2004 according to Google Trends

Figure 4.2. Overall performance using our approach on multilanguage datasets

Figure 3.1. Comparison between phrase-based and neural network techniques with a human baseline, extracted from [V. Le and Schuster, 2016]

Table 3.5. Overview of the sentence-level native methods.

Frequently Asked Questions (5)

Q1. What have the authors contributed in "A comparative study of machine translation for multilingual sentence-level sentiment analysis" ?

Despite the significant interest in this theme and amount of research efforts in the field, almost all existing methods are designed to work with only English content. In this work, the authors take a different step into this field. The authors focus on evaluating existing efforts proposed to do language specific sentiment analysis with a simple yet effective baseline approach. To do it, the authors evaluated sixteen methods for sentence-level sentiment analysis proposed for English, comparing them with three language-specific methods. Based on fourteen human labeled language-specific datasets, the authors provide an extensive quantitative analysis of existing multi-language approaches. The authors also rank methods according to their prediction performance and they identified the methods that acquired the best results using machine translation across different languages. Their primary results suggest that simply translating the input text on a specific language to English and then using one of the existing best methods developed to English can be better than the existing language specific efforts evaluated.

Q2. What are the main reasons for the research?

Most current strategies in many languages consist of adapting existing lexical resources, without presenting proper validations and basic baseline comparisons.

Q3. What is the purpose of this article?

As a final contribution to the research community, the authors release their codes, datasets, and the iFeel 3.0 system, a web framework for multilingual sentence-level sentiment analysis.

Q4. What are the main methods of sentiment analysis?

Palavras-chave: Análise de Sentimentos, Multilíngue , Tradução Automática, Redes Sociais Online, Mineração de Opinião.xvAbstractSentiment analysis has become a key tool for several social media applications, including analysis of user’s opinions about products and services, support for politics during campaigns and even for market trending.

Q5. What is the main purpose of this article?

Based on fourteen human labeled language-specific datasets, the authors provide an extensive quantitative analysis of existing multi-language approaches.