On negative results when using sentiment analysis tools for software engineering research

doi:10.1007/S10664-016-9493-X

Empir Software Eng (2017) 22:2543–2584

DOI 10.1007/s10664-016-9493-x

On negative results when using sentiment analysis tools

for software engineering research

Robbert Jongeling

1

· Proshanta Sarkar

2

·

Subhajit Datta

3

· Alexander Serebrenik

1

Published online: 10 January 2017

Abstract Recent years have seen an increasing attention to social aspects of software engi-

neering, including studies of emotions and sentiments experienced and expressed by the

software developers. Most of these studies reuse existing sentiment analysis tools such as

S

ENTISTRENGTH and NLTK. However, these tools have been trained on product reviews

and movie reviews and, therefore, their results might not be applicable in the software engi-

neering domain. In this paper we study whether the sentiment analysis tools agree with the

sentiment recognized by human evaluators (as reported in an earlier study) as well as with

each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool

on software engineering studies by conducting a simple study of differences in issue reso-

lution times for positive, negative and neutral texts. We repeat the study for seven datasets

(issue trackers and S

TACK OVERFLOW questions) and different sentiment analysis tools and

observe that the disagreement between the tools can lead to diverging conclusions. Finally,

we perform two replications of previously published studies and observe that the results of

those studies cannot be confirmed when a different sentiment analysis tool is used.

Communicated by: Richard Paige, Jordi Cabot and Neil Ernst

 Alexander Serebrenik

a.serebrenik@tue.nl

Robbert Jongeling

r.m.jongeling@alumnus.tue.nl

Proshanta Sarkar

proshant.cse@gmail.com

Subhajit Datta

subhajit.datta@acm.org

1

Eindhoven University of Technology, Eindhoven, The Netherlands

2

IBM India Private Limited, Kolkata, India

3

Singapore University of Technology and Design, Singapore, Singapore

2544 Empir Software Eng (2017) 22:2543–2584

Keywords Sentiment analysis tools · Replication study · Negative results

1 Introduction

Sentiment analysis is “the task of identifying positive and negative opinions, emotions, and

evaluations” (Wilson et al. 2005). Since its inception sentiment analysis has been subject

of an intensive research effort and has been successfully applied e.g., to assist users in

their development by providing them with interesting and supportive content (Honkela et al.

2012), predict the outcome of an election (Tumasjan et al. 2010) or movie sales (Mishne

and Glance 2006). The spectrum of sentiment analysis techniques ranges from identifying

polarity (positive or negative) to a complex computational treatment of subjectivity, opinion

and sentiment (Pang and Lee 2007). In particular, the research on sentiment polarity analysis

has resulted in a number of mature and publicly available tools such as SENTISTRENGTH

(Thelwall et al. 2010), Alchemy,

1

Stanford NLP sentiment analyser (Socher et al. 2013)and

NLTK (Bird et al. 2009).

In recent times, large scale software development has become increasingly social. With

the proliferation of collaborative development environments, discussion between developers

are recorded and archived to an extent that could not be conceived before. The availability of

such discussion materials makes it easy to study whether and how the sentiments expressed

by software developers influence the outcome of development activities. With this back-

ground, we apply sentiment polarity analysis to several software development ecosystems

in this study.

Sentiment polarity analysis has been recently applied in the software engineering context

to study commit comments in GitHub (Guzman et al. 2014), GitHub discussions related to

security (Pletea et al. 2014), productivity in Jira issue resolution (Ortu et al. 2015), activity

of contributors in Gentoo (Garcia et al.

2013), classification of user reviews for mainte-

nance and evolution (Panichella et al. 2015) and evolution of developers’ sentiments in the

openSUSE Factory (Rousinopoulos et al. 2014). It has also been suggested when assess-

ing technical candidates on the social web (Capiluppi et al. 2013). Not surprisingly, all

the aforementioned software engineering studies with the notable exception of the work

by Panichella et al. (2015), reuse the existing sentiment polarity tools, e.g., (Pletea et al.

2014) and (Rousinopoulos et al. 2014) use NLTK, while (Garcia et al. 2013; Guzman and

Bruegge

2013;Guzmanetal.2014; Novielli et al. 2015) and (Ortu et al. 2015) opted for

SENTISTRENGTH. While the reuse of the existing tools facilitated the application of the sen-

timent polarity analysis techniques in the software engineering domain, it also introduced

a commonly recognized threat to validity of the results obtained: those tools have been

trained on non-software engineering related texts such as movie reviews or product reviews

and might misidentify (or fail to identify) polarity of a sentiment in a software engineering

artefact such as a commit comment (Guzman et al. 2014; Pletea et al. 2014).

Therefore, in this paper we focus on sentiment polarity analysis (Wilson et al. 2005)and

investigate to what extent are the software engineering results obtained from sentiment anal-

ysis depend on the choice of the sentiment analysis tool. We recognize that there are multiple

ways to measure outcomes in software engineering. Among them, time to resolve a partic-

ular defect, and/or respond to a particular query are relevant for end users. Accordingly, in

1

http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis/

Empir Software Eng (2017) 22:2543–2584 2545

the different data-sets studied in this paper, we have taken such resolution or response times

to reflect the outcomes of our interest.

For the sake of simplicity, from here on, instead of “existing sentiment polarity analysis

tools” we talk about the “sentiment analysis tools”. Specifically, we aim at answering the

following questions:

– RQ1: To what extent do different sentiment analysis tools agree with emotions of

software developers?

– RQ2: To what extent do results from different sentiment analysis tools agree with each

other?

We have observed disagreement between sentiment analysis tools and the emotions of soft-

ware developers but also between different sentiment analysis tools themselves. However,

disagreement between the tools does not apriorimean that sentiment analysis tools might

lead to contradictory results in software engineering studies making use of these tools. Thus,

we ask

– RQ3: Do different sentiment analysis tools lead to contradictory results in a software

engineering study?

We have observed that disagreement between the tools might lead to contradictory results

in software engineering studies. Therefore, we finally conduct replication studies in order

to understand:

– RQ4: How does the choice of a sentiment analysis tool affect validity of the previously

published results?

The remainder of this paper is organized as follows. The next section outlines the sen-

timent analysis tools we have considered in this study. In Section

3 we study agreement

between the tools and the results of manual labeling, and between the tools themselves, i.e.,

RQ1 and RQ2. In Section 4 we conduct a series of studies based on the results of different

sentiment analysis tools. We observe that conclusions one might derive using different tools

diverge, casting doubt on their validity (RQ3). While our answer to RQ3 indicates that the

choice of a sentiment analysis tool might affect validity of software engineering results, in

Section 5 we perform replication of two published studies answering RQ4 and establishing

that conclusions of previously published works cannot be reproduced when a different sen-

timent analysis tool is used. Finally, in Section 6 we discuss related work and conclude in

Section 7.

Source code and data used to obtain the results of this paper has been made available.

2

2 Sentiment Analysis Tools

2.1 Tool Selection

To perform the tool evaluation we have decided to focus on open-source tools. This require-

ment excludes such commercial tools as Lymbix

3

Sentiment API of MeaningCloud

4

or

2

http://ow.ly/HvC5302N4oK

3

http://www.lymbix.com/supportcenter/docs

4

https://www.meaningcloud.com/developer/sentiment-analysis

2546 Empir Software Eng (2017) 22:2543–2584

GetSentiment.

5

Furthermore, we exclude tools that require training before they can be

applied such as LibShortText (Yu et al.

2013) or sentiment analysis libraries of popular

machine learning tools such as RapidMiner or Weka. Finally, since the software engineering

texts that have been analyzed in the past can be quite short (JIRA issues, STACK OVER-

FLOW questions), we have chosen tools that have already been applied either to software

engineering texts (SENTISTRENGTH and NLTK) or to short texts such as tweets (Alchemy

or Stanford NLP sentiment analyser).

2.2 Description of Tools

2.2.1 S

ENTISTRENGTH

SENTISTRENGTH is the sentiment analysis tool most frequently used in software engineer-

ing studies (Garcia et al.

2013;Guzmanetal.2014; Novielli et al. 2015;Ortuetal.2015).

Moreover, SENTISTRENGTH had the highest average accuracy among fifteen Twitter senti-

ment analysis tools (Abbasi et al. 2014). SENTISTRENGTH assigns an integer value between

1 and 5 for the positivity of a text, p and similarly, a value between −1and−5forthe

negativity, n.

Interpretation In order to map the separate positivity and negativity scores to a senti-

ment (positive, neutral or negative) for an entire text fragment, we follow the approach by

Thelwall et al. (

2012). A text is considered positive when p + n>0, negative when

p + n<0, and neutral if p =−n and p<4. Texts with a score of p =−n and p ≥ 4are

considered having an undetermined sentiment and are removed from the datasets.

2.2.2 Alchemy

Alchemy provides several text processing APIs, including a sentiment analysis API which

promises to work on very short texts (e.g., tweets) as well as relatively long texts (e.g., news

articles).

6

The sentiment analysis API returns for a text fragment a status, a language, a

score and a type. The score is in the range [−1, 1],thetype is the sentiment of the text and is

based on the score. For negative scores, the type is negative, conversely for positive scores,

the type is positive. For a score of 0, the type is neutral. The status reflects the analysis

success and it is either “OK” or “ERROR”.

Interpretation We ignore texts with status “ERROR” or a non-English language. For the

remaining texts we consider them as being negative, neutral or positive as indicated by the

returned type.

2.2.3 NLTK

NLTK has been applied in earlier software engineering studies (Pletea et al.

2014;

Rousinopoulos et al. 2014). NLTK uses a simple bag of words model and returns for each

5

https://getsentiment.3scale.net/

6

http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis

Empir Software Eng (2017) 22:2543–2584 2547

text three probabilities: a probability of the text being negative, one of it being neutral and

one of it being positive. To call NLTK, we use the API provided at text-processing.com.

7

Interpretation If the probability score for neutral is greater than 0.5, the text is considered

neutral. Otherwise, it is considered to be the other sentiment with the highest probability

(Pletea et al.

2014).

2.2.4 Stanford NLP

The Stanford NLP parses the text into sentences and performs a more advanced grammatical

analysis as opposed to a simpler bag of words model used, e.g., in NLTK. Indeed, Socher

et al. argue that such an analysis should outperform the bag of words model on short texts

(Socher et al.

2013). The Stanford NLP breaks down the text into sentences and assigns

each a sentiment score in the range [0, 4], where 0 is very negative, 2 is neutral and 4 is

very positive. We note that the tool may have difficulty breaking the text into sentences

as comments sometimes include pieces of code or e.g. URLs. The tool does not provide a

document-level score.

Interpretation To determine a document-level sentiment we compute −2∗ #0− #1+#3+

2 ∗ #4, where #0 denotes the number of sentences with score 0, etc.. If this score is negative,

neutral or positive, we consider the text to be negative, neutral or positive, respectively.

3 Agreement Between Sentiment Analysis Tools

In this section we address RQ1 and RQ2, i.e., to what extent do the different sentiment

analysis tools described earlier, agree with emotions of software developers and to what

extent do different sentiment analysis tools agree with each other. To perform the evaluation

we use the manually labeled emotions dataset (Murgia et al.

2014).

3.1 Methodology

3.1.1 Manually-Labeled Software Engineering Data

As the “golden set” we use the data from a developer emotions study by Murgia et al.

(

2014). In this study, four evaluators manually labeled 392 comments with emotions “joy”,

“love”, “surprise”, “anger”, “sadness” or “fear”. Emotions “joy” and“love” are taken as

indicators of positive sentiments and “anger”, “sadness” and “fear”—of negative sentiment.

We exclude information about the “surprise” sentiment, since surprises can be, in general,

both positive and negative depending on the expectations of the speaker.

We focus on consistently labeled comments. We consider the comment as positive if at

least three evaluators have indicated a positive sentiment and no evaluator has indicated

negative sentiments. Similarly, we consider the comment as negative if at least three evalua-

tors have indicated a negative sentiment and no evaluator has indicated positive sentiments.

Finally, a text is considered as neutral when three or more evaluators have neither indicated

a positive sentiment nor a negative sentiment.

7

API docs for NLTK sentiment analysis:

http://text-processing.com/docs/sentiment.html

On negative results when using sentiment analysis tools for software engineering research

Citations

Measuring Affectiveness and Effectiveness in Software Systems.

Analyzing Public Opinions Regarding Virtual Tourism in the Context of COVID-19: Unidirectional vs. 360-Degree Videos

Interests, Difficulties, Sentiments, and Tool Usages of Concurrency Developers: A Large-Scale Study on Stack Overflow.

Automatic Rule Definition for Pattern-Based Text Mining

Using Personality Detection Tools for Software Engineering Research: How Far Can We Go?

References

The WEKA data mining software: an update

Statistical methods for rates and proportions

Individual Comparisons by Ranking Methods

Statistical Methods for Rates and Proportions

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.

Related Papers (5)

Sentiment Polarity Detection for Software Development

Sentiment analysis for software engineering: how far can we go?

Sentiment analysis of commit comments in GitHub: an empirical study

Are bullies more productive?: empirical study of affectiveness vs. issue fixing time

Security and emotion: sentiment analysis of security discussions on GitHub