scispace - formally typeset
Open AccessJournal ArticleDOI

On negative results when using sentiment analysis tools for software engineering research

Reads0
Chats0
TLDR
Whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other is studied.
Abstract
Recent years have seen an increasing attention to social aspects of software engineering, including studies of emotions and sentiments experienced and expressed by the software developers. Most of these studies reuse existing sentiment analysis tools such as SentiStrength and NLTK. However, these tools have been trained on product reviews and movie reviews and, therefore, their results might not be applicable in the software engineering domain. In this paper we study whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. We repeat the study for seven datasets (issue trackers and Stack Overflow questions) and different sentiment analysis tools and observe that the disagreement between the tools can lead to diverging conclusions. Finally, we perform two replications of previously published studies and observe that the results of those studies cannot be confirmed when a different sentiment analysis tool is used.

read more

Content maybe subject to copyright    Report

Empir Software Eng (2017) 22:2543–2584
DOI 10.1007/s10664-016-9493-x
On negative results when using sentiment analysis tools
for software engineering research
Robbert Jongeling
1
· Proshanta Sarkar
2
·
Subhajit Datta
3
· Alexander Serebrenik
1
Published online: 10 January 2017
© The Author(s) 2017. This article is published with open access at Springerlink.com
Abstract Recent years have seen an increasing attention to social aspects of software engi-
neering, including studies of emotions and sentiments experienced and expressed by the
software developers. Most of these studies reuse existing sentiment analysis tools such as
S
ENTISTRENGTH and NLTK. However, these tools have been trained on product reviews
and movie reviews and, therefore, their results might not be applicable in the software engi-
neering domain. In this paper we study whether the sentiment analysis tools agree with the
sentiment recognized by human evaluators (as reported in an earlier study) as well as with
each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool
on software engineering studies by conducting a simple study of differences in issue reso-
lution times for positive, negative and neutral texts. We repeat the study for seven datasets
(issue trackers and S
TACK OVERFLOW questions) and different sentiment analysis tools and
observe that the disagreement between the tools can lead to diverging conclusions. Finally,
we perform two replications of previously published studies and observe that the results of
those studies cannot be confirmed when a different sentiment analysis tool is used.
Communicated by: Richard Paige, Jordi Cabot and Neil Ernst
Alexander Serebrenik
a.serebrenik@tue.nl
Robbert Jongeling
r.m.jongeling@alumnus.tue.nl
Proshanta Sarkar
proshant.cse@gmail.com
Subhajit Datta
subhajit.datta@acm.org
1
Eindhoven University of Technology, Eindhoven, The Netherlands
2
IBM India Private Limited, Kolkata, India
3
Singapore University of Technology and Design, Singapore, Singapore

2544 Empir Software Eng (2017) 22:2543–2584
Keywords Sentiment analysis tools · Replication study · Negative results
1 Introduction
Sentiment analysis is “the task of identifying positive and negative opinions, emotions, and
evaluations” (Wilson et al. 2005). Since its inception sentiment analysis has been subject
of an intensive research effort and has been successfully applied e.g., to assist users in
their development by providing them with interesting and supportive content (Honkela et al.
2012), predict the outcome of an election (Tumasjan et al. 2010) or movie sales (Mishne
and Glance 2006). The spectrum of sentiment analysis techniques ranges from identifying
polarity (positive or negative) to a complex computational treatment of subjectivity, opinion
and sentiment (Pang and Lee 2007). In particular, the research on sentiment polarity analysis
has resulted in a number of mature and publicly available tools such as SENTISTRENGTH
(Thelwall et al. 2010), Alchemy,
1
Stanford NLP sentiment analyser (Socher et al. 2013)and
NLTK (Bird et al. 2009).
In recent times, large scale software development has become increasingly social. With
the proliferation of collaborative development environments, discussion between developers
are recorded and archived to an extent that could not be conceived before. The availability of
such discussion materials makes it easy to study whether and how the sentiments expressed
by software developers influence the outcome of development activities. With this back-
ground, we apply sentiment polarity analysis to several software development ecosystems
in this study.
Sentiment polarity analysis has been recently applied in the software engineering context
to study commit comments in GitHub (Guzman et al. 2014), GitHub discussions related to
security (Pletea et al. 2014), productivity in Jira issue resolution (Ortu et al. 2015), activity
of contributors in Gentoo (Garcia et al.
2013), classification of user reviews for mainte-
nance and evolution (Panichella et al. 2015) and evolution of developers’ sentiments in the
openSUSE Factory (Rousinopoulos et al. 2014). It has also been suggested when assess-
ing technical candidates on the social web (Capiluppi et al. 2013). Not surprisingly, all
the aforementioned software engineering studies with the notable exception of the work
by Panichella et al. (2015), reuse the existing sentiment polarity tools, e.g., (Pletea et al.
2014) and (Rousinopoulos et al. 2014) use NLTK, while (Garcia et al. 2013; Guzman and
Bruegge
2013;Guzmanetal.2014; Novielli et al. 2015) and (Ortu et al. 2015) opted for
SENTISTRENGTH. While the reuse of the existing tools facilitated the application of the sen-
timent polarity analysis techniques in the software engineering domain, it also introduced
a commonly recognized threat to validity of the results obtained: those tools have been
trained on non-software engineering related texts such as movie reviews or product reviews
and might misidentify (or fail to identify) polarity of a sentiment in a software engineering
artefact such as a commit comment (Guzman et al. 2014; Pletea et al. 2014).
Therefore, in this paper we focus on sentiment polarity analysis (Wilson et al. 2005)and
investigate to what extent are the software engineering results obtained from sentiment anal-
ysis depend on the choice of the sentiment analysis tool. We recognize that there are multiple
ways to measure outcomes in software engineering. Among them, time to resolve a partic-
ular defect, and/or respond to a particular query are relevant for end users. Accordingly, in
1
http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis/

Empir Software Eng (2017) 22:2543–2584 2545
the different data-sets studied in this paper, we have taken such resolution or response times
to reflect the outcomes of our interest.
For the sake of simplicity, from here on, instead of “existing sentiment polarity analysis
tools” we talk about the “sentiment analysis tools”. Specifically, we aim at answering the
following questions:
RQ1: To what extent do different sentiment analysis tools agree with emotions of
software developers?
RQ2: To what extent do results from different sentiment analysis tools agree with each
other?
We have observed disagreement between sentiment analysis tools and the emotions of soft-
ware developers but also between different sentiment analysis tools themselves. However,
disagreement between the tools does not apriorimean that sentiment analysis tools might
lead to contradictory results in software engineering studies making use of these tools. Thus,
we ask
RQ3: Do different sentiment analysis tools lead to contradictory results in a software
engineering study?
We have observed that disagreement between the tools might lead to contradictory results
in software engineering studies. Therefore, we finally conduct replication studies in order
to understand:
RQ4: How does the choice of a sentiment analysis tool affect validity of the previously
published results?
The remainder of this paper is organized as follows. The next section outlines the sen-
timent analysis tools we have considered in this study. In Section
3 we study agreement
between the tools and the results of manual labeling, and between the tools themselves, i.e.,
RQ1 and RQ2. In Section 4 we conduct a series of studies based on the results of different
sentiment analysis tools. We observe that conclusions one might derive using different tools
diverge, casting doubt on their validity (RQ3). While our answer to RQ3 indicates that the
choice of a sentiment analysis tool might affect validity of software engineering results, in
Section 5 we perform replication of two published studies answering RQ4 and establishing
that conclusions of previously published works cannot be reproduced when a different sen-
timent analysis tool is used. Finally, in Section 6 we discuss related work and conclude in
Section 7.
Source code and data used to obtain the results of this paper has been made available.
2
2 Sentiment Analysis Tools
2.1 Tool Selection
To perform the tool evaluation we have decided to focus on open-source tools. This require-
ment excludes such commercial tools as Lymbix
3
Sentiment API of MeaningCloud
4
or
2
http://ow.ly/HvC5302N4oK
3
http://www.lymbix.com/supportcenter/docs
4
https://www.meaningcloud.com/developer/sentiment-analysis

2546 Empir Software Eng (2017) 22:2543–2584
GetSentiment.
5
Furthermore, we exclude tools that require training before they can be
applied such as LibShortText (Yu et al.
2013) or sentiment analysis libraries of popular
machine learning tools such as RapidMiner or Weka. Finally, since the software engineering
texts that have been analyzed in the past can be quite short (JIRA issues, STACK OVER-
FLOW questions), we have chosen tools that have already been applied either to software
engineering texts (SENTISTRENGTH and NLTK) or to short texts such as tweets (Alchemy
or Stanford NLP sentiment analyser).
2.2 Description of Tools
2.2.1 S
ENTISTRENGTH
SENTISTRENGTH is the sentiment analysis tool most frequently used in software engineer-
ing studies (Garcia et al.
2013;Guzmanetal.2014; Novielli et al. 2015;Ortuetal.2015).
Moreover, SENTISTRENGTH had the highest average accuracy among fifteen Twitter senti-
ment analysis tools (Abbasi et al. 2014). SENTISTRENGTH assigns an integer value between
1 and 5 for the positivity of a text, p and similarly, a value between 1and5forthe
negativity, n.
Interpretation In order to map the separate positivity and negativity scores to a senti-
ment (positive, neutral or negative) for an entire text fragment, we follow the approach by
Thelwall et al. (
2012). A text is considered positive when p + n>0, negative when
p + n<0, and neutral if p =−n and p<4. Texts with a score of p =−n and p 4are
considered having an undetermined sentiment and are removed from the datasets.
2.2.2 Alchemy
Alchemy provides several text processing APIs, including a sentiment analysis API which
promises to work on very short texts (e.g., tweets) as well as relatively long texts (e.g., news
articles).
6
The sentiment analysis API returns for a text fragment a status, a language, a
score and a type. The score is in the range [−1, 1],thetype is the sentiment of the text and is
based on the score. For negative scores, the type is negative, conversely for positive scores,
the type is positive. For a score of 0, the type is neutral. The status reflects the analysis
success and it is either “OK” or “ERROR”.
Interpretation We ignore texts with status “ERROR” or a non-English language. For the
remaining texts we consider them as being negative, neutral or positive as indicated by the
returned type.
2.2.3 NLTK
NLTK has been applied in earlier software engineering studies (Pletea et al.
2014;
Rousinopoulos et al. 2014). NLTK uses a simple bag of words model and returns for each
5
https://getsentiment.3scale.net/
6
http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis

Empir Software Eng (2017) 22:2543–2584 2547
text three probabilities: a probability of the text being negative, one of it being neutral and
one of it being positive. To call NLTK, we use the API provided at text-processing.com.
7
Interpretation If the probability score for neutral is greater than 0.5, the text is considered
neutral. Otherwise, it is considered to be the other sentiment with the highest probability
(Pletea et al.
2014).
2.2.4 Stanford NLP
The Stanford NLP parses the text into sentences and performs a more advanced grammatical
analysis as opposed to a simpler bag of words model used, e.g., in NLTK. Indeed, Socher
et al. argue that such an analysis should outperform the bag of words model on short texts
(Socher et al.
2013). The Stanford NLP breaks down the text into sentences and assigns
each a sentiment score in the range [0, 4], where 0 is very negative, 2 is neutral and 4 is
very positive. We note that the tool may have difficulty breaking the text into sentences
as comments sometimes include pieces of code or e.g. URLs. The tool does not provide a
document-level score.
Interpretation To determine a document-level sentiment we compute 2 #0 #1+#3+
2 #4, where #0 denotes the number of sentences with score 0, etc.. If this score is negative,
neutral or positive, we consider the text to be negative, neutral or positive, respectively.
3 Agreement Between Sentiment Analysis Tools
In this section we address RQ1 and RQ2, i.e., to what extent do the different sentiment
analysis tools described earlier, agree with emotions of software developers and to what
extent do different sentiment analysis tools agree with each other. To perform the evaluation
we use the manually labeled emotions dataset (Murgia et al.
2014).
3.1 Methodology
3.1.1 Manually-Labeled Software Engineering Data
As the “golden set” we use the data from a developer emotions study by Murgia et al.
(
2014). In this study, four evaluators manually labeled 392 comments with emotions “joy”,
“love”, “surprise”, “anger”, “sadness” or “fear”. Emotions “joy” and“love” are taken as
indicators of positive sentiments and “anger”, “sadness” and “fear”—of negative sentiment.
We exclude information about the “surprise” sentiment, since surprises can be, in general,
both positive and negative depending on the expectations of the speaker.
We focus on consistently labeled comments. We consider the comment as positive if at
least three evaluators have indicated a positive sentiment and no evaluator has indicated
negative sentiments. Similarly, we consider the comment as negative if at least three evalua-
tors have indicated a negative sentiment and no evaluator has indicated positive sentiments.
Finally, a text is considered as neutral when three or more evaluators have neither indicated
a positive sentiment nor a negative sentiment.
7
API docs for NLTK sentiment analysis:
http://text-processing.com/docs/sentiment.html

Citations
More filters
Posted Content

Measuring Affectiveness and Effectiveness in Software Systems.

TL;DR: The summary presented in this paper highlights the results obtained in a four-years project aiming at analyzing the development process of software artifacts from two points of view: Effectiveness and Affectiveness.
Journal ArticleDOI

Analyzing Public Opinions Regarding Virtual Tourism in the Context of COVID-19: Unidirectional vs. 360-Degree Videos

TL;DR: In this paper , the authors employed CNN and Random Forest for sentiment analysis of YouTube viewers' comments, where the sentiment analysis output by SenticNet7 was used as actual values.
Posted Content

Interests, Difficulties, Sentiments, and Tool Usages of Concurrency Developers: A Large-Scale Study on Stack Overflow.

TL;DR: The authors conducted a large-scale study on the entirety of Stack Overflow to understand interests, difficulties, sentiment, and tool usages of concurrency developers, and investigated the implications of their findings for the practice, research, and education of concurrent software development.
Proceedings ArticleDOI

Automatic Rule Definition for Pattern-Based Text Mining

TL;DR: In this paper , the authors developed a method of automatic rule definition for pattern-based text mining, which can be used together with manually defined rules, resulting in outperforming other methods.
Journal ArticleDOI

Using Personality Detection Tools for Software Engineering Research: How Far Can We Go?

TL;DR: In this paper , the authors assess the performance of general-purpose personality detection tools when applied to a technical corpus of developers' emails retrieved from the public archives of the Apache Software Foundation.
References
More filters
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Book

Statistical methods for rates and proportions

TL;DR: In this paper, the basic theory of Maximum Likelihood Estimation (MLE) is used to detect a difference between two different proportions of a given proportion in a single proportion.
Book ChapterDOI

Individual Comparisons by Ranking Methods

TL;DR: The comparison of two treatments generally falls into one of the following two categories: (a) a number of replications for each of the two treatments, which are unpaired, or (b) we may have a series of paired comparisons, some of which may be positive and some negative as mentioned in this paper.
Journal ArticleDOI

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.

TL;DR: The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi.
Related Papers (5)