scispace - formally typeset
Open AccessJournal ArticleDOI

On negative results when using sentiment analysis tools for software engineering research

Reads0
Chats0
TLDR
Whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other is studied.
Abstract
Recent years have seen an increasing attention to social aspects of software engineering, including studies of emotions and sentiments experienced and expressed by the software developers. Most of these studies reuse existing sentiment analysis tools such as SentiStrength and NLTK. However, these tools have been trained on product reviews and movie reviews and, therefore, their results might not be applicable in the software engineering domain. In this paper we study whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. We repeat the study for seven datasets (issue trackers and Stack Overflow questions) and different sentiment analysis tools and observe that the disagreement between the tools can lead to diverging conclusions. Finally, we perform two replications of previously published studies and observe that the results of those studies cannot be confirmed when a different sentiment analysis tool is used.

read more

Content maybe subject to copyright    Report

Empir Software Eng (2017) 22:2543–2584
DOI 10.1007/s10664-016-9493-x
On negative results when using sentiment analysis tools
for software engineering research
Robbert Jongeling
1
· Proshanta Sarkar
2
·
Subhajit Datta
3
· Alexander Serebrenik
1
Published online: 10 January 2017
© The Author(s) 2017. This article is published with open access at Springerlink.com
Abstract Recent years have seen an increasing attention to social aspects of software engi-
neering, including studies of emotions and sentiments experienced and expressed by the
software developers. Most of these studies reuse existing sentiment analysis tools such as
S
ENTISTRENGTH and NLTK. However, these tools have been trained on product reviews
and movie reviews and, therefore, their results might not be applicable in the software engi-
neering domain. In this paper we study whether the sentiment analysis tools agree with the
sentiment recognized by human evaluators (as reported in an earlier study) as well as with
each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool
on software engineering studies by conducting a simple study of differences in issue reso-
lution times for positive, negative and neutral texts. We repeat the study for seven datasets
(issue trackers and S
TACK OVERFLOW questions) and different sentiment analysis tools and
observe that the disagreement between the tools can lead to diverging conclusions. Finally,
we perform two replications of previously published studies and observe that the results of
those studies cannot be confirmed when a different sentiment analysis tool is used.
Communicated by: Richard Paige, Jordi Cabot and Neil Ernst
Alexander Serebrenik
a.serebrenik@tue.nl
Robbert Jongeling
r.m.jongeling@alumnus.tue.nl
Proshanta Sarkar
proshant.cse@gmail.com
Subhajit Datta
subhajit.datta@acm.org
1
Eindhoven University of Technology, Eindhoven, The Netherlands
2
IBM India Private Limited, Kolkata, India
3
Singapore University of Technology and Design, Singapore, Singapore

2544 Empir Software Eng (2017) 22:2543–2584
Keywords Sentiment analysis tools · Replication study · Negative results
1 Introduction
Sentiment analysis is “the task of identifying positive and negative opinions, emotions, and
evaluations” (Wilson et al. 2005). Since its inception sentiment analysis has been subject
of an intensive research effort and has been successfully applied e.g., to assist users in
their development by providing them with interesting and supportive content (Honkela et al.
2012), predict the outcome of an election (Tumasjan et al. 2010) or movie sales (Mishne
and Glance 2006). The spectrum of sentiment analysis techniques ranges from identifying
polarity (positive or negative) to a complex computational treatment of subjectivity, opinion
and sentiment (Pang and Lee 2007). In particular, the research on sentiment polarity analysis
has resulted in a number of mature and publicly available tools such as SENTISTRENGTH
(Thelwall et al. 2010), Alchemy,
1
Stanford NLP sentiment analyser (Socher et al. 2013)and
NLTK (Bird et al. 2009).
In recent times, large scale software development has become increasingly social. With
the proliferation of collaborative development environments, discussion between developers
are recorded and archived to an extent that could not be conceived before. The availability of
such discussion materials makes it easy to study whether and how the sentiments expressed
by software developers influence the outcome of development activities. With this back-
ground, we apply sentiment polarity analysis to several software development ecosystems
in this study.
Sentiment polarity analysis has been recently applied in the software engineering context
to study commit comments in GitHub (Guzman et al. 2014), GitHub discussions related to
security (Pletea et al. 2014), productivity in Jira issue resolution (Ortu et al. 2015), activity
of contributors in Gentoo (Garcia et al.
2013), classification of user reviews for mainte-
nance and evolution (Panichella et al. 2015) and evolution of developers’ sentiments in the
openSUSE Factory (Rousinopoulos et al. 2014). It has also been suggested when assess-
ing technical candidates on the social web (Capiluppi et al. 2013). Not surprisingly, all
the aforementioned software engineering studies with the notable exception of the work
by Panichella et al. (2015), reuse the existing sentiment polarity tools, e.g., (Pletea et al.
2014) and (Rousinopoulos et al. 2014) use NLTK, while (Garcia et al. 2013; Guzman and
Bruegge
2013;Guzmanetal.2014; Novielli et al. 2015) and (Ortu et al. 2015) opted for
SENTISTRENGTH. While the reuse of the existing tools facilitated the application of the sen-
timent polarity analysis techniques in the software engineering domain, it also introduced
a commonly recognized threat to validity of the results obtained: those tools have been
trained on non-software engineering related texts such as movie reviews or product reviews
and might misidentify (or fail to identify) polarity of a sentiment in a software engineering
artefact such as a commit comment (Guzman et al. 2014; Pletea et al. 2014).
Therefore, in this paper we focus on sentiment polarity analysis (Wilson et al. 2005)and
investigate to what extent are the software engineering results obtained from sentiment anal-
ysis depend on the choice of the sentiment analysis tool. We recognize that there are multiple
ways to measure outcomes in software engineering. Among them, time to resolve a partic-
ular defect, and/or respond to a particular query are relevant for end users. Accordingly, in
1
http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis/

Empir Software Eng (2017) 22:2543–2584 2545
the different data-sets studied in this paper, we have taken such resolution or response times
to reflect the outcomes of our interest.
For the sake of simplicity, from here on, instead of “existing sentiment polarity analysis
tools” we talk about the “sentiment analysis tools”. Specifically, we aim at answering the
following questions:
RQ1: To what extent do different sentiment analysis tools agree with emotions of
software developers?
RQ2: To what extent do results from different sentiment analysis tools agree with each
other?
We have observed disagreement between sentiment analysis tools and the emotions of soft-
ware developers but also between different sentiment analysis tools themselves. However,
disagreement between the tools does not apriorimean that sentiment analysis tools might
lead to contradictory results in software engineering studies making use of these tools. Thus,
we ask
RQ3: Do different sentiment analysis tools lead to contradictory results in a software
engineering study?
We have observed that disagreement between the tools might lead to contradictory results
in software engineering studies. Therefore, we finally conduct replication studies in order
to understand:
RQ4: How does the choice of a sentiment analysis tool affect validity of the previously
published results?
The remainder of this paper is organized as follows. The next section outlines the sen-
timent analysis tools we have considered in this study. In Section
3 we study agreement
between the tools and the results of manual labeling, and between the tools themselves, i.e.,
RQ1 and RQ2. In Section 4 we conduct a series of studies based on the results of different
sentiment analysis tools. We observe that conclusions one might derive using different tools
diverge, casting doubt on their validity (RQ3). While our answer to RQ3 indicates that the
choice of a sentiment analysis tool might affect validity of software engineering results, in
Section 5 we perform replication of two published studies answering RQ4 and establishing
that conclusions of previously published works cannot be reproduced when a different sen-
timent analysis tool is used. Finally, in Section 6 we discuss related work and conclude in
Section 7.
Source code and data used to obtain the results of this paper has been made available.
2
2 Sentiment Analysis Tools
2.1 Tool Selection
To perform the tool evaluation we have decided to focus on open-source tools. This require-
ment excludes such commercial tools as Lymbix
3
Sentiment API of MeaningCloud
4
or
2
http://ow.ly/HvC5302N4oK
3
http://www.lymbix.com/supportcenter/docs
4
https://www.meaningcloud.com/developer/sentiment-analysis

2546 Empir Software Eng (2017) 22:2543–2584
GetSentiment.
5
Furthermore, we exclude tools that require training before they can be
applied such as LibShortText (Yu et al.
2013) or sentiment analysis libraries of popular
machine learning tools such as RapidMiner or Weka. Finally, since the software engineering
texts that have been analyzed in the past can be quite short (JIRA issues, STACK OVER-
FLOW questions), we have chosen tools that have already been applied either to software
engineering texts (SENTISTRENGTH and NLTK) or to short texts such as tweets (Alchemy
or Stanford NLP sentiment analyser).
2.2 Description of Tools
2.2.1 S
ENTISTRENGTH
SENTISTRENGTH is the sentiment analysis tool most frequently used in software engineer-
ing studies (Garcia et al.
2013;Guzmanetal.2014; Novielli et al. 2015;Ortuetal.2015).
Moreover, SENTISTRENGTH had the highest average accuracy among fifteen Twitter senti-
ment analysis tools (Abbasi et al. 2014). SENTISTRENGTH assigns an integer value between
1 and 5 for the positivity of a text, p and similarly, a value between 1and5forthe
negativity, n.
Interpretation In order to map the separate positivity and negativity scores to a senti-
ment (positive, neutral or negative) for an entire text fragment, we follow the approach by
Thelwall et al. (
2012). A text is considered positive when p + n>0, negative when
p + n<0, and neutral if p =−n and p<4. Texts with a score of p =−n and p 4are
considered having an undetermined sentiment and are removed from the datasets.
2.2.2 Alchemy
Alchemy provides several text processing APIs, including a sentiment analysis API which
promises to work on very short texts (e.g., tweets) as well as relatively long texts (e.g., news
articles).
6
The sentiment analysis API returns for a text fragment a status, a language, a
score and a type. The score is in the range [−1, 1],thetype is the sentiment of the text and is
based on the score. For negative scores, the type is negative, conversely for positive scores,
the type is positive. For a score of 0, the type is neutral. The status reflects the analysis
success and it is either “OK” or “ERROR”.
Interpretation We ignore texts with status “ERROR” or a non-English language. For the
remaining texts we consider them as being negative, neutral or positive as indicated by the
returned type.
2.2.3 NLTK
NLTK has been applied in earlier software engineering studies (Pletea et al.
2014;
Rousinopoulos et al. 2014). NLTK uses a simple bag of words model and returns for each
5
https://getsentiment.3scale.net/
6
http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis

Empir Software Eng (2017) 22:2543–2584 2547
text three probabilities: a probability of the text being negative, one of it being neutral and
one of it being positive. To call NLTK, we use the API provided at text-processing.com.
7
Interpretation If the probability score for neutral is greater than 0.5, the text is considered
neutral. Otherwise, it is considered to be the other sentiment with the highest probability
(Pletea et al.
2014).
2.2.4 Stanford NLP
The Stanford NLP parses the text into sentences and performs a more advanced grammatical
analysis as opposed to a simpler bag of words model used, e.g., in NLTK. Indeed, Socher
et al. argue that such an analysis should outperform the bag of words model on short texts
(Socher et al.
2013). The Stanford NLP breaks down the text into sentences and assigns
each a sentiment score in the range [0, 4], where 0 is very negative, 2 is neutral and 4 is
very positive. We note that the tool may have difficulty breaking the text into sentences
as comments sometimes include pieces of code or e.g. URLs. The tool does not provide a
document-level score.
Interpretation To determine a document-level sentiment we compute 2 #0 #1+#3+
2 #4, where #0 denotes the number of sentences with score 0, etc.. If this score is negative,
neutral or positive, we consider the text to be negative, neutral or positive, respectively.
3 Agreement Between Sentiment Analysis Tools
In this section we address RQ1 and RQ2, i.e., to what extent do the different sentiment
analysis tools described earlier, agree with emotions of software developers and to what
extent do different sentiment analysis tools agree with each other. To perform the evaluation
we use the manually labeled emotions dataset (Murgia et al.
2014).
3.1 Methodology
3.1.1 Manually-Labeled Software Engineering Data
As the “golden set” we use the data from a developer emotions study by Murgia et al.
(
2014). In this study, four evaluators manually labeled 392 comments with emotions “joy”,
“love”, “surprise”, “anger”, “sadness” or “fear”. Emotions “joy” and“love” are taken as
indicators of positive sentiments and “anger”, “sadness” and “fear”—of negative sentiment.
We exclude information about the “surprise” sentiment, since surprises can be, in general,
both positive and negative depending on the expectations of the speaker.
We focus on consistently labeled comments. We consider the comment as positive if at
least three evaluators have indicated a positive sentiment and no evaluator has indicated
negative sentiments. Similarly, we consider the comment as negative if at least three evalua-
tors have indicated a negative sentiment and no evaluator has indicated positive sentiments.
Finally, a text is considered as neutral when three or more evaluators have neither indicated
a positive sentiment nor a negative sentiment.
7
API docs for NLTK sentiment analysis:
http://text-processing.com/docs/sentiment.html

Citations
More filters
Proceedings ArticleDOI

Sentiment analysis for software engineering: how far can we go?

TL;DR: This work retrained—on a set of 40k manually labeled sentences/words extracted from Stack Overflow—a state-of-the-art sentiment analysis tool exploiting deep learning, and found the results were negative.
Proceedings ArticleDOI

Detecting code smells using machine learning techniques: Are we there yet?

TL;DR: The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.
Journal ArticleDOI

The ABC of Software Engineering Research

TL;DR: A taxonomy from the social sciences is adopted, termed here the ABC framework for SE research, which offers a holistic view of eight archetypal research strategies, and six ways in which the framework can advance SE research.
Journal ArticleDOI

How to Ask for Technical Help? Evidence-based Guidelines for Writing Questions on Stack Overflow

TL;DR: This paper provides evidence-based guidelines for writing effective questions on Stack Overflow that software engineers can follow to increase the chance of getting technical help and empirically confirmed community guidelines that suggest avoiding rudeness in question writing.
Proceedings ArticleDOI

Leveraging automated sentiment analysis in software engineering

TL;DR: SentiStrength-SE, a tool for improved sentiment analysis especially designed for application in the software engineering domain, achieves 73.85% precision and 85% recall, which are significantly higher than a state-of-the-art sentiment analysis tool the authors compare with.
References
More filters
Book

Natural Language Processing with Python

TL;DR: This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.
Proceedings Article

SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining.

TL;DR: This work discusses SENTIWORDNET 3.0, a lexical resource explicitly devised for supporting sentiment classification and opinion mining applications, and reports on the improvements concerning aspect (b) that it embodies with respect to version 1.0.
Proceedings Article

Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment

TL;DR: It is found that the mere number of messages mentioning a party reflects the election result, and joint mentions of two parties are in line with real world political ties and coalitions.
Journal Article

On comparing partitions

TL;DR: In this paper, Hubert and Arabie corrected the Rand Index for chance (Adjusted Rand Index) and presented some alternative indices, which do not assume one set of units for two partitions.
Book

Observing Interaction: An Introduction to Sequential Analysis

TL;DR: The book begins with a discussion of social interaction and observation and quickly moves into a classic study of interaction, Parten's (1932) study ofSocial interaction in children, which discusses recording methods but is notable for its lack of detail.
Related Papers (5)