Sentiment Polarity Detection for Software Development

doi:10.1007/S10664-017-9546-9

Home
/
Papers
/
Sentiment Polarity Detection for Software Development

Journal Article•DOI•

Sentiment Polarity Detection for Software Development

Fabio Calefato¹, Filippo Lanubile¹, Federico Maiorano¹, Nicole Novielli¹•Institutions (1)

University of Bari¹

01 Jun 2018-Empirical Software Engineering (Springer US)-Vol. 23, Iss: 3, pp 1352-1382

TL;DR: Senti4SD as mentioned in this paper is a classifier specifically trained to support sentiment analysis in developers' communication channels, which is trained and validated using a gold standard of Stack Overflow questions, answers, and comments manually annotated for sentiment polarity.

read less

Abstract: The role of sentiment analysis is increasingly emerging to study software developers' emotions by mining crowd-generated content within social software engineering tools. However, off-the-shelf sentiment analysis tools have been trained on non-technical domains and general-purpose social media, thus resulting in misclassifications of technical jargon and problem reports. Here, we present Senti4SD, a classifier specifically trained to support sentiment analysis in developers' communication channels. Senti4SD is trained and validated using a gold standard of Stack Overflow questions, answers, and comments manually annotated for sentiment polarity. It exploits a suite of both lexicon- and keyword-based features, as well as semantic features based on word embedding. With respect to a mainstream off-the-shelf tool, which we use as a baseline, Senti4SD reduces the misclassifications of neutral and positive posts as emotionally negative. To encourage replications, we release a lab package including the classifier, the word embedding space, and the gold standard with annotation guidelines.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

A benchmark study on sentiment analysis for software engineering research

[...]

Nicole Novielli¹, Daniela Girardi¹, Filippo Lanubile¹•Institutions (1)

University of Bari¹

28 May 2018

TL;DR: In this paper, the authors report a benchmark study to assess the performance and reliability of three sentiment analysis tools specifically customized for software engineering and offer a reflection on the open challenges, as they emerge from a qualitative analysis of misclassified texts.

...read moreread less

Abstract: A recent research trend has emerged to identify developers' emotions, by applying sentiment analysis to the content of communication traces left in collaborative development environments. Trying to overcome the limitations posed by using off-the-shelf sentiment analysis tools, researchers recently started to develop their own tools for the software engineering domain. In this paper, we report a benchmark study to assess the performance and reliability of three sentiment analysis tools specifically customized for software engineering. Furthermore, we offer a reflection on the open challenges, as they emerge from a qualitative analysis of misclassified texts.1

...read moreread less

102 citations

Proceedings Article•DOI•

Investigating the effects of gender bias on GitHub

[...]

Nasif Imtiaz¹, Justin Middleton¹, Joymallya Chakraborty¹, Neill Robson¹, Gina R. Bai¹, Emerson Murphy-Hill² - Show less +2 more•Institutions (2)

North Carolina State University¹, Google²

25 May 2019

TL;DR: The effects of gender bias are largely invisible on the GitHub platform itself, but there are still signals of women concentrating their work in fewer places and being more restrained in communication than men.

...read moreread less

Abstract: Diversity, including gender diversity, is valued by many software development organizations, yet the field remains dominated by men. One reason for this lack of diversity is gender bias. In this paper, we study the effects of that bias by using an existing framework derived from the gender studies literature. We adapt the four main effects proposed in the framework by posing hypotheses about how they might manifest on GitHub, then evaluate those hypotheses quantitatively. While our results show that effects of gender bias are largely invisible on the GitHub platform itself, there are still signals of women concentrating their work in fewer places and being more restrained in communication than men.

...read moreread less

81 citations

Journal Article•DOI•

SentiStrength-SE: Exploiting domain specificity for improved sentiment analysis in software engineering text

[...]

Rakibul Islam¹, Minhaz F. Zibran¹•Institutions (1)

University of New Orleans¹

01 Nov 2018-Journal of Systems and Software

TL;DR: The empirical evaluations confirm that the domain specificity exploited in the SentiStrength-SE enables it to substantially outperform the existing domain-independent tools/toolkits in detecting sentiments in software engineering text.

...read moreread less

74 citations

Proceedings Article•DOI•

Sentiment Analysis for Software Engineering: How Far Can Pre-trained Transformer Models Go?

[...]

Ting Zhang¹, Bowen Xu¹, Ferdian Thung¹, Stefanus Agus Haryono¹, David Lo¹, Lingxiao Jiang¹ - Show less +2 more•Institutions (1)

Singapore Management University¹

01 Sep 2020

TL;DR: This work is the first to fine-tune pre-trained Transformer-based models for the SA4SE task, and outperform the existing SA4 SE tools by 6.5-35.6% in terms of macro/micro-averaged F1 scores.

...read moreread less

Abstract: Extensive research has been conducted on sentiment analysis for software engineering (SA4SE). Researchers have invested much effort in developing customized tools (e.g., SentiStrength-SE, SentiCR) to classify the sentiment polarity for Software Engineering (SE) specific contents (e.g., discussions in Stack Overflow and code review comments). Even so, there is still much room for improvement. Recently, pre-trained Transformer-based models (e.g., BERT, XLNet) have brought considerable breakthroughs in the field of natural language processing (NLP). In this work, we conducted a systematic evaluation of five existing SA4SE tools and variants of four state-of-the-art pre-trained Transformer-based models on six SE datasets. Our work is the first to fine-tune pre-trained Transformer-based models for the SA4SE task. Empirically, across all six datasets, our fine-tuned pre-trained Transformer-based models outperform the existing SA4SE tools by 6.5-35.6% in terms of macro/micro-averaged F1 scores.

...read moreread less

73 citations

Proceedings Article•DOI•

Entity-level sentiment analysis of issue comments

[...]

Ding Jin¹, Hailong Sun¹, Xu Wang¹, Xudong Liu¹•Institutions (1)

Beihang University¹

02 Jun 2018

TL;DR: SentiSW is designed and developed, an entitylevel sentiment analysis tool consisting of sentiment classifcation and entity recognition, which can classify issue comments into tuples.

...read moreread less

Abstract: Emotions and sentiment of software developers can largely influence the software productivity and quality. However, existing work on emotion mining and sentiment analysis is still in the early stage in software engineering in terms of accuracy, the size of datasets used and the specificity of the analysis. In this work, we are concerned with conducting entity-level sentiment analysis. We first build a manually labeled dataset containing 3,000 issue comments selected from 231,732 issue comments collected from 10 open source projects in GitHub. Then we design and develop SentiSW, an entity-level sentiment analysis tool consisting of sentiment classification and entity recognition, which can classify issue comments into tuples. We evaluate the sentiment classification using ten-fold cross validation, and it achieves 68.71% mean precision, 63.98% mean recall and 77.19% accuracy, which is significantly higher than existing tools. We evaluate the entity recognition by manually annotation and it achieves a 75.15% accuracy.

...read moreread less

70 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

Collapse

References

PDF

Open Access

More filters

Journal Article•

R: A language and environment for statistical computing.

[...]

R Core Team

01 Jan 2014-MSOR connections

TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.

...read moreread less

Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

...read moreread less

272,030 citations

Posted Content•

Efficient Estimation of Word Representations in Vector Space

[...]

Tomas Mikolov¹, Kai Chen², Greg S. Corrado³, Jeffrey Dean³•Institutions (3)

Brno University of Technology¹, Beijing University of Posts and Telecommunications², Google³

16 Jan 2013-arXiv: Computation and Language

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.

...read moreread less

Abstract: We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

...read moreread less

20,077 citations

Journal Article•DOI•

A circumplex model of affect

[...]

James A. Russell¹•Institutions (1)

University of British Columbia¹

01 Dec 1980-Journal of Personality and Social Psychology

12,519 citations

Posted Content•

Distributed Representations of Words and Phrases and their Compositionality

[...]

Tomas Mikolov¹, Ilya Sutskever¹, Kai Chen¹, Greg S. Corrado¹, Jeffrey Dean¹ - Show less +1 more•Institutions (1)

Google¹

16 Oct 2013-arXiv: Computation and Language

TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.

...read moreread less

Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

...read moreread less

11,343 citations

Book Chapter•DOI•

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

[...]

Thorsten Joachims¹•Institutions (1)

Technical University of Dortmund¹

21 Apr 1998

TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.

...read moreread less

Abstract: This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore they are fully automatic, eliminating the need for manual parameter tuning.

...read moreread less

8,658 citations