scispace - formally typeset
Search or ask a question
Journal ArticleDOI

On negative results when using sentiment analysis tools for software engineering research

TL;DR: Whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other is studied.
Abstract: Recent years have seen an increasing attention to social aspects of software engineering, including studies of emotions and sentiments experienced and expressed by the software developers. Most of these studies reuse existing sentiment analysis tools such as SentiStrength and NLTK. However, these tools have been trained on product reviews and movie reviews and, therefore, their results might not be applicable in the software engineering domain. In this paper we study whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. We repeat the study for seven datasets (issue trackers and Stack Overflow questions) and different sentiment analysis tools and observe that the disagreement between the tools can lead to diverging conclusions. Finally, we perform two replications of previously published studies and observe that the results of those studies cannot be confirmed when a different sentiment analysis tool is used.

Content maybe subject to copyright    Report

Empir Software Eng (2017) 22:2543–2584
DOI 10.1007/s10664-016-9493-x
On negative results when using sentiment analysis tools
for software engineering research
Robbert Jongeling
1
· Proshanta Sarkar
2
·
Subhajit Datta
3
· Alexander Serebrenik
1
Published online: 10 January 2017
© The Author(s) 2017. This article is published with open access at Springerlink.com
Abstract Recent years have seen an increasing attention to social aspects of software engi-
neering, including studies of emotions and sentiments experienced and expressed by the
software developers. Most of these studies reuse existing sentiment analysis tools such as
S
ENTISTRENGTH and NLTK. However, these tools have been trained on product reviews
and movie reviews and, therefore, their results might not be applicable in the software engi-
neering domain. In this paper we study whether the sentiment analysis tools agree with the
sentiment recognized by human evaluators (as reported in an earlier study) as well as with
each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool
on software engineering studies by conducting a simple study of differences in issue reso-
lution times for positive, negative and neutral texts. We repeat the study for seven datasets
(issue trackers and S
TACK OVERFLOW questions) and different sentiment analysis tools and
observe that the disagreement between the tools can lead to diverging conclusions. Finally,
we perform two replications of previously published studies and observe that the results of
those studies cannot be confirmed when a different sentiment analysis tool is used.
Communicated by: Richard Paige, Jordi Cabot and Neil Ernst
Alexander Serebrenik
a.serebrenik@tue.nl
Robbert Jongeling
r.m.jongeling@alumnus.tue.nl
Proshanta Sarkar
proshant.cse@gmail.com
Subhajit Datta
subhajit.datta@acm.org
1
Eindhoven University of Technology, Eindhoven, The Netherlands
2
IBM India Private Limited, Kolkata, India
3
Singapore University of Technology and Design, Singapore, Singapore

2544 Empir Software Eng (2017) 22:2543–2584
Keywords Sentiment analysis tools · Replication study · Negative results
1 Introduction
Sentiment analysis is “the task of identifying positive and negative opinions, emotions, and
evaluations” (Wilson et al. 2005). Since its inception sentiment analysis has been subject
of an intensive research effort and has been successfully applied e.g., to assist users in
their development by providing them with interesting and supportive content (Honkela et al.
2012), predict the outcome of an election (Tumasjan et al. 2010) or movie sales (Mishne
and Glance 2006). The spectrum of sentiment analysis techniques ranges from identifying
polarity (positive or negative) to a complex computational treatment of subjectivity, opinion
and sentiment (Pang and Lee 2007). In particular, the research on sentiment polarity analysis
has resulted in a number of mature and publicly available tools such as SENTISTRENGTH
(Thelwall et al. 2010), Alchemy,
1
Stanford NLP sentiment analyser (Socher et al. 2013)and
NLTK (Bird et al. 2009).
In recent times, large scale software development has become increasingly social. With
the proliferation of collaborative development environments, discussion between developers
are recorded and archived to an extent that could not be conceived before. The availability of
such discussion materials makes it easy to study whether and how the sentiments expressed
by software developers influence the outcome of development activities. With this back-
ground, we apply sentiment polarity analysis to several software development ecosystems
in this study.
Sentiment polarity analysis has been recently applied in the software engineering context
to study commit comments in GitHub (Guzman et al. 2014), GitHub discussions related to
security (Pletea et al. 2014), productivity in Jira issue resolution (Ortu et al. 2015), activity
of contributors in Gentoo (Garcia et al.
2013), classification of user reviews for mainte-
nance and evolution (Panichella et al. 2015) and evolution of developers’ sentiments in the
openSUSE Factory (Rousinopoulos et al. 2014). It has also been suggested when assess-
ing technical candidates on the social web (Capiluppi et al. 2013). Not surprisingly, all
the aforementioned software engineering studies with the notable exception of the work
by Panichella et al. (2015), reuse the existing sentiment polarity tools, e.g., (Pletea et al.
2014) and (Rousinopoulos et al. 2014) use NLTK, while (Garcia et al. 2013; Guzman and
Bruegge
2013;Guzmanetal.2014; Novielli et al. 2015) and (Ortu et al. 2015) opted for
SENTISTRENGTH. While the reuse of the existing tools facilitated the application of the sen-
timent polarity analysis techniques in the software engineering domain, it also introduced
a commonly recognized threat to validity of the results obtained: those tools have been
trained on non-software engineering related texts such as movie reviews or product reviews
and might misidentify (or fail to identify) polarity of a sentiment in a software engineering
artefact such as a commit comment (Guzman et al. 2014; Pletea et al. 2014).
Therefore, in this paper we focus on sentiment polarity analysis (Wilson et al. 2005)and
investigate to what extent are the software engineering results obtained from sentiment anal-
ysis depend on the choice of the sentiment analysis tool. We recognize that there are multiple
ways to measure outcomes in software engineering. Among them, time to resolve a partic-
ular defect, and/or respond to a particular query are relevant for end users. Accordingly, in
1
http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis/

Empir Software Eng (2017) 22:2543–2584 2545
the different data-sets studied in this paper, we have taken such resolution or response times
to reflect the outcomes of our interest.
For the sake of simplicity, from here on, instead of “existing sentiment polarity analysis
tools” we talk about the “sentiment analysis tools”. Specifically, we aim at answering the
following questions:
RQ1: To what extent do different sentiment analysis tools agree with emotions of
software developers?
RQ2: To what extent do results from different sentiment analysis tools agree with each
other?
We have observed disagreement between sentiment analysis tools and the emotions of soft-
ware developers but also between different sentiment analysis tools themselves. However,
disagreement between the tools does not apriorimean that sentiment analysis tools might
lead to contradictory results in software engineering studies making use of these tools. Thus,
we ask
RQ3: Do different sentiment analysis tools lead to contradictory results in a software
engineering study?
We have observed that disagreement between the tools might lead to contradictory results
in software engineering studies. Therefore, we finally conduct replication studies in order
to understand:
RQ4: How does the choice of a sentiment analysis tool affect validity of the previously
published results?
The remainder of this paper is organized as follows. The next section outlines the sen-
timent analysis tools we have considered in this study. In Section
3 we study agreement
between the tools and the results of manual labeling, and between the tools themselves, i.e.,
RQ1 and RQ2. In Section 4 we conduct a series of studies based on the results of different
sentiment analysis tools. We observe that conclusions one might derive using different tools
diverge, casting doubt on their validity (RQ3). While our answer to RQ3 indicates that the
choice of a sentiment analysis tool might affect validity of software engineering results, in
Section 5 we perform replication of two published studies answering RQ4 and establishing
that conclusions of previously published works cannot be reproduced when a different sen-
timent analysis tool is used. Finally, in Section 6 we discuss related work and conclude in
Section 7.
Source code and data used to obtain the results of this paper has been made available.
2
2 Sentiment Analysis Tools
2.1 Tool Selection
To perform the tool evaluation we have decided to focus on open-source tools. This require-
ment excludes such commercial tools as Lymbix
3
Sentiment API of MeaningCloud
4
or
2
http://ow.ly/HvC5302N4oK
3
http://www.lymbix.com/supportcenter/docs
4
https://www.meaningcloud.com/developer/sentiment-analysis

2546 Empir Software Eng (2017) 22:2543–2584
GetSentiment.
5
Furthermore, we exclude tools that require training before they can be
applied such as LibShortText (Yu et al.
2013) or sentiment analysis libraries of popular
machine learning tools such as RapidMiner or Weka. Finally, since the software engineering
texts that have been analyzed in the past can be quite short (JIRA issues, STACK OVER-
FLOW questions), we have chosen tools that have already been applied either to software
engineering texts (SENTISTRENGTH and NLTK) or to short texts such as tweets (Alchemy
or Stanford NLP sentiment analyser).
2.2 Description of Tools
2.2.1 S
ENTISTRENGTH
SENTISTRENGTH is the sentiment analysis tool most frequently used in software engineer-
ing studies (Garcia et al.
2013;Guzmanetal.2014; Novielli et al. 2015;Ortuetal.2015).
Moreover, SENTISTRENGTH had the highest average accuracy among fifteen Twitter senti-
ment analysis tools (Abbasi et al. 2014). SENTISTRENGTH assigns an integer value between
1 and 5 for the positivity of a text, p and similarly, a value between 1and5forthe
negativity, n.
Interpretation In order to map the separate positivity and negativity scores to a senti-
ment (positive, neutral or negative) for an entire text fragment, we follow the approach by
Thelwall et al. (
2012). A text is considered positive when p + n>0, negative when
p + n<0, and neutral if p =−n and p<4. Texts with a score of p =−n and p 4are
considered having an undetermined sentiment and are removed from the datasets.
2.2.2 Alchemy
Alchemy provides several text processing APIs, including a sentiment analysis API which
promises to work on very short texts (e.g., tweets) as well as relatively long texts (e.g., news
articles).
6
The sentiment analysis API returns for a text fragment a status, a language, a
score and a type. The score is in the range [−1, 1],thetype is the sentiment of the text and is
based on the score. For negative scores, the type is negative, conversely for positive scores,
the type is positive. For a score of 0, the type is neutral. The status reflects the analysis
success and it is either “OK” or “ERROR”.
Interpretation We ignore texts with status “ERROR” or a non-English language. For the
remaining texts we consider them as being negative, neutral or positive as indicated by the
returned type.
2.2.3 NLTK
NLTK has been applied in earlier software engineering studies (Pletea et al.
2014;
Rousinopoulos et al. 2014). NLTK uses a simple bag of words model and returns for each
5
https://getsentiment.3scale.net/
6
http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis

Empir Software Eng (2017) 22:2543–2584 2547
text three probabilities: a probability of the text being negative, one of it being neutral and
one of it being positive. To call NLTK, we use the API provided at text-processing.com.
7
Interpretation If the probability score for neutral is greater than 0.5, the text is considered
neutral. Otherwise, it is considered to be the other sentiment with the highest probability
(Pletea et al.
2014).
2.2.4 Stanford NLP
The Stanford NLP parses the text into sentences and performs a more advanced grammatical
analysis as opposed to a simpler bag of words model used, e.g., in NLTK. Indeed, Socher
et al. argue that such an analysis should outperform the bag of words model on short texts
(Socher et al.
2013). The Stanford NLP breaks down the text into sentences and assigns
each a sentiment score in the range [0, 4], where 0 is very negative, 2 is neutral and 4 is
very positive. We note that the tool may have difficulty breaking the text into sentences
as comments sometimes include pieces of code or e.g. URLs. The tool does not provide a
document-level score.
Interpretation To determine a document-level sentiment we compute 2 #0 #1+#3+
2 #4, where #0 denotes the number of sentences with score 0, etc.. If this score is negative,
neutral or positive, we consider the text to be negative, neutral or positive, respectively.
3 Agreement Between Sentiment Analysis Tools
In this section we address RQ1 and RQ2, i.e., to what extent do the different sentiment
analysis tools described earlier, agree with emotions of software developers and to what
extent do different sentiment analysis tools agree with each other. To perform the evaluation
we use the manually labeled emotions dataset (Murgia et al.
2014).
3.1 Methodology
3.1.1 Manually-Labeled Software Engineering Data
As the “golden set” we use the data from a developer emotions study by Murgia et al.
(
2014). In this study, four evaluators manually labeled 392 comments with emotions “joy”,
“love”, “surprise”, “anger”, “sadness” or “fear”. Emotions “joy” and“love” are taken as
indicators of positive sentiments and “anger”, “sadness” and “fear”—of negative sentiment.
We exclude information about the “surprise” sentiment, since surprises can be, in general,
both positive and negative depending on the expectations of the speaker.
We focus on consistently labeled comments. We consider the comment as positive if at
least three evaluators have indicated a positive sentiment and no evaluator has indicated
negative sentiments. Similarly, we consider the comment as negative if at least three evalua-
tors have indicated a negative sentiment and no evaluator has indicated positive sentiments.
Finally, a text is considered as neutral when three or more evaluators have neither indicated
a positive sentiment nor a negative sentiment.
7
API docs for NLTK sentiment analysis:
http://text-processing.com/docs/sentiment.html

Citations
More filters
Proceedings ArticleDOI
27 May 2018
TL;DR: This work retrained—on a set of 40k manually labeled sentences/words extracted from Stack Overflow—a state-of-the-art sentiment analysis tool exploiting deep learning, and found the results were negative.
Abstract: Sentiment analysis has been applied to various software engineering (SE) tasks, such as evaluating app reviews or analyzing developers' emotions in commit messages. Studies indicate that sentiment analysis tools provide unreliable results when used out-of-the-box, since they are not designed to process SE datasets. The silver bullet for a successful application of sentiment analysis tools to SE datasets might be their customization to the specific usage context. We describe our experience in building a software library recommender exploiting developers' opinions mined from Stack Overflow. To reach our goal, we retrained---on a set of 40k manually labeled sentences/words extracted from Stack Overflow---a state-of-the-art sentiment analysis tool exploiting deep learning. Despite such an effort- and time-consuming training process, the results were negative. We changed our focus and performed a thorough investigation of the accuracy of commonly used tools to identify the sentiment of SE related texts. Meanwhile, we also studied the impact of different datasets on tool performance. Our results should warn the research community about the strong limitations of current sentiment analysis tools.

179 citations


Cites background or methods from "On negative results when using sent..."

  • ...[16] conducted a comparison of four widely used sentiment analysis tools: SentiStrength, NLTK, Stanford CoreNLP, and AlchemyAPI....

    [...]

  • ...in previous work [15, 16], is not sufficient to evaluate sentiment analysis tools....

    [...]

  • ...This “out-of-thebox” usage has been criticized due to the poor accuracy these tools achieved when applied in a context different from the one for which they have been designed and/or trained [16, 23, 35]....

    [...]

  • ...Previous literature has pointed our that sentiment analysis tools cannot be used outof-the-box for software engineering tasks [15, 16, 23, 35]....

    [...]

  • ...Given the warning raised by previous work in our field [16, 23, 35] there was the need for training and customizing the sentiment analysis tool to the Stack Overflow context....

    [...]

Proceedings ArticleDOI
02 Apr 2018
TL;DR: The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.
Abstract: Code smells are symptoms of poor design and implementation choices weighing heavily on the quality of produced source code. During the last decades several code smell detection tools have been proposed. However, the literature shows that the results of these tools can be subjective and are intrinsically tied to the nature and approach of the detection. In a recent work the use of Machine-Learning (ML) techniques for code smell detection has been proposed, possibly solving the issue of tool subjectivity giving to a learner the ability to discern between smelly and non-smelly source code elements. While this work opened a new perspective for code smell detection, it only considered the case where instances affected by a single type smell are contained in each dataset used to train and test the machine learners. In this work we replicate the study with a different dataset configuration containing instances of more than one type of smell. The results reveal that with this configuration the machine learning techniques reveal critical limitations in the state of the art which deserve further research.

155 citations

Journal ArticleDOI
TL;DR: A taxonomy from the social sciences is adopted, termed here the ABC framework for SE research, which offers a holistic view of eight archetypal research strategies, and six ways in which the framework can advance SE research.
Abstract: A variety of research methods and techniques are available to SE researchers, and while several overviews exist, there is consistency neither in the research methods covered nor in the terminology used. Furthermore, research is sometimes critically reviewed for characteristics inherent to the methods. We adopt a taxonomy from the social sciences, termed here the ABC framework for SE research, which offers a holistic view of eight archetypal research strategies. ABC refers to the research goal that strives for generalizability over Actors (A) and precise measurement of their Behavior (B), in a realistic Context (C). The ABC framework uses two dimensions widely considered to be key in research design: the level of obtrusiveness of the research and the generalizability of research findings. We discuss metaphors for each strategy and their inherent limitations and potential strengths. We illustrate these research strategies in two key SE domains, global software engineering and requirements engineering, and apply the framework on a sample of 75 articles. Finally, we discuss six ways in which the framework can advance SE research.

138 citations


Cites methods from "On negative results when using sent..."

  • ...[89] On negative results when using sentiment analysis tools for software engineering research Laboratory Experiment Investigates to what extent results from SE studies using sentiment analysis depend on the choice of sentiment analysis tool....

    [...]

Journal ArticleDOI
TL;DR: This paper provides evidence-based guidelines for writing effective questions on Stack Overflow that software engineers can follow to increase the chance of getting technical help and empirically confirmed community guidelines that suggest avoiding rudeness in question writing.
Abstract: Context The success of Stack Overflow and other community-based question-and-answer (Q&A) sites depends mainly on the will of their members to answer others’ questions. In fact, when formulating requests on Q&A sites, we are not simply seeking for information. Instead, we are also asking for other people's help and feedback. Understanding the dynamics of the participation in Q&A communities is essential to improve the value of crowdsourced knowledge. Objective In this paper, we investigate how information seekers can increase the chance of eliciting a successful answer to their questions on Stack Overflow by focusing on the following actionable factors: affect, presentation quality, and time. Method We develop a conceptual framework of factors potentially influencing the success of questions in Stack Overflow. We quantitatively analyze a set of over 87 K questions from the official Stack Overflow dump to assess the impact of actionable factors on the success of technical requests. The information seeker reputation is included as a control factor. Furthermore, to understand the role played by affective states in the success of questions, we qualitatively analyze questions containing positive and negative emotions. Finally, a survey is conducted to understand how Stack Overflow users perceive the guideline suggestions for writing questions. Results We found that regardless of user reputation, successful questions are short, contain code snippets, and do not abuse with uppercase characters. As regards affect, successful questions adopt a neutral emotional style. Conclusion We provide evidence-based guidelines for writing effective questions on Stack Overflow that software engineers can follow to increase the chance of getting technical help. As for the role of affect, we empirically confirmed community guidelines that suggest avoiding rudeness in question writing.

115 citations


Cites background from "On negative results when using sent..."

  • ...sentiment-analysis tools that are not trained specifically for the software engineering domain [29,30,42]....

    [...]

  • ...[29,30] conducted a classification study of seven datasets from technical websites (i....

    [...]

Proceedings ArticleDOI
20 May 2017
TL;DR: SentiStrength-SE, a tool for improved sentiment analysis especially designed for application in the software engineering domain, achieves 73.85% precision and 85% recall, which are significantly higher than a state-of-the-art sentiment analysis tool the authors compare with.
Abstract: Automated sentiment analysis in software engineering textual artifacts has long been suffering from inaccuracies in those few tools available for the purpose. We conduct an in-depth qualitative study to identify the difficulties responsible for such low accuracy. Majority of the exposed difficulties are then carefully addressed in developing SentiStrength-SE, a tool for improved sentiment analysis especially designed for application in the software engineering domain. Using a benchmark dataset consisting of 5,600 manually annotated JIRA issue comments, we carry out both quantitative and qualitative evaluations of our tool. SentiStrength-SE achieves 73.85% precision and 85% recall, which are significantly higher than a state-of-the-art sentiment analysis tool we compare with.

114 citations


Cites result from "On negative results when using sent..."

  • ...A similar approach is also used in other studies [28], [29] to categorize emotional expressions according to their polarities....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

19,603 citations


"On negative results when using sent..." refers background or methods in this paper

  • ...This requirement excludes such commercial tools as Lymbix3 Sentiment API of MeaningCloud4 or 2http://ow.ly/HvC5302N4oK 3http://www.lymbix.com/supportcenter/docs 4https://www.meaningcloud.com/developer/sentiment-analysis GetSentiment.5 Furthermore, we exclude tools that require training before they can be applied such as LibShortText (Yu et al. 2013) or sentiment analysis libraries of popular machine learning tools such as RapidMiner or Weka....

    [...]

  • ...A more careful sentiment analysis for software engineering texts is therefore needed: e.g., one might consider training more general purpose machine learning tools such as Weka (Hall et al. 2009) or RapidMiner15 on software engineering data....

    [...]

  • ..., one might consider training more general purpose machine learning tools such as Weka [35] or RapidMiner(15) on software engineering data....

    [...]

  • ...A similar approach has been recently taken by Panichella et al. (2015) that have used Weka to train a Naive Bayes classifier on 2090 App Store and Google Play review sentences....

    [...]

Book
01 Jan 1981
TL;DR: In this paper, the basic theory of Maximum Likelihood Estimation (MLE) is used to detect a difference between two different proportions of a given proportion in a single proportion.
Abstract: Preface.Preface to the Second Edition.Preface to the First Edition.1. An Introduction to Applied Probability.2. Statistical Inference for a Single Proportion.3. Assessing Significance in a Fourfold Table.4. Determining Sample Sizes Needed to Detect a Difference Between Two Proportions.5. How to Randomize.6. Comparative Studies: Cross-Sectional, Naturalistic, or Multinomial Sampling.7. Comparative Studies: Prospective and Retrospective Sampling.8. Randomized Controlled Trials.9. The Comparison of Proportions from Several Independent Samples.10. Combining Evidence from Fourfold Tables.11. Logistic Regression.12. Poisson Regression.13. Analysis of Data from Matched Samples.14. Regression Models for Matched Samples.15. Analysis of Correlated Binary Data.16. Missing Data.17. Misclassification Errors: Effects, Control, and Adjustment.18. The Measurement of Interrater Agreement.19. The Standardization of Rates.Appendix A. Numerical Tables.Appendix B. The Basic Theory of Maximum Likelihood Estimation.Appendix C. Answers to Selected Problems.Author Index.Subject Index.

16,435 citations

Book ChapterDOI
Frank Wilcoxon1
TL;DR: The comparison of two treatments generally falls into one of the following two categories: (a) a number of replications for each of the two treatments, which are unpaired, or (b) we may have a series of paired comparisons, some of which may be positive and some negative as mentioned in this paper.
Abstract: The comparison of two treatments generally falls into one of the following two categories: (a) we may have a number of replications for each of the two treatments, which are unpaired, or (b) we may have a number of paired comparisons leading to a series of differences, some of which may be positive and some negative. The appropriate methods for testing the significance of the differences of the means in these two cases are described in most of the textbooks on statistical methods.

12,871 citations


"On negative results when using sent..." refers methods in this paper

  • ...The second step uses the t-test or the rank-based Wilcoxon-Mann-Whitney test (Wilcoxon 1945), with correction for multiple comparisons, e.g., Bonferroni correction (Dunn 1961; Sheskin 2007)....

    [...]

  • ...The second step uses the t-test or the rank-based Wilcoxon-Mann-Whitney test (Wilcoxon 1945), with correction for multiple comparisons, e....

    [...]

Journal ArticleDOI
Jacob Cohen1
TL;DR: The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi.
Abstract: A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi

7,604 citations


"On negative results when using sent..." refers background or methods in this paper

  • ...Therefore, rather than reporting accuracy of the approaches we use the Weighted kappa (Cohen 1968) and the Adjusted Rand Index (ARI) (Hubert and Arabie 1985; Santos and Embrechts 2009)....

    [...]

  • ...One possible way to address this concern would be to associate different kinds of disagreement with different weights: this is an approach taken, e.g., by the weighted κ (Cohen 1968)....

    [...]