scispace - formally typeset
Open AccessJournal ArticleDOI

Interrater reliability: the kappa statistic

Marry L. McHugh
- 15 Oct 2012 - 
- Vol. 22, Iss: 3, pp 276-282
TLDR
While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.
Abstract
The kappa statistic is frequently used to test interrater reliability. The importance of rater reliability lies in the fact that it represents the extent to which the data collected in the study are correct representations of the variables measured. Measurement of the extent to which data collectors (raters) assign the same score to the same variable is called interrater reliability. While there have been a variety of methods to measure interrater reliability, traditionally it was measured as percent agreement, calculated as the number of agreement scores divided by the total number of scores. In 1960, Jacob Cohen critiqued use of percent agreement due to its inability to account for chance agreement. He introduced the Cohen's kappa, developed to account for the possibility that raters actually guess on at least some variables due to uncertainty. Like most correlation statistics, the kappa can range from -1 to +1. While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations. Judgments about what level of kappa should be acceptable for health research are questioned. Cohen's suggested interpretation may be too lenient for health related studies because it implies that a score as low as 0.41 might be acceptable. Kappa and percent agreement are compared, and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

TL;DR: This article shows how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario.
Journal ArticleDOI

Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning.

TL;DR: A deep convolutional neural network model is trained on whole-slide images obtained from The Cancer Genome Atlas to accurately and automatically classify them into LUAD, LUSC or normal lung tissue and predicts the ten most commonly mutated genes in LUAD.
Journal ArticleDOI

Intercoder Reliability in Qualitative Research: Debates and Practical Guidelines

TL;DR: In this paper, the intercoder reliability of a coding frame is evaluated as a good practice in qualitative analysis, and the ICR is a somewhat controversial topic in the qualitative research community.
Journal ArticleDOI

PD-L1 Immunohistochemistry Comparability Study in Real-Life Clinical Samples: Results of Blueprint Phase 2 Project.

TL;DR: The Blueprint (BP) Programmed Death Ligand 1 (PD-L1) Immunohistochemistry Comparability Project is a pivotal academic/professional society and industrial collaboration to assess the feasibility of harmonizing the clinical use of five independently developed commercial PD-L 1 immunohistochemical assays.
Journal ArticleDOI

Global threat of arsenic in groundwater

TL;DR: A global model for predicting groundwater arsenic levels suggests that 94 million to 220 million people are potentially exposed to high arsenic concentrations in groundwater, the vast majority of which are in Asia.
References
More filters
Journal ArticleDOI

A Coefficient of agreement for nominal Scales

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Journal ArticleDOI

A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability

TL;DR: Researchers and practitioners should be aware that different approaches to estimating interrater reliability carry with them different implications for how ratings across multiple judges should be summarized, which may impact the validity of subsequent study results.
Journal ArticleDOI

Meta-analysis of Pap test accuracy

TL;DR: The summary receiver operating characteristic curve suggests that the Pap test may be unable to achieve concurrently high sensitivity and specificity and future primary studies should pay more attention to methodologic standards for the conduct and reporting of diagnostic test evaluations.
Journal Article

Pressure ulcers: prevention, evaluation, and management.

TL;DR: Treatment involves management of local and distant infections, removal of necrotic tissue, maintenance of a moist environment for wound healing, and possibly surgery, and systemic antibiotics are used in patients with advancing cellulitis, osteomyelitis, or systemic infection.
Related Papers (5)