Interrater reliability: the kappa statistic

doi:10.11613/BM.2012.031

Open AccessJournal ArticleDOI

Interrater reliability: the kappa statistic

Marry L. McHugh

- 15 Oct 2012 -

Biochemia Medica

- Vol. 22, Iss: 3, pp 276-282

TLDR

While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.

Abstract:

The kappa statistic is frequently used to test interrater reliability. The importance of rater reliability lies in the fact that it represents the extent to which the data collected in the study are correct representations of the variables measured. Measurement of the extent to which data collectors (raters) assign the same score to the same variable is called interrater reliability. While there have been a variety of methods to measure interrater reliability, traditionally it was measured as percent agreement, calculated as the number of agreement scores divided by the total number of scores. In 1960, Jacob Cohen critiqued use of percent agreement due to its inability to account for chance agreement. He introduced the Cohen's kappa, developed to account for the possibility that raters actually guess on at least some variables due to uncertainty. Like most correlation statistics, the kappa can range from -1 to +1. While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations. Judgments about what level of kappa should be acceptable for health research are questioned. Cohen's suggested interpretation may be too lenient for health related studies because it implies that a score as low as 0.41 might be acceptable. Kappa and percent agreement are compared, and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Davide Chicco, +1 more

- 02 Jan 2020 -

BMC Genomics

TL;DR: This article shows how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario.

...read moreread less

Journal ArticleDOI

Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning.

Nicolas Coudray, +8 more

- 17 Sep 2018 -

Nature Medicine

TL;DR: A deep convolutional neural network model is trained on whole-slide images obtained from The Cancer Genome Atlas to accurately and automatically classify them into LUAD, LUSC or normal lung tissue and predicts the ten most commonly mutated genes in LUAD.

...read moreread less

Journal ArticleDOI

Intercoder Reliability in Qualitative Research: Debates and Practical Guidelines

Cliodhna O'Connor, +1 more

- 22 Jan 2020 -

The International Journal of Qualitative...

TL;DR: In this paper, the intercoder reliability of a coding frame is evaluated as a good practice in qualitative analysis, and the ICR is a somewhat controversial topic in the qualitative research community.

...read moreread less

Journal ArticleDOI

PD-L1 Immunohistochemistry Comparability Study in Real-Life Clinical Samples: Results of Blueprint Phase 2 Project.

Ming-Sound Tsao, +27 more

- 01 Sep 2018 -

Journal of Thoracic Oncology

TL;DR: The Blueprint (BP) Programmed Death Ligand 1 (PD-L1) Immunohistochemistry Comparability Project is a pivotal academic/professional society and industrial collaboration to assess the feasibility of harmonizing the clinical use of five independently developed commercial PD-L 1 immunohistochemical assays.

...read moreread less

Journal ArticleDOI

Global threat of arsenic in groundwater

Joel Podgorski, +3 more

- 22 May 2020 -

Science

TL;DR: A global model for predicting groundwater arsenic levels suggests that 94 million to 220 million people are potentially exposed to high arsenic concentrations in groundwater, the vast majority of which are in Asia.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A Coefficient of agreement for nominal Scales

Jacob Cohen

- 01 Apr 1960 -

Educational and Psychological Measuremen...

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.

...read moreread less

Journal ArticleDOI

A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability

Steven E. Stemler

- 01 Jan 2004 -

Practical Assessment, Research and Evalu...

TL;DR: Researchers and practitioners should be aware that different approaches to estimating interrater reliability carry with them different implications for how ratings across multiple judges should be summarized, which may impact the validity of subsequent study results.

...read moreread less

Journal ArticleDOI

Meta-analysis of Pap test accuracy

Michael T. Fahey, +2 more

- 01 Apr 1995 -

American Journal of Epidemiology

TL;DR: The summary receiver operating characteristic curve suggests that the Pap test may be unable to achieve concurrently high sensitivity and specificity and future primary studies should pay more attention to methodologic standards for the conduct and reporting of diagnostic test evaluations.

...read moreread less

Journal Article

Pressure ulcers: prevention, evaluation, and management.

Daniel Bluestein, +1 more

- 15 Nov 2008 -

American Family Physician

TL;DR: Treatment involves management of local and distant infections, removal of necrotic tissue, maintenance of a moist environment for wound healing, and possibly surgery, and systemic antibiotics are used in patients with advancing cellulitis, osteomyelitis, or systemic infection.

...read moreread less

Journal ArticleDOI