scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Coefficient of agreement for nominal Scales

Jacob Cohen1
01 Apr 1960-Educational and Psychological Measurement (SAGE Publications)-Vol. 20, Iss: 1, pp 37-46
TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Abstract: CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of measurement obtainable is nominal scaling (Stevens, 1951, pp. 2526), i.e. placement in a set of k unordered categories. Because the categorizing of the units is a consequence of some complex judgment process performed by a &dquo;two-legged meter&dquo; (Stevens, 1958), it becomes important to determine the extent to which these judgments are reproducible, i.e., reliable. The procedure which suggests itself is that of having two (or more) judges independently categorize a sample of units and determine the degree, significance, and

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.
Abstract: This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.

64,109 citations


Cites background or methods from "A Coefficient of agreement for nomi..."

  • ...In particular, w1j represents the set of weights which generate the kappa measure of perfect agreement proposed in Cohen [1960]. The sequence of hierarchical kappa-type statistics within each of the two patient populations associated with the weights given in Table 2 can be expressed in the formulation (A....

    [...]

  • ...example, the weights w2j in Table 8 are directly analogous to those discussed in Cohen [1968], Fleiss, Cohen and Everitt [1969] and Cicchetti [1972], which were used to generate weighted kappa and C statistics....

    [...]

  • ..., Goodman and Kruskal [1954], Cohen [1960, 1968], Fleiss [1971], Light [1971], and Cicchetti [1972]....

    [...]

  • ...Furthermore, as shown in Fleiss and Cohen [1973] and Fleiss [1975], K is directly analogous to the intraclass correlation coefficient obtained from ANOVA models for quantitative measurements and can be used as a measure of the reliability of multiple determinations on the same subj ects....

    [...]

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Journal ArticleDOI
TL;DR: Abnormal lipids, smoking, hypertension, diabetes, abdominal obesity, psychosocial factors, consumption of fruits, vegetables, and alcohol, and regular physical activity account for most of the risk of myocardial infarction worldwide in both sexes and at all ages in all regions.

10,387 citations

Journal ArticleDOI
TL;DR: While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.
Abstract: The kappa statistic is frequently used to test interrater reliability. The importance of rater reliability lies in the fact that it represents the extent to which the data collected in the study are correct representations of the variables measured. Measurement of the extent to which data collectors (raters) assign the same score to the same variable is called interrater reliability. While there have been a variety of methods to measure interrater reliability, traditionally it was measured as percent agreement, calculated as the number of agreement scores divided by the total number of scores. In 1960, Jacob Cohen critiqued use of percent agreement due to its inability to account for chance agreement. He introduced the Cohen's kappa, developed to account for the possibility that raters actually guess on at least some variables due to uncertainty. Like most correlation statistics, the kappa can range from -1 to +1. While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations. Judgments about what level of kappa should be acceptable for health research are questioned. Cohen's suggested interpretation may be too lenient for health related studies because it implies that a score as low as 0.41 might be acceptable. Kappa and percent agreement are compared, and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.

9,097 citations

Journal ArticleDOI
TL;DR: Results suggest the K-SADS-PL generates reliable and valid child psychiatric diagnoses.
Abstract: Objective To describe the psychometric properties of the Schedule for Affective Disorders and Schizophrenia for School-Age Children-Present and Lifetime version (K-SADS-PL) interview, which surveys additional disorders not assessed in prior K-SADS, contains improved probes and anchor points, includes diagnosis-specific impairment ratings, generates DSM-III-R and DSM-IV diagnoses, and divides symptoms surveyed into a screening interview and five diagnostic supplements. Method Subjects were 55 psychiatric outpatients and 11 normal controls (aged 7 through 17 years). Both parents and children were used as informants. Concurrent validity of the screen criteria and the K-SADS-PL diagnoses was assessed against standard self-report scales. Interrater ( n = 15) and test-retest ( n = 20) reliability data were also collected (mean retest interval: 18 days; range: 2 to 38 days). Results Rating scale data support the concurrent validity of screens and K-SADS-PL diagnoses. Interrater agreement in scoring screens and diagnoses was high (range: 93% to 100%). Test-retest reliability κ coefficients were in the excellent range for present and/or lifetime diagnoses of major depression, any bipolar, generalized anxiety, conduct, and oppositional defiant disorder (.77 to 1.00) and in the good range for present diagnoses of posttraumatic stress disorder and attention-deficit hyperactivity disorder (.63 to .67). Conclusion Results suggest the K-SADS-PL generates reliable and valid child psychiatric diagnoses. J. Am. Acad. Child Adolesc. Psychiatry , 1997, 36(7): 980–988.

8,742 citations


Cites methods from "A Coefficient of agreement for nomi..."

  • ...Percent agreement was used to generate interrater reliability estimates, as there were an insufficient number of cases (n < 5) to justilY calculation of a J( statistic (Cohen, 1960) in most diagnostic categories....

    [...]

References
More filters
Book
01 Jan 1942

3,601 citations

Book
01 Jan 1979
TL;DR: In this article, a number of alternative measures are considered, almost all based upon a probabilistic model for activity to which the cross-classification may typically lead, and only the case in which the population is completely known is considered, so no question of sampling or measurement error appears.
Abstract: When populations are cross-classified with respect to two or more classifications or polytomies, questions often arise about the degree of association existing between the several polytomies. Most of the traditional measures or indices of association are based upon the standard chi-square statistic or on an assumption of underlying joint normality. In this paper a number of alternative measures are considered, almost all based upon a probabilistic model for activity to which the cross-classification may typically lead. Only the case in which the population is completely known is considered, so no question of sampling or measurement error appears. We hope, however, to publish before long some approximate distributions for sample estimators of the measures we propose, and approximate tests of hypotheses. Our major theme is that the measures of association used by an empirical investigator should not be blindly chosen because of tradition and convention only, although these factors may properly be g...

2,672 citations