The measurement of observer agreement for categorical data

doi:10.2307/2529310

Home
/
Papers
/
The measurement of observer agreement for categorical data

Journal Article•DOI•

The measurement of observer agreement for categorical data

J. R. Landis¹, Gary G. Koch•Institutions (1)

University of Michigan¹

01 Mar 1977-Biometrics (Biometrics)-Vol. 33, Iss: 1, pp 159-174

TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.

read less

Abstract: This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Schedule for Affective Disorders and Schizophrenia for School-Age Children-Present and Lifetime Version (K-SADS-PL): Initial Reliability and Validity Data

[...]

Joan Kaufman¹, Boris Birmaher¹, David A. Brent¹, Uma Rao¹, Uma Rao², Cynthia Flynn¹, Cynthia Flynn³, Paula Moreci¹, Douglas E. Williamson¹, Neal D. Ryan¹ - Show less +6 more•Institutions (3)

University of Pittsburgh¹, University of California, Los Angeles², Vanderbilt University³

01 Jul 1997-Journal of the American Academy of Child and Adolescent Psychiatry

TL;DR: Results suggest the K-SADS-PL generates reliable and valid child psychiatric diagnoses.

...read moreread less

Abstract: Objective To describe the psychometric properties of the Schedule for Affective Disorders and Schizophrenia for School-Age Children-Present and Lifetime version (K-SADS-PL) interview, which surveys additional disorders not assessed in prior K-SADS, contains improved probes and anchor points, includes diagnosis-specific impairment ratings, generates DSM-III-R and DSM-IV diagnoses, and divides symptoms surveyed into a screening interview and five diagnostic supplements. Method Subjects were 55 psychiatric outpatients and 11 normal controls (aged 7 through 17 years). Both parents and children were used as informants. Concurrent validity of the screen criteria and the K-SADS-PL diagnoses was assessed against standard self-report scales. Interrater ( n = 15) and test-retest ( n = 20) reliability data were also collected (mean retest interval: 18 days; range: 2 to 38 days). Results Rating scale data support the concurrent validity of screens and K-SADS-PL diagnoses. Interrater agreement in scoring screens and diagnoses was high (range: 93% to 100%). Test-retest reliability κ coefficients were in the excellent range for present and/or lifetime diagnoses of major depression, any bipolar, generalized anxiety, conduct, and oppositional defiant disorder (.77 to 1.00) and in the good range for present diagnoses of posttraumatic stress disorder and attention-deficit hyperactivity disorder (.63 to .67). Conclusion Results suggest the K-SADS-PL generates reliable and valid child psychiatric diagnoses. J. Am. Acad. Child Adolesc. Psychiatry , 1997, 36(7): 980–988.

...read moreread less

8,742 citations

Journal Article•DOI•

Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology.

[...]

Domenic V. Cicchetti¹•Institutions (1)

Veterans Health Administration¹

01 Jan 1994-Psychological Assessment

TL;DR: In this paper, the authors provide guidelines, guidelines, and simple rules of thumb to assist the clinician faced with the challenge of choosing an appropriate test instrument for a given psychological assessment.

...read moreread less

Abstract: In the context of the development of prototypic assessment instruments in the areas of cognition, personality, and adaptive functioning, the issues of standardization, norming procedures, and the important psychometrics of test reliability and validity are evaluated critically. Criteria, guidelines, and simple rules of thumb are provided to assist the clinician faced with the challenge of choosing an appropriate test instrument for a given psychological assessment. Clinicians are often faced with the critical challenge of choosing the most appropriate available test instrument for a given psychological assessment of a child, adolescent, or adult of a particular age, gender, and class of disability. It is the purpose of this report to provide some criteria, guidelines, or simple rules of thumb to aid in this complex scientific decision. As such, it draws upon my experience with issues of test development, standardization, norming procedures, and important psychometrics, namely, test reliability and validity. As I and my colleagues noted in an earlier publication, the major areas of psychological functioning, in the normal development of infants, children, adolescents, adults, and elderly people, include cognitive, academic, personality, and adaptive behaviors (Sparrow, Fletcher, & Cicchetti, 1985). As such, the major examples or applications discussed in this article derive primarily, although not exclusively, from these several areas of human functioning.

...read moreread less

7,254 citations

Journal Article•DOI•

The autism diagnostic observation schedule-generic: a standard measure of social and communication deficits associated with the spectrum of autism.

[...]

Catherine Lord¹, Susan Risi², Linda Lambrecht², Edwin H. Cook², Bennett L. Leventhal², Pamela C. DiLavore³, Andrew Pickles⁴, Michael Rutter - Show less +4 more•Institutions (4)

University of Illinois at Chicago¹, University of Chicago², University of North Carolina at Chapel Hill³, University of Manchester⁴

01 Jun 2000-Journal of Autism and Developmental Disorders

TL;DR: Algorithm sensitivities and specificities for autism and PD DNOS relative to nonspectrum disorders were excellent, with moderate differentiation of autism from PDDNOS.

...read moreread less

Abstract: The Autism Diagnostic Observation Schedule-Generic (ADOS-G) is a semistructured, standardized assessment of social interaction, communication, play, and imaginative use of materials for individuals suspected of having autism spectrum disorders. The observational schedule consists of four 30-minute modules, each designed to be administered to different individuals according to their level of expressive language. Psychometric data are presented for 223 children and adults with Autistic Disorder (autism), Pervasive Developmental Disorder Not Otherwise Specified (PDDNOS) or nonspectrum diagnoses. Within each module, diagnostic groups were equivalent on expressive language level. Results indicate substantial interrater and test-retest reliability for individual items, excellent interrater reliability within domains and excellent internal consistency. Comparisons of means indicated consistent differentiation of autism and PDDNOS from nonspectrum individuals, with some, but less consistent, differentiation of autism from PDDNOS. A priori operationalization of DSM-IV/ICD-10 criteria, factor analyses, and ROC curves were used to generate diagnostic algorithms with thresholds set for autism and broader autism spectrum/PDD. Algorithm sensitivities and specificities for autism and PDDNOS relative to nonspectrum disorders were excellent, with moderate differentiation of autism from PDDNOS.

...read moreread less

7,012 citations

Journal Article•DOI•

The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions.

[...]

Sara H. Downs¹, Nick Black•Institutions (1)

University of London¹

01 Jun 1998-Journal of Epidemiology and Community Health

TL;DR: It is shown that it is feasible to develop a checklist that can be used to assess the methodological quality not only of randomised controlled trials but also non-randomised studies and it is possible to produce a Checklist that provides a profile of the paper, alerting reviewers to its particular methodological strengths and weaknesses.

...read moreread less

Abstract: OBJECTIVE: To test the feasibility of creating a valid and reliable checklist with the following features: appropriate for assessing both randomised and non-randomised studies; provision of both an overall score for study quality and a profile of scores not only for the quality of reporting, internal validity (bias and confounding) and power, but also for external validity. DESIGN: A pilot version was first developed, based on epidemiological principles, reviews, and existing checklists for randomised studies. Face and content validity were assessed by three experienced reviewers and reliability was determined using two raters assessing 10 randomised and 10 non-randomised studies. Using different raters, the checklist was revised and tested for internal consistency (Kuder-Richardson 20), test-retest and inter-rater reliability (Spearman correlation coefficient and sign rank test; kappa statistics), criterion validity, and respondent burden. MAIN RESULTS: The performance of the checklist improved considerably after revision of a pilot version. The Quality Index had high internal consistency (KR-20: 0.89) as did the subscales apart from external validity (KR-20: 0.54). Test-retest (r 0.88) and inter-rater (r 0.75) reliability of the Quality Index were good. Reliability of the subscales varied from good (bias) to poor (external validity). The Quality Index correlated highly with an existing, established instrument for assessing randomised studies (r 0.90). There was little difference between its performance with non-randomised and with randomised studies. Raters took about 20 minutes to assess each paper (range 10 to 45 minutes). CONCLUSIONS: This study has shown that it is feasible to develop a checklist that can be used to assess the methodological quality not only of randomised controlled trials but also non-randomised studies. It has also shown that it is possible to produce a checklist that provides a profile of the paper, alerting reviewers to its particular methodological strengths and weaknesses. Further work is required to improve the checklist and the training of raters in the assessment of external validity.

...read moreread less

6,849 citations

Journal Article•

Understanding interobserver agreement: the kappa statistic.

[...]

Anthony J. Viera¹, Joanne M. Garrett•Institutions (1)

University of North Carolina at Chapel Hill¹

01 May 2005-Family Medicine

TL;DR: Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers and studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance.

...read moreread less

Abstract: Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers. Studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance. The kappa statistic (or kappa coefficient) is the most commonly used statistic for this purpose. A kappa of 1 indicates perfect agreement, whereas a kappa of 0 indicates agreement equivalent to chance. A limitation of kappa is that it is affected by the prevalence of the finding under observation. Methods to overcome this limitation have been described.

...read moreread less

6,539 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A Coefficient of agreement for nominal Scales

[...]

Jacob Cohen¹•Institutions (1)

York University¹

01 Apr 1960-Educational and Psychological Measurement

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.

...read moreread less

Abstract: CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of measurement obtainable is nominal scaling (Stevens, 1951, pp. 2526), i.e. placement in a set of k unordered categories. Because the categorizing of the units is a consequence of some complex judgment process performed by a &dquo;two-legged meter&dquo; (Stevens, 1958), it becomes important to determine the extent to which these judgments are reproducible, i.e., reliable. The procedure which suggests itself is that of having two (or more) judges independently categorize a sample of units and determine the degree, significance, and

...read moreread less

34,965 citations

"The measurement of observer agreeme..." refers background or methods in this paper

...In particular, w1j represents the set of weights which generate the kappa measure of perfect agreement proposed in Cohen [1960]. The sequence of hierarchical kappa-type statistics within each of the two patient populations associated with the weights given in Table 2 can be expressed in the formulation (A....
[...]
...example, the weights w2j in Table 8 are directly analogous to those discussed in Cohen [1968], Fleiss, Cohen and Everitt [1969] and Cicchetti [1972], which were used to generate weighted kappa and C statistics....
[...]
..., Goodman and Kruskal [1954], Cohen [1960, 1968], Fleiss [1971], Light [1971], and Cicchetti [1972]....
[...]
...Furthermore, as shown in Fleiss and Cohen [1973] and Fleiss [1975], K is directly analogous to the intraclass correlation coefficient obtained from ANOVA models for quantitative measurements and can be used as a measure of the reliability of multiple determinations on the same subj ects....
[...]

Journal Article•DOI•

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.

[...]

Jacob Cohen¹•Institutions (1)

York University¹

01 Oct 1968-Psychological Bulletin

TL;DR: The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi.

...read moreread less

Abstract: A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi

...read moreread less

7,604 citations

Journal Article•DOI•

Measuring nominal scale agreement among many raters.

[...]

Joseph L. Fleiss¹•Institutions (1)

New York State Department of Mental Hygiene¹

01 Jan 1971-Psychological Bulletin

7,318 citations

Journal Article•DOI•

The Analysis of Variance

[...]

Henry Scheffé¹•Institutions (1)

University of California, Berkeley¹

01 Jun 1960-Soil Science

TL;DR: In this paper, the basic theory of analysis of variance by considering several different mathematical models is examined, including fixed-effects models with independent observations of equal variance and other models with different observations of variance.

...read moreread less

Abstract: Originally published in 1959, this classic volume has had a major impact on generations of statisticians. Newly issued in the Wiley Classics Series, the book examines the basic theory of analysis of variance by considering several different mathematical models. Part I looks at the theory of fixed-effects models with independent observations of equal variance, while Part II begins to explore the analysis of variance in the case of other models.

...read moreread less

5,728 citations

Book•

Linear Models

[...]

Shayle R. Searle, S R Searle

01 Jan 1971

3,429 citations