scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The measurement of observer agreement for categorical data

01 Mar 1977-Biometrics (Biometrics)-Vol. 33, Iss: 1, pp 159-174
TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.
Abstract: This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.
Citations
More filters
Journal ArticleDOI
TL;DR: Results suggest the K-SADS-PL generates reliable and valid child psychiatric diagnoses.
Abstract: Objective To describe the psychometric properties of the Schedule for Affective Disorders and Schizophrenia for School-Age Children-Present and Lifetime version (K-SADS-PL) interview, which surveys additional disorders not assessed in prior K-SADS, contains improved probes and anchor points, includes diagnosis-specific impairment ratings, generates DSM-III-R and DSM-IV diagnoses, and divides symptoms surveyed into a screening interview and five diagnostic supplements. Method Subjects were 55 psychiatric outpatients and 11 normal controls (aged 7 through 17 years). Both parents and children were used as informants. Concurrent validity of the screen criteria and the K-SADS-PL diagnoses was assessed against standard self-report scales. Interrater ( n = 15) and test-retest ( n = 20) reliability data were also collected (mean retest interval: 18 days; range: 2 to 38 days). Results Rating scale data support the concurrent validity of screens and K-SADS-PL diagnoses. Interrater agreement in scoring screens and diagnoses was high (range: 93% to 100%). Test-retest reliability κ coefficients were in the excellent range for present and/or lifetime diagnoses of major depression, any bipolar, generalized anxiety, conduct, and oppositional defiant disorder (.77 to 1.00) and in the good range for present diagnoses of posttraumatic stress disorder and attention-deficit hyperactivity disorder (.63 to .67). Conclusion Results suggest the K-SADS-PL generates reliable and valid child psychiatric diagnoses. J. Am. Acad. Child Adolesc. Psychiatry , 1997, 36(7): 980–988.

8,742 citations

Journal ArticleDOI
TL;DR: In this paper, the authors provide guidelines, guidelines, and simple rules of thumb to assist the clinician faced with the challenge of choosing an appropriate test instrument for a given psychological assessment.
Abstract: In the context of the development of prototypic assessment instruments in the areas of cognition, personality, and adaptive functioning, the issues of standardization, norming procedures, and the important psychometrics of test reliability and validity are evaluated critically. Criteria, guidelines, and simple rules of thumb are provided to assist the clinician faced with the challenge of choosing an appropriate test instrument for a given psychological assessment. Clinicians are often faced with the critical challenge of choosing the most appropriate available test instrument for a given psychological assessment of a child, adolescent, or adult of a particular age, gender, and class of disability. It is the purpose of this report to provide some criteria, guidelines, or simple rules of thumb to aid in this complex scientific decision. As such, it draws upon my experience with issues of test development, standardization, norming procedures, and important psychometrics, namely, test reliability and validity. As I and my colleagues noted in an earlier publication, the major areas of psychological functioning, in the normal development of infants, children, adolescents, adults, and elderly people, include cognitive, academic, personality, and adaptive behaviors (Sparrow, Fletcher, & Cicchetti, 1985). As such, the major examples or applications discussed in this article derive primarily, although not exclusively, from these several areas of human functioning.

7,254 citations

Journal ArticleDOI
TL;DR: Algorithm sensitivities and specificities for autism and PD DNOS relative to nonspectrum disorders were excellent, with moderate differentiation of autism from PDDNOS.
Abstract: The Autism Diagnostic Observation Schedule-Generic (ADOS-G) is a semistructured, standardized assessment of social interaction, communication, play, and imaginative use of materials for individuals suspected of having autism spectrum disorders. The observational schedule consists of four 30-minute modules, each designed to be administered to different individuals according to their level of expressive language. Psychometric data are presented for 223 children and adults with Autistic Disorder (autism), Pervasive Developmental Disorder Not Otherwise Specified (PDDNOS) or nonspectrum diagnoses. Within each module, diagnostic groups were equivalent on expressive language level. Results indicate substantial interrater and test-retest reliability for individual items, excellent interrater reliability within domains and excellent internal consistency. Comparisons of means indicated consistent differentiation of autism and PDDNOS from nonspectrum individuals, with some, but less consistent, differentiation of autism from PDDNOS. A priori operationalization of DSM-IV/ICD-10 criteria, factor analyses, and ROC curves were used to generate diagnostic algorithms with thresholds set for autism and broader autism spectrum/PDD. Algorithm sensitivities and specificities for autism and PDDNOS relative to nonspectrum disorders were excellent, with moderate differentiation of autism from PDDNOS.

7,012 citations

Journal ArticleDOI
TL;DR: It is shown that it is feasible to develop a checklist that can be used to assess the methodological quality not only of randomised controlled trials but also non-randomised studies and it is possible to produce a Checklist that provides a profile of the paper, alerting reviewers to its particular methodological strengths and weaknesses.
Abstract: OBJECTIVE: To test the feasibility of creating a valid and reliable checklist with the following features: appropriate for assessing both randomised and non-randomised studies; provision of both an overall score for study quality and a profile of scores not only for the quality of reporting, internal validity (bias and confounding) and power, but also for external validity. DESIGN: A pilot version was first developed, based on epidemiological principles, reviews, and existing checklists for randomised studies. Face and content validity were assessed by three experienced reviewers and reliability was determined using two raters assessing 10 randomised and 10 non-randomised studies. Using different raters, the checklist was revised and tested for internal consistency (Kuder-Richardson 20), test-retest and inter-rater reliability (Spearman correlation coefficient and sign rank test; kappa statistics), criterion validity, and respondent burden. MAIN RESULTS: The performance of the checklist improved considerably after revision of a pilot version. The Quality Index had high internal consistency (KR-20: 0.89) as did the subscales apart from external validity (KR-20: 0.54). Test-retest (r 0.88) and inter-rater (r 0.75) reliability of the Quality Index were good. Reliability of the subscales varied from good (bias) to poor (external validity). The Quality Index correlated highly with an existing, established instrument for assessing randomised studies (r 0.90). There was little difference between its performance with non-randomised and with randomised studies. Raters took about 20 minutes to assess each paper (range 10 to 45 minutes). CONCLUSIONS: This study has shown that it is feasible to develop a checklist that can be used to assess the methodological quality not only of randomised controlled trials but also non-randomised studies. It has also shown that it is possible to produce a checklist that provides a profile of the paper, alerting reviewers to its particular methodological strengths and weaknesses. Further work is required to improve the checklist and the training of raters in the assessment of external validity.

6,849 citations

Journal Article
TL;DR: Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers and studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance.
Abstract: Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers. Studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance. The kappa statistic (or kappa coefficient) is the most commonly used statistic for this purpose. A kappa of 1 indicates perfect agreement, whereas a kappa of 0 indicates agreement equivalent to chance. A limitation of kappa is that it is affected by the prevalence of the finding under observation. Methods to overcome this limitation have been described.

6,539 citations

References
More filters
Journal ArticleDOI
Jacob Cohen1
TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Abstract: CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of measurement obtainable is nominal scaling (Stevens, 1951, pp. 2526), i.e. placement in a set of k unordered categories. Because the categorizing of the units is a consequence of some complex judgment process performed by a &dquo;two-legged meter&dquo; (Stevens, 1958), it becomes important to determine the extent to which these judgments are reproducible, i.e., reliable. The procedure which suggests itself is that of having two (or more) judges independently categorize a sample of units and determine the degree, significance, and

34,965 citations


"The measurement of observer agreeme..." refers background or methods in this paper

  • ...In particular, w1j represents the set of weights which generate the kappa measure of perfect agreement proposed in Cohen [1960]. The sequence of hierarchical kappa-type statistics within each of the two patient populations associated with the weights given in Table 2 can be expressed in the formulation (A....

    [...]

  • ...example, the weights w2j in Table 8 are directly analogous to those discussed in Cohen [1968], Fleiss, Cohen and Everitt [1969] and Cicchetti [1972], which were used to generate weighted kappa and C statistics....

    [...]

  • ..., Goodman and Kruskal [1954], Cohen [1960, 1968], Fleiss [1971], Light [1971], and Cicchetti [1972]....

    [...]

  • ...Furthermore, as shown in Fleiss and Cohen [1973] and Fleiss [1975], K is directly analogous to the intraclass correlation coefficient obtained from ANOVA models for quantitative measurements and can be used as a measure of the reliability of multiple determinations on the same subj ects....

    [...]

Journal ArticleDOI
Jacob Cohen1
TL;DR: The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi.
Abstract: A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi

7,604 citations

Journal ArticleDOI
TL;DR: In this paper, the basic theory of analysis of variance by considering several different mathematical models is examined, including fixed-effects models with independent observations of equal variance and other models with different observations of variance.
Abstract: Originally published in 1959, this classic volume has had a major impact on generations of statisticians. Newly issued in the Wiley Classics Series, the book examines the basic theory of analysis of variance by considering several different mathematical models. Part I looks at the theory of fixed-effects models with independent observations of equal variance, while Part II begins to explore the analysis of variance in the case of other models.

5,728 citations

Book
01 Jan 1971

3,429 citations