Sources of unreliability in depression ratings.

doi:10.1097/JCP.0B013E318192E4D7

Journal ArticleDOI

Sources of unreliability in depression ratings.

Kenneth A. Kobak, +6 more

- 01 Feb 2009 -

Journal of Clinical Psychopharmacology

- Vol. 29, Iss: 1, pp 82-85

Chats0

TLDR

Experienced and uncalibrated raters should focus on establishing common conventions, whereas experienced and calibrated ratersShould focus on fine tuning judgment calls on different thresholds of symptoms, and calibration training seems to improve reliability over experience alone.

Abstract:

Background Good interrater reliability is essential to minimize error variance and improve study power. Reasons why raters differ in scoring the same patient include information variance (different information obtained because of asking different questions), observation variance (the same information is obtained, but raters differ in what they notice and remember), interpretation variance (differences in the significance attached to what is observed), criterion variance (different criteria used to score items), and subject variance (true differences in the subject). We videotaped and transcribed 30 pairs of interviews to examine the most common sources of rater unreliability. Method Thirty patients who experienced depression were independently interviewed by 2 different raters on the same day. Raters provided rationales for their scoring, and independent assessors reviewed the rationales, the interview transcripts, and the videotapes to code the main reason for each discrepancy. One third of the interviews were conducted by raters who had not administered the Hamilton Depression Rating Scale before; one third, by raters who were experienced but not calibrated; and one third, by experienced and calibrated raters. Results Experienced and calibrated raters had the highest interrater reliability (intraclass correlation [ICC]; r = 0.93) followed by inexperienced raters (r = 0.77) and experienced but uncalibrated raters (r = 0.55). The most common reason for disagreement was interpretation variance (39%), followed by information variance (30%), criterion variance (27%), and observation variance (4%). Experienced and calibrated raters had significantly less criterion variance than the other cohorts (P = 0.001). Conclusions Reasons for disagreement varied by level of experience and calibration. Experienced and uncalibrated raters should focus on establishing common conventions, whereas experienced and calibrated raters should focus on fine tuning judgment calls on different thresholds of symptoms. Calibration training seems to improve reliability over experience alone. Experienced raters without cohort calibration had lower reliability than inexperienced raters.

Sources of unreliability in depression ratings.

Citations

Is it valid to measure suicidal ideation by depression rating scales

Feasibility and Validation of a Computer-Automated Columbia-Suicide Severity Rating Scale Using Interactive Voice Response Technology

Placebo-related effects in clinical trials in schizophrenia: what is driving this phenomenon and what can be done to minimize it?

The Computerized Adaptive Diagnostic Test for Major Depressive Disorder (CAD-MDD): A Screening Tool for Depression

Inter-rater agreement in evaluation of disability: systematic review of reproducibility studies

References

A rating scale for depression

A structured interview guide for the Hamilton Depression Rating Scale.

Standardizing the Hamilton Depression Rating Scale: past, present, and future.

The GRID-HAMD : standardization of the Hamilton Depression Rating Scale

Comparison of the standard and structured interview guide for the Hamilton Depression Rating Scale in depressed geriatric inpatients.

Related Papers (5)

A rating scale for depression

Native and Non-Native Raters of L2 Speaking Performance: Accent Familiarity and Cognitive Processes.

Using FACETS to model rater training effects

The Consistency Between Raters Scoring in Different Test Years

The Relationship Between Raters' Prior Language Study and the Evaluation of Foreign Language Speech Samples