scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Quality criteria were proposed for measurement properties of health status questionnaires

TL;DR: The criteria can be used in systematic reviews of health status questionnaires, to detect shortcomings and gaps in knowledge of measurement properties, and to design validation studies.
Abstract: Objectives: Recently, an increasing number of systematic reviews have been published in which the measurement properties of health status questionnaires are compared. For a meaningful comparison, quality criteria for measurement properties are needed. Our aim was to develop quality criteria for design, methods, and outcomes of studies on the development and evaluation of health status questionnaires. Study Design and Setting: Quality criteria for content validity, internal consistency, criterion validity, construct validity, reproducibility, longitudinal validity, responsiveness, floor and ceiling effects, and interpretability were derived from existing guidelines and consensus within our research group. Results: For each measurement property a criterion was defined for a positive, negative, or indeterminate rating, depending on the design, methods, and outcomes of the validation study. Conclusion: Our criteria make a substantial contribution toward defining explicit quality criteria for measurement properties of health status questionnaires. Our criteria can be used in systematic reviews of health status questionnaires, to detect shortcomings and gaps in knowledge of measurement properties, and to design validation studies. The future challenge will be to refine and complete the criteria and to reach broad consensus, especially on quality criteria for good measurement properties. 2006 Elsevier Inc. All rights reserved.

Summary (3 min read)

1. Introduction

  • The number of available health status questionnaires has increased dramatically over the past decades.
  • Within each of these attributes, specific criteria were defined by which instruments should be reviewed.
  • The authors criteria can also be used to detect shortcomings and gaps in knowledge of measurement properties, and to design validation studies.
  • The authors aim is to contribute to the development of explicit quality criteria for the design, methods, and outcomes of studies on the development and evaluation of health status questionnaires.

2. Content validity

  • Content validity examines the extent to which the concepts of interest are comprehensively represented by the items in the questionnaire [16].
  • The relevance of items may also depend on disease severity.
  • Items in the questionnaire must reflect areas that are important to the target population that is being studied.
  • This strategy, however, does not guarantee a better content validity, because a comprehensive set of items can also be achieved without item reduction.
  • The authors give a positive rating for content validity if a clear description is provided of the measurement aim, the target population, the concepts that are being measured, and the item selection.

3. Internal consistency

  • Internal consistency is a measure of the extent to which items in a questionnaire (sub)scale are correlated , thus measuring the same concept.
  • Internal consistency is an important measurement property for questionnaires that intend to measure a single underlying concept by using multiple items.
  • In case that there is no prior hypothesis regarding the dimensionality of a questionnaire, exploratory principal component analysis or factor analyses can be applied.
  • Rules-of-thumb vary from four to 10 subjects per variable, with a minimum number of 100 subjects to ensure stability of the varianceecovariance matrix [27].
  • Furthermore, a very high Cronbach’s alpha is usually found for scales with a large number of items, because Cronbach’s alpha is dependent upon the number of items in a scale.

5. Construct validity

  • Construct validity refers to the extent to which scores on a particular instrument relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concepts that are being measured [17,19].
  • Construct validity should be assessed by testing predefined hypotheses (e.g., about expected correlations between measures or expected differences in scores between ‘‘known’’ groups).
  • Without specific hypotheses, the risk of bias is high because retrospectively it is tempting to think up alternative explanations for low correlations instead of concluding that the questionnaire may not be valid.

6. Reproducibility

  • Reproducibility concerns the degree to which repeated measurements in stable persons provide similar answers.
  • Reliability concerns the degree to which patients can be distinguished from each other, despite measurement error [19].
  • Reliability coefficients (intraclass correlation coefficients (ICC)) concern the variation in the population (interindividual variation) divided by the total variation, which is the interindividual variation plus the intraindividual variation (measurement error), expressed as a ratio between 0 and 1.
  • The time period between the repeated administrations should be long enough to prevent recall, though short enough to ensure that clinical change has not occurred.
  • Therefore, the authors do not rate the appropriateness of the time period, but only require that this time period is described and justified.

6.1. Agreement

  • The measurement error can be adequately expressed as the standard error of measurement (SEM) [30].
  • The SEM equals the square root of the error variance of an ANOVA analysis, either including systematic differences or excluding them .
  • Another adequate parameter of agreement is described by Bland and Altman [34].
  • For evaluative purposes, the absolute measurement error should be smaller than the minimal amount of change in the (sub)scale that is considered to be important (minimal important change (MIC)).

6.2. Reliability

  • The ICC is the most suitable and most commonly used reliability parameter for continuous measures.
  • Because systematic differences are considered to be part of the measurement error, ICCagreement (two-way random effects model, or ICC (A,1) according to McGraw and Wong [36]) is preferred.
  • Floor or ceiling effects are considered to be present if more than 15% of respondents achieved the lowest or highest possible score, respectively [41].
  • Investigators should provide information about what (change in) score would be clinically meaningful.
  • Various distribution-based and anchor-based methods have been proposed.

10. Population-specific ratings of measurement properties

  • Each property is rated as positive, negative, or indeterminate, depending on the design, methods, and outcomes of the study.
  • Measurement properties differ between populations and settings.
  • The setting refers to the testing conditions, e.g., self-completed or interview, and language.
  • A clear description of the design of each individual study has to be provided, including population (diagnosis and clinical features, age, and gender); design (e.g., language version, time between the measurements, completion before or after treatment); testing conditions (e.g., questionnaires completed at home or in a waiting room, self of in interview); and analyses of the data.
  • In addition, if any important methodological weakness in the design or execution of the study is found, e.g., selection bias or an extremely heterogeneous study population, the evaluated measurement properties are also rated as indeterminate.

11. Overview table

  • In the final comparison of the measurement properties of different questionnaires, one has to consider all ratings together when choosing between different questionnaires.
  • In Table 2 the results are presented from their systematic review of all questionnaires measuring disability in patients with shoulder complaints (because there is no gold standard for disability, criterion validity was not assessed) [1].
  • In Table 2 all ratings for each questionnaire are presented separately for each specific population or setting.
  • Construct validity and responsiveness were rated positively for outpatients (c), but rated as indeterminate for primary care patients (b) and hospital patients (d).
  • Taking into account those measurement properties that are most important for a specific application (e.g., reliability when using a questionnaire for discrimination and responsiveness when using it for evaluation of a treatment effect) and the population and setting in which the questionnaire is going to be used.

12. Discussion

  • The authors developed quality criteria for the design, methods, and outcomes of studies on the development and evaluation of health status questionnaires.
  • Firstly, with their approach, poor quality questionnaires can be given positive ratings for some measurement properties.
  • Furthermore, poorly reported validation studies will lead to low ratings for questionnaires that are not necessarily poor in design or performance.
  • Practically all authors of instrument evaluation studies conclude that their instrument is valid, whereas objective counting of the number of hypotheses that were confirmed frequently indicates otherwise [1,4].
  • Different authors, however, have pointed to the conceptual and algebraic differences between different indices of responsiveness, showing that different indices may lead to different conclusions [53,54].

13. Future challenges

  • One might argue that their criteria are not discriminative enough to distinguish between good and very high-quality questionnaires.
  • When coefficient alpha does and doesn’t matter, also known as Being inconsistent about consistency.
  • New York: Wiley; 1989. [26] de Vet HCW, Ader HJ, Terwee CB, Pouwer F. Evaluating quality of life and health status instruments: development of scientific review criteria.

Did you find this useful? Give us your feedback

Citations
More filters
Journal ArticleDOI
TL;DR: The aim was to clarify and standardize terminology and definitions of measurement properties by reaching consensus among a group of experts and to develop a taxonomy of measurement property relevant for evaluating health instruments.
Abstract: Objective: Lack of consensus on taxonomy, terminology, and definitions has led to confusion about which measurement properties are relevant and which concepts they represent. The aim was to clarify and standardize terminology and definitions of measurement properties by reaching consensus among a group of experts and to develop a taxonomy of measurement properties relevant for evaluating health instruments. Study Design and Setting: An international Delphi study with four written rounds was performed. Participating experts had a background in epidemiology, statistics, psychology, and clinical medicine. The panel was asked to rate their (dis)agreement about proposals on a five-point scale. Consensus was considered to be reached when at least 67% of the panel agreed. Results: Of 91 invited experts, 57 agreed to participate and 43 actually participated. Consensus was reached on positions of measurement properties in the taxonomy (68e84%), terminology (74e88%, except for structural validity [56%]), and definitions of measurement properties (68e88%). The panel extensively discussed the positions of internal consistency and responsiveness in the taxonomy, the terms ‘‘reliability’’ and ‘‘structural validity,’’ and the definitions of internal consistency and reliability. Conclusions: Consensus on taxonomy, terminology, and definitions of measurement properties was reached. Hopefully, this will lead to a more uniform use of terms and definitions in the literature on measurement properties. 2010 Elsevier Inc. All rights reserved.

2,862 citations


Cites background from "Quality criteria were proposed for ..."

  • ...[15] consider internal consistency not as a subcategory of the domain reliability and defined it as ‘‘the extent to which items in a (sub)scale are intercorrelated, thus measuring the same construct....

    [...]

  • ...[15] and COSMIN focus on health status measurement....

    [...]

  • ...The SAC-MOS standards and the Terwee criteria, however, are not based on consensus among a large group of experts [13,15]....

    [...]

Journal ArticleDOI
TL;DR: The resulting COSMIN checklist could be useful when selecting a measurement instrument, peer-reviewing a manuscript, designing or reporting a study on measurement properties, or for educational purposes.
Abstract: Aim of the COSMIN study (COnsensus-based Standards for the selection of health status Measurement INstruments) was to develop a consensus-based checklist to evaluate the methodological quality of studies on measurement properties. We present the COSMIN checklist and the agreement of the panel on the items of the checklist. A four-round Delphi study was performed with international experts (psychologists, epidemiologists, statisticians and clinicians). Of the 91 invited experts, 57 agreed to participate (63%). Panel members were asked to rate their (dis)agreement with each proposal on a five-point scale. Consensus was considered to be reached when at least 67% of the panel members indicated ‘agree’ or ‘strongly agree’. Consensus was reached on the inclusion of the following measurement properties: internal consistency, reliability, measurement error, content validity (including face validity), construct validity (including structural validity, hypotheses testing and cross-cultural validity), criterion validity, responsiveness, and interpretability. The latter was not considered a measurement property. The panel also reached consensus on how these properties should be assessed. The resulting COSMIN checklist could be useful when selecting a measurement instrument, peer-reviewing a manuscript, designing or reporting a study on measurement properties, or for educational purposes.

2,772 citations


Cites background from "Quality criteria were proposed for ..."

  • ...Examples of such criteria were previously published by members of our group [6]....

    [...]

Journal ArticleDOI
Nichole D. Palmer1, Caitrin W. McDonough1, Pamela J. Hicks1, B H Roh1  +381 moreInstitutions (6)
04 Jan 2012-PLOS ONE
TL;DR: It is suggested that multiple loci underlie T2DM susceptibility in the African-American population and that these loci are distinct from those identified in other ethnic populations.
Abstract: African Americans are disproportionately affected by type 2 diabetes (T2DM) yet few studies have examined T2DM using genome-wide association approaches in this ethnicity. The aim of this study was to identify genes associated with T2DM in the African American population. We performed a Genome Wide Association Study (GWAS) using the Affymetrix 6.0 array in 965 African-American cases with T2DM and end-stage renal disease (T2DM-ESRD) and 1029 population-based controls. The most significant SNPs (n = 550 independent loci) were genotyped in a replication cohort and 122 SNPs (n = 98 independent loci) were further tested through genotyping three additional validation cohorts followed by meta-analysis in all five cohorts totaling 3,132 cases and 3,317 controls. Twelve SNPs had evidence of association in the GWAS (P<0.0071), were directionally consistent in the Replication cohort and were associated with T2DM in subjects without nephropathy (P<0.05). Meta-analysis in all cases and controls revealed a single SNP reaching genome-wide significance (P<2.5×10(-8)). SNP rs7560163 (P = 7.0×10(-9), OR (95% CI) = 0.75 (0.67-0.84)) is located intergenically between RND3 and RBM43. Four additional loci (rs7542900, rs4659485, rs2722769 and rs7107217) were associated with T2DM (P<0.05) and reached more nominal levels of significance (P<2.5×10(-5)) in the overall analysis and may represent novel loci that contribute to T2DM. We have identified novel T2DM-susceptibility variants in the African-American population. Notably, T2DM risk was associated with the major allele and implies an interesting genetic architecture in this population. These results suggest that multiple loci underlie T2DM susceptibility in the African-American population and that these loci are distinct from those identified in other ethnic populations.

1,957 citations


Cites background or methods from "Quality criteria were proposed for ..."

  • ...Responsiveness was measured as the area under the receiver operating characteristic (ROC) curve which indicates the probability of correctly identifying subjects who report improvement [27,30]....

    [...]

  • ...Reproducibility can be divided in agreement and reliability [27]....

    [...]

  • ...Absence of floor or ceiling effects indicates a good content validity [17,27]....

    [...]

Journal ArticleDOI
TL;DR: There is no current 'gold standard' amongst 15 measures of resilience, and a number of the scales are in the early stages of development, and all require further validation work.
Abstract: The evaluation of interventions and policies designed to promote resilience, and research to understand the determinants and associations, require reliable and valid measures to ensure data quality. This paper systematically reviews the psychometric rigour of resilience measurement scales developed for use in general and clinical populations. Eight electronic abstract databases and the internet were searched and reference lists of all identified papers were hand searched. The focus was to identify peer reviewed journal articles where resilience was a key focus and/or is assessed. Two authors independently extracted data and performed a quality assessment of the scale psychometric properties. Nineteen resilience measures were reviewed; four of these were refinements of the original measure. All the measures had some missing information regarding the psychometric properties. Overall, the Connor-Davidson Resilience Scale, the Resilience Scale for Adults and the Brief Resilience Scale received the best psychometric ratings. The conceptual and theoretical adequacy of a number of the scales was questionable. We found no current 'gold standard' amongst 15 measures of resilience. A number of the scales are in the early stages of development, and all require further validation work. Given increasing interest in resilience from major international funders, key policy makers and practice, researchers are urged to report relevant validation statistics when using the measures.

1,625 citations


Cites background or methods from "Quality criteria were proposed for ..."

  • ...Fundamental to the robustness of a methodological review are the quality criteria used to distinguish the measurement properties of a scale to enable a meaningful comparison [15]....

    [...]

  • ...content validity advocate that the target group should be involved with the item selection when measures are being developed[11,15]....

    [...]

  • ...In order to address known methodological weaknesses in the current evidence informing practice, this paper reports a methodological systematic review of resilience measurement scales, using published quality assessment criteria to evaluate psychometric properties[15]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors developed guidelines for reporting reliability and agreement studies in interrater and intra-arater reliability and agreements, and proposed 15 issues that should be addressed when reporting such studies.
Abstract: Objective: Results of reliability and agreement studies are intended to provide information about the amount of error inherent in any diagnosis, score, or measurement. The level of reliability and agreement among users of scales, instruments, or classifications is widely unknown. Therefore, there is a need for rigorously conducted interrater and intrarater reliability and agreement studies. Information about sample selection, study design, and statistical analysis is often incomplete. Because of inadequate reporting, interpretation and synthesis of study results are often difficult. Widely accepted criteria, standards, or guidelines for reporting reliability and agreement in the health care and medical field are lacking. The objective was to develop guidelines for reporting reliability and agreement studies. Study Design and Setting: Eight experts in reliability and agreement investigation developed guidelines for reporting. Results: Fifteen issues that should be addressed when reliability and agreement are reported are proposed. The issues correspond to the headings usually used in publications. Conclusion: The proposed guidelines intend to improve the quality of reporting. 2011 Elsevier Inc. All rights reserved.

1,605 citations

References
More filters
Journal ArticleDOI
TL;DR: An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.
Abstract: In clinical measurement comparison of a new measurement technique with an established one is often needed to see whether they agree sufficiently for the new to replace the old. Such investigations are often analysed inappropriately, notably by using correlation coefficients. The use of correlation is misleading. An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.

43,884 citations


"Quality criteria were proposed for ..." refers methods in this paper

  • ...Another adequate parameter of agreement is described by Bland and Altman [34]....

    [...]

  • ...[34] Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement....

    [...]

  • ...[35] Altman DG. Practical statistics for medical research....

    [...]

  • ...In both cases, we consider a sample size of at least 50 patients adequate for the assessment of the agreement parameter, based on a general guideline by Altman [35]....

    [...]

Book
28 Apr 1989
TL;DR: The General Model, Part I: Latent Variable and Measurement Models Combined, Part II: Extensions, Part III: Extensions and Part IV: Confirmatory Factor Analysis as discussed by the authors.
Abstract: Model Notation, Covariances, and Path Analysis. Causality and Causal Models. Structural Equation Models with Observed Variables. The Consequences of Measurement Error. Measurement Models: The Relation Between Latent and Observed Variables. Confirmatory Factor Analysis. The General Model, Part I: Latent Variable and Measurement Models Combined. The General Model, Part II: Extensions. Appendices. Distribution Theory. References. Index.

19,019 citations

Book
15 Jun 2006
TL;DR: Practical Statistics for Medical Research is a problem-based text for medical researchers, medical students, and others in the medical arena who need to use statistics but have no specialized mathematics background.
Abstract: Most medical researchers, whether clinical or non-clinical, receive some background in statistics as undergraduates. However, it is most often brief, a long time ago, and largely forgotten by the time it is needed. Furthermore, many introductory texts fall short of adequately explaining the underlying concepts of statistics, and often are divorced from the reality of conducting and assessing medical research. Practical Statistics for Medical Research is a problem-based text for medical researchers, medical students, and others in the medical arena who need to use statistics but have no specialized mathematics background. The author draws on twenty years of experience as a consulting medical statistician to provide clear explanations to key statistical concepts, with a firm emphasis on practical aspects of designing and analyzing medical research. The text gives special attention to the presentation and interpretation of results and the many real problems that arise in medical research

17,322 citations


"Quality criteria were proposed for ..." refers methods in this paper

  • ...Another adequate parameter of agreement is described by Bland and Altman [34]....

    [...]

  • ...[34] Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement....

    [...]

  • ...[35] Altman DG. Practical statistics for medical research....

    [...]

  • ...In both cases, we consider a sample size of at least 50 patients adequate for the assessment of the agreement parameter, based on a general guideline by Altman [35]....

    [...]

Journal ArticleDOI
TL;DR: An instrument to assess the quality of reports of randomized clinical trials (RCTs) in pain research is described and its use to determine the effect of rater blinding on the assessments of quality is described.
Abstract: It has been suggested that the quality of clinical trials should be assessed by blinded raters to limit the risk of introducing bias into meta-analyses and systematic reviews, and into the peer-review process There is very little evidence in the literature to substantiate this This study describes the development of an instrument to assess the quality of reports of randomized clinical trials (RCTs) in pain research and its use to determine the effect of rater blinding on the assessments of quality A multidisciplinary panel of six judges produced an initial version of the instrument Fourteen raters from three different backgrounds assessed the quality of 36 research reports in pain research, selected from three different samples Seven were allocated randomly to perform the assessments under blind conditions The final version of the instrument included three items These items were scored consistently by all the raters regardless of background and could discriminate between reports from the different samples Blind assessments produced significantly lower and more consistent scores than open assessments The implications of this finding for systematic reviews, meta-analytic research and the peer-review process are discussed

15,740 citations


"Quality criteria were proposed for ..." refers methods in this paper

  • ...We did not summarize the quality criteria into one overall quality score, as is often done in systematic reviews of randomized clinical trials [46]....

    [...]

Frequently Asked Questions (8)
Q1. What are the criteria needed to legitimize what the questionnaire is?

Explicit quality criteria for studies on the development and evaluation of health status questionnaires are needed to legitimize what the best questionnaire is. 

In both cases, the authors consider a sample size of at least 50 patients adequate for the assessment of the agreement parameter, based on a general guideline by Altman [35]. 

The authors believe that systematic differences should be considered part of the measurement error, because the authors want to distinguish them from ‘‘real’’ changes, e.g., due to treatment. 

Because the number of health status questionnaires is rapidly growing, choosing the right questionnaire for a specific purpose becomes a time-consuming and difficult task. 

The authors therefore give a positive rating for construct validity if hypotheses are specified in advance and at least 75% of the results are in correspondence with these hypotheses, in (sub)groups of at least 50 patients. 

With this table one can make an evidencebased choice for the questionnaire with the best measurement properties, taking into account those measurement properties that are most important for a specific application (e.g., reliability when using a questionnaire for discrimination and responsiveness when using it for evaluation of a treatment effect) and the population and setting in which the questionnaire is going to be used. 

The authors give a positive rating for internal consistency when factor analysis was applied and Cronbach’s alpha is between 0.70 and 0.95. 

The authors give a positive rating for agreement if the SDC (SDCind for application in individuals and SDCgroup for use in groups) or the limits of agreement (upper or lower limit, depending on whether the interest is in improvement or deterioration) are smaller than the MIC. 

Trending Questions (1)
What are the benefits of measurement validity in quantitative questionnaires?

Measurement validity in quantitative questionnaires ensures that the questionnaire accurately measures what it intends to measure, increasing the reliability and credibility of the data collected.