scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM.

01 Feb 2005-Journal of Strength and Conditioning Research (J Strength Cond Res)-Vol. 19, Iss: 1, pp 231-240
TL;DR: In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC and how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.
Abstract: Reliability, the consistency of a test or measurement, is frequently quantified in the movement sciences literature. A common metric is the intraclass correlation coefficient (ICC). In addition, the SEM, which can be calculated from the ICC, is also frequently reported in reliability studies. However, there are several versions of the ICC, and confusion exists in the movement sciences regarding which ICC to use. Further, the utility of the SEM is not fully appreciated. In this review, the basics of classic reliability theory are addressed in the context of choosing and interpreting an ICC. The primary distinction between ICC equations is argued to be one concerning the inclusion (equations 2,1 and 2,k) or exclusion (equations 3,1 and 3,k) of systematic error in the denominator of the ICC equation. Inferential tests of mean differences, which are performed in the process of deriving the necessary variance components for the calculation of ICC values, are useful to determine if systematic error is present. If so, the measurement schedule should be modified (removing trials where learning and/or fatigue effects are present) to remove systematic error, and ICC equations that only consider random error may be safely used. The use of ICC values is discussed in the context of estimating the effects of measurement error on sample size, statistical power, and correlation attenuation. Finally, calculation and application of the SEM are discussed. It is shown how the SEM and its variants can be used to construct confidence intervals for individual scores and to determine the minimal difference needed to be exhibited for one to be confident that a true change in performance of an individual has occurred.
Citations
More filters
Journal ArticleDOI
TL;DR: This primer will equip both scientists and practitioners to understand the ontology and methodology of scale development and validation, thereby facilitating the advancement of the understanding of a range of health, social, and behavioral outcomes.
Abstract: Scale development and validation are critical to much of the work in the health, social, and behavioral sciences. However, the constellation of techniques required for scale development and evaluation can be onerous, jargon-filled, unfamiliar, and resource-intensive. Further, it is often not a part of graduate training. Therefore, our goal was to concisely review the process of scale development in as straightforward a manner as possible, both to facilitate the development of new, valid, and reliable scales, and to help improve existing ones. To do this, we have created a primer for best practices for scale development in measuring complex phenomena. This is not a systematic review, but rather the amalgamation of technical literature and lessons learned from our experiences spent creating or adapting a number of scales over the past several decades. We identified three phases that span nine steps. In the first phase, items are generated and the validity of their content is assessed. In the second phase, the scale is constructed. Steps in scale construction include pre-testing the questions, administering the survey, reducing the number of items, and understanding how many factors the scale captures. In the third phase, scale evaluation, the number of dimensions is tested, reliability is tested, and validity is assessed. We have also added examples of best practices to each step. In sum, this primer will equip both scientists and practitioners to understand the ontology and methodology of scale development and validation, thereby facilitating the advancement of our understanding of a range of health, social, and behavioral outcomes.

1,523 citations

Journal ArticleDOI
TL;DR: The present work systematically evaluated the test-retest reliability of TC-GICA derived RSFC measures over the short-term (<45 min) and long-term (5-16 months) and found moderate-to-high short-and-long-term test

741 citations

Journal ArticleDOI
TL;DR: The perceived stress scale (PSS-10) showed an adequate reliability and validity supporting its use in this population of Brazilian adults, and the exploratory factor analysis showed two factors with eigenvalues greater than 1.0 supported its use.
Abstract: The perceived stress scale (PSS-10) reliability and validity were evaluated in Brazilian adults. A two-stage translation procedure was employed to achieve a Portuguese version. Participants were 793 Brazilian university teachers. The exploratory factor analysis showed two factors with eigenvalues greater than 1.0 (56.8% of variance). The Cronbach's alpha coefficients were 0.83 (Factor 1), 0.77 (Factor 2) and 0.87 (Total Score). The test-retest reliability scores were 0.83 (Factor 1), 0.68 (Factor 2) and 0.86 (Total Score). PSS-10 and perceived health correlations ranged from -0.22 to -0.35. The PSS-10 showed an adequate reliability and validity supporting its use in this population.

509 citations


Cites methods from "Quantifying test-retest reliability..."

  • ...The test–retest reliability for Perceived Stress was good, according the recommendation elsewhere (Weir, 2005); however, as other studies have reported only Pearson-product correlation (Cohen et al....

    [...]

  • ...The test–retest reliability for Perceived Stress was good, according the recommendation elsewhere (Weir, 2005); however, as other studies have reported only Pearson-product correlation (Cohen et al., 1983; Cole, 1999), comparisons are limited....

    [...]

Journal ArticleDOI
TL;DR: The studies reviewed show that bipedal static COP measures may be used as a reliable tool for investigating general postural stability and balance performance under specific conditions and recommendations for maximizing the reliability of COP data are provided.

500 citations


Cites background from "Quantifying test-retest reliability..."

  • ...The issue with the described heterogeneity of the chosen ICC models is that, depending on the data, differentmodels are likely to yield varying results [20]....

    [...]

  • ...application in test–retest reliability studies is often discouraged for its inability to detect systematic error [20]....

    [...]

  • ...Conversely, even in the presence of low inter-participant variability, small test–retest variations may cause low ICC value [20,21]....

    [...]

Journal ArticleDOI
TL;DR: Investigations of feature repeatability and reproducibility are currently limited to a small number of cancer types and there was no emergent consensus regarding either shape metrics or textural features; however, coarseness and contrast appeared among the least reproducible features.
Abstract: Purpose An ever-growing number of predictive models used to inform clinical decision making have included quantitative, computer-extracted imaging biomarkers, or “radiomic features.” Broadly generalizable validity of radiomics-assisted models may be impeded by concerns about reproducibility. We offer a qualitative synthesis of 41 studies that specifically investigated the repeatability and reproducibility of radiomic features, derived from a systematic review of published peer-reviewed literature. Methods and Materials The PubMed electronic database was searched using combinations of the broad Haynes and Ingui filters along with a set of text words specific to cancer, radiomics (including texture analyses), reproducibility, and repeatability. This review has been reported in compliance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. From each full-text article, information was extracted regarding cancer type, class of radiomic feature examined, reporting quality of key processing steps, and statistical metric used to segregate stable features. Results Among 624 unique records, 41 full-text articles were subjected to review. The studies primarily addressed non-small cell lung cancer and oropharyngeal cancer. Only 7 studies addressed in detail every methodologic aspect related to image acquisition, preprocessing, and feature extraction. The repeatability and reproducibility of radiomic features are sensitive at various degrees to processing details such as image acquisition settings, image reconstruction algorithm, digital image preprocessing, and software used to extract radiomic features. First-order features were overall more reproducible than shape metrics and textural features. Entropy was consistently reported as one of the most stable first-order features. There was no emergent consensus regarding either shape metrics or textural features; however, coarseness and contrast appeared among the least reproducible. Conclusions Investigations of feature repeatability and reproducibility are currently limited to a small number of cancer types. Reporting quality could be improved regarding details of feature extraction software, digital image manipulation (preprocessing), and the cutoff value used to distinguish stable features.

493 citations


Cites background from "Quantifying test-retest reliability..."

  • ...The ICC metric (69) is appropriate where one expects strong correlation within a given class but weak correlation between classes, and it was most commonly reported in reproducibility experiments....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.

43,884 citations

Journal ArticleDOI
TL;DR: In this article, the authors present guidelines for choosing among six different forms of the intraclass correlation for reliability studies in which n target are rated by k judges, and the confidence intervals for each of the forms are reviewed.
Abstract: Reliability coefficients often take the form of intraclass correlation coefficients. In this article, guidelines are given for choosing among six different forms of the intraclass correlation for reliability studies in which n target are rated by k judges. Relevant to the choice of the coefficient are the appropriate statistical model for the reliability and the application to be made of the reliability results. Confidence intervals for each of the forms are reviewed.

21,185 citations

Book
07 Dec 1989
TL;DR: In this article, the authors propose three basic concepts: devising the items, selecting the items and selecting the responses, from items to scales, reliability and validity of the responses.
Abstract: 1. Introduction 2. Basic concepts 3. Devising the items 4. Scaling responses 5. Selecting the items 6. Biases in responding 7. From items to scales 8. Reliability 9. Generalizability theory 10. Validity 11. Measuring change 12. Item response theory 13. Methods of administration 14. Ethical considerations 15. Reporting test results Appendices

9,316 citations

Journal ArticleDOI
TL;DR: A review of the distinction between various forms of intraclass correlation coefficients (ICC) can be found in this article, followed by a discussion of the relationship between the two types of ICCs.
Abstract: Although intraclass correlation coefficients (ICCs) are commonly used in behavioral measurement, psychometrics, and behavioral genetics, procedures available for forming inferences about ICCs are not widely known. Following a review of the distinction between various forms of the ICC, this article p

5,858 citations