scispace - formally typeset
Search or ask a question

Showing papers on "Intraclass correlation published in 2020"


Journal ArticleDOI
TL;DR: When the measures are quantitative, the intraclass correlation coefficient (ICC) should be used to assess agreement but this should be done with care because there are different ICCs so that it is important to describe the model and type of ICC being used.
Abstract: Agreement between observers (i.e., inter-rater agreement) can be quantified with various criteria but their appropriate selections are critical. When the measure is qualitative (nominal or ordinal), the proportion of agreement or the kappa coefficient should be used to evaluate inter-rater consistency (i.e., inter-rater reliability). The kappa coefficient is more meaningful that the raw percentage of agreement, because the latter does not account for agreements due to chance alone. When the measures are quantitative, the intraclass correlation coefficient (ICC) should be used to assess agreement but this should be done with care because there are different ICCs so that it is important to describe the model and type of ICC being used. The Bland-Altman method can be used to assess consistency and conformity but its use should be restricted to comparison of two raters.

129 citations


Journal ArticleDOI
Yue Sun1, Zhaoyan Fu1, Qijing Bo1, Zhen Mao1, Xin Ma1, Chuanyue Wang1 
TL;DR: PHQ-9 showed good reliability and validity, and high adaptability for patients with MDD in psychiatric hospital, and is a simple, rapid, effective, and reliable tool for screening and evaluation of the severity of depression.
Abstract: To assess the reliability and validity of Patient Health Questionnaire-9 (PHQ-9) for patients with major depressive disorder (MDD) and to assess the feasibility of its use in psychiatric hospitals in China. One hundred nine outpatients or inpatients with MDD who qualified the Diagnostic and Statistical Manual of Mental Disorders, fourth edition (DSM-IV) criteria completed PHQ-9 and Hamilton Depression Scale (HAMD-17). Two weeks after the initial evaluation, 54 randomly selected patients underwent repeat assessment using PHQ-9. For validity analysis, the construct validity and criterion validity were assessed. The internal concordance coefficient and the test-retest correlation coefficients were used for reliability analysis. The correlation between total score and scores for each item and the correlation between scores for various items were evaluated using Pearson correlation coefficient. Principal components factor analysis showed good construct validity of the PHQ-9. PHQ-9 total score showed a positive correlation with HAMD-17 total score (r = 0.610, P < 0.001). With HAMD as the standard, PHQ-9 depression scores of 7, 15, and 21 points were used as cut-offs for mild, moderate, and severe depression, respectively. Consistency assessment was conducted between the depression severity as assessed by PHQ-9 and HAMD (Kappa = 0.229, P < 0.001). Intraclass correlation coefficient between PHQ-9 total score and HAMD total score was 0.594 (95% confidence interval, 0.456–0.704, P < 0.001). The Cronbach’s α coefficient of PHQ-9 was 0.892. Correlation coefficients between each item score and the total score ranged from 0.567–0.789 (P < 0.01); the correlation coefficient between various item scores ranged from 0.233–0.747. The test-retest correlation coefficient for total score was 0.737. PHQ-9 showed good reliability and validity, and high adaptability for patients with MDD in psychiatric hospital. It is a simple, rapid, effective, and reliable tool for screening and evaluation of the severity of depression.

83 citations


Journal ArticleDOI
TL;DR: The New Freezing of Gait Questionnaire (NFOG‐Q) is a widely used and valid tool to quantify freezing of gait severity, however, its test‐retest reliability and minimal detectable change remain unknown.
Abstract: Background: Freezing of gait (FOG) is a common gait deficit in Parkinson's disease. The New Freezing of Gait Questionnaire (NFOG-Q) is a widely used and valid tool to quantify freezing of gait severity. However, its test-retest reliability and minimal detectable change remain unknown. Objective: To determine the test-retest reliability and responsiveness of the NFOG-Q. Methods: Two groups of freezers, involved in 2 previous rehabilitation trials, completed the NFOG-Q at 2 time points (T1 and T2), separated by a 6-week control period without active intervention. Sample 1 (N = 57) was measured in ON and sample 2 (N = 14) in OFF. We calculated various reliability statistics for the NFOG-Q scores between T1 and T2 as well as correlation coefficients with clinical descriptors to explain the variability between time points. Results: In sample 1 the NFOG-Q showed modest reliability (intraclass correlation coefficient = 0.68 [0.52-0.80]) without differences between T1 and T2. However, a minimal detectable change of 9.95 (7.90-12.27) points emerged for the total score (range 28 points, relative minimal detectable change of 35.5%). Sample 2 showed largely similar results. We found no associations between cognitive-related or disease severity-related outcomes and variability in NFOG-Q scores. Conclusions: We conclude that the NFOG-Q is insufficiently reliable or responsive to detect small effect sizes, as changes need to go beyond 35% to surpass measurement error. Therefore, we warrant caution in using the NFOG-Q as a primary outcome in clinical trials. These results emphasize the need for robust and objective freezing of gait outcome measures.

51 citations


Journal ArticleDOI
TL;DR: Finger strength (in both flexion and extension) and muscle tone, as provided by a robotic device for hand rehabilitation, are reliable and sensitive measures that could be considered as clinically relevant and used to assess the effect of a rehabilitation treatment in patients with subacute stroke.
Abstract: The majority of stroke survivors experiences significant hand impairments, as weakness and spasticity, with a severe impact on the activity of daily living. To objectively evaluate hand deficits, quantitative measures are needed. The aim of this study is to assess the reliability, the validity and the discriminant ability of the instrumental measures provided by a robotic device for hand rehabilitation, in a sample of patients with subacute stroke. In this study, 120 patients with stroke and 40 controls were enrolled. Clinical evaluation included finger flexion and extension strength (using the Medical Research Council, MRC), finger spasticity (using the Modified Ashworth Scale, MAS) and motor control and dexterity during ADL performance (by means of the Frenchay Arm Test, FAT). Robotic evaluations included finger flexion and extension strength, muscle tone at rest, and instrumented MAS and Modified Tardieu Scale. Subjects were evaluated twice, one day apart, to assess the test-retest reliability of the robotic measures, using the Intraclass Correlation Coefficient (ICC). To estimate the response stability, the standard errors of measurement and the minimum detectable change (MDC) were also calculated. Validity was assessed by analyzing the correlations between the robotic metrics and the clinical scales, using the Spearman’s Correlation Coefficient (r). Finally, we investigated the ability of the robotic measures to distinguish between patients with stroke and healthy subjects, by means of Mann-Whitney U tests. All the investigated measures were able to discriminate patients with stroke from healthy subjects (p < 0.001). Test-retest reliability was found to be excellent for finger strength (in both flexion and extension) and muscle tone, with ICCs higher than 0.9. MDCs were equal to 10.6 N for finger flexion, 3.4 N for finger extension, and 14.3 N for muscle tone. Conversely, test-retest reliability of the spasticity measures was poor. Finally, finger strength (in both flexion and extension) was correlated with the clinical scales (r of about 0.7 with MRC, and about 0.5 with FAT). Finger strength (in both flexion and extension) and muscle tone, as provided by a robotic device for hand rehabilitation, are reliable and sensitive measures. Moreover, finger strength is strongly correlated with clinical scales. Changes higher than the obtained MDC in these robotic measures could be considered as clinically relevant and used to assess the effect of a rehabilitation treatment in patients with subacute stroke.

51 citations


Journal ArticleDOI
TL;DR: The present study shows that the PDS is reliable and generally tracks with Tanner staging (for both self and parent report), and indicates that PDS categories do not map directly to specific Tanner stages, partly because a premature adrenarche is often misinterpreted by parents and pediatricians alike.

50 citations


Journal ArticleDOI
TL;DR: The requirement to validate the assumptions of these statistical approaches, and also how to deal with violations and provide formulae on how to calculate the confidence interval for estimated values of agreement and intraclass correlation, is emphasized.
Abstract: The rapid emergence of new measurement instruments and methods requires personnel and researchers of different disciplines to know the correct statistical methods to utilize to compare their performance with reference ones and properly interpret findings. We discuss the often-made mistake of applying the inappropriate correlation and regression statistical approaches to compare methods and then explain the concepts of agreement and reliability. Then, we introduce the intraclass correlation as a measure of inter-rater reliability, and the Bland-Altman plot as a measure of agreement, and we provide formulae to calculate them along with illustrative examples for different types of study designs, specifically single measurement per subject, repeated measurement while the true value is constant, and repeated measurement when the true value is not constant. We emphasize the requirement to validate the assumptions of these statistical approaches, and also how to deal with violations and provide formulae on how to calculate the confidence interval for estimated values of agreement and intraclass correlation. Finally, we explain how to interpret and report the findings of these statistical analyses.

38 citations


Journal ArticleDOI
TL;DR: The findings support the comparability of CANTAB performance indices (errors, correct trials, and response sensitivity) in unsupervised, web-based assessments with in-person and laboratory tests, and underline the importance of examining more than one index to ascertain comparability.
Abstract: Background: Computerized assessments are already used to derive accurate and reliable measures of cognitive function. Web-based cognitive assessment could improve the accessibility and flexibility of research and clinical assessment, widen participation, and promote research recruitment while simultaneously reducing costs. However, differences in context may influence task performance. Objective: This study aims to determine the comparability of an unsupervised, web-based administration of the Cambridge Neuropsychological Test Automated Battery (CANTAB) against a typical in-person lab-based assessment, using a within-subjects counterbalanced design. The study aims to test (1) reliability, quantifying the relationship between measurements across settings using correlational approaches; (2) equivalence, the extent to which test results in different settings produce similar overall results; and (3) agreement, by quantifying acceptable limits to bias and differences between measurement environments. Methods: A total of 51 healthy adults (32 women and 19 men; mean age 36.8, SD 15.6 years) completed 2 testing sessions, which were completed on average 1 week apart (SD 4.5 days). Assessments included equivalent tests of emotion recognition (emotion recognition task [ERT]), visual recognition (pattern recognition memory [PRM]), episodic memory (paired associate learning [PAL]), working memory and spatial planning (spatial working memory [SWM] and one touch stockings of Cambridge), and sustained attention (rapid visual information processing [RVP]). Participants were randomly allocated to one of the two groups, either assessed in-person in the laboratory first (n=33) or with unsupervised web-based assessments on their personal computing systems first (n=18). Performance indices (errors, correct trials, and response sensitivity) and median reaction times were extracted. Intraclass and bivariate correlations examined intersetting reliability, linear mixed models and Bayesian paired sample t tests tested for equivalence, and Bland-Altman plots examined agreement. Results: Intraclass correlation (ICC) coefficients ranged from ρ=0.23-0.67, with high correlations in 3 performance indices (from PAL, SWM, and RVP tasks; ρ≥0.60). High ICC values were also seen for reaction time measures from 2 tasks (PRM and ERT tasks; ρ≥0.60). However, reaction times were slower during web-based assessments, which undermined both equivalence and agreement for reaction time measures. Performance indices did not differ between assessment settings and generally showed satisfactory agreement. Conclusions: Our findings support the comparability of CANTAB performance indices (errors, correct trials, and response sensitivity) in unsupervised, web-based assessments with in-person and laboratory tests. Reaction times are not as easily translatable from in-person to web-based testing, likely due to variations in computer hardware. The results underline the importance of examining more than one index to ascertain comparability, as high correlations can present in the context of systematic differences, which are a product of differences between measurement environments. Further work is now needed to examine web-based assessments in clinical populations and in larger samples to improve sensitivity for detecting subtler differences between test settings.

37 citations


Journal ArticleDOI
Fan Li1
TL;DR: In this paper, a matrix-adjusted quasi-least squares approach is proposed to estimate the correlation parameters along with the marginal intervention effect for a cohort step wedge randomized trial, and the empirical power agrees well with the prediction even with as few as nine clusters.
Abstract: A stepped wedge cluster randomized trial is a type of longitudinal cluster design that sequentially switches clusters to intervention over time until all clusters are treated. While the traditional posttest-only parallel design requires adjustment for a single intraclass correlation coefficient, the stepped wedge design allows multiple outcome measurements from the same cluster and so additional correlation parameters are necessary to characterize the within-cluster correlation structure. Although a number of studies have differentiated between the concepts of within-period and between-period correlations, only a few studies have allowed the between-period correlation to decay over time. In this article, we consider the proportional decay correlation structure for a cohort stepped wedge design, and provide a matrix-adjusted quasi-least squares approach to accurately estimate the correlation parameters along with the marginal intervention effect. We further develop the sample size and power procedures accounting for the correlation decay, and investigate the accuracy of the power procedure with continuous outcomes in a simulation study. We show that the empirical power agrees well with the prediction even with as few as nine clusters, when data are analyzed with matrix-adjusted quasi-least squares concurrently with a suitable bias-corrected sandwich variance. Two trial examples are provided to illustrate the new sample size procedure.

35 citations


Journal ArticleDOI
TL;DR: This study developed two instruments, the Self-Care in Chronic Obstructive Pulmonary Disease (COPD) Inventory and the COPD-Self-Care Self-Efficacy Scale (SCES), and tested their psychometric properties on a convenience sample of 498 patients from Northern, Central, and Southern Italy.
Abstract: This study developed two instruments, the Self-Care in Chronic Obstructive Pulmonary Disease (COPD) Inventory (SC-COPDI) and the COPD-Self-Care Self-Efficacy Scale (SCES), and tested their psychometric properties on a convenience sample of 498 patients from Northern, Central, and Southern Italy. First, the domains and the items of the SC-SCOPDI were generated based on the middle-range theory of self-care of chronic illness, comprising the dimensions of self-care maintenance, self-care monitoring, and self-care management, and the SCES-COPD was developed accordingly. Second, we assessed the content validity of each scale. Third, we conducted a multicenter cross-sectional study to test their structural validity, convergent and discriminative validity, internal consistency, and test-retest reliability. The theoretical dimensions of the two instruments were confirmed through confirmatory factor analysis. Convergent validity was demonstrated by the correlation among the three self-care scales and the Self-Efficacy Scale, and discriminative validity by higher self-care scale scores in individuals with greater COPD severity and poorer health status. The global reliability index ranged from .78 to .92 for all scales. The intraclass correlation coefficients were higher than .70. Further studies are needed to confirm the psychometric properties of the two instruments in different COPD populations and countries to extend their use in clinical practice.

33 citations


Journal ArticleDOI
TL;DR: The self-administration version of the ALSFRS-R demonstrates high reproducibility and can be used in apps and online portals for both individual comparisons, facilitating the management of clinical care and group comparisons in clinical trials.
Abstract: Objective The Amyotrophic Lateral Sclerosis Functional Rating Scale-Revised (ALSFRS-R) is widely applied to assess disease severity and progression in patients with motor neuron disease (MND). The objective of the study is to assess the inter-rater and intra-rater reproducibility, i.e., the inter-rater and intra-rater reliability and agreement, of a self-administration version of the ALSFRS-R for use in apps, online platforms, clinical care and trials. Methods The self-administration version of the ALSFRS-R was developed based on both patient and expert feedback. To assess the inter-rater reproducibility, 59 patients with MND filled out the ALSFRS-R online and were subsequently assessed on the ALSFRS-R by three raters. To assess the intra-rater reproducibility, patients were invited on two occasions to complete the ALSFRS-R online. Reliability was assessed with intraclass correlation coefficients, agreement was assessed with Bland-Altman plots and paired samples t-tests, and internal consistency was examined with Cronbach’s coefficient alpha. Results The self-administration version of the ALSFRS-R demonstrated excellent inter-rater and intra-rater reliability. The assessment of inter-rater agreement demonstrated small systematic differences between patients and raters and acceptable limits of agreement. The assessment of intra-rater agreement demonstrated no systematic changes between time points; limits of agreement were 4.3 points for the total score and ranged from 1.6 to 2.4 points for the domain scores. Coefficient alpha values were acceptable. Discussion The self-administration version of the ALSFRS-R demonstrates high reproducibility and can be used in apps and online portals for both individual comparisons, facilitating the management of clinical care and group comparisons in clinical trials.

32 citations


Journal ArticleDOI
TL;DR: PGA is a valid, responsive and feasible instrument, though its reliability was impacted by the scale adopted, suggesting the major need for standardization of its scoring.
Abstract: Objective The Physician Global Assessment (PGA) is a visual analogue score that reflects the clinician's judgement of overall SLE disease activity. The aim of this systematic literature review is to describe and analyse the psychometric properties of the PGA. Methods This systematic literature review was conducted by two independent reviewers in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement. All articles published through 1 July 2019 in PubMed were screened, with no limitation on year of publication, language or patients' age. Psychometric properties data were analysed according to the OMERACT Filter methodology version 2.1. Results The literature search identified 91 studies. Face validity was reported in all the articles retrieved in which the PGA was used alone or as part of composite indices (Systemic Responder Index, Safety of Estrogen in Lupus Erythematosus National Assessment Flare Index, Lupus Low Disease Activity State, Definitions of Remission in Systemic Lupus Erythematosus criteria). Content validity was reported in 89 studies. Construct validity was demonstrated by a good correlation (r ≥ 0.50) between the PGA with the SLEDAI (12 studies), SLAM (4 studies), LAI, BILAG and ECLAM (2 studies each). Criterion validity was assessed exploring the PGA correlation with quality of life measurements, biomarker levels and treatment changes in 28 studies, while no study has evaluated correlation with damage. A good responsiveness for PGA was shown in eight studies. A high variability in scales was found, causing a wide range of reliability (intraclass correlation coefficient 0.67-0.98). Conclusion PGA is a valid, responsive and feasible instrument, though its reliability was impacted by the scale adopted, suggesting the major need for standardization of its scoring.

Journal ArticleDOI
TL;DR: The m30STS is a reliable, feasible tool for use in a general geriatric population with a lower level of function and demonstrated concurrent validity with the Berg Balance Scale and modified Barthel Index but not with knee extensor strength to body weight ratio.
Abstract: Background and purpose Sit-to-stand tests measure a clinically relevant function and are widely used in older adult populations. The modified 30-second sit-to-stand test (m30STS) overcomes the floor effect of other sit-to-stand tests observed in physically challenged older adults. The purpose of this study was to examine interrater and test-retest intrarater reliability for the m30STS for older adults. In addition, convergent validity of the m30STS, as well as responsiveness to change, was examined in older adults undergoing rehabilitation. Methods In phase I, 7 older adult participants were filmed performing the m30STS. The m30STS was standardized to allow hand support during the rise to and descent from standing but required participants to let go of the armrests with each stand. Ten physical therapists and physical therapist assistants independently scored the filmed m30STS twice, with 21 days separating the scoring sessions. In phase II, 33 older adults with comorbidities admitted to physical therapy services at a skilled nursing facility were administered the m30STS, Berg Balance Scale, handheld dynamometry of knee extensors, and the modified Barthel Index at initial examination and discharge. Results In phase I, the m30STS was found to be reliable. Interrater reliability using absolute agreement was calculated as intraclass correlation coefficient (ICC)2,1 = 0.737 (P ≤ .001). Test-retest intrarater reliability using absolute agreement was calculated as ICC2,k = 0.987 (P ≤ .001). In phase II, concurrent validity was established for the m30STS for the initial (Spearman ρ = 0.737, P = .01) and discharge (Spearman ρ = 0.727, P = .01) Berg Balance Scale as well as total scores of the modified Barthel Index (initial total score Spearman ρ = 0.711, P = .01; discharge total score Spearman ρ = 0.824, P = .01). The initial m30STS predicted 31.5% of the variability in the discharge Berg Balance Scale. The m30STS did not demonstrate significant correlation with body weight-adjusted strength measures of knee extensors measured by handheld dynamometry. The minimal detectable change (MDC90) was calculated to be 0.70, meaning that an increase of 1 additional repetition in the m30STS is a change beyond error. Conclusion The m30STS is a reliable, feasible tool for use in a general geriatric population with a lower level of function. The m30STS demonstrated concurrent validity with the Berg Balance Scale and modified Barthel Index but not with knee extensor strength to body weight ratio. One repetition of the m30STS was established as the MDC90 as change beyond error.

Journal ArticleDOI
TL;DR: SONG-HD Fatigue seems to be a reliable and valid measure to be used in trials involving patients receiving hemodialysis, and demonstrated convergence with Functional Assessment of Chronic Illness Therapy-Fatigue and had moderate correlations with other measures that assessed related but not the same concept.
Abstract: Background and objectives Fatigue is a very common and debilitating symptom and identified by patients as a critically important core outcome to be included in all trials involving patients receiving hemodialysis. A valid, standardized measure for fatigue is needed to yield meaningful and relevant evidence about this outcome. This study validated a core patient-reported outcome measure for fatigue in hemodialysis. Design, setting, participants, & measurements A longitudinal cohort study was conducted to assess the validity and reliability of a new fatigue measure (Standardized Outcomes in Nephrology-Hemodialysis Fatigue [SONG-HD Fatigue]). Eligible and consenting patients completed the measure at three time points: baseline, a week later, and 12 days following the second time point. Cronbach α and intraclass correlation coefficient were calculated to assess internal consistency, and Spearman rho was used to assess convergent validity. Confirmatory factor analysis was also conducted. Hemodialysis units in the United Kingdom, Australia, and Romania participated in this study. Adult patients aged 18 years and over who were English speaking and receiving maintenance hemodialysis were eligible to participate. Standardized Outcomes in Nephrology-Hemodialysis, the Visual Analog Scale for fatigue, the 12-Item Short Form Survey, and Functional Assessment of Chronic Illness Therapy–Fatigue were used. Results In total, 485 participants completed the study across the United Kingdom, Australia, and Romania. Psychometric assessment demonstrated that Standardized Outcomes in Nephrology-Hemodialysis is internally consistent (Cronbach α =0.81–0.86) and stable over a 1-week period (intraclass correlation coefficient =0.68–0.74). The measure demonstrated convergence with Functional Assessment of Chronic Illness Therapy–Fatigue and had moderate correlations with other measures that assessed related but not the same concept (the 12-Item Short Form Survey and the Visual Analog Scale). Confirmatory factor analysis supported the one-factor model. Conclusions SONG-HD Fatigue seems to be a reliable and valid measure to be used in trials involving patients receiving hemodialysis.

Journal ArticleDOI
02 Dec 2020
TL;DR: The results demonstrate that the repeatability of speech features extracted using open-source tool kits is low, and researchers should exercise caution when developing digital health models with open- source speech features.
Abstract: Introduction Changes in speech have the potential to provide important information on the diagnosis and progression of various neurological diseases. Many researchers have relied on open-source speech features to develop algorithms for measuring speech changes in clinical populations as they are convenient and easy to use. However, the repeatability of open-source features in the context of neurological diseases has not been studied. Methods We used a longitudinal sample of healthy controls, individuals with amyotrophic lateral sclerosis, and individuals with suspected frontotemporal dementia, and we evaluated the repeatability of acoustic and language features separately on these 3 data sets. Results Repeatability was evaluated using intraclass correlation (ICC) and the within-subjects coefficient of variation (WSCV). In 3 sets of tasks, the median ICC were between 0.02 and 0.55, and the median WSCV were between 29 and 79%. Conclusion Our results demonstrate that the repeatability of speech features extracted using open-source tool kits is low. Researchers should exercise caution when developing digital health models with open-source speech features. We provide a detailed summary of feature-by-feature repeatability results (ICC, WSCV, SE of measurement, limits of agreement for WSCV, and minimal detectable change) in the online supplementary material so that researchers may incorporate repeatability information into the models they develop.

Journal ArticleDOI
TL;DR: In this paper, the authors derived confidence intervals and thresholds of significant change for Kinarm Standard Tests (KST) and determined the 5-95% confidence intervals for each task parameter, and derived thresholds for significant change.
Abstract: Traditional clinical assessments are used extensively in neurology; however, they can be coarse, which can also make them insensitive to change. Kinarm is a robotic assessment system that has been used for precise assessment of individuals with neurological impairments. However, this precision also leads to the challenge of identifying whether a given change in performance reflects a significant change in an individual’s ability or is simply natural variation. Our objective here is to derive confidence intervals and thresholds of significant change for Kinarm Standard Tests™ (KST). We assessed participants twice within 15 days on all tasks presently available in KST. We determined the 5–95% confidence intervals for each task parameter, and derived thresholds for significant change. We tested for learning effects and corrected for the false discovery rate (FDR) to identify task parameters with significant learning effects. Finally, we calculated intraclass correlation of type ICC [1, 2] (ICC-C) to quantify consistency across assessments. We recruited an average of 56 participants per task. Confidence intervals for Z-Task Scores ranged between 0.61 and 1.55, and the threshold for significant change ranged between 0.87 and 2.19. We determined that 4/11 tasks displayed learning effects that were significant after FDR correction; these 4 tasks primarily tested cognition or cognitive-motor integration. ICC-C values for Z-Task Scores ranged from 0.26 to 0.76. The present results provide statistical bounds on individual performance for KST as well as significant changes across repeated testing. Most measures of performance had good inter-rater reliability. Tasks with a higher cognitive burden seemed to be more susceptible to learning effects, which should be taken into account when interpreting longitudinal assessments of these tasks.

Journal ArticleDOI
TL;DR: The CIQ-R is an important tool for Italian professionals and can be useful in both clinical practice and research for measuring the level of community integration among the healthy population.
Abstract: Objective. The aims of this study are the translation, cultural adaptation, and validation of the Community Integration Questionnaire–Revised (CIQ-R) in Italian in a group of individuals with no clinical evidence of disability. Methods. The test’s internal consistency and validity were assessed by following international guidelines. The test’s internal consistency was examined using Cronbach’s alpha ( ) coefficient. Pearson’s correlation coefficient was calculated to assess the test’s concurrent validity compared with the Short Form-12 (SF-12) health survey. Results. The CIQ-R was administrated to 400 people with no clinical evidence of disease, impairment, or disability, aged between 18 and 64. Cronbach’s reported a value of 0.82 in the home integration subscale. The test also showed a good test-retest reliability, with an Intraclass Correlation Coefficient of 0.78, and a significant correlation between the total score of the CIQ-R and the Physical Component Summary (PCS) of the SF-12 ( ), between the “social integration” subscale’s score and PCS12 ( ) and between the “Electronic Social Networking integration” subscale’s score and PCS12 ( ), with . Conclusion. This is the first study to report the results of the translation and validation of the CIQ-R in Italian. The CIQ-R is an important tool for Italian professionals and can be useful in both clinical practice and research for measuring the level of community integration among the healthy population.

Journal ArticleDOI
01 Sep 2020-Cancer
TL;DR: The 9-item Financial Index of Toxicity demonstrated internal and test-retest reliability as well as concurrent and construct validity and Prospective testing in patients with HNC who were treated at other facilities is needed to further establish its responsiveness and generalizability.
Abstract: BACKGROUND The treatment of head and neck cancer (HNC) may cause significant financial toxicity to patients. Herein, the authors have presented the development and validation of the Financial Index of Toxicity (FIT) instrument. METHODS Items were generated using literature review and were based on expert opinion. In item reduction, items with factor loadings of a magnitude <0.3 in exploratory factor analysis and inverse correlations (r < 0) in test-retest analysis were eliminated. Retained items constituted the FIT. Reliability tests included internal consistency (Cronbach α) and test-retest reliability (intraclass correlation). Validity was tested using the Spearman rho by comparing FIT scores with baseline income, posttreatment lost income, and the Financial Concerns subscale of the Social Difficulties Inventory. Responsiveness analysis compared change in income and change in FIT between 12 and 24 months. RESULTS A total of 14 items were generated and subsequently reduced to 9 items comprising 3 domains identified on exploratory factor analysis: financial stress, financial strain, and lost productivity. The FIT was administered to 430 patients with HNC at 12 to 24 months after treatment. Internal consistency was good (α = .77). Test-retest reliability was satisfactory (intraclass correlation, 0.70). Concurrent validation demonstrated mild to strong correlations between the FIT and Social Difficulties Inventory Money Matters subscale (Spearman rho, 0.26-0.61; P < .05). FIT scores were found to be inversely correlated with baseline household income (Spearman rho, -0.34; P < .001) and positively correlated with lost income (Spearman rho, 0.24; P < .001). Change in income was negatively correlated with change in FIT over time (Spearman rho, -0.25; P = .04). CONCLUSIONS The 9-item FIT demonstrated internal and test-retest reliability as well as concurrent and construct validity. Prospective testing in patients with HNC who were treated at other facilities is needed to further establish its responsiveness and generalizability.

Journal ArticleDOI
TL;DR: Reliability is paramount to establishing the validity of the tool, thus the constructs assessed by the NIH-TB may vary over time in youth, thus further refinement of theNIH-TB Cognitive Battery and its norming procedures for children are recommended.
Abstract: Background The Cognitive Battery of the National Institutes of Health Toolbox (NIH-TB) is a collection of assessments that have been adapted and normed for administration across the lifespan and is increasingly used in large-scale population-level research. However, despite increasing adoption in longitudinal investigations of neurocognitive development, and growing recommendations that the Toolbox be used in clinical applications, little is known about the long-term temporal stability of the NIH-TB, particularly in youth. Methods The present study examined the long-term temporal reliability of the NIH-TB in a large cohort of youth (9-15 years-old) recruited across two data collection sites. Participants were invited to complete testing annually for 3 years. Results Reliability was generally low-to-moderate, with intraclass correlation coefficients ranging between 0.31 and 0.76 for the full sample. There were multiple significant differences between sites, with one site generally exhibiting stronger temporal stability than the other. Conclusions Reliability of the NIH-TB Cognitive Battery was lower than expected given early work examining shorter test-retest intervals. Moreover, there were very few instances of tests meeting stability requirements for use in research; none of the tests exhibited adequate reliability for use in clinical applications. Reliability is paramount to establishing the validity of the tool, thus the constructs assessed by the NIH-TB may vary over time in youth. We recommend further refinement of the NIH-TB Cognitive Battery and its norming procedures for children before further adoption as a neuropsychological assessment. We also urge researchers who have already employed the NIH-TB in their studies to interpret their results with caution.

Journal ArticleDOI
TL;DR: It is concluded that ICA and NASA-TLX are reliable measures of cognitive workload in older adults, and the convergent validity of these measures against event-related potentials (ERPs) was investigated.
Abstract: Cognitive workload is increasingly recognized as an important determinant of performance in cognitive tests and daily life activities. Cognitive workload is a measure of physical and mental effort allocation to a task, which can be determined through self-report or physiological measures. However, the reliability and validity of these measures have not been established in older adults with a wide range of cognitive ability. The aim of this study was to establish the test-retest reliability of the National Aeronautics and Space Administration Task Load Index (NASA-TLX) and Index of Cognitive Activity (ICA), extracted from pupillary size. The convergent validity of these measures against event-related potentials (ERPs) was also investigated. A total of 38 individuals with scores on the Montreal Cognitive Assessment ranging between 17 and 30 completed a working memory test (n-back) with three levels of difficulty at baseline and at a two-week follow-up. The intraclass correlation coefficients (ICC) values of the NASA-TLX ranged between 0.71 and 0.81, demonstrating good to excellent reliability. The mean ICA scores showed fair to good reliability, with ICCs ranging between 0.56 and 0.73. The mean ICA and NASA-TLX scores showed significant and moderate correlations (Pearson's r ranging between 0.30 and 0.33) with the third positive peak of the ERP at the midline channels. We conclude that ICA and NASA-TLX are reliable measures of cognitive workload in older adults. Further research is needed in dissecting the subjective and objective constructs of cognitive workload.

Journal ArticleDOI
TL;DR: The LOCFAS-I was found to be reliable and a valid measurement tool for the assessment of cognitive functioning post-coma in the Italian population.
Abstract: STUDY DESIGN Cross-sectional study. OBJECTIVE To develop an Italian version of the Levels of Cognitive Functioning Assessment Scale (LOCFAS) and examine its reliability and validity. SUBJECT Patients with acquired brain injury in an early post-coma state. METHODS The original scale was translated from English to Italian using the guidelines set forth in the Translation and Cultural Adaptation of Patient Reported Outcomes Measures-Principles of Good Practice. Intra-rater reliability was examined using the intraclass correlation coefficient (ICC). Concurrent validity was evaluated using Pearson's correlation coefficients with some of the functional and disability components of the International Classification of Functioning, Disability and Health (ICF), excluding environmental factors. SETTING The highly specialized neurorehabilitation department of "San Raffaele" Hospital, Cassino. RESULTS The Italian version of the LOCFAS (LOCFAS-I) was administered to 38 subjects from May 9, 2017 to August 31, 2017. The mean ± SD of the LOCFAS-I score was 3.05 ± 1.88. All LOCFAS-I items were either identical or similar in meaning to the original version's items. Test-retest reliability (ICC) was 0.996 (p 0.536 (p<0.01). CONCLUSIONS The LOCFAS-I was found to be reliable and a valid measurement tool for the assessment of cognitive functioning post-coma in the Italian population.

Journal ArticleDOI
TL;DR: A single IMU, placed on the lower trunk, is considered as a reliable tool for the detection of some spatial-temporal gait parameters in patients after recent THA or TKA, where crutches seem to interfere with the Detection of the gait cycle phases.

Journal ArticleDOI
TL;DR: The results show that a 12-week circuit strength training program is an effective method to increase handball-related performance characteristics.
Abstract: The purpose of this study was to determine the effects of 12 weeks of circuit training on physical fitness in handball players. Subjects were randomly divided into a circuit strength training group (CT, n = 10) and a control group (CG, n = 9). Training sessions and matches were performed together, but during the 12-week intervention, the experimental group replaced part of the regular regimen with circuit strength training. Measures assessed in both groups before and after the intervention included: the agility T-half Test, the Yo-Yo intermittent recovery test, squat and counter-movement jumps, 15 m and 30 m sprints, and strength tests for the bench press, pull over, and the half squat. The upper limb bench press and pull-over tests along with the lower limb back half squat were performed using a 1-repetition maximum protocol. Based on the intraclass correlation coefficient and excluding the agility T-test (ICC = 0.72), we found excellent relative reliability for all variables (intraclass correlation coefficient range: 0.85-0.96, SEM range: 0.03-3.00). For absolute reliability or coefficients of variation, 71% (5/7) of the variables were excellent (CV < 5%). The circuit strength training group showed significant interaction effects and relevant effect sizes for the 12-week training period (8/9, 89%), and the mean effect size for the CT was markedly higher (d = 1.3, range: 0.41 - 2.76) than in the CG (d = - 1.0, range: -0.73 - 0.29). The largest improvements were in the Yo-Yo test (d = 2.76) and the squat jump (d = 2.05). These results show that a 12-week circuit strength training program is an effective method to increase handball-related performance characteristics.

Journal ArticleDOI
TL;DR: The GSDS-IT was found to have good internal consistency and test-retest reliability, and it showed positive and significant values for all the PSQI domains, and is a valid, reliable, and time-efficient tool for measuring sleep disturbances over the past week in a population with SCI.
Abstract: Psychometric study. This study sought to analyze the psychometric properties of the Italian version of the General Sleep Disturbance Scale (GSDS-IT) in a population of individuals with spinal cord injury (SCI). Italy. Its reliability was assessed using the Cronbach’s alpha and intraclass correlation coefficient (ICC), while its concurrent validity was assessed using the Pearson’s correlation coefficient in relation to the Pittsburgh Sleep Quality Index (PSQI). The obtained scores were compared with the cut-off score for the GSDS-IT among a healthy Italian population (38.5). The GSDS-IT was administered to 57 participants with SCI who were recruited from all over Italy. The GSDS-IT was found to have good internal consistency (Cronbach’s α of 0.76) and good test-retest reliability (ICC of 0.7), and it showed positive and significant values for all the PSQI domains. Based on the cut-off score of 38.5, 56% of participants tested positive for sleep disturbances upon admission (t0), while among the randomized participants submitted for the test-retest after 24 h (t1), 75% tested positive for sleep disturbances. The GSDS-IT is a valid, reliable, and time-efficient tool for measuring sleep disturbances over the past week in a population with SCI.


Journal ArticleDOI
TL;DR: The authors' results demonstrated acceptable EARS-Br reliability, validity, and responsiveness for patients with non-specific Chronic Low Back Pain.
Abstract: This study aimed to adapt the Exercise Adherence Rating Scale (EARS) into Brazilian Portuguese and evaluate its measurement properties, given as reliability, validity, and responsiveness in patients with non-specific Chronic Low Back Pain (CLBP). A total of 108 patients with a mean age of 46.62 years (SD = 9.98) and CLBP participated in this longitudinal study. Participants were oriented on undertaking the prescribed exercises in the first session, and adherence behavior was assessed after 1 week, and finally reassessed after 2 weeks (test-retest reliability). Three weeks after the first assessment, they were invited again to full fill the EARS (responsiveness). The intraclass correlation coefficient (ICC2,1) and Cronbach’s α were used to assess test-retest reliability and internal consistency, respectively. Spearman’s correlation and confirmatory factor analysis (CFA) were used to assess construct validity, and the Receiver operating characteristic curve and area under the curve (AUC) were used to analyze responsiveness. The one-factor EARS-Br (adherence behavior) structure with 6 items showed acceptable fit indexes (comparative fit index and goodness of fit index> 0.90 and root-mean-square error of approximation< 0.08). The EARS-Br scale showed acceptable internal consistency (α = 0.88) and excellent reliability (ICC = 0.91 [95% CI 0.86–0.94]). Mild to moderate correlations were observed between EARS-Br total score vs. disability, pain catastrophizing, depression/anxiety, fear-avoidance and pain intensity. A Minimally Important Change (MIC) of 5.5 in the EARS-Br total score was considered as a meaningful change in the adherence behavior (AUC = 0.82). Moderate accuracy (AUC = 0.89) was obtained for a 17/24 total EARS cutoff score after home exercise was prescribed. The sensitivity and specificity were also acceptable (greater than 80%). Our results demonstrated acceptable EARS-Br reliability, validity, and responsiveness for patients with CLBP. A final score of 17/24 on EARS after the prescription of home-exercise could be used as a cut-off for an acceptable adherence behavior associated with improvement in patient outcomes.

Journal ArticleDOI
TL;DR: S-Shearwave Imaging demonstrated excellent intra- and inter-observer repeatability, and a strong correlation with pSWE measurements of liver stiffness, however, because of the significant difference between LSMs obtained using 2D-SWE and pS THE AUTHORS, these methods should not be used interchangeably.
Abstract: PURPOSE The purpose of this study was to prospectively investigate the intra- and interobserver repeatability of a new 2-dimensional (2D) shear wave elastography (SWE) technique (S-Shearwave Imaging) for assessing liver fibrosis in chronic liver disease patients, and to compare liver stiffness measurements (LSMs) made using 2D-SWE with those made using point SWE (pSWE). METHODS This prospective study received institutional review board approval and informed consent was obtained from all patients. Fifty-three chronic liver disease patients were randomly allocated to group 1 (for intra-observer repeatability [n=33]) or group 2 (for inter-observer repeatability [n=20]). In group 1, two 2D-SWE sessions and one pSWE sessions were performed by one radiologist. In group 2, one 2D-SWE session and one pSWE session were performed by the aforementioned radiologist, and a second 2D-SWE session was performed by another radiologist. The intraclass correlation coefficient (ICC) was used to assess intra- and interobserver reliability. LSMs obtained using 2D-SWE and pSWE were compared and correlated using the paired t test and Pearson correlation coefficient, respectively. RESULTS LSMs made using 2D-SWE demonstrated excellent intra- and inter-observer repeatability (ICC, 0.997 [95% confidence interval, 0.994 to 0.999]) and 0.995 [0.988 to 0.998], respectively). LSMs made using 2D-SWE were significantly different from those made using pSWE (2.1±0.6 m/sec vs. 1.9±0.6 m/sec, P<0.001), although a significant correlation existed between the 2D-SWE and pSWE LSMs (rho=0.836, P<0.001). CONCLUSION S-Shearwave Imaging demonstrated excellent intra- and inter-observer repeatability, and a strong correlation with pSWE measurements of liver stiffness. However, because of the significant difference between LSMs obtained using 2D-SWE and pSWE, these methods should not be used interchangeably.

Journal ArticleDOI
TL;DR: In this article, a modified WLAQ (m-WLAQ) was developed to assess workers' sitting time and cardiorespiratory fitness (CRF), and multiple regression analyses were performed to develop prediction equations for VO2max.
Abstract: Sedentary behavior (SB) and cardiorespiratory fitness (CRF) are important issues in occupational health. Developing a questionnaire to concurrently assess workers’ SB and CRF could fundamentally improve epidemiological research. The Worker’s Living Activity-time Questionnaire (WLAQ) was developed previously to assess workers’ sitting time. WLAQ can be modified to evaluate workers’ CRF if additional physical activity (PA) data such as PA frequency, duration, and intensity are collected. A total of 198 working adults (93 women and 105 men; age, 30–60 years) completed anthropometric measurements, a treadmill exercise test for measuring maximal oxygen consumption (VO2max), and modified WLAQ (m-WLAQ, which included questions about PA data additional to the original questions). Multiple regression analyses were performed to develop prediction equations for VO2max. The generated models were cross-validated using the predicted residual error sum of squares method. Among the participants, the data of 97 participants who completed m-WLAQ twice after a 1-week interval were used to calculate intraclass correlation coefficient (ICC) for the test–retest reliability analyses. Age (r = − 0.29), sex (r = 0.48), body mass index (BMI, r = − 0.20), total sitting time (r = − 0.15), and PA score (total points for PA data, r = 0.47) were significantly correlated with VO2max. The models that included age, sex, and BMI accounted for 43% of the variance in measured VO2max [standard error of the estimate (SEE) = 5.04 ml·kg− 1·min− 1]. These percentages increased to 59% when the PA score was included in the models (SEE = 4.29 ml·kg− 1·min− 1). Cross-validation analyses demonstrated good stability of the VO2max prediction models, while systematic underestimation and overestimation of VO2max were observed in individuals with high and low fitness, respectively. The ICC of the PA score was 0.87 (0.82–0.91), indicating excellent reliability. The PA score obtained using m-WLAQ, rather than sitting time, correlated well with measured VO2max. The equation model that included the PA score as well as age, sex, and BMI had a favorable validity for estimating VO2max. Thus, m-WLAQ can be a useful questionnaire to concurrently assess workers’ SB and CRF, which makes it a reasonable resource for future epidemiological surveys on occupational health.

Journal ArticleDOI
TL;DR: The dimension of Looking after Myself is problematic for these young children but most notably so in the 3 year old group but if one considers the summary scores of the EQ-5D-Y Proxy version 1 it appears to work well.
Abstract: The EQ-5D-Y Proxy is currently recommended for Health Related Quality of Life (HRQoL) measurement in children aged 4–8 years of age. However, it has only been validated in children over six years of age. The aim of this study was to investigate the performance of the EQ-5D-Y proxy version 1 in children between the ages of 3–6 years. A sample of 328 children between 3 and 6 years of age were recruited which included children who were either acutely-ill (AI), chronically-ill (CI) or from the general school going population (GP). The EQ-5D-Y Proxy Version 1 and the PedsQL questionnaires were administered at baseline. The EQ-5D-Y Proxy was administered telephonically 24 h later to children with chronic illnesses to establish test-retest reliability. The distribution of dimensions and summary scores, Cohen’s kappa, the intraclass correlation coefficient, Pearson’s correlation and Analysis of variance were used to explore the reliability, and validity of the EQ-5D-Y for each age group. A single index score was estimated using Latent scores and Adult EQ-5D-3 L values (Dolan). The groups included 3-year olds (n = 105), 4-year olds (n = 98) and 5-years olds (n = 118). The dimension Looking after Myself had the greatest variability between age groups and had the highest rate of problems reported. Worried, Sad or Unhappy and Pain or Discomfort were not stable across time in test-retest analysis. The Visual Analogue Scale (VAS), and single index scores estimated using the latent values and Dolan tariff had good test retest (except for the latent value scores in a small number of 4-year olds). EQ-5D-Y scores for all ages had small to moderate correlations with PedsQL total score. The EQ-5D-Y discriminated well between children with a health condition and the general population for all age groups. Caregivers reported difficulty completing the Looking after Myself dimension due to age-related difficulties with washing and dressing. The dimension of Looking after Myself is problematic for these young children but most notably so in the 3 year old group. If one considers the summary scores of the EQ-5D-Y Proxy version 1 it appears to work well. Known group validity was demonstrated. Concurrent validity was demonstrated on a composite level but not for individual dimensions of Usual Activities or Worried, Sad or Unhappy.. The observable dimensions demonstrated stability over time, with the inferred dimensions (Pain or Discomfort and Worried, Sad or Unhappy) less so, which is to be expected. Further work is needed in exploring either the adaptation of the dimensions in the younger age groups.

Journal ArticleDOI
12 Mar 2020
TL;DR: The application of CAT to the VR-12 survey demonstrated an ability to lessen the response burden for patients with a negligible effect on score integrity.
Abstract: Background Patient-reported outcome measures (PROMs) are essential tools that are used to assess health status and treatment outcomes in orthopaedic care. Use of PROMs can burden patients with lengthy and cumbersome questionnaires. Predictive models using machine learning known as computerized adaptive testing (CAT) offer a potential solution. The purpose of this study was to evaluate the ability of CAT to improve efficiency of the Veterans RAND 12 Item Health Survey (VR-12) by decreasing the question burden while maintaining the accuracy of the outcome score. Methods A previously developed CAT model was applied to the responses of 19,523 patients who had completed a full VR-12 survey while presenting to 1 of 5 subspecialty orthopaedic clinics. This resulted in the calculation of both a full-survey and CAT-model physical component summary score (PCS) and mental component summary score (MCS). Several analyses compared the accuracy of the CAT model scores with that of the full scores by comparing the means and standard deviations, calculating a Pearson correlation coefficient and intraclass correlation coefficient, plotting the frequency distributions of the 2 score sets and the score differences, and performing a Bland-Altman assessment of scoring patterns. Results The CAT model required 4 fewer questions to be answered by each subject (33% decrease in question burden). The mean PCS was 1.3 points lower in the CAT model than with the full VR-12 (41.5 ± 11.0 versus 42.8 ± 10.4), and the mean MCS was 0.3 point higher (57.3 ± 9.4 versus 57.0 ± 9.6). The Pearson correlation coefficients were 0.97 for PCS and 0.98 for MCS, and the intraclass correlation coefficients were 0.96 and 0.97, respectively. The frequency distribution of the CAT and full scores showed significant overlap for both the PCS and the MCS. The difference between the CAT and full scores was less than the minimum clinically important difference (MCID) in >95% of cases for the PCS and MCS. Conclusions The application of CAT to the VR-12 survey demonstrated an ability to lessen the response burden for patients with a negligible effect on score integrity.

Journal ArticleDOI
TL;DR: The modified Italian version of the SCI-SCS may represent a valuable instrument for the longitudinal assessment of the impact of secondary conditions in people with SCI.
Abstract: STUDY DESIGN Validation cross-sectional study. OBJECTIVE To adapt the Spinal Cord Injury Secondary Conditions Scale (SCI-SCS) to Italian and to assess the validity and reliability of this instrument. SETTING Multicentre study in outpatient clinics of three urban spinal units across Italy. METHODS After a five-step translation/validation process, the Italian SCI-SCS was administered in a toolset composed of a sociodemographic questionnaire, the Modified Barthel Index, the Short-Form 8, the Patient Health Questionnaire 9, and the General Anxiety Disorder 7. The Italian SCI-SCS construct validity was assessed through exploratory factor analysis (EFA). The internal consistency and test-retest reliability of the instrument were evaluated using Cronbach's α and the intraclass correlation coefficient (ICC) for the total scale and its subscales. Pearson's correlation coefficient with all administered instruments was calculated to evaluate the concurrent validity. RESULTS One-hundred fifty-six participants were recruited from February to October 2018. EFA suggested a three-factor structure explaining 45% of the total variance. After experts' consideration about the clinical relevance of its components, a final version of the Italian SCI-SCS with four different subscales and 15 items was proposed. The total scale Cronbach's α was 0.73. The ICC agreement for test-retest reliability was 0.91. Correlations of the Italian SCI-SCS with the administered instruments were statistically significant (p < 0.05), highlighting congruent hypothesized relations. CONCLUSION Findings of this study provided a first psychometric evaluation of the SCI-SCS. The modified Italian version of this tool may represent a valuable instrument for the longitudinal assessment of the impact of secondary conditions in people with SCI.