scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 2006"


Journal ArticleDOI
TL;DR: The analyses indicate that in African American, Chinese American, Latino, and non-Hispanic white patient groups the PHQ-9 measures a common concept of depression and can be effective for the detection and monitoring of depression in these diverse populations.
Abstract: OBJECTIVE: The Patient Health Questionnaire depression scale (PHQ-9) is a well-validated, Diagnostic and Statistical Manual of Mental Disorders—Fourth Edition (DSM-IV) criterion-based measure for diagnosing depression, assessing severity and monitoring treatment response. The performance of most depression scales including the PHQ-9, however, has not been rigorously evaluated in different racial/ethnic populations. Therefore, we compared the factor structure of the PHQ-9 between different racial/ethnic groups as well as the rates of endorsement and differential item functioning (DIF) of the 9 items of the PHQ-9. The presence of DIF would indicate that responses to an individual item differ significantly between groups, controlling for the level of depression.

619 citations


Journal ArticleDOI
TL;DR: The purpose of this essay is to review the definitions and assumptions associated with factorial invariance, placing this formulation in the context of bias, fairness, and equity.
Abstract: Background:Analysis of subgroups such as different ethnic, language, or education groups selected from among a parent population is common in health disparities research. One goal of such analyses is to examine measurement equivalence, which includes both qualitative review of the meaning of items a

572 citations


Journal ArticleDOI
TL;DR: The short EUROHIS-QOL 8-item index showed good cross-cultural field study performance and a satisfactory convergent and discriminant validity, and can therefore be recommended for use in public health research.
Abstract: Background: Survey research including multiple health indicators requires brief indices for use in cross-cultural studies, which have, however, rarely been tested in terms of their psychometric quality. Recently, the EUROHIS-QOL 8-item index was developed as an adaptation of the WHOQOL-100 and the WHOQOL-BREF. The aim of the current study was to test the psychometric properties of the EUROHIS-QOL 8-item index. Methods: In a survey on 4849 European adults, the EUROHIS-QOL 8-item index was assessed across 10 countries, with equal samples adjusted for selected sociodemographic data. Participants were also investigated with a chronic condition checklist, measures on general health perception, mental health, health-care utilization and social support. Results: Findings indicated good internal consistencies across a range of countries, showing acceptable convergent validity with physical and mental health measures, and the measure discriminates well between individuals that report having a longstanding condition and healthy individuals across all countries. Differential item functioning was less frequently observed in those countries that were geographically and culturally closer to the UK, but acceptable across all countries. A universal one-factor structure with a good fit in structural equation modelling analyses (SEM) was identified with, however, limitations in model fit for specific countires. Conclusions: The short EUROHIS-QOL 8-item index showed good cross-cultural field study performance and a satisfactory convergent and discriminant validity, and can therefore be recommended for use in public health research. In future studies the measure should also be tested in multinational clinical studies, particularly in order to test its sensitivity.

414 citations


Journal ArticleDOI
TL;DR: A common strategy for identifying differential item functioning (DIF) items that can be implemented in both the mean and covariance structures method (MACS) and item response theory and results indicated that the proposed strategy was considerably more effective than an alternative approach involving a constrained-baseline model.
Abstract: In this article, the authors developed a common strategy for identifying differential item functioning (DIF) items that can be implemented in both the mean and covariance structures method (MACS) and item response theory (IRT). They proposed examining the loadings (discrimination) and the intercept (location) parameters simultaneously using the likelihood ratio test with a free-baseline model and Bonferroni corrected critical p values. They compared the relative efficacy of this approach with alternative implementations for various types and amounts of DIF, sample sizes, numbers of response categories, and amounts of impact (latent mean differences). Results indicated that the proposed strategy was considerably more effective than an alternative approach involving a constrained-baseline model. Both MACS and IRT performed similarly well in the majority of experimental conditions. As expected, MACS performed slightly worse in dichotomous conditions but better than IRT in polytomous cases where sample sizes were small. Also, contrary to popular belief, MACS performed well in conditions where DIF was simulated on item thresholds (item means), and its accuracy was not affected by impact.

358 citations


Journal ArticleDOI
TL;DR: The ordinal logistic regression approach to DIF detection, when combined with IRT ability estimates, provides a reasonable alternative for Dif detection.
Abstract: Introduction:We present an ordinal logistic regression model for identification of items with differential item functioning (DIF) and apply this model to a Mini-Mental State Examination (MMSE) dataset. We employ item response theory ability estimation in our models. Three nested ordinal logistic reg

261 citations


Journal ArticleDOI
TL;DR: Through this work, it has become clear that differences in raw scores of different groups cannot be used to infer group differences in theoretical attributes unless the test scores accord with a particular set of model invariance restrictions.
Abstract: question whether observed differences in psychometric test scores can be attributed to differences in the properties that such tests measure is relevant in many research domains; examples include the proper interpretation of differences in intelligence test scores across different generations of people,1 gender differences in affectivity,2 and crosscultural differences in personality. This question also has generated some of the most conspicuous controversies in the social and life sciences, where the highest temperature in the many heated discussions around the topic has, without a doubt, been reached in the debate on IQ-score differences between ethnic groups in the United States.4'5 Such debates are often unproductive because of a lack of unambiguous characterizations of concepts like "biased," "incomparable," and "culture-fair." Terms are easily coined, as is illustrated by Johnson's6 count of no less than 55 types of measurement equivalence; however, it is often less easy to spell out their meaning in terms of their empirical consequences. However, without at least some degree of precision in one's conception of a term like "equivalence," it is difficult to have a scientifically productive debate, or even to agree on what aspects of empirical data are relevant for answering the questions involved. It is for this reason that the establishment of concepts like measurement invariance and bias in an unambigous, formal framework with testable consequences7"9 represents a theoretical development of great importance. Through this work, it has become clear that differences in raw scores (eg, IQ-scores) of different groups (eg, blacks and whites) cannot be used to infer group differences in theoretical attributes (eg, general intelligence) unless the test scores accord with a particular set of model invariance restrictions. Namely, the same attribute must relate to the same set of observations in the same way in each group. Statistically, this means that the mathematical function that relates latent variables to the observations must be the same in each of the groups involved in the comparison.7'8 This idea has become known as the requirement of measurement invariance. The theoretical definitions of measurement invariance and bias are very general, and apply to different models, such as item response theory (IRT) and factor models, in roughly the same way.10'11 This does not hold for the empirical methods available for testing measurement invariance. In the past decades, psychometricians working on measurement invariance have produced many different statistical techniques to assess differential item functioning (DIF). These techniques usually employ different statistical assumptions, for instance, regarding the form of the relation between latent and observed variables and the shape of the population distribution on the latent variable, and employ different modeling strategies as well as selection criteria for flagging items as biased. For this reason, it is difficult to assess the consequences of choosing a particular technique; moreover, it is not always clear to what extent the choice of technique makes a difference with respect to the diagnosis of meaurement invariance and bias in applied situations. For this reason, the articles on DIF collected here (by Crane et al;12 Dorans and Kulick;13 Jones;14 Morales, Flowers, Gutierrez, Kleinman, and Teresi;15 Edelen Orlando et al16) represent a useful project in the application of bias detection methods. Each set of authors analyzes the Mini-Mental State Examination (MMSE) for measurement invariance using the same data, albeit with different methods. Together, the articles provide a

249 citations


Book ChapterDOI
TL;DR: This chapter presents a description of many of the commonly employed methods in the detection of item bias, and outlines a set of six steps that practitioners can use in conducting DIF analyses.
Abstract: Publisher Summary This chapter presents a description of many of the commonly employed methods in the detection of item bias. Because much of the statistical detection of item bias makes the use of differential item functioning (DIF) procedures, the majority of this chapter focuses on the description of statistical methods for the analysis of DIF. DIF detection procedures for dichotomous and polytomous items are presented in the chapter, along with the methods for the categorization of DIF effect in dichotomous items. It also presents several recent innovations in DIF detection, including Bayesian applications the detection of differential test functioning, and studies examining sources or explanations of DIF. While much of this chapter focuses on the statistical approaches to measuring DIF, conducting a comprehensive DIF analysis requires series of steps aimed at measuring DIF and ensuring that the obtained DIF statistics are interpreted appropriately. The chapter also outlines a set of six steps that practitioners can use in conducting DIF analyses. The steps are demonstrated using a real dataset.

204 citations


Journal ArticleDOI
TL;DR: An integrated approach to the examination of measurement equivalence, invariance, and DIF is necessary for measurement in an increasingly multi-ethnic society.
Abstract: Background and objectives Reviewed in this article are topics related to the study of invariance and differential item functioning (DIF) that have received relatively little attention in the literature. Several factors influence DIF detection; these include (1) model fit, (2) model assumptions, (3) disability distributions, (4) purification, (5) cutoff values for magnitude measures, and (6) sample and scale size. Methods Approaches to DIF detection are discussed in terms of model assumptions, purification, magnitude and impact, and possible advantages and disadvantages of each method. Conclusions An integrated approach to the examination of measurement equivalence, invariance, and DIF is necessary for measurement in an increasingly multi-ethnic society. Ideally, qualitative analyses should be performed in an iterative fashion to inform about findings of DIF. However, if an already-developed measure is being evaluated, then the steps might be to focus first on dimensional invariance using factor analytic methods, followed by DIF analyses examining both significance and magnitude of DIF, accompanied by formal tests of the impact of DIF. The DIF analytic method selected in the second step might be determined based on the findings summarized in the table presented within this paper.

173 citations


Journal ArticleDOI
TL;DR: This paper examined measurement equivalence of the Satisfaction with Life Scale between American and Chinese samples using multigroup Structural Equation Modeling (SEM), Multiple Indicator Multiple Cause Model (MIMIC), and Item Response Theory (IRT).

153 citations


Journal ArticleDOI
TL;DR: In this article, the authors investigated the use of response time to assess the amount of examinee effort received by individual test items and found that the strongest predictors of the effort required by items were item length (i.e., how much reading or scanning was required).
Abstract: In low-stakes testing, the motivation levels of examinees are often a matter of concern to test givers because a lack of examinee effort represents a direct threat to the validity of the test data. This study investigated the use of response time to assess the amount of examinee effort received by individual test items. In 2 studies, it was found that the strongest predictors of the effort received by items were item length (i.e., how much reading or scanning was required) and item position. In addition, it was found that by treating item responses resulting from rapid guesses as missing, item means and item-total correlations were differentially affected and test score reliability decreased, whereas validity increased. Several implications of these results for low-stakes testing are discussed.

151 citations


Journal ArticleDOI
01 Jan 2006-Brain
TL;DR: The 88-item Multiple Sclerosis Spasticity Scale (MSSS-88) is a reliable and valid, patient-based, interval-level measure of the impact of spasticity in multiple sclerosis that has the potential to advance outcomes measurement in clinical trials and clinical practice, and provides a new perspective in the clinical evaluation ofSpasticity.
Abstract: Spasticity is most commonly defined as an inappropriate, velocity dependent, increase in muscle tonic stretch reflexes, due to the amplified reactivity of motor segments to sensory input. It forms one component of the upper motor neuron syndrome and often leads to muscle stiffness and disability. Spasticity can, therefore, be measured through electrophysiological, biomechanical and clinical evaluation, the last most commonly using the Ashworth scale. None of these techniques incorporate the patient experience of spasticity, nor how it affects people's daily lives. Consequently, we set out to construct a rating scale to quantify the perspectives of the impact of spasticity on people with multiple sclerosis. Qualitative methods (in-depth patient interviews and focus groups, expert opinion and literature review) were used to develop a conceptual framework of spasticity impact, and to generate a pool of items with the potential to convert this framework into a rating scale with multiple dimensions. This item pool was administered, in the form of a questionnaire, to a sample of people with multiple sclerosis and spasticity. Guided by Rasch analysis, we constructed and validated a rating scale for each component of the conceptual framework. Decisions regarding item selection were based on the integration and assimilation of seven specific analyses including clinical meaning, ordering of thresholds, fit statistics and differential item functioning. The qualitative phase (17 patient interviews, 3 focus groups) generated 144 potential scale items and a conceptual model with eight components addressing symptoms (muscle stiffness, pain and discomfort and muscle spasms,), physical impact (activities of daily living, walking and body movements) and psychosocial impact (emotional health, social functioning). The first postal survey was sent to 272 people with multiple sclerosis and had a response rate of 88%. Findings supported the development of scales for each component but demonstrated that five item response options were too many. The 144-item questionnaire, reformatted with four-item response options, was administered with four validating instruments to an independent sample of 259 people with multiple sclerosis (response rate 78%). From the responses, an 88-item instrument with eight subscales was developed that satisfied criteria for reliable and valid measurement. Correlations with other measures were consistent with predictions. The 88-item Multiple Sclerosis Spasticity Scale (MSSS-88) is a reliable and valid, patient-based, interval-level measure of the impact of spasticity in multiple sclerosis. It has the potential to advance outcomes measurement in clinical trials and clinical practice, and provides a new perspective in the clinical evaluation of spasticity.

Journal ArticleDOI
TL;DR: The authors show that the size of the effect can be expressed by a presentation of the values of the parameter estimates derived from the fitted model, and develop a case study of the description of effect size for research reporting in the context of item response theory.
Abstract: The psychological literature currently emphasizes reporting the "effect size" of research findings in addition to the outcome of any tests of significance. However, some confusion may result from the fact that there are three distinct uses of effect sizes in the psychological literature, namely, power analysis, research synthesis, and research reporting. The authors review these uses of effect sizes and develop a case study of the description of effect size for research reporting in the context of item response theory. For many parametric models, hypotheses are tested by comparing the values of directly interpretable parameters. The authors show that the size of the effect can be expressed by a presentation of the values of the parameter estimates derived from the fitted model. Studies that use item response theory to detect differential item functioning provide illustrations.

Journal ArticleDOI
TL;DR: The articles addressing differential item functioning (DIF) and factorial invariance in this special issue of Medical Care1−9 are uniformly excellent and readers will find that each article makes an important contribution to the measurement literature.
Abstract: The articles addressing differential item functioning (DIF) and factorial invariance in this special issue of Medical Care1–9 are uniformly excellent and readers will find that each article makes an important contribution to the measurement literature. The suggestion to have researchers apply variou

Journal ArticleDOI
TL;DR: A reduced 18-item measure demonstrating strong clinical utility is proposed, with scores of 8 or greater implying greater need for treatment.
Abstract: The Rutgers Alcohol Problem Index (RAPI; H. R. White & E. W. Labouvie, 1989) is a frequently used measure of alcohol-related consequences in adolescents and college students, but psychometric evaluations of the RAPI are limited and it has not been validated with college students. This study used item response theory (IRT) to examine the RAPI on students (N = 895; 65% female, 35% male) assessed in both high school and college. A series of 2-parameter IRT models were computed, examining differential item functioning across gender and time points. A reduced 18-item measure demonstrating strong clinical utility is proposed, with scores of 8 or greater implying greater need for treatment.

Journal ArticleDOI
TL;DR: This article used differential item functioning to evaluate the comparability of translated items at two different points in time, after the initial translation and 4 years later after the translations were revisited using a more rigorous translation model.
Abstract: Guidelines for translating educational and psychological assessments for use across different languages and cultures have been developed by the International Test Commission and the Joint Committee on Standards for Educational and Psychological Testing. Common themes in these guidelines and standards are when translating items both judgmental and statistical techniques should be used to ensure item comparability across languages, and rigorous quality-control steps should be included in the translation process. In this study, the authors use differential item functioning methodology to evaluate the comparability of translated items at two different points in time—after the initial translation and 4 years later after the translations were revisited using a more rigorous translation model. The results indicated that the revised translations led to improvements in some but not all items. Improvements in the process of translating survey items, even when based on accepted professional standards, should be stat...

Journal ArticleDOI
TL;DR: IRT and the likelihood-based model comparison approach comprise a powerful tool for DIF detection that can aid in the development, refinement, and evaluation of measures for use in ethnically diverse populations.
Abstract: Background An important part of examining the adequacy of measures for use in ethnically diverse populations is the evaluation of differential item functioning (DIF) among subpopulations such as those administered the measure in different languages. A number of methods exist for this purpose. Objective The objective of this study was to introduce and demonstrate the identification of DIF using item response theory (IRT) and the likelihood-based model comparison approach. Methods Data come from a sample of community-residing elderly who were part of a dementia case registry. A total of 1578 participants were administered either an English (n = 913) or Spanish (n = 665) version of the 21-item Mini-Mental State Examination. IRT was used to identify language DIF in these items with the likelihood-based model comparison approach. Results : Fourteen of the 21 items exhibited significant DIF according to language of administration. However, because the direction of the identified DIF was not consistent for one language version over the other, the impact at the scale level was negligible. Conclusions IRT and the likelihood-based model comparison approach comprise a powerful tool for DIF detection that can aid in the development, refinement, and evaluation of measures for use in ethnically diverse populations.

Journal ArticleDOI
TL;DR: The Skindex-29 of 454 Italian dermatological patients was subjected to Rasch analysis to investigate threshold order, differential item functioning (DIF), and item and overall fit to the model.

Journal ArticleDOI
TL;DR: In this article, the strengths of the Rasch model as a psychometric tool and analysis technique are discussed, referring to person item maps, anchoring, differential item functioning, and person item fit.
Abstract: Recent international studies note that countries whose students perform well on international science assessments report the need to change science education. Some countries use assessments for diagnostic purposes to assist teachers in addressing their students' needs. However, in the United States, standards-based reform has focused the national discussion on documenting students' attainment of high educational standards. Students' science achievement is one of those standards, and in many states, “high-stakes” tests determine the resultant achievement measures. Policymakers and administrators use those tests to rank school performance, to prevent students' graduation, and to evaluate teachers. With science test measures used in different ways, statistical confidence in the measures' validity and reliability is essential. Using a science achievement test from one state's systemic reform project as an example, this paper discusses the strengths of the Rasch model as a psychometric tool and analysis technique, referring to person item maps, anchoring, differential item functioning, and person item fit. Furthermore, the paper proposes that science educators should carefully inspect the tools they use to measure and document changes in educational systems. © 2005 Wiley Periodicals, Inc. Sci Ed90:253–269, 2006

Journal ArticleDOI
TL;DR: An integrated approach to the quantitative methods used in this special issue to examine measurement equivalence are provided, finding that factor analytic and DIF detection methods provide unique information and can be viewed as complementary in informing about measurement equivalences.
Abstract: Background:Reviewed in this article are issues relating to the study of invariance and differential item functioning (DIF). The aim of factor analyses and DIF, in the context of invariance testing, is the examination of group differences in item response conditional on an estimate of disability. Dis

Journal ArticleDOI
TL;DR: Failing to account for measurement differences may lead to spurious inferences regarding language group differences in level of underlying level of cognitive functioning, and the MIMIC model can be used to detect and adjust for such measurement differences in substantive research.
Abstract: Background: Knowledge of the extent to which measurement of adult cognitive functioning differs between Spanish and English language administrations of the Mini-Mental State Examination (MMSE) is critical for inclusive, representative, and valid research of older adults in the United States Objectives: We sought to demonstrate the use of an item response theory (IRT) based structural equation model, that is, the MIMIC model (multiple indicators, multiple causes), to evaluate MMSE responses for evidence of differential item functioning (DIF) attributable to language of administration Subjects: We studied participants in a dementia case registry study (n = 1546), 42% of whom were examined with the Spanish language MMSE Results: Twelve of 21 items were identified as having significant uniform DIF The 4 most discrepant included orientation to season, orientation to state, repeat phrase, and follow command DIF accounted for two-thirds of the observed difference in underlying level of cognitive functioning between Spanish- and English-language administration groups Conclusions: Failing to account for measurement differences may lead to spurious inferences regarding language group differences in level of underlying level of cognitive functioning The MIMIC model can be used to detect and adjust for such measurement differences in substantive research

Journal ArticleDOI
TL;DR: The simulated SFS CAT was efficient and produced precise, clinically relevant measures of functional status with good discriminating ability.

Journal ArticleDOI
TL;DR: To further evaluate these models and characterize item and test functioning, multidimensional representations of statistics such as information, difficulty, and discrimination for the M-3PL and M-2PPC are presented.
Abstract: Multidimensional item response theory (IRT) models have been proposed for better understanding the dimensional structure of data or to define diagnostic profiles of student learning. A compensatory multidimensional two-parameter partial credit model (M-2PPC) for constructed-response items is presented that is a generalization of those proposed to date along with a compensatory multidimensional three-parameter logistic model for multiple-choice data (M-3PL). Estimation of these models using Markov chain Monte Carlo methods is discussed. To further evaluate these models and characterize item and test functioning, multidimensional representations of statistics such as information, difficulty, and discrimination for the M-3PL and M-2PPC are presented. Many assessment programs have a mixture of item types in which multiple choice and constructed response are administered together. An example is presented in which the dimensional structure of a test containing mixed item types is examined. Goodnessof-fit testing under various model formulations and derived statistics are discussed. Index terms: item response theory, partial credit model, MIRT, item information, item statistics, MCMC

Journal ArticleDOI
TL;DR: A simulation study shows that the new procedure is feasible in practice, and that when the latent distribution is not well approximated as normal, two-parameter logistic (2PL) item parameter estimates and expected a posteriori scores (EAPs) can be improved over what they would be with the normal model.
Abstract: The purpose of this paper is to introduce a new method for fitting item response theory models with the latent population distribution estimated from the data using splines. A spline-based density estimation system provides a flexible alternative to existing procedures that use a normal distribution, or a different functional form, for the population distribution. A simulation study shows that the new procedure is feasible in practice, and that when the latent distribution is not well approximated as normal, two-parameter logistic (2PL) item parameter estimates and expected a posteriori scores (EAPs) can be improved over what they would be with the normal model. An example with real data compares the new method and the extant empirical histogram approach.

Journal ArticleDOI
TL;DR: These results support the measurement validity of both questionnaires in PD, although the FACIT-F displayed better measurement precision and modest psychometric advantages over the FSS.

Journal ArticleDOI
TL;DR: This paper compares a qualitative and a quantitative method of item assessment for developing the content of a food insecurity scale for Bangladesh and lends added confidence to the use of either scale for identifying food-insecure households in different regions of Bangladesh.
Abstract: This paper compares a qualitative and a quantitative (Rasch) method of item assessment for developing the content of a food insecurity scale for Bangladesh. Data are derived from the Bangladesh Food Insecurity Measurement and Validation Study, in which researchers collected 2 rounds of ethnographic information and 3 rounds of conventional household survey data between 2001 and 2003. The qualitative method of scale development relied on content experts and respondents themselves to evaluate household food insecurity items generated through ethnographic research. The quantitative method applied the Rasch model to assess the fit of the same items using representative survey data. The Rasch model was then used to test for differential item functioning (DIF) across diverse demographic and geographic subgroups. The qualitative assessment flagged and discarded 10 items, leaving 13. The Rasch assessment of infit and outfit flagged 3 items, and the Rasch DIF test discarded another 10 items, leaving a total of 10 items in the Rasch-derived scale. The 2 scales contained 8 of the same items. The qualitatively and quantitatively derived scales were highly correlated (r = 0.96, P < 0.01), and the 2 methods located 90% of households in the same food insecurity tercile. This convergence lends added confidence to the use of either scale for identifying food-insecure households in different regions of Bangladesh. Multiple methods should continue to be applied in a systematic and transparent way to lend additional credence to the results when they converge and to pinpoint directions for further clarification where they do not.

Journal ArticleDOI
TL;DR: The 19-item WHOQOL-BREF measures more succinct latent traits than the original design, and yields not only more accurate estimates for the correlation between domains but also substantially higher reliabilities, than the standard unidimensional approach.
Abstract: Objective: This study examined the construct validity, and improved the test reliability and the estimation accuracy for the correlation between domains of the WHOQOL-BREF using multidimensional Rasch analysis. Method: A total of 13,083 adults were administered the 28-item WHOQOL-BREF Taiwan version, which consists of 4 subscales (domains). The multidimensional form of the partial credit model was used to examine the fit of the 4 subscales. For comparison, each subscale individually was also fitted to the unidimensional partial credit model. Standard item fit statistics and analysis of differential item functioning (DIF) were used to check model-data fit. Results: After excluding 2 overall items and deleting 7 DIF items, the remaining items of each subscale in the WHOQOL-BREF constituted a single construct. The test reliabilities and correlations between domains obtained from the multidimensional approach, (0.82–0.86) and (0.79–0.89), respectively, were much higher than those obtained from the unidimensional approach, (0.67–0.75) and (0.53–0.65), respectively. Conclusion: The 19-item WHOQOL-BREF measures more succinct latent traits than the original design. The multidimensional approach yields not only more accurate estimates for the correlation between domains but also substantially higher reliabilities, than the standard unidimensional approach.

Journal ArticleDOI
TL;DR: The use of different methods to examine DIF in relation to English and Spanish language administration of the Mini-Mental State Examination has practical and theoretical implications in the context of health disparities.
Abstract: Background:Various forms of differential item functioning (DIF) in the Mini-Mental State Examination (MMSE) have been identified. Items have been found to perform differently for individuals of different educational levels, racial/ethnic groups, and/or of groups whose first language is not English.

Journal ArticleDOI
TL;DR: The results of this study suggest that EPDS, in its original 10 item form, is not a viable scale for the unidimensional measurement of depression, and Rasch analysis suggests that a revised eight item version (EPDS-8) would provide a more psychometrically robust scale.
Abstract: Background: The Edinburgh Postnatal Depression Scale (EPDS) is a 10 item self-rating post-natal depression scale which has seen widespread use in epidemiological and clinical studies. Concern has been raised over the validity of the EPDS as a single summed scale, with suggestions that it measures two separate aspects, one of depressive feelings, the other of anxiety. Methods: As part of a larger cross-sectional study conducted in Melbourne, Australia, a community sample (324 women, ranging in age from 18 to 44 years: mean = 32 yrs, SD = 4.6), was obtained by inviting primiparous women to participate voluntarily in this study. Data from the EPDS were fitted to the Rasch measurement model and tested for appropriate category ordering, for item bias through Differential Item Functioning (DIF) analysis, and for unidimensionality through tests of the assumption of local independence. Results: Rasch analysis of the data from the ten item scale initially demonstrated a lack of fit to the model with a significant Item-Trait Interaction total chi-square (chi Square = 82.8, df = 40; p < .001). Removal of two items (items 7 and 8) resulted in a non-significant Item-Trait Interaction total chisquare with a residual mean value for items of -0.467 with a standard deviation of 0.850, showing fit to the model. No DIF existed in the final 8-item scale (EPDS-8) and all items showed fit to model expectations. Principal Components Analysis of the residuals supported the local independence assumption, and unidimensionality of the revised EPDS-8 scale. Revised cut points were identified for EPDS-8 to maintain the case identification of the original scale. Conclusion: The results of this study suggest that EPDS, in its original 10 item form, is not a viable scale for the unidimensional measurement of depression. Rasch analysis suggests that a revised eight item version (EPDS-8) would provide a more psychometrically robust scale. The revised cut points of 7/8 and 9/10 for the EPDS-8 show high levels of agreement with the original case identification for the EPDS-10.

Journal ArticleDOI
TL;DR: The VA LV VFQ-48 is a sensitive measure of changes that occur in visual ability as a result of vision rehabilitation and can be used to compare programs that offer different levels of intervention and serve patients across the continuum of vision loss.
Abstract: PURPOSE. To evaluate the sensitivity to change, in patients who undergo vision rehabilitation, of the Veteran Affairs (VA) Low Vision Visual Functioning Questionnaire (LV VFQ-48), which was designed to measure the difficulty visually impaired persons have in performing daily activities and to evaluate vision rehabilitation outcomes. METHODS. Before and after rehabilitation, the VA LV VFQ-48 was administered by telephone interview to subjects from five sites in the VA and private sector. Visual acuity of these subjects ranged from near normal to total blindness. RESULTS. The VA LV VFQ exhibited significant differential item functioning (DIF) for 7 of 48 items (two mobility tasks, four reading tasks, and one distance-vision task). However, the DIF was small relative to baseline changes in item difficulty for all items. Therefore, the data were reanalyzed with the constraint that item difficulties do not change with rehabilitation, which assigns all changes to the person measure. Subjects in the inpatient Blind Rehabilitation Center (BRC) program showed the largest changes in person measures after vision rehabilitation (effect size = 1.9; t-test P < 0.0001). The subjects in the outpatient programs exhibited smaller changes in person measures after rehabilitation (effect size = 0.29; t-test P < 0.01). There was no significant change in person measures for the control group (test-retest before rehabilitation). CONCLUSIONS. In addition to being a valid and reliable measure of visual ability, the VA LV VFQ-48 is a sensitive measure of changes that occur in visual ability as a result of vision rehabilitation. Patients' self-reports of the difficulty they experience performing daily activities measured with this instrument can be used to compute a single number, the person measure that can serve as an outcome measure in clinical studies. The VA LV VFQ-48 can be used to compare programs that offer different levels of intervention and serve patients across the continuum of vision loss.

Journal ArticleDOI
TL;DR: Camilli and Penfield as discussed by the authors proposed two estimators of the DIF effect variance for tests containing dichotomous and polytomous items, which are direct extensions of the noniterative estimators developed by Camilli and penfield (1997) for tests composed of dichotomyous items and a small simulation study is reported in which the statistical properties of the generalized variance estimators are assessed.
Abstract: One approach to measuring unsigned differential test functioning is to estimate the variance of the differential item functioning (DIF) effect across the items of the test This article proposes two estimators of the DIF effect variance for tests containing dichotomous and polytomous items The proposed estimators are direct extensions of the noniterative estimators developed by Camilli and Penfield (1997) for tests composed of dichotomous items A small simulation study is reported in which the statistical properties of the generalized variance estimators are assessed, and guidelines are proposed for interpreting values of DIF effect variance estimators