scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 2011"


Journal ArticleDOI
TL;DR: The current investigation advances the technique by developing a computational platform integrating both statistical and IRT procedures into a single program, and a Monte Carlo simulation approach was incorporated to derive empirical criteria for various DIF statistics and effect size measures.
Abstract: Logistic regression provides a flexible framework for detecting various types of differential item functioning (DIF). Previous efforts extended the framework by using item response theory (IRT) based trait scores, and by employing an iterative process using group--specific item parameters to account for DIF in the trait scores, analogous to purification approaches used in other DIF detection frameworks. The current investigation advances the technique by developing a computational platform integrating both statistical and IRT procedures into a single program. Furthermore, a Monte Carlo simulation approach was incorporated to derive empirical criteria for various DIF statistics and effect size measures. For purposes of illustration, the procedure was applied to data from a questionnaire of anxiety symptoms for detecting DIF associated with age from the Patient--Reported Outcomes Measurement Information System.

512 citations


Journal ArticleDOI
TL;DR: This example shows that CAT and short forms derived from the PROMIS FIB can reliably estimate fatigue reported by the U.S. general population.

269 citations


Journal ArticleDOI
TL;DR: In this article, the potential of the lmer function from the lme4 package in R for item response (IRT) modeling is discussed, and three broad categories of models are described: item covariate models, person covariate model, and person-by-item model.
Abstract: In this paper we elaborate on the potential of the lmer function from the lme4 package in R for item response (IRT) modeling. In line with the package, an IRT framework is described based on generalized linear mixed modeling. The aspects of the framework refer to (a) the kind of covariates -- their mode (person, item, person-by-item), and their being external vs. internal to responses, and (b) the kind of effects the covariates have -- fixed vs. random, and if random, the mode across which the effects are random (persons, items). Based on this framework, three broad categories of models are described: Item covariate models, person covariate models, and person-by-item covariate models, and within each category three types of more specific models are discussed. The models in question are explained and the associated lmer code is given. Examples of models are the linear logistic test model with an error term, differential item functioning models, and local item dependency models. Because the lme4 package is for univariate generalized linear mixed models, neither the two-parameter, and three-parameter models, nor the item response models for polytomous response data, can be estimated with the lmer function.

237 citations


Journal ArticleDOI
TL;DR: The authors used exploratory structural equation modeling and exploratory factor analysis to identify well-differentiated dimensions of bullying and victimization that meet standards of good measurement: goodness of fit, measurement invariance, lack of differential item functioning, and well differentiated factors that are not so highly correlated as to detract from their discriminant validity.
Abstract: Existing research posits multiple dimensions of bullying and victimization but has not identified well-differentiated facets of these constructs that meet standards of good measurement: goodness of fit, measurement invariance, lack of differential item functioning, and well-differentiated factors that are not so highly correlated as to detract from their discriminant validity and substantive usefulness in school settings. Here we demonstrate exploratory structural equation modeling, an integration of confirmatory factor analysis and exploratory factor analysis. On the basis of responses to the 6-factor Adolescent Peer Relations Instrument (verbal, social, physical facets of bullying and victimization), we tested invariance of factor loadings, factor variances--covariances, item uniquenesses, item intercepts (a lack of differential item functioning), and latent means across gender, year in school, and time. Using a combination of relations with student characteristics and a multitrait--multimethod analysis, we showed that the 6 bully/victim factors have discriminant validity over time and in relation to gender, year in school, and relevant psychosocial correlates (e.g., depression, 11 components of academic and nonacademic self-concept, locus of control, attitudes toward bullies and victims). However, bullies and victims are similar in many ways, and longitudinal panel models of the positive correlations between bully and victim factors suggest reciprocal effects such that each is a cause and an effect of the other

205 citations


Journal ArticleDOI
TL;DR: An efficient full-information maximum marginal likelihood estimator is derived by extending Gibbons and Hedeker's bifactor dimension reduction method so that the optimization of the marginal log-likelihood requires only 2-dimensional integration regardless of the dimensionality of the latent variables.
Abstract: Full-information item bifactor analysis is an important statistical method in psychological and educational measurement. Current methods are limited to single group analysis and inflexible in the types of item response models supported. We propose a flexible multiple-group item bifactor analysis framework that supports a variety of multidimensional item response theory models for an arbitrary mixing of dichotomous, ordinal, and nominal items. The extended item bifactor model also enables the estimation of latent variable means and variances when data from more than one group are present. Generalized user-defined parameter restrictions are permitted within or across groups. We derive an efficient full-information maximum marginal likelihood estimator. Our estimation method achieves substantial computational savings by extending Gibbons and Hedeker’s (1992) bifactor dimension reduction method so that the optimization of the marginal log-likelihood only requires two-dimensional integration regardless of the dimensionality of the latent variables. We use simulation studies to demonstrate the flexibility and accuracy of the proposed methods. We apply the model to study cross-country differences, including differential item functioning, using data from a large international education survey on mathematics literacy.

198 citations


Journal Article
TL;DR: In this article, the authors reviewed the current psychometric properties of the State Trait Anxiety Inventory (STAI) and compared the original and current values with a t-Test.
Abstract: Psychometric revision and differential item functioning in the State Trait Anxiety Inventory (STAI). One of the psychological problems with highest prevalence is anxiety. The State Trait Anxiety Inventory is one of the instruments to measure it. This questionnaire assesses Trait Anxiety (understood as a personality factor that predisposes one to suffer from anxiety) and State Anxiety (refers to environment factors that protect from or generate anxiety). The questionnaire was adapted in Spain in 1982. Therefore, the goal of the study is to review the current psychometric properties of the STAI. A total of 1036 adults took part in the study. Cronbach's alpha reliability was .90 for Trait and .94 for State Anxiety. Factor analysis showed similar results compared with the original data. Moreover, differential item functioning (DIF) was carried out to explore sex bias. Only one of the 40 items showed DIF problems. Lastly, a t-Test was run, comparing the original and current values; whereas Trait Anxiety varied in 1 point, State Anxiety had differences of up to 6 points. In general, this result shows that the STAI has maintained adequate psychometric properties and has also been sensitive to increased environmental stimuli that produce stress.

190 citations


Journal ArticleDOI
TL;DR: In this article, two major approaches in testing measurement invariance for ordinal measures: multiple-group categorical confirmatory factor analysis (MCCFA) and item response theory (IRT) were investigated.
Abstract: This study investigated two major approaches in testing measurement invariance for ordinal measures: multiple-group categorical confirmatory factor analysis (MCCFA) and item response theory (IRT). Unlike the ordinary linear factor analysis, MCCFA can appropriately model the ordered-categorical measures with a threshold structure. A simulation study under various conditions was conducted for the comparison of MCCFA and IRT with respect to the power to detect the lack of invariance across groups. Both MCCFA and IRT showed reasonable power to identify the noninvariant item when differential item functioning (DIF) was large. The false positive rates were relatively high in both methods, however. The adjustment of critical values improved the performance of MCCFA by reducing false positive rates substantially and yet yielding adequate power. Alternative model fit indexes of MCCFA were also examined and they were found to be reliable to detect DIF, in general.

176 citations


Journal ArticleDOI
TL;DR: This article proposed exploratory structural equation modeling (ESEM), an integration of the best aspects of CFA and traditional exploratory factor analyses (EFA) to fit the data much better and results in substantially more differentiated (less correlated) factors than corresponding CFA models.
Abstract: The most popular measures of multidimensional constructs typically fail to meet standards of good measurement: goodness of fit, measurement invariance, lack of differential item functioning, and well-differentiated factors that are not so highly correlated as to detract from their discriminant validity. Part of the problem, the authors argue, is undue reliance on overly restrictive independent cluster models of confirmatory factor analysis (ICM-CFA) in which each item loads on one, and only one, factor. Here the authors demonstrate exploratory structural equation modeling (ESEM), an integration of the best aspects of CFA and traditional exploratory factor analyses (EFA). On the basis of responses to the 11-factor Motivation and Engagement Scale (n = 7,420, Mage = 14.22), we demonstrate that ESEM fits the data much better and results in substantially more differentiated (less correlated) factors than corresponding CFA models. Guided by a 13-model taxonomy of ESEM full-measurement (mean structure) invarianc...

169 citations


Journal ArticleDOI
TL;DR: A single physical function dimension accounts for most of the item variance in the PPFIB, suggesting that the items are measuring predominately one single construct.

161 citations


Journal ArticleDOI
TL;DR: In this article, the authors developed two PF item pools that comprised 32 mobility and 38 upper extremity items and evaluated the scale dimensionality and sources of local dependence (LD) with factor analysis.

152 citations


Journal ArticleDOI
TL;DR: The EQ is an appropriate measure of the construct of empathy and can be measured along a single dimension and suggested that a hierarchical factor of empathy underlies these sub-factors.

Journal ArticleDOI
TL;DR: An effect size index is proposed for confirmatory factor analytic studies of measurement equivalence to address limitations of commonly recommended criteria for evaluating results from these analyses.
Abstract: Because of the practical, theoretical, and legal implications of differential item functioning (DIF) for organizational assessments, studies of measurement equivalence are a necessary first step before scores can be compared across individuals from different groups. However, commonly recommended criteria for evaluating results from these analyses have several important limitations. The present study proposes an effect size index for confirmatory factor analytic (CFA) studies of measurement equivalence to address 1 of these limitations. The application of this index is illustrated with personality data from American English, Greek, and Chinese samples. Results showed a range of nonequivalence across these samples, and these differences were linked to the observed effects of DIF on the outcomes of the assessment (i.e., group-level mean differences and adverse impact). Practitioners and organizational researchers confront a vast number of questions that involve comparing scores on assessment instruments across groups. Are workers more satisfied in organi- zations with empowerment programs? Are successful salespersons more extraverted? Are employees in a multinational organization more satisfied in one country than employees in another? More- over, because of the legal and practical implications of using selection assessments that advantage one group over another, group comparisons may be particularly salient during the hiring process. For all of these comparisons to be meaningful, it is essential that the tests and scales provide equivalent measurement across groups. Equivalent measurement is obtained when individ- uals with the same standing on the trait assessed by the test or scale, but sampled from different groups, have equal expected observed scores (Drasgow, 1984). As such, measurement invari- ance can be examined by a differential item functioning (DIF) analysis using item-response theory (IRT) or with confirmatory factor analytic (CFA) mean and covariance structure (MACS) analysis. The latter method is the focus of this article. Although several articles have proposed various decision rules for determining if measurement nonequivalence exists with MACS analysis (Cheung & Rensvold, 2002; Hu & Bentler, 1999; Meade, Johnson, & Braddy, 2008), these rules generally involve empiri- cally derived cutoffs or statistical significance tests. As such, the analysis does not address the practical importance of observed differences between groups and does not provide users with infor- mation about the effects of nonequivalence on the organizational outcomes of an assessment. In the broader psychological literature, effect size statistics have been proposed to overcome this limita- tion (Cohen, 1990, 1994; Kirk, 2006; Schmidt, 1996). However, effect size indices for CFA evaluations of measurement equiva- lence have not yet been developed. In the present study, we propose such an index and examine its application to real-world data. To illustrate its practical importance, we also demonstrate the effects of measurement nonequivalence on the observed outcomes (e.g., means, adverse impact) of group-level comparisons. This information will enable researchers and practitioners to further evaluate the theoretical and practical importance of observed dif- ferences.

Journal ArticleDOI
TL;DR: In patients with stroke, the FSS-7 showed better psychometric properties and had better potential to detect changes in fatigue over time than the Fss-9 version, suggesting satisfactory grounds for removal of items #1 and #2 for its application.

Journal ArticleDOI
TL;DR: The National Institutes of Health's Patient-Reported Outcomes Measurement Information System (PROMIS) Roadmap initiative is a cooperative research program designed to develop, evaluate, and standardize item banks to measure patient-reported outcomes (PROs) across different medical conditions as well as the US population.
Abstract: The National Institutes of Health (NIH) Patient-Reported Outcomes Measurement Information System (PROMIS®) Roadmap initiative (www.nihpromis.org) is a cooperative research program designed to develop, evaluate, and standardize item banks to measure patient-reported outcomes (PROs) across different medical conditions as well as the US population (1). The goal of PROMIS is to develop reliable and valid item banks using item response theory (IRT) that can be administered in a variety of formats including short forms and computerized adaptive tests (CAT)(1-3). IRT is often referred to as “modern psychometric theory,” in contrast to “classic test theory,” or CTT. The basic idea behind both IRT and CTT is that there is some latent construct, or “trait,” underlying an illness experience. This construct cannot be directly measured, but can be indirectly measured by creating items that are scaled and scored. For example, “fatigue,” “pain,” “disability,” or even “happiness” are latent constructs, i.e. subjective feelings – we cannot take a picture, snap an X-Ray to view them, or run a blood test to check for them. However, we know they exist. People can experience more or less of these constructs, thus it is helpful to try to translate that experience into several levels represented by scores. IRT models the associations between items and the latent construct. Specifically, IRT models describe relationships between a respondent's underlying level on a construct and the probability of particular item responses. Tests developed with CTT (such as the Health Assessment Questionnaire-Disability Index(4), the Scleroderma Gastrointestinal Tract instrument(5)) require administering all items, even though only some are appropriate for the persons' trait level. Some items are too high for those with low trait levels (e.g., “can you walk 100 yards” to a patient in a wheelchair) or too low for those with high trait levels (e.g., “can you get up from the chair?” to a runner). In contrast, IRT methods make it possible to estimate person trait levels with any subset of items appropriate for the persons' trait levels in an item pool. As such, any set of items from the pool could be administered as a fixed form or, for greatest efficiency, administered as a CAT. CAT is an approach to administering the subset of items in an item bank that are most informative for measuring the health construct in order to achieve a target standard error of measurement. A good item bank will have items that represent a range of content and difficulty, provide high level of information, and have items that perform equivalently in different subgroups of the target population. How does CAT work? Without prior information, the first item administered in a CAT is typically one of medium trait level. For example, “In the past 7 days I was grouchy” with multi-level response from “never” to “always.” After each response, the person's trait level and associated standard error are estimated. The next item administered to someone not endorsing the first item, is an “easier” item. If the person endorses the first item, the next item administered is a “harder” item. CAT is terminated when the standard error falls below an acceptable value. This provides an estimate of one's score with the minimal number of questions and no loss of measurement precision. In addition, scores from different studies using different items can be compared using a common scale. IRT models estimate the underlying scale score (theta) from the items. All items are calibrated on the same metric and independently and collectively provide an estimate of theta. Hence, it is possible to estimate the score using any subset of items and to estimate the standard error of the estimated score. This allows assessment of health outcomes across patients with differing medical conditions (such as compare scores of someone with arthritis to someone with heart disease) at various degrees of physical and other impairments, both at the lowest and highest ends of trait levels.

Journal ArticleDOI
TL;DR: The PGQ had acceptably high reliability and validity in people with PGP both during pregnancy and postpartum, it is simple to administer, and it is feasible for use in clinical practice.
Abstract: Background No appropriate measures have been specifically developed for pelvic girdle pain (PGP). There is a need for suitable outcome measures that are reliable and valid for people with PGP for use in research and clinical practice. Objective The objective of this study was to develop a condition-specific measure, the Pelvic Girdle Questionnaire (PGQ), for use during pregnancy and postpartum. Design This was a methodology study. Methods Items were developed from a literature review and information from a focus group of people who consulted physical therapists for PGP. Face validity and content validity were assessed by classifying the items according to the World Health Organization's International Classification of Functioning, Disability and Health . After a pilot study, the PGQ was administered to participants with clinically verified PGP by means of a postal questionnaire in 2 surveys. The first survey included 94 participants (52 pregnant), and the second survey included 87 participants (43 pregnant). Rasch analysis was used for item reduction, and the PGQ was assessed for unidimensionality, item fit, redundancy, and differential item functioning. Test-retest reliability was assessed with a random sample of 42 participants. Results The analysis resulted in a questionnaire consisting of 20 activity items and 5 symptom items on a 4-point response scale. The items in both subscales showed a good fit to the Rasch model, with acceptable internal consistency, satisfactory fit residuals, and no disordered threshold. Test-retest reliability showed high intraclass correlation coefficient estimates: .93 (95% confidence interval=0.86–0.96) for the PGQ activity subscale and .91 (95% confidence interval=0.84–0.95) for the PGQ symptom subscale. Limitations The PGQ should be compared with low back pain questionnaires as part of a concurrent evaluation of measurement properties, including validity and responsiveness to change. Conclusions The PGQ is the first condition-specific measure developed for people with PGP. The PGQ had acceptably high reliability and validity in people with PGP both during pregnancy and postpartum, it is simple to administer, and it is feasible for use in clinical practice.

Journal ArticleDOI
TL;DR: Multigroup confirmatory factor analysis was used to detect differential item functioning (DIF) in factor loadings and intercepts for the Revised NEO Personality Inventory and the results indicate that considerable caution is warranted in cross-cultural comparisons of personality profiles.
Abstract: Measurement invariance is a prerequisite for confident cross-cultural comparisons of personality profiles. Multigroup confirmatory factor analysis was used to detect differential item functioning (DIF) in factor loadings and intercepts for the Revised NEO Personality Inventory (P. T. Costa, Jr., & R. R. McCrae, 1992) in comparisons of college students in the United States (N = 261), Philippines (N = 268), and Mexico (N = 775). About 40%-50% of the items exhibited some form of DIF and item-level noninvariance often carried forward to the facet level at which scores are compared. After excluding DIF items, some facet scales were too short or unreliable for cross-cultural comparisons, and for some other facets, cultural mean differences were reduced or eliminated. The results indicate that considerable caution is warranted in cross-cultural comparisons of personality profiles.

Journal ArticleDOI
TL;DR: A modified H ADS-A and HADS-D are unidimensional, free of DIF and have good fit to the Rasch model in this population of patients with MND, suggesting they are suitable for use in MND clinics or research.
Abstract: Background: The Hospital Anxiety and Depression Scale (HADS) is commonly used to assess symptoms of anxiety and depression in motor neurone disease (MND). The measure has never been specifically validated for use within this population, despite questions raised about the scale’s validity. This study seeks to analyse the construct validity of the HADS in MND by fitting its data to the Rasch model. Methods: The scale was administered to 298 patients with MND. Scale assessment included model fit, differential item functioning (DIF), unidimensionality, local dependency and category threshold analysis. Results: Rasch analyses were carried out on the HADS total score as well as depression and anxiety subscales (HADS-T, D and A respectively). After removing one item from both of the seven item scales, it was possible to produce modified HADS-A and HADS-D scales which fit the Rasch model. An 11-item higher-order HADS-T total scale was found to fit the Rasch model following the removal of one further item. Conclusion: Our results suggest that a modified HADS-A and HADS-D are unidimensional, free of DIF and have good fit to the Rasch model in this population. As such they are suitable for use in MND clinics or research. The use of the modified HADS-T as a higher-order measure of psychological distress was supported by our data. Revised cut-off points are given for the modified HADS-A and HADS-D subscales.

Journal ArticleDOI
TL;DR: In this paper, a latent variable interaction is added to the MIMIC model to test for non-uniform DIF, and the approach is tested in simulations with small focal-group N and illustrated with an empirical example using a scale about agoraphobic cognitions.
Abstract: In extant literature, multiple indicator multiple cause (MIMIC) models have been presented for identifying items that display uniform differential item functioning (DIF) only, not nonuniform DIF. This article addresses, for apparently the first time, the use of MIMIC models for testing both uniform and nonuniform DIF with categorical indicators. A latent variable interaction is added to the MIMIC model to test for nonuniform DIF. The approach is tested in simulations with small focal-group N and illustrated with an empirical example using a scale about agoraphobic cognitions. MIMIC-interaction models are compared with MIMIC models without the interaction as well as likelihood ratio DIF testing using item response theory (IRT-LR-DIF). The most important finding is that when the latent moderated structural equations approach is used to estimate the interaction, the Type I error in MIMIC-interaction DIF models is severely inflated.

Journal ArticleDOI
TL;DR: In this paper, the authors present 15-item short forms of the Wisconsin Schizotypy scales, which are based on psychometric analyses using item response theory, and the items are listed in an Appendix A. Based on data from a sample of young adults (n = 1144).

Journal ArticleDOI
TL;DR: Rasch analysis supports the interpretation that a student's APP score is an indication of their underlying level of professional competence in workplace practice.

01 Jan 2011
TL;DR: Costa et al. as discussed by the authors used multigroup confirmatory factor analysis to detect differential item functioning in factor loadings and intercepts for the Revised NEO Personality Inventory (P. T. Costa, Jr., & R. R. McCrae, 1992) in comparisons of college students in the United States (N 261), Philippines (N 268), and Mexico (N 775).
Abstract: Measurement invariance is a prerequisite for confident cross-cultural comparisons of personality profiles. Multigroup confirmatory factor analysis was used to detect differential item functioning (DIF) in factor loadings and intercepts for the Revised NEO Personality Inventory (P. T. Costa, Jr., & R. R. McCrae, 1992) in comparisons of college students in the United States (N 261), Philippines (N 268), and Mexico (N 775). About 40%–50% of the items exhibited some form of DIF and item-level noninvariance often carried forward to the facet level at which scores are compared. After excluding DIF items, some facet scales were too short or unreliable for cross-cultural comparisons, and for some other facets, cultural mean differences were reduced or eliminated. The results indicate that considerable caution is warranted in cross-cultural comparisons of personality profiles.

Journal ArticleDOI
TL;DR: In this article, the authors discuss the importance of conducting differential item functioning (DIF) analyses using a priori hypotheses whenever possible, and demonstrate how to test for DIF using logistic regression and DIFPACK.
Abstract: The purpose of this manuscript was to help researchers better understand the causes and implications of differential item functioning (DIF), as well as the importance of testing for DIF in the process of test development and validation. The underlying theoretical reason for the presence of DIF is explicated, followed by a discussion of how to test for the presence of DIF using logistic regression and DIFPACK, which includes SIBTEST, PSIBTEST and Crossing SIBTEST. This manuscript stresses the importance of conducting DIF analyses using a priori hypotheses whenever possible. However, the example that is provided, to show researchers and practitioners how to conduct a DIF analysis, utilizes an exploratory DIF analyses paradigm which may often be needed in practical DIF applications. This example uses PSIBTEST to test for DIF, using data from an international assessment that includes a mixture of polytomous and dichotomous items. In addition to demonstrating how to test for DIF, this manuscript demonstrates h...

Journal ArticleDOI
TL;DR: The results suggest that, at the same level of syndrome severity, the severity of psychotic symptoms, including the negative ones, observed in MA psychotic and schizophrenic patients are almost the same.
Abstract: The concept of negative symptoms in methamphetamine (MA) psychosis (e.g., poverty of speech, flatten affect, and loss of drive) is still uncertain. This study aimed to use differential item functioning (DIF) statistical techniques to differentiate the severity of psychotic symptoms between MA psychotic and schizophrenic patients. Data of MA psychotic and schizophrenic patients were those of the participants in the WHO Multi-Site Project on Methamphetamine-Induced Psychosis (or WHO-MAIP study) and the Risperidone Long-Acting Injection in Thai Schizophrenic Patients (or RLAI-Thai study), respectively. To confirm the unidimensionality of psychotic syndromes, we applied the exploratory and confirmatory factor analyses (EFA and CFA) on the eight items of Manchester scale. We conducted the DIF analysis of psychotic symptoms observed in both groups by using nonparametric kernel-smoothing techniques of item response theory. A DIF composite index of 0.30 or greater indicated the difference of symptom severity. The analyses included the data of 168 MA psychotic participants and the baseline data of 169 schizophrenic patients. For both data sets, the EFA and CFA suggested a three-factor model of the psychotic symptoms, including negative syndrome (poverty of speech, psychomotor retardation and flatten/incongruous affect), positive syndrome (delusions, hallucinations and incoherent speech) and anxiety/depression syndrome (anxiety and depression). The DIF composite indexes comparing the severity differences of all eight psychotic symptoms were lower than 0.3. The results suggest that, at the same level of syndrome severity (i.e., negative, positive, and anxiety/depression syndromes), the severity of psychotic symptoms, including the negative ones, observed in MA psychotic and schizophrenic patients are almost the same.

Journal ArticleDOI
TL;DR: Findings suggest that DIF based on items’ scoring direction is not problematic when the Five Facet Mindfulness Questionnaire is used to compare demographically similar meditators and nonmeditators.
Abstract: A recent study of the Five Facet Mindfulness Questionnaire reported high levels of differential item functioning (DIF) for 18 of its 39 items in meditating and nonmeditating samples that were not demographically matched. In particular, meditators were more likely to endorse positively worded items whereas nonmeditators were more likely to deny negatively worded (reverse-scored) items. The present study replicated these analyses in demographically matched samples of meditators and nonmeditators (n = 115 each) and found that evidence for DIF was minimal. There was little or no evidence for differential relationships between positively and negatively worded items for meditators and nonmeditators. Findings suggest that DIF based on items’ scoring direction is not problematic when the Five Facet Mindfulness Questionnaire is used to compare demographically similar meditators and nonmeditators.

Journal ArticleDOI
TL;DR: The results do not support the hypothesis that cumulative DIF for PHQ-9 items spuriously inflates the numbers of persons with TBI screened as potentially having major depressive disorder, and all symptoms can be counted toward the diagnosis of major depressive Disorder without special concern about overdiagnosis or unnecessary treatment.

Journal ArticleDOI
TL;DR: In this paper, the authors applied modern statistical approaches in the adaptation and assessment of the psychometric properties of the Peabody Picture Vocabulary Test-Revised (PPVT-R) Greek.
Abstract: Assessment of lexical/semantic knowledge is performed with a variety of tests varying in response requirements. The present study exemplifies the application of modern statistical approaches in the adaptation and assessment of the psychometric properties of the Peabody Picture Vocabulary Test‐Revised (PPVT-R) Greek. Confirmatory factor analyses applied to data from a large sample of elementary school students (N = 585) indicated the existence of a single vocabulary dimension and differential item functioning procedures pointed to minimal bias due to gender or ethnic group. Rasch model‐derived indices of item difficulty and discrimination were used to develop a short form of the test, which was administered to a second sample of 900 students. Convergent and discriminant validity were assessed through comparisons with the Wechsler Intelligence Scales for Children‐III Vocabulary and Block design subtests. Short- and long-term stability of individual scores over a 6-month period were very high, and the utility of the test as part of routine educational assessment is attested by its strong longitudinal predictive value with reading comprehension measures. It is concluded that the Greek version of the PPVT-R constitutes a reliable and valid assessment of vocabulary for Greek students and immigrants who speak Greek.

Journal ArticleDOI
TL;DR: In this article, three distinctive methods of assessing measurement equivalence of ordinal items, namely, confirmatory factor analysis, differential item functioning using item response theory, and latent class factor analysis make different modeling assumptions and adopt different procedures.
Abstract: Three distinctive methods of assessing measurement equivalence of ordinal items, namely, confirmatory factor analysis, differential item functioning using item response theory, and latent class factor analysis, make different modeling assumptions and adopt different procedures. Simulation data are used to compare the performance of these three approaches in detecting the sources of measurement inequivalence. For this purpose, the authors simulated Likert-type data using two nonlinear models, one with categorical and one with continuous latent variables. Inequivalence was set up in the slope parameters (loadings) as well as in the item intercept parameters in a form resembling agreement and extreme response styles. Results indicate that the item response theory and latent class factor models can relatively accurately detect and locate inequivalence in the intercept and slope parameters both at the scale and the item levels. Confirmatory factor analysis performs well when inequivalence is located in the slo...

Journal ArticleDOI
TL;DR: Differential item functioning (DIF) analysis is a way of determining whether test items function differently across subgroups of test takers after controlling for ability level as mentioned in this paper, and is used to evaluate tests' validity arguments.
Abstract: Differential item functioning (DIF) analysis is a way of determining whether test items function differently across subgroups of test takers after controlling for ability level. DIF results are used to evaluate tests' validity arguments. This study uses Rasch measurement to examine the Michigan English Language Assessment Battery listening test for DIF across gender subgroups. After establishing the unidimensionality and local independence of the data, the authors used two methods to test for DIF: (a) a t-test uniform DIF analysis, which showed that two test items displayed substantive DIF, and favored different gender subgroups; and (b) nonuniform DIF analysis, which revealed several test items with significant DIF, many of which favored low-ability male test takers. A possible explanation for gender-ability DIF is that lower ability male test takers are more likely to attempt lucky guesses, particularly on multiple-choice items with unattractive distracters, and that having only two distracters makes th...

Journal ArticleDOI
TL;DR: This study tested for the presence of differential item functioning (DIF) in DSM-IV Pathological Gambling Disorder (PGD) criteria based on gender, race/ethnicity and age using a nationally representative sample of adults from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC).
Abstract: This study tested for the presence of differential item functioning (DIF) in DSM-IV Pathological Gambling Disorder (PGD) criteria based on gender, race/ethnicity and age. Using a nationally representative sample of adults from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC), indicating current gambling (n = 10,899), Multiple Indicator-Multiple Cause (MIMIC) models tested for DIF, controlling for income, education, and marital status. Compared to the reference groups (i.e., Male, Caucasian, and ages 25–59 years), women (OR = 0.62; P < .001) and Asian Americans (OR = 0.33; P < .001) were less likely to endorse preoccupation (Criterion 1). Women were more likely to endorse gambling to escape (Criterion 5) (OR = 2.22; P < .001) but young adults (OR = 0.62; P < .05) were less likely to endorse it. African Americans (OR = 2.50; P < .001) and Hispanics were more likely to endorse trying to cut back (Criterion 3) (OR = 2.01; P < .01). African Americans were more likely to endorse the suffering losses (OR = 2.27; P < .01) criterion. Young adults were more likely to endorse chasinglosses (Criterion 9) (OR = 1.81; P < .01) while older adults were less likely to endorse this criterion (OR = 0.76; P < .05). Further research is needed to identify factors contributing to DIF, address criteria level bias, and examine differential test functioning.

Journal ArticleDOI
TL;DR: The results indicate that the revised-EDS is unidimensional, with minimal differential item functioning, and retains predictive validity consistent with the original scale.
Abstract: The Everyday Discrimination Scale (EDS), a widely used measure of daily perceived discrimination, is purported to be unidimensional, to function well among African Americans, and to have adequate construct validity. Two separate studies and data sources were used to examine and cross-validate the psychometric properties of the EDS. In Study 1, an exploratory factor analysis was conducted on a sample of African American law students (N = 589), providing strong evidence of local dependence, or nuisance multidimensionality within the EDS. In Study 2, a separate nationally representative community sample (N = 3,527) was used to model the identified local dependence in an item factor analysis (i.e., bifactor model). Next, item response theory (IRT) calibrations were conducted to obtain item parameters. A five-item, revised-EDS was then tested for gender differential item functioning (in an IRT framework). Based on these analyses, a summed score to IRT-scaled score translation table is provided for the revised-EDS. Our results indicate that the revised-EDS is unidimensional, with minimal differential item functioning, and retains predictive validity consistent with the original scale.