scispace - formally typeset
Search or ask a question

Showing papers on "Item response theory published in 2006"


Journal Article
TL;DR: The LTM package ltm as discussed by the authors is developed for the analysis of multivariate dichotomous and polytomous data using latent variable models, under the Item Response Theory approach.
Abstract: The R package ltm has been developed for the analysis of multivariate dichotomous and polytomous data using latent variable models, under the Item Response Theory approach. For dichotomous data the Rasch, the Two-Parameter Logistic, and Birnbaum’s Three-Parameter models have been implemented, whereas for polytomous data Semejima’s Graded Response model is available. Parameter estimates are obtained under marginal maximum likelihood using the Gauss-Hermite quadrature rule. The capabilities and features of the package are illustrated using two real data examples.

835 citations


Journal ArticleDOI
TL;DR: Recommendations are provided for physical therapists who are interpreting changes in the context of clinical practice, case reports, and intervention research that include a greater application of indexes that help interpret the meaning of clinically significant change to multiple clinical, research, consumer, and payer communities.
Abstract: Over the past decade, the methods and science used to describe changes in outcomes of physical therapy services have become more refined. Recently, emphasis has been placed not only on changes beyond expected measurement error, but also on the identification of changes that make a real difference in the lives of patients and families. This article will highlight a case example of how to determine and interpret "clinically significant change" from both of these perspectives. The authors also examine how to use item maps within an item response theory model to enhance the interpretation of change at a content level. Recommendations are provided for physical therapists who are interpreting changes in the context of clinical practice, case reports, and intervention research. These recommendations include a greater application of indexes that help interpret the meaning of clinically significant change to multiple clinical, research, consumer, and payer communities.

766 citations


Journal ArticleDOI
TL;DR: In this article, the authors exemplify strategies associated with tests for measurement invariance that are uncommonly applied and reported in the extant literature and exemplify the use of the MACS approach in testing for an invariant higher-order factor structure, and tests for latent mean differences relative to both levels of a higher order factor structure.
Abstract: The overarching intent of this article is to exemplify strategies associated with tests for measurement invariance that are uncommonly applied and reported in the extant literature. Designed within a pedagogical framework, the primary purposes are 3-fold and illustrate (a) tests for measurement invariance based on the analysis of means and covariance structures (MACS), (b) use of the MACS approach in testing for an invariant higher order factor structure, and (c) tests for latent mean differences relative to both levels of a higher order factor structure. Addressing additional application limitations, the secondary purposes are 2-fold and illustrate (a) determination of invariance based on two substantially different sets of criteria and (b) interpretation of noninvariant measurement items within the context of an item response theory perspective. We are hopeful that readers will find the didactic approach to be helpful in gaining a better understanding of the invariance-testing process.

416 citations


Journal ArticleDOI
TL;DR: A common strategy for identifying differential item functioning (DIF) items that can be implemented in both the mean and covariance structures method (MACS) and item response theory and results indicated that the proposed strategy was considerably more effective than an alternative approach involving a constrained-baseline model.
Abstract: In this article, the authors developed a common strategy for identifying differential item functioning (DIF) items that can be implemented in both the mean and covariance structures method (MACS) and item response theory (IRT). They proposed examining the loadings (discrimination) and the intercept (location) parameters simultaneously using the likelihood ratio test with a free-baseline model and Bonferroni corrected critical p values. They compared the relative efficacy of this approach with alternative implementations for various types and amounts of DIF, sample sizes, numbers of response categories, and amounts of impact (latent mean differences). Results indicated that the proposed strategy was considerably more effective than an alternative approach involving a constrained-baseline model. Both MACS and IRT performed similarly well in the majority of experimental conditions. As expected, MACS performed slightly worse in dichotomous conditions but better than IRT in polytomous cases where sample sizes were small. Also, contrary to popular belief, MACS performed well in conditions where DIF was simulated on item thresholds (item means), and its accuracy was not affected by impact.

358 citations


Journal ArticleDOI
TL;DR: In this article, the reliability of responses to the items, as well as the item parameters of three GSE measures using item response theory, were examined, and the results indicate that the New General Self-Efficacy Scale has a slight advantage over the other measures examined in this study in terms of the item discrimination, item information, and relative efficiency of the test information function.
Abstract: General self-efficacy (GSE), individuals'belief in their ability to perform well in a variety of situations, has been the subject of increasing research attention. However, the psychometric properties (e.g., reliability, validity) associated with the scores on GSE measures have been criticized, which has hindered efforts to further establish the construct of GSE. This study examines the reliability of responses to the items, as well as the item parameters of three GSE measures using item response theory. Contrary to the criticisms, the responses to the items on all three measures of GSE demonstrate acceptable psychometric properties, especially at lower levels of GSE. The results indicate that the New General Self-Efficacy Scale has a slight advantage over the other measures examined in this study in terms of the item discrimination, item information, and relative efficiency of the test information function. Implications for GSE research are discussed.

336 citations


Journal ArticleDOI
TL;DR: The results of this study found that the HOS ADL and sports subscales were unidimensional, had adequate internal consistency, were potentially responsive across the spectrum of ability, and contributed information across the range of ability.
Abstract: Purpose: The purpose of this study was to offer evidence of validity for the Hip Outcome Score (HOS) based on internal structure, test content, and relation to other variables. Methods: The study population consisted of 507 subjects with a labral tear. Internal structure was evaluated by use of factor analysis and coefficient α. Test content was evaluated by use of item response theory. Pearson correlation coefficients were used to assess relations between the Short Form 36 and the HOS. Results: The mean subject age was 38 years (range, 13 to 66 years), with 232 male and 273 female subjects. Of the subjects, 263 (52%) underwent arthroscopic surgery. Factor analysis found that 17 of 19 items on the activities-of-daily-living (ADL) subscale loaded on 1 factor. The 2 items that did not fit the 1-factor model were omitted from further testing. All 9 items on the sports subscale loaded on 1 factor. The coefficient α values were .96 and .95 for the ADL and sports subscales, respectively. The errors associated with a single measure were ±4.6 and ±3.8 points for the ADL and sports subscales, respectively. Item response theory found that all items contributed to their test information curves and were potentially responsive. The correlations between the HOS and Short Form 36 measures of physical function were significantly different than their correlation to measures of mental functioning ( P Conclusions: The results of this study provide evidence of validity to support the use of the HOS ADL and sports subscales for individuals with labral tears. This includes individuals who underwent arthroscopic surgery, as well as those who did not. Specifically, the results of this study found that the HOS ADL and sports subscales were unidimensional, had adequate internal consistency, were potentially responsive across the spectrum of ability, and contributed information across the spectrum of ability. In addition, scores obtained by the HOS related to measures of function and did not relate to measures of mental health. Level of Evidence: Level III, development of diagnostic criteria with nonconsecutive patients.

311 citations


Journal ArticleDOI
TL;DR: A lognormal model for the response times of a person on a set of test items is investigated, with an excellent fit to the data, whereas the normal model seemed unable to allow for a characteristic skewness of the response time distributions.
Abstract: A lognormal model for the response times of a person on a set of test items is investigated. The model has a parameter structure analogous to the two-parameter logistic response models in item response theory, with a parameter for the speed of each person as well as parameters for the time intensity and discriminating power of each item. It is shown how these parameters can be estimated by a Markov chain Monte Carlo method (Gibbs sampler). The method was used to analyze response times for the adaptive version of a test from the Armed Services Vocational Aptitude Battery. The same data set was used to test the validity of the model against a normal model using posterior predictive checks on the response times. The lognormal model showed an excellent fit to the data, whereas the normal model seemed unable to allow for a characteristic skewness of the response time distributions. The addition of an equality constraint on the discrimination parameters led only to a slight loss of fit. The potential use of the model for improving the daily practice of testing is indicated.

294 citations


Journal ArticleDOI
TL;DR: This work provides the first combined IRT and CDFA analysis of a clinical measure (the SMFQ) in a community sample of 7-through 11-year-old children, and confirms its scaling properties as a potential dimensional measure of symptom severity of childhood depression in community samples.
Abstract: Item response theory (IRT) and categorical data factor analysis (CDFA) are complementary methods for the analysis of the psychometric properties of psychiatric measures that purport to measure latent constructs. These methods have been applied to relatively few child and adolescent measures. We provide the first combined IRT and CDFA analysis of a clinical measure (the Short Mood and Feelings Questionnaire—SMFQ) in a community sample of 7-through 11-year-old children. Both latent variable models supported the internal construct validity of a single underlying continuum of severity of depressive symptoms. SMFQ items discriminated well at the more severe end of the depressive latent trait. Item performance was not affected by age, although age correlated significantly with latent SMFQ scores suggesting that symptom severity increased within the age period of 7–11. These results extend existing psychometric studies of the SMFQ and confirm its scaling properties as a potential dimensional measure of symptom severity of childhood depression in community samples.

263 citations


Journal ArticleDOI
TL;DR: The ordinal logistic regression approach to DIF detection, when combined with IRT ability estimates, provides a reasonable alternative for Dif detection.
Abstract: Introduction:We present an ordinal logistic regression model for identification of items with differential item functioning (DIF) and apply this model to a Mini-Mental State Examination (MMSE) dataset. We employ item response theory ability estimation in our models. Three nested ordinal logistic reg

261 citations


Journal ArticleDOI
TL;DR: In this article, the authors describe the development, analysis, and interpretation of a novel item format called Ordered Multiple-Choice (OMC), which is linked to a model of student cognitive development for the construct being measured.
Abstract: In this article we describe the development, analysis, and interpretation of a novel item format we call Ordered Multiple-Choice (OMC). A unique feature of OMC items is that they are linked to a model of student cognitive development for the construct being measured. Each of the possible answer choices in an OMC item is linked to developmental levels of student understanding, facilitating the diagnostic interpretation of student item responses. OMC items seek to provide greater diagnostic utility than typical multiple-choice items, while retaining their efficiency advantages. On the one hand, sets of OMC items provide information about the developmental understanding of students that is not available with traditional multiple-choice items; on the other hand, this information can be provided to schools, teachers, and students quickly and reliably, unlike traditional open-ended test items.

258 citations


Journal ArticleDOI
TL;DR: Through this work, it has become clear that differences in raw scores of different groups cannot be used to infer group differences in theoretical attributes unless the test scores accord with a particular set of model invariance restrictions.
Abstract: question whether observed differences in psychometric test scores can be attributed to differences in the properties that such tests measure is relevant in many research domains; examples include the proper interpretation of differences in intelligence test scores across different generations of people,1 gender differences in affectivity,2 and crosscultural differences in personality. This question also has generated some of the most conspicuous controversies in the social and life sciences, where the highest temperature in the many heated discussions around the topic has, without a doubt, been reached in the debate on IQ-score differences between ethnic groups in the United States.4'5 Such debates are often unproductive because of a lack of unambiguous characterizations of concepts like "biased," "incomparable," and "culture-fair." Terms are easily coined, as is illustrated by Johnson's6 count of no less than 55 types of measurement equivalence; however, it is often less easy to spell out their meaning in terms of their empirical consequences. However, without at least some degree of precision in one's conception of a term like "equivalence," it is difficult to have a scientifically productive debate, or even to agree on what aspects of empirical data are relevant for answering the questions involved. It is for this reason that the establishment of concepts like measurement invariance and bias in an unambigous, formal framework with testable consequences7"9 represents a theoretical development of great importance. Through this work, it has become clear that differences in raw scores (eg, IQ-scores) of different groups (eg, blacks and whites) cannot be used to infer group differences in theoretical attributes (eg, general intelligence) unless the test scores accord with a particular set of model invariance restrictions. Namely, the same attribute must relate to the same set of observations in the same way in each group. Statistically, this means that the mathematical function that relates latent variables to the observations must be the same in each of the groups involved in the comparison.7'8 This idea has become known as the requirement of measurement invariance. The theoretical definitions of measurement invariance and bias are very general, and apply to different models, such as item response theory (IRT) and factor models, in roughly the same way.10'11 This does not hold for the empirical methods available for testing measurement invariance. In the past decades, psychometricians working on measurement invariance have produced many different statistical techniques to assess differential item functioning (DIF). These techniques usually employ different statistical assumptions, for instance, regarding the form of the relation between latent and observed variables and the shape of the population distribution on the latent variable, and employ different modeling strategies as well as selection criteria for flagging items as biased. For this reason, it is difficult to assess the consequences of choosing a particular technique; moreover, it is not always clear to what extent the choice of technique makes a difference with respect to the diagnosis of meaurement invariance and bias in applied situations. For this reason, the articles on DIF collected here (by Crane et al;12 Dorans and Kulick;13 Jones;14 Morales, Flowers, Gutierrez, Kleinman, and Teresi;15 Edelen Orlando et al16) represent a useful project in the application of bias detection methods. Each set of authors analyzes the Mini-Mental State Examination (MMSE) for measurement invariance using the same data, albeit with different methods. Together, the articles provide a

Journal ArticleDOI
TL;DR: This paper provided renewed converging empirical evidence for the hypothesis that asking test-takers to respond to text passages with multiple-choice questions induces response processes that are strikingly different from those that respondents would draw on when reading in non-testing contexts.
Abstract: This article provides renewed converging empirical evidence for the hypothesis that asking test-takers to respond to text passages with multiple-choice questions induces response processes that are strikingly different from those that respondents would draw on when reading in non-testing contexts. Moreover, the article shows that the construct of reading comprehension is assessment specific and is fundamentally determined through item design and text selection. The data come from qualitative analyses of 10 cognitive interviews conducted with non-native adult English readers who were given three passages with several multiple-choice questions from the CanTEST, a large-scale language test used for admission and placement purposes in Canada, in a partially counter-balanced design. The analyses show that:• There exist multiple different representations of the construct of ‘reading comprehension’ that are revealed through the characteristics of the items.• Learners view responding to multiple-choice questions ...

Journal ArticleDOI
TL;DR: The authors introduced an effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation, and found that the effortmoded model showed better model fit, yielded more accurate item parameter estimates, more accurately estimated test information, and yielded proficiency estimates with higher convergent validity.
Abstract: The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.

Journal ArticleDOI
TL;DR: This article examines the performance of a number of discrepancy measures for assessing different aspects of fit of the common IRT models and makes specific recommendations about what measures are most useful in assessing model fit.
Abstract: Model checking in item response theory (IRT) is an underdeveloped area. There is no universally accepted tool for checking IRT models. The posterior predictive model-checking method is a popular Bayesian model-checking tool because it has intuitive appeal, is simple to apply, has a strong theoretical basis, and can provide graphical or numerical evidence about model misfit. An important issue with the application of the posterior predictive model-checking method is the choice of a discrepancy measure (which plays a role like that of a test statistic in traditional hypothesis tests). This article examines the performance of a number of discrepancy measures for assessing different aspects of fit of the common IRT models and makes specific recommendations about what measures are most useful in assessing model fit. Graphical summaries of model-checking results are demonstrated to provide useful insights about model fit.

Journal ArticleDOI
TL;DR: The authors' results indicate that ideal point models can provide as good or better fit to personality items than do dominance models because they can fit monotonically increasing item response functions but do not require this property.
Abstract: The present study investigated whether the assumptions of an ideal point response process, similar in spirit to Thurstone's work in the context of attitude measurement, can provide viable alternatives to the traditionally used dominance assumptions for personality item calibration and scoring. Item response theory methods were used to compare the fit of 2 ideal point and 2 dominance models with data from the 5th edition of the Sixteen Personality Factor Questionnaire (S. Conn & M. L. Rieke, 1994). The authors' results indicate that ideal point models can provide as good or better fit to personality items than do dominance models because they can fit monotonically increasing item response functions but do not require this property. Several implications of these findings for personality measurement and personnel selection are described.

Journal ArticleDOI
TL;DR: The new SAHLSA-50, a health literacy test for Spanish-speaking Adults, has good reliability and validity and could be used in the clinical or community setting to screen for low health literacy among Spanish speakers.
Abstract: Objective. The study was intended to develop and validate a health literacy test, termed theShortAssessmentofHealthLiteracyforSpanish-speakingAdults (SAHLSA), for the Spanish-speaking population. Study Design. The design of SAHLSA was based on the Rapid Estimate of Adult Literacy in Medicine (REALM), known as the most easily administered tool for assessing health literacy in English. In addition to the word recognition test in REALM, SAHLSA incorporates a comprehension test using multiple-choice questions designed by an expert panel. Data Collection. Validation of SAHLSA involved testing and comparing the tool with other health literacy instruments in a sample of 201 Spanish-speaking and 202 English-speaking subjects recruited from the Ambulatory Care Center at UNC Health Care. Principal Findings. With only the word recognition test, REALM could not differentiate the level of health literacy in Spanish. The SAHLSA significantly improved the differentiation. Item response theory analysis was performed to calibrate the SAHLSA and reduce the instrument to 50 items. The resulting instrument, SAHLSA-50, was correlated with the Test of Functional Health Literacy in Adults, another health literacy instrument, at r 5 0.65. The SAHLSA-50 score was significantly and positively associated with the physical health status of Spanish-speaking subjects (po.05), holding constant age and years of education. The instrument displayed good internal reliability (Cronbach’s a 5 0.92) and test–retest reliability (Pearson’s r 5 0.86). Conclusions. The new instrument, SAHLSA-50, has good reliability and validity. It could be used in the clinical or community setting to screen for low health literacy among Spanish speakers.

Journal ArticleDOI
TL;DR: The results showed the Preschool Word and Print Awareness to be suitable for measuring preschoolers' PCK and to be sensitive to differences among children as a function of risk status.
Abstract: Purpose This research determined the psychometric quality of a criterion-referenced measure that was thought to measure preschoolers' print-concept knowledge (PCK). Method This measure, titled the ...

Journal ArticleDOI
TL;DR: There is a developmental trend during middle childhood for grammatical abilities and vocabulary abilities to become differentiated, but standardized measures do not provide differential information concerning receptive and expressive abilities.
Abstract: Purpose This study asked if children’s performance on language tests reflects different dimensions of language and if this dimensionality changes with development. Method Children were given standardized language batteries at kindergarten and at second, fourth, and eighth grades. A revised modified parallel analysis was used to determine the dimensionality of these items at each grade level. A confirmatory factor analysis was also performed on the subtest scores to evaluate alternate models of dimensionality. Results The revised modified parallel analysis revealed a single dimension across items with evidence of either test specific or language area specific minor dimensions at different ages. The confirmatory factor analysis tested models involving modality (receptive or expressive) and domain (vocabulary or sentence use) against a single-dimension model. The 2-dimensional model involving domains of vocabulary and sentence use fit the data better than the single-dimensional model; however, the single-dim...

Journal ArticleDOI
TL;DR: Overall, this study provides support for the excellent properties of the SIAS's straightforwardly worded items, although questions remain regarding its reverse-scored items.
Abstract: The widely used Social Interaction Anxiety Scale (SIAS; R. P. Mattick & J. C. Clarke, 1998) possesses favorable psychometric properties, but questions remain concerning its factor structure and item properties. Analyses included 445 people with social anxiety disorder and 1,689 undergraduates. Simple unifactorial models fit poorly, and models that accounted for differences due to item wording (i.e., reverse scoring) provided superior fit. It was further found that clients and undergraduates approached some items differently, and the SIAS may be somewhat overly conservative in selecting analogue participants from an undergraduate sample. Overall, this study provides support for the excellent properties of the SIAS's straightforwardly worded items, although questions remain regarding its reverse-scored items.

Journal ArticleDOI
TL;DR: This paper examined measurement equivalence of the Satisfaction with Life Scale between American and Chinese samples using multigroup Structural Equation Modeling (SEM), Multiple Indicator Multiple Cause Model (MIMIC), and Item Response Theory (IRT).

Journal ArticleDOI
TL;DR: In this article, the authors investigated the use of response time to assess the amount of examinee effort received by individual test items and found that the strongest predictors of the effort required by items were item length (i.e., how much reading or scanning was required).
Abstract: In low-stakes testing, the motivation levels of examinees are often a matter of concern to test givers because a lack of examinee effort represents a direct threat to the validity of the test data. This study investigated the use of response time to assess the amount of examinee effort received by individual test items. In 2 studies, it was found that the strongest predictors of the effort received by items were item length (i.e., how much reading or scanning was required) and item position. In addition, it was found that by treating item responses resulting from rapid guesses as missing, item means and item-total correlations were differentially affected and test score reliability decreased, whereas validity increased. Several implications of these results for low-stakes testing are discussed.

Journal ArticleDOI
TL;DR: The results illustrate limitations of DSM-IV criteria for alcohol and cannabis use disorders when applied to adolescents and should be informed by statistical models such as those used in the development process for the fifth edition (DSM-V).
Abstract: Item response theory (IRT) has advantages over classical test theory in evaluating diagnostic criteria. In this study, the authors used IRT to characterize the psychometric properties of Diagnostic and Statistical Manual of Mental Disorders, 4th edition (DSM-IV; American Psychiatric Association, 1994) alcohol and cannabis use disorder symptoms among 472 clinical adolescents. For both substances, DSM-IV symptoms fit a model specifying a unidimensional latent trait of problem severity. Threshold (severity) parameters did not distinguish abuse and dependence symptoms. Abuse symptoms of legal problems and hazardous use, and dependence symptoms of tolerance, unsuccessful attempts to quit, and physical-psychological problems, showed relatively poor discrimination of problem severity. There were gender differences in thresholds for hazardous use, legal problems, and physical-psychological problems. The results illustrate limitations of DSM-IV criteria for alcohol and cannabis use disorders when applied to adolescents. The development process for the fifth edition (DSM-V) should be informed by statistical models such as those used in this study.

Journal ArticleDOI
TL;DR: In this article, a testlet-based item response theory (IRT) model is proposed to deal with the local dependence present among items within a common testlet, where testlets are made up of testlets.
Abstract: When tests are made up of testlets, standard item response theory (IRT) models are often not appropriate due to the local dependence present among items within a common testlet. A testlet-based IRT...

Journal ArticleDOI
TL;DR: The authors show that the size of the effect can be expressed by a presentation of the values of the parameter estimates derived from the fitted model, and develop a case study of the description of effect size for research reporting in the context of item response theory.
Abstract: The psychological literature currently emphasizes reporting the "effect size" of research findings in addition to the outcome of any tests of significance. However, some confusion may result from the fact that there are three distinct uses of effect sizes in the psychological literature, namely, power analysis, research synthesis, and research reporting. The authors review these uses of effect sizes and develop a case study of the description of effect size for research reporting in the context of item response theory. For many parametric models, hypotheses are tested by comparing the values of directly interpretable parameters. The authors show that the size of the effect can be expressed by a presentation of the values of the parameter estimates derived from the fitted model. Studies that use item response theory to detect differential item functioning provide illustrations.

Journal ArticleDOI
TL;DR: A reduced 18-item measure demonstrating strong clinical utility is proposed, with scores of 8 or greater implying greater need for treatment.
Abstract: The Rutgers Alcohol Problem Index (RAPI; H. R. White & E. W. Labouvie, 1989) is a frequently used measure of alcohol-related consequences in adolescents and college students, but psychometric evaluations of the RAPI are limited and it has not been validated with college students. This study used item response theory (IRT) to examine the RAPI on students (N = 895; 65% female, 35% male) assessed in both high school and college. A series of 2-parameter IRT models were computed, examining differential item functioning across gender and time points. A reduced 18-item measure demonstrating strong clinical utility is proposed, with scores of 8 or greater implying greater need for treatment.

Journal ArticleDOI
TL;DR: IRT and the likelihood-based model comparison approach comprise a powerful tool for DIF detection that can aid in the development, refinement, and evaluation of measures for use in ethnically diverse populations.
Abstract: Background An important part of examining the adequacy of measures for use in ethnically diverse populations is the evaluation of differential item functioning (DIF) among subpopulations such as those administered the measure in different languages. A number of methods exist for this purpose. Objective The objective of this study was to introduce and demonstrate the identification of DIF using item response theory (IRT) and the likelihood-based model comparison approach. Methods Data come from a sample of community-residing elderly who were part of a dementia case registry. A total of 1578 participants were administered either an English (n = 913) or Spanish (n = 665) version of the 21-item Mini-Mental State Examination. IRT was used to identify language DIF in these items with the likelihood-based model comparison approach. Results : Fourteen of the 21 items exhibited significant DIF according to language of administration. However, because the direction of the identified DIF was not consistent for one language version over the other, the impact at the scale level was negligible. Conclusions IRT and the likelihood-based model comparison approach comprise a powerful tool for DIF detection that can aid in the development, refinement, and evaluation of measures for use in ethnically diverse populations.

Journal ArticleDOI
TL;DR: In this article, the authors compared four item response theory (IRT) models using data from tests where multiple items were grouped into testlets focused on a common stimulus, and found that the independent-items model also yielded greater root mean square error (RMSE) for item difficulty and underestimated the item slopes.
Abstract: Four item response theory (IRT) models were compared using data from tests where multiple items were grouped into testlets focused on a common stimulus. In the bi-factor model each item was treated as a function of a primary trait plus a nuisance trait due to the testlet; in the testlet-effects model the slopes in the direction of the testlet traits were constrained within each testlet to be proportional to the slope in the direction of the primary trait; in the polytomous model the item scores were summed into a single score for each testlet; and in the independent-items model the testlet structure was ignored. Using the simulated data, reliability was overestimated somewhat by the independent-items model when the items were not independent within testlets. Under these nonindependent conditions, the independent-items model also yielded greater root mean square error (RMSE) for item difficulty and underestimated the item slopes. When the items within testlets were instead generated to be independent, the bi-factor model yielded somewhat higher RMSE in difficulty and slope. Similar differences between the models were illustrated with real data.

Journal ArticleDOI
TL;DR: In this article, the authors compare the ability of two commonly used methods of rotation, Varimax and Promax, in terms of their ability to correctly link items to factors and to identify the presence of simple structure.
Abstract: Nonlinear factor analysis is a tool commonly used by measurement specialists to identify both the presence and nature of multidimensionality in a set of test items, an important issue given that standard Item Response Theory models assume a unidimensional latent structure. Results from most factor-analytic algorithms include loading matrices, which are used to link items with factors. Interpretation of the loadings typically occurs after they have been rotated in order to amplify the presence of simple structure. The purpose of this simulation study is to compare the ability of two commonly used methods of rotation, Varimax and Promax, in terms of their ability to correctly link items to factors and to identify the presence of simple structure. Results suggest that the two approaches are equally able to recover the underlying factor structure, regardless of the correlations among the factors, though the oblique method is better able to identify the presence of a “simple structure.” These results suggest that for identifying which items are associated with which factors, either approach is effective, but that for identifying simple structure when it is present, the oblique method is preferable.

Journal ArticleDOI
TL;DR: The Skindex-29 of 454 Italian dermatological patients was subjected to Rasch analysis to investigate threshold order, differential item functioning (DIF), and item and overall fit to the model.

Journal ArticleDOI
TL;DR: An integrated approach to the quantitative methods used in this special issue to examine measurement equivalence are provided, finding that factor analytic and DIF detection methods provide unique information and can be viewed as complementary in informing about measurement equivalences.
Abstract: Background:Reviewed in this article are issues relating to the study of invariance and differential item functioning (DIF). The aim of factor analyses and DIF, in the context of invariance testing, is the examination of group differences in item response conditional on an estimate of disability. Dis