scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 2002"


Journal ArticleDOI
TL;DR: The authors offer a comparison of a linear method (confirmatory factor analysis) and a nonlinear method (differential item and test functioning using item response theory) with an emphasis on their methodological similarities and differences.
Abstract: Current interest in the assessment of measurement equivalence emphasizes 2 major methods of analysis. The authors offer a comparison of a linear method (confirmatory factor analysis) and a nonlinear method (differential item and test functioning using item response theory) with an emphasis on their methodological similarities and differences. The 2 approaches test for the equality of true scores (or expected raw scores) across 2 populations when the latent (or factor) score is held constant. Both approaches can provide information about when measurement nonequivalence exists and the extent to which it is a problem. An empirical example is used to illustrate the 2 approaches.

436 citations


Journal ArticleDOI
TL;DR: The extent to which DIF affects estimates of age and gender group differences in disability severity among adults with some functional disability is examined, which suggests comparisons of disability across sociodemographic groups need to take DIF into account.
Abstract: Objectives Estimates of group differences in functional disability may be biased if items exhibit differential item functioning (DIF). For a given item, DIF exists if persons in different groups do not have the same probability of responding, given their level of disability. This study examines the extent to which DIF affects estimates of age and gender group differences in disability severity among adults with some functional disability. Methods Data came from the 1994/1995 National Health Interview Survey Disability Supplement. Analyses focused on 5,750 adult respondents who received help or supervision with at least one of 11 activities of daily living/instrumental activities of daily living tasks. We estimated gender and age group (18-39, 40-69, and 70+) differences in disability, using multiple-indicator/multiple-cause models, which treat functional disability as a latent trait. Results Nine items manifested significant DIF by age or gender; DIF was especially large for "shopping" and "money management." Without adjusting for DIF, middle-aged persons were less disabled than elderly men, and women were less disabled than men among nonelderly persons. After adjusting for DIF, middle-aged persons did not differ from elderly persons, and gender differences within age groups were not significant. Discussion Comparisons of disability across sociodemographic groups need to take DIF into account. Future research should examine the causes of DIF and develop alternative question wordings that reduce DIF effects.

140 citations


Journal ArticleDOI
TL;DR: In this article, two parametric procedures for detecting differential item functioning (DIF) using the Graded Response Model (GRM; Samejima, 1969)-the GRM-Likelihood Ratio test and theGRM-Differential Functioning of Items and Tests (GRm-DFIT; Flowers, Oshima, & Raju, 1999)-were compared with a nonparametric DIF detection procedure, Poly-SIBTEST (Chang, Mazzeo, & Roussos, 1996), in a simulation study.
Abstract: Two parametric procedures for detecting differential item functioning (DIF) using the Graded Response Model (GRM; Samejima, 1969)-the GRM-Likelihood Ratio test (GRM-LR; Thissen, Steinberg, & Gerard, 1986) and the GRM-Differential Functioning of Items and Tests (GRM-DFIT; Flowers, Oshima, & Raju, 1999)-were compared with a nonparametric DIF detection procedure, Poly-SIBTEST (Chang, Mazzeo, & Roussos, 1996), in a simulation study. The 3 DIF procedures were examined (a) under conditions in which the GRM provided an exact fit to the data and (b) under conditions of slight model misfit. Small amounts of model misfit were simulated by applying the GRM-based DIF procedures to data generated from alternative item response models. Although all 3 DIF procedures adhered to nominal Type I error rates when data were generated from the GRM, the GRM-LR demonstrated large Type I error inflation under certain conditions when the generating model was not the GRM. GRM-DFIT showed much less Type I error inflation under model...

131 citations


Journal ArticleDOI
TL;DR: The authors' analyses show that failing to account for DIF results in an approximately 1.6% overestimation of the magnitude of difference in assessed cognition between high- and low-education groups, and item bias does not appear to be a major source of observed differences in cognitive status by educational attainment.
Abstract: Years of completed education is a powerful correlate of performance on mental status assessment. This analysis evaluates differences in cognitive performance attributable to level of education and sex. We analyzed Mini-Mental State Examination responses from a large community sample (Epidemiologic Catchment Area study, N � 8,556), using a structural equation analytic framework grounded in item response theory. Significant sex and education group differential item functioning (DIF) were detected. Those with low education were more likely to err on the first serial subtraction , spell world backwards , repeat phrase , write , name season , and copy design tasks. Women were more likely to err on all serial subtractions , men on spelling and other language tasks. The magnitude of detected DIF was small. Our analyses show that failing to account for DIF results in an approximately 1.6% overestimation of the magnitude of difference in assessed cognition between high- and low-education groups. In contrast, nearly all (95%) of apparent sex differences underlying cognitive impairment are due to DIF. Therefore, item bias does not appear to be a major source of observed differences in cognitive status by educational attainment. Adjustments of total scores that eliminate education group differences are not supported by these results. Our results have implications for future research concerning education and risk for dementia.

128 citations


Journal ArticleDOI
TL;DR: To examine variations in the manifestation of depressive symptomatology across racial/ethnic groups, analyses of differential item functioning (DIF) on the Center for Epidemiologic Studies Depression Scale were separately conducted for representative samples of young adults in the following groups.
Abstract: To examine variations in the manifestation of depressive symptomatology across racial/ethnic groups, analyses of differential item functioning (DIF) on the Center for Epidemiologic Studies Depression Scale (CES-D) were separately conducted for representative samples of young adults in the following groups: African-Americans (n = 434), Hispanics born in the US (n = 493), and Hispanics born outside the US (n = 395). Non-Hispanic whites (n = 463) were employed as the reference group in all analyses. The effects of gender and age were controlled. DIF analyses indicated that: (1) about half of the CES-D items functioned differently among non-Hispanic whites compared to each of the other racial/ethnic groups; (2) the manifestation of symptoms seemed to be similar for both Hispanic groups, except for low positive affect; (3) African-Americans tended to favor somatic symptoms over affective (depressive) symptoms; (4) Immigrant Hispanics appeared to inhibit the expression of positive affect, and thus more high scorers on the total CES-D were observed within this subgroup. In contrast, no differences were observed when only negative items were considered. The use of positive affect items might artifactually induce spurious differences among people who were born outside the United States or North America.

121 citations


Journal ArticleDOI
TL;DR: In this article, the behavior of item and person statistics obtained from two measurement frameworks, item response theory (IRT) and classical test theory (CTT), were examined using Monte Carlo techniques with simulated test data.
Abstract: Despite the well-known theoretical advantages of item response theory (IRT) over classical test theory (CTT), research examining their empirical properties has failed to reveal consistent, demonstrable differences. Using Monte Carlo techniques with simulated test data, this study examined the behavior of item and person statistics obtained from these two measurement frameworks. The findings suggest IRT- and CTT-based item difficulty and person ability estimates were highly comparable, invariant, and accurate in the test conditions simulated. However, whereas item discrimination estimates based on IRT were accurate across most of the experimental conditions, CTT-based item discrimination estimates proved accurate under some conditions only. Implications of the results of this study for psychometric item analysis and item selection are discussed.

117 citations


Journal ArticleDOI
TL;DR: The authors demonstrated the application of an innovative item response theory (IRT) based approach to evaluate measurement equivalence, comparing a newly developed Spanish version of the Posttraumatic Stress Disorder Checklist-Civilian Version (PCL-C) with the established English version.
Abstract: This study demonstrated the application of an innovative item response theory (IRT) based approach to evaluating measurement equivalence, comparing a newly developed Spanish version of the Posttraumatic Stress Disorder Checklist–Civilian Version (PCL–C) with the established English version. Basic principles and practical issues faced in the application of IRT methods for instrument evaluation are discussed. Data were derived from a study of the mental health consequences of community violence in both Spanish speakers (n 102) and English speakers (n 284). Results of differential item functioning (DIF) analyses revealed that the 2 versions were not fully equivalent on an item-by-item basis in that 6 of the 17 items displayed uniform DIF. No bias was observed, however, at the level of the composite PCL–C scale score, indicating that the 2 language versions can be combined for scale-level analyses.

114 citations


Journal ArticleDOI
TL;DR: In this paper, the effect of item parameter drift on ability estimates under item response theory was investigated, and it was shown that the effect had a small effect on ability estimation under the two-parameter logistic model.
Abstract: This study investigated the effect of item parameter drift on ability estimates under item response theory. Item response data for two testing occasions were simulated for the two-parameter logistic model under the following crossed conditions: test length, sample size, percentage of drifting items, and type of drift. Results indicated that item parameter drift, under the conditions simulated, had a small effect on ability estimates.

82 citations


Journal ArticleDOI
TL;DR: Differential item functioning (DIF) is defined as an item that displays different statistical properties for different groups after matching the groups on an ability measure as discussed by the authors, and it can be explained by recognizing that the observed data do not reflect a homogeneous population of individuals, but are a mixture of data from multiple latent populations or classes.
Abstract: Differential item functioning (DIF) may be defined as an item that displays different statistical properties for different groups after matching the groups on an ability measure. For instance, with binary data, DIF exists when there is a difference in the conditional probabilities of a correct response for two manifest groups. This article argues that the occurrence of DIF can be explained by recognizing that the observed data do not reflect a homogeneous population of individuals, but are a mixture of data from multiple latent populations or classes. This conceptualization of DIF hypothesizes that when one observes DIF using the current conceptualization of DIF it is only to the degree that the manifest groups are represented in the latent classes in different proportions. A Monte Carlo study was conducted to compare various approaches to detecting DIF under this formulation of DIF. Results showed that as the latent class proportions became more equal the DIF detection methods identification rates approa...

77 citations


Journal ArticleDOI
TL;DR: Age-related measurement bias in responses to items on the revised Beck Depression Inventory (BDI) in depressed late-life patients versus midlife patients was examined, and IRT results indicated that late- life patients tended to report fewer cognitive symptoms, especially at low to average levels of depression.
Abstract: The present analyses examined age-related measurement bias in responses to items on the revised Beck Depression Inventory (BDI) in depressed late-life patients versus midlife patients Item response theory (IRT) models were used to equate the scale and to differentiate true-group differences from bias in measurement in the 2 samples Baseline BDI data (218 late life and 613 midlife) were used for the present analysis IRT results indicated that late-life patients tended to report fewer cognitive symptoms, especially at low to average levels of depression Conversely, they tended to report more somatic symptoms, especially at higher levels of depression Adjusted cutoff scores in the late-life group are provided, and possible reasons for age-related differences in the performance of the BDI are discussed

72 citations


Book ChapterDOI
01 Jan 2002
TL;DR: A class of multivariate models combining features of Rasch type models with features of graphical interaction models into a common framework for analysis of criterion related construct validity and differential item functioning is defined.
Abstract: This paper defines a class of multivariate models combining features of Rasch type models with features of graphical interaction models into a common framework for analysis of criterion related construct validity and differential item functioning. Item analysis by Graphical Rasch models is illustrated with reanalysis of a summary Health scale counting numbers of experienced symptoms within the last six months.

Journal ArticleDOI
TL;DR: In this article, a hierarchical logistic regression approach is proposed to identify consistent sources of DIF, to quantify the proportion of explained variation in DIF coefficients, and to compare the predictive accuracy of alternate explanations for DIF.
Abstract: Over the past 25 years a range of parametric and nonparametric methods have been developed for analyzing Differential Item Functioning (DIF). These procedures are typically performed for each item individually or for small numbers of related items. Because the analytic procedures focus on individual items, it has been difficult to pool information across items to identify potential sources of DIF analytically. In this article, we outline an approach to DIF analysis using hierarchical logistic regression that makes it possible to combine results of logistic regression analyses across items to identify consistent sources of DIF, to quantify the proportion of explained variation in DIF coefficients, and to compare the predictive accuracy of alternate explanations for DIF. The approach can also be used to improve the accuracy of DIF estimates for individual items by applying empirical Bayes techniques, with DIF-related item characteristics serving as collateral information. To illustrate the hierarchical logi...

Journal ArticleDOI
TL;DR: In this article, three strategies are used for identifying adaptation and curricular differences as sources of DIF: judgmental reviews by multiple bilingual translators of all items, cross-validation of differential item functioning in multiple groups, and examination of the distribution of the DIF items by topic.
Abstract: This article describes and discusses strategies used in disentangling sources of differential item functioning (DIF) in multilanguage assessments where multiple factors are expected to be causing DIF. Three strategies are used for identifying adaptation and curricular differences as sources of DIF: (a) judgmental reviews by multiple bilingual translators of all items, (b) cross-validation of DIF in multiple groups, and (c) examination of the distribution of DIF items by topic. Twenty-seven percent of the mathematics DIF items and 37% of the science DIF items were interpreted to be due to adaptation-related differences based on judgmental reviews. Most of these interpretations were also supported by the cross-validation analyses. Clustering of DIF items by topic provided curricular differences as interpretation for DIF only for small portions of the DIF items, approximately 23% of the mathematics DIF items and 13% of the science DIF items.

Journal ArticleDOI
TL;DR: In this article, the detection of item bias or differential item functioning (dif) has proliferated in psychometric and applied psychological literature over the last 25 years and has been studied extensively.
Abstract: Summary: Research related to the detection of item bias or differential item functioning (dif) has proliferated in psychometric and applied psychological literature over the last 25 years. In fact,...

Journal ArticleDOI
TL;DR: Two methods for developing a common metric for the graded response model under item response theory are compared: linking separate calibration runs using equating coefficients from the characteristic curve method and concurrent calibration using the combined data of the base and target groups.
Abstract: Developing a common metric is essential to successful applications of item response theory to practical testing problems, such as equating, differential item functioning, and computerized adaptive testing. In this study, the authors compared two methods for developing a common metric for the graded response model under item response theory: (a) linking separate calibration runs using equating coefficients from the characteristic curve method and (b) concurrent calibration using the combined data of the base and target groups. Concurrent calibration yielded consistently albeit only slightly smaller root mean square differences for both item discrimination and location parameters. Similar results were observed for distance measures between item parameter estimates and item parameters. Concurrent calibration also yielded consistently though only slightly smaller root mean square differences for ability than linking.

Journal ArticleDOI
TL;DR: In this paper, a large-scale admission test was used to evaluate the effect of local item dependence (LID) on test score reliability and examinee proficiency, and the results showed that the presence of LID may impact the performance of examinee.
Abstract: Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory. The most basic unit of a test is the test item. Test development organizations spend more time and money developing and selecting items for inclusion on a test than any other aspect of the test construction process. Numerous test items are needed to (a) adequately span the content or construct domain tested, and (b) provide reliable estimates of examinee proficiencies. It has long been known that one way to increase test score reliability is to increase the number of items on a test. However, merely duplicating the same items will not accomplish the goal of reliable and valid measurement. Thus, test developers strive to develop items that provide unique information regarding examinee knowledge, skills, and abilities. Redundancy among items is not desirable. Items that do not make a unique contribution to an assessment do not increase construct representation and exacerbate any construct-irrelevant factors that may be associated with an item, such as prior familiarity with the item context. For this reason, what is now known as local item dependence (LID) must be considered in the development and scoring of achievement and aptitude tests. The concept of LID is best understood within the framework of item response theory (IRT). The most popular IRT models specify a single ability to account for all statistical relationships among test items as well as all differences among examinees. It is this underlying ability, typically denoted theta (0), that distinguishes items with respect to difficulty and distinguishes examinees with respect to

Journal ArticleDOI
TL;DR: In this article, a distinction is made between absolute and relative measurement, where the scale of measurement is expressed in terms of the within-group position on a trait, and it is shown that items for relative measurement will produce bias as classically defined if the mean and/or variance of the trait distribution differ between groups.
Abstract: In this article, a distinction is made between absolute and relative measurement. Absolute measurement refers to the measurement of traits on a group-invariant scale, and relative measurement refers to the within-group measurement of traits, where the scale of measurement is expressed in terms of the within-group position on a trait. Relative measurement occurs, for example, if an item induces a within-group comparison in respondents. These distinctions are discussed within the framework of measurement invariance, differentiating between absolute and relative forms of measurement invariance and bias. It is shown that items for relative measurement will produce bias as classically defined if the mean and/or variance of the trait distribution differ between groups. This form of bias, however, does not result from multidimensionality but from the fact that measurement is on a relative scale. A logistic regression procedure for the detection of relative measurement invariance and bias is proposed, as well as ...

Journal ArticleDOI
TL;DR: A class of locally dependent latent trait models based on a family of conditional distributions that describes joint multiple item responses as a function of student latent trait, not assuming conditional independence is proposed.
Abstract: In this paper, we propose a class of locally dependent latent trait models for responses to psychological and educational tests. Typically, item response models treat an individual's multiple response to stimuli as conditional independent given the individual's latent trait. In this paper, instead the focus is on models based on a family of conditional distributions, or kernel, that describes joint multiple item responses as a function of student latent trait, not assuming conditional independence. Specifically, we examine a hybrid kernel which comprises a component for one-way item response functions and a component for conditional associations between items given latent traits. The class of models allows the extension of item response theory to cover some new and innovative applications in psychological and educational research. An EM algorithm for marginal maximum likelihood of the hybrid kernel model is proposed. Furthermore, we delineate the relationship of the class of locally dependent models and the log-linear model by revisiting the Dutch identity (Holland, 1990).

Journal ArticleDOI
TL;DR: For example, this paper found that women are more likely to worry about "being poor" than equally depressive men when using the Rasch version of Thalbourne's Manic-Depressiveness Scale (MDS).

Journal ArticleDOI
TL;DR: Data analytic steps for IRT modeling are reviewed for evaluating item quality and differential item functioning across subgroups of gender, age, and smoking status and implications and challenges in the use of these methods for tobacco onset research and for assessing the developmental trajectories of smoking among youth are discussed.

Journal ArticleDOI
TL;DR: The detection of item bias using Item Response Theory (IRT) arose in the cognitive testing domain, so the phenomenon is almost invariably perceived as undesirable as discussed by the authors, but it is argued that DIF-prone items can afford valuable insights into the nature of the construct under study, especially where group differences are important.
Abstract: A form of item bias known as Differential Item Functioning (DIF) occurs when two individuals with the same trait levels but different group membership do not have the same probability of endorsing an item in the keyed direction. The detection of DIF using Item Response Theory (IRT) arose in the cognitive testing domain, so the phenomenon is almost invariably perceived as undesirable. With the extension of IRT procedures to substantive areas of psychology, it is argued that DIF-prone items can afford valuable insights into the nature of the construct under study, especially where group differences are important. An example is presented using responses from 568 students completing a popular measure of Openness to Experience.

Journal ArticleDOI
TL;DR: The authors investigated the impact of item feature variation on item statistical characteristics and the degree to which such information could be used as collateral information to supplement examinee performance data and reduce pretest sample size.
Abstract: In this study we investigated the impact of systematic item feature variation on item statistical characteristics and the degree to which such information could be used as collateral information to supplement examinee performance data and reduce pretest sample size. Two families of word problem variants for the quantitative section of the Graduate Record Examinations General Test were generated by systematically manipulating item features. For rate problems, the item design features affected item difficulty (adjusted R2 = .90), item discrimination (adjusted R2 = .50), and guessing (adjusted R2 = .41). For probability problems the item design features affected difficulty (adjusted R2 = .61) but not discrimination or guessing. The results demonstrate the enormous potential of systematically creating item variants. The issue of how to develop a knowledge base that would support the systematic generation of a wider variety of quantitative problems is discussed.


Journal ArticleDOI
TL;DR: In this article, the effects of test purification in detecting differential item functioning (DIF) by means of polytomous extensions of the Raju area measures and the Lord statistic were examined.
Abstract: The differential item functioning (DIF) detection in polytomous response items is an area of study of recent interest. Cohen and colleagues proposed a polytomous extension of the Lord statistic and the Raju exact area measures, when the items fit the graded response model. This study examined the effects of test purification in detecting DIF by means of polytomous extensions of the Raju area measures and the Lord statistic. The factors manipulated were percentage of DIF items in the test (5%, 10%, and 20%), amount of DIF (0.2, 0.4, and 0.8), sample size (250, 500, and 1,000) and test purification (noniterative versus two-stage). The results of this study suggest that the use of the two-stage equating procedure with Z(SA) and χ2-LORD reduces the percentage of false positives and improves the detection of DIF.

Journal ArticleDOI
TL;DR: In this paper, an empirical Bayes (EB) enhancement of the popular Mantel-Haenszel (MH) DIF analysis method was used to investigate the applicability to computerized adaptive test data of a differential item functioning (DIF) analysis method developed by Zwick, Thayer, and Lewis.
Abstract: This study used a simulation to investigate the applicability to computerized adaptive test data of a differential item functioning (DIF) analysis method developed by Zwick, Thayer, and Lewis. The approach involves an empirical Bayes (EB) enhancement of the popular Mantel-Haenszel (MH) DIF analysis method. Results showed the performance of the EB DIF approach to be quite promising, even in extremely small samples. In particular, the EB procedure was found to achieve roughly the same degree of stability for samples averaging 117 and 40 members in the two examinee groups as did the ordinary MH for samples averaging 240 in each of the two groups. Overall, the EB estimates tended to be closer to their target values than did the ordinary MH statistics in terms of root mean square residuals; the EB statistics were also more highly correlated with the target values than were the MH statistics. When combined with a loss-function-based decision rule, the EB method is better at detecting DIF than conventional appro...

Journal ArticleDOI
TL;DR: This article evaluated the effects of calculator use on performance on SAT I: Reasoning Test in Mathematics, questions about use of the calculator on the test were inserted into the answer sheets for the November 1996 and November 1997 administrations of the examination.
Abstract: To evaluate the effects of calculator use on performance on the SAT I: Reasoning Test in Mathematics, questions about use of the calculator on the test were inserted into the answer sheets for the November 1996 and November 1997 administrations of the examination. Overall, nearly all of examinees indicated that they brought a calculator to the test and about two thirds reported using them on one third or more of the math items. Some group differences in the use of calculators were observed with girls using them more frequently than boys and Whites and Asian Americans using them more often than other racial or ethnic groups. Use of calculators was associated with higher test performance, but the more able students were more likely to have calculators and used them more often. The results were analyzed further using multiple regression and differential item functioning procedures. The degree of speededness on different degrees of calculator use was also examined. Overall, the effects of calculator use were ...

Journal ArticleDOI
TL;DR: In this paper, the hypothesized superiority of the item response model (IRM) is tested against structural equation modeling (SEM) for responses to the Center for Epidemiologic Studies-Depression (CES-D) scale.
Abstract: The sample invariance of item discrimination statistics is evaluated in this case study using real data. The hypothesized superiority of the item response model (IRM) is tested against structural equation modeling (SEM) for responses to the Center for Epidemiologic Studies-Depression (CES-D) scale. Responses from 10 random samples of 500 people were drawn from a base sample of 6,621 participants across gender, age, and different health groups. Hierarchical tests of multiple-group structural equation models indicated statistically significant differences exist in item regressions across contrast groups. Although the IRM item discrimination estimates were most stable in all conditions of this case study, additional research on the precision of individual scores and possible item bias is required to support the validity of either model for scoring the CES-D. The SEM approach to examining between-group differences holds promise for any field where heterogeneous populations are assessed and important consequen...

Journal ArticleDOI
TL;DR: In this paper, the asymptotic standard errors of item/test response function estimates are derived by the delta method for the three-parameter logistic model using a similar method.
Abstract: The asymptotic standard errors of item/test response function estimates are derived by the delta method for the three-parameter logistic model. Using a similar method, the asymptotic standard error...

Book ChapterDOI
01 Jan 2002
TL;DR: Basic concept of quality of life (QOL) measurement and item response theory (IRT) and in a way complementary to that of traditional methods, IRT models can be applied to analyze and interpret QOL data collected in various settings.
Abstract: This article discusses basic concept of quality of life (QOL) measurement and item response theory (IRT). In a way complementary to that of traditional methods, IRT models can be applied to analyze and interpret QOL data collected in various settings. Growing interest in precise QOL measurement in research and clinical settings demands the development of psychometrically sound and clinically meaningful measurement tools. This in turn contributes to the appropriate use of QOL data that are collected. Advances in IRT, also referred to as modern test theory, make it possible for one to more critically evaluate questionnaire performance at its initial development and subsequent refinement and validation. It also offers better methodology to make interpretation of QOL data and comparisons between different populations or occasions more meaningful by converting ordinal observations into linear measures. Empirical results from different studies are provided to assist in the understanding of different IRT models and their applications. It is feasible and promising to integrate IRT models and advanced computer technology to develop a computerized adaptive testing platform to deliver tailored test to arrive at more precise QOL measurement. Administration of more targeted test items according to patient’s level of health via CAT with real-time scoring and reporting is not just possibility but a reality. This can facilitate better use of QOL information between patients and physicians, and ultimately improve patient care.

Journal ArticleDOI
TL;DR: In this paper, a transposition of the usual person-item matrices is used to enhance diagnostic assessment in which individual differences in scores between content domains are clarified by conditioning the scores on item difficulty.
Abstract: The definitions, methods, and interpretations of differential item functioning are extended to the transpose of the usual person-item matrices. The primary purpose is to enhance diagnostic assessment in which individual differences in scores between content domains are clarified by conditioning the scores on item difficulty. Three examples are used to illustrate this approach with data from the mathematics section of the California Achievement Test using the Mantel-Haenszel procedure. The term differential person functioning is suggested.