Showing papers on "Differential item functioning published in 2002"

PDF

Open Access

Journal Article•DOI•

Measurement equivalence: a comparison of methods based on confirmatory factor analysis and item response theory.

[...]

Nambury S. Raju¹, Larry J. Laffitte, Barbara M. Byrne²•Institutions (2)

Illinois Institute of Technology¹, University of Ottawa²

01 Jun 2002-Journal of Applied Psychology

TL;DR: The authors offer a comparison of a linear method (confirmatory factor analysis) and a nonlinear method (differential item and test functioning using item response theory) with an emphasis on their methodological similarities and differences.

...read moreread less

Abstract: Current interest in the assessment of measurement equivalence emphasizes 2 major methods of analysis. The authors offer a comparison of a linear method (confirmatory factor analysis) and a nonlinear method (differential item and test functioning using item response theory) with an emphasis on their methodological similarities and differences. The 2 approaches test for the equality of true scores (or expected raw scores) across 2 populations when the latent (or factor) score is held constant. Both approaches can provide information about when measurement nonequivalence exists and the extent to which it is a problem. An empirical example is used to illustrate the 2 approaches.

...read moreread less

436 citations

Journal Article•DOI•

Impact of Differential Item Functioning on Age and Gender Differences in Functional Disability

[...]

John A. Fleishman¹, William D. Spector, Barbara M Altman•Institutions (1)

Agency for Healthcare Research and Quality¹

01 Sep 2002-Journals of Gerontology Series B-psychological Sciences and Social Sciences

TL;DR: The extent to which DIF affects estimates of age and gender group differences in disability severity among adults with some functional disability is examined, which suggests comparisons of disability across sociodemographic groups need to take DIF into account.

...read moreread less

Abstract: Objectives Estimates of group differences in functional disability may be biased if items exhibit differential item functioning (DIF). For a given item, DIF exists if persons in different groups do not have the same probability of responding, given their level of disability. This study examines the extent to which DIF affects estimates of age and gender group differences in disability severity among adults with some functional disability. Methods Data came from the 1994/1995 National Health Interview Survey Disability Supplement. Analyses focused on 5,750 adult respondents who received help or supervision with at least one of 11 activities of daily living/instrumental activities of daily living tasks. We estimated gender and age group (18-39, 40-69, and 70+) differences in disability, using multiple-indicator/multiple-cause models, which treat functional disability as a latent trait. Results Nine items manifested significant DIF by age or gender; DIF was especially large for "shopping" and "money management." Without adjusting for DIF, middle-aged persons were less disabled than elderly men, and women were less disabled than men among nonelderly persons. After adjusting for DIF, middle-aged persons did not differ from elderly persons, and gender differences within age groups were not significant. Discussion Comparisons of disability across sociodemographic groups need to take DIF into account. Future research should examine the causes of DIF and develop alternative question wordings that reduce DIF effects.

...read moreread less

140 citations

Journal Article•DOI•

A Monte Carlo Comparison of Parametric and Nonparametric Polytomous DIF Detection Methods.

[...]

Daniel M. Bolt

01 Apr 2002-Applied Measurement in Education

TL;DR: In this article, two parametric procedures for detecting differential item functioning (DIF) using the Graded Response Model (GRM; Samejima, 1969)-the GRM-Likelihood Ratio test and theGRM-Differential Functioning of Items and Tests (GRm-DFIT; Flowers, Oshima, & Raju, 1999)-were compared with a nonparametric DIF detection procedure, Poly-SIBTEST (Chang, Mazzeo, & Roussos, 1996), in a simulation study.

...read moreread less

Abstract: Two parametric procedures for detecting differential item functioning (DIF) using the Graded Response Model (GRM; Samejima, 1969)-the GRM-Likelihood Ratio test (GRM-LR; Thissen, Steinberg, & Gerard, 1986) and the GRM-Differential Functioning of Items and Tests (GRM-DFIT; Flowers, Oshima, & Raju, 1999)-were compared with a nonparametric DIF detection procedure, Poly-SIBTEST (Chang, Mazzeo, & Roussos, 1996), in a simulation study. The 3 DIF procedures were examined (a) under conditions in which the GRM provided an exact fit to the data and (b) under conditions of slight model misfit. Small amounts of model misfit were simulated by applying the GRM-based DIF procedures to data generated from alternative item response models. Although all 3 DIF procedures adhered to nominal Type I error rates when data were generated from the GRM, the GRM-LR demonstrated large Type I error inflation under certain conditions when the generating model was not the GRM. GRM-DFIT showed much less Type I error inflation under model...

...read moreread less

131 citations

Journal Article•DOI•

Education and Sex Differences in the Mini-Mental State Examination Effects of Differential Item Functioning

[...]

Richard N. Jones, Joseph J. Gallo

01 Nov 2002-Journals of Gerontology Series B-psychological Sciences and Social Sciences

TL;DR: The authors' analyses show that failing to account for DIF results in an approximately 1.6% overestimation of the magnitude of difference in assessed cognition between high- and low-education groups, and item bias does not appear to be a major source of observed differences in cognitive status by educational attainment.

...read moreread less

Abstract: Years of completed education is a powerful correlate of performance on mental status assessment. This analysis evaluates differences in cognitive performance attributable to level of education and sex. We analyzed Mini-Mental State Examination responses from a large community sample (Epidemiologic Catchment Area study, N � 8,556), using a structural equation analytic framework grounded in item response theory. Significant sex and education group differential item functioning (DIF) were detected. Those with low education were more likely to err on the first serial subtraction , spell world backwards , repeat phrase , write , name season , and copy design tasks. Women were more likely to err on all serial subtractions , men on spelling and other language tasks. The magnitude of detected DIF was small. Our analyses show that failing to account for DIF results in an approximately 1.6% overestimation of the magnitude of difference in assessed cognition between high- and low-education groups. In contrast, nearly all (95%) of apparent sex differences underlying cognitive impairment are due to DIF. Therefore, item bias does not appear to be a major source of observed differences in cognitive status by educational attainment. Adjustments of total scores that eliminate education group differences are not supported by these results. Our results have implications for future research concerning education and risk for dementia.

...read moreread less

128 citations

Journal Article•DOI•

Race/ethnicity and depressive symptoms in community-dwelling young adults: a differential item functioning analysis

[...]

Noboru Iwata¹, Noboru Iwata², R. Jay Turner², Donald A. Lloyd²•Institutions (2)

University of East Asia¹, Florida International University²

31 Jul 2002-Psychiatry Research-neuroimaging

TL;DR: To examine variations in the manifestation of depressive symptomatology across racial/ethnic groups, analyses of differential item functioning (DIF) on the Center for Epidemiologic Studies Depression Scale were separately conducted for representative samples of young adults in the following groups.

...read moreread less

Abstract: To examine variations in the manifestation of depressive symptomatology across racial/ethnic groups, analyses of differential item functioning (DIF) on the Center for Epidemiologic Studies Depression Scale (CES-D) were separately conducted for representative samples of young adults in the following groups: African-Americans (n = 434), Hispanics born in the US (n = 493), and Hispanics born outside the US (n = 395). Non-Hispanic whites (n = 463) were employed as the reference group in all analyses. The effects of gender and age were controlled. DIF analyses indicated that: (1) about half of the CES-D items functioned differently among non-Hispanic whites compared to each of the other racial/ethnic groups; (2) the manifestation of symptoms seemed to be similar for both Hispanic groups, except for low positive affect; (3) African-Americans tended to favor somatic symptoms over affective (depressive) symptoms; (4) Immigrant Hispanics appeared to inhibit the expression of positive affect, and thus more high scorers on the total CES-D were observed within this subgroup. In contrast, no differences were observed when only negative items were considered. The use of positive affect items might artifactually induce spurious differences among people who were born outside the United States or North America.

...read moreread less

121 citations

Journal Article•DOI•

A Monte Carlo Comparison of Item and Person Statistics Based on Item Response Theory versus Classical Test Theory.

[...]

Paul L. MacDonald, Sampo V. Paunonen¹•Institutions (1)

University of Western Ontario¹

01 Dec 2002-Educational and Psychological Measurement

TL;DR: In this article, the behavior of item and person statistics obtained from two measurement frameworks, item response theory (IRT) and classical test theory (CTT), were examined using Monte Carlo techniques with simulated test data.

...read moreread less

Abstract: Despite the well-known theoretical advantages of item response theory (IRT) over classical test theory (CTT), research examining their empirical properties has failed to reveal consistent, demonstrable differences. Using Monte Carlo techniques with simulated test data, this study examined the behavior of item and person statistics obtained from these two measurement frameworks. The findings suggest IRT- and CTT-based item difficulty and person ability estimates were highly comparable, invariant, and accurate in the test conditions simulated. However, whereas item discrimination estimates based on IRT were accurate across most of the experimental conditions, CTT-based item discrimination estimates proved accurate under some conditions only. Implications of the results of this study for psychometric item analysis and item selection are discussed.

...read moreread less

117 citations

Journal Article•DOI•

Differential item functioning in a Spanish translation of the PTSD checklist: Detection and evaluation of impact

[...]

Maria Orlando¹, Grant N. Marshall•Institutions (1)

RAND Corporation¹

01 Jan 2002-Psychological Assessment

TL;DR: The authors demonstrated the application of an innovative item response theory (IRT) based approach to evaluate measurement equivalence, comparing a newly developed Spanish version of the Posttraumatic Stress Disorder Checklist-Civilian Version (PCL-C) with the established English version.

...read moreread less

Abstract: This study demonstrated the application of an innovative item response theory (IRT) based approach to evaluating measurement equivalence, comparing a newly developed Spanish version of the Posttraumatic Stress Disorder Checklist–Civilian Version (PCL–C) with the established English version. Basic principles and practical issues faced in the application of IRT methods for instrument evaluation are discussed. Data were derived from a study of the mental health consequences of community violence in both Spanish speakers (n 102) and English speakers (n 284). Results of differential item functioning (DIF) analyses revealed that the 2 versions were not fully equivalent on an item-by-item basis in that 6 of the 17 items displayed uniform DIF. No bias was observed, however, at the level of the composite PCL–C scale score, indicating that the 2 language versions can be combined for scale-level analyses.

...read moreread less

114 citations

Journal Article•DOI•

The Effect of Item Parameter Drift on Examinee Ability Estimates

[...]

Craig S. Wells, Michael J. Subkoviak, Ronald C. Serlin¹•Institutions (1)

University of Wisconsin-Madison¹

01 Mar 2002-Applied Psychological Measurement

TL;DR: In this paper, the effect of item parameter drift on ability estimates under item response theory was investigated, and it was shown that the effect had a small effect on ability estimation under the two-parameter logistic model.

...read moreread less

Abstract: This study investigated the effect of item parameter drift on ability estimates under item response theory. Item response data for two testing occasions were simulated for the two-parameter logistic model under the following crossed conditions: test length, sample size, percentage of drifting items, and type of drift. Results indicated that item parameter drift, under the conditions simulated, had a small effect on ability estimates.

...read moreread less

82 citations

Journal Article•DOI•

Differential Item Functioning: A Mixture Distribution Conceptualization

[...]

R. J. De Ayala, Seock-Ho Kim, Laura M. Stapleton, C. Mitchell Dayton

01 Sep 2002-International Journal of Testing

TL;DR: Differential item functioning (DIF) is defined as an item that displays different statistical properties for different groups after matching the groups on an ability measure as discussed by the authors, and it can be explained by recognizing that the observed data do not reflect a homogeneous population of individuals, but are a mixture of data from multiple latent populations or classes.

...read moreread less

Abstract: Differential item functioning (DIF) may be defined as an item that displays different statistical properties for different groups after matching the groups on an ability measure. For instance, with binary data, DIF exists when there is a difference in the conditional probabilities of a correct response for two manifest groups. This article argues that the occurrence of DIF can be explained by recognizing that the observed data do not reflect a homogeneous population of individuals, but are a mixture of data from multiple latent populations or classes. This conceptualization of DIF hypothesizes that when one observes DIF using the current conceptualization of DIF it is only to the degree that the manifest groups are represented in the latent classes in different proportions. A Monte Carlo study was conducted to compare various approaches to detecting DIF under this formulation of DIF. Results showed that as the latent class proportions became more equal the DIF detection methods identification rates approa...

...read moreread less

77 citations

Journal Article•DOI•

Differential functioning of the Beck depression inventory in late-life patients: use of item response theory.

[...]

Yookyung Kim, Paul A. Pilkonis, Ellen Frank, Michael E. Thase, Charles F. Reynolds - Show less +1 more

01 Sep 2002-Psychology and Aging

TL;DR: Age-related measurement bias in responses to items on the revised Beck Depression Inventory (BDI) in depressed late-life patients versus midlife patients was examined, and IRT results indicated that late- life patients tended to report fewer cognitive symptoms, especially at low to average levels of depression.

...read moreread less

Abstract: The present analyses examined age-related measurement bias in responses to items on the revised Beck Depression Inventory (BDI) in depressed late-life patients versus midlife patients Item response theory (IRT) models were used to equate the scale and to differentiate true-group differences from bias in measurement in the 2 samples Baseline BDI data (218 late life and 613 midlife) were used for the present analysis IRT results indicated that late-life patients tended to report fewer cognitive symptoms, especially at low to average levels of depression Conversely, they tended to report more somatic symptoms, especially at higher levels of depression Adjusted cutoff scores in the late-life group are provided, and possible reasons for age-related differences in the performance of the BDI are discussed

...read moreread less

72 citations

Book Chapter•DOI•

Graphical Rasch Models

[...]

Svend Kreiner¹, Karl Bang Christensen²•Institutions (2)

University of Copenhagen¹, National Institute of Occupational Health²

01 Jan 2002

TL;DR: A class of multivariate models combining features of Rasch type models with features of graphical interaction models into a common framework for analysis of criterion related construct validity and differential item functioning is defined.

...read moreread less

Abstract: This paper defines a class of multivariate models combining features of Rasch type models with features of graphical interaction models into a common framework for analysis of criterion related construct validity and differential item functioning. Item analysis by Graphical Rasch models is illustrated with reanalysis of a summary Health scale counting numbers of experienced symptoms within the last six months.

...read moreread less

Journal Article•DOI•

Analysis of Differential Item Functioning (DIF) Using Hierarchical Logistic Regression Models

[...]

David B. Swanson, Brian E. Clauser, S M Case, Ronald J. Nungester, Carol Featherman - Show less +1 more

20 Mar 2002-Journal of Educational and Behavioral Statistics

TL;DR: In this article, a hierarchical logistic regression approach is proposed to identify consistent sources of DIF, to quantify the proportion of explained variation in DIF coefficients, and to compare the predictive accuracy of alternate explanations for DIF.

...read moreread less

Abstract: Over the past 25 years a range of parametric and nonparametric methods have been developed for analyzing Differential Item Functioning (DIF). These procedures are typically performed for each item individually or for small numbers of related items. Because the analytic procedures focus on individual items, it has been difficult to pool information across items to identify potential sources of DIF analytically. In this article, we outline an approach to DIF analysis using hierarchical logistic regression that makes it possible to combine results of logistic regression analyses across items to identify consistent sources of DIF, to quantify the proportion of explained variation in DIF coefficients, and to compare the predictive accuracy of alternate explanations for DIF. The approach can also be used to improve the accuracy of DIF estimates for individual items by applying empirical Bayes techniques, with DIF-related item characteristics serving as collateral information. To illustrate the hierarchical logi...

...read moreread less

Journal Article•DOI•

Disentangling Sources of Differential Item Functioning in Multilanguage Assessments

[...]

Kadriye Ercikan

01 Sep 2002-International Journal of Testing

TL;DR: In this article, three strategies are used for identifying adaptation and curricular differences as sources of DIF: judgmental reviews by multiple bilingual translators of all items, cross-validation of differential item functioning in multiple groups, and examination of the distribution of the DIF items by topic.

...read moreread less

Abstract: This article describes and discusses strategies used in disentangling sources of differential item functioning (DIF) in multilanguage assessments where multiple factors are expected to be causing DIF. Three strategies are used for identifying adaptation and curricular differences as sources of DIF: (a) judgmental reviews by multiple bilingual translators of all items, (b) cross-validation of DIF in multiple groups, and (c) examination of the distribution of DIF items by topic. Twenty-seven percent of the mathematics DIF items and 37% of the science DIF items were interpreted to be due to adaptation-related differences based on judgmental reviews. Most of these interpretations were also supported by the cross-validation analyses. Clustering of DIF items by topic provided curricular differences as interpretation for DIF only for small portions of the DIF items, approximately 23% of the mathematics DIF items and 13% of the science DIF items.

...read moreread less

Journal Article•DOI•

Effects of Ability Scale Purification on the Identification of dif

[...]

María JosÉ Navas-ara, Juana Gómez-Benito

01 Apr 2002-European Journal of Psychological Assessment

TL;DR: In this article, the detection of item bias or differential item functioning (dif) has proliferated in psychometric and applied psychological literature over the last 25 years and has been studied extensively.

...read moreread less

Abstract: Summary: Research related to the detection of item bias or differential item functioning (dif) has proliferated in psychometric and applied psychological literature over the last 25 years. In fact,...

...read moreread less

Journal Article•DOI•

A Comparison of Linking and Concurrent Calibration Under the Graded Response Model

[...]

Seock-Ho Kim¹, Allan S. Cohen²•Institutions (2)

University of Georgia¹, University of Wisconsin-Madison²

01 Mar 2002-Applied Psychological Measurement

TL;DR: Two methods for developing a common metric for the graded response model under item response theory are compared: linking separate calibration runs using equating coefficients from the characteristic curve method and concurrent calibration using the combined data of the base and target groups.

...read moreread less

Abstract: Developing a common metric is essential to successful applications of item response theory to practical testing problems, such as equating, differential item functioning, and computerized adaptive testing. In this study, the authors compared two methods for developing a common metric for the graded response model under item response theory: (a) linking separate calibration runs using equating coefficients from the characteristic curve method and (b) concurrent calibration using the combined data of the base and target groups. Concurrent calibration yielded consistently albeit only slightly smaller root mean square differences for both item discrimination and location parameters. Similar results were observed for distance measures between item parameter estimates and item parameters. Concurrent calibration also yielded consistently though only slightly smaller root mean square differences for ability than linking.

...read moreread less

Journal Article•DOI•

Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test

[...]

April L. Zenisky¹, Ronald K. Hambleton¹, Stephen G. Sired¹•Institutions (1)

University of Massachusetts Amherst¹

01 Dec 2002-Journal of Educational Measurement

TL;DR: In this paper, a large-scale admission test was used to evaluate the effect of local item dependence (LID) on test score reliability and examinee proficiency, and the results showed that the presence of LID may impact the performance of examinee.

...read moreread less

Abstract: Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory. The most basic unit of a test is the test item. Test development organizations spend more time and money developing and selecting items for inclusion on a test than any other aspect of the test construction process. Numerous test items are needed to (a) adequately span the content or construct domain tested, and (b) provide reliable estimates of examinee proficiencies. It has long been known that one way to increase test score reliability is to increase the number of items on a test. However, merely duplicating the same items will not accomplish the goal of reliable and valid measurement. Thus, test developers strive to develop items that provide unique information regarding examinee knowledge, skills, and abilities. Redundancy among items is not desirable. Items that do not make a unique contribution to an assessment do not increase construct representation and exacerbate any construct-irrelevant factors that may be associated with an item, such as prior familiarity with the item context. For this reason, what is now known as local item dependence (LID) must be considered in the development and scoring of achievement and aptitude tests. The concept of LID is best understood within the framework of item response theory (IRT). The most popular IRT models specify a single ability to account for all statistical relationships among test items as well as all differences among examinees. It is this underlying ability, typically denoted theta (0), that distinguishes items with respect to difficulty and distinguishes examinees with respect to

...read moreread less

Journal Article•DOI•

Different Kinds of DIF: A Distinction Between Absolute and Relative Forms of Measurement Invariance and Bias

[...]

Denny Borsboom, Gideon J. Mellenbergh, Jaap van Heerden

01 Dec 2002-Applied Psychological Measurement

TL;DR: In this article, a distinction is made between absolute and relative measurement, where the scale of measurement is expressed in terms of the within-group position on a trait, and it is shown that items for relative measurement will produce bias as classically defined if the mean and/or variance of the trait distribution differ between groups.

...read moreread less

Abstract: In this article, a distinction is made between absolute and relative measurement. Absolute measurement refers to the measurement of traits on a group-invariant scale, and relative measurement refers to the within-group measurement of traits, where the scale of measurement is expressed in terms of the within-group position on a trait. Relative measurement occurs, for example, if an item induces a within-group comparison in respondents. These distinctions are discussed within the framework of measurement invariance, differentiating between absolute and relative forms of measurement invariance and bias. It is shown that items for relative measurement will produce bias as classically defined if the mean and/or variance of the trait distribution differ between groups. This form of bias, however, does not result from multidimensionality but from the fact that measurement is on a relative scale. A logistic regression procedure for the detection of relative measurement invariance and bias is proposed, as well as ...

...read moreread less

Journal Article•DOI•

Locally dependent latent trait model and the dutch identity revisited

[...]

Edward H. Ip¹•Institutions (1)

University of Southern California¹

01 Sep 2002-Psychometrika

TL;DR: A class of locally dependent latent trait models based on a family of conditional distributions that describes joint multiple item responses as a function of student latent trait, not assuming conditional independence is proposed.

...read moreread less

Abstract: In this paper, we propose a class of locally dependent latent trait models for responses to psychological and educational tests. Typically, item response models treat an individual's multiple response to stimuli as conditional independent given the individual's latent trait. In this paper, instead the focus is on models based on a family of conditional distributions, or kernel, that describes joint multiple item responses as a function of student latent trait, not assuming conditional independence. Specifically, we examine a hybrid kernel which comprises a component for one-way item response functions and a component for conditional associations between items given latent traits. The class of models allows the extension of item response theory to cover some new and innovative applications in psychological and educational research. An EM algorithm for marginal maximum likelihood of the hybrid kernel model is proposed. Furthermore, we delineate the relationship of the class of locally dependent models and the log-linear model by revisiting the Dutch identity (Holland, 1990).

...read moreread less

Journal Article•DOI•

Depressive Response Sets due to gender and culture-based Differential Item Functioning

[...]

Rense Lange¹, Michael A. Thalbourne², James Houran³, David Lester⁴•Institutions (4)

Illinois State Board of Education¹, University of Adelaide², Southern Illinois University School of Medicine³, Richard Stockton College of New Jersey⁴

19 Oct 2002-Personality and Individual Differences

TL;DR: For example, this paper found that women are more likely to worry about "being poor" than equally depressive men when using the Rasch version of Thalbourne's Manic-Depressiveness Scale (MDS).

...read moreread less

Journal Article•DOI•

Assessing tobacco beliefs among youth using item response theory models

[...]

Abigail T Panter¹, Bryce B. Reeve²•Institutions (2)

University of North Carolina at Chapel Hill¹, National Institutes of Health²

01 Nov 2002-Drug and Alcohol Dependence

TL;DR: Data analytic steps for IRT modeling are reviewed for evaluating item quality and differential item functioning across subgroups of gender, age, and smoking status and implications and challenges in the use of these methods for tobacco onset research and for assessing the developmental trajectories of smoking among youth are discussed.

...read moreread less

Journal Article•DOI•

On the Usefulness of Item Bias Analysis to Personality Psychology

[...]

Larissa L. Smith¹•Institutions (1)

University of California, Riverside¹

01 Jun 2002-Personality and Social Psychology Bulletin

TL;DR: The detection of item bias using Item Response Theory (IRT) arose in the cognitive testing domain, so the phenomenon is almost invariably perceived as undesirable as discussed by the authors, but it is argued that DIF-prone items can afford valuable insights into the nature of the construct under study, especially where group differences are important.

...read moreread less

Abstract: A form of item bias known as Differential Item Functioning (DIF) occurs when two individuals with the same trait levels but different group membership do not have the same probability of endorsing an item in the keyed direction. The detection of DIF using Item Response Theory (IRT) arose in the cognitive testing domain, so the phenomenon is almost invariably perceived as undesirable. With the extension of IRT procedures to substantive areas of psychology, it is argued that DIF-prone items can afford valuable insights into the nature of the construct under study, especially where group differences are important. An example is presented using responses from 568 students completing a popular measure of Openness to Experience.

...read moreread less

Journal Article•DOI•

Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics.

[...]

Mary K. Enright¹, Mary Morley¹, Kathleen M. Sheehan¹•Institutions (1)

Princeton University¹

02 Jan 2002-Applied Measurement in Education

TL;DR: The authors investigated the impact of item feature variation on item statistical characteristics and the degree to which such information could be used as collateral information to supplement examinee performance data and reduce pretest sample size.

...read moreread less

Abstract: In this study we investigated the impact of systematic item feature variation on item statistical characteristics and the degree to which such information could be used as collateral information to supplement examinee performance data and reduce pretest sample size. Two families of word problem variants for the quantitative section of the Graduate Record Examinations General Test were generated by systematically manipulating item features. For rate problems, the item design features affected item difficulty (adjusted R2 = .90), item discrimination (adjusted R2 = .50), and guessing (adjusted R2 = .41). For probability problems the item design features affected difficulty (adjusted R2 = .61) but not discrimination or guessing. The results demonstrate the enormous potential of systematically creating item variants. The issue of how to develop a knowledge base that would support the systematic generation of a wider variety of quantitative problems is discussed.

...read moreread less

Differential item and test functioning.

[...]

Nambury S. Raju¹, Barbara B. Ellis•Institutions (1)

Illinois Institute of Technology¹

01 Jan 2002

Journal Article•DOI•

Two-Stage Equating in Differential Item Functioning Detection under the Graded Response Model with the Raju Area Measures and the Lord Statistic

[...]

Maria Dolores Hidalgo-Montesinos, José Antonio López-Pina¹•Institutions (1)

University of Murcia¹

01 Feb 2002-Educational and Psychological Measurement

TL;DR: In this article, the effects of test purification in detecting differential item functioning (DIF) by means of polytomous extensions of the Raju area measures and the Lord statistic were examined.

...read moreread less

Abstract: The differential item functioning (DIF) detection in polytomous response items is an area of study of recent interest. Cohen and colleagues proposed a polytomous extension of the Lord statistic and the Raju exact area measures, when the items fit the graded response model. This study examined the effects of test purification in detecting DIF by means of polytomous extensions of the Raju area measures and the Lord statistic. The factors manipulated were percentage of DIF items in the test (5%, 10%, and 20%), amount of DIF (0.2, 0.4, and 0.8), sample size (250, 500, and 1,000) and test purification (noniterative versus two-stage). The results of this study suggest that the use of the two-stage equating procedure with Z(SA) and χ2-LORD reduces the percentage of false positives and improves the detection of DIF.

...read moreread less

Journal Article•DOI•

Application of an empirical Bayes enhancement of Mantel-Haenszel differential item functioning analysis to a computerized adaptive test

[...]

Rebecca Zwick¹, Dorothy T. Thayer•Institutions (1)

University of California, Santa Barbara¹

01 Mar 2002-Applied Psychological Measurement

TL;DR: In this paper, an empirical Bayes (EB) enhancement of the popular Mantel-Haenszel (MH) DIF analysis method was used to investigate the applicability to computerized adaptive test data of a differential item functioning (DIF) analysis method developed by Zwick, Thayer, and Lewis.

...read moreread less

Abstract: This study used a simulation to investigate the applicability to computerized adaptive test data of a differential item functioning (DIF) analysis method developed by Zwick, Thayer, and Lewis. The approach involves an empirical Bayes (EB) enhancement of the popular Mantel-Haenszel (MH) DIF analysis method. Results showed the performance of the EB DIF approach to be quite promising, even in extremely small samples. In particular, the EB procedure was found to achieve roughly the same degree of stability for samples averaging 117 and 40 members in the two examinee groups as did the ordinary MH for samples averaging 240 in each of the two groups. Overall, the EB estimates tended to be closer to their target values than did the ordinary MH statistics in terms of root mean square residuals; the EB statistics were also more highly correlated with the target values than were the MH statistics. When combined with a loss-function-based decision rule, the EB method is better at detecting DIF than conventional appro...

...read moreread less

Journal Article•DOI•

Calculator Access, Use, and Type in Relation to Performance on the SAT I: Reasoning Test in Mathematics

[...]

Janice Dowd Scheuneman, Wayne J. Camara, Alicia S. Cascallar, Cathy Wendler, Ida Lawrence - Show less +1 more

02 Jan 2002-Applied Measurement in Education

TL;DR: This article evaluated the effects of calculator use on performance on SAT I: Reasoning Test in Mathematics, questions about use of the calculator on the test were inserted into the answer sheets for the November 1996 and November 1997 administrations of the examination.

...read moreread less

Abstract: To evaluate the effects of calculator use on performance on the SAT I: Reasoning Test in Mathematics, questions about use of the calculator on the test were inserted into the answer sheets for the November 1996 and November 1997 administrations of the examination. Overall, nearly all of examinees indicated that they brought a calculator to the test and about two thirds reported using them on one third or more of the math items. Some group differences in the use of calculators were observed with girls using them more frequently than boys and Whites and Asian Americans using them more often than other racial or ethnic groups. Use of calculators was associated with higher test performance, but the more able students were more likely to have calculators and used them more often. The results were analyzed further using multiple regression and differential item functioning procedures. The degree of speededness on different degrees of calculator use was also examined. Overall, the effects of calculator use were ...

...read moreread less

Journal Article•DOI•

Sample Invariance of the Structural Equation Model and the Item Response Model: A Case Study

[...]

Krista Breithaupt, Bruno D. Zumbo

01 Jul 2002-Structural Equation Modeling

TL;DR: In this paper, the hypothesized superiority of the item response model (IRM) is tested against structural equation modeling (SEM) for responses to the Center for Epidemiologic Studies-Depression (CES-D) scale.

...read moreread less

Abstract: The sample invariance of item discrimination statistics is evaluated in this case study using real data. The hypothesized superiority of the item response model (IRM) is tested against structural equation modeling (SEM) for responses to the Center for Epidemiologic Studies-Depression (CES-D) scale. Responses from 10 random samples of 500 people were drawn from a base sample of 6,621 participants across gender, age, and different health groups. Hierarchical tests of multiple-group structural equation models indicated statistically significant differences exist in item regressions across contrast groups. Although the IRM item discrimination estimates were most stable in all conditions of this case study, additional research on the precision of individual scores and possible item bias is required to support the validity of either model for scoring the CES-D. The SEM approach to examining between-group differences holds promise for any field where heterogeneous populations are assessed and important consequen...

...read moreread less

Journal Article•DOI•

Stable Response Functions with Unstable Item Parameter Estimates

[...]

Haruhiko Ogasawara¹•Institutions (1)

Otaru University of Commerce¹

01 Sep 2002-Applied Psychological Measurement

TL;DR: In this paper, the asymptotic standard errors of item/test response function estimates are derived by the delta method for the three-parameter logistic model using a similar method.

...read moreread less

Abstract: The asymptotic standard errors of item/test response function estimates are derived by the delta method for the three-parameter logistic model. Using a similar method, the asymptotic standard error...

...read moreread less

Book Chapter•DOI•

Item response theory (IRT): Applications in quality of life measurement, analysis and interpretation

[...]

David Cella¹, Chih Hung Chang¹, Allen W. Heinemann•Institutions (1)

Northwestern University¹

01 Jan 2002

TL;DR: Basic concept of quality of life (QOL) measurement and item response theory (IRT) and in a way complementary to that of traditional methods, IRT models can be applied to analyze and interpret QOL data collected in various settings.

...read moreread less

Abstract: This article discusses basic concept of quality of life (QOL) measurement and item response theory (IRT). In a way complementary to that of traditional methods, IRT models can be applied to analyze and interpret QOL data collected in various settings. Growing interest in precise QOL measurement in research and clinical settings demands the development of psychometrically sound and clinically meaningful measurement tools. This in turn contributes to the appropriate use of QOL data that are collected. Advances in IRT, also referred to as modern test theory, make it possible for one to more critically evaluate questionnaire performance at its initial development and subsequent refinement and validation. It also offers better methodology to make interpretation of QOL data and comparisons between different populations or occasions more meaningful by converting ordinal observations into linear measures. Empirical results from different studies are provided to assist in the understanding of different IRT models and their applications. It is feasible and promising to integrate IRT models and advanced computer technology to develop a computerized adaptive testing platform to deliver tailored test to arrive at more precise QOL measurement. Administration of more targeted test items according to patient’s level of health via CAT with real-time scoring and reporting is not just possibility but a reality. This can facilitate better use of QOL information between patients and physicians, and ultimately improve patient care.

...read moreread less

Journal Article•DOI•

Differential Person Functioning

[...]

George A. Johanson¹, Abdalla Alsmadi²•Institutions (2)

Ohio University¹, Mutah University²

01 Jun 2002-Educational and Psychological Measurement

TL;DR: In this paper, a transposition of the usual person-item matrices is used to enhance diagnostic assessment in which individual differences in scores between content domains are clarified by conditioning the scores on item difficulty.

...read moreread less

Abstract: The definitions, methods, and interpretations of differential item functioning are extended to the transpose of the usual person-item matrices. The primary purpose is to enhance diagnostic assessment in which individual differences in scores between content domains are clarified by conditioning the scores on item difficulty. Three examples are used to illustrate this approach with data from the mathematics section of the California Achievement Test using the Mantel-Haenszel procedure. The term differential person functioning is suggested.

...read moreread less