scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 1991"


Book
23 Jul 1991
TL;DR: This research attacked the mode-based approach to item response theory with a model- data fit approach, and found that the model-Data Fit approach proved to be more accurate than the other approaches.
Abstract: Background Concepts, Models, and Features Ability and Item Parameter Estimation Assessment of Model-Data Fit The Ability Scale Item and Test Information and Efficiency Functions Test Construction Identification of Potentially Biased Test Items Test Score Equating Computerized Adaptive Testing Future Directions of Item Response Theory

2,583 citations


Journal ArticleDOI
TL;DR: It is concluded that for multiddimensional data a common factor analysis on the matrix of tetrachoric correlations performs at least as well as the theoretically appropriate multidimensional item response models.
Abstract: Many factor analysis and multidimensional item response models for dichotomous variables have been proposed in literature. The models and various methods for estimating the item parameters are reviewed briefly. In a simulation study these methods are compared with respect to their estimates of the item parameters both in terms of an item response theory formulation and in terms of a factor analysis formulation. It is concluded that for multidimensional data a common factor analysis on the matrix of tetrachoric correlations performs at least as well as the theoretically appropriate multidimensional item response models.

189 citations


Journal ArticleDOI
TL;DR: Differential item functioning (DIF) assessment attempts to identify items or item types for which subpopulations of examinees exhibit performance differentials that are not consistent with the performance differences typically seen for those sub-populations on collections of items that purport to measure a common construct.
Abstract: Differential item functioning (DIF) assessment attempts to identify items or item types for which subpopulations of examinees exhibit performance differentials that are not consistent with the performance differentials typically seen for those subpopulations on collections of items that purport to measure a common construct. DIF assessment requires a rule for scoring items and a matching variable on which different subpopulations can be viewed as comparable for purposes of assessing their performance on items. Typically, DIF is operationally defined as a difference in item performance between subpopulations, e.g., Blacks and Whites, that exists after members of the different subpopulations have been matched on some total score. Constructed-response items move beyond traditional multiple-choice items, for which DIF methodology is well-defined, towards item types involving selection or identification, reordering or rearrangement, substitution or correction, completion, construction, and performance or presentation. This paper defines DIF, describes two standard procedures for measuring DIF and indicates how DIF might be assessed for certain constructed-response item types. The description of DIF assessment presented in this paper is applicable to computer-delivered constructed-response items as well as paper and pencil delivered items.

99 citations


Journal ArticleDOI
TL;DR: The area between two item response functions is often used as a measure of differential item functioning under item response theory as mentioned in this paper, and this area can be measured over either an open interval (i.e., ex...
Abstract: The area between two item response functions is often used as a measure of differential item functioning under item response theory. This area can be measured over either an open interval (i.e., ex...

53 citations


Journal ArticleDOI
TL;DR: The authors compared two approximation techniques for detecting differential item functioning (DIF) in an English as a second language (ESL) placement test when the group sizes are too small to use other possible methods (e.g., the three parameter item response theory method).
Abstract: This paper compares two approximation techniques for detecting differential item functioning (DIF) in an English as a second language (ESL) placement test when the group sizes are too small to use other possible methods (e.g., the three parameter item response theory method). An application of the Angoff delta- plot method (Angoff and Ford, 1973) utilizing the one parameter Rasch model adopted in Chen and Henning (1985), and Scheuneman's chi-square method (Scheuneman, 1979) were chosen because they are among the few methods appropriate for a sample size smaller than 100. Two linguistically and culturally diverse groups (Chinese and Spanish speaking) served as the subjects of this study. The results reveal that there was only marginal overlap between DIF items detected by Chen and Henning's method and Scheuneman's method; the former detected fewer DIF items with less variety than the latter. Moreover, Chen and Henning's method tended to detect easier items with smaller differ ences in p-value between the t...

49 citations


Journal ArticleDOI
TL;DR: In this article, the item characteristic curve (ICC) is estimated by using polynomial regression splines, which provide a more flexible family of functions than is given by the three-parameter logistic family.
Abstract: The item characteristic curve (ICC), defining the relation between ability and the probability of choosing a particular option for a test item, can be estimated by using polynomial regression splines. These provide a more flexible family of functions than is given by the three-parameter logistic family. The estimation of spline ICCs is described by maximizing the marginal likelihood formed by integrating ability over a beta prior distribution. Some simulation results compare this approach with the joint estimation of ability and item parameters.

34 citations


Journal ArticleDOI
TL;DR: In this paper, a model-testing approach for evaluating the stability of IRT item parameter estimates in a pretest-posttest design was presented, where a random sample of pretest and posttest responses to a 19-item math test was assessed for a group of children assessed on two different occasions.
Abstract: Item parameter instability can threaten the validity of inferences about changes in student achievement when using Item Response Theory- (IRT) based test scores obtained on different occasions. This article illustrates a model-testing approach for evaluating the stability of IRT item parameter estimates in a pretest-posttest design. Stability of item parameter estimates was assessed for a random sample of pretest and posttest responses to a 19-item math test. Using MULTILOG (Thissen, 1986), IRT models were estimated in which item parameter estimates were constrained to be equal across samples (reflecting stability) and item parameter estimates were free to vary across samples (reflecting instability). These competing models were then compared statistically in order to test the invariance assumption. The results indicated a moderately high degree of stability in the item parameter estimates for a group of children assessed on two different occasions.

17 citations


Journal ArticleDOI
TL;DR: In this paper, the effectiveness of the Mantel-Haenszel (MH) statistic in detecting differentially functioning test items when the internal criterion was varied was investigated, and the results revealed that the choice of criterion, total test score versus subtest score, had a substantial influence on the classification of items as to whether or not they were differentially functional in the American and Native American groups.
Abstract: This study investigated the effectiveness of the Mantel-Haenszel (MH) statistic in detecting dif ferentially functioning (DIF) test items when the internal criterion was varied. Using a dataset from a statewide administration of a life skills examina tion, a sample of 1,000 Anglo-American and 1,000 Native American examinee item response sets were analyzed. The MH procedure was first applied to all the items involved. The items were then cate gorized as belonging to one or more of four subtests based on the skills or knowledge needed to select the correct response. Each subtest was then analyzed as a separate test, using the MH pro cedure. Three control subtests were also established using random assignment of test items and were analyzed using the MH procedure. The results revealed that the choice of criterion, total test score versus subtest score, had a substantial influence on the classification of items as to whether or not they were differentially functioning in the American and Native American group...

17 citations


01 Jan 1991
TL;DR: In this paper, items in the verbal (Hebrew and English) sections of the psychometric entrance test (PET) administered for university admission in Israel were studied for differential item functioning (DIF) between the sexes.
Abstract: Items in the verbal (Hebrew and English) sections of the Psychometric Entrance Test (PET) administered for university admission in Israel were studied for differential item functioning (DIF) between the sexes. Analyses were conducted for 4,354 males and 4,901 females taking Form 3 of the PET in April 1984, and 3,785 males and 3,615 females taking Form 17 of the PET in April 1987. Three subtests were examined: (1) veral reasoning; (2) English; and (3) mathematical reasoning (a control non-verbal test). DIF was determined for the 1984 population through: the weighted sum of the differences between the twc groups and across all ability groups; and the root of the mean squared differences as defined above. These two indices and a Mantel-Haenszel chi square test examined DIF for the 1987 group. About one-third of the items in the verbal and mathematics reasoning 1-.rts were found to have DIE', but few English subtest items did so. The content of some of the items exhibiting DIF was clearly related to stereotypical perceptions of feminine and masculine areas of interest. Implications for test content are discussed. (SLD)

13 citations


Journal ArticleDOI
TL;DR: In this paper, two editions of the Verbal and Mathematical portions of the Scholastic Aptitude Test (SAT) were used to study differential speededness and differential omission and the relationships among differential item functioning (DIF), differential omission, and item difficulty for Asian-Americans, Blacks, Hispanics, and females.
Abstract: Two editions of the Verbal and Mathematical portions of the Scholastic Aptitude Test (SAT) were used to study differential speededness and differential omission and the relationships among differential item functioning (DIF), differential omission, and item difficulty for Asian-Americans, Blacks, Hispanics, and females. Consistent and replicable evidence of differential speededness was found for Blacks and Hispanics. Use of an unspeeded criterion for matching in place of the traditional total score, which contains speeded items, does not affect the DIF analyses of the speeded items. A strong artifactual negative relationship between DIF and differential omission was found. The relationship between differential omission and difficulty was consistently positive on the Verbal sections for all comparison groups except the Asian-American group, for whom it was consistently negative. On the Mathematical sections, this relationship was only consistently found for the female/male comparison, for whom it was negative. Finally, the relationship between difficulty and DIF was negative but smaller than previously observed.

12 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigated the detection of differential item functioning (DIF) on items intentionally constructed to favor one group over another in two item response theory-based computer programs, LOGIST and BILOG.
Abstract: Detection of differential item functioning (DIF) on items intentionally constructed to favor one group over another was investigated on item parameter estimates obtained from two item response theory-based computer programs, LOGIST and BILOG. Signed- and unsigned-area measures based on joint maximum likelihood estimation, marginal maximum likelihood estimation, and two marginal maximum a posteriori estimation procedures were compared with each other to determine whether detection of DIF could be improved using prior distributions. Results indicated that item parameter estimates obtained using either prior condition were less deviant than when priors were not used. Differences in detection of DIF appeared to be related to item parameter estimation condition and to some extent to sample size.

Journal ArticleDOI
TL;DR: In this paper, the authors compared the performance of algebra-placement and student-produced response (SPR) items with the SAT-Math items, and found that the SPR items had higher levels of differential speededness than the AP-M.
Abstract: Alternative mathematical items administered as prototypes at the Spring 1989 Field Trials are evaluated for differential item functioning (DIF) and differential speededness. Results for Algebra Placement (AP) and Student Produced Response (SPR) items are presented and contrasted to results obtained on the two current SAT-Math items: Regular Math and Quantitative Comparison. Analyses on comparisons between female and comparable male examinees, and between Asian-American, Black, and Hispanic examinees in comparison to comparable White examinees indicate that both of these alternative items appear to have DIF. Additional DD? analyses comparing the use of an internal versus an external matching criteria for the SPR items show evidence of negative DIF with either criteria. Results using the MH D-DIF statistic are more extreme than DIF results using the STD P-DIF index. The metric used to calculate the DIF indices may be accountable for the differences observed. Differential speededness results indicate that the two Math prototypes have slightly higher levels of differential speededness than the SAT-M. The SPR items pose an interesting problem for DIF. The definition of an appropriate DIF matching criterion for constructed response item types needs more study. Metric differences between methods and their effect on difficult or easy items also needs further exploration. Until these methodological issues are resolved, results of DD? studies on constructed response items should be interpreted with caution.


Journal ArticleDOI
TL;DR: A FORTRAN 77 program is presented in this paper, which performs analyses of differential item performance in psychometric tests and computes various additional classical indices of differential items functioning (including discrimination indices) as well as associated effect size measures.
Abstract: A FORTRAN 77 program is presented which performs analyses of differential item performance in psychometric tests. The program performs the Mantel-Haenszel procedure and computes various additional classical indices of differential item functioning (including discrimination indices) as well as associated effect size measures.


Journal ArticleDOI
TL;DR: In this article, an alternative three-parameter logistic model was proposed, in which the asymptote parameter is a linear component within the logit of the function.
Abstract: Birnbaum's three-parameter logistic function has become a common basis for item response theory modeling, especially within situations where significant guessing behavior is evident This model is formed through a linear transformation of the two-parameter logistic function in order to facilitate a lower asymptote This paper discusses an alternative three-parameter logistic model in which the asymptote parameter is a linear component within the logit of the function This alternative is derived from a more general four-parameter model based on a transformed hyperbola

01 Apr 1991
TL;DR: Dancer et al. as mentioned in this paper presented an examination of differential item functioning in Likert-type items using log-linear models, which was presented at the 1991 American Educational Research Association (AERA) Conference.
Abstract: AUTHOR Dancer, L. Suzanne; And Others TITLE Examination of Differential Item Functioning in Likert-Type Items Using Log-Linear Models. SPONS AGENCY Wisconsin Univ., Milwaukee. PUB DATE Apr 91 NOTE 20p.; Paper presented at the Annual Meeting of the American Educational Research Association (Chicago, IL, April 3-7, 1991). PUB TYPE Reports Research/Technica) (143) -Speeches/Conference Papers (150)



Journal ArticleDOI
TL;DR: In this article, the authors proposed IRT for item response time and showed the utility of this theory by applying it to practical data and showed that it can be used to evaluate examinee's ability for response time.
Abstract: We can obtain recently item response time data easily by Computer Testing. And we can evaluate examinees not only for test score but for response time. It is well known that IRT(Item Response Theory)is useful in item analysis for evaluation of test score. Similiarly we can apply item analysis for evaluation of examinee's response time to the idea of IRT. The authors proposed IRT for item response time. In this paper, the authors showed; 1. validity of the theory, 2. item analysis by the theory, 3. estimated examinee's ability for response time. And the authors showed the utilities of this theory by applying to practical data.