scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 1989"


Journal ArticleDOI
TL;DR: In this paper, the authors discuss the definition, detection, and explanation of item bias, and four strategies are described: qualitative, correlational, quasi-experimental, and experimental research.

474 citations


Journal ArticleDOI
TL;DR: In this paper, the results of a standard Mantel-Haenszel DIF analysis are compared to results obtained from supplementary analyses in which history course background, as well as score, is used as a conditioning variable.
Abstract: The Mantel-Haenszel approach for investigating differential item functioning was applied to U.S. history items that were administered as part of the National Assessment of Educational Progress. On some items, blacks, Hispanics, and females performed more poorly than other students, conditional on number-right score. It was hypothesized that this resulted, in part, from the fact that ethnic and gender groups differed in their exposure to the material included in the assessment. Supplementary Mantel-Haenszel analyses were undertaken in which the number of historical periods studied, as well as score, was used as a conditioning variable. Contrary to expectation, the additional conditioning did not lead to a reduction in the number of DIF items. Both methodological and substantive explanations for this unexpected result were explored. The National Assessment of Educational Progress (NAEP) is a survey of the academic achievements of American students that began in 1969. The MantelHaenszel (MH), 1959, approach to differential item functioning (DIF) analysis developed by Holland and Thayer (1988) was applied to U.S. history items that were administered in 1986 as part of a project supported by NAEP and the National Endowment for the Humanities (see Applebee, Langer, & Mullis, 1987). On about 30 percent of the items, there was some evidence that either blacks, Hispanics, or females performed more poorly than other students, conditional on number-right score. It was hypothesized that this could have resulted, in part, from the fact that ethnic and gender groups differed in their exposure to the material included in the history assessment. In this study, the results of a standard Mantel-Haenszel DIF analysis are compared to results obtained from supplementary analyses in which history course background, as well as score, is used as a conditioning variable. The purpose of this more refined matching procedure is to achieve a situation in which item performance is compared for groups of students who are of similar overall proficiency and have been exposed to similar curricula. If the original findings were indeed a reflection of differences in curriculum exposure, the new analyses should produce fewer DIF items.

172 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compared the IRT-based area method and the Mantel-Haenszel method for investigating differential item functioning (DIF), to determine the degree of agreement between the methods in identifying potentially biased items, and, when the two methods led to different results, to identify possible reasons for the discrepancies.
Abstract: The purpose of this study was to compare the IRT-based area method and the Mantel-Haenszel method for investigating differential item functioning (DIF), to determine the degree of agreement between the methods in identifying potentially biased items, and, when the two methods led to different results, to identify possible reasons for the discrepancies. Data for the study were the item responses of Anglo American and Native American students who took the 1982 New Mexico High School Proficiency Exam. Two samples of 1,000 students from each group were studied. The major findings were that (a) the consistency of classifications of items into "biased" and "not-biased" categories across replications was 75% to 80% for both methods and (b) when the unreliability of the statistics was taken into account, the two methods led to very similar results. Discrepancies between methods were due to the presence of nonuniform DIF (the Mantel-Haenszel method could not identify these items) and the choice of interval over wh...

139 citations


Journal ArticleDOI
TL;DR: In this article, a method is proposed to evaluate l'equivalence de mesure des traductions des tests d'intelligence americains and allemands dans les deux sens.
Abstract: Utilisation des methodes basees sur la theorie de la reponse par item pour evaluer l'equivalence de mesure des traductions des tests d'intelligence americains et allemands dans les deux sens. Identification des items jouant un role differentiel et analyse de contenu pour en determiner la cause (culturelle ou linguistique)

128 citations


Journal ArticleDOI
TL;DR: The standardization and Mantel-Haenszel approaches to the assessment of differential item functioning (DIF) are described and compared in this paper, which emphasize the importance of comparing comparable groups of examinees, use the same data base for analysis, namely, a 2 (Group) x 2 (Item Score: Correct or Incorrect) x S (Score Level) contingency table for each item studied.
Abstract: The standardization and Mantel-Haenszel approaches to the assessment of differential item functioning (DIF) are described and compared. For rightwrong scoring of items, these two approaches, which emphasize the importance of comparing comparable groups of examinees, use the same data base for analysis, namely, a 2 (Group) x 2 (Item Score: Correct or Incorrect) x S (Score Level) contingency table for each item studied. The two procedures differ with respect to how they operate on these basic data tables to compare the performance of the two groups of examinees. Whereas the operations employed by Mantel-Haenszel are motivated by statistical power considerations, the operations employed by standardization are motivated by datainterpretation considerations. These differences in operation culminate in different measures of DIF effect-size that are very highly related indicators of degree of departure from the null hypothesis of no DIF.

100 citations


Journal ArticleDOI
TL;DR: In this paper, a method of analyzing test item responses is advocated to examine differential item functioning through distractor choices of those who answer an item incorrectly, using log-linear models of a three-way contingency table to examine whether there is an interaction of population subgroup and option choice when ability is held constant.
Abstract: A method of analyzing test item responses is advocated to examine differential item functioning through distractor choices of those who answer an item incorrectly. The analysis, called Differential Distractor Functioning, uses log-linear models of a three-way contingency table to examine whether there is an interaction of population subgroup and option choice when ability is held constant. The analysis is explained and is exemplified in an analysis of the Verbal portion of a recent Scholastic Aptitude Test.

64 citations


Journal ArticleDOI
TL;DR: A survey of procedures for the detection of differential item functioning is presented in this article, which is divided according to classical measurement methods and those based on item response theory, and its advantages and constraints are discussed.
Abstract: A survey of procedures for the detection of differential item functioning is divided according to those based on classical measurement methods and those based on item response theory. Each of the procedures is described, and its advantages and constraints are discussed. Evaluation of each method, based on the existent research, is included in these discussions. The article concludes with recommendations for use under varying circumstances.

40 citations


Journal ArticleDOI
TL;DR: The fidelity of a translated survey instrument used to measure attitudes toward mental health was evaluated using statistical methods based on item response theory and item characteristic curves to determine the source of dif.
Abstract: The fidelity of a translated survey instrument used to measure attitudes toward mental health was evaluated using statistical methods based on item response theory. Data from French and German versions of the attitude survey wm analyzed, and items that displayed differential item functioning (dif) were identified. Item characteristic curves (ICCs) were examined to determine whether the source of dif could be attributed to errors in translation or differences in cultural experiences or knowledge. The proposal by Humphreys and Hulin for using ICCs to determine the source of dif is evaluated.

38 citations


Journal ArticleDOI
TL;DR: In this article, the authors examined the relationship of differential item functioning (DIF) to item difficulty on the Scholastic Aptitude Test (SAT) and found that more difficult items tended to exhibit positive DIF (dIF favored the focal group over the white reference group).
Abstract: This study examined the relationship of differential item functioning (DIF) to item difficulty on the Scholastic Aptitude Test (SAT). The data comprise verbal and mathematical item statistics from nine recent administrations of the SAT. In general, item difficulty is related to DIF. The nature of that relationship appears to be independent of the choice of DIF index (either the Mantel-Haenszel or the standardization approach) as well as of test form. However, the relationship was dependent on the particular group comparison and on both the test sections and the item type being analyzed. The relationship was strong for each of the racial and ethnic group contrasts—in which black, Hispanic, and Asian American examinees were compared in turn with white examinees—but was weak for the female and male examinee contrast. The relationship also appeared stronger on the verbal sections than on the mathematical sections. The relationship is such that more difficult items tended to exhibit positive DIF (DIF favored the focal group over the white reference group). On the verbal sections, only the reading comprehension item type (with the smallest observed range in item difficulty) failed to exhibit a strong relationship. Another index, the standardized difference in percentage omit (DIFPOM), correlated very highly (negatively) with DIF. Differential omission refers to a relative difference in omit rates between groups matched in ability. In fact, DIFPOM was consistently a better predictor of DIF in most models than was item difficulty. The relationship between DIF and DIFPOM held up across all four comparisons, including gender. It was also present in the mathematical sections with nearly the same magnitude exhibited in the verbal sections. Although DIF and DIFPOM are mathematically dependent measures, it was proposed that DIFPOM may be partly responsible for the relationship between DIF and item difficulty. To what extent DIF is a consequence of differential omission and to what extent differential omission is a manifestation of DIF is problematic. Nonetheless, the presence of differential omission on a test has the potential to influence DIF indices and therefore should be an important concern especially for formula-scored tests, where omission occurs often on difficult items. Among other findings is that Hispanic and black focal groups tended to omit differentially less than did the white reference groups. For Asian American examinees, the reverse holds. For females and males, the direction depends on the test sections. In general, groups that omitted differentially less experienced a relative advantage (high-positive DIF values) on the more difficult items, as measured by the DIF indices studied here (which treat omits as wrong in their calculation).

30 citations


Journal ArticleDOI
TL;DR: In this paper, a test item is typically considered free of differential item functioning (DIF) if its item response function is the same across demographic groups, and a popular means of testing for DIF is the Mantel-Haenszel (MH) approach.
Abstract: A test item is typically considered free of differential item functioning (DIF) if its item response function is the same across demographic groups. A popular means of testing for DIF is the Mantel-Haenszel (MH) approach. Holland and Thayer showed that under the Rasch model, identity of item response functions across demographic groups implies that the MH null hypothesis will be satisfied when the MH matching variable is test score, including the studied item. This result, however, cannot be generalized to the class of items for which item response functions are monotonic and local independence holds. Suppose that all item response functions are identical across groups, but the ability distributions for the two groups are ordered. In general, the population MH result will show DIF favoring the higher group on some items and the lower group on others. If the studied item is excluded from the matching criterion under these conditions, the population MH result will always show DIF favoring the higher group.

29 citations


Journal ArticleDOI
TL;DR: This article used the partial correlation technique devised by Stricker and Reynolds and Willson (1984) to detect race (black, white) and gender differences on item means that are independent of overall ability differences.

Journal ArticleDOI
TL;DR: In this article, the authors attempted to pinpoint the causes of differential item difficulty for blind students taking the braille edition of the Scholastic Aptitude Test's Mathematical section (SAT-M).
Abstract: This study attempted to pinpoint the causes of differential item difficulty for blind students taking the braille edition of the Scholastic Aptitude Test's Mathematical section (SAT-M). The study method involved reviewing the literature to identify factors that might cause differential item functioning for these examinees, forming item categories based on these factors, identifying categories that functioned differentially, and assessing the functioning o f the items comprising deviant categories to determine if the differential effect was pervasive. Results showed an association between selected item categories and differential functioning, particularly for items that included figures in the stimulus, items for which spatial estimation was helpful in eliminating at least two of the options, and items that presented figures that were small or medium in size. The precise meaning of this association was unclear, however, because some items from the suspected categories functioned normally, factors other than the hypothesized ones might have caused the observed aberrant item behavior, and the differential difficulty might reflect real population differences in relevant content knowledge

Journal ArticleDOI
TL;DR: In this article, a study was conducted to verify findings from an earlier study by Lawrence, Curley, and McHale (1988) in which DIF was examined for females on reading subscore items in four forms of SAT-Verbal.
Abstract: This study was conducted to attempt to verify findings from an earlier study by Lawrence, Curley, and McHale (1988) in which DIF was examined for females on Reading subscore items in four forms of SAT-Verbal. Specifically, the focus of this research was on verifying the technical science hypothesis as a contributing factor to DIF for females on science reading comprehension questions, and the true-science classification as a contributing factor to DIF for females on sentence completion items. Confirmatory evidence for the factors identified in the earlier study is not clear-cut in this study. For reading comprehension items, the presence of technical science material in a reading passage tends to make the corresponding items more difficult for females. However, some items associated with highly technical passages do not function differently for males and females. With respect to sentence completion items, there was partial support for the hypothesis that true-science items are more difficult for females. However, the limited number of items studied does not warrant statements about the differential functioning of sentence completion items.

Journal ArticleDOI
TL;DR: In this article, the authors discuss the potential benefits of using item response theory in test construction and evaluate the experience and evidence accumulated during 9 years of using a three-parameter model in the construction of major achievement batteries.
Abstract: Certain potential benefits of using item response theory in test construction are discussed and evaluated using the experience and evidence accumulated during 9 years of using a three-parameter model in the construction of major achievement batteries. We also discuss several cautions and limitations in realizing these benefits as well as issues in need of further research. The potential benefits considered are those of getting "sample-free" item calibrations and "item-free" person measurement, automatically equating various tests, decreasing the standard errors of scores without increasing the number of items used by using item pattern scoring, assessing item bias (or differential item functioning) independently of difficulty in a manner consistent with item selection, being able to determine just how adequate a tryout pool of items may be, setting up computer-generated "ideal" tests drawn from pools as targets for test developers, and controlling the standard error of a selected test at any desired set o...

Journal ArticleDOI
TL;DR: In this paper, the authors developed and evaluated a confirmatory approach to assess test structure using multidimensional item response theory (MIRT), which involves adding to the exponent of the MIRT model an item structure matrix that allows the user to specify the ability dimensions measured by an item.
Abstract: The purpose of this research was to develop and evaluate a confirmatory approach to assessing test structure using multidimensional item response theory (MIRT). The approach investigated involves adding to the exponent of the MIRT model an item structure matrix that allows the user to specify the ability dimensions measured by an item. Various combinations of item structures were fit to two sets of simulation data with known true structures, and the results were evaluated using a likelihood ratio chi-square statistic and two information-based model selection criteria. The results of these analyses support the use of the confirmatory MIRT approach, since it was found that the procedures could recover the true item structures. It was also found that adding an additional ability dimension that forces together items that ought not to be together noticeably deteriorates the quality of the solution. On the other hand, imposing structures different from, but not inconsistent with, the true structures does not necessarily yield worse fit. Finally, in terms of model fit statistics, the consistent Akaike information criterion performed better than the simple Akaike information criterion, while the likelihood ratio chi-square was clearly inadequate.


01 Jan 1989
TL;DR: In this article, computer simulations were conducted to study the behavior of three conditional differential item functioning (DIF) statistics in the detection of true or asymptotic DIF.
Abstract: Computer simulations were conducted to study the behavior of three conditional differential item functioning (DIF) statistics in the detection of true or asymptotic DIF. The statistics included the standardized difference in proportion-correct (STD), the Mantel-Haenszel common odds-ratio (MH) and the root mean weighted squared difference in proportion-correct (RMWSD). The simulated tests were based on actual administrations o£ the ACT Assessment to certain focal and base examinee populations. Sample sizes of examinees were varied while true DIF and test length remained fixed. Results of these simulations showed that/the MH and STD statistics were preferred as DIF indicators for sample sizes greater than 250, In the fall of 1988, several members of the American College Testing Program's Test Development Division conducted computer simulations to study the behavior of three conditional differential item functioning (DIF) statistics, in terms of DIF or item bias detection. The statistics selected for inclusion in this study were the standardized difference in proportioncorrect (Dorans & Kulick, 1986), Mantel-Haenszel common odds-ratio (Holland & Thayer, 1986; Mantel & Haenszel, 1959), and the root mean weighted squared difference in proportion-correct (Dorans & Kulick, 1986). Item bias statistics which condition on some examinee ability measure are thought to be better measures of DIF than those statistics that use the simple unconditional difference in proportion-correct values, sometimes referred to as impact. The unconditional impact does not take into account underlying differences in ability distributions between populations or groups of interest. The conditional procedures, on the other hand, reflect proportioncorrect differences only between examinees with comparable ability in each population or group. These DIF statistics have been used by other testing programs and services to detect or flag test items on tests where DIF might be problematic. The statistics were defined as follows. The populations or groups of interest were referred to as the focal (F) group and the base (B) group. Then s indexed each observed score category of a k-item test, or s = 0, 1, ..., k. Then N = the number of examinees in the F group at score s, F ~ s N = the number of examinees in the B group at score s, S PERFORMANCE OF THREE CONDITIONAL DIF STATISTICS IN DETECTING DIFFERENTIAL ITEM FUNCTIONING ON SIMULATED TESTS

Journal ArticleDOI
TL;DR: In this article, the authors compared pseudo-Bayesian and joint maximum likelihood procedures for estimating item parameters for the three-parameter logistic model in item response theory, and found that the item characteristic curves estimated by the two methods were more similar to each other than to the generated item characteristic curve.
Abstract: This study compared pseudo-Bayesian and joint maximum likelihood procedures for estimating item parameters for the three-parameter logistic model in item response theory. Two programs, ASCAL and LOGIST, which employ the two methods were com pared using data simulated from a three-parameter model. Item responses were generated for sample sizes of 2,000 and 500, test lengths of 35 and 15, and examin ees of high, medium, and low ability. The results showed that the item characteristic curves estimated by the two methods were more similar to each other than to the generated item characteristic curves. Pseudo- Bayesian estimation consistently produced more accu rate item parameter estimates for the smaller sample size, whereas joint maximum likelihood was more ac curate as test length was reduced. Index terms: ASCAL, item response theory, joint maximum likelihood estimation, LOGIST, parameter estimation, pseudo- Bayesian estimation, three-parameter model.


22 Mar 1989
TL;DR: This article used item response theory (IRT), the delta plot method, and Mantel-Haenszel techniques to assess differential item functioning across racial and gender groups associated with the Maryland Test of Citizenship Skills (MTCS).
Abstract: Use of item response theory (IRT), the delta plot method, and Mantel-Haenszel techniques to assess differential item functioning (DIF) across racial and gender groups associated with the Maryland Test of Citizenship Skills (MTCS) is described. The objective of this research was to determine the: effect of sample size on results from these three DIF techniques; degree of relationship among these DIF statistics; and degree to which they identify the same items as biased. The data for the study include item responses from one form of the 1989 edition of the MTCS. The MTCS consists of 45 multiple-choice items that assess students' knowledge and skills in 3 domains: constitutional government; politics and political behavior; and principles, rights, and responsibilities. The MTCS was administered to 50,000 ninth graders during January and February of 1988. The analyses were performed on representative samples of 1,000, 750, 500, and 200 first-time test takers. It is concluded that no MTCS items are functioning differentially in either black/white or male/female comparisons. Plots of item difficulty estimates for black/white and male/female comparisons show nearly perfect linear relationships in both groups. Agreement, as indicated by rank order correlations across DIF techniques, is very high between Rasch and Delta Plot DIF indices for all sample sizes in both black/whice and male/female comparisons. In terms of agreement regarding biased and unbiased items, agreement with the three-parameter DIF index is highest for the Delta Plot and Rasch techniques. A 30-item list of references, 19 data tables, and 30 figures are included. (TJH) *********************************************************************** * Reproductions supplied by EDRS are the best that can be made * * from the original document. * ***********************************************************************

01 Dec 1989
TL;DR: In this paper, a portion of the theory is developed from a few principles and applied to the problems of deciding whether ability has the same distribution in two demographic groups, to finding latent class models that are equivalent to item response models, and to controlling drift in adaptive testing programs.
Abstract: : Formula scoring is a systematic study of measurement statistics expressed as sums of products of item scores. The theory is currently being used to compute non-parametric estimates of ability distributions, item response functions, and option response functions. The theory has been used to design algorithms for estimating item response functions from adaptive test data (on-line calibration), monitoring and correcting drift in observed score distributions for adaptive tests (on-line equating), computing optimal tests for cheating, and combining appropriateness measurement information from several subtests. In this paper a portion of the theory is developed from a few principles. Applications are considered to the problems of deciding whether ability has the same distribution in two demographic groups, to finding latent class models that are equivalent to item response models, and to controlling drift in adaptive testing programs. Keywords: Latent trait theory, Item response theory, Formula score, Rasch model, Equating, Foundations, Quasidensities, Densities, Non-parametric density estimation, Ability distributions, Identifiability.

01 Jan 1989
TL;DR: In this paper, responses to the ACT-COMP test were analyzed using F. Samejima's graded model to determine the level of differential item functioning (DIF) for 60 multiple-choice items.
Abstract: Responses to American College Test College Outcome Measures Program (ACT-COMP) items by 481 black and 9,237 white students at the University of Tenncssee (Knoxville) were analyzed using F. Samejima's graded model to determine the level of differential item functioning (DIF). Students had been tested using Form 8 of the ACT-COMP objective test either as freshmen or as seniors. The test contains 60 multiple-choice items, each of which has two correct answers. The model developed by Samejima (1969) for graded responses, which uses a series of binary models to describe polychotomous data, was used to assess the data. Student response patterns were fitted to the graded model and five items that did not fit the model were dropped. The remaining items were analyzed using threshold parameters and their standard errors to calculate difficulty-shift coefficients. Results indicate that: (1) for 32 of the 55 remaining items, significant instances of DIF are present; (2) instances of DIF are not evenly distributed among the six subscales of the ACT-COMP test; (3) questions designed to assess explanation skills produce higher rates of DIF than do questions designed to assess skills related to identification and description; and (4) activities that rely on blueprints, require interpretation of satire, or use a radio news format to produce high levels of DIF. Four data tables and nine graphs are provided. (TJH) *********************************************************************** * Reproductions supplied by EDRS are the best that can be made * * from the original document. *************************************************w*********************

01 Dec 1989
TL;DR: In this article, it is shown that the maximum likelihood quasidensity estimate (mle) is strongly consistent and the asymptotic distribution of the mle is derived.
Abstract: : The quasidensity is a useful surrogate for the ability density in item response theory analyses of multiple choice tests. Like the ability density, it can be used to calculate the probability of sampling an examinee with a specified pattern of responses. Sometimes it is preferable to the density because it is continuous (densities need not be continuous), it is unique (two very different densities can give exactly the same pattern probabilities and all other expected values of random variables that are functions of item responses), it always exists (a discontinuous ability distributions has a quasidensity, but it does not have a density). Some large sample results are proven for quasidensity estimation. It is shown that the maximum likelihood quasidensity estimate (mle) is strongly consistent. The asymptotic distribution of the mle is derived. Some new results on the relation between latent class and latent trait models are also presented. It is shown that every item response model with a smooth density and smooth item response functions is isomorphic to the latent class model obtained with the same item response functions and a discrete ability distribution. An upper bound for the number of points is derived.


01 Jan 1989
TL;DR: In this article, the authors link empirical Bayes methods with two specific topics in item response theory, and test the goodness of fit of the Rasch model under the assumptions of local independence and sufficiency.
Abstract: The purpose of this paper is to link empirical Bayes methods with two specific topics in item response theory--item/subtest regression, and testing the goodness of fit of the Rasch model--under the assumptions of local independence and sufficiency. It is shown that item/subtest regression results in empirical Bayes estimates only if the Rasch model holds. Additionally, it is shown that a newly-derived exploratory goodness-of-fit test for the Rasch model, which does not need item and person parameter estimates, can be seen as an empirical Bayes test. This test compares the observed proportions of correct answers to one specific item, given any pattern that leads to a number-right score.