scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 1982"


Journal ArticleDOI
TL;DR: In this paper, an item response theory is discussed which is based on purely ordinal assumptions about the probabilities that people respond positively to items and is considered as a natural generalization of both Guttman scaling and classical test theory.
Abstract: An item response theory is discussed which is based on purely ordinal assumptions about the probabilities that people respond positively to items. It is considered as a natural generalization of both Guttman scaling and classical test theory. A distinction is drawn between construction and evaluation of a test (or scale) on the one hand and the use of a test to measure and make decisions about persons' abilities on the other. Techniques to deal with each of these aspects are described and illustrated with examples.

382 citations


Journal ArticleDOI
TL;DR: In this paper, two strategies for assessing item bias are discussed: Methods based on Itemx Group interaction and methods that compare the probabilities of a correct response for different groups conditional on ability level.
Abstract: Two strategies for assessing item bias are discussed: Methods based on Itemx Group interaction and methods that compare the probabilities of a correct response for different groups conditional on ability level In latent trait models, correct response probabilties are compared conditional on latent ability; inScheuneman’s (1979) method these probabilities are compared conditional on the observed test score Scheuneman’s method is modified to fit the general theory of loglinear and logit models for contingency tables A distinction is made between uniform and nonuniform item bias, and a method to assess item bias and distinguish between uniform and nonuniform bias is described Data reported by Scheuneman are used to demonstrate the method in detail; the differences obtained with Scheuneman’s method are discussed In addition, the method is applied to two tests administered byvan der Flier (1980) in Kenya and Tanzania, and the results are compared to those obtained with Scheuneman’s approach

257 citations


Journal ArticleDOI
TL;DR: In this article, the authors assessed the accuracy of simultaneous estimation of item and person parameters in item response theory using the root mean squared error between recovered and actual item characteristic curves served as the principal measure of estimation accuracy for items.
Abstract: This monte carlo study assessed the accuracy of simultaneous estimation of item and person parameters in item response theory. Item responses were simulated using the two- and three-parameter logistic models. Samples of 200, 500, 1,000, and 2,000 simulated examinees and tests of 15, 30, and 60 items were generated. Item and person parameters were then estimated using the appropriate model. The root mean squared error between recovered and actual item characteristic curves served as the principal measure of estimation accuracy for items. The accuracy of estimates of ability was assessed by both correlation and root mean squared error. The results indicate that minimum sample sizes and tests lengths depend upon the response model and the purposes of an investigation. With item responses generated by the two-parameter model, tests of 30 items and samples of 500 appear adequate for some purposes. Estimates of ability and item parameters were less accurate in small sample sizes when item responses were generat...

206 citations


Journal ArticleDOI
TL;DR: In this paper, the mathematics required to calculate the asymptotic standard errors of the parameters of three commonly used logistic item response models is described and used to generate values for some common situations.
Abstract: The mathematics required to calculate the asymptotic standard errors of the parameters of three commonly used logistic item response models is described and used to generate values for some common situations. It is shown that the maximum likelihood estimation of a lower asymptote can wreak havoc with the accuracy of estimation of a location parameter, indicating that if one needs to have accurate estimates of location parameters (say for purposes of test linking/equating or computerized adaptive testing) the sample sizes required for acceptable accuracy may be unattainable in most applications. It is suggested that other estimation methods be used if the three parameter model is applied in these situations.

154 citations


Journal ArticleDOI
TL;DR: The linear logistic test model (LLTM) as mentioned in this paper is a Rasch model with linear constraints on the item parameters, which allows the characterization of individuals in a multidimensional latent space and the testing of hypotheses regarding effects of treatments.
Abstract: The linear logistic test model (LLTM), a Rasch model with linear constraints on the item parameters, is described. Three methods of parameter estimation are dealt with, giving special consideration to the conditional maximum likelihood approach, which provides a basis for the testing of structural hypotheses regarding item difficulty. Standard areas of application of the LLTM are surveyed, including many references to empirical studies in item analysis, item bias, and test construction; and a novel type of application to response-contingent dynamic processes is presented. Finally, the linear logistic model with relaxed assumptions (LLRA) for measuring change is introduced as a special case of an LLTM; it allows the characterization of individuals in a multidimensional latent space and the testing of hypotheses regarding effects of treatments.

78 citations


Journal ArticleDOI
TL;DR: The feasibility of using item response theory as a psychometric model for the GRE Aptitude Test was addressed by assessing the reasonableness of the assumptions of item response theories for GRE item types and examinee populations as mentioned in this paper.
Abstract: The feasibility of using item response theory as a psychometric model for the GRE Aptitude Test was addressed by assessing the reasonableness of the assumptions of item response theory for GRE item types and examinee populations. Items from four forms and four administrations of the GRE Aptitude Test were calibrated using the three-parameter logistic item response model (one form was given at two administrations and one administration used two forms; the exact relationships between forms and administrations are given in Test Forms and Populations section of this report). The unidimensionality assumption of item response theory was addressed in a variety of ways. Previous factor analytic research on the GRE Aptitude Test was reviewed to assess the dimensionality of the test and to extract information pertinent to the construction of sets of homogeneous items. On the basis of this review, separate calibrations of discrete verbal items and reading comprehension items were run, in addition to calibrations on all verbal items, because two strong dimensions on the verbal scale were identified in the factor analytic research. Local independence of item responses is a consequence of the unidimensionality assumption. To test the weak form of the local independence condition, partial correlations, both with and without a correction for guessing, among items with ability partialled out were computed and factor analyzed. Violations of local independence were observed in both verbal item types and quantitative item types. These violations were basically consistent with expectations based on the factor analytic review. Fit of the three-parameter logistic model to GRE Aptitude Test data was assessed by comparing estimated item-ability regressions, i.e., item response functions, with empirical item-ability regressions. The three-parameter model fit all verbal item types reasonably well. The fit to data interpretation items, regular math items, analytical reasoning items, and logical diagrams items also seemed acceptable. The model fit quantitative comparison items least well. The analysis of explanations item type was also not fit well by the three-parameter logistic model. The stability of item parameter estimates for different samples was assessed. Item difficulty estimates exhibited a large degree of stability, followed by item discrimination parameter estimates. The hard-to-estimate lower asymptote or pseudoguessing parameter exhibited the least temporal stability. The sensitivity of item parameter estimates to the lack of unidimensionality that produced the local independence violations was examined. The discrete verbal and all verbal calibrations of discrete verbal items produced more similiar estimates of item discrimination than the reading comprehension and all verbal calibrations of reading comprehension items, reflecting the larger correlations that overall verbal ability estimates had with discrete verbal ability estimates. As compared to item discrimination estimates, item difficulty estimates exhibited much less sensitivity to homogeneity of item sets. The estimates of the lower asymptote were, for the most part, fairly robust to homogeneity of item calibration set. The comparability of ability estimates based on homogeneous item sets (reading comprehension items or discrete verbal items) with estimates based on all verbal items was examined. Correlations among overall verbal ability estimates, discrete verbal ability estimates, and reading comprehension ability estimates provided evidence for the existence of two distinct, highly correlated verbal abilities that can be combined to produce a composite ability that resembles the overall verbal ability defined by the calibration of all verbal items together. Three equating methods were compared in this research: equipercentile equating, linear equating, and item response theory true score equating. Various data collection designs (for both IRT and non-IRT methods) and several item parameter linking procedures (for the IRT equatings) were employed. The equipercentile and linear equatings of the verbal scales were more similar to each other than they were to the IRT equatings. The degree of similarity among the scaled score distributions produced by the various equating methods, data collection designs, and linking procedures was greater for the verbal equatings than for either the quantitative or analytical equatings. In almost every comparison, the IRT methods produced quantitative scaled score means and standard deviations that were higher and lower, respectively, than those produced by the linear and equipercentile methods. The most notable finding in the analytical equatings was the sensitivity of the precalibration design (in this study, used only for the IRT equating method) to practice effects on analytical items, particularly for the analysis of explanations item type. Since the precalibration design is the data collection method most appealing (for administrative reasons) for equating the GRE Aptitude Test in a test disclosure environment, this sensitivity might present a problem for any equating method. In sum, the item response theory model and IRT true score equating, using the precalibration data collection design, appear most applicable to the verbal section, less applicable to the quantitative section because of possible dimensionality problems with data interpretation items and instances of nonmontonicity for the quantitative comparison items, and least applicable to the analytical section because of severe practice effects associated with the analysis of explanations item type. Expected revisions of the analytical section, particularly the removal of the troublesome analysis of explanations item type, should enhance the fit and applicability of the three-parameter model to the analytical section. Planned revisions of the verbal section should not substantially affect the satisfactory fit of the model to verbal item types. The heterogeneous quantitative section might present problems for item response theory. It must be remembered, however, that these same (and other) factors that affect IRT based equatings may also affect other equating methods.

32 citations


Journal ArticleDOI
TL;DR: In this paper, the effect of the position of an item within a test on examinee's responding behavior at the item level was investigated, where item response theory statistics were used to assess position effects.
Abstract: The research described in this paper deals solely with the effect of the position of an item within a test on examinee's responding behavior at the item level. For simplicity's sake, this effect will be referred to as practice effect when the result is improved examinee performance and as fatigue effect when the result is poorer examinee performance. Item response theory item statistics were used to assess position effects because, unlike traditional item statistics, they are sample invariant. In addition, the use of item response theory statistics allows one to make a reasonable adjustment for speededness, which is important when, as in this research, the same item administered in different positions is likely to be affected differently by speededness, depending upon its location in the test. Five types of analyses were performed as part of this research. The first three types involved analyses of differences between the two estimations of item difficulty (b), item discrimination (a), and pseudoguessing (c) parameters. The fourth type was an analysis of the differences between equatings based on items calibrated when administered in the operational section and equatings based on items calibrated when administered in section V. Finally, an analysis of the regression of the difference between b's on item position within the operational section was conducted. The analysis of estimated item difficulty parameters showed a strong practice effect for analysis of explanations and logical diagrams items and a moderate fatigue effect for reading comprehension items. Analysis of other estimated item parameters, a and c, produced no consistent results for the two test forms analyzed. Analysis of the difference between equatings for Form 3CGR1 reflected the differences between estimated b's found for the verbal, quantitative, and analytical item types. A large practice effect was evident for the analytical section, a small practice effect, probably due to capitalization on chance, was found for the quantitative section, and no effect was found for the verbal section. Analysis of the regression of the difference between b's on item position within the operational section for analysis of explanations items showed a rather consistent relationship for Form ZGR1 and a weaker but still definite relationship for Form 3CGR1. The results of this research strongly suggest one particularly important implication for equating. If an item type exhibits a within-test context effect, any equating method, e.g., IRT based equating, that uses item data either directly or as part of an equating section score should provide for administration of the items in the same position in the old and new forms. Although a within-test context effect might have a negligible influence on a single equating, a chain of such equatings might drift because of the systematic bias.

28 citations


Journal ArticleDOI
TL;DR: In this paper, the degree of overlap between the four item bias methods in identifying biased items depends on the extent of bias in the items comprising the initial pool of items, which is the same as in this paper.
Abstract: Four methods of item bias detection-transformed item difficulty, item discrimination expressed as Clemans' lambda, ChiSquare, and the three-parameter item characteristic curve-were studied to determine the degree of correspondence among them in identifying biased and unbiased items in reading and math subtests of the 1978 SRA Achievement Series. Intercorrelations among the four methods were moderate at best, confirming previous research involving different item bias analysis techniques. The item discrimination method showed the least correspondence with the other three methods. The extent of overlap between the four item bias methods in identifying biased items depends on the extent of bias in the items comprising the initial pool of items. That is, except for the item discrimination method, the item bias procedures identify similar sets of most biased items.

11 citations



Journal ArticleDOI
TL;DR: It is shown how the fact that selecting items for mastery decisions should also allow for the utility of the decision outcomes and the distribution of the true scores can be taken into consideration by approaching the item selection from a decision-theoretic point of view.

4 citations



Journal ArticleDOI
TL;DR: This paper examined the effect of nonresponse to questions of ethnic identity on the measurement of DIF for SAT verbal items using a commonly used modern method, the Mantel-Haenszel procedure (Holland and Thayer, 1988).
Abstract: The accuracy of procedures that are used to compare the performance of different groups of examinees on test items obviously depends upon the correct classification of members in each examinee group. The significance of this dependence is determined by the sensitivity of the statistical procedure and the proportion of examinees who are unidentified. Since the number of nonrespondents to questions of ethnicity is often of the same order of magnitude as the number of identified members of most minority groups, understanding the effect of nonresponse is crucial to evaluating the validity of a procedure which is used to study differential item functioning (DIF). In this study, we examined the effect of nonresponse to questions of ethnic identity on the measurement of DIF for SAT verbal items using a commonly used modern method, the Mantel-Haenszel procedure (Holland and Thayer, 1988). We found that efforts to obtain more complete ethnic identifications from the examinees would be rewarded with more accurate DIF analyses.