scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 1987"


Journal ArticleDOI
TL;DR: This article identified item factors that may contribute to differential item functioning (DIF) for Black examinees on Scholastic Aptitude Test (SAT) analogy items and found that Black students need more time to complete the SAT verbal sections than White students with comparable total SAT verbal scores.
Abstract: The purpose of the present investigation was to identify item factors that may contribute to differential item functioning (DIF) for Black examinees on Scholastic Aptitude Test (SAT) analogy items. This research was considered necessary because analogy items have repeatedly been identified as differentially more difficult for Black examinees. The research was performed in two steps. Initially items in three forms were classified according to several possible explanatory factors. Preliminary analyses identified several factors that seemed to affect DIF for Black examinees. In order to confirm these hypothesized factors, a second step involved classifying and analyzing analogy items from two additional SAT forms. The most significant finding is that Black students appear to need more time to complete the SAT verbal sections than White students with comparable total SAT verbal scores. This differential speededness effect makes analogy items appear differentially more difficult for Black examinees. Once differential item functioning statistics were corrected for speededness, a smaller number of analogy items were identified as differentially more difficult. In addition, evaluation of the hypothesized factors showed that some of the factors are interdependent and no clear distinction could be made to determine out their individual effects. These item factors are: item position within each analogy set, difficulty, subject matter content, and level of abstractness. The effects of homographs and semantic relationship types are also confounded with the previous factors. The only factor that seemed to be independent was “vertical relationships”. In general, a vertical or word associative answering strategy seems to be more consistently used by Black examinees on those items with negative DIF. Generalizations to be drawn from these results are limited because the analyses are based on regular administration items which tend to have several factors operating in a single item and which contain infrequent occurrence of some factors. At this stage, these results provide a clearer picture of how comparable Black and White students respond on analogy items and the factors that might influence their performance.

56 citations



Journal ArticleDOI
TL;DR: In this article, an error diagnostic program was developed to carry out an error analysis of test performance by three samples of students, and the items were characterized by their underlying subtask patterns.
Abstract: This study sought a scientific way to examine whether item response curves are influenced systematically by the cognitive processes underlying solution of the items in a procedural domain (addition of fractions). Starting from an expert teacher's logical task analysis and prediction of various erroneous rules and sources of misconceptions, an error diagnostic program was developed. This program was used to carry out an error analysis of test performance by three samples of students. After the cognitive structure of the subtasks was validated by a majority of the students, the items were characterized by their underlying subtask patterns. It was found that item response curves for items in the same categories were significantly more homogeneous than those in different categories. In other words, underlying cognitive subtasks appeared to systematically influence the slopes and difficulties of item response curves.

24 citations


Journal ArticleDOI
TL;DR: The SAT-Verbal test is composed of two distinct albeit highly-related dimensions, a reading dimension and a vocabulary dimension as mentioned in this paper, and the LISREL analyses also provide some evidence for separating the vocabulary dimension into two dimensions, Antonyms and Analogies.
Abstract: The primary goal of this research was to obtain a fuller understanding of what is being measured by the SAT. In addition, this project had two secondary goals: (1) the identification of a set of analytic techniques that can satisfactorily assess the dimensionality of the SAT; and (2) the identification of components of a cost-effective and informative system for monitoring the internal construct validity of the SAT. The principal project findings are: The SAT-Mathematical test is effectively unidimensional. There is little empirical justification for dividing the total SAT-Mathematical score into subscores on the basis of item content related to subject area. For purposes of differential item functioning analysis, the total SAT-Mathematics score should be an adequate matching variable. The SAT-Verbal test is composed of two distinct albeit highly-related dimensions, a reading dimension and a vocabulary dimension. Hence, there is an empirical justification for reporting two separate subscores. Currently, antonym and analogy items count toward a vocabulary subscore; sentence completion and reading comprehension items count toward a reading subscore. The empirical data indicate that the sentence completion items appear to belong more with the analogy and antonym items than with the reading comprehension items. The LISREL analyses also provide some evidence for separating the Vocabulary dimension into two dimensions, Antonyms and Analogies. The evidence is inconsistent, however. For the purposes of differential item functioning analysis, it may be necessary to match on the current Reading subscore (reading and sentence completion) for reading comprehension items, an analogy/sentence completion criterion for analogies, an antonym/sentence completion criterion for antonyms and any of these three criteria or SAT-Verbal for sentence completion items. This research has several findings that have implications for redesigning the SAT-Verbal test. First, the sentence completion item type is the item type that is most like the total SAT-Verbal score in what it measures, while the reading passage items are measuring the construct that is least like the total SAT-Verbal score. Second, speededness effects on SAT-Verbal may be large enough to produce a factor in item level data that is more salient than a general reading passage factor. Third, the analogy item seems to be the least reliable item type, while the sentence completion item type appears to be the most reliable. This array of results points in the direction of using more reading passage items to achieve a more distinct and reliable reading subscore, using more sentence completion items because they are more reliable and may take less time than other items, and using fewer analogy items because they are less reliable and probably more time consuming than sentence completions or antonyms. The analyses conducted in this study have served to underscore the value of using confirmatory factor analysis of item parcel data to study the dimensional structure of test data. This approach is computationally inexpensive, and appears to provide meaningful and consistent results. The use of parallel parcels makes item data amenable to linear factor analysis. The method seems to circumvent the problems associated with directly factor analyzing item data, namely, the propagation of artifactual “difficulty” factors. The method can also avoid the problem of observing a “speed” factor, as items from later positions in the test can be balanced across parcels. In sum, the parallel parcel approach can be used to dispense with difficulty and speed factors and, hence, obtain a clearer look at the substantive factor structure of the test. Full information factor analysis is not ready for routine use with ATP data. The TESTFACT program is very expensive and sometimes yields “methodological” factors under the full information mode. The TESTFACT program can also produce a least squares factor analysis of a smoothed positive definite matrix of tetrachorics corrected for guessing at a fraction of the cost of the full information solution. Unfortunately, this solution appeared to yield difficulty factors for both SAT-Verbal and SAT-Mathematical.

23 citations



Journal ArticleDOI
TL;DR: In this article, rank-order correlations between IRT-derived item information functions (llFs) and four conventional discrimination indices: the phi-coefficient, the B-index, phi/phi max, and the agreement statistic were computed.
Abstract: Several recent papers have argued for the usefulness of item response theory (IRT) methods of assessing item discrimination power for criterion-referenced tests (CRTs). Conventional methods continue to be used more widely, however, for reasons that include some practical constraints associated with the use of IRT methods. To provide users with information that may help them to decide on which conventional indices to employ in evaluating CRT items, Spearman rank-order correlations were computed between IRT-derived item information functions (llFs) and four conventional discrimination indices: the phi-coefficient, the B-index, phi/phi max, and the agreement statistic. The rank-order correlations between the phi-coefficient and the llFs were very high, with a median of .96. The remaining conventional indices, with the exception of phi-over-phi-max, also correlated well with the IIF. Theoretical explanations for these relationships are offered.

7 citations


01 Apr 1987
TL;DR: In this article, the authors compare the characteristics of unidimensional ability estimates obtained from data generated from the multidimensional IRT (MIRT) compensatory and noncompensatory models.
Abstract: The purpose of this study was to compare the characteristics of unidimen­ sional ability estimates obtained from data generated from the multidimen­ sional IRT (MIRT) compensatory and noncompensatory models. Reckase, Carlson, Ackerman and Spray (1986) reported that when the compensatory model is used and item difficulty is confounded with dimensionality, the composition of the unidimensional ability estimates differs for different points along the unidimensional ability scale. Eight data sets (four compensatory, four noncompensatory) were generated for four different levels of correlated two dimensional abilities: p = 0, .3, .6, .9. In each set difficulty was con­ founded with dimensionality. Each set was then calibrated using the IRT calibration programs LOGIST and BILOG. BILOG calibration of response vectors generated to the matched MIRT item parameters appeared to be more affected than LOGIST by the confounding of difficulty and dimensionality. As the correlation between the generated two-dimensional abilities increased, the response data appeared to become more unidimensional as evidenced in bivariate plots of vs. 0 2 for specified 0 quantiles. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and Noncompensatory Multidimensional Item Response Data One of the underlying assumptions of unidimensional item response theory (IRT) models is that a person's ability can be estimated in a unidimensional latent space. However, researchers and educators have expressed concern whether or not the response process to any one item requires only a single latent ability. Traub (1983) suggests that many cognitive variables are brought to the testing task and that the number used varies from person to person. Likewise, the combination of latent abilities required by individuals to obtain a correct response may vary from item to item. Caution over the application of unidimensional IRT estimation of multidimensional response data has been expressed by several researchers including Ansley and Forsyth (1985); Reckase, Carlson, Ackerman, and Spray (1986); and, Yen (1984). Using a compensatory multidimensional IRT (MIRT) model, Reckase et a l . (1986) demonstrated that when dimensionality and difficulty are confounded (i.e., easy items discriminate only on 0 L, difficult items discriminate only on 02) the unidimensional ability scale has a different meaning at different points on the scale. Specifically, for their twodimensional generated data set, upper ability deciles differed mainly on 02 while the lower deciles differed mostly on 0 1# These results led the authors to suggest that the univariate calibration of two-dimensional response data can be explained in terms of the interaction between the multidimensional test information and the distribution of the two-dimensional abilities. Reckase et al. (1986) examined the condition in which ability estimates were uncorrelated. Such an approach may not be very realistic, however, since most cognitive abilities tend to be

6 citations


16 Jun 1987
TL;DR: In this article, the bias of the maximum likelihood estimate as a function of the latent trait, or ability, for the general case in which item responses are discrete is discussed. But the bias is not considered in this paper.
Abstract: : The paper is concerned with the bias of the maximum likelihood estimate as a function of the latent trait, or ability, for the general case in which item responses are discrete. The rationale is presented, and observations are made with respect to the effects of the test information function, the item parameters, the number of items, the transformation of the latent variable, etc., on the amount of bias. Keywords: Mathematical models, Scale transformation. (Author)

5 citations



Journal ArticleDOI
TL;DR: In this paper, the authors investigated whether ethnic, racial, or gender groups differ from the rest of the population taking the GRE General Test in terms of patterns of item difficulty and found that these groups were differentiated due to bends in the item characteristic curves, rather than through demographic effects.
Abstract: The impetus for the present study comes from the rather inconclusive results from “test bias” studies, which can be characterized as studies to determine whether ethnic, racial, or gender groups differ from the rest of the population taking the GRE General Test in terms of patterns of item difficulty. The results of such studies have been sufficiently weak as to suggest that test items are not strongly “biased” for any particular subgroups of test takers as defined by race, gender, or ethnicity. It is possible, however, that (a) there are subgroups of test takers whose patterns of item difficulty differ but whose members are demographically mixed, or (b) there are no groups for whom item difficulty differences exist (except for overall level of performance on the test. The potential groupings that underlie test performance are called latent groups because they are not defined beforehand, but are a product of the study. Latent groups are defined as groups that differ in the patterns of their item difficulties. To clarify the results of bias studies, the present study sought to discover latent groups of examinees by utilizing only test performance data. Factor analyses were performed to determine lower bounds for the numbers of latent groups. Two types of coefficients were analyzed: the proportion of joint successes on pairs of items, and phi-coefficients. Both types of coefficients were well accounted for by a single factor. This result implies the existence of no more than one group. To be sure that the effects of group heterogeneity were not being masked, test takers who described themselves as White were sampled down so that the responses of other demographic groups could more strongly effect the results. However, using either of the coefficients analyzed, a single factor still accounted very well for the relationships among test items. Clearly, if any latent group structure were to be detected, a more sensitive approach was required. Thus, the subsequent approach adopted examined the value of each residual after a given number of factors was extracted. This approach led to a three-group solution for the verbal subtest, and a two-group solution for the quantitative subtest. This pattern of item difficulties suggested that these groups were differentiated due to bends in the item characteristic curves, rather than through demographic effects. In addition, a large number of groups were identified for the analytic subtest, but the vector of item difficulties proved to be essentially proportional. Also, demographic and item content information proved uninterpretable. It was concluded that the tests were essentially unitary as intended.

1 citations