scispace - formally typeset
Search or ask a question

Showing papers on "Item response theory published in 1994"


Book
06 Apr 1994
TL;DR: Test Bias, Item Bias and Test Validity early item bias Indices Based on Classical Test Theory and Analysis of Variance Item Response Theory as Applied to Differential Item Functioning Contingency Table Approaches Interpretations of Bias from DIF Statistics Conclusions and Caveats
Abstract: Introduction Test Bias, Item Bias, and Test Validity Early Item Bias Indices Based on Classical Test Theory and Analysis of Variance Item Response Theory as Applied to Differential Item Functioning Contingency Table Approaches Interpretations of Bias from DIF Statistics Conclusions and Caveats

776 citations


Journal ArticleDOI
TL;DR: Findings support the content validity of the PF-10 as a measure of physical functioning and suggest that valid Rasch-IRT summary scores could be generated as an alternative to the current Likert summative scores.

284 citations



Journal ArticleDOI
TL;DR: In this paper, a graphical MIRT analysis designed to provide better insight into what individual items are measuring as well as what the test as a whole is assessing is presented and discussed, and the goal of the article is to encourage testing practitioners to use MIRT as a means to statistically validate the test specifications.
Abstract: Item response theory (IRT) describes the interaction between examinees and items using probabilistic models. One of the underlying assumptions of IRT is that examinees are all using the same skill or same composite of multiple skills to respond to each of the test items. When item response data do not satisfy the unidimensionality assumption, multidimensional item response theory (MIRT) should be used to model the item-examinee interaction. MIRT enables one to model the interaction of items that are capable of discrimi- nating between levels of several different abilities and examinees that vary in their proficiencies on these abilities. In this article graphical MIRT analyses designed to provide better insight into what individual items are measuring as well as what the test as a whole is assessing are presented and discussed. The goal of the article is to encourage testing practitioners to use MIRT as a means to statistically validate the test specifications.

179 citations


Journal ArticleDOI
TL;DR: In this paper, generalized linear item response theory is discussed, which is based on the following assumptions: (a) a distribution of the responses occurs according to a given item format; (b) the item responses are explained by one continuous or nominal latent variable and p latent as well as observed variables that are continuous and nominal; (c) the responses to the different items of a test are independently distributed given the values of the explanatory variables; and (d) a monotone differentiable function g of the expected item response τ is needed such that a linear combination of the
Abstract: In this article generalized linear item response theory is discussed, which is based on the following assumptions: (a) A distribution of the responses occurs according to a given item format; (b) the item responses are explained by one continuous or nominal latent variable and p latent as well as observed variables that are continuous or nominal; (c) the responses to the different items of a test are independently distributed given the values ofthe explanatory variables; and (d) a monotone differentiable function g of the expected item response τ is needed such that a linear combination of the explanatory variables is a predictor of g(τ)

164 citations


Journal ArticleDOI
TL;DR: In nursing research many concepts are measured by questionnaires, and unidimensional scaling procedures and their rationales are discussed.
Abstract: In nursing research many concepts are measured by questionnaires Respondents are asked to respond to a set of related statements or questions In unidimensional scaling these statements or questions are indicants of the same concept Scaling means to assign numbers to respondents, according to their position on the continuum underlying the concept It is very common to use the summative Likert scaling procedure The sumscore of the responses to the items is the estimator of the position of the patient on the continuum The rationale behind this procedure is classical test theory The main assumption in this theory is that all items are parallel instruments The Rasch model offers an alternative scaling procedure With Rasch both respondents and items are scaled on the same continuum Whereas in Likert scaling all items have the same weight in the summating procedure, in the Rasch model items are differentiated from each other by‘difficulty’The model holds that the probability of a positive response to an item is dependent on the difference between the difficulty of the item and the value of the person on the latent trait The rationale behind this procedure is item response theory In this paper both scaling procedures and their rationales are discussed

147 citations


Journal ArticleDOI
TL;DR: This paper used item response theory to evaluate item performance as a function of severity of depression as well as examine gender item bias and the appropriateness of the weights assigned to response options in both depressed outpatient and non-patient college samples.
Abstract: Since the introduction of the Beck Depression Inventory (BDI; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961), studies have documented its effectiveness as a self-report measure of depression (for a review, see Beck, Steer, & Garbin, 1988). These studies report on the effectiveness of the BDI within and between groups but do not examine the psychometric properties of the BDI as a function of the severity of depression within groups. Previous research reporting gender differences on the BDI has neither distinguished group mean differences from differences due to gender item bias (Thissen, Steinberg, & Gerrard, 1986) nor examined how such differences between men and women may vary as a function of depression. Furthermore, despite the large number of studies investigating the BDI, few studies have directly assessed the appropriateness of the weights assigned to response options. In this article, we use techniques based on item response theory to evaluate item performance as a function of the severity of depression as well as to examine gender item bias and the appropriateness of the weights assigned to response options in both depressed outpatient and nonpatient college samples. Items on the BDI consist of groups of graded statements, or options, reflecting different degrees of severity for the symptom domain assessed by that item. For each item there are four op

133 citations



Journal ArticleDOI
TL;DR: In this paper, two nonparametric procedures, the Mantel-Haenszel (MH) procedure and the simultaneous item bias (SIB) procedure, were compared with respect to their Type I error rates and power.
Abstract: Two nonparametric procedures for detecting differ ential item functioning (DIF)—the Mantel-Haenszel (MH) procedure and the simultaneous item bias (SIB) procedure—were compared with respect to their Type I error rates and power. Data were simulated to reflect conditions varying in sample size, ability distribution differences between the focal and reference groups, pro portion of DIF items in the test, DIF effect sizes, and type of item. 1,296 conditions were studied. The SIB and MH procedures were equally powerful in detecting uniform DIF for equal ability distributions. The SIB procedure was more powerful than the MH procedure in detecting DIF for unequal ability distributions. Both procedures had sufficient power to detect DIF for a sample size of 300 in each group. Ability distribution did not have a significant effect on the SIB procedure but did affect the MH procedure. This is important because ability distribu tion differences between two groups often are found in practice. The Type I error rates f...

122 citations


Journal ArticleDOI
TL;DR: This paper proposed a loglinear IRT model that relates polytomously scored item responses to a multidimensional latent space, where the analyst may specify a response function for each response, indicating which latent abilities are necessary to arrive at that response.
Abstract: A loglinear IRT model is proposed that relates polytomously scored item responses to a multidimensional latent space The analyst may specify a response function for each response, indicating which latent abilities are necessary to arrive at that response Each item may have a different number of response categories, so that free response items are more easily analyzed Conditional maximum likelihood estimates are derived and the models may be tested generally or against alternative loglinear IRT models

120 citations


Journal ArticleDOI
TL;DR: In the context of the unidimensional case model for continuous item responses the concepts of item and test information functions, specific objectivity, item bias, and reliability are discussed; also the application of the model to test construction is shown.
Abstract: A general linear latent trait model for continuous item responses is described. The special unidimensional case for continuous item responses is Joreskog's (1971) model of congeneric item responses. In the context of the unidimensional case model for continuous item responses the concepts of item and test information functions, specific objectivity, item bias, and reliability are discussed; also the application of the model to test construction is shown. Finally, the correspondence with latent trait theory for dichotomous item responses is discussed.

01 Jan 1994
TL;DR: In this paper, a general weighted information criterion is suggested of which the usual maximum information criterion and the suggested alternative criteria are special cases, and a simulation study is conducted to compare the different criteria.
Abstract: In this study some alternative item selection criteria for adaptive testing are proposed. These criteria take into account the uncertainty of the ability estimates. A general weighted information criterion is suggested of which the usual maximum information criterion and the suggested alternative criteria are special cases. A simulation study was conducted to compare the different criteria. The results showed that the likelihood weighted mean information criterion was a good alternative to the maximum information criterion. Another good alternative was a maximum information criterion with the maximum likelihood estimate of ability replaced by the Bayesian EAP estimate. An appendix discusses the interval information criterion for the two- and three-parameter logistic item response theory model.

Journal ArticleDOI
TL;DR: In this paper, the effects of speededness on parameter estimation were examined by comparing the item and ability parameter estimates with the known true parameters, and it was found that the ability estimation was least affected by speed-edness in terms of the correlation between true and estimated ability parameters.
Abstract: There is a paucity of research in item response theory (IRT) examining the consequences of violating the implicit assumption of nonspeededness. In this study, test data were simulated systematically under various speeded conditions. The three factors considered in relation to speededness were proportion of test not reached (5%, 10%, and 15%), response to not reached (blank vs. random response), and item ordering (random vs. easy to hard). The effects of these factors on parameter estimation were then examined by comparing the item and ability parameter estimates with the known true parameters. Results indicated that the ability estimation was least affected by speededness in terms of the correlation between true and estimated ability parameters. On the other hand, substantial effects of speededness were observed among item parameter estimates. Recommendations for minimizing the effects of speededness are discussed.

Journal ArticleDOI
TL;DR: The evolution of test theory has been shaped by the nature of users' inferences, which, until recently, have been framed almost exclusively in terms of trait and behavioral psychology as mentioned in this paper.
Abstract: Educational test theory consists of statistical and methodological tools to support inference about examinees' knowledge, skills, and accomplishments The evolution of test theory has been shaped by the nature of users' inferences, which, until recently, have been framed almost exclusively in terms of trait and behavioral psychology Progress in the methodology of test theory enabled users to extend the range of inference, sharpen the logic, and ground their interpretations more solidly within these psychological paradigms In particular, the focus remained on students' overall tendency to perform in prespecified ways in prespecified domains of tasks; for example, to make correct answers to mixed-number subtraction problems Developments in cognitive and developmental psychology broaden the range of desired inferences, especially to conjectures about the nature and acquisition of students' knowledge Commensurately broader ranges of data-types and student models are entertained The same underlying principles of inference that led to standard test theory can be applied to support inference in this broader universe of discourse Familiar models and methods—sometimes extended, sometimes reinterpreted, sometimes applied to problems wholly different from those for which they were first devised—can play a useful role to this end

01 Nov 1994
TL;DR: In this paper, the use of person-fit statistics in empirical data analysis is briefly discussed, and it is shown that in some situations, the analysis of item score patterns might reveal more information about examinees than the analysis on test scores.
Abstract: Methods for detecting item score patterns that are unlikely (aberrant) given that a parametric item response theory (IRT) model gives an adequate description of the data or given the responses of the other persons in the group are discussed. The emphasis here is on the latter group of statistics. These statistics can be applied when a nonparametric model is used to fit the data or when the data are described in the absence of an IRT model. After discussion of the literature on person-fit methods, the use of person-fit statistics in empirical data analysis is briefly discussed. In some situations, the analysis of item score patterns might reveal more information about examinees than the analysis of test scores. Finding an aberrant pattern does not explain the reason for the aberrance. A full person-fit analysis requires additional research into the motives, strategies, and backgrounds of the examinees who deviate from the statistical norm set by the model or group.

Journal ArticleDOI
TL;DR: In this article, the authors compared the power of the U3 statistic with a simple person-fit statistic, the sum of the number of Guttman errors in most cases studied, (a weighted version of) the latter statistic performed as well as the U 3 statistic.
Abstract: A number of studies have examined the power of several statistics that can be used to detect examinees with unexpected (nonfitting) item score patterns, or to determine person fit This study compared the power of the U3 statistic with the power of one of the simplest person-fit statistics, the sum of the number of Guttman errors In most cases studied, (a weighted version of) the latter statistic performed as well as the U3 statistic Counting the number of Guttman errors seems to be a useful and simple alternative to more complex statistics for determining person fit Index terms: aberrance detection, appropriateness measurement, Guttman errors, nonparametric item response theory, person fit

Journal ArticleDOI
TL;DR: The performance of the Mantel-Haenszel odds-ratio estimator and χ 2 significance test were investigated using simulated data in this paper, where multiparameter logistic item response theory models were used to generate item scores for 20-and 40-item tests for 500 reference group and 500 focal group examinees.
Abstract: The performance of the Mantel-Haenszel odds-ratio estimator and χ 2 significance test were investigated using simulated data. Multiparameter logistic item response theory models were used to generate item scores for 20- and 40-item tests for 500 reference group and 500 focal group examinees. The difficulty, discrimination, and guessing parameters, and the difference in the group trait level averages were varied and combined factorially. Within each cell of the design, 200 replications were completed under both differential item functioning (DIF) and no-DIF conditions. The empirical χ 2 Type I and Type II error rates, and the average of the odds-ratio estimates, were analyzed over the 200 replications

Journal ArticleDOI
TL;DR: In this article, a new item-fit index is proposed that is both a descriptive measure of deviance of single items and an index for statistical inference, based on the assumptions of the dichotomous and polytomous Rasch models for items with ordered categories.
Abstract: A new item-fit index is proposed that is both a descriptive measure of deviance of single items and an index for statistical inference. This index is based on the assumptions of the dichotomous and polytomous Rasch models for items with ordered categories and, in particular, is a standardization of the conditional likelihood of the item pattern that does not depend on the item parameters. This approach is compared with other methods for determining item fit. In contrast to many other item-fit indexes, this index is not based on response-score residuals. Results of a simulation study illustrating the performance of the index are provided. An asymptotically normally distributed Z statistic is derived and an empirical example demonstrates the sensitivity of the index with respect to item and person heterogeneity. Index terms: appropriateness measurement, item discrimination, item fit, partial credit model, Rasch model.

Journal ArticleDOI
TL;DR: In this paper, the authors established the correspondence between an IRF and a unique set of ICRFs for two of the most commonly used polytomous IRT models (the partial credit models and the graded response model).
Abstract: The item response function (IRF) for a polytomously scored item is defined as a weighted sum of the item category response functions (ICRF, the probability of getting a particular score for a randomly sampled examinee of ability θ). This paper establishes the correspondence between an IRF and a unique set of ICRFs for two of the most commonly used polytomous IRT models (the partial credit models and the graded response model). Specifically, a proof of the following assertion is provided for these models: If two items have the same IRF, then they must have the same number of categories; moreover, they must consist of the same ICRFs. As a corollary, for the Rasch dichotomous model, if two tests have the same test characteristic function (TCF), then they must have the same number of items. Moreover, for each item in one of the tests, an item in the other test with an identical IRF must exist. Theoretical as well as practical implications of these results are discussed.

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the effect of item context on the consistency of responses to items and found that items appearing later in a questionnaire are more related to total score than items appearing earlier.
Abstract: It has recently been argued that the process of measuring personality constructs changes the consistency of responses to items. E. S. Knowles (1988) showed that items appearing later in a questionnaire are more related to total score than items appearing earlier. J. C. Hamilton and T. R. Shuminsky (1990) offered empirical support for the hypothesis that level of self-awareness is responsible for this serial-order effect. The present study investigated the generality of the proposition that measuring personality constructs using a self-report questionnaire changes the construct measured. With techniques of item response theory (IRT), it was found that the findings of previous investigations may be explained by more specific item-context effects due to both the item's content and serial position. These findings are discussed within the framework that uses IRT to test hyoptheses about item-context effects and personality measurement.

Journal ArticleDOI
TL;DR: The reliability coefficient and the standard error of measurement in classical test theory are not properties of a specific test, but are attributed to both the specific test and a specific trait as discussed by the authors.
Abstract: The reliability coefficient and the standard error of measurement in classical test theory are not properties of a specific test, but are attributed to both a specific test and a specific trait dis...

Journal ArticleDOI
TL;DR: In this paper, the performance of three methodologies for assessing unidi-mensionality: DIMTEST, Holland and Rosenbaum's approach, and nonlinear factor analysis was compared with other methods on simulated and real data sets.
Abstract: This study compares the performance of three methodologies for assessing unidi-mensionality: DIMTEST, Holland and Rosenbaum's approach, and nonlinear factor analysis. Each method is examined and compared with other methods on simulated and real data sets. Seven data sets, all with 2,000 examinees, were generated: three unidimensional and four two-dimensional data sets. Two levels of correlation between abilities were considered:ρ=3 andρ=. 7. Eight different real data sets were used: Four of them were expected to be unidimensional, and the other four were expected to be two-dimensional. Findings suggest that all three methods correctly confirmed unidimensionality but differed in their ability to detect lack of unidimensionality. DIMTEST showed excellent power in detecting lack of unidimensionality; Holland and Rosenbaum's and nonlinear factor analysis approaches showed good power, provided the correlation between abilities was low.

Journal ArticleDOI
TL;DR: In this paper, the derivation of d = 1.702 was discussed and it was shown that the scaling could be improved by the factor (w//l3) (15/16) = (1.70044).
Abstract: Cox (1970) observed that the most apparent method of scaling ti to coincide with 4 is to standardize the logistic variable, which is done by multiplying x by Tw/In = 1.81380. Johnson and Kotz (1970) graphically showed that the scaling could be improved by the factor (w//l3) (15/16) = 1.70044. However, Haley (1952) outlined the theoretical derivation of d = 1.702. Because the use of d is widespread and Haley's (1952) unpublished work is not easily accessible, the derivation of d re-presented in this brief note provides a


Journal ArticleDOI
TL;DR: This article found that polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0.5.
Abstract: Using Muraki's (1992) generalized partial credit IRT model, polytomous items (responses to which can be scored as ordered categories) from the 1991 field test of the NAEP Reading Assessment were calibrated simultaneously with multiple-choice and short open-ended items. Expected information of each type of item was computed. On average, four-category polytomous items yielded 2.1 to 3.1 times as much IRT information as dichotomous items. These results provide limited support for the ad hoc rule of weighting k-category polytomous items the same as k - 1 dichotomous items for computing total scores. Polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored dichotomously, information in polytomous items sharply decreased, but they still provided more expected information than did the other response formats. For reference, a derivation of the information function for the generalized partial credit model is included.

Journal ArticleDOI
TL;DR: In this article, the authors extended previous work on the effects of dimensionality on parameter estimation for dichotomous models to the polytomous graded response model, which was developed to generate data in one, two, and three dimensions.
Abstract: Most item response theory models assume a uni-dimensional latent space This study extended previous work on the effects of dimensionality on parameter estimation for dichotomous models to the polytomous graded response model A multidimensional graded response model was developed to generate data in one, two, and three dimensions The two- and three-dimension conditions contained datasets that varied in their interdimensional association The effects of test length and the ratio of sample size to the number of item parameters estimated also were investigated For the unidimensional data, a sample size ratio of 5:1 provided reasonably accurate estimation; increasing test length from 15 to 30 items did not have a significant impact on the accuracy of item parameter estimation Regardless of the dimensionality of the data, the difficulty parameters were well estimated For the multidimensional data, the correlations between the estimated item discrimination and the average (as well as the sum of the) dimens

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the appropriateness measurement in nonparametric item response theory modeling and found that the reliability of the items, the test length, the type of aberrant response behavior, and the percentage of aberrants persons in the group were all important factors for detecting aberrant responses.
Abstract: Appropriateness measurement in nonparametric item response theory modeling is affected by the reliability of the items, the test length, the type of aberrant response behavior, and the percentage of aberrant persons in the group. The percentage of simulees defined a priori as aberrant responders that were detected increased when the mean item reliability, the test length, and the ratio of aberrant to nonaberrant simulees in the group increased. Also, simulees "cheating" on the most difficult items in a test were more easily detected than those "guessing" on all items. Results were less stable across replications as item reliability or test length decreased. Results suggest that relatively short tests of at least 17 items can be used for person-fit analysis if the items are sufficiently reliable. Index terms: aberrance detection, appropriateness measurement, nonparametric item response theory, person-fit, person-fit statistic U3.

Journal ArticleDOI
TL;DR: The suitability of fitting a two-parameter logistic item response model to the Drug Use Screening Inventory (DUSI) was assessed and it is indicated that the DUSI has sound psychometric properties.
Abstract: The suitability of fitting a two-parameter logistic item response model to the Drug Use Screening Inventory (DUSI) was assessed In a sample of 846 adolescents, each of the 10 domains was found to be unidimensional Invariance of the item parameters across different groups was also observed The reliability coefficient, based on item response theory, was found to be superior The results of these analyses indicate that the DUSI has sound psychometric properties

Journal ArticleDOI
TL;DR: In this paper, the ideal observer method was used to compare two models with the same number of parameters (Samejima's (1969) graded response model and Thissen & Steinberg's (1986) extension of Masters' (1982) partial credit model) to investigate whether difference models or divide-by-total models should be preferred for fitting Likert type data.
Abstract: Several item response models have been proposed for fitting Likert-type data. Thissen & Steinberg (1986) classified most of these models into difference models and divide-by-total models. Although they have differ ent mathematical forms, divide-by-total and difference models with the same number of parameters seem to provide very similar fit to the data. The ideal observer method was used to compare two models with the same number of parameters—Samejima's (1969) graded re sponse model (a difference model) and Thissen & Steinberg's (1986) extension of Masters' (1982) partial credit model (a divide-by-total model)—to investigate whether difference models or divide-by-total models should be preferred for fitting Likert-type data. The models were found to be very similar under the condi tions investigated, which included scale lengths from 5 to 25 items (five-option items were used) and calibra tion samples of 250 to 3,000. The results suggest that both models fit approximately equally well in most practical ...

Journal ArticleDOI
TL;DR: The scaling constant d = 1.702 used in item response theory minimizes the maximum difference between the normal and logistic distribution functions as discussed by the authors, and the theoretical and numerical derivation of d gi...
Abstract: The scaling constant d = 1.702 used in item response theory minimizes the maximum difference between the normal and logistic distribution functions. The theoretical and numerical derivation of d gi...