scispace - formally typeset
Search or ask a question

Showing papers on "Item response theory published in 1983"




Journal ArticleDOI
TL;DR: It appears that item response theory models can be applied to moderately heterogenous item pools under the conditions simulated here.
Abstract: A simulation model was developed for generating item responses from a multidimensional latent trait space The model permits the prepotency of a general latent trait underlying responses to all simu...

256 citations


Journal ArticleDOI
TL;DR: Two linearly constrained logistic models which are based on the well-known dichotomous Rasch model, the linear logistic test model (LLTM) and the Linear Logistic model with relaxed assumptions (LLRA), are discussed in this paper.
Abstract: Two linearly constrained logistic models which are based on the well-known dichotomous Rasch model, the ‘linear logistic test model’ (LLTM) and the ‘linear logistic model with relaxed assumptions’ (LLRA), are discussed. Necessary and sufficient conditions for the existence of unique conditional maximum likelihood estimates of the structural model parameters are derived. Methods for testing composite hypotheses within the framework of these models and a number of typical applications to real data are mentioned.

171 citations


Journal ArticleDOI
TL;DR: In this paper, the problem of characterizing the manifest probabilities of a latent trait model is considered, where the item characteristic curve is transformed to the item passing-odds curve and a corresponding transformation is made on the distribution of ability.
Abstract: The problem of characterizing the manifest probabilities of a latent trait model is considered. The item characteristic curve is transformed to the item passing-odds curve and a corresponding transformation is made on the distribution of ability. This results in a useful expression for the manifest probabilities of any latent trait model. The result is then applied to give a characterization of the Rasch model as a log-linear model for a 2 J -contingency table. Partial results are also obtained for other models. The question of the identifiability of “guessing” parameters is also discussed.

165 citations


Journal ArticleDOI
TL;DR: In this article, scale drift for the verbal and mathematical portions of the SAT was investigated using linear, equipercentile and item response theory (IRT) equating methods, and it was shown that the linear IRT equating method is more accurate than the linear linear ITA.
Abstract: Scale drift for the verbal and mathematical portions of the Scholastic Aptitude Test (SAT) was investigated using linear, equipercentile and item response theory (IRT) equating methods. The linear ...

148 citations


Book ChapterDOI
01 Jan 1983
TL;DR: In this article, the authors discuss the reliability and validity of adaptive ability tests in a military setting and report the results of the first in a series of live-testing studies investigating the psychometric characteristics of computer-administered adaptive testing in comparison with a conventional test.
Abstract: Publisher Summary This chapter presents a research to discuss the reliability and validity of adaptive ability tests in a military setting. It discusses a research with the purpose to report the results of the first in a series of live-testing studies investigating the psychometric characteristics of computer-administered adaptive testing in comparison with a conventional test. A pilot study based on a small sample of Marine recruits was followed by a larger study based on the same research design. The two studies were designed in part to address two psychometric research questions: (1) whether a computer-administered adaptive test is more reliable than a conventional test, holding test length constant and (2) whether a computer-administered test is more valid than a comparable conventional test with test length constant. These questions were motivated by the results of previous research. The question of the advantages of adaptive tests over conventional ones in terms of reliability has a clear and positive theoretical answer: Holding test length and all else constant, a good adaptive test is superior, provided that highly discriminating test items are available. The general method used in both studies was that of equivalent tests administered to independent examinee groups. One group took two equivalent computer-administered adaptive tests. The other group took two equivalent conventional tests, also administered by computer. To control for item quality, both test types were made up of items from the same source—a common pool of 150 verbal ability items, which had previously been calibrated using item response theory methods in large samples of Marine recruits.

145 citations


Book ChapterDOI
01 Jan 1983
TL;DR: This chapter describes the use of an item response model for item analysis and test scoring in the context of timed testing by using a slightly different model and an integrated parameter estimation scheme.
Abstract: Publisher Summary This chapter describes the use of an item response model for item analysis and test scoring in the context of timed testing. The chapter describes a model that is a revision of a model proposed by Furneaux. Furneaux described the outline of the model and provided some illustration of the performance of parts of the system with a small set of test data, using parameters estimated by various heuristic procedures. The chapter describes the outcome of timed testing with several different kinds of test items by using a slightly different model and an integrated parameter estimation scheme . A complete system for scoring timed tests requires an item response model for the data of timed testing: the responses and the associated latencies.

144 citations


Book ChapterDOI
01 Jan 1983
TL;DR: This chapter presents a decision procedure that operates sequentially and can easily be applied to tailored testing without loss of any of the elegance and mathematical sophistication of the examination procedures.
Abstract: Publisher Summary This chapter presents a decision procedure that operates sequentially and can easily be applied to tailored testing without loss of any of the elegance and mathematical sophistication of the examination procedures. In applying the decision procedure, two specific item response theory (IRT) models are used: the one- and three-parameter logistic models. Although any other IRT model could just as easily have been used, these models were selected because of their frequent appearance in the research literature and because of the existence of readily available calibration programs and tailored testing programs. The purposes of this research were (1) to obtain information on how the sequential probability ratio test (SPRT) procedure functioned when items were not randomly sampled from the item pool; (2) to gain experience in selecting the bounds of the indifference region; and (3) to obtain information on the effects of guessing on the accuracy of classification when the one-parameter logistic model was used. To determine the effects of these variables, the computation of the SPRT was programmed into both the one- and three-parameter logistic tailored testing procedures that were operational at the University of Missouri—Columbia.

134 citations


Book
01 Jan 1983

130 citations


Book
01 Jan 1983
TL;DR: Osterlind as mentioned in this paper discusses five strategies for detecting bias: analysis of variance, transformed item difficulties, chi square, item characteristic curve, and distractor response, and specific hypotheses under test for each technique, as well as the capabilities and limitations of each strategy.
Abstract: A unique, practical manual for identifying and analyzing item bias in standardized tests. Osterlind discusses five strategies for detecting bias: analysis of variance, transformed item difficulties, chi square, item characteristic curve, and distractor response. He covers specific hypotheses under test for each technique, as well as the capabilities and limitations of each strategy.

Journal ArticleDOI
TL;DR: In this article, Van den Wollenberg et al. proposed several goodness-of-fit tests for the Rasch model for dichotomous items, including difficulty plots for person groups scoring right and wrong on a specific item, a slope test per item based on a binomial distribution per score group, and a unidimensionality check based on an extended hypergeometric distribution per group.
Abstract: Although several goodness of fit tests have been developed for the Rasch model for dichotomous items, most of them are of a global, asymptotic, and confirmatory type. This paper, based on ideas from a recent thesis by Van den Wollenberg, offers some suggestions for local, small sample, and exploratory techniques: difficulty plots for person groups scoring right and wrong on a specific item, a slope test per item based on a binomial distribution per score group, and a unidimensionality check based on an extended hypergeometric distribution per score group.

Book ChapterDOI
01 Jan 1983
TL;DR: The method used to compare the two variable-length mastery testing procedures, which were AMT and sequential probability ratio test (SPRT), to one another, as well as to a conventional testing procedure, consisted of the following steps.
Abstract: Publisher Summary This chapter presents a study for the comparison of item response theory-based adaptive mastery testing (AMT) and a sequential mastery testing procedure. Monte Carlo simulation was used to delineate circumstances in which one of the mastery testing procedures might have an advantage over the other. The method used to compare the two variable-length mastery testing procedures, which were AMT and sequential probability ratio test (SPRT), to one another, as well as to a conventional testing procedure, consisted of the following steps. (1) Four item pools were generated in which the items differed from one another to different degrees. (2) The desired mastery level on the proportion-correct metric was converted to the θ metric by means of the test response function from each item pool, as required by the AMT procedure. (3) Item responses were generated for 500 simulated subjects for each of the items in the four item pools. (4) Conventional tests of three different lengths were drawn from the larger item pools; these conventional tests served as item pools from which the SPRT and AMT procedures drew items. (5) The AMT and SPRT procedures were simulated for each of the four different item pool types and the three conventional test lengths. (6) Comparisons were made among the three types of tests, AMT, SPRT, and conventional concerning the degree of correspondence between the decisions made by the three test types and the true mastery status. Further comparisons were made based on the average test length that each test type required to reach its decisions.

Journal ArticleDOI
TL;DR: In this article, the sampling errors of maximum likelihood estimation of item response theory parameters are studied in the case when both people and item parameters are estimated simultaneously, and a check on the validity of the standard error formulas is carried out.
Abstract: The sampling errors of maximum likelihood esti mates of item response theory parameters are studied in the case when both people and item parameters are estimated simultaneously. A check on the validity of the standard error formulas is carried out. The effect of varying sample size, test length, and the shape of the ability distribution is investigated. Finally, the ef fect of anchor-test length on the standard error of item parameters is studied numerically for the situation, common in equating studies, when two groups of ex aminees each take a different test form together with the same anchor test. The results encourage the use of rectangular or bimodal ability distributions, and also the use of very short anchor tests.

Book ChapterDOI
01 Jan 1983
TL;DR: In this paper, the authors investigated the efficiency of the Urry procedure and the maximum likelihood procedure to estimate parameters in the three-parameter model, to study the properties of the estimators, and to provide some guidelines regarding the conditions under which they should be employed.
Abstract: Publisher Summary This chapter discusses a study to investigate the efficiency of the Urry procedure and the maximum likelihood procedure to estimate parameters in the three-parameter model, to study the properties of the estimators, and to provide some guidelines regarding the conditions under which they should be employed. In particular, the issues investigated were (1) the accuracy of the two estimation procedures; (2) the relations among the number of items, examinees, and the accuracy of estimation; (3) the effect of the distribution of ability on the estimates of item and ability parameters; and (4) the statistical properties, such as bias and consistency, of the estimators. To investigate the issues mentioned above, artificial data were generated according to the three-parameter logistic model using the DATGEN program of Hambleton and Rovinelli. Data were generated to simulate various testing situations by varying the test length, the number of examinees, and the ability distribution of the examinees. In the Urry estimation procedure, the relationships that exist for item discrimination and item difficulty between the latent trait theory parameters and the classical item parameters are exploited. These relationships are derived under the assumption that ability is normally distributed and that the item characteristic curve is the normal ogive. To study how the departures from the assumption of normally distributed abilities affect the Urry procedure, three ability distributions were considered: normal, uniform, and negatively skewed.

Journal ArticleDOI
TL;DR: In this article, a new approach to assessing unexpected differential item performance (item bias or item fairness) is developed and applied to the item responses of males and females to SAT/TSWE items administered operationally in December 1977.
Abstract: A new approach to assessing unexpected differential item performance (item bias or item fairness) is developed and applied to the item responses of males and females to SAT/TSWE items administered operationally in December 1977. While the main body of the report describes the particulars of the present application and delineates the essential features of the approach, a technical appendix describes the standardization approach in detail. The primary goal of the standardization approach is to control for differences in subpopulation ability before making comparisons between subpopulation performance on test items. By so doing, it removes the contaminating effects of ability differences from the assessment of item fairness. Of the total of 195 items studied, the standardization approach identified only a handful as meriting careful review for possible content bias. Of these few, only one item exhibited a clearly unacceptable degree of unexpected differential item performance between males and females that could be attributed to content bias.

Journal ArticleDOI
TL;DR: Two distinct approaches, one based on item response theory and the other based on observed item responses and standard summary statistics, have been proposed to identify unusual response patterns of test items as mentioned in this paper.
Abstract: Two distinct approaches, one based on item re sponse theory and the other based on observed item responses and standard summary statistics, have been proposed to identify unusual response patterns of re sponses to test items. A link between these two ap proaches is provided by showing certain correspon dences between Sato's S-P curve theory and item response theory. This link makes possible several ex tensions of Sato's caution index that take advantage of the results of item response theory. Several such in dices are introduced and their use illustrated by appli cation to a set of achievement test data. Two of the newly introduced extended indices were found to be very effective for purposes of identifying persons who consistently use an erroneous rule in attempting to solve signed-number arithmetic problems. The poten tial importance of this result is briefly discussed.

Journal ArticleDOI
TL;DR: Latent trait models for binary responses to a set of test items are considered from the point of view of estimating latent trait parametersθ=(θ1,…,θn) and item parametersβ=(β1, …,βk), whereβj may be vector valued.
Abstract: Latent trait models for binary responses to a set of test items are considered from the point of view of estimating latent trait parametersθ=(θ 1, …,θ n ) and item parametersβ=(β 1, …,β k ), whereβ j may be vector valued. Withθ considered a random sample from a prior distribution with parameterφ, the estimation of (θ, β) is studied under the theory of the EM algorithm. An example and computational details are presented for the Rasch model.

Book ChapterDOI
01 Jan 1983
TL;DR: In this paper, the feasibility of the person response curve (PRC) approach for investigating the fit of persons to the three-parameter item response theory (IRT) model is discussed.
Abstract: Publisher Summary This chapter discusses the feasibility of the person response curve (PRC) approach for investigating the fit of persons to the three-parameter item response theory (IRT) model. To operationalize the PR,C it subdivides ability test items into separate strata of varying difficulty levels. The limited literature on person variability within a test, thus, seems to have three major trends: (1) the direct analysis of person variability as originally suggested by Mosier, later called the testee's trace line by Weiss, the subject characteristic curve by Vale and Weiss, and the person characteristic curve by Lumsden, (2) the designation of highly variable persons as aberrant by Levine and Rubin, (3) the elimination of aberrant person-item interactions by Wright. A careful analysis of these three approaches indicates that the first approach is the most general of the three, subsuming the other two as special cases: If the entire pattern of a testee's responses is studied as a function of the difficulty levels of the items, the identification of aberrant response patterns or person–item interactions follows directly. In addition, postulating a person characteristic curve in conjunction with IRT provides a means of testing whether the response patterns of single individuals fit the theory, regardless of the number of parameters assumed.

Journal ArticleDOI
TL;DR: In this article, a theoretical model for dealing with omitted responses is presented, and two special cases of omitted responses are investigated, one of which is the case of the omitted response in this paper.
Abstract: A theoretical model is given for dealing with omitted responses. Two special cases are investigated.

Book ChapterDOI
01 Jan 1983
TL;DR: In this article, the adequacy of score-equating models when certain sample and test characteristics are systematically varied is examined. But, the authors focus on curvilinear score equating models and not on linear models.
Abstract: Publisher Summary This chapter presents a study to determine the adequacy of curvilinear score equating models. In an ideal psychometric world, tests on which scores need to be equated would be parallel in all important respects: An anchor test, if used, would be parallel to the total tests and random samples on which to base the equating would always be available. In actual testing practice, however, scores must sometimes be equated under less-than-optimum conditions. This study is the first part of a larger study, the purpose of which is to examine the adequacy of score-equating models when certain sample and test characteristics are systematically varied. The emphasis in this part of the study is on curvilinear models, whereas the second part focuses on linear models. This study is more comprehensive than previous studies of equating models in that it includes a greater variety of equipercentile, linear, and item response theory models and investigates equating based on dissimilar samples as well as on random samples.

Journal ArticleDOI
TL;DR: In this paper, the applicability of item response theory to attitude scale development is discussed and an illustration derived from a study of the pro pensity toward jealousy in romantic relationships is presented.
Abstract: This paper describes the applicability of item re sponse theory to attitude scale development and pro vides an illustration derived from a study of the pro pensity toward jealousy in romantic relationships. The item analysis model used is identical to the factor analysis model, so factor analytic criteria are used to evaluate the scale. These criteria may be used to de cide whether the scale may be scored as a measure of a single variable and whether a simple sum or a weighted sum of the item responses serves as an opti mal test score. Estimates of the reliability of the scale based on the item response model are also described.

Journal ArticleDOI
TL;DR: Group level item response models as mentioned in this paper extend the machinery of item response theory to data gathered in the maximally efficient multiple-matrix sampling design, under which each sampled subject is administered only one item from a scale.
Abstract: The most familiar models of item response theory (IRT) are defined at the level of individual subjects; the form and the parameters of a model specify the probability of a correct response to a particular item from a particular subject. It is possible, however, to define an item response model at the level of salient groups of subjects; the form and the parameters of such a model would specify the probability of a correct response to a particular item from a subject selected at random from a particular group of subjects. So-called "group level" models extend the machinery of IRT to data gathered in the maximally efficient multiple-matrix sampling design, under which each sampled subject is administered only one item from a scale. This paper discusses group-level item response models, their uses, their relationships to subject-level item response models, and the estimation of model parameters. Two salient advances in educational testing over the past two decades have been the development item response theory (IRT) and the introduction of multiple-matrix sampling (MMS) methods. It is ironic that until recently researchers interested in the MMS methods have not been able to enjoy the full benefits of the former. At first blush, the two concepts seem at odds. MMS designs provide the most efficient estimates of population parameters by bypassing the estimation of parameters for sampled individuals (the most efficient designs solicit only one response from a given individual (Lord,


Book
01 Sep 1983
TL;DR: A book which summarizes many of the recent advances in the theory and practice of achievement testing, in the light of technological developments, and developments in psychometric and psychological theory is presented in this paper.
Abstract: A book which summarizes many of the recent advances in the theory and practice of achievement testing, in the light of technological developments, and developments in psychometric and psychological theory. It provides an introduction to the two major psychometric models, item response theory and generalizability theory, and assesses their strengths for different applications. The book closes with some speculations about the future of achievement tests for the assessment of individuals, as well as monitoring of educational progress. '...the book contains valuable information for both beginners and for advanced workers who want an overview of recent work in achievement testing.' -- The Journal of the American Statistical Association, June 1985

Journal ArticleDOI
TL;DR: In this paper, the well-known Rasch model is generalized to a multicomponent model, so that observations of component events are not needed to apply the model, and the results of an application to a mathematics test involving six components are described.
Abstract: The well-known Rasch model is generalized to a multicomponent model, so that observations of component events are not needed to apply the model. It is shown that the generalized model has retained the property of the specific objectivity of the Rasch model. For a restricted variant of the model, maximum likelihood estimates of its parameters and a statistical test of the model are given. The results of an application to a mathematics test involving six components are described.



Journal ArticleDOI
TL;DR: Hambleton and Popham as discussed by the authors showed that when the test items are similar statistically, the particular choice of test items will have only a minimal impact on the decision validity of the test scores.
Abstract: The common method for selecting test items for criterion-referenced tests (CRTs) is straightforward: First, a test length is determined and then a random (or stratified random) sample of test items is drawn from the pool of acceptable (valid) test items measuring the content domain of interest (Hambleton, 1982; Popham, 1978). Readers are referred to Wilcox (1980) for a review of methods for determining test length and to Hambleton (1980, 1982) and Popham (1978, 1980) for reviews of methods for preparing and validating pools of criterion-referenced test items. Random (or stratified random) selection of test items is a satisfactory method when an item pool is statistically homogeneous because for all practical purposes, the test items are interchangeable. When items are similar statistically, the particular choice of test items will have only a minimal impact on the decision validity of the test scores. But, when item pools are statistically heterogeneous' (as they often are), a random selection of test items may be far from optimal for separating examinees into mastery states. For a fixed test length, the most valid test for separating examinees into mastery states will consist of test items that are discriminating effectively near the cutoff score on the domain score scale (Lord & Novick, 1968). With a randomly selected set of test items from a heterogeneous pool, the validity of the resulting classifications will generally be lower because not all of the test items will be optimally functioning near the cutoff score. For example, test items that may be very easy or very difficult or have low discriminating power are as likely to be selected with the random-selection method as are other more suitable items in the pool. It is important that these less than ideal items remain in the pool so that they might be sampled when tests to provide nonbiased domain score estimates are needed or when deriving the relationship between latent ability scores and domain scores (Hambleton, 1982). Also, these "poor" items may be more suitable when different cutoff scores are set. But, when a test is used to make mastery/non-mastery decisions, items that do not contribute substantially to the discriminating power of the test are of limited value. Unfortunately, with a random selection of items, the probability associated with selection is the same for all items in the pool. When constructing tests to separate examinees into two or more mastery states in relation to a content domain of interest, it is desirable to select items that are most discriminating within the region of the cutoff score (Birnbaum, 1968). But criterionreferenced measurement specialists have not usually taken advantage of optimal items for

Journal ArticleDOI
TL;DR: This article derived asymptotic formulas for the bias in the maximum likelihood estimators of the item parameters in the logistic item response model when examinee abilities are known, and provided numerical results for a typical verbal test for college admission.
Abstract: Asymptotic formulas are derived for the bias in the maximum likelihood estimators of the item parameters in the logistic item response model when examinee abilities are known. Numerical results are given for a typical verbal test for college admission.