scispace - formally typeset
Search or ask a question

Showing papers on "Item response theory published in 1977"


Journal ArticleDOI
TL;DR: For example, when a person tries to answer a test item, many forces might influence the outcome-too many to be named in a workable theory of the person's response.
Abstract: Science conquers experience by finding the most succinct explanations to which experience can be forced to yield. Progress marches on the invention of simple ways to handle complicated situations. When a person tries to answer a test item the situation is potentially complicated. Many forces might influence the outcome-too many to be named in a workable theory of the person's response. To arrive at a workable position, we must invent a simple conception of what we are willing to suppose happens, do our best to write items and test persons so that their interaction is governed by this conception and then impose its statistical consequences upon the data to see if the invention can be made useful.

583 citations


Journal ArticleDOI
TL;DR: In this article, two related probabilistic models that can be used for making classification decisions with respect to mastery of specific concepts or skills are presented, and procedures for assessing the adequacy of the models, identifying optimal decision rules for mastery classification, and identifying minimally sufficient numbers of items necessary to obtain acceptable levels of misclassification.
Abstract: Descriptions are presented of two related probabilistic models that can be used for making classification decisions with respect to mastery of specific concepts or skills. Included are the development of procedures for: (a) assessing the adequacy of “fit” provided by the models; (b) identifying optimal decision rules for mastery classification; and (c) identifying minimally sufficient numbers of items necessary to obtain acceptable levels of misclassification.

254 citations



Journal ArticleDOI
TL;DR: A theory of latent traits as discussed by the authors supposes that in testing situations, examinee performance on a test can be predicted (or explained) by defining characteristics of examinees, referred to as traits, estimating scores for examinees on these traits, and using the scores to predict or explain test performance.
Abstract: A theory of latent traits supposes that in testing situations, examinee performance on a test can be predicted (or explained) by defining characteristics of examinees, referred to as traits, estimating scores for examinees on these traits, and using the scores to predict or explain test performance (Lord and Novick, 1968). Since the traits are not directly measurable and therefore "unobservable," they are often referred to as latent traits or abilities. A latent trait model specifies a relationship between observable examinee test performance and the unobservable traits or abilities assumed to underlie performance on the test. The relationship between the "observable" and the "unobservable" quantities is described by a mathematical function. For this reason, latent trait models are mathematical models. Also, latent trait models are based on assumptions about the test data. When selecting a particular latent trait model to apply to one's test data, it is necessary to consider whether the test data satisfy the assumptions of the model. If they do not, different test models should be considered. Alternately, some psychometricians (for example, Wright, 1968) have recommended that test developers design their tests so as to satisfy the assumptions of the particular latent trait model they are interested in using. Recent work by Lord (1968, 1974a), Lord and Novick (1968), Wright (1968), Wright and Panchapakesan (1969), Samejima (1969, 1972), Bock and Wood (1971), and Whitely and Dawis (1974) has been helpful in introducing educational measurement specialists to the topic of latent trait models. Also, the work of these and other individuals has contributed substantially to the current interest among test practitioners in applying the models to a wide variety of educational and psychological testing problems. Latent trait models are now being used to "explain" examinee test performance as well as to provide a framework for solving test design problems and other important testing questions that have, to date, gone unresolved (Lord, 1977; Wright, 1977a, 1977b). Why has the use of latent trait models in practical testing situations been low? There are at least five reasons. For one, the topic of latent trait theory represents a complex branch of the field of test theory. The advanced mathematical skills required to study many of the papers published on the topic have probably discouraged many potential

184 citations


Journal ArticleDOI
TL;DR: The classical test theory deals with an entire test; this theory is applicable even if the test is not composed of items as mentioned in this paper, and if a test consists of separate items and if test score is a (possibly weighted) sum of item scores, then statistics describing the test scores of a certain group of examinees can be expressed algebraically in terms of statistics describing individual item scores for the same group of examined individuals.
Abstract: Much of classical test theory deals with an entire test; this theory is applicable even if the test is not composed of items. If a test consists of separate items and if test score is a (possibly weighted) sum of item scores, then statistics describing the test scores of a certain group of examinees can be expressed algebraically in terms of statistics describing the individual item scores for the same group of examinees. Insofar as it relates to tests, classical item theory (this is only a part of classical test theory) consists of such algebraic tautologies. Such a theory makes no assumptions about matters that are beyond the control of the psychometrician. This is actuarial science. It cannot predict how individuals will respond to items unless the items have previously been administered to similar individuals.

165 citations


Journal ArticleDOI
TL;DR: In this article, a test battery of binary items is assumed to follow a Rasch model and the latent individual parameters are distributed within a given population in accordance with a normal distribution, and methods are then considered for estimating the mean and variance of this latent population distribution.
Abstract: Under consideration is a test battery of binary items. The responses ofn individuals are assumed to follow a Rasch model. It is further assumed that the latent individual parameters are distributed within a given population in accordance with a normal distribution. Methods are then considered for estimating the mean and variance of this latent population distribution. Also considered are methods for checking whether a normal population distribution fits the data. The developed methods are applied to data from an achievement test and from an attitude test.

109 citations


Journal ArticleDOI
TL;DR: In this article, a simple approximation of the Wright's RRC procedure is developed, which produces comparable estimates in a few seconds, and an editing algorithm for preparing item response data for calibration is appended.
Abstract: Wright's (1969) widely used "unconditional" pro cedure for Rasch sample-free item calibration is biased. A correction factor which makes the bias negligible is identified and demonstrated. Since this procedure, in spite of its superiority over "condi tional" procedures, is nevertheless slow at calibra ting 60 or more items, a simple approximation which produces comparable estimates in a few seconds is developed. Since no procedure works on data containing persons or items with infinite para meter estimates, an editing algorithm for preparing item response data for calibration is appended.

105 citations


Journal ArticleDOI
TL;DR: It is emphasized that the standard error of estimation should be considered as the major index of dependability, as opposed to the reliability of a test.
Abstract: Several important and useful implications in latent trait theory, with direct implications for individualized adaptive or tailored testing, are pointed out. A way of using the information function in tailored testing in connection with the standard error of estimation of the ability level using maximum likelihood estimation is suggested. It is emphasized that the standard error of estimation should be considered as the major index of dependability, as opposed to the reliability of a test. The concept of weak parallel forms is expanded to test-

85 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compared two approaches given in the literature to determine the optimal number of choices per item for a set of N items, assuming that the total number of alternatives is fixed.
Abstract: Typical multiple choice tests have four or five alternative choices per item. What is the optimal number? Here two approaches given in the literature are compared with two new approaches. From some points of view, the contrasts between the different approaches are even more interesting and instructive than the actual answers given to the question asked. Each approach makes the assumption that the total number of alternatives is fixed. This will make sense if total testing time for a set of N items is proportional to the number A of choices per item. It seems likely that many or most item types do not satisfy this condition, but doubtless some item types will be found for which the condition can be shown to hold approximately. Here, the condition is treated as given, along with the problem. The real relation of N to A for fixed testing time should be determined experimentally for any given item type. When N is not proportional to A, the theoretical approaches given here may be modified in obvious ways to determine the optimal

79 citations


Journal ArticleDOI
TL;DR: In this article, a linear loss function is used for computing a cutting score that minimizes the risk for the decision rule, which is demonstrated with a criterion-referenced achievement test of elementary statistics administered to 167 students.
Abstract: The situation is considered in which a total score on a test is used for classifying examinees into two categories: "accepted (with scores above a cutting score on the test) and "not accepted" (with scores below the cutting score). A value on the latent variable is fixed in advance; examinees above this value are "suitable" and those below are "not suitable." Using a linear loss function, a procedure is described for computing a cutting score that minimizes the risk for the decision rule. The procedure is demonstrated with a criterion-referenced achievement test of elementary statistics administered to 167 students.

69 citations


Journal ArticleDOI
TL;DR: In this paper, a simplified alternative procedure for conditional estimation practical for 20 or 30 items which produces equivalent estimates is developed, and a correction factor which makes the bias negligible is identified and demonstrated.
Abstract: Two procedures for Rasch, sample-free, item calibration are reviewed and compared for accuracy. Andersen's (1972) theoretically ideal "conditional" procedure is impractical for calibrating more than 10 or 15 items. A simplified alternative procedure for conditional estimation practical for 20 or 30 items which produces equivalent estimates is developed. When more than 30 items are analyzed recourse to Wright's (1969) widely used "unconditional" procedure is inevitable but that procedure is biased. A correction factor which makes the bias negligible is identified and demonstrated.

Journal ArticleDOI
TL;DR: In this article, a weakly parallel test is proposed, in contrast to strongly parallel tests in latent trait theory, and some criticisms of the fundamental concepts in classical test theory such as the reliability of a test and the standard error of estimation are given.
Abstract: A new concept of weakly parallel tests, in contrast to strongly parallel tests in latent trait theory, is proposed. Some criticisms of the fundamental concepts in classical test theory, such as the reliability of a test and the standard error of estimation, are given.

Journal ArticleDOI
TL;DR: The present research simulated the responses of 75 subjects responding to 30 items under the Birnbaum and Rasch models and attempted a fit to the data using the Rasch model, finding the poorest overall fit appeared within the uniform distribution.
Abstract: Among the varieties of logistic models, those at tributed to Birnbaum (involving the parameters of item discrimination, item difficulty, and person ability) and Rasch (involving only item difficulty and person ability) have received attention. The present research simulated the responses of 75 subjects responding to 30 items under the Birn baum model and then attempted a fit to the data using the Rasch model. When item discriminations varied from a variance of .05 to .25 within distribu tions of different form (uniform, normal, and posi tively skewed), the poorest overall fit appeared within the uniform distribution. For each distribu tion there was only a slight increase in the lack of fit as the variances increased.


Journal ArticleDOI
TL;DR: In this paper, an application of Samejima's latent trait model for continuous responses is reported, and a brief review of latent trait theory is presented, including an elaboration of the theory for test responses.
Abstract: This paper reports an application of Samejima's latent trait model for continuous responses A brief review of latent trait theory is presented, including an elaboration of the theory for test resp

Journal ArticleDOI
TL;DR: Item analysis has been an important component in the field of educational measurement as discussed by the authors, and a wide variety of item analysis procedures have been created, including item difficulty and the item-criterion correlation.
Abstract: Ever since Binet and Simon (1916) plotted the proportion of correct response to an item as a function of age, item analysis has been an important component in the field of educational measurement. Two basic theoretical models, the classical psychometric and the item characteristic curve, have been developed, and a wide variety of item analysis procedures have been created. At the present time, it is a rare measurement textbook that does not devote some space to item analysis. Following the example set in these textbooks, most practitioners of item analysis use the classical psychometric model in which item difficulty and the item-criterion correlation are used to describe an item. In recent years, however, the primary advances in the theory of item analysis have been under the item characteristic curve model. This model has a strong statistical orientation and has been subsumed under what is now called "Latent Trait Theory" (see Lord & Novick, 1968). While latent trait theory has the estimation of ability as a primary goal, the item characteristic curve model is at the core of the theory. Since about 1960, there have been significant developments and refinements in item analysis, new item characteristics curve models have been employed, more sophisticated estimation procedures have been used, and a richer conceptualization of the role of item statistics in test construction has evolved. While some lag between developments in theories and their use in practice is to be expected, this lag in the field of item analysis appears to be unusually large. In their survey of test theory, Bock and Wood (1971) found that few measurement books included the item characteristic curve model, and the one extended

01 Apr 1977
TL;DR: In this paper, an approach to biased item identification using item characteristic curve theory (ICC) is described and applied, which is applicable to items of sufficiently varying degrees of difficulty.
Abstract: BeCause it -is a true store model-employing item parameters which:are-independent of the-examined:sample, item characteristic cur-ve theory (ICC) offers several-advantages-over classical measurement theory In this paperan-approachto biased -item identification using ICC -theory-is described and applied The ICC -theory approach is attractive in that it,--(1) appears to be sensitive_largely to cultural-variations in theitrait gauged by test items, (2) does not assume total scores to be valid indicators-of --true ability-, (3) places the identified-_degree of item bias on a guantified-metric, andA4) -is applicable to items of sufficiently varying degrees of difficulty While sensitive to somefactOrs Other --than item bias-; namely,-local independence, item inipprepriateness-and poor parameterestimates, the approach may prove useful to the measurement field(Author/RC)


Journal ArticleDOI
TL;DR: In this article, four Monte Carlo simulation studies of Owen's Bayesian sequential procedure for adaptive mental testing were conducted, where the authors explored a number of additional properties, both in a normally distributed population and in a distribution-free context.
Abstract: Four monte carlo simulation studies of Owen's Bayesian sequential procedure for adaptive mental testing were conducted. In contrast to previous simulation studies of this procedure which have concentrated on evaluating it in terms of the corre lation of its test scores with simulated ability in a normal population, these four studies explored a number of additional properties, both in a normally distributed population and in a distribution-free context. Study 1 replicated previous studies with finite item pools, but examined such properties as the bias of estimate, mean absolute error, and cor relation of test length with ability. Studies 2 and 3 examined the same variables in a number of hypo thetical infinite item pools, investigating the effects of item discriminating power, guessing, and vari able vs. fixed test length. Study 4 investigated some properties of the Bayesian test scores as latent trait estimators. The properties of interest included the conditional bias of the ability estimates, the info...

Journal ArticleDOI
TL;DR: In this paper, a method of estimating item characteristic functions is proposed, in which a set of test items, whose operating characteristics are known and which give a constant test information function for a substantially wide range of ability, are used.
Abstract: A method of estimating item characteristic functions is proposed, in which a set of test items, whose operating characteristics are known and which give a constant test information function for a substantially wide range of ability, are used. The method is based on the maximum likelihood estimates of ability for a group of several hundred examinees. Throughout the present study the Monte Carlo method is used.

Journal ArticleDOI
TL;DR: Wright as mentioned in this paper showed that in practice, using least squares estimators as a first step in parameter estimations is neither awkward nor unnecessary; correct interpretations and expressions for least squares standard errors were given in the Whitely and Dawis (1974) article, except for a minor typographical error; and some advantages of the model may be nullified in the process of meeting traditional goals in testing.
Abstract: Wright (1977) shows that a debate is developing between those who strongly advocate use of the Rasch model and those who have certain reservations about the extent to which the model meets some traditional concerns in trait measurement. In an earlier article, Whitely and Dawis (1974) presented the Rasch model in the context of least squares estimation, and noted some features that may limit the utility of the model in test development. Wright (1977) questions several of the specific interpretations and conclusions that were given in the earlier article. The current article is a response to those questions. Although two areas of disagreement between Wright's (1977) and Whitely and Dawis' (1974) articles could be termed "pseudo-issues" (equivalent forms and technological sophistication), several areas represent real issues. The current article shows that 1) in practice, using least squares estimators as a first step in parameter estimations is neither awkward nor unnecessary; 2) correct interpretations and expressions for least squares standard errors were given in the Whitely and Dawis (1974) article, except for a minor typographical error; 3) large sample sizes are required for successful application of the model; and 4) some advantages of the model may be nullified in the process of meeting traditional goals in testing-namely validity and score interpretability.

Journal ArticleDOI
TL;DR: In this paper, some predicted relationships among certain item characteristics, response processes, instability of response, and nearness of subject and item on a trait continuum are used to compare the traditional and proposed approaches.
Abstract: In their investigation of the relationship between item properties and test quality, itemmetricians rely too heavily upon inference about the nature of the subject-item interaction. Itemmetric research can be improved by employing data, specifying variables, and selecting an order of data analysis which are more directly related to this interaction. Some predicted relationships among certain item characteristics, response processes, instability of response, and nearness of subject and item on a trait continuum are used to compare the traditional and proposed approaches. Stronger relationships and clearer interpretations arise from the proposed approach.

Journal ArticleDOI
TL;DR: The authors showed that the convenient and inexpensive procedure suggested by Urry (1974a) for approximating test models (i.e., the normal ogive and logistic models) tends to systematically underestimate ai (item discriminatory power) and overestimate /b i/ (item difficulty).
Abstract: This note demonstrates that the convenient and inexpensive procedure suggested by Urry (1974a) for approximating test models (i.e., the normal ogive and logistic models) tends to systematically underestimate ai (item discriminatory power) and overestimate /b i/ (item difficulty). A simple correction for error in estimated ability (θ) is presented which serves to eliminate these biases. Implications for item screening and for item parameterization via maximum likelihood methods and via Urry's more recently developed estimation procedure (1974b) are discussed.

01 Sep 1977
TL;DR: In this paper, the applicability of item characteristic curve (ICC) theory to a multiple-choice test item pool used to measure achievement is described and the rationale for attempting to use ICC theory in an achievement framework is summarized.
Abstract: : The applicability of item characteristic curve (ICC) theory to a multiple-choice test item pool used to measure achievement is described. The rationale for attempting to use ICC theory in an achievement framework is summarized, and the adequacy for adaptive testing of a classroom achievement test item pool in a college biology class is studied. Using criteria usually applied to ability measurement item pools, the item difficulties and discriminations in this achievement test pool were found to be similar to those used in adaptive testing pools for ability testing. Studies of the dimensionality of the pool indicate that it is primarily unidimensional. Analysis of the item parameters of items administered to two different samples reveals the possibility of a deviation from invariance in the discrimination parameter but a high degree of invariance for the difficulty parameter. The pool as a whole, as well as two subpools, is judged to be adequate for use in adaptive testing. It is also concluded that the ICC model is not inappropriate for application to typical college classroom achievement tests similar to the one studied. (Author)

Journal ArticleDOI
TL;DR: Results indicated that tailored testing is promising especially when the number of items is not too small, and that a graded item can effectively be used as the initial item in tailored testing because of its branching effect.
Abstract: Applying the normal ogive model of latent trait theory, two sets of data, simulated and empirical, were analyzed. The objective was to determine how much accuracy of estimation of the subjects' latent ability can be maintained by tailoring for each testee the order of presentation of the items and the border of dichotomization for each item. This was compared to the information provided by the ori ginal graded test items. Results indicated that tailored testing is promising especially when the number of items is not too small, and that a graded item can effectively be used as the initial item in tailored testing because of its branching effect.

Journal ArticleDOI
TL;DR: In this paper, the validity and utility of the stratified adaptive computerized testing model (stradaptive) developed by Weiss (1973) were investigated empirically and the results showed significantly higher reliability for the stradaptive group, and equivalent validity indices for the conventional test groups.
Abstract: This study empirically investigated the validity and utility of the stratified adaptive computerized testing model (stradaptive) developed by Weiss (1973). The model presents a tailored testing strategy based upon Binet IQ measurement theory and Lord's (1972) modern test theory. Nationally normed School and College Ability Test Verbal analogy items (SCAT-V) were used to construct an item pool. Item difficulty and discrimination in dices were re-scaled to normal ogive parameters on 244 items. One hundred and two freshmen volun teers at Florida State University were randomly as signed to stradaptive or conventional test groups. Both groups were tested via cathode-ray-tube (CRT) terminals coupled to a Control Data Corporation 6500 computer. The conventional subjects took. a SCAT-V test essentially as published, while the stradaptive group took individually tailored tests using the same item pool. Results showed sig nificantly higher reliability for the stradaptive group, and equivalent validity indices betwe...


Journal ArticleDOI
TL;DR: In this article, students rated the quality of the items on a classroom test that had been taken previously, and psychometric item indices were calculated on the same test, finding that the student ratings were related to the item difficulty, but not to item-test correlation.
Abstract: Students rated the quality of the items on a classroom test that had been taken previously. On the same test, psychometric item indices were calculated. The results showed that the student ratings were related to the item difficulty, but not to the item-test correlation. In addition, the better-achieving students tended to rate the items as less ambiguous. Finally, the ambiguity ratings were more highly related to the item-test correlations for the better achieving students. These findings support opinions held by many instructors of students' judgments of item quality.