scispace - formally typeset
Search or ask a question

Showing papers on "Item response theory published in 1989"


Book
07 Dec 1989
TL;DR: In this article, the authors propose three basic concepts: devising the items, selecting the items and selecting the responses, from items to scales, reliability and validity of the responses.
Abstract: 1. Introduction 2. Basic concepts 3. Devising the items 4. Scaling responses 5. Selecting the items 6. Biases in responding 7. From items to scales 8. Reliability 9. Generalizability theory 10. Validity 11. Measuring change 12. Item response theory 13. Methods of administration 14. Ethical considerations 15. Reporting test results Appendices

9,316 citations


01 Jan 1989

3,037 citations


Journal ArticleDOI
TL;DR: Weighted likelihood estimation (WLE) as discussed by the authors removes the first order bias term from MLE and proved to be less biased than MLE with the same asymptotic variance and normal distribution.
Abstract: Applications of item response theory, which depend upon its parameter invariance property, require that parameter estimates be unbiased. A new method, weighted likelihood estimation (WLE), is derived, and proved to be less biased than maximum likelihood estimation (MLE) with the same asymptotic variance and normal distribution. WLE removes the first order bias term from MLE. Two Monte Carlo studies compare WLE with MLE and Bayesian modal estimation (BME) of ability in conventional tests and tailored tests, assuming the item parameters are known constants. The Monte Carlo studies favor WLE over MLE and BME on several criteria over a wide range of the ability scale.

965 citations


Journal ArticleDOI
TL;DR: In this article, the authors describe the construction of a Job in General (JIG) scale, a global scale to accompany the facetscales of the Job Descriptive Index.
Abstract: We describe the construction of a Job in General (JIG) scale, a global scale to accompany the facetscales of the Job Descriptive Index. We applied both traditional and item response theory proceduresfor item analysis to data from three large heterogeneous samples (N = 1,149, 3,566, and 4,490).Alpha was .91 and above for the resulting 18-item scale in successive samples. Convergent and dis-criminant validity and differential response to treatments were demonstrated. Global scales are con-trasted with composite and with facet scales in psychological measurement. We show that globalscales are not equivalent to summated facet scales. Both facet and global scales were useful in anotherorganization (N = 648). Some principles are suggested for choosing specific (facet), composite, orglobal measures for practical and theoretical problems. The correlations between global and facetscales suggest that work may be the most important facet in relation to general job satisfaction.

740 citations


Journal ArticleDOI
TL;DR: In this paper, the authors discuss the definition, detection, and explanation of item bias, and four strategies are described: qualitative, correlational, quasi-experimental, and experimental research.

474 citations


Book
17 Nov 1989
TL;DR: In this paper, the authors present a step by step guide on how to construct a psychometric questionnaire, which progresses through all the stages of test construction from definition of original purpose to its eventual validation, as well as knowledge-based tests of ability, aptitude, achievement and person based tests of personality, clinical symptoms, mood and attitude.
Abstract: John Rust and Susan Golombok provide a readable introduction to modern psychometrics. The first part deals with theoretical and more general issues in psychometrics and acknowledges that if psychometrics is to fulfil its function of fair assessment and selection it must take a stand on issues of racism and injustice. The second part is a step by step guide on how to construct a psychometric questionnaire. This progresses through all the stages of test construction from definition of original purpose to its eventual validation. Item response theory, criterion reference testing, profiling and minimum competency testing, are included, as are knowledge based tests of ability, aptitude, achievement and person based tests of personality, clinical symptoms, mood and attitude.

450 citations


Journal ArticleDOI
TL;DR: In this article, a new method for using certain restricted latent class models, referred to as binary skills models, to determine the skills required by a set of test items is presented.
Abstract: This paper presents a new method for using certain restricted latent class models, referred to as binary skills models, to determine the skills required by a set o f test items. The method is applied to reading achievement data from a nationally representative sample o f fourth-grade students and offers useful perspectives on test structure and examinee ability, distinct from those provided by other methods o f analysis. Models fitted to small, overlapping sets o f items are integrated into a common skill map, and the nature o f each skill is then inferred from the characteristics o f the items for which it is required. The reading comprehension items examined conform closely to a unidimensional scale with six discrete skill levels that range from an inability to comprehend or match isolated words in a reading passage to the abilities required to integrate passage content with general knowledge and to recognize the main ideas o f the most difficult passages on the test.

389 citations



Journal ArticleDOI
TL;DR: It is shown that local independence fails at the level of the individual questions of a test of reading comprehension, and the application to testlet scoring of some multiple-category models originally developed for individual items is discussed.
Abstract: It is not always convenient or appropriate to construct tests in which individual items are fungible. There are situations in which small clusters of items (testlets) are the units that are assembled to create a test. Using data from a test of reading comprehension constructed of four passages with several questions following each passage, we show that local independence fails at the level of the individual questions. The questions following each passage, however, constitute a testlet. We discuss the application to testlet scoring of some multiple-category models originally developed for individual items, In the example examined, the concurrent validity of the testlet scoring equaled or exceeded that of individual-item-level scoring

195 citations


Book
01 Jan 1989
TL;DR: In this paper, the authors focus on the problem of constructing test items for standardized tests of achievement, ability, and aptitude, which is a task of enormous importance and one fraught with difficulty.
Abstract: Constructing test items for standardized tests of achievement, ability, and aptitude is a task of enormous importance—and one fraught with difficulty. The task is important because test items are the foundation of written tests of mental attributes, and the ideas they express must be articulated precisely and succinctly. Being able to draw valid and reliable inferences from a test’s scores rests in great measure upon attention to the construction of test items. If a test’s scores are to yield valid inferences about an examinee’s mental attributes, its items must reflect a specific psychological construct or domain of content. Without a strong association between a test item and a psychological construct or domain of content, the test item lacks meaning and purpose, like a mere free-floating thought on a page with no rhyme or reason for being there at all.

176 citations


Patent
26 Oct 1989
TL;DR: A computerized mastery testing system providing for the computerized implementation of sequential testing in order to reduce test length without sacrificing mastery classification accuracy is described in this article, where test item units are randomly and sequentially presented to the examinee by a computer test administrator.
Abstract: A computerized mastery testing system providing for the computerized implementation of sequential testing in order to reduce test length without sacrificing mastery classification accuracy. The mastery testing system is based on Item Response Theory and Bayesian Decision Theory which are used to qualify collections of test items, administered as a unit, and determine the decision rules regarding examinee's responses thereto. The test item units are randomly and sequentially presented to the examinee by a computer test administrator. The administrator periodically determines, based on previous responses, whether the examinee may be classified as a nonmaster or master or whether more responses are necessary. If more responses are necessary it will present as many additional test item units as required for classification. The method provides for determining the test specifications, creating an item pool, obtaining IRT statistics for each item, determining ability values, assembling items into testlets, verifying the testlets, selecting loss functions and prior probability of mastery, estimating cutscores, packaging the test for administration, randomly and sequentially administering testlets to the examinee until a pass/fail decision can be made.

Journal ArticleDOI
TL;DR: A maximin model for IRT-based test design is proposed that serves as a constraint subject to which a linear programming algorithm maximizes the information in the test.
Abstract: A maximin model for IRT-based test design is proposed. In the model only the relative shape of the target test information function is specified. It serves as a constraint subject to which a linear programming algorithm maximizes the information in the test. In the practice of test construction, several demands as linear constraints in the model. A worked example of a text construction problem with practical constraints is presented. The paper concludes with a discussion of some alternative models of test construction.

Journal ArticleDOI
TL;DR: In this paper, the 1-, 2-, and 3-parameter logistic item response theory models are discussed, and the effects of changing the a, b, or c parameters are compared.
Abstract: This module discusses the 1-, 2-, and 3-parameter logistic item response theory models. Mathematical formulas are given for each model, and comparisons among the three models are made. Figures are included to illustrate the effects of changing the a, b, or c parameter, and a single data set is used to illustrate the effects of estimating parameter values (as opposed to the true parameter values) and to compare parameter estimates achieved though applying the different models. The estimation procedure itself is discussed briefly. Discussions of model assumptions, such as dimensionality and local independence, can be found in many of the annotated references (e.g., Hambleton, 1988).

Journal ArticleDOI
TL;DR: In this article, a method is proposed to evaluate l'equivalence de mesure des traductions des tests d'intelligence americains and allemands dans les deux sens.
Abstract: Utilisation des methodes basees sur la theorie de la reponse par item pour evaluer l'equivalence de mesure des traductions des tests d'intelligence americains et allemands dans les deux sens. Identification des items jouant un role differentiel et analyse de contenu pour en determiner la cause (culturelle ou linguistique)

Journal ArticleDOI
TL;DR: Results from 1967 teachers in Western Australia who completed the 30-item form of the GHQ show that the items conform reasonably well to the model at a general or macro-level of analysis, and the original ordering of categories is supported.
Abstract: This study examines the Likert-style successive integer scoring of Goldberg's (1972, 1978) General Health Questionnaire (GHQ) with a psychometric model in which the thresholds between successive categories within each item can be estimated. The model is particularly appropriate because the scoring of the successive categories, which are not named in the same way across items, by successive integers has received substantial discussion in the literature. Results from 1967 teachers in Western Australia who completed the 30-item form of the GHQ show that the items conform reasonably well to the model at a general or macro-level of analysis. In particular, the original ordering of categories is supported. However, as expected, there are systematic differences between distances among threshold within items and systematic differences among thresholds between items. The differences between positively and negatively orientated items confirm a suggestion in the literature that these two classes of items form sufficiently different scales so that they could be treated as separate, though reasonably correlated, scales.

Journal ArticleDOI
TL;DR: This paper looks at 50 years of IRM and finds a disappointing lack of advance, and it is shown how a linear model framework, involving different response transformations, unifies separate approaches to the study of test item responses.
Abstract: An historical and theoretical review is provided of so called item response theory (IRT), more accurately described as item response modelling (IRM). This paper looks at 50 years of IRM and finds a disappointing lack of advance. It is shown how a linear model framework, involving different response transformations, unifies separate approaches to the study of test item responses.

Journal ArticleDOI
TL;DR: In this article, an item response model for multiple-choice items and its application in item analysis is described. Butler et al. used the model for the detection of flawed items, for item design and development, and for test construction.
Abstract: This paper describes an item response model for multiple-choice items and illustrates its application in item analysis. The model provides parametric and graphical summaries of the performance of each alternative associated with a multiple-choice item; the summaries describe each alternative's relationship to the proficiency being measured. The interpretation of the parameters of the multiple-choice model and the use of the model in item analysis are illustrated using data obtained from a pilot test of mathematics achievement items. The use of such item analysis for the detection of flawed items, for item design and development, and for test construction is discussed.

Journal ArticleDOI
TL;DR: The theory behind and applications of adaptive personality assessment based on the item response theory and two adaptive testing strategies were compared: fixed test length and clinical decision.
Abstract: This article introduces the theory behind and applications of adaptive personality assessment based on the item response theory. Two adaptive testing strategies were compared: (a) fixed test length and (b) clinical decision. Real-data simulations, based on the item responses from 1,000 subjects who had previously taken the 34-item Absorption scale (Tellegen, 1982) by means of paper-and-pencil format, were used to illustrate these strategies. Results suggest that computerized adaptive personality assessment works impressively well. With the fixed-test-length strategy, a 50% savings in administered items was achieved with little loss of measurement precision. In the clinical-decision testing strategy, individuals who were extreme on the Absorption trait were identified with perfect accuracy using, on average, 25% of the available items. The implications of these results for personality research and assessment are discussed.

26 Apr 1989
Abstract: Combinations of five methods of equating test forms and two methods of selecting samples of students for equating were compared for accuracy. The two sampling methods were representative sampling from the population and matching samples on the anchor test score. The equating methods were the Tucker, Levine equally reliable, chained equipercentile, frequency estimation, and item response theory (IRT) 3PL methods. The tests were the Verbal and Mathematical sections of the Scholastic Aptitude Test. The criteria for accuracy were measures of agreement with an equivalent-groups equating based on more than 115,000 students taking each form. Much of the inaccuracy in the equatings could be attributed to overall bias. The results for all equating methods in the matched samples were similar to those for the Tucker and frequency estimation methods in the representative samples; these equatings made too small an adjustment for the difference in the difficulty of the test forms. In the representative samples, the cha...

Journal ArticleDOI
TL;DR: In this article, a marginal maximum likelihood estimation procedure is developed which allows for incomplete data and linear restrictions on both the item and the population parameters, and two statistical tests for evaluating model fit are presented: the former test has power against violation of the assumption about the ability distribution, the latter test offers the possibility of identifying specific items that do not fit the model.
Abstract: The partial credit model, developed by Masters (1982), is a unidimensional latent trait model for responses scored in two or more ordered categories. In the present paper some extensions of the model are presented. First, a marginal maximum likelihood estimation procedure is developed which allows for incomplete data and linear restrictions on both the item and the population parameters. Secondly, two statistical tests for evaluating model fit are presented: the former test has power against violation of the assumption about the ability distribution, the latter test offers the possibility of identifying specific items that do not fit the model.

01 Mar 1989
TL;DR: Simultaneous and sequential parallel test construction methods based on the use of 0–1 programming are examined for the Rasch and 3-parameter logistic model.

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the accuracy of marginal maximum likelihood estimation of the item parameters of the two-parameter LoGistic model and found that marginal estima tion was substantially better than joint maximum likelihood estimation for items with extreme diffi culty or discrimination parameters.
Abstract: The accuracy of marginal maximum likelihood esti mates of the item parameters of the two-parameter lo gistic model was investigated. Estimates were obtained for four sample sizes and four test lengths; joint maxi mum likelihood estimates were also computed for the two longer test lengths. Each condition was replicated 10 times, which allowed evaluation of the accuracy of estimated item characteristic curves, item parameter estimates, and estimated standard errors of item pa rameter estimates for individual items. Items that are typical of a widely used job satisfaction scale and moderately easy tests had satisfactory marginal esti mates for all sample sizes and test lengths. Larger samples were required for items with extreme diffi culty or discrimination parameters. Marginal estima tion was substantially better than joint maximum like lihood estimation. Index terms: Fletcher-Powell algorithm, item parameter estimation, item response theory, joint maximum likelihood estimation, marginal maximum likelihood...

Journal ArticleDOI
TL;DR: This article applied latent variable models to the study of the validity of a psychological test and found that the test predicts a criterion by measuring a unidimensional latent construct, not only must the total score predict the criterion, but the joint distribution of criterion scores and item responses must exhibit a certain pattern.
Abstract: Established results on latent variable models are applied to the study of the validity of a psychological test. When the test predicts a criterion by measuring a unidimensional latent construct, not only must the total score predict the criterion, but the joint distribution of criterion scores and item responses must exhibit a certain pattern. The presence of this population pattern may be tested with sample data using the stratified Wilcoxon rank sum test. Often, criterion information is available only for selected examinees, for instance, those who are admitted or hired. Three cases are discussed: (i) selection at random, (ii) selection based on the current test, and (iii) selection based on other measures of the latent construct. Discriminant validity is also discussed.

Journal ArticleDOI
Abstract: The purpose of the present research was to develop general guidelines to assist practitioners in setting up operational computerized adaptive testing (CAT) sys tems based on the graded response mod...

Journal ArticleDOI
TL;DR: In this paper, the authors used monotone regression splines to define p(x, θ) and applied them to the representation of test items as functions of examinee ability.
Abstract: A binomial regression function p(x, θ) models the probability of rj successes in nj trials as a function of the values of an observed covariate xj and/or a latent variable θj (j = 1, …, J). This article explores the use of monotone regression splines to define p, and applies them to the representation of test items as functions of examinee ability. Some illustrative data suggest that the flexibility of monotone splines permits the detection of item characteristics not observable using logistic-based or log-linear approaches. A simulation study indicates that estimates of both item-characteristic curves and ability are reasonably precise for numbers of items and examinees typical of large university lectures. Given a set of such binomial regression functions, it can be useful to study the principal components of functional variation. The extension of multivariate principal-components analysis to permit the analysis of many item-characteristic curves is described.

Journal ArticleDOI
TL;DR: In this article, the authors extend the generalized Rasch model to designs with any number of time points and even with different sets of items presented on different occasions, provided that one unidimensional subscale is available per latent trait.
Abstract: The LLRA (linear logistic model with relaxed assumptions; Fischer, 1974, 1977a, 1977b, 1983a) was developed, within the framework of generalized Rasch models, for assessing change in dichotomous item score matrices between two points in time; it allows to quantify change on latent trait dimensions and to explain change in terms of treatment effects, treatment interactions, and a trend effect. A remarkable feature of the model is that unidimensionality of the item set is not required. The present paper extends this model to designs with any number of time points and even with different sets of items presented on different occasions, provided that one unidimensional subscale is available per latent trait. Thus unidimensionality assumptions within subscales are combined with multidimensionality of the item set. Conditional maximum likelihood methods for parameter estimation and hypothesis testing are developed, and a necessary and sufficient condition for unique identification of the model, given the data, is derived. Finally, a sample application is presented.

Journal ArticleDOI
TL;DR: In this article, a method for the detection of item bias with respect to observed or unobserved subgroups is proposed, which uses quasi-loglinear models for the incomplete subgroup × test score × item 1 ×... × itemk contingency table.
Abstract: A method is proposed for the detection of item bias with respect to observed or unobserved subgroups. The method uses quasi-loglinear models for the incomplete subgroup × test score × Item 1 × ... × itemk contingency table. If subgroup membership is unknown the models are Haberman's incomplete-latent-class models. The (conditional) Rasch model is formulated as a quasi-loglinear model. The parameters in this loglinear model, that correspond to the main effects of the item responses, are the conditional estimates of the parameters in the Rasch model. Item bias can then be tested by comparing the quasi-loglinear-Rasch model with models that contain parameters for the interaction of item responses and the subgroups.

Journal ArticleDOI
TL;DR: In this paper, the psychometric requirements for adaptive testing are reviewed and the historical antecedents are considered and an analysis of these two factors reveals the importance of the concept of the item/person interaction.
Abstract: The psychometric requirements for adaptive testing are reviewed and the historical antecedents are considered. An analysis of these two factors reveals the importance of the concept of the item/person interaction. Future areas for advancement of adaptive testing are discussed.

Journal ArticleDOI
TL;DR: In this paper, the effects of using a unidi mensional IRT model when the assumption of unidi milliorientality was violated was examined and the adaptive and non-adaptive tests were formed from two-dimensional item sets.
Abstract: This study examined some effects of using a unidi mensional IRT model when the assumption of unidi mensionality was violated. Adaptive and nonadaptive tests were formed from two-dimensional item sets. The tests were administered to simulated examinee populations with different correlations of the two un derlying abilities. Scores from the adaptive tests tended to be related to one or the other ability rather than to a composite. Similar but less disparate results were obtained with IRT scoring of nonadaptive tests, whereas the conventional standardized number-correct score was equally related to both abilities. Differences in item selection from the adaptive administration and in item parameter estimation were also examined and related to differences in ability estimation. Index terms: ability estimation, adaptive testing, item pa rameter estimation, item response theory, multidimen sionality.

Journal ArticleDOI
TL;DR: An overview of item response theory is presented in this article, where basic models are described, and various applications are illustrated Applications discussed include instrument construction and computerized testing, as well as instrument testing.
Abstract: An overview of item response theory is presented Basic models are described, and various applications are illustrated Applications discussed include instrument construction and computerized testing