scispace - formally typeset
Search or ask a question

Showing papers on "Item response theory published in 1979"



Journal ArticleDOI
TL;DR: A student may be so atypical and unlike other students that their or their aptitude test score fails to be a completely appropriate measure of his or her relative ability as mentioned in this paper, and we consider the problem of us...
Abstract: A student may be so atypical and unlike other students that his or her aptitude test score fails to be a completely appropriate measure of his or her relative ability. We consider the problem of us...

257 citations


Journal ArticleDOI
TL;DR: A number of methods have been proposed for assessing bias in items based on latent trait theory and item characteristic curves, which have some interesting advantages over those discussed above (Scheuneman, in press as discussed by the authors ).
Abstract: In the past few years, the issue of test bias with its far-reaching political and social implications has been the subject of much controversy. The majority of the research on bias has been concerned with the fair use of testing instruments in decision-making situations such as employment and college admissions. In general, the models employed in these situations have involved the test or subtest as a whole and have measured the test scores against some criterion of performance external to the test. (See Petersen & Novick, 1976, for a discussion of the principal models which have been proposed.) More recently, test publishers and others charged with the construction of testing instruments have become interested in the more specific question of identifying items which appear to be biased prior to the construction of final forms of the test. Once such items are identified, the modification or removal of these items should increase the likelihood that fairer use of the final instrument will follow. Consequently, the past few years have seen a rapid proliferation of new methods for assessing bias in test items. Recent reviews by Merz (1978) and Rudner (1977a) discuss a number of these techniques. The methods use a variety of models, but all approach the issue at the item level, and all are based on information internal to the testing instrument itself. Among the various procedures for assessing bias in the absence of an outside criterion, the most well known methods are the transformed item difficulty approaches. When examining differences in performance on items for members of different groups, the assumption of equal ability in the respective samples is usually not warranted.' Hence, simply comparing item difficulties and assuming that any significant difference constitutes bias is not an appropriate procedure. Therefore, the assumption is made instead that mean performance differences between samples reflect true differences in ability, and that items deviating significantly from these mean differences might be considered biased. (These procedures rely on an item-by-group interaction definition of bias.) Item difficulties are first transformed so that the relationship between obtained difficulties for the two groups is linear, and a line is fitted to points representing the difficulty of the items for each group. Then confidence bands are drawn on either side of the line, or distances from it are computed for each point. This approach is typified by the delta plot procedure introduced by Angoff (for example, Angoff and Sharon, 1974). Other approaches which have been proposed, but not widely used, are methods comparing responses to incorrect options of multiple-choice items such as those proposed by Veale (1976), and various factor analytic procedures such as those presented by Merz (1976) and Green and Draper (1972). More recently, methods have been proposed for assessing bias in items based on latent trait theory and item characteristic curves, which have some interesting advantages over those discussed above (Scheuneman, in press). In the models based on this theory, the

204 citations


Journal ArticleDOI
TL;DR: It was concluded that the selection of an item calibre procedure should be dependent on the distri bution of ability in the calibration sample, the later uses of the item parameters, and the computer re sources available.
Abstract: A simulation study of the effectiveness of four item characteristic curve estimation programs was conducted. Using the three-parameter logistic model, three groups of 2,000 simulated subjects were administered 80-item tests. These simulated test responses were then calibrated using the four programs. The estimated item parameters were compared to the known item parameters in four analyses for each program in all three data sets. It was concluded that the selection of an item calibra tion procedure should be dependent on the distri bution of ability in the calibration sample, the later uses of the item parameters, and the computer re sources available.

45 citations


Journal ArticleDOI
TL;DR: In this paper, the authors focus on the validity of the response alternatives of questionnaire items and their effect on the mean score for a group, and the importance of the validity question cannot be resolved without considering the response choices as a component of the whole item.
Abstract: Questionnaire items used in survey and evaluation research are most frequently written so that the item stems have content or face validity; little attention is directed to the validity of the response alternatives. Yet numerous response alternative formats are available to the item constructor and the item validity question cannot be resolved without considering the response choices as a component of the whole item. When item scale points (response alternatives) are labeled (e.g., excellent to poor) and numbered (e.g., 1 to 5), do respondents choose by focusing on label meanings or by referencing the numeric scale? Do responses vary with the frame of reference used? If so, how does one interpret the mean score for a group? This uncertainty of interpretation is a validity problem, the importance of which is highlighted further if one views items in which only the endpoints are defined as the typical bipolar scale used with the semantic differential technique. If the responses to such items differ from responses to content-parallel items in which all response positions are defined, further research would be warranted to determine whether one format is more susceptible to rating errors of leniency or other sources of invalidity. This study was designed to provide answers to two questions regarding the response scales of questionnaire items: (a) Do random groups respond to the same item stem differently if all scale points are defined rather than just the endpoints? and (b) Do random groups respond to the same item stem differently if all scale points are labeled numerically rather than alphabetically? If subjects do respond differently to the same item stem when the response alternative format varies, then the items could be regarded as nonequivalent in the same sense as content-parallel achievement items which vary in difficulty level. The general validity problem and specific questions stated above have not been addressed directly in the studies of response schemes which have appeared in the literature. Follman (1974), for example, studied the polarity of sets of response alternatives but only defined scale endpoints with verbal descriptors on a 1-5 scale. A comparative analysis of standard and reworded questionnaire items was published by Jaeger and Freijo (1974), but they used verbal descriptors at each point of their five-choice numeric scale. Spector (1976) focused on the problem of ordinal data being interpreted as interval data with Likert-type scale items. The subjects in Spector's study merely ranked response alternatives; they did not use response categories in the context of responding to actual questionnaire items. Because the meanings of such categories as "satisfactory" or "excellent" probably vary with the item context with which they are associated, Spector's work appears to have questionable generalizability. Finn (1972) reported on two separate studies which involved varying scale point definitions for subjects who rated job complexity. Though he reported no significant effect on mean

38 citations


Journal ArticleDOI
TL;DR: This paper investigated the adequacy of the Rasch model in providing objective measurement when equating existing standardized reading achievement tests with groups of examinees not widely separate, and found that the model was adequate for measuring the performance of these tests.
Abstract: This study investigated the adequacy of the Rasch model in providing objective measurement when equating existing standardized reading achievement tests with groups of examinees not widely separate...

35 citations


01 Jul 1979
TL;DR: The differences in type of information-processing skill developed by different instructional backgrounds affect, negatively or positively, the learning of further advanced instructional materials and poses a serious problem for routing students to an instructional level on the sole basis of performance on a diagnostic adaptive test.
Abstract: Abstract : The differences in type of information-processing skill developed by different instructional backgrounds affect, negatively or positively, the learning of further advanced instructional materials. That is, if prior and subsequent instructional methods are different, a proactive inhibition effect produces low achievement scores on a posttest. This fact poses a serious problem for routing of students to an instructional level on the sole basis of performance on a diagnostic adaptive test. It is essential that we somehow unravel what information-processing strategy was used and consider this knowledge simultaneously.

17 citations


Journal ArticleDOI
TL;DR: In this paper, the authors examined the conditions on item difficulty for both a deterministic and a stochastic conception of item responses and found that the deterministic conditions are more restrictive than is generally understood and differ for both conceptions.
Abstract: In choosing a binomial test model, it is important to know exactly what conditions are imposed on item difficulty. In this paper these conditions are examined for both a deterministic and a stochastic conception of item responses. It appears that they are more restrictive than is generally understood and differ for both conceptions. When the binomial model is applied to a fixed examinee, the deterministic conception imposes no conditions on item difficulty but requires instead that all items have characteristic functions of the Guttman type. In contrast, the stochastic conception allows non- Guttman items but requires that all characteristic functions must intersect at the same point, which implies equal classically defined difficulty. The beta-binomial model assumes identical characteristic functions for both conceptions, and this also implies equal difficulty. Finally, the compound binomial model entails no restrictions on item difficulty.

14 citations



Journal ArticleDOI
TL;DR: In this article, a technique for differentially weighting options of a multiple choice test in a fashion that maximizes the item predictive validity was proposed, which can be applied with different number of categories and the "optimal" number of classes can be determined by significance tests and/or through the R2 criterion.
Abstract: This paper outlines a technique for differentially weighting options of a multiple choice test in a fashion that maximizes the item predictive validity. The rule can be applied with different number of categories and the “optimal” number of categories can be determined by significance tests and/or through the R2 criterion. Our theoretical analysis indicates that more complex scoring rules have: higher item validities, higher item variances, higher score variances, and are also likely to increase the interitem correlations and the test reliability. A plausible explanation for the apparent paradox of lack of improvement in the test validity, based on the relation between interitem correlations and item validities, is offered.

8 citations


13 Feb 1979
TL;DR: It is observed that the subset of equivalent binary items can be used as a substitute for the Old Test, and those methods and approaches are generalized in the present situation and it is discovered that, for once, items with low discrimination power have a significant role.
Abstract: : The methods and approaches introduced so far for estimating the operating characteristics of item response categories require the Old Test, or a set of items, whose operating characteristics are known. To generalize these methods to apply for the situation where we start to develop a new item pool, i.e., there is no 'Old Test,' an approach is made by assuming that the tentative item pool has a substantial number of equivalent items, even though their common item characteristic function is not known yet. It is observed that, within the type of item characteristic function which is strictly increasing in the latent trait, Theta, with zero and unity as its two asymptotes, the area under the square root of the item information function is a constant value, pi. The item characteristic function which provides a constant item information is searched and discovered, and is named the constant information model. Using this model, it is observed that the subset of equivalent binary items can be used as a substitute for the Old Test, and those methods and approaches are generalized in the present situation. It is discovered that, for once, items with low discrimination power have a significant role. (Author)

Journal ArticleDOI
TL;DR: In this article, two short forms of the recently published WISC-R were developed, one employing a design determined by empirical item analysis results of the standard test battery and the other employing the well-known Yudin scheme determined by systematic random selection of test items.
Abstract: This study demonstrated that the design of cur rent intelligence test short forms could be improved by employing a more effective method of item selec tion based on psychometric theory. Two short forms of the recently published WISC-R were developed, one employing a design determined by empirical item analysis results of the standard test battery and the other employing the well-known Yudin scheme determined by systematic random selection of test items. In all analyses the item analysis method of item selection was shown to yield more accurate results than the Yudin procedure. Practi cal usefulness as well as limitations of the present WISC-R Short form are discussed.

01 Apr 1979
TL;DR: In this article, the appropriateness of the use of the standardized residual to assess congruence between sample-test item responses and the one-pirameter latent-trait (Rasch) item characteristic curve is investigated.
Abstract: _ The appropriateness -of the use of the standardized residual'1SR),to,assess congruence between:sample_test item responses and the' one_ pirameter latent_trait (Rasch)_item characteristic curve is investigated. Latentrtrait-thecry is reviewed, as well as theory of the:SR, the apparent-error in_calculating the expected distribution Of:the SR, and'impIications of using the §11 for item AnalySis._Empiricalresults using actual data are presented to_ support the theoretical anIaysis, as well as a demonstration cf the practiCaI implications of the failureto reject items which do not fit-the model; ConClusions basgdson t4 a findings_include:(1) discriminations of all the items in a_test must be very similar in crder for Basch model_analyses_tO work 'in practice: (2) the; SE mean square fit_statistic does not detect_unacceptable variation in discrimineticnand (3) item discrimination. needs to be monitored aniL controlled using more exact tests of fit that the, residual4mean squareFinally, an alternate IinearmodeI is described which may Provide a_practical solution tc problems encountered in the construction cf item banks and tailored testing; (RI)

01 Mar 1979
TL;DR: In this article, a computerized diagnostic adaptive test for a series of pre-algebra signed-number lessons (which are also on the computer system) was programmed along with a computer-managed routing system by which each examinee was sent to the instructional unit corresponding to the level of skill at which she/he stopped in the initial test.
Abstract: Abstract : A computerized diagnostic adaptive test for a series of pre-algebra signed-number lesson (which are also on the computer system) was programmed along with a computer-managed routing system by which each examinee was sent to the instructional unit corresponding to the level of skill at which she/he stopped in the initial test. Upon completion of the course a computerized conventional posttest was given to the examinees. The posttest scores were far from being unidimensional, while the pretest and posttest data obtained from a previous study, in which the pretest was a computerized conventional test and students were forced to go through all instructional units regardless of their achievement in the pretest, indicated a strong tendency to be unidimensional. The response patterns of the posttest in the present study showed a high error rate for the skills prior to stopping levels for a subgroup of examinees. A cluster analysis was performed on the response patterns and four different groups were found. A discriminant analysis indicated significant differences among the four groups in response patterns of the skills in signed number operations. After interviewing the teachers and several children, we came to the conclusion that it was the difference between prior and current instructional methods that confused students and caused a mess in the posttest data, i.e., there was a proactive inhibition effect.




Journal ArticleDOI
TL;DR: In this paper, a study was designed to compare classical item analysis procedure with the Rasch latent trait analysis in terms of several applied concerns, and the results showed substantial differences between the two procedures in terms selected items, relationship to the full scale, and assessment of actual alcohol consumption.
Abstract: Several methods of psychometric instrument construction have been suggested in the literature. Although there is considerable information describing theoretical and mathematical differences among these methods, there is a paucity of data regarding actual differences resulting from practical applications. Therefore, the present study was designed to compare classical item analysis procedure with the Rasch latent trait analysis in terms of several applied concerns. Respondents were 373 predominately white university undergraduates who completed the MacAndrew alcoholism scale and the Khavari Alcohol Test, an instrument for assessing quantity-frequency of alcohol consumption. Responses to the MacAndrew scale were analyzed using both classical and Rasch procedures and the resulting sub-scales were then compared in terms of specific items selected, relationship to the full scale, and assessment of actual alcohol consumption. Results showed substantial differences between the two procedures in terms of selected ...

Journal ArticleDOI
Jan Vegelius1
TL;DR: In this paper, a new measure of similarity between persons applicable in Q-analysis is proposed, which allows assumptions of non-orthogonality (dependence) between the items, across which the similarity is computed.
Abstract: A new measure of similarity between persons applicable in Q-analysis is proposed. It allows assumptions of non-orthogonality (dependence) between the items, across which the similarity is computed. The similarity measure may also be applied in an R-analysis, when the person orthogonality assumption may be considered as doubtful. Finally an artificial example is given.


ReportDOI
01 Apr 1979
TL;DR: In this paper, the authors discuss the importance of latent trait theory and generalizability theory in terms of content-referenced testing for work samples, and suggest methods for increasing the objectivity of measurement in programs of personnel testing.
Abstract: : Suggestions are offered for increasing the objectivity of measurement in programs of personnel testing. Classical concepts of reliability and validity are reviewed. Construct validity is seen as the basic evaluation of a measuring instrument in psychology; criterion-related validity actually refers to hypotheses rather than to measurements, and content validity refers to test development. The major evaluation for personnel tests is less a matter of validity than of job relevance and to generalizability. Implications of latent trait theory and generalizability theory are discussed in terms of content-referenced testing for work samples.

01 Apr 1979
TL;DR: In this article, three classifications of item specificity -global, general concept, and specific -were chosen to represent this continuum, and thirty-nine Likert-type items empirically identified as members of these categories, and as member of a content domain labeled influence and security, were then compared against six statistical properties; (1) skewness in distribution of class section means; (2) between-class variance of means, (3) within class variance among student respcnses; (LI) ceiling effect; (5) item reliability; and (
Abstract: Prior research has indicated that items administered to college students for rating their irstructors, can be empirically as well as logically classified on a -continuum from very general to specific. Three of 'hese hypothesized classifications of item specificity -global, general concept, and specific--were chosen to represent this continuum. Thirty-nine Likert-type items empirically identified as members of these categories, and as members of a content domain labeled influence and security, were then compared against six statistical properties; (1) skewness in distribution of class section means; (2) between-class variance of means; (3) within-class variance among student respcnses; (LI) ceiling effect; (5) item reliability; and (6) interquartile range. Results indicated that most items idet the criteria hypcthesize.d, although sc' e discrepancies for the most specific items were pronounced. The differentiation among specificity levels offered an essentially content-free classification scheme. Iffplications were drawn for questionnaire item writing, use of results, and the evaluation of overall item quality. (Author/CP)


01 Mar 1979
TL;DR: In this paper, the authors address three practical questions of importance and interest to test developers: 1) What are the effects of examinee sample size and test length on the precision of SEE curves, 2) What effects do the statistical characteristics of an item pool have on the standard error of ability estimates (SEE) under varying circumstances.
Abstract: : One of the most important advantages that accrue from the application of latent trait models is the possibility of specifying a target information curve and then selecting items from an item pool to produce a test with the features characterized by this curve. By proceeding in this manner, it is possible to develop a test that provides a pre-specified level of precision (Standard Error of Ability Estimate) at selected ability levels. One problem with this paradigm is that little is known about the precision of the standard error of ability estimates (SEE) under varying circumstances. The purpose of the research reported in this paper was to address three practical questions of importance and interest to test developers: 1) What are the effects of examinee sample size and test length on the precision of SEE Curves, 2) What effects do the statistical characteristics of an item pool have on the precision of SEE curves, and 3) What is the relationship between test length and SEE Curves in typical item pools? Keywords: Latent trait theory, Psychological tests, Aptitude tests, Mathematical models.