scispace - formally typeset
Search or ask a question

Showing papers in "Applied Psychological Measurement in 1992"


Journal ArticleDOI
TL;DR: The generalized partial credit model (GPCM) as discussed by the authors is a generalized PCM with a varying slope parameter, which is based on Andrich's (1978) rating scale formulation.
Abstract: The partial credit model (PCM) with a varying slope parameter is developed and called the generalized partial credit model (GPCM). The item step parameter of this model is decomposed to a location and a threshold parameter, following Andrich's (1978) rating scale formulation. The EM algorithm for estimating the model parameters is derived. The performance of this generalized model is compared on both simulated and real data to a Rasch family of polytomous item response models. Simulated data were generated and then analyzed by the various polytomous item response models. The results demonstrate that the rating formulation of the GPCM is quite adaptable to the analysis of polytomous item responses. The real data used in this study consisted of the National Assessment of Educational Progress (Johnson & Allen, 1992) mathematics data that used both dichotomous and polytomous items. The PCM was applied to these data using both constant and varying slope parameters. The GPCM, which provides for varying slope pa...

1,219 citations


Journal ArticleDOI
TL;DR: A new method is presented for incorporating a large number of con straints on adaptive item selection and the meth odology emulates the test construction practices of expert test specialists, which is a necessity if com puterized adaptive testing is to compete with con ventional tests.
Abstract: Previous attempts at incorporating expert test construction practices into computerized adaptive testing paradigms are described. A new method is presented for incorporating a large number of con straints on adaptive item selection. The meth odology emulates the test construction practices of expert test specialists, which is a necessity if com puterized adaptive testing is to compete with con ventional tests. Two examples—one for a verbal measure and the other for a quantitative measure— are provided of the successful use of the proposed method in designing adaptive tests.

220 citations


Journal ArticleDOI
TL;DR: In this paper, Monte Carlo methods were used to evaluate MML estimation of item param eters and maximum likelihood (ML) estimates of θ in the two-parameter logistic model for varying test lengths, sample sizes, and assumed θ dis tribution.
Abstract: Marginal maximum likelihood (MML) estimation of the logistic response model assumes a structure for the distribution of ability (8). If this assump tion is incorrect, the statistical properties of MML estimates may not hold. Monte carlo methods were used to evaluate MML estimation of item param eters and maximum likelihood (ML) estimates of θ in the two-parameter logistic model for varying test lengths, sample sizes, and assumed θ dis tribution. 100 datasets were generated for each of the combinations of factors, allowing for item-level analyses based on means across replications. MML estimates of item difficulty were generally precise and stable in small samples, short tests, and under varying distributional assumptions of θ. When the true distribution of θ was normal, MML estimates of item discrimination were also gen erally precise and stable. ML estimates of θ were generally precise and stable, although the distribu tion of θ estimates was platykurtic and truncated at the high and low ends of the scor...

101 citations


Journal ArticleDOI
TL;DR: For a set of k items having nonintersecting item response functions (IRFS), the H coefficient (Loevinger, 1948; Mokken, 1971) applied to a transposed persons by items binary matrix HT has a non-negative value as discussed by the authors.
Abstract: For a set of k items having nonintersecting item response functions (IRFS), the H coefficient (Loevinger, 1948; Mokken, 1971) applied to a transposed persons by items binary matrix HT has a non-negative value. Based on this result, a method is proposed for using HT to investigate whether a set of IRFS intersect. Results from a monte carlo study support the proposed use of HT. These results support the use of HT as an exten sion to Mokken's nonparametric item response theory approach.

90 citations


Journal ArticleDOI
TL;DR: In this article, the Stocking and Lord (1983) procedure for computing equating coefficients for tests having dichotomously scored items is extended to the case of graded response items.
Abstract: The Stocking and Lord (1983) procedure for computing equating coefficients for tests having dichotomously scored items is extended to the case of graded response items. A system of equations for obtaining the equating coefficients under Samejima's (1969, 1972) graded response model is derived. These equations are used to compute equating coefficients in two related situations. Under the first, the equating coefficients are obtained by matching, on an examinee by examinee basis, the true scores on two tests. In the second case, the equating coefficients are obtained by matching the test characteristic curves (TCCS) of the two tests. Several examples of computing equating coefficients in these two situations are provided. The TCC matching ap proach was much less demanding computationally and yielded equating coefficients that differed little from those obtained through the true score distribution matching approach.

89 citations


Journal ArticleDOI
TL;DR: Differential item functioning (DIF) has been informally conceptualized as multidimensionality as mentioned in this paper, which assumes that DIF is not a difference in the item parameters of two groups; rather, it is a shift in the distribution of ability along a secondary trait that influences the probability of a correct item response.
Abstract: Differential item functioning (DIF) has been informally conceptualized as multidimensionality. Recently, more formal descriptions of DIF as multidimensionality have become available in the item response theory literature. This approach assumes that DIF is not a difference in the item parameters of two groups; rather, it is a shift in the distribution of ability along a secondary trait that influences the probability of a correct item response. That is, one group is relatively more able on an ability such as test-wiseness. The parameters of the secondary distribution are confounded with item parameters by unidimensional DIF detection models, and this manifests as differences between estimated item parameters. However, DIF is con founded with impact in multidimensional tests, which may be a serious limitation of unidimen sional detection methods in some situations. In the multidimensional approach, DIF is considered to be a function of the educational histories of the examinees. Thus, a better tool for unde...

88 citations


Journal ArticleDOI
TL;DR: The ordered partition model as mentioned in this paper is designed for a measurement context in which the categories of response to an item cannot be completely ordered, so that an examiner may want to maintain the distinction between the strategies.
Abstract: An item response model, called the ordered.partition model, is designed for a measurement context in which the categories of response to an item cannot be completely ordered. For example, two different solution strategies may lead to an equivalent degree of success because both strategies may result in the same score, but an examiner may want to maintain the distinction between the strategies. Thus, the data would not be nominal nor completely ordered, so may not be suitable for other polytomous item response models such as the partial credit or the graded response models. The ordered partition model is described as an extension of the partial credit model, its relationship to other models is discussed, and two examples are presented. © 1992 Sage Publications. All rights reserved.

75 citations


Journal ArticleDOI
TL;DR: In this article, a two-stage procedure for estimating item bias was examined with six indexes of item bias and with the Mantel-Haenszel (MH) statistic; the sample size, the number of biased items, and the magnitude of the bias were varied.
Abstract: A two-stage procedure for estimating item bias was examined with six indexes of item bias and with the Mantel-Haenszel (MH) statistic; the sample size, the number of biased items, and the magnitude of the bias were varied. The second stage of the procedure did not identify substantial numbers of false positives (unbiased items identified as biased). However, the identification of true positives in the second stage was useful only when the magnitude of the bias was not small and the number of biased items was large (20% or 40% of the test). The weighted indexes tended to identify more true and false positives than their unweighted item response theory counterparts. Finally, the MH statistic identified fewer false positives, but did not identify small bias as well as the item response theory indexes

73 citations


Journal ArticleDOI
TL;DR: Bias in an observed variable Y as a measure of an unobserved variable W exists when the relationship of Y to W varies among popula tions of interest as mentioned in this paper, and bias is often studied by examin ing...
Abstract: Measurement bias in an observed variable Y as a measure of an unobserved variable W exists when the relationship of Y to W varies among popula tions of interest. Bias is often studied by examin ing...

64 citations


Journal ArticleDOI
TL;DR: In this article, the authors demonstrate empirically how item bias indexes based on item response theory (IRT) identify bias that results from multidimensionality, when a test is multidimensional (MD) with a primary trait and a nuisance trait that affects a small portion of the test.
Abstract: This paper demonstrates empirically how item bias indexes based on item response theory (IRT) identify bias that results from multidimensionality. When a test is multidimensional (MD) with a primary trait and a nuisance trait that affects a small portion of the test, item bias is defined as a mean difference on the nuisance trait between two groups. Results from a simulation study showed that although IRT-based bias indexes clearly distinguished multidimensionality from item bias, even with the presence of a between-group dif ference on the primary trait, the bias detection rate depended on the degree to which the item measured the nuisance trait, the values of MD discrimination, and the number of MD items. It was speculated that bias defined from the MD perspective was more likely to be detected when the test data met the essential unidimensionality assumption. Index

62 citations


Journal ArticleDOI
TL;DR: In this article, the importance of regression diagnostics in detecting influential points is discussed, and five statistics are recommended for the applied researcher, and the suggested diagnostics were used on a small dataset to detect an influen tial data point and the effects were analyzed.
Abstract: Influential data points can affect the results of a regression analysis; for example, the usual sum mary statistics and tests of significance may be misleading. The importance of regression diagnostics in detecting influential points is discussed, and five statistics are recommended for the applied researcher. The suggested diagnostics were used on a small dataset to detect an influen tial data point, and the effects were analyzed. Colinearity-based diagnostics also are discussed and illustrated on the same dataset. The non- robustness of the least squares estimates in the presence of influential points is emphasized. Diagnostics for multiple influential points, multi variate regression, multicolinearity, nonlinear regression, and other multivariate procedures also are discussed.

Journal ArticleDOI
TL;DR: In this article, the effect of response format on diagnostic assessment of students' performance on an algebra test was investigated using two diagnostic approaches: a ''bug» analysis and a rule-space analysis.
Abstract: The effect of response format on diagnostic assessment of students' performance on an algebra test was investigated. Two sets of parallel, openended (OE) items and a set of multiple-choice (MC) items―which were stem-equivalent to one of the OE item sets―were compared using two diagnostic approaches: a «bug» analysis and a rule-space analysis. Items with identical format (parallel OE items) were more similar than items with different formats (OE VS. MC)

Journal ArticleDOI
TL;DR: In this paper, the similarity data were analyzed using a multidimensional scaling (MDS) procedure followed by a hierarchical cluster analysis of the MDS stimulus coordinates, and the results indicated a strong correspondence between similarity data and the arrangement of items as prescribed in the test blueprint.
Abstract: A new method for evaluating the content representation of a test is illustrated. Item similari ty ratings were obtained from content domain ex perts in order to assess whether their ratings cor responded to item groupings specified in the test blueprint. Three expert judges rated the similarity of items on a 30-item multiple-choice test of study skills. The similarity data were analyzed using a multidimensional scaling (MDS) procedure followed by a hierarchical cluster analysis of the MDS stimulus coordinates. The results indicated a strong correspondence between the similarity data and the arrangement of items as prescribed in the test blueprint. The findings suggest that analyzing item similarity data with MDS and cluster analysis can provide substantive information pertaining to the content representation of a test. The advantages and disadvantages of using MDS and cluster analysis with item similarity data are discussed.

Journal ArticleDOI
TL;DR: In this paper, a single higher-order cluster analysis is used to group cluster mean profiles derived from several preliminary analyses, and the results are confirmed when each higher order cluster contains one clu...
Abstract: A single higher-order cluster analysis can be used to group cluster mean profiles derived from several preliminary analyses. Replication is confirmed when each higher-order cluster contains one clu...

Journal ArticleDOI
TL;DR: In this article, the effect of reviewing items and altering responses on the efficiency of computerized adaptive tests and the resultant ability estimates of examinees was explored, and the average efficiency of the test was decreased by 1% after review.
Abstract: The effect of reviewing items and altering responses on the efficiency of computerized adap tive tests and the resultant ability estimates of examinees were explored. 220 students were ran domly assigned to a review condition; their test instructions indicated that each item must be answered when presented, but that the responses could be reviewed and altered at the end of the test. A sample of 492 students did not have the opportunity to review and alter responses. Within the review condition, examinee ability estimates before and after review were correlated .98. The average efficiency of the test was decreased by 1% after review. Approximately 32% of the examinees improved their ability estimates after review, but did not change their pass/fail status. Disallowing review on adaptive tests administered under these rules is not supported by these data.

Journal ArticleDOI
TL;DR: The derivations of several item selection algorithms for use in fitting test items to target information functions (IFS) are described, indicating that the algorithms pro vided reliable fit to the target in terms of item parameters, test information functions, and expected score distributions.
Abstract: The derivations of several item selection algorithms for use in fitting test items to target information functions (IFS) are described. These algorithms circumvent iterative solutions by using the criteria of moving averages of the distance to a target IF and by simultaneously considering an entire range of ability points used to condition the IFS. The algorithms were tested by generating six forms of an ACT math test, each fit to an existing target test, including content-designated item sub sets. The results indicate that the algorithms pro vided reliable fit to the target in terms of item parameters, test information functions, and expected score distributions.

Journal ArticleDOI
TL;DR: In this article, a mathematical programing model for constructing tests with a prespecified test information function and a heuristic for assigning items to tests such that their information functions are equal play an important role in the methods.
Abstract: Methods are proposed for the construction of weakly parallel tests, that is, tests with the same test information function. A mathematical programing model for constructing tests with a prespecified test information function and a heuristic for assigning items to tests such that their information functions are equal play an important role in the methods. The MI and MIDI methods are proposed for constructing tests with a prespecified test information function applying the Minimax model. Similar methods, MAMI and MADI, are provided for construction of a weakly parallel test approximately equal with respect to the Maximin criterion. The four methods were applied on a real item bank of 600 items from college placement mathematics tests (520 items were from 13 previously administered American College Testing Assessment Program tests, and 80 were from the Collegiate Mathematics Placement Program). The numerical examples indicated that the tests were constructed quickly and that the heuristic gave good results. However, the heuristic was not applicable for every set of practical constraints (i.e., constraints with respect to test administration time, test composition, or dependencies between items). Four tables and four graphs present information about the constructed tests.

Journal ArticleDOI
TL;DR: In this article, a two-stage process that considers the multi dimensionality of tests under the framework of unidimensional item response theory (IRT) is described and evaluated, and items are clustered in the first stage.
Abstract: A two-stage process that considers the multi dimensionality of tests under the framework of unidimensional item response theory (IRT) is described and evaluated. In the first stage, items are clust...

Journal ArticleDOI
TL;DR: In this article, the nominal response model (NR CAT) was used for adaptive test. And the performance of the NR CAT and a CAT based on the three-parameter logistic (3PL) model was compared.
Abstract: Although most computerized adaptive tests (CATS) use dichotomous item response theory (IRT) models, research on the use of polytomous IRT models in CAT has shown promising results. This study implemented a CAT based on the nominal response model (NR CAT). Item pool requirements for the NR CAT were examined. The performance of the NR CAT and a CAT based on the three-parameter logistic (3PL) model was compared. For two-, three-, and four-category items, items with maximum information of at least.16 produced reasonably accurate trait estimation for tests with a minimum test length of approximately 15 to 20 items. The NR CAT was able to produce trait estimates comparable to those of the 3PL CAT. Implications of these results are discussed

Journal ArticleDOI
TL;DR: In this article, the capability of DIMTEST in assessing essential unidimensionality of item responses to real tests was investigated, and it was found that some test data fit an essentially unidimensional model and did not fit in an essentially non-uniform model.
Abstract: The capability of DIMTEST in assessing essential unidimensionality of item responses to real tests was investigated. DIMTEST found that some test data fit an essentially unidimensional model and ot...

Journal ArticleDOI
TL;DR: In this article, an approximate statistical test is derived for the hypothesis that the intraclass reliability coefficients associated with two measurement procedures are equal, and control of Type 1 error is investigated by comparing empirical sampling distributions of the test statistic with its derived theoretical distribu tion.
Abstract: An approximate statistical test is derived for the hypothesis that the intraclass reliability coefficients associated with two measurement procedures are equal. Control of Type 1 error is investigated by comparing empirical sampling distributions of the test statistic with its derived theoretical distribu tion. A numerical illustration of the procedure is also presented.

Journal ArticleDOI
TL;DR: The direct-product model has been suggested as a procedure for estimating multiplicative effects of traits and methods in multitrait-multimethod matrices as discussed by the authors, which has been extended in two ways: first, hierarchically nested models are derived for explicitly testing the overall and specific patterns of method and trait factors.
Abstract: The direct-product model has been suggested as a procedure for estimating multiplicative effects of traits and methods in multitrait-multimethod matrices. Research on the direct-product model is extended in two ways. First, hierarchically nested models are derived for explicitly testing the overall and specific patterns of method and trait factors. Second, formal tests are developed for the pattern of communalities. These procedures are illustrated with data from Lawler (1967)

Journal ArticleDOI
TL;DR: The IRTDIF program was written in IBM Professional FORTRAN for IBM and compatible personal computers and uses subroutines taken from Numerical Recipes to compute the percentage points of the incomplete gamma functions.
Abstract: IRT models. To compute the DIF measures and the statistics to test the significance of the DIF measures, IRTDIF uses two files. One file contains sets of item parameter estimates; the other contains the sampling variance-covariance matrices. Significance levels (p values) are provided for Lord’s x2 and the exact area measures. When the sampling variance-covariance matrices are not available, the exact and closedinterval area measures are provided without statistical significance tests. The program was written in IBM Professional FORTRAN for IBM and compatible personal computers and uses subroutines taken from Numerical Recipes (Press, Flannery, Teukolsky, & Vetterling, 1986) to compute the percentage points of the incomplete gamma functions. Execution of the program requires a numerical coprocessor.

Journal ArticleDOI
TL;DR: DIMTEST tests the hypothesis that the model underlying a matrix of binary item responses, generated by administering a test to a specific examinee population, is essentially unidimensional.
Abstract: DIMTEST is a statistical test developed by Stout (1987), and refined by Nandakumar & Stout (in press; see also Nandakumar, in press). DIMTEST tests the hypothesis that the model underlying a matrix of binary item responses, generated by administering a test to a specific examinee population, is essentially unidimensional (essential dimensionality is a mathematical formulation of the existence of one dominant latent dimension).

Journal ArticleDOI
TL;DR: To estimate test reliability and to create parallel tests, test items frequently are matched and algorithms are presented based on optimization theory in networks (graphs) and have polynomial complexity.
Abstract: To estimate test reliability and to create parallel tests, test items frequently are matched. Items can be matched by splitting tests into parallel test halves, by creating T splits, or by matching a desired test form. Problems often occur. Algorithms are presented to solve these problems. The algorithms are based on optimization theory in networks (graphs) and have polynomial complexity. Computational results from solving sample problems with several hundred decision variables are reported

Journal ArticleDOI
TL;DR: A flexible data analysis approach is proposed that combines two psychometric procedures— seriation and multidimensional scaling (MDS) that is particularly appropriate for the analysis of proximities containing temporal information.
Abstract: A number of model-based scaling methods have been developed that apply to asymmetric proximity matrices. A flexible data analysis approach is pro posed that combines two psychometric procedures— seriation and multidimensional scaling (MDS). The method uses seriation to define an empirical order ing of the stimuli, and then uses MDS to scale the two separate triangles of the proximity matrix defined by this ordering. The MDS solution con tains directed distances, which define an "extra" dimension that would not otherwise be portrayed, because the dimension comes from relations between the two triangles rather than within triangles. The method is particularly appropriate for the analysis of proximities containing temporal information. A major difficulty is the computa tional intensity of existing seriation algorithms, which is handled by defining a nonmetric seriation algorithm that requires only one complete itera tion. The procedure is illustrated using a matrix of co-citations between recent presidents o...

Journal ArticleDOI
TL;DR: In this article, the authors investigated the effect of test length and item response theory (IRT) model and test length on the distribution of the appropriateness indexes and their cutoff values at three false positive rates.
Abstract: The extent to which three appropriateness indexes - Z 3 , ECIZ4, and W (a variation of Wright's person-fit statistic) - are well-standardized was investigated in a monte carlo study. To assess the effects of the item response theory (IRT) model and test length on the distribution of the indexes and their cutoff values at three false positive rates, nonaberrant response patterns were generated. ECIZ4 most closely approximated a normal distribution, showing less skewness and kurtosis than Z 3 , and W. The ECIZ4 cutoff values were affected less by test length and the IRT model than were Z 3 , and W. In contrast, the distribution of W was the least stable over replications, and its cutoff values varied greatly depending on the IRT model and test length

Journal ArticleDOI
TL;DR: In this article, an external measure can replace one of the raters, and individual reliabilities of two independent raters can be estimated, in a somewhat similar fashion, estimates of treatment effects present in ratings by two independent rater can provide the external frame of reference against which dif ferences in their individual ratings can be evaluated.
Abstract: Rating scales have no inherent reliability that is independent of the observers who use them. The often reported interrater reliability is an average of perhaps quite different individual rater reliabilities. It is possible to separate out the individual rater reliabilities given a number of independent raters who observe the same sample of ratees. Under cer tain assumptions, an external measure can replace one of the raters, and individual reliabilities of two independent raters can be estimated. In a somewhat similar fashion, estimates of treatment effects present in ratings by two independent raters can provide the external frame of reference against which dif ferences in their individual reliabilities can be evaluated. Models for estimating individual rater reliabilities are provided for use in selecting, evalu ating, and training participants in clinical research.

Journal ArticleDOI
TL;DR: The extreme groups research strategy is a two-stage measurement procedure that may be employed when it is relatively simple and inexpen sive to obtain data on a psychological variable (X) in the first stage of investigation, but it is quite complex and expensive to measure subsequently a second variable (Y) as discussed by the authors.
Abstract: The extreme groups research strategy is a two- stage measurement procedure that may be employed when it is relatively simple and inexpen sive to obtain data on a psychological variable (X) in the first stage of investigation, but it is quite complex and expensive to measure subsequently a second variable (Y). This strategy is related to the selection of upper and lower groups for item dis crimination analysis (Kelley, 1939) and to the treatments x blocks design in which participants are first "blocked" on the X variable and then only the extreme (highest and lowest means) blocks are compared on the Y variable, usually by a t test or an analysis of variance. Feldt (1961) showed analytically that if the population correlation coefficient between X and Y is p = .10, the power of the t test is maximized if each extreme group consists of 27% of the population tested on the X variable. However, Feldt's derivation assumes that the X and Y variables are normally distributed. The present study employed a monte car...

Journal ArticleDOI
TL;DR: In these studies, two joint maximum likelihood estimation methods (LOGIST 2B and LOGIST 5) and two marginal maximum likelihood estimations methods (BILOG and ForScore) were contrasted by measuring the difference between a simulation model and a model obtained by applying an estimation method to simulation data.
Abstract: Two psychometric models with very different parametric formulas and item response functions can make virtually the same predictions in all applications. By applying some basic results from the theory of hypothesis testing and from signal detection theory, the power of the most powerful test for distinguishing the models can be com puted. Measuring model misspecification by com puting the power of the most powerful test is proposed. If the power of the most powerful test is low, then the two models will make nearly the same prediction in every application. If the power is high, there will be applications in which the models will make different predictions. This measure, that is, the power of the most powerful test, places various types of model misspecifica tion— item parameter estimation error, multidi mensionality, local independence failure, learning and/or fatigue during testing—on a common scale. The theory supporting the method is presented and illustrated with a systematic study of misspecifica tion...