scispace - formally typeset
Search or ask a question

Showing papers in "Applied Psychological Measurement in 1999"


Journal ArticleDOI
TL;DR: In this article, a multistage adaptive testing approach that factors a into the item selection process is proposed, where the items in the item bank are stratified into a number of levels based on their a values.
Abstract: Computerized adaptive tests (CAT) commonly use item selection methods that select the item which provides maximum information at an examinees estimated trait level. However, these methods can yield extremely skewed item exposure distributions. For tests based on the three-parameter logistic model, it was found that administering items with low discrimination parameter (a) values early in the test and administering those with high a values later was advantageous; the skewness of item exposure distributions was reduced while efficiency was maintained in trait level estimation. Thus, a new multistage adaptive testing approach is proposed that factors a into the item selection process. In this approach, the items in the item bank are stratified into a number of levels based on their a values. The early stages ofa test use items with lower as and later stages use items with higher as. At each stage, items are selected according to an optimization criterion from the corresponding level. Simulation studies were ...

277 citations


Journal ArticleDOI
TL;DR: In this paper, an alternative index, r*wg.J, is recommended, which is an inverse linear function of the ratio of the average obtained variance to the variance of uniformly distributed random error.
Abstract: The commonly used form of rwg. (J) can display irregular behavior, so four variants of this index were examined. An alternative index, r*wg. J, is recommended. This index is an inverse linear function of the ratio of the average obtained variance to the variance of uniformly distributed random error. r*wg.Jis superficially similar to Cronbach’s α, but careful examination confirms that r*wg.Jis an index of agreement, not reliability. Based on an examination of the small-sample behavior of rwgand r*wg.J, sample sizes of 10 or more raters are recommended.

256 citations


Journal ArticleDOI
TL;DR: The elements of CAT discussed here include item selection procedures, estimation of the latent trait, item exposure, measurement precision, and item bank development.
Abstract: Use of computerized adaptive testing (CAT) has increased substantially since it was first formulated in the 1970s. This paper provides an overview of CAT and introduces the contributions to this Special Issue. The elements of CAT discussed here include item selection procedures, estimation of the latent trait, item exposure, measurement precision, and item bank development. Some topics for future research are also presented.

145 citations


Journal ArticleDOI
TL;DR: An item-selection algorithm is proposed for neutralizing the differential effects of time limits on computerized adaptive test scores based on a statistical model for distributions of examinees’ response times on items in a bank that is updated each time an item is administered.
Abstract: An item-selection algorithm is proposed for neutralizing the differential effects of time limits on computerized adaptive test scores The method is based on a statistical model for distributions of examinees’ response times on items in a bank that is updated each time an item is administered Predictions from the model are used as constraints in a 0-1 linear programming model for constrained adaptive testing that maximizes the accuracy of the trait estimator The method is demonstrated empirically using an item bank from the Armed Services Vocational Aptitude Battery

117 citations


Journal ArticleDOI
TL;DR: In this article, the authors examined the polytomous-DFIT framework and found that it was effective in identifying DTF and DIF for the simulated conditions, but the DTF index did not perform as consistently as the DIF index.
Abstract: Raju, van der Linden, & Fleer (1995) proposed an item response theory based, parametric differential item functioning (DIF) and differential test functioning (DTF) procedure known as differential functioning of items and tests (DFIT). According to Raju et al., the DFIT framework can be used with unidimensional and multidimensional data that are scored dichotomously and/or polytomously. This study examined the polytomous-DFIT framework. Factors manipulated in the simulation were: (1) length of test (20 and 40 items), (2) focal group distribution, (3) number of DIF items, (4) direction of DIF, and (5) type of DIF. The findings provided promising results and indicated directions for future research. The polytomous DFIT framework was effective in identifying DTF and DIF for the simulated conditions. The DTF index did not perform as consistently as the DIF index. The findings are similar to those of unidimensional and multidimensional DFIT studies.

115 citations


Journal ArticleDOI
TL;DR: In this study, a method based on Kullback-Leibler information (KLI) was evaluated and showed that testing algorithms using KLI-based item selection performed better than or as well as those using Fisher information (FI) based item selection.
Abstract: Wald’s (1947) sequential probability ratio test can be implemented as an adaptive test for classifying examinees into categories. However, current implementations use an item selection method that ...

112 citations


Journal ArticleDOI
TL;DR: In this article, flexible methods that relax restrictive conditional independence assumptions of latent class analysis (LCA) are described, and the relationship between the multivariate probit mixture model proposed here and Rost's mixed Rasch (1990, 1991) model is discussed.
Abstract: Flexible methods that relax restrictive conditional independence assumptions of latent classanalysis (LCA) are described. Dichotomous and ordered category manifest variables are viewed asdiscretized latent continuous variables. The latent continuous variables are assumed to have a mixtureofmultivariate-normals distribution. Within a latent class, conditional dependence is modeled as the mutual association of all or some latent continuous variables with a continuous latent trait (or in special cases, multiple latent traits). The relaxation of conditional independence assumptions allows LCA to better model natural taxa. Comparisons of specific restricted and unrestricted models permit statistical tests of specific aspects of latent taxonic structure. Latent class, latent trait, and latent distribution analysis can be viewed as special cases of the mixed latent trait model. The relationship between the multivariate probit mixture model proposed here and Rost’s mixed Rasch (1990, 1991) model is discussed. Two...

96 citations


Journal ArticleDOI
TL;DR: In this article, a three-part simulation study was conducted to investigate the theoretical distribution of the lz and lz across trait across trait for CAT and P&P tests.
Abstract: Several person-fit statistics have been proposed to detect item score patterns that do not fit an item response theory model. To classify response patterns as misfitting, the distribution of a person-fit statistic is needed. The theoretical null distributions of several fit statistics have been derived for paper-and-pencil (P&P) tests. However, it is unknown whether these distributions also hold for computerized adaptive tests (CAT). A three-part simulation study was conducted. In the first study, the theoretical distribution of the lz statistic across trait. 0levels for CAT and P&P tests was investigated. The distribution of the l*z statistic proposed by Snijders (in press) was also investigated. Results indicated that the distribution of both lz and l*z differed from the theoretical distribution in CAT. The second study examined the distributions of lzand l*z using simulation. These simulated distributions, when based on O [UNKNOWN], were found to be problematic in CAT. In the third study, the detection rates of l*z and lz were compared. The rates for both statistics were found to be similar in most cases

73 citations


Journal ArticleDOI
TL;DR: In this paper, the sample size ratio (SSR), the latent trait distribution (LD), and the amount of item information were used to estimate the item parameters in the nominal response model.
Abstract: Establishing guidelines for reasonable item parameter estimation is fundamental to use of the nominal response model. Factors studied were the sample size ratio (SSR), latent trait distribution (LD), and amount of item information. Results showed that the LD accounted for 42.5% of the variability in the accuracy of estimating the slope parameter; the SSR and the maximum item information factors accounted for 29.5% and 3.5% of the accuracy, respectively. In general, as the LD departed from a normal distribution, a larger number of examinees was required to accurately estimate the slope and intercept parameters. Results indicated that an SSR of 10:1 can produce reasonably accurate item parameter estimates when the LD is normal.

71 citations


Journal ArticleDOI
TL;DR: Person-fit indices (lz and multitest lzm) derived from item response theory and used to identify misfitting examinees were computed based on responses to cognitive ability and personality tests as discussed by the authors.
Abstract: Person-fit indices (lz and multitest lzm) derived from item response theory and used to identify misfitting examinees were computed based on responses to cognitive ability and personality tests. lz indices from different ability domains within the cognitive tests were uncorrelated with each other; lz indices from different tests within the personality domain were moderately intercorrelated. Cross-domain correlations were near 0. Test-taking motivation and conscientiousness were correlated moderately with multitest lzm for personality tests and to a lesser extent for cognitive tests. Test reactions were uncorrelated with any of the lz measures. Males had higher mean lz s than females. This difference could be partly attributed to differences in conscientiousness. African-Americans had higher mean lz than Whites. This effect could not be accounted for by test-taking motivation or conscientiousness. High values of lz affected the criterion-related validity of the set of cognitive tests such that the validity...

65 citations


Journal ArticleDOI
Tenko Raykov1
TL;DR: In this paper, a latent variable modeling approach is discussed that focuses on ability change scores and allows estimation of both individual latent change score and the relationship of ability change score to other variables.
Abstract: This paper complements recent discussions about the reliability of observed change scores (Collins, 1996a; Humphreys, 1996; Williams & Zimmerman, 1996a, 1996b). It is argued that modeling change on the latent dimension of interest is a better approach to measuring change than focusing on observed change scores and their properties. It is proposed that research be directed toward correlates and predictors of ability change (Rogosa & Willett, 1985b) and away from recorded change scores and their reliability. A latent variable modeling approach is discussed that focuses on ability change scores. It permits estimation of both individual latent change scores and the relationship of ability change scores to other variables.

Journal ArticleDOI
TL;DR: In this paper, an empirical Monte Carlo study was performed using predictor and criterion data from 84,808 U.S. Air Force enlistees, and 500 estimates for each of 9 validity and 11 cross-validity estimation procedures were generated for each sample size condition.
Abstract: An empirical monte carlo study was performed using predictor and criterion data from 84,808 U.S. Air Force enlistees. 501 samples were drawn for each of seven sample size conditions: 25, 40, 60, 80, 100, 150, and 200. Using an eight-predictor model, 500 estimates for each of 9 validity and 11 cross-validity estimation procedures were generated for each sample size condition. These estimates were then compared to the actual squared population validity and cross-validity in terms of mean bias and mean squared bias. For the regression models determined using ordinary least squares, the Ezekiel procedure produced the most accurate estimates of squared population validity (followed by the Smith and the Wherry procedures), and Burket’s formula resulted in the best estimates of squared population cross-validity. Other analyses compared the coefficients determined by traditional empirical cross-validation and equal weights; equal weights resulted in no loss of predictive accuracy and less shrinkage. Numerous issu...

Journal ArticleDOI
TL;DR: In this article, a procedure for empirical initialization of the trait is proposed based on the statistical relation between and background variables known prior to test administration, which is modeled using a two-parameter version of a logistic item response theory model with manifest predictors discussed in Zwinderman (1991).
Abstract: A procedure for empirical initialization of the trait. estimator in adaptive testing is proposed that is based on the statistical relation between and background variables known prior to test administration. The relation is modeled using a two-parameter version of a logistic item response theory model with manifest predictors discussed in Zwinderman (1991). Equations are provided that are necessary for estimating the parameters from an incomplete sample of response data and data on background variables. The procedure is illustrated for an adaptive version of a test from the Dutch General Aptitude Test Battery, with response time on a prior test as a background variable.

Journal ArticleDOI
TL;DR: In this article, the use of a beta prior in trait estimation was extended to the maximum expected a posteriori (MAP) method of Bayesian estimation, called essentially unbiased MAP (EU-MAP).
Abstract: The use of a beta prior in trait estimation was extended to the maximum expected a posteriori (MAP) method of Bayesian estimation. This new method, called essentially unbiased MAP (EU-MAP), was compared with MAP (using a standard normal prior), essentially unbiased expected a posteriori, weighted likelihood, and maximum likelihood estimation methods. Comparisons were made based on the effects that the shape of prior distributions, different item bank characteristics, and practical constraints had on bias, standard error, and root-mean-square error (RMSE). Overall, EU-MAP performed best. This new method significantly reduced bias in fixed-length tests (though with a slight increase in RMSE) and performed reasonably well when a fixed posterior variance termination rule was used. Practical constraints had little effect on the bias of this method.

Journal ArticleDOI
TL;DR: In this paper, the authors evaluate procedures that could identify these individuals by examining the application of person-fit indices in the adaptive test environment, using information from these indices, a new method was developed.
Abstract: The purpose of appropriateness/person-fit indices is to identify response patterns for which a given item response theory model is inappropriate for an examinee even though that model is appropriate for a group. This study was concerned with those cases in which examinees had prior knowledge of items from an item bank used to generate a computerized adaptive test (CAT) and used the memorized information to inflate their test scores. The objective was to evaluate procedures that could identify these individuals by examining the application of person-fit indices in the CAT environment. The lzand ECI4 zindices were selected for comparison. Using information from these indices, a new method was developed. All three indices showed little power to detect the use of memorization. Some possibilities for altering a test when the model becomes inappropriate for an examinee are also discussed.

Journal ArticleDOI
TL;DR: In this article, a distinction is made between two concepts of measurement precision: reliability and information dependence, and it is shown that reliability is population dependent and information is examinee dependent.
Abstract: A distinction is necessary between two concepts of measurement precision. Reliability is population dependent and information is examinee dependent. Both concepts also apply to the simple gain scor...

Journal ArticleDOI
TL;DR: In this article, a new procedure for defining achievement levels on continuous scales was developed using aspects of Guttman scaling and item response theory, which assigns examinees to levels of achievement when the levels are represented by separate pools of multiple-choice items.
Abstract: A new procedure for defining achievement levels on continuous scales was developed using aspects of Guttman scaling and item response theory. This procedure assigns examinees to levels of achievement when the levels are represented by separate pools of multiple-choice items. Items were assigned to levels on the basis of their content and hierarchically defined level descriptions. The resulting level response functions were well-spaced and noncrossing. This result allowed well-spaced levels of achievement to be defined by a common percent-correct standard of mastery on the level pools. Guttman patterns of mastery could be inferred from level scores. The new scoring procedure was found to have higher reliability, higher classification consistency, and lower classification error, when compared to two Guttman scoring procedures.

Journal ArticleDOI
TL;DR: This article examined how three item response models performed when they were applied to data collected from a conventionally developed Likert-type personality scale, each model examined is based on a...
Abstract: This study examined how three item response models performed when they were applied to data collected from a conventionally developed Likert-type personality scale. Each model examined isbased on a...

Journal ArticleDOI
TL;DR: In this paper, the authors describe procedures and computer programs for solving these problems using the methods described by Olkin and Finn, and extend these methods for any number of predictors or for partialing out any variable.
Abstract: Olkin & Finn (1995) developed expressions for confidence intervals for functions of simple, partial, and multiple correlations. This paper describes procedures and computer programs for solving these problems using the methods described by Olkin and Finn. The programs extendthe methods for any number of predictors or for partialing out any number of variables.

Journal ArticleDOI
TL;DR: In this article, three reliability estimates are derived for the Bayes modal estimate (BME) and the maximum likelihood estimate (MLE) of θ in computerized adaptive tests (CAT).
Abstract: Three reliability estimates are derived for the Bayes modal estimate (BME) and the maximum likelihood estimate (MLE) of θin computerized adaptive tests (CAT). Each reliability estimate is a functio...

Journal ArticleDOI
TL;DR: The logistic versions of Samejima's (1969) graded response model and Muraki's (1992) generalized partial-credit model are parameterized differently by MULTILOG (Thissen, 1991) and PARSCALE (Muraki, 1992) as mentioned in this paper.
Abstract: The logistic versions of Samejima’s (1969) graded response model and Muraki’s (1992) generalized partial-credit model are parameterized differently by MULTILOG (Thissen, 1991) and PARSCALE (Muraki ...

Journal ArticleDOI
TL;DR: In this paper, a model-oriented approach to studying processes and strategies underlying the incorrect/correct responses to cognitive test tasks is presented, which is contrasted with a dataoriented approach in which verbal explanations for incorrect or correct responses are collected during the test phase and incorporated in the scoring.
Abstract: Componential item response theory (CIRT) is presented as a model-oriented approach to studying processes and strategies underlying the incorrect/correct responses to cognitive test tasks. CIRT is contrasted with a data-oriented approach in which verbal explanations for incorrect/correct responses are collected during the test phase and incorporated in the scoring. Alternatively, the psychologically meaningful data are modeled by unidimensional item response theory models. Verbal explanations for each examinee and task were collected from transitive reasoning tasks in addition to the incorrect/correct responses. Two datasets were compiled, one reflecting the common incorrect/correct scoring and one showing whether a deductive strategy had been used to produce a correct response. The Mokken model of monotone homogeneity, the partial-credit model, and the generalized one-parameter logistic model were used to analyze both polytomous datasets. Results showed that combining knowledge of solution strategies with...

Journal ArticleDOI
TL;DR: In this paper, an approximate statistical test is developed for the hypothesis of equality between the Spearman-Brown extrapolations of two independent values of Cronbach's alpha reliability coefficient (α), assuming that the units added to or deleted from each instrument are classically parallel to the units included in the original version of each instrument.
Abstract: An approximate statistical test is developed for the hypothesis of equality between the Spearman-Brown extrapolations of two independent values of Cronbach’s alpha reliability coefficient (α). This test assumes that the units added to or deleted from each instrument are classically parallel to the units included in the original version of each instrument. The projections for Tests 1 and 2 are based on lengthening or shortening factors of K1 and K2, which may or may not be equal. Special cases of this test include applications in which the projected values are intraclass coefficients or only one of the instruments is presumed to be altered in length. Monte carlo simulations demonstrated that the procedure effectively controls Type I error even when the original αs are based on as few as two test parts or two raters.

Journal Article
TL;DR: An estimator in adaptive testing is proposed that is based on the statistical relation between and background variables known prior to test administration, based on a two-parameter version of a logistic item response theory model.
Abstract: A procedure for empirical initialization of the trait. estimator in adaptive testing is proposed that is based on the statistical relation between and background variables known prior to test administration. The relation is modeled using a two-parameter version of a logistic item response theory model with manifest predictors discussed in Zwinderman (1991). Equations are provided that are necessary for estimating the parameters from an incomplete sample of response data and data on background variables. The procedure is illustrated for an adaptive version of a test from the Dutch General Aptitude Test Battery, with response time on a prior test as a background variable.

Journal ArticleDOI
TL;DR: In this article, a simulation study was conducted to determine how well two models for local item dependency (LID), called interaction models, could be distinguished, and the results indicated that if the interaction parameter is not too extreme, the COIM will be rejected in favor of the true model, while finding the true weight required a large sample size.
Abstract: A simulation study was conducted to determine how well two models for local item dependency (LID), called interaction models, could be distinguished. The models examined were the constantorder interaction model (COIM) and the dimension dependent interaction model (DDIM). Data were simulated according to the latter model. Three factors were manipulated: sample size, the weight of the difference between the latent trait value of the examinee and the interaction parameter, and the value of the interaction parameter. Results indicated that (1) if the interaction parameter is not too extreme, the COIM will be rejected in favor of the true model (the Rasch model fit poorly for all levels of the interaction parameter); (2) a larger weight of the difference between the latent trait value and the interaction parameter facilitated the rejection of the COIM, although finding the true weight required a large sample size; and (3) the value for the interaction parameter with an optimal discrimination between the COIM a...

Journal ArticleDOI
TL;DR: This paper derives discrimination parameter values, as functions of the guessing parameter and distances between person parameters and item difficulty, that yield maximum information for the three-parameter logistic item response theory model.
Abstract: Items with the highest discrimination parameter values in a logistic item response theory model do not necessarily give maximum information. This paper derives discrimination parameter values, as functions of the guessing parameter and distances between person parameters and item difficulty, that yield maximum information for the three-parameter logistic item response theory model. An upper bound for information as a function of these parameters is also derived. An algorithm is suggested for the maximum information item selection criterion for adaptive testing and is compared with a full bank search algorithm.

Journal ArticleDOI
TL;DR: NNORMULT as discussed by the authors is a multivariate extension of the Fleishman power method for generating simulated multivariate nonnormal data, which requires the user to enter the desired population skew, kurtosis, and start values for the constants.
Abstract: Many of the data-analytic methods employed in the social sciences assume that the data are normally distributed. It has been recognized that this assumption is often unrealistic, so there has been much research into how particular data-analytic methods behave when the data are not normally distributed. Much of this research has relied heavily on monte carlo simulations to characterize the behavior of a particular test statistic. A cornerstone of this type of investigation is the generation of data that conform to prescribed population characteristics. Fleishman (1978) developed the power transformation method for generating simulated univariate nonnormal data. Vale & Maurelli (1983) extended this study by developing a method for generating simulated multivariate nonnormal data. PWRCOEFF. PWRCOEFFderives power transformation constants for any possible combination of skew and kurtosis. It requires the user to enter the desired population skew, kurtosis, and start values for the constants. The program outputs a text file containing the specified skew and kurtosis, the constants for the power transformation, the start values, and the number of iterations to convergence. NNORMULT. NNORMULT generates multivariate nonnormal data using the multivariate extension of the Fleishman power method as developed by Vale & Maurelli (1983). The program requires the user to enter the sample size of the dataset, the population covariance matrix (or correlation matrix) for the data, and the Fleishman power constants [which can be found in Fleishman’s (1978) table or can be derived using PWRCOEFF]. NNORMULT outputs a text file containing a matrix of transformed raw data. The sample covariance structure S, of this matrix, represents a random sample drawn from the desired population with covariance structure 6 .

Journal ArticleDOI
TL;DR: The item search algorithm in these tests can be based on either a golden section search, a Z-score, or an EAP-based search; these methods result, respectively, in the golden search grading test (GGT), the Z score grading test, and the EAP grading test.
Abstract: IRT-based adaptive grading tests are designed to assign examinees to one of several grading categories. The item search algorithm in these tests can be based on either a golden section search, a Z-score, or an EAP-based search; these methods result, respectively, in the golden search grading test (GGT), the Z-score grading test (ZGT), and the EAP grading test (EGT). Grade assignments are evaluated after each item is administered and after the current trait estimate ([UNKNOWN]) has been determined. A test is terminated based on one of three conditions: (1) [UNKNOWN] is between two cutoff scores; (2) [UNKNOWN] is above or below the highest or lowest cutoff scores, respectively; or (3) a prespecified maximum number of items has been administered. Monte carlo studies using actual ACT Mathematics test item parameters showed that all three strategies effectively assigned examinees into multiple achievement grade levels. EGT had more correct classifications in the middle range of grade levels and more classifica...

Journal ArticleDOI
TL;DR: This paper presents latent-class models that fall within the purview of the general model presented by Clogg & Goodman (1984, 1985) and Walter & Irwig (1988) and variations on the general latent- class model allow the investigator to determine whether the criterion measure and/or the diagnostic or screening procedure for multiple groups can be considered error-free.
Abstract: Classification analysis is used widely to detect classification errors determined by evaluating a screening or diagnostic instrument against a criterion measure. The usefulness of classification analysis is limited because it assumes an error-free criterion and provides no statistical test of the validity of that assumption. The classification-analysis model is a special case of a general latent-class model. This paper presents latent-class models that fall within the purview of the general model presented by Clogg & Goodman (1984, 1985) and Walter & Irwig (1988). Variations on the general latent-class model allowthe investigator to determine whether the criterion measure and/or the diagnostic or screening procedure for multiple groups can be considered error-free. Analogous to the problem of differential item functioning, the general model makes it possible to test assumptions regarding classification errors that could occur across groups. The proportion of individuals who may be misclassified by a scree...