scispace - formally typeset
Search or ask a question

Showing papers in "Applied Psychological Measurement in 2013"



Journal ArticleDOI
Chia Yi Chiu1
TL;DR: Results show that the method can identify and correct misspecified entries in the Q-matrix, thereby improving its accuracy, and comparisons of the residual sum of squares computed from the observed and the ideal item responses.
Abstract: Most methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes require the Q-matrix that associates each item in a test with the cogniti...

106 citations


Journal ArticleDOI
TL;DR: The authors propose a method that evaluates the degree of item-level dimensionality and allows for the selection of subsets of items (i.e., short form) that result in scaled scores and standard errors that are equivalent to other multidimensional IRT-based scoring procedures.
Abstract: Test developers often need to create unidimensional scales from multidimensional data. For item analysis, marginal trace lines capture the relation with the general dimension while accounting for n...

98 citations


Journal ArticleDOI
TL;DR: The authors evaluate the viability of the proposed polytomous generalized deterministic inputs, noisy, “and” gate (pG-DINA) model by examining how well the model parameters can be estimated under various conditions, and compare its classification accuracy against that of the conventional G-Dina model with a modified classification rule.
Abstract: Polytomous attributes, particularly those defined as part of the test development process, can provide additional diagnostic information. The present research proposes the polytomous generalized de...

76 citations


Journal ArticleDOI
TL;DR: It is found that the two stopping rules are implemented successfully and that KL uses the least number of items to reach the same precision level, followed by Vm, and overall, KL is recommended for varying-length MCAT.
Abstract: Through simulated data, five multidimensional computerized adaptive testing (MCAT) selection procedures with varying test lengths are examined and compared using different stopping rules. Fixed ite...

46 citations


Journal ArticleDOI
TL;DR: A new item response theory (IRT) model estimation program, IRTPRO 2.1, for Windows that is capable of unidimensional and multidimensional IRT model estimation for existing and user-specified constrained IRT models for dichotomously and polytomously scored item response data.
Abstract: This article reviews a new item response theory (IRT) model estimation program, IRTPRO 2.1, for Windows that is capable of unidimensional and multidimensional IRT model estimation for existing and user-specified constrained IRT models for dichotomously and polytomously scored item response data.

41 citations


Journal ArticleDOI
TL;DR: In this paper, the authors derived the value of the latent ability level that maximizes the item information function under the 3-parameter logistic model, and extended this result to the 4PL model.
Abstract: This article focuses on four-parameter logistic (4PL) model as an extension of the usual three-parameter logistic (3PL) model with an upper asymptote possibly different from 1. For a given item with fixed item parameters, Lord derived the value of the latent ability level that maximizes the item information function under the 3PL model. The purpose of this article is to extend this result to the 4PL model. A generic and algebraic method is developed for that purpose. The result is practically illustrated by an example and several potential applications of this result are outlined.

40 citations


Journal ArticleDOI
TL;DR: In this article, the relationship between item response theory (IRT) and classical test theory (CTT) Equations are presented for comparing the reliability and precision of scores within the CTT and IRT frameworks.
Abstract: A classic topic in the fields of psychometrics and measurement has been the impact of the number of scale categories on test score reliability This study builds on previous research by further articulating the relationship between item response theory (IRT) and classical test theory (CTT) Equations are presented for comparing the reliability and precision of scores within the CTT and IRT frameworks This study presented new results pertaining to the relative precision (ie, the test score conditional standard error of measurement for a given trait value) of CTT and IRT, and the new results shed light on the conditions where total scores and IRT estimates are more or less precisely measured The relative reliability of CTT and IRT scores is examined as a function of item characteristics (eg, locations, category thresholds, and discriminations) and subject characteristics (eg, the skewness and kurtosis of the latent distribution) CTT total scores were more reliable when the latent distribution was m

36 citations


Journal ArticleDOI
TL;DR: A control procedure was developed to control item exposure and test overlap simultaneously among examinees and showed that using the two criteria with the posterior-weighted Kullback–Leibler information procedure for selecting items could achieve the prespecified measurement precision.
Abstract: Interest in developing computerized adaptive testing (CAT) under cognitive diagnosis models (CDMs) has increased recently. CAT algorithms that use a fixed-length termination rule frequently lead to...

35 citations


Journal ArticleDOI
TL;DR: In this article, multidimensional computerized adaptive testing (MCAT) is able to provide a vector of ability estimates for each examinee, which could be used to provide more informative profile of an examinee's abilities.
Abstract: Multidimensional computerized adaptive testing (MCAT) is able to provide a vector of ability estimates for each examinee, which could be used to provide a more informative profile of an examinee’s ...

32 citations


Journal ArticleDOI
TL;DR: In this article, a new class of higher order item response theory models for hierarchical latent traits that are flexible in accommodating both dichotomous and polytomous items, to estimate both item and person parameters jointly, to allow users to specify customized item response functions, and to go beyond two orders of latent traits and the linear relationship between latent traits.
Abstract: Many latent traits in the human sciences have a hierarchical structure. This study aimed to develop a new class of higher order item response theory models for hierarchical latent traits that are flexible in accommodating both dichotomous and polytomous items, to estimate both item and person parameters jointly, to allow users to specify customized item response functions, and to go beyond two orders of latent traits and the linear relationship between latent traits. Parameters of the new class of models can be estimated using the Bayesian approach with Markov chain Monte Carlo methods. Through a series of simulations, the authors demonstrated that the parameters in the new class of models can be well recovered with the computer software WinBUGS, and the joint estimation approach was more efficient than multistaged or consecutive approaches. Two empirical examples of achievement and personality assessments were given to demonstrate applications and implications of the new models.

Journal ArticleDOI
TL;DR: In this article, the authors examined several methods for assessing differential item functioning (DIF) on polytomous items generated by an ideal point process, and the results revealed that DIF effect sizes were moderately large for the.50 uniform DIF conditions and small for nonuniform DIF; moreover, the LR test in general yielded the best results.
Abstract: There has been growing use of ideal point models to develop scales measuring important psychological constructs. For meaningful comparisons across groups, it is important to identify items on such scales that exhibit differential item functioning (DIF). In this study, the authors examined several methods for assessing DIF on polytomous items generated by an ideal point process. Two paradigms (i.e., null hypothesis significance testing [NHST] and effect size quantification) were utilized, and three test statistics (i.e., the log-likelihood ratio [LR], the Akaike information criterion [AIC], and Lord’s chi-square) and two approaches to DIF testing (i.e., the constrained and free baseline methods) were evaluated. In addition, the authors investigated three levels of impact. The results revealed that DIF effect sizes were moderately large for the .50 uniform DIF conditions and small for nonuniform DIF; moreover, the LR test in general yielded the best results. When there was small to moderate impact, the free...

Journal ArticleDOI
TL;DR: The majority of large-scale assessments develop various score scales that are either linear or nonlinear transformations of raw scores for better interpretations and uses of assessment results as discussed by the authors, which can be classified into two categories: linear and nonlinear scales.
Abstract: The majority of large-scale assessments develop various score scales that are either linear or nonlinear transformations of raw scores for better interpretations and uses of assessment results. The...

Journal ArticleDOI
TL;DR: Guessing behavior is an issue discussed widely with regard to multiple choice tests as mentioned in this paper, and its primary effect is on number-correct scores for examinees at lower levels of proficiency.
Abstract: Guessing behavior is an issue discussed widely with regard to multiple choice tests. Its primary effect is on number-correct scores for examinees at lower levels of proficiency. This is a systemati...

Journal ArticleDOI
TL;DR: The violation of the assumption of local independence when applying item response theory (IRT) models has been shown to have a negative impact on all estimates obtained from the given model as mentioned in this paper.
Abstract: The violation of the assumption of local independence when applying item response theory (IRT) models has been shown to have a negative impact on all estimates obtained from the given model. Numero...

Journal ArticleDOI
TL;DR: In this article, the authors developed observed score and true score equating procedures to be used in conjunction with the multidimensional item response theory (MIRT) framework, and three equati...
Abstract: The purpose of this research was to develop observed score and true score equating procedures to be used in conjunction with the multidimensional item response theory (MIRT) framework. Three equati...

Journal ArticleDOI
TL;DR: The results confirm that capitalization on chance occurs in VL-CAT and has complex effects on test length, ability estimation, and classification accuracy, and these results have important implications for the design and implementation of VL -CATs.
Abstract: Variable-length computerized adaptive testing (VL-CAT) allows both items and test length to be “tailored” to examinees, thereby achieving the measurement goal (e.g., scoring precision or classifica...

Journal ArticleDOI
TL;DR: Within the framework of item response theory (IRT), there are two recent lines of work on the estimation of classification accuracy (CA) rate as discussed by the authors, one approach estimates CA when decisions are made base...
Abstract: Within the framework of item response theory (IRT), there are two recent lines of work on the estimation of classification accuracy (CA) rate. One approach estimates CA when decisions are made base...

Journal ArticleDOI
TL;DR: The process of parameter estimation is described to provide insight into the causes of uncertainty in the item parameters and an alternative automated test assembly algorithm is presented that is robust against uncertainties in the data.
Abstract: Item response theory parameters have to be estimated, and because of the estimation process, they do have uncertainty in them. In most large-scale testing programs, the parameters are stored in item banks, and automated test assembly algorithms are applied to assemble operational test forms. These algorithms treat item parameters as fixed values, and uncertainty is not taken into account. As a consequence, resulting tests might be off target or less informative than expected. In this article, the process of parameter estimation is described to provide insight into the causes of uncertainty in the item parameters. The consequences of uncertainty are studied. Besides, an alternative automated test assembly algorithm is presented that is robust against uncertainties in the data. Several numerical examples demonstrate the performance of the robust test assembly algorithm, and illustrate the consequences of not taking this uncertainty into account. Finally, some recommendations about the use of robust test assembly and some directions for further research are given.

Journal ArticleDOI
TL;DR: This study proposes the item pocket (IP) method, a new testing approach that allows test takers greater flexibility in changing their responses by eliminating restrictions that prevent them from moving across test sections to review their answers.
Abstract: Most computerized adaptive testing (CAT) programs do not allow test takers to review and change their responses because it could seriously deteriorate the efficiency of measurement and make tests vulnerable to manipulative test-taking strategies. Several modified testing methods have been developed that provide restricted review options while limiting the trade-off in CAT efficiency. The extent to which these methods provided test takers with options to review test items, however, still was quite limited. This study proposes the item pocket (IP) method, a new testing approach that allows test takers greater flexibility in changing their responses by eliminating restrictions that prevent them from moving across test sections to review their answers. A series of simulations were conducted to evaluate the robustness of the IP method against various manipulative test-taking strategies. Findings and implications of the study suggest that the IP method may be an effective solution for many CAT programs when the...

Journal ArticleDOI
TL;DR: Mixtures of item response theory (IRT) models have been proposed as a technique to explore response patterns in test data related to cognitive strategies, instructional sensitivity, and differentia as discussed by the authors.
Abstract: Mixtures of item response theory (IRT) models have been proposed as a technique to explore response patterns in test data related to cognitive strategies, instructional sensitivity, and differentia...

Journal ArticleDOI
TL;DR: The Monte Carlo approach is applied here to cognitive diagnostic CAT to test the ability of this approach to address multiple content constraints and the recovery rate of the knowledge states, the distribution of the item exposure, and the utilization rate ofThe item bank are improved when the Monte Carlo method is used.
Abstract: The Monte Carlo approach which has previously been implemented in traditional computerized adaptive testing (CAT) is applied here to cognitive diagnostic CAT to test the ability of this approach to...

Journal ArticleDOI
TL;DR: In this paper, a generalized distance discriminating method for test with polytomous response (GDD-P) was proposed, which can identify examinees' ideal response patterns (IRPs) based on generalized distance index.
Abstract: This article proposes a generalized distance discriminating method for test with polytomous response (GDD-P). The new method is the polytomous extension of an item response theory (IRT)-based cognitive diagnostic method, which can identify examinees’ ideal response patterns (IRPs) based on a generalized distance index. The similarities between observed response patterns and IRPs for polytomous response situation are measured by the index of GDD-P, and the attribute patterns can be recognized via the relationship between attribute patterns and IRPs. Feasible designs about polytomous Q-matrix and scoring items for polytomous response are also discussed. In simulation, the classification accuracy of the GDD-P method for the test with polytomous response was investigated, and results indicated that the proposed method had promising performance in recognizing examinees’ attribute patterns.

Journal ArticleDOI
TL;DR: Results from a study with simulated response data highlight both the effects of within-family item-parameter variability and the severity of the constraint sets in the test-design models on their optimal solutions.
Abstract: Optimal test-design methods are applied to rule-based item generation. Three different cases of automated test design are presented: (a) test assembly from a pool of pregenerated, calibrated items; (b) test generation on the fly from a pool of calibrated item families; and (c) test generation on the fly directly from calibrated features defining the item families. The last two cases do not assume any item calibration under a regular response theory model; instead, entire item families or critical features of them are assumed to be calibrated using a hierarchical response model developed for rule-based item generation. The test-design models maximize an expected version of the Fisher information in the test and control critical attributes of the test forms through explicit constraints. Results from a study with simulated response data highlight both the effects of within-family item-parameter variability and the severity of the constraint sets in the test-design models on their optimal solutions.

Journal ArticleDOI
TL;DR: In this paper, a linear regression method is proposed as a detection method and compared with a traditional method in various conditions, which confirmed the necessity of detecting and removing outlying common items.
Abstract: Common test items play an important role in equating alternate test forms under the common item nonequivalent groups design. When the item response theory (IRT) method is applied in equating, inconsistent item parameter estimates among common items can lead to large bias in equated scores. It is prudent to evaluate inconsistency in parameter estimates of common items before conducting IRT equating. The evaluation of inconsistency in parameter estimates is typically achieved through detecting outliers in the common item set. In this study, a linear regression method is proposed as a detection method. The newly proposed method was compared with a traditional method in various conditions. The results of this study confirmed the necessity of detecting and removing outlying common items. The results also show that the newly proposed method performed better than did the traditional method in most conditions.

Journal ArticleDOI
TL;DR: In this paper, an item response theory-based framework is proposed to perform item analysis involving P-S data, which improves psychometric analysis for scoring response categories, calibrating items, calculating reliability or internal consistency, and selecting and revising items.
Abstract: The Presence-Severity (P-S) format refers to a compound item structure in which a question is first asked to check the presence of the particular event in question If the respondent provides an affirmative answer, a follow-up is administered, often about the frequency, density, severity, or impact of the event Despite the popularity of the P-S format in areas such as patient reported outcomes, little attention has been paid to their psychometric analysis, which is necessary for making key design decisions about a scale In this study, an item response theory–based framework is proposed to perform item analysis involving P-S data, which improves psychometric analysis for (a) scoring response categories, (b) calibrating items, (c) calculating reliability or internal consistency, and (d) selecting and revising items A real-data example involving the Memorial Symptom Assessment Scale–Short Form, which is used as symptom distress measure for terminally ill cancer patients, demonstrates how the new framework

Journal ArticleDOI
TL;DR: MSTGen, a new MST simulation software tool, was developed to serve various purposes ranging from fundamental MST research to technical MST program evaluations and offers a variety of test administration environments and a user-friendly graphical interface.
Abstract: Multistage testing, or MST, was developed as an alternative to computerized adaptive testing (CAT) for applications in which it is preferable to administer a test at the level of item sets (i.e., modules). As with CAT, the simulation technique in MST plays a critical role in the development and maintenance of tests. Theoretically, MST is a special case of CAT (likewise, CAT also can be viewed as a special version of MST). Technically, however, MST and CAT are completely different relative to how test systems work; thus, existing commercial or noncommercial CAT simulation programs, for example, CATSim (Weiss & Guyer, 2012) and SimulCAT (Han, 2012), cannot accommodate MST-based tests. MSTGen, a new MST simulation software tool, was developed to serve various purposes ranging from fundamental MST research to technical MST program evaluations. The new CAT simulation software tool supports both traditional MST functioning (MST by routing to preassembled modules after each stage; Luecht & Nungester, 1998) and new MST methods (e.g., MST by shaping a module for each stage; Han & Guo, 2013). It offers a variety of test administration environments and a user-friendly graphical interface.

Journal ArticleDOI
TL;DR: The simulation studies showed that the v index (Wollack, 1996) and generalized binomial test (GBT; van der Linden & Sotaridona, 2006) provide the highest detection rates, while holding the empirical Type I error rates below the nominal level.
Abstract: Fraud on standardized tests has been an increasing concern (Crouch, 2012; Hildebrand, 2012) because it invalidates the inferences made from test scores. The large-scale, self-report survey results consistently revealed that about 35% of high school students engaged in some type of test fraud two or more times in the previous year (Josephson Institute of Ethics, 2006, 2008, 2010), and answer exchange between two examinees is a type of test fraud commonly observed in multiple-choice examinations (Bopp, Gleason, & Misicka, 2001; Brimble & Clarke, 2005; Hughes & McCabe, 2006; Rakovski & Levy, 2007). Identifying answer copying is an essential part of maintaining the integrity of test scores, and additional evidence is always necessary when a pair of examinees is suspected of exchanging answers on a multiple-choice test. Many scholars have developed a variety of analytical procedures and addressed the issue from a statistical perspective to provide additional evidence of answer copying between two examinees (Angoff, 1972; Bay, 1995; Bellezza & Bellezza, 1989; Cody, 1985; Frary, Tideman, & Watts, 1977; Hanson, Harris, & Brennan, 1987; Holland, 1996; Saupe, 1960; Sotaridona & Meijer, 2002, 2003; van der Linden & Sotaridona, 2006; Wollack, 1996). However, few of these indices have been shown to be effective and reliable based on the results from simulation studies (Bay, 1995; Hanson et al., 1987; Sotaridona & Meijer, 2002, 2003; Wollack, 1996, 2003, 2006; Wollack & Cohen, 1998; Zopluoglu & Davenport, 2012). The simulation studies showed that the v index (Wollack, 1996) and generalized binomial test (GBT; van der Linden & Sotaridona, 2006) provide the highest detection rates, while holding the empirical Type I error rates below the nominal level. In addition, the K index (Holland, 1996), K1 and K2 (Sotaridona & Meijer, 2002), and S1 and S2 indices (Sotaridona & Meijer, 2003) have provided reasonable detection rates and held Type I error rates below the nominal level in simulation studies. Although much effort has been put into developing statistical indices for detecting answer copying, so far little effort has been made to develop accessible software for practitioners to compute statistical indices that have been found effective in the literature. To my knowledge, none of these useful indices are available for practitioners in any

Journal ArticleDOI
TL;DR: The results showed that the parameters of RTGUM were recovered fairly well and that ignoring the randomness in thresholds led to biased estimates, but the longer the test length, the smaller the randoms in thresholds, and the more categories in an item, the more precise the ability estimates would be.
Abstract: The random-threshold generalized unfolding model (RTGUM) was developed by treating the thresholds in the generalized unfolding model as random effects rather than fixed effects to account for the subjective nature of the selection of categories in Likert items. The parameters of the new model can be estimated with the JAGS (Just Another Gibbs Sampler) freeware, which adopts a Bayesian approach for estimation. A series of simulations was conducted to evaluate the parameter recovery of the new model and the consequences of ignoring the randomness in thresholds. The results showed that the parameters of RTGUM were recovered fairly well and that ignoring the randomness in thresholds led to biased estimates. Computerized adaptive testing was also implemented on RTGUM, where the Fisher information criterion was used for item selection and the maximum a posteriori method was used for ability estimation. The simulation study showed that the longer the test length, the smaller the randomness in thresholds, and the...

Journal ArticleDOI
TL;DR: In this paper, the authors explore the usefulness of latent growth curve modeling in the study of pacing behavior and test speededness, using a high-stakes, computerized examination.
Abstract: This research explores the usefulness of latent growth curve modeling in the study of pacing behavior and test speededness. Examinee response times from a high-stakes, computerized examination, col...