scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2011"


Journal ArticleDOI
TL;DR: In this article, a multidimensional IRT model based on Thurstone's framework for comparative data is introduced, which is suitable for use with any forced-choice questionnaire composed of items fitting the dominance response model, with any number of measured traits, and any block sizes.
Abstract: Multidimensional forced-choice formats can significantly reduce the impact of numerous response biases typically associated with rating scales. However, if scored with classical methodology, these questionnaires produce ipsative data, which lead to distorted scale relationships and make comparisons between individuals problematic. This research demonstrates how item response theory (IRT) modeling may be applied to overcome these problems. A multidimensional IRT model based on Thurstone’s framework for comparative data is introduced, which is suitable for use with any forced-choice questionnaire composed of items fitting the dominance response model, with any number of measured traits, and any block sizes (i.e., pairs, triplets, quads, etc.). Thurstonian IRT models are normal ogive models with structured factor loadings, structured uniquenesses, and structured local dependencies. These models can be straightforwardly estimated using structural equation modeling (SEM) software Mplus. A number of simulation studies are performed to investigate how latent traits are recovered under various forced-choice designs and provide guidelines for optimal questionnaire design. An empirical application is given to illustrate how the model may be applied in practice. It is concluded that when the recommended design guidelines are met, scores estimated from forced-choice questionnaires with the proposed methodology reproduce the latent traits well.

210 citations


Journal ArticleDOI
TL;DR: Exploratory Factor Analysis (EFA) has been used in the social sciences to depict the relationships between variables/items and latent traits as mentioned in this paper, and it has been applied in many applications.
Abstract: Exploratory factor analysis (EFA) has long been used in the social sciences to depict the relationships between variables/items and latent traits. Researchers face many choices when using EFA, incl...

210 citations


Journal ArticleDOI
TL;DR: A systematic review of measures of social and emotional skills for children and young people is presented in this article, where the review process resulted in the retention of 12 measures, which are presented and discussed in relation to key issues in this area, including difficulties with the underlying theory and frameworks for social-and emotional skills, inconsistent terminology, the scope and distinctiveness of available measures, and more practical issues such as the type of respondent, location, and purpose of measurement.
Abstract: This study presents the findings of a systematic review of measures of social and emotional skills for children and young people. The growing attention to this area in recent years has resulted in the development of a large number of measures to aid in the assessment of children and young people. These measures vary on a number of variables relating to implementation characteristics and psychometric properties. The methodology of the review followed the general principles of systematic reviewing, such as systematic search of databases, the adoption of predetermined set of inclusion and exclusion criteria, and a multistage filtering process. The review process resulted in the retention of 12 measures, which are presented and discussed in relation to key issues in this area, including difficulties with the underlying theory and frameworks for social and emotional skills, inconsistent terminology, the scope and distinctiveness of available measures, and more practical issues such as the type of respondent, location, and purpose of measurement.

180 citations


Journal ArticleDOI
TL;DR: In this article, the authors synthesize internal consistency reliability for the subscale scores on the Maslach Burnout Inventory (MBI) and use it to diagnose burnout.
Abstract: The purpose of this study was to synthesize internal consistency reliability for the subscale scores on the Maslach Burnout Inventory (MBI). The authors addressed three research questions: (a) What is the mean subscale score reliability for the MBI across studies? (b) What factors are associated with observed variance in MBI subscale score reliability? (c) What are the implications for appropriate use based on MBI subscale mean internal consistency estimates? Of the 221 studies reviewed, 84 provided alpha coefficients and were used in the current analysis. Results suggest that mean alpha estimates across subscales generally fell within the .70 to .80 range. Scale variance and language most often accounted for the variance in coefficient alpha, although some variations were apparent between subscales. Of the three MBI subscales, Personal Accomplishment and Depersonalization mean alpha estimates were well below recommended levels for high-stakes decisions, such as the diagnosis of burnout syndrome. Recommen...

141 citations


Journal ArticleDOI
TL;DR: In this article, the authors extended a methodological approach considered by Bolt and Johnson for the measurement and control of extreme response style (ERS) to the analysis of rating data from multiple scales.
Abstract: This article extends a methodological approach considered by Bolt and Johnson for the measurement and control of extreme response style (ERS) to the analysis of rating data from multiple scales. Specifically, it is shown how the simultaneous analysis of item responses across scales allows for more accurate identification of ERS, and more effective control of ERS effects on the substantive trait estimates, than when analyzing just one scale. Moreover, unlike a competing approach presented by Greenleaf, the current strategy can accommodate conditions in which the substantive traits across scales correlate, as is almost always the case in social sciences research. Simulation and real data analyses are used for illustration.

95 citations


Journal ArticleDOI
TL;DR: In this paper, a two-step process is commonly used to evaluate data-model fit of latent variable path models, the first step addressing the measurement portion of the model and the second addressing the structural portion of model.
Abstract: A two-step process is commonly used to evaluate data–model fit of latent variable path models, the first step addressing the measurement portion of the model and the second addressing the structural portion of the model. Unfortunately, even if the fit of the measurement portion of the model is perfect, the ability to assess the fit within the structural portion is affected by the quality of the factor–variable relations within the measurement model. The result is that models with poorer quality measurement appear to have better data–model fit, whereas models with better quality measurement appear to have worse data–model fit. The current article illustrates this phenomenon across different classes of fit indices, discusses related structural assessment problems resulting from issues of measurement quality, and endorses a supplemental modeling step evaluating the structural portion of the model in isolation from the measurement model.

82 citations


Journal ArticleDOI
TL;DR: In this article, the performance of the minimum average partial (MAP) procedure in the presence of ordinal-level measurement was investigated with categorical data and the results indicated that using polychoric correlations and the squared partial correlations leads to considerably more accurate estimations than using Pearson correlations and/or raising the partial correlations to the fourth power.
Abstract: Despite strong evidence supporting the use of Velicer's minimum average partial (MAP) method to establish the dimensionality of continuous variables, little is known about its performance with categorical data. Seeking to fill this void, the current study takes an in-depth look at the performance of the MAP procedure in the presence of ordinal-level measurement. Using Monte Carlo methods, seven factors related to the data (sample size, factor loading, number of variables per factor, number of factors, factor correlation, number of response categories, and skewness) as well as two fac- tors related to the MAP method (type of correlation matrix and power) were sys- tematically manipulated. The results indicate that using polychoric correlations and the squared partial correlations leads to considerably more accurate estimations than using Pearson correlations and/or raising the partial correlations to the fourth power. Additionally, the MAP method is shown to be a biased estimator of dimen- sionality in two conditions: (a) for low factor loadings (.40) and (b) for medium factor loadings (.55) and a small number of variables per factor (≤6). The applicability of this method with categorical variables is discussed in the context of these findings.

78 citations


Journal ArticleDOI
TL;DR: The use of causal indicators to formatively measure latent constructs appears to be on the rise, despite what appears a troubling lack of consistency in their application as discussed by the authors, and the authors suggest that this lack of theory has contributed to the confusion surrounding their implementation.
Abstract: The use of causal indicators to formatively measure latent constructs appears to be on the rise, despite what appears to be a troubling lack of consistency in their application. Scholars in any discipline are responsible not only for advancing theoretical knowledge in their domain of study but also for addressing methodological issues that threaten that advance. In that spirit, the current study traces causal indicators from their origins in causal modeling to their use in structural equation modeling today. Conclusions from this review suggest that unlike effect (reflective) indicators, whose application is based on classical test theory, today’s application of causal (formative) indicators is based on research demonstrating their practical application rather than on psychometric theory supporting their use. The authors suggest that this lack of theory has contributed to the confusion surrounding their implementation. Recent research has questioned the generalizability of formatively measured latent cons...

71 citations


Journal ArticleDOI
TL;DR: The Social Issues Advocacy Scale (SASS) as discussed by the authors ) is a four-factor scale, accounting for 71.4% of the variance, measuring different aspects of social issue advocacy: political and social advocacy, confronting discrimination, political awareness and social issue awareness.
Abstract: This article describes the development and the initial psychometric evaluation of the Social Issues Advocacy Scale in two studies. In the first study, an exploratory factor analysis (n = 278) revealed a four-factor scale, accounting for 71.4% of the variance, measuring different aspects of social issue advocacy: Political and Social Advocacy, Confronting Discrimination, Political Awareness, and Social Issue Awareness. The second study (n = 509) supported the structure. Results indicated excellent internal reliability and associations with another social advocacy scale, political interest, and multicultural empathy, but not with self-esteem and life satisfaction; all of which provided initial evidence of construct and discriminant validity.

70 citations


Journal ArticleDOI
TL;DR: In this paper, the fairness of an ability test delivered on either paper or computer is investigated in the context of applied psychometrics, and it is shown that a computer ability test provides the same information as a paper test.
Abstract: Whether an ability test delivered on either paper or computer provides the same information is an important question in applied psychometrics. Besides the validity, it is also the fairness of a mea...

68 citations


Journal ArticleDOI
TL;DR: In this paper, the authors provide further empirical support for the use of target rotations as a method for deriving a comparison model, and explore the Schmid-Leiman orthogonalization to specify a viable initial target matrix and the recovery of true bifactor pattern matrices using target factor rotation.
Abstract: Reise, Cook, and Moore proposed a “comparison modeling” approach to assess the distortion in item parameter estimates when a unidimensional item response theory (IRT) model is imposed on multidimensional data. Central to their approach is the comparison of item slope parameter estimates from a unidimensional IRT model (a restricted model), with the item slope parameter estimates from the general factor in an exploratory bifactor IRT model (the unrestricted comparison model). In turn, these authors suggested that the unrestricted comparison bifactor model be derived from a target factor rotation. The goal of this study was to provide further empirical support for the use of target rotations as a method for deriving a comparison model. Specifically, we conducted Monte Carlo analyses exploring (a) the use of the Schmid–Leiman orthogonalization to specify a viable initial target matrix and (b) the recovery of true bifactor pattern matrices using target rotations as implemented in Mplus. Results suggest that t...

Journal ArticleDOI
TL;DR: Results indicate that classification and regression trees generally produced the highest classification accuracy of all techniques tested, though study design characteristics such as sample size and model complexity can greatly influence optimal choice or effectiveness of statistical classification method.
Abstract: The statistical classification of N individuals into G mutually exclusive groups when the actual group membership is unknown is common in the social and behavioral sciences. The results of such classification methods often have important consequences. Among the most common methods of statistical classification are linear discriminant analysis, quadratic discriminant analysis, and logistic regression. However, recent developments in the statistics literature have brought new and potentially more flexible classification models to the forefront. Although these new models are increasingly being used in the physical sciences and marketing research, they are still relatively little used in the social and behavioral sciences. The purpose of this article is to provide a comparison of these modern methods with the classical methods widely used in situations that are relevant in the social and behavioral sciences. This study uses a large-scale Monte Carlo simulation study for the comparisons, as analytic comparison...

Journal ArticleDOI
TL;DR: The authors further investigated the factor structure of the Hong Psychological Reactance Scale (HPRS) by testing four competing models using responses from 1,282 college students and found that a modified bifactor model, in which a general reactance factor explained common variance among all the items and specific factors explained shared residual variance among sets of items, was championed.
Abstract: The Hong Psychological Reactance Scale (HPRS) purports to measure reactance: a motivational state experienced when a behavioral freedom is threatened with elimination. To date, five studies have examined the psychometric properties of the HPRS, but reached different conclusions regarding its factor structure. The current study further investigated the factor structure of the HPRS by testing four competing models using responses from 1,282 college students. A modified bifactor model, in which a general reactance factor explained common variance among all the items and specific factors explained shared residual variance among sets of items, was championed. Implications for estimating reliability and scoring the HPRS are discussed.

Journal ArticleDOI
TL;DR: Results indicate that the predicted standard error reduction (PSER) stopping rule makes efficient use of CAT item pools, administering fewer items when predictive gains in information are small and increasing measurement precision when information is abundant.
Abstract: The goal of the current study was to introduce a new stopping rule for computerized adaptive testing. The predicted standard error reduction stopping rule (PSER) uses the predictive posterior variance to determine the reduction in standard error that would result from the administration of additional items. The performance of the PSER was compared to that of the minimum standard error stopping rule and a modified version of the minimum information stopping rule in a series of simulated adaptive tests, drawn from a number of item pools. Results indicate that the PSER makes efficient use of CAT item pools, administering fewer items when predictive gains in information are small and increasing measurement precision when information is abundant.

Journal ArticleDOI
TL;DR: In this article, the authors elaborates the Rasch differential item functioning (DIF) model formulation under the marginal maximum likelihood estimation context, and the model performance was examined an an...
Abstract: This study elaborates the Rasch differential item functioning (DIF) model formulation under the marginal maximum likelihood estimation context. Also, the Rasch DIF model performance was examined an...

Journal ArticleDOI
TL;DR: In this paper, the assessment of preschool learning behavior has become very popular as a mechanism to inform cognitive development and promote successful interventions, and the most widely used measures offer sound pred...
Abstract: Assessment of preschool learning behavior has become very popular as a mechanism to inform cognitive development and promote successful interventions. The most widely used measures offer sound pred...

Journal ArticleDOI
TL;DR: The maximum likelihood estimation and maximum a posteriori classification methods are compared with the expected a posterioru method for cognitive diagnosis models, which are special cases of restricted latent class models.
Abstract: Cognitive diagnosis models have received much attention in the recent psychometric literature because of their potential to provide examinees with information regarding multiple fine-grained discretely defined skills, or attributes. This article discusses the issue of methods of examinee classification for cognitive diagnosis models, which are special cases of restricted latent class models. Specifically, the maximum likelihood estimation and maximum a posteriori classification methods are compared with the expected a posteriori method. A simulation study using the Deterministic Input, Noisy-And model is used to assess the classification accuracy of the methods using various criteria.

Journal ArticleDOI
TL;DR: In this article, the authors developed three school bullying scales (the Bully Scale, the Victim Scale, and the Witness Scale) to assess secondary school students' bullying behaviors, including physical bullying, verbal bullying, relational bullying, and cyber bullying.
Abstract: The study aims to develop three school bullying scales—the Bully Scale, the Victim Scale, and the Witness Scale—to assess secondary school students’ bullying behaviors, including physical bullying, verbal bullying, relational bullying, and cyber bullying. The items of the three scales were developed from viewpoints of bullies, victims, and witnesses. Two samples of Taiwanese secondary students participated in the test development, one for item revision and the other for validation. Samples 1 and 2 consisted of 860 and 3,941 students, respectively. Rasch techniques were applied to assess model—data fit of the three scales. The results indicated that the correlations of person measures from viewpoints of bullies, victims, and witnesses were between .78 and .83; the person separation reliabilities of the measures from the Bully Scale, the Victim Scale, and the Witness Scale were .86, .87, and .94, respectively. The person measures from the three scales were positively correlated with external variables— anti...

Journal ArticleDOI
TL;DR: The authors examined the impact of response formats on item attributes of a language awareness test applying different item response theory models, and found that although the test contains items with different response formats, only one latent trait is measured; no format-specific dimensions were found.
Abstract: In aptitude and achievement tests, different response formats are usually used. A fundamental distinction must be made between the class of multiple-choice formats and the constructed response formats. Previous studies have examined the impact of different response formats applying traditional statistical approaches, but these influences can also be studied using methods of item response theory to deal with incomplete data. Response formats can influence item attributes in two ways: different response formats could cause items to measure different latent traits or they could contribute differently to item difficulty. In contrast to previous research, the present study examines the impact of response formats on item attributes of a language awareness test applying different item response theory models. Results indicate that although the language awareness test contains items with different response formats, only one latent trait is measured; no format-specific dimensions were found. Response formats do, ho...

Journal ArticleDOI
TL;DR: The recent flurry of articles on formative measurement, particularly in the information systems literature, appears to be symptomatic of a much larger problem as mentioned in this paper, despite significant objections by met...
Abstract: The recent flurry of articles on formative measurement, particularly in the information systems literature, appears to be symptomatic of a much larger problem. Despite significant objections by met...

Journal ArticleDOI
TL;DR: The authors used item response theory (IRT) mixture models to detect latent groups and estimate the differential item functioning (DIF) caused by these latent groups, but the accuracy of model estimation has not been thoroughly explored.
Abstract: There is a long history of differential item functioning (DIF) detection methods for known, manifest grouping variables, such as sex or ethnicity. But if the experiences or cognitive processes leading to DIF are not perfectly correlated with the manifest groups, it would be more informative to uncover the latent groups underlying DIF. The use of item response theory (IRT) mixture models to detect latent groups and estimate the DIF caused by these latent groups has been explored/interpreted with real data sets, but the accuracy of model estimation has not been thoroughly explored. The purpose of this simulation research was to assess the accuracy of the recovery of classes, item parameters, and DIF effects in contexts where relatively small clusters of items showed DIF. Overall, the results from the study reveal that the use of IRT mixture models for latent DIF detection may be problematic. Class membership recovery was poor in all conditions tested. Discrimination parameters were estimated well for the in...

Journal ArticleDOI
TL;DR: This article used a nominal response item response theory model to estimate category boundary discrimination (CBD) parameters for items drawn from the Emotional Distress item pools (Depression, Anxiety, and Anger) developed in the Patient-Reported Outcomes Measurement Information Systems (PROMIS) project.
Abstract: The authors used a nominal response item response theory model to estimate category boundary discrimination (CBD) parameters for items drawn from the Emotional Distress item pools (Depression, Anxiety, and Anger) developed in the Patient-Reported Outcomes Measurement Information Systems (PROMIS) project For polytomous items with ordered response categories, CBD parameters index the degree to which a particular dichotomous distinction (eg, a response in category two vs one) discriminates trait levels Findings indicated that 25 of the 86 PROMIS items displayed statistically significant (p ≤ 05) within-item variation in CBD parameters as judged by a Wald test The most common finding was that the CBD parameter for the first (never vs rarely) and last (sometimes vs often/always) response distinction were higher than the CBD parameters for the remaining distinction (rarely vs sometimes) The implications of significant CBD variation for model choice, scale analysis, and for scoring individual differences are reviewed

Journal ArticleDOI
TL;DR: In this paper, the relationship between the probability of the given response and the response latency was investigated, and it was found that the quantities are negatively related and that more probable responses are given faster.
Abstract: Recent studies have revealed a relation between the given response and the response latency for personality questionnaire items in the form of an inverted-U effect, which has been interpreted in light of schema-driven behavior. In general, more probable responses are given faster. In the present study, the relationship between the probability of the given response and the response latency was investigated. First, a probabilistic model was introduced describing the relationship between response latencies and a latent trait. Second, the model was applied in an empirical study: Employing items from a personality questionnaire and using data from 170 men, the probability of responses were estimated based on the Rasch model. Assuming log-normally distributed response latencies, a linear regression model was fit to the logarithmized response latencies, including the response probability as a predictor. Findings suggested that the quantities are negatively related. This relation can be used to incorporate the re...

Journal ArticleDOI
TL;DR: In this paper, a conceptual framework for investigating measurement invariance, specifically differential item functioning within the context of assessing students with disabilities, is described, and the hierarchical generalized linear model, an explanatory model that incorporates item response models into hierarchical models in multilevel settings, is explored.
Abstract: To address whether or not modifications in test administration influence item functioning for students with disabilities on a high-stakes statewide problem-solving assessment, a sample of 868 students (with and without disabilities) from 74 Georgia schools were randomly assigned to one of three testing conditions (resource guide, calculator, or standard administration). The authors describe a conceptual framework for investigating measurement invariance, specifically differential item functioning within the context of assessing students with disabilities. Specifically, we illustrate how the hierarchical generalized linear model, an explanatory model that incorporates item response models into hierarchical models in multilevel settings, can be used to explore issues of measurement invariance traditionally addressed using descriptive item response models, that is, the many-facet Rasch model. Results obtained from the two approaches are reported and discussed.

Journal ArticleDOI
TL;DR: This paper examined discrepant high school grade point average (HSGPA) and SAT performance as measured by the difference between a student's standardized SAT composite score and standardized HSGPA.
Abstract: This study examined discrepant high school grade point average (HSGPA) and SAT performance as measured by the difference between a student’s standardized SAT composite score and standardized HSGPA....

Journal ArticleDOI
TL;DR: In this article, the usefulness of multidimensional adaptive testing (MAT) for the assessment of student literacy in the Programme for International Student Assessment (PISA) was examined within a real data simulation study.
Abstract: The usefulness of multidimensional adaptive testing (MAT) for the assessment of student literacy in the Programme for International Student Assessment (PISA) was examined within a real data simulation study. The responses of N = 14,624 students who participated in the PISA assessments of the years 2000, 2003, and 2006 in Germany were used to simulate MAT with different restrictions (unrestricted, treatment of link items, treatment of open items, content balance, unitwise item selection, all restrictions). Compared with conventional testing based on the booklet design of PISA 2006, unrestricted MAT increases measurement efficiency by 74 and reduces the average number of presented items from 55 to 26 without a loss in measurement precision. The incorporation of restrictions reduces the advantages of MAT. MAT is recommended for the assessment of newly introduced constructs but not for the assessment of the literacy domains in PISA.

Journal ArticleDOI
TL;DR: In this paper, the accuracy of examinee classification into performance categories and the estimation of the theta parameter for several item response theory (IRT) scaling techniques when applied to six administrations of a test were investigated.
Abstract: This article investigates the accuracy of examinee classification into performance categories and the estimation of the theta parameter for several item response theory (IRT) scaling techniques when applied to six administrations of a test. Previous research has investigated only two administrations; however, many testing programs equate tests across multiple administrations. As such, this article seeks to examine the long-term sustainability of IRT scaling methods. Three different types of shifts in the ability distribution were examined: no change, a mean shift, and a change in skewness. Haebara, Stocking and Lord, mean—sigma, mean—mean, and fixed common item parameter (FCIP) scaling were compared relative to bias, root mean square error, and classification of examinees into performance categories. Results indicate that FCIP may be the most suitable for complex changes in examinee performance, whereas the methods performed quite similarly for simple changes.

Journal ArticleDOI
TL;DR: In this article, a model capable of estimating a mixture partial credit model using joint maximum likelihood is presented, where step parameters are constrained to be equal across items, making the model a mixture rating scale model.
Abstract: This research provides a demonstration of the utility of mixture Rasch models. Specifically, a model capable of estimating a mixture partial credit model using joint maximum likelihood is presented. Like the partial credit model, the mixture partial credit model has the beneficial feature of being appropriate for analysis of assessment data containing any combination of dichotomous and polytomous item types. Mixture Rasch models are able to provide information regarding latent classes (subpopulations without manifest grouping variables) and separate item parameter estimates for each of these latent classes. In this research, the step parameters were constrained to be equal across items, making the model a mixture rating scale model. An analysis with simulated data provides a clear example demonstration followed by a real-world analysis and interpretation of student survey data.

Journal ArticleDOI
TL;DR: In this article, the error rate and power of multivariate extension of the S − χ2 statistic using unidimensional and multidimensional item response theory (UIRT and MIRT) were investigated.
Abstract: This study investigated the Type I error rate and power of the multivariate extension of the S − χ2 statistic using unidimensional and multidimensional item response theory (UIRT and MIRT, respecti...

Journal ArticleDOI
TL;DR: In this paper, the authors used analysis of covariance and provided methods for computing power within an optimal design framework that incorporates costs of units at different levels and covariate effects for three-level cluster randomized balanced designs.
Abstract: Field experiments with nested structures assign entire groups such as schools to treatment and control conditions. Key aspects of such cluster randomized experiments include knowledge of the intraclass correlation structure and the sample sizes necessary to achieve adequate power to detect the treatment effect. The units at each level of the hierarchy have a cost associated with them, however, and thus, researchers need to take budget and costs into account when designing their studies. This article uses analysis of covariance and provides methods for computing power within an optimal design framework that incorporates costs of units at different levels and covariate effects for three-level cluster randomized balanced designs. The optimal sample sizes are a function of the variances at each level and the cost of each unit. Overall, the results suggest that when units at higher levels become more expensive, the researcher should sample units at lower levels. The covariates affect the sampling of units and ...