scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2019"


Journal ArticleDOI
TL;DR: The results showed that the effect of p on the population CFI and TLI depended on thetype of specification error, whereas a higher p was associated with lower values of the population RMSEA regardless of the type of model misspecification.
Abstract: This study investigated the effect the number of observed variables (p) has on three structural equation modeling indices: the comparative fit index (CFI), the Tucker-Lewis index (TLI), and the root mean square error of approximation (RMSEA). The behaviors of the population fit indices and their sample estimates were compared under various conditions created by manipulating the number of observed variables, the types of model misspecification, the sample size, and the magnitude of factor loadings. The results showed that the effect of p on the population CFI and TLI depended on the type of specification error, whereas a higher p was associated with lower values of the population RMSEA regardless of the type of model misspecification. In finite samples, all three fit indices tended to yield estimates that suggested a worse fit than their population counterparts, which was more pronounced with a smaller sample size, higher p, and lower factor loading.

323 citations


Journal ArticleDOI
TL;DR: A procedure that can be used to evaluate the variance inflation factors and tolerance indices in linear regression models is discussed, which allows more informed evaluation of these quantities when addressing multicollinearity-related issues in empirical research using regression models.
Abstract: A procedure that can be used to evaluate the variance inflation factors and tolerance indices in linear regression models is discussed. The method permits both point and interval estimation of these factors and indices associated with explanatory variables considered for inclusion in a regression model. The approach makes use of popular latent variable modeling software to obtain these point and interval estimates. The procedure allows more informed evaluation of these quantities when addressing multicollinearity-related issues in empirical research using regression models. The method is illustrated on an empirical example using the popular software Mplus. Results of a simulation study investigating the capabilities of the procedure are also presented.

124 citations


Journal ArticleDOI
TL;DR: The article highlights the fact that as an index aimed at informing about multiple-component measuring instrument reliability, coefficient alpha is dependable then as a reliability estimator, and should remain in service when these conditions are fulfilled and not be abandoned.
Abstract: This note discusses the merits of coefficient alpha and their conditions in light of recent critical publications that miss out on significant research findings over the past several decades. That earlier research has demonstrated the empirical relevance and utility of coefficient alpha under certain empirical circumstances. The article highlights the fact that as an index aimed at informing about multiple-component measuring instrument reliability, coefficient alpha is dependable then as a reliability estimator. Therefore, alpha should remain in service when these conditions are fulfilled and not be abandoned.

111 citations


Journal ArticleDOI
TL;DR: Three variants of Cohen’s kappa that can handle missing data are presented and it is recommended to use the kappa coefficient that is based on listwise deletion of missing ratings if it can be assumed that missingness is completely at random or not at random.
Abstract: Cohen's kappa coefficient is commonly used for assessing agreement between classifications of two raters on a nominal scale. Three variants of Cohen's kappa that can handle missing data are presented. Data are considered missing if one or both ratings of a unit are missing. We study how well the variants estimate the kappa value for complete data under two missing data mechanisms-namely, missingness completely at random and a form of missingness not at random. The kappa coefficient considered in Gwet (Handbook of Inter-rater Reliability, 4th ed.) and the kappa coefficient based on listwise deletion of units with missing ratings were found to have virtually no bias and mean squared error if missingness is completely at random, and small bias and mean squared error if missingness is not at random. Furthermore, the kappa coefficient that treats missing ratings as a regular category appears to be rather heavily biased and has a substantial mean squared error in many of the simulations. Because it performs well and is easy to compute, we recommend to use the kappa coefficient that is based on listwise deletion of missing ratings if it can be assumed that missingness is completely at random or not at random.

53 citations


Journal ArticleDOI
TL;DR: This research provides the necessary equations and shows how skewness can increase the precision with which locations of distributions can be estimated, and contrasts with a typical argument in favor of performing transformations to normalize skewed data for the sake of performing more efficient significance tests.
Abstract: Two recent publications in Educational and Psychological Measurement advocated that researchers consider using the a priori procedure. According to this procedure, the researcher specifies, prior to data collection, how close she wishes her sample mean(s) to be to the corresponding population mean(s), and the desired probability of being that close. A priori equations provide the necessary sample size to meet specifications under the normal distribution. Or, if sample size is taken as given, a priori equations provide the precision with which estimates of distribution means can be made. However, there is currently no way to perform these calculations under the more general family of skew-normal distributions. The present research provides the necessary equations. In addition, we show how skewness can increase the precision with which locations of distributions can be estimated. This conclusion, based on the perspective of improving sampling precision, contrasts with a typical argument in favor of performi...

42 citations


Journal ArticleDOI
TL;DR: It is demonstrated that trait scores based on only equally keyed blocks can be improved substantially by measuring a sizable number of traits, and it is concluded that Thurstonian IRT models should only be applied in high-stakes situations where persons are motivated to give fake answers.
Abstract: Forced-choice questionnaires have been proposed to avoid common response biases typically associated with rating scale questionnaires. To overcome ipsativity issues of trait scores obtained from classical scoring approaches of forced-choice items, advanced methods from item response theory (IRT) such as the Thurstonian IRT model have been proposed. For convenient model specification, we introduce the thurstonianIRT R package, which uses Mplus, lavaan, and Stan for model estimation. Based on practical considerations, we establish that items within one block need to be equally keyed to achieve similar social desirability, which is essential for creating forced-choice questionnaires that have the potential to resist faking intentions. According to extensive simulations, measuring up to five traits using blocks of only equally keyed items does not yield sufficiently accurate trait scores and inter-trait correlation estimates, neither for frequentist nor for Bayesian estimation methods. As a result, persons' trait scores remain partially ipsative and, thus, do not allow for valid comparisons between persons. However, we demonstrate that trait scores based on only equally keyed blocks can be improved substantially by measuring a sizable number of traits. More specifically, in our simulations of 30 traits, scores based on only equally keyed blocks were non-ipsative and highly accurate. We conclude that in high-stakes situations where persons are motivated to give fake answers, Thurstonian IRT models should only be applied to tests measuring a sizable number of traits.

38 citations


Journal ArticleDOI
TL;DR: The utility of XGBoost in detecting examinees with potential item preknowledge is investigated using a real data set that includes examinees who engaged in fraudulent testing behavior, such as illegally obtaining live test content before the exam.
Abstract: Researchers frequently use machine-learning methods in many fields. In the area of detecting fraud in testing, there have been relatively few studies that have used these methods to identify potent...

27 citations


Journal ArticleDOI
TL;DR: The results suggest that when class separation is low, very large sample sizes may be needed to obtain stable results and it may often be necessary to consider a preponderance of evidence in latent class enumeration.
Abstract: Regression mixture models are a statistical approach used for estimating heterogeneity in effects. This study investigates the impact of sample size on regression mixture’s ability to produce “stab...

25 citations


Journal ArticleDOI
TL;DR: Eye-tracking biometric indicators, an essential eye-tracking indicator, is modeled to reflect the degree of test engagement when a test taker solves a set of test questions.
Abstract: With the development of technology-enhanced learning platforms, eye-tracking biometric indicators can be recorded simultaneously with students item responses. In the current study, visual fixation, an essential eye-tracking indicator, is modeled to reflect the degree of test engagement when a test taker solves a set of test questions. Three negative binomial regression models are proposed for modeling visual fixation counts of test takers solving a set of items. These models follow a similar structure to the lognormal response time model and the two-parameter logistic item response model. The proposed modeling structures include individualized latent person parameters reflecting the level of engagement of each test taker and two item parameters indicating the visual attention intensity and discriminating power of each test item. A Markov chain Monte Carlo estimation method is implemented for parameter estimation. Real data are fitted to the three proposed models, and the results are discussed.

21 citations


Journal ArticleDOI
TL;DR: Fit indices for FSR are proposed that can be used to inspect the model fit and be used for inference of the estimators on the regression coefficients and a model comparison test is introduced based on one of these newly proposed fit indices.
Abstract: Factor score regression (FSR) is a popular alternative for structural equation modeling. Naively applying FSR induces bias for the estimators of the regression coefficients. Croon proposed a method...

20 citations


Journal ArticleDOI
TL;DR: Two types of IRTree models were introduced, descriptive and explanatory models, perceived under a larger modeling framework, called explanatory item response models, proposed by De Boeck and Wilson, and suggested the presence of two distinct extreme response styles and acquiescence response style in the scale.
Abstract: Item response tree (IRTree) models are recently introduced as an approach to modeling response data from Likert-type rating scales. IRTree models are particularly useful to capture a variety of ind...

Journal ArticleDOI
TL;DR: The applicability of quantiles regression to empirical work to estimate intervention effects is demonstrated using education data from a large-scale experiment and the estimation of quantile treatment effects at various quantiles in the presence of dropouts is discussed.
Abstract: This study discusses quantile regression methodology and its usefulness in education and social science research. First, quantile regression is defined and its advantages vis-a-vis vis ordinary least squares regression are illustrated. Second, specific comparisons are made between ordinary least squares and quantile regression methods. Third, the applicability of quantile regression to empirical work to estimate intervention effects is demonstrated using education data from a large-scale experiment. The estimation of quantile treatment effects at various quantiles in the presence of dropouts is also discussed. Quantile regression is especially suitable in examining predictor effects at various locations of the outcome distribution (e.g., lower and upper tails).

Journal ArticleDOI
TL;DR: The counterintuitive way in which the best prediction of a test taker’s latent ability depends on the factor loadings is highlighted, which means practitioners need to shift their focus to an interpretation which incorporates the structure of the model-based latent ability estimate.
Abstract: Factor loadings and item discrimination parameters play a key role in scale construction. A multitude of heuristics regarding their interpretation are hardwired into practice—for example, neglectin...

Journal ArticleDOI
TL;DR: This study aims to elucidate and illustrate an alternative response format and analytic technique, Thurstonian item response theory (IRT), for analyzing data from surveys using an alternate response format, the forced-choice format.
Abstract: One of the most cited methodological issues is with the response format, which is traditionally a single-response Likert response format Therefore, our study aims to elucidate and illustrate an alternative response format and analytic technique, Thurstonian item response theory (IRT), for analyzing data from surveys using an alternate response format, the forced-choice format Specifically, we strove to give a thorough introduction of Thurstonian IRT at a more elementary level than previous publications in order to widen the possible audience This article presents analyses and comparison of two versions of a self-report scale, one version using a single-response format and the other using a forced-choice format Drawing from lessons learned from our study and literature, we present a number of recommendations for conducting research using the forced-choice format and Thurstonian IRT, as well as suggested avenues for future research

Journal ArticleDOI
TL;DR: It is suggested that it is possible to use common numeric and graphical indicators of DRF and rater misfit when raters exhibit both these effects, but that these effects may be difficult to distinguish using only numeric indicators.
Abstract: Rater effects, or raters’ tendencies to assign ratings to performances that are different from the ratings that the performances warranted, are well documented in rater-mediated assessments across ...

Journal ArticleDOI
TL;DR: When evaluating goodness-of-fit for ordinal CFA with many observed indicators, researchers should be cautious in interpreting the root mean square error of approximation, as this value appeared overly optimistic under misspecified conditions.
Abstract: A simulation study was conducted to investigate the model size effect when confirmatory factor analysis (CFA) models include many ordinal items. CFA models including between 15 and 120 ordinal items were analyzed with mean- and variance-adjusted weighted least squares to determine how varying sample size, number of ordered categories, and misspecification affect parameter estimates, standard errors of parameter estimates, and selected fit indices. As the number of items increased, the number of admissible solutions and accuracy of parameter estimates improved, even when models were misspecified. Also, standard errors of parameter estimates were closer to empirical standard deviation values as the number of items increased. When evaluating goodness-of-fit for ordinal CFA with many observed indicators, researchers should be cautious in interpreting the root mean square error of approximation, as this value appeared overly optimistic under misspecified conditions.

Journal ArticleDOI
TL;DR: It is argued that Chalmers’ critique of ordinal α, proposed in Zumbo et al. as a measure of test reliability in certain research settings, is unfounded.
Abstract: Chalmers recently published a critique of the use of ordinal α proposed in Zumbo et al. as a measure of test reliability in certain research settings. In this response, we take up the task of refut...

Journal ArticleDOI
TL;DR: This article includes two simulation studies to test this empirical Q-matrix validation method under a wider range of conditions, with the purpose of providing it with a higher generalization, and to empirically determine the most suitable EPS considering the data conditions.
Abstract: Cognitive diagnosis models (CDMs) are latent class multidimensional statistical models that help classify people accurately by using a set of discrete latent variables, commonly referred to as attr...

Journal ArticleDOI
TL;DR: The procedures proposed are an FA extension of the “added-value” procedures initially proposed for subscale scores in educational testing, and the basic principle is that the multiple FA solution is defensible when the factor score estimates of the primary factors are better measures of these factors than score estimates derived from a unidimensional or second-order solution.
Abstract: Measures initially designed to be single-trait often yield data that are compatible with both an essentially unidimensional factor-analysis (FA) solution and a correlated-factors solution. For thes...

Journal ArticleDOI
TL;DR: Three simulation studies were conducted to find out whether the effect of a time limit for testing impairs model fit in investigations of structural validity, whether the representation of the assumed source of the effect prevents impairment of model fit and whether it is possible to identify and discriminate this method effect from another method effect.
Abstract: The article reports three simulation studies conducted to find out whether the effect of a time limit for testing impairs model fit in investigations of structural validity, whether the representat...

Journal ArticleDOI
TL;DR: The results suggest that the most accurate estimates can be obtained from the application of multiple group models for nonignorable missing values when the amounts of missing data and the missing data mechanisms changed over time.
Abstract: Mechanisms causing item nonresponses in large-scale assessments are often said to be nonignorable. Parameter estimates can be biased if nonignorable missing data mechanisms are not adequately modeled. In trend analyses, it is plausible for the missing data mechanism and the percentage of missing values to change over time. In this article, we investigated (a) the extent to which the missing data mechanism and the percentage of missing values changed over time in real large-scale assessment data, (b) how different approaches for dealing with missing data performed under such conditions, and (c) the practical implications for trend estimates. These issues are highly relevant because the conclusions hold for all kinds of group mean differences in large-scale assessments. In a reanalysis of PISA (Programme for International Student Assessment) data from 35 OECD countries, we found that missing data mechanisms and numbers of missing values varied considerably across time points, countries, and domains. In a simulation study, we generated data in which we allowed the missing data mechanism and the amount of missing data to change over time. We showed that the trend estimates were biased if differences in the missing-data mechanisms were not taken into account, in our case, when omissions were scored as wrong, when omissions were ignored, or when model-based approaches assuming a constant missing data mechanism over time were used. The results suggest that the most accurate estimates can be obtained from the application of multiple group models for nonignorable missing values when the amounts of missing data and the missing data mechanisms changed over time. In an empirical example, we furthermore showed that the large decline in PISA reading literacy in Ireland in 2009 was reduced when we estimated trends using missing data treatments that accounted for changes in missing data mechanisms.

Journal ArticleDOI
TL;DR: This article proposes an external auxiliary procedure in which primary factor scores and general factor scores are related to relevant external variables and is assessed by means of a simulation study and its usefulness is illustrated with a real-data example in the personality domain.
Abstract: Many psychometric measures yield data that are compatible with (a) an essentially unidimensional factor analysis solution and (b) a correlated-factor solution. Deciding which of these structures is the most appropriate and useful is of considerable importance, and various procedures have been proposed to help in this decision. The only fully developed procedures available to date, however, are internal, and they use only the information contained in the item scores. In contrast, this article proposes an external auxiliary procedure in which primary factor scores and general factor scores are related to relevant external variables. Our proposal consists of two groups of procedures. The procedures in the first group (differential validity procedures) assess the extent to which the primary factor scores relate differentially to the external variables. Procedures in the second group (incremental validity procedures) assess the extent to which the primary factor scores yield predictive validity increments with respect to the single general factor scores. Both groups of procedures are based on a second-order structural model with latent variables from which new methodological results are obtained. The functioning of the proposal is assessed by means of a simulation study, and its usefulness is illustrated with a real-data example in the personality domain.

Journal ArticleDOI
TL;DR: The Bayes estimator appears to be a promising method for estimating categorical omega under a variety of conditions through manipulating the scale length, number of response categories, distributions of the categorical variable, heterogeneities of thresholds across items, and prior distributions for model parameters.
Abstract: When item scores are ordered categorical, categorical omega can be computed based on the parameter estimates from a factor analysis model using frequentist estimators such as diagonally weighted least squares. When the sample size is relatively small and thresholds are different across items, using diagonally weighted least squares can yield a substantially biased estimate of categorical omega. In this study, we applied Bayesian estimation methods for computing categorical omega. The simulation study investigated the performance of categorical omega under a variety of conditions through manipulating the scale length, number of response categories, distributions of the categorical variable, heterogeneities of thresholds across items, and prior distributions for model parameters. The Bayes estimator appears to be a promising method for estimating categorical omega. Mplus and SAS codes for computing categorical omega were provided.

Journal ArticleDOI
TL;DR: An item response modeling procedure is discussed that can be used for point and interval estimation of the individual true score on any item in a measuring instrument or item set following the popular and widely applicable graded response model.
Abstract: This note highlights and illustrates the links between item response theory and classical test theory in the context of polytomous items. An item response modeling procedure is discussed that can be used for point and interval estimation of the individual true score on any item in a measuring instrument or item set following the popular and widely applicable graded response model. The method contributes to the body of research on the relationships between classical test theory and item response theory and is illustrated on empirical data.

Journal ArticleDOI
TL;DR: Three new findings are presented that suggest the original assumption of expectation-independence among predictors can be expanded to encompass many other joint distributions and that for many jointly distributed random variables, even some that enjoy considerable symmetry, the correlation between the centered main effects and their respective interaction can increase when compared with the correlation of the uncentered effects.
Abstract: Within the context of moderated multiple regression, mean centering is recommended both to simplify the interpretation of the coefficients and to reduce the problem of multicollinearity. For almost 30 years, theoreticians and applied researchers have advocated for centering as an effective way to reduce the correlation between variables and thus produce more stable estimates of regression coefficients. By reviewing the theory on which this recommendation is based, this article presents three new findings. First, that the original assumption of expectation-independence among predictors on which this recommendation is based can be expanded to encompass many other joint distributions. Second, that for many jointly distributed random variables, even some that enjoy considerable symmetry, the correlation between the centered main effects and their respective interaction can increase when compared with the correlation of the uncentered effects. Third, that the higher order moments of the joint distribution play as much of a role as lower order moments such that the symmetry of lower dimensional marginals is a necessary but not sufficient condition for a decrease in correlation between centered main effects and their interaction. Theoretical and simulation results are presented to help conceptualize the issues.

Journal ArticleDOI
TL;DR: This research first proposed an attribute-balanced item selection criterion, namely, the standardized weighted deviation global discrimination index (SWDGDI), and subsequently formulated the constrained progressive index (CP_SWDDDI) by casting the SWDG DI in a progressive algorithm.
Abstract: For item selection in cognitive diagnostic computerized adaptive testing (CD-CAT), ideally, a single item selection index should be created to simultaneously regulate precision, exposure status, and attribute balancing. For this purpose, in this study, we first proposed an attribute-balanced item selection criterion, namely, the standardized weighted deviation global discrimination index (SWDGDI), and subsequently formulated the constrained progressive index (CP_SWDGDI) by casting the SWDGDI in a progressive algorithm. A simulation study revealed that the SWDGDI method was effective in balancing attribute coverage and the CP_SWDGDI method was able to simultaneously balance attribute coverage and item pool usage while maintaining acceptable estimation precision. This research also demonstrates the advantage of a relatively low number of attributes in CD-CAT applications.

Journal ArticleDOI
TL;DR: This study has implications for researchers looking to apply recommended latent class analysis mixture modeling approaches in that nonnormality, which has been not fully considered in previous studies, was taken into account to address the distributional form of distal outcomes.
Abstract: The present study aims to compare the robustness under various conditions of latent class analysis mixture modeling approaches that deal with auxiliary distal outcomes. Monte Carlo simulations were employed to test the performance of four approaches recommended by previous simulation studies: maximum likelihood (ML) assuming homoskedasticity (ML_E), ML assuming heteroskedasticity (ML_U), BCH, and LTB. For all investigated simulation conditions, the BCH approach yielded the most unbiased estimates of class-specific distal outcome means. This study has implications for researchers looking to apply recommended latent class analysis mixture modeling approaches in that nonnormality, which has been not fully considered in previous studies, was taken into account to address the distributional form of distal outcomes.

Journal ArticleDOI
TL;DR: Results indicate that 20 is the minimum number of plausible values required to obtain point estimates of the IRT ability parameter that are comparable to marginal maximum likelihood estimation (MMLE)/expected a posteriori (EAP) estimates.
Abstract: Plausible values can be used to either estimate population-level statistics or compute point estimates of latent variables. While it is well known that five plausible values are usually sufficient ...

Journal ArticleDOI
TL;DR: This article assessed the psychometric qualities of three PCV statistics that can be used in conjunction with principal axis factor analysis: the standard PCV statistic and two modifications of it and concluded that practitioners can gain additional information from π ^ SMC : k ′ + Λ ^ and make more nuanced decision about the number of factors when R-PA fails to retain the correct number of Factors.
Abstract: Past research suggests revised parallel analysis (R-PA) tends to yield relatively accurate results in determining the number of factors in exploratory factor analysis. R-PA can be interpreted as a ...

Journal ArticleDOI
TL;DR: The results suggest that unfolding models offer a useful way to evaluate rater-mediated assessments in order to initially explore the judgmental processes underlying the ratings.
Abstract: The purpose of this study is to explore the use of unfolding models for evaluating the quality of ratings obtained in rater-mediated assessments. Two different judgmental processes can be used to conceptualize ratings: impersonal judgments and personal preferences. Impersonal judgments are typically expected in rater-mediated assessments, and these ratings reflect a cumulative response process. However, raters may also be influenced by their personal preferences in providing ratings, and these ratings may reflect a noncumulative or unfolding response process. The goal of rater training in rater-mediated assessments is to stress impersonal judgments represented by scoring rubrics and to minimize the personal preferences that may represent construct-irrelevant variance in the assessment system. In this study, we explore the use of unfolding models as a framework for evaluating the quality of ratings in rater-mediated assessments. Data from a large-scale assessment of writing in the United States are used to illustrate our approach. The results suggest that unfolding models offer a useful way to evaluate rater-mediated assessments in order to initially explore the judgmental processes underlying the ratings. The data also indicate that there are significant relationships between some essay features (e.g., word count, syntactic simplicity, word concreteness, and verb cohesion) and essay orderings based on the personal preferences of raters. The implications of unfolding models for theory and practice in rater-mediated assessments are discussed.