scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2020"


Journal ArticleDOI
TL;DR: Estimation methods had substantial impacts on the RMSEA and CFI so that different cutoff values need to be employed for different estimators, and SRMR is robust to the method used to estimate the model parameters.
Abstract: We examined the effect of estimation methods, maximum likelihood (ML), unweighted least squares (ULS), and diagonally weighted least squares (DWLS), on three population SEM (structural equation modeling) fit indices: the root mean square error of approximation (RMSEA), the comparative fit index (CFI), and the standardized root mean square residual (SRMR). We considered different types and levels of misspecification in factor analysis models: misspecified dimensionality, omitting cross-loadings, and ignoring residual correlations. Estimation methods had substantial impacts on the RMSEA and CFI so that different cutoff values need to be employed for different estimators. In contrast, SRMR is robust to the method used to estimate the model parameters. The same criterion can be applied at the population level when using the SRMR to evaluate model fit, regardless of the choice of estimation method.

91 citations


Journal ArticleDOI
TL;DR: Results of the simulation demonstrated that the use of fit index difference values outperformed parallel analysis for categorical indicators, and for normally distributed indicators when factor loadings were small, which is one of the most reliable such extant methods.
Abstract: Exploratory factor analysis (EFA) is widely used by researchers in the social sciences to characterize the latent structure underlying a set of observed indicator variables. One of the primary issues that must be resolved when conducting an EFA is determination of the number of factors to retain. There exist a large number of statistical tools designed to address this question, with none being universally optimal across applications. Recently, researchers have investigated the use of model fit indices that are commonly used in the conduct of confirmatory factor analysis to determine the number of factors to retain in EFA. These results have yielded mixed results, appearing to be effective when used in conjunction with normally distributed indicators, but not being as effective for categorical indicators. The purpose of this simulation study was to compare the performance of difference values for several fit indices as a method for identifying the optimal number of factors to retain in an EFA, with parallel analysis, which is one of the most reliable such extant methods. Results of the simulation demonstrated that the use of fit index difference values outperformed parallel analysis for categorical indicators, and for normally distributed indicators when factor loadings were small. Implications of these findings are discussed.

57 citations


Journal ArticleDOI
TL;DR: This study compares two missing data procedures in the context of ordinal factor analysis models: pairwise deletion (PD) and multiple imputation (MI) to examine which procedure demonstrates parameter estimates and model fit indices closer to those of complete data.
Abstract: This study compares two missing data procedures in the context of ordinal factor analysis models: pairwise deletion (PD; the default setting in Mplus) and multiple imputation (MI). We examine which procedure demonstrates parameter estimates and model fit indices closer to those of complete data. The performance of PD and MI are compared under a wide range of conditions, including number of response categories, sample size, percent of missingness, and degree of model misfit. Results indicate that both PD and MI yield parameter estimates similar to those from analysis of complete data under conditions where the data are missing completely at random (MCAR). When the data are missing at random (MAR), PD parameter estimates are shown to be severely biased across parameter combinations in the study. When the percentage of missingness is less than 50%, MI yields parameter estimates that are similar to results from complete data. However, the fit indices (i.e., χ2, RMSEA, and WRMR) yield estimates that suggested a worse fit than results observed in complete data. We recommend that applied researchers use MI when fitting ordinal factor models with missing data. We further recommend interpreting model fit based on the TLI and CFI incremental fit indices.

37 citations


Journal ArticleDOI
TL;DR: How the use of different IER detection methods may affect psychometric properties such as predictive validity and reliability is demonstrated and recommendations and future research directions for those who suspect their data may contain responses reflecting careless, random, or biased responding are provided.
Abstract: Insufficient effort responding (IER) affects many forms of assessment in both educational and psychological contexts. Much research has examined different types of IER, IER's impact on the psychometric properties of test scores, and preprocessing procedures used to detect IER. However, there is a gap in the literature in terms of practical advice for applied researchers and psychometricians when evaluating multiple sources of IER evidence, including the best strategy or combination of strategies when preprocessing data. In this study, we demonstrate how the use of different IER detection methods may affect psychometric properties such as predictive validity and reliability. Moreover, we evaluate how different data cleansing procedures can detect different types of IER. We provide evidence via simulation studies and applied analysis using the ACT's Engage assessment as a motivating example. Based on the findings of the study, we provide recommendations and future research directions for those who suspect their data may contain responses reflecting careless, random, or biased responding.

33 citations


Journal ArticleDOI
TL;DR: Full-information maximum likelihood (FIML), zero replacement, and multiple imputation with chain equations utilizing classification and regression trees and random forest imputation were compared to compare the performance of four methods in handling missing data when estimating ability parameters.
Abstract: Large amounts of missing data could distort item parameter estimation and lead to biased ability estimates in educational assessments. Therefore, missing responses should be handled properly before...

31 citations


Journal ArticleDOI
TL;DR: In the authors' simulations, the proposed Croon FSR approaches outperformed methods that blindly assumed conditionally independent uniquenesses, performed comparably to a correctly specified SEM, and outperformed SEMs that correctly specified the unique factor covariances but misspecified the structural model.
Abstract: Recently, quantitative researchers have shown increased interest in two-step factor score regression (FSR) approaches to structural model estimation. A particularly promising approach proposed by C...

20 citations


Journal ArticleDOI
TL;DR: This work uses response times retrieved from computerized testing to distinguish missing data due to lack of speed from missingness due to quitting and presents a new model that allows to disentangle and simultaneously model different missing data mechanisms underlying not-reached items.
Abstract: So far, modeling approaches for not-reached items have considered one single underlying process. However, missing values at the end of a test can occur for a variety of reasons. On the one hand, ex...

18 citations


Journal ArticleDOI
TL;DR: The prior sensitivity in BSEM-N was explored in factor analysis models with sparse loading structures through a simulation study and an empirical example and indicated that when the 95% credible intervals of shrinkage priors barely covered the population cross-loading values, it resulted in the best balance between true and false positives.
Abstract: Bayesian structural equation modeling (BSEM) is a flexible tool for the exploration and estimation of sparse factor loading structures; that is, most cross-loading entries are zero and only a few important cross-loadings are nonzero. The current investigation was focused on the BSEM with small-variance normal distribution priors (BSEM-N) for both variable selection and model estimation. The prior sensitivity in BSEM-N was explored in factor analysis models with sparse loading structures through a simulation study (Study 1) and an empirical example (Study 2). Study 1 examined the prior sensitivity in BSEM-N based on the model fit, population model recovery, true and false positive rates, and parameter estimation. Seven shrinkage priors on cross-loadings and five noninformative/vague priors on other model parameters were examined. Study 2 provided a real data example to illustrate the impact of various priors on model fit and parameter selection and estimation. Results indicated that when the 95% credible intervals of shrinkage priors barely covered the population cross-loading values, it resulted in the best balance between true and false positives. If the goal is to perform variable selection, a sparse cross-loading structure is required, preferably with a minimal number of nontrivial cross-loadings and relatively high primary loading values. To improve parameter estimates, a relatively large prior variance is preferred. When cross-loadings are relatively large, BSEM-N with zero-mean priors is not recommended for the estimation of cross-loadings and factor correlations.

16 citations


Journal ArticleDOI
TL;DR: This article reviews and analyzes the literature on nonidentifying codes and provides recommendations for researchers interested in using these types of codes in conducting anonymous longitudinal studies.
Abstract: Longitudinal studies are commonly used in the social and behavioral sciences to answer a wide variety of research questions. Longitudinal researchers often collect data anonymously from participant...

15 citations


Journal ArticleDOI
TL;DR: Automated scoring based on Latent Semantic Analysis (LSA) is used to score short answer responses to the Consequences Test, a measure of creativity and divergent thinking that encourages a wide range of potential responses.
Abstract: Automated scoring based on Latent Semantic Analysis (LSA) has been successfully used to score essays and constrained short answer responses. Scoring tests that capture open-ended, short answer responses poses some challenges for machine learning approaches. We used LSA techniques to score short answer responses to the Consequences Test, a measure of creativity and divergent thinking that encourages a wide range of potential responses. Analyses demonstrated that the LSA scores were highly correlated with conventional Consequence Test scores, reaching a correlation of .94 with human raters and were moderately correlated with performance criteria. This approach to scoring short answer constructed responses solves many practical problems including the time for humans to rate open-ended responses and the difficulty in achieving reliable scoring.

14 citations


Journal ArticleDOI
TL;DR: Overall, although accuracy in parameter estimation is sacrificed slightly with the proposed strategy, it can provide timely diagnostic feedback to practitioners, which is in line with the concept of “assessment for learning” and the needs of formative assessment.
Abstract: Timely diagnostic feedback is helpful for students and teachers, enabling them to adjust their learning and teaching plans according to a current diagnosis. Motivated by the practical concern that the simultaneity estimation strategy currently adopted by longitudinal learning diagnosis models does not provide timely diagnostic feedback, this study proposes a new Markov estimation strategy, which follows the Markov property. A simulation study was conducted to explore and compare the performance of four estimation strategies: the simultaneity, the Markov, the anchor-item, and the separated estimation strategies. The results show that their performance was highly consistent, and they presented in the following relative order: simultaneity > Markov > anchor-item ≥ separated. Overall, although accuracy in parameter estimation is sacrificed slightly with the proposed strategy, it can provide timely diagnostic feedback to practitioners, which is in line with the concept of "assessment for learning" and the needs of formative assessment.

Journal ArticleDOI
TL;DR: Additional equations are proposed that expand the a priori procedure to handle differences between means, both in matched and in independent samples.
Abstract: Previous researchers have proposed the a priori procedure, whereby the researcher specifies, prior to data collection, how closely she wishes the sample means to approach corresponding population means, and the degree of confidence of meeting the specification. However, an important limitation of previous research is that researchers sometimes are interested in differences between means, rather than in the means themselves. To address this limitation, we propose additional equations that expand the a priori procedure to handle differences between means, both in matched and in independent samples. Finally, implications are discussed.

Journal ArticleDOI
TL;DR: The Fréchet–Hoeffding bounds restrict the theoretical correlation range [−1, 1] such that certain correlation structures may be unfeasible and therefore coefficient alpha is bounded above depending on the shape of the distributions.
Abstract: Simulations concerning the distributional assumptions of coefficient alpha are contradictory. To provide a more principled theoretical framework, this article relies on the Frechet-Hoeffding bounds, in order to showcase that the distribution of the items play a role on the estimation of correlations and covariances. More specifically, these bounds restrict the theoretical correlation range [-1, 1] such that certain correlation structures may be unfeasible. The direct implication of this result is that coefficient alpha is bounded above depending on the shape of the distributions. A general form of the Frechet-Hoeffding bounds is derived for discrete random variables. R code and a user-friendly shiny web application are also provided so that researchers can calculate the bounds on their data.

Journal ArticleDOI
TL;DR: The simulated annealing algorithm showed the best overall performance as well as robustness to model missespecification, while the genetic algorithm produced short forms with worse fit than the other algorithms under conditions with model misspecification.
Abstract: This study compares automated methods to develop short forms of psychometric scales. Obtaining a short form that has both adequate internal structure and strong validity with respect to relationships with other variables is difficult with traditional methods of short-form development. Metaheuristic algorithms can select items for short forms while optimizing on several validity criteria, such as adequate model fit, composite reliability, and relationship to external variables. Using a Monte Carlo simulation study, this study compared existing implementations of the ant colony optimization, Tabu search, and genetic algorithm to select short forms of scales, as well as a new implementation of the simulated annealing algorithm. Selection of short forms of scales with unidimensional, multidimensional, and bifactor structure were evaluated, with and without model misspecification and/or an external variable. The results showed that when the confirmatory factor analysis model of the full form of the scale was correctly specified or had only minor misspecification, the four algorithms produced short forms with good psychometric qualities that maintained the desired factor structure of the full scale. Major model misspecification resulted in worse performance for all algorithms, but including an external variable only had minor effects on results. The simulated annealing algorithm showed the best overall performance as well as robustness to model misspecification, while the genetic algorithm produced short forms with worse fit than the other algorithms under conditions with model misspecification.

Journal ArticleDOI
TL;DR: More accurate performance of SMT over UIRT true-score equating was consistently observed across the studies, which supports the benefits of a multidimensional approach in equating for multiddimensional data.
Abstract: A theoretical and conceptual framework for true-score equating using a simple-structure multidimensional item response theory (SS-MIRT) model is developed. A true-score equating method, referred to as the SS-MIRT true-score equating (SMT) procedure, also is developed. SS-MIRT has several advantages over other complex multidimensional item response theory models including improved efficiency in estimation and straightforward interpretability. The performance of the SMT procedure was examined and evaluated through four studies using different data types. In these studies, results from the SMT procedure were compared with results from four other equating methods to assess the relative benefits of SMT compared with the other procedures. In general, SMT showed more accurate equating results compared with the traditional unidimensional IRT (UIRT) equating when the data were multidimensional. More accurate performance of SMT over UIRT true-score equating was consistently observed across the studies, which supports the benefits of a multidimensional approach in equating for multidimensional data. Also, SMT performed similarly to a SS-MIRT observed score method across all studies.

Journal ArticleDOI
TL;DR: In this study, test-takers are classified as normal or aberrant using a mixture item response theory (IRT) modeling approach, and aberrant response behavior is described and modeled using item response trees (IRTrees).
Abstract: In educational assessments and achievement tests, test developers and administrators commonly assume that test-takers attempt all test items with full effort and leave no blank responses with unplanned missing values. However, aberrant response behavior-such as performance decline, dropping out beyond a certain point, and skipping certain items over the course of the test-is inevitable, especially for low-stakes assessments and speeded tests due to low motivation and time limits, respectively. In this study, test-takers are classified as normal or aberrant using a mixture item response theory (IRT) modeling approach, and aberrant response behavior is described and modeled using item response trees (IRTrees). Simulations are conducted to evaluate the efficiency and quality of the new class of mixture IRTree model using WinBUGS with Bayesian estimation. The results show that the parameter recovery is satisfactory for the proposed mixture IRTree model and that treating missing values as ignorable or incorrect and ignoring possible performance decline results in biased estimation. Finally, the applicability of the new model is illustrated by means of an empirical example based on the Program for International Student Assessment.

Journal ArticleDOI
TL;DR: Results showed that combining multiple administrations’ worth of data via the Rasch model can lead to more accurate equating compared to classical methods designed to work well in small samples.
Abstract: Equating and scaling in the context of small sample exams, such as credentialing exams for highly specialized professions, has received increased attention in recent research. Investigators have pr...

Journal ArticleDOI
TL;DR: This research proposed a new way by directly using “participant-own-defined” missing item information (user missingness) in a zero-inflated Poisson model and found that the Confucian students had lower user missingness irrespective of item positions as compared with their Western counterparts.
Abstract: In large-scale low-stake assessment such as the Programme for International Student Assessment (PISA), students may skip items (missingness) which are within their ability to complete. The detectio...

Journal ArticleDOI
TL;DR: It is argued that with adaptations, the alignment method is well suited for combining data across multiple sites even when they use different measurement instruments, and may further inform development of more effective and efficient methods to assess the same constructs in prospectively designed studies.
Abstract: Large-scale studies spanning diverse project sites, populations, languages, and measurements are increasingly important to relate psychological to biological variables. National and international consortia already are collecting and executing mega-analyses on aggregated data from individuals, with different measures on each person. In this research, we show that Asparouhov and Muthen's alignment method can be adapted to align data from disparate item sets and response formats. We argue that with these adaptations, the alignment method is well suited for combining data across multiple sites even when they use different measurement instruments. The approach is illustrated using data from the Whole Genome Sequencing in Psychiatric Disorders consortium and a real-data-based simulation is used to verify accurate parameter recovery. Factor alignment appears to increase precision of measurement and validity of scores with respect to external criteria. The resulting parameter estimates may further inform development of more effective and efficient methods to assess the same constructs in prospectively designed studies.

Journal ArticleDOI
TL;DR: Six missing data methods were compared and it was shown that no missing data method was always superior, yet random forest imputation performed best for the majority of conditions—in particular when parallel analysis was applied to the averaged correlation matrix rather than to each imputed data set separately.
Abstract: Exploratory factor analysis is a statistical method commonly used in psychological research to investigate latent variables and to develop questionnaires. Although such self-report questionnaires are prone to missing values, there is not much literature on this topic with regard to exploratory factor analysis-and especially the process of factor retention. Determining the correct number of factors is crucial for the analysis, yet little is known about how to deal with missingness in this process. Therefore, in a simulation study, six missing data methods (an expectation-maximization algorithm, predictive mean matching, Bayesian regression, random forest imputation, complete case analysis, and pairwise complete observations) were compared with respect to the accuracy of the parallel analysis chosen as retention criterion. Data were simulated for correlated and uncorrelated factor structures with two, four, or six factors; 12, 24, or 48 variables; 250, 500, or 1,000 observations and three different missing data mechanisms. Two different procedures combining multiply imputed data sets were tested. The results showed that no missing data method was always superior, yet random forest imputation performed best for the majority of conditions-in particular when parallel analysis was applied to the averaged correlation matrix rather than to each imputed data set separately. Complete case analysis and pairwise complete observations were often inferior to multiple imputation.

Journal ArticleDOI
TL;DR: This article compares the performance of two popular latent variable interaction modeling approaches in handling ordered-categorical indicators: unconstrained product indicator (UPI) and latent moderated structural equations (LMS).
Abstract: Methods to handle ordered-categorical indicators in latent variable interactions have been developed, yet they have not been widely applied. This article compares the performance of two popular latent variable interaction modeling approaches in handling ordered-categorical indicators: unconstrained product indicator (UPI) and latent moderated structural equations (LMS). We conducted a simulation study across sample sizes, indicators' distributions and category conditions. We also studied four strategies to create sets of product indicators for UPI. Results supported using a parceling strategy to create product indicators in the UPI approach or using the LMS approach when the categorical indicators are symmetrically distributed. We applied these models to study the interaction effect between third- to fifth-grade students' social skills improvement and teacher-student closeness on their state English language arts test scores.

Journal ArticleDOI
TL;DR: A mixture model is introduced, using both response accuracy and response time information, to help differentiating non-effortful and effortful individuals, and to improve item parameter estimation based on the effortful group.
Abstract: The responses of non-effortful test-takers may have serious consequences as non-effortful responses can impair model calibration and latent trait inferences. This article introduces a mixture model...

Journal ArticleDOI
TL;DR: This study presents new models for item response functions (IRFs) in the framework of the D-scoring method (DSM) that is gaining attention in the field of educational and psychological measurement and largescale assessments, referred to as rational function models (RFMs) with one parameter (RFM1), two parameters (R FM2), and three parameters ( RFM3).
Abstract: This study presents new models for item response functions (IRFs) in the framework of the D-scoring method (DSM) that is gaining attention in the field of educational and psychological measurement ...

Journal ArticleDOI
TL;DR: A novel differential item functioning (DIF) method based on propensity score matching that tackles two challenges in analyzing performance assessment data, that is, continuous task scores and lack of a reliable internal variable as a proxy for ability or aptitude is introduced.
Abstract: This study introduces a novel differential item functioning (DIF) method based on propensity score matching that tackles two challenges in analyzing performance assessment data, that is, continuous...

Journal ArticleDOI
TL;DR: The results suggest that the two polytomous item explanatory models are methodologically and practically different in terms of the target difficulty parameters ofpolytomous items, which are explained by item properties.
Abstract: This study investigates polytomous item explanatory item response theory models under the multivariate generalized linear mixed modeling framework, using the linear logistic test model approach. Building on the original ideas of the many-facet Rasch model and the linear partial credit model, a polytomous Rasch model is extended to the item location explanatory many-facet Rasch model and the step difficulty explanatory linear partial credit model. To demonstrate the practical differences between the two polytomous item explanatory approaches, two empirical studies examine how item properties explain and predict the overall item difficulties or the step difficulties each in the Carbon Cycle assessment data and in the Verbal Aggression data. The results suggest that the two polytomous item explanatory models are methodologically and practically different in terms of (a) the target difficulty parameters of polytomous items, which are explained by item properties; (b) the types of predictors for the item properties incorporated into the design matrix; and (c) the types of item property effects. The potentials and methodological advantages of item explanatory modeling are discussed as well.

Journal ArticleDOI
TL;DR: The use of item response models in the context of a computerized cognitive task designed to assess visual working memory capacity in people with psychosis as well as healthy adults is explored.
Abstract: Although item response models have grown in popularity in many areas of educational and psychological assessment, there are relatively few applications of these models in experimental psychopathology. In this article, we explore the use of item response models in the context of a computerized cognitive task designed to assess visual working memory capacity in people with psychosis as well as healthy adults. We begin our discussion by describing how item response theory can be used to evaluate and improve unidimensional cognitive assessment tasks in various examinee populations. We then suggest how computerized adaptive testing can be used to improve the efficiency of cognitive task administration. Finally, we explore how these ideas might be extended to multidimensional item response models that better represent the complex response processes underlying task performance in psychopathological populations.

Journal ArticleDOI
TL;DR: This study expands on and demonstrates the performance of a new model-based estimate of WCPM based on a recently developed latent-variable psychometric model of speed and accuracy for ORF data.
Abstract: Oral reading fluency (ORF), used by teachers and school districts across the country to screen and progress monitor at-risk readers, has been documented as a good indicator of reading comprehension...

Journal ArticleDOI
TL;DR: Investigation of differential item functioning analyses indicated that DIF procedures appear to be a promising alternative to assess the interrater reliability of constructed response items, or other polytomous types of items, such as rating scales.
Abstract: The purpose of this study was to investigate a new way of evaluating interrater reliability that can allow one to determine if two raters differ with respect to their rating on a polytomous rating ...

Journal ArticleDOI
TL;DR: This study lays the groundwork for future educational researchers interested in the practical applications of the IDA framework to empirical data sets with complex model structures via the examination of pooled data from student and teacher school climate surveys.
Abstract: Survey research frequently involves the collection of data from multiple informants. Results, however, are usually analyzed by informant group, potentially ignoring important relationships across groups. When the same construct(s) are measured, integrative data analysis (IDA) allows pooling of data from multiple sources into one data set to examine information from multiple perspectives within the same analysis. Here, the IDA procedure is demonstrated via the examination of pooled data from student and teacher school climate surveys. This study contributes to the sparse literature regarding IDA applications in the social sciences, specifically in education. It also lays the groundwork for future educational researchers interested in the practical applications of the IDA framework to empirical data sets with complex model structures.

Journal ArticleDOI
TL;DR: The main thesis of the present study is to use the Bayesian structural equation modeling (BSEM) methodology of establishing approximate measurement invariance (A-MI) using data from a national examination in Saudi Arabia as an alternative to not meeting strong invariance criteria.
Abstract: The main thesis of the present study is to use the Bayesian structural equation modeling (BSEM) methodology of establishing approximate measurement invariance (A-MI) using data from a national examination in Saudi Arabia as an alternative to not meeting strong invariance criteria. Instead, we illustrate how to account for the absence of measurement invariance using relative compared to exact criteria. A secondary goal was to compare latent means across groups using invariant parameters only and through utilizing exact and relative evaluative-MI protocol suggested equivalence of the thresholds using prior variances equal to 0.10. Subsequent differences between groups were evaluated using effect size criteria and the prior-posterior predictive p-value (PPPP), which proved to be invaluable in attesting for differences that are beyond zero, some meaningless nonzero estimate, and the three commonly used indices of effect sizes described by Cohen in 1988 (i.e., .20, .50, and .80). Results substantiated the use of the PPPP for evaluating mean differences across groups when utilizing nonexact evaluative criteria.