scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2021"


Journal ArticleDOI
TL;DR: In this article, the authors examined the accuracy of p values obtained using the asymptotic mean and variance (MV) correction to the distribution of the sample standardized root mean squared residual (SRMR) proposed by Maydeu-Olivares to assess the exact fit of SEM models.
Abstract: We examine the accuracy of p values obtained using the asymptotic mean and variance (MV) correction to the distribution of the sample standardized root mean squared residual (SRMR) proposed by Maydeu-Olivares to assess the exact fit of SEM models. In a simulation study, we found that under normality, the MV-corrected SRMR statistic provides reasonably accurate Type I errors even in small samples and for large models, clearly outperforming the current standard, that is, the likelihood ratio (LR) test. When data shows excess kurtosis, MV-corrected SRMR p values are only accurate in small models (p = 10), or in medium-sized models (p = 30) if no skewness is present and sample sizes are at least 500. Overall, when data are not normal, the MV-corrected LR test seems to outperform the MV-corrected SRMR. We elaborate on these findings by showing that the asymptotic approximation to the mean of the SRMR sampling distribution is quite accurate, while the asymptotic approximation to the standard deviation is not.

43 citations


Journal ArticleDOI
TL;DR: All fit indices, except SRMR, are overly sensitive to correlated residuals and nonspecific error, resulting in solutions that are overfactored, and in general, this research does not recommend using model fit indices to select number of factors in a scale evaluation framework.
Abstract: Model fit indices are being increasingly recommended and used to select the number of factors in an exploratory factor analysis. Growing evidence suggests that the recommended cutoff values for common model fit indices are not appropriate for use in an exploratory factor analysis context. A particularly prominent problem in scale evaluation is the ubiquity of correlated residuals and imperfect model specification. Our research focuses on a scale evaluation context and the performance of four standard model fit indices: root mean square error of approximate (RMSEA), standardized root mean square residual (SRMR), comparative fit index (CFI), and Tucker-Lewis index (TLI), and two equivalence test-based model fit indices: RMSEAt and CFIt. We use Monte Carlo simulation to generate and analyze data based on a substantive example using the positive and negative affective schedule (N = 1,000). We systematically vary the number and magnitude of correlated residuals as well as nonspecific misspecification, to evaluate the impact on model fit indices in fitting a two-factor exploratory factor analysis. Our results show that all fit indices, except SRMR, are overly sensitive to correlated residuals and nonspecific error, resulting in solutions that are overfactored. SRMR performed well, consistently selecting the correct number of factors; however, previous research suggests it does not perform well with categorical data. In general, we do not recommend using model fit indices to select number of factors in a scale evaluation framework.

36 citations


Journal ArticleDOI
TL;DR: The results not only present practical and general guidelines for substantive researchers to determine minimum required sample sizes but also improve understanding of which factors are related to sample size requirements in mediation models.
Abstract: Mediation models have been widely used in many disciplines to better understand the underlying processes between independent and dependent variables. Despite their popularity and importance, the ap...

33 citations


Journal ArticleDOI
TL;DR: This study demonstrates the application of a new method for multiple-group analysis that concurrently models item responses, response times, and visual fixation counts collected from an eye-tracker.
Abstract: Many approaches have been proposed to jointly analyze item responses and response times to understand behavioral differences between normally and aberrantly behaved test-takers. Biometric informati...

19 citations


Journal ArticleDOI
TL;DR: Results revealed that the respondents preferred a paper-based VAS item with a horizontal, 8-cm long, 3 DTP (“desktop publishing point”) wide, black line, with flat line endpoints, and the ascending numerical anchors “0” and “10”, both for women and men.
Abstract: Paper-based visual analogue scale (VAS) items were developed 100 years ago. Although they gained great popularity in clinical and medical research for assessing pain, they have been scarcely applied in other areas of psychological research for several decades. However, since the beginning of digitization, VAS have attracted growing interest among researchers for carrying out computerized and paper-based data assessments. In the present study, we investigated the research question "Which different design characteristics of paper-based VAS items are preferred by women and men?" Based on a sample of 115 participants (68 female), our results revealed that the respondents preferred a paper-based VAS item with a horizontal, 8-cm long, 3 DTP ("desktop publishing point") wide, black line, with flat line endpoints, and the ascending numerical anchors "0" and "10", both for women and men. Although we did not identify any gender difference in these characteristics, our findings uncovered clear preferences on how to design paper-based VAS items.

16 citations


Journal ArticleDOI
TL;DR: Simulation results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared with the 2PL model.
Abstract: As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as m...

15 citations


Journal ArticleDOI
TL;DR: Careless responding is a bias in survey responses that disregards the actual item content, constituting a threat to the factor structure, reliability, and validity of psychological measurements as discussed by the authors.
Abstract: Careless responding is a bias in survey responses that disregards the actual item content, constituting a threat to the factor structure, reliability, and validity of psychological measurements. Di...

15 citations


Journal ArticleDOI
TL;DR: The authors identified disengaged item responses pose a threat to the validity of the results provided by large-scale assessments and proposed several procedures for identifying disengaged responses on the basis of observed response.
Abstract: Disengaged item responses pose a threat to the validity of the results provided by large-scale assessments. Several procedures for identifying disengaged responses on the basis of observed response...

15 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compared semiparametric continuous norming (SPCN) with conventional norming methods by simulating results for test results for psychometric test results and found that the interpretation of test results is usually based on norm scores.
Abstract: The interpretation of psychometric test results is usually based on norm scores. We compared semiparametric continuous norming (SPCN) with conventional norming methods by simulating results for tes...

14 citations


Journal ArticleDOI
TL;DR: This paper explored the theory of psychological inoculation: If people are preemptively exposed to a weakened version of a virus, they will be more likely to be exposed to the virus in the future.
Abstract: Online misinformation is a pervasive global problem. In response, psychologists have recently explored the theory of psychological inoculation: If people are preemptively exposed to a weakened vers...

13 citations


Journal ArticleDOI
TL;DR: In this paper, the presence of rapid guessing presents a challenge to practitioners in obtaining accurate estimates of measurement properties and examinee ability, and in response to this concern, researchers have proposed a method to deal with this problem.
Abstract: The presence of rapid guessing (RG) presents a challenge to practitioners in obtaining accurate estimates of measurement properties and examinee ability. In response to this concern, researchers ha...

Journal ArticleDOI
TL;DR: It is suggested that test users should evaluate and document potential differential NER prior to both conducting measurement quality analyses and reporting disaggregated subgroup mean performance.
Abstract: Low test-taking effort as a validity threat is common when examinees perceive an assessment context to have minimal personal value. Prior research has shown that in such contexts, subgroups may dif...

Journal ArticleDOI
TL;DR: A new approach to the analysis of how students answer tests and how they allocate resources in terms of time on task and revisiting previously answered questions is presented, revealing that examinees’ tendency to revisit items was strongly related to their speed and subgroups of examinees displayed different test-taking behaviors.
Abstract: This article presents a new approach to the analysis of how students answer tests and how they allocate resources in terms of time on task and revisiting previously answered questions. Previous res...

Journal ArticleDOI
TL;DR: It is argued that methodology applied to investigate response styles should attend to the inherent uncertainty of response style influence due to the likely influence of both response styles and the content trait on the selection of extreme response categories.
Abstract: This paper presents a mixture item response tree (IRTree) model for extreme response style. Unlike traditional applications of single IRTree models, a mixture approach provides a way of representin...

Journal ArticleDOI
TL;DR: A follow-up regression comparing alpha and omega revealed alpha to be more sensitive to degree of violation of tau equivalence, whereas omega was affected greater by sample size and number of items, especially when population reliability was low.
Abstract: The accuracy of certain internal consistency estimators have been questioned in recent years. The present study tests the accuracy of six reliability estimators (Cronbach’s alpha, omega, omega hier...

Journal ArticleDOI
TL;DR: Forced-choice questionnaires can prevent faking and other response biases typically associated with rating scales as mentioned in this paper, however, the derived trait scores are often unreliable and ipsative, making inter-subjective responses often unreliable.
Abstract: Forced-choice questionnaires can prevent faking and other response biases typically associated with rating scales. However, the derived trait scores are often unreliable and ipsative, making interi...

Journal ArticleDOI
TL;DR: The impact of excluding and misspecifying covariate effects on measurement invariance testing and class enumeration was investigated via Monte Carlo simulations and the utility of a model comparison approach in searching for the correct specification of covariates was evidenced.
Abstract: Factor mixture modeling (FMM) has been increasingly used to investigate unobserved population heterogeneity. This study examined the issue of covariate effects with FMM in the context of measurement invariance testing. Specifically, the impact of excluding and misspecifying covariate effects on measurement invariance testing and class enumeration was investigated via Monte Carlo simulations. Data were generated based on FMM models with (1) a zero covariate effect, (2) a covariate effect on the latent class variable, and (3) covariate effects on both the latent class variable and the factor. For each population model, different analysis models that excluded or misspecified covariate effects were fitted. Results highlighted the importance of including proper covariates in measurement invariance testing and evidenced the utility of a model comparison approach in searching for the correct specification of covariate effects and the level of measurement invariance. This approach was demonstrated using an empirical data set. Implications for methodological and applied research are discussed.

Journal ArticleDOI
TL;DR: This simulation study examined the robustness of LPA in terms of class enumeration and parameter recovery when the noninvariance was unmodeled by using composite or factor scores as profile indicators.
Abstract: Latent profile analysis (LPA) identifies heterogeneous subgroups based on continuous indicators that represent different dimensions. It is a common practice to measure each dimension using items, c...

Journal ArticleDOI
TL;DR: Simulations show performance exceeding that of Cronbach's alpha in terms of root mean square error when the formula matching the correct exponential family is used, and a discussion of Jensen’s inequality suggests explanations for peculiarities of the bias and standard error of the simulations across the different exponential families.
Abstract: This article presents some equivalent forms of the common Kuder-Richardson Formula 21 and 20 estimators for nondichotomous data belonging to certain other exponential families, such as Poisson count data, exponential data, or geometric counts of trials until failure. Using the generalized framework of Foster (2020), an equation for the reliability for a subset of the natural exponential family have quadratic variance function is derived for known population parameters, and both formulas are shown to be different plug-in estimators of this quantity. The equivalent Kuder-Richardson Formulas 20 and 21 are given for six different natural exponential families, and these match earlier derivations in the case of binomial and Poisson data. Simulations show performance exceeding that of Cronbach's alpha in terms of root mean square error when the formula matching the correct exponential family is used, and a discussion of Jensen's inequality suggests explanations for peculiarities of the bias and standard error of the simulations across the different exponential families.

Journal ArticleDOI
TL;DR: The purpose of this article is to show that the large-sample variance of Fleiss’ generalized kappa is systematically being misused, is invalid as a precision measure for kappa, and cannot be used for constructing confidence intervals.
Abstract: Cohen's kappa coefficient was originally proposed for two raters only, and it later extended to an arbitrarily large number of raters to become what is known as Fleiss' generalized kappa. Fleiss' generalized kappa and its large-sample variance are still widely used by researchers and were implemented in several software packages, including, among others, SPSS and the R package "rel." The purpose of this article is to show that the large-sample variance of Fleiss' generalized kappa is systematically being misused, is invalid as a precision measure for kappa, and cannot be used for constructing confidence intervals. A general-purpose variance expression is proposed, which can be used in any statistical inference procedure. A Monte-Carlo experiment is presented, showing the validity of the new variance estimation procedure.

Journal ArticleDOI
TL;DR: A generalized latent variable model is presented that, when combined with strong parametric assumptions based on mathematical cognitive models, permits the use of adaptive testing without large samples or the need to precalibrate item parameters.
Abstract: The adaptation of experimental cognitive tasks into measures that can be used to quantify neurocognitive outcomes in translational studies and clinical trials has become a key component of the stra...

Journal ArticleDOI
TL;DR: This study proposes a polytomous scoring approach for handling not-reached items and compares its performance with those of the traditional scoring approaches and indicates that the polytomously scoring approaches outperformed the traditional approaches.
Abstract: In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students' responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students' nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed.

Journal ArticleDOI
TL;DR: In data collected from virtual learning environments (VLEs), item response theory (IRT) models can be used to guide the ongoing measurement of student ability.
Abstract: In data collected from virtual learning environments (VLEs), item response theory (IRT) models can be used to guide the ongoing measurement of student ability. However, such applications of IRT rel...

Journal ArticleDOI
TL;DR: In the majority of conditions and for all factor retention criteria except the comparison data approach, the missing data mechanism had little impact on the accuracy and pairwise deletion performed comparably well as the more sophisticated imputation methods.
Abstract: Determining the number of factors in exploratory factor analysis is arguably the most crucial decision a researcher faces when conducting the analysis. While several simulation studies exist that c...

Journal ArticleDOI
TL;DR: This research presents a novel and scalable approach called "Smart Scorecard™", which automates and automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and rating individual performances in a discrete-time manner.
Abstract: Practical constraints in rater-mediated assessments limit the availability of complete data. Instead, most scoring procedures include one or two ratings for each performance, with overlapping perfo...

Journal ArticleDOI
TL;DR: The regression method generally performs the best in terms of coefficient and standard error bias, accuracy, and empirical Type I error rates, and the correlation-preserving method mostly outperform the sum score methods.
Abstract: Factor score regression has recently received growing interest as an alternative for structural equation modeling. However, many applications are left without guidance because of the focus on norma...

Journal ArticleDOI
TL;DR: This study found that if the class constraint algorithm was used a priori, it should be combined with a post hoc algorithm for accurate classification and is most effective under two-class models when class separation is high.
Abstract: Simulation studies involving mixture models inevitably aggregate parameter estimates and other output across numerous replications. A primary issue that arises in these methodological investigation...

Journal ArticleDOI
TL;DR: This paper presents a meta-analysis of eight randomized control trials (RCTs) conducted in the Netherlands over the course of a 12-month period and found that three out of four trials showed statistically significant improvements in the quality of the control groups.
Abstract: Considerable thought is often put into designing randomized control trials (RCTs). From power analyses and complex sampling designs implemented preintervention to nuanced quasi-experimental models ...

Journal ArticleDOI
TL;DR: A nested structure based approach to analyzing data obtained within a nested structure (e.g., students within schools) to effectively analyze data with such a structure is recommended.
Abstract: Oftentimes in many fields of the social and natural sciences, data are obtained within a nested structure (e.g., students within schools). To effectively analyze data with such a structure, multile...

Journal ArticleDOI
TL;DR: Results suggest that the highest alternate forms reliability coefficients were obtained when the second test was administered at least 2 to 3 weeks after the first test, suggesting a potential tradeoff in waiting longer to retest as student ability tended to grow with time.
Abstract: An essential question when computing test–retest and alternate forms reliability coefficients is how many days there should be between tests. This article uses data from reading and math computeriz...