scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2018"


Journal ArticleDOI
TL;DR: This article proposes a comprehensive approach for assessing the quality and appropriateness of exploratory factor analysis solutions intended for item calibration and individual scoring by assessing three groups of properties: strength and replicability, determinacy and accuracy of the individual score estimates, and closeness to unidimensionality in the case of multidimensional solutions.
Abstract: This article proposes a comprehensive approach for assessing the quality and appropriateness of exploratory factor analysis solutions intended for item calibration and individual scoring. Three groups of properties are assessed: (a) strength and replicability of the factorial solution, (b) determinacy and accuracy of the individual score estimates, and (c) closeness to unidimensionality in the case of multidimensional solutions. Within each group, indices are considered for two types of factor-analytic models: the linear model for continuous responses and the categorical-variable-methodology model that treats the item scores as ordered-categorical. All the indices proposed have been implemented in a noncommercial and widely known program for exploratory factor analysis. The usefulness of the proposal is illustrated with a real data example in the personality domain.

209 citations


Journal ArticleDOI
TL;DR: It is demonstrated how IRT model comparison can be conducted with Stan and how the provided Stan code for simple IRT models can be easily extended to their multidimensional and multilevel cases.
Abstract: Stan is a new Bayesian statistical software program that implements the powerful and efficient Hamiltonian Monte Carlo (HMC) algorithm. To date there is not a source that systematically provides Stan code for various item response theory (IRT) models. This article provides Stan code for three representative IRT models, including the three-parameter logistic IRT model, the graded response model, and the nominal response model. We demonstrate how IRT model comparison can be conducted with Stan and how the provided Stan code for simple IRT models can be easily extended to their multidimensional and multilevel cases.

63 citations


Journal ArticleDOI
TL;DR: It can be stated that the BRMSEA is well suited to evaluate model fit in large sample Bayesian CFA models by taking sample size and model complexity into account.
Abstract: Bayesian confirmatory factor analysis (CFA) offers an alternative to frequentist CFA based on, for example, maximum likelihood estimation for the assessment of reliability and validity of educational and psychological measures For increasing sample sizes, however, the applicability of current fit statistics evaluating model fit within Bayesian CFA is limited We propose, therefore, a Bayesian variant of the root mean square error of approximation (RMSEA), the BRMSEA A simulation study was performed with variations in model misspecification, factor loading magnitude, number of indicators, number of factors, and sample size This showed that the 90% posterior probability interval of the BRMSEA is valid for evaluating model fit in large samples (N≥ 1,000), using cutoff values for the lower (<05) and upper limit (<08) as guideline An empirical illustration further shows the advantage of the BRMSEA in large sample Bayesian CFA models In conclusion, it can be stated that the BRMSEA is well suited to evalu

51 citations


Journal ArticleDOI
TL;DR: A minimum value for the item-score reliability methods to be used in item analysis is recommended and the relation between the three item- score reliability methods and the four well-known item indices are investigated.
Abstract: Reliability is usually estimated for a total score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the repeatability of an individual item score in a group. Three methods to estimate item-score reliability are discussed, known as method MS, method λ6, and method CA. The item-score reliability methods are compared with four well-known and widely accepted item indices, which are the item-rest correlation, the item-factor loading, the item scalability, and the item discrimination. Realistic values for item-score reliability in empirical-data sets are monitored to obtain an impression of the values to be expected in other empirical-data sets. The relation between the three item-score reliability methods and the four well-known item indices are investigated. Tentatively, a minimum value for the item-score reliability methods to be used in item analysis is recommended.

48 citations


Journal ArticleDOI
TL;DR: Results indicate that cluster bootstrapping, though more computationally demanding, can be used as an alternative procedure for the analysis of clustered data when treatment effects at the group level are of primary interest.
Abstract: Cluster randomized trials involving participants nested within intact treatment and control groups are commonly performed in various educational, psychological, and biomedical studies. However, recruiting and retaining intact groups present various practical, financial, and logistical challenges to evaluators and often, cluster randomized trials are performed with a low number of clusters (~20 groups). Although multilevel models are often used to analyze nested data, researchers may be concerned of potentially biased results due to having only a few groups under study. Cluster bootstrapping has been suggested as an alternative procedure when analyzing clustered data though it has seen very little use in educational and psychological studies. Using a Monte Carlo simulation that varied the number of clusters, average cluster size, and intraclass correlations, we compared standard errors using cluster bootstrapping with those derived using ordinary least squares regression and multilevel models. Results indicate that cluster bootstrapping, though more computationally demanding, can be used as an alternative procedure for the analysis of clustered data when treatment effects at the group level are of primary interest. Supplementary material showing how to perform cluster bootstrapped regressions using R is also provided.

45 citations


Journal ArticleDOI
TL;DR: Recommendations about the minimum required sample sizes that satisfied all four criteria—model selection accuracy, parameter estimation bias, standard error bias, and coverage rate—as well as rules of thumb for sample size requirements when applying MLCMs in data analysis are provided.
Abstract: A multilevel latent class model (MLCM) is a useful tool for analyzing data arising from hierarchically nested structures. One important issue for MLCMs is determining the minimum sample sizes needed to obtain reliable and unbiased results. In this simulation study, the sample sizes required for MLCMs were investigated under various conditions. A series of design factors, including sample sizes at two levels, the distinctness and the complexity of the latent structure, and the number of indicators were manipulated. The results revealed that larger samples are required when the latent classes are less distinct and more complex with fewer indicators. This study also provides recommendations about the minimum required sample sizes that satisfied all four criteria-model selection accuracy, parameter estimation bias, standard error bias, and coverage rate-as well as rules of thumb for sample size requirements when applying MLCMs in data analysis.

44 citations


Journal ArticleDOI
TL;DR: It is concluded that ordinal alpha should not be used in routine reliability analyses and reports, and instead should be understood as hypothetical tool, similar to the Spearman–Brown prophecy formula, for theoretically increasing the number of ordinal categorical response options in future applied testing applications.
Abstract: This article discusses the theoretical and practical contributions of Zumbo, Gadermann, and Zeisser’s family of ordinal reliability statistics. Implications, interpretation, recommendations, and practical applications regarding their ordinal measures, particularly ordinal alpha, are discussed. General misconceptions relating to this family of ordinal reliability statistics are highlighted, and arguments for interpreting ordinal alpha as a measure of hypothetical reliability, as opposed to observed reliability, are presented. It is concluded that ordinal alpha should not be used in routine reliability analyses and reports, and instead should be understood as hypothetical tool, similar to the Spearman–Brown prophecy formula, for theoretically increasing the number of ordinal categorical response options in future applied testing applications.

44 citations


Journal ArticleDOI
TL;DR: While retrofitting may not always be an ideal approach to diagnostic measurement, this article aims to invite discussions through presenting the possibility, challenges, process, and product of retrofitting.
Abstract: Developing a diagnostic tool within the diagnostic measurement framework is the optimal approach to obtain multidimensional and classification-based feedback on examinees. However, end users may seek to obtain diagnostic feedback from existing item responses to assessments that have been designed under either the classical test theory or item response theory frameworks. Retrofitting diagnostic classification models to existing assessments designed under other psychometric frameworks could be a plausible approach to obtain more actionable scores or understand more about the constructs themselves. This study (a) discusses the possibility and problems of retrofitting, (b) proposes a step-by-step retrofitting framework, and (c) explores the information one can gain from retrofitting through an empirical application example. While retrofitting may not always be an ideal approach to diagnostic measurement, this article aims to invite discussions through presenting the possibility, challenges, process, and product of retrofitting.

36 citations


Journal ArticleDOI
TL;DR: This article presents a review of longitudinal factorial invariance, a condition necessary for ensuring that the measured construct is the same across time points, and introduces the CUFFS model.
Abstract: A first-order latent growth model assesses change in an unobserved construct from a single score and is commonly used across different domains of educational research. However, examining change using a set of multiple response scores (e.g., scale items) affords researchers several methodological benefits not possible when using a single score. A curve of factors (CUFFS) model assesses change in a construct from multiple response scores but its use in the social sciences has been limited. In this article, we advocate the CUFFS for analyzing a construct's latent trajectory over time, with an emphasis on applying this model to educational research. First, we present a review of longitudinal factorial invariance, a condition necessary for ensuring that the measured construct is the same across time points. Next, we introduce the CUFFS model, followed by an illustration of testing factorial invariance and specifying a univariate and a bivariate CUFFS model to longitudinal data. To facilitate implementation, we include syntax for specifying these statistical methods using the free statistical software R.

32 citations


Journal ArticleDOI
TL;DR: This study proposes a bifactor measurement model for the mediating construct as a way to parse variance and represent the general aspect and specific facets of a construct simultaneously and investigates the conditions when researchers can detect the mediated effect.
Abstract: Statistical mediation analysis allows researchers to identify the most important mediating constructs in the causal process studied. Identifying specific mediators is especially relevant when the hypothesized mediating construct consists of multiple related facets. The general definition of the construct and its facets might relate differently to an outcome. However, current methods do not allow researchers to study the relationships between general and specific aspects of a construct to an outcome simultaneously. This study proposes a bifactor measurement model for the mediating construct as a way to parse variance and represent the general aspect and specific facets of a construct simultaneously. Monte Carlo simulation results are presented to help determine the properties of mediated effect estimation when the mediator has a bifactor structure and a specific facet of a construct is the true mediator. This study also investigates the conditions when researchers can detect the mediated effect when the multidimensionality of the mediator is ignored and treated as unidimensional. Simulation results indicated that the mediation model with a bifactor mediator measurement model had unbiased and adequate power to detect the mediated effect with a sample size greater than 500 and medium a- and b-paths. Also, results indicate that parameter bias and detection of the mediated effect in both the data-generating model and the misspecified model varies as a function of the amount of facet variance represented in the mediation model. This study contributes to the largely unexplored area of measurement issues in statistical mediation analysis.

32 citations


Journal ArticleDOI
TL;DR: In this article, the authors evaluate the ability of mediation models to detect a significant mediation effect using limited data using simulations comparing four mediation models: sequential, dynamic, and cross-lagged panel.
Abstract: This article serves as a practical guide to mediation design and analysis by evaluating the ability of mediation models to detect a significant mediation effect using limited data. The cross-sectional mediation model, which has been shown to be biased when the mediation is happening over time, is compared with longitudinal mediation models: sequential, dynamic, and cross-lagged panel. These longitudinal mediation models take time into account but bring many problems of their own, such as choosing measurement intervals and number of measurement occasions. Furthermore, researchers with limited resources often cannot collect enough data to fit an appropriate longitudinal mediation model. These issues were addressed using simulations comparing four mediation models each using the same amount of data but with differing numbers of people and time points. The data were generated using multilevel mediation models, with varying data characteristics that may be incorrectly specified in the analysis models. Models were evaluated using power and Type I error rates in detecting a significant indirect path. Multilevel longitudinal mediation analysis performed well in every condition, even in the misspecified conditions. Of the analyses that used limited data, sequential mediation had the best performance; therefore, it offers a viable second choice when resources are limited. Finally, each of these models were demonstrated in an empirical analysis.

Journal ArticleDOI
TL;DR: A new framework for global model tests for polytomous Rasch models based on a model-based recursive partitioning algorithm is proposed, which is more powerful when the group structure is not known a priori—as will usually be the case in practical applications.
Abstract: Psychometric measurement models are only valid if measurement invariance holds between test takers of different groups. Global model tests, such as the well-established likelihood ratio (LR) test, ...

Journal ArticleDOI
TL;DR: The relations between IRT and Mplus FA “Theta” and “Delta” parameterizations are described using expressions without the use of matrices, which can be understood efficiently by applied researchers and students.
Abstract: The purpose of this article is twofold. The first is to provide evaluative information on the recovery of model parameters and their standard errors for the two-parameter item response theory (IRT) model using different estimation methods by Mplus. The second is to provide easily accessible information for practitioners, instructors, and students about the relationships between IRT and item factor analysis (FA) parameterizations. Specifically, this is done using the "Theta" and "Delta" parameterizations in Mplus for unidimensional and multidimensional modeling with dichotomous and polytomous responses with and without the scaling constant D. The first objective aims at investigating differences that may occur when using different estimation methods in Mplus for binary response modeling. The second objective was motivated by practical interest observed among graduate students and applied researchers. The relations between IRT and Mplus FA "Theta" and "Delta" parameterizations are described using expressions without the use of matrices, which can be understood efficiently by applied researchers and students.

Journal ArticleDOI
TL;DR: Drawing contrasts between the valid and invalid responses revealed differences in means, prevalence rates of student adjustment, and associations among reports of bullying victimization and student adjustment outcomes, lending additional support for the need to screen for invalid responders in adolescent samples.
Abstract: Self-report surveys are widely used to measure adolescent risk behavior and academic adjustment, with results having an impact on national policy, assessment of school quality, and evaluation of school interventions. However, data obtained from self-reports can be distorted when adolescents intentionally provide inaccurate or careless responses. The current study illustrates the problem of invalid respondents in a sample (N = 52,012) from 323 high schools that responded to a statewide assessment of school climate. Two approaches for identifying invalid respondents were applied, and contrasts between the valid and invalid responses revealed differences in means, prevalence rates of student adjustment, and associations among reports of bullying victimization and student adjustment outcomes. The results lend additional support for the need to screen for invalid responders in adolescent samples.

Journal ArticleDOI
TL;DR: This study reviewed and evaluated two alternative methods within the structural equation modeling (SEM) framework, namely, the reliability-adjusted product indicator (RAPI) method and the latent moderated structural equations (LMS) method, which can both flexibly take into account measurement errors.
Abstract: Path models with observed composites based on multiple items (e.g., mean or sum score of the items) are commonly used to test interaction effects. Under this practice, researchers generally assume that the observed composites are measured without errors. In this study, we reviewed and evaluated two alternative methods within the structural equation modeling (SEM) framework, namely, the reliability-adjusted product indicator (RAPI) method and the latent moderated structural equations (LMS) method, which can both flexibly take into account measurement errors. Results showed that both these methods generally produced unbiased estimates of the interaction effects. On the other hand, the path model—without considering measurement errors—led to substantial bias and a low confidence interval coverage rate of nonzero interaction effects. Other findings and implications for future studies are discussed.

Journal ArticleDOI
TL;DR: Results suggested that parameter estimates for examinee and task facets are quite robust to modifications in the size, model–data fit, and latent-variable location of the link, and Parameter estimates for the rater are more sensitive to reductions in link size.
Abstract: Previous research includes frequent admonitions regarding the importance of establishing connectivity in data collection designs prior to the application of Rasch models. However, details regarding...

Journal ArticleDOI
TL;DR: The present study used Monte Carlo simulation methods to compare the effects of multiple model parameterizations and estimators on the performance of the chi-square test for the exact-fit hypothesis and chi- square and likelihood ratio difference tests for the equal-fits hypothesis for evaluating measurement invariance with ordered polytomous data.
Abstract: Evaluations of measurement invariance provide essential construct validity evidence—a prerequisite for seeking meaning in psychological and educational research and ensuring fair testing procedures...

Journal ArticleDOI
TL;DR: Results show that the proposed indices have tail-area probabilities that can be closely approximated by central chi-squared random variables under the null hypothesis, and are powerful for detecting latent variable distributional assumption violations, and not sensitive to other forms of model misspecification such as multidimensionality.
Abstract: In item response theory (IRT), the underlying latent variables are typically assumed to be normally distributed. If the assumption of normality is violated, the item and person parameter estimates can become biased. Therefore, it is necessary in practical data analysis situations to examine the adequacy of this assumption in an effective manner. There is a recent surge of interest in limited-information overall goodness-of-fit test statistics for IRT models (see e.g., Cai, Maydeu-Olivares, Coffman, & Thissen, 2006; Joe & Maydeu-Olivares, 2010; Cai & Hansen, 2013), but their appropriateness for diagnosing latent variable distributional fit has not been studied. The approach undertaken in this research is to use summed score likelihood based indices. The idea itself is not new (see e.g., Ferrando & Lorenzo-Seva, 2001; Hambleton & Traub, 1973; Lord, 1953; Ross, 1966; Sinharay, Johnson, & Stern, 2006; Thissen & Wainer, 2001), but this study recasts the problem using the framework of limited-information goodness of fit testing. The summed score based indices can be viewed as a particular form of reduction of the full underlying multinomial that are potentially sensitive to the latent variable distributional misspecifications.Results from a pilot study (Li & Cai, 2012) show that summed score likelihood based indices enjoy high statistical power for detecting latent variable distributional assumption violations, and are not sensitive (correctly) to other forms of model misspecification such as unmodeled multidimensionality. Meanwhile, the limited-information overall fit statistic M2 (Maydeu-Olivares & Joe, 2005) has relatively low power against latent variable non-normality. However, technically the statistical indices proposed by Li and Cai (2012) don't follow an exactly chi-squared distribution. They proposed a heuristic degrees of freedom adjustment, but more rigorous justifications could be developed along the lines of Sattora-Bentler type moment adjustment popular in structural equation modeling (Satorra & Bentler, 1994). In IRT, the moment adjustment approaches have been used by Cai et al. (2006) and Maydeu-Olivares (2001). The major methodological contributions of my dissertation come from simulation studies that examine the calibration and power of the moment adjusted test statistics across various conditions: number of items, sample size, item type, generating latent variable distribution, and the values of generating item parameters. The performance of these fit statistics are also compared with M2. Simulation study results show that the proposed moment-adjusted statistics improve upon the unadjusted statistics in the null and alternative conditions, especially when generating item parameters are dispersed. Finally, performance of the indices is illustrated with empirical data from educational and psychological assessment development projects.

Journal ArticleDOI
TL;DR: The results of a large simulation study indicated that, in general, the mediated effect was robust to violations of invariance in loadings, and most conditions with violations of intercept invariance exhibited severely positively biased mediated effects.
Abstract: When testing a statistical mediation model, it is assumed that factorial measurement invariance holds for the mediating construct across levels of the independent variable X. The consequences of failing to address the violations of measurement invariance in mediation models are largely unknown. The purpose of the present study was to systematically examine the impact of mediator noninvariance on the Type I error rates, statistical power, and relative bias in parameter estimates of the mediated effect in the single mediator model. The results of a large simulation study indicated that, in general, the mediated effect was robust to violations of invariance in loadings. In contrast, most conditions with violations of intercept invariance exhibited severely positively biased mediated effects, Type I error rates above acceptable levels, and statistical power larger than in the invariant conditions. The implications of these results are discussed and recommendations are offered.

Journal ArticleDOI
TL;DR: It is argued that the utility of LM tests depends on both the method used to compute the test and the degree of misspecification in the initially fitted model, and this is demonstrated in the context of a multidimensional IRT framework.
Abstract: Lagrange multiplier (LM) or score tests have seen renewed interest for the purpose of diagnosing misspecification in item response theory (IRT) models. LM tests can also be used to test whether parameters differ from a fixed value. We argue that the utility of LM tests depends on both the method used to compute the test and the degree of misspecification in the initially fitted model. We demonstrate both of these points in the context of a multidimensional IRT framework. Through an extensive Monte Carlo simulation study, we examine the performance of LM tests under varying degrees of model misspecification, model size, and different information matrix approximations. A generalized LM test designed specifically for use under misspecification, which has apparently not been previously studied in an IRT framework, performed the best in our simulations. Finally, we reemphasize caution in using LM tests for model specification searches.

Journal ArticleDOI
TL;DR: Results differed, especially for two highly correlated cognitive tests; neither reproduced full-sample correlations well due to small deviations from normal distribution in skew and kurtosis, and problems in developing further adjustments to offset range-restriction distortions are discussed.
Abstract: Most study samples show less variability in key variables than do their source populations due most often to indirect selection into study participation associated with a wide range of personal and...

Journal ArticleDOI
TL;DR: This article presents some new developments in the methodology of an approach to scoring and equating of tests with binary items, referred to as delta scoring (D-scoring), which is under piloting with large-scale assessments at the National Center for Assessment in Saudi Arabia.
Abstract: This article presents some new developments in the methodology of an approach to scoring and equating of tests with binary items, referred to as delta scoring (D-scoring), which is under piloting with large-scale assessments at the National Center for Assessment in Saudi Arabia. This presentation builds on a previous work on delta scoring and adds procedures for scaling and equating, item response function, and estimation of true values and standard errors of D scores. Also, unlike the previous work on this topic, where D-scoring involves estimates of item and person parameters in the framework of item response theory, the approach presented here does not require item response theory calibration.

Journal ArticleDOI
Unkyung No1, Sehee Hong1
TL;DR: Results demonstrate that the three-step approaches produced more stable and better estimations than the other approaches even with a small sample size of 100, indicating the superiority of these approaches.
Abstract: The purpose of the present study is to compare performances of mixture modeling approaches (i.e., one-step approach, three-step maximum-likelihood approach, three-step BCH approach, and LTB approac...

Journal ArticleDOI
TL;DR: A latent variable modeling approach is discussed that allows point and interval estimation of the relationship of an underlying latent factor to a criterion variable in a setting that is more general than the commonly considered homogeneous psychometric test case.
Abstract: Validity coefficients for multicomponent measuring instruments are known to be affected by measurement error that attenuates them, affects associated standard errors, and influences results of statistical tests with respect to population parameter values. To account for measurement error, a latent variable modeling approach is discussed that allows point and interval estimation of the relationship of an underlying latent factor to a criterion variable in a setting that is more general than the commonly considered homogeneous psychometric test case. The method is particularly helpful in validity studies for scales with a second-order factorial structure, by allowing evaluation of the relationship between the second-order factor and a criterion variable. The procedure is similarly useful in studies of discriminant, convergent, concurrent, and predictive validity of measuring instruments with complex latent structure, and is readily applicable when measuring interrelated traits that share a common variance source. The outlined approach is illustrated using data from an authoritarianism study.

Journal ArticleDOI
TL;DR: The current simulation study compares the performance of two methods for estimating the correlations among changes in latent variables between two points in time, the two-wave latent change score model and the latent difference factor model.
Abstract: Collection and analysis of longitudinal data is an important tool in understanding growth and development over time in a whole range of human endeavors. Ideally, researchers working in the longitudinal framework are able to collect data at more than two points in time, as this will provide them with the potential for a deeper understanding of the development processes under study and a much broader array of statistical modeling options. However, in some circumstances data collection is limited to only two time points, perhaps because of resource limitations, issues with the context in which the data are collected, or the nature of the trait under study. In such instances, researchers may still want to learn about complex relationships in the data, such as the correlation between changes in latent traits that are being measured. However, with only two data points, standard approaches for modeling such relationships, such as growth curve modeling, cannot be used. The current simulation study compares the pe...

Journal ArticleDOI
Ren Liu1
TL;DR: A conceptual framework for understanding misspecifications of attribute structures is provided and a simulation study and application example were used to investigate how misspecification of external shapes and internal organizations affects model fit and item fit assessments, and respondent classification.
Abstract: Attribute structure is an explicit way of presenting the relationship between attributes in diagnostic measurement. The specification of attribute structures directly affects the classification accuracy resulted from psychometric modeling. This study provides a conceptual framework for understanding misspecifications of attribute structures. Under the framework, each attribute structure can be represented through an external shape and an internal organization. A simulation study and an application example were used to investigate how misspecification of external shapes and internal organizations affects model fit and item fit assessments, and respondent classification. The proposed framework and simulation results aim to support using attribute structures to (a) develop better diagnostic assessments and (b) inform theories of constructs.

Journal ArticleDOI
TL;DR: The data suggest that male respondents, respondents with lower levels of education, and respondents who did not report participating in SNAP, formerly the Food Stamp Program tend to have more misfit, and lack of homeownership appears to be a predictor of misfit for Infit MSE statistics.
Abstract: This study focuses on model-data fit with a particular emphasis on household-level fit within the context of measuring household food insecurity. Household fit indices are used to examine the psychometric quality of household-level measures of food insecurity. In the United States, measures of food insecurity are commonly obtained from the U.S. Household Food Security Survey Module (HFSSM, 18 items) of the Current Population Survey Food Security Supplement (CPS-FSS). These measures, in various forms, are used to inform national programs and policies related to food insecurity. Data for low-income households with children from recent administrations of the HFSSM (2012-2014) are used in this study (N = 7,324). The results suggest that there are detectable levels of misfit with Infit mean square error (MSE) statistics ranging from 6.73 % to 21.33% and Outfit MSE statistics ranging from 5.31% to 9.68%. The data suggest for Outfit MSE statistics that (a) male respondents, (b) respondents with lower levels of education, and (c) respondents who did not report participating in SNAP (Supplemental Nutrition Assistance Program, formerly the Food Stamp Program) tend to have more misfit. For Infit MSE statistics, lack of homeownership appears to be a predictor of misfit. The implications of this research for future research, theory, and policy related to the measurement of household food insecurity are discussed.

Journal ArticleDOI
TL;DR: This study presents the multilevel bifactor approach to handling wording effects of mixed-format scales used in a multileVEL context and shows that positive and negative wording effects were present at both the within and the between levels.
Abstract: Wording effects associated with positively and negatively worded items have been found in many scales. Such effects may threaten construct validity and introduce systematic bias in the interpretation of results. A variety of models have been applied to address wording effects, such as the correlated uniqueness model and the correlated traits and correlated methods model. This study presents the multilevel bifactor approach to handling wording effects of mixed-format scales used in a multilevel context. The Students Confident in Mathematics scale is used to illustrate this approach. Results from comparing a series of models showed that positive and negative wording effects were present at both the within and the between levels. When the wording effects were ignored, the within-level predictive validity of the Students Confident in Mathematics scale was close to that under the multilevel bifactor model. However, at the between level, a lower validity coefficient was observed when ignoring the wording effects. Implications for applied researchers are discussed.

Journal ArticleDOI
TL;DR: Results of this study may support the use of scales composed of items worded in the same direction, and particularly in the positive direction, following the investigation of how adjacent categories may discriminate differently when items are positively or negatively worded.
Abstract: The generalized partial credit model (GPCM) is often used for polytomous data; however, the nominal response model (NRM) allows for the investigation of how adjacent categories may discriminate differently when items are positively or negatively worded. Ten items from three different self-reported scales were used (anxiety, depression, and perceived stress), and authors wrote an additional item worded in the opposite direction to pair with each original item. Sets of the original and reverse-worded items were administered, and responses were analyzed using the two models. The NRM fit significantly better than the GPCM, and it was able to detect category responses that may not function well. Positively worded items tended to be more discriminating than negatively worded items. For the depression scale, category boundary locations tended to have a larger range for the positively worded items than for the negatively worded items from both models. Some pairs of items functioned comparably when reverse-worded,...

Journal ArticleDOI
TL;DR: Empirical underidentification problems that are encountered when fitting particular types of bifactor models to certain types of data sets are considered.
Abstract: Bifactor models are commonly used to assess whether psychological and educational constructs underlie a set of measures. We consider empirical underidentification problems that are encountered when fitting particular types of bifactor models to certain types of data sets. The objective of the article was fourfold: (a) to allow readers to gain a better general understanding of issues surrounding empirical identification, (b) to offer insights into empirical underidentification with bifactor models, (c) to inform methodologists who explore bifactor models about empirical underidentification with these models, and (d) to propose strategies for structural equation model users to deal with underidentification problems that can emerge when applying bifactor models.