scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2014"


Journal ArticleDOI
TL;DR: This article used data from one large-scale survey as a basis for examining the extent to which typical fit measures used in multiple-group confirmatory factor analysis are suitable for detecting measurement invariance in a large scale survey context.
Abstract: In the field of international educational surveys, equivalence of achievement scale scores across countries has received substantial attention in the academic literature; however, only a relatively recent emphasis on scale score equivalence in nonachievement education surveys has emerged. Given the current state of research in multiple-group models, findings regarding these recent measurement invariance investigations were supported with research that was limited in scope to few groups and relatively small sample sizes. To that end, this study uses data from one large-scale survey as a basis for examining the extent to which typical fit measures used in multiple-group confirmatory factor analysis are suitable for detecting measurement invariance in a large-scale survey context. Using measures validated in a smaller scale context and an empirically grounded simulation study, our findings indicate that many typical measures and associated criteria are either unsuitable in a large group and varied sample-siz...

335 citations


Journal ArticleDOI
TL;DR: It is concluded that structural equation modeling is a viable methodology to model complex regional interdependencies in brain activation in pediatric populations.
Abstract: The present study assessed the impact of sample size on the power and fit of structural equation modeling applied to functional brain connectivity hypotheses. The data consisted of time-constrained minimum norm estimates of regional brain activity during performance of a reading task obtained with magnetoencephalography. Power analysis was first conducted for an autoregressive model with 5 latent variables (brain regions), each defined by 3 indicators (successive activity time bins). A series of simulations were then run by generating data from an existing pool of 51 typical readers (aged 7.5-12.5 years). Sample sizes ranged between 20 and 1,000 participants and for each sample size 1,000 replications were run. Results were evaluated using chi-square Type I errors, model convergence, mean RMSEA (root mean square error of approximation) values, confidence intervals of the RMSEA, structural path stability, and D-Fit index values. Results suggested that 70 to 80 participants were adequate to model relationships reflecting close to not so close fit as per MacCallum et al.'s recommendations. Sample sizes of 50 participants were associated with satisfactory fit. It is concluded that structural equation modeling is a viable methodology to model complex regional interdependencies in brain activation in pediatric populations.

191 citations


Journal ArticleDOI
TL;DR: This paper investigated the performance of classical and model-based approaches in empirical data, accounting for different kinds of missing responses simultaneously, and confirmed the existence of a unidimensional tendency to omit items.
Abstract: Data from competence tests usually show a number of missing responses on test items due to both omitted and not-reached items. Different approaches for dealing with missing responses exist, and there are no clear guidelines on which of those to use. While classical approaches rely on an ignorable missing data mechanism, the most recently developed model-based approaches account for nonignorable missing responses. Model-based approaches include the missing propensity in the measurement model. Although these models are very promising, the assumptions made in these models have not yet been tested for plausibility in empirical data. Furthermore, studies investigating the performance of different approaches have only focused on one kind of missing response at once. In this study, we investigated the performance of classical and model-based approaches in empirical data, accounting for different kinds of missing responses simultaneously. We confirmed the existence of a unidimensional tendency to omit items. Indicating nonignorability of the missing mechanism, missing tendency due to both omitted and not-reached items correlated with ability. However, results on parameter estimation showed that ignoring missing

98 citations


Journal ArticleDOI
TL;DR: This paper developed a new class of item response theory (IRT) models to account for ERS so that the target latent trait is free from the response style and the tendency of ERS is quantified.
Abstract: Extreme response style (ERS) is a systematic tendency for a person to endorse extreme options (e.g., strongly disagree, strongly agree) on Likert-type or ratingscale items. In this study, we develop a new class of item response theory (IRT) models to account for ERS so that the target latent trait is free from the response style and the tendency of ERS is quantified. Parameters of these new models can be estimated with marginal maximum likelihood estimation methods or Bayesian methods. In this study, we use the freeware program WinBUGS, which implements Bayesian methods. In a series of simulations, we find that the parameters are recovered fairly well; ignoring ERS by fitting standard IRT models resulted in biased estimates, and fitting the new models to data without ERS did little harm. Two empirical examples are provided to illustrate the implications and applications of the new models.

64 citations


Journal ArticleDOI
TL;DR: The authors found that response style-related and content-related processes were selectively linked to extraneous criteria of response styles and content, and that a substantial suppression effect of response style was found.
Abstract: Response styles, the tendency to respond to Likert-type items irrespective of content, are a widely known threat to the reliability and validity of self-report measures. However, it is still debated how to measure and control for response styles such as extreme responding. Recently, multiprocess item response theory models have been proposed that allow for separating multiple response processes in rating data. The rationale behind these models is to define process variables that capture psychologically meaningful aspects of the response process like, for example, content- and response style-related processes. The aim of the present research was to test the validity of this approach using two large data sets. In the first study, responses to a 7point rating scale were disentangled, and it was shown that response style-related and content-related processes were selectively linked to extraneous criteria of response styles and content. The second study, using a 4-point rating scale, focused on a content-related criterion and revealed a substantial suppression effect of response style. The findings have implications for both basic and applied fields, namely, for modeling response styles and for the interpretation of rating data.

60 citations


Journal ArticleDOI
TL;DR: This article conducted an experimental study that manipulated the length of observation and order of presentation of 40-minute videotaped lessons from secondary grade classrooms and found that two 20-minute observation segments presented in random order produce the most desirable effect on score reliability and validity.
Abstract: Observational methods are increasingly being used in classrooms to evaluate the quality of teaching. Operational procedures for observing teachers are somewhat arbitrary in existing measures and vary across different instruments. To study the effect of different observation procedures on score reliability and validity, we conducted an experimental study that manipulated the length of observation and order of presentation of 40-minute videotaped lessons from secondary grade classrooms. Results indicate that two 20-minute observation segments presented in random order produce the most desirable effect on score reliability and validity. This suggests that 20-minute occasions may be sufficient time for a rater to observe true characteristics of teaching quality assessed by the measure used in the study, and randomizing the order in which segments were rated may reduce construct irrelevant variance arising from carry over effects and rater drift.

54 citations


Journal ArticleDOI
TL;DR: The authors compared the functioning of positively and negatively worded personality items using item response theory and found that negatively-worded items produced comparatively higher difficulty and lower discrimination parameters than positively worded items and yielded almost no information.
Abstract: This study compared the functioning of positively and negatively worded personality items using item response theory. In Study 1, word pairs from the Goldberg Adjective Checklist were analyzed using the Graded Response Model. Across subscales, negatively worded items produced comparatively higher difficulty and lower discrimination parameters than positively worded items and yielded almost no information. Model fit was examined for two forms of each scale: parameters freely estimated versus parameters estimated with item pairs constrained to be equal. Greater misfit was found in the latter. In Study 2, positively and negatively worded items from a more commonly formatted personality assessment were compared. Parameters again differed, albeit to a lesser extent, and model fit was improved in four out of five scales by the removal of negatively worded items. These results indicated that positively and negatively worded items were not psychometrically interchangeable and that negatively worded items have lim...

50 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigated the effect of sample size on the assignment of items to the correct scales in Mokken scale analysis and found that the AISP and GA algorithms minimally required 250 to 500 respondents when item quality was high and 1,250 to 1,750 respondents when quality was low.
Abstract: An automated item selection procedure in Mokken scale analysis partitions a set of items into one or more Mokken scales, if the data allow. Two algorithms are available that pursue the same goal of selecting Mokken scales of maximum length: Mokken’s original automated item selection procedure (AISP) and a genetic algorithm (GA). Minimum sample size requirements for the two algorithms to obtain stable, replicable results have not yet been established. In practical scale construction reported in the literature, we found that researchers used sample sizes ranging from 133 to 15,022 respondents. We investigated the effect of sample size on the assignment of items to the correct scales. Using a misclassification of 5% as a criterion, we found that the AISP and the GA algorithms minimally required 250 to 500 respondents when item quality was high and 1,250 to 1,750 respondents when item quality was low.

47 citations


Journal ArticleDOI
TL;DR: In this paper, the authors examined how sensitive the commonly used model selection indices are in class enumeration of MRMs with nonnormal errors, and investigated whether a skew-normal MRM can accommodate nonnormality, and illustrate the potential of this model with a real data analysis.
Abstract: A challenge associated with traditional mixture regression models (MRMs), which rest on the assumption of normally distributed errors, is determining the number of unobserved groups. Specifically, even slight deviations from normality can lead to the detection of spurious classes. The current work aims to (a) examine how sensitive the commonly used model selection indices are in class enumeration of MRMs with nonnormal errors, (b) investigate whether a skew-normal MRM can accommodate nonnormality, and (c) illustrate the potential of this model with a real data analysis. Simulation results indicate that model information criteria are not useful for class determination in MRMs unless errors follow a perfect normal distribution. The skewnormal MRM can accurately identify the number of latent classes in the presence of normal or mildly skewed errors, but fails to do so in severely skewed conditions. Furthermore, across the experimental conditions it is seen that some parameter estimates provided by the skew-normal MRM become more biased as skewness increases whereas others remain unbiased. Discussion of these results in the context of the applicability of skew-normal MRMs is provided.

37 citations


Journal ArticleDOI
TL;DR: In this article, a variety of nonfactor causal worlds that are perfectly, but inappropriately, fit by factor models are shown to be causally misspecified, leading to the conclusion that close-fitting factor models may seriously misrepresent the world's causal structure.
Abstract: Researchers using factor analysis tend to dismiss the significant ill fit of factor models by presuming that if their factor model is close-to-fitting, it is probably close to being properly causally specified. Close fit may indeed result from a model being close to properly causally specified, but close-fitting factor models can also be seriously causally misspecified. This article illustrates a variety of nonfactor causal worlds that are perfectly, but inappropriately, fit by factor models. Seeing nonfactor worlds that are perfectly yet erroneously fit via factor models should help researchers understand that close-to-fitting factor models may seriously misrepresent the world’s causal structure. Statistical cautions regarding the factor model’s proclivity to fit when it ought not to fit have been insufficiently publicized and are rarely heeded. A research commitment to understanding the world’s causal structure, combined with clear examples of factor mismodeling should spur diagnostic assessment of significant factor model failures—including reassessment of published failing factor models.

36 citations


Journal ArticleDOI
TL;DR: The authors investigated the 4-year longitudinal stability of behavioral and emotional risk screening scores among a sample of youth to examine change in risk status over time, finding that the vast majority of students continued to be classified within the same risk category across time points.
Abstract: The practice of screening students to identify behavioral and emotional risk is gaining momentum, with limited guidance regarding the frequency with which screenings should occur. Screening frequency decisions are influenced by the stability of the constructs assessed and changes in risk status over time. This study investigated the 4year longitudinal stability of behavioral and emotional risk screening scores among a sample of youth to examine change in risk status over time. Youth (N = 156) completed a self-report screening measure, the Behavioral and Emotional Screening System, at 1-year intervals in the 8th through 11th grades. Categorical and dimensional stability coefficients, as well as transitions across risk status categories, were analyzed. A latent profile analysis was conducted to determine if there were salient and consistent patterns of screening scores over time. Stability coefficients were moderate to large, with stronger coefficients across shorter time intervals. Latent profile analysis pointed to a three-class solution in which classes were generally consistent with risk categories and stable across time. Results showed that the vast majority of students continued to be classified within the same risk category across time points. Implications for practice and future research needs are discussed.

Journal ArticleDOI
TL;DR: Suggestions to practitioners who want to use GMM for their research are provided, based on two ways of dealing with model nonconvergence, and the performance of the two types of mixture models are examined and discussed.
Abstract: Growth mixture modeling has gained much attention in applied and methodological social science research recently, but the selection of the number of latent classes for such models remains a challenging issue, especially when the assumption of proper model specification is violated. The current simulation study compared the performance of a linear growth mixture model (GMM) for determining the correct number of latent classes against a completely unrestricted multivariate normal mixture model. Results revealed that model convergence is a serious problem that has been underestimated by previous GMM studies. Based on two ways of dealing with model nonconvergence, the performance of the two types of mixture models and a number of model fit indices in class identification are examined and discussed. This article provides suggestions to practitioners who want to use GMM for their research.

Journal ArticleDOI
TL;DR: This article showed that standard errors of item response theory (IRT) model parameters are often of immediate interest to practitioners and that there is a correlation between standard errors and the performance of the model.
Abstract: The present study was motivated by the recognition that standard errors (SEs) of item response theory (IRT) model parameters are often of immediate interest to practitioners and that there is curre...

Journal ArticleDOI
TL;DR: In this article, the shape of the latent trait distribution is estimated simultaneously with the item parameters by estimating the RC-IRT model parameters by the Metropolis-Hastings Robbins-Monro (MH-RM) algorithm.
Abstract: In Ramsay curve item response theory (RC-IRT) modeling, the shape of the latent trait distribution is estimated simultaneously with the item parameters. In its original implementation, RC-IRT is estimated via Bock and Aitkin’s EM algorithm, which yields maximum marginal likelihood estimates. This method, however, does not produce the parameter covariance matrix as an automatic byproduct on convergence. In turn, researchers are limited in when they can employ RC-IRT, as the covariance matrix is needed for many statistical inference procedures. The present research remedies this problem by estimating the RC-IRT model parameters by the Metropolis–Hastings Robbins–Monro (MH-RM) algorithm. An attractive feature of MH-RM is that the structure of the algorithm makes estimation of the covariance matrix convenient. Additionally, MH-RM is ideally suited for multidimensional IRT, whereas EM is limited by the ‘‘curse of dimensionality.’’ Based on the current research, when RC-IRTor similar IRT models are eventually generalized to include multiple latent dimensions, MHRM would appear to be the logical choice for estimation.

Journal ArticleDOI
TL;DR: In this article, the authors propose to use the relation between higher-than second-order moments on one side and correlation and regression models on the other to determine direction of dependence in nonexperimental data.
Abstract: Approaches to determining direction of dependence in nonexperimental data are based on the relation between higher-than second-order moments on one side and correlation and regression models on the other. These approaches have experienced rapid development and are being applied in contexts such as research on partner violence, attention deficit hyperactivity disorder, and currency exchange rates. In this article, we propose using these methods in the context of latent variables analysis. Specifically, we propose creating component or factor scores and relating the component score or factor score variables to each other by using methods for the determination of direction of dependence. Empirical examples use data from the development of aggression in adolescence. In the discussion, issues concerning the establishment of causal relation in empirical research are addressed.

Journal ArticleDOI
TL;DR: In this paper, the authors used signal detection analysis to generate indices of knowledge accuracy (OC-accuracy) and selfenhancement (OC bias) and compared them with MC and short answer tests in assessing knowledge of introductory psychology topics.
Abstract: The overclaiming technique is a novel assessment procedure that uses signal detection analysis to generate indices of knowledge accuracy (OC-accuracy) and selfenhancement (OC-bias). The technique has previously shown robustness over varied knowledge domains as well as low reactivity across administration contexts. Here we compared the OC-accuracy index with multiple choice (MC) and short answer (SA) tests in assessing knowledge of introductory psychology topics in a sample of 108 undergraduates. Results indicated that OC-accuracy was (a) comparable to MC and SA in predicting overall course grades and (b) superior to SA tests in reliability achieved per unit administration time. By including the OC-bias index, the overclaiming method also adds a unique element to scholastic testing, namely, a measure of knowledge self-enhancement. The latter index was a negative predictor of overall course grade, suggesting a narcissistic self-destructiveness. Because the selfenhancement index adds no extra administration time to the knowledge measure, the overclaiming approach provides a more rich and efficient information source compared with traditional methods of scholastic assessment.

Journal ArticleDOI
TL;DR: This paper examined the factorial structure of two administrations of the SAT (October 2010 and May 2011), testing competing models (e.g., one-factor, general ability; two factor, mathematics and literacy; three factor, critical reading, and writing) with revise-in-context writing items loading on (and bridging) a reading and writing factor equally, thereby bridging these factors into a literacy factor.
Abstract: The name ‘‘SAT’’ has become synonymous with college admissions testing; it has been dubbed ‘‘the gold standard.’’ Numerous studies on its reliability and predictive validity show that the SAT predicts college performance beyond high school grade point average. Surprisingly, studies of the factorial structure of the current version of today’s SAT, revised in 2005, have not been reported, if conducted. One purpose of this study was to examine the factorial structure of two administrations of the SAT (October 2010 and May 2011), testing competing models (e.g., one-factor—general ability; two factor—mathematics and ‘‘literacy’’; three factor—mathematics, critical reading, and writing). We found support for the two-factor model with revise-in-context writing items loading on (and bridging) a reading and writing factor equally, thereby bridging these factors into a literacy factor. A second purpose was to draw tentative implications of our finding for the ‘‘next generation’’ SATor other college readiness exams in light of Common Core State Standards Consortia efforts, suggesting that combining critical reading and writing (including the essay) would offer unique revision opportunities. More specifically, a reading and writing (combined) construct might pose a relevant problem or issue with multiple documents to be used to answer questions about the issue(s) (multiple-choice, short answer) and to write an argumentative/analytical essay based on the documents provided. In this way, there may not only be an opportunity to measure students’ literacy but also perhaps students’ critical thinking—key factors in assessing college readiness.

Journal ArticleDOI
TL;DR: This article investigated potential sources of setting accommodation resulting in differential item functioning (DIF) on math and reading assessments for examinees with varied learning characteristics and found that examinees' latent abilities, accommodation status, and characteristics (including gender, home language, and learning attitudes).
Abstract: This exploratory study investigated potential sources of setting accommodation resulting in differential item functioning (DIF) on math and reading assessments for examinees with varied learning characteristics. The examinees were those who participated in large-scale assessments and were tested in either standardized or accommodated testing conditions. The data were examined using multilevel measurement modeling, latent class analyses (LCA), and log-linear and odds ratio analyses. The results indicate that LCA models yielded substantially better fits to the observed data when they included only one covariate (total scores) than others with multiple covariates. Consistent patterns emerged from the results also show that the observed math and reading DIF can be explained by examinees’ latent abilities, accommodation status, and characteristics (including gender, home language, and learning attitudes). The present study not only confirmed previous findings that examinees’ characteristics are helpful in iden...

Journal ArticleDOI
TL;DR: The authors explored the potential for machine scoring of short written responses to the Classroom-Video-Analysis (CVA) assessment, which is designed to measure teachers' usable mathematics skills and showed that machine scoring can be used to evaluate teachers' ability to answer CVA questions.
Abstract: In this study, we explored the potential for machine scoring of short written responses to the Classroom-Video-Analysis (CVA) assessment, which is designed to measure teachers’ usable mathematics t...

Journal ArticleDOI
TL;DR: In this article, three IRT approaches to examinee growth modeling were applied to a single-group anchor test design and their examinede growth estimates were compared, showing the importance of modeling the serial correlation over multiple time points and other additional dependence coming from the use of the unique item sets, as well as the anchor test.
Abstract: Typically a longitudinal growth modeling based on item response theory (IRT) requires repeated measures data from a single group with the same test design. If operational or item exposure problems are present, the same test may not be employed to collect data for longitudinal analyses and tests at multiple time points are constructed with unique item sets, as well as a set of common items (i.e., anchor test) for a study of examinee growth. In this study, three IRT approaches to examinee growth modeling were applied to a single-group anchor test design and their examinee growth estimates were compared. In terms of tracking individual growth, growth patterns in the examinee population distribution, and the overall model–data fit, results show the importance of modeling the serial correlation over multiple time points and other additional dependence coming from the use of the unique item sets, as well as the anchor test.

Journal ArticleDOI
TL;DR: The research reported in this article provided step-by-step hands-on guidance on the item pool design process by applying the bin-and-union method to design item pools for a large-scale licensure CAT employing complex adaptive testing algorithm with variable test length, a decision based on stopping rule, content balancing, and exposure control.
Abstract: For computerized adaptive tests (CATs) to work well, they must have an item pool with sufficient numbers of good quality items. Many researchers have pointed out that, in developing item pools for CATs, not only is the item pool size important but also the distribution of item parameters and practical considerations such as content distribution and item exposure issues. Yet, there is little research on how to design item pools to have those desirable features. The research reported in this article provided step-by-step hands-on guidance on the item pool design process by applying the bin-and-union method to design item pools for a large-scale licensure CAT employing complex adaptive testing algorithm with variable test length, a decision based on stopping rule, content balancing, and exposure control. The design process involved extensive simulations to identify several alternative item pool designs and evaluate their performance against a series of criteria. The design output included the desired item pool size and item parameter distribution. The results indicate that the mechanism used to identify the desirable item pool features functions well and that two recommended item pool designs would support satisfactory performance of the operational testing program.

Journal ArticleDOI
TL;DR: In this paper, it is shown that sum score-based methods for the identification of differential item functioning (DIF), such as the Mantel-Haenszel (MH) approach, can be affected by Type I error inflation in the absence of any DIF effect.
Abstract: It is known that sum score-based methods for the identification of differential item functioning (DIF), such as the Mantel–Haenszel (MH) approach, can be affected by Type I error inflation in the absence of any DIF effect. This may happen when the items differ in discrimination and when there is item impact. On the other hand, outlier DIF methods have been developed that are robust against this Type I error inflation, although they are still based on the MH DIF statistic. The present article gives an explanation for why the common MH method is indeed vulnerable to the inflation effect whereas the outlier DIF versions are not. In a simulation study, we were able to produce the Type I error inflation by inducing item impact and item differences in discrimination. At the same time and in parallel with the Type I error inflation, the dispersion of the DIF statistic across items was increased. As expected, the outlier DIF methods did not seem sensitive to impact and differences in item discrimination.

Journal ArticleDOI
TL;DR: In this paper, the authors show that when anchor item differential item functioning varies across forms in a differential manner across sub-populations, population invarability will have an effect on equating dependence.
Abstract: Invariant relationships in the internal mechanisms of estimating achievement scores on educational tests serve as the basis for concluding that a particular test is fair with respect to statistical bias concerns. Equating invariance and differential item functioning are both concerned with invariant relationships yet are treated separately in the psychometric literature. Connecting these two facets of statistical invariance is critical for developing a holistic definition of fairness in educational measurement, for fostering a deeper understanding of the nature and causes of equating invariance and a lack thereof, and for providing practitioners with guidelines for addressing reported score-level equity concerns. This study hypothesizes that differential item functioning manifested in anchor items of an assessment will have an effect on equating dependence. Findings show that when anchor item differential item functioning varies across forms in a differential manner across subpopulations, population invar...

Journal ArticleDOI
TL;DR: In this article, a latent variable approach is proposed to find auxiliary variables with the property that if included in subsequent maximum likelihood analyses they may enhance considerably the plausibility of the underlying assumption of data missing at random.
Abstract: This research note contributes to the discussion of methods that can be used to identify useful auxiliary variables for analyses of incomplete data sets. A latent variable approach is discussed, which is helpful in finding auxiliary variables with the property that if included in subsequent maximum likelihood analyses they may enhance considerably the plausibility of the underlying assumption of data missing at random. The auxiliary variables can also be considered for inclusion alternatively in imputation models for following multiple imputation analyses. The approach can be particularly helpful in empirical settings where violations of missing at random are suspected, and is illustrated with data from an aging research study.

Journal ArticleDOI
TL;DR: In this article, an alternative estimation approach, Ramsay Curve Item Response Theory (RC-IRT), is proposed to provide more accurate item parameter estimates modeled under the NRM under normal, skewed, and bimodal latent trait distributions for ordered polytomous items.
Abstract: The nominal response model (NRM), a much understudied polytomous item response theory (IRT) model, provides researchers the unique opportunity to evaluate within-item category distinctions. Polytomous IRT models, such as the NRM, are frequently applied to psychological assessments representing constructs that are unlikely to be normally distributed in the population. Unfortunately, models estimated using estimation software with the MML/EM algorithm frequently employs a set of normal quadrature points, effectively ignoring the true shape of the latent trait distribution. To address this problem, the current research implements an alternative estimation approach, Ramsay Curve Item Response Theory (RC-IRT), to provide more accurate item parameter estimates modeled under the NRM under normal, skewed, and bimodal latent trait distributions for ordered polytomous items. Based on the results of improved item parameter recovery under RC-IRT, it is recommended that RC-IRT estimation be implemented whenever a rese...

Journal ArticleDOI
TL;DR: This paper applied a method based on individual differences multidimensional scaling and principal component analysis to detect item bias in terms of culture and try to eliminate this bias variance from the overall item variance so as to avoid jeopardizing validity levels and arrive at clearer and more meaningful dimensions after adjusting the raw scores by removing the bias part.
Abstract: Several sources of bias can plague research data and individual assessment. When cultural groups are considered, across or even within countries, it is essential that the constructs assessed and evaluated are as free as possible from any source of bias and specifically from bias caused due to culturally specific characteristics. Employing the Explanations of Unemployment Scale (revised form) for a sample of 1,894 employed and unemployed adults across eight countries (the United States, the United Kingdom, Turkey, Spain, Romania, Poland, Greece, and Brazil), we applied a method based on individual differences multidimensional scaling and principal component analysis to detect item bias in terms of culture and try to eliminate this bias variance from the overall item variance so as to (a) avoid jeopardizing validity levels and (b) arrive at clearer and more meaningful dimensions after adjusting the raw scores by removing the bias part. The results supported our statistical–psychometric intervention as the s...

Journal ArticleDOI
TL;DR: In this paper, the authors compared the performance of differential item functioning (DIF) methods that do not account for multilevel data structure when the intraclass correlation coefficient (r) of the studied item was the same as the r of the total score.
Abstract: Previous research has demonstrated that differential item functioning (DIF) methods that do not account for multilevel data structure could result in too frequent rejection of the null hypothesis (i.e., no DIF) when the intraclass correlation coefficient (r) of the studied item was the same as the r of the total score. The current study extended previous research by comparing the performance of DIF methods when r of the studied item was less than r of the total score, a condition that may be observed with considerable frequency in practice. The performance of two simple and frequently used DIF methods that do not account for multilevel data structure, the Mantel–Haenszel test (MH) and logistic regression (LR), was compared with the performance of a complex and less frequently used DIF method that does account for multilevel data structure, hierarchical logistic regression (HLR). Simulation indicated that HLR and LR performed equivalently in terms of significance tests under most conditions, and MH was conservative across most of the conditions. Effect size estimate of HLR was equally accurate and consistent as effect size estimates of LR and MH under the Rasch model and was more accurate and consistent than LR and MH effect size estimates under the two-parameter item response theory model. The results of the current study provide evidence to help researchers further understand

Journal ArticleDOI
TL;DR: This article presented a comparative judgment approach for holistically scored constructed response tasks, where the grader rank orders (rather than rate) the quality of a small set of responses, and the final response scores are determined by weighting the prior and ranking information.
Abstract: This article presents a comparative judgment approach for holistically scored constructed response tasks. In this approach, the grader rank orders (rather than rate) the quality of a small set of responses. A prior automated evaluation of responses guides both set formation and scaling of rankings. Sets are formed to have similar prior scores and subsequent rankings by graders serve to update the prior scores of responses. Final response scores are determined by weighting the prior and ranking information. This approach allows for scaling comparative judgments on the basis of a single ranking, eliminates rater effects in scoring, and offers a conceptual framework for combining human and automated evaluation of constructed response tasks. To evaluate this approach, groups of graders evaluated responses to two tasks using either the ranking (with sets of 5 responses) or traditional rating approach. Results varied by task and the relative weighting of prior versus ranking information, but in general the rank...

Journal ArticleDOI
TL;DR: In this article, the authors used Factor Mixture Models (FMM) for detecting between-class latent DIF and class-specific observed DIF, and found that FMMs with binary outcomes performed well in terms of the DIF detection and for recovery of large DIF effects.
Abstract: Conventional differential item functioning (DIF) detection methods (e.g., the Mantel– Haenszel test) can be used to detect DIF only across observed groups, such as gender or ethnicity. However, research has found that DIF is not typically fully explained by an observed variable. True sources of DIF may include unobserved, latent variables, such as personality or response patterns. The factor mixture model (FMM) is designed to detect unobserved sources of heterogeneity in factor models. The current study investigated use of the FMM for detecting between-class latent DIF and class-specific observed DIF. Factors that were manipulated included the DIF effect size and the latent class probabilities. The performance of model fit indices (Akaike information criterion [AIC], Bayesian information criterion [BIC], sample size– adjusted BIC, and consistent AIC) were assessed for their detection of the correct DIF model. The recovery of DIF parameters was also assessed. Results indicated that use of FMMs with binary outcomes performed well in terms of the DIF detection and for recovery of large DIF effects. When class probabilities were unequal with small DIF effects, performance decreased for fit indices, power, and the recovery of DIF effects compared with equal class probability conditions. Inflated Type I errors were found for non-DIF items across simulation conditions. Results and future research directions for applied and methodological are discussed.

Journal ArticleDOI
TL;DR: This article evaluated four measures of instructional differentiation: one for grade 2 English language arts (ELA), one for Grade 2 mathematics, another for grade 5 ELA, and another for Grade 5 mathematics.
Abstract: This study operationalizes four measures of instructional differentiation: one for Grade 2 English language arts (ELA), one for Grade 2 mathematics, one for Grade 5 ELA, and one for Grade 5 mathematics. Our study evaluates their measurement properties of each measure in a large field experiment: the Indiana Diagnostic Assessment Tools Study, which included two consecutive cluster randomized trials (CRTs) of the effects of interim assessments on student achievement. Each log was designed to measure instructional practices as they were implemented for eight randomly selected students in the participating teachers’ classrooms. A total of 592 teachers from 127 schools took part in this study. Logs were administered 16 times in each experiment. Item responses to the logs were scaled using the Rasch model and reliability estimates for the differentiation measures were evaluated at the log level (observations within teachers), the teacher level, and the school level. Estimated reliability was above .70 for each of the log- and teacher-level measures. At the school level, reliability estimates were lower for Grade 5 ELA and mathematics. The variance between teachers and schools on the scaled differentiation measures was substantially less than within-teacher variation. These results provide preliminary evidence that teacher instructional logs may provide useful measures of instructional differentiation in elementary grades at multiple levels of aggregation.