Showing papers in "Educational and Psychological Measurement in 2016"

PDF

Open Access

Journal Article•DOI•

Hypothesis Testing Using Factor Score Regression: A Comparison of Four Methods

[...]

Ines Devlieger¹, Axel Mayer¹, Yves Rosseel¹•Institutions (1)

01 Oct 2016-Educational and Psychological Measurement

TL;DR: The bias correcting method, with the newly developed standard error, is the only suitable alternative for SEM and has a higher standard error bias than SEM, but has a comparable bias, efficiency, mean square error, power, and type I error rate.

...read moreread less

Abstract: In this article, an overview is given of four methods to perform factor score regression (FSR), namely regression FSR, Bartlett FSR, the bias avoiding method of Skrondal and Laake, and the bias correcting method of Croon. The bias correcting method is extended to include a reliable standard error. The four methods are compared with each other and with structural equation modeling (SEM) by using analytic calculations and two Monte Carlo simulation studies to examine their finite sample characteristics. Several performance criteria are used, such as the bias using the unstandardized and standardized parameterization, efficiency, mean square error, standard error bias, type I error rate, and power. The results show that the bias correcting method, with the newly developed standard error, is the only suitable alternative for SEM. While it has a higher standard error bias than SEM, it has a comparable bias, efficiency, mean square error, power, and type I error rate.

...read moreread less

105 citations

Journal Article•DOI•

Alternative Models for Small Samples in Psychological Research Applying Linear Mixed Effects Models and Generalized Estimating Equations to Repeated Measures Data

[...]

Chelsea Muth¹, Karen L. Bales¹, Katie Hinde², Nicole Maninger¹, Sally P. Mendoza¹, Emilio Ferrer¹ - Show less +2 more•Institutions (2)

University of California, Davis¹, Harvard University²

01 Feb 2016-Educational and Psychological Measurement

TL;DR: A brief tutorial and exploration of two alternative longitudinal modeling techniques, linear mixed effects models and generalized estimating equations, as applied to a repeated measures study of pairmate attachment and social stress in primates.

...read moreread less

Abstract: Unavoidable sample size issues beset psychological research that involves scarce populations or costly laboratory procedures. When incorporating longitudinal designs these samples are further reduced by traditional modeling techniques, which perform listwise deletion for any instance of missing data. Moreover, these techniques are limited in their capacity to accommodate alternative correlation structures that are common in repeated measures studies. Researchers require sound quantitative methods to work with limited but valuable measures without degrading their data sets. This article provides a brief tutorial and exploration of two alternative longitudinal modeling techniques, linear mixed effects models and generalized estimating equations, as applied to a repeated measures study (n = 12) of pairmate attachment and social stress in primates. Both techniques provide comparable results, but each model offers unique information that can be helpful when deciding the right analytic tool.

...read moreread less

100 citations

Journal Article•DOI•

Robust Coefficients Alpha and Omega and Confidence Intervals With Outlying Observations and Missing Data Methods and Software

[...]

Zhiyong Zhang¹, Ke-Hai Yuan¹•Institutions (1)

University of Notre Dame¹

01 Jun 2016-Educational and Psychological Measurement

TL;DR: This study proposes robust procedures to estimate both alpha and omega as well as corresponding standard errors and confidence intervals from samples that may contain potential outlying observations and missing values and results show that the newly developed robust method yields substantially improvedalpha and omega estimates aswell as better coverage rates of confidence intervals than the conventional nonrobust method.

...read moreread less

Abstract: Cronbach’s coefficient alpha is a widely used reliability measure in social, behavioral, and education sciences. It is reported in nearly every study that involves measuring a construct through mul...

...read moreread less

99 citations

Journal Article•DOI•

A Latent Transition Analysis Model for Assessing Change in Cognitive Skills.

[...]

Feiming Li¹, Allan S. Cohen², Brian A. Bottge³, Jonathan Templin⁴•Institutions (4)

University of North Texas¹, University of Georgia², University of Kentucky³, University of Kansas⁴

01 Apr 2016-Educational and Psychological Measurement

TL;DR: The use of a cognitive diagnostic model, the DINA model, as the measurement model in a LTA is illustrated, thereby demonstrating a means of analyzing change in cognitive skills over time and expanding the utility of LTA to practical problems in educational measurement research.

...read moreread less

Abstract: Latent transition analysis (LTA) was initially developed to provide a means of measuring change in dynamic latent variables. In this article, we illustrate the use of a cognitive diagnostic model, the DINA model, as the measurement model in a LTA, thereby demonstrating a means of analyzing change in cognitive skills over time. An example is presented of an instructional treatment on a sample of seventh-grade students in several classrooms in a Midwestern school district. In the example, it is demonstrated how hypotheses could be framed and then tested regarding the form of the change in different groups within the population. Both manifest and latent groups also are defined and used to test additional hypotheses about change specific to particular subpopulations. Results suggest that the use of a DINA measurement model expands the utility of LTA to practical problems in educational measurement research.

...read moreread less

80 citations

Journal Article•DOI•

A Comparison of Composite Reliability Estimators Coefficient Omega Confidence Intervals in the Current Literature

[...]

Miguel A. Padilla¹, Jasmin Divers²•Institutions (2)

Old Dominion University¹, Wake Forest University²

01 Jun 2016-Educational and Psychological Measurement

TL;DR: Six approaches for estimating confidence intervals for coefficient omega with unidimensional congeneric items were evaluated through a Monte Carlo simulation and the normal theory bootstrap confidence interval had the best performance across all simulation conditions that included sample sizes less than 100.

...read moreread less

Abstract: Coefficient omega and alpha are both measures of the composite reliability for a set of items. Unlike coefficient alpha, coefficient omega remains unbiased with congeneric items with uncorrelated e...

...read moreread less

80 citations

Journal Article•DOI•

Testing the Difference of Correlated Agreement Coefficients for Statistical Significance

[...]

Kilem L. Gwet

01 Aug 2016-Educational and Psychological Measurement

TL;DR: A technique similar to the classical pairwise t test for means, which is based on a large-sample linear approximation of the agreement coefficient is proposed, which requires neither advanced statistical modeling skills nor considerable computer programming experience.

...read moreread less

Abstract: This article addresses the problem of testing the difference between two correlated agreement coefficients for statistical significance. A number of authors have proposed methods for testing the difference between two correlated kappa coefficients, which require either the use of resampling methods or the use of advanced statistical modeling techniques. In this article, we propose a technique similar to the classical pairwise t test for means, which is based on a large-sample linear approximation of the agreement coefficient. We illustrate the use of this technique with several known agreement coefficients including Cohen's kappa, Gwet's AC1, Fleiss's generalized kappa, Conger's generalized kappa, Krippendorff's alpha, and the Brenann-Prediger coefficient. The proposed method is very flexible, can accommodate several types of correlation structures between coefficients, and requires neither advanced statistical modeling skills nor considerable computer programming experience. The validity of this method is tested with a Monte Carlo simulation.

...read moreread less

78 citations

Journal Article•DOI•

It Might Not Make a Big DIF Improved Differential Test Functioning Statistics That Account for Sampling Variability

[...]

R. Philip Chalmers¹, Alyssa Counsell¹, David B. Flora¹•Institutions (1)

York University¹

01 Feb 2016-Educational and Psychological Measurement

TL;DR: Improved DTF statistics are proposed that properly account for sampling variability in item parameter estimates while avoiding the necessity of predicting provisional latent trait estimates to create two-step approximations.

...read moreread less

Abstract: Differential test functioning, or DTF, occurs when one or more items in a test demonstrate differential item functioning (DIF) and the aggregate of these effects are witnessed at the test level. In many applications, DTF can be more important than DIF when the overall effects of DIF at the test level can be quantified. However, optimal statistical methodology for detecting and understanding DTF has not been developed. This article proposes improved DTF statistics that properly account for sampling variability in item parameter estimates while avoiding the necessity of predicting provisional latent trait estimates to create two-step approximations. The properties of the DTF statistics were examined with two Monte Carlo simulation studies using dichotomous and polytomous IRT models. The simulation results revealed that the improved DTF statistics obtained optimal and consistent statistical properties, such as obtaining consistent Type I error rates. Next, an empirical analysis demonstrated the application of the proposed methodology. Applied settings where the DTF statistics can be beneficial are suggested and future DTF research areas are proposed.

...read moreread less

72 citations

Journal Article•DOI•

Survey Satisficing Inflates Reliability and Validity Measures: An Experimental Comparison of College and Amazon Mechanical Turk Samples.

[...]

Tyler Hamby¹, Wyn Taylor¹•Institutions (1)

University of Texas at Arlington¹

23 Jan 2016-Educational and Psychological Measurement

TL;DR: Results indicated effects for task difficulty and motivation in predicting survey satisficing, and satisficing was associated with improved internal consistency reliability and convergent validity but also worse discriminant validity in the second part of the study.

...read moreread less

Abstract: This study examined the predictors and psychometric outcomes of survey satisficing, wherein respondents provide quick, “good enough” answers (satisficing) rather than carefully considered answers (...

...read moreread less

65 citations

Journal Article•DOI•

The Role of Measurement Quality on Practical Guidelines for Assessing Measurement and Structural Invariance.

[...]

Yoonjeong Kang¹, Daniel McNeish², Gregory R. Hancock²•Institutions (2)

American Institutes for Research¹, University of Maryland, College Park²

01 Aug 2016-Educational and Psychological Measurement

TL;DR: Simulation and population analysis methods show that ΔMcDonald’s NCI is minimally affected by loading magnitude and sample size when testing invariance in the measurement model, while differences in comparative fit index varies widely when testing both measurement and structural variance as measurement quality changes, making it difficult to pinpoint a common value that suggests reasonable invariance.

...read moreread less

Abstract: Although differences in goodness-of-fit indices (ΔGOFs) have been advocated for assessing measurement invariance, studies that advanced recommended differential cutoffs for adjudicating invariance actually utilized a very limited range of values representing the quality of indicator variables (i.e., magnitude of loadings). Because quality of measurement has been found to be relevant in the context of assessing data-model fit in single-group models, this study used simulation and population analysis methods to examine the extent to which quality of measurement affects ΔGOFs for tests of invariance in multiple group models. Results show that ΔMcDonald’s NCI is minimally affected by loading magnitude and sample size when testing invariance in the measurement model, while differences in comparative fit index varies widely when testing both measurement and structural variance as measurement quality changes, making it difficult to pinpoint a common value that suggests reasonable invariance.

...read moreread less

56 citations

Journal Article•DOI•

Testing Mediation in Structural Equation Modeling The Effectiveness of the Test of Joint Significance

[...]

Craig Leth-Steensen¹, Elena Gallitto²•Institutions (2)

Carleton University¹, University of Ottawa²

01 Apr 2016-Educational and Psychological Measurement

TL;DR: Four sets of Monte Carlo simulations involving full latent variable structural equation models were run to contrast the effectiveness of the currently popular bias-corrected bootstrapping approach with the simple test of joint significance approach.

...read moreread less

Abstract: A large number of approaches have been proposed for estimating and testing the significance of indirect effects in mediation models. In this study, four sets of Monte Carlo simulations involving full latent variable structural equation models were run in order to contrast the effectiveness of the currently popular bias-corrected bootstrapping approach with the simple test of joint significance approach. The results from these simulations demonstrate that the test of joint significance had more power than bias-corrected bootstrapping and also yielded more reasonable Type I errors.

...read moreread less

55 citations

Journal Article•DOI•

On the Relationship Between Classical Test Theory and Item Response Theory: From One to the Other and Back.

[...]

Tenko Raykov¹, George A. Marcoulides²•Institutions (2)

Michigan State University¹, University of California, Santa Barbara²

01 Apr 2016-Educational and Psychological Measurement

TL;DR: Two distinct observational equivalence approaches are outlined that render the item response models from corresponding classical test theory-based models, and can each be used to obtain the former from the latter models.

...read moreread less

Abstract: The frequently neglected and often misunderstood relationship between classical test theory and item response theory is discussed for the unidimensional case with binary measures and no guessing. It is pointed out that popular item response models can be directly obtained from classical test theory-based models by accounting for the discrete nature of the observed items. Two distinct observational equivalence approaches are outlined that render the item response models from corresponding classical test theory-based models, and can each be used to obtain the former from the latter models. Similarly, classical test theory models can be furnished using the reverse application of either of those approaches from corresponding item response models.

...read moreread less

Journal Article•DOI•

Automatic coding of short text responses via clustering in educational assessment

[...]

Fabian Zehner¹, Christine Sälzer¹, Frank Goldhammer•Institutions (1)

Technische Universität München¹

01 Apr 2016-Educational and Psychological Measurement

TL;DR: The accuracy of automatic text coding is demonstrated by using data collected in the Programme for International Student Assessment (PISA) 2012 in Germany and potential innovations for assessment that are enabled by automatic coding of short text responses are discussed.

...read moreread less

Abstract: Educational and psychological measurement 76 (2016) 2, S. 280-303 Padagogische Teildisziplin: Empirische Bildungsforschung; als elektronischer Volltext verfugbar

...read moreread less

Journal Article•DOI•

A Note on Testing Mediated Effects in Structural Equation Models: Reconciling Past and Current Research on the Performance of the Test of Joint Significance.

[...]

Matthew J. Valente¹, Oscar Gonzalez¹, Milica Miočević¹, David P. MacKinnon¹•Institutions (1)

Arizona State University¹

25 Oct 2016-Educational and Psychological Measurement

TL;DR: Investigation of evidence that the test of joint significance was more powerful than the bias-corrected bootstrap method for detecting mediated effects in SEMs found by Leth-Steensen and Gallitto 2016 and two issues related to testing the significance of mediated effectsin SEMs are described which explain the inconsistent results.

...read moreread less

Abstract: Methods to assess the significance of mediated effects in education and the social sciences are well studied and fall into two categories: single sample methods and computer-intensive methods. A popular single sample method to detect the significance of the mediated effect is the test of joint significance, and a popular computer-intensive method to detect the significance of the mediated effect is the bias-corrected bootstrap method. Both these methods are used for testing the significance of mediated effects in structural equation models (SEMs). A recent study by Leth-Steensen and Gallitto 2015 provided evidence that the test of joint significance was more powerful than the bias-corrected bootstrap method for detecting mediated effects in SEMs, which is inconsistent with previous research on the topic. The goal of this article was to investigate this surprising result and describe two issues related to testing the significance of mediated effects in SEMs which explain the inconsistent results regarding the power of the test of joint significance and the bias-corrected bootstrap found by Leth-Steensen and Gallitto 2015. The first issue was that the bias-corrected bootstrap method was conducted incorrectly. The bias-corrected bootstrap was used to estimate the standard error of the mediated effect as opposed to creating confidence intervals. The second issue was that the correlation between the path coefficients of the mediated effect was ignored as an important aspect of testing the significance of the mediated effect in SEMs. The results of the replication study confirmed prior research on testing the significance of mediated effects. That is, the bias-corrected bootstrap method was more powerful than the test of joint significance, and the bias-corrected bootstrap method had elevated Type 1 error rates in some cases. Additional methods for testing the significance of mediated effects in SEMs were considered and limitations and future directions were discussed.

...read moreread less

Journal Article•DOI•

Improving the Factor Structure of Psychological Scales The Expanded Format as an Alternative to the Likert Scale Format

[...]

Xijuan Zhang¹, Victoria Savalei¹•Institutions (1)

University of British Columbia¹

01 Jun 2016-Educational and Psychological Measurement

TL;DR: This study examines an alternative scale format, called the Expanded format, which replaces each response option in the Likert scale with a full sentence, and tested this hypothesis on three popular psychological scales: the Rosenberg Self-Esteem scale, the Conscientiousness subscale of the Big Five Inventory, and the Beck Depression Inventory II.

...read moreread less

Abstract: Many psychological scales written in the Likert format include reverse worded (RW) items in order to control acquiescence bias. However, studies have shown that RW items often contaminate the factor structure of the scale by creating one or more method factors. The present study examines an alternative scale format, called the Expanded format, which replaces each response option in the Likert scale with a full sentence. We hypothesized that this format would result in a cleaner factor structure as compared with the Likert format. We tested this hypothesis on three popular psychological scales: the Rosenberg Self-Esteem scale, the Conscientiousness subscale of the Big Five Inventory, and the Beck Depression Inventory II. Scales in both formats showed comparable reliabilities. However, scales in the Expanded format had better (i.e., lower and more theoretically defensible) dimensionalities than scales in the Likert format, as assessed by both exploratory factor analyses and confirmatory factor analyses. We encourage further study and wider use of the Expanded format, particularly when a scale's dimensionality is of theoretical interest.

...read moreread less

Journal Article•DOI•

Reliability and Model Fit.

[...]

Leanne Stanley¹, Michael C. Edwards¹•Institutions (1)

Ohio State University¹

17 Mar 2016-Educational and Psychological Measurement

TL;DR: The purpose of this article is to highlight the distinction between the reliability of test scores and the fit of psychometric measurement models, reminding readers why it is important to consider both when evaluating whether test scores are valid for a proposed interpretation and/or use.

...read moreread less

Abstract: The purpose of this article is to highlight the distinction between the reliability of test scores and the fit of psychometric measurement models, reminding readers why it is important to consider both when evaluating whether test scores are valid for a proposed interpretation and/or use. It is often the case that an investigator judges both the reliability of scores and the fit of a corresponding measurement model to be either acceptable or unacceptable for a given situation, but these are not the only possible outcomes. This article focuses on situations in which model fit is deemed acceptable, but reliability is not. Data were simulated based on the item characteristics of the PROMIS (Patient Reported Outcomes Measurement Information System) anxiety item bank and analyzed using methods from classical test theory, factor analysis, and item response theory. Analytic techniques from different psychometric traditions were used to illustrate that reliability and model fit are distinct, and that disagreement among indices of reliability and model fit may provide important information bearing on a particular validity argument, independent of the data analytic techniques chosen for a particular research application. We conclude by discussing the important information gleaned from the assessment of reliability and model fit.

...read moreread less

Journal Article•DOI•

Accuracy of Revised and Traditional Parallel Analyses for Assessing Dimensionality with Binary Data.

[...]

Samuel B. Green¹, Nickalus Redell¹, Marilyn S. Thompson¹, Roy Levy¹•Institutions (1)

Arizona State University¹

01 Feb 2016-Educational and Psychological Measurement

TL;DR: In this paper, the authors argue for a revision to parallel analysis that makes it more consistent with hypothesis testing, and evaluate the relative accuracy of the revised parallel analysis (R-PA) and traditional PA (T-PA), using Monte Carlo methods.

...read moreread less

Abstract: Parallel analysis (PA) is a useful empirical tool for assessing the number of factors in exploratory factor analysis. On conceptual and empirical grounds, we argue for a revision to PA that makes it more consistent with hypothesis testing. Using Monte Carlo methods, we evaluated the relative accuracy of the revised PA (R-PA) and traditional PA (T-PA) methods for factor analysis of tetrachoric correlations between items with binary responses. We manipulated five data generation factors: number of observations, type of factor model, factor loadings, correlation between factors, and distribution of thresholds. The R-PA method tended to be more accurate than T-PA, although not uniformly across conditions. R-PA tended to perform better relative to T-PA if the underlying model (a) was unidimensional but had some unique items, (b) had highly correlated factors, or (c) had a general factor as well as a group factor. In addition, R-PA tended to outperform T-PA if items had higher factor loadings and sample size was large. A major disadvantage of the T-PA method was that it frequently yielded inflated Type I error rates.

...read moreread less

Journal Article•DOI•

The Mediated MIMIC Model for Understanding the Underlying Mechanism of DIF.

[...]

Ying Cheng¹, Can Shao¹, Quinn N. Lathrop¹•Institutions (1)

University of Notre Dame¹

01 Feb 2016-Educational and Psychological Measurement

TL;DR: The mediated MIMIC model method is very successful in detecting the mediation effect that completely or partially accounts for DIF, while keeping the Type I error rate well controlled for both balanced and unbalanced sample sizes between focal and reference groups.

...read moreread less

Abstract: Due to its flexibility, the multiple-indicator, multiple-causes (MIMIC) model has become an increasingly popular method for the detection of differential item functioning (DIF). In this article, we propose the mediated MIMIC model method to uncover the underlying mechanism of DIF. This method extends the usual MIMIC model by including one variable or multiple variables that may completely or partially mediate the DIF effect. If complete mediation effect is found, the DIF effect is fully accounted for. Through our simulation study, we find that the mediated MIMIC model is very successful in detecting the mediation effect that completely or partially accounts for DIF, while keeping the Type I error rate well controlled for both balanced and unbalanced sample sizes between focal and reference groups. Because it is successful in detecting such mediation effects, the mediated MIMIC model may help explain DIF and give guidance in the revision of a DIF item.

...read moreread less

Journal Article•DOI•

A Simulation Study on Methods of Correcting for the Effects of Extreme Response Style

[...]

Eunike Wetzel¹, Eunike Wetzel², Jan R. Böhnke³, Norman Rose²•Institutions (3)

University of Konstanz¹, University of Tübingen², Hull York Medical School³

01 Apr 2016-Educational and Psychological Measurement

TL;DR: This simulation study investigated three methods that have been proposed for the correction of trait estimates for ERS effects: mixed Rasch models, multidimensional item response models, and regression residuals.

...read moreread less

Abstract: The impact of response styles such as extreme response style (ERS) on trait estimation has long been a matter of concern to researchers and practitioners. This simulation study investigated three methods that have been proposed for the correction of trait estimates for ERS effects: (a) mixed Rasch models, (b) multidimensional item response models, and (c) regression residuals. The methods were compared with respect to their ability of recovering the true latent trait levels. Data were generated according to a unidimensional model with only one trait, a mixed Rasch model with two populations of ERS and non-ERS, and a two-dimensional model incorporating a trait and an ERS dimension. The data were analyzed using the same models as well as linear regression where the trait estimate is regressed on an ERS score and the resulting residual is considered the corrected trait estimate. Over all conditions, the two-dimensional model achieved the best trait recovery, though the difference to the unidimensional model ...

...read moreread less

Journal Article•DOI•

Differences in Reaction to Immediate Feedback and Opportunity to Revise Answers for Multiple-Choice and Open-Ended Questions.

[...]

Yigal Attali¹, Cara Cahalan Laitusis¹, Elizabeth Stone¹•Institutions (1)

Princeton University¹

01 Oct 2016-Educational and Psychological Measurement

TL;DR: The reactions of test takers to an interactive assessment with immediate feedback and answer-revision opportunities for the two types of items are investigated, and the concept of effortful engagement is explained—the OE format encourages more mindful engagement with the items in interactive mode.

...read moreread less

Abstract: There are many reasons to believe that open-ended (OE) and multiple-choice (MC) items elicit different cognitive demands of students. However, empirical evidence that supports this view is lacking. In this study, we investigated the reactions of test takers to an interactive assessment with immediate feedback and answer-revision opportunities for the two types of items. Eighth-grade students solved mathematics problems, both MC and OE, with standard instructions and feedback-and-revision opportunities. An analysis of scores based on revised answers in feedback mode revealed gains in measurement precision for OE items but not for MC items. These results are explained through the concept of effortful engagement-the OE format encourages more mindful engagement with the items in interactive mode. This interpretation is supported by analyses of response times and test takers' reports.

...read moreread less

Journal Article•DOI•

Extracting Spurious Latent Classes in Growth Mixture Modeling With Nonnormal Errors

[...]

Kiero Guerra-Peña¹, Douglas Steinley²•Institutions (2)

Pontificia Universidad Católica Madre y Maestra¹, University of Missouri²

01 Mar 2016-Educational and Psychological Measurement

TL;DR: Results show that spurious classes may be selected and optimal solutions obtained in the data analysis when the population departs from normality even when the nonnormality is only present in time invariant covariates.

...read moreread less

Abstract: Growth mixture modeling is generally used for two purposes: (1) to identify mixtures of normal subgroups and (2) to approximate oddly shaped distributions by a mixture of normal components. Often in applied research this methodology is applied to both of these situations indistinctly: using the same fit statistics and likelihood ratio tests. This can lead to the overextraction of latent classes and the attribution of substantive meaning to these spurious classes. The goals of this study are (1) to explore the performance of the Bayesian information criterion, sample-adjusted BIC, and bootstrap likelihood ratio test in growth mixture modeling analysis with nonnormal distributed outcome variables and (2) to examine the effects of nonnormal time invariant covariates in the estimation of the number of latent classes when outcome variables are normally distributed. For both of these goals, we will include nonnormal conditions not considered previously in the literature. Two simulation studies were conducted. Results show that spurious classes may be selected and optimal solutions obtained in the data analysis when the population departs from normality even when the nonnormality is only present in time invariant covariates.

...read moreread less

Journal Article•DOI•

An Approach to Scoring and Equating Tests With Binary Items: Piloting With Large-Scale Assessments.

[...]

Dimiter M. Dimitrov¹•Institutions (1)

George Mason University¹

16 Feb 2016-Educational and Psychological Measurement

TL;DR: The proposed D-scaling proved promising under its current piloting with large-scale assessments and the hope is that it can efficiently complement IRT procedures in the practice of large- scale testing in the field of education and psychology.

...read moreread less

Abstract: This article describes an approach to test scoring, referred to as delta scoring (D-scoring), for tests with dichotomously scored items. The D-scoring uses information from item response theory (IRT) calibration to facilitate computations and interpretations in the context of large-scale assessments. The D-score is computed from the examinee's response vector, which is weighted by the expected difficulties (not "easiness") of the test items. The expected difficulty of each item is obtained as an analytic function of its IRT parameters. The D-scores are independent of the sample of test-takers as they are based on expected item difficulties. It is shown that the D-scale performs a good bit better than the IRT logit scale by criteria of scale intervalness. To equate D-scales, it is sufficient to rescale the item parameters, thus avoiding tedious and error-prone procedures of mapping test characteristic curves under the method of IRT true score equating, which is often used in the practice of large-scale testing. The proposed D-scaling proved promising under its current piloting with large-scale assessments and the hope is that it can efficiently complement IRT procedures in the practice of large-scale testing in the field of education and psychology.

...read moreread less

Journal Article•DOI•

Effort in Low-Stakes Assessments: What Does It Take to Perform as Well as in a High-Stakes Setting?

[...]

Yigal Attali¹•Institutions (1)

Princeton University¹

04 Mar 2016-Educational and Psychological Measurement

TL;DR: Analyses of performance under the high- and low-stakes situations revealed that the level of effort in theLow-stakes situation strongly predicted the stakes effect on performance, and the correlations between the low- and high-stakes scores approached the upper bound possible considering the reliability of the test.

...read moreread less

Abstract: Performance of students in low-stakes testing situations has been a concern and focus of recent research. However, researchers who have examined the effect of stakes on performance have not been ab...

...read moreread less

Journal Article•DOI•

Can Reliability of Multiple Component Measuring Instruments Depend on Response Option Presentation Mode

[...]

Natalja Menold¹, Tenko Raykov²•Institutions (2)

Leibniz Association¹, Michigan State University²

01 Jun 2016-Educational and Psychological Measurement

TL;DR: It is demonstrated that the reliability of an instrument need not be the same when polarity of the response options for its individual components differs across administrations of the instrument.

...read moreread less

Abstract: This article examines the possible dependency of composite reliability on presentation format of the elements of a multi-item measuring instrument. Using empirical data and a recent method for interval estimation of group differences in reliability, we demonstrate that the reliability of an instrument need not be the same when polarity of the response options for its individual components differs across administrations of the instrument. Implications for empirical educational, behavioral, and social research are discussed.

...read moreread less

Journal Article•DOI•

Rasch Model Parameter Estimation in the Presence of a Nonnormal Latent Trait Using a Nonparametric Bayesian Approach.

[...]

Holmes Finch¹, Julianne M. Edwards¹•Institutions (1)

Ball State University¹

01 Aug 2016-Educational and Psychological Measurement

TL;DR: Results of the current study support that the nonparametric Bayesian estimation approach may be a preferred option when fitting a Rasch model in the presence of nonnormal latent traits and item difficulties, as it proved to be most accurate in virtually all scenarios that were simulated in this study.

...read moreread less

Abstract: Standard approaches for estimating item response theory (IRT) model parameters generally work under the assumption that the latent trait being measured by a set of items follows the normal distribution. Estimation of IRT parameters in the presence of nonnormal latent traits has been shown to generate biased person and item parameter estimates. A number of methods, including Ramsay curve item response theory, have been developed to reduce such bias, and have been shown to work well for relatively large samples and long assessments. An alternative approach to the nonnormal latent trait and IRT parameter estimation problem, nonparametric Bayesian estimation approach, has recently been introduced into the literature. Very early work with this method has shown that it could be an excellent option for use when fitting the Rasch model when assumptions cannot be made about the distribution of the model parameters. The current simulation study was designed to extend research in this area by expanding the simulatio...

...read moreread less

Journal Article•DOI•

Evaluating Rater Accuracy in Rater-Mediated Assessments Using an Unfolding Model

[...]

Jue Wang¹, George Engelhard¹, Edward W. Wolfe²•Institutions (2)

University of Georgia¹, Pearson Education²

01 Dec 2016-Educational and Psychological Measurement

TL;DR: It is suggested that HCM is a promising approach for examining rater accuracy, and that the HCM can provide a useful interpretive framework for evaluating the quality of ratings obtained within the context of rater-mediated assessments.

...read moreread less

Abstract: The number of performance assessments continues to increase around the world, and it is important to explore new methods for evaluating the quality of ratings obtained from raters. This study describes an unfolding model for examining rater accuracy. Accuracy is defined as the difference between observed and expert ratings. Dichotomous accuracy ratings (0 = inaccurate, 1 = accurate) are unfolded into three latent categories: inaccurate below expert ratings, accurate ratings, and inaccurate above expert ratings. The hyperbolic cosine model (HCM) is used to examine dichotomous accuracy ratings from a statewide writing assessment. This study suggests that HCM is a promising approach for examining rater accuracy, and that the HCM can provide a useful interpretive framework for evaluating the quality of ratings obtained within the context of rater-mediated assessments.

...read moreread less

Journal Article•DOI•

Partially Compensatory Multidimensional Item Response Theory Models: Two Alternate Model Forms.

[...]

Christine E. DeMars¹•Institutions (1)

James Madison University¹

01 Apr 2016-Educational and Psychological Measurement

TL;DR: Either the model used to simulate the data or the compensatory model generally had the best fit, as indexed by information criteria, and interfactor correlations were estimated well by both the correct model and the compensated model.

...read moreread less

Abstract: Partially compensatory models may capture the cognitive skills needed to answer test items more realistically than compensatory models, but estimating the model parameters may be a challenge. Data were simulated to follow two different partially compensatory models, a model with an interaction term and a product model. The model parameters were then estimated for both models and for the compensatory model. Either the model used to simulate the data or the compensatory model generally had the best fit, as indexed by information criteria. Interfactor correlations were estimated well by both the correct model and the compensatory model. The predicted response probabilities were most accurate from the model used to simulate the data. Regarding item parameters, root mean square errors seemed reasonable for the interaction model but were quite large for some items for the product model. Thetas were recovered similarly by all models, regardless of the model used to simulate the data.

...read moreread less

Journal Article•DOI•

Question Order Affects the Measurement of Bullying Victimization among Middle School Students.

[...]

Francis L. Huang¹, Dewey G. Cornell²•Institutions (2)

University of Missouri¹, University of Virginia²

01 Oct 2016-Educational and Psychological Measurement

TL;DR: A randomized experiment testing the question-order effect found that changing the sequence of questions can result in 45% higher prevalence rates, which raises questions about the accuracy of several widely used bullying surveys.

...read moreread less

Abstract: Bullying among youth is recognized as a serious student problem, especially in middle school. The most common approach to measuring bullying is through student self-report surveys that ask questions about different types of bullying victimization. Although prior studies have shown that question-order effects may influence participant responses, no study has examined these effects with middle school students. A randomized experiment (n = 5,951 middle school students) testing the question-order effect found that changing the sequence of questions can result in 45% higher prevalence rates. These findings raise questions about the accuracy of several widely used bullying surveys.

...read moreread less

Journal Article•DOI•

Exploring Rating Quality in Rater-Mediated Assessments Using Mokken Scale Analysis.

[...]

Stefanie A. Wind¹, George Engelhard²•Institutions (2)

University of Alabama¹, University of Georgia²

01 Aug 2016-Educational and Psychological Measurement

TL;DR: Overall, the findings suggest that indices of rater monotonicity, rater scalability, and invariant rater ordering based on Mokken scaling provide diagnostic information at the level of individual raters related to the requirements for invariant measurement.

...read moreread less

Abstract: Mokken scale analysis is a probabilistic nonparametric approach that offers statistical and graphical tools for evaluating the quality of social science measurement without placing potentially inappropriate restrictions on the structure of a data set. In particular, Mokken scaling provides a useful method for evaluating important measurement properties, such as invariance, in contexts where response processes are not well understood. Because rater-mediated assessments involve complex interactions among many variables, including assessment contexts, student artifacts, rubrics, individual rater characteristics, and others, rater-assigned scores are suitable candidates for Mokken scale analysis. The purposes of this study are to describe a suite of indices that can be used to explore the psychometric quality of data from rater-mediated assessments and to illustrate the substantive interpretation of Mokken-based statistics and displays in this context. Techniques that are commonly used in polytomous applications of Mokken scaling are adapted for use with rater-mediated assessments, with a focus on the substantive interpretation related to individual raters. Overall, the findings suggest that indices of rater monotonicity, rater scalability, and invariant rater ordering based on Mokken scaling provide diagnostic information at the level of individual raters related to the requirements for invariant measurement. These Mokken-based indices serve as an additional suite of diagnostic tools for exploring the quality of data from rater-mediated assessments that can supplement rating quality indices based on parametric models.

...read moreread less

Journal Article•DOI•

Convergence, Admissibility, and Fit of Alternative Confirmatory Factor Analysis Models for MTMM Data.

[...]

Charles E. Lance, Yi Fan¹•Institutions (1)

University of Georgia¹

01 Jun 2016-Educational and Psychological Measurement

TL;DR: The CTCM-R model provided the most accurate estimates of the full range of parameters relevant to a confirmatory factor analytic model of MTMM data, and showed that they are often not plausible when analyzing real data.

...read moreread less

Abstract: We compared six different analytic models for multitrait–multimethod (MTMM) data in terms of convergence, admissibility, and model fit to 258 samples of previously reported data. Two well-known models, the correlated trait–correlated method (CTCM) and the correlated trait–correlated uniqueness (CTCU) models, were fit for reference purposes in comparison to four other under- or unstudied models, including (a) Rindskopf’s reparameterization of the CTCM (CTCM-R) model, (b) a correlated trait–constrained uncorrelated method model and two of its more general cases, (c) a correlated trait–constrained correlated method model, and (d) a correlated trait–uncorrelated method model. Results show that (a) the CTCM-R model often solved convergence and admissibility problems with the CTCM model at rates equivalent to the CTCU model and (b) constrained models often provided convergent and admissible solutions but significantly worse model fit, indicating that they are often not plausible when analyzing real data. A foll...

...read moreread less

Journal Article•DOI•

Evaluation of Measurement Instrument Criterion Validity in Finite Mixture Settings.

[...]

Tenko Raykov¹, George A. Marcoulides², Tenglong Li¹•Institutions (2)

Michigan State University¹, University of California, Santa Barbara²

01 Dec 2016-Educational and Psychological Measurement

TL;DR: A method for evaluating the validity of multicomponent measurement instruments in heterogeneous populations is discussed and can be used for point and interval estimation of criterion validity of linear composites in populations representing mixtures of an unknown number of latent classes.

...read moreread less

Abstract: A method for evaluating the validity of multicomponent measurement instruments in heterogeneous populations is discussed. The procedure can be used for point and interval estimation of criterion validity of linear composites in populations representing mixtures of an unknown number of latent classes. The approach permits also the evaluation of between-class validity differences as well as within-class validity coefficients. The method can similarly be used with known class membership when distinct populations are investigated, their number is known beforehand and membership in them is observed for the studied subjects, as well as in settings where only the number of latent classes is known. The discussed procedure is illustrated with numerical data.

...read moreread less