Showing papers on "Item response theory published in 2013"

PDF

Open Access

Journal Article•DOI•

Validating the Interpretations and Uses of Test Scores

[...]

01 Mar 2013-Journal of Educational Measurement

TL;DR: In this article, an argument-based approach to validate an interpretation or use of test scores is proposed, where the claims based on the test scores are outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses.

...read moreread less

Abstract: To validate an interpretation or use of test scores is to evaluate the plausibility of the claims based on the scores. An argument-based approach to validation suggests that the claims based on the test scores be outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses. Validation then can be thought of as an evaluation of the coherence and completeness of this interpretation/use argument and of the plausibility of its inferences and assumptions. In outlining the argument-based approach to validation, this paper makes eight general points. First, it is the proposed score interpretations and uses that are validated and not the test or the test scores. Second, the validity of a proposed interpretation or use depends on how well the evidence supports the claims being made. Third, more-ambitious claims require more support than less-ambitious claims. Fourth, more-ambitious claims (e.g., construct interpretations) tend to be more useful than less-ambitious claims, but they are also harder to validate. Fifth, interpretations and uses can change over time in response to new needs and new understandings leading to changes in the evidence needed for validation. Sixth, the evaluation of score uses requires an evaluation of the consequences of the proposed uses; negative consequences can render a score use unacceptable. Seventh, the rejection of a score use does not necessarily invalidate a prior, underlying score interpretation. Eighth, the validation of the score interpretation on which a score use is based does not validate the score use.

...read moreread less

1,300 citations

Journal Article•DOI•

The grounded psychometric development and initial validation of the Health Literacy Questionnaire (HLQ)

[...]

Richard H. Osborne¹, Roy Batterham¹, Gerald R. Elsworth¹, Melanie Hawkins¹, Rachelle Buchbinder - Show less +1 more•Institutions (1)

Deakin University¹

16 Jul 2013-BMC Public Health

TL;DR: The HLQ covers 9 conceptually distinct areas of health literacy to assess the needs and challenges of a wide range of people and organisations and is likely to be useful in surveys, intervention evaluation, and studies of theneeds and capabilities of individuals.

...read moreread less

Abstract: Health literacy has become an increasingly important concept in public health. We sought to develop a comprehensive measure of health literacy capable of diagnosing health literacy needs across individuals and organisations by utilizing perspectives from the general population, patients, practitioners and policymakers. Using a validity-driven approach we undertook grounded consultations (workshops and interviews) to identify broad conceptually distinct domains. Questionnaire items were developed directly from the consultation data following a strict process aiming to capture the full range of experiences of people currently engaged in healthcare through to people in the general population. Psychometric analyses included confirmatory factor analysis (CFA) and item response theory. Cognitive interviews were used to ensure questions were understood as intended. Items were initially tested in a calibration sample from community health, home care and hospital settings (N=634) and then in a replication sample (N=405) comprising recent emergency department attendees. Initially 91 items were generated across 6 scales with agree/disagree response options and 5 scales with difficulty in undertaking tasks response options. Cognitive testing revealed that most items were well understood and only some minor re-wording was required. Psychometric testing of the calibration sample identified 34 poorly performing or conceptually redundant items and they were removed resulting in 10 scales. These were then tested in a replication sample and refined to yield 9 final scales comprising 44 items. A 9-factor CFA model was fitted to these items with no cross-loadings or correlated residuals allowed. Given the very restricted nature of the model, the fit was quite satisfactory: χ 2 WLSMV(866 d.f.) = 2927, p<0.000, CFI = 0.936, TLI = 0.930, RMSEA = 0.076, and WRMR = 1.698. Final scales included: Feeling understood and supported by healthcare providers; Having sufficient information to manage my health; Actively managing my health; Social support for health; Appraisal of health information; Ability to actively engage with healthcare providers; Navigating the healthcare system; Ability to find good health information; and Understand health information well enough to know what to do. The HLQ covers 9 conceptually distinct areas of health literacy to assess the needs and challenges of a wide range of people and organisations. Given the validity-driven approach, the HLQ is likely to be useful in surveys, intervention evaluation, and studies of the needs and capabilities of individuals.

...read moreread less

794 citations

Book Chapter•DOI•

Toward an Integration of Item-Response Theory and Cognitive Error Diagnosis.

[...]

Kikumi K. Tatsuoka¹•Institutions (1)

Princeton University¹

04 Jul 2013

TL;DR: A new methodology that is capable of diagnosing cognitive errors and analyzing different methods for solving problems will be introduced, illustrated with fraction subtraction problems.

...read moreread less

Abstract: Finding the sources of misconceptions possessed by students is a difficult task because it is impossible to see what is happening in their heads The only directly observable outcomes are the students' responses to the test items Studying their think-aloud protocols is one method for discovering how students solve or think through problems Several computer programs that are capable of diagnosing students' misconceptions have been developed in the past decade (Brown & Burton, 1978; Marshall, 1980; Ohlsson & Langley, 1985; Sleeman, 1984; Tatsuoka, Baillie, & Yamamoto, 1982; VanLehn, 1983) The common ground for these cognitive diagnostic systems is that they infer the unobservable cognitive processes from logical interrelationships among cognitive tasks, subtasks, and goals involved in problems representing the domain of interest It is important that we be able to retrieve invisible things from the "black box" and put them into a useful form so that valuable information can be obtained for improving educational quality

...read moreread less

307 citations

Journal Article•DOI•

FACTOR 9.2: A Comprehensive Program for Fitting Exploratory and Semiconfirmatory Factor Analysis and IRT Models

[...]

Urbano Lorenzo-Seva¹, Pere J. Ferrando¹•Institutions (1)

Rovira i Virgili University¹

20 May 2013-Applied Psychological Measurement

284 citations

Journal Article•DOI•

New tricks for an old measure: The development of the Barratt Impulsiveness Scale–Brief (BIS-Brief).

[...]

Lynne Steinberg¹, Carla Sharp¹, Matthew S. Stanford², Andra Teten Tharp³•Institutions (3)

University of Houston¹, Baylor University², Centers for Disease Control and Prevention³

01 Mar 2013-Psychological Assessment

TL;DR: A unidimensional Barratt Impulsiveness Scale-Brief is introduced that includes 8 of the original BIS-11 items and will reduce the burden on respondents without loss of information in clinical assessment settings and large epidemiological studies of psychiatric disorders.

...read moreread less

Abstract: The Barratt Impulsivity Scale (BIS), a 30-item self-report measure, is one of the most commonly used scales for the assessment of the personality construct of impulsiveness. It has recently marked 50 years of use in research and clinical settings. The current BIS-11 is held to measure 3 theoretical subtraits, namely, attentional, motor, and non-planning impulsiveness. We evaluated the factor structure of the BIS using full information item bifactor analysis for Likert-type items. We found no evidence supporting the 3-factor model. In fact, half of the items do not share any relation with other items and do not form any factor. In light of this, we introduce a unidimensional Barratt Impulsiveness Scale-Brief (BIS-Brief) that includes 8 of the original BIS-11 items. Next, we present evidence of construct validity comparing scores obtained with the BIS-Brief against the original BIS total scores using data from (a) a community sample of borderline personality patients and normal controls, (b) a forensic sample, and (c) an inpatient sample of young adults and adolescents. We demonstrated similar indices of construct validity that is observed for the BIS-11 total score with the BIS-Brief score. Use of the BIS-Brief in clinical assessment settings and large epidemiological studies of psychiatric disorders will reduce the burden on respondents without loss of information. (PsycINFO Database Record (c) 2012 APA, all rights reserved). Language: en

...read moreread less

251 citations

Book•

Measurement Theory in Action: Case Studies and Exercises, Second Edition

[...]

Kenneth S. Shultz, David J. Whitney, Michael J. Zickar¹•Institutions (1)

California State University¹

14 Nov 2013

TL;DR: In this article, the authors present a self-standing module for psychometrics, testing and measurement, or multivariate statistics course taught in psychology, education, marketing and management, professional researchers in need of a quick refresher on applying measurement theory will also find an invaluable reference.

...read moreread less

Abstract: This book helps readers apply testing and measurement theories. Featuring 22 self-standing modules, instructors can pick and choose the ones that are most appropriate for their course. Each module features an overview of a measurement issue and a step-by-step application of that theory. Best practices provide recommendations for ensuring the appropriate application of the theory. Practical questions help students assess their understanding of the topic while the examples allow them to apply the material using real data. Two cases in each module depict typical dilemmas faced when applying measurement theory followed by Questions to Ponder to encourage critical examination of the issues noted in the cases. Each module contains exercises some of which require no computer access while others involve the use of SPSS to solve the problem. The book’s website houses the accompanying data sets and more. The book also features suggested readings, a glossary of the key terms, and a continuing exercise that incorporates many of the steps in the development of a measure of typical performance. Updated throughout to reflect recent changes in the field, the new edition also features: --A new co-author, Michael Zickar, who updated the advanced topics and added the new module on generalizability theory (Module 22). -Expanded coverage of reliability (Modules 5 & 6) and exploratory and confirmatory factor analysis (Modules 18 & 19) to help readers interpret results presented in journal articles. -Expanded Web Resources, Instructors will now find: suggested answers to the book’s questions and exercises; detailed worked solutions to the exercises; and PowerPoint slides. Students and instructors can access the SPSS data sets; additional exercises; the glossary; and website references that are helpful in understanding psychometric concepts. Part 1 provides an introduction to measurement theory and specs for scaling and testing and a review of statistics. Part 2 then progresses through practical issues related to text reliability, validation, meta-analysis and bias. Part 3 reviews practical issues related to text construction such as the development of measures of maximal performance, CTT item analysis, test scoring, developing measures of typical performance, and issues related to response styles and guessing. The book concludes with advanced topics such as multiple regression, exploratory and confirmatory factor analysis, item response theory (IRT), IRT applications including computer adaptive testing and differential item functioning, and generalizability theory. Ideal as a text for any psychometrics, testing and measurement, or multivariate statistics course taught in psychology, education, marketing and management, professional researchers in need of a quick refresher on applying measurement theory will also find this an invaluable reference.

...read moreread less

196 citations

Book•

Latent Trait and Latent Class Models

[...]

Rolf Langeheine, Jürgen Rost

27 Apr 2013

TL;DR: This chapter discusses Latent Trait Theory, a model for Latent Class Theory, and its applications to Criterion-Referenced Testing.

...read moreread less

Abstract: and Overview.- I Latent Trait Theory.- 1 Measurement Models for Ordered Response Categories.- 2 Testing a Latent Trait Model.- 3 Latent Trait Models with Indicators of Mixed Measurement Level.- II Latent Class Theory.- 4 New Developments in Latent Class Theory.- 5 Log-Linear Modeling, Latent Class Analysis, or Correspondence Analysis: Which Method Should Be Used for the Analysis of Categorical Data?.- 6 A Latent Class Covariate Model with Applications to Criterion-Referenced Testing.- III Comparative Views of Latent Traits and Latent Classes.- 7 Test Theory with Qualitative and Quantitative Latent Variables.- 8 Latent Class Models for Measuring.- Chaffer 9 Comparison of Latent Structure Models.- IV Application Studies.- 10 Latent Variable Techniques for Measuring Development.- 11 Item Bias and Test Multidimensionality.- 12 On a Rasch-Model-Based Test for Noncomputerized Adaptive Testing.- 13 Systematizing the Item Content in Test Design.

...read moreread less

190 citations

Journal Article•DOI•

The Communicative Participation Item Bank (CPIB): Item Bank Calibration and Development of a Disorder-Generic Short Form

[...]

Carolyn R. Baylor¹, Kathryn M. Yorkston¹, Tanya L. Eadie¹, Jiseon Kim¹, Hyewon Chung², Dagmar Amtmann¹ - Show less +2 more•Institutions (2)

University of Washington¹, Chungnam National University²

01 Aug 2013-Journal of Speech Language and Hearing Research

TL;DR: The CPIB provides speech-language pathologists with a unidimensional, self-report outcomes measurement instrument dedicated to the construct of communicative participation and this instrument may be useful to clinicians and researchers wanting to implement measures of Communicative participation in their work.

...read moreread less

Abstract: Purpose The purpose of this study was to calibrate the items for the Communicative Participation Item Bank (CPIB; Baylor, Yorkston, Eadie, Miller, & Amtmann, 2009; Yorkston et al., 2008) using item response theory (IRT). One overriding objective was to examine whether the IRT item parameters would be consistent across different diagnostic groups, thereby allowing creation of a disorder-generic instrument. The intended outcomes were the final item bank and a short form ready for clinical and research applications. Method Self-report data were collected from 701 individuals representing 4 diagnoses: multiple sclerosis, Parkinson's disease, amyotrophic lateral sclerosis, and head and neck cancer. Participants completed the CPIB and additional self-report questionnaires. CPIB data were analyzed using the IRT graded response model. Results The initial set of 94 candidate CPIB items were reduced to an item bank of 46 items demonstrating unidimensionality, local independence, good item fit, and good measurement ...

...read moreread less

184 citations

Journal Article•DOI•

Goodness-of-Fit Assessment of Item Response Theory Models.

[...]

Alberto Maydeu-Olivares¹•Institutions (1)

University of Barcelona¹

02 Oct 2013-Measurement: Interdisciplinary Research & Perspective

TL;DR: In this article, the authors provide an overview of goodness-of-fit assessment methods for item response theory (IRT) models, including root mean squared error of approximation (RMSEA), which makes it possible to test whether the model misfit is below a specific cutoff value.

...read moreread less

Abstract: The article provides an overview of goodness-of-fit assessment methods for item response theory (IRT) models. It is now possible to obtain accurate p-values of the overall fit of the model if bivariate information statistics are used. Several alternative approaches are described. As the validity of inferences drawn on the fitted model depends on the magnitude of the misfit, if the model is rejected it is necessary to assess the goodness of approximation. With this aim in mind, a class of root mean squared error of approximation (RMSEA) is described, which makes it possible to test whether the model misfit is below a specific cutoff value. Also, regardless of the outcome of the overall goodness-of-fit assessment, a piece-wise assessment of fit should be performed to detect parts of the model whose fit can be improved. A number of statistics for this purpose are described, including a z statistic for residual means, a mean-and-variance correction to Pearson's X 2 statistic applied to each bivariate subtable...

...read moreread less

159 citations

Journal Article•DOI•

Validation of a measure of knowledge about human papillomavirus (HPV) using item response theory and classical test theory

[...]

Jo Waller, Remo Ostini¹, Laura A.V. Marlow, Kirsten McCaffery², Gregory D. Zimet³ - Show less +1 more•Institutions (3)

University of Queensland¹, University of Sydney², Indiana University³

01 Jan 2013-Preventive Medicine

TL;DR: A structurally coherent set of items covering a range of important HPV knowledge was developed, and Responses indicated a reliable questionnaire, which allowed the fitting of an IRT model.

...read moreread less

148 citations

Journal Article•DOI•

How IRT Can Solve Problems of Ipsative Data in Forced-Choice Questionnaires

[...]

Anna Brown¹, Alberto Maydeu-Olivares²•Institutions (2)

University of Kent¹, University of Barcelona²

01 Mar 2013-Psychological Methods

TL;DR: It is shown that when Thurstonian IRT modeling is applied, even existing forced-choice questionnaires with challenging features can be scored adequately and that the IRT-estimated scores are free from the problems of ipsative data.

...read moreread less

Abstract: In multidimensional forced-choice (MFC) questionnaires, items measuring different attributes are presented in blocks, and participants have to rank order the items within each block (fully or partially). Such comparative formats can reduce the impact of numerous response biases often affecting single-stimulus items (aka rating or Likert scales). However, if scored with traditional methodology, MFC instruments produce ipsative data, whereby all individuals have a common total test score. Ipsative scoring distorts individual profiles (it is impossible to achieve all high or all low scale scores), construct validity (covariances between scales must sum to zero), criterion-related validity (validity coefficients must sum to zero), and reliability estimates. We argue that these problems are caused by inadequate scoring of forced-choice items and advocate the use of item response theory (IRT) models based on an appropriate response process for comparative data, such as Thurstone’s law of comparative judgment. We show that when Thurstonian IRT modeling is applied (Brown & Maydeu-Olivares, 2011), even existing forcedchoice questionnaires with challenging features can be scored adequately and that the IRT-estimated scores are free from the problems of ipsative data.

...read moreread less

Journal Article•DOI•

Rasch Analysis Staging Methodology to Classify Upper Extremity Movement Impairment After Stroke

[...]

Michelle L. Woodbury¹, Michelle L. Woodbury², Craig A. Velozo³, Craig A. Velozo², Lorie Richards⁴, Pamela W. Duncan⁵ - Show less +2 more•Institutions (5)

Medical University of South Carolina¹, Veterans Health Administration², University of Florida³, University of Utah⁴, Wake Forest Baptist Medical Center⁵

01 Aug 2013-Archives of Physical Medicine and Rehabilitation

TL;DR: The cutoff scores, which link to a description of specific movements a patient can, can partially, and cannot perform, may enable formation of heterogeneous patient groups, advance efforts to identify specific movement therapy targets, and define treatment response in terms of specific movement that changed or did not change with therapy.

...read moreread less

Journal Article•DOI•

The case for an international patient-reported outcomes measurement information system (PROMIS®) initiative

[...]

Jordi Alonso, Susan J. Bartlett¹, Matthias Rose², Matthias Rose³, Neil K. Aaronson⁴, John Chaplin⁵, Fabio Efficace, Alain Leplège, Aiping Lu, David S. Tulsky⁶, Hein Raat⁷, Ulrike Ravens-Sieberer⁸, Dennis A. Revicki, Caroline B. Terwee⁹, Jose M Valderas¹⁰, David Cella¹¹, Christopher B. Forrest¹² - Show less +13 more•Institutions (12)

McGill University¹, University of Massachusetts Amherst², Charité³, Netherlands Cancer Institute⁴, University of Gothenburg⁵, University of Michigan⁶, Erasmus University Rotterdam⁷, University of Hamburg⁸, VU University Medical Center⁹, University of Oxford¹⁰, Northwestern University¹¹, Children's Hospital of Philadelphia¹²

20 Dec 2013-Health and Quality of Life Outcomes

TL;DR: The case for developing an international core set of PROs building from the US PROMIS network is described and the potential to transform PRO measurement by creating a shared, unifying terminology and metric for reporting of common symptoms and functional life domains is discussed.

...read moreread less

Abstract: Patient-reported outcomes (PROs) play an increasingly important role in clinical practice and research. Modern psychometric methods such as item response theory (IRT) enable the creation of item banks that support fixed-length forms as well as computerized adaptive testing (CAT), often resulting in improved measurement precision and responsiveness. Here we describe and discuss the case for developing an international core set of PROs building from the US PROMIS® network.PROMIS is a U.S.-based cooperative group of research sites and centers of excellence convened to develop and standardize PRO measures across studies and settings. If extended to a global collaboration, PROMIS has the potential to transform PRO measurement by creating a shared, unifying terminology and metric for reporting of common symptoms and functional life domains. Extending a common set of standardized PRO measures to the international community offers great potential for improving patient-centered research, clinical trials reporting, population monitoring, and health care worldwide. Benefits of such standardization include the possibility of: international syntheses (such as meta-analyses) of research findings; international population monitoring and policy development; health services administrators and planners access to relevant information on the populations they serve; better assessment and monitoring of patients by providers; and improved shared decision making.The goal of the current PROMIS International initiative is to ensure that item banks are translated and culturally adapted for use in adults and children in as many countries as possible. The process includes 3 key steps: translation/cultural adaptation, calibration, and validation. A universal translation, an approach focusing on commonalities, rather than differences across versions developed in regions or countries speaking the same language, is proposed to ensure conceptual equivalence for all items. International item calibration using nationally representative samples of adults and children within countries is essential to demonstrate that all items possess expected strong measurement properties. Finally, it is important to demonstrate that the PROMIS measures are valid, reliable and responsive to change when used in an international context.IRT item banking will allow for tailoring within countries and facilitate growth and evolution of PROs through contributions from the international measurement community. A number of opportunities and challenges of international development of PROs item banks are discussed.

...read moreread less

Journal Article•DOI•

Limited-information Goodness-of-fit Testing of Hierarchical Item Factor Models

[...]

Li Cai¹, Mark Hansen¹•Institutions (1)

University of California, Los Angeles¹

01 May 2013-British Journal of Mathematical and Statistical Psychology

TL;DR: A dimension reduction method that can take advantage of the hierarchical factor structure so that the integrals can be approximated far more efficiently and a new test statistic that can be substantially better calibrated and more powerful than the original M2 statistic when the test is long and the items are polytomous is proposed.

...read moreread less

Abstract: In applications of item response theory, assessment of model fit is a critical issue. Recently, limited-information goodness-of-fit testing has received increased attention in the psychometrics literature. In contrast to full-information test statistics such as Pearson's X(2) or the likelihood ratio G(2) , these limited-information tests utilize lower-order marginal tables rather than the full contingency table. A notable example is Maydeu-Olivares and colleagues'M2 family of statistics based on univariate and bivariate margins. When the contingency table is sparse, tests based on M2 retain better Type I error rate control than the full-information tests and can be more powerful. While in principle the M2 statistic can be extended to test hierarchical multidimensional item factor models (e.g., bifactor and testlet models), the computation is non-trivial. To obtain M2 , a researcher often has to obtain (many thousands of) marginal probabilities, derivatives, and weights. Each of these must be approximated with high-dimensional numerical integration. We propose a dimension reduction method that can take advantage of the hierarchical factor structure so that the integrals can be approximated far more efficiently. We also propose a new test statistic that can be substantially better calibrated and more powerful than the original M2 statistic when the test is long and the items are polytomous. We use simulations to demonstrate the performance of our new methods and illustrate their effectiveness with applications to real data.

...read moreread less

Journal Article•DOI•

Development and psychometric properties of the PROMIS® pediatric fatigue item banks

[...]

Jin Shei Lai¹, Brian D. Stucky², David Thissen³, James W. Varni⁴, Esi Morgan DeWitt⁵, Debra E. Irwin³, Karin Yeatts³, Darren A. DeWalt³ - Show less +4 more•Institutions (5)

Northwestern University¹, RAND Corporation², University of North Carolina at Chapel Hill³, Texas A&M University⁴, Cincinnati Children's Hospital Medical Center⁵

02 Feb 2013-Quality of Life Research

TL;DR: The PROMIS assessment of self-reported fatigue in pediatrics includes two item banks: Tired and (Lack of) Energy, which demonstrated satisfactory psychometric properties and can be used for research settings.

...read moreread less

Abstract: Purpose This paper reports on the development and psychometric properties of self-reported pediatric fatigue item banks as part of the Patient-Reported Outcomes Measurement Information System (PROMIS).

...read moreread less

Journal Article•DOI•

A Tutorial on Interpreting Bifactor Model Scores.

[...]

Christine E. DeMars¹•Institutions (1)

James Madison University¹

06 Sep 2013-International Journal of Testing

TL;DR: In this article, the authors address possible sources of confusion in interpreting trait scores from the bifactor model and compare the general trait score with a simple-structure model with correlated factors.

...read moreread less

Abstract: This tutorial addresses possible sources of confusion in interpreting trait scores from the bifactor model. The bifactor model may be used when subscores are desired, either for formative feedback on an achievement test or for theoretically different constructs on a psychological test. The bifactor model is often chosen because it requires fewer computational resources than other models for subscores. The bifactor model yields a score on the general or primary trait measured by the test overall, as well as specific or secondary traits measured by the subscales. Interpreting the general trait score is straight-forward, but the specific traits must be interpreted as residuals relative to the general trait. Trait scores on the specific factors are contrasted with trait scores on a simple-structure model with correlated factors, using example data from one TIMSS test booklet and a civic responsibility measure. The correlated factors model was used for contrast because its scores correspond to a more intuitive...

...read moreread less

Journal Article•DOI•

The Langer-Improved Wald Test for DIF Testing With Multiple Groups: Evaluation and Comparison to Two-Group IRT

[...]

Carol M. Woods¹, Li Cai², Mian Wang¹•Institutions (2)

University of Kansas¹, University of California, Los Angeles²

01 Jun 2013-Educational and Psychological Measurement

TL;DR: In this article, an improved version of the Lord's χ2 Wald test for comparing item response model parameter estimates between two groups is presented, which uses better approaches for computation of the covariance matrix and equating the item parameters across groups.

...read moreread less

Abstract: Differential item functioning (DIF) occurs when the probability of responding in a particular category to an item differs for members of different groups who are matched on the construct being measured. The identification of DIF is important for valid measurement. This research evaluates an improved version of Lord’s χ2 Wald test for comparing item response model parameter estimates between two groups. The improved version uses better approaches for computation of the covariance matrix and equating the item parameters across groups. There are two equating algorithms implemented in IRTPro and flexMIRT software: Wald-1 (one-stage) and Wald-2 (two-stage), only one of which has been studied in simulations before. The present study evaluates for the first time the Wald-1 algorithm and Wald-1 and Wald-2 for three groups simultaneously. A comparison to two-group IRT-LR-DIF is included. Results indicate that Wald-1 performs very well and is recommended, whereas Type I error is extremely inflated for Wald-2. Perfo...

...read moreread less

Journal Article•DOI•

Putting the "IRT" in "Dirty" : item response theory analyses of the Dark Triad Dirty Dozen - an efficient measure of narcissism, psychopathy, and Machiavellianism

[...]

Gregory D. Webster¹, Peter K. Jonason²•Institutions (2)

University of Florida¹, University of Western Sydney²

01 Jan 2013-Personality and Individual Differences

TL;DR: The Dark Triad Dirty Dozen is a personality inventory designed to measure individual differences in narcissism, psychopathy, and Machiavellianism in sub-clinical populations.

...read moreread less

Journal Article•DOI•

Statistical Refinement of the Q-Matrix in Cognitive Diagnosis:

[...]

Chia Yi Chiu¹•Institutions (1)

Rutgers University¹

31 May 2013-Applied Psychological Measurement

TL;DR: Results show that the method can identify and correct misspecified entries in the Q-matrix, thereby improving its accuracy, and comparisons of the residual sum of squares computed from the observed and the ideal item responses.

...read moreread less

Abstract: Most methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes require the Q-matrix that associates each item in a test with the cogniti...

...read moreread less

Journal Article•DOI•

Psychometric properties of the Brazilian-adapted version of the Ages and Stages Questionnaire in public child daycare centers.

[...]

Alberto Filgueiras¹, Pedro Pires², Silvia Maissonette, Jesus Landeira-Fernandez¹, Jesus Landeira-Fernandez³ - Show less +1 more•Institutions (3)

Pontifical Catholic University of Rio de Janeiro¹, Federal University of Rio de Janeiro², Estácio S.A.³

01 Aug 2013-Early Human Development

TL;DR: The results indicated that most of the ASQ-BR questionnaires had adequate internal consistency and are psychometrically sound developmental screening instruments that can be easily administered by primary caregivers.

...read moreread less

Book•

Patient-Reported Outcomes: Measurement, Implementation and Interpretation

[...]

Joseph C. Cappelleri, Kelly H. Zou, Andrew G. Bushmakin, Jose Alvir, Demissie Alemayehu, Tara Symonds - Show less +2 more

20 Dec 2013

TL;DR: In this article, the authors compared two or more samples of a PRO Questionnaire and compared their results with the results of a single sample using regression analysis and SAS's Confirmatory Factor Analysis.

...read moreread less

Abstract: Introduction Patient-Reported Outcomes in Perspective Patient-Reported Outcomes in Clinical Research Terms and Definitions Measurement Scales Psychometrics vs Clinimetrics Selection of a PRO Questionnaire Development of a Patient-Reported Outcome Population Item Generation Item Wording Cognitive Interviews Validity Content Validity Construct Validity Simulated Example Using SAS: Convergent and Divergent Validity Factors Affecting Response Reliability Intraclass Correlation Coefficient for Continuous Variables ICC Example ICC Simulated Example ICC in Context Bland and Altman Plot for Continuous Variables Simple Kappa and Weighted Kappa Coefficients for Categorical Variables Internal Consistency Reliability: Cronbach's Alpha Coefficient Simulated Example of Cronbach's Alpha Exploratory and Confirmatory Factor Analyses Exploratory Factor Analysis Confirmatory Factor Analysis Causal Indicators vs Effect Indicators Simulated Examples Using SAS: Exploratory Factor Analysis Simulated Examples Using SAS: Confirmatory Factor Analysis Real-Life Examples Item Response Theory Classical Test Theory Revisited Assumptions of IRT Item Characteristic Curves Item Information Item Fit and Person Fit Differential Item Functioning Sample Size Example Example: Rasch Model Implementation Cross-Sectional Analysis Types of PRO Data and Exploratory Methods Comparing Two or More Samples Regression Analysis Longitudinal Analysis Analytic Considerations Repeated Measures Model Random Coefficient Model Real-Life Examples Mediation Models Single Mediator Model Model Invariance Advanced Example Bootstrapping Methodology Implementation Missing Data Study Design to Minimize Missing Data Missing Data Patterns and Mechanisms Approaches for Missing Items within Domains or Measures Approaches for Missing Entire Domains or Entire Questionnaires Sensitivity Analyses Simulated Example Using SAS: Pattern Mixture Models Enriching Interpretation Anchor-Based Approaches Distribution-Based Approaches Multiple Testing Index A Summary and References appear at the end of each chapter

...read moreread less

Journal Article•DOI•

Using Logistic Approximations of Marginal Trace Lines to Develop Short Assessments

[...]

Brian D. Stucky¹, David Thissen², Maria Orlando Edelen¹•Institutions (2)

RAND Corporation¹, University of North Carolina at Chapel Hill²

01 Jan 2013-Applied Psychological Measurement

TL;DR: The authors propose a method that evaluates the degree of item-level dimensionality and allows for the selection of subsets of items (i.e., short form) that result in scaled scores and standard errors that are equivalent to other multidimensional IRT-based scoring procedures.

...read moreread less

Abstract: Test developers often need to create unidimensional scales from multidimensional data. For item analysis, marginal trace lines capture the relation with the general dimension while accounting for n...

...read moreread less

Journal Article•DOI•

Investigating the MACH–IV With Item Response Theory and Proposing the Trimmed MACH*

[...]

John F. Rauthmann¹•Institutions (1)

Humboldt University of Berlin¹

10 Jun 2013-Journal of Personality Assessment

TL;DR: The MACH–IV was investigated with item response theory to elucidate its psychometric properties and suggest a trimmed version, the MACH*, which seemed to be cynicism/misanthropy and was formed from the 5 most informative and precise MACH-IV items.

...read moreread less

Abstract: The MACH-IV was investigated (N = 528) with item response theory to elucidate its psychometric properties and suggest a trimmed version, the MACH*. The core content of the MACH-IV seemed to be cynicism/misanthropy and the MACH* was formed from the 5 most informative and precise MACH-IV items. The MACH* showed good internal consistency and construct and criterion validity comparable to the MACH-IV. The MACH-IV and MACH* measure most precisely at average to above average levels of Machiavellianism. Implications for theory and measurement of Machiavellianism are discussed.

...read moreread less

Journal Article•DOI•

The Accuracy of the Life Orientation Test–Revised (LOT–R) in Measuring Dispositional Optimism: Evidence From Item Response Theory Analyses

[...]

Francesca Chiesi¹, Silvia Galli¹, Caterina Primi¹, Paolo Innocenti Borgi, Andrea Bonacchi - Show less +1 more•Institutions (1)

University of Florence¹

12 Aug 2013-Journal of Personality Assessment

TL;DR: Overall findings provide evidence of the accuracy of the LOT–R and suggest possible modifications of the scale to improve the assessment of dispositional optimism.

...read moreread less

Abstract: The accuracy of the Life Orientation Test–Revised (LOT–R) in measuring dispositional optimism was investigated applying item response theory (IRT). The study was conducted on a sample of 484 university students (62% males, M age = 22.79 years, SD = 5.63). After testing the 1-factor structure of the scale, IRT was applied to evaluate the functioning of the LOT–R along the pessimism–optimism continuum. Item parameter estimates and the test information function showed that each item and the global scale satisfactorily measured the latent trait. Referring to the IRT estimated trait levels, the validity of the LOT–R was studied examining the relationships between dispositional optimism and psychological well-being, sense of mastery, and sense of coherence. Overall findings based on IRT analyses provide evidence of the accuracy of the LOT–R and suggest possible modifications of the scale to improve the assessment of dispositional optimism.

...read moreread less

Journal Article•DOI•

Measuring Enfranchisement: Importance of and Control Over Participation by People With Disabilities

[...]

Allen W. Heinemann¹, Susan Magasi², Rita K. Bode, Joy Hammel³, Gale G. Whiteneck², Jennifer Bogner⁴, John D. Corrigan⁴ - Show less +3 more•Institutions (4)

Northwestern University¹, Craig Hospital², University of Illinois at Chicago³, Ohio State University⁴

01 Nov 2013-Archives of Physical Medicine and Rehabilitation

TL;DR: Evaluation of the psychometric properties and validity of an expanded set of community enfranchisement items that are suitable for computer adaptive testing suggested 2 distinct subsets of items: importance of participation and control over participation.

...read moreread less

Journal Article•DOI•

To cut a short test even shorter: reliability and validity of a brief assessment of intellectual ability in schizophrenia--a control-case family study

[...]

Eva Velthorst, Stephen Z. Levine¹, Cécile Henquet², Lieuwe de Haan, Jim van Os³, Inez Myin-Germeys², Abraham Reichenberg³ - Show less +3 more•Institutions (3)

University of Haifa¹, Maastricht University², King's College London³

30 Oct 2013-Cognitive Neuropsychiatry

TL;DR: The proposed 15-minute version of the Wechsler Adult Intelligence Scale–III may serve as a useful screening device for general intellectual ability in research or clinical settings, and is recommended when a quick and accurate IQ estimate is desired.

...read moreread less

Abstract: Background.The potential inclusion of cognitive assessments in the DSM-V and large time-consuming assessments drive a need for short tests of cognitive impairments. We examined the reliability and validity of a brief, 15-minute, version of the Wechsler Adult Intelligence Scale–III (WAIS-III).Methods.The sample consisted of patients diagnosed with schizophrenia (n=75), their siblings without schizophrenia (n=74) and unrelated healthy controls (n=84). A short WAIS-III consists of the Digit Symbol Coding subtest, and every second (or third) item of Block Design, Information, and Arithmetic. Psychometric analyses were implemented using item-response theory (IRT) to determine the best minimal item short version, while maintaining the sensitivity and reliability of the IQ score.Results.The proposed 15-minute WAIS-III gave reliable estimates of the Full Scale IQ (FSIQ) in all three groups in the sample. The 15-minute (select-item) version yielded an overall R of.95 (R2=.92) and IRT yielded an R of .96 (R2=.92). ...

...read moreread less

Journal Article•DOI•

Relevance and advantages of using the item response theory

[...]

Silvana Ligia Vincenzi Bortolotti¹, Rafael Tezza², Dalton Francisco de Andrade³, Antonio Cezar Bornia³, Afonso Farias de Sousa Júnior - Show less +1 more•Institutions (3)

Federal University of Technology - Paraná¹, Universidade do Estado de Santa Catarina², Universidade Federal de Santa Catarina³

01 Jun 2013-Quality & Quantity

TL;DR: The item response theory (IRT) also known as latent trait theory is used for the development, evaluation and administration of standardized measurements; it is widely used in the areas of psychology and education.

...read moreread less

Abstract: The item response theory (IRT) also known as latent trait theory, is used for the development, evaluation and administration of standardized measurements; it is widely used in the areas of psychology and education. This theory was developed and expanded for over 50 years and has contributed to the development of measurement scales of latent traits. This paper presents the basic and fundamental concepts of this IRT and a practical example of the construction of scales is proposed to illustrate the feasibility, advantages and validity of IRT through a known measurement, the height. The results obtained with the practical application of IRT confirm its effectiveness in the evaluation of latent traits.

...read moreread less

Journal Article•DOI•

Modeling Item-Position Effects Within an IRT Framework

[...]

Dries Debeer¹, Rianne Janssen¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jun 2013-Journal of Educational Measurement

Abstract: Changing the order of items between alternate test forms to prevent copying and to enhance test security is a common practice in achievement testing. However, these changes in item order may affect item and test characteristics. Several procedures have been proposed for studying these item-order effects. The present study explores the use of descriptive and explanatory models from item response theory for detecting and modeling these effects in a one-step procedure. The framework also allows for consideration of the impact of individual differences in position effect on item difficulty. A simulation was conducted to investigate the impact of a position effect on parameter recovery in a Rasch model. As an illustration, the framework was applied to a listening comprehension test for French as a foreign language and to data from the PISA 2006 assessment. In achievement testing, administering the same set of items in different orders is a common strategy to prevent copying and to enhance test security. These item-order manipulations across alternate test forms, however, may not be without consequence. After the early work of Mollenkopf (1950), it repeatedly has been shown that changes in the placement of items may have unintended effects on test and item characteristics (Leary & Dorans, 1985). Traditionally, two kinds of item-position effects have been discerned (Kingston & Dorans, 1984): a practice or a learning effect occurs when the items become easier in later positions, and a fatigue effect occurs when items become more difficult if placed towards the end of the test. Recent empirical studies on the effect of item position include Hohensinn et al. (2008), Meyers, Miller, and Way (2009), Moses, Yang and Wilson (2007), Pommerich and Harris (2003), and Schweizer, Schreiner and Gold (2009). In the present article, item-position effects will be studied within De Boeck and Wilson’s (2004) framework of descriptive and explanatory item response models. It will be argued that modeling item-position effects across alternate test forms can be considered as a special case of differential item functioning (DIF). Apart from the DIF approach, the linear logistic test model of Fischer (1973) and its random-weights extension (Rijmen & De Boeck, 2002) will be used to investigate the effect of item position on individual item parameters and to model the trend of item-position effects across items. A new feature of the approach is that individual differences in the effects of item position on difficulty can be taken into account. In the following pages we first will present a brief overview of current approaches to studying the impact of item position on test scores and item characteristics. We then present the proposed item response theory (IRT) framework used for modeling item-position effects. After demonstrating the impact of a position effect on parameter recovery with simulated data, the framework is applied to a listening

...read moreread less

Journal Article•

A Systematic Review of the Methodology for Person Fit Research in Item Response Theory: Lessons about Generalizability of Inferences from the Design of Simulation Studies

[...]

André A. Rupp¹•Institutions (1)

University of Maryland, College Park¹

01 Jan 2013-Psychological test and assessment modeling

TL;DR: A systematic review of the methodology for person fit research targeted specifically at methodologists in training can be found in this paper, where the authors analyze the ways in which researchers in the area of person fit have conducted simulation studies for parametric and nonparametric unidimensional IRT models.

...read moreread less

Abstract: This paper is a systematic review of the methodology for person fit research targeted specifically at methodologists in training. I analyze the ways in which researchers in the area of person fit have conducted simulation studies for parametric and nonparametric unidimensional IRT models since the seminal review paper by Meijer and Sijtsma (2001). I specifically review how researchers have operationalized different types of aberrant responding for particular testing conditions in order to compar e these simulation design characteristics with features of the real-life testing situations for which person fit analyses are officially reported. I discuss the alignment between the theoretical and practical work and the implications for future simulation work and guidelines for best practice.Key words: Person fit, systematic review, aberrant responding, item response theory, simulation study, generalizability, experimental design.This paper is situated in the conceptual space of research on person fit, which is one aspect of the comprehensive enterprise of critiquing the alignment of the structure of a particular statistical model with a particular data set using residual-based statistics (Engelhard Jr., 2009). I first analyze the ways in which researchers in the area of person fit have conducted simulation studies in non-parametric (e.g., Sijtsma & Molenaar, 2002; van der Aark, Hemker, & Sijtsma, 2002) and parametric unidimensional item response theory (IRT) (e.g., DeAyala, 2009; Yen & Fitzpatrick, 2006) since the seminal review paper by Meijer and Sijtsma (2001). I then discuss the alignment between the theoretical and practical work and the implications for future simulation work and guidelines for best practice.This paper is primarily intended for methodologists in training but should also prove useful for practitioners who are curious about the statistical foundations for proposed guidelines of best practice. The information in this paper may be of less interest for the relatively few specialists who are already conducting advanced simulation studies in this area. However, it should provide some useful insight into the ways these researchers conduct their work for the many other researchers and practitioners who want to be critical consumers of this work.Simulation studies are designed statistical experiments that can provide reliable scientific evidence about the performance of statistical methods. As noted concisely by Cook and Teo (2011):In evaluating methodologies, simulation studies: (i) provide a cost-effective way to quantify potential performance for a large range of scenarios, spanning different combinations of sample sizes and underlying parameters, (ii) allow average performance to be estimated under repeat Monte Carlo sampling and (iii) facilitate comparison of estimates against the "true" system underlying the simulations, none of which is really achievable via genuine applications, as gratifying as those are. (p. I)In the context of person fit research, simulation studies are most commonly used to quantify the frequency of type-I and type-II errors and associated power rates under a variety of test design and model misspecification conditions.Researchers who publish in this area clearly make some concerted and thoughtful efforts to summarize findings from simulation studies, especially when they are trying to situate their particular theoretical work within a relevant part of the literature. Thus, I initially started out writing this paper as a more "traditional" review paper that focused on what researchers had learned about person fit in roughly the last 10 years. However, while reviewing the recent body of work it became quickly clear that there is perhaps a more urgent need to discuss the methodology of simulation research with more scrutiny in order to help methodologists in training understand the kinds of generalizations that can and cannot be made based on this work. …

...read moreread less

Fair and Equitable Measurement of Student Learning in MOOCS: An Introduction to Item Response Theory, Scale Linking, and Score Equating

[...]

J. Patrick Meyer, Shi Zhu

01 Jan 2013

TL;DR: Patrick Meyer et al. as mentioned in this paper described a framework for maintaining test security and preventing one form of cheating in online assessments, and introduced item response theory, scale linking, and score equating to demonstrate the way these methods can produce fair and equitable test scores.

...read moreread less

Abstract: Email meyerjp@virginia.edu Abstract Massive open online courses (MOOCs) are playing an increasingly important role in higher education around the world, but despite their popularity, the measurement of student learning in these courses is hampered by cheating and other problems that lead to unfair evaluation of student learning. In this paper, we describe a framework for maintaining test security and preventing one form of cheating in online assessments. We also introduce readers to item response theory, scale linking, and score equating to demonstrate the way these methods can produce fair and equitable test scores. Patrick Meyer is an Assistant Professor in the Curry School of Education at the University of Virginia. He is the inventor of jMetrik, an open source psychometric software program. Shi Zhu is a doctoral student in the Research, Statistics, and Evaluation program in the Curry School of Education. He holds a Ph.D. in History from Nanjing University in China.

...read moreread less

Collapse