scispace - formally typeset
Search or ask a question

Showing papers on "Item response theory published in 2013"


Journal ArticleDOI
TL;DR: In this article, an argument-based approach to validate an interpretation or use of test scores is proposed, where the claims based on the test scores are outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses.
Abstract: To validate an interpretation or use of test scores is to evaluate the plausibility of the claims based on the scores. An argument-based approach to validation suggests that the claims based on the test scores be outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses. Validation then can be thought of as an evaluation of the coherence and completeness of this interpretation/use argument and of the plausibility of its inferences and assumptions. In outlining the argument-based approach to validation, this paper makes eight general points. First, it is the proposed score interpretations and uses that are validated and not the test or the test scores. Second, the validity of a proposed interpretation or use depends on how well the evidence supports the claims being made. Third, more-ambitious claims require more support than less-ambitious claims. Fourth, more-ambitious claims (e.g., construct interpretations) tend to be more useful than less-ambitious claims, but they are also harder to validate. Fifth, interpretations and uses can change over time in response to new needs and new understandings leading to changes in the evidence needed for validation. Sixth, the evaluation of score uses requires an evaluation of the consequences of the proposed uses; negative consequences can render a score use unacceptable. Seventh, the rejection of a score use does not necessarily invalidate a prior, underlying score interpretation. Eighth, the validation of the score interpretation on which a score use is based does not validate the score use.

1,300 citations


Journal ArticleDOI
TL;DR: The HLQ covers 9 conceptually distinct areas of health literacy to assess the needs and challenges of a wide range of people and organisations and is likely to be useful in surveys, intervention evaluation, and studies of theneeds and capabilities of individuals.
Abstract: Health literacy has become an increasingly important concept in public health. We sought to develop a comprehensive measure of health literacy capable of diagnosing health literacy needs across individuals and organisations by utilizing perspectives from the general population, patients, practitioners and policymakers. Using a validity-driven approach we undertook grounded consultations (workshops and interviews) to identify broad conceptually distinct domains. Questionnaire items were developed directly from the consultation data following a strict process aiming to capture the full range of experiences of people currently engaged in healthcare through to people in the general population. Psychometric analyses included confirmatory factor analysis (CFA) and item response theory. Cognitive interviews were used to ensure questions were understood as intended. Items were initially tested in a calibration sample from community health, home care and hospital settings (N=634) and then in a replication sample (N=405) comprising recent emergency department attendees. Initially 91 items were generated across 6 scales with agree/disagree response options and 5 scales with difficulty in undertaking tasks response options. Cognitive testing revealed that most items were well understood and only some minor re-wording was required. Psychometric testing of the calibration sample identified 34 poorly performing or conceptually redundant items and they were removed resulting in 10 scales. These were then tested in a replication sample and refined to yield 9 final scales comprising 44 items. A 9-factor CFA model was fitted to these items with no cross-loadings or correlated residuals allowed. Given the very restricted nature of the model, the fit was quite satisfactory: χ 2 WLSMV(866 d.f.) = 2927, p<0.000, CFI = 0.936, TLI = 0.930, RMSEA = 0.076, and WRMR = 1.698. Final scales included: Feeling understood and supported by healthcare providers; Having sufficient information to manage my health; Actively managing my health; Social support for health; Appraisal of health information; Ability to actively engage with healthcare providers; Navigating the healthcare system; Ability to find good health information; and Understand health information well enough to know what to do. The HLQ covers 9 conceptually distinct areas of health literacy to assess the needs and challenges of a wide range of people and organisations. Given the validity-driven approach, the HLQ is likely to be useful in surveys, intervention evaluation, and studies of the needs and capabilities of individuals.

794 citations


Book ChapterDOI
04 Jul 2013
TL;DR: A new methodology that is capable of diagnosing cognitive errors and analyzing different methods for solving problems will be introduced, illustrated with fraction subtraction problems.
Abstract: Finding the sources of misconceptions possessed by students is a difficult task because it is impossible to see what is happening in their heads The only directly observable outcomes are the students' responses to the test items Studying their think-aloud protocols is one method for discovering how students solve or think through problems Several computer programs that are capable of diagnosing students' misconceptions have been developed in the past decade (Brown & Burton, 1978; Marshall, 1980; Ohlsson & Langley, 1985; Sleeman, 1984; Tatsuoka, Baillie, & Yamamoto, 1982; VanLehn, 1983) The common ground for these cognitive diagnostic systems is that they infer the unobservable cognitive processes from logical interrelationships among cognitive tasks, subtasks, and goals involved in problems representing the domain of interest It is important that we be able to retrieve invisible things from the "black box" and put them into a useful form so that valuable information can be obtained for improving educational quality

307 citations



Journal ArticleDOI
TL;DR: A unidimensional Barratt Impulsiveness Scale-Brief is introduced that includes 8 of the original BIS-11 items and will reduce the burden on respondents without loss of information in clinical assessment settings and large epidemiological studies of psychiatric disorders.
Abstract: The Barratt Impulsivity Scale (BIS), a 30-item self-report measure, is one of the most commonly used scales for the assessment of the personality construct of impulsiveness. It has recently marked 50 years of use in research and clinical settings. The current BIS-11 is held to measure 3 theoretical subtraits, namely, attentional, motor, and non-planning impulsiveness. We evaluated the factor structure of the BIS using full information item bifactor analysis for Likert-type items. We found no evidence supporting the 3-factor model. In fact, half of the items do not share any relation with other items and do not form any factor. In light of this, we introduce a unidimensional Barratt Impulsiveness Scale-Brief (BIS-Brief) that includes 8 of the original BIS-11 items. Next, we present evidence of construct validity comparing scores obtained with the BIS-Brief against the original BIS total scores using data from (a) a community sample of borderline personality patients and normal controls, (b) a forensic sample, and (c) an inpatient sample of young adults and adolescents. We demonstrated similar indices of construct validity that is observed for the BIS-11 total score with the BIS-Brief score. Use of the BIS-Brief in clinical assessment settings and large epidemiological studies of psychiatric disorders will reduce the burden on respondents without loss of information. (PsycINFO Database Record (c) 2012 APA, all rights reserved). Language: en

251 citations


Book
14 Nov 2013
TL;DR: In this article, the authors present a self-standing module for psychometrics, testing and measurement, or multivariate statistics course taught in psychology, education, marketing and management, professional researchers in need of a quick refresher on applying measurement theory will also find an invaluable reference.
Abstract: This book helps readers apply testing and measurement theories. Featuring 22 self-standing modules, instructors can pick and choose the ones that are most appropriate for their course. Each module features an overview of a measurement issue and a step-by-step application of that theory. Best practices provide recommendations for ensuring the appropriate application of the theory. Practical questions help students assess their understanding of the topic while the examples allow them to apply the material using real data. Two cases in each module depict typical dilemmas faced when applying measurement theory followed by Questions to Ponder to encourage critical examination of the issues noted in the cases. Each module contains exercises some of which require no computer access while others involve the use of SPSS to solve the problem. The book’s website houses the accompanying data sets and more. The book also features suggested readings, a glossary of the key terms, and a continuing exercise that incorporates many of the steps in the development of a measure of typical performance. Updated throughout to reflect recent changes in the field, the new edition also features: --A new co-author, Michael Zickar, who updated the advanced topics and added the new module on generalizability theory (Module 22). -Expanded coverage of reliability (Modules 5 & 6) and exploratory and confirmatory factor analysis (Modules 18 & 19) to help readers interpret results presented in journal articles. -Expanded Web Resources, Instructors will now find: suggested answers to the book’s questions and exercises; detailed worked solutions to the exercises; and PowerPoint slides. Students and instructors can access the SPSS data sets; additional exercises; the glossary; and website references that are helpful in understanding psychometric concepts. Part 1 provides an introduction to measurement theory and specs for scaling and testing and a review of statistics. Part 2 then progresses through practical issues related to text reliability, validation, meta-analysis and bias. Part 3 reviews practical issues related to text construction such as the development of measures of maximal performance, CTT item analysis, test scoring, developing measures of typical performance, and issues related to response styles and guessing. The book concludes with advanced topics such as multiple regression, exploratory and confirmatory factor analysis, item response theory (IRT), IRT applications including computer adaptive testing and differential item functioning, and generalizability theory. Ideal as a text for any psychometrics, testing and measurement, or multivariate statistics course taught in psychology, education, marketing and management, professional researchers in need of a quick refresher on applying measurement theory will also find this an invaluable reference.

196 citations


Book
27 Apr 2013
TL;DR: This chapter discusses Latent Trait Theory, a model for Latent Class Theory, and its applications to Criterion-Referenced Testing.
Abstract: and Overview.- I Latent Trait Theory.- 1 Measurement Models for Ordered Response Categories.- 2 Testing a Latent Trait Model.- 3 Latent Trait Models with Indicators of Mixed Measurement Level.- II Latent Class Theory.- 4 New Developments in Latent Class Theory.- 5 Log-Linear Modeling, Latent Class Analysis, or Correspondence Analysis: Which Method Should Be Used for the Analysis of Categorical Data?.- 6 A Latent Class Covariate Model with Applications to Criterion-Referenced Testing.- III Comparative Views of Latent Traits and Latent Classes.- 7 Test Theory with Qualitative and Quantitative Latent Variables.- 8 Latent Class Models for Measuring.- Chaffer 9 Comparison of Latent Structure Models.- IV Application Studies.- 10 Latent Variable Techniques for Measuring Development.- 11 Item Bias and Test Multidimensionality.- 12 On a Rasch-Model-Based Test for Noncomputerized Adaptive Testing.- 13 Systematizing the Item Content in Test Design.

190 citations


Journal ArticleDOI
TL;DR: The CPIB provides speech-language pathologists with a unidimensional, self-report outcomes measurement instrument dedicated to the construct of communicative participation and this instrument may be useful to clinicians and researchers wanting to implement measures of Communicative participation in their work.
Abstract: Purpose The purpose of this study was to calibrate the items for the Communicative Participation Item Bank (CPIB; Baylor, Yorkston, Eadie, Miller, & Amtmann, 2009; Yorkston et al., 2008) using item response theory (IRT). One overriding objective was to examine whether the IRT item parameters would be consistent across different diagnostic groups, thereby allowing creation of a disorder-generic instrument. The intended outcomes were the final item bank and a short form ready for clinical and research applications. Method Self-report data were collected from 701 individuals representing 4 diagnoses: multiple sclerosis, Parkinson's disease, amyotrophic lateral sclerosis, and head and neck cancer. Participants completed the CPIB and additional self-report questionnaires. CPIB data were analyzed using the IRT graded response model. Results The initial set of 94 candidate CPIB items were reduced to an item bank of 46 items demonstrating unidimensionality, local independence, good item fit, and good measurement ...

184 citations


Journal ArticleDOI
TL;DR: In this article, the authors provide an overview of goodness-of-fit assessment methods for item response theory (IRT) models, including root mean squared error of approximation (RMSEA), which makes it possible to test whether the model misfit is below a specific cutoff value.
Abstract: The article provides an overview of goodness-of-fit assessment methods for item response theory (IRT) models. It is now possible to obtain accurate p-values of the overall fit of the model if bivariate information statistics are used. Several alternative approaches are described. As the validity of inferences drawn on the fitted model depends on the magnitude of the misfit, if the model is rejected it is necessary to assess the goodness of approximation. With this aim in mind, a class of root mean squared error of approximation (RMSEA) is described, which makes it possible to test whether the model misfit is below a specific cutoff value. Also, regardless of the outcome of the overall goodness-of-fit assessment, a piece-wise assessment of fit should be performed to detect parts of the model whose fit can be improved. A number of statistics for this purpose are described, including a z statistic for residual means, a mean-and-variance correction to Pearson's X 2 statistic applied to each bivariate subtable...

159 citations


Journal ArticleDOI
TL;DR: A structurally coherent set of items covering a range of important HPV knowledge was developed, and Responses indicated a reliable questionnaire, which allowed the fitting of an IRT model.

148 citations


Journal ArticleDOI
TL;DR: It is shown that when Thurstonian IRT modeling is applied, even existing forced-choice questionnaires with challenging features can be scored adequately and that the IRT-estimated scores are free from the problems of ipsative data.
Abstract: In multidimensional forced-choice (MFC) questionnaires, items measuring different attributes are presented in blocks, and participants have to rank order the items within each block (fully or partially). Such comparative formats can reduce the impact of numerous response biases often affecting single-stimulus items (aka rating or Likert scales). However, if scored with traditional methodology, MFC instruments produce ipsative data, whereby all individuals have a common total test score. Ipsative scoring distorts individual profiles (it is impossible to achieve all high or all low scale scores), construct validity (covariances between scales must sum to zero), criterion-related validity (validity coefficients must sum to zero), and reliability estimates. We argue that these problems are caused by inadequate scoring of forced-choice items and advocate the use of item response theory (IRT) models based on an appropriate response process for comparative data, such as Thurstone’s law of comparative judgment. We show that when Thurstonian IRT modeling is applied (Brown & Maydeu-Olivares, 2011), even existing forcedchoice questionnaires with challenging features can be scored adequately and that the IRT-estimated scores are free from the problems of ipsative data.

Journal ArticleDOI
TL;DR: The cutoff scores, which link to a description of specific movements a patient can, can partially, and cannot perform, may enable formation of heterogeneous patient groups, advance efforts to identify specific movement therapy targets, and define treatment response in terms of specific movement that changed or did not change with therapy.

Journal ArticleDOI
TL;DR: The case for developing an international core set of PROs building from the US PROMIS network is described and the potential to transform PRO measurement by creating a shared, unifying terminology and metric for reporting of common symptoms and functional life domains is discussed.
Abstract: Patient-reported outcomes (PROs) play an increasingly important role in clinical practice and research. Modern psychometric methods such as item response theory (IRT) enable the creation of item banks that support fixed-length forms as well as computerized adaptive testing (CAT), often resulting in improved measurement precision and responsiveness. Here we describe and discuss the case for developing an international core set of PROs building from the US PROMIS® network.PROMIS is a U.S.-based cooperative group of research sites and centers of excellence convened to develop and standardize PRO measures across studies and settings. If extended to a global collaboration, PROMIS has the potential to transform PRO measurement by creating a shared, unifying terminology and metric for reporting of common symptoms and functional life domains. Extending a common set of standardized PRO measures to the international community offers great potential for improving patient-centered research, clinical trials reporting, population monitoring, and health care worldwide. Benefits of such standardization include the possibility of: international syntheses (such as meta-analyses) of research findings; international population monitoring and policy development; health services administrators and planners access to relevant information on the populations they serve; better assessment and monitoring of patients by providers; and improved shared decision making.The goal of the current PROMIS International initiative is to ensure that item banks are translated and culturally adapted for use in adults and children in as many countries as possible. The process includes 3 key steps: translation/cultural adaptation, calibration, and validation. A universal translation, an approach focusing on commonalities, rather than differences across versions developed in regions or countries speaking the same language, is proposed to ensure conceptual equivalence for all items. International item calibration using nationally representative samples of adults and children within countries is essential to demonstrate that all items possess expected strong measurement properties. Finally, it is important to demonstrate that the PROMIS measures are valid, reliable and responsive to change when used in an international context.IRT item banking will allow for tailoring within countries and facilitate growth and evolution of PROs through contributions from the international measurement community. A number of opportunities and challenges of international development of PROs item banks are discussed.

Journal ArticleDOI
TL;DR: A dimension reduction method that can take advantage of the hierarchical factor structure so that the integrals can be approximated far more efficiently and a new test statistic that can be substantially better calibrated and more powerful than the original M2 statistic when the test is long and the items are polytomous is proposed.
Abstract: In applications of item response theory, assessment of model fit is a critical issue. Recently, limited-information goodness-of-fit testing has received increased attention in the psychometrics literature. In contrast to full-information test statistics such as Pearson's X(2) or the likelihood ratio G(2) , these limited-information tests utilize lower-order marginal tables rather than the full contingency table. A notable example is Maydeu-Olivares and colleagues'M2 family of statistics based on univariate and bivariate margins. When the contingency table is sparse, tests based on M2 retain better Type I error rate control than the full-information tests and can be more powerful. While in principle the M2 statistic can be extended to test hierarchical multidimensional item factor models (e.g., bifactor and testlet models), the computation is non-trivial. To obtain M2 , a researcher often has to obtain (many thousands of) marginal probabilities, derivatives, and weights. Each of these must be approximated with high-dimensional numerical integration. We propose a dimension reduction method that can take advantage of the hierarchical factor structure so that the integrals can be approximated far more efficiently. We also propose a new test statistic that can be substantially better calibrated and more powerful than the original M2 statistic when the test is long and the items are polytomous. We use simulations to demonstrate the performance of our new methods and illustrate their effectiveness with applications to real data.

Journal ArticleDOI
TL;DR: The PROMIS assessment of self-reported fatigue in pediatrics includes two item banks: Tired and (Lack of) Energy, which demonstrated satisfactory psychometric properties and can be used for research settings.
Abstract: Purpose This paper reports on the development and psychometric properties of self-reported pediatric fatigue item banks as part of the Patient-Reported Outcomes Measurement Information System (PROMIS).

Journal ArticleDOI
TL;DR: In this article, the authors address possible sources of confusion in interpreting trait scores from the bifactor model and compare the general trait score with a simple-structure model with correlated factors.
Abstract: This tutorial addresses possible sources of confusion in interpreting trait scores from the bifactor model. The bifactor model may be used when subscores are desired, either for formative feedback on an achievement test or for theoretically different constructs on a psychological test. The bifactor model is often chosen because it requires fewer computational resources than other models for subscores. The bifactor model yields a score on the general or primary trait measured by the test overall, as well as specific or secondary traits measured by the subscales. Interpreting the general trait score is straight-forward, but the specific traits must be interpreted as residuals relative to the general trait. Trait scores on the specific factors are contrasted with trait scores on a simple-structure model with correlated factors, using example data from one TIMSS test booklet and a civic responsibility measure. The correlated factors model was used for contrast because its scores correspond to a more intuitive...

Journal ArticleDOI
TL;DR: In this article, an improved version of the Lord's χ2 Wald test for comparing item response model parameter estimates between two groups is presented, which uses better approaches for computation of the covariance matrix and equating the item parameters across groups.
Abstract: Differential item functioning (DIF) occurs when the probability of responding in a particular category to an item differs for members of different groups who are matched on the construct being measured. The identification of DIF is important for valid measurement. This research evaluates an improved version of Lord’s χ2 Wald test for comparing item response model parameter estimates between two groups. The improved version uses better approaches for computation of the covariance matrix and equating the item parameters across groups. There are two equating algorithms implemented in IRTPro and flexMIRT software: Wald-1 (one-stage) and Wald-2 (two-stage), only one of which has been studied in simulations before. The present study evaluates for the first time the Wald-1 algorithm and Wald-1 and Wald-2 for three groups simultaneously. A comparison to two-group IRT-LR-DIF is included. Results indicate that Wald-1 performs very well and is recommended, whereas Type I error is extremely inflated for Wald-2. Perfo...

Journal ArticleDOI
TL;DR: The Dark Triad Dirty Dozen is a personality inventory designed to measure individual differences in narcissism, psychopathy, and Machiavellianism in sub-clinical populations.

Journal ArticleDOI
Chia Yi Chiu1
TL;DR: Results show that the method can identify and correct misspecified entries in the Q-matrix, thereby improving its accuracy, and comparisons of the residual sum of squares computed from the observed and the ideal item responses.
Abstract: Most methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes require the Q-matrix that associates each item in a test with the cogniti...

Journal ArticleDOI
TL;DR: The results indicated that most of the ASQ-BR questionnaires had adequate internal consistency and are psychometrically sound developmental screening instruments that can be easily administered by primary caregivers.

Book
20 Dec 2013
TL;DR: In this article, the authors compared two or more samples of a PRO Questionnaire and compared their results with the results of a single sample using regression analysis and SAS's Confirmatory Factor Analysis.
Abstract: Introduction Patient-Reported Outcomes in Perspective Patient-Reported Outcomes in Clinical Research Terms and Definitions Measurement Scales Psychometrics vs Clinimetrics Selection of a PRO Questionnaire Development of a Patient-Reported Outcome Population Item Generation Item Wording Cognitive Interviews Validity Content Validity Construct Validity Simulated Example Using SAS: Convergent and Divergent Validity Factors Affecting Response Reliability Intraclass Correlation Coefficient for Continuous Variables ICC Example ICC Simulated Example ICC in Context Bland and Altman Plot for Continuous Variables Simple Kappa and Weighted Kappa Coefficients for Categorical Variables Internal Consistency Reliability: Cronbach's Alpha Coefficient Simulated Example of Cronbach's Alpha Exploratory and Confirmatory Factor Analyses Exploratory Factor Analysis Confirmatory Factor Analysis Causal Indicators vs Effect Indicators Simulated Examples Using SAS: Exploratory Factor Analysis Simulated Examples Using SAS: Confirmatory Factor Analysis Real-Life Examples Item Response Theory Classical Test Theory Revisited Assumptions of IRT Item Characteristic Curves Item Information Item Fit and Person Fit Differential Item Functioning Sample Size Example Example: Rasch Model Implementation Cross-Sectional Analysis Types of PRO Data and Exploratory Methods Comparing Two or More Samples Regression Analysis Longitudinal Analysis Analytic Considerations Repeated Measures Model Random Coefficient Model Real-Life Examples Mediation Models Single Mediator Model Model Invariance Advanced Example Bootstrapping Methodology Implementation Missing Data Study Design to Minimize Missing Data Missing Data Patterns and Mechanisms Approaches for Missing Items within Domains or Measures Approaches for Missing Entire Domains or Entire Questionnaires Sensitivity Analyses Simulated Example Using SAS: Pattern Mixture Models Enriching Interpretation Anchor-Based Approaches Distribution-Based Approaches Multiple Testing Index A Summary and References appear at the end of each chapter

Journal ArticleDOI
TL;DR: The authors propose a method that evaluates the degree of item-level dimensionality and allows for the selection of subsets of items (i.e., short form) that result in scaled scores and standard errors that are equivalent to other multidimensional IRT-based scoring procedures.
Abstract: Test developers often need to create unidimensional scales from multidimensional data. For item analysis, marginal trace lines capture the relation with the general dimension while accounting for n...

Journal ArticleDOI
TL;DR: The MACH–IV was investigated with item response theory to elucidate its psychometric properties and suggest a trimmed version, the MACH*, which seemed to be cynicism/misanthropy and was formed from the 5 most informative and precise MACH-IV items.
Abstract: The MACH-IV was investigated (N = 528) with item response theory to elucidate its psychometric properties and suggest a trimmed version, the MACH*. The core content of the MACH-IV seemed to be cynicism/misanthropy and the MACH* was formed from the 5 most informative and precise MACH-IV items. The MACH* showed good internal consistency and construct and criterion validity comparable to the MACH-IV. The MACH-IV and MACH* measure most precisely at average to above average levels of Machiavellianism. Implications for theory and measurement of Machiavellianism are discussed.

Journal ArticleDOI
TL;DR: Overall findings provide evidence of the accuracy of the LOT–R and suggest possible modifications of the scale to improve the assessment of dispositional optimism.
Abstract: The accuracy of the Life Orientation Test–Revised (LOT–R) in measuring dispositional optimism was investigated applying item response theory (IRT). The study was conducted on a sample of 484 university students (62% males, M age = 22.79 years, SD = 5.63). After testing the 1-factor structure of the scale, IRT was applied to evaluate the functioning of the LOT–R along the pessimism–optimism continuum. Item parameter estimates and the test information function showed that each item and the global scale satisfactorily measured the latent trait. Referring to the IRT estimated trait levels, the validity of the LOT–R was studied examining the relationships between dispositional optimism and psychological well-being, sense of mastery, and sense of coherence. Overall findings based on IRT analyses provide evidence of the accuracy of the LOT–R and suggest possible modifications of the scale to improve the assessment of dispositional optimism.

Journal ArticleDOI
TL;DR: Evaluation of the psychometric properties and validity of an expanded set of community enfranchisement items that are suitable for computer adaptive testing suggested 2 distinct subsets of items: importance of participation and control over participation.

Journal ArticleDOI
TL;DR: The proposed 15-minute version of the Wechsler Adult Intelligence Scale–III may serve as a useful screening device for general intellectual ability in research or clinical settings, and is recommended when a quick and accurate IQ estimate is desired.
Abstract: Background.The potential inclusion of cognitive assessments in the DSM-V and large time-consuming assessments drive a need for short tests of cognitive impairments. We examined the reliability and validity of a brief, 15-minute, version of the Wechsler Adult Intelligence Scale–III (WAIS-III).Methods.The sample consisted of patients diagnosed with schizophrenia (n=75), their siblings without schizophrenia (n=74) and unrelated healthy controls (n=84). A short WAIS-III consists of the Digit Symbol Coding subtest, and every second (or third) item of Block Design, Information, and Arithmetic. Psychometric analyses were implemented using item-response theory (IRT) to determine the best minimal item short version, while maintaining the sensitivity and reliability of the IQ score.Results.The proposed 15-minute WAIS-III gave reliable estimates of the Full Scale IQ (FSIQ) in all three groups in the sample. The 15-minute (select-item) version yielded an overall R of.95 (R2=.92) and IRT yielded an R of .96 (R2=.92). ...

Journal ArticleDOI
TL;DR: The item response theory (IRT) also known as latent trait theory is used for the development, evaluation and administration of standardized measurements; it is widely used in the areas of psychology and education.
Abstract: The item response theory (IRT) also known as latent trait theory, is used for the development, evaluation and administration of standardized measurements; it is widely used in the areas of psychology and education. This theory was developed and expanded for over 50 years and has contributed to the development of measurement scales of latent traits. This paper presents the basic and fundamental concepts of this IRT and a practical example of the construction of scales is proposed to illustrate the feasibility, advantages and validity of IRT through a known measurement, the height. The results obtained with the practical application of IRT confirm its effectiveness in the evaluation of latent traits.

Journal ArticleDOI
Abstract: Changing the order of items between alternate test forms to prevent copying and to enhance test security is a common practice in achievement testing. However, these changes in item order may affect item and test characteristics. Several procedures have been proposed for studying these item-order effects. The present study explores the use of descriptive and explanatory models from item response theory for detecting and modeling these effects in a one-step procedure. The framework also allows for consideration of the impact of individual differences in position effect on item difficulty. A simulation was conducted to investigate the impact of a position effect on parameter recovery in a Rasch model. As an illustration, the framework was applied to a listening comprehension test for French as a foreign language and to data from the PISA 2006 assessment. In achievement testing, administering the same set of items in different orders is a common strategy to prevent copying and to enhance test security. These item-order manipulations across alternate test forms, however, may not be without consequence. After the early work of Mollenkopf (1950), it repeatedly has been shown that changes in the placement of items may have unintended effects on test and item characteristics (Leary & Dorans, 1985). Traditionally, two kinds of item-position effects have been discerned (Kingston & Dorans, 1984): a practice or a learning effect occurs when the items become easier in later positions, and a fatigue effect occurs when items become more difficult if placed towards the end of the test. Recent empirical studies on the effect of item position include Hohensinn et al. (2008), Meyers, Miller, and Way (2009), Moses, Yang and Wilson (2007), Pommerich and Harris (2003), and Schweizer, Schreiner and Gold (2009). In the present article, item-position effects will be studied within De Boeck and Wilson’s (2004) framework of descriptive and explanatory item response models. It will be argued that modeling item-position effects across alternate test forms can be considered as a special case of differential item functioning (DIF). Apart from the DIF approach, the linear logistic test model of Fischer (1973) and its random-weights extension (Rijmen & De Boeck, 2002) will be used to investigate the effect of item position on individual item parameters and to model the trend of item-position effects across items. A new feature of the approach is that individual differences in the effects of item position on difficulty can be taken into account. In the following pages we first will present a brief overview of current approaches to studying the impact of item position on test scores and item characteristics. We then present the proposed item response theory (IRT) framework used for modeling item-position effects. After demonstrating the impact of a position effect on parameter recovery with simulated data, the framework is applied to a listening

Journal Article
TL;DR: A systematic review of the methodology for person fit research targeted specifically at methodologists in training can be found in this paper, where the authors analyze the ways in which researchers in the area of person fit have conducted simulation studies for parametric and nonparametric unidimensional IRT models.
Abstract: This paper is a systematic review of the methodology for person fit research targeted specifically at methodologists in training. I analyze the ways in which researchers in the area of person fit have conducted simulation studies for parametric and nonparametric unidimensional IRT models since the seminal review paper by Meijer and Sijtsma (2001). I specifically review how researchers have operationalized different types of aberrant responding for particular testing conditions in order to compar e these simulation design characteristics with features of the real-life testing situations for which person fit analyses are officially reported. I discuss the alignment between the theoretical and practical work and the implications for future simulation work and guidelines for best practice.Key words: Person fit, systematic review, aberrant responding, item response theory, simulation study, generalizability, experimental design.This paper is situated in the conceptual space of research on person fit, which is one aspect of the comprehensive enterprise of critiquing the alignment of the structure of a particular statistical model with a particular data set using residual-based statistics (Engelhard Jr., 2009). I first analyze the ways in which researchers in the area of person fit have conducted simulation studies in non-parametric (e.g., Sijtsma & Molenaar, 2002; van der Aark, Hemker, & Sijtsma, 2002) and parametric unidimensional item response theory (IRT) (e.g., DeAyala, 2009; Yen & Fitzpatrick, 2006) since the seminal review paper by Meijer and Sijtsma (2001). I then discuss the alignment between the theoretical and practical work and the implications for future simulation work and guidelines for best practice.This paper is primarily intended for methodologists in training but should also prove useful for practitioners who are curious about the statistical foundations for proposed guidelines of best practice. The information in this paper may be of less interest for the relatively few specialists who are already conducting advanced simulation studies in this area. However, it should provide some useful insight into the ways these researchers conduct their work for the many other researchers and practitioners who want to be critical consumers of this work.Simulation studies are designed statistical experiments that can provide reliable scientific evidence about the performance of statistical methods. As noted concisely by Cook and Teo (2011):In evaluating methodologies, simulation studies: (i) provide a cost-effective way to quantify potential performance for a large range of scenarios, spanning different combinations of sample sizes and underlying parameters, (ii) allow average performance to be estimated under repeat Monte Carlo sampling and (iii) facilitate comparison of estimates against the "true" system underlying the simulations, none of which is really achievable via genuine applications, as gratifying as those are. (p. I)In the context of person fit research, simulation studies are most commonly used to quantify the frequency of type-I and type-II errors and associated power rates under a variety of test design and model misspecification conditions.Researchers who publish in this area clearly make some concerted and thoughtful efforts to summarize findings from simulation studies, especially when they are trying to situate their particular theoretical work within a relevant part of the literature. Thus, I initially started out writing this paper as a more "traditional" review paper that focused on what researchers had learned about person fit in roughly the last 10 years. However, while reviewing the recent body of work it became quickly clear that there is perhaps a more urgent need to discuss the methodology of simulation research with more scrutiny in order to help methodologists in training understand the kinds of generalizations that can and cannot be made based on this work. …

01 Jan 2013
TL;DR: Patrick Meyer et al. as mentioned in this paper described a framework for maintaining test security and preventing one form of cheating in online assessments, and introduced item response theory, scale linking, and score equating to demonstrate the way these methods can produce fair and equitable test scores.
Abstract: Email meyerjp@virginia.edu Abstract Massive open online courses (MOOCs) are playing an increasingly important role in higher education around the world, but despite their popularity, the measurement of student learning in these courses is hampered by cheating and other problems that lead to unfair evaluation of student learning. In this paper, we describe a framework for maintaining test security and preventing one form of cheating in online assessments. We also introduce readers to item response theory, scale linking, and score equating to demonstrate the way these methods can produce fair and equitable test scores. Patrick Meyer is an Assistant Professor in the Curry School of Education at the University of Virginia. He is the inventor of jMetrik, an open source psychometric software program. Shi Zhu is a doctoral student in the Research, Statistics, and Evaluation program in the Curry School of Education. He holds a Ph.D. in History from Nanjing University in China.