scispace - formally typeset
Search or ask a question

Showing papers on "Item response theory published in 2007"


Book
01 Jan 2007
TL;DR: The aim of this second edition is to provide a history of quality of life research in the context of clinical trials and to clarify the role of factor analysis in this research.
Abstract: Preface to the first edition. Preface to the second edition. List of abbreviations. Part A. Introduction. 1. Introduction. 1.1 Patient-reported outcomes? 1.2 What is quality of life? 1.3 Historical development. 1.4 Why measure quality of life? 1.5 Which clinical trials should assess quality of life? 1.6 How to measure quality of life. 1.7 Instruments. 1.8 Conclusions. 2. Principles of measurement scales. 2.1 Introduction. 2.2 Scales and items. 2.3 Constructs and latent variables. 2.4 Indicator variables and causal variables. 2.5 Single global questions versus multi-item scales. 2.6 Single-item versus multi-item scales. 2.7 Psychometrics and item response theory. 2.8 Psychometric versus clinimetric scales. 2.9 Sufficient causes and necessary causes. 2.10 Discriminative, evaluative and predictive instruments. 2.11 Measuring quality of life: indicator or causal items? 2.12 Conclusions. Part B. Developing and Testing Questionnaires. 3. Developing a questionnaire. 3.1 Introduction. 3.2 General issues. 3.3 Defining the target population. 3.4 Item generation. 3.5 Qualitative methods. 3.6 Forming scales. 3.7 Multi-item scales. 3.8 Wording of questions. 3.9 Face and content validity of the proposed questionnaire. 3.10 Pre-testing the questionnaire. 3.11 Strategies for validation. 3.12 Translation. 3.13 Field testing. 3.14 Conclusions. 3.15 Further reading. 4. Scores and measurements: validity, reliability, sensitivity. 4.1 Introduction. 4.2 Content validity. 4.3 Criterion validity. 4.4 Construct validity. 4.5 Reliability. 4.6 Sensitivity and responsiveness. 4.7 Conclusions. 5. Multi-item scales. 5.1 Introduction. 5.2 Significance tests. 5.3 Correlations. 5.4 Construct validity. 5.5 Cronbach's &alpha and internal consistency. 5.6 Implications for causal items. 5.7 Conclusions. 6. Factor analysis and structural equation modelling. 6.1 Introduction. 6.2 Correlation patterns. 6.3 Path diagrams. 6.4 Factor analysis. 6.5 Factor analysis of the HADS questionnaire. 6.6 Uses of factor analysis. 6.7 Applying factor analysis: choices and decisions. 6.8 Assumptions for factor analysis. 6.9 Factor analysis in QoL research. 6.10 Limitations of correlation-based analysis. 6.11 Causal models. 6.12 Confirmatory factor analysis and structural equation modelling. 6.13 Conclusions. 6.14 Further reading and software. 7. Item response theory and differential item functioning. 7.1 Introduction. 7.2 Item characteristic curves . 7.3 Logistic models. 7.4 Fitting item response theory models: tips. 7.5 Test design. 7.6 IRT versus traditional and Guttman scales. 7.7 Polytomous item response theory models. 7.8 Differential item functioning. 7.9 Quantifying differential item functioning. 7.10 Exploring differential item functioning: tips. 7.11 Conclusions. 7.12 Further reading and software. 8. Item banks, item listing and computer-adaptive tests. 8.1 Introduction. 8.2 Item bank. 8.3 Item calibration. 8.4 Item linking and test equating. 8.5 Test information. 8.6 Computer-adaptive testing. 8.7 Stopping rules and simulations. 8.8 Computer-adaptive testing software. 8.9 Unresolved issues. 8.10 Computer-assisted tests. 8.11 Conclusions. 8.12 Further reading. Part C. Clinical Trials. 9. Choosing and scoring questionnaires. 9.1 Introduction. 9.2 Generic versus specific. 9.3 Finding instruments. 9.4 Choice of instrument. 9.5 Adding ad-hoc items. 9.6 Scoring multi-item scales. 9.7 Conclusions. 9.8 Further reading. 10. Clinical trials. 10.1 Introduction. 10.2 Basic design issues. 10.3 Compliance. 10.4 Administering a quality-of-life assessment. 10.5 Recommendations for writing protocols. 10.6 Standard operating procedures. 10.7 Summary and checklist. 11. Sample sizes. 11.1 Introduction. 11.2 Significance tests, p-values and power. 11.3 Estimating sample size. 11.4 Comparing two groups. 11.5 Comparison with a reference population. 11.6 Equivalence studies. 11.7 Choice of sample size method. 11.8 Multiple endpoints. 11.9 Specifying the target difference. 11.10 Sample size estimation is pre-study. 11.11 Attrition. 11.12 Conclusion. 11.13 Further reading. Part D. Analysis of QoL Data. 12. Cross-sectional analysis. 12.1 Types of data. 12.2 Comparing two groups. 12.3 Adjusting for covariates. 12.4 Changes from baseline. 12.5 Analysis of variance. 12.6 Analysis of variance models. 12.7 Graphical summaries. 12.8 Endpoints. 12.9 Conclusions. 13. Exploring longitudinal data. 13.1 Area under the curve. 13.2 Graphical presentations. 13.3 Tabular presentations. 13.4 Reporting. 13.5 Conclusions. 14. Modelling longitudinal data. 14.1 Preliminaries. 14.2 Auto-correlation. 14.3 Repeated measures. 14.4 Other situations. 14.5 Modelling versus area under the curve. 14.6 Conclusions. 15. Missing data. 15.1 Introduction. 15.2 Types of missing data. 15.3 Why do missing data matter? 15.4 Missing items. 15.5 Methods for missing items within a form. 15.6 Missing forms. 15.7 Methods for missing forms. 15.8 Comments. 15.9 Degrees of freedom. 15.10 Sensitivity analysis. 15.11 Conclusions. 15.12 Further reading. 16. Practical and reporting issues. 16.1 Introduction. 16.2 The reporting of design issues. 16.3 Data analysis. 16.4 Elements of good graphics. 16.5 Some errors. 16.6 Guidelines for reporting. 16.7 Further reading . Part E. Beyond Clinical Trials. 17. Quality-adjusted survival. 17.1 Introduction. 17.2 Preferences and utilities. 17.3 Multi-attribute utility measures. 17.4 Utility-based instruments. 17.5 Quality-adjusted life years. 17.6 Q-TWiST. 17.7 Sensitivity analysis. 17.8 Prognosis and variation with time. 17.9 Healthy-years equivalent. 17.10 Conclusions. 18. Clinical interpretation. 18.1 Introduction. 18.2 Statistical significance. 18.3 Absolute levels and changes over time. 18.4 Threshold values: percentages. 18.5 Population norms. 18.6 Minimal clinically important difference. 18.7 Impact of state of quality of life. 18.8 Changes in relation to life events. 18.9 Effect size. 18.10 Effect sizes and meta-analysis. 18.11 Patient variability. 18.12 Number needed to treat. 18.13 Conclusions. 18.14 Further reading. 19. Meta-analysis. 19.1 Introduction. 19.2 Defining objectives. 19.3 Defining outcomes. 19.4 Literature searching. 19.5 Assessing quality. 19.6 Summarising results. 19.7 Measures of treatment effect. 19.8 Combining studies. 19.9 Forest plot. 19.10 Heterogeneity. 19.11 Publication bias and funnel plots. 19.12 Conclusions. 19.13 Further reading. Appendix Examples of Instruments. Generic instruments. [L, flush left]E1 Sickness Impact Profile (SIP). E2 Nottingham Health Profile (NHP). E3 Health Survey Standard Version (SF-36v2). E4 EuroQoL (EQ-5D). E5 Patient Generated Index (PGI). Disease-specific instruments. E6 European Organisation for Research and Treatment of Cancer (EORTC QLQ-C30). E7 EORTC Head and Neck Module (EORTC H&N35). E8 Functional Assessment of Cancer - General version (FACT-G). E9 Rotterdam Symptom Checklist (RSCL). E10 Quality of Life in Epilepsy (QOLIE-89). E11 Paediatric Asthma Quality of Life Questionnaire (PAQLQ). Domain-specific instruments. E12 Hospital Anxiety and Depression Scale (HADS). E13 Short Form McGill Pain Questionnaire (SF-MPQ). E14 Multidimensional Fatigue Inventory (MFI-20). ADL and disability. E15 Barthel Index of disability (modified) (BI). Statistical tables. T1 Normal distribution. T2 Normal distribution - percentage points. T3 t-distribution. T4 ?2 distribution. T5 F-distribution. References. Index.

1,743 citations


Journal ArticleDOI
TL;DR: Compared with the MAS and the DAS, the CSI scales were shown to have higher precision of measurement and correspondingly greater power for detecting differences in levels of satisfaction, suggesting that they assess the same theoretical construct as do prior scales.
Abstract: The present study took a critical look at a central construct in couples research: relationship satisfaction. Eight well-validated self-report measures of relationship satisfaction, including the Marital Adjustment Test (MAT; H. J. Locke & K. M. Wallace, 1959), the Dyadic Adjustment Scale (DAS; G. B. Spanier, 1976), and an additional 75 potential satisfaction items, were given to 5,315 online participants. Using item response theory, the authors demonstrated that the MAT and DAS provided relatively poor levels of precision in assessing satisfaction, particularly given the length of those scales. Principal-components analysis and item response theory applied to the larger item pool were used to develop the Couples Satisfaction Index (CSI) scales. Compared with the MAS and the DAS, the CSI scales were shown to have higher precision of measurement (less noise) and correspondingly greater power for detecting differences in levels of satisfaction. The CSI scales demonstrated strong convergent validity with other measures of satisfaction and excellent construct validity with anchor scales from the nomological net surrounding satisfaction, suggesting that they assess the same theoretical construct as do prior scales. Implications for research are discussed.

1,304 citations


Journal ArticleDOI
TL;DR: An overview of the methods used in the PROMIS item analyses and proposed calibration of item banks is provided and recommendations are provided for future evaluations of item Banks in HRQOL assessment.
Abstract: Background: The construction and evaluation of item banks to measure unidimensional constructs of health-related quality of life (HRQOL) is a fundamental objective of the Patient-Reported Outcomes Measurement Information System (PROMIS) project. Objectives: Item banks will be used as the foundation for developing short-form instruments and enabling computerized adaptive testing. The PROMIS Steering Committee selected 5 HRQOL domains for initial focus: physical functioning, fatigue, pain, emotional distress, and social role participation. This report provides an overview of the methods used in the PROMIS item analyses and proposed calibration of item banks. Analyses: Analyses include evaluation of data quality (eg, logic and range checking, spread of response distribution within an item), descriptive statistics (eg, frequencies, means), item response theory model assumptions (unidimensionality, local independence, monotonicity), model fit, differential item functioning, and item calibration for banking. Recommendations: Summarized are key analytic issues; recommendations are provided for future evaluations of item banks in HRQOL assessment.

1,251 citations


Journal ArticleDOI
TL;DR: The R package ltm has been developed for the analysis of multivariate dichotomous and polytomous data using latent variable models, under the Item Response Theory approach.
Abstract: The R package ltm has been developed for the analysis of multivariate dichotomous and polytomous data using latent variable models, under the Item Response Theory approach. For dichotomous data the Rasch, the Two-Parameter Logistic, and Birnbaum's Three-Parameter models have been implemented, whereas for polytomous data Semejima's Graded Response model is available. Parameter estimates are obtained under marginal maximum likelihood using the Gauss-Hermite quadrature rule. The capabilities and features of the package are illustrated using two real data examples.

818 citations


Journal ArticleDOI
TL;DR: The authors provide a targeted review and synthesis of the item factor analysis (IFA) estimation literature for ordered-categorical data with specific attention paid to the problems of estimating models with many items and many factors.
Abstract: The rationale underlying factor analysis applies to continuous and categorical variables alike; however, the models and estimation methods for continuous (i.e., interval or ratio scale) data are not appropriate for item-level data that are categorical in nature. The authors provide a targeted review and synthesis of the item factor analysis (IFA) estimation literature for ordered-categorical data (e.g., Likert-type response scales) with specific attention paid to the problems of estimating models with many items and many factors. Popular IFA models and estimation methods found in the structural equation modeling and item response theory literatures are presented. Following this presentation, recent developments in the estimation of IFA parameters (e.g., Markov chain Monte Carlo) are discussed. The authors conclude with considerations for future research on IFA, simulated examples, and advice for applied researchers.

762 citations


Book
19 Oct 2007
TL;DR: In this article, the authors discuss the importance of individual differences in psychological test scores and the relationship between individual differences and the nature of variability in a test score and its relationship with the dimensionality of a test.
Abstract: 1. Psychometrics and the Importance of Psychological Measurement Observable Behavior and Unobservable Attributes Psychological Tests: Definition and Types Psychometrics Challenges to Measurement in Psychology Theme: The Importance of Individual Differences 2. Scaling Fundamental Issues and Numbers Units of Measurement Additivity and Counting Four Scales of Measurement Summary Suggested Readings 3. Individual Differences and Correlations The Nature of Variability Importance of Individual Differences Variability and Distribution of Scores Distribution Shapes and Normal Distributions Quantifying the Association Between Distributions Variance of "Composite Scores" Interpreting Test Scores Test Norms Summary Suggested Readings 4. Test Dimensionality and Factor Analysis Test Dimensionality Factor Analysis: Examining the Dimensionality of a Test Summary Suggested Readings 5. Reliability: Conceptual Basis Overview of Reliability and Classical Test Theory Observed Scores, True Scores and Measurement Error Variances in Observed Scores, True Scores and Error Scores Four Ways to Think of Reliability Reliability and the Standard Error of Measurement Parallel Tests Summary 6. Empirical Estimates of Reliability Alternate Forms Reliability Test-Retest Reliability Internal Consistency Reliability Factors Affecting the Reliability of Test Scores Sample Homogeneity and Reliability Generalization Reliability of Difference Scores Summary 7. Importance of Reliability Behavioral Research Applied Behavioral Practice: Evaluation of an Individual's Test Score Test Construction and Refinement Summary Suggested Readings 8. Validity: The Conceptual Basis What is Validity? Validity Evidence: Test Content Validity Evidence: Internal Structure of the Test Validity Evidence: Response Processes Validity Evidence: Associations with Other Variables Validity Evidence: Consequences of Testing Other Perspectives on Validity Contrasting Reliability and Validity The Importance of Validity Summary Suggested Readings 9. Validity: Estimating and Evaluating Convergent and Discriminant Validity Methods for Evaluating Convergent and Discriminant Validity Factors Affecting a Validity Coefficient Interpreting a Validity Coefficient Summary Suggested Readings 10. Response Biases Types of Response Biases Methods for Coping with Response Biases Response Biases, Response Sets, and Response Styles Summary Suggested Readings 11. Test Bias Why Worry about Test Score Bias Detecting Construct Bias: Internal Evaluation of a Test Detecting Predictive Bias: External Evaluation of a Test Summary Suggested Readings 12. Generalizability Theory Multiple Facets of Measurement Generalizability and Variance Components G Studies and D Studies Conducting and Interpreting Generalizability Theory Analysis: A One-facet Design Conducting and Interpreting Generalizability Theory Analysis: A Two-facet Design Other Measurement Designs Summary Footnote Suggested Readings 13. Item Response Theory and Rasch Models Basics of IRT IRT Measurement Models An Example of IRT: A Rasch Model Item and Test Information Applications of IRT Summary Suggested Readings

713 citations


Journal ArticleDOI
TL;DR: The outcomes of all three approaches substantiate the conviction that the assessment of dimensionality requires a good deal of judgment.
Abstract: The evaluation of assessment dimensionality is a necessary stage in the gathering of evidence to support the validity of interpretations based on a total score, particularly when assessment development and analysis are conducted within an item response theory (IRT) framework. In this study, we employ polytomous item responses to compare two methods that have received increased attention in recent years (Rasch model and Parallel analysis) with a method for evaluating assessment structure that is less well-known in the educational measurement community (TETRAD). The three methods were all found to be reasonably effective. Parallel Analysis successfully identified the correct number of factors and while the Rasch approach did not show the item misfit that would indicate deviation from clear unidimensionality, the pattern of residuals did seem to indicate the presence of correlated, yet distinct, factors. TETRAD successfully confirmed one dimension in the single-construct data set and was able to confirm two dimensions in the combined data set, yet excluded one item from each cluster, for no obvious reasons. The outcomes of all three approaches substantiate the conviction that the assessment of dimensionality requires a good deal of judgment.

696 citations


Journal ArticleDOI
TL;DR: When used appropriately, IRT can be a powerful tool for questionnaire development, evaluation, and refinement, resulting in precise, valid, and relatively brief instruments that minimize response burden.
Abstract: Background Health outcomes researchers are increasingly applying Item Response Theory (IRT) methods to questionnaire development, evaluation, and refinement efforts.

597 citations


Journal ArticleDOI
TL;DR: An overview ofitem banking and CAT is provided, the approach to item banking and its byproducts are discussed, testing options are described, an example of CAT for fatigue is discussed, and models for long term sustainability of an entity such as PROMIS are discussed.
Abstract: The use of item banks and computerized adaptive testing (CAT) begins with clear definitions of important outcomes, and references those definitions to specific questions gathered into large and well-studied pools, or “banks” of items. Items can be selected from the bank to form customized short scales, or can be administered in a sequence and length determined by a computer programmed for precision and clinical relevance. Although far from perfect, such item banks can form a common definition and understanding of human symptoms and functional problems such as fatigue, pain, depression, mobility, social function, sensory function, and many other health concepts that we can only measure by asking people directly. The support of the National Institutes of Health (NIH), as witnessed by its cooperative agreement with measurement experts through the NIH Roadmap Initiative known as PROMIS (www.nihpromis.org), is a big step in that direction. Our approach to item banking and CAT is practical; as focused on application as it is on science or theory. From a practical perspective, we frequently must decide whether to re-write and retest an item, add more items to fill gaps (often at the ceiling of the measure), re-test a bank after some modifications, or split up a bank into units that are more unidimensional, yet less clinically relevant or complete. These decisions are not easy, and yet they are rarely unforgiving. We encourage people to build practical tools that are capable of producing multiple short form measures and CAT administrations from common banks, and to further our understanding of these banks with various clinical populations and ages, so that with time the scores that emerge from these many activities begin to have not only a common metric and range, but a shared meaning and understanding across users. In this paper, we provide an overview of item banking and CAT, discuss our approach to item banking and its byproducts, describe testing options, discuss an example of CAT for fatigue, and discuss models for long term sustainability of an entity such as PROMIS. Some barriers to success include limitations in the methods themselves, controversies and disagreements across approaches, and end-user reluctance to move away from the familiar.

516 citations


Journal ArticleDOI
TL;DR: The heiQ has high construct validity and is a reliable measure of a broad range of patient education program benefits and will provide valuable information to clinicians, researchers, policymakers and other stakeholders about the value of patienteducation programs in chronic disease management.

472 citations


Journal ArticleDOI
TL;DR: The R package eRm (extended Rasch modeling) is proposed for computing Rasch models and several extensions, which fits the following models: the Rasch model, the rating scale model (RSM), and the partial credit model (PCM) as well as linear reparameterizations through covariate structures like the linear logistic test model (LLTM), the linear rating scalemodel (LRSM, and the linear partial credit models (LPCM).
Abstract: Item response theory models (IRT) are increasingly becoming established in social science research, particularly in the analysis of performance or attitudinal data in psychology, education, medicine, marketing and other fields where testing is relevant. We propose the R package eRm (extended Rasch modeling) for computing Rasch models and several extensions. A main characteristic of some IRT models, the Rasch model being the most prominent, concerns the separation of two kinds of parameters, one that describes qualities of the subject under investigation, and the other relates to qualities of the situation under which the response of a subject is observed. Using conditional maximum likelihood (CML) estimation both types of parameters may be estimated independently from each other. IRT models are well suited to cope with dichotomous and polytomous responses, where the response categories may be unordered as well as ordered. The incorporation of linear structures allows for modeling the effects of covariates and enables the analysis of repeated categorical measurements. The eRm package fits the following models: the Rasch model, the rating scale model (RSM), and the partial credit model (PCM) as well as linear reparameterizations through covariate structures like the linear logistic test model (LLTM), the linear rating scale model (LRSM), and the linear partial credit model (LPCM). We use an unitary, efficient CML approach to estimate the item parameters and their standard errors. Graphical and numeric tools for assessing goodness-of-fit are provided.

Journal ArticleDOI
TL;DR: In this article, a hierarchical framework for modeling speed and accuracy on test items is presented as an alternative to these models, allowing a "plug-and-play" approach with alternative choices of models for the response and response-time distributions as well as the distributions of their parameters.
Abstract: Current modeling of response times on test items has been strongly influenced by the paradigm of experimental reaction-time research in psychology. For instance, some of the models have a parameter structure that was chosen to represent a speed-accuracy tradeoff, while others equate speed directly with response time. Also, several response-time models seem to be unclear as to the level of parametrization they represent. A hierarchical framework for modeling speed and accuracy on test items is presented as an alternative to these models. The framework allows a "plug-and-play approach" with alternative choices of models for the response and response-time distributions as well as the distributions of their parameters. Bayesian treatment of the framework with Markov chain Monte Carlo (MCMC) computation facilitates the approach. Use of the framework is illustrated for the choice of a normal-ogive response model, a lognormal model for the response times, and multivariate normal models for their parameters with Gibbs sampling from the joint posterior distribution.

Journal ArticleDOI
TL;DR: The International Knee Documentation Committee (IKDC) Subjective Knee Form is a patient-oriented questionnaire that assesses symptoms and function in daily living activities as mentioned in this paper, and the purpose of this study was to validate the IKDC SubjectiveKnee Form in a large patient population with various knee disorders.

Journal ArticleDOI
TL;DR: In this paper, the authors developed estimation procedures for fitting the graded response model when the data follow the bifactor structure, which results from the constraint that each item has a nonzero loading on the primary dimension and, at most, one of the group factors.
Abstract: A plausible factorial structure for many types of psychological and educational tests exhibits a general factor and one or more group or method factors. This structure can be represented by a bifactor model. The bifactor structure results from the constraint that each item has a nonzero loading on the primary dimension and, at most, one of the group factors. The authors develop estimation procedures for fitting the graded response model when the data follow the bifactor structure. Using maximum marginal likelihood estimation of item parameters, the bifactor restriction leads to a major simplification of the likelihood equations and (a) permits analysis of models with large numbers of group factors, (b) permits conditional dependence within identified subsets of items, and (c) provides more parsimonious factor solutions than an unrestricted full-information item factor analysis in some cases. Analysis of data obtained from 586 chronically mentally ill patients revealed a clear bifactor structure.

Book
19 Mar 2007
TL;DR: Bayesian Testlet Response Theory: Bayesian testlet response theory as discussed by the authors is a generalization of the traditional true score theory and item response theory, which is used in the Law School Admissions Test as an example.
Abstract: Preface Part I. Introduction to Testlets: 1. Introduction to testing 2. Traditional true score theory 3. Item response theory 4. Testlet response theory: introduction and preview 5. The origins of testlet response theory: three alternatives 6. Fitting testlets with polytomous IRT models: the Law School Admissions Test as an example Part II. Bayesian Testlet Response Theory: Introduction 7. A brief history and the basic ideas of modern testlet response theory 8. The 2-PL Bayesian testlet model 9. The 3-PL Bayesian testlet model 10. A Bayesian testlet model for a mixture of binary and polytomous data 11. A Bayesian testlet model with covariates 12. Testlet nonresponse theory: dealing with missing data Part III. Applications and Ancillary Topics: Introduction 13. Using posterior distributions to evaluate passing scores: The PPoP curve 14. DIF - Differential Testlet Functioning 15. Estimation: a Bayesian primer.

Journal Article
TL;DR: The CAOS test as discussed by the authors is designed to measure students' conceptual understanding of important statistical ideas across three years of revision and testing, content validation, and realiability analysis, and results reported from a large scale class testing and item responses are compared from pretest to posttest in order to learn more about areas in which students demonstrated improved performance from beginning to end of the course, as well as areas that showed no improvement or decreased performance.
Abstract: This paper describes the development of the CAOS test, designed to measure students’ conceptual understanding of important statistical ideas, across three years of revision and testing, content validation, and realiability analysis. Results are reported from a large scale class testing and item responses are compared from pretest to posttest in order to learn more about areas in which students demonstrated improved performance from beginning to end of the course, as well as areas that showed no improvement or decreased performance. Items that showed an increase in students’ misconceptions about particular statistical concepts were also examined. The paper concludes with a discussion of implications for students’ understanding of different statistical topics, followed by suggestions for further research.

Journal ArticleDOI
TL;DR: Several methods for detection of differential item functioning (DIF), including non-parametric and parametric methods such as logistic regression, and those based on item response theory are discussed.
Abstract: Establishing measurement equivalence is important because inaccurate assessment may lead to incorrect estimates of effects in research, and to suboptimal decisions at the individual, clinical level. Examination of differential item functioning (DIF) is a method for studying measurement equivalence. An item (i.e., one question in a longer scale) exhibits DIF if the item response differs across groups (e.g., gender, race), controlling for an estimate of the construct being measured. A distinction between applications in health, as contrasted with other settings such as educational and aptitude testing, is that there are many health-related constructs and multiple measures of each, few of which have received much critical evaluation. Discussed in this article are several methods for detection of differential item functioning (DIF), including non-parametric and parametric methods such as logistic regression, and those based on item response theory. Basic definitions and criteria for DIF detection are provided, as are steps in performing the analyses. Recommendations are presented and future directions discussed.

Journal ArticleDOI
TL;DR: The DS14 adequately measures NA and SI, with highest reliability in the trait range around the cutoff, and is a valid instrument to assess and compare type-D personality across clinical groups.

Journal ArticleDOI
TL;DR: The unique features of item banks and CAT are described, how to develop item banks is discussed and how to evaluate the assumptions of the IRT model are discussed.
Abstract: Item banks and Computerized Adaptive Testing (CAT) have the potential to greatly improve the assessment of health outcomes This review describes the unique features of item banks and CAT and discusses how to develop item banks In CAT, a computer selects the items from an item bank that are most relevant for and informative about the particular respondent; thus optimizing test relevance and precision Item response theory (IRT) provides the foundation for selecting the items that are most informative for the particular respondent and for scoring responses on a common metric The development of an item bank is a multi-stage process that requires a clear definition of the construct to be measured, good items, a careful psychometric analysis of the items, and a clear specification of the final CAT The psychometric analysis needs to evaluate the assumptions of the IRT model such as unidimensionality and local independence; that the items function the same way in different subgroups of the population; and that there is an adequate fit between the data and the chosen item response models Also, interpretation guidelines need to be established to help the clinical application of the assessment Although medical research can draw upon expertise from educational testing in the development of item banks and CAT, the medical field also encounters unique opportunities and challenges

Journal ArticleDOI
TL;DR: There was substantial evidence for the cross-cultural equivalence of the KIDSCREEN-27 across the countries studied and the factor structure was highly replicable in individual countries.
Abstract: Objectives: The aim of this study is to assess the structural and cross-cultural validity of the KIDSCREEN-27 questionnaire. Methods: The 27-item version of the KIDSCREEN instrument was derived from a longer 52-item version and was administered to young people aged 8-18 years in 13 European countries in a cross-sectional survey. Structural and cross-cultural validity were tested using multitrait multi-item analysis, exploratory and confirmatory factor analysis, and Rasch analyses. Zumbo's logistic regression method was applied to assess differential item functioning (DIF) across countries. Reliability was assessed using Cronbach's alpha. Results: Responses were obtained from n = 22,827 respondents (response rate 68.9%). For the combined sample from all countries, exploratory factor analysis with procrustean rotations revealed a five-factor structure which explained 56.9% of the variance. Confirmatory factor analysis indicated an acceptable model fit (RMSEA = 0.068, CFI = 0.960). The unidimensionality of all dimensions was confirmed (INFIT: 0.81-1.15). Differential item functioning (DIF) results across the 13 countries showed that 5 items presented uniform DIF whereas 10 displayed non-uniform DIF. Reliability was acceptable (Cronbach's α = 0.78-0.84 for individual dimensions). Conclusions: There was substantial evidence for the cross-cultural equivalence of the KIDSCREEN-27 across the countries studied and the factor structure was highly replicable in individual countries. Further research is needed to correct scores based on DIF results. The KIDSCREEN-27 is a new short and promising tool for use in clinical and epidemiological studies. © 2007 Springer Science+Business Media B.V.

Journal ArticleDOI
TL;DR: A Quality of Life questionnaire rated by professionals that can be used for people with dementia in different stages of the disease, living in residential settings is developed.
Abstract: Objective To develop a Quality of Life questionnaire rated by professionals that can be used for people with dementia in different stages of the disease, living in residential settings. Method Development was performed in two phases: item generation and pilot testing, and a field survey to evaluate the psychometric properties. For unidimensionality we used a non-parametric model from item response theory: the Mokken scaling model, and computed the corresponding scalability coefficients, using a theory driven strategy. Results The pilot survey resulted in a list of 49 items. The field survey was performed in a sample of 238 people with dementia residing in ten nursing homes. The scalability of the subscales positive affect, negative affect, restless tense behavior, and social relations is strong (0.50 < H < 0.63); for care relationship, positive self image, feeling at home, and having something to do, scalability was moderate (0.40 < H < 0.49), and for social isolation it was weak (H = 0.34). The reliability coefficient Rho (under assumption of double monotonicity) varied from 0.60 for social isolation to 0.90 for positive affect (Cronbach's alpha varied from 0.59 to 0.89). Twenty-one of 40 items are suited for people with very severe dementia. Conclusion The QUALIDEM is an easy to administer and sufficiently reliable rating scale that provides a QOL profile of persons with dementia in residential settings. The QUALIDEM can be used for evaluation as well as for research and practice innovation.

Journal ArticleDOI
TL;DR: It is recommended that short tests for high-stakes decision making be used in combination with other information so as to increase reliability and classification consistency.
Abstract: Short tests containing at most 15 items are used in clinical and health psychology, medicine, and psychiatry for making decisions about patients. Because short tests have large measurement error, the authors ask whether they are reliable enough for classifying patients into a treatment and a nontreatment group. For a given certainty level, proportions of correct classifications were computed for varying test length, cut-scores, item scoring, and choices of item parameters. Short tests were found to classify at most 50% of a group consistently. Results were much better for tests containing 20 or 40 items. Small differences were found between dichotomous and polytomous (5 ordered scores) items. It is recommended that short tests for high-stakes decision making be used in combination with other information so as to increase reliability and classification consistency.

Journal ArticleDOI
TL;DR: Results show that adoption of the ideal point approach provided a more flexible platform for creating future personality measures, and this transition did not adversely affect the validity of personality test scores.
Abstract: The main aim of this article is to explicate why a transition to ideal point methods of scale construction is needed to advance the field of personality assessment. The study empirically demonstrated the substantive benefits of ideal point methodology as compared with the dominance framework underlying traditional methods of scale construction. Specifically, using a large, heterogeneous pool of order items, the authors constructed scales using traditional classical test theory, dominance item response theory (IRT), and ideal point IRT methods. The merits of each method were examined in terms of item pool utilization, model-data fit, measurement precision, and construct and criterion-related validity. Results show that adoption of the ideal point approach provided a more flexible platform for creating future personality measures, and this transition did not adversely affect the validity of personality test scores.

Journal ArticleDOI
TL;DR: Specific criteria chosen to determine whether items have DIF have an impact on the findings, and criteria based entirely on statistical significance may detect small differences that are clinically negligible.
Abstract: Background Several techniques have been developed to detect differential item functioning (DIF), including ordinal logistic regression (OLR). This study compared different criteria for determining whether items have DIF using OLR.


Journal ArticleDOI
TL;DR: In this article, a multidimensional compensatory dichotomous and polytomous item response theory (MIRT) model was used to estimate the subscale score proficiency of students.
Abstract: Several approaches to reporting subscale scores can be found in the literature. This research explores a multidimensional compensatory dichotomous and polytomous item response theory modeling approach for subscale score proficiency estimation, leading toward a more diagnostic solution. It also develops and explores the recovery of a Markov chain Monte Carlo (MCMC) estimation approach to multidimensional item and ability parameter estimation, as well as subscale proficiency and classification rates. The simulation study presented here used real data-derived parameters from a large-scale statewide assessment with subscale score information under varying conditions of sample size and correlations between subscales (0.0, 0.1, 0.3, 0.5, 0.7, 0.9). It was found that to report accurate diagnostic information at the subscale level, the subscales need to be highly correlated, or a multidimensional approach should be implemented. MCMC methodology is still a nascent methodology in psychometrics; however, with the growing body of research, its future looks promising. Index terms: multidimensional item response theory (MIRT); Bayesian estimation; MCMC; Domain score; OPI; subscale scores. The field of educational assessment has been dominated by unidimensional summative tests that report overall scores indicating students’ levels of achievement in broadly defined content area domains, such as mathematics or language arts. Although there has been an increasing demand for assessments that directly assist students and teachers in understanding and responding to key strengths and weaknesses in specific content domains, this need for large-scale assessments that are more formative and informative in design remains largely unfulfilled. Many testing programs (e.g., Scholastic Aptitude Test [SAT]; College Board) have introduced a diagnostic pretest,asin the PSAT, toassiststudentsindetermining whichofthe skillswithina particular domain of knowledge needs improvement. CTB/McGraw-Hill’s Tests of Adult Basic Education (TABE) is a diagnostic pretest for the General Education Diploma (GED) test in the United States. Operating under the assumption that a test can measure more than one trait or distinct cognitive dimension, numerous testing programs now require reporting of subscale scores for different objectives defined by the test design. For example, an overall mathematics score might be supplemented with scores for numbers and operations, algebra and functions, geometry and measures, and probability and statistics. Some programs report subscale scores on a simple number-correct score (or percentagecorrect),whichinmanycasesisnot adjustedfor formdifficultyand thusdoesnot permit form-to-form or population-to-population comparisons. These simple scores also do not augment the subscores based on information from the other subscores and do not consider the relevance

Journal ArticleDOI
TL;DR: This study compared four methods for setting item response time thresholds to differentiate rapid-guessing behavior from solution behavior, indicating that response time effort is not very sensitive to the particular threshold identification method used.
Abstract: This study compared four methods for setting item response time thresholds to differentiate rapid-guessing behavior from solution behavior. Thresholds were either (a) common for all test items, (b) based on item surface features such as the amount of reading required, (c) based on visually inspecting response time frequency distributions, or (d) statistically estimated using a two-state mixture model. The thresholds were compared using the criteria proposed by Wise and Kong to establish the reliability and validity of response time effort scores, which were generated on the basis of the specified threshold values. The four methods yielded very similar results, indicating that response time effort is not very sensitive to the particular threshold identification method used. Recommendations are given regarding use of the various methods.

Journal ArticleDOI
TL;DR: The authors applied item randomized-response (IRR) models for the analysis of multivariate RR data and found that respondents are more willing to comply when the expected benefits of noncompliance are minor and social control is strong.
Abstract: Randomized response (RR) is a well-known method for measuring sensitive behavior. Yet this method is not often applied because: (i) of its lower efficiency and the resulting need for larger sample sizes which make applications of RR costly; (ii) despite its privacy-protection mechanism the RR design may not be followed by every respondent; and (iii) the incorrect belief that RR yields estimates only of aggregate-level behavior but that these estimates cannot be linked to individual-level covariates. This paper addresses the efficiency problem by applying item randomized-response (IRR) models for the analysis of multivariate RR data. In these models, a person parameter is estimated based on multiple measures of a sensitive behavior under study which allow for more powerful analyses of individual differences than available from univariate RR data. Response behavior that does not follow the RR design is approached by introducing mixture components in the IRR models with one component consisting of respondents who answer truthfully and another component consisting of respondents who do not provide truthful responses. An analysis of data from two large-scale Dutch surveys conducted among recipients of invalidity insurance benefits shows that the willingness of a respondent to answer truthfully is related to the educational level of the respondents and the perceived clarity of the instructions. A person is more willing to comply when the expected benefits of noncompliance are minor and social control is strong.

Journal ArticleDOI
TL;DR: In this article, the authors compared model selection results using the likelihood ratio test, two information-based criteria, and two Bayesian methods, and found that the cross-validation log-likelihood (CVLL) appeared to work the best of the five models for the conditions simulated in this study.
Abstract: Fit of the model to the data is important if the benefits of item response theory (IRT) are to be obtained. In this study, the authors compared model selection results using the likelihood ratio test, two information-based criteria, and two Bayesian methods. An example illustrated the potential for inconsistency in model selection depending on which of the indices was used. Results from a simulation study indicated that the inconsistencies among the indices were common but that model selection was relatively accurate for longer tests administered to larger sample of examinees. The cross-validation log-likelihood (CVLL) appeared to work the best of the five models for the conditions simulated in this study.

Journal ArticleDOI
TL;DR: Item response theory models are measurement models for categorical responses as discussed by the authors, where test items are scored either dichotomously (correct{incorrect) or by using an ordinal scale (a grade from poor to excellent).
Abstract: Item response theory models are measurement models for categorical responses. Traditionally, the models are used in educational testing, where re- sponses to test items can be viewed as indirect measures of latent ability. The test items are scored either dichotomously (correct{incorrect) or by using an ordinal scale (a grade from poor to excellent). Item response models also apply equally for measurement of other latent traits. Here we describe the one- and two-parameter logit models for dichotomous items, the partial-credit and rating scale models for ordinal items, and an extension of these models where the latent variable is re- gressed on explanatory variables. We show how these models can be expressed as generalized linear latent and mixed models and tted by using the user-written command gllamm.