scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 2003"


Journal ArticleDOI
TL;DR: Rasch modelling has increasing application in rehabilitation medicine, allowing for the transformation of the cumulative raw scores into linear continuous measures of ability and difficulty, and making it possible to compare homogeneous measures or foster diagnostic procedures on the reasons for differential item functioning.
Abstract: Variables present in an individual, for example, independence, pain, balance, fatigue, depression and knowledge, cannot be measured directly (hence the term "latent" variables). They are usually assessed by measuring related behaviours, defined by sets of standardized items. The homogeneity of the different items, and proportionality of raw counts to measure, can only be postulated. In 1960 Georg Rasch proposed a statistical model that complied with the fundamental assumptions made in measurements in physical sciences. It allowed for the transformation of the cumulative raw scores (achieved by a subject across items, or by an item across subjects) into linear continuous measures of ability (for subjects) and difficulty (for items). These 2 parameters, only, govern the probability that "pass" rather than "fail" occurs. The discrepancies between model-expected scores (continuous between 0 and 1) and observed scores (discrete, either 0 or 1) provide indexes of inconsistency of individual subjects, items and classes of subjects. In subsequent years the same principles were extended to rating scales, with items graded on more than 2 levels, and to "many-facet" contexts where, beyond items and subjects, multiple raters, times of administration, etc. converge in determining the observed scores. Rasch modelling has increasing application in rehabilitation medicine. New scales with unprecedented metric validity (including internal consistency and reliability) can be built. Existing scales can be improved or rejected on a sound theoretical basis. In clinical trials the consistency and the linearity of measures of either subjects or raters can be validly matched with those of physical and chemical measures. The stability of the item difficulties across time, cultures, diagnostic groups and time of administration can be estimated, thus making it possible to compare homogeneous measures or foster diagnostic procedures on the reasons for differential item functioning.

316 citations


Journal ArticleDOI
TL;DR: Results from this study imply that S - X 2 may be a useful tool in detecting the misfit of one item contained in an otherwise well-fitted test, lending additional support to the utility of the index for use with dichotomous item response theory models.
Abstract: This study presents new findings on the utility of S - X2 as an item fit index for dichotomous item response theory models Results are based on a simulation study in which item responses were generated and calibrated for 100 tests under each of 27 conditions The item fit indices S - X2 and Q1 - X2 were calculated for each item ROC curves were constructed based on the hit and false alarm rates of the two indices Examination of these curves indicated that in general, the performance of S - X2 improved with test length and sample size The performance of S - X2 was superior to that of Q1 - X2 under most but not all conditions Results from this study imply that S - X2 may be a useful tool in detecting the misfit of one item contained in an otherwise well-fitted test, lending additional support to the utility of the index for use with dichotomous item response theory models Index Terms: item response theory, S - X2, Q1 - X, model = data fit, item fit index

305 citations


Journal Article
TL;DR: In this article, the authors provide a general overview on "validity" based on the latest standards, which is the most relevant feature for both test development and test evaluation, as well as new sources of evidence for validity analysis such as differential item functioning and consequential validity have prouted strongly.
Abstract: About test validity. More than two years have passed since the publication of the last standards for educational and psychological testing (AERA, APA y NCME, 1999). This publication is the best and foremost reference for test evaluation, construction and use. The most important contribution to the previous version published in 1985 is the emphasis in guaranteeing proper use of the tests, thus bestowing upon the user new responsibilities in the process. As a result of this, new sources of evidence for validity analysis such as differential item functioning and consequential validity have prouted strongly. The objective of this study is to provide a general overview on "validity" based on the latest standards, which is the most relevant feature for both test development and test evaluation.

136 citations


Journal ArticleDOI
TL;DR: In this article, the effects of anchor item methods on Type I error and power of detecting differential item functioning (DIF) using the likelihood ratio test within the framework of item response theory was investigated.
Abstract: Through simulations, this study investigates the effects of anchor item methods on Type I error and power of detecting differential item functioning (DIF) using the likelihood ratio test within the framework of item response theory. Four anchor item methods were compared: the all-other, 1-item, 4-item, and 10-item methods. The results showed that it is the average signed area between the reference and focal groups rather than the percentage of DIF items in a test that determines the Type I error of the all-other method. The all-other method yields good control over Type I error and reasonable power only when the average signed area approaches zero. The all-other method is not recommended for practical DIF analysis because it is only adequate under very stringent conditions. The other three methods perform appropriately under all the simulated conditions. The more anchor items are used, the higher the power of DIF detection.

133 citations


Journal ArticleDOI
TL;DR: It is concluded that testing for DIF is a useful way to validate questionnaire translations and discusses the difference between linguistic DIF and DIF caused by confounding, cross-cultural differences, or DIF in other items in the scale.
Abstract: In cross-national comparisons based on questionnaires, accurate translations are necessary to obtain valid results. Differential item functioning (DIF) analysis can be used to test whether translations of items in multi-item scales are equivalent to the original. In data from 10,815 respondents representing 10 European languages we tested for DIF in the nine translations of the EORTC QLQ-C30 emotional function scale when compared to the original English version. We tested for DIF using two different methods in parallel, a contingency table method and logistic regression. The DIF results obtained with the two methods were similar. We found indications of DIF in seven of the nine translations. At least two of the DIF findings seem to reflect linguistic problems in the translation. 'Imperfect' translations can affect conclusions drawn from cross-national comparisons. Given that translations can never be identical to the original we discuss how findings of DIF can be interpreted and discuss the difference between linguistic DIF and DIF caused by confounding, cross-cultural differences, or DIF in other items in the scale. We conclude that testing for DIF is a useful way to validate questionnaire translations.

130 citations


Journal ArticleDOI
TL;DR: Age group comparisons of mental health may be particularly affected by DIF, and differences in education, as well as age and gender, need to be controlled when making group comparisons.
Abstract: Background. Demographic differences have been reported in summary measures of physical and mental health based on the SF-12 instrument. Objectives. This study examines the extent to which differential item functioning (DIF) contributes to observed subgroup differences in health status. DIF refers to situations in which the psychometric properties of items are not invariant across different groups. The presence of DIF confounds interpretation of subgroup differences. Subjects. A national sample of 11,626 adult respondents in the 2000 Medical Expenditure Panel Survey who completed a self-administered questionnaire. Measures. In addition to the SF-12, we collected data on demographic characteristics (age, gender, education, and race/ethnicity) and whether the person had ever been diagnosed with six chronic medical conditions. Results. Multiple-indicator multiple-cause latent variable models showed significant differences in physical health by gender, age, and education. Adjusting for DIF reduced but did not eliminate age and education differences. However, for mental health, adjusting for DIF resulted in Black-White differences becoming nonsignificant, and the effect for the oldest age group switched from positive to negative. Race/ethnicity was not associated with physical health status. Conclusions. Age group comparisons of mental health may be particularly affected by DIF. Differences in education, as well as age and gender, need to be controlled when making group comparisons. Additional work is needed to understand factors that give rise to demographic differences in reported health status.

123 citations


Journal ArticleDOI
TL;DR: In this paper, a multilevel item response (IRT) model is presented which allows for differences between the distributions of item parameters of families of item clones, and an item selection procedure for computerized adaptive testing with item cloning is presented.
Abstract: To increase the number of items available for adaptive testing and reduce the cost of item writing, the use of techniques of item cloning has been proposed. An important consequence of item cloning is possible variability between the item parameters. To deal with this variability, a multilevel item response (IRT) model is presented which allows for differences between the distributions of item parameters of families of item clones. A marginal maximum likelihood and a Bayesian procedure for estimating the hyperparameters are presented. In addition, an item-selection procedure for computerized adaptive testing with item cloning is presented which has the following two stages: First, a family of item clones is selected to be optimal at the estimate of the person parameter. Second, an item is randomly selected from the family for administration. Results from simulation studies based on an item pool from the Law School Admission Test (LSAT) illustrate the accuracy of these item pool calibration and adaptive testing procedures. Index terms: computerized adaptive testing, item cloning, multilevel item response theory, marginal maximum likelihood, Bayesian item selection.

113 citations


01 Jan 2003

94 citations


Journal ArticleDOI
TL;DR: In this article, the authors investigated potentially biased scale items on the Center for Epidemiologic Studies Depression (CES-D) using two binary methods (presence and persistence) and one ordinal method.
Abstract: The present study investigated potentially biased scale items on the Center for Epidemiologic Studies Depression (CES-D). The 20-item CES-D was scored using two binary methods (presence and persistence) and one ordinal method. Gender differential item functioning (DIF) was explored using Zumbo’s OLR method with corresponding logistic regression effect size estimator with all three scoring methods. Gender DIF was found with the CES-D item “crying” for the ordinal and presence methods of scoring. The persistence scoring method identified two DIF items (effort and hopeful); however, this scoring method appears to be of limited use due to low variability on some items. Overall, the results indicate that the scoring method has an effect on DIF; thus, DIF is a property of the item, scoring method, and purpose of the instrument.

94 citations


Journal ArticleDOI
TL;DR: In this paper, scale-level methods are sometimes exclusively used to investigate measurement invariance for test translation, and the results of a simulation study are described based on the observation that scale level methods are often exclusively used for test invariance.
Abstract: Based on the observation that scale-level methods are sometimes exclusively used to investigate measurement invariance for test translation, this article describes the results of a simulation study...

88 citations


Journal ArticleDOI
TL;DR: Although promising, both questionnaires warrant further developmental work and stronger support of measurement validity before they could be considered fully suitable for valid use in PD, in particular in earlier stages of the disease.
Abstract: We assessed the feasibility and psychometric properties of two commonly used health status questionnaires in Parkinson's disease (PD): the generic Nottingham Health Profile (NHP) and the disease-specific 39-item Parkinson's disease Questionnaire (PDQ-39), from a cross-sectional postal survey of PD patients (N = 81), using traditional and Rasch measurement methodologies. Overall response rate was 88%. Both questionnaires were found feasible, although the NHP performed less well. The PDQ-39 had fewer floor effects and was better able to separate respondents into distinct groups than the NHP, whereas the latter exhibited less ambiguous dimensionality and better targeting of respondents with non-extreme scores. Reliability and validity indices were similar, and potential differential item functioning by age and gender groups was found for both questionnaires. PDQ-39 response alternatives indicated ambiguity. With few exceptions, questionnaire scales were unable to meet recommended standards fully. While preliminary, this study illustrates the need for thorough evaluation of outcome measures and has implications beyond the questionnaires used here. Although promising, both questionnaires warrant further developmental work and stronger support of measurement validity before they could be considered fully suitable for valid use in PD, in particular in earlier stages of the disease.

Journal ArticleDOI
TL;DR: In this paper, the relation between classical test theory (CTT) and item response theory (IRT) is discussed. And it is shown that IRT can be used to provide CTT statistics in situations where CTT fails, even though a test was not administered to the intended population.
Abstract: This study is about relations between classical test theory (CTT) and item response theory (IRT). It is shown that CTT is based on the assumption that measures are exchangeable, whereas IRT is based on conditional independence. Thus, IRT is presented as an extension of CTT, and concepts from both theories are related to one another. Furthermore, it is demonstrated that IRT can be used to provide CTT statistics in situations where CTT fails. Reliability, for instance, can be determined even though a test was not administered to the intended population.

Journal ArticleDOI
TL;DR: In this paper, the delta method was used to compute the standard error of the estimates of the converted item response theory (IRT) discrimination and difficulty parameters derived from multiple-indicator, multiple-causes (MIMIC) model parameters.
Abstract: The purpose of this study is to document the delta method to compute the standard error of the estimates of the converted item response theory (IRT) discrimination and difficulty parameters derived from multiple-indicator, multiple-causes (MIMIC) model parameters. Discussed is the formulation of MIMIC models to explore differential item functioning in Mplus and how to obtain factor-analytic estimates that are converted easily into IRT parameters. Also described are the partial derivatives necessary to apply the delta method to estimate variances for the converted parameters. Both item difficulty and discrimination parameters estimated from MIMIC parameters were very close to the Multilog estimates. The variance estimates for most parameters were similar as well.

Journal ArticleDOI
TL;DR: Findings underscore the importance of efforts to generate culture-fair measurement devices, as culture- Fair assessments may attenuate, but not eliminate, group differences in assessed cognition due to the incommensurate action of background variables.
Abstract: This study was undertaken to determine if the difference in assessed cognition between Black/African-American and White older adults was due differential item functioning (DIF) and/or differences in the effect of background variables. Participants were 15257 adults aged 50 and older surveyed in the Study of Asset and Health Dynamics of the Oldest Old (AHEAD) and Health and Retirement Study (HRS). The cognitive measure was a modified telephone interview for cognitive status. The analytic strategy was a multiple group structural equation model grounded in item response theory. Results suggest that most (89%) of the group difference could be attributed to measurement or structural differences, the remainder being not significantly different from zero (p=0.193). Most items displayed racial DIF, accounting for most of the group difference. After controlling for DIF, the group difference that remained could be attributed to heterogeneity in the effect of background variables. For example, low education was more deleterious for Black/African-Americans, and high income conferred an advantage only for Whites. These findings underscore the importance of efforts to generate culture-fair measurement devices. However, culture-fair assessments may attenuate, but not eliminate, group differences in assessed cognition due to the incommensurate action of background variables

01 Nov 2003
TL;DR: Analyses of item functioning on linear forms suggested a high level of isomorphicity across items within models, which provides a promising first step toward significant cost and theoretical improvement in test creation methodology for educational assessment.
Abstract: The goal of this study was to assess the feasibility of an approach to adaptive testing using item models based on the quantitative section of the Graduate Record Examination (GRE) test. An item model is a means of generating items that are isomorphic, that is, equivalent in content and equivalent psychometrically. Item models, like items, are calibrated by fitting an IRT response model. The resulting set of parameter estimates is imputed to all the items generated by the model. An on-thefly adaptive test tailors the test to examinees and presents instances of an item model rather than independently developed items. A simulation study was designed to explore the effect an on-the-fly test design would have on score precision and bias as a function of the level of item model isomorphicity. In addition, two types of experimental tests were administered – an experimental, on-thefly, adaptive quantitative-reasoning test as well as an experimental quantitative-reasoning linear test consisting of items based on item models. Results of the simulation study showed that under different levels of isomorphicity, there was no bias, but precision of measurement was eroded at some level. However, the comparison of experimental, on-the-fly adaptive test scores with the GRE test scores closely matched the test-retest correlation observed under operational conditions. Analyses of item functioning on the experimental linear test forms suggested that a high level of isomorphicity across items within models was achieved. The current study provides a promising first step toward significant cost reduction and theoretical improvement in test creation methodol ogy for educational assessment.

Journal ArticleDOI
TL;DR: In this paper, a Bayesian posterior log odds ratio index is proposed for detecting the use of item preknowledge in continuous testing, where the test taker's responses deviate from the underlying item response theory (IRT) model.
Abstract: With the increased use of continuous testing in computerized adaptive testing, new concerns about test security have evolved, such as how to ensure that items in an item pool are safeguarded from theft. In this article, procedures to detect test takers using item preknowledge are explored. When test takers use item preknowledge, their item responses deviate from the underlying item response theory (IRT) model, and estimated abilities may be inflated. This deviation may be detected through the use of person-fit indices. A Bayesian posterior log odds ratio index is proposed for detecting the use of item preknowledge. In this approach to person fit, the estimated probability that each test taker has preknowledge of items is updated after each item response. These probabilities are based on the IRT parameters, a model specifying the probability that each item has been memorized, and the test taker’s item responses. Simulations based on an operational computerized adaptive test (CAT) pool are used to demonstra...

Journal ArticleDOI
TL;DR: In this paper, Liu and Agresti proposed a Mantel and Haenszel-type estimator of a common odds ratio for several 2 × J tables, where the J columns are ordinal levels of a response variable.
Abstract: Liu and Agresti (1996) proposed a Mantel and Haenszel-type (1959) estimator of a common odds ratio for several 2 × J tables, where the J columns are ordinal levels of a response variable. This article applies the Liu-Agresti estimator to the case of assessing differential item functioning (DIF) in items having an ordinal response variable. A simulation study was conducted to investigate the accuracy of the Liu-Agresti estimator in relation to other statistical DIF detection procedures. The results of the simulation study indicate that the Liu-Agresti estimator is a viable alternative to other DIF detection statistics.

Journal ArticleDOI
TL;DR: In this article, the authors present an analytical derivation for the mathematical form of an average between-test overlap index as a function of the item exposure index, for fixed-length computerized adaptive tests (CATs).
Abstract: The purpose of this article is to present an analytical derivation for the mathematical form of an average between-test overlap index as a function of the item exposure index, for fixed-length computerized adaptive tests (CATs). This algebraic relationship is used to investigate the simultaneous control of item exposure at both the item and test levels. The results indicate that, in fixed-length CATs, control of the average between-test overlap is achieved via the mean and variance of the item exposure rates of the items that constitute the CAT item pool. The mean of the item exposure rates is easily manipulated. Control over the variance of the item exposure rates can be achieved via the maximum item exposure rate (rmax). Therefore, item exposure control methods which implement a specification of rmax (e.g., Sympson & Hetter, 1985) provide the most direct control at both the item and test levels.

Journal ArticleDOI
TL;DR: To investigate the pan‐European, cross‐cultural validity of the EuroQol‐5D for assessing quality of life in the Schizophrenia Outpatient Health Outcomes (SOHO) Study.
Abstract: Objective: To investigate the pan-European, cross-cultural validity of the EuroQol-5D (EQ-5D) for assessing quality of life in the Schizophrenia Outpatient Health Outcomes (SOHO) Study. Method: The EQ-5D items investigated were mobility, self-care, usual activities, pain/discomfort and anxiety/depression. A Rasch rating scale model (a form of differential item functioning) was used to identify invariance of item calibrations for the 10 European countries participating in the SOHO study. Results: There was general congruence in the EQ-5D item calibration pattern. The rank of average EQ-5D item calibrations was similar for all countries except Denmark. Denmark showed slight misfits for mobility and pain/discomfort. Conclusion: The EQ-5D is an appropriate measure of health-related quality of life across European countries and translations.

Book ChapterDOI
01 Jan 2003
TL;DR: This article argued that social values have meaning and force outside of measurement wherever evaluative judgments and decisions are made, and that validity, reliability, comparability, and fairness are not just measurement issues.
Abstract: “Validity, reliability, comparability, and fairness are not just measurement issues, but social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made” (Messick, 1994, p. 2).

Journal ArticleDOI
TL;DR: In this article, a two-stage methodology for evaluating differential item functioning (DIF) in large-scale state assessment data was explored, and the findings illustrated the merit of iterative approaches for DIF detection.
Abstract: In differential item functioning (DIF) studies, examinees from different groups are typically ability matched, and then one or more statistical indices are used to compare performance on a set of test items. Typically, matching is on total test score (a criterion both observable and easily accessible), but it may be limited in value because if DIF is present, it is likely to distort test scores and potentially confound any item performance differences. Thus, some researchers have advocated iterative approaches for DIF detection. In this article, a two-stage methodology for evaluating DIF in large-scale state assessment data was explored. The findings illustrated the merit of iterative approaches for DIF detection. Items being flagged as DIF in the second stage were not necessarily the same items identified as DIF in the first stage and vice versa, and this finding was directly related to the amount of DIF found in the Stage 1 analyses.

Journal ArticleDOI
TL;DR: Adaptation of the Rheumatoid Arthritis Quality of Life questionnaire for use in Turkey was successful and can be used in both national and international studies for cross-cultural comparison with the UK, as long as adjustments are made for the few items displaying DIF for culture.
Abstract: Objective. The aim of this study was to adapt the Rheumatoid Arthritis Quality of Life (RAQoL) questionnaire for use in Turkey and to test its reliability and validity. Methods. The translation process included the recent guidelines for cross-cultural adaptation. Reliability of the Turkish RAQoL was assessed by internal consistency and test-retest reliability, internal construct validity by Rasch analysis, and external construct validity by associations with impairments, disability, and general health status. Cross-cultural validity was tested through analysis of differential item functioning (DIF) by comparison with data from the UK version of the RAQoL. Results. Reliability of the adapted version was good, with high internal consistency (Cronbach's alpha 0.95 and 0.96 at times 1 and 2, respectively) and test-retest reliability (Spearman's rho 0.874). Internal construct validity was confirmed by excellent fit to the Rasch model (mean item fit 0.236, SD 1.113) and external construct validity by expected associations. The DIF for culture was found in four items. Conclusions. Adaptation of the RAQoL for use in Turkey was successful. The instrument can be used in both national and international studies for cross-cultural comparison with the UK, as long as adjustments are made for the few items displaying DIF for culture.

Journal ArticleDOI
TL;DR: In this article, a panel of translators and researchers revised the target-language DIF items detected in that study, and the revised items were then readministered for cross-validation purposes.
Abstract: When a test is translated from a source language into a target language, the 2 tests are generally not psychometrically equivalent. Differential item functioning (DIF) analysis can facilitate an understanding of the differences between the tests in the 2 languages and can help translators to revise translated items to produce non-DIF (or lower DIF) items. Based on the reasons found for DIF in Allalouf, Hambleton, and Sireci (1999; differences in word difficulty, item format, content, and cultural relevance), a panel of translators and researchers revised the target-language DIF items detected in that study. The revised items were then readministered. Results showed that the revisions succeeded in reducing DIF considerably. For cross-validation purposes, the original translated items were readministered and the study results were generally validated. An attempt was made to determine which sources of DIF and which item types could be revised most effectively. The revision process is expensive, so considerat...

Journal ArticleDOI
TL;DR: Mapping of ADS item severities suggests a continuum of alcohol problem severity from heavy drinking to severe withdrawal that may be reliably tapped with dichotomous items, combined with previous work with the ADS in treatment-seeking alcoholics.

Journal ArticleDOI
TL;DR: In this article, a Bayesian hierarchical model is proposed to analyze data involving item families with multiple-choice items, and the family expected response function (FERF) is introduced as a way to summarize the probability of a correct response to an item randomly generated from an item family.
Abstract: Item families, which are groups of related items, are becoming increasingly popular in complex educational assessments. For example, in automatic item generation (AIG) systems, a test may consist of multiple items generated from each of a number of item models. Item calibration or scoring for such an assessment requires fitting models that can take into account the dependence structure inherent among the items that belong to the same item family. Glas and van der Linden (2001) suggest a Bayesian hierarchical model to analyze data involving item families with multiple-choice items. We fit the model using the Markov Chain Monte Carlo (MCMC) algorithm, introduce the family expected response function (FERF) as a way to summarize the probability of a correct response to an item randomly generated from an item family, and suggest a way to estimate the FERFs. This work is thus a step towards creating a tool that can save significant amount of resources in educational testing, by allowing proper analysis and summ...

Journal ArticleDOI
TL;DR: In this article, a computer simulation study was conducted to determine the effect of using an iterative or noniterative multinomial logistic regression analysis (MLR) to detect differential item functioning (DIF) in polytomous items.
Abstract: Summary We conducted a computer simulation study to determine the effect of using an iterative or noniterative multinomial logistic regression analysis (MLR) to detect differential item functioning (DIF) in polytomous items. A simple iteration in which ability is defined as total observed score in the test is compared with a two-step MLR in which the ability was purified by eliminating the DIF items. Data were generated to simulate several biased tests. The factors manipulated were: DIF effect size (0.5, 1.0, and 1.5), percentage of DIF items in the test (0%, 10%, 20% and 30%), DIF type (uniform and nonuniform) and sample size (500, 1000 and 2000). Item scores were generated using the graded response model. The MLR procedures were consistently able to detect both uniform and nonuniform DIF. When the two-step MLR procedure was used, the false-positive rate (the proportion of non-DIF items that were detected as DIF) decreased and the correct identification rate increased slightly. The purification process r...

Journal ArticleDOI
TL;DR: This paper used an ecological approach to investigate the proximal relationship between raters and test takers in a science version of a high-stakes, topic-based language test, revealing a source of test bias that might otherwise have remained undetected by traditional methods.
Abstract: Researchers from a range of disciplines (Faigley, 1995; Hamp-Lyons, 1994; Huot, 1996; Kirshner & Whitson, 1997; Lemke, 1995) have argued that to draw valid inferences from tests, more than the analysis of factors in isolation (e.g., items-test scores, rater consistency) should be investigated. The complex, interconnected, or "proximal"(Bronfenbrenner, 1994) relationships that form as a result of the processes and practices of testing activity should also become a focus of investigation. The study reported here is offered as an example of the application of an ecological approach in the investigation of the proximal relationship of raters (n = 12) and test takers (n = 423), formed during the trialling of a new science version of a high-stakes, topic-based language test. The study reveals a source of test bias that might otherwise have remained undetected by traditional methods, such as those that are common to differential item functioning studies. Systematically focusing on stakeholders' accounts of tests...


Journal Article
TL;DR: The concept of a behavior domain is a reasonable and essential foundation for psychometric work based on true score theory, the linear model of common factor analysis, and the nonlinear models of item response theory as discussed by the authors.
Abstract: The concept of a behavior domain is a reasonable and essential foundation for psychometric work based on true score theory, the linear model of common factor analysis, and the nonlinear models of item response theory. Investigators applying these models to test data generally treat the true scores or factors or traits as abstractive psychological attributes: common properties of the items, possibly with some inconsistency between their practice and their theoretical statements. A countably infinite item domain defines an attribute uniquely, and a function of the domain item scores gives an identified measure of it, to be estimated from a finite set of item scores, with a defined error of measurement. In test development the investigator must consider and justify the assumption that an item domain exists for the specific measurement application and is large enough to be treated as infinite for that application.

Journal ArticleDOI
TL;DR: In this article, a large item banks with properly calibrated test items are essential for ensuring the validity of computer-based tests, at the same time, item calibrations with small samples are desirable to minimiz...
Abstract: Large item banks with properly calibrated test items are essential for ensuring the validity of computer-based tests. At the same time, item calibrations with small samples are desirable to minimiz...