scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 1999"


Book
01 Jul 1999
TL;DR: In this article, the authors introduce the concept of a scale and test homogeneity, reliability, and generalizability for total test scores, and propose a scaling theory for test scores.
Abstract: Contents: General Introduction. Items and Item Scores. Item and Test Statistics. The Concept of a Scale. Reliability Theory for Total Test Scores. Test Homogeneity, Reliability, and Generalizability. Reliability--Applications. Prediction and Multiple Regression. The Common Factor Model. Validity. Classical Item Analysis. Item Response Models. Properties of Item Response Models. Multidimensional Item Response Models. Comparing Populations. Alternate Forms and the Problem of Equating. An Introduction to Structural Equation Modeling. Some Scaling Theory. Retrospective. Appendix: Some Rules for Expected Values.

2,928 citations



Journal ArticleDOI
TL;DR: In this paper, a method for studying DIF is demonstrated that can be used with either dichotomous or polytomous items, and the method is shown to be valid for data that follow a partial credit IRT model.
Abstract: In this paper a method for studying DIF is demonstrated that can used with either dichotomous or polytomous items. The method is shown to be valid for data that follow a partial credit IRT model. It is also shown that logistic regression gives results equivalent to those of the proposed method. In a simulation study, positively biased type 1 error rates of the method are shown to be in accord with results from previous studies; however the size of the bias in the log odds is moderate. Finally, it is demonstrated how these statistics can be used to study DIF variability with the method of Longford, Holland, & Thayer (1993). Much work has been done in recent years in the area of differential item functioning (DIF). Identifying items which exhibit DIF is a preliminary step in the current practice of assessing item and test bias. The ultimate rationale is that removal or modification of biased items will improve the validity of a test, and in conjunction with more direct assessments of validity, will result in a test that is fair to all groups of examinees. This approach to DIF has an intrinsic statistical problem: because many analyses have been based on estimation and testing for individual items, characteristics of clusters of items or of the test as a whole may go unnoticed. Rubin (1988) suggested that in addition to a DIF estimate for each item, a measure of the variability of the DIF would be desirable across items for addressing the question of whether or not the measure of DIF in any single item was important when compared to the overall test. He proposed that methods be devised to handle all items simultaneously. A number of statistical methods have been devised to address DIF more holistically than the one-item-at-a-time approach. These methods fall into a category termed differential test functioning (DTF) to distinguish them from differential item functioning (DIF). Within the DTF category, three major subdivisions can be identified. First, DTF may be obtained as the expected signed or unsigned difference between two test (or subtest) response functions. Signed DIF may accumulate across a selected group of items; for example, if all items

620 citations


Journal ArticleDOI
TL;DR: This paper demonstrates Markov chain Monte Carlo techniques that are particularly well-suited to complex models with item response theory (IRT) assumptions, and develops a MCMC methodology, based on Metropolis-Hastings sampling, that can be routinely implemented to fit novel IRT models.
Abstract: This paper demonstrates Markov chain Monte Carlo (MCMC) techniques that are particularly well-suited to complex models with item response theory (IRT) assumptions. MCMC may be thought of as a successor to the standard practice of first calibrating the items using E-M methods and then taking the item parameters to be known and fixed at their calibrated values when proceeding with inference regarding the latent trait. In contrast to this two-stage E-M approach, MCMC methods treat item and subject parameters at the same time; this allows us to incorporate standard errors of item estimates into trait inferences, and vice versa. We develop a MCMC methodology, based on Metropolis-Hastings sampling, that can be routinely implemented to fit novel IRT models, and we compare the algorithmic features of the MetropolisHastings approach to other approaches based on Gibbs sampling. For concreteness we illustrate the methodology using the familiar two-parameter logistic (2PL) IRT model; more complex models are treated in a subsequent paper (Patz & Junker, in press). Item response theory (IRT; Lord, 1980) models have greatly extended the data analytic reach of psychometricians, social scientists, and educational measurement specialists. The scaling and measurement applications of parametric IRT models in educational measurement are well known in areas ranging from basic questions about item quality and ability estimation (e.g., Linn, 1989) to equating (Holland & Rubin, 1982), differential item functioning (Holland & Wainer,

506 citations


Journal ArticleDOI
TL;DR: In this article, a multistage adaptive testing approach that factors a into the item selection process is proposed, where the items in the item bank are stratified into a number of levels based on their a values.
Abstract: Computerized adaptive tests (CAT) commonly use item selection methods that select the item which provides maximum information at an examinees estimated trait level. However, these methods can yield extremely skewed item exposure distributions. For tests based on the three-parameter logistic model, it was found that administering items with low discrimination parameter (a) values early in the test and administering those with high a values later was advantageous; the skewness of item exposure distributions was reduced while efficiency was maintained in trait level estimation. Thus, a new multistage adaptive testing approach is proposed that factors a into the item selection process. In this approach, the items in the item bank are stratified into a number of levels based on their a values. The early stages ofa test use items with lower as and later stages use items with higher as. At each stage, items are selected according to an optimization criterion from the corresponding level. Simulation studies were ...

277 citations


Journal ArticleDOI
TL;DR: The authors used a Bayesian approach to estimate the probabilities that the true DIF for an item falls into the A, B, or C categories (the True DIF method in our terminology) or to
Abstract: future observed status. DIF status is expressed in terms of the probabilities associated with each of the five DIF levels defined by the ETS classification system: C-, B-, A, B+, and C+. The EB methods yield more stable DIF estimates than do conventional methods, especially in small samples, which is advantageous in computer-adaptive testing. The EB approach may also convey information about DIF stability in a more useful way by representing the state of knowledge about an item's DIF status as probabilistic. The results of a Mantel-Haenszel (MH; Mantel & Haenszel, 1959) analysis of differential item functioning (DIF) typically include an index of the magnitude of DIF, along with an estimated standard error (see Holland & Thayer, 1988). In making decisions about whether to discard items or flag them for review, however, testing companies sometimes rely on categorical ratings of the severity of DIF. Educational Testing Service (ETS) has a system for categorizing DIF as negligible ("A"), slight to moderate ("B") or moderate to severe ("C") based on both the magnitude of the DIF index and the statistical significance of the results. A disadvantage of this classification system is that it sometimes conveys the notion that an item's DIF category is deterministic. A possible solution is to use a Bayesian approach to estimate the probabilities that the true DIF for an item falls into the A, B, or C categories (the True DIF method in our terminology) or to

195 citations


Journal ArticleDOI
TL;DR: The authors found that 34% of the items functioned differentially across languages, mostly in favor of the Russian-speaking examinees, and the main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.
Abstract: Translated tests are being used increasingly for assessing the knowledge and skills of individuals who speak different languages. There is little research exploring why translated items sometimes function differently across languages. If the sources of differential item functioning (DIF) across languages could be predicted, it could have important implications on test development, scoring and equating. This study focuses on two questions: "Is DIF related to item type?", "What are the causes of DIF?" The data were taken from the Israeli Psychometric Entrance Test in Hebrew (source) and Russian (translated). The results indicated that 34% of the items functioned differentially across languages. The analogy items were the most problematic with 65% showing DIF, mostly in favor of the Russian-speaking examinees. The sentence completion items were also a problem (45% D1F). The main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.

150 citations


Journal ArticleDOI
TL;DR: In this article, the authors examined the polytomous-DFIT framework and found that it was effective in identifying DTF and DIF for the simulated conditions, but the DTF index did not perform as consistently as the DIF index.
Abstract: Raju, van der Linden, & Fleer (1995) proposed an item response theory based, parametric differential item functioning (DIF) and differential test functioning (DTF) procedure known as differential functioning of items and tests (DFIT). According to Raju et al., the DFIT framework can be used with unidimensional and multidimensional data that are scored dichotomously and/or polytomously. This study examined the polytomous-DFIT framework. Factors manipulated in the simulation were: (1) length of test (20 and 40 items), (2) focal group distribution, (3) number of DIF items, (4) direction of DIF, and (5) type of DIF. The findings provided promising results and indicated directions for future research. The polytomous DFIT framework was effective in identifying DTF and DIF for the simulated conditions. The DTF index did not perform as consistently as the DIF index. The findings are similar to those of unidimensional and multidimensional DFIT studies.

115 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigated the power and error rate of the likelihood ratio goodness-of-fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model.
Abstract: The purpose of this study was to investigate the power and Type I error rate of the likelihood ratio goodness-of-fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model. A multiple-replication Monte Carlo study was utilized in which DIF was modeled in simulated data sets which were then calibrated with MULTILOG (Thissen, 1991) using hierarchically nested item response models. In addition, the power and Type I error rate of the Mantel (1963) approach for detecting DIF in ordered response categories were investigated using the same simulated data, for comparative purposes. The power of both the Mantel and LR procedures was affected by sample size, as expected. The LR procedure lacked the power to consistently detect DIF when it existed in reference/focal groups with sample sizes as small as 500/500. The Mantel procedure maintained control of its Type I error rate and was more powerful than the LR procedure when the comparison group ability distributions were identical and there was a constant DIF pattern. On the other hand, the Mantel procedure lost control of its Type I error rate, whereas the LR procedure did not, when the comparison groups differed in mean ability; and the LR procedure demonstrated a profound power advantage over the Mantel procedure under conditions of balanced DIF in which the comparison group ability distributions were identical. The choice and subsequent use of any procedure requires a thorough understanding of the power and Type I error rates of the procedure under varying conditions of DIF pattern, comparison group ability distributions.–or as a surrogate, observed score distributions–and item characteristics.

79 citations


Journal ArticleDOI
TL;DR: In this paper, the sample size ratio (SSR), the latent trait distribution (LD), and the amount of item information were used to estimate the item parameters in the nominal response model.
Abstract: Establishing guidelines for reasonable item parameter estimation is fundamental to use of the nominal response model. Factors studied were the sample size ratio (SSR), latent trait distribution (LD), and amount of item information. Results showed that the LD accounted for 42.5% of the variability in the accuracy of estimating the slope parameter; the SSR and the maximum item information factors accounted for 29.5% and 3.5% of the accuracy, respectively. In general, as the LD departed from a normal distribution, a larger number of examinees was required to accurately estimate the slope and intercept parameters. Results indicated that an SSR of 10:1 can produce reasonably accurate item parameter estimates when the LD is normal.

71 citations


Journal ArticleDOI
TL;DR: In this article, the effects of retaining test items manifesting differential item functioning (DIF) on aspects of the measurement quality and validity of that test's scores were investigated using the Mantel-Haenszel procedure, which allows one to detect items that function differently in two groups of examinees at constant levels of the trait.
Abstract: This study investigated effects of retaining test items manifesting differential item functioning (DIF) on aspects of the measurement quality and validity of that test’s scores. DIF was evaluated using the Mantel-Haenszel procedure, which allows one to detect items that function differently in two groups of examinees at constant levels of the trait. Multiple composites of DIF-and non-DIF-containing items were created to examine the impact of DIF on the measurement, validity, and predictive relations involving those composites. Criteria used were the American College Testing composite, the Scholastic Aptitude Test (SAT) verbal (SATV), quantitative (SATQ), composite (SATC), and grade point average rank percentile. Results indicate measurement quality of tests is not seriously degraded when items manifesting DIF are retained, even when number of items in the compared composites has been controlled. Implications of results are discussed within the framework of multiple determinants of item responses.

Journal ArticleDOI
TL;DR: In this paper, the authors derived a general formula for the population parameter being estimated by the Mantel-Haenszel (MH) differential item functioning (DIF) statistic, which is appropriate for either uniform DIF (defined as a difference in item response theory item difficulty values) or non-uniform DIF; and it can be used regardless of the form of the item response function.
Abstract: The present study derives a general formula for the population parameter being estimated by the Mantel-Haenszel (MH) differential item functioning (DIF) statistic. Because the formula is general, it is appropriate for either uniform DIF (defined as a difference in item response theory item difficulty values) or nonuniform DIF; and it can be used regardless of the form of the item response function. In the case of uniform DIF modeled with two-parameter-logistic response functions, the parameter is well known to be linearly related to the difference in item difficulty between the focal and reference groups. Even though this relationship is known to not strictly hold true in the case of three-parameter-logistic (3PL) uniform DIE the degree of the departure from this relationship has not been known and has been generally believed to be small By evaluating the MH DIF parameter, we show that for items of medium or high difficulty, the parameter is much smaller in absolute value than expected based on the differ...

Journal ArticleDOI
TL;DR: If the problem of missing values is solved by layout changes or interview administration, the GH scale appears to be a valid measure of self-rated health in elderly populations, and some degree of multidimensionality in theGH scale.
Abstract: The authors used multigroup confirmatory factor analysis and structural equations models to examine the construct validity and item functioning of the five-item General Health (GH) scale from the SF-36 in Danes over 16 years of age (n = 4,084). They included four criteria variables for physical and mental health. Items GH2-GH5 had low response rates among the elderly, probably due to the compact layout of these items in the questionnaire. The authors found differential item functioning for several items, indicating some degree of multidimensionality in the GH scale. Thus, GH1 had stronger associations with age, physical functioning, and chronic diseases than predicted by the one-factor model. However, psychometrical problems were mostly found in the youngest age group. If the problem of missing values is solved by layout changes or interview administration, the GH scale appears to be a valid measure of self-rated health in elderly populations.

Journal ArticleDOI
TL;DR: This paper explored methods for detecting gender-based differential item functioning on a 12th-grade constructed-response science test administered as part of the National Education Longitudinal Study of 1988 (NELS:88) and found that gender differences were largest on items that involved visualization and called on knowledge acquired outside of school.
Abstract: In this study, I explored methods for detecting gender-based differential item functioning on a 12th-grade constructed-response (CR) science test administered as part of the National Education Longitudinal Study of 1988 (NELS:88). The primary difficulty encountered with many CR tests is the absence of a reliable and appropriate measure of ability on which to condition. In this study, several combinations of conditioning variables were explored, and results were supplemented with evidence from interviews of students who completed the test items. The study revealed that 1 item in particular displayed a large male advantage and contributed to the gender difference on total score. Results were similar to those obtained with the NELS:88 multiple-choice test. In both cases, gender differences were largest on items that involved visualization and called on knowledge acquired outside of school. Implications for users of large-scale assessment results are discussed.

Journal ArticleDOI
TL;DR: In this paper, a stepwise DIF analysis based on the multiple-group partial credit model was applied to the National Assessment of Educational Progress (NAEP) writing trend data, where uniform and non-uniform items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups.
Abstract: Bock, Muraki, and Pfeiffenberger (1988) proposed a dichotomous item response theory (IRT) model for the detection of differential item functioning (DIF), and they estimated the IRT parameters and the means and standard deviations of the multiple latent trait distributions. This IRT DIF detection method is extended to the partial credit model (Masters, 1982; Muraki, 1993) and presented as one of the multiple-group IRT models. Uniform and non-uniform DIF items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups. The DIF method was applied to this simulated data using a stepwise procedure. The standardized DIF measures for slope and item location parameters successfully detected the non-uniform and uniform DIF items as well as recovered the means and standard deviations of the latent trait distributions.This stepwise DIF analysis based on the multiple-group partial credit model was then applied to the National Assessment of Educational Progress (NAEP) writing trend data.

Journal ArticleDOI
TL;DR: In this article, the authors compared logistic regression and analysis of variance for dichotomously scored items and found that the logistic regressions had higher mean detection rates than the variance method.
Abstract: Differential item functioning (DIF) detection rates were compared between logistic regression and analysis of variance for dichotomously scored items. These two DIF methods were compared using simulated binary item response data sets of varying test length (20, 40, and 60 items), sample size (200, 400, and 600 examinees), discrimination type (fixed and varying), and relative underlying ability (equal and unequal) between groups under conditions of uniform DIF, nonuniform DIF, combination DIF, and false positive errors. These test conditions were replicated 100 times. For both DIF detection methods, a test length of 20 items was sufficient for satisfactory DIF detection with detection rate increasing as sample size increased. With the exception of uniform DIF, the logistic regression method had higher mean detection rates than the analysis of variance method. Because the type of DIF present in real data is rarely known, the logistic regression method is recommended for most practical applications.

Journal ArticleDOI
TL;DR: In this paper, the authors evaluated the connection between gender differences in examinees' familiarity, interest, and negative emotional reactions to items on the Advanced Placement Psychology Examination and the items' gender differential item functioning (DIF).
Abstract: This study evaluated the connection between gender differences in examinees’ familiarity, interest, and negative emotional reactions to items on the Advanced Placement Psychology Examination and the items’ gender differential item functioning (DIF). Gender DIF and gender differences in interest varied appreciably with the content of the items. Gender differences in the three variables were substantially related to the items’ gender DIF (e.g., R = .50). Much of the gender DIF on this test may be attributable to gender differences in these variables.

Journal Article
TL;DR: This article evaluated the equivalence of two translated tests using statistical and judgmental methods and found that the items flagged by the three statistical procedures were relatively consistent, but not identical across the two tests.
Abstract: The purpose of this study was to evaluate the equivalence of two translated tests using statistical and judgmental methods. Performance differences for a large random sample of English- and French-speaking examinees were compared on a grade 6 mathematics and social studies provincial achievement test. Items displaying differential item functioning (DIF) were flagged using three popular statistical methods—ManteTHaenszel, Simultaneous Item Bias Test, and logistic regression—and the substantive meaning of these items was studied by comparing the back-translated form with the original English version. The items flagged by the three statistical procedures were relatively consistent, but not identical across the two tests. The correlation between the DIF effect size measures were also strong, but far from perfect, suggesting that two procedures should be used to screen items for translation DIF. To identify the DIF items with translation differences, the French items were back-translated into English and compared with the original English items by three reviewers. Two of seven and six of 26 DIF items in mathematics and social studies respectively were judged to be nonequivalent across language forms due to differences introduced in the translation process. There were no apparent translation differences for the remaining items, revealing the necessity for further research on the sources of translation differential item functioning. Results from this study provide researchers and practitioners with a better understanding of how three popular DIF statistical methods compare and contrast. The results also demonstrate how statistical methods inform substantive reviews intended to identify items with translation differences.

Journal ArticleDOI
TL;DR: In this article, the authors demonstrated the empirical relationship between Cohen's chi-square effect size, w, and differential item functioning (DIF), defined as group differences in item response theory (IRT) item difficulty.
Abstract: The authors demonstrated the empirical relationship between Cohen's chi-square effect size, w, and differential item functioning (DIF), defined as group differences in item response theory (IRT) item difficulty. In Experiment 1, in which the lower asymptote was 0, the authors argued that Cohen's designation of small, medium, and large effects connotes reasonably well for that definition of DIF. In Experiment 2, the lower asymptote of the item response function was raised from 0 to 0.2 and the item discrimination parameter was held to 1.0. Doing so admitted non-crossing nonuniform DIF to the model, violating an underlying assumption of the Mantel–Haenszel procedure that the odds ratio is constant across studied levels of the matching criterion. Smaller difficulty parameter difference resulted, which produced larger effects with an inflation in effect size of about 15%. In Experiment 3, the authors used the 1-parameter logistic model to examine the effect that group differences in the matching crit...


Journal ArticleDOI
TL;DR: This paper derives discrimination parameter values, as functions of the guessing parameter and distances between person parameters and item difficulty, that yield maximum information for the three-parameter logistic item response theory model.
Abstract: Items with the highest discrimination parameter values in a logistic item response theory model do not necessarily give maximum information. This paper derives discrimination parameter values, as functions of the guessing parameter and distances between person parameters and item difficulty, that yield maximum information for the three-parameter logistic item response theory model. An upper bound for information as a function of these parameters is also derived. An algorithm is suggested for the maximum information item selection criterion for adaptive testing and is compared with a full bank search algorithm.


Journal Article
TL;DR: In this article, the adaptation to Basque of a verbal ability test is studied and the adaptation of psychological tests is a process that requires the assessment of the metric equivalence between scores, and therefore, a study of the possible bias.
Abstract: Adaptation to basque of a verbal ability, test. The adaptation of psychological tests is a process wich goes beyond the study of the linguistic quality of translation. It is a process that requires the assessment of the metric equivalence between scores, and therefore, a study of the possible bias. In this work the adaptation to Basque of a verbal ability test is studied. The impossibility of using item response theory models has led to the need to utilise alternative detection procedures of differential item functioning. The high percentage of items with differential functioning, together with the high correlation between the Mante-Haenszel statistic and Sibtest as well as the low correlation of the logistic iterative regression with previous ones, and also the close relationship between the differential functioning the kind of task associated with each item can be outlined as the most relevant results.

Dissertation
22 Mar 1999
TL;DR: In this article, the authors present a survey of the state-of-the-art approaches to the problem of self-deletion in cyber-physical self-defense.
Abstract: ...................................................................................................................... ........................... II ACKNOWLEDGMENTS ................................................................................................................II

01 Apr 1999
TL;DR: In this article, the authors examined the relationship between student achievement in mathematics and pedagogical approach used by middle school mathematics teachers in the United States who participated in the Third International Mathematics and Science Study.
Abstract: The primary objective of this research was to examine the relationship between student achievement in mathematics and pedagogical approach used by middle school mathematics teachers in the United States who participated in the Third International Mathematics and Science Study. In this research, student achievement was explored at the item, rather than test, level with the thought that differences might be found only at this micro level. It was hypothesized that middle school mathematics students whose teachers utilized a more student-centered, or constructivist, pedagogical approach would have a higher probability of obtaining the correct answer to mathematics items that measured conceptual, rather than procedural, understanding. This hypothesis was explicitly tested using differential item functioning analyses. Results support the hypothesis, although not as strongly as had been expected. An appendix contains the teacher survey. (Contains 3 tables, 1 figure, and 10 references.) (Author/SLD) ******************************************************************************** * Reproductions supplied by EDRS are the best that can be made * * from the original document. * ********************************************************************************

01 Apr 1999
TL;DR: In this article, a 50-item translated test was used to assess the percentage and type of agreement between the Mantel-Haenszel (MH) and Differential Functioning of Items and Tests (DFIT) techniques for the detection of differential item functioning (DIF).
Abstract: Data from a 50-item translated test used for certification were used to assess the percentage and type of agreement between the Mantel-Haenszel (MH) and Differential Functioning of Items and Tests (DFIT) techniques for the detection of differential item functioning (DIF) . The DFIT procedure flagged 10 of 30 items as exhibiting significant DIF while the MH technique flagged 2 of 30 items for significant DIF. In both methods items were flagged for significant DIF when translation differences appeared in the item stems. The DFIT method was more sensitive in detecting DIF, resulting exclusively from differences in the item answer options. The overall percent agreement between the two techniques for the detection of DIF in this investigation was 20 percent. The MH technique detected 1 of 10 items as exhibiting nonuniform DIF and 1 of 10 items as displaying uniform DIF. The DFIT procedure detected 4 of 10 items as exhibiting nonuniform DIF and 6 of 10 as displaying uniform DIF. Four appendixes contain tables of descriptive statistics and the English and back-translated item versions. (Contains 34 references.) (Author/SLD) ******************************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document. ******************************************************************************** DFIT/MH Detecting DIF 1 Running Head: DFIT VS. MH FOR DETECTING DIF IN TRANSLATIONS Differential Functioning of Items and Tests Versus the Mantel-Haenszel Technique for Detecting Differential Item Functioning in a Translated Test Lany R. Price Emory University Paper Presented at the Annual Meeting of the American Alliance of Health, Physical Education, Recreation and Dance, Boston, MA. U.S. DEPARTMENT OF EDUCATION Office of Educational Research and Improvement EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) le'rhis document has been reproduced as received from the person or organization originating it. 0 Minor changes have been made to improve reproduction quality. Points of view or opinions stated in this document do not necessarily represent official OERI position or policy. PERMISSION TO REPRODUCE AND DISSEMINATE THIS MATERIAL HAS BEEN GRANTED BY Pr ci TO THE EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) 2 BEST COPY AVAILABLE DFIT/MH Detecting DIF 2 Abstract Data from a 50-item translated test used for certification were used to assess the percentage and type of agreement between the Mantel-Haenszel (MH) and Differential Functioning of Items and Tests (DFIT) techniques for the detection of differential item functioning (DIF). The DFIT procedure flagged 10/30 items as exhibiting significant DIF while the MH technique flagged 2/30 items for significant DIF. In both methods items were flagged for significant DIF when translation differences appeared in the item stems. The DFIT method was more sensitive in detecting DIF, resulting exclusively from differences in the item answer options. The overall percent agreement between the two techniques for the detection of DIF in this investigation was 20%. The MH technique detected 1/10 items as exhibiting non-uniform DIF and 1/10 items displaying uniform DIF. The DFIT procedure detected 4/10 items as exhibiting non-uniform DIF and 6/10 as displaying uniform DIF.Data from a 50-item translated test used for certification were used to assess the percentage and type of agreement between the Mantel-Haenszel (MH) and Differential Functioning of Items and Tests (DFIT) techniques for the detection of differential item functioning (DIF). The DFIT procedure flagged 10/30 items as exhibiting significant DIF while the MH technique flagged 2/30 items for significant DIF. In both methods items were flagged for significant DIF when translation differences appeared in the item stems. The DFIT method was more sensitive in detecting DIF, resulting exclusively from differences in the item answer options. The overall percent agreement between the two techniques for the detection of DIF in this investigation was 20%. The MH technique detected 1/10 items as exhibiting non-uniform DIF and 1/10 items displaying uniform DIF. The DFIT procedure detected 4/10 items as exhibiting non-uniform DIF and 6/10 as displaying uniform DIF. DFIT/MH Detecting DIF 3 Introduction Translated assessment instruments used for the purpose of certification and licensing are sometimes translated from one language to another when they are used in a cross-cultural setting. Examples of such instruments in physical education and exercise science include certification tests administered to practitioners for the purpose of knowledge mastery in a discipline prior to becoming certified to perform the clinical assessment of physical fitness/wellness or for participation in certain types of sports activities. When tests are modified and used crossculturally, the measurement equivalence of the instrument should be evaluated. Ifmeasurement inequivalence is found, the test should be revised by improving or replacing problematic items. The original and modified tests may not be equivalent because: (a) through the translation process the meaning of the test items have been unknowingly changed and/or (b) the test items may not have the same relevance across the different cultural groups (Budgell, Raju, & Quartetti, 1995). Historically, cross-cultural researchers have used procedures, such as back-translation and decentering, as an initial step in the process of test translation (Brislin, 1980). After test translation was complete, classical test theory methods were used for examining differences within Qroups with the final goal of producing measurement equivalence across groups. Classical test theory methods are population or group dependent, however, and are therefore less than ideal for verifying measurement equivalence in translated tests. Statistical methods based on item response theory (IRT) overcome a variety of problems associated with the classical test theory model and provide researchers with an improved methodology for examining measurement equivalence across culturally different groups. Within the framework of IRT, measurement equivalence is a property that exists when the relations between observed test scores and the latent attribute measured by the test are identical across sub-populations (Drasaow, 1984, p. 134). In order for a translated test to exhibit measurement equivalence, individuals who come from different cultural groups that are equal in ability must have the same observed score. Equivalent assessment instruments must be used in

01 Apr 1999
TL;DR: Differential item functioning (DIF) is defined as an item that displays different statistical properties for different groups after the groups are matched on an ability measure as discussed by the authors, where the observed data do not reflect a homogeneous population of individuals, but are a mixture of data from multiple latent populations or classes.
Abstract: Differential item functioning (DIF) may be defined as an item that displays different statistical properties for different groups after the groups are matched on an ability measure. For instance, with binary data, DIF exists when there is a difference in the conditional probabilities of a correct response for two manifest groups. This paper suggests that the occurrence of DIF can be explained by recognizing that the observed data do not reflect a homogeneous population of individuals, but are a mixture of data from multiple latent populations or classes. This conceptualization of DIF hypothesizes that when one observes DIF using the current conceptualization of DIF, it is only to the degree that the manifest groups are represented in the latent classes in different proportions. A Monte Carlo study was conducted to compare various approaches to detecting DIF under this formulation of DIF. Results show that as the latent class proportions became more equal, the DIF detection methods identification rates approached null condition levels. (Contains 6 tables, 3 figures, and 27 references.) (Author/SLD) Reproductions supplied by EDRS are the best that can be made from the original document.

Journal Article
TL;DR: In this paper, the effects of violating the unidimensionality assumption when applying item response logistic models were studied using a multidimensional two-parameter logistic model, two different tests that are common in practice were simulated: a) a test composed of two equally relevant dimensions, and b) test with a dominant dimension and a secundary one.
Abstract: The effects of violating the unidimensionality assumption when applying item response logistic models were studied. Using a multidimensional two parameter logistic model, two different tests that are common in practice were simulated: a) a test composed of two equally relevant dimensions, and b) a test with a dominant dimension and a secundary one. Each test was composed of 40 items, 25 corresponding to the first dimension, 15 to the second. Two sample sizes (N= 300 and N= 1000) and five levels of correlation between the dimensions (0.05, 0.30, 0.60, 0.90, 0.95) were used to generate the data. The unidimensional two-parameter logistic model was used to estimate the item parameters and the ability of the examinees. The results indicate that the unidimensional estimates are consistently robust. Estimates of the item difficulty parameter are less affected by the violation of the unidimensionality assumption than the other item parameter estimates. The item discrimination parameter and the ability estimates are influenced by the size of the correlation between the dimensions and by the type of multidimensionality displayed by the data.

01 Jan 1999
TL;DR: All indices used in this study are inadequate fit measures for detecting DIF, and all indices did perform poorly across various conditions.
Abstract: This study examined five Rasch-model-based item-fit indices: unweighted and weighted standardized indices (denoted UWz and Wz), standardized likelihood index (denoted Lz), and Extended Caution Indices (denoted ECI2z and ECI4z), in terms of their distributional properties and the power of detecting item bias or Differential Item Functioning (DIF). The results indicated that although these five standardized item-fit indices did not depart significantly from a normal distribution, it appeared that the Type I error rates were not reasonable. For the power of five standardized item-fit indices to detect DIF, the results showed that all indices did perform poorly across various conditions. These findings lead to the conclusion that all indices used in this study are inadequate fit measures for detecting DIF.

Journal ArticleDOI
TL;DR: This paper presents latent-class models that fall within the purview of the general model presented by Clogg & Goodman (1984, 1985) and Walter & Irwig (1988) and variations on the general latent- class model allow the investigator to determine whether the criterion measure and/or the diagnostic or screening procedure for multiple groups can be considered error-free.
Abstract: Classification analysis is used widely to detect classification errors determined by evaluating a screening or diagnostic instrument against a criterion measure. The usefulness of classification analysis is limited because it assumes an error-free criterion and provides no statistical test of the validity of that assumption. The classification-analysis model is a special case of a general latent-class model. This paper presents latent-class models that fall within the purview of the general model presented by Clogg & Goodman (1984, 1985) and Walter & Irwig (1988). Variations on the general latent-class model allowthe investigator to determine whether the criterion measure and/or the diagnostic or screening procedure for multiple groups can be considered error-free. Analogous to the problem of differential item functioning, the general model makes it possible to test assumptions regarding classification errors that could occur across groups. The proportion of individuals who may be misclassified by a scree...