scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 1998"


Journal ArticleDOI
TL;DR: Analysis of DIF is useful for evaluating questionnaire translations in the Danish translation of the SF-36 Health Survey, using two general population samples and results agreed with independent ratings of translation quality but the statistical techniques were more sensitive.

190 citations


Journal Article
TL;DR: The disability scale demonstrated excellent practicality and reliability, and accurately reflects patient perceptions regarding functional status and pain as well as doctor's global assessment and is responsive to change over long periods of time.

135 citations


Journal ArticleDOI
TL;DR: This study compared three methods for developing a common metric under IRT, finding that linking separate calibration runs using equating coefficients from the characteristic curve method yielded smaller root mean square differences for both item discrimination and difficulty parameters.
Abstract: Applications of item response theory (IRT) to practical testing problems, including equating, differential item functioning, and computerized adaptive testing, require a common metric for item parameter estimates. This study compared three methods for developing a common metric under IRT: (1) linking separate calibration runs using equating coefficients from the characteristic curve method, (2) concurrent calibration based on marginal maximum a posteriori estimation, and (3) concurrent calibration based on marginal maximum likelihood estimation. For smaller numbers of common items, linking using the characteristic curve method yielded smaller root mean square differences for both item discrimination and difficulty parameters. For larger numbers of common items, the three methods yielded similar results.

123 citations


Journal ArticleDOI
TL;DR: Results do not support arguments that measures of negative affective dispositions "artificially" produce gender mean differences by focusing on specific selected content areas and illustrate how even in an essentially unidimensional scale, comparison of group mean differences can be affected by multidimensionality caused by item clusters that share similar content.
Abstract: Item response theory methods were used to study differential item functioning (DIF) between gender groups on a measure of stress reaction. Results revealed that women were more likely to endorse items describing emotional vulnerability and sensitivity, whereas men were more likely to endorse items describing tension, irritability, and being easily upset. Item factor analysis yielded 5 correlated factors, and the DIF analysis, in turn, revealed differential gender mean differences on these factors. This finding illustrates how even in an essentially unidimensional scale, comparison of group mean differences can be affected by multidimensionality caused by item clusters that share similar content. Results do not support arguments that measures of negative affective dispositions "artificially" produce gender mean differences by focusing on specific selected content areas.

116 citations


Journal ArticleDOI
TL;DR: In this paper, restricted factor analysis (RFA) is used to detect item bias in the context of differential item functioning (DIF) where the common factor model serves as an item response model, but group membership is also included in the model.
Abstract: Restricted factor analysis (RFA) can be used to detect item bias (also called differential item functioning). In the RFA method of item bias detection, the common factor model serves as an item response model, but group membership is also included in the model. Two simulation studies are reported, both showing that the RFA method detects bias in 7‐point scale items very well, especially when the sample size is large, the mean trait difference between groups is small, the group sizes are equal, and the amount of bias is large. The first study further shows that the RFA method detects bias in dichotomous items at least as well as an established method based on the one‐parameter logistic item response model. The second study concerns various procedures to evaluate the significance of two‐item bias indices provided by the RFA method. The results indicate that the RFA method performs best when it is used in an iterative procedure.

112 citations


Journal Article
TL;DR: In this paper, it is shown that differential item functioning can be evaluated using the Lagrange multiplier test or Rao's efficient score test in the framework of a number of IRT models such as the Rasch model, OPLM, the 2-parameter logistic model, the generalized partial credit model and nominal response model.
Abstract: In the present paper it is shown that differential item functioning can be evaluated using the Lagrange multiplier test or Rao’s efficient score test. The test is presented in the framework of a number of IRT models such as the Rasch model, the OPLM, the 2-parameter logistic model, the generalized partial credit model and the nominal response model. However, the paradigm for detection of differential item functioning presented here also applies to other IRT models. Two examples are given, one using simulated data and one using real data.

102 citations


Journal ArticleDOI
TL;DR: In this paper, item response models can be applied to address specific psychometric issues, including the relative effectiveness of response options (option effectiveness), the ability of existing measures to detect differences in depressive severity (scale discriminability), and the extent to which certain groups of individuals use items and options differently (differential item functioning).
Abstract: Despite advances in psychometric theory and analytic techniques, a number of issues regarding the assessment of depression remain unresolved, including the relative effectiveness of response options (option effectiveness), the ability of existing measures to detect differences in depressive severity (scale discriminability), and the extent to which certain groups of individuals use items and options differently (differential item functioning). One part of the article introduces the fundamentals of nonparametric item response models; the 2nd part of the article illustrates how item response models can be applied to address specific psychometric issues. Although the article focuses on the assessment of depression, the problems addressed in this article are present in virtually every field of psychological research, and the techniques offered may be applied broadly. Analytic techniques based on item response models are not only helpful in identifying and ultimately resolving many of these issues, they are essential to ensure that traits, abilities, and conditions, such as depression, are assessed fairly and equitably.

102 citations


Journal ArticleDOI
TL;DR: In this article, the type I error rates of the likelihood ratio test for the detection of differential item functioning (DIF) were investigated using monte Carlo simulations using the graded response model with five ordered categories, used to generate datasets of a 30-item test for samples of 300 and 1,000 simulated examinees.
Abstract: Type I error rates of the likelihood ratio test for the detection of differential item functioning (DIF) were investigated using monte carlo simulations. The graded response model with five ordered categories was used to generate datasets of a 30-item test for samples of 300 and 1,000 simulated examinees. All DIF comparisons were simulated by randomly pairing two groups of examinees. Three different sample size combinations of reference and focal groups were simulated under two ability matching conditions. For each of the six combinations of sample sizes by ability matching conditions, 100 replications of DIF detection comparisons were simulated. Item parameter estimates and likelihood values were obtained by marginal maximum likelihood estimation using the computer program MULTILOG. Irpe I eryor rates of the likelihood ratio test statistics for all six combinations of sample sizes and ability matching conditions were within theoretically expected values at each of the nominal alpha levels considered.

102 citations


14 Apr 1998
TL;DR: In this paper, the type I error rates of the likelihood ratio test for the detection of differential item functioning (DIF) in the partial credit model were investigated using simulated data.
Abstract: Type I error rates of the likelihood ratio test for the detection of differential item functioning (DIF) in the partial credit model were investigated using simulated data. The partial credit model with four ordered performance levels was used to generate data sets of a 30-item test for samples of 300 and 1,000 simulated examinees. Three different combinations of sample sizes of reference and focal group comparisons were simulated under two different ability matching conditions. One hundred replications of DIF detection comparisons were simulated for each of six conditions. Type I error rates of the likelihood ratio for all six conditions were with theoretically expected values at each of the nominal alpha levels considered. (Contains 25 references.) (Author/SLD) Reproductions supplied by EDRS are the best that can be made from the original document. U.S. DEPARTMENT OF EDUCATION Office of Educational Research and Improvement EDU ATIONAL RESOURCES INFORMATION CENTER (ERIC) This document has been reproduced as received from the person or organization originating it. Minor changes have been made to improve reproduction quality. Points of view or opinions stated in this document do not necessarily represent official OERI position or policy. 1 PERMISSION TO REPRODUCE AND DISSEMINATE THIS MATERIAL HAS BEEN GRANTED BY

85 citations


Journal ArticleDOI
TL;DR: In this article, it was shown that parallel three-parameter logistic IRFs do not result in uniform differential item functioning (DIF) and that the term "uniform DIF" should be reserved for the condition in which the association between the item response and group is constant for all values of the matching variable, as distinguished from parallel and unidirectional DIF.
Abstract: Uniform differential item functioning (DIF) exists when the statistical relationship between item response and group is constant for all levels of a matching variable. Two other types of DIF are defined based on differences in item response functions (IRFs) among the groups of examinees: unidirectional DIF (the IRFs do not cross) and parallel DIF (the IRFs are the same shape but shifted from one another by a constant, i.e., the IRFs differ only in location). It is shown that these three types of DIF are not equivalent and the relationships among them are examined in this paper for two item response categories, two groups, and an ideal continuous univariate matching variable. The results imply that unidirectional and parallel DIF which have been considered uniform DIF by several authors are not uniform DIE For example, it is shown in this paper that parallel three-parameter logistic IRFs do not result in uniform DIE It is suggested that the term "uniform DIF" be reserved for the condition in which the association between the item response and group is constant for all values of the matching variable, as distinguished from parallel and unidirectional DIE Differential item functioning (DIF) refers to differences in the measurement properties of a test item among different groups of examinees. Many statistical techniques have been proposed to assess DIF (Millsap & Everson, 1993). This paper focuses on a particular type of DIF-uniform DIF. Uniform DIF exists when the statistical relationship between item response and group is constant for all levels of a matching variable (Mellenbergh, 1982). Two other types of DIF are defined based on differences in the item response functions (IRFs) among

73 citations


Journal ArticleDOI
TL;DR: The item parameter recovery characteristics of a Gibb's sampling method (Albert, 1992) for IRT item parameter estimation were investigated using a simulation study as discussed by the authors, where the item parameters were estimated, under a normal ogive item response function model, using Gibbs sampling and BILOG (Mislevy & Bock, 1989).
Abstract: The item parameter recovery characteristics of a Gibb's sampling method (Albert, 1992) for IRT item parameter estimation were investigated using a simulation study. The item parameters were estimated, under a normal ogive item response function model, using Gibbs sampling and BILOG (Mislevy & Bock, 1989). The item parameter estimates were then equated to the metric of the underlying item parameters for tests with 10, 20, 30, and 50 items, and samples of 30, 60, 120, and 500 examinees. Summary statistics of the equating coefficients showed that Gibbs sampling and BILOG both produced trait scale metrics with units of measurement that were too small, but yielding a proper midpoint of the metric. When expressed in a common metric, the biases of the BILOG estimates of the item discriminations were uniformly smaller and less variable than those from Gibbs sampling. The biases of the item difficulty estimates yielded by the two estimation procedures were small and similar to each other. In addition, the item par...


Journal ArticleDOI
TL;DR: In this paper, data were generated to simulate two-dimensional tests and the dimensional structure of the tests, the discrimination levels of the items, and the correlation between the traits measured by the test were varied.
Abstract: Popular techniques for assessing differential item functioning (DIF) assume that the test under study is unidimensional When this assumption is tenable, number-correct score is a reasonable matching criterion When a test is intentionally multidimensional, matching on a single test score does not ensure comparability and may result in inflated error rates An alternate approach is to match on all relevant traits simultaneously, using a procedure such as logistic regression In this study, data were generated to simulate two-dimensional tests The dimensional structure of the tests, the discrimination levels of the items, and the correlation between the traits measured by the test were varied Standard DIF analyses were conducted using total test score as the matching variable High false-positive error rates were found Items were divided into subtests using nonlinear factor analysis and DIF analyses were repeated with subtest scores as the matching criteria False-positive error rates were reduced for m

Journal ArticleDOI
TL;DR: In this paper, the authors used differential item functioning (DIF) analysis techniques to determine whether relationships between measured interests and vocational preferences were equivalent for the two sexes using SII responses of 16,484 males and females.

17 Apr 1998
TL;DR: Methods for evaluating construct equivalence and item equivalence across different language versions of a test with samples ranging from 1,329 to 2,000 examinees per test indicate that these procedures provide a great deal of information useful for evaluating test and item functioning across groups.
Abstract: Adapting credentialing examinations for international uses involves translating tests for use in multiple languages This paper explores methods for evaluating construct equivalence and item equivalence across different language versions of a test These methods were applied to four different language versions (English, French, German, and Japanese) of a Microsoft certification examination with samples ranging from 1,329 to 2,000 examinees per test Principal components analysis, multidimensional scaling, and confirmatory factor analysis of these data were conducted to evaluate construct equivalence Detection of differential item functioning across languages was conducted using the standardized p-difference index The results indicate that these procedures provide a great deal of information useful for evaluating test and item functioning across groups Some differences in factor and dimension loadings across groups were noted, but a common, one-factor model fit the data well Four items were flagged for differential item functioning across all groups Suggestions for using these methods to evaluate translated tests are provided (Contains 8 tables, 3 figures, and 13 references) (Author/SLD) ******************************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document ********************************************************************************

01 Jan 1998
TL;DR: It is drawn that complex procedures are not required to generate interpretable results if relevant differences between the groups being compared are known and the inability of many researchers to interpret results for racial/ethnic or gender groups is not due to inadequacies of the methods, but more likely to lack of pertinent knowledge about group differences.
Abstract: This paper presents an analysis of differential item functioning (DIF) in a certification examination for a medical specialty. The groups analyzed were (1) physicians from different subspecialties within this area and (2) physicians who qualified for the examination through two different experiential pathways. The DIF analyses were performed using a simple Rasch model procedure. The results were shown to be readily interpretable in terms of the known differences between the groups being compared. These results serve as validity evidence for the Rasch model procedure as a means for evaluating DIF in examinations. The conclusion is drawn that complex procedures are not required to generate interpretable results if relevant differences between the groups being compared are known. This suggests that the inability of many researchers to interpret results for racial/ethnic or gender groups is not due to inadequacies of the methods, but more likely to lack of pertinent knowledge about group differences.

Journal Article
TL;DR: In this paper, a simulation study was carried out to compare the Mantel-Haenszel and the Loglinear procedures for detecting differentially functioning items, and the two main conclusions are: 1) the Mantelshaenszel procedure has higher power than the LogLinear procedures, and 2) iterative application of the procedures substantially improves detection rates and Type I error compared to single-step applications.
Abstract: Comparison of the Mantel-Haenszel procedure versus the Loglinear models for detecting differential item functioning. A simulation study was carried out to compare the Mantel-Haenszel and the Loglinear procedures for detecting differentially functioning items. The procedures were applied in one single step and iteratively. The two main conclusions are: First, the Mantel-Haenszel procedure has higher power than the Loglinear procedures. Second, the iterative application of the procedures substantially improves detection rates and Type I error compared to single-step applications.

01 Apr 1998
TL;DR: Oshima et al. as discussed by the authors used the parametric framework Differential Functioning of Items and Tests (DFIT) to detect differential item functioning (DIF) and found that 10 out of 30 items exhibited significant noncompensatory DIF.
Abstract: Often, educational and psychological measurement instruments must be translated from one language to another when they are administered to different cultural groups. The translation process often necessarily introduces measurement inequivalence. Therefore, an examination may be said to exhibit differential functioning if the test provides a consistent advantage to one particular race or culture through the manner in which the test items are written. One thousand American and 1,134 Japanese entry-level examinees participating in a scuba diving certification course took a standardized criterion mastery test for certification. The parametric framework Differential Functioning of Items and Tests (DFIT) proposed by N. Raju, W. van der Linden, and P. Fleer (1992) was used to detect differential item functioning (DIF). Out of a total of 30 items, 10 were found to exhibit significant noncompensatory DIF. Differential test functioning was also found to be significant. This paper demonstrates that the new DFIT technique can be applied successfully to the translated data, and that possible causes for the differential functioning can be examined using results from the DFIT analysis. (Contains 3 figures, 5 tables, and 25 references.) (Author/SLD) ******************************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document. ******************************************************************************** Differential Item Functioning 1 Running Head: DIFFERENTIAL ITEM FUNCTIONING Differential Item Functioning and Language Translation: A Cross-National Study With a Test Developed for Certification Larry R. Price Emory University T. C. Oshima Georgia State University U.S. DEPARTMENT OF EDUCATION Office of Educational Research and Improvement EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) This document has been reproduced as received from the person or organization originating it. 0 Minor changes have been made to improve reproduction quality. Points of view or opinions stated in this document do not necessarily represent official OERI position or policy. PERMISSION TO REPRODUCE AND DISSEMINATE THIS MATERIAL HAS BEEN GRANTED BY Lo Pc z c e, TO THE EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) Paper Presented at the 1998 Annual Meeting of the American Educational Research Association, San Diego, CA.

Journal ArticleDOI
TL;DR: The Mattis Dementia Rating Scale shows no appreciable evidence of test bias and minimal differential item functioning (item bias) because of race, suggesting that the MDRS may be used in both African American and Caucasian dementia patients to assess dementia severity.
Abstract: The Mattis Dementia Rating Scale (MDRS) is a commonly used cognitive measure designed to assess the course of decline in progressive dementias. However, little information is available about possible systematic racial bias on the items presented in this test. We investigated race as a potential source of test bias and differential item functioning in 40 pairs of African American and Caucasian dementia patients (N = 80), matched on age, education, and gender. Principal component analysis revealed similar patterns and magnitudes across component loadings for each racial group, indicating no clear evidence of test bias on account of race. Results of an item analysis of the MDRS revealed differential item functioning across groups on only 4 of 36 items, which may potentially be dropped to produce a modified MDRS that may be less sensitive to cultural factors. Given the absence of test bias because of race, the observed racial differences on the total MDRS score are most likely associated with group differences in dementia severity. We conclude that the MDRS shows no appreciable evidence of test bias and minimal differential item functioning (item bias) because of race, suggesting that the MDRS may be used in both African American and Caucasian dementia patients to assess dementia severity.


Journal ArticleDOI
TL;DR: This study tests for differential item functioning by gender and ethnicity of the CAGE, TWEAK, brief MAST, and AUDIT in 492 emergency room patients with lifetime drinking experience.
Abstract: The importance of having available an efficient, sound instrument for the detection of alcohol dependence and problem drinking in a medical setting is underscored by the generally poor performance of clinical judgment and blood alcohol level (BAL) in detecting these conditions (Soderstrom, Dischinger, Smith et al., 1992; Whitney, 1983; Yates, Hadfield & Peters, 1987). Although a number of screening devices have been developed for use in medical settings, such as the CAGE (Ewing, 1984) and the MAST (Selzer, 1971), these instruments were primarily developed using samples of male alcoholics. Other, newer screening devices, such as the AUDIT (Saunders, Aasland, Babor et al., 1993) and the TWEAK (Russell et al., 1994), have been developed to identify problem drinking in primary care settings, but these instruments have seen limited development in these settings in general, as well as among women and ethnic minorities. Information relating to the psychometric properties of these criterion-referenced instruments has been of mixed quality, and studies have generally been limited to an examination of the sensitivity, specificity, and predictive value, or summary measures such as the receiver operating characteristic curve (ROC) (Hsiao, Bartko & Potter, 1989), which plots sensitivity as a function of 1 - specificity. Studies have also evaluated instrument performance at standard cut points (Cherpitel, 1995a; Fleming & Barry, 1989), and some studies have used ROC analysis to determine optimal cut points (Barry & Fleming, 1993; Ross, Gavin & Skinner, 1990). Studies examining the relative performance of these instruments against each other have been sparse, however, and have generally been limited to comparisons between two of the standard screening instruments (Barry & Fleming, 1993; Magruder-Habib, Stevens & Alling, 1993; Ross, Gavin & Skinner, 1990) or between one or two of these instruments and laboratory tests (Bernadt, Taylor, Mumford et al., 1982; Girela, Villanueva, Hernandez-Cueto & Luna, 1994; Skinner, Holt, Schuller et al., 1984). An exception has been the work of Chan and Russell (Chan, Pristach, Welte & Russell, 1993; Russell et al., 1994) examining the comparative performance of a screening instrument developed for detecting pregnancyrisk drinking in prenatal and general populations. Previous analyses (e.g., Cherpitel,1995a) have not focused on the potential for item bias by gender and ethnicity for these alcohol screening instruments, although this testing process has been recommended as good psychometric practice (Nunnally & Bernstein, 1994). This type of work is important, because if individual items on an alcohol screening instrument function differently for two groups, then the overall sensitivity for that instrument will differ as a function of group membership. The present study was designed to test the individual items on the CAGE, TWEAK, brief MAST, and AUDIT for differential item functioning by gender and ethnic identification among a sample of emergency room patients at a large Southern medical center. Differential item functioning, or DIF, is a psychometric term currently used in place of the term item bias (Camilli and Shepard, 1994). These data have been previously used by Cherpitel (1995a and 1995b; Cherpitel & Clark, 1995) to examine the psychometric properties of the scales (e.g., sensitivity and specificity), but not item functioning. Method For a more detailed description of the participants, procedures and instruments, please see Cherpitel (1995a). A probability sample of every third patient 18 years old and older was drawn over a six-month period from the University of Mississippi Medical Center emergency room. Interviews were obtained from 89% (N = 1,330) of those interviews selected (N = 1,498). Analyses were conducted on those participants who were current drinkers and reported consuming three or more drinks at one time at least once in their lives (N = 492). …

14 Apr 1998
TL;DR: In this paper, a comparison of item response theory and observed score methods for the graded response model is presented, showing stronger agreement within IRT methods and within observed score measures than between these two sets of DIF detection methods.
Abstract: This paper provides a review of procedures for detection of differential item functioning (DIF) for item response theory (IRT) and observed score methods for the graded response model. In addition, data from a test anxiety scale were analyzed to examine the congruence among these procedures. Data from Nasser, Takahashi, and Benson (1997) were reanalyzed for purposes of this study. The data were obtained from participants' responses to an Arabic Version of Sarason's (1984) Reactions to Test (RTT) scale. The sample consisted of 421 tenth graders from two Arab high schools in the central district of Israel. Results indicated stronger agreement within IRT methods and within observed score methods than between these two sets of DIF detection methods. A discussion is included focusing on reasons for these similarities and differences. Results of this study can provide useful information about the relationships to expect between various DIF detection methods. (Contains 10 tables and 49 references.) (Author/SLD) ******************************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document. ******************************************************************************** A Comparison of Item Response Theory And Observed Score DIF Detection Measures For the Graded Response Model Allan S. Cohen University of WisconsinMadison Seock-Ho Kim The University of Georgia James A. Wollack University of WisconsinMadison , April 14, 1998 Running Head: COMPARISON OF DIF DETECTION Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, California PERMISSION TO REPRODUCE AND DISSEMINATE THIS MATERIAL HAS BEEN GRANTED BY SQ-40e-KAo Kcon TO THE EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) 2 U.S. DEPARTMENT OF EDUCATION Office of Educational Research and Improvement EDU ATIONAL RESOURCES INFORMATION CENTER (ERIC) This document has been reproduced as received from the person or organization originating it. Minor changes have been made to improve reproduction quality. Points of view or opinions stated in this document do not necessarily represent official OERI position or policy. A Comparison of Item Response Theory And Observed Score DIF Detection Measures For the Graded Response Model


Journal ArticleDOI
TL;DR: The authors analyzed referral, placement, and retention decisions using item response theory (IRT) to investigate whether classification decisions could be placed on the latent continuum of ability normally associated with test items.
Abstract: Referral, placement, and retention decisions were analyzed using item response theory (IRT) to investigate whether classification decisions could be placed on the latent continuum of ability normally associated with test items. A second question pertained to the existence of classification differential item functioning (DIF) for the various decisions. When the decisions were calibrated, the resulting "item" parameters were similar to those that might be expected from conventional test items. For classification DIF analyses, referral decisions for ethnicity were found to be functioning differently for Whites versus non-Whites. Analyzing decisions represents a new unit of analysis for IRT and represents a powerful methodology that could be applied to a variety of new problem types.

Journal Article
TL;DR: In this paper, the authors examined the possible existence of real differences (impact) according to gender in a test of numerical ability and the possible differential item functioning (DIF), with two different procedures.
Abstract: Gender-related impact and differential item functioning in a test of numerical ability. Standardized tests reflect differences between groups, and we should distinguish whether they are real or a product of the measuring instrument itself. In the last few decades psychometric research has approached this problem with the growing development of the studies of the differential item functioning for subjects or groups with the same degree of ability. This study examines the possible existence of real differences (impact) according to gender in a test of numerical ability and the possible differential item functioning (DIF), with two different procedures. The first is based on item response theory and the second on confirmatory factor analysis. Results suggest a careful review of the content of several items of the test. An examination of the convergence of results obtained with the different procedures supports the use of iterative purification mechanisms when working with confirmatory factor analytic techniques and the use of multiple convergent evidences in taking decisions in empirical studies.


Dissertation
14 Oct 1998
TL;DR: In this paper, the authors used item response theory (IRT) methodologies to assess the degree of uniform and non-uniform DIF in a sample of ASVAB takers.
Abstract: Utilising Item Response Theory (IRT) methodologies, the Armed Services Vocational Aptitude Battery (ASVAB) was examined for differential item functioning (DIF) on the basis of crossed gender and ethnicity variables. Both the Mantel‐Haenszel procedure and an IRT area‐based technique were utilised to assess the degree of uniform and non‐uniform DIF in a sample of ASVAB takers. Findings were mixed. At the item level, DIF fluctuated greatly. Numerous instances of DIF favouring the reference as well as the focal group were found. At the scale level, inconsistencies existed across the forms and versions. Tests varied in their tendency to be potentially biased against the focal group of interest and at times, performed contrary to expectations. Implications for the ASVAB as well as other g‐loaded selection instruments are considered.