scispace - formally typeset
Search or ask a question

Showing papers on "Differential item functioning published in 2012"


Journal ArticleDOI
TL;DR: The EUROHIS-QOL 8-item index showed acceptable cross-cultural performance and a satisfactory discriminant validity and would be a useful measure to include in studies to assess treatment effectiveness.

177 citations


Journal ArticleDOI
TL;DR: The authors examine the impact of race/ethnicity on responses to the Everyday Discrimination Scale, one of the most widely used discrimination scales in epidemiologic and public health research, to suggest that the Everyday discrimination Scale could potentially be used across racial/ethnic groups as originally intended.
Abstract: The authors examined the impact of race/ethnicity on responses to the Everyday Discrimination Scale, one of the most widely used discrimination scales in epidemiologic and public health research. Participants were 3,295 middle-aged US women (African-American, Caucasian, Chinese, Hispanic, and Japanese) from the Study of Women's Health Across the Nation (SWAN) baseline examination (1996-1997). Multiple-indicator, multiple-cause models were used to examine differential item functioning (DIF) on the Everyday Discrimination Scale by race/ethnicity. After adjustment for age, education, and language of interview, meaningful DIF was observed for 3 (out of 10) items: "receiving poorer service in restaurants or stores," "being treated as if you are dishonest," and "being treated with less courtesy than other people" (all P's < 0.001). Consequently, the "profile" of everyday discrimination differed slightly for women of different racial/ethnic groups, with certain "public" experiences appearing to have more salience for African-American and Chinese women and "dishonesty" having more salience for racial/ethnic minority women overall. "Courtesy" appeared to have more salience for Hispanic women only in comparison with African-American women. Findings suggest that the Everyday Discrimination Scale could potentially be used across racial/ethnic groups as originally intended. However, researchers should use caution with items that demonstrated DIF.

144 citations


Journal ArticleDOI
TL;DR: This article shows how item response models can be used to capture multiple response processes in psychological applications and shows that the response processes can be measured via pseudoitems derived from the observed responses.
Abstract: In this article, I show how item response models can be used to capture multiple response processes in psychological applications. Intuitive and analytical responses, agree-disagree answers, response refusals, socially desirable responding, differential item functioning, and choices among multiple options are considered. In each of these cases, I show that the response processes can be measured via pseudoitems derived from the observed responses. The estimation of these models via standard software programs that allow for missing data is also discussed. The article concludes with two detailed applications that illustrate the prevalence of multiple response processes.

143 citations


Journal ArticleDOI
TL;DR: The AAQ-II shows promise as a useful tool for the measurement of experiential avoidance in mild to moderately depressed and anxious populations and shows incremental validity beyond 5 mindfulness facets in explaining depression, anxiety, and positive mental health.
Abstract: The Acceptance and Action Questionnaire–II (AAQ-II) is a self-report measure designed to assess experiential avoidance as conceptualized in acceptance and commitment therapy (ACT) The current study is the first to evaluate the psychometric properties of the AAQ-II in a large sample of adults (N = 376) with mild to moderate levels of depression and anxiety who participated in a study on the effects of an ACT intervention The internal construct validity and local measurement precision were investigated by fitting the data to a unidimensional item response theory (IRT) model, and the incremental validity of the AAQ-II beyond mindfulness, as measured by the Five Facet Mindfulness Questionnaire, was assessed Results of the IRT analyses suggest that the AAQ-II is a unidimensional measure of experiential avoidance and has satisfactory reliability for group comparisons in mild to moderately depressed and anxious populations Item functioning was found to be independent of gender and slightly dependent on age in this sample Furthermore, the AAQ-II showed incremental validity beyond 5 mindfulness facets in explaining depression, anxiety, and positive mental health This study suggests the AAQ-II shows promise as a useful tool for the measurement of experiential avoidance in mild to moderately depressed and anxious populations

142 citations


Book
12 Oct 2012
TL;DR: In this paper, Tatsuoka et al. present a framework for studying differences between multiple-choice and free-response test items, and present a set of models appropriate for Constructed Response and differential item functions.
Abstract: Contents: Preface. R.E. Bennett, On the Meanings of Constructed Response. R.E. Traub, On the Equivalence of the Traits Assessed by Multiple-Choice and Constructed-Response Tests. R.E. Snow, Construct Validity and Constructed-Response Tests. S. Messick, Trait Equivalence as Construct Validity of Score Interpretation Across Multiple Methods of Measurement. R.J. Mislevy, A Framework for Studying Differences Between Multiple-Choice and Free-Response Test Items. K.K. Tatsuoka, Item Construction and Psychometric Models Appropriate for Constructed Responses. N.J. Dorans, A.P. Schmitt, Constructed Response and Differential Item Functioning: A Pragmatic Approach. J. Braswell, J. Kupin, Item Formats for Assessment in Mathematics. R. Camp, The Place of Portfolios in Our Changing Views of Writing Assessment. D.P. Wolf, Assessment as an Episode of Learning. D.H. Gitomer, Performance Assessment and Educational Measurement. C.A. Dwyer, Innovation and Reform: Examples from Teacher Assessment. T.W. Hartle, P.A. Battaglia, The Federal Role in Standardized Testing. S.P. Robinson, The Politics of Multiple-Choice Versus Free-Response Assessment.

132 citations


Journal ArticleDOI
TL;DR: The goal of this project was to review the status of ETS Dif analysis procedures, focusing on three aspects: the nature and stringency of the statistical rules used to flag items, the minimum sample size requirements that are currently in place for DIF analysis, and the efficacy of criterion refinement.
Abstract: Differential item functioning (DIF) analysis is a key component in the evaluation of the fairness and validity of educational tests. The goal of this project was to review the status of ETS DIF analysis procedures, focusing on three aspects: (a) the nature and stringency of the statistical rules used to flag items, (b) the minimum sample size requirements that are currently in place for DIF analysis, and (c) the efficacy of criterion refinement. The main findings of the review are as follows: • The ETS C rule often displays low DIF detection rates even when samples are large. • With improved flagging rules in place, minimum sample size requirements could probably be relaxed. In addition, updated rules for combining data across administrations could allow DIF analyses to be performed in a broader range of situations. • Refinement of the matching criterion improves detection rates when DIF is primarily in one direction but can depress detection rates when DIF is balanced. If nothing is known about the likely pattern of DIF, refinement is advisable. Each of these findings is discussed in detail, focusing on the case of dichotomous items.

128 citations


Journal ArticleDOI
TL;DR: In this article, the authors evaluated the measurement invariance of the Mental Health Continuum Short Form (MHC-SF), a 14-item self-report questionnaire for measuring emotional, social, and psychological well-being.
Abstract: This study evaluated the measurement invariance of the Mental Health Continuum-Short Form (MHC-SF), a 14-item self-report questionnaire for measuring emotional, social, and psychological well-being. The study draws on data of a representative panel (Longitudinal Internet Studies for the Social Sciences of CentERdata). 1,932 Dutch adults filled out the MHC-SF at four timepoints over 9 months. We used item response theory analyses with two-parameter models to examine differential item functioning across demographics, health indicators, and timepoints. The results indicated differences in the performance of one item (social well-being) for educational level, one item (social well-being) for sex, and two items (psychological well-being) for age. The MHC-SF is highly reliable over time, as there was no differential item functioning across the four timepoints. Furthermore, the means and reliabilities of the subscales were consistent over time. The MHC-SF is a reliable and valid instrument to measure positive aspects of mental health.

121 citations


Journal ArticleDOI
TL;DR: This paper investigated the potential for a shared-L1 advantage on an academic English listening test featuring speakers with L2 accents and found that Japanese L1 listeners were advantaged on a small number of items on the test featuring the Japanese-accented speaker, but these were balanced by items which favored no...
Abstract: This paper reports on an investigation of the potential for a shared-L1 advantage on an academic English listening test featuring speakers with L2 accents. Two hundred and twelve second-language listeners (including 70 Mandarin Chinese L1 listeners and 60 Japanese L1 listeners) completed three versions of the University Test of English as a Second Language (UTESL) listening sub-test which featured an Australian English-accented speaker, a Japanese-accented speaker and a Mandarin Chinese-accented speaker. Differential item functioning (DIF) analyses were conducted on data from the tests which featured L2-accented speakers using two methods of DIF detection – the standardization procedure and the Mantel-Haenszel procedure – with candidates matched for ability on the test featuring the Australian English-accented speaker. Findings showed that Japanese L1 listeners were advantaged on a small number of items on the test featuring the Japanese-accented speaker, but these were balanced by items which favoured no...

107 citations


Journal ArticleDOI
TL;DR: In this paper, the authors use the Rasch model for dichotomous responses as a theoretical basis to prove that the source of artificial DIF in the Mantel-Haenszel (MH) procedure is that estimates of the person locations are substituted for their unknown values.
Abstract: The literature in modern test theory on procedures for identifying items with differential item functioning (DIF) among two groups of persons includes the Mantel–Haenszel (MH) procedure. Generally, it is not recognized explicitly that if there is real DIF in some items which favor one group, then as an artifact of this procedure, artificial DIF that favors the other group is induced in the other items. Using the Rasch model for dichotomous responses as a theoretical basis, this article proves that the source of artificial DIF in the MH procedure is that estimates of the person locations are substituted for their unknown values. The article then demonstrates that the formalization of artificial DIF implies mathematically (a) a particular sequential, iterative procedure for detecting items with real DIF and for identifying a set of items that may have no DIF and (b) a resolution of the items with real DIF for quantifying the DIF on the same metric as the items showing no DIF and provides expected value curv...

105 citations


Journal ArticleDOI
TL;DR: The results show that there were several problems with the DLQI, including misfitting items, DIF by disease, age, and gender, disordered response thresholds, and inadequate measurement of patients with mild illness.

101 citations


Journal ArticleDOI
TL;DR: The analysis suggested DIF was mitigated for scales including both ADL and IADL and that summary indexes (counts of limitations) likely underestimate mean disability in these international populations.
Abstract: Objective. To examine the measurement equivalence of items on disability across three international surveys of aging. Method. Data for persons aged 65 and older were drawn from the Health and Retirement Survey (HRS, n = 10,905), English Longitudinal Study of Aging (ELSA, n = 5,437), and Survey of Health, Ageing and Retirement in Europe (SHARE, n = 13,408). Differential item functioning (DIF) was assessed using item response theory (IRT) methods for activities of daily living (ADL) and instrumental activities of daily living (IADL) items. Results. HRS and SHARE exhibited measurement equivalence, but 6 of 11 items in ELSA demonstrated meaningful DIF. At the scale level, this item-level DIF affected scores reflecting greater disability. IRT methods also spread out score distributions and shifted scores higher (toward greater disability). Results for mean disability differences by demographic characteristics, using original and DIF-adjusted scores, were the same overall but differed for some subgroup comparisons involving ELSA. Discussion. Testing and adjusting for DIF is one means of minimizing measurement error in cross-national survey comparisons. IRT methods were used to evaluate potential measurement bias in disability comparisons across three international surveys of aging. The analysis also suggested DIF was mitigated for scales including both ADL and IADL and that summary indexes (counts of limitations) likely underestimate mean disability in these international populations.

Journal ArticleDOI
TL;DR: This study supports the internal validity and reliability of the BBS-12 as a measurement tool independent of the etiology of the neurologic disease causing the balance impairment.

Journal ArticleDOI
TL;DR: The validation process resulted in a revised 30-item AddiQoL questionnaire and an eight-item short version with good psychometric properties and high reliability, and no difference between patients with isolated AD and those with concomitant diseases.
Abstract: Context: Patients with Addison's disease (AD) self-report impairment in specific dimensions on well-being questionnaires. An AD-specific quality-of-life questionnaire (AddiQoL) was developed to aid evaluation of patients. Objective: We aimed to translate and determine construct validity, reliability, and concurrent validity of the AddiQoL questionnaire. Methods: After translation, the final versions were tested in AD patients from Norway (n = 107), Sweden (n = 101), Italy (n = 165), Germany (n = 200), and Poland (n = 50). Construct validity was examined by exploratory factor analysis and Rasch analysis, aiming at unidimensionality and fit to the Rasch model. Reliability was determined by Cronbach's coefficient-α and Person separation index. Longitudinal reliability was tested by differential item functioning in stable patient subgroups. Concurrent validity was examined in Norwegian (n = 101) and Swedish (n = 107) patients. Results: Exploratory factor analysis and Rasch analysis identified six items with p...

Journal ArticleDOI
TL;DR: Study findings support the use of the selected PROMIS short forms for comparing symptoms and quality of life indicators across different diagnoses and age ranges.

Journal ArticleDOI
TL;DR: The results confirm the potential to validly measure subjective qualities of meaningful activity participation and the EMAS can be used to evaluate processes and outcomes central to occupational therapy practice and to aid in the design of therapeutic occupations.
Abstract: Objective This study evaluated the measurement characteristics of the Engagement in Meaningful Activities Survey (EMAS) in an age-diverse sample. Method The sample included 154 older adults and 122 college students (age range = 18-100 yr). A Rasch-Andrich rating scale model was used to evaluate the EMAS. Analyses addressed rating scale design, person and item fit, item hierarchy, model unidimensionality, and differential item functioning. Results Category functioning was improved by reducing the EMAS item responses to four categories. Adequate person response validity was established, and all but one EMAS item demonstrated an ideal fit to the Rasch measurement model. After establishing the item hierarchy, I found the EMAS to be a unidimensional measure. Differential item functioning was not detected using Bonferroni-adjusted statistical criteria. Conclusion The results confirm the potential to validly measure subjective qualities of meaningful activity participation. The EMAS can be used to evaluate processes and outcomes central to occupational therapy practice and to aid in the design of therapeutic occupations.

Journal ArticleDOI
TL;DR: This study compares 11 variations on 5 qualitatively different approaches from recent literature for selecting optimal anchor items and indicates that for nearly all conditions, an easily implemented 2-stage procedure recently put forth by Lopez Rivas, Stark, and Chernyshenko (2009) provided optimal power while maintaining nominal Type I error.
Abstract: The efficacy of tests of differential item functioning (measurement invariance) has been well established. It is clear that when properly implemented, these tests can successfully identify differentially functioning (DF) items when they exist. However, an assumption of these analyses is that the metric for different groups is linked using anchor items that are invariant. In practice, however, it is impossible to be certain which items are DF and which are invariant. This problem of anchor items, or referent indicators, has long plagued invariance research, and a multitude of suggested approaches have been put forth. Unfortunately, the relative efficacy of these approaches has not been tested. This study compares 11 variations on 5 qualitatively different approaches from recent literature for selecting optimal anchor items. A large-scale simulation study indicates that for nearly all conditions, an easily implemented 2-stage procedure recently put forth by Lopez Rivas, Stark, and Chernyshenko (2009) provided optimal power while maintaining nominal Type I error. With this approach, appropriate anchor items can be easily and quickly located, resulting in more efficacious invariance tests. Recommendations for invariance testing are illustrated using a pedagogical example of employee responses to an organizational culture measure.

Journal ArticleDOI
14 Dec 2012-PLOS ONE
TL;DR: The PHQ-9 can reasonably be used without adjustment in Canadian English- and French-speaking samples and analyses assessing measurement equivalence should be routinely conducted prior to pooling data from English and French versions of patient-reported outcome measures.
Abstract: Background Medical research increasingly utilizes patient-reported outcome measures administered and scored in different languages. In order to pool or compare outcomes from different language versions, instruments should be measurement equivalent across linguistic groups. The objective of this study was to examine the cross-language measurement equivalence of the Patient Health Questionnaire-9 (PHQ-9) between English- and French-speaking Canadian patients with systemic sclerosis (SSc). Methods The sample consisted of 739 English- and 221 French-speaking SSc patients. Multiple-Indicator Multiple-Cause (MIMIC) modeling was used to identify items displaying possible differential item functioning (DIF). Results A one-factor model for the PHQ-9 fit the data well in both English- and French-speaking samples. Statistically significant DIF was found for 3 of 9 items on the PHQ-9. However, the overall estimate in depression latent scores between English- and French-speaking respondents was not influenced substantively by DIF. Conclusions Although there were several PHQ-9 items with evidence of minor DIF, there was no evidence that these differences influenced overall scores meaningfully. The PHQ-9 can reasonably be used without adjustment in Canadian English- and French-speaking samples. Analyses assessing measurement equivalence should be routinely conducted prior to pooling data from English and French versions of patient-reported outcome measures.

Journal ArticleDOI
15 Mar 2012-Memory
TL;DR: The results show that the scale derived from responses to the AMT operates well over a wide range of scores, consistent with the aim of deriving a continuous measure of over-general memory.
Abstract: Although the Autobiographical Memory Test (AMT) is widely used its psychometric properties have rarely been investigated. This paper utilises data gathered from a 10-item written version of the AMT, completed by 5792 adolescents participating in the Avon Longitudinal Study of Parents and Children, to examine the psychometric properties of the measure. The results show that the scale derived from responses to the AMT operates well over a wide range of scores, consistent with the aim of deriving a continuous measure of over-general memory. There was strong evidence of group differences in terms of gender, low negative mood, and IQ, and these were in agreement when comparing an item response theory (IRT) approach with that based on a sum score. One advantage of the IRT model is the ability to assess and consequently allow for differential item functioning. This additional analysis showed evidence of response bias for both gender and mood, resulting in attenuation in the mean differences in AMT across these groups. Implications of the findings for the use of the AMT measure in different samples are discussed.

Journal ArticleDOI
TL;DR: The newly developed PFS-12 can be used to assess fatigue in African-American and Caucasian breast cancer survivors and reduces response burden without compromising reliability or validity.
Abstract: Brief, valid measures of fatigue, a prevalent and distressing cancer symptom, are needed for use in research. This study’s primary aim was to create a shortened version of the revised Piper Fatigue Scale (PFS-R) based on data from a diverse cohort of breast cancer survivors. A secondary aim was to determine whether the PFS captured multiple distinct aspects of fatigue (a multidimensional model) or a single overall fatigue factor (a unidimensional model). Breast cancer survivors (n = 799; stages in situ through IIIa; ages 29–86 years) were recruited through three SEER registries (New Mexico, Western Washington, and Los Angeles, CA) as part of the Health, Eating, Activity, and Lifestyle (HEAL) study. Fatigue was measured approximately 3 years post-diagnosis using the 22-item PFS-R that has four subscales (Behavior, Affect, Sensory, and Cognition). Confirmatory factor analysis was used to compare unidimensional and multidimensional models. Six criteria were used to make item selections to shorten the PFS-R: scale’s content validity, items’ relationship with fatigue, content redundancy, differential item functioning by race and/or education, scale reliability, and literacy demand. Factor analyses supported the original 4-factor structure. There was also evidence from the bi-factor model for a dominant underlying fatigue factor. Six items tested positive for differential item functioning between African-American and Caucasian survivors. Four additional items either showed poor association, local dependence, or content validity concerns. After removing these 10 items, the reliability of the PFS-12 subscales ranged from 0.87 to 0.89, compared to 0.90–0.94 prior to item removal. The newly developed PFS-12 can be used to assess fatigue in African-American and Caucasian breast cancer survivors and reduces response burden without compromising reliability or validity. This is the first study to determine PFS literacy demand and to compare PFS-R responses in African-Americans and Caucasian breast cancer survivors. Further testing in diverse populations is warranted.

Journal ArticleDOI
TL;DR: The authors propose two item-selection criteria that utilize information from a lognormal model for response times, and modifies the maximum information criterion to maximize information per time unit.
Abstract: Traditional methods for item selection in computerized adaptive testing only focus on item information without taking into consideration the time required to answer an item. As a result, some examinees may receive a set of items that take a very long time to finish, and information is not accrued as efficiently as possible. The authors propose two item-selection criteria that utilize information from a lognormal model for response times. The first modifies the maximum information criterion to maximize information per time unit. The second is an inverse time-weighted version of a-stratification that takes advantage of the response time model, but achieves more balanced item exposure than the information-based techniques. Simulations are conducted to compare these procedures against their counterparts that ignore response times, and efficiency of estimation, time-required, and item exposure rates are assessed.

Journal ArticleDOI
TL;DR: The developmental sequence of letter name knowledge acquisition by children from 2 to five years of age was examined and indicated an approximate developmental sequence in letter name learning for the simplest and most challenging to learn letters--but with no clear sequence between these extremes.

Journal ArticleDOI
TL;DR: The ARAT possesses good psychometric properties in stroke patients with mild to moderate motor severity and without severe cognitive impairment, and has evidence of unidimensionality, predictive validity, and reliability.

Journal ArticleDOI
TL;DR: In this article, the authors developed a new method by adding a scale purification procedure to the rank-based method and conducted two simulation studies to evaluate its performances on DIF assessment.
Abstract: The DIF-free-then-DIF (DFTD) strategy consists of two steps: (a) select a set of items that are the most likely to be DIF-free and (b) assess the other items for DIF (differential item functioning) using the designated items as anchors. The rank-based method together with the computer software IRTLRDIF can select a set of DIF-free polytomous items very accurately, but it loses accuracy when tests contain many DIF items. To resolve this problem, the authors developed a new method by adding a scale purification procedure to the rank-based method and conducted two simulation studies to evaluate its performances on DIF assessment. It was found that the new method outperformed the rank-based method in identifying DIF-free items, especially when the tests contained many DIF items. In addition, the new method, combined with the DFTD strategy, yielded a well-controlled Type I error rate and a high power rate of DIF detection. In contrast, conventional DIF assessment methods yielded an inflated Type I error rate a...

Journal ArticleDOI
TL;DR: Application of the Rasch model produced a valid and reliable scale measuring adolescent attitudes towards abortion, with stable measurement properties, and shows the value of this model in developing scales for both social science and health disciplines.
Abstract: Background Measurement scales seeking to quantify latent traits like attitudes, are often developed using traditional psychometric approaches. Application of the Rasch unidimensional measurement model may complement or replace these techniques, as the model can be used to construct scales and check their psychometric properties. If data fit the model, then a scale with invariant measurement properties, including interval-level scores, will have been developed. Aims This paper highlights the unique properties of the Rasch model. Items developed to measure adolescent attitudes towards abortion are used to exemplify the process. Method Ten attitude and intention items relating to abortion were answered by 406 adolescents aged 12 to 19 years, as part of the “Teen Relationships Study”. The sampling framework captured a range of sexual and pregnancy experiences. Items were assessed for fit to the Rasch model including checks for Differential Item Functioning (DIF) by gender, sexual experience or pregnancy experience. Results Rasch analysis of the original dataset initially demonstrated that some items did not fit the model. Rescoring of one item (B5) and removal of another (L31) resulted in fit, as shown by a non-significant item-trait interaction total chi-square and a mean log residual fit statistic for items of -0.05 (SD=1.43). No DIF existed for the revised scale. However, items did not distinguish as well amongst persons with the most intense attitudes as they did for other persons. A person separation index of 0.82 indicated good reliability. Conclusion Application of the Rasch model produced a valid and reliable scale measuring adolescent attitudes towards abortion, with stable measurement properties. The Rasch process provided an extensive range of diagnostic information concerning item and person fit, enabling changes to be made to scale items. This example shows the value of the Rasch model in developing scales for both social science and health disciplines. Key Words Rasch unidimensional measurement model, adolescent, abortion, attitudes, attitude scale

Journal ArticleDOI
TL;DR: Initial evaluation revealed that the SCI-FI achieved considerable breadth of coverage in each content domain and demonstrated acceptable psychometric properties, which will minimize assessment burden, while allowing for the comprehensive assessment of the functional abilities of adults with SCI.

Journal ArticleDOI
TL;DR: The purpose of this article is to present the clinician and researcher with a contemporary 8-stage framework for measurement scale development based on a mixed-methods qualitative and quantitative approach.

Journal ArticleDOI
TL;DR: The refined AS-20 may prove to be even more responsive to HRQOL changes in adult strabismus following treatment or changes over time.
Abstract: Purpose To further refine the Adult Strabismus 20 (AS-20) health-related quality of life (HRQOL) questionnaire using Rasch analysis Methods Rasch analysis was performed independently on the original AS-20 using the following steps: dimensionality, response ordering, local dependence, infit and outfit analyses, differential item functioning, subject targeting, and confirmatory dimensionality Results Two subscales were present in each of the original AS-20 subscales, for a total of 4 subscales, which were labeled "self-perception," "interaction," "reading function," and "general function" Response ordering was appropriate for 3 of the subscales but required reduction to 4 response options for the fourth subscale No notable local dependence was found for any subscale As a result of fit analysis, 2 items were removed, 1 each from 2 subscales No significant differential item functioning was seen for sex or age The resulting 5-item self-perception subscale and 4-item reading function subscale are reliable and target the adult strabismus patient cohort appropriately The resulting 5-item interaction subscale and 4-item general function subscale have less than optimal reliability Conclusions The AS-20 benefits from reduction to 4 subscales (self-perception, interaction, reading function, and general function) and reducing the response options in the general function subscale from 5 to 4 options The refined AS-20 may prove to be even more responsive to HRQOL changes in adult strabismus following treatment or changes over time

Journal ArticleDOI
TL;DR: This paper investigated a version of the International English Language Testing System (IELTS) listening test for evidence of differential item functioning (DIF) based on gender, nationality, age, and degree of previous exposure to the test.
Abstract: This article investigates a version of the International English Language Testing System (IELTS) listening test for evidence of differential item functioning (DIF) based on gender, nationality, age, and degree of previous exposure to the test. Overall, the listening construct was found to be underrepresented, which is probably an important cause of the observed lack of significant correlation between awarded scores on while-listening performance (WLP) tests and subsequent academic performance. Some short answer items were biased toward higher-ability subgroups, likely due to those test takers' higher ability to apply what they had understood. Finally, some multiple-choice questions (MCQs) with few options likely encouraged attempts at lucky guesses, particularly among low-ability people who had received training in test-taking strategies. Implications for listening assessment and language education are discussed.

Journal ArticleDOI
TL;DR: The authors examined whether students' English language status has an impact on their inquiry science performance using a multaceted Rasch Differential Item Functioning (DIF) model and found that non-ELLs significantly outperformed ELLs.
Abstract: The performance of English language learners (ELLs) has been a concern given the rapidly changing demographics in US K-12 education This study aimed to examine whether students' English language status has an impact on their inquiry science performance Differential item functioning (DIF) analysis was conducted with regard to ELL status on an inquiry-based science assessment, using a multifaceted Rasch DIF model A total of 1,396 seventh- and eighth-grade students took the science test, including 313 ELL students The results showed that, overall, non-ELLs significantly outperformed ELLs Of the four items that showed DIF, three favored non-ELLs while one favored ELLs The item that favored ELLs provided a graphic representation of a science concept within a family context There is some evidence that constructed-response items may help ELLs articulate scientific reasoning using their own words Assessment developers and teachers should pay attention to the possible interaction between linguistic challen

Journal ArticleDOI
TL;DR: The illustrative analyses demonstrate the value of latent variable mixture modeling in revealing the potential implications of sample heterogeneity in the measurement of PROs.
Abstract: Purpose A fundamental assumption of patient-reported outcomes (PRO) measurement is that all individuals interpret questions about their health status in a consistent manner, such that a measurement model can be constructed that is equivalently applicable to all people in the target population. The related assumption of sample homogeneity has been assessed in various ways, including the many approaches to differential item functioning analysis.