scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2007"


Journal ArticleDOI
TL;DR: The Comprehensive Assessment of Team Member Effectiveness (CAPE) as mentioned in this paper ) is a team member effectiveness instrument that measures 29 types of team member contributions with 3 items each. These fall into five categories (contributing to the team's work; interacting with teammates; keeping the team on track; expecting quality; and having relevant knowledge, skills, and abilities).
Abstract: This article describes the development of the Comprehensive Assessment of Team Member Effectiveness. The authors used the teamwork literature to create potential items, which they tested using two surveys of college students (Ns = 2,777 and 1,157). The authors used exploratory factor analysis and confirmatory factor analysis to help them select items for the final instrument. The full instrument has 87 items that measure 29 types of team member contributions with 3 items each. These fall into five categories (contributing to the team's work; interacting with teammates; keeping the team on track; expecting quality; and having relevant knowledge, skills, and abilities). A short version of the instrument has 33 items. Potential uses for the instrument and suggestions for future research are discussed.

244 citations


Journal ArticleDOI
TL;DR: This article used confirmatory factor analysis to assess the measurement of school engagement in prior research that used the National Educational Longitudinal Study of 1988 (NELS:88), and systematically developed an improved measurement model for school engagement, and examined the measurement invariance of this model across racial and ethnic groups.
Abstract: The purposes of this study were to (a) assess the measurement of school engagement in prior research that used the National Educational Longitudinal Study of 1988 (NELS:88), (b) systematically develop an improved measurement model for school engagement, and (c) examine the measurement invariance of this model across racial and ethnic groups. Results from confirmatory factor analyses indicated that school engagement should be measured as a multidimensional concept. A higher order measurement model in which behavioral and psychological engagement are second-order latent variables that influence several subdimensions is consistent with the data. Results from a series of multiple group analyses indicated that the proposed measurement model exhibits measurement invariance for White, African American, Latino, and Asian students. Therefore, it is appropriate to compare the effects of the dimensions of engagement across these groups. The results demonstrate the advantages of confirmatory factor analysis for enhancing the understanding and measurement of school engagement.

198 citations


Journal ArticleDOI
TL;DR: This paper conducted a generalization study to examine the typical reliability coefficients of BIDR scores and explored factors that explained the variability of reliability estimates across studies, concluding that the overall Balanced Inventory of Desirable Responding (BIDR) scale produced scores that were adequately reliable.
Abstract: The Balanced Inventory of Desirable Responding (BIDR) is one of the most widely used social desirability scales. The authors conducted a reliability generalization study to examine the typical reliability coefficients of BIDR scores and explored factors that explained the variability of reliability estimates across studies. The results indicated that the overall BIDR scale produced scores that were adequately reliable. The mean score reliability estimates for the two subscales, Self-Deception Enhancement and Impression Management, were not satisfactory. In addition, although a number of study characteristics were statistically significantly related to reliability estimates, they accounted for only a small portion of the overall variability in reliability estimates. The results of these findings and their implications are also discussed.

151 citations


Journal ArticleDOI
TL;DR: In this paper, a revised version of the School Level Environment Questionnaire (SLEQ) was validated using a sample of teachers from a large school district, and five school environment factors emerged.
Abstract: Scores from a revised version of the School Level Environment Questionnaire (SLEQ) were validated using a sample of teachers from a large school district. An exploratory factor analysis was used with a randomly selected half of the sample. Five school environment factors emerged. A confirmatory factor analysis was run with the remaining half of the sample. Goodness-of-fit indices indicated that the factor structure fit the data reasonably well. Further analyses using structural equation modeling techniques revealed that the Revised SLEQ worked equally well for all samples. Invariance testing showed that the fitted model and the estimated parameter values were statistically equivalent across all samples. Internal consistency estimates provided further evidence of the reliability of factor scores. In addition, an analysis of variance indicated that the instrument discriminated climate differences between schools. Results suggest that the Revised SLEQ provides a good tool for studying teachers' perceptions o...

137 citations


Journal ArticleDOI
TL;DR: In this article, the reliability, structural validity, and concurrent validity of Zimbardo Time Perspective Inventory (ZTPI) scores in a group of 815 academically talented adolescents were examined.
Abstract: In this study, the authors examined the reliability, structural validity, and concurrent validity of Zimbardo Time Perspective Inventory (ZTPI) scores in a group of 815 academically talented adolescents. Reliability estimates of the purported factors' scores were in the low to moderate range. Exploratory factor analysis supported a five-factor structure similar to the one proposed by Zimbardo and Boyd but also provided support for a six-factor structure that included an additional factor reflecting negative feelings about the future. ZTPI subscale intercorrelations were generally low, and intercorrelations between ZTPI subscale scores and other constructs were low, but in the expected directions. Results are discussed in light of the ZTPI's application to an adolescent-aged population.

135 citations


Journal ArticleDOI
TL;DR: In this paper, two unresolved implementation issues with logistic regression (LR) for differential item functioning (DIF) detection include ability purification and effect size use, and the effectiveness of such controls, especially used in combination, requires evaluation.
Abstract: Two unresolved implementation issues with logistic regression (LR) for differential item functioning (DIF) detection include ability purification and effect size use. Purification is suggested to control inaccuracies in DIF detection as a result of DIF items in the ability estimate. Additionally, effect size use may be beneficial in controlling Type I error rates. The effectiveness of such controls, especially used in combination, requires evaluation. Detection errors were evaluated through simulation across iterative purification and no purification procedures with and without the use of an effect size criterion. Sample size, DIF magnitude and percentage, and ability differences were manipulated. Purification was beneficial under certain conditions, although overall power and Type I error rates did not substantially improve. The LR statistical test without purification performed as well as other classification criteria and may be the practical choice for many situations. Continued evaluation of the effec...

113 citations


Journal ArticleDOI
TL;DR: This study compared four methods for setting item response time thresholds to differentiate rapid-guessing behavior from solution behavior, indicating that response time effort is not very sensitive to the particular threshold identification method used.
Abstract: This study compared four methods for setting item response time thresholds to differentiate rapid-guessing behavior from solution behavior. Thresholds were either (a) common for all test items, (b) based on item surface features such as the amount of reading required, (c) based on visually inspecting response time frequency distributions, or (d) statistically estimated using a two-state mixture model. The thresholds were compared using the criteria proposed by Wise and Kong to establish the reliability and validity of response time effort scores, which were generated on the basis of the specified threshold values. The four methods yielded very similar results, indicating that response time effort is not very sensitive to the particular threshold identification method used. Recommendations are given regarding use of the various methods.

107 citations


Journal ArticleDOI
TL;DR: In this paper, two scales, Perceived Work Demand (PWD) and Perceived Family Demand(PFD), are developed and their scores validated using three diverse samples, and the results provide support for both perceived work and family demand scales.
Abstract: Two scales, Perceived Work Demand (PWD) and Perceived Family Demand (PFD), are developed and their scores validated using three diverse samples. The scales are of particular interest in the work-family conflict (WFC) area and provide needed clarification in predicting WFC. Scale development procedures were followed, and dimensionality, internal consistency, discriminant validity, and predictive validity results are discussed. The results provide support for both perceived work and family demand scales.

105 citations


Journal ArticleDOI
TL;DR: This article conducted a meta-analysis of computer-based and paper-and-pencil administration mode effects on K-12 student mathematics tests and found that the administration mode had no statistically significant effect on student mathematics test scores.
Abstract: This study conducted a meta-analysis of computer-based and paper-and-pencil administration mode effects on K-12 student mathematics tests. Both initial and final results based on fixed- and random-effects models are presented. The results based on the final selected studies with homogeneous effect sizes show that the administration mode had no statistically significant effect on K-12 student mathematics tests. Only the moderator variable of computer delivery algorithm contributed to predicting the effect size. The differences in scores between test modes were larger for linear tests than for adaptive tests. However, such variables as study design, grade level, sample size, type of test, computer delivery method, and computer practice did not lead to differences in student mathematics scores between computer-based and paper-and-pencil modes.

100 citations


Journal ArticleDOI
TL;DR: This paper examined the measurement equivalence of both components of the Multigroup Ethnic Identity Measure (MEIM) across racial and ethnic groups using a sample of 1,349 White, Hispanic, African American, and Asian American adults.
Abstract: An increasing number of organizational researchers examine the effects of ethnic identity and other-group orientation. In doing so, many use Phinney's (1992) Multigroup Ethnic Identity Measure (MEIM), which purportedly allows simultaneous assessment of various groups. Although several studies demonstrate adequate validity and reliability for scores on the MEIM, the only two studies that have assessed its measurement equivalence across racial and ethnic groups (a) focus exclusively on the ethnic identity component, (b) use entirely adolescent samples, and (c) obtain somewhat mixed results. Because ethnic identity is still developing during adolescence, it cannot be assumed that equivalence or lack thereof among adolescents will generalize to adults. The present study examines the measurement equivalence of both components of the MEIM across racial and ethnic groups using a sample of 1,349 White, Hispanic, African American, and Asian American adults. The results suggest that Roberts et al.'s revised version...

92 citations


Journal ArticleDOI
TL;DR: The internal and external validity of scores on the two-scale Balanced Inventory of Desirable Responding (BIDR) and its recent revision, the Paulhus Deception Scales (PDS), developed to measure two facets of social desirability, were studied with three groups of forensic clients and two groups of college undergraduates as discussed by the authors.
Abstract: The internal and external validity of scores on the two-scale Balanced Inventory of Desirable Responding (BIDR) and its recent revision, the Paulhus Deception Scales (PDS), developed to measure two facets of social desirability, were studied with three groups of forensic clients and two groups of college undergraduates (total N = 519). The two scales were statistically significantly related in all groups and for both versions of the inventory. A two-factor congeneric, orthogonal measurement model was rejected for all groups. However, a two-factor model that allowed cross-loadings among the items and correlation between the factors provided adequate fit. Concurrent validity data showed scores on both the Impression Management and Self-Deceptive Enhancement (SDE) scales to be satisfactory measures of their respective constructs and also of general social desirability, for both forensic clients and undergraduates. An exception was found in lower validity correlates for scores on the SDE scale in the PDS form.

Journal ArticleDOI
TL;DR: This article extended the three-factor measure of achievement goals in a work domain to the four-factor conceptualization by adding items to represent masteryavoidance goals, and found initial support for each of the four goal orientations having a unique relationship to theoretically related external criteria.
Abstract: The current research extended the three-factor (mastery, performance-approach, and performance-avoidance) measure of achievement goals in a work domain to the fourfactor conceptualization (in a 2 ×2 framework) by adding items to represent masteryavoidance goals. Confirmatory factor analysis was conducted on two independent samples to evaluate the dimensionality of scores. Results from both samples indicated that after dropping 5 problematic mastery-avoidance items, responses to a reduced 18item version of the instrument fit a four-factor model well. In addition, initial support for each of the four goal orientations having a unique relationship to theoretically related external criteria was found.

Journal ArticleDOI
TL;DR: In this paper, the authors compared the performance of four methods: simultaneous item bias test (SIBTEST), logistic regression, item response theory likelihood ratio test, and confirmatory factor analysis.
Abstract: Differential item functioning (DIF) continues to receive attention both in applied and methodological studies. Because DIF can be an indicator of irrelevant variance that can influence test scores, continuing to evaluate and improve the accuracy of detection methods is an essential step in gathering score validity evidence. Methods for detecting uniform DIF are well established, whereas those for identifying the presence of nonuniform or crossing DIF are less clearly understood. Four such methods were compared: simultaneous item bias test (SIBTEST), logistic regression, item response theory likelihood ratio test, and confirmatory factor analysis. Factors manipulated were sample size, ability differences between groups, percentage of DIF, and the underlying model used to generate the data. Results suggest that all methods were able to control Type I error, but SIBTEST had the highest power of the approaches compared. Problems with detection rates were evident with different underlying models.

Journal ArticleDOI
TL;DR: In this paper, the authors present the Rasch perspective on calculating reliability (measurement error) and present Rasch measurement model programs to compute the various reliability estimates, which can also compute reliability estimates of scores under different test situations.
Abstract: Measurement error is a common theme in classical measurement models used in testing and assessment. In classical measurement models, the definition of measurement error and the subsequent reliability coefficients differ on the basis of the test administration design. Internal consistency reliability specifies error due primarily to poor item sampling. Rater reliability indicates error due to inconsistency among raters. For estimates of test-retest reliability, error is attributed mainly to changes over time. In alternate-forms reliability, error is assumed to be due largely to variation between samples of items on test forms. Rasch models can also compute reliability estimates of scores under different test situations. The authors therefore present the Rasch perspective on calculating reliability (measurement error) and present Rasch measurement model programs to compute the various reliability estimates.

Journal ArticleDOI
TL;DR: In this paper, the authors compare two classes of normal ogive two-parameter models and show that the multi-unidimensional model offers a better way to represent test situations not realized in unidimensional models.
Abstract: For tests consisting of multiple subtests, unidimensional item response theory (IRT) models apply when the subtests are known to measure a common underlying ability. However, in many instances, due to the lack of a satisfactory index for assessing the dimensionality assumption, the test structure is not clear. A more general IRT model, the multiunidimensional model, is more flexible and efficient in various test situations. This article compares these two classes of normal ogive two-parameter models and shows that the multiunidimensional model offers a better way to represent test situations not realized in unidimensional models.

Journal ArticleDOI
TL;DR: The Non-Cognitive Questionnaire (NCQ) is a 23-item measure assessing eight noncognitive variables that are thought to predict the performance and retention of students in college as mentioned in this paper.
Abstract: The Non-Cognitive Questionnaire (NCQ) is a 23-item measure assessing eight noncognitive variables that are thought to predict the performance and retention of students in college. The NCQ is widely used in research and practice. This study is a meta-analytic review of the validity of scores on the NCQ across 47 independent samples for predicting academic outcomes (N = 9,321). Across all analyses, none of the scales of the NCQ are adequate predictors of GPA or persistence in college. Based on their evaluation of the NCQ, the authors recommend against its use for research or practice.

Journal ArticleDOI
TL;DR: In this paper, the authors used confirmatory factor analysis (CFA) to test the fit of the SACQ authors' proposed four-factor model using a sample of university students.
Abstract: The construct validity of scores on the Student Adaptation to College Questionnaire (SACQ) was examined using confirmatory factor analysis (CFA). The purpose of this study was to test the fit of the SACQ authors' proposed four-factor model using a sample of university students. Results indicated that the hypothesized model did not fit. Additional CFAs specifying one-factor models for each subscale were performed to diagnose areas of misfit, and results also indicated lack of fit. Exploratory factor analyses were then conducted and a four-factor model, different from the model proposed by the authors, was examined to provide information for future instrument revisions. It was concluded that researchers need to return to the first stage of instrument development, which would entail examining not only the theories behind adjustment to college in greater detail, but also how the current conceptualization of the SACQ relates to such theories.

Journal ArticleDOI
TL;DR: In this paper, the impact of outliers on coefficient a has been investigated for varying values of population reliability and sample sizes for visual analogue scales, and the results show that coefficient a is not affected by symmetric outlier contamination, whereas asymmetric outliers artificially inflate the estimates of coefficient a.
Abstract: The impact of outliers on Cronbach’s coefficient a has not been documented in the psychometric or statistical literature. This is an important gap because coefficient a is the most widely used measurement statistic in all of the social, educational, and health sciences. The impact of outliers on coefficient a is investigated for varying values of population reliability and sample sizes for visual analogue scales. Results show that coefficient a is not affected by symmetric outlier contamination, whereas asymmetric outliers artificially inflate the estimates of coefficient a. Coefficient a estimates are upwardly biased and more variable sample to sample, with increasing asymmetry and proportion of outlier contamination in the population. However, these effects of outliers on the bias and sample variability of coefficient a estimates are reduced for increasing population reliability. The results are discussed in the context of providing guidance for computing or interpreting coefficient a for visual analogue scales.

Journal ArticleDOI
TL;DR: A principal components analysis of the Teacher Rating Scale-Child (TRS-C) of the Behavior Assessment System for Children was conducted with a cross-sectional cohort of 659 children in Grades 1 to 5 as discussed by the authors.
Abstract: A principal components analysis of the Teacher Rating Scale—Child (TRS-C) of the Behavior Assessment System for Children was conducted with a cross-sectional cohort of 659 children in Grades 1 to 5. A predictive validity study was then conducted with a 2-year longitudinal sample of 206 children. The results suggested that scores from the resulting 23-item screener had strong initial reliability and validity evidence. Predictive validity coefficients for the screener scores were acceptable for both behavioral and academic outcomes and equal to or better than those for the full TRS-C, comprised of 148 items. The practicality of the screener was documented by teachers' experiences. Administration time was less than 5 minutes per child, and no specialized teacher training was necessary. These results provide preliminary evidence that routine school screening via a brief teacher rating scale can increase the probability that children with behavioral and emotional problems may be validly identified for diagnost...

Journal ArticleDOI
TL;DR: The psychometric properties of scores from the Academic Self-Concept Scale are examined in a group of 198 Asian American college students using parallel analysis, a four-factor solution accounting for 46% of the variance was found in a test of construct validity as mentioned in this paper.
Abstract: The psychometric properties of scores from the Academic Self-Concept Scale are examined in a group of 198 Asian American college students Using parallel analysis, a four-factor solution accounting for 46% of the variance was found In a test of construct validity, academic self-concept was found to be negatively related to adherence to Asian values, positively related to adherence to European American values, and positively related to grade point average Results suggest that academic self-concept, as currently measured, is primarily an individualistic psychological construct

Journal ArticleDOI
TL;DR: The authors examined the measurement equivalence of the Multidimensional Work Ethic Profile (MWEP) across the diverse cultures of Korea, Mexico, and the United States and found that the MWEP was invariant across samples drawn from each country.
Abstract: The authors examined the measurement equivalence of the Multidimensional Work Ethic Profile (MWEP) across the diverse cultures of Korea, Mexico, and the United States. Korean- and Spanish-language versions of the MWEP were developed and evaluated relative to the original English version of the measure. Confirmatory factor analytic results indicated measurement invariance across samples drawn from each country. Further analyses indicated potential substantive differences for some of the seven subscales of the MWEP across samples. The implications of these findings and directions for future research are presented.

Journal ArticleDOI
TL;DR: In this paper, a cognitive diagnostic model using information from educational experts to describe the relationships between item performances and posited proficiencies is described using a fully Bayesian model, and a number of graphics and statistics for diagnosing problems with cognitive diagnostic models expressed as Bayesian networks.
Abstract: A cognitive diagnostic model uses information from educational experts to describe the relationships between item performances and posited proficiencies. When the cognitive relationships can be described using a fully Bayesian model, Bayesian model checking procedures become available. Checking models tied to cognitive theory of the domains provides feedback to educators about the underlying cognitive theory. This article suggests a number of graphics and statistics for diagnosing problems with cognitive diagnostic models expressed as Bayesian networks. The suggested diagnostics allow the authors to identify the inadequacy of an earlier cognitive diagnostic model and to hypothesize an improved model that provides better fit to the data.

Journal ArticleDOI
TL;DR: In this paper, the structural validity of scores from the Bem Sex Role Inventory using a maximum likelihood confirmatory factor analysis (CFA) was examined, and a hierarchical factor structure model with seven first-order factors (compassionate, interpersonal affect, shy, dominant, decisive, athletic, and self-sufficient) was used as the baseline comparison model.
Abstract: This study examines the structural validity of scores from the Bem Sex Role Inventory using a maximum likelihood confirmatory factor analysis (CFA). Six hundred and sixty-five graduate and undergraduate students participate in the study. A seven firstorder factor model almost identical to the model reported in a previous CFA study is used as the baseline comparison model. The data for testing these models are obtained from an exploratory sample randomly selected from the whole sample. A hierarchical factor structure model with seven first-order factors (compassionate, interpersonal affect, shy, dominant, decisive, athletic, and self-sufficient) and two second-order factors (masculinity and femininity) fit the data quite well. The fit indices based on the validation sample collectively indicate a very good fit. The results of this study are notably consistent with the hierarchical factor models suggested in two previous CFA studies.

Journal ArticleDOI
TL;DR: Results from simulation studies using three item selection methods, Fisher information (FI), posterior-weighted FI (FIP), and MI, are provided for an adaptive four-category classification test and it is shown that in general, MI item selection classifies the highest proportion of examinees correctly and yields the shortest test lengths.
Abstract: A general approach for item selection in adaptive multiple-category classification tests is provided. The approach uses mutual information (MI), a special case of the Kullback-Leibler distance, or relative entropy. MI works efficiently with the sequential probability ratio test and alleviates the difficulties encountered with using other local- and global-information measures in the multiple-category classification setting. Results from simulation studies using three item selection methods, Fisher information (FI), posterior-weighted FI (FIP), and MI, are provided for an adaptive four-category classification test. Both across and within the four classification categories, it is shown that in general, MI item selection classifies the highest proportion of examinees correctly and yields the shortest test lengths. The next best performance is observed for FIP item selection, followed by FI.

Journal ArticleDOI
TL;DR: In this article, the authors compared four methods of evaluating statistically independent errors: Durbin-Watson, Huitema-McKean, Box-Pierce and Ljung-Box (L-B) tests.
Abstract: Regression models used in the analysis of interrupted time-series designs assume statistically independent errors. Four methods of evaluating this assumption are the Durbin-Watson (D-W), Huitema-McKean (H-M), Box-Pierce (B-P), and Ljung-Box (L-B) tests. These tests were compared with respect to Type I error and power under a wide variety of error models and sample sizes. Although the B-P and L-B tests are portmanteau methods that incorporate information from a large portion of the autocorrelation function, the more focused D-W and H-M first-order autoregressive tests are shown to be considerably more powerful. The popular L-B test has unacceptable Type I error and should not be used in the context of the intervention model applied in this study.

Journal ArticleDOI
TL;DR: In this paper, the authors found that the fear of failure energizes individuals to avoid failure because of the learned aversive consequences of failing (e.g., shame) and that failure is socialized in childhood.
Abstract: Fear of failure (FF) energizes individuals to avoid failure because of the learned aversive consequences of failing (e.g., shame). Although FF is socialized in childhood, little is known about the ...

Journal ArticleDOI
TL;DR: In this paper, the authors describe, test, and illustrate a new implementation of the EH method for ordinal items, which involves the estimation of item response model parameters simultaneously with the approximation of the distribution of the random latent variable (τ) as a histogram.
Abstract: The purpose of this research is to describe, test, and illustrate a new implementation of the empirical histogram (EH) method for ordinal items. The EH method involves the estimation of item response model parameters simultaneously with the approximation of the distribution of the random latent variable (τ) as a histogram. Software for the EH method with ordinal items (having more than two response options) has not been readily available in the past but was created for the present research. Simulation results suggest that with larger (but not smaller) numbers of quadrature points, graded-model item parameter estimates from the EH method are highly accurate when the τ distribution is either normal or skewed. Results for expected a posteriori scores depend on the magnitude of τ.

Journal ArticleDOI
TL;DR: In this article, a multigroup structural equation modeling of longitudinal data from the Canadian National Population Health Survey was used to examine the measurement invariance across three age groups (19 to 25 years, n = 1,257; 30 to 55 years,n = 5,326; and ≥ 60 years, N = 2,213) and to compare the stability of the Antonovsky's Sense of Coherence Scale (SOC) scores obtained in 1994 and 1998 in the same participants.
Abstract: The purpose of this investigation was to test the age-related measurement invariance and temporal stability of the 13-item version of Antonovsky's Sense of Coherence Scale (SOC). Multigroup structural equation modeling of longitudinal data from the Canadian National Population Health Survey was used to examine the measurement invariance across 3 age groups (19 to 25 years, n = 1,257; 30 to 55 years, n = 5,326; and ≥60 years, n = 2,213) and to compare the stability of the SOC scores obtained in 1994 and 1998 in the same participants. The results support the age-related structural invariance of the scale. Differences in the latent means and stability coefficients obtained for the three age groups provide weak to moderate support for the stability of SOC scores over time in the general population of Canada.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the coverage probability of an asymptotic and percentile bootstrap confidence interval with respect to the squared multiple correlation coefficient (Δρ2) associated with a variable in a regression equation.
Abstract: The increase in the squared multiple correlation coefficient (ΔR 2) associated with a variable in a regression equation is a commonly used measure of importance in regression analysis. The coverage probability that an asymptotic and percentile bootstrap confidence interval includes Δρ2 was investigated. As expected, coverage probability for the asymptotic confidence interval was often inadequate (outside the interval .925 to .975 for a 95% confidence interval), even when sample size was quite large (i.e., 200). However, adequate coverage probability for the confidence interval based on a bootstrap interval could typically be obtained with a sample size of 200 or less, and moreover, this accuracy was obtained with relatively small sample sizes (100 or less) with six or fewer predictors.

Journal ArticleDOI
TL;DR: In this paper, the authors examined the gender-related differential predictive validity of five subscales of the Institutional Integration Scale (IIS) with regard to college student withdrawal and found no differential functioning.
Abstract: This study examined the gender-related differential predictive validity of five subscales of the Institutional Integration Scale (IIS) with regard to college student withdrawal. Differential functioning of the IIS across genders was assessed using an item response theory (IRT)—based framework of differential item and test functioning. The results confirmed the absence of differential functioning and supported the predictive validity of two of the five subscales for student withdrawal. IRT analyses revealed that a number of the items did not adequately reflect the construct and should be revised or removed from the measure. A discussion of these results and the implications for higher education institutions focused on preventing student withdrawal are presented.