scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2008"


Journal ArticleDOI
TL;DR: The psychometric properties and multigroup measurement invariance of scores on the Self-Efficacy for Self-Regulated Learning Scale taken from Bandura's Children's SelfEfficiency Scale were assessed in a sample of 3,760 students from Grades 4 to 11 as discussed by the authors.
Abstract: The psychometric properties and multigroup measurement invariance of scores on the Self-Efficacy for Self-Regulated Learning Scale taken from Bandura's Children's Self-Efficacy Scale were assessed in a sample of 3,760 students from Grades 4 to 11. Latent means differences were also examined by gender and school level. Results reveal a unidimensional construct with equivalent factor pattern coefficients for boys and girls and for students in elementary, middle, and high school. Elementary school students report higher self-efficacy for self-regulated learning than do students in middle and high school. The latent factor is related to self-efficacy, self-concept, task goal orientation, apprehension, and achievement.

302 citations


Journal ArticleDOI
TL;DR: In this paper, the authors examined the relationship between the squared multiple correlation coefficients and minimum necessary sample sizes and found a definite relationship, similar to a negative exponential relationship, and provided guidelines for sample size needed for accurate predictions.
Abstract: When using multiple regression for prediction purposes, the issue of minimum required sample size often needs to be addressed. Using a Monte Carlo simulation, models with varying numbers of independent variables were examined and minimum sample sizes were determined for multiple scenarios at each number of independent variables. The scenarios arrive from varying the levels of correlations between the criterion variable and predictor variables as well as among predictor variables. Two minimum sample sizes were determined for each scenario, a good and an excellent prediction level. The relationship between the squared multiple correlation coefficients and minimum necessary sample sizes were examined. A definite relationship, similar to a negative exponential relationship, was found between the squared multiple correlation coefficient and the minimum sample size. As the squared multiple correlation coefficient decreased, the sample size increased at an increasing rate. This study provides guidelines for sample size needed for accurate predictions.

272 citations


Journal ArticleDOI
TL;DR: This paper provided a summary of 45 exploratory and confirmatory factor-analytic studies that examined the internal structure of scores obtained from the Maslach Burnout Inventory (MBI).
Abstract: This study provides a summary of 45 exploratory and confirmatory factor-analytic studies that examined the internal structure of scores obtained from the Maslach Burnout Inventory (MBI). It highlights characteristics of the studies that account for differences in reporting of the MBI factor structure. This approach includes an examination of the various sample characteristics, forms of the instrument, factor-analytic methods, and the reported factor structure across studies that have attempted to examine the dimensionality of the MBI. This study also investigates the dimensionality of MBI scale scores using meta-analysis. Both descriptive and empirical analysis supported a three-factor model. The pattern of reported dimensions across validation studies should enhance understanding of the structural dimensions that the MBI measures as well as provide a more meaningful interpretation of its test scores.

203 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used teachers' ability to analyze teaching as a proxy for their teaching knowledge, using video clips of classroom instruction as item prompts to measure teacher knowledge of teaching mathematics.
Abstract: Responding to the scarcity of suitable measures of teacher knowledge, this article reports on a novel assessment approach to measuring teacher knowledge of teaching mathematics. The new approach uses teachers' ability to analyze teaching as a proxy for their teaching knowledge. Video clips of classroom instruction, which respondents were asked to analyze in writing, were used as item prompts. Teacher responses were scored along four dimensions: mathematical content, student thinking, alternative teaching strategies, and overall quality of interpretation. A prototype assessment was developed and its reliability and validity were examined. Respondents' scores were found to be reliable. Positive, moderate correlations between teachers' scores on the video-analysis assessment, a criterion measure of mathematical content knowledge for teaching, and expert ratings provide initial evidence for the criterion-related validity of the video-analysis assessment. Results suggest that teachers' ability to analyze teach...

197 citations


Journal ArticleDOI
TL;DR: In this article, the effects of Q-matrix misspecifications on parameter estimates and misclassification rates for the deterministic-input, noisy ''and'' gate (DINA) were investigated.
Abstract: This article reports a study that investigated the effects of Q-matrix misspecifications on parameter estimates and misclassification rates for the deterministic-input, noisy ``and'' gate (DINA) mo...

181 citations


Journal ArticleDOI
TL;DR: In this article, a meta-analysis was conducted to synthesize the administration mode effects of CBTs and paper-and-pencil tests on K-12 student reading assessments.
Abstract: In recent years, computer-based testing (CBT) has grown in popularity, is increasingly being implemented across the United States, and will likely become the primary mode for delivering tests in the future. Although CBT offers many advantages over traditional paper-and-pencil testing, assessment experts, researchers, practitioners, and users have expressed concern about the comparability of scores between the two test administration modes. To help provide an answer to this issue, a meta-analysis was conducted to synthesize the administration mode effects of CBTs and paper-and-pencil tests on K—12 student reading assessments. Findings indicate that the administration mode had no statistically significant effect on K—12 student reading achievement scores. Four moderator variables—study design, sample size, computer delivery algorithm, and computer practice—made statistically significant contributions to predicting effect size. Three moderator variables—grade level, type of test, and computer delivery method...

157 citations


Journal ArticleDOI
TL;DR: The authors investigated aspects of validity reflected in a large and diverse sample of published measures used in educational and psychological testing contexts and found that validity information is not routinely provided in terms of modern validity theory, some sources of validity evidence (e.g., consequential) are essentially ignored in validity reports, and the favorability of judgments about a test is more strongly related to the number of validity sources provided than to the perspective on validity taken or other factors.
Abstract: This study investigates aspects of validity reflected in a large and diverse sample of published measures used in educational and psychological testing contexts. The current edition of Mental Measurements Yearbook served as the data source for this study. The validity aspects investigated included perspective on validity represented, number and kinds of sources of validity evidence provided, overall evaluation of the favorability of the test, and whether these factors varied as a function of the type of test. Findings reveal that validity information is not routinely provided in terms of modern validity theory, some sources of validity evidence (e.g., consequential) are essentially ignored in validity reports, and the favorability of judgments about a test is more strongly related to the number of validity sources provided than to the perspective on validity taken or other factors. The article concludes with implications for extending and refining current validity theory and validation practice.

152 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used the National Longitudinal Survey of Youth 1979 (NLSY79) data to test measurement invariance of the Behavior Problem Index (BPI) during middle childhood across three ethnic groups.
Abstract: Accurate measurement of behavioral functioning is a cornerstone of research on disparities in child development. This study used the National Longitudinal Survey of Youth 1979 (NLSY79) data to test measurement invariance of the Behavior Problem Index (BPI) during middle childhood across three ethnic groups. Using the internalizing and externalizing behavior problem division derived by Parcel and Menaghan (1988) and suggested for use with NLSY79 data, the configural invariance hypothesis was not supported. The BPI factor structure model was revised based on theoretical considerations using the division of items from the Child Behavior Checklist. This model demonstrated configural invariance across ethnic groups and over time. Moreover, measurement invariance of factor loadings and thresholds across ethnic groups at each time point and within each ethnic group over time was also supported. The implications of these findings for educational and cross-cultural research are outlined.

92 citations


Journal ArticleDOI
TL;DR: In this article, the authors used reliability generalization to identify the variability in reliability estimates for the WOCS scores across studies, and the typical score reliability for WOCs, and salient features across studies that relate to the variability of reliability estimate scores.
Abstract: For more than 20 years, the Ways of Coping Scale (WOCS) has been used extensively to measure coping. Yet beyond the original psychometric data, few studies have reexamined its properties utilizing the enormous body of research generated on the WOCS. Reliability has been assumed to be consistent as an attribute of the test. This study used reliability generalization to identify (a) the variability in reliability estimates for the WOCS scores across studies, (b) the typical score reliability for the WOCS, and (c) the salient features across studies that relate to the variability in reliability estimate scores for the WOCS. Typical reliability across subscale scores ranged from .60 to .75 with Positive Reappraisal showing the least variability and Self-Controlling showing the most. Factors related to this variability were age and format of administration.

79 citations


Journal ArticleDOI
TL;DR: The most commonly used locus of control measures of control are Rotter's Internality-Externality Scale (I-E) and Nowicki and Strickland's International Neighbourhood Scale (NSIE) as mentioned in this paper.
Abstract: The most commonly used measures of locus of control are Rotter's Internality-Externality Scale (I-E) and Nowicki and Strickland's Internality-Externality Scale (NSIE). A reliability generalization study is conducted to explore variability in I-E and NSIE score reliability. Studies are coded for aspects of the scales used (number of response points, number of items) and for sample demographic descriptors (percentage female, average age). Results indicate no statistically significant difference in the predicted internal consistency estimate for I-E Scale versus NSIE Scale scores. Only the percentage female variable is found to predict variation in internal consistency estimates. Testing interval length explains variability in test-retest coefficient estimates. Results and directions for future research are discussed.

73 citations


Journal ArticleDOI
TL;DR: The results from simulation studies as well as actual data suggest that IRT-based models with continuous latent traits can be developed and compared with the unidimensional IRT model, the proposed models better describe the actual data.
Abstract: As item response models gain increased popularity in large-scale educational and measurement testing situations, many studies have been conducted on the development and applications of unidimensional and multidimensional models. Recently, attention has been paid to IRT-based models with an overall ability dimension underlying several ability dimensions specific for individual test items, where the focus is mainly on models with dichotomous latent traits. The purpose of this study is to propose such models with continuous latent traits under the Bayesian framework. The proposed models are further compared with the conventional IRT models using Bayesian model choice techniques. The results from simulation studies as well as actual data suggest that (a) such models can be developed; (b) compared with the unidimensional IRT model, the proposed models better describe the actual data; and (c) the use of the proposed IRT models and the multiunidimensional model should be based on different beliefs about the unde...

Journal ArticleDOI
TL;DR: In this article, both desirability ratings of BSRI traits (both for a man and for a woman) and self-ratings were obtained from the same sample and factor analyzed.
Abstract: Pedhazur and Tetenbaum speculated that factor structures from self-ratings of the Bem Sex-Role Inventory (BSRI) personality traits would be different from factor structures from desirability ratings of the same traits. To explore this hypothesis, both desirability ratings of BSRI traits (both for a man and for a woman) and self-ratings were obtained from the same sample and factor analyzed. Factor analyses performed on the three sets of ratings of the 40 BSRI traits (self-ratings, desirability ratings for a man, and desirability ratings for a woman) confirmed that the factors across ratings were diverse. Thus, the underlying constructs must be studied independently. Predictive discriminant analyses replicated the finding that two traits alone (Masculine and Feminine) provided nearly all of the discrimination of males and females in the sample when self-ratings were employed. Also, predictive discriminant analyses revealed that the classification of participants into gender groups was very accurate using s...

Journal ArticleDOI
TL;DR: In this paper, a combination of two item response theory (IRT) models is used for the observed response data and one for the missing data indicator, which is modeled using a sequential model with linear restrictions on the item parameters.
Abstract: In tests with time limits, items at the end are often not reached. Usually, the pattern of missing responses depends on the ability level of the respondents; therefore, missing data are not ignorable in statistical inference. This study models data using a combination of two item response theory (IRT) models: one for the observed response data and one for the missing data indicator. The missing data indicator is modeled using a sequential model with linear restrictions on the item parameters. The models are connected by the assumption that the respondents' latent proficiency parameters have a joint multivariate normal distribution. Model parameters are estimated by maximum marginal likelihood. Simulations show that treating missing data as ignorable can lead to considerable bias in parameter estimates. Including an IRT model for the missing data indicator removes this bias. The method is illustrated with data from an intelligence test with a time limit.

Journal ArticleDOI
TL;DR: In this paper, the authors test the validity of scores on the Homework Management Scale (HMS) using 699 rural and 482 urban 8th graders and find that urban students were more likely to manage their homework than their rural counterparts in two of the five areas, namely, handling distraction and monitoring motivation.
Abstract: The purpose of this study was to test the validity of scores on the Homework Management Scale (HMS) using 699 rural and 482 urban eighth graders. The study revealed that the HMS comprised 5 separate yet related factors: arranging the environment, managing time, handling distraction, monitoring motivation, and controlling emotion. Given an adequate level of configural, factor loading, common error covariance, and intercept invariance, I further tested the difference between group means. Results revealed that urban students were more likely to manage their homework than their rural counterparts in 2 of the 5 areas, namely, handling distraction and monitoring motivation. Findings also showed that the HMS differentiated among students who were more or less likely to complete homework assignments.

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the measurement invariance of a particular measure of achievement goal orientation, the modified Achievement Goal Questionnaire (AGQ-M), across African American and white university students.
Abstract: There has been growing interest in comparing achievement goal orientations across ethnic groups. Such comparisons, however, cannot be made until validity evidence has been collected to support the use of an achievement goal orientation instrument for that purpose. Therefore, this study investigates the measurement invariance of a particular measure of achievement goal orientation, the modified Achievement Goal Questionnaire (AGQ-M), across African American and White university students. Confirmatory factor analyses support measurement invariance across the two groups. These findings provide additional validity evidence for the newly conceptualized 2 × 2 framework of achievement goal orientation and for the equivalence of functioning of the AGQ-M across these distinct groups. Because this level of invariance is established, researchers can make more valid inferences about differences in the AGQ-M scores across African American and White students.

Journal ArticleDOI
TL;DR: This article examined the consistency of the metric properties of AAI scores by testing their factorial structure for invariance across gender and grade (2 [genders] × 5 [grades] = 10 groups) in a sample of 3,417 high school students.
Abstract: Motivation deficits are common in high school and constitute a significant problem for both students and teachers. The Academic Amotivation Inventory (AAI) was developed to measure the multidimensional nature of the academic amotivation construct (Legault, Green-Demers, & Pelletier, 2006). The present project further examined the consistency of the metric properties of AAI scores by testing their factorial structure for invariance across gender and grade (2 [genders] × 5 [grades] = 10 [groups]) in a sample of 3,417 high school students. Factorial invariance of latent means was also examined as a complementary substantive objective. Configural, metric, and scalar invariance were successfully substantiated across all 10 groups. Results revealed well-fitting models for each group. Moreover, constraining factor loadings and intercepts had no meaningful impact on model fit. Findings are discussed in terms of an increased conceptual and psychometric understanding of scholastic motivational problems.

Journal ArticleDOI
TL;DR: In this article, confirmatory factor analysis for ordered-categorical measures (CFA-OCM) and rating scale item response theory (IRT) analyses explore measurement bias across gender on the Children's Depression Inve...
Abstract: Confirmatory factor analysis for ordered-categorical measures (CFA-OCM) and rating scale item response theory (IRT) analyses explore measurement bias across gender on the Children's Depression Inve...

Journal ArticleDOI
TL;DR: This article examined the invariance of a measurement model underlying Wechsler Adult Intelligence Scale-Third Edition scores in the US and the Canadian standardization samples and found that the measurement model, involving four latent variables, satisfies the assumption of invariance across samples.
Abstract: A measurement model is invoked whenever a psychological interpretation is placed on test scores When stated in detail, a measurement model provides a description of the numerical and theoretical relationship between observed scores and the corresponding latent variables or constructs In this way, the hypothesis that similar meaning can be derived from a set of test scores can be tested by examination of a measurement model across groups This study examines the invariance of a measurement model underlying Wechsler Adult Intelligence Scale—Third Edition scores in the US and the Canadian standardization samples The measurement model, involving four latent variables, satisfies the assumption of invariance across samples Subtest scores also show similar reliability in both samples However, slightly higher latent variable means are found in the Canadian normative sample

Journal ArticleDOI
TL;DR: This article examined the relationship between item difficulty and differential item functioning by using alternative statistical techniques based on item response theory and a different standardized test, and the results replicate previous research and provide support for the generalizability of the findings.
Abstract: Recent research examining racial differences on standardized cognitive tests has focused on the impact of test item difficulty. Studies using data from the SAT and GRE have reported a correlation between item difficulty and differential item functioning (DIF) such that minority test takers are less likely than majority test takers to respond correctly to easy test items. The statistical techniques used and the effect sizes reported in these studies have been heavily criticized. This study addresses these criticisms by examining the relationship between item difficulty and DIF by using alternative statistical techniques based on item response theory and a different standardized test. The results replicate previous research and provide support for the generalizability of the findings.

Journal ArticleDOI
TL;DR: In this paper, the authors compared two confirmatory factor analysis methods on their ability to verify whether correct assignments of items to subtests are supported by the data, and found that the confirmatory common factor (CCF) method is used most often and defines nonzero loadings so that they correspond to the assignment of items in subtests.
Abstract: This study compares two confirmatory factor analysis methods on their ability to verify whether correct assignments of items to subtests are supported by the data. The confirmatory common factor (CCF) method is used most often and defines nonzero loadings so that they correspond to the assignment of items to subtests. Another method is the oblique multiple group (OMG) method, which defines subtests as unweighted sums of the scores on all items assigned to the subtest, and (corrected) correlations are used to verify the assignment. A simulation study compares both methods, accounting for the influence of model error and the amount of unique variance. The CCF and OMG methods show similar behavior with relatively small amounts of unique variance and low interfactor correlations. However, at high amounts of unique variance and high interfactor correlations, the CCF detected correct assignments more often, whereas the OMG was better at detecting incorrect assignments.

Journal ArticleDOI
TL;DR: In this paper, the authors used the full-information item bifactor model for graded response data to test the dimensionality of an adapted version of the State Metacognitive Inventory.
Abstract: Dimensionality assessment using the full-information item bifactor model for graded response data is provided. The model applies to data in which each item relates to a general factor and one group factor. Specifically, alternative model specification within item response theory (IRT) is shown to test a scale's factor structure. For illustrative purposes, the bifactor model and competing IRT models were fit to the data of separate cohorts of incoming college students (Cohort 1, n = 1,490; Cohort 2, n = 1,533) to test the dimensionality of an adapted version of the State Metacognitive Inventory. Overall, the bifactor analysis did not strongly support distinct group factors after accounting for the general factor. Instead, results suggested conceptualizing the scale as unidimensional, indicating that scores should be based on the total scale, not subscales. Considerations related to the use of the bifactor IRT model are discussed.

Journal ArticleDOI
TL;DR: In this article, the authors compared the criterion-related validity of scores yielded by a work-non-work conflict scale and those yielded by work-family conflict scale using active-duty U.S. Army soldiers stationed in Germany and Italy with spouses and/or children and without spouses or children.
Abstract: Research examining the influence of nonwork issues on work-related outcomes has flourished. Often, however, the breadth of the interrole conflict construct varies widely between studies. To determine if the breadth of the interrole conflict measure makes a difference, the current study compares the criterion-related validity of scores yielded by a work‐nonwork conflict scale and those yielded by a work‐family conflict scale using active-duty U.S. Army soldiers stationed in Germany and Italy with spouses and/or children and without spouses or children. Results demonstrated that the two constructs are related but distinct. In addition, work‐family conflict had a stronger relationship with job satisfaction and turnover intentions for employees with a spouse and/or children than for single, childless employees, whereas work‐nonwork conflict had a stronger relationship with these outcomes for single, childless employees than for employees with a spouse and/or children.

Journal ArticleDOI
TL;DR: In this paper, a revised version of the coaching efficacy scale (CES II-HST) was developed for head coaches of high school teams, and data were collected from head coaches from 14 relevant high school sports (N = 799).
Abstract: The purpose of this validity study was to improve measurement of coaching efficacy, an important variable in models of coaching effectiveness. A revised version of the coaching efficacy scale (CES) was developed for head coaches of high school teams (CES II-HST). Data were collected from head coaches of 14 relevant high school sports (N = 799). Exploratory factor analysis (n = 250) and a conceptual understanding of the construct of interest led to the selection of 18 items. A single-group confirmatory factor analysis (CFA; n = 549) provided evidence for close model-data fit. A multigroup CFA provided evidence for factorial invariance by gender of the coach (n = 588).

Journal ArticleDOI
TL;DR: Mantel-Haenszel methods comprise a highly flexible methodology for assessing the degree of association between two categorical variables, whether they are nominal or ordinal, while controlling for... as discussed by the authors.
Abstract: Mantel-Haenszel methods comprise a highly flexible methodology for assessing the degree of association between two categorical variables, whether they are nominal or ordinal, while controlling for ...

Journal ArticleDOI
TL;DR: Cheung et al. as discussed by the authors extended the previous study by examining the case of heterogeneous degree of dependence and found that adjusted-weighted procedure generated slightly less biased estimates of the degree of heterogeneity than the adjusted-individual weighted procedure across conditions.
Abstract: In meta-analysis, it is common to have dependent effect sizes, such as several effect sizes from the same sample but measured at different times. Cheung and Chan proposed the adjusted-individual and adjusted-weighted procedures to estimate the degree of dependence and incorporate this estimate in the meta-analysis. The present study extends the previous study by examining the case of heterogeneous degree of dependence. Simulation results reveal that these two procedures again generated less biased estimates of the degree of heterogeneity than the commonly used samplewise procedure and were statistically more powerful to detect true variations. In addition, the adjusted-weighted procedure generated slightly less biased estimates of the degree of heterogeneity than the adjusted-individual weighted procedure across conditions. Future directions to further refine these procedures are discussed.

Journal ArticleDOI
TL;DR: In this paper, a multilevel modeling approach is proposed to study the general and specific attitudes formed in human learning behavior based on the premises of activity theory, which conceptualizes the unit of analysis for attitude measurement as a scalable and evolving activity system rather than a single action.
Abstract: This article proposes a multilevel modeling approach to study the general and specific attitudes formed in human learning behavior. Based on the premises of activity theory, it conceptualizes the unit of analysis for attitude measurement as a scalable and evolving activity system rather than a single action. Measurement issues related to this conceptualization, including scale development and validation, are discussed with the help of facet analysis and multilevel structural equation modeling techniques. An empirical study was conducted, and the results indicate that this approach is theoretically and methodologically defensible.

Journal ArticleDOI
TL;DR: This paper investigated the use of latent class analysis for the detection of differences in item functioning on the Peabody Picture Vocabulary Test-Third Edition (PPVT-III) and proposed a two-class solution.
Abstract: This study investigated the use of latent class analysis for the detection of differences in item functioning on the Peabody Picture Vocabulary Test—Third Edition (PPVT-III). A two-class solution f...

Journal ArticleDOI
TL;DR: In this article, the authors tested the viability of the expanded nigrescence (NT-E) model as operationalized by Cross Racial Identity Scale (CRIS) scores using confirmatory factor analyses.
Abstract: In this study, the authors tested the viability of the expanded nigrescence (NT-E) model as operationalized by Cross Racial Identity Scale (CRIS) scores using confirmatory factor analyses. Participants were 594 Black college students from the Southeastern United States. Results indicated a good fit for NT-E's proposed six-factor structure. One-factor and two-factor higher-order models also yielded good fit indices, although several coefficients in the one-factor higher-order model were not salient or statistically significant. In sum, the results provide strong support for the CRIS as an operationalization of NT-E. The authors suggest that CRIS scores can be used in studies concerned with drawing inferences about the effects of racial identity attitudes.

Journal ArticleDOI
TL;DR: The authors compared student performance between paper-and-pencil testing (PPT) and computer-based testing (CBT) on a large-scale statewide end-of-course English examination.
Abstract: The current study compared student performance between paper-and-pencil testing (PPT) and computer-based testing (CBT) on a large-scale statewide end-of-course English examination. Analyses were conducted at both the item and test levels. The overall results suggest that scores obtained from PPT and CBT were comparable. However, at the content domain level, a rather large difference in the reading comprehension section suggests that reading comprehension test may be more affected by the test administration mode. Results from the confirmatory factor analysis suggest that the administration mode did not alter the construct of the test.

Journal ArticleDOI
TL;DR: The present study seeks to fill the void by comparing two approaches for handling missing data in categorical covariates in logistic regression: the expectation-maximization (EM) method of weights and multiple imputation (MI).
Abstract: For the past 25 years, methodological advances have been made in missing data treatment Most published work has focused on missing data in dependent variables under various conditions The present study seeks to fill the void by comparing two approaches for handling missing data in categorical covariates in logistic regression: the expectation-maximization (EM) method of weights and multiple imputation (MI) Sample data are drawn randomly from a population with known characteristics Missing data on covariates are simulated under two conditions: missing completely at random and missing at random with different missing rates A logistic regression model was fit to each sample using either the EM or MI approach The performance of these two approaches is compared on four criteria: bias, efficiency, coverage, and rejection rate Results generally favored MI over EM Practical issues such as implementation, inclusion of continuous covariates, and interactions between covariates are discussed