scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 2000"


Journal ArticleDOI
TL;DR: In this article, a meta-analysis explores factors associated with higher response rates in electronic surveys reported in both published and unpublished research and concludes that response representativeness is more important than response rate in survey research.
Abstract: Response representativeness is more important than response rate in survey research. However, response rate is important if it bears on representativeness. The present meta-analysis explores factors associated with higher response rates in electronic surveys reported in both published and unpublished research. The number of contacts, personalized contacts, and precontacts are the factors most associated with higher response rates in the Web studies that are analyzed.

2,520 citations


Journal ArticleDOI
TL;DR: In this article, a 2 × 3 design in which item stem direction and item response pattern direction were crossed was used to determine effects on internal consistency reliability as measured by Cronbach's alpha.
Abstract: The controversy with regard to using reverse or negatively worded survey stems has been around for several decades; it is a practice of questionable utility intended to guard against acquiescence or response set behaviors. A 2 × 3 design in which item stem direction and item response pattern direction were crossed was used to determine effects on internal consistency reliability as measured by Cronbach’s alpha. The condition having the highest alpha was when all directly worded stems were used with bidirectional response options. Alpha was higher and accounted for at least 10%, and in one case 20%, higher internal consistency as compared with any of the three conditions in which negatively worded stems were used. This would indicate that the use of all directly worded stems and half of the response options going in one direction and half going in the other direction may be a better way of guarding against acquiescence and response set behaviors than the use of items with negatively worded stems.

487 citations


Journal ArticleDOI
TL;DR: In this article, the authors discuss the development of an instrument designed to identify empowering behaviors of leaders, but there has been little research on identifying empowering behaviours of leaders in management practice.
Abstract: Empowerment is a popular management practice, but there has been little research to identify empowering behaviors of leaders. The present article discusses the development of an instrument designed...

379 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a manifesto regarding the nature of score reliability and what are reasonable expectations for psychometric reporting practices in substantive inquiries, and explore the consequences of misunderstandings about score reliability.
Abstract: The present article responds to selected criticisms of some EPM editorial policies and Vacha-Haase’s “reliability generalization” meta-analytic methods. However, the treatment is more broadly a manifesto regarding the nature of score reliability and what are reasonable expectations for psychometric reporting practices in substantive inquiries. The consequences of misunderstandings of score reliability are explored. It is suggested that paradigmatic misconceptions regarding psychometric issues feed into a spiral of presumptions that measurement training is unnecessary for doctoral students, which then in turn further reinforces misunderstandings of score integrity issues.

339 citations


Journal ArticleDOI
TL;DR: The authors compared the original intrinsic and extrinsic subscales of the Minnesota Satisfaction Questionnaire short form to revised subscales using data from two samples, and found that revising the intrinsic and intrinsic subscales made little difference in the results obtained.
Abstract: This study compared the original intrinsic and extrinsic subscales of the Minnesota Satisfaction Questionnaire short form to revised subscales using data from two samples. The revised subscales were formed according to critiques by several researchers. Confirmatory factor analysis of the original and revised subscales supported the discriminant validity of scores on the intrinsic and extrinsic job satisfaction measures. Several hierarchical regression models were tested that included job involvement, overall job satisfaction, and volitional absence variables, in addition to the job satisfaction components. The analyses from both samples indicated that revising the intrinsic and extrinsic subscales made little difference in the results obtained.

329 citations


Journal ArticleDOI
TL;DR: In this article, the authors used Monte Carlo methods to assess the per-contrast and experimentwise Type I error rates of two post hoc tests of cellwise residuals and four post hoc test of pairwise contrasts in 3 4 chi-square contingency tables.
Abstract: The authors used Monte Carlo methods to assess the per-contrast and experimentwise Type I error rates of two post hoc tests of cellwise residuals and four post hoc tests of pairwise contrasts in 3 4 chi-square contingency tables. The six post hoc procedures were evaluated under three sample sizes and under the null hypotheses of independence and homogeneity. Results of the study indicate that the cellwise adjusted residual method provided adequate experimentwise Type I error rate control when appropriate adjustments to the alpha level were made, and the Gardner pairwise post hoc procedure provided several advantages over the other pairwise procedures. This was true for both the independence and homogeneity models.

319 citations


Journal ArticleDOI
TL;DR: This article examined the frequency of use of various types of reliability coefficients for a systematically drawn sample of 696 tests appearing in the APA-published Directory of Unpublished Experimental Mental Measures.
Abstract: This study examined the frequency of use of various types of reliability coefficients for a systematically drawn sample of 696 tests appearing in the APA-published Directory of Unpublished Experimental Mental Measures. Almost all articles included some type of reliability report for at least one test administration. Coefficient alpha was the over-whelming favorite among types of coefficients. Several measures treated almost universally in psychological-testing textbooks were rarely or never used. Problems encountered in the study included ambiguous designations of types of coefficients, reporting reliability based on a study other than the one cited, inadequate information about subscales, and simply incorrect recording of the information given in an original source.

234 citations


Journal ArticleDOI
TL;DR: In this paper, two studies were conducted to develop and provide evidence supporting the construct validity of scores on a scale to measure two aspects of workplace friendship: friendship prevalence and friendship opportunities.
Abstract: Two studies were conducted to develop and provide evidence supporting the construct validity of scores on a scale to measure two aspects of workplace friendship: friendship prevalence and friendship opportunities. In the first study, data collected from 200 part-time graduate students supported the internal consistency and proposed dimensionality of scale scores. In the second study, data were collected from a total sample of 116, which consisted of part-time graduate students and employees of three organizations. Support was provided for convergent, discriminant, and nomological validity of scale scores.

233 citations


Journal ArticleDOI
TL;DR: In this paper, a total of 848 coefficients of stability and 1,359 internal consistency reliabilities across the Big Five factors of personality were examined, including emotional stability, extraversion, openness to experience, Agreeableness, and conscientiousness.
Abstract: Meta-analysis was used to cumulate reliabilities of personality scale scores. A total of 848 coefficients of stability and 1,359 internal consistency reliabilities across the Big Five factors of personality were examined. The frequency-weighted mean coefficients of stability were .75 (SD = .10, K = 221), .76 (SD = .12, K = 176), .71 (SD = .13, K = 139), .69 (SD = .14, K = 119), and .72 (SD = .13, K = 193) for Emotional Stability, Extraversion, Openness to Experience, Agreeableness, and Conscientiousness, respectively. The corresponding internal consistency reliabilities were .78 (SD = .11, K = 370), .78 (SD = .09, K = 307), .73 (SD = .12, K = 251), .75 (SD = .11, K = 123), and .78 (SD = .10, K = 307). Sample-size-weighted means also were computed. The dimension of personality being rated does not appear to strongly moderate either the internal consistency or the testretest reliabilities. Implications for personality assessment are discussed.

222 citations


Journal ArticleDOI
TL;DR: The reliability estimates for the Beck Depression Inventory (BDI) scores across studies were accumulated and summarized in a meta-analysis as discussed by the authors, indicating that the logic of "test score reliability" generally has not prevailed in clinical psychology regarding application of BDI.
Abstract: The reliability estimates for the Beck Depression Inventory (BDI) scores across studies were accumulated and summarized in a meta-analysis. Only 7.5% of the articles reviewed reported meaningful reliability estimates, indicating that the logic of “test score reliability” generally has not prevailed in clinical psychology regarding application of BDI. Analyses revealed that for BDI, the measurement error due to time sampling as captured by test-retest reliability estimate is considerably larger than the measurement error due to item heterogeneity and content sampling as captured by internal consistency reliability estimate. Also, reliability estimates involving substance addicts were consistently lower than reliability estimates involving normal subjects, possibly due to restriction of range problems. Correlation analyses revealed that standard errors of measurement (SEMs) were not correlated with reliability estimates but were substantially related to standard deviations of BDI scores, suggesting that SEM...

207 citations


Journal ArticleDOI
TL;DR: The Social Readjustment Rating Scale (SRRS) is one of the most widely cited measurement instruments in the stress literature as discussed by the authors, and it has been widely used for stress research.
Abstract: Despite criticism, the Social Readjustment Rating Scale (SRRS) is one of the most widely cited measurement instruments in the stress literature. This research assesses several criticisms of the SRRS after years of widespread use. Specifically, the authors evaluate content-related criticisms, including differential prediction of desirable relative to undesirable life events, controllable relative to uncontrollable life events, and contaminated relative to uncontaminated life event items. On balance, the authors find that the SRRS is a useful tool for stress researchers and practitioners.

Journal ArticleDOI
TL;DR: In this article, it has been recognized that the two-phase version of the interrupted time-series design can be frequently modeled using a four-parameter design matrix, however, there are differences across writers in the details of the recommended design matrices to be used in the estimation of the four parameters of the model.
Abstract: It has been recognized that the two-phase version of the interrupted time-series design can be frequently modeled using a four-parameter design matrix. There are differences across writers, however, in the details of the recommended design matrices to be used in the estimation of the four parameters of the model. Various writers imply that different methods of specifying the four-parameter design matrix all lead to the same conclusions; they do not. The tests and estimates for level change are dramatically different under the various seemingly equivalent design specifications. Examples of egregious errors of interpretation are presented and recommendations regarding the correct specification of the design matrix are made. The recommendations hold whether the model is estimated using ordinary least squares (for the case of approximately independent errors) or some more complex time-series approach (for the case of autocorrelated errors).

Journal ArticleDOI
TL;DR: This paper investigated empirically exactly how dissimilar in both composition and variability samples inducting reliability coefficients from prior studies were from the cited prior samples from which coefficients were generalized, and concluded that reliability, once proven, is immutable.
Abstract: As measurement specialists, we have done a disservice to both ourselves and our profession by habitually referring to “the reliability of the test,” or saying that “the test is reliable.” This has created a mind-set implying that reliability, once proven, is immutable. More important, practitioners and scholars need not know measurement theories if they may simply rely on the reliability purportedly intrinsic within all uses of established measures. The present study investigated empirically exactly how dissimilar in both composition and variability samples inducting reliability coefficients from prior studies were from the cited prior samples from which coefficients were generalized.

Journal ArticleDOI
TL;DR: Reliability generalization is a meta-analytic method for examining the variability in the reliability of scores by determining which sample characteristics are related to differences in score reliability as discussed by the authors, and it was found that there was a large amount of variability of the NEO scores, both between and within personality domains.
Abstract: A reliability generalization of 51 samples employing one of the NEO personality scales was conducted Reliability generalization is a meta-analytic method for examining the variability in the reliability of scores by determining which sample characteristics are related to differences in score reliability It was found that there was a large amount of variability in the reliability of NEO scores, both between and within personality domains The sample characteristics that are related to score reliability were dependent on NEO domain Agreeableness scores appear to be the weakest of the domains assessed by the NEO scales in terms of reliability, particularly in clinical samples, for male-only samples, and when temporal consistency was the criterion for reliability The reliability of Openness to Experience scores was low when the NEO-Five Factor Inventory was used The advantages of conceptualizing reliability as a property of scores, and not tests, are discussed

Journal ArticleDOI
TL;DR: The Community Service Attitudes Scale (CSAS) as mentioned in this paper was developed based on Schwartz's helping behavior model to measure college students' attitudes about community service and found that the CSAS scale scores were positively correlated with gender, college major, community service experience, and intentions to engage in community service.
Abstract: This study reports the multistage development of the Community Service Attitudes Scale (CSAS), an instrument for measuring college students’attitudes about community service. The CSAS was developed based on Schwartz’s helping behavior model. Scores on the scales of the CSAS yielded strong reliability evidence (coefficient alphas ranging from .72 to .93). Principal components analysis yielded results consistent with the Schwartz model. In addition, the CSAS scale scores were positively correlated with gender, college major, community service experience, and intentions to engage in community service. The CSAS will be useful to researchers for conducting further research on the effects of service learning and community service experiences for students.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a bootstrap-F method for one-way repeated measure ANOVA design using a Monte Carlo approach in which sample size, nonsphericity, and sample complexity were taken into account.
Abstract: The current article proposes a bootstrap-F method and a bootstrap-T2 method for use in a one-way repeated measure ANOVA design. Using a Monte Carlo approach in which sample size, nonsphericity, and...

Journal ArticleDOI
TL;DR: In this paper, a meta-analytically synthesized set of studies that have investigated the extent to which individuals can inflate their integrity test scores when coached or instructed to fake good were investigated.
Abstract: Although it has been consistently found that test takers can effectively fake good on self-report noncognitive measures when instructed to do so, not all measures are equally susceptible. The present review meta-analytically synthesized studies that have investigated the extent to which individuals can inflate their integrity test scores when coached or instructed to fake good. Both overt and personality-based integrity tests were investigated. Results indicated that the overt test was especially susceptible to both fake good (d = 0.90) and coaching (d = 1.32) instructions. Personality-based measures appeared to be more resistant to both faking good (d = 0.38) and coaching (d = 0.36). Implications of these results for integrity testing are discussed.

Journal ArticleDOI
TL;DR: This article examined the validity of scores on the Multigroup Ethnic Identity Measure (MEIM) in a group of 275 academically talented adolescents attending a summer enrichment program and found that the MEIM was a two-factor measure.
Abstract: This study examined the validity of scores on the Multigroup Ethnic Identity Measure (MEIM) in a group of 275 academically talented adolescents attending a summer enrichment program. The two-factor...


Journal ArticleDOI
TL;DR: In this paper, the authors proposed an improvement-over-chance classification (I) index, which can be used in situations that are univariate, multivariate, homogeneous, heterogeneous, or any combination thereof.
Abstract: The research content of interest herein is that of comparison of means. It is generally recognized that statistical test p values do not adequately reflect mean comparison assessments. What is desirable is some effect-size assessment. The typical effect-size indexes used in mean comparisons are restricted to the variance homogeneity condition. What is proposed here is the use of the group-overlap concept. Group overlap may be assessed via prediction of group assignment, that is, using predictive discriminant analysis. The effect-size index proposed is that of improvement-over-chance classification (I). The I index may be used in situations that are univariate, multivariate, homogeneous, heterogeneous, or any combination thereof. Some very tentative suggestions for cutoffs of I values to define index magnitude for some data situations are made.

Journal ArticleDOI
TL;DR: The results of the study are that a reduction of at least 22% in the mean number of items can be expected in a computerized adaptive test (CAT) compared to an existing paper-and-pencil placement test.
Abstract: The objective of this study was to explore the possibilities for using computerized adaptive testing in situations in which examinees are to be classified into one of three categories. Testing algorithms with two different statistical computation procedures are described and evaluated. The first computation procedure is based on statistical testing and the other on statistical estimation. Item selection methods based on maximum information (MI) considering content and exposure control are considered. The measurement quality of the proposed testing algorithms is reported. The results of the study are that a reduction of at least 22% in the mean number of items can be expected in a computerized adaptive test (CAT) compared to an existing paper-and-pencil placement test. Furthermore, statistical testing is a promising alternative to statistical estimation. Finally, it is concluded that imposing constraints on the MI selection strategy does not negatively affect the quality of the testing algorithms.

Journal ArticleDOI
TL;DR: In this paper, the authors report the results of several psychometric analyses that were conducted to provide evidence of construct validity for scores on a measure of psychological empowerment, the Psychological Empowerment Scale (PES), for parents of children with a disability.
Abstract: This article reports the results of several psychometric analyses that were conducted to provide evidence of construct validity for scores on a measure of psychological empowerment, the Psychological Empowerment Scale (PES), for parents of children with a disability. Confirmatory factor analyses were conducted to evaluate the internal structure of the PES and the reliability of its scores. The results of the confirmatory factor analyses provided evidence of convergent and discriminant validity for the scores from the four subscales underlying the PES: (a) attitudes of control and competence, (b) cognitive appraisals of critical skills and knowledge, (c) formal participation in organizations, and (d) informal participation in social systems and relationships. Reliability coefficients for the subscale scores and total scale score ranged from .90 to .97. In addition, the PES scores were correlated with other empowerment-related measures. The results of these correlational analyses and group discrimination an...

Journal ArticleDOI
TL;DR: In this article, three methods of measuring selfefficacy were compared: traditional, Likert, and a simplified scale Scores on the three scales had highly similar reliability and validity and were strongly related.
Abstract: Three methods of measuring self-efficacy were compared: traditional, Likert, and a simplified scale Scores on the three scales had highly similar reliability and validity and were strongly related The Likert and simplified scales required 50% and 70% (respectively) fewer participant responses than the traditional format, whereas the traditional and Likert formats provided more specific diagnostic information

Journal ArticleDOI
TL;DR: This article reviewed issues regarding test reliability, which is psychometric terminology, and score reliability which is score-centric terminology, in part due to some EPM editorial policies and Vacha-Haase's “reliability generalization” proposal.
Abstract: The present article reviews issues regarding test reliability, which is psychometric terminology, and score reliability, which is score-centric terminology. These issues have arisen, in part, due to some EPM editorial policies and Vacha-Haase’s “reliability generalization” proposal. The article includes (a) a brief historical review of reliability terminology, (b) discussion on the emergence of datametrics (loosely defined as the application of psychometry to scores as opposed to an instrument) including a review of textbook authors’uses of psychometric versus datametric terminology, (c) discussion of problems with datametrics, and (d) a critique of Vacha-Haase’s proposed meta-analytic reliability generalization via dummy-coded regression. The article concludes with a brief summary that presents several suggestions.

Journal ArticleDOI
TL;DR: This paper investigated whether the change of response order in a Likert-type scale altered participant responses and scale characteristics and found that response order had no substantial influence on participant responses or scale characteristics.
Abstract: The study investigated whether the change of response order in a Likert-type scale altered participant responses and scale characteristics. Response order is the order in which options of a Likert-type scale are offered. The sample included 490 college students and 368 junior high school students. Scale means with different response orders were compared. Structural equation modeling was used to test the invariance of interitem correlations, covariances, and factor structure across scale formats and educational levels. The results indicated that response order had no substantial influence on participant responses and scale characteristics. Motivating participants and avoiding ambiguous items may minimize possible effects of scale format on participant responses and scale properties.

Journal ArticleDOI
TL;DR: In this paper, a meta-analysis was conducted to determine the extent to which computer administration of a measure influences socially desirable responding, and a small but statistically significant effect was found for impression management, with impression management being lower when assessed by computer.
Abstract: A meta-analysis was conducted to determine the extent to which the computer administration of a measure influences socially desirable responding. Social desirability was defined as consisting of two components: impression management and self-deceptive enhancement. A small but statistically significant effect (d = -0.08) was found for impression management, with impression management being lower when assessed by computer. Correlational analysis revealed, however, that the strength of the effect of computer administration on impression management appeared to diminish over time such that more recent studies have found small or no effects. Consistent with its conceptualization, reports of self-deceptive enhancement did not differ by testing format. The implications of these findings are discussed in terms of how they contribute to the explication of the construct of social desirability and cross-mode equivalence.

Journal ArticleDOI
TL;DR: The historical growth in the popularity of statistical significance testing is examined using a random sample of annual data from 12 American Psychological Association (APA) journals as mentioned in this paper, and the results replicate and extend the findings of Hubbard, Parsa, and Luthy, who used data from only the Journal of Applied Psychology.
Abstract: The historical growth in the popularity of statistical significance testing is examined using a random sample of annual data from 12 American Psychological Association (APA) journals. The results replicate and extend the findings of Hubbard, Parsa, and Luthy, who used data from only the Journal of Applied Psychology. The results also confirm Gigerenzer and Murray’s allegation that an inference revolution occurred in psychology between 1940 and 1955. An assessment of the future prospects for statistical significance testing is offered. It is concluded that replication with extension research, and its connections with meta-analysis, is a better vehicle for developing a cumulative knowledge base in the discipline than statistical significance testing. It is conceded, however, that statistical significance testing is likely here to stay.

Journal ArticleDOI
TL;DR: Despite evidence that Allen and Meyer's scales measure three-component commitment in a reliable and valid manner, the literature contains recurring criticism of several scale items as mentioned in this paper, and the authors of this paper are aware of these recurring criticism.
Abstract: Despite evidence that Allen and Meyer’s scales measure three-component commitment in a reliable and valid manner, the literature contains recurring criticism of several scale items. Criticisms refe...

Journal ArticleDOI
TL;DR: This article showed that regression and ANCOVA both exhibit a directional bias when measuring correlates of change, which confounds the comparison of changes between naturally occurring groups with large pretest differances.
Abstract: ANCOVA and regression both exhibit a directional bias when measuring correlates of change. This bias confounds the comparison of changes between naturally occurring groups with large pretest differ...

Journal ArticleDOI
TL;DR: The differential functioning of items and tests (DFIT) framework was used to examine the measurement equivalence of a Spanish translation of the Sixteen Personality Factor (16PF) Questionnaire as mentioned in this paper.
Abstract: The differential functioning of items and tests (DFIT) framework was used to examine the measurement equivalence of a Spanish translation of the Sixteen Personality Factor (16PF) Questionnaire. The questionnaire was administered in English to English-speaking Anglo-Americans and English-dominant Hispanic Americans and in Spanish to Spanish-dominant Hispanic Americans and Spanish-speaking Mexican nationals. As expected, the compensatory differential item functioning/differential test functioning (CDIF/DTF) procedure, which accounts for CDIF at the scale level, flagged fewer items as differential functioning than did the noncompensatory differential item functioning (NCDIF) procedure. Results did not support the hypothesis that DIF would be greatest in the Anglo versus Spanish-speaker comparison followed by the Hispanic versus Spanish-speaker comparison and least in the Anglo versus Hispanic comparison. Advantages of using the DFIT framework in assessing test translations, especially for test developers, ar...