scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 1971"


Journal ArticleDOI
TL;DR: In this article, the problem of determining the optimal number of rating categories for any given rating instrument was addressed. But the problem was not to determine the number of categories to be added to the rating scales, but to determine a minimum number of ratings beyond which there is no further improvement in discrimination of the rated items.
Abstract: GivEN that rating scales are so widely used in the social sciences, both as research tools and in practical applications, determination of the optimal number of rating categories becomes an important consideration in the construction of such scales. As Garner (1960) pointed out, the basic question is whether for any given rating instrument there is an optimum number of rating categories, or at least a number of rating categories beyond which there is no further improvement in discrimination of the rated items. Garner and Hake

588 citations


Journal ArticleDOI
TL;DR: In the behavioral and social sciences, a very common problem is the prediction of the standing of a person or thing on one variable, usually designated the criterion, from his or its standing on a number of other variables, usually called the predictors.
Abstract: A very common problem in the behavioral and social sciences is the prediction of the standing of a person or thing on one variable, usually designated the criterion, from his or its standing on a number of other variables, usually called the predictors. Leastsquared error multiple regression weights are most commonly used in weighting the predictors into a composite. These weights, which minimize, over the cases in the sample, the sum of the squared deviations of the observed from the predicted criterion score, are calculated from the normal equations which express the minimization conditions (Anderson, 1958). The fitting of the regression weights to the idiosyncracies of the initial sample leads to a decrease in effectiveness when these weights are applied to a new sample in which these particular idiosyncracies are not present. This &dquo;shrinkage&dquo; is often substantial in practical situations (e.g., see: Kurtz, 1948; Cureton, 1950; and Kirkpatrick, 1951), especially when the initial sample is small. And small samples, as Lawshe and Schucker (1959) point out, are the rule rather than the exception in many areas of applied psychology. Certain other approaches to the weighting problem produce

223 citations


Journal ArticleDOI
TL;DR: In this article, a person is asked to indicate, through recall or recognition, the answer to a question and the person asking the question then decides regarding the correctness or appropriateness of the answer.
Abstract: an opportunity for him to demonstrate his knowledge. The person is asked to indicate, through recall or recognition, the answer to a question and the person asking the question then decides regarding the correctness or appropriateness of the answer. The decision regarding the person’s knowledge is a function of what he actually knows, the way the question is asked, and the way the judgment is made concerning the correctness of the answer. The assumption is made that if a question is asked of a person and he answers the question in the proper way, he has the information, and if he does not answer the question properly, he does not have the information. An alternative method to determine whether or not a person knows something is to ask him if he knows it. Thus one can present an individual with a list of laws, principles, persons, or facts and ask the person to check or otherwise indicate the ones with which

85 citations


Journal ArticleDOI
TL;DR: The purpose of the present paper is to show that it is possible to view certain single-subject designs in such a way as to make the technique of Analysis of Vari-
Abstract: IN Psychology, there has been a long standing conflict between single-subject and multisubject researchers, which tends to center about the question of whether or not statistics is really useful in single-subject research. Such writers as. Sidman (1961) and Skinner (1953) emphasize precisely controlled, single-subject experiments as the most fruitful experimental approach in Psychology, with the role of statistics being limited to the use of elementary descriptive statistics such as means and standard deviations. The powerful, multivariate methods of modern statistics are considered to be inapplicable because they are primarily designed to deal with groups instead of individuals and because their averaging out processes tend to obscure individual differences. Other writers, such as Underwood (1957), argue that the best experimental approach in Psychology is to study groups of subjects to which the modern statistical inference methods may be applied. The purpose of the present paper is to show that it is possible to view certain single-subject designs in such a way as to make the technique of Analysis of Vari-

83 citations


Journal ArticleDOI
TL;DR: This paper is concerned entirely with measurement and not at all with instruction, and the best measurement is obtained when the examinee knows the answer to any of the test items.
Abstract: In tailored testing, we try to choose for administration items at a difficulty level matching the examinee's ability, which we infer from his responses to items already administered. Robbins-Monro procedures for selecting items and for estimating the examinee's ability are evaluated. Various ideas of use for tailored testing emerge.

70 citations


Journal ArticleDOI
TL;DR: The factor results used for this illustration of a factor analytic interpretation strategy are the reanalyses, by seven different solutions, of the data from nine of the Guilford studies as reported by C. Harris (1967).
Abstract: THE purpose of this paper is to illustrate the use of a strategy for determining the common factors in a set of data. C. Harris (1967) suggested using several different computing algorithms for the initial solution, obtaining derived solutions, both orthogonal and oblique, comparing the results, and regarding as the important substantive findings those factors that are robust with respect to method. This paper illustrates a way of comparing the results. The factor results used for this illustration of a factor analytic interpretation strategy are the reanalyses, by seven different solutions, of the data from nine of the Guilford studies as reported by C. Harris (1967). The initial component and factor methods used

63 citations


Journal ArticleDOI
TL;DR: This article examined the construct validity of several commonly-used personality measures and found that the correlations may be presented in what Campbell and Fiske (1959) have pointed out, that correlational evidence for construct validity requires demonstration of both convergent validity and discriminant validity, and that two tests may correlate highly because they have method and trait variance in common.
Abstract: The present study was undertaken to examine the construct validity of several commonly-used personality measures. Campbell and Fiske (1959) have pointed out that correlational evidence for the construct validity of a psychological test requires demonstration of both convergent validity and discriminant validity. They also point out that two tests may correlate highly because they have method, as well as trait, variance in common. From these considerations it follows that correlational evidence for construct validity requires that each of at least two traits be measured by at least two methods. The correlations may be presented in what Campbell and

55 citations


Journal ArticleDOI
TL;DR: In this paper, Cattell and Digman proposed a comprehensive perturbation theory with corrections indicated for questionnaires and ratings according to the specialized form of perturbations called trait-view theory.
Abstract: ATTEMPTS to correct questionnaire scores for distortion and instrument factor (&dquo;content&dquo;) effects have been largely based on response set (Cronbach, 1946), multi-method (Campbell and Fiske, 1959) and social desirability (Edwards, 1957) conceptions. Recently Cattell and Digman proposed subsuming most of these under a comprehensive perturbation theory (1964) with corrections indicated for questionnaires and ratings according to the specialized form of perturbation theory called trait-view theory. Trait-view theory proposes to consider the distortion as itself the

49 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compared the AP scores of Yale students to criteria of ability and achievement, such as academic ability, achievement, and ability and ability of the students, to judge the value of this program.
Abstract: some college courses while they are in secondary school (Blackmer, 1952; Casserly, 1966; Cornog, 1956; and Wilcox, 1962). Participating schools offer to their better students courses which the program has planned, outlined, and described. These students take examinations which the program’s committee of readers grades on a scale ranging from five, extremely well qualified, to one, no recommendation. After participating colleges receive the scores they may decide to place students in advanced courses or to award them college credit. To judge the value of this program we decided to relate the AP scores of Yale students to criteria of ability and achievement, such

29 citations


Journal ArticleDOI
TL;DR: In this paper, the interpretation of a well known laboratory experiment intended to classically condition attitudinal affect to previously neutral stimuli has been examined and it has been shown that some subjects in this situation do become aware and that the experimental effect can be accounted for in terms of this awareness.
Abstract: versy (Cohen, 1964; Insko and Oakes, 1966; Page, 1969; Staats, 1969) regarding the interpretation of a well known laboratory experiment intended to classically condition attitudinal affect to previously neutral stimuli. The original authors (Staats and Staats, 1958) claimed they had demonstrated that attitudes are acquired through a process similar to classical conditioning, and that this occurred &dquo;without awareness-without cognition&dquo; on the part of the subjects. The critics of this interpretation have asserted that some subjects in this situation do become aware and that the experimental effect can be accounted for in terms of this awareness. In his recent

25 citations


Journal ArticleDOI
TL;DR: The Flexilevel Test as discussed by the authors is a variant of the peaked conventional test for measuring examinees in the middle of the ability range, which is superior for examinees at the extremes.
Abstract: A flexilevel test is found to be inferior to a peaked conventional test for measuring examinees in the middle of the ability range, superior for examinees at the extremes. Throughout the entire range of ability, a flexilevel test is much superior to any conventional test that attempts to provi,le accurate measurement at both extremes. A THEORETICAL STUDY OF THE MEASUMMENT EFFECTIVENESS OF FLEXILEVEL TESTS* A conventional test becomes a flexilevel test when modified so that the examinee follows these rules: 1. Answer first a specified test item of median difficulty. 2. After answering an item correctly, attempt next the easiest unanswered itew of more-thanmedian difficulty. After answering an item incorrectly, attempt next the hardest unanswered item of less-than-median difficulty. A special answer sheet is used so that the examinee will know whether each answer is correct or incorrect. If the conventional test contains N items, the examinee taking the flexilevel test will attempt only n = (N + 1)/2 of these. A method for implementing flexilevel testing is described by

Journal ArticleDOI
TL;DR: In spite of widespread agreement regarding the desirability of a theory-derived diagnostic classification system for the psychiatric disorders of childhood, no personality theory has obtained sufficient general support to form the basis for such a system as discussed by the authors.
Abstract: IN spite of widespread agreement regarding the desirability of a theory-derived diagnostic classification system for the psychiatric disorders of childhood (Bard, Sidwell, and Wittenbrook, 1955; Rutter, 1965; Freud, 1965; Achenbach, 1966), no personality theory has obtained sufficient general support to form the basis for such a system. Increasing numbers of investigators have, therefore, focused their efforts upon developing a descriptive classification of manifest symptoms defined in behavioral terms with a minimum of interpretation and inference. As Miller (1967a) has indicated, many variables contribute to the differences across studies in regard to the descriptive classification scheme derived. Mathematical considerations such as the type of factor analytic rotation employed and the number of factors extracted, subject population considerations such as the types of children evaluated, and item considerations such as number of items

Journal ArticleDOI
K.H. Lu1
TL;DR: In many areas of inquiries, the variable of interest often does not lend itself to direct physical measurement; its mensuration, therefore must rely upon the subjective judgments of the observer as discussed by the authors.
Abstract: IN many areas of inquiries, the variable of interest often does not lend itself to direct physical measurement; its mensuration, therefore must rely upon the subjective judgments of the observer. For example, the evaluation of works in the arts and letters, the weighing of certain experimental and clinical data in biomedical research, psychology, sociology, and education are all but a few of the areas of this kind. Due to the multitude of factors associated with the variable, precise and detailed considerations of all of these factors are either impractical or impalable. Consequently, the com-

Journal ArticleDOI
TL;DR: For instance, this article found that attitudes towards school and learning were significant indicators of verbal skills in sixth graders, while other studies have shown positive but often nonsignificant relationships between grades and achievement scores and attitudes.
Abstract: EDUCATORS and parents are becoming increasingly aware of the importance of student attitudes. The child is often expected not only to learn the required subject matter, but also to enjoy school and to look forward to learning new things. Also, there is concern among industrial leaders as well as educators that children appreciate the benefits of technology and that they not be afraid of the many machines in their environments. Previous studies (Coleman, Campbell, Hobson, McPartland, Mood, Weinfeld, and York, 1966) have shown that student attitudes are related to school achievement. In their massive studies of United States schools, Coleman, et al. found that attitudes towards school and learning were significant indicators of verbal skills in sixth graders. Other studies have shown positive, but often nonsignificant relationships between grades and achievement scores and attitudes (Jackson and Lahaderne, 1967; Brodie, 1964). Even with the increasing interest in measures for assessing student attitudes, there are few existing instruments in the literature. Many of those which do exist are parts of larger instruments. Often these subtests do not have tested reliability or validity as independent measures (Coleman, et al., 1966). Other instruments

Journal ArticleDOI
TL;DR: In this article, Linn and Werts (1969) pointed out that the usefulness of covariance method requires a number of assumptions when used in making causal inferences.
Abstract: (McNemar, 1962) is equivalent to the squared total multiple correlation of dummy variables and covariates with the dependent variable minus the squared multiple correlation of the covariates with the dependent variable and the denominator of the F test is equivalent to the error variance, i.e., one minus the squared total multiple correlation. The numerator is therefore what Darlington (1968) calls &dquo;usefulness,&dquo; i.e., the proportion of variance that the dummy variables add to the prediction of the dependent variables in a stepwise regression after the covariates have entered. However, Linn and Werts (1969) point out that &dquo;usefulness&dquo; requires a number of assumptions when used in making causal inferences. It follows that the considerations listed by Linn and Werts (1969) can be specialized to the analysis of covariance method as shall be demonstrated below.

Journal ArticleDOI
TL;DR: The sampling theory for the Pearson product-moment correlation coefficient has been discussed recently by Forsyth and Feldt (1969) as mentioned in this paper, who used it to estimate the correlation between the test and freshman grade point average for a population represented by all those who applied for admission.
Abstract: BEHAVosiAL scientists frequently utilize interval estimation techniques and point estimates of parameters for which there exists only relatively inadequate sampling theories. Correlation coefficients corrected for attenuation, for example, have estimated standard error formulas, but the form of the sampling distribution is not known. Investigators using such coefficients often employ the sampling theory for the Pearson product-moment correlation coefficient (Yamamoto, 1965). The inadequacies associated with this procedure have been discussed recently by Forsyth and Feldt (1969). A fairly similar situation prevails when an experimenter wishes to estimate the correlation between variables X and Y for some population but complete data are available for only a select, nonrandom sample of subjects. In a common example, subjects have been selected on the basis of their X-scores and data for the Y-variable are available only for this restricted group. This situation would occur if a college entrance test was utilized as a selection device and it was desired to estimate the correlation between the test and freshman grade point average for a population of students represented by all those who applied for admission. In this case, data on the test are available for both groups, but grade point data

Journal ArticleDOI
TL;DR: In this paper, it has been suggested that the grades the students expect may influence their evaluations of the instructors' teaching performances, and that the students may not be completely objective observers.
Abstract: THERE has been some concern about the validity of student ratings of their instructors’ teaching performances because the students may not be completely objective observers. It has been suggested that the grades the students expect may influence their evaluations of the instructors. Unfortunately, the research on this question (e.g., Anikeef, 1953; Heilman and Armentraut, 1936; Hudelson, 1951; Riley, Ryan, and Lifshitz, 1950; Voeks and French, 1960; Weaver, 1960) has been relatively meager, productive of conflicting conclusions, uninformative because different elements of the evaluation were not considered separately, and in

Journal ArticleDOI
Frank Baker1
TL;DR: Erikson examines the growth of the personality in terms of a series of eight successive stages, and predicts the human organism's readiness to be aware of, and to interact with a widening social radius.
Abstract: ERIKSON examines the growth of the personality in terms of a series of eight successive stages, &dquo;predetermined in the human organism’s readiness to be aware of, and to interact with a widening social radius&dquo; (Erikson, 1959, p. 52). As the developing individual encounters different aspects of the social environment, each Step becomes a potential crisis because of the attendant radical change in perspective. The problems of each stage can be solved in one of two polar directions which lead to the development of a series of alternative basic senses or attitudes: (a) trust versus mistrust, (b) autonomy versus shame and doubt, (c) initiative versus guilt, (d) industry versus inferiority, (e) identity versus identity diffusion, (f) intimacy versus isolation, (g) generativity versus selfabsorption, and (h) integrity versus disgust and despair. The psychosocial quality of each basic attitude becomes more differenti-

Journal ArticleDOI
TL;DR: GUTTMAN and Schlesinger as mentioned in this paper recommended that item distractors be systematically constructed to increase the information yield of test devices, and at least three desirable additional benefits can be derived from distractors.
Abstract: GUTTMAN and Schlesinger (1967) recommended that item distractors be systematically constructed to increase the information yield of test devices. In describing the typical purpose of distractors Guttman and Schlesinger wrote: &dquo;Keeping the correct answer company is usually regarded as the only function of distractors, and sufficient attraction is deemed sufficient qualification for being a good distractor&dquo; (p. 569). At least three desirable additional benefits can be derived from distractors, according to Guttman and Schlesinger: 1. Successful prediction of relative empirical difficulties of distractors

Journal ArticleDOI
TL;DR: In this paper, the assumptions underlying derivations of various indices of error of measurement and such coefficients of reliability as KR-20 and KR-21 have been discussed and discussed extensively.
Abstract: WHAT are the assumptions underlying derivations of various indices of error of measurement and such coefficients of reliability as KR-20 and KR-21? Over the years there has been a great variety of instructive discussion relating to questions of this kind (cf. Cronbach, 1951; Cronbach, Rajaratman and Gleser, 1963; Gulliksen, 1950; Henrysson, 1959; Hoyt, 1941; Kuder and Richardson, 1937; Lord 1955a; 1955b; 1957; 1959a; 1959b; 1962; Novick, 1966; Novick and Lewis, 1967; Penfield, 1967; Winer, 1962 and others). Recent articles by Lord have been particularly helpful in indicating the minimal assumptions under which a reliability coefficient may be obtained and in showing the basis upon which one would need to justify use of one standard error of measurement (SEM) for all scores. Yet it is true that interesting and important relationships between various derivations still remain somewhat obscure and the implications in use of various ways of estimating a standard error of measurement are by no

Journal ArticleDOI
TL;DR: As samples increase in size, it is more and more likely that any given null hypothesis will be rejected at a statistically significant level as discussed by the authors, and many authors have discussed the implications of this fact and have severely criticized the continuing use by psychologists of the statistical reject-accept hypothesis testing model.
Abstract: As samples increase in size, it is more and more likely that any given null hypothesis will be rejected at a statistically significant level. Many authors have discussed the implications of this fact (Baker, 1966; Lykken, 1968; Nunnally, 1960; Rozeboom, 1960) and have severely criticized the continuing use by psychologists of the statistical reject-accept hypothesis testing model. The essence of their criticisms is that when the model is used by psychologists, (e.g. by applying F or t statistics to examine mean differences), large samples are likely to yield &dquo;significant&dquo; statistics even when

Journal ArticleDOI
TL;DR: In this article, the authors compare the validity of the Miller analogies test with the GRE-Verbal and GRE-Quantitative test and find that the Miller Analogies test provides more supportive evidence for the MAT, with validities in the mid.20's and.30's.
Abstract: THE literature of academic prediction in graduate education is filled with studies demonstrating correlations of various selection devices and grade point averages. The favorite devices are of course the Graduate Record Examination and the Miller Analogies Test. Results using these tests have certainly been unpredictable with reported validities ranging from .08 for the GRE-Verbal (Newman, 1968) to .47 for the GRE-Quantitative (Law, 1960). MAT validities ranging from .00 (Travers, 1948) to .6,9 (Gustad, 1950) have also been noted. Studies have yielded more supportive evidence for the MAT, with validities in the mid .20’s and .30’s. (e.g. Payne and Tuttle, 1966). Those concerned with selection in colleges of education will, however, find few readily available data on the potentially applicable and useful National Teacher Examinations. To describe comparative validities between these three instruments was one of the purposes of the present investigation. In addition, since most predictive studies tend to focus on masters or doctoral students,


Journal ArticleDOI
K.H. Lu1
TL;DR: It has been shown that the Kuder-Richardson formula 20 and Hoyt’s method are algebraic equivalents.
Abstract: scores. Numerous articles have been written on the concept of reliability and the ways of estimation. The three typical techniques of estimation are (1) the Kuder-Richardson formula 20 (Kuder and Richardson, 1937), (2) the Hoyt analysis of variance method (Hoyt, 1941), and (3) the Rulon split half method (Rulon, 1939). It has been shown that the Kuder-Richardson formula 20 and Hoyt’s method are algebraic equivalents. The reliabilities of many tests have been computed by these methods since their introduction some 30 years ago. Unfortunately, from the theoretical point of view, each of these methods suffers from a certain amount of impurity. Consequently, these methods arrive at the correct estimates only when the impurities are accidentally absent from the data. For instance, Hoyt and Krishnaiah (1960) investigated an analysis of variance model where the item-subject interaction was assumed absent. Under vari-

Journal ArticleDOI
TL;DR: In this article, a correlational analysis of SVIB scales with the CMT and the D-48 was carried out and it was found that for these subjects Analogies is more highly correlated with CMT than is Vocabulary (Welsh, 1969).
Abstract: a correlational analysis of SVIB scales with the CMT and the D-48. In addition to CMT total-score, separate part-scores for the two sections, Vocabulary (Synonyms-Antonyms) and Analogies, were also utilized since it has been found that for these subjects Analogies is more highly correlated with the D-48 than is Vocabulary (Welsh, 1969). From the SVIB 55 regular vocational and nonvocational scales were examined plus four newly developed scales. These special scales resulted from an item analysis of subgroups of the gifted adolescents scoring relatively high or low on a figure preference test art scale, often used as an index of creative potential (Welsh, 1959), conjointly with high or low scores on the CMT.

Journal ArticleDOI
TL;DR: In this article, the psychometric characteristics of three critical thinking tests were investigated by determining their item difficulty and discrimination indices, reliability coefficients, item validities, and basic dimensions. But the results were limited to a junior level educational psychology course at Wisconsin State University, Oshkosh.
Abstract: Introduction. The purpose of this study was to investigate the psychometric characteristics of three critical thinking tests by determining their item difficulty and discrimination indices, reliability coefficients, item validities, and basic dimensions. A Test of Critical Thinking Form G (Form G) (American Council on Education, 1951) ; the Cornell Critical Thinking Test Form Z (Form Z) (Ennis, 1961); and the Watson-Glaser Critical Thinking Appraisal Form ZM (Form ZM) (Watson and Glaser, 1964) were the tests used. The tests were administered to students in a junior level educational psychology course at Wisconsin State University, Oshkosh in May, 1967. The numbers of subjects for the different analyses ranged from 190 to 227. Form G has 52 items, Form Z has 52 items, and Form ZM has 100 items.

Journal ArticleDOI
TL;DR: In this article, the authors used the initial status as the covariate for the analysis of covariance using a two-factor two factor regression model, where the group was used as the unit of analysis and the differential treatment effects could be tested using the variation among group means within treatments.
Abstract: measurements on the same instrument at the beginning and at the end of the experiment, it being known that the effect is appropriately indicated by changes in the group means. True experimental design in such a situation would require that more than one group be assigned at random to each treatment. The group could then be used as the unit of analysis and the differential treatment effects could be tested using the variation among group means within treatments. In practice, however, there is frequently only one group per treatment and the individuals are used as the unit of analysis. Since the treatments are randomly assigned, either the analysis of covariance using initial status as the covariate (this would be &dquo;usage 2&dquo; mentioned by Evans and Anastasio, 1968) or a two factor


Journal ArticleDOI
TL;DR: In this article, a particularly clear description of the technical difficulties an experimenter must surmount was given by Vaught (1965) in which he described the need to blindfold subjects between trials of the RFT in order to prevent their establishing visual cues of uprightness while the experimenter was performing the test.
Abstract: more portable, less expensive instrument with which to assess the individual’s position with regard to being field dependent (FD) (Witkin, Dyk, Faterson, Goodenough, and Karp, 1962). A particularly clear description of the technical difficulties an experimenter must surmount was given by Vaught (1965) in which he described the need to blindfold subjects between trials of the RFT in order to prevent their establishing visual cues of uprightness while the experimenter

Journal ArticleDOI
TL;DR: The authors used true-false test items as a simple and direct means of measuring the essential outcome of formal education, which is command of useful verbal knowledge, for all knowledge can be expressed in a series of propositions, and a proposition is simply a sentence that can be said to be true or false.
Abstract: Why Use True-False Items? The basic reason for using true-false test items is that they provide a simple and direct means of measuring the essential outcome of formal education, which is command of useful verbal knowledge. For all knowledge can be expressed in a series of propositions, and a proposition is simply a sentence that can be said to be true or false. Propositions are the substance of knowledge. Judging their truth or falsity is the essential task of scholarship in any field. Some test constructors obtain scores of respectable reliability from true-false classroom tests. Examples are given in Table 1. These reliabilities indicate that the items in these tests were not seriously ambiguous, and that guessing could not have been extensive. If an ambiguous true-false item is written, it is the fault of the writer, not of the form. Also, when guessing affects test scores seriously, it is because the test is too short, the items too difficult or too ambiguous or the examinees too poorly motivated. Compared with other item forms, true-false test items are relatively easy to write. They are simple declarative sentences of the