scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 1951"


Journal ArticleDOI
TL;DR: In this paper, the crossvalidation term is used to describe any one of several distinct, though closely related, experimental designs, and it may be well to identify each of these, to point out their similarities and differences and to make clear the objectives which it serves.
Abstract: THE term &dquo;cross-validation&dquo; is often loosely applied to any one of several distinct, though closely related, experimental designs. Before we get lost in a swamp of semantic confusion, it may be well to identify each of these, to point out their similarities and differences and to make clear the objectives which it serves. What name we give to each is secondary, although for convenience I shall attach a different name to each.

308 citations


Journal ArticleDOI
TL;DR: In the present article the possibility of such increase will be established, and the way in which it varies with the number of assignments, the per-
Abstract: a situation to develop a single composite measure having the highest average validity for all jobs or to develop a battery of tests with differential weighting for each job. In choosing between these two alternative approaches, it is usually recognized that differential weighting is likely to result in higher validities, on the average, than those obtained when the same weighted composite is employed for all jobs. It is not so generally recognized that, even though little increase in validity is obtained by differential weighting, efficiency of selection can be very materially increased in this way. In the present article the possibility of such increase will be established, and the way in which it varies with the number of assignments, the per-

45 citations


Journal ArticleDOI
TL;DR: Only a minor amount of effort has been expended by psychologists to investigate the potential usefulness of the ratings-scale approach to the appraisal of social status as mentioned in this paper, which is probably due to the immense popularity of the partial-rank-order scale developed most extensively in this country by Moreno (6) and his associates.
Abstract: Only a minor amount of effort has been expended by psychologists to investigate the potential usefulness of the ratingscale approach to the appraisal of social status. This apparent neglect of the rating method is probably due to the immense popularity of the partial-rank-order scale developed most extensively in this country by Moreno (6) and his associates. The partial-rank-order instrument is relatively simple to administer in that it requires no special preparation in its elementary form and entails a minimum of tabulation of results. This approach typically involves an individual’s selecting three or four preferred associates for one or more activities. It may also in its less-widely adopted form solicit information on the

44 citations


Journal ArticleDOI
TL;DR: The existence of consistent, stable interest patterns in older adolescents and adults has been demonstrated conclusively by Strong (4), Kuder (3), and others as discussed by the authors, but the question of how these interests develop in the individual has not yet been answered.
Abstract: ONE of the most striking ways in which individuals have been shown to differ from one another is in their likes, dislikes and preferences. The existence of consistent, stable interest patterns in older adolescents and adults has been demonstrated conclusively by Strong (4), Kuder (3) and others. The question of how these interests develop in the individual has not yet been answered. At what age can definite groupings of likes and dislikes first be identified? What are these groupings in the

37 citations


Journal ArticleDOI
TL;DR: In this paper, the authors refer to the techniques of the technician and that of the scientist as the empirical versus the rational, or in terms of one of the other dualisms of philosophy.
Abstract: worker has had important effects on the methodology of test construction. The technician has developed techniques of his own which are somewhat different from those commonly adopted by the scientist. While one is tempted to contrast these techniques as the empirical versus the rational, or in terms of one of the other dualisms of philosophy, it seems better for the purpose of this discussion to refer to them simply as that of the technician and that of the scientist. In the construction of tests, the scientist and the technician

36 citations


Journal ArticleDOI
TL;DR: The existence of at least two spatial abilities has been shown in several factorial analyses completed by workers in the psychological research units of the AAF (3) and in other analyses reported by Fruchter and by the writers.
Abstract: THE existence of at least two spatial abilities has been shown in several factorial analyses completed by workers in the psychological research units of the AAF (3) and in other analyses reported by Fruchter (i) and by the writers (10, i i, 18). However, in most of these studies samples consisting of adult males have been employed, and some degree of selection has been present. Thus, most samples have consisted of either college students, who in most instances had attained at least a specified minimum score on an aptitude test, or of aviation cadets, who

33 citations


Journal ArticleDOI
TL;DR: This paper showed that the level of the testee's intelligence is not a significant factor in determining how successful his efforts at biasing will be, and that it has not, as yet, been demonstrated in real life situations that such biasing actually occurs, or that it occurs successfully on each and every one of the tests which might be amenable to such behavior.
Abstract: that the level of the testee’s intelligence-at least when the entire sample is average or better-is not a significant factor in determining how successful his efforts at biasing will be (8). Even though the ability to bias results of these types of tests is admitted, their widespread use in situations that might seem to encourage such behavior has continued unabated. The reason for this is simple. It has not, as yet, been demonstrated in real life situations that such biasing actually occurs, or that it occurs successfully on each and every one of the tests which might be amenable to such behavior, or that when it does occur

33 citations


Journal ArticleDOI
TL;DR: McCall's Tscale as mentioned in this paper overcomes the difficulty of using negative values, but does not solve the problem of negative scale values being a little difficult to manipulate and is not suitable for all scales.
Abstract: INVESTIGATORS concerned with the evaluation of test data have found that the more traditional methods of converting test scores into scale values have some rather marked shortcomings. The conversion of raw scores into standard scores involves the use of both positive and negative scale values (sometimes a little difficult to manipulate) and generally these values must then be placed in a frequency distribution. McCall’s Tscale overcomes the difficulty of using negative values, but does

31 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider the open book final examination as a more practical means of achieving a final coordination and testing of the work presented in a course and consider the various uses for examinations.
Abstract: more accurate manner. However, there has been less research on the more basic problem of evaluating the type of examination best fitting the needs of the course. It is the general practice to consider all types of examination with the same objective in mind-has the student learned the material presented? Has be memorized certain salient facts? It is the purpose of this paper to consider the open book final examination as a more practical means of achieving a final coordination and testing of the work presented in a course. It may be well to consider the various uses for examinations. Teachers who have taught for any length of time are very much aware of the stimulus value derived from giving a short ten minute test every week or every other week. Such a test

29 citations


Journal ArticleDOI
TL;DR: In this paper, it was shown that two-part scores are available for a test when the parts are not of equal length, which can happen in a number of ways: the two scores may be those from the front and back of an answer sheet, respectively, or from a first and second page or group of pages, of a test booklet.
Abstract: IT sometimes happens that two-part scores are available for a test when the parts are not of equal length. This can happen in a number of ways. For example, the two scores may be those from the front and back of an answer sheet, respectively, or from a first and second page, or group of pages, of a test booklet. Or, the scores may be from separately timed parts of a test where unequal time limits were used.

26 citations


Journal ArticleDOI
TL;DR: The validity of a test may be viewed in terms of the composite validity of its items, and it is common practice in constructing psychological tests to determine the validity of each item, that is, the item's ability to make discriminations in an external criterion.
Abstract: THE validity of a test may be viewed in terms of the composite validity of its items. It is, therefore, common practice in constructing psychological tests to determine the validity of each item, that is, the item’s ability to make discriminations in an external criterion. When this is done, the test constructor is then faced with essentially the same two-fold problem raised in the other papers of this symposium: (i ) What weights shall be ascribed to the various predictors (in this case, of course, the items), and (2) what is the best estimate of the validity of the battery (in this case, the total test) ? But there are two important respects in which item-analysis data differ from other prediction batteries: i. The number of predictors is much larger than in other

Journal ArticleDOI
TL;DR: In this paper, the authors present a method for obtaining empirical validity coefficients from the test-using public, which is a very desirable type of information to have, and a lot of technical information has been used to obtain these data.
Abstract: use by the test-using public. On the whole, crude as this procedure is, its results have been mainly beneficial. Its use has, however, placed a strong emphasis on the obtaining of empirical validity coefficients. The textbooks and the instructors have cautioned the prospective user to be wary and to ask, &dquo;Where are the validity data?&dquo; Empirical validity data are certainly a very desirable type of information to have. In quest of these data a lot of technical

Journal ArticleDOI
TL;DR: In this article, a comparison of the results obtained through the use of three such techniques commonly employed in assessing pupil status in the classroom is presented, including chronological age, mental ability, ethnic composition, and socio-economic status.
Abstract: IN the past few years the use of sociometric devices for bringing to light and analyzing pupil status and interrelationships in the classroom situation has become increasingly popular. Present-day emphasis in educational circles upon the development of well-rounded, emotionally stable personalities and the almost coequal stress upon the formation of sound intergroup relationships has accelerated a trend which had its inception shortly after the appearance of Moreno’s Who Shall Survive.I In view of the wide variations in such characteristics as chronological age, mental ability, ethnic composition, and socio-economic status in almost every school, it is not surprising to find that several methods seeking to determine pupil status have been evolved since the publication of Moreno’s early approach. This paper presents a comparison of the results obtained through the use of three such techniques commonly employed in assessing pupil status in the classroom.

Journal ArticleDOI
G.A. Ferguson1
TL;DR: The relationship between split-half and Kuder-Richardson reliability coefficients is discussed in this article, where it is shown that the kuder-richardson coefficient has a value of unity where no inconsistencies are found in the answer pattern, and an expected value of zero when the responses of individuals on the items.
Abstract: I. The Relationship between Split-half and Kuder-Richardson Reliability Coefficients BOTH split-half and Kuder-Richardson coefficients (i), although commonly referred to as reliability coefficients, are essentially coefficients of internal consistency; that is, they serve as indices of the amount of inconsistency present in an answer pattern obtained by administering a test of k items to a sample of n individuals. The Kuder-Richardson coefficient will, under given conditions, have a value of unity where no inconsistencies are found in the answer pattern, and an expected value of zero when the responses of individuals on the items

Journal ArticleDOI
TL;DR: In this article, the least squares process over-fits; it fits the sampling errors in the validation-sample data as well as the systematic trends in the data, and it does so by giving optimal weights to everything which, in that sample, will contribute to prediction.
Abstract: weights to use in predicting the criterion scores of the validation-sample subjects. Their use maximizes the multiple correlation in that particular sample. It does so by giving optimal weights to everything which, in that sample, will contribute to prediction. &dquo;Everything&dquo; includes, however, the sampling errors in the validation-sample data. Hence the least-squares process over-fits; it fits the errors as well as the systematic trends in the data.

Journal ArticleDOI
TL;DR: In this paper, two approaches to the study of self-report type personality tests with ratings, clinical diagnoses, and the like have been employed, one of which is to compare results of selfreport types with ratings and the other is to simulate or slant scores into a specified category or in a wanted direction.
Abstract: ditioned by several fairly obvious factors. Among these are, one, his knowledge of and insight into his own behavior; two, his desire to be truthful; three, his insight into the implications of the questions, four, the subtlety of the instrument; and five, the intelligence, maturity, sex and other characteristics of the subject. Two approaches to the study of this problem may be employed. The first is to compare results of self-report type personality tests with ratings, clinical diagnoses, and the like. This technique does not seem to have been much used. The second approach is to have subjects attempt to simulate or slant scores into a specified category or in a wanted direction and then to determine how successfully this has been done.

Journal ArticleDOI
TL;DR: In this paper, it was hypothesized that those individuals who actually talked the most would not be perceived necessarily by others to have talked the more or that the time individuals spent talking measured objectively would have a far less-than-perfect correlation with the group's estimate of who talked more.
Abstract: by time spent talking, was indicative of an observer’s rating of that individual. Another hypothesis to be tested was that those individuals who actually talked the most would not be perceived necessarily by others to have talked the most or that the time individuals spent talking measured objectively would have a far-less-thanperfect correlation with the group’s estimate of who talked the most. It was hypothesized that the group would estimate2 2

Journal ArticleDOI
TL;DR: Inference can be developed along two different lines of emphasis: the first is concerned largely with the problem of correction for the fitting of error, while the second is concerned principally with the nature of the sampling involved.
Abstract: inference can be developed along two different lines of emphasis. The first of these is concerned largely with the problem of correction for the fitting of error, while the second is concerned principally with the nature of the sampling involved. If the correction for the fitting-of-error aspect is emphasized, the two approaches appear as competitors and we must attempt to answer the question, &dquo;Which is the best method to determine

Journal ArticleDOI
TL;DR: The problems of differential prediction may be thought of as falling into three broad categories: methodological, statistical, and practical as discussed by the authors, and they serve to organize the problems into a more coherent framework for discussion.
Abstract: military classification officer who asks, &dquo;To what specialty shall I assign this soldier?&dquo; The problems of differential prediction may be thought of as falling into three broad categories: methodological, statistical, and practical. This is not to say that these categories are distinct-far from it. They are inevitably interrelated. It would, in fact, be difficult to find a problem which falls exclusively into one of the categories. The chief advantage of the categories is that they serve to organize the problems into a more coherent framework for discussion.

Journal ArticleDOI
TL;DR: Two were graduating seniors of the commerce college of a large midwestern state university, all of whom were interested in applying for sales positions with an insurance company and/or an office-machine concern as mentioned in this paper.
Abstract: two were graduating seniors of the commerce college of a large midwestern state university, all of whom were interested in applying for sales positions with an insurance company and/or an office-machine concern. The 32 men signed up voluntarily to be members of one of four groups of eight commerce college seniors. These four &dquo;sales&dquo; groups were labeled Groups A, B, C, and D, respectively.

Journal ArticleDOI
TL;DR: The AGCT was designed to measure not only verbal comprehension but also quantitative reasoning and spatial thinking as discussed by the authors, and the attempt was made to minimize the effect of schooling and to obtain a measure of the quality of education.
Abstract: vised edition of the Examiner Manual for the ~rmy General Classification ~’est, published in November, 1948 (5). D. E. Super, in his new volume on vocational tests, also carefully reviews published studies and evaluates the test in an eightpage report (7: 124-132). These adequate reviews will not be repeated here. The AGCT was designed to measure not only verbal comprehension but also quantitative reasoning and spatial thinking. Verbal comprehension was measured with vocabulary, quantitative reasoning with arithmetic word problems, and spatial thinking with block-counting. The attempt was made to minimize the effect of schooling and to obtain a measure of

Journal ArticleDOI
TL;DR: OCCUPATIONAL interest scores of 345 college students agree with the occupation engaged in twenty years later to the extent of 86 per cent of the possible maximum.
Abstract: OCCUPATIONAL interest scores of 345 college students agree with the occupation engaged in twenty years later to the extent of 86 per cent of the possible maximum. For the 23o men who did not change their occupation the agreement amounts to 91 per cent of the maximum; for the 1 i 5 men who have changed their occupations the agreement amounts to 77 per cent of the maximum. Data on two groups of students are considered here. The first group consists of 285 seniors at Stanford University in 1927. They filled out the Vocational Interest Blank at that time and 218 of them did so again in 1949. The second group consists of 306 freshmen who completed the test blank in 1930, of whom

Journal ArticleDOI
TL;DR: In this article, a brief review of the requirements of fundamental measurement is given, together with a discussion of some difficulties in applying this model to mental testing, and some consequences of these difficulties for measurement practice will be mentioned and, finally, some suggestions regarding criteria for evaluating mental-test methods will be made which depart from the customary criteria of conformity to the pattern of fundamental measurements.
Abstract: cussed, and an interpretation of the position of psychological measurement with respect to these requirements was offered. In the present paper, some of the general problems of psychological measurement will be discussed as they apply to the mental-test field. A brief review of the requirements of fundamental measurement will be given, together with a discussion of some difficulties in applying this model to mental testing. Some of the consequences of these difficulties for measurement practice will be mentioned and, finally, some suggestions regarding criteria for evaluating mental-test methods will be made which depart from the customary criteria of conformity to the pattern of fundamental measurement. The point of view will be expressed that the excellence of measurement methods in mental testing may be judged by the practical validity of those methods for the purposes at hand, in addition to comparing them with the model of measurement in the physical sciences. Reasons for giving greater emphasis to the former criterion will be offered.

Journal ArticleDOI
TL;DR: In this paper, a table of random numbers was used to fill out answer sheets, where even numbers were considered a true answer and odd numbers are considered false, and an odd die was entered as a false answer.
Abstract: an odd die was entered as a false answer. Twenty-six answer sheets were filled out by using a table of random numbers. Even numbers were considered a true answer and odd numbers were considered false. One answer sheet was scored where all items were answered true and one where all items were answered false. Product-moment correlation coefficients were computed for the 5o cases on all scales of the MMPI, except the &dquo;Cannot Say&dquo; category.

Journal ArticleDOI
Sam C. Webb1
TL;DR: The two most commonly used methods of scale construction were the method of equal appearing intervals developed by Thurstone and Chave (9) and the summated ratings developed by Likert (3) as discussed by the authors.
Abstract: subjects of botany, chemistry, geology, physics, psychology, and zoology at the college level. When this investigation was conducted, the two most commonly used methods of scale construction were the method of equal appearing intervals developed by Thurstone and Chave (9) and the method of summated ratings developed by Likert (3).2 These techniques have been described in detail elsewhere (3, 9) and only briefly summarized here.

Journal ArticleDOI
TL;DR: In this article, Long Term Consistency of Measurement (LSTM) is used to determine whether relative ability, in the characteristic measured, is independent of the many factors which may affect an individual.
Abstract: I. Long Term Consistency of Measurement Assuming that a characteristic or trait is worth measuring in the first place, a question that may be asked is : &dquo;How well does present measurement predict a person’s rating on this variable in the future?&dquo; In effect, this question seeks to determine whether relative ability, in the characteristic measured, is independent of the many factors which may affect an indi-

Journal ArticleDOI
TL;DR: The ACE Psychological Examination was used by the American College Personnel Association Research Committee (ACPRA) for the first time in the fall of 1949 as discussed by the authors, with the assumption that the ACE psychological examination was very widely used and that its separate Q and L scores were almost as widely used, and that they provided for differential prediction of subjects loosely labelled as primarily quantitative or primarily linguistic.
Abstract: IN the spring of 1950 Dr. Kate Mueller, Chairman of the American College Personnel Association Research Committee designated a sub-committee with the assignment of arousing interest in a co-operative research project on a problem of general concern. The authors of this paper constituted the subcommittee, Ralph Berdie being chairman, and the title of the paper indicates the problem which was to be studied. Announcement was made of the project with the result that 36 people inquired for information and were sent outlines of the study. Of this group, 20 individuals agreed to participate, and progress reports indicated the initiation of some activity. Reports ultimately were received from 17 people, but only 13 were in a form usable for the report. These 13 reports included four from state universities, four from private universities, two from state colleges, two from state teachers colleges and one from a technical institute. The list of individuals involved in the study is entirely too long for mention here, but their contributions certainly merit acknowledgment and praise. The project decided upon grew out of the feeling of the subcommittee that the ACE Psychological Examination was very widely used and that its separate Q and L scores were almost as widely used, with the assumption that they provided for differential prediction of subjects loosely labelled as primarily quantitative or primarily linguistic. It seemed desirable to collect data to investigate this assumption, not for the first time, but on a more extensive basis. It was decided that entering freshmen for the fall of 1949 should be studied, and that the Q, L, and T scores should be related to certain subjects and groups of subjects commonly studied in the freshman year. English, Mathematics, Physics, Chemistry, Biological Science, Social Science, Foreign Language, Music, Art and total grade-point

Journal ArticleDOI
TL;DR: In this paper, Eshbach reported that the increase in fees had discouraged a number of institutions from participating in the project and that some schools had turned to other tests and high-school marks to select students for engineering training.
Abstract: self-supporting financially and that fees for the program had been increased in 1948 for that reason. And, finally, Dean Eshbach reported that the increase in fees had discouraged a number of institutions from participating in the project and that some schools had turned to other tests and high-school marks to select students for engineering training. The College of Engineering at the University of Utah is one of these schools. In the Spring of 1948, the engineering faculty decided that the cost of administering the Pre-Engineering In-

Journal ArticleDOI
TL;DR: In the last half of the century, with the rapid expansion of faculties and the importation of continental ideals of intellectualism, the individual student became lost except insofar as he was able to assimilate the class as discussed by the authors.
Abstract: was being made available to unprecedented numbers of students, there was, according to Rugg (32), a noticeable neglect of one important aspect of student life: social relations. Cowley (12) has traced one source of this neglect. Especially during the years preceding the Civil War, the majority of the staff members of institutions of higher learning were clergymen who considered the spiritual welfare of their students to be as much their concern as intellectual development. While this situation prevailed, many of the out-of-class needs of the students were dealt with in one way or another. However, with the rapid expansion of faculties in the last half of the century, with the attendant influx of laymen researchers, with the importation of continental ideals of intellectualism, the individual student became lost except insofar as he was able to assimilate the class-

Journal ArticleDOI
TL;DR: This article reported that students at Converse College had made significant gains on the re-test of the American Council Psychological Examination (ACPE) and Stalnaker and Sattler This article reported that Stanford students in Stanford University made substantial improvements on the APPE.
Abstract: on a repeated test of the Ohio State University Psychological Examination. In i 940, Flory (2) concluded in a report on retesting students with the American Council Psychological Examination at Lawrence College that &dquo;there is a real improvement in intellectual ability during the college years.&dquo; Hunter (4) reported in 1942 that students at Converse College had made significant gains on the re-test of the American Council Psychological Examination. Stalnaker and Stalnaker (7) reported that students in Stanford University made substantial