scispace - formally typeset
Search or ask a question

Showing papers in "Educational and Psychological Measurement in 1970"



Journal ArticleDOI
TL;DR: In content analysis, in the process of developing recording instructions, defining units of analysis and operationalizing scales, the researcher requires more detailed information about the sources and kind of unreliability and over-all measures of agreement do not provide such information readily.
Abstract: IN content analysis as in all situations in which unstructured observations play an important role, the reliability with which data are generated is of crucial importance. Low data reliability limits the confidence in the validity of subsequent inferences and the reliability of a population of data must be estimated from the agreement among many observers regarding a sample. The way reliability of data is assessed is not different in principle from the way the reliability of psychological tests is measured. However, in the process of developing recording instructions, defining units of analysis and operationalizing scales, the researcher requires more detailed information about the sources and kind of unreliability. Over-all measures of agreement do not provide such information readily. More specifically, the analyst of a recording instrument may wish to obtain:

423 citations


Journal ArticleDOI
R.H. Finn1
TL;DR: The most widely used method for estimating the reliability of ratings or judgments is with the intraclass correlation, or some variation of it (Ebel, 1951) as mentioned in this paper, which is basically concerned about the consistency with which a group of judges place a given item (or items) into categories.
Abstract: IN carrying out research, one is often confronted with the need to estimate the reliability with which a group of judges place stimuli into categories. Guttman (1946) presents methods for estimating the upper bound and lower bound reliabilities of such data. Essentially, these estimates are functions of the number of categories available and the proportion of instances with which the modal response is chosen. In his discussion of reliability, Guilford (1954) acknowledges the interest in reliability of categorical data and concludes that there has not been sufficient experience with approaches such as that proposed by Guttman to permit an evaluation. Probably the most generally applicable and widely used method for estimating the reliability of ratings or judgments is with the intraclass correlation, or some variation of it (Ebel, 1951). Regarding the matter of categorical data, one is basically concerned about the consistency with which a group of judges place a given item (or items) into categories. As a measure of this kind of consistency, the intraclass correlation suffers from the limitation of requiring non-zero variance between items being rated in order to obtain a significant coefficient. It obviously follows that the

199 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigate the expectation of the positive reinforcement quality of others for themselves and their expectations of negative reinforcement quality for others for a given situation, and the actual negative reinforcing quality of a specific other or target.
Abstract: of (1) his general expectation of the positive reinforcing quality of others for himself, Ri; (2) his general expectation of the negative reinforcing quality of others for himself, R2; (3) the actual positive reinforcing quality of a specific other with whom he might associate in a given situation, ri; and (4) the actual negative reinforcing quality of a specific other, or target, r2. Further, with more familiar targets whose actual reinforcing qualities are bet-

124 citations


Journal ArticleDOI
TL;DR: It is possible, though not easy, to administer, by computer, questionnaires calling for the respondent to generate and type sentences as discussed by the authors, and it is reasonable to assume that one advantage to be gained by the use of computers rather than conventional means to administer psychological tests is an increase in honesty or a reduction in response bias of answers to certain sensitive or ambiguous questions.
Abstract: also possible, though not easy, to administer, by computer, questionnaires calling for the respondent to generate and type sentences. In neither case would the computer be used efficiently unless there were enough respondents (or other tasks) to keep the machine from being neither idle nor congested unreasonably long. Using machines in such a time-sharing fashion trades certain desiderata for the increased cost due to the computer’s having to keep track of the traffic. What are these desiderata? It is reasonable to suppose that one advantage to be gained by the use of computers rather than conventional means to administer psychological tests is an increase in honesty or a reduction in response bias of answers to certain sensitive or ambiguous questions. Response bias has been defined as the tendency to distort answers, consciously or unconsciously. The work of Rosenthal (1963) has persuasively demonstrated that the experimenter’s or tester’s ex-

71 citations


Journal ArticleDOI
Jacob Cohen1
TL;DR: Cohen, 1969 as mentioned in this paper discussed the need and relative neglect of statistical power analysis of the Neyman-Pearson (1928- 1933) type in the design and interpretation of research in the behavioral sciences.
Abstract: IN the course of preparing a handbook for power analysis (Cohen, 1969), it became apparent that at the cost of (a) working with approximate rather than &dquo;exact&dquo; solutions, and (b) doing a modest amount of computing, many frequently encountered problems of statistical power analysis encountered by hypothesis-testing behavioral scientists could be brought into a simple common framework. Past publications in this program have discussed (Cohen, 1965, pp. 95-101) and documented (Cohen, 1962) the need and relative neglect of statistical power analysis of the Neyman-Pearson (1928, 1933) type in the design and interpretation of research in the behavioral sciences. This article and the more

70 citations


Journal ArticleDOI
TL;DR: In this article, the authors define a trait as any potentially measureable attribute of the subject whose responses are observed, e.g., anxiety, neuroticism, morale, etc.
Abstract: IN order to delineate the problem considered in this paper, and the rationale upon which the solution is based, the following definitions mill be adhered to. A trait is defined as any potentially measureable attribute of the subject whose responses are observed, e.g., anxiety, neuroticism, morale, etc. A method refers to the means used to ob: -ewe or assess the trait, e.g. peer ratings, psychologist’s ratings, questionnaires, etc. The measure refers to the particular trait-method combination administered to a subject to obtain an observation. The multitrait-multimethod matrix is comprised of correlations between measures which use a number of methods to assess each member of the same set of traits. The procedure described in this paper is based on explicit models for multitrait-multimethod data. The formal models are related to models implicit in the classical Campbell and Fiske (1959) presentation. The procedure is based also on previous research conducted by Wolins (1964) and by Stanley (1961). Another

69 citations


Journal ArticleDOI
TL;DR: One of the most important, but still inadequately resolved issues pertinent to factor analytically oriented research is that of matching factors from two or more independent studies as mentioned in this paper, which has been severely limited by a lack of equally powerful rotational methods for establishing identity of concepts across a series of researches.
Abstract: ONE of the most important, but still inadequately resolved issues pertinent to factor analytically oriented research is that of matching factors from two or more independent studies. Powerful though the methods of factor analysis may be for purposes of organizing masses of data within the context of a single multivariate experiment, their use in comprehensive, programmatic research has been severely limited by a lack of equally powerful rotational methods for establishing identity of concepts, as evidenced by matched factors, across a series of researches.

63 citations


Journal ArticleDOI
TL;DR: Tms as mentioned in this paper advances several considerations for the use and interpretation of the eta coefficient (7]), or correlation ratio, as a postmortem measure of association for comparative experiments, which is a growing awareness among substantive researchers, as reflected by the increased use of ex post facto measures of association (Johnson and Gade, 1968; Bruning, 1968, Kennedy, 1969).
Abstract: Tms paper advances several considerations for the use and interpretation of the eta coefficient (7]), or correlation ratio, as a postmortem measure of association for comparative experiments. There is a growing awareness among substantive researchers, as reflected by the increased use of ex post facto measures of association (Johnson and Gade, 1968; Bruning, 1968; Kennedy, 1969), that a meaningful indication of the strength of an effect is not provided by merely reporting a significant test statistic (i.e., a t or F etc.) with its associated p value. As Bakan (1966), among others, has pointed out, the size of a test statistic and thus the p value is to a considerable extent a function of sample size. Since ultimately the null hypothesis can always be rejected, the test statistic simply implies that the observed effect is significant for a given n. The

63 citations


Journal ArticleDOI
TL;DR: In this paper, the authors define a trait as any potentially observable attribute of the subj ect. A method refers to the means used to observe or assess the trait, and a measure is the trait-method combination used to obtain an observation on the subject, and the observations yielded by a measure are treated as if they vary on an interval scale.
Abstract: THE following definitions, taken from Thurstone (1947) and from Campbell and Fiske (1959), will be employed to characterize the measurement system considered in this paper. A subject is generally the person whose responses are observed. A trait is defined as any potentially observable attribute of the subj ect. A method refers to the means used to observe or assess the trait. A measure refers to the trait-method combination used to obtain an observation on the subject. The observations yielded by a measure are treated as if they vary on an interval scale. Consider, for example, the traits called manager human relations

60 citations


Journal ArticleDOI
TL;DR: For example, TVERSKY as mentioned in this paper showed that given a fixed number of alternatives on a multiple-choice type test, the use of three alternatives at each choice point will maximize the discrimination and power of the test.
Abstract: TVERSKY (1964) has presented mathematical proof that given a fixed number of alternatives on a multiple-choice type test, the use of three alternatives at each choice point will maximize the discrimination and power of the test. Empirical support for this proof could be of considerable practical importance, since in constructing multiple-choice achievement tests teachers commonly prefer items with more than three alternatives, in the expectation that by doing so their tests will be more discriminating. (Ebel, 1965, p. 165 ) Unfortunately, very few empirical studies have been carried

Journal ArticleDOI
TL;DR: The authors compared the mean spelling score for one school with that for another, using variable item weights, rather than the constant weight of two for any item, and found that variable weights would have differentiated better among the schools.
Abstract: means complete yet. The first real &dquo;comparative&dquo; test used systematically with large numbers of students seems due to Joseph M. Rice (1897). It consisted of 50 dictated words to be spelled on paper. A pupil received two points if he spelled a given word correctly, and zero points if he spelled it incorrectly. This resulted directly in a percentage score that could range from zero through 100, a very common method of scoring to this day. Unlike most persons then, however, Rice did not focus his attention on some arbitrary normative scheme such as 90-100% = A, etc. Instead, he compared the mean spelling score for one school with that for another. Might he have differentiated better among the schools if he had used variable item weights, rather than the constant weight of two for any item? Perhaps the most

Journal ArticleDOI
TL;DR: The authors examined the degree to which measures of aptitude and undergraduate preparation obtained before the beginning of doctoral study are predictive of the success of psychology graduate students, and concluded that the overall success of each student was made six years after their beginning of graduate work, when all students in the research either had completed a Ph.D. or not.
Abstract: THIS study examines the degree to which measures of aptitude and undergraduate preparation obtained before the beginning of doctoral study are predictive of the &dquo;success&dquo; of psychology graduate students. Criterion measures were taken at two points in time. At the end of the first year of graduate study, the general progress and potential of each student was rated, and first-year course grades were obtained. Judgments of the overall success of each student were made six years after the beginning of graduate work, when all students in the research either had completed a Ph.D. or

Journal ArticleDOI
TL;DR: The use of factor analytic rotation procedures has been extensively studied in the literature as discussed by the authors, with the main goal being to maximize the sum of fourth powers of factor loadings, which is the same as the maximization of hyperplane count.
Abstract: procedures from which to choose. Rather, the present situation sees an extensive proliferation of various techniques all designed to reach essentially the same goal. Suggested procedures for analytic rotation vary from quartimax (Neuhaus and Wrigley, 1954), which maximizes the sum of fourth powers of factor loadings, to varimax (Kaiser, 1958), which maximizes the variance of the factor’s squared loadings, to maxplane (Eber, 1966), which maximizes hyperplane count. Other procedures-such as covarimin (Kaiser, 1958), biquartimin (Carroll, 1957), direct oblimin (Jennrich and Sampson, 1966), proportional profiles (Cattell and Cattell, 1955), and promax (Hendrickson and White, 1964) have also been suggested. Differences in factor analytic technique can, of course, produce different results (e.g.,Bechtoldt, 1961; Cattell, 1966). The proliferation of techniques is not just a theoretical issue but is also reflected in the use of factor analytic rotation procedures. In a brief scanning of APA journals most likely to publish factor analytic studies, some 40 papers were found which used factor analytic techniques. Although varimax was used in half of the cases,


Journal ArticleDOI
TL;DR: In this paper, the authors developed a large pool of items to measure the dimension of Evaluation Sensitivity, and to appraise the degree to which this item pool would shorn adequate psychometric properties.
Abstract: THE area of personality assessment has experienced very substantial advances during the past decade, both in terms of its conceptual development, and in terms of its increased use of measurement techniques. Nowhere have these changes become more apparent than in the area of personality test development, where a modern 'methodology has replaced the simplistic conceptions of an earlier day. Ad hoc procedures, like empirical item selection against external criteria, have been found inadequate as a means for approaching the measurement of personality traits as constructs (Loevinger, 1957). Increasingly, there has developed an appreciation of the value of sequential strategies for item selection in personality assessment. Multiple strategies for item selection have been required by virtue of the simultaneous demands for item representativeness and scale generalizability, substantive cogency, homogeneity, and freedom from response biases, as well as for convergent and discriminant validity. One important purpose of the present investigation was to develop a large pool of items to measure the dimension of Evaluation Sensitivity, and to appraise the degree to which this item pool. would shorn adequate psychometric properties. The theoretical,


Journal ArticleDOI
TL;DR: In this paper, the cosine of the angle between vectors X and Y, say, equals Pearson's productmoment correlation coefficient between variables X and X, i.e., cos 9zy = TZII where 9Zy is the angle separating X and Z.
Abstract: of variable X may be considered to be coordinates on the n-orthogonal axes of an n-dimensional space; thus, the observations of X may be regarded as corresponding to a vector, X, in n-space. Similarly, two vectors corresponding to the n observations on Y and Z may be established in the n-space. It is well known (see Anderson, 1958, pp. 49-50) that the cosine of the angle between vectors X and Y, say, equals Pearson’s product-moment correlation coefficient between variables X and Y, i.e., cos 9zy = TZII where 9Zy is the angle separating X and Y.

Journal ArticleDOI
TL;DR: In this article, the authors present a set of requirements that an interest inventory must be valid with respect to the criterion of job satisfaction, and that the inventory must also be usable in the workplace.
Abstract: VOCATIONAL interest inventories are intended to help young people find the occupations which will bring them the greatest satisfaction in their work. This statement of purpose carries the implication of two requirements which must be met in the development of an inventory. One of these requirements is that an interest inventory must be valid with respect to the criterion of job satisfaction. The other is that the inventory must be usable in the

Journal ArticleDOI
TL;DR: The concept of test-retest reliability as discussed by the authors is defined as the stability of measurement across some time interval, and it is defined in terms of testtaking ability, response sets, response styles and guessing habits.
Abstract: THE concept of reliability of measurement is clearly not as simple and static as standard definitions often imply. Reliability is not an all or none criterion which, if once satisfied, is invariant for a given measuring instrument, for different groups, or for different testing conditions. Reliability may also be examined in relation to a given measure for a given individual, thus implying the relevance of examining specific individual factors contributing to unreliability. Unreliability, Thorndike’s “error variance” (1951), can be seen as being composed of two classes of elements: (1) characteristics of the observer and the environment; and (2) characteristics of the individual. The first group is composed of such factors as poor testing conditions, careless investigators, inaccurate calculations and numerous other factors which are external to the individual being examined. Included in individual characteristics are aspects such as test-taking ability, response sets, response styles and guessing habits. Reliability of measurement implies more than consistency of response over a time interval. Rather, reliability can be discussed in two different framemorks-test-retest reliability (stability) and internal consistency reliability. Test-retest reliability refers to the stability of measurement across some time interval. Stability de-

Journal ArticleDOI
TL;DR: In this paper, the authors compared the three estimates of new sample R given by using Wherry's original formula (1931), a modified version (McNemar, 1962; Lord and Novick, 1968), and a formula by Lord (1950), with the actual calculated values of R from the original sample.
Abstract: one sample from a defined population, we are often interested in determining how accurate this same equation would be in predicting the same criterion variable for new samples from the same population. Several formulas have been developed for determining the amount of shrinkage to expect in the multiple correlation coefficient (R) when a regression equation from one sample is used with a new sample. This study compares the three estimates of new sample Rs given by using Wherry’s original formula (1931), a modified version (McNemar, 1962; Lord and Novick, 1968), and a formula by Lord (1950), with the actual calculated values of R from the

Journal ArticleDOI
TL;DR: The Personal Orientation Inventory (POI) as discussed by the authors was designed by Shostrom (1964) to provide a comprehensive measure of values and behaviors believed to be of importance in the development of self-actualization.
Abstract: THE Personal Orientation Inventory (POI) designed by Shostrom (1964, 1966) attempts to provide a comprehensive measure of values and behaviors believed to be of importance in the development of self-actualization. The POI consists of 150 twochoice, paired-opposite statements of values. Scores are reported for two major scales and for ten secondary scales. Culbert, Clark, and Bobele (1968), Gibb (1968), Foulds (1969), Knapp (1965), and Lieb and Snyder (1967) demonstrated the value of the POI in measuring certain dimensions that could be of importance in

Journal ArticleDOI
TL;DR: Taylor as mentioned in this paper described a rationale and a method for constructing rating scales of clinical judgment, which allowed an assessment of unidimensionality for the variable being rated, as well as an estimate of expectable reliability between independent raters.
Abstract: AN earlier communication to this journal (Taylor, 1968a) described a rationale and a method for constructing rating scales of clinical judgment. The method allows an assessment of unidimensionality for the variable being rated, as well as an estimate of expectable reliability between independent raters. It results in a thermometer-like rating scale of up to 100 points, with particular points being anchored by brief case-history vignettes. In the work

Journal ArticleDOI
TL;DR: In this paper, the effect of nonhomogeneous correlations, between treatments, is to introduce a positive bias in jp' when the correlations are positive but unequal and a negative bias when the correlation are negative but unequal.
Abstract: Box (1954), Geisser and Greenhouse (1958) and others have shown that the effect of nonhomogeneous correlations, between treatments, is to introduce a positive bias in jp’ when the correlations are positive but unequal and a negative bias when the correlations are negative but unequal. Various approaches have been suggested to contend with this bias; some are methodological, others focus on the choice of the statistical test, and some attempt to correct the biased F mathematically. In many instances, nonhomogeneous covariances result from such factors as sequence effects, carry-over effects or transfer of training which may occur during collection of data. Suggestions have been made to administer treatments randomly within sub-

Journal ArticleDOI
TL;DR: In this paper, the best combination of test scores in predicting achievement of boys in the first and third grades from a battery of tests given in first grade was determined. And the specific questions asked in this study were:
Abstract: among elementary school children are attributed to intelligence, perceptual and motor skills, linguistic skills, social and emotional factors, and a number of other skills and characteristics. The purpose of the present study is to determine the best combination of test scores in predicting achievement of boys in the first and third grades from a battery of tests given in the first grade. The battery included a measure of intelligence, psycholinguistic abilities, reading-readiness skills, and visual-motor skills. To measure these variables the Wechsler Intelligence Scale for Children (WISC-Wechsler, 1949), the Illinois Test of Psycholinguistic Abilities (ITPA-McCarthy and Kirk, 1961), the Bender Visual Motor Gestalt Test (BG-Bender, 1938), and the Harrison-Stroud Reading Readiness Profiles (Harrison and Stroud, 1956) were used. The specific questions asked in this study

Journal ArticleDOI
TL;DR: A discussion of the methodology of matrix sampling, and the empirical and theoretical research on matrix sampling can be found in this paper, where the authors demonstrate the f llowing points: 1. Matrix Sampling can be viewed as a simple two factor, random model analysis of variance design, the matrix sampling formula for estimating the mean and variance being simply the point estimate formulas for estimating components of the underlying linear model.
Abstract: This paper, a discussion of the methodology of matrix sampling, and the empirical and theoretical research on matrix sampling, attempts to demonstrate the f llowing points: 1. Matrix sampling can be viewed as a simple two factor, random model analysis of variance design, the matrix sampling formulas for estimating the mean and variance being simply the point estimate formulas for estimating components of the underlying linear model. 2. These formulas can be based on the weakest possible set of assumptions, viz., random and independent sampling of examinees and items. No assumptions about the statistical nature of the data need be made. 3. The literature is unclear about what effect the above sampling assumptions have upon matrix sampling in the estimation of the mean and, especially, the variance. 4. Of the three alternative procedures suggested for dealing with negative variance estimates in multiple matrix sampling--equating the negative estimate to zero, Winsorizing the distribution of estimates, or treating all estimates alike regardless of sign--the third procedure appears to be most promising. A simulation study is necessary to determine the shape of the distribution of variance component estimates for matrix sampling as well as the relative efficiency of the three methods for handling negative estimates.

Journal ArticleDOI
TL;DR: In this paper, the Pearson product-moment correlation coefficient is directly related to the standard deviations (SDs) of the two variables being correlated, and a reduction in either or both of the SDs will also lower the correlation coefficient between the two variable.
Abstract: THE magnitude of a Pearson product-moment correlation coefficient is directly related to the standard deviations (SDs) of the two variables being correlated. A reduction in either or both of the SDs will also lower the correlation coefficient between the two variables. If the unrestricted SDs are known, it has been possible in certain situations to estimate what the correlation would be without restriction. In psychological and educational testing, most of the recent test standardizations specify the SD upon which the test scores are based. Accordingly, one can usually determine whether the ranges of the tests being correlated are low by comparing the obtained SDs with the standardization SDs. Restriction usually results either from biased sampling procedures or from the dropping out of all scores below a specified cutoff value. This paper will consider only the latter situation. Where two variables are being correlated, the following three possible cases may occur. 1. Variable X is explicitly restricted and there is knowledge of the restricted correlation coefficient (r) and of the restricted (saj and unrestricted (

Journal ArticleDOI
TL;DR: In this article, the concept of matrix sampling is considered as one term for a class of operations called matrix sampling, i.e., sampling of items from a population, persons from the population (finite population sampling), or sampled matrices reflecting different combinations of persons and items.
Abstract: sampling&dquo; may be considered as one term for a class of operations called matrix sampling. Types of matrix sampling may include the sampling of items from a population, persons from a population (finite population sampling), or sampled matrices reflecting different combinations of persons and items. Osbum (1967) has recently shown a relationship of matrix or item sampling to the concept of unmatched data samples in the terminology of generalizability developed by Cronbach, Rajaratnam, and Gleser (1963). Studies dealing with item sampling have been reported by Cook

Journal ArticleDOI
TL;DR: In this paper, the omnibus hypothesis is used to test the null hypothesis of a randomized groups analysis of variance (ANOVA) or its nonparametric counterpart, the Kruskal-Wallis H test.
Abstract: VERY often a researcher employs a multiple group (G) design in which n observations are in each of k independent groups or samples. A randomized groups analysis of variance (ANOVA) or its nonparametric counterpart, the Kruskal-Wallis H test, is typically employed to test the null hypothesis: Gl = G2 = ... = Gk. Similarly, in the randomized blocks or matched groups design, ANOVA or the Friedman x2 test might be employed. In all four of these cases the alternative hypothesis is the omnibus: Gi 0:/= Gj for some i =A j. However, there are also specific alternatives which might be tested, for example, partially ordered hypotheses (Gi = G2 < ...

Journal ArticleDOI
TL;DR: In this article, a program written in TELCOMP, an interactive-computer language, is presented for doing all the necessary computations for the Johnson-Neyman (1936) technique in the simplest case, starting from raw sums, sums of squares, and sums of crossproducts or from means, standard deviations, and correlations previously computed for the two groups.
Abstract: A program written in TELCOMP, an interactive-computer language, is presented for doing all the necessary computations for the Johnson-Neyman (1936) technique in the simplest case, starting from raw sums, sums of squares, and sums of cross-products or from means, standard deviations, and correlations previously computed for the two groups. Computations are shown for three sets of data, two of which are from previous expositions of the method; it is noted that these previous expositions suffered from computational errors of major seriousness. The third set of data illustrates use of the method for investigating interactions between individual differences and treatments in such a way as to supplement the usual analysis of covariance procedures.