scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Educational Measurement in 1975"


Journal ArticleDOI
TL;DR: The assumption that examinees either know the answer to a test item or else choose among all alternative responses at random is often made in discussions of formula scoring as mentioned in this paper, which is indefensible.
Abstract: In discussions of formula scoring, the following assertion is sometimes made (for example, see Diamond & Evans, 1973, p. 181): Formula scoring is based on the assumption that examinees either know the answer to a test item or else choose among all alternative responses at random. Other discussions in the literature (for example, Davis, 1959; Lord & Novick, 1968, chapt. 14; Thorndike, 1971) often suggest but usually do not explicitly make the same assertion. The asserted assumption is, of course, indefensible. Typically, examinees have some partial information about an item. For most multiple-choice items, they very likely can rule out one or more of the alternative responses with greater or lesser assurance. It is very difficult to be content with any kind of scoring based on an assumption of random selection.

113 citations


Journal ArticleDOI
TL;DR: This article reviewed construct-validation methodology and applied some of this methodology to the problem of validating construct interpretations of measures of cognitive structure, and found that the validity of construct interpretation of cognitive structures can be improved by applying construct validation to cognitive structure measures.
Abstract: The purposes of this paper are to briefly review construct-validation methodology and to apply some of this methodology to the problem of validating construct interpretations of measures of cognitive structure. A review of construct-validation methodology seemed warranted in view of the paucity of studies concerned with this topic, especially in the recent literature on human learning and memory (e.g., Adams, 1967; Anderson & Bower, 1973; Ausubel, 1963, 1968; Frijda, 1972; Johnson, P., 1967, 1969; Mayer & Greeno, 1972; Neisser, 1967; Norman, 1969, 1970; Tulving & Donaldson, 1972).

103 citations



Journal ArticleDOI
TL;DR: In this paper, the authors present procedures for estimating the appropriate number of subjects to include in educational experiments designed to detect differences among two or more treatments (e.g., directional and non-directional).
Abstract: The purpose of this paper is to present some new procedures for estimating the appropriate (or at least, the approximate) number of subjects to include in educational experiments designed to detect differences among two or more treatments. The major appeal of these procedures is essentially threefold: (a) they do not require an estimate of the unknown within-treatment variance previously acknowledged as a difficult task for most researchers (e.g., Feldt, 1973), (b) they consist of a straightforward extension (to multi-treatment and multi-factor designs) of a concept proposed for the case in which only two treatments are compared (e.g., Cohen, 1969), and (c) they are adaptable to the testing of "planned comparisons" (cf. Hays, 1973)-both directional and nondirectional-which in certain conditions has distinct economic advantages. In addition, because of the great versatility of the basic approach, it is applicable not only to situations wherein subjects are randomly assigned to treatments (completely randomized designs), but to those wherein subjects are matched or "blocked" on some relevant control variable prior to assignment (randomized block designs), when subjects serve as their own controls (repeated measures designs), when a control variable is employed as a "covariate" in the design, or when the appropriate "experimental unit" consists of something other than individual subjects (as when groups of students or classrooms are simultaneously administered a single treatment). Due to space limitations, however, only a brief discussion of the basic approach, along with an example illustrating its usage, will be included here. To set the stage, suppose that a researcher is interested in determining whether there are performance differences associated with K independently administered .treatments. Under the assumptions of random sampling and, especially, random assignment of subjects to treatments, a one-way analysis of variance (as an extension of the two-sample t test) might be conducted to assess whether the observed mean differences in performance exceed what would have been expected on the basis of chance. Should such an F test prove statistically significant, the researcher might then wish to identify which of the K treatment means differ from one another. Thus, a multiple comparison procedure such as that of Tukey or Schefft (cf. Kirk, 1968) would generally be selected in order to determine which means or combinations of means were responsible for rejecting the hypothesis of no differences.2 While previous approaches to the determination of "sample size" have generally been framed with regard to the probability of rejecting the no-difference hypothesis per se, the present approach is framed with regard to the probability of detecting

78 citations


Journal ArticleDOI
TL;DR: In this article, Tversky has considered tests which can be represented as a sequence of n choice points with a alternatives per choice point and found that the use of three alternatives per point was sufficient to maximize three criteria: the power of a test, defined as one minus the probability of getting a perfect score by chance, the discrimination capacity of the test, defining as the number of possible response patterns the test can distinguish between; and the uncertainty index, a measure of the information gained from using the test.
Abstract: Tversky (1964) has considered tests which can be represented as a sequence of n choice points with a alternatives per choice point. With the total number of alternatives fixed at some constant number c (c = na), the use of three alternatives per choice point was found to maximize three criteria: the power of a test, defined as one minus the probability of getting a perfect score by chance; the discrimination capacity of the test, defined as the number of possible response patterns the test can distinguish between; and the uncertainty index, a measure of the information gained from using the test. A simple example is shown in Table 1, for the case c = 12. Note that fixing c (called the size of the test) requires that test length increase to compensate for the reduced number of alternatives per question. The discrimination capacity of each possible combination of n and a is presented in the table for comparison.

69 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider the phenomenon of adaptation level (Helson, 1947, 1948), which concerns the "anchoring" effects of background stimuli upon the perception of focal stimuli.
Abstract: College grading is a deeper topic than it at first appears. It can be investigated and described on several levels. On one level, studying college grading is equivalent to studying the behavior of college instructors. Such an investigation would focus on the input factors or antecedents (e.g., student ability levels, work habits, etc.) that influence grading as well as the characteristics of the persons assigning grades. Another level at which college grading can be investigated concerns the consequences of grading practices. These consequences can be studied for their effects on individuals or on aggregates. The latter approach would assess the systemwide effects of grading practices upon the whole institution, including student enrollments, major field choice, and faculty hiring. The authors believe that the antecedents and consequences of college grading are inextricably tied together by a personal characteristic of college instructors. This characteristic is so pervasive among college instructors (and perhaps people in general) as to be considered an almost inevitable factor in college grading process. The characteristic to which we refer is the phenomenon of adaptation level (Helson, 1947, 1948). Adaptation level, briefly, concerns the "anchoring" effects of background stimuli upon the perception of focal stimuli. This concept, originally developed to account for psychophysical data, can be translated into the language of college grading. When the performance of the individual student is considered as a focal stimulus, the performances of all other students in the class can be considered as background stimuli against which the individual's performance is judged. Thus, grading standards would be partly determined by the ability level of the student population. If, for example, the ability level declines without an accompanying decline in average grades, then grading standards have "fallen" (or become less "stringent").

68 citations



Journal ArticleDOI
TL;DR: The authors investigated the effect of the quality of papers preceding an essay response on the grade assigned to the response and to determine whether the effect was in accordance with Adaptation Level (AL) theory.
Abstract: A generally held belief is that the position of a student's paper in a stack of papers to be graded will influence the grade which is assigned to that paper (Ross & Stanley, 1954, p. 194; Stalnaker, 1936, p. 41), especially if the examinee's responses are preceded by several excellent responses or several poor responses; an essay response graded after the scoring of several very good responses will be graded differently than if it were preceded by several poor papers. That the quality of the responses immediately preceding the response being scored is a factor in grading is little more than an assumption, albeit a logical one to make. Nevertheless, this assumption is given as one reason for employing a sorting method in scoring essay examinations (Ross & Stanley, 1954, p. 205). If one were to extend Helson's Adaptation Level (AL) theory (Helson, 1951; Helson, 1959; Helson, Dworkin, & Michels, 1956) to essay question response grading, a theoretical explanation of this assumed phenomenom may emerge. In accordance with AL theory, a grader who initially encounters a block of poor essay question responses should "adapt" to these responses and establish a high AL to quality of responses. Therefore, subsequent essay question responses should be perceived by the grader as being even better than they would have been perceived in another situation. The converse should be true when a grader initially encounters several very good essay question responses. The purpose of this study was to investigate the effect of the quality of papers preceding an essay response on the grade assigned to the response and to determine whether the effect was in accordance with AL theory. Specifically, the purpose was to investigate the effect of initial blocks of either very good or poor essay question responses on the grades assigned to subsequent essay responses.

47 citations


Journal ArticleDOI
TL;DR: Hakstian and Kansup as discussed by the authors conducted a large-scale study of the effects, on reliability and validity, of differential option weighting, and two independent variables were examined: (1) the manner in which examinees are instructed to respond, and (2) the way in which responses obtained by some method are scored.
Abstract: The notion of going beyond simple right-wrong scoring of multiple-choice items to assess intermediate states of knowledge has received considerable attention in the last 10 years. The techniques proposed have involved differential weighting of the item response alternatives. The accumulated literature (to 1970) on differential weightingof not only item alternatives but also item scores-has been well summarized by Wang and Stanley (1970). Motivated by the apparent continued optimism surrounding these techniques (see, e.g., Hambleton, Roberts, & Traub, 1970; Patnaik & Traub, 1973), the authors conducted a large-scale study of the effects, on reliability and validity, of differential option weighting. Two independent variables were examined: (1) the manner in which examinees are instructed to respond, and (2) the manner in which responses obtained by some method are scored. To investigate these two variables, a large sample was randomly divided into three experimental test-taking groups. The effects of different item-response instructions are dealt with in a forthcoming paper (Hakstian & Kansup, 1975). The effects of different scoring procedures, applied to the same set of responses, are of concern in the present paper. In the present study, one group of 346 subjects responded to multiple-choice items after being given conventional test-taking instructions, i.e., simply marking the correct answer. There have been attempts in the past to increase the reliability and validity of conventionally administered tests by scoring incorrect choices according to some a priori-determined degree of correctness each possesses, a scoring procedure we refer to as logical weighting. Somewhat similar is empirical weighting, in which the weights for each option of each item are determined by their contributions to the overall psychometric qualities of the test, for the particular examinee sample. Investigations of the latter-usually employing modifications of Guttman's (1941) procedure-have shown insubstantial increases in reliability and no increase in validity (Davis & Fifer, 1959; Hendrickson, 1971; Sabers & White, 1969). Some investigations of logical weighting have shown increases in reliability for logically-weighted scores (Nedelsky, 1954; Patnaik & Traub, 1973), whereas at least one study (Hambleton et al., 1970) has shown a decrease (although not statistically significant). None of these studies has demonstrated a statistically significant increase in validity. In the present study, the tests were scored both conventionally and by a logical weighting procedure. Internal consistency was compared for the two scoring procedures, as well as, unlike earlier

44 citations


Journal ArticleDOI
TL;DR: In this article, a psychometric examination of the responses of 68 subjects to a 102-item inventory which measures a predisposition to behave creatively was carried out, and non-metric multidimensional scaling (Kruskal, 1964a, 1964b, Shepard, 1962a, 1962b) of inter-item correlations was used as an objective method for identifying the subscale structure underlying the test.
Abstract: The present paper summarizes the findings of a psychometric examination of the responses of 68 subjects to a 102-item inventory which measures a predisposition to behave creatively. More specifically, nonmetric multidimensional scaling (Kruskal, 1964a, 1964b; Shepard, 1962a, 1962b) of inter-item correlations was used as an objective method for identifying the subscale structure underlying the test. Nonmetric multidimensional scaling was used in lieu of factor analysis because the number of variables (102) exceeded the number of subjects (68) and also because nonmetric multidimensional results tend to involve few factors, since the method is not based upon the linearity assumptions of factor analysis. Structural analyses of this type have recently been applied to a number of different inventories and tests (Farley & Cohen, 1974; Karni & Levin, 1972; Napior, 1972; Rosenberg & Sedlak, 1972). Given a matrix of correlations (or other types of proximity measures such as distances) between each pair of n variables, nonmetric multidimensional scaling represents the variables as n points in a spatial configuration or "picture" so that positively correlated (similar) variables are close together and negatively correlated (dissimilar) variables are far apart. In other words, the inter-point distance between two variables in the spatial configuration is made to reflect the degree of input proximity between those variables. An index called "stress" (Kruskal, 1964a) indicates the extent of agreement between the rank order of input proximities and the rank order of interpoint distances. Stress values near zero are desirable since they indicate close agreement between the two rank orders while stress values near one are undesirable since they indicate little agreement between the two rank orders. Regarding the creativity inventory, a core problem in creativity research has been the development of tests which are easily administered and efficiently scored, yet are reasonably valid as predictors of real creative behavior. The present test, How Do You Think, Form B, attacks this prediction problem by assessing attitudes, motivations, interests, values, and other personality and biographical information.1 Our assumption is that creative individuals generally possess a certain constellation of personological traits which lead them to think and behave more creatively than the average person. We assume, for example, that independence, self-confidence, and nonconformity would contribute to innovativeness. Also, the creative person is assumed to be above average in spontaneity, willingness to take risks and make mistakes, playfulness, and sense of humor. He may also tend to be curious, to be attracted to the complex and mysterious, to be open to new ideas and experiences, to have artistic and aesthetic interests, and especially to be highly energetic. There is evidence from past research that highly creative people are fully aware of their unique abilities, habits, and experiences and can respond accurately to test items which simply ask if they are creative, inventive, or

44 citations


Journal ArticleDOI
TL;DR: In a previous paper, Hambleton and Novick conceptualized a decisiontheoretic formulation for several issues in criterion-referenced measurement and proposed a solution based on a Bayesian procedure given by Novick, Lewis, and Jackson (1973).
Abstract: In a previous paper, Hambleton and Novick (1973) conceptualized a decisiontheoretic formulation for several issues in criterion-referenced measurement. Among the issues discussed was the important problem of allocating individuals to mastery states. These authors proposed a solution to the problem based on a Bayesian procedure given by Novick, Lewis, and Jackson (1973). More recently, Lewis, Wang, and Novick (1973) have developed a Bayesian procedure that is more appropriate in the context of criterion-referenced measurement. Based on this most recent

Journal ArticleDOI
TL;DR: In this article, the effects of item difficulty sequencing on performance and on post-state anxiety were investigated using a timed mathematics aptitude test, where Ss were randomly assigned to a random, easy-to-hard, or hard-toeasy item difficulty sequence group.
Abstract: : Effects of item difficulty sequencing on performance and on post-state anxiety were investigated using a timed mathematics aptitude test. The Ss were randomly assigned to a random, easy-to-hard, or hard-to-easy item difficulty sequencing group. The hard-to-easy sequence group performance was significantly lower than either the random or easy-to-hard sequence groups. Though not statistically different, (1) the mathematics aptitude test scores of four achievement anxiety types grouped using the Achievement Anxiety Test, and (2) levels of state anxiety provoked by the three difficulty sequences were in the predicted direction. (Author)

Journal ArticleDOI
TL;DR: In this paper, the authors make a contribution to the conceptualization and measurement of some of the abilities involved in the task of working with people, including abstract, mechanical, and social intelligence.
Abstract: In her recent presidential address to the American Association for Higher Education, K. Patricia Cross (1974) pointed out that present models of education overemphasize the narrow band of human abilities that enable people to perform academic tasks. She suggested a three-dimensional model that prepares people to work with data, work with things, and work with people. In Cross' view the task of education is to develop the student's ability to the point of excellence in one area and to "prepare him or her to live in today's world by developing at least minimum competence in the other two areas" (p. 4). If these suggestions are to be implemented by educators, they require means of conceptualizing and measuring the abilities involved in each of these three areas so that appropriate curricula may be developed. More than fifty years ago, E. L. Thorndike (1920) suggested that human intelligence is composed of three aspects: abstract, mechanical, and social intelligence (paralleling the work skills outlined by Cross). Although both abstract and mechanical intelligence have been successfully measured, early attempts to develop independent measures of social intelligence had not been successful. Both R. L. Thorndike (1936) and Woodrow (1939) found that tests designed to measure social intelligence were loaded on factors defined by verbal-ability tests. The present study is one contribution to the conceptualization and measurement of some of the abilities involved in the task of working with people. In our view, Thorndike's "social intelligence" and Cross's "work with people" are too general to be of practical value over and above naming the domain of interest. Observing people interacting with other people suggests that there are a number of different ways of being socially intelligent. As Argyle (1972) has noted, "Clearly most people are better at some social tasks than others .... There are, for example, people who are better at handling audiences, or committees, than at dealing with individualsor vice versa" (p. 77). Some people are astute in understanding or cognizing what others think and feel, but for reasons of timidity or poor social training do not behave well in social situations. Others, while not particularly perceptive of others' feelings and thoughts, are socially poised and well-informed. Some individuals, such as successful statesmen, can produce many different solutions to a social problem. One of the few theories of human intelligence that includes social intelligence abilities is Guilford's (1967) Structure of Intellect (SI) model. The Structure of Intellect postulates 120 different factors of intellectual ability organized along three dimensions:

Journal ArticleDOI
TL;DR: The authors found that 64 percent of the students believed that changing answers would tend to lower a student's total score; 36 percent believed that neither raising nor lowering scores would neither raise nor lower total score.
Abstract: Few teachers include test-taking strategies as part of their curricula. Nevertheless, students do acquire strategies to use against their recurring enemy-the objective achievement test. One tactic of test-taking strategy deals with answer changing behavior; whether to change a response after deliberation, or not to change it, is the student's dilemma. Many students do change answers, but research indicates that students generally believe that changing answers is an unwise tactic (Foote & Belinky, 1972; Mathews, 1929). In a graduate level educational measurement course, the authors found that 64 percent of the students believed that changing answers would tend to lower a student's total score; 36 percent believed that changing answers would neither raise nor lower total score. None believed that changing answers would increase scores. Contrary to students' expectations, research studies report that most students gain in total score by changing answers (Bath, 1967; Jacobs, 1972; Reiling & Taylor, 1972). A small percentage of the students in the above mentioned studies did, however, decrease their total scores as a consequence of changing answers. The purpose of this study was to determine the relationship of sex, answer-changing incidence, and total score to net changes in total score resulting from changing answers, by examining the answer-changing behavior of a large number of graduate students responding to achievement test items.

Journal ArticleDOI
TL;DR: In this article, Guttman-Scalogram analysis is used to validate non-linear hierarchical task networks, such that a single task is often a prerequisite to two or more tasks, or a pair of tasks are often immediately prerequisite to a single higher level task.
Abstract: The works of Gagne and his collaborators (Gagne, 1962, 1968; Gagne & Bassler, 1963; Gagne & Paradise, 1961; Gagne, Mayor, Garstens, & Paradise, 1962) represent the classic examples of research focused upon the specification and validation of hierarchical task networks. As an outgrowth of the interest in task analysis and programmed instruction which characterized the early 1960's, Gagne sought to define hierarchies of tasks which were prerequisite to the performance of various terminal objectives. These hierarchies were posited on the basis of rational analysis and sought to identify the order in which the tasks were learned. The early studies by Gagne and his associates have spawned numerous similar studies (e.g., Cox & Graham, 1966; Ford & Meyer, 1966; Kropp, Stoker, & Bashaw, 1966; Merrill, Barton, & Wood, 1970; Resnick, 1967; Walbesser, 1968; White, 1973). Crucial to the investigation of instructional hierarchies is the need to demonstrate that the hypothesized prerequisite relations among tasks in a hierarchy are confirmed by student learning data. To date, the methodological strategies used to validate hierarchical task networks have been limited by two factors. First, the hierarchies investigated generally have been non-linear in their patterns of prerequisite relationships. That is, the systems of prerequisite relations are such that a single task is often a prerequisite to two or more tasks, or, alternatively, two or more tasks are often immediately prerequisite to a single higher level task. Guttman Scalogram Analysis (Guttman, 1944, 1950) and its extensions (Lingoes, 1963), the most prevalently used methods for ordering tasks into a hierarchy, are constrained to defining only linear orders among tasks (Torgerson, 1958). Thus, Guttman-type methods cannot handle the complexities involved in validating non-linearly ordered task hierarchies (Wang, 1969). A second limitation of prior validation studies is more conceptual than methodological in nature. To date, validation studies have focused attention solely upon those prerequisite relationships posited a priori. Other, non-posited, potential prerequisite relationships among tasks in the hierarchy rarely have been subjected to analysis. To realize the richness inherent in the study of instructional hierarchies and to add rigor to hierarchy validation procedures, it is important to have methodologies which can generate the best fitting hierarchy from a data set independent of any a priori hypothesized hierarchy. Such methodologies would be both theory-generating and theory-confirming. They would be theory-generating in that they would permit data to be analyzed post hoc to determine whether and in what form prerequisite relations exist in the data. Such procedures would be especially helpful in those areas where a paucity of theory prevents a priori definition of hierarchies (Bart & Airasian, 1974). The methodologies would be theory-confirming since they could be used to define the best fitting hierarchical network among a set of tasks. This empirically derived network then could be compared to the hypothesized, a priori network to determine the correspondence between the two. Such a comparison of correspondence would afford a


Journal ArticleDOI
TL;DR: In this paper, the authors identify the components of high-inference ratings of instructors by correlating a highinference rating of teacher effectiveness with ratings obtained on items reflecting more specific instructor attributes.
Abstract: Student ratings are one of the most frequently used methods for evaluating teacher effectiveness in colleges and universities One drawback to the use of student ratings for the improvement of instruction is that the results obtained from them are often based upon rating items that are too general It is thus difficult to make specific recommendations to faculty members who wish to improve their ratings Rosenshine (1970) has referred to this type of measurement as "high-inference" measurement, because the rater has to make a number of inferences as to what constitutes "good" or "effective" performance In contrast, Rosenshine used the term "low-inference" to describe questionnaire or rating items that focus upon specific, and relatively objective characteristics Results obtained on scales composed of low-inference rating items, therefore, facilitate making specific recommendations for improvement The purpose of this study was to identify the components of high-inference ratings of instructors by correlating a high-inference rating of teacher effectiveness with ratings obtained on items reflecting more-specific instructor attributes Also, those morespecific item ratings which were good predictors of a high-inference item were to be identified

Journal ArticleDOI
TL;DR: Gilman and Ferry as discussed by the authors examined the reliability associated with the use of an answer-until-correct (AUC) method with graduate students of educational measurement, and concluded that the AUC scores were substantially more reliable than the INR scores.
Abstract: Effort has long been devoted to seeking more adequate measurement with multiplechoice tests than conventional right-wrong scoring affords. Confidence weighing of options constitutes one dimension of these attempts to bolster reliability and validity. A variant of these techniques involves securing an examinee's second, third, fourth, etc., choice of options only when he errs in his prior choice(s). To do this requires immediate feedback concerning the correctness of each response. Since Pressey (1926) invented the teaching-testing machine, numerous mechanical, graphic, electrical, and chemical means have been devised to give immediate feedback. The examinee may be told to continue selecting answers to a question until he is successful. One attraction of this procedure is that immediate feedback may promote learning. Another attraction is that it enables examinees to continue responding in a real-to-life fashion until feedback indicates success. Yet another rationale for testing with immediate feedback is the supposition that if examinees continue answering questions until they answer them correctly, the range of possible scores will be increased (Pressey, 1950), and hence reliability and validity may be improved. Gilman and Ferry (1972) examined the reliability associated with the use of an answer-until-correct (AUC) method with graduate students of educational measurement. Their procedures involved directing the examinee to respond to each multiplechoice question by erasing a carbon shield covering a feedback message. If the selected answer was correct on the first trial, the student had completed the item. Otherwise, he made another response, etc., until the feedback signified that the right answer had been selected. An AUC score was obtained for each student by subtracting the total number of responses (erasures) made in finding the correct answers from the total number of possible responses. An inferred number right (INR) score was obtained for each examinee by counting as correct those questions answered correctly on the first trial. Comparisons of the two sets of scores revealed that the AUC scores were substantially more reliable than the INR scores. Hanna (1974) suggested that part of the reliability gain realized by the AUC procedure may result from affective characteristics. For example, the immediate feedback inherent in AUC media may adversely affect the performance of some anxious examinees who happen to score poorly on initial items. If the internal consistency of cognitive tests were raised by consistent affective traits, then this incremental reliability would be obtained at the expense of construct validity, and consequently would be undesirable.

Journal ArticleDOI
TL;DR: For example, this article reported correlations of.87 and.89 between overall ratings of instructors from one year to the next, which if general would easily make student ratings the most stable measure of teaching proficiency available.
Abstract: The high reliabilities of student ratings of instruction have been extensively documented, both for rating instruments as a whole, and for many of the individual items making up these instruments (e.g., Bausell & Magoon, 1972; Lovell & Haner, 1955; Spencer, 1968); their relatively high consistency across time has also been documented. Guthrie (1954) reported correlations of .87 and .89 between overall rankings of instructors from one year to the next, which if general, would easily make student ratings the most stable measure of teaching proficiency available. While reliability information of this sort is extremely important, there are equally important validation concerns which need to be addressed before ratings are used as measures of teaching proficiency by faculty and administrative consumers. Regardless of how stable student ratings are, for example, it must be recognized that they normally involve several different parameters, and that they occur under several different conditions. It is not only important to know (1) which parameters replicate across time and (2) the extent of these replications; it is also crucial to know (3) under what conditions the parameters replicate, in order to ascertain which are primarily due to consistent, idiosyncratic instructional behaviors (i.e. teaching proficiency), and which are due primarily to consistencies in instructional settings (and hence invalid for the assessment of teaching proficiency).

Journal ArticleDOI
TL;DR: There has been considerable research activity in recent years concerned with the issue of test bias (Cleary, 1968, Goldman & Richards, 1974; Kallingal, 1971; Stanley, 1971, Temp, 1971).
Abstract: There has been considerable research activity in recent years concerned with the issue of test bias (Cleary, 1968; Goldman & Richards, 1974; Kallingal, 1971; Stanley, 1971; Temp, 1971). Briefly, test bias may be manifest in two predominant forms. The first occurs when a predictor (or set of predictors) is invalid (or less valid) for certain subgroups. This form of test bias implies inaccurate prediction for some groups but not necessarily systematic underprediction or over-prediction. This type of test bias can be identified by comparison of the correlation coefficients (between predictor and criterion) for different subgroups. The second form of test bias occurs when a predictor test consistently underpredicts or over-predicts the criterion for a subgroup. As Cleary (1968, p. 115) concisely states, ". . . the test is biased if the criterion score predicted from the common regression line is consistently too high or too low for members of the subgroup." Both forms of test bias have been studied with reference to Black-White differences. Stanley and Porter (1967) found the Scholastic Aptitude Test (SAT) a nearly equallyaccurate predictor of college success (GPA) for Blacks as well as Whites in essentially segregated colleges. While the correlations of SAT with GPA were similar for Blacks and Whites, it is clear that the regression systems were quite different. Cleary (1968) found homogeneous regression systems for Blacks and Whites in three integrated colleges. Furthermore, regression intercepts differed in only one of the three colleges. However, it should be noted that the accuracy of GPA prediction appeared to be greater for Whites than for Blacks. Investigations by Temp (1971) and Kallingal (1971) found that regression systems for the prediction of GPA were nonparallel for Blacks and Whites in almost all colleges sampled. These investigators also found that GPA's for Blacks were either accurately predicted or overpredicted when using White-derived regression equations. While the aforementioned investigations are valuable, they do not provide direct information to the issue of test bias for other minority subgroups. Temp (1971, p. 247) states in a footnote, "Most investigations have dealt solely with black students and then the generalizations have been extrapolated to other 'minorities' (i.e. MexicanAmericans, the disadvantaged, low income females, etc.)." There are a number of good reasons why it is hazardous to generalize Black-White comparisons to other groups. Most obvious among the reasons is the issue of bilingualism. In addition, child-rearing patterns and values are not identical in these minority groups. Thus, an investigation of test bias should be conducted separately for every subgroup in question. Although Cleary's definition of test bias is probably most widely used by psychologists and educators, several alternative models of test bias have recently emerged (Cole, 1973; Einhorn & Bass, 1971; Thorndike, 1971). These three models, while distinct from each other, suggest that test bias may reside in the use of tests for decision making as well as in regression systems. Thorndike has noted that minority

Journal Article
TL;DR: The purpose of this study was to investigate the effects on test reliability and student performance of response sequencing that would be extremely unlikely under a random model.
Abstract: with respect to keyed response position, is a desirable test characteristic. Nearly every basic educational measurement textbook devotes some discussion to item sequencing, usually recommending that the correct answer appear in each position about an equal number of times and that the items be arranged randomly. The theoretical rationale underlying such procedures is that if one were to choose the same option for each item of a test, he could not obtain a score beyond that of a chance score. Other reasons are to avoid providing test-takers with systematic devices which would enable them to \"beat\" the test and to establish a safeguard against an unconscious bias by the test constructor to allow the correct response to occur appreciably more often in one option position than in another. Although detailed methods for arranging items in proper sequence have been given (Anderson, 1952; Mosier and Price, 1945), there is little empirical evidence that keyed response arrangement is related to test performance. The purpose of this study was to investigate the effects on test reliability and student performance of response sequencing that would be extremely unlikely under a random model.

Journal ArticleDOI
TL;DR: However, it has been suggested that effective use of the form requires special talent, and that typical classroom teachers are unlikely to be able to use it effectively as mentioned in this paper, and the present study was designed to shed some light on this question.
Abstract: True-false test items are regarded with disfavor by some test specialists and test users They are suspected of being often trivial or ambiguous and always susceptible to guessing Ebel (1970) has argued that these faults are not inherent in the form, and need not seriously limit its usefulness He has provided a rationale for the validity of the form in tests of educational achievement, and has shown that highly reliable test scores can be obtained from true-false tests Others (Burmester & Olson, 1966; Frisbie, 1973) have provided evidence to support the belief that true-false items are not essentially different from multiple choice items, and that they can serve essentially the same purposes However, it has been suggested that effective use of the form requires special talent, and that typical classroom teachers are unlikely to be able to use it effectively (Storey, 1966; Wesman, 1971) The present study was designed to shed some light on this question

Journal ArticleDOI
TL;DR: This paper investigated the effects on test reliability and student performance of response sequencing that would be extremely unlikely under a random model, and found little empirical evidence that keyed response arrangement is related to test performance.
Abstract: with respect to keyed response position, is a desirable test characteristic. Nearly every basic educational measurement textbook devotes some discussion to item sequencing, usually recommending that the correct answer appear in each position about an equal number of times and that the items be arranged randomly. The theoretical rationale underlying such procedures is that if one were to choose the same option for each item of a test, he could not obtain a score beyond that of a chance score. Other reasons are to avoid providing test-takers with systematic devices which would enable them to "beat" the test and to establish a safeguard against an unconscious bias by the test constructor to allow the correct response to occur appreciably more often in one option position than in another. Although detailed methods for arranging items in proper sequence have been given (Anderson, 1952; Mosier and Price, 1945), there is little empirical evidence that keyed response arrangement is related to test performance. The purpose of this study was to investigate the effects on test reliability and student performance of response sequencing that would be extremely unlikely under a random model.

Journal ArticleDOI
TL;DR: Super, Braasch, and Shay (1947) and Standt (1948) as discussed by the authors conducted a study on the effect of test administration on test performance and found no significant differences in accuracy of performance.
Abstract: In a testing situation, factors other than the test itself may influence test performance. It is generally assumed that a quiet, distraction-free situation is necessary for testing. Test manuals and texts on the topic of test administration usually indicate that distraction-free situations promote better test performance than do distractive environments. Classroom teachers normally try to establish conditions as "ideal" as possible for testing. Research does not support the general assumption that distractions in the testing situation influence test scores. Probably the best known study of distractions' effects on test performance was conducted by Super, Braasch, and Shay (1947). The Minnesota Vocational Test for Clerical Workers and the Otis Quick Score Gamma Am were given to graduate students. One group was tested under "normal, quiet" conditions. Another group was subjected to such distractions as breaking pencils, arguments in the hall, apparent mistiming, and a poorly played trumpet. Analyses of test scores showed no differences in the performances of the two groups. Hovey (1928) and Standt (1948) also experimented with distractions in a college level test situation. Hovey found that with college sophomores, distractions such as noise, lights, music, whistles and stunt man performances did not effect performance on the Army Alpha Test. Standt tested college women with several measures which included verbal analogies, cancellation, addition, multiplication, and with the Otis Self-Administering Intelligence Test. Every 30 seconds a buzzer was sounded and subjects were told which problem to begin on next. While this study may have confounded distractions with instructions, the main interest was in the distraction condition. No significant differences were found in accuracy of performance. In a study using high school students, Ingle and DeAmico (1969) found that conditions in the testing room did not affect standardized achievement test scores. Insofar as poor lighting, poor writing surfaces and such can be considered distractive, distraction had no influence.





Journal ArticleDOI
TL;DR: In this paper, the effects of two levels of penalty for incorrect responses on two dependent variables (a measure of risk-taking or confidence, based on nonsense items, and the number of response-attempts to legitImoteltems) for three treatment groups in a 2 x 3, multi-response repeated measures, multivariate ANOVA design were investigated.
Abstract: Investigated were The effects of two levels of penalty for incorrect responses on two dependent variables (a measure of risk-taking or confidence, based on nonsense items, and the number of response-attempts to legitImoteltems) for three treatment groups In a 2 x 3, multi-response repeated measures, multivariate ANOVA design. Ss responded under one of three scoring-administrative rules: canyon?lanai Coombs-type directions and two variants suggested as mathematically more adequate. Results indicated significant differences both among groups and across conditions. The results went discussed with reference to The question of test vailditi in general, and tho problems posed for criterion-referenced measurement. A number of alternative administrative and scoring procedures for objective tests have been suggested (e.g. Coombs, 1953; de Finetti, 1965; Ebel, 1965; Rippey, 1968) which have as their coesnon objective a more adequate assessment of the degree of partial knowledge held by a given student with reference to a given item.1 A procedure known as 'option-elimination' or 'Coombs-type directions' (CTO) seems quite applicable to the typical classroom testing situation. With CTD, the student is required to Identify as many of the J-1 distractors among the J Item options as he or she is able. With the usual scoring rule, a student earns one point for each distractor so Identified. A penalty of -(j -1) points is suffered if the correct answer is identified as a distrector. Item scores, then, can range from -(J-1) points to (j.1) points, having 2(J-1)1 1See Echternacht (1972) for a comprehensive description end review of a number of alternative testing procedures.