scispace - formally typeset
Search or ask a question

Showing papers in "Educational Measurement: Issues and Practice in 2009"


Journal ArticleDOI
TL;DR: This article found that teachers are better at drawing reasonable inferences about student levels of understanding from assessment information than they are at deciding the next instructional steps, and they discussed the implications of the results for effective formative assessment and end with considerations of how teachers can be supported to know what to teach next.
Abstract: Based on the results of a generalizability study of measures of teacher knowledge for teaching mathematics developed at the National Center for Research on Evaluation, Standards, and Student Testing at the University of California, Los Angeles, this article provides evidence that teachers are better at drawing reasonable inferences about student levels of understanding from assessment information than they are at deciding the next instructional steps. We discuss the implications of the results for effective formative assessment and end with considerations of how teachers can be supported to know what to teach next.

263 citations


Journal ArticleDOI
TL;DR: In this paper, student growth percentiles are introduced supplying a normative description of growth capable of accommodating criterion-referenced aims like those embedded within NCLB and, more importantly, extending possibilities for descriptive data use beyond the current high stakes paradigm.
Abstract: Annual student achievement data derived from state assessment programs have led to widespread enthusiasm for statistical models suitable for longitudinal analysis. The current policy environment's adherence to high stakes accountability vis-a-vis No Child Left Behind (NCLB)'s universal proficiency mandate has fostered an impoverished view of what an examination of student growth can provide. To address this, student growth percentiles are introduced supplying a normative description of growth capable of accommodating criterion-referenced aims like those embedded within NCLB and, more importantly, extending possibilities for descriptive data use beyond the current high stakes paradigm.

226 citations


Journal ArticleDOI
TL;DR: Interim assessments are driven by their purpose, which fall into the categories of instructional, evaluative, or predictive as discussed by the authors, and they can be an important piece of a comprehensive assessment system that includes formative, interim and summative assessments.
Abstract: Local assessment systems are being marketed as formative, benchmark, predictive, and a host of other terms. Many so-called formative assessments are not at all similar to the types of assessments and strategies studied by Black and Wiliam (1998) but instead are interim assessments. In this article, we clarify the definition and uses of interim assessments and argue that they can be an important piece of a comprehensive assessment system that includes formative, interim, and summative assessments. Interim assessments are given on a larger scale than formative assessments, have less flexibility, and are aggregated to the school or district level to help inform policy. Interim assessments are driven by their purpose, which fall into the categories of instructional, evaluative, or predictive. Our intent is to provide a specific definition for these “interim assessments” and to develop a framework that district and state leaders can use to evaluate these systems for purchase or development. The discussion lays out some concerns with the current state of these assessments as well as hopes for future directions and suggestions for further research.

198 citations


Journal ArticleDOI
TL;DR: The construction of booklet designs are described as the task of allocating items to booklets under context-specific constraints in large-scale assessments of student achievement.
Abstract: In most large-scale assessments of student achievement, several broad content domains are tested. Because more items are needed to cover the content domains than can be presented in the limited testing time to each individual student, multiple test forms or booklets are utilized to distribute the items to the students. The construction of an appropriate booklet design is a complex and challenging endeavor that has far-reaching implications for data calibration and score reporting. This module describes the construction of booklet designs as the task of allocating items to booklets under context-specific constraints. Several types of experimental designs are presented that can be used as booklet designs. The theoretical properties and construction principles for each type of design are discussed and illustrated with examples. Finally, the evaluation of booklet designs is described and future directions for researching, teaching, and reporting on booklet designs for large-scale assessments of student achievement are identified.

150 citations


Journal ArticleDOI
TL;DR: The authors argue that the validity research that will tell us the most about how formative assessment can be used to improve student learning must be embedded in rich curriculum and must at the same time attempt to foster instructional practices consistent with learning research.
Abstract: In many school districts, the pressure to raise test scores has created overnight celebrity status for formative assessment. Its powers to raise student achievement have been touted, however, without attending to the research on which these claims were based. Sociocultural learning theory provides theoretical grounding for understanding how formative assessment works to increase student learning. The articles in this special issue bring us back to underlying first principles by offering separate validity frameworks for evaluating formative assessment (Nichols, Meyers, & Burling) and newly-invented interim assessments (Perie, Marion, & Gong). The article by Heritage, Kim, Vendlinski, and Herman then offers the most important insight of all; that is, formative assessment is of little use if teachers don't know what to do when students are unable to grasp an important concept. While it is true that validity investigations are needed, I argue that the validity research that will tell us the most—about how formative assessment can be used to improve student learning—must be embedded in rich curriculum and must at the same time attempt to foster instructional practices consistent with learning research.

104 citations


Journal ArticleDOI
TL;DR: In this paper, empirical patterns of growth in student achievement are compared as a function of different approaches to creating a vertical scale, and it is shown that interpretations of empirical growth patterns appear to depend upon the extent to which a vertical scales has been effectively "stretched" or "compressed" by the psychometric decisions made to establish it.
Abstract: Most growth models implicitly assume that test scores have been vertically scaled. What may not be widely appreciated are the different choices that must be made when creating a vertical score scale. In this paper empirical patterns of growth in student achievement are compared as a function of different approaches to creating a vertical scale. Longitudinal item-level data from a standardized reading test are analyzed for two cohorts of students between Grades 3 and 6 and Grades 4 and 7 for the entire state of Colorado from 2003 to 2006. Eight different vertical scales were established on the basis of choices made for three key variables: Item Response Theory modeling approach, linking approach, and ability estimation approach. It is shown that interpretations of empirical growth patterns appear to depend upon the extent to which a vertical scale has been effectively “stretched” or “compressed” by the psychometric decisions made to establish it. While all of the vertical scales considered show patterns of decelerating growth across grade levels, there is little evidence of scale shrinkage.

90 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a framework within which to consider evidence-based claims that assessment information can be used to improve student achievement and illustrate its use with an example of one-on-one tutoring.
Abstract: Assessments labeled as formative have been offered as a means to improve student achievement. But labels can be a powerful way to miscommunicate. For an assessment use to be appropriately labeled “formative,” both empirical evidence and reasoned arguments must be offered to support the claim that improvements in student achievement can be linked to the use of assessment information. Our goal in this article is to support the construction of such an argument by offering a framework within which to consider evidence-based claims that assessment information can be used to improve student achievement. We describe this framework and then illustrate its use with an example of one-on-one tutoring. Finally, we explore the framework's implications for understanding when the use of assessment information is likely to improve student achievement and for advising test developers on how to develop assessments that are intended to offer information that can be used to improve student achievement.

50 citations


Journal ArticleDOI
TL;DR: In this paper, critical issues and concerns related to the assessment of English language learners in U.S. and Canadian schools and emphasizes assessment approaches for test developers and decision makers that will facilitate increased equity, meaningfulness, and accuracy in assessment and accountability efforts.
Abstract: Substantial growth in the numbers of English language learners (ELLs) in the United States and Canada in recent years has significantly affected the educational systems of both countries. This article focuses on critical issues and concerns related to the assessment of ELLs in U.S. and Canadian schools and emphasizes assessment approaches for test developers and decision makers that will facilitate increased equity, meaningfulness, and accuracy in assessment and accountability efforts. It begins by examining the crucial issue of defining ELLs as a group. Next, it examines the impact of testing originating from the No Child Left Behind Act of 2001 (NCLB) in the U.S. and government-mandated standards-driven testing in Canada by briefly describing each country's respective legislated testing requirements and outlining their consequences at several levels. Finally, the authors identify key points that test developers and decision makers in both contexts should consider in testing this ever-increasing group of students.

43 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a didactic overview of the DSF framework and provide specific guidance and recommendations on how DSF can be used to enhance the examination of DIF in polytomous items.
Abstract: Traditional methods for examining differential item functioning (DIF) in polytomously scored test items yield a single item-level index of DIF and thus provide no information concerning which score levels are implicated in the DIF effect. To address this limitation of DIF methodology, the framework of differential step functioning (DSF) has recently been proposed, whereby measurement invariance is examined within each step underlying the polytomous response variable. The examination of DSF can provide valuable information concerning the nature of the DIF effect (i.e., is the DIF an item-level effect or an effect isolated to specific score levels), the location of the DIF effect (i.e., precisely which score levels are manifesting the DIF effect), and the potential causes of a DIF effect (i.e., what properties of the item stem or task are potentially biasing). This article presents a didactic overview of the DSF framework and provides specific guidance and recommendations on how DSF can be used to enhance the examination of DIF in polytomous items. An example with real testing data is presented to illustrate the comprehensive information provided by a DSF analysis.

34 citations


Journal ArticleDOI
TL;DR: In this article, the dependence of state and school-level growth results on cut-score choice is examined along three dimensions: rigor, as states set cut scores largely at their discretion, across-grade articulation, and the time horizon chosen for growth to proficiency.
Abstract: States participating in the Growth Model Pilot Program reference individual student growth against “proficiency” cut scores that conform with the original No Child Left Behind Act (NCLB). Although achievement results from conventional NCLB models are also cut-score dependent, the functional relationships between cut-score location and growth results are more complex and are not currently well described. We apply cut-score scenarios to longitudinal data to demonstrate the dependence of state- and school-level growth results on cut-score choice. This dependence is examined along three dimensions: 1) rigor, as states set cut scores largely at their discretion, 2) across-grade articulation, as the rigor of proficiency standards may vary across grades, and 3) the time horizon chosen for growth to proficiency. Results show that the selection of plausible alternative cut scores within a growth model can change the percentage of students “on track to proficiency” by more than 20 percentage points and reverse accountability decisions for more than 40% of schools. We contribute a framework for predicting these dependencies, and we argue that the cut-score dependence of large-scale growth statistics must be made transparent, particularly for comparisons of growth results across states.

30 citations


Journal ArticleDOI
TL;DR: The authors investigated whether teaching experience was a necessary selection criterion for all aspects of the examination, and found that teaching experience is not a necessary criterion for the selection of a teacher for a National Curriculum test.
Abstract: Hundreds of thousands of raters are recruited internationally to score examinations, but little research has been conducted on the selection criteria for these raters. Many countries insist upon teaching experience as a selection criterion and this has frequently become embedded in the cultural expectations surrounding the tests. Shortages in raters for some of England's national examinations has led to non-teachers being hired to score a small minority of items and changes in technology have fostered this approach. For a National Curriculum test in English taken at age 14, this study investigated whether teaching experience was a necessary selection criterion for all aspects of the examination. Fifty-seven raters with different backgrounds were trained in the normal manner and scored the same 97 students' work. Accuracy was investigated using a cross-classified multilevel model of absolute score differences with accuracy measures at level 1 and raters crossed with candidates at level 2. By comparing the scoring accuracy of graduates with a degree in English, teacher trainees, experienced teachers and experienced raters, this study found that teaching experience was not a necessary selection criterion. A rudimentary model for allocation of raters to different question types is proposed and further research to investigate the limits of necessary qualifications for scoring is suggested.

Journal ArticleDOI
TL;DR: In this article, the authors present a framework that attempts to prescribe the conditions under which the responsibility for collecting evidence of test score use consequences falls to the test developer or the test user.
Abstract: This article has three goals. The first goal is to clarify the role that the consequences of test score use play in validity judgments by reviewing the role that modern writers on validity have ascribed for consequences in supporting validity judgments. The second goal is to summarize current views on who is responsible for collecting evidence of test score use consequences by attempting to separate the responsibilities of the test developer and the test user. The last goal is to offer a framework that attempts to prescribe the conditions under which the responsibility for collecting evidence of consequences falls to the test developer or to the test user.

Journal ArticleDOI
TL;DR: In this paper, six pilot-approved growth models were applied to vertically scaled mathematics assessment data from a single state collected over 2 years and student and school classifications were compared across models.
Abstract: A key intent of the NCLB growth pilot is to reward low-status schools who are closing the gap to proficiency. In this article, we demonstrate that the capability of proposed models to identify those schools depends on how the growth model is incorporated into accountability decisions. Six pilot-approved growth models were applied to vertically scaled mathematics assessment data from a single state collected over 2 years. Student and school classifications were compared across models. Accountability classifications using status and growth to proficiency as defined by each model were considered from two perspectives. The first involved adding the number of students moving toward proficiency to the count of proficient students, while the second involved a multitier accountability system where each school was first held accountable for status and then held accountable for the growth of their nonproficient students. Our findings emphasize the importance of evaluating status and growth independently when attempting to identify low-status schools with insufficient growth among nonproficient students.

Journal ArticleDOI
TL;DR: In this paper, the authors addressed the challenge of scoring cognitive interviews in research involving multiple cultural groups, and interviewed 123 fourth and fifth-grade students from three cultural groups to probe how they related a mathematics item to their personal lives.
Abstract: We addressed the challenge of scoring cognitive interviews in research involving multiple cultural groups. We interviewed 123 fourth- and fifth-grade students from three cultural groups to probe how they related a mathematics item to their personal lives. Item meaningfulness—the tendency of students to relate the content and/or context of an item to activities in which they are actors—was scored from interview transcriptions with a procedure similar to the scoring of constructed-response tasks. Generalizability theory analyses revealed a small amount of score variation due to the main and interaction effect of rater but a sizeable magnitude of measurement error due to the interaction of person and question (context). Students from different groups tended to draw on different sets of contexts of their personal lives to make sense of the item. In spite of individual and potential cultural communication style differences, cognitive interviews can be reliably scored by well-trained raters with the same kind of rigor used in the scoring of constructed-response tasks. However, to make valid generalizations of cognitive interview-based measures, a considerable number of interview questions may be needed. Information obtained with cognitive interviews for a given cultural group may not be generalizable to other groups.

Journal ArticleDOI
TL;DR: In this paper, large data sets from a state reading assessment for third and fifth graders were analyzed to examine differential item functioning, differential distractor functioning, and differential omission frequency between students with particular categories of disabilities (speech/language impairments, learning disabilities, and emotional behavior disorders) and students without disabilities.
Abstract: Large data sets from a state reading assessment for third and fifth graders were analyzed to examine differential item functioning (DIF), differential distractor functioning (DDF), and differential omission frequency (DOF) between students with particular categories of disabilities (speech/language impairments, learning disabilities, and emotional behavior disorders) and students without disabilities. Multinomial logistic regression was employed to compare response characteristic curves (RCCs) of individual test items. Although no evidence for serious test bias was found for the state assessment examined in this study, the results indicated that students in different disability categories showed different patterns of DIF, DDF, and DOF, and that the use of RCCs helps clarify the implications of DIF and DDF.

Journal ArticleDOI
TL;DR: This article showed that the same-form advantage is minimal for credentialing testing and that the use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs.
Abstract: Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs; thus, it is important to determine if unwarranted score gains actually occur. Prior studies provide strong evidence that the same-form advantage is pronounced for aptitude tests. However, the sparse research within the context of achievement and credentialing testing suggests that the same-form advantage is minimal. For the present experiment, 541 examinees who failed a national certification test were randomly assigned to receive either the same test or a different (parallel) test on their second attempt. Although the same-form group had shorter response times on the second administration, score gains for the two groups were indistinguishable. We discuss factors that may limit the generalizability of these findings to other assessment contexts.

Journal ArticleDOI
TL;DR: In this paper, an alignment procedure, called Links for Academic Learning (LAL), is described for examining the degree of alignment of alternate assessments based on alternate achievement standards (AA-AAS) to grade-level content standards and instruction.
Abstract: This article describes an alignment procedure, called Links for Academic Learning (LAL), for examining the degree of alignment of alternate assessments based on alternate achievement standards (AA-AAS) to grade-level content standards and instruction. Although some of the alignment criteria are similar to those used in general education assessments, modifications to these criteria as well as additional unique criteria are incorporated because of the needs of the students with significant cognitive disabilities. This article addresses the alignment challenges that are specific to AA-AAS, the rationale for the LAL alignment criteria, and a description of how the criteria are measured.

Journal ArticleDOI
TL;DR: In this paper, the authors present findings from the Third Annual Survey of Assessment and Accommodations for Students who are Deaf or Hard-of-Hearing (SDHH).
Abstract: Students who are deaf or hard of hearing (SDHH) often use test accommodations when they participate in large-scale, standardized assessments. The purpose of this article is to present findings from the Third Annual Survey of Assessment and Accommodations for Students who are Deaf or Hard of Hearing. The “big five” accommodations were reported by at least two-thirds of the 389 participants: extended time, small group/individual administration, test directions interpreted, test items read aloud, and test items interpreted. In a regression analysis, language used in instruction showed the most significant effects on accommodations use. The article considers these findings in light of a more proactive role for the National Survey in providing evidence for the effectiveness of accommodations with SDHH.

Journal ArticleDOI
Henry Braun1
TL;DR: The Growth Model Pilot Program (GMPP) as discussed by the authors was introduced as a response to criticism of the reliance of No Child Left Behind on status-based indicators, and although incorporating a growth component appears to be a step in the right direction, it adds a level of complexity that brings to the fore new concerns.
Abstract: The Growth Model Pilot Program (GMPP) was introduced as a response to criticism of the reliance of No Child Left Behind on status-based indicators. Although incorporating a growth component appears to be a step in the right direction, it adds a level of complexity that brings to the fore new concerns. Three of the four papers in this special issue address different manifestations of a common problem: how choices related to implementation of the GMPP, made on technical or practical grounds, can have substantial consequences for schools and teachers. The fourth paper explicates a different approach to growth modeling. Together these papers add considerably to our understanding of the strengths and weaknesses of growth-based indicators. In this discussion, I attempt to set these papers in a larger context before providing brief reviews. I conclude by arguing that the nation will be better served if policy makers appreciated the consequences of choices before they crafted legislation and the accompanying regulations.