Abstract: A weakness of most previous score-inflation studies is that differences in performance between a high-stakes test and a readily available “audit” test could reflect factors other than inflation. Koretz and Beguin (2010) proposed a “self-monitoring assessment” (SMA) intended to eliminate such factors. Using pilot data from a first-ever designed SMA—comprising past state-test items (non-audit subtest) and specially-designed audit items (audit subtest)— administered to a random sample of 4 graders in New York State in 2011, we demonstrated empirically the feasibility of designing an SMA to measure score inflation associated with coaching. To examine the practical impact of using inflated scores for accountability, we also used the pilot data to investigate the consistency of school-performance ratings on the two subtests. We found that schools’ ratings, using various school-performance measures, varied substantially between the subtests, thereby suggesting that the ratings based on the non-audit subtest reflect schools’ relative engagement in coaching. Sensitivity of school ratings to score inflation 1 Sensitivity of School-Performance Ratings to Score Inflation: An Exploratory Study Using a Self-Monitoring Assessment Introduction In recent years, student scores on standardized tests of academic achievement have grown in importance as indicators of schools’ success, in the US and internationally (Linn, 2004; Mathison, 2009; Organisation for Economic Cooperation and Development [OECD], 2008; Torrance, 2009). But how trustworthy are ratings of schools’ performance based on measures derived from scores obtained from high-stakes testing, where these scores could be inflated (i.e., higher than the actual achievement levels they represent)? While the phenomenon of score inflation is well documented in the US (Koretz & Hamilton, 2006), existing research is limited in two ways with regard to its impact on schoolperformance measures. First, existing studies have investigated score inflation only at the state or district levels (e.g., Fuller, Gesicki, Kang, & Wright, 2006; Jacob, 2005, 2007; Klein, Hamilton, McCaffrey, & Stecher, 2000), but not at the school level. More importantly, these studies have relied typically on results from readily available “audit” tests, not designed specifically to detect score inflation, to evaluate performance on the high-stakes test. For example, all the studies cited above made use of scores on a test (e.g., NAEP) that may differ from the high-stakes test in ways other than the stakes for educators. These include differences in test-curriculum alignment, tested-student populations, and students’ motivational levels. Thus, any discrepancy in performances on the highand lower-stakes tests could also be due to these factors, rather than score inflation. To address this, Koretz and Beguin (2010) proposed a “self-monitoring assessment” (SMA) that incorporates specially designed audit items into an Sensitivity of school ratings to score inflation 2 operational assessment directly to eliminate specific non-inflation-related factors, thereby providing a measure of score inflation that is free from such potential confounders. In this study, we used data from the first pilot implementation of an SMA. This SMA, designed by the Education Accountability Project at the Harvard Graduate School of Education, was administered to a statewide random sample of 4 graders in New York State (NYS) in Spring 2011. Students were administered two sets of test items. The first (the “non-audit subtest”) comprised items from past NYS state tests that were deemed likely to be the focus of inappropriate test preparation. The second (the “audit subtest”) comprised items designed to test similar content but to be less susceptible to a particular type of test preparation—“coaching”— that we expected to be directed at the past items in the non-audit subtest. The two sets of items were interspersed in random order and were not distinguished to the students’ view. To the extent that the design of the audit items had eliminated the effects of other factors, discrepancies in performance between the two subtests is a measure of score inflation. As this was a first pilot test of the SMA design, we first verified the extent to which the audit test provides a valid measure of score inflation. Then, to investigate the practical impact of using inflated scores for accountability, we examined the consistency of schools’ performance ratings on the two subtests. Specifically, we asked: RQ1. To what extent does the difference in student performance on the two subtests provide a measure of score inflation? RQ2. How consistent are schools’ performance ratings when we use scores from the two subtests to derive the school-performance estimates? In the rest of the article, we first describe the theoretical set up underlying the use of two measures of the same outcome to detect score inflation, and how the SMA framework proposed by Koretz and Beguin (2010) provides a way to detect score inflation caused by inappropriate Sensitivity of school ratings to score inflation 3 test preparation that is free from potential confounding by specific non-inflation-related factors. Then, we review how coaching could inflate students’ scores, which formed the basis for designing the SMA used in the study. After briefly describing the context of the high-stakes school-accountability system in NYS, we set out the research design for the study and present our key findings. We conclude by discussing the implications and limitations of the study. Detecting Score Inflation using Two Measures of the Same Outcome There are five main factors associated with systematic differences between two standardized tests in the same academic subject which could result in inconsistent school ratings when one test is used rather than the other: (1) alignment between schools’ implemented curricula and the content mixes of the tests; (2) timing of the tests; (3) students’ motivational levels while taking the tests; (4) test-administration procedures; and (5) tested-student populations. Factors (2) and (3) are different aspects of the occasion of testing that are potentially confounded. Further, for each factor, the differences between the tests could be due to either noninflation-related sources or schools’ behavioral responses to high-stakes use of the results of one test but not the other (i.e., induced by stakes). For example, variations in test-curricula alignment among schools could arise when schools adopt different content focus or pedagogical approaches for non-stakes related reasons such as values orientation and curriculum-design model adopted (Marsh, 2009). They could also arise when some schools engage in certain inappropriate testpreparation activities focused on tested materials on the high-stakes test (see later). For detecting and measuring score inflation, the difference in scores from two tests that differ in both non-inflation-related and stakes-induced aspects thus confounds the effects of the two sources of differences. This is a known weakness of previous score-inflation studies that Sensitivity of school ratings to score inflation 4 have relied on readily available “audit” tests, as discussed by other researchers (e.g., Applegate, Applegate, McGeehan, Pinto, & Kong, 2009; Center on Education Policy, 2010; Corcoran, Jennings, & Beveridge, 2011; Jacob, 2002; Jirka & Hambleton, 2004; Koretz & Beguin, 2010; Wei, Shen, Lukoff, Ho, & Haertel, 2006). Typically, past score-inflation studies addressed this limitation by introducing the dimension of time, thereby using discrepant score gains over time on the two tests as indicative of the presence of score inflation. This approach could not detect score inflation at any single time point. The SMA framework proposed by Koretz and Beguin (2010) thus represents an important methodological breakthrough because it is intended to detect score inflation at a single time point. Before we describe how it does so, we first formalize the confounding in the use of the difference in scores between two tests as a measure of score inflation at a single time point. Assume that we are interested in making an inference about students’ achievement in a particular content domain. Consider two tests that are designed to measure achievement in that domain, one of which is the target high-stakes test that we want to investigate the incidence of score inflation, and the other is an audit test that we want to use to make that evaluation. For simplicity of exposition, let us also assume that scores on the audit test are free from inflation. Then, for student i in school s, let Audit Audit Audit Target Target Target is is is is is is is is is