Journal Article•DOI•

Accountability, incentives and behavior: the impact of high-stakes testing in the Chicago Public Schools

Brian A. Jacob¹•Institutions (1)

01 Jun 2005-Journal of Public Economics (North-Holland)-Vol. 89, Iss: 5, pp 761-796

TL;DR: The authors examined the impact of an accountability policy implemented in the Chicago Public Schools in 1996-1997, using a panel of student-level, administrative data, and found that math and reading achievement increased sharply following the introduction of the accountability policy, in comparison to both prior achievement trends in the district and to changes experienced by other large, urban districts.

read less

About: This article is published in Journal of Public Economics.The article was published on 2005-06-01 and is currently open access. It has received 554 citations till now. The article focuses on the topics: Accountability & Special education.

...read moreread less

Summary (6 min read)

Jump to: [1. Introduction] – [2. Background] – [2.2 High-Stakes Testing in Chicago] – [3. Empirical strategy] – [4. Data] – [5.1 Math and Reading Trends on the ITBS] – [5.2 The Heterogeneity of Effects Across Student and School Risk Level] – [5.3 Student-Focused versus School-Focused Accountability] – [6. What factors are driving the improvements in performance in Chicago?] – [6.1 The Role of General Skills] – [6.2 The Role of Specific Skills] – [6.3 The Role of Effort] – [6.4. Summary] – [7. Did educators respond strategically to high-stakes testing?] – [7.1 Low-stakes versus high-stakes subjects] – [7.2 Special education placements] – [7.3 Grade retention] – [7.4 Sensitivity analysis] and [8. Conclusions]

1. Introduction

If HST increased the general skill level, observed achievement gains should be reflected in other measures of student outcomes.
By placing low performing students in special education programs, teachers are able to exempt them from most 2 Achievement gains may also be due to increases in cheating on the part of students, teachers or administrators.
This paper addresses these questions in the context of a test-based accountability policy that was implemented in Chicago Public Schools in 1996-97.3.
On the one hand, they provide strong empirical support for general incentive theories, including the multi-task theories of Holmstrom and Milgrom (1991).

2. Background

The evidence on school-based accountability programs and student performance is decidedly mixed.
Several studies note that Texas students have made substantial achievement gains since the implementation of that state’s accountability program (Grissmer and Flanagan 1998, Grissmer et. al. 2000, Haney 2000, Klein et. al. 2000, Toenjes et. al. 2000, Deere and Strayer 2001).
Koretz and Barron (1998) find survey evidence that elementary teachers in Kentucky shifted the amount of time devoted to math and science across grades to correspond with the subjects tested in each grade.
Various studies suggest that test preparation associated with high-stakes testing may artificially inflate achievement, producing gains that are not generalizable to other exams (Linn and Graue 1990, Shepard 1990, Koretz et. al. 1991, Koretz and Barron 1998, Stecher and Barron 1998, Klein et. al. 2000).

2.2 High-Stakes Testing in Chicago

In 1996 the ChiPS introduced a comprehensive accountability policy designed to raise academic achievement.
The first component of the policy focused on holding students accountable for learning, by ending a practice commonly known as “social promotion” whereby students are advanced to the next grade regardless of ability or achievement level.
Students who again fail to meet the standard are required to repeat the grade, with the exception of 15-year-olds who attend newly created “transition” centers.
The same whether one considers the eighth grade policy to have been implemented in 1996 or 1997.

3. Empirical strategy

Because Chicago instituted its accountability policy district-wide in 1996-97, it is difficult to identify the causal impact of the program with certainty.
Similarly, improvements in the economy or other time-varying factors coincident with the policy would bias their estimates.
Finally, one might be worried about other policies or programs in Chicago whose impact was felt at the same time as HST, so that 0),( ≠φHighStakesCov .
This is essentially a difference-in-difference estimator where the first difference is a within student change over time and the second difference is a district-wide change from pre-policy to post-policy.
One might be particularly concerned about unobservable changes on the state or national level effecting student performance (e.g., implementation of state or federal school reform legislation).

4. Data

This study utilizes detailed administrative data from the ChiPS.
Student records include information on a student’s school, home address, demographic and family background characteristics, special education and bilingual placement, free lunch status, standardized test scores, grade retention and summer school attendance.
On the other hand, there was some increase in initial student achievement—e.g., prior reading achievement increased from an average of 0.89 grade equivalents below norms to 0.71 grade equivalents below norms.

5.1 Math and Reading Trends on the ITBS

Figure 1 shows unadjusted math and reading achievement trends in Chicago from 1990 to 2000, combining the data from grades three, six and eight and standardizing student test scores using the 1990 student-level mean and standard deviation.
In math, the authors see that observed achievement seemed to decrease somewhat from 1993 to 1996, but then increased sharply after 1996.
Because the 1999 cohort of sixth graders experienced high-stakes testing beginning in 1997, for example, one would not want to include their fourth or fifth grade scores in the estimation.
The estimates in Table 2 reveal several interesting findings.
The estimates for the latter cohorts may be biased because of compositional changes resulting from grade retention.

5.2 The Heterogeneity of Effects Across Student and School Risk Level

If the improvements in student achievement were caused by the accountability policy, one might expect them to vary across students and schools.
Model 1 provides the average effect for all students in all of the post-policy cohorts, providing a baseline from which to compare the other results.
First, students in low-performing schools seem to have fared considerably better under the policy than comparable peers in higher-performing schools.
Moreover, the effect for marginal students appears somewhat stronger in reading than math, suggesting that there may be more intentional targeting of individual students in reading than in math, or that there is greater divisibility in the production of reading achievement.

5.3 Student-Focused versus School-Focused Accountability

Unlike most previous accountability systems, high-stakes testing in Chicago provided direct incentives for students as well as teachers.
Table 5 presents the policy affects for grades three, six and eight (i.e., promotional gate grades) versus grades four, five and seven (i.e., nongate grades).
Finally, it is possible that the first year effects were somewhat anomalous, perhaps because students and teachers were still adjusting to the policy or because the form change that year may have affected grades differentially.
Tables available from the author upon request.
The 1998 accountability effects are at least twice as large in grades three, six and eight compared with grade five (for example, 0.144 versus 0.067 s.d. gain in math), suggesting that the student accountability provisions may have played a large role in the overall policy in later years.

6. What factors are driving the improvements in performance in Chicago?

Even if a positive causal relationship between HST and student achievement can be established, it is important to understand what factors are driving the improvements in performance.
Critics of test-based accountability often argue that the primary impact of HST is to increase the time spent on test-specific preparation activities, which could improve testspecific skills at the expense of more general skills.
Others argue that test score gains reflect student motivation on the day of the exam.
Unfortunately, because such things as effort and test preparation are not directly observable, it is difficult to disentangle the factors underlying the achievement gains in Chicago.
This section attempts to shed some light on the factors driving the achievement gains in Chicago, first by comparing student performance across exams and then by examining the ITBS improvements in greater detail.

6.1 The Role of General Skills

Even the most comprehensive achievement exam can only cover a fraction of the possible skills and topics within a particular domain.
Differences in student effort across exams (or rather changes in student effort) also complicate the comparison of performance trends from one test to another.
The data for this analysis is drawn from school “report cards” compiled by the Illinois State Board of Education (ISBE) which provide average IGAP scores by grade and subject as well as background information on schools and districts.
24 To identify the comparison districts, I first identify districts in the top decile in terms of the percent of students receiving free or reduced price lunch, percent minority students, and total enrollment and in the bottom decile in terms of average student achievement (averaged over third, sixth and eighth grade reading and math scores) based on 1990 data.
The point estimates indicate that once the authors take into account district-specific pre-existing trends and demographics, HST appears to have a slight negative effect on IGAP achievement in Chicago.

6.2 The Role of Specific Skills

Based on analysis of teacher survey data, Tepper (2002) concluded that ITBS-specific test preparation and curriculum alignment increased following the introduction of the accountability policy.
28 Column 1 classifies questions into two groups—those testing basic skills such as math computation and number concepts and those testing more complex skills such as estimation, data interpretation and problem-solving (i.e., word problems).
Column 2 separates items into five categories—computation, number concept, data interpretation, estimation and problem-solving— and shows the same pattern.
The item difficulty measures are the percentage of students correctly answering the item in a nationally representative ample used by the test publisher to norm the exam.
This analysis suggests that test preparation may have played a large role in the math gains, but was perhaps less important in reading improvement.

6.3 The Role of Effort

Student effort is another likely candidate for explaining the large ITBS gains.
29 Test completion is one indicator of effort.
This pattern is true even among the lowest achieving students who left the greatest number of items blank prior to the accountability policy.
While increased guessing cannot explain a significant portion of the ITBS gains, other forms of effort may play a larger role.
Comparing the gain across item position groups, the authors see that 1998 students improved nearly 6.7 percentage points on the final 20 percent of items.

6.4. Summary

The improvement in math achievement in Chicago appears to be driven largely by gains in specific skill areas such as math computation that make up a large portion of the ITBS, but are emphasized less on the IGAP.
This suggests that teachers aligned their math curriculum to more closely match the content of the high-stake exam.
In reading, ITBS gains were equally distributed across item types, but were considerably larger among questions at the end of the exam.
This suggests that student effort or “stamina” played a larger role than test preparation in the observed reading improvements.
The fact that IGAP trends did not jump sharply following the introduction of the accountability policy confirms that the ITBS gains were not driven entirely by improvements in general skills.

7. Did educators respond strategically to high-stakes testing?

In evaluating the effectiveness of HST, it is important to understand whether teachers and administrators respond strategically to the incentives provided by the accountability policy.
Critics of test-based accountability worry about educator responses along a number of dimensions, ranging from changes in the rate of special education placements to substitution away from low-stakes subjects.
This section examines several of these issues.

7.1 Low-stakes versus high-stakes subjects

Given the consequences attached to test performance in certain subjects, one might expect teachers and students to shift resources and attention toward subjects included in the accountability program.
The authors can test this theory by comparing trends in math and reading achievement after the introduction of HST with test score trends in social studies and science, subjects that are not included in the Chicago accountability policy.
Unfortunately science and social studies exams are not given in every grade, and the grades in which these exams are given has changed over time.
The distribution of effects is also somewhat different for low versus high-stakes subjects.
As the authors noted earlier, in math and reading, students in low-achieving schools experienced greater gains. , However, conditional on school achievement, low-ability students appeared to make only slightly larger gains than their peers.

7.2 Special education placements

While the accountability policies in Chicago are designed to increase student achievement, they also create incentives for teachers and administrators to alter the pool of testtakers.
The sample only includes third, sixth and eighth grade students from 1994 to 2000 because some special education and reporting data is not available for the 1993 cohort.
Figures available from the author upon request.
Beginning in 1997, ChiPS began excluding the ITBS scores of students who had been enrolled in bilingual programs for three or fewer years to encourage teachers to test these students for appears that the trend became steeper beginning in 1997, suggesting that the accountability policy may have influenced teacher and administrator behavior.
The lowest performing schools increased special education placements for high-risk sixth graders by 50 percent following the introduction of the accountability policy, compared with an increase of roughly 32 percent among moderateachieving schools and no increase among the highest performing schools.

7.3 Grade retention

Another way for teachers to shield low-achieving students from the accountability mandates is to preemptively retain them—that is, hold them back before they enter grade three, six or eight.
36 Roderick et al. (2000) found that retention rates in kindergarten, first and second grades started to rise in 1996 and jumped sharply in 1997 among first and second graders.
Grade, 2.5 percent in second grade and a little over 1 percent in grades four, five and seven.
Retention rates began to increase in 1996, possibly in anticipation of the new standards the students would face in 1997.
The bottom panel controls for current achievement, age and special education status as well as demographic variables, thereby accounting for prior retention and giving a better sense of the marginal effect of the policy on the propensity to retain students.

7.4 Sensitivity analysis

To test the sensitivity of the findings presented in the previous sections, Table 13 presents comparable estimates for a variety of different specifications and samples.
The next three rows show that the results are not sensitive to including students who either were in that grade for the second time (e.g., retained students) or whose test scores were not included for official reporting purposes because of a special education or bilingual classification.
This should control for any changes in form difficulty that may confound the results.

8. Conclusions

When the federal legislation No Child Left Behind became law earlier this year, high- stakes testing took on a heightened level of importance for students, teachers and parents across the country.
If the authors make the conservative assumption that special education rates increased by two percentage points in all grades (mirroring the increases they saw in grades three, six and eight), this would translate to an additional expenditure of $40 per pupil.
“Comparing State and District Results to National Norms: The Validity of the Claim that 'Everyone is Above Average'.” Educational Measurement: Issues and Practice 9(3): 5-14.

Did you find this useful? Give us your feedback