Absolute and relative measures of instructional sensitivity

doi:10.3102/1076998617703649

Naumann, Alexander; Hartig, Johannes; Hochweber, Jan

Journal of educational and behavioral statistics 42 (2017) 6, S. 678-705

Quellenangabe/ Reference:

Naumann, Alexander; Hartig, Johannes; Hochweber, Jan: Absolute and relative measures of

instructional sensitivity - In: Journal of educational and behavioral statistics 42 (2017) 6, S. 678-705 -

URN: urn:nbn:de:0111-pedocs-156029 - DOI: 10.25656/01:15602

https://nbn-resolving.org/urn:nbn:de:0111-pedocs-156029

https://doi.org/10.25656/01:15602

Nutzungsbedingungen Terms of use

Gewährt wird ein nicht exklusives, nicht übertragbares,

persönliches und beschränktes Recht auf Nutzung dieses

Dokuments. Dieses Dokument ist ausschließlich für den

persönlichen, nicht-kommerziellen Gebrauch bestimmt. Die

Nutzung stellt keine Übertragung des Eigentumsrechts an diesem

Dokument dar und gilt vorbehaltlich der folgenden

Einschränkungen: Auf sämtlichen Kopien dieses Dokuments

müssen alle Urheberrechtshinweise und sonstigen Hinweise auf

gesetzlichen Schutz beibehalten werden. Sie dürfen dieses

Dokument nicht in irgendeiner Weise abändern, noch dürfen Sie

dieses Dokument für öffentliche oder kommerzielle Zwecke

vervielfältigen, öffentlich ausstellen, aufführen, vertreiben oder

anderweitig nutzen.

We grant a non-exclusive, non-transferable, individual and limited

right to using this document.

This document is solely intended for your personal, non-commercial

use. Use of this document does not include any transfer of property

rights and it is conditional to the following limitations: All of the

copies of this documents must retain all copyright information and

other information regarding legal protection. You are not allowed to

alter this document in any way, to copy it for public or commercial

purposes, to exhibit the document in public, to perform, distribute or

otherwise use the document in public.

Mit der Verwendung dieses Dokuments erkennen Sie die

Nutzungsbedingungen an.

By using this particular document, you accept the above-stated

conditions of use.

Kontakt / Contact:

pe

DOCS

DIPF | Leibniz-Institut für Bildungsforschung und Bildungsinformation

Informationszentrum (IZ) Bildung

E-Mail: pedocs@dipf.de

Internet: www.pedocs.de

Article

Absolute and Relative Measures

of Instructional Sensitivity

Alexander Naumann

Johannes Hartig

German Institute for International Educational Research (DIPF)

Jan Hochweber

University of Teacher Education St. Gallen (PHSG)

Valid inferences on teaching drawn from students’ test scores require that tests

are sensitive to the instruction students received in class. Accordingly, measures

of the test items’ instructional sensitivity provide empirical support for validity

claims about inferences on instruction. In the present study, we first introduce

the concepts of absolute and relative measures of instructional sensitivity.

Absolute measures summarize a single item’s total capacity of capturing effects

of instruction, which is independent of the test’s sensitivity. In contrast, relative

measures summarize a single item’s capacity of capturing effects of instruction

relative to test sensitivity. Then, we propose a longitudinal multilevel item

response theory model that allows estimating both types of measures depending

on the identification constraints.

Keywords:instructional sensitivity; multilevel IRT; differential item functioning

Researchers as well as policymakers regularly rely on student performance data

to draw inferences on schools, teachers, or teaching (Creemers & Kyriakides,

2008; Pellegrino, 2002). Yet valid inferences drawn from student test scores

require that instruments are sensitive to the instruction that students have received

in class (Popham, 2007; Popham & Ryan, 2012). Accordingly, measures of test

items’ instructional sensitivity may provide empirical support for validity claims

about the inferences on instruction derived from student test scores.

Instructional sensitivity is defined as the psychometric property of a test or a

single item to capture effects of instruction (Polikoff, 2010). Scores of instruc-

tionally sensitive tests are expected to increase with more or better teaching

(Baker, 1994). Students who received different instruction should produce dif-

ferent responses to highly instructionally sensitive items (Ing, 2008). Fundamen-

tally, instructional sensitivity relates to the observation of change in students’

responses on items as a consequence of instruction (Burstein, 1989). If item

responses do not change as a consequence of instruction, it may remain unclear

Journal of Educational and Behavioral Statistics

2017, Vol. 42, No. 6, pp. 678–705

DOI: 10.3102/1076998617703649

# 2017 AERA. http://jebs.aera.net

678

whether teaching was ineffective or thetest was insensitive (Naumann, Hoch-

weber, & Hartig, 2014). To test the hypothesis of whether an item is instruc-

tionally sensitive, various measures have been proposed (see Haladyna & Roid,

1981; Polikoff, 2010). Most commonly,these item sensitivity measures are

based on item parameters, that is, item difficulty or discrimination (Haladyna,

2004).

According to Naumann, Hochweber, and Klieme (2016), each item sensitivity

measure refers to one of the three perspectives on how to test the instructional

sensitivity of items. From the first perspective, instructional sensitivity is con-

ceived as change in item parameters between two time points of measurement,

while from the second perspective instructional sensitivity is conceived as dif-

ferences in item parameters between at least two groups (e.g., treatment and

control groups or classes) within a sample. The third perspective is a combination

of the two preceding ones, which allows deriving measures addressing two facets

of item sensitivity: global and differential sensitivity. Global sensitivity refers to

the extent to which item parameters change on average across time. Differential

sensitivity refers to the variation of change in parameters across groups,

indicating an item’s capacity of detecting differences in group-specific learning.

Overall, these perspectives provide an elaborate framework for the measurement

of instructional sensitivity based on item statistics by highlighting the relevant

sources of variance: variance between (a) time points, (b) groups, and (c) groups

and time points. As item sensitivity measures rooted in different perspectives

target different sources of variance, they do not necessarily provide consistent

results (Naumann et al., 2014).

Yet the three perspectives are not sufficient for describing common charac-

teristics and distinctions of instructional sensitivity measures. Actually, instruc-

tional sensitivity measures referring to the same perspective may address two

essentially different hypotheses regarding item sensitivity: Some measures relate

to the hypothesis of whether an item is sensitive at all, that is,absolute sensitivity,

while others relate to the hypothesis of whether an item substantially deviates

from the test’s overall sensitivity, that is,relative sensitivity.

This additional distinction has important theoretical and practical implications

for the evaluation of instructional sensitivity. For example, studies have shown

that the most commonly applied approaches, the Pretest–Posttest Difference

Index (PPDI; Cox & Vargas, 1966) and differential item functioning (DIF)-based

methods (e.g., Linn & Harnisch, 1981; Robitzsch, 2009), are inconsistent in their

judgment of item sensitivity (Li, Ruiz-Primo, & Wills, 2012; Naumann et al.,

2014). One reason for this finding lies in the difference of the perspective taken

on instructional sensitivity by these approaches (Naumann, Hochweber, &

Klieme, 2016): While the PPDI focuses on change in item difficulties across

time points, DIF approaches focus on differences in item difficulty between at

least two groups of students (e.g., treatment groups or courses or classes) within a

sample. Yet another reason is that the approaches differ in the way they measure

Naumann et al.

679

instructional sensitivity: While the PPDI is an absolute sensitivity measure, DIF

approaches provide relative measures of item sensitivity.

Thus, in the present study, we aim to contribute to the measurement frame-

work of instructional sensitivity by introducing the distinction between absolute

and relative measures. Absolute and relative measures may be distinguished

within each of the three perspectives on instructional sensitivity and provide

unique and valuable information on item functioning in educational assessments

when inferences on schools, teachers, or teaching are to be drawn. In the follow-

ing, we will first elaborate on the distinction of absolute and relative measures.

We will point out how absolute and relative measures relate to test sensitivity and

current approaches to the instructional sensitivity of items. Second, we will

provide a model-based approach that allows testing the hypothesis of whether

items are absolutely and/or relatively sensitive within a more general item

response theory (IRT) framework. For illustration purposes, we apply our

approach to simulated and empirical item response data. Finally, we will discuss

implications for the measurement of instructional sensitivity, test development,

and test score interpretation.

Extending the Measurement Framework of Instructional Sensitivity

Figure 1 depicts an extended measurement framework. The extended mea-

surement framework comprises the three perspectives as well as the two sensi-

tivity facets—global and differentialsensitivity—that can be distinguished

within the groups and time points perspective following Naumann and col-

leagues (2016). In addition, we draw the distinction between absolute and rela-

tive item sensitivity measures within each perspective, making explicit that two

different hypotheses regarding item sensitivity may be tested via absolute and

relative measures.

Absolute measures address the hypothesis of whether a single item is sensitive

to instruction. In principle, absolute measures summarize a single item’s total

capacity of capturing potential effects of instruction in terms of variation in item

parameters across time, groups, or both. Hence, absolute measures are expected

to approach zero the less sensitive an item is and depart from zero the higher the

item’s sensitivity to instruction is.

In contrast, relative measures address the hypothesis of whether a single

item’s sensitivity substantially deviates from test sensitivity. Test sensitivity is

a concept that so far has only been implicitly used in the measurement of instruc-

tional sensitivity. In consistence with the predominant statistical notion of item

sensitivity (see Haladyna & Roid, 1981; Haladyna, 2004; Polikoff, 2010), test

sensitivity may be defined as the overall (i.e., unconditional) variation of

test scores across either time points, groups, or both (cf. Naumann et al.,

2016). Test sensitivity then is a prerequisite for what is commonly conceived

as the instructional sensitivity of a test, which typically refers to the proportion of

Absolute and Relative Measures of Instructional Sensitivity

680

variance in test scores explained by school, teacher, or teaching characteristics

(e.g., D’Agostino, Welsh, & Corson, 2007; Grossman, Cohen, Ronfeldt, &

Brown, 2014; Ing, 2008). Generally, test sensitivity captures the degree of item

sensitivity that is common to all the items within a test. Technically speaking, the

stronger the item sensitivity correlates across all test items, the higher the test

sensitivity. Accordingly, relative measures express the degree to which a single

item’s sensitivity differs from test sensitivity. More precisely, relative measures

are expected to approach zero the more an item’s sensitivity is in consistence

with test sensitivity and to be nonzero if the item’s sensitivity deviates from test

sensitivity.

In general, whether a specific item sensitivity measure is absolute or relative

depends on whether or not the underlying measurement model comprises one or

more parameters capturing test sensitivity. Absolute measures of sensitivity are

unconditional on test sensitivity while relative measures are conditional on test

sensitivity. That is, from each of the three perspectives, measures are obtainable

in two ways, either independently of (i.e., absolute) or depending on (i.e., rela-

tive) test sensitivity. As a result, there are eight different ways of measuring an

FIGURE 1.Extended measurement framework of instructional sensitivity comprising the

three perspectives, the two facets, and the eight absolute and relative sensitivity measures.

Naumann et al.

681

Absolute and relative measures of instructional sensitivity

Citations

Mathematical content knowledge and knowledge for teaching: exploring their distinguishability and contribution to student learning

Individuelle Beteiligung am Unterrichtsgespräch in Grundschulklassen: Wer ist (nicht) beteiligt und welche Konsequenzen hat das für den Lernerfolg?

Sensitivity of test items to teaching quality

Exploring teacher popularity: associations with teacher characteristics and student outcomes in primary school

Instruktionssensitivität von Tests und Items

References

R: A language and environment for statistical computing.

Bayesian data analysis.

JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling

Data Analysis Using Regression and Multilevel/Hierarchical Models

Reasonable mean-square fit values

Related Papers (5)

A Validation Study of the National Assessment Instruments for Young English Language Learners in Norway and Slovenia

Examining the Quality of English Test Items Using Psychometric and Linguistic Characteristics among Grade Six Pupils.

Accommodations for English Language Learner Students: The Effect of Linguistic Modification of Math Test Item Sets. Final Report. NCEE 2009-4079.

Assessing the Test Information Function and Differential Item Functioning for the TOEFL ® Junior™ Standard Test

A comparison of reliability and precision of subscore reporting methods for a state English language proficiency assessment