scispace - formally typeset

Journal ArticleDOI

Absolute and relative measures of instructional sensitivity

17 Apr 2017-Journal of Educational and Behavioral Statistics (SAGE PublicationsSage CA: Los Angeles, CA)-Vol. 42, Iss: 6, pp 678-705

AbstractValid inferences on teaching drawn from students’ test scores require that tests are sensitive to the instruction students received in class. Accordingly, measures of the test items’ instructional ...

Topics: Item response theory (51%), Test theory (51%), Test (assessment) (51%) more

Content maybe subject to copyright    Report

Naumann, Alexander; Hartig, Johannes; Hochweber, Jan
Absolute and relative measures of instructional sensitivity
Journal of educational and behavioral statistics 42 (2017) 6, S. 678-705
Quellenangabe/ Reference:
Naumann, Alexander; Hartig, Johannes; Hochweber, Jan: Absolute and relative measures of
instructional sensitivity - In: Journal of educational and behavioral statistics 42 (2017) 6, S. 678-705 -
URN: urn:nbn:de:0111-pedocs-156029 - DOI: 10.25656/01:15602
Nutzungsbedingungen Terms of use
Gewährt wird ein nicht exklusives, nicht übertragbares,
persönliches und beschränktes Recht auf Nutzung dieses
Dokuments. Dieses Dokument ist ausschließlich für den
persönlichen, nicht-kommerziellen Gebrauch bestimmt. Die
Nutzung stellt keine Übertragung des Eigentumsrechts an diesem
Dokument dar und gilt vorbehaltlich der folgenden
Einschränkungen: Auf sämtlichen Kopien dieses Dokuments
müssen alle Urheberrechtshinweise und sonstigen Hinweise auf
gesetzlichen Schutz beibehalten werden. Sie dürfen dieses
Dokument nicht in irgendeiner Weise abändern, noch dürfen Sie
dieses Dokument für öffentliche oder kommerzielle Zwecke
vervielfältigen, öffentlich ausstellen, aufführen, vertreiben oder
anderweitig nutzen.
We grant a non-exclusive, non-transferable, individual and limited
right to using this document.
This document is solely intended for your personal, non-commercial
use. Use of this document does not include any transfer of property
rights and it is conditional to the following limitations: All of the
copies of this documents must retain all copyright information and
other information regarding legal protection. You are not allowed to
alter this document in any way, to copy it for public or commercial
purposes, to exhibit the document in public, to perform, distribute or
otherwise use the document in public.
Mit der Verwendung dieses Dokuments erkennen Sie die
Nutzungsbedingungen an.
By using this particular document, you accept the above-stated
conditions of use.
Kontakt / Contact:
DIPF | Leibniz-Institut für Bildungsforschung und Bildungsinformation
Informationszentrum (IZ) Bildung

Absolute and Relative Measures
of Instructional Sensitivity
Alexander Naumann
Johannes Hartig
German Institute for International Educational Research (DIPF)
Jan Hochweber
University of Teacher Education St. Gallen (PHSG)
Valid inferences on teaching drawn from students test scores require that tests
are sensitive to the instruction students received in class. Accordingly, measures
of the test items instructional sensitivity provide empirical support for validity
claims about inferences on instruction. In the present study, we first introduce
the concepts of absolute and relative measures of instructional sensitivity.
Absolute measures summarize a single items total capacity of capturing effects
of instruction, which is independent of the tests sensitivity. In contrast, relative
measures summarize a single items capacity of capturing effects of instruction
relative to test sensitivity. Then, we propose a longitudinal multilevel item
response theory model that allows estimating both types of measures depending
on the identification constraints.
Keywords:instructional sensitivity; multilevel IRT; differential item functioning
Researchers as well as policymakers regularly rely on student performance data
to draw inferences on schools, teachers, or teaching (Creemers & Kyriakides,
2008; Pellegrino, 2002). Yet valid inferences drawn from student test scores
require that instruments are sensitive to the instruction that students have received
in class (Popham, 2007; Popham & Ryan, 2012). Accordingly, measures of test
items instructional sensitivity may provide empirical support for validity claims
about the inferences on instruction derived from student test scores.
Instructional sensitivity is defined as the psychometric property of a test or a
single item to capture effects of instruction (Polikoff, 2010). Scores of instruc-
tionally sensitive tests are expected to increase with more or better teaching
(Baker, 1994). Students who received different instruction should produce dif-
ferent responses to highly instructionally sensitive items (Ing, 2008). Fundamen-
tally, instructional sensitivity relates to the observation of change in students
responses on items as a consequence of instruction (Burstein, 1989). If item
responses do not change as a consequence of instruction, it may remain unclear
Journal of Educational and Behavioral Statistics
2017, Vol. 42, No. 6, pp. 678705
DOI: 10.3102/1076998617703649
# 2017 AERA.

whether teaching was ineffective or thetest was insensitive (Naumann, Hoch-
weber, & Hartig, 2014). To test the hypothesis of whether an item is instruc-
tionally sensitive, various measures have been proposed (see Haladyna & Roid,
1981; Polikoff, 2010). Most commonly,these item sensitivity measures are
based on item parameters, that is, item difficulty or discrimination (Haladyna,
According to Naumann, Hochweber, and Klieme (2016), each item sensitivity
measure refers to one of the three perspectives on how to test the instructional
sensitivity of items. From the first perspective, instructional sensitivity is con-
ceived as change in item parameters between two time points of measurement,
while from the second perspective instructional sensitivity is conceived as dif-
ferences in item parameters between at least two groups (e.g., treatment and
control groups or classes) within a sample. The third perspective is a combination
of the two preceding ones, which allows deriving measures addressing two facets
of item sensitivity: global and differential sensitivity. Global sensitivity refers to
the extent to which item parameters change on average across time. Differential
sensitivity refers to the variation of change in parameters across groups,
indicating an items capacity of detecting differences in group-specific learning.
Overall, these perspectives provide an elaborate framework for the measurement
of instructional sensitivity based on item statistics by highlighting the relevant
sources of variance: variance between (a) time points, (b) groups, and (c) groups
and time points. As item sensitivity measures rooted in different perspectives
target different sources of variance, they do not necessarily provide consistent
results (Naumann et al., 2014).
Yet the three perspectives are not sufficient for describing common charac-
teristics and distinctions of instructional sensitivity measures. Actually, instruc-
tional sensitivity measures referring to the same perspective may address two
essentially different hypotheses regarding item sensitivity: Some measures relate
to the hypothesis of whether an item is sensitive at all, that is,absolute sensitivity,
while others relate to the hypothesis of whether an item substantially deviates
from the tests overall sensitivity, that is,relative sensitivity.
This additional distinction has important theoretical and practical implications
for the evaluation of instructional sensitivity. For example, studies have shown
that the most commonly applied approaches, the PretestPosttest Difference
Index (PPDI; Cox & Vargas, 1966) and differential item functioning (DIF)-based
methods (e.g., Linn & Harnisch, 1981; Robitzsch, 2009), are inconsistent in their
judgment of item sensitivity (Li, Ruiz-Primo, & Wills, 2012; Naumann et al.,
2014). One reason for this finding lies in the difference of the perspective taken
on instructional sensitivity by these approaches (Naumann, Hochweber, &
Klieme, 2016): While the PPDI focuses on change in item difficulties across
time points, DIF approaches focus on differences in item difficulty between at
least two groups of students (e.g., treatment groups or courses or classes) within a
sample. Yet another reason is that the approaches differ in the way they measure
Naumann et al.

instructional sensitivity: While the PPDI is an absolute sensitivity measure, DIF
approaches provide relative measures of item sensitivity.
Thus, in the present study, we aim to contribute to the measurement frame-
work of instructional sensitivity by introducing the distinction between absolute
and relative measures. Absolute and relative measures may be distinguished
within each of the three perspectives on instructional sensitivity and provide
unique and valuable information on item functioning in educational assessments
when inferences on schools, teachers, or teaching are to be drawn. In the follow-
ing, we will first elaborate on the distinction of absolute and relative measures.
We will point out how absolute and relative measures relate to test sensitivity and
current approaches to the instructional sensitivity of items. Second, we will
provide a model-based approach that allows testing the hypothesis of whether
items are absolutely and/or relatively sensitive within a more general item
response theory (IRT) framework. For illustration purposes, we apply our
approach to simulated and empirical item response data. Finally, we will discuss
implications for the measurement of instructional sensitivity, test development,
and test score interpretation.
Extending the Measurement Framework of Instructional Sensitivity
Figure 1 depicts an extended measurement framework. The extended mea-
surement framework comprises the three perspectives as well as the two sensi-
tivity facetsglobal and differentialsensitivity—that can be distinguished
within the groups and time points perspective following Naumann and col-
leagues (2016). In addition, we draw the distinction between absolute and rela-
tive item sensitivity measures within each perspective, making explicit that two
different hypotheses regarding item sensitivity may be tested via absolute and
relative measures.
Absolute measures address the hypothesis of whether a single item is sensitive
to instruction. In principle, absolute measures summarize a single items total
capacity of capturing potential effects of instruction in terms of variation in item
parameters across time, groups, or both. Hence, absolute measures are expected
to approach zero the less sensitive an item is and depart from zero the higher the
items sensitivity to instruction is.
In contrast, relative measures address the hypothesis of whether a single
items sensitivity substantially deviates from test sensitivity. Test sensitivity is
a concept that so far has only been implicitly used in the measurement of instruc-
tional sensitivity. In consistence with the predominant statistical notion of item
sensitivity (see Haladyna & Roid, 1981; Haladyna, 2004; Polikoff, 2010), test
sensitivity may be defined as the overall (i.e., unconditional) variation of
test scores across either time points, groups, or both (cf. Naumann et al.,
2016). Test sensitivity then is a prerequisite for what is commonly conceived
as the instructional sensitivity of a test, which typically refers to the proportion of
Absolute and Relative Measures of Instructional Sensitivity

variance in test scores explained by school, teacher, or teaching characteristics
(e.g., DAgostino, Welsh, & Corson, 2007; Grossman, Cohen, Ronfeldt, &
Brown, 2014; Ing, 2008). Generally, test sensitivity captures the degree of item
sensitivity that is common to all the items within a test. Technically speaking, the
stronger the item sensitivity correlates across all test items, the higher the test
sensitivity. Accordingly, relative measures express the degree to which a single
items sensitivity differs from test sensitivity. More precisely, relative measures
are expected to approach zero the more an items sensitivity is in consistence
with test sensitivity and to be nonzero if the items sensitivity deviates from test
In general, whether a specific item sensitivity measure is absolute or relative
depends on whether or not the underlying measurement model comprises one or
more parameters capturing test sensitivity. Absolute measures of sensitivity are
unconditional on test sensitivity while relative measures are conditional on test
sensitivity. That is, from each of the three perspectives, measures are obtainable
in two ways, either independently of (i.e., absolute) or depending on (i.e., rela-
tive) test sensitivity. As a result, there are eight different ways of measuring an
FIGURE 1.Extended measurement framework of instructional sensitivity comprising the
three perspectives, the two facets, and the eight absolute and relative sensitivity measures.
Naumann et al.

More filters

Journal ArticleDOI
Abstract: During the last three decades, scholars have proposed several conceptual structures to represent teacher knowledge. A common denominator in this work is the assumption that disciplinary knowledge and the knowledge needed for teaching are distinct. However, empirical findings on the distinguishability of these two knowledge components, and their relationship with student outcomes, are mixed. In this replication and extension study, we explore these issues, drawing on evidence from a multi-year study of over 200 fourth- and fifth-grade US teachers. Exploratory and confirmatory factor analyses of these data suggested a single dimension for teacher knowledge. Value-added models predicting student test outcomes on both state tests and a test with cognitively challenging tasks revealed that teacher knowledge positively predicts student achievement gains. We consider the implications of these findings for teacher selection and education.

9 citations

Journal ArticleDOI
Abstract: Instructional sensitivity is the psychometric capacity of tests or single items of capturing effects of classroom instruction. Yet, current item sensitivity measures’ relationship to (a) actual instruction and (b) overall test sensitivity is rather unclear. The present study aims at closing these gaps by investigating test and item sensitivity to teaching quality, reanalyzing data from a quasi-experimental intervention study in primary school science education (1026 students, 53 classes, Mage = 8.79 years, SDage = 0.49, 50% female). We examine (a) the correlation of item sensitivity measures and the potential for cognitive activation in class and (b) consequences for test score interpretation when assembling tests from items varying in their degree of sensitivity to cognitive activation. Our study (a) provides validity evidence that item sensitivity measures may be related to actual classroom instruction and (b) points out that inferences on teaching drawn from test scores may vary due to test composition.

8 citations

Journal ArticleDOI
Abstract: Testergebnisse von Schulerinnen und Schulern dienen regelmasig als ein zentrales Kriterium fur die Beurteilung der Effektivitat von Schule und Unterricht. Gultige Ruckschlusse uber Schule und Unterricht setzen voraus, dass die eingesetzten Testinstrumente mogliche Effekte des Unterrichts auffangen konnen, also instruktionssensitiv sind. Jedoch wird diese Voraussetzung nur selten empirisch uberpruft. Somit bleibt mitunter unklar, ob ein Test nicht instruktionssensitiv oder ein Unterricht nicht effektiv war. Die Klarung dieser Frage erfordert die empirische Untersuchung der Instruktionssensitivitat der eingesetzten Tests und Items. Wahrend die Instruktionssensitivitat in den USA bereits seit Langem diskutiert wird, findet das Konzept im deutschsprachigen Diskurs bislang nur wenig Beachtung. Unsere Arbeit zielt daher darauf ab, das Konzept Instruktionssensitivitat in den deutschsprachigen Diskurs uber schulische Leistungsmessung einzubetten. Dazu werden drei Themenfelder behandelt, (a) der theoretische Hintergrund des Konzepts Instruktionssensitivitat, (b) die Messung von Instruktionssensitivitat sowie (c) die Identifikation von weiteren Forschungsbedarfen.

6 citations

Journal ArticleDOI
Abstract: In this study, we investigated conditions and consequences of teacher popularity in primary schools. Teacher popularity is embedded in a theoretical framework that describes relationships between teacher competence, teaching quality, and student outcomes. In the empirical analyses, we used multilevel modeling to distinguish between individual students’ liking of the teacher and a teacher’s popularity as rated by the whole class (N = 1070 students, 54 teachers). The classroom level composite of the extent to which students liked their teacher was a reliable indicator of teacher popularity. Teacher popularity was associated with teacher self-reports of self-efficacy and teaching enthusiasm and with external observers’ ratings of teaching quality. The grades students received were not related to the popularity ratings. In a longitudinal study, teacher popularity predicted students’ learning gains and interest development over and above the effects of teaching quality. These results suggest that teacher popularity can be a useful and informative indicator in research on students’ academic development and teacher effectiveness.

4 citations

25 Jun 2020
Abstract: The present study investigated technical qualities of the elicited imitation (EI) items used by the Assessment of College English – International (ACE-In), a locally developed English language proficiency test used in the undergraduate English Academic Purpose Program at Purdue University EI is a controversial language assessment tool that has been utilized and examined for decades The simplicity of the test format and the ease of rating place EI in an advantageous position to be widely implemented in language assessment On the other hand, EI has received a series of critiques, primarily questioning its validity To offer insights into the quality of the EI subsection of the ACE-In and to provide guidance for continued test development and revision, the present study examined the measurement qualities of the items by analyzing the pre- and post-test performance of 100 examines on EI The analyses consist of an item analysis that reports item difficulty, item discrimination, and total score reliability; an examination of pre-post changes in performance that reports a matched pairs t-test and item instructional sensitivity; and an analysis of the correlation patterns between EI scores and TOEFL iBT total and subsection scoresThe results of the item analysis indicated that the current EI task was slightly easy for the intended population, but test items functioned satisfactorily in terms of separating examinees of higher proficiency from those of lower proficiency The EI task was also found to have high internal consistency across forms As for the pre-post changes, a significant pair-wise difference was found between the pre- and post-performance after a semester of instruction However, the results also reported that over half of the items were relatively insensitive to instruction The last stage of the analysis indicated that while EI scores had a significant positive correlation with TOEFL iBT total scores and speaking subsection scores, EI scores were negatively correlated with TOEFL iBT reading subsection scores Findings of the present study provided evidence in favor of the use of EI as a measure of L2 proficiency, especially as a viable alternative to free-response items EI is also argued to provide additional information regarding examinees’ real-time language processing ability that standardized language tests are not intended to measure Although the EI task used by the ACE-In is generally suitable for the targeted population and testing purposes, it can be further improved if test developers increase the number of difficult items and control the contents and the structures of sentence stimuli Examining the technical qualities of test items is fundamental but insufficient to build a validity argument for the test The present EI test can benefit from test validation studies that exceed item analysis Future research that focuses on improving item instructional sensitivity is also recommended

4 citations

Cites background from "Absolute and relative measures of i..."

  • ...The measure of tests’ instructional sensitivity is argued to provide empirical support for inferences on instruction based on test scores (Naumann et al., 2017)....


  • ...Instructional sensitivity is often included as a part of instrument evaluation because this measure is argued to provide empirical support for inferences on test scores expected to be influenced by instruction (Naumann et al., 2017)....


More filters

Journal Article
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

229,202 citations

Journal ArticleDOI
TL;DR: A fatal flaw of NHST is reviewed and some benefits of Bayesian data analysis are introduced and illustrative examples of multiple comparisons in Bayesian analysis of variance and Bayesian approaches to statistical power are presented.
Abstract: Bayesian methods have garnered huge interest in cognitive science as an approach to models of cognition and perception. On the other hand, Bayesian methods for data analysis have not yet made much headway in cognitive science against the institutionalized inertia of 20th century null hypothesis significance testing (NHST). Ironically, specific Bayesian models of cognition and perception may not long endure the ravages of empirical verification, but generic Bayesian methods for data analysis will eventually dominate. It is time that Bayesian data analysis became the norm for empirical methods in cognitive science. This article reviews a fatal flaw of NHST and introduces the reader to some benefits of Bayesian data analysis. The article presents illustrative examples of multiple comparisons in Bayesian analysis of variance and Bayesian approaches to statistical power. Copyright © 2010 John Wiley & Sons, Ltd. For further resources related to this article, please visit the WIREs website.

5,136 citations

"Absolute and relative measures of i..." refers methods in this paper

  • ...To estimate the LMLIRT model, we chose Wishart distributions with T þ 1 degrees of freedom and scale matrix set to identity as priors for the inverse of the covariance matrices Σ, Λ, and Φt, resulting in vague priors for the matrices’ off-diagonal elements (Gelman et al., 2013)....


01 Jan 2003
TL;DR: JAGS is a program for Bayesian Graphical modelling which aims for compatibility with Classic BUGS and could eventually be developed as an R package.
Abstract: JAGS is a program for Bayesian Graphical modelling which aims for compatibility with Classic BUGS. The program could eventually be developed as an R package. This article explains the motivations for this program, briefly describes the architecture and then discusses some ideas for a vectorized form of the BUGS language.

4,032 citations

Journal ArticleDOI

3,453 citations

"Absolute and relative measures of i..." refers methods in this paper

  • ...As recommended by Gelman and Hill (2006), we assumed flat normal distributions with mean 0 and variance 10,000 as priors for the means of the classroom-level ability distributions and highest level of the item difficulty distributions....


01 Jan 1994

1,501 citations

"Absolute and relative measures of i..." refers methods in this paper

  • ...Fit was acceptable for all items including the dichotomous step indicators with weighted mean square values ranging from 0.88 (0.84, 0.93) to 1.11 (1.07, 1.14) at pretest and from 0.87 (0.82, 0.93) to 1.15 (1.03, 1.30) at posttest (cf. Wright & Linacre, 1994)....