scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Absolute and relative measures of instructional sensitivity

17 Apr 2017-Journal of Educational and Behavioral Statistics (SAGE PublicationsSage CA: Los Angeles, CA)-Vol. 42, Iss: 6, pp 678-705
TL;DR: In this article, the authors show that valid inferences on teaching drawn from students' test scores require that tests are sensitive to the instruction students received in class and that measures of the test items' instructional...
Abstract: Valid inferences on teaching drawn from students’ test scores require that tests are sensitive to the instruction students received in class. Accordingly, measures of the test items’ instructional ...

Content maybe subject to copyright    Report

Naumann, Alexander; Hartig, Johannes; Hochweber, Jan
Absolute and relative measures of instructional sensitivity
Journal of educational and behavioral statistics 42 (2017) 6, S. 678-705
Quellenangabe/ Reference:
Naumann, Alexander; Hartig, Johannes; Hochweber, Jan: Absolute and relative measures of
instructional sensitivity - In: Journal of educational and behavioral statistics 42 (2017) 6, S. 678-705 -
URN: urn:nbn:de:0111-pedocs-156029 - DOI: 10.25656/01:15602
https://nbn-resolving.org/urn:nbn:de:0111-pedocs-156029
https://doi.org/10.25656/01:15602
Nutzungsbedingungen Terms of use
Gewährt wird ein nicht exklusives, nicht übertragbares,
persönliches und beschränktes Recht auf Nutzung dieses
Dokuments. Dieses Dokument ist ausschließlich für den
persönlichen, nicht-kommerziellen Gebrauch bestimmt. Die
Nutzung stellt keine Übertragung des Eigentumsrechts an diesem
Dokument dar und gilt vorbehaltlich der folgenden
Einschränkungen: Auf sämtlichen Kopien dieses Dokuments
müssen alle Urheberrechtshinweise und sonstigen Hinweise auf
gesetzlichen Schutz beibehalten werden. Sie dürfen dieses
Dokument nicht in irgendeiner Weise abändern, noch dürfen Sie
dieses Dokument für öffentliche oder kommerzielle Zwecke
vervielfältigen, öffentlich ausstellen, aufführen, vertreiben oder
anderweitig nutzen.
We grant a non-exclusive, non-transferable, individual and limited
right to using this document.
This document is solely intended for your personal, non-commercial
use. Use of this document does not include any transfer of property
rights and it is conditional to the following limitations: All of the
copies of this documents must retain all copyright information and
other information regarding legal protection. You are not allowed to
alter this document in any way, to copy it for public or commercial
purposes, to exhibit the document in public, to perform, distribute or
otherwise use the document in public.
Mit der Verwendung dieses Dokuments erkennen Sie die
Nutzungsbedingungen an.
By using this particular document, you accept the above-stated
conditions of use.
Kontakt / Contact:
pe
DOCS
DIPF | Leibniz-Institut für Bildungsforschung und Bildungsinformation
Informationszentrum (IZ) Bildung
E-Mail: pedocs@dipf.de
Internet: www.pedocs.de

Article
Absolute and Relative Measures
of Instructional Sensitivity
Alexander Naumann
Johannes Hartig
German Institute for International Educational Research (DIPF)
Jan Hochweber
University of Teacher Education St. Gallen (PHSG)
Valid inferences on teaching drawn from students test scores require that tests
are sensitive to the instruction students received in class. Accordingly, measures
of the test items instructional sensitivity provide empirical support for validity
claims about inferences on instruction. In the present study, we first introduce
the concepts of absolute and relative measures of instructional sensitivity.
Absolute measures summarize a single items total capacity of capturing effects
of instruction, which is independent of the tests sensitivity. In contrast, relative
measures summarize a single items capacity of capturing effects of instruction
relative to test sensitivity. Then, we propose a longitudinal multilevel item
response theory model that allows estimating both types of measures depending
on the identification constraints.
Keywords:instructional sensitivity; multilevel IRT; differential item functioning
Researchers as well as policymakers regularly rely on student performance data
to draw inferences on schools, teachers, or teaching (Creemers & Kyriakides,
2008; Pellegrino, 2002). Yet valid inferences drawn from student test scores
require that instruments are sensitive to the instruction that students have received
in class (Popham, 2007; Popham & Ryan, 2012). Accordingly, measures of test
items instructional sensitivity may provide empirical support for validity claims
about the inferences on instruction derived from student test scores.
Instructional sensitivity is defined as the psychometric property of a test or a
single item to capture effects of instruction (Polikoff, 2010). Scores of instruc-
tionally sensitive tests are expected to increase with more or better teaching
(Baker, 1994). Students who received different instruction should produce dif-
ferent responses to highly instructionally sensitive items (Ing, 2008). Fundamen-
tally, instructional sensitivity relates to the observation of change in students
responses on items as a consequence of instruction (Burstein, 1989). If item
responses do not change as a consequence of instruction, it may remain unclear
Journal of Educational and Behavioral Statistics
2017, Vol. 42, No. 6, pp. 678705
DOI: 10.3102/1076998617703649
# 2017 AERA. http://jebs.aera.net
678

whether teaching was ineffective or thetest was insensitive (Naumann, Hoch-
weber, & Hartig, 2014). To test the hypothesis of whether an item is instruc-
tionally sensitive, various measures have been proposed (see Haladyna & Roid,
1981; Polikoff, 2010). Most commonly,these item sensitivity measures are
based on item parameters, that is, item difficulty or discrimination (Haladyna,
2004).
According to Naumann, Hochweber, and Klieme (2016), each item sensitivity
measure refers to one of the three perspectives on how to test the instructional
sensitivity of items. From the first perspective, instructional sensitivity is con-
ceived as change in item parameters between two time points of measurement,
while from the second perspective instructional sensitivity is conceived as dif-
ferences in item parameters between at least two groups (e.g., treatment and
control groups or classes) within a sample. The third perspective is a combination
of the two preceding ones, which allows deriving measures addressing two facets
of item sensitivity: global and differential sensitivity. Global sensitivity refers to
the extent to which item parameters change on average across time. Differential
sensitivity refers to the variation of change in parameters across groups,
indicating an items capacity of detecting differences in group-specific learning.
Overall, these perspectives provide an elaborate framework for the measurement
of instructional sensitivity based on item statistics by highlighting the relevant
sources of variance: variance between (a) time points, (b) groups, and (c) groups
and time points. As item sensitivity measures rooted in different perspectives
target different sources of variance, they do not necessarily provide consistent
results (Naumann et al., 2014).
Yet the three perspectives are not sufficient for describing common charac-
teristics and distinctions of instructional sensitivity measures. Actually, instruc-
tional sensitivity measures referring to the same perspective may address two
essentially different hypotheses regarding item sensitivity: Some measures relate
to the hypothesis of whether an item is sensitive at all, that is,absolute sensitivity,
while others relate to the hypothesis of whether an item substantially deviates
from the tests overall sensitivity, that is,relative sensitivity.
This additional distinction has important theoretical and practical implications
for the evaluation of instructional sensitivity. For example, studies have shown
that the most commonly applied approaches, the PretestPosttest Difference
Index (PPDI; Cox & Vargas, 1966) and differential item functioning (DIF)-based
methods (e.g., Linn & Harnisch, 1981; Robitzsch, 2009), are inconsistent in their
judgment of item sensitivity (Li, Ruiz-Primo, & Wills, 2012; Naumann et al.,
2014). One reason for this finding lies in the difference of the perspective taken
on instructional sensitivity by these approaches (Naumann, Hochweber, &
Klieme, 2016): While the PPDI focuses on change in item difficulties across
time points, DIF approaches focus on differences in item difficulty between at
least two groups of students (e.g., treatment groups or courses or classes) within a
sample. Yet another reason is that the approaches differ in the way they measure
Naumann et al.
679

instructional sensitivity: While the PPDI is an absolute sensitivity measure, DIF
approaches provide relative measures of item sensitivity.
Thus, in the present study, we aim to contribute to the measurement frame-
work of instructional sensitivity by introducing the distinction between absolute
and relative measures. Absolute and relative measures may be distinguished
within each of the three perspectives on instructional sensitivity and provide
unique and valuable information on item functioning in educational assessments
when inferences on schools, teachers, or teaching are to be drawn. In the follow-
ing, we will first elaborate on the distinction of absolute and relative measures.
We will point out how absolute and relative measures relate to test sensitivity and
current approaches to the instructional sensitivity of items. Second, we will
provide a model-based approach that allows testing the hypothesis of whether
items are absolutely and/or relatively sensitive within a more general item
response theory (IRT) framework. For illustration purposes, we apply our
approach to simulated and empirical item response data. Finally, we will discuss
implications for the measurement of instructional sensitivity, test development,
and test score interpretation.
Extending the Measurement Framework of Instructional Sensitivity
Figure 1 depicts an extended measurement framework. The extended mea-
surement framework comprises the three perspectives as well as the two sensi-
tivity facetsglobal and differentialsensitivity—that can be distinguished
within the groups and time points perspective following Naumann and col-
leagues (2016). In addition, we draw the distinction between absolute and rela-
tive item sensitivity measures within each perspective, making explicit that two
different hypotheses regarding item sensitivity may be tested via absolute and
relative measures.
Absolute measures address the hypothesis of whether a single item is sensitive
to instruction. In principle, absolute measures summarize a single items total
capacity of capturing potential effects of instruction in terms of variation in item
parameters across time, groups, or both. Hence, absolute measures are expected
to approach zero the less sensitive an item is and depart from zero the higher the
items sensitivity to instruction is.
In contrast, relative measures address the hypothesis of whether a single
items sensitivity substantially deviates from test sensitivity. Test sensitivity is
a concept that so far has only been implicitly used in the measurement of instruc-
tional sensitivity. In consistence with the predominant statistical notion of item
sensitivity (see Haladyna & Roid, 1981; Haladyna, 2004; Polikoff, 2010), test
sensitivity may be defined as the overall (i.e., unconditional) variation of
test scores across either time points, groups, or both (cf. Naumann et al.,
2016). Test sensitivity then is a prerequisite for what is commonly conceived
as the instructional sensitivity of a test, which typically refers to the proportion of
Absolute and Relative Measures of Instructional Sensitivity
680

variance in test scores explained by school, teacher, or teaching characteristics
(e.g., DAgostino, Welsh, & Corson, 2007; Grossman, Cohen, Ronfeldt, &
Brown, 2014; Ing, 2008). Generally, test sensitivity captures the degree of item
sensitivity that is common to all the items within a test. Technically speaking, the
stronger the item sensitivity correlates across all test items, the higher the test
sensitivity. Accordingly, relative measures express the degree to which a single
items sensitivity differs from test sensitivity. More precisely, relative measures
are expected to approach zero the more an items sensitivity is in consistence
with test sensitivity and to be nonzero if the items sensitivity deviates from test
sensitivity.
In general, whether a specific item sensitivity measure is absolute or relative
depends on whether or not the underlying measurement model comprises one or
more parameters capturing test sensitivity. Absolute measures of sensitivity are
unconditional on test sensitivity while relative measures are conditional on test
sensitivity. That is, from each of the three perspectives, measures are obtainable
in two ways, either independently of (i.e., absolute) or depending on (i.e., rela-
tive) test sensitivity. As a result, there are eight different ways of measuring an
FIGURE 1.Extended measurement framework of instructional sensitivity comprising the
three perspectives, the two facets, and the eight absolute and relative sensitivity measures.
Naumann et al.
681

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, a multi-year study of over 200 fourth and fifth grade US teachers revealed that teacher knowledge positively predicts student achievement gains. But, empirical findings on the distinguishability of these two knowledge components, and their relationship with student outcomes, are mixed.
Abstract: During the last three decades, scholars have proposed several conceptual structures to represent teacher knowledge. A common denominator in this work is the assumption that disciplinary knowledge and the knowledge needed for teaching are distinct. However, empirical findings on the distinguishability of these two knowledge components, and their relationship with student outcomes, are mixed. In this replication and extension study, we explore these issues, drawing on evidence from a multi-year study of over 200 fourth- and fifth-grade US teachers. Exploratory and confirmatory factor analyses of these data suggested a single dimension for teacher knowledge. Value-added models predicting student test outcomes on both state tests and a test with cognitively challenging tasks revealed that teacher knowledge positively predicts student achievement gains. We consider the implications of these findings for teacher selection and education.

22 citations

Journal ArticleDOI
TL;DR: Aktive Beteiligung am Unterrichtsgesprach gilt als wichtiger Baustein schulischen Lernens und als Indikator fur bildungsbezogene Partizipation.
Abstract: Zusammenfassung. Eine aktive Beteiligung am Unterrichtsgesprach gilt als wichtiger Baustein schulischen Lernens und als Indikator fur bildungsbezogene Partizipation. In der vorliegenden Studie wurd...

15 citations

Journal ArticleDOI
TL;DR: In this article, the authors investigated test and item sensitivity to teaching quality, reanalyzing data from a quasi-experimental intervention study in primary school science education (1026 students, 53 classes, Mage =※8.79 years, SDage= 0.49, 50% female).

13 citations

Journal ArticleDOI
TL;DR: In this article, conditions and consequences of teacher popularity in primary schools were investigated, and teacher popularity was embedded in a theoretical framework that describes relationships between teacher competence, teaching quality, and student outcomes.
Abstract: In this study, we investigated conditions and consequences of teacher popularity in primary schools. Teacher popularity is embedded in a theoretical framework that describes relationships between teacher competence, teaching quality, and student outcomes. In the empirical analyses, we used multilevel modeling to distinguish between individual students’ liking of the teacher and a teacher’s popularity as rated by the whole class (N = 1070 students, 54 teachers). The classroom level composite of the extent to which students liked their teacher was a reliable indicator of teacher popularity. Teacher popularity was associated with teacher self-reports of self-efficacy and teaching enthusiasm and with external observers’ ratings of teaching quality. The grades students received were not related to the popularity ratings. In a longitudinal study, teacher popularity predicted students’ learning gains and interest development over and above the effects of teaching quality. These results suggest that teacher popularity can be a useful and informative indicator in research on students’ academic development and teacher effectiveness.

10 citations

Journal ArticleDOI
TL;DR: In this article, the authors discuss empirically unklar, ob a test nicht instruktionssensitiv oder ein Unterricht nicht effektiv war.
Abstract: Testergebnisse von Schulerinnen und Schulern dienen regelmasig als ein zentrales Kriterium fur die Beurteilung der Effektivitat von Schule und Unterricht. Gultige Ruckschlusse uber Schule und Unterricht setzen voraus, dass die eingesetzten Testinstrumente mogliche Effekte des Unterrichts auffangen konnen, also instruktionssensitiv sind. Jedoch wird diese Voraussetzung nur selten empirisch uberpruft. Somit bleibt mitunter unklar, ob ein Test nicht instruktionssensitiv oder ein Unterricht nicht effektiv war. Die Klarung dieser Frage erfordert die empirische Untersuchung der Instruktionssensitivitat der eingesetzten Tests und Items. Wahrend die Instruktionssensitivitat in den USA bereits seit Langem diskutiert wird, findet das Konzept im deutschsprachigen Diskurs bislang nur wenig Beachtung. Unsere Arbeit zielt daher darauf ab, das Konzept Instruktionssensitivitat in den deutschsprachigen Diskurs uber schulische Leistungsmessung einzubetten. Dazu werden drei Themenfelder behandelt, (a) der theoretische Hintergrund des Konzepts Instruktionssensitivitat, (b) die Messung von Instruktionssensitivitat sowie (c) die Identifikation von weiteren Forschungsbedarfen.

6 citations

References
More filters
Journal ArticleDOI

50 citations


"Absolute and relative measures of i..." refers background in this paper

  • ...To test the hypothesis of whether an item is instructionally sensitive, various measures have been proposed (see Haladyna & Roid, 1981; Polikoff, 2010)....

    [...]

  • ...In consistence with the predominant statistical notion of item sensitivity (see Haladyna & Roid, 1981; Haladyna, 2004; Polikoff, 2010), test sensitivity may be defined as the overall (i.e., unconditional) variation of test scores across either time points, groups, or both (cf. Naumann et al., 2016)....

    [...]

Journal ArticleDOI
TL;DR: Bayesian tests (Bayes factor, deviance information criterion) are proposed which enable multiple marginal invariance hypotheses to be tested simultaneously and show that background information can be used to explain cross-national variation in item functioning.
Abstract: Random item effects models provide a natural framework for the exploration of violations of measurement invariance without the need for anchor items. Within the random item effects modelling framework, Bayesian tests (Bayes factor, deviance information criterion) are proposed which enable multiple marginal invariance hypotheses to be tested simultaneously. The performance of the tests is evaluated with a simulation study which shows that the tests have high power and low Type I error rate. Data from the European Social Survey are used to test for measurement invariance of attitude towards immigrant items and to show that background information can be used to explain cross-national variation in item functioning.

49 citations


"Absolute and relative measures of i..." refers methods in this paper

  • ...We checked items’ absolute and relative differential sensitivity, that is, the variance components f22i, following a procedure by Verhagen and colleagues (Verhagen & Fox, 2013; Verhagen, Levy, Millsap, & Fox, 2015)....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors developed a method for capturing the alignment between how teachers bring standards to life in their classrooms and how the standards are defined on a test, and found that the best predictors of classroom achievement were the match between how the state's academic standards were defined on the state test.
Abstract: The accuracy of achievement test score inferences largely depends on the sensitivity of scores to instruction focused on tested objectives. Sensitivity requirements are particularly challenging for standards-based assessments because a variety of plausible instructional differences across classrooms must be detected. For this study, we developed a new method for capturing the alignment between how teachers bring standards to life in their classrooms and how the standards are defined on a test. Teachers were asked to report the degree to which they emphasized the state's academic standards, and to describe how they taught certain objectives from the standards. Two curriculum experts judged the alignment between how teachers brought the objectives to life in their classrooms and how the objectives were operationalized on the state test. Emphasis alone did not account for achievement differences among classrooms. The best predictors of classroom achievement were the match between how the standards w...

47 citations

Book ChapterDOI
01 Jan 2004
TL;DR: In this article, the authors consider the inclusion of person-by-item predictors into the model and distinguish between static and dynamic interaction models, focusing on models for differential item functioning (DIF) and local item dependencies.
Abstract: In this chapter we consider the inclusion of person-by-item predictors into the model. Unlike person predictors or item predictors, person-by-item predictors vary both within and between persons. The inclusion of person-by-item predictors besides person predictors or item predictors is relevant for modeling various phenomena such as differential item functioning (DIF) and local item dependencies (LID) (see Zwinderman, 1997). To describe models with person-by-item predictors we will distinguish between static and dynamic interaction models. We concentrate here on models for DIF and LID, but the interaction concept is of course more general.

45 citations


"Absolute and relative measures of i..." refers background in this paper

  • ...DIF approaches from the groups perspective focus on cross-sectional data and may become computationally rather demanding when accounting for multilevel structures (multilevel DIF; Meulders & Xie, 2004)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors examined the potential to improve matching by conditioning simultaneously on test score and a categorical variable representing the educational background of the examinees using a logistic regression procedure.
Abstract: When tests are designed to measure dimensionally complex material, DIF analysis with matching based on the total test score may be inappropriate. Previous research has demonstrated that matching can be improved by using multiple internal or both internal and external measures to more completely account for the latent ability space. The present article extends this line of research by examining the potential to improve matching by conditioning simultaneously on test score and a categorical variable representing the educational background of the examinees. The responses of male and female examinees from a test of medical competence were analyzed using a logistic regression procedure. Results show a substantial reduction in the number of items identified as displaying significant DIF when conditioning is based on total test score and a variable representing educational background as opposed to total test score only.

41 citations