Critical Values for Yen’s Q3: Identification of Local Dependence in the Rasch Model Using Residual Correlations:

doi:10.1177/0146621616677520

Article

Applied Psychological Measurement

2017, Vol. 41(3) 178–194

Ó The Author(s) 2016

Reprints and permissions:

sagepub.com/journalsPermissions.nav

DOI: 10.1177/0146621616677520

journals.sagepub.com/home/apm

Critical Values for Yen’s Q

3

:

Identification of Local

Dependence in the Rasch

Model Using Residual

Correlations

Karl Bang Christensen

1

, Guido Makransky

2

, and Mike Horton

3

Abstract

The assumption of local independence is central to all item response theory (IRT) models.

Violations can lead to inflated estimates of reliability and problems with construct validity. For

the most widely used fit statistic Q

3

, there are currently no well-documented suggestions of the

critical values which should be used to indicate local dependence (LD), and for this reason, a

variety of arbitrary rules of thumb are used. In this study, an empirical data example and Monte

Carlo simulation were used to investigate the different factors that can influence the null distri-

bution of residual correlations, with the objective of proposing guidelines that researchers and

practitioners can follow when making decisions about LD during scale development and valida-

tion. A parametric bootstrapping procedure should be implemented in each separate situation

to obtain the critical value of LD applicable to the data set, and provide example critical values

for a number of data structure situations. The results show that for the Q

3

fit statistic, no single

critical value is appropriate for all situations, as the percentiles in the empirical null distribution

are influenced by the number of items, the sample size, and the number of response categories.

Furthermore, the results show that LD should be considered relative to the average observed

residual correlation, rather than to a uniform value, as this results in more stable percentiles

for the null distribution of an adjusted fit statistic.

Keywords

local dependence, Rasch model, Yen’s Q

3

, residual correlations, Monte Carlo simulation

1

University of Copenhagen, Denmark

2

University of Southern Denmark, Odense, Denmark

3

University of Leeds, UK

Corresponding Author:

Karl Bang Christensen, Section of Biostatistics, Department of Public Health, University of Copenhagen, P.O. Box 2099,

Copenhagen DK-1014, Denmark.

Email: KACH@sund.ku.dk

Introduction

Statistical independence of two variables implies that knowledge about one variable does not

change the expectations about another variable. Thus, test items, X

1

, ..., X

I

, are not indepen-

dent, because a student giving a correct answer to one test item would change the expectation

of his or her probability of also giving a correct answer to another item in the same test. A fun-

damental assumption in the Rasch (1960) model and in other item response theory (IRT) mod-

els is that item responses are conditionally independent given the latent variable:

PX

1

= x

1

, ..., X

I

= x

I

juðÞ=

Y

I

i =1

PX

i

= x

i

juðÞ: ð1Þ

The items should only be correlated through the latent trait that the test is measuring (Lord &

Novick, 1968). This is generally referred to as local independence (Lazarsfeld & Henry, 1968).

The assumptions of local independence can be violated through response dependency and

multidimensionality, and these violations are often referred to under the umbrella term of ‘‘local

dependence’’ (LD). Both of these situations yield interitem correlations beyond what can be

attributed to the latent variable, but for very different reasons. Response dependency occurs

when items are linked in some way, such that the response on one item governs the response on

another because of similarities in, for example, item content or response format. A typical

example is where several walking items are included in the same scale. If a person can walk

several miles without difficulty, then that person must be able to walk 1 mile, or any lesser dis-

tance, without difficulty (Tennant & Conaghan, 2007). This is a structural dependency which is

inherent within the items, because there is no other logical way in which a person may validly

respond. Another form of LD could be caused by a redundancy–dependency, where the degree

of overlap within the content of items is such that the items are not independent (i.e., where the

same question is essentially asked twice, using slightly different language or synonymous

descriptive words). Yen (1993) offered an in-depth discussion of ways that the format and pre-

sentation of items can cause LD.

Violation of the local independence assumption through multidimensionality is typically seen

for instruments composed of bundles of items that measure different aspects of the latent vari-

able or different domains of a broader latent construct. In this case, the higher order latent vari-

able alone might not account for correlation between items in the same bundle.

Violations of local independence in a unidimensional scale will influence estimation of per-

son parameters and can lead to inflated estimates of reliability and problems with construct

validity. Consequences of LD have been described in detail elsewhere (Lucke, 2005; Marais,

2009; Marais & Andrich, 2008a; Scott & Ip, 2002; Yen, 1993). Ignoring LD in a unidimensional

scale thus leads to reporting of inflated reliability giving a false impression of the accuracy and

precision of estimates (Marais, 2013). For a discussion of the effect of multidimensionality on

estimates of reliability, see Marais and Andrich (2008b).

Detecting LD

One of the earliest methods for detecting LD in the Rasch model is the fit measure Q

2

(van den

Wollenberg, 1982), which was derived from contingency tables and used the sufficiency prop-

erties of the Rasch model. Kelderman (1984) expressed the Rasch model as a log-linear model

in which LD can be shown to correspond to interactions between items. Log-linear Rasch mod-

els have also been considered by Haberman (2007) and by Kreiner and Christensen (2004,

2007), who proposed to test for LD by evaluating partial correlations using approach similar to

Christensen et al. 179

the Mantel–Haenszel analysis of differential item functioning (DIF; Holland & Thayer, 1988).

The latter approach is readily implemented in standard software such as SAS or SPSS. Notably,

Kreiner and Christensen (2007) argued that the log-linear Rasch models proposed by

Kelderman that incorporate LD still provide essentially valid and objective measurement and

described the measurement properties of such models. Furthermore, a way of quantifying LD

has been proposed by Andrich and Kreiner (2010) for two dichotomous items. It is based on

splitting a dependent item into two new ones, according to the responses to the other item

within the dependent pair. LD is then easily quantified by estimating the difference d between

the item locations of the two new items. However, Andrich and Kreiner do not go on to investi-

gate whether d is statistically significant. For the partial credit model (Masters, 1982) and the

rating scale model (Andrich, 1978), a generalized version of this methodology exists (Andrich,

Humphry, & Marais, 2012).

Beyond the Rasch model, Yen (1984) proposed the Q

3

statistic for detecting LD in the three

parameter logistics (3PL) model. This fit statistic is based on the item residuals,

d

i

= X

i

 EX

i

j

^

u



, ð2Þ

and computed as the Pearson correlation (taken over examinees),

Q

3, ij

= r

d

i

d

j

, ð3Þ

where d

i

and d

j

are item residuals for items i and j, respectively. This method is often used for

the Rasch model, the partial credit model, and the rating scale model.

Chen and Thissen (1997) discussed X

2

and G

2

LD statistics that, although not more powerful

than the Q

3

, have null distributions very similar to the chi-square distribution with one degree

of freedom. Other methods for detecting LD are standardized bivariate residuals for dichoto-

mous or multinomial IRT models (Maydeu-Olivares & Liu, 2015), the use of conditional covar-

iances (Douglas, Kim, Habing, & Gao, 1998), or the use of Mantel–Haenszel type tests (Ip,

2001). Tests based on parametric models are also a possibility: Glas and Suarez-Falcon (2003)

proposed Lagrange multiplier (LM) tests based on a threshold shift model, but bifactor models

(Liu & Thissen, 2012, 2014), specification of other models that incorporate LD (Hoskens & De

Boeck, 1997; Ip, 2002), or limited information goodness-of-fit tests (Liu & Maydeu-Olivares,

2013) are also possible.

The Use of the Q

3

Fit Statistic

Yen’s Q

3

is probably the most often reported index in published Rasch analyses due to its inclu-

sion (in the form of the residual correlation matrix) in widely used software such as RUMM

(Andrich, Sheridan, & Luo, 2010). Yen (1984) argued that if the IRT model is correct, then

the distribution of the Q

3

is known, and proposed that p values could be based on the Fisher

(1915) z-transform. Chen and Thissen (1997) stated, ‘‘In using Q

3

to screen items for local

dependence, it is more common to use a uniform critical value of an absolute value of 0.2 for

the Q

3

statistic itself’’ (pp. 284-285). T hey went on to present results showing that, although

the sampling distribution under the Rasch model is bell shaped, it is not well approximated

by the standard normal distribution, especially in the tails (Chen & Thissen, 1997, Figure 3).

In practical applications of the Q

3

test statistic researchers will often compute the complete

correlation matrix of residuals and look at the maximum value:

Q

3, max

= max

i . j

Q

3, ij

: ð4Þ

180 Applied Psychological Measurement 41(3)

Critical Values of Residual Correlations

When investigating LD based on Yen’s Q

3

, residuals for any pair of items should be uncorre-

lated, and generally close to zero. Residual correlations that are high indicate a violation of the

local independence assumption, and this suggests that the pair of items have something more in

common than the rest of the item set have in common with each other (Marais, 2013).

As noted by Yen (1984), a negative bias is built into Q

3

. This problem is due to the fact that

measures of association will be biased away from zero even though the assumption of local

independence applies, due to the conditioning on a proxy variable instead of the latent variable

(Rosenbaum, 1984). A second problem is that the way the residuals are computed induces a bias

(Kreiner & Christensen, 2011). Marais (2013) recognized that the sampling properties among

residuals are unknown; therefore, these statistics cannot be used for formal tests of LD. A third,

and perhaps the most important, problem in applications is that there are currently no well-

documented suggestions of the critical values which should be used to indicate LD, and for this

reason, arbitrary rules of thumb are used when evaluating whether an observed correlation is

such that it can be reasonably supposed to have arisen from random sampling.

Standards often reported in the literature include looking at fit residuals over the critical

value of 0.2, as proposed by Chen and Thissen (1997). For examples of this, see Reeve et al.

(2007); Hissbach, Klusmann, and Hampe (2011); Makransky and Bilenberg (2014); and

Makransky, Rogers, and Creed (2014). However, other critical values are also used, and there

seems to be a wide variation in what is seen as indicative of dependence. Marais and Andrich

(2008a) investigated dependence at a critical value of 0.1, but a value of 0.3 has also often been

used (see, for example, das Nair, Moreton, & Lincoln, 2011; La Porta et al., 2011; Ramp,

Khan, Misajon, & Pallant, 2009; Røe, Damsga

˚

rd, Fors, & Anke, 2014), and critical values of

0.5 (Davidson, Keating, & Eyres, 2004; Ten Klooster, Taal, & van de Laar, 2008) and even 0.7

(Gonza

´

lez-de Paz et al., 2015) can be found in use.

There are two fundamental problems with this use of standard critical values: (a) there is lim-

ited evidence of their validity and often no reference of where values come from, and (b) they

are not sensitive to specific characteristics of the data.

Marais (2013) not only identified that the residual correlations are difficult to directly inter-

pret confidently when there are fewer than 20 items in the item set but also stated that the corre-

lations should always be considered relative to the overall set of correlations. This is because of

the magnitude of a residual correlation value, which indicates LD will vary depending on the

number of items in a data set. Instead of an absolute critical value, Marais (2013) suggested that

residual correlation values should be compared with the average item residual correlation of the

complete data set to give a truer picture of the LD within a data set. It was concluded that when

diagnosing response dependence, item residual correlations should be considered relative to

each other and in light of the number of items, although there is no indication of a relative criti-

cal value (above the average residual correlation) that could indicate LD.

Thus, under the null hypothesis, the average correlation of residuals is negative (cf. Marais,

2013) and, ideally, observed correlations between residuals in a data set should be evaluated

with reference to this average value. Marais proposes to evaluate them with reference to the

average of the observed correlations rather than the average under the null hypothesis. Thus,

following Marais, the average value of the observed correlations could be considered:



Q

3

=

I

2



1

X

i . j

Q

3, ij

, ð5Þ

where

I

2



is the number of item pairs and defines the test statistic:

Christensen et al. 181

Q

3, 

= Q

3, max



Q

3

, ð6Þ

that compares the largest observed correlation with average of the observed correlations.

The problem with the currently used critical values is that they are neither theoretically nor

empirically based. Researchers and practitioners faced with making scale validation, and devel-

opment decisions need to know what level of LD could be expected, given the properties of their

items and data.

A possible solution would be to use a parametric bootstrap approach and simulate the resi-

dual correlation matrix several times under the assumption of fit to the Rasch model. This

would provide information about the level of residual correlation that could be expected for the

particular case, given that the Rasch model fits. To the authors’ knowledge, there is no existing

research that describes how important characteristics such as the number of items, number of

response categories, number of respondents, the distribution of items and persons, and the tar-

geting of the items affect residual correlations expected, given fit to the Rasch model. In the

current study, the possibility of identifying critical values of LD is investigated by examining

the distribution of Q

3

under the null hypothesis, where the data fit the model. This is done using

an empirical example along with a simulation study.

Given the existence of the wide range of fit statistics with known sampling distributions out-

lined above, it is surprising that Rasch model applications abound with reporting of Q

3

using

arbitrary cut-points without theoretical or empirical justification. The reason for this is that the

theoretically sound LD indices are not included in the software packages used by practitioners.

For this reason, this article presents extensive simulation studies that will (a) illustrate that Q

3

should be interpreted with caution and (b) allow researchers to know what level of LD could be

expected, given properties of their items and data. Furthermore, these simulation studies will be

used to study whether the maximum correlation, or the difference between the maximum corre-

lation and the average correlation, as suggested by Marais (2013), is the most informative.

Thus, the objectives of this article are (a) to provide an overview of the influence of different

factors upon the null distribution of residual correlations and (b) to propose guidelines that

researchers and practitioners can follow when making decisions about LD during scale devel-

opment and validation. Two different situations are addressed: first, the situation where the test

statistic is computed for all item pairs and only the strongest evidence (the largest correlation)

is considered, and second, the less common case where only a single a priori defined item pair

is considered.

Simulation Study

Methods

The simulated data sets used are as follows: (a) I dichotomous items simulated from

PX

i

= xjuðÞ=

exp x u  b

i

ðÞðÞ

1 + exp u  b

i

ðÞ

i =1, ..., IðÞ, ð7Þ

with evenly spaced item difficulties b

i

ranging from 22to2,

b

i

=2

i  1

I  1



i =1, ..., IðÞ, ð8Þ

or (b) I polytomous items simulated from

182 Applied Psychological Measurement 41(3)

Critical Values for Yen’s Q3: Identification of Local Dependence in the Rasch Model Using Residual Correlations:

Citations

Cites background from "Critical Values for Yen’s Q3: Ident..."

Cites methods from "Critical Values for Yen’s Q3: Ident..."

References

"Critical Values for Yen’s Q3: Ident..." refers methods in this paper

"Critical Values for Yen’s Q3: Ident..." refers methods in this paper

"Critical Values for Yen’s Q3: Ident..." refers methods in this paper

Related Papers (5)