A comparison of some methods to analyze repeated measures ordinal categorical data

doi:10.4148/2475-7772.1219

Kansas State University Libraries Kansas State University Libraries

New Prairie Press New Prairie Press

Conference on Applied Statistics in Agriculture 2001 - 13th Annual Conference Proceedings

A COMPARISON OF SOME METHODS TO ANALYZE REPEATED A COMPARISON OF SOME METHODS TO ANALYZE REPEATED

MEASURES ORDINAL CATEGORICAL DATA MEASURES ORDINAL CATEGORICAL DATA

Yaobing Sui

Walter W. Stroup

Follow this and additional works at: https://newprairiepress.org/agstatconference

Part of the Agriculture Commons, and the Applied Statistics Commons

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Recommended Citation Recommended Citation

Sui, Yaobing and Stroup, Walter W. (2001). "A COMPARISON OF SOME METHODS TO ANALYZE REPEATED

MEASURES ORDINAL CATEGORICAL DATA,"

Conference on Applied Statistics in Agriculture

.

https://doi.org/10.4148/2475-7772.1219

This is brought to you for free and open access by the Conferences at New Prairie Press. It has been accepted for

inclusion in Conference on Applied Statistics in Agriculture by an authorized administrator of New Prairie Press. For

more information, please contact cads@k-state.edu.

98 Kansas State University

A COMPARISON OF SOME METHODS TO ANALYZE REPEATED MEASURES

ORDINAL CATEGORICAL DATA

by Yaobing Sui and Walter

W.

Stroup

Department

of

Biometry, University of Nebraska, Lincoln, NE 68583-0712

Abstract: Recent advances in statistical software made possible by the rapid development

of

computer technology in the past decade have made many new procedures available to data

analysts. We focus in this paper on methods for ordinal categorical data with repeated measures

that can be implemented using SAS. These procedures are illustrated using data from an animal

health experiment. The responses, measured as severity

of

symptoms on an ordinal scale, are

recorded for test animals over time. The experiment was designed to estimate treatment and time

effects on the severity

of

symptoms. The data were analyzed with various approaches using

PROC MIXED, PROC NLMIXED, PROC GENMOD, and the GLIMMIX macro.

In

this paper,

we compare the strengths and weaknesses

of

these different methods.

1. Introduction

Consider an experiment in which three treatments are compared. There are r blocks

of

three

animals each, formed using criteria relevant to the experiment. Within each block, one animal is

assigned at random to each treatment. Animals are measured at "week 0," the time the treatments

first take effect, and again at weeks 4 and 12. The variables measured include weight, presence or

absence

of

disease symptoms, and severity

of

symptoms, classified

as

"worse," "no change," or

"better." This type

of

experiment is called a repeated measures experiment. The focus

of

this

paper is on repeated measures analysis of the last two types

of

data in the above list: categorical

data that are either binary or ordinal.

Repeated measures data, also known

as

longitudinal data, come from experiments in which

observations are made on subjects at regular, planned times. These experiments have two or

more treatments and are set up using familiar designs: randomized complete or incomplete block

designs,

if

blocking is appropriate, row-column designs such

as

Latin Squares, when appropriate,

or completely randomized assignment

of

experimental units to treatments when blocking is not

required. Repeated measures designs are widely used throughout the life sciences.

Repeated measures analysis is fairly well understood for normally distributed data, but less

so for categorical data. However, recent developments in methodology and statistical computing

software have greatly increased the number

of

tools available to categorical data analysts. The

purpose

of

this paper is to review these tools, what we currently know

of

their advantages and

disadvantages, and what we still need to learn about them.

Regardless

of

whether the observations are normally distributed, or categorical, or have

some other distribution, a general approach to repeated measures analysis based on the linear

mixed model uses the following general form:

observation

= between subject systematic effects + between subjects random variation

+ within subjects systematic effects + within subjects random variation

For non-normal data, a function of the observation, e.g. the link function in a generalized linear

Conference on Applied Statistics in Agriculture

Kansas State University

New Prairie Press

https://newprairiepress.org/agstatconference/2001/proceedings/9

Applied Statistics in Agriculture

99

mixed model, often replaces the literal observation in the above model.

In the example that begins this section, the between subjects systematic effects are for block

and treatment, the between subjects random effects correspond to block x treatment random

effects - i.e. the between subjects model is identical to the model one would use for a randomized

complete block analysis

of

variance. The within subjects systematic effects are the main effects

of

time and the treatment x time interaction. Within subjects random variation - formally, block

x time within treatment variation - is essentially whatever is left unexplained, i.e. variation

among the measurements at different times on the same experimental unit not explain by

systematic effects already specified.

Formally, for normal errors, the model equation is:

Y

ijk

=J..l+'t i

+rj

+b'j

+W

k

+(

'tW)

ik

+e

ijk

,

where

Yijk

is the observation on the

ith

treatment,

jth

block at the

kth

week (or, more generally,

time),

Il

is the intercept,

-rj

is the

ith

treatment main effect,

rj

is the

jth

block effect, b

ij

is the

ijth

block-treatment random effect, assumed i.i.d.

N(O,

(J~

),

w

k

is the

kth

time main effect, (-rW)jk is

the

ikth

time-treatment interaction effect, and

eijk

is the

ijkth

within subject error. The

eijk

are

assumed multivariate normal and, at least potentially, correlated.

There are two main distinguishing features

of

repeated measures analysis:

1.

The primary objective is to see

if

changes over time are the same for each treatment, i.e. to

assess the time x treatment interaction.

2. The errors,

eijk'

are correlated. Specifically, let e

ij

'=

[e

ijl

, e

ij2

'

•••

, e

ijK

] be the vector

of

within subjects errors, where K is the number

of

time periods observed. Then

e ij - M V N (0,

L)

, where

~

is the covariance matrix reflecting the correlation structure.

The vector

e'=

[e(\,

...

,e("

...

,e;\,,,.,e;,l

is thus distributed with a block-diagonal covariance

matrix, i.e.

e - M V N

(0,/

ar

®

L)

, where a is the number

of

treatments.

With normal errors, repeated measures analysis can be implemented with mixed model software

such as PROC MIXED. The main issues in using PROC MIXED for repeated measures analysis

involve choosing an appropriate covariance model for

~,

realistically approximating the error

degrees

of

freedom for various tests, and adjusting for potential bias

of

standard errors and test

statistics that result from estimating the components

of

~.

Readers seeking more detail on the use

of

PROC MIXED for repeated measures analysis are referred to Littell, et. al. (1996). Carlin and

Louis (1996) discussed covariance model selection issues. Kenward and Roger (1997) discussed

standard error bias and degree

of

freedom issues and presented approximations now available

with PROC MIXED. Guerin and Stroup (2000) presented an extensive simulation study

documenting the small sample behavior

of

PROC MIXED under various options.

Models with non-normal errors, including categorical data, require some modifications.

To

make these modifications more understandable, one can re-express the normal errors model in

terms that make it more amenable to the required changes. Specifically, define the linear mixed

model in terms

of

the distribution

of

the random models effects and in terms

of

the conditional

distribution

of

the observations given the random model effects. Specifically,

y

lu

- M V N ( X

~

+ Z u ,

R)

and u - M V N (0, G ) .

The linear mixed model is a model

of

the conditional mean

of

the observation vector,

y,

given the

random effects,

u. For non-normal data, one adapts the generalized linear model approach used

for categorical models such as logistic regression and log-linear models. Specifically, drop the

Conference on Applied Statistics in Agriculture

Kansas State University

New Prairie Press

https://newprairiepress.org/agstatconference/2001/proceedings/9

100

Kansas

State

University

assumption

of

multivariate normality for ylu and use

XP+Zu

to model a function

of

the

conditional mean, E(ylu), called the link function in generalized linear models. This results in the

generalized linear mixed model (GLMM), widely discussed in the statistical literature

of

the

1980's through the present. See, for example, Breslow and Clayton (1993). The GLMM is thus

described

as

follows:

1.

The

distribution

of

the

random

effects: u - MVN(O,G)

2. The

conditional

distribution

of

the

observations,

y, given the random effects, u. For

categorical data, this distribution is typically assumed Poisson (for log-linear models fit to

contingency tables), binomial (for logistic models),

or

multinomial (for extensions

of

logit

models when there are more than two categories). Quasi-likelihood methods allow the use

of

GLMM-based analysis even when one can only specify the expected value and variance

of

ylu rather than the distribution

per

se.

3. The inverse link, E(Ylu) = h(XP+Zu). The inverse link may be the inverse

of

the link

function, or the inverse link may be a set

of

functions, as is the case for some multinomial

models. With the latter case, there is no one-to-one relationship between the conditional

mean and the link. When a one-to-one relationship does exist, the

GLMM

can be described

in terms

of

the link function, that is, ll=XP+Zu, where ll=g[E(Ylu)] is the link function.

For

the randomized complete block design with repeated measures described above, the

GLMM

would thus be

Yl

ilk

=~+'t

i

+r}

fbi}

+CO

k

+(

'tCO)

ik

where

llijk

is the link function,

g[E(Yi;k

I

bij

)], and the terms

of

the right-hand side

of

the model are

defined as they were with the linear mixed model given previously. Alternatively, one can use the

inverse link

E(Yjjkl

b

jj

) = h[,u+rj +rj

+bij

+Wk

+(rw)jkl

Several options exist in SAS for fitting categorical repeated measures models. PROC

GENMOD can be used to fit log-linear models. For

binomial data only, GENMOD can also fit

certain

GLMM's

for repeated measures using the method

of

generalized estimating equations

(Zeger, et. al. 1988), commonly referred to

as

GEE's. The GLIMMIX macro can also fit repeated

measures

GLMM's

to binomial data. GLIMMIX uses a pseudo-likelihood approach (Wolfinger

and O'Connell, 1993) that is similar to the quasi-likelihood approach described by Breslow and

Clayton (1993), but somewhat more general. GLIMMIX is not as restrictive as the GENMOD

GEE option in terms

of

the types

of

covariance models available. PROC NLMIXED, introduced

in SAS Version

8,

can estimate repeated

GLMM's

for multinomial data in addition to models for

binomial data.

It

uses a maximum likelihood algorithm based on Gaussian quadrature. With

some programming ingenuity, NLMIXED can fit a certain covariance matrices, although

convergence can be an issue with more complex structures.

The next section describes in more detail SAS-based methods useful for categorical

repeated measures data, with a focus on ordinal data. Section 3 presents an example from an

animal health experiment. Section 4 presents some tentative simulation results. These will be

pursued in far more detail in work now in progress.

2. Review

of

Methods

Table 1 shows the data for the experiment described at the beginning

of

Section 1 in

contingency table form. Each cell contains the number

of

animals in a given treatment x week x

response category combination. This section describes the methods available in SAS to analyze

Conference on Applied Statistics in Agriculture

Kansas State University

New Prairie Press

https://newprairiepress.org/agstatconference/2001/proceedings/9

Applied Statistics in Agriculture

101

these data.

The simplest categorical data analysis approach is to compute the Cochran-Mantel-Haenszel

statistic to test treatment x response category association. A statistically significant result

constitutes evidence

of

a treatment effect, assuming that the association does not change over

weeks. SAS PROC FREQ can compute the Cochran-Mantel-Haenszel test.

It

can also compute

the Breslow-Day statistic for no three-way treatment x response category x week association (i.e.

no change in treatment x response association over weeks)

if

the treatment x response table is 2

x 2, but not for the more general case, such as the 3x3 shown here. See Agresti (1996) for a more

in depth discussion

of

the contingency table approach.

Alternatively, the contingency table approach can be implemented using a log-linear model.

For the above example, the log-linear model is

10

g

(A

ijk

)

=)..l

+ 't i +

CD

j + (

'tCD

)

ij

+c k + ( 'tc )

ik

+ (

'tCDc

)

ijk

where A

ijk

is the expected count

of

the

ijkth

treatment x week x response category combination,

and

r,

cu,

and c refer to treatment, week, and response category effects, respectively. The two

effects

of

primary interest are the three-way association effects and, assuming the three-way

effects, (TUJc);jk' are zero, the two-way treatment x response category effects. The test

of

110:

all

(TUJc)ijk=O

is equivalent to the Breslow-Day test, but more general because it is not restricted to

2x2 treatmentxresponse category cases. The test

of

110:

all (TC)ij=O is equivalent to the Cochran-

Mantel-Haenszel test. PROC GENMOD can do all the required computations for the log-linear

model.

While the log-linear model is easy to compute, the contingency table approach may not take

correlation among repeated measurements on the same experimental unit into account

realistically. Agresti (1996) presents the logic

of

the contingency table approach when there are

two times, but the logic does not necessarily extend to three or more times. Approaches using

GEE's

or other GLMM methods with more flexibility in specifying the covariance structure are,

at least in theory, preferable.

In

SAS,for

binary data only, GEE's can be implemented using the REPEATED option in

PROC GENMOD. This approach is limited in that it assumes no random model effects. The

model thus

llijk

=)..l+'t i

+rj

+CD

k +(

'tCD)

ik

where

llijk

is usually either the logit or probit link, and

1",

r,

and curefer to treatment, block, and

week effects, respectively. The logit link is defined as

logit(1t

ijk

)=IOg(

1tijk

),

where

TI

ijk

is the

1-1t ijk

probability

of

the outcome of interest occurring for the

ijkth

treatmentxblockxweek combination.

The probit link is defined as probit(TI

ijk

)=

<I>

-1

(7t

ijk)

,where

<1>-1

is the inverse cumulative

standard normal distribution. The observations are assumed to have a covariance matrix

R=DPD,

(

7ti'k

(1-

7t

n

)]

where D=diag J

J,

and n

ijk

is the number

of

Bernoulli trials observed on the ijk

th

n

ijk

treatmentxblockxweek combination. The form

of

D given here is specific to the binomial

distribution. In general, D a diagonal matrix whose elements are the variance functions with for

each treatmentxblockxweek combination. P is a working correlation matrix. Working correlation

matrices are not true correlation matrices, but their structure follows common correlated error

Conference on Applied Statistics in Agriculture

Kansas State University

New Prairie Press

https://newprairiepress.org/agstatconference/2001/proceedings/9

A comparison of some methods to analyze repeated measures ordinal categorical data

Citations

Analyzing Binomial Data in a Split-Plot Design: Classical Approach or Modern Techniques?

Small sample power characteristics of generalized mixed model procedures for binary repeated measures data using sas

Analyzing binomial data in a split-plot design: classical approaches or modern techniques?

References

SAS System for Mixed Models

An introduction to categorical data analysis

Models for longitudinal data: a generalized estimating equation approach.

Small Sample Inference for Fixed Effects from Restricted Maximum Likelihood

The analysis of longitudinal data