What are the future works in this paper?

Once a posterior is found to exist, this does not guarantee inference on the basis of an extended sample ! The analysis is now fully coherent, in that new observations can never destroy the possibility of conducting inference.

What is the simplest way to implement Bayesian analysis with set observations?

A simple Gibbs sampling strategy is proposed to implement Bayesian analysis with set observations and a number of Examples is considered, all under Student sampling with an Exponential prior on ν.

How can the authors use the Gibbs sampler to perform Bayesian inference?

In general, Bayesian inference using set observations can easily be implemented through a Gibbs sampler on the parameters augmented with y = (y1, . . . , yn)′.

What is the simplest way to interpret the assumption of Theorem 4?

As explained in the discussion of Theorem 3, the assumption of Theorem 4 (ii) is most easily interpreted in the location-scale case (k = 1 and xi = 1), where r(X : y) < 1 + p means that there exists a (p − 1)-dimensional affine space that intersects all of the sets S1, . . . , Sn.

How many sets of observations are based on the analysis?

As expected, the analysis based on all sixteen set observations identifies the four extra observations [i.e. the last four in (4.6)] as outliers through small values of the mixing variables λi associated with these observations.

What is the way to explain the Student-t model?

classical inference could be based on efficient likelihood estimation [Lehmann (1983, chap. 6)], grouped likelihoods [see Giesbrecht and Kempthorne (1976) for a lognormal model and Beckman and Johnson (1987) for the Student-t case], sample percentiles [Resek (1976)], modified likelihood [as in Cheng and Iles (1987)] or spacings methods [as in Cheng and Amin (1979)].

What is the correct prior for the Bayesian model?

The Bayesian model will be completed with a commonly used improper prior on the regression coefficients and scatter matrix, and some proper prior on the degrees of freedom.

(Open Access) Multivariate Student-t regression models: Pitfalls and inference (1999) | Carmen Fernandez

Q: What are the contributions in this paper?

The authors consider likelihood-based inference from multivariate regression models with independent Student-t errors.

Q: What is the rank condition for a design matrix X and a sample y?

For a design matrix X and a sample y ∈ <n×p, sj , j = 1, . . . , p, is the largest number of observations such that the rank of the corresponding submatrix of X is k while the rank of the corresponding submatrix of (X : y) is k + p− j.Clearly, since r(X : y) = k + p, the authors obtain that k ≤ sp < sp−1 < . . . < s1 < n.

Q: What is the minimum requirement for a scale mixture of Normals?

It is immediate from Lemma 1 thatfor finite mixtures of Normals, p(y) <∞ if and only if r(X : y) = k + p, (3.1)which is the minimal possible requirement for any scale mixture of Normals.

Q: How many sets of observations are based on the analysis?

As expected, the analysis based on all sixteen set observations identifies the four extra observations [i.e. the last four in (4.6)] as outliers through small values of the mixing variables λi associated with these observations.

Q: What is the way to explain the Student-t model?

classical inference could be based on efficient likelihood estimation [Lehmann (1983, chap. 6)], grouped likelihoods [see Giesbrecht and Kempthorne (1976) for a lognormal model and Beckman and Johnson (1987) for the Student-t case], sample percentiles [Resek (1976)], modified likelihood [as in Cheng and Iles (1987)] or spacings methods [as in Cheng and Amin (1979)].

Q: What is the probability of a sample having zero probability?

Even though Theorem 1 indicates that Bayesian inference is possible for almost all samples (i.e. except for a set of zero probability under the sampling model), problems can occur since any sample of point observations formally has probability zero of being observed.

Tilburg University

Multivariate Student -t Regression Models

Fernández, C.; Steel, M.F.J.

Publication date:

1997

Link to publication in Tilburg University Research Portal

Citation for published version (APA):

Fernández, C., & Steel, M. F. J. (1997).

Multivariate Student -t Regression Models: Pitfalls and Inference

(CentER Discussion Paper; Vol. 1997-08). Econometrics.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners

and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

• You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal

Take down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately

and investigate your claim.

Download date: 09. aug.. 2022

Multivariate Student-t Regression Models:

Pitfalls and Inference

By Carmen Fern´andez and Mark F.J. Steel

CentER for Economic Research and Department of Econometrics

Tilburg University, 5000 LE Tilburg, The Netherlands

FIRST VERSION DECEMBER 1996; CURRENT VERSION JANUARY 1997

Abstract

We consider likelihood-based inference from multivariate regression models with in-

dependent Student-t errors. Some very intruiging pitfalls of both Bayesian and classical

methods on the basis of point observations are uncovered. Bayesian inference may be

precluded as a consequence of the coarse nature of the data. Global maximization of the

likelihood function is a vacuous exercise since the likelihood function is unbounded as we

tend to the boundary of the parameter space. A Bayesian analysis on the basis of set

observations is proposed and illustrated by several examples.

KEY WORDS: Bayesian inference; Coarse data; Continuous distribution; Maximum like-

lihood; Missing data; Scale mixture of Normals.

1. INTRODUCTION

The multivariate regression model with unknown scatter matrix is widely used in

many ﬁelds of science. Applications to real data often indicate that the analytically con-

venient assumption of Normality is not quite tenable and thicker tails are called for in

order to adequately capture the main features of the data. Thus, we consider regression

error vectors that are distributed as scale mixtures of Normals. We shall mainly emphasize

the empirically relevant case of independent sampling from a multivariate Student-t distri-

bution with unknown degrees of freedom. In particular, we provide a complete Bayesian

Carmen Fern´andez is Research Fellow, CentER for Economic Research and Assistant Professor, De-

partment of Econometrics, Tilburg University, 5000 LE Tilburg, The Netherlands. Mark Steel is Senior

Research Fellow, CentER for Economic Research and Associate Professor, Department of Econometrics,

Tilburg University, 5000 LE Tilburg, The Netherlands. We gratefully acknowledge the extremely valu-

able help of F. Chamizo in the proof of Theorem 3 as well as useful comments from B. Melenberg and

W.J. Studden. Both authors beneﬁtted from a travel grant awarded by the Netherlands Organization for

Scientiﬁc Research (NWO) and were visiting the Statistics Department at Purdue University during much

of the work on this paper.

analysis of the linear Student-t regression model, and also comment on the behaviour of

the likelihood function.

The Bayesian model will be completed with a commonly used improper prior on the

regression coeﬃcients and scatter matrix, and some proper prior on the degrees of freedom.

Section 3 examines the usual posterior inference on the basis of a recorded sample of point

observations. Even though Theorem 1 indicates that Bayesian inference is possible for

almost all samples (i.e. except for a set of zero probability under the sampling model),

problems can occur since any sample of point observations formally has probability zero of

being observed. In practice, this can become relevant due to rounding or ﬁnite precision of

the recorded observations, and we can easily end up with a sample for which inference is

precluded. This incompatibility between the continuous sampling model and any sample of

point observations can have very disturbing consequences: the posterior distribution may

not exist, even if it already existed on the basis of a subset of the sample. New observations

can, thus, have a devastating eﬀect on the usual Bayesian inference. Fern´andez and Steel

(1996a) present a detailed discussion of this phenomenon in the context of a univariate

location-scale model.

Section 4 presents a solution through the use of set observations, which have positive

probability under the sampling model, and are, thus, in agreement with the sampling

assumptions. This leads to a fully coherent Bayesian analysis where new observations

can never harm the possibility of conducting inference. A Gibbs sampling scheme [see

e.g. Gelfand and Smith (1990) and Casella and George (1992)] is seen to be a convenient

way to implement this solution in practice. Some examples are presented: a univariate

regression model for the well-known stackloss data [see Brownlee (1965)], and a bivariate

location-scale model for the iris setosa data of Fisher (1936).

The analysis through set observations is naturally extended to the case where some

components of the multivariate response are not observed (missing data). We illustrate

this with the artiﬁcial Murray (1977) data, extended with some extreme values in Liu and

Rubin (1995).

In addition, we ﬁnd that none of the results concerning the feasibility of Bayesian

inference with set observations depend on the particular scale mixture of Normals that we

sample from.

Finally, in Section 5 the Student likelihood function for point observations is analyzed

in some detail: it is found that the likelihood is unbounded as we tend to the boundary

of the parameter space in a certain direction. This casts some doubt on the meaning and

validity of a maximum likelihood analysis of this model [as performed in e.g. Lange, Little

and Taylor (1989), Lange and Sinsheimer (1993) and Liu and Rubin (1994, 1995)]. This

behaviour of the likelihood function is illustrated through the stackloss data example, and

it also explains the source of the problems encountered by Lange et al. (1989) and Lange

and Sinsheimer (1993) when applying the EM algorithm for joint estimation of regression

coeﬃcients, scale and degrees of freedom to the radioimmunoassay data set of Tiede and

Pagano (1979).

All proofs are grouped in Appendix A, whereas Appendix B recalls some matricvariate

probability densities used in the body of the paper. With some abuse of notation, we do not

explicitly distinguish between random variables and their realizations, and p(·)(adensity

function) or P (·) (a measure) can correspond to either a probability measure or a general

σ-ﬁnite measure. All density functions are Radon-Nikodym derivatives with respect to the

Lebesgue measure in the corresponding space, unless stated otherwise.

2. THE MODEL

Observations for the p-variate response variable y

are assumed to be generated through

the linear regression model

= β

+ ε

,i=1,...,n, (2.1)

where β is a k × p matrix of regression coeﬃcients, x

is a k-dimensional vector of ex-

planatory variables and the entire design matrix, X =(x

,...,x

)

, is taken to be of

full column rank k [denoted as r(X)=k]. The error vectors ε

are independent and

identically distributed (i.i.d.) as p-variate scale mixtures of Normals with mean zero and

positive deﬁnite symmetric (PDS) covariance matrix Σ. The mixing variables, denoted by

, i =1,...,n, follow a probability distribution P

|ν

on <

, which can depend on a pa-

rameter ν ∈N (possibly of inﬁnite dimension). Thus, we have n independent replications

from the sampling density

p(y

|β,Σ,ν)=

∞

p/2

(2π)

p/2

|Σ|

1/2

exp



−

− β

)

−1

− β

)



|ν

. (2.2)

By changing P

|ν

we cover the class of p-variate scale mixtures of Normals. The latter is a

subset of the elliptical class [see Fang, Kotz and Ng (1990), chap. 2], with ellipsoids in <

as isodensity sets, while allowing for a wide variety of tail behaviour. Leading examples are

ﬁnite mixtures of Normals, corresponding to a discrete distribution on λ

, and multivariate

Student-t distributions with ν>0 degrees of freedom, where P

|ν

is a Gamma distribution

with unitary mean and both shape and precision parameters equal to ν/2. Most of the

subsequent discussion will focus on the empirically relevant case of Student-t sampling.

Special cases of the model in (2.1) are the multivariate location-scale model, where

k = 1 and x

= 1, and the univariate regression model for p =1.

The Bayesian model needs to be completed with a prior distribution for (β,Σ,ν). In

particular, we assume a product structure between the three parameters where

p(β, Σ) ∝|Σ|

−(p+1)/2

(2.3)

and

is any probability measure on N . (2.4)

The prior in (2.3) is the “usual” default prior in the absence of compelling prior information

on (β,Σ). Under ﬁxed ν it corresponds to the Jeﬀreys’ prior under “independence” and is

thus invariant under separate reparameterizations of β and of Σ.

Note that the model in (2.1) implies that all p components of y

are regressed on the

same variables x

. Thus, we treat a special case of Zellner’s (1962) seemingly unrelated

regression (SUR) model, which allows for diﬀerent regressors on the p components. Al-

ternatively, our framework can be extended to general SUR models by considering priors

that impose zero restrictions on certain elements of β.

3. BAYESIAN INFERENCE USING POINT OBSERVATIONS

We now consider the feasibility of a Bayesian analysis of the model in (2.2) − (2.4) on

the basis of the recorded point observations, as is the usual practice. Since the prior in

(2.3) is improper, we clearly need to verify the existence of the posterior distribution. The

following Theorem addresses this issue.

Theorem 1. Consider n independent replications from (2.2) with any mixing distribution

|ν

and the prior in (2.3) − (2.4) with any proper P

. Then the conditional distribution

of (β,Σ,ν) given y ≡ (y

,...,y

)

exists if and only if n ≥ k + p.

Somewhat surprisingly, neither the mixing distribution nor the prior on ν aﬀects the

existence of the conditional distribution of the parameters given the observables (i.e. the

posterior distribution). Thus, whenever n ≥ k + p, the fact that the prior is improper is of

no consequence for the existence of the posterior distribution. However, probability theory

tells us that a conditional distribution is only deﬁned up to a set of measure zero in the

conditioning variables. In other words, Theorem 1 assures us that p(y) < ∞ except possibly

on a set of Lebesgue measure zero in <

n×p

. Theoretically, this validates inference since

problems can only occur for samples that have zero probability of being observed. However,

as stressed in Fern´andez and Steel (1996a), any recorded sample of point observations has

zero probability of occurrence under any continuous sampling distribution. Thus, Theorem

1 does not guarantee that p(y) < ∞ for our particular observed sample, and the latter

has to be veriﬁed explicitly. Note that this problem stems from an inherent violation of

the rules of probability calculus, since the recorded observations are in contradiction with

the assumed sampling model, and is in no way linked to the improperness of the prior [see

Fern´andez and Steel (1996a) for a more detailed discussion].

If we complement Theorem 1 by considering any possible point y ∈<

n×p

, Lemma 1

in the Appendix shows that both P

|ν

and P

can intervene. It is immediate from Lemma

1that

for ﬁnite mixtures of Normals,p(y)<∞if and only if r(X : y)=k+p, (3.1)

which is the minimal possible requirement for any scale mixture of Normals. In the sequel

of this Section, we shall, therefore, assume that this rank condition holds.

Let us now analyze the more challenging case of Student-t sampling, where we shall

use the following Deﬁnition:

Deﬁnition 1. For a design matrix X and a sample y ∈<

n×p

,j=1,...,p,isthe

largest number of observations such that the rank of the corresponding submatrix of X is

k while the rank of the corresponding submatrix of (X : y) is k + p − j.

Clearly, since r(X : y)=k+p, we obtain that k ≤ s

p−1

< ... < s

<n.Now

we can present the following Theorem.

Multivariate Student-t regression models: Pitfalls and inference

Figures

Citations

Modern Applied Statistics with S Fourth edition

Finite Mixture and Markov Switching Models

Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t‐distribution

Distributions generated by perturbation of symmetry with emphasis on a multivariate skew $t$ distribution

Phylogeography Takes a Relaxed Random Walk in Continuous Space and Time

References

The use of multiple measurements in taxonomic problems

The Theory of Matrices

An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias

Sampling-based approaches to calculating marginal densities

Explaining the Gibbs Sampler

Related Papers (5)

Robust Statistical Modeling Using the t Distribution

Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t‐distribution

Maximum likelihood from incomplete data via the EM algorithm

A class of distributions which includes the normal ones

Bayesian measures of model complexity and fit

Frequently Asked Questions (11)

Q1. What are the contributions in this paper?

Q2. What are the future works in this paper?

Q3. What is the rank condition for a design matrix X and a sample y?

Q4. What is the simplest way to implement Bayesian analysis with set observations?

Q5. What is the minimum requirement for a scale mixture of Normals?

Q6. How can the authors use the Gibbs sampler to perform Bayesian inference?

Q7. What is the simplest way to interpret the assumption of Theorem 4?

Q8. How many sets of observations are based on the analysis?

Q9. What is the way to explain the Student-t model?

Q10. What is the probability of a sample having zero probability?

Q11. What is the correct prior for the Bayesian model?