scispace - formally typeset
Open AccessJournal ArticleDOI

Multivariate Student-t regression models: Pitfalls and inference

Carmen Fernandez, +1 more
- 01 Mar 1999 - 
- Vol. 86, Iss: 1, pp 153-167
TLDR
In this paper, the authors consider likelihood-based inference from multivariate regression models with independent Student-t errors and uncover some very intruiging pitfalls of both Bayesian and classical methods on the basis of point observations.
Abstract
We consider likelihood-based inference from multivariate regression models with independent Student-t errors. Some very intruiging pitfalls of both Bayesian and classical methods on the basis of point observations are uncovered. Bayesian inference may be precluded as a consequence of the coarse nature of the data. Global maximization of the likelihood function is a vacuous exercise since the likelihood function is unbounded as we tend to the boundary of the parameter space. A Bayesian analysis on the basis of set observations is proposed and illustrated by several examples.

read more

Content maybe subject to copyright    Report

Tilburg University
Multivariate Student -t Regression Models
Fernández, C.; Steel, M.F.J.
Publication date:
1997
Link to publication in Tilburg University Research Portal
Citation for published version (APA):
Fernández, C., & Steel, M. F. J. (1997).
Multivariate Student -t Regression Models: Pitfalls and Inference
.
(CentER Discussion Paper; Vol. 1997-08). Econometrics.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Download date: 09. aug.. 2022

Multivariate Student-t Regression Models:
Pitfalls and Inference
By Carmen Fern´andez and Mark F.J. Steel
1
CentER for Economic Research and Department of Econometrics
Tilburg University, 5000 LE Tilburg, The Netherlands
FIRST VERSION DECEMBER 1996; CURRENT VERSION JANUARY 1997
Abstract
We consider likelihood-based inference from multivariate regression models with in-
dependent Student-t errors. Some very intruiging pitfalls of both Bayesian and classical
methods on the basis of point observations are uncovered. Bayesian inference may be
precluded as a consequence of the coarse nature of the data. Global maximization of the
likelihood function is a vacuous exercise since the likelihood function is unbounded as we
tend to the boundary of the parameter space. A Bayesian analysis on the basis of set
observations is proposed and illustrated by several examples.
KEY WORDS: Bayesian inference; Coarse data; Continuous distribution; Maximum like-
lihood; Missing data; Scale mixture of Normals.
1. INTRODUCTION
The multivariate regression model with unknown scatter matrix is widely used in
many fields of science. Applications to real data often indicate that the analytically con-
venient assumption of Normality is not quite tenable and thicker tails are called for in
order to adequately capture the main features of the data. Thus, we consider regression
error vectors that are distributed as scale mixtures of Normals. We shall mainly emphasize
the empirically relevant case of independent sampling from a multivariate Student-t distri-
bution with unknown degrees of freedom. In particular, we provide a complete Bayesian
1
Carmen Fern´andez is Research Fellow, CentER for Economic Research and Assistant Professor, De-
partment of Econometrics, Tilburg University, 5000 LE Tilburg, The Netherlands. Mark Steel is Senior
Research Fellow, CentER for Economic Research and Associate Professor, Department of Econometrics,
Tilburg University, 5000 LE Tilburg, The Netherlands. We gratefully acknowledge the extremely valu-
able help of F. Chamizo in the proof of Theorem 3 as well as useful comments from B. Melenberg and
W.J. Studden. Both authors benefitted from a travel grant awarded by the Netherlands Organization for
Scientific Research (NWO) and were visiting the Statistics Department at Purdue University during much
of the work on this paper.

2
analysis of the linear Student-t regression model, and also comment on the behaviour of
the likelihood function.
The Bayesian model will be completed with a commonly used improper prior on the
regression coefficients and scatter matrix, and some proper prior on the degrees of freedom.
Section 3 examines the usual posterior inference on the basis of a recorded sample of point
observations. Even though Theorem 1 indicates that Bayesian inference is possible for
almost all samples (i.e. except for a set of zero probability under the sampling model),
problems can occur since any sample of point observations formally has probability zero of
being observed. In practice, this can become relevant due to rounding or finite precision of
the recorded observations, and we can easily end up with a sample for which inference is
precluded. This incompatibility between the continuous sampling model and any sample of
point observations can have very disturbing consequences: the posterior distribution may
not exist, even if it already existed on the basis of a subset of the sample. New observations
can, thus, have a devastating effect on the usual Bayesian inference. Fern´andez and Steel
(1996a) present a detailed discussion of this phenomenon in the context of a univariate
location-scale model.
Section 4 presents a solution through the use of set observations, which have positive
probability under the sampling model, and are, thus, in agreement with the sampling
assumptions. This leads to a fully coherent Bayesian analysis where new observations
can never harm the possibility of conducting inference. A Gibbs sampling scheme [see
e.g. Gelfand and Smith (1990) and Casella and George (1992)] is seen to be a convenient
way to implement this solution in practice. Some examples are presented: a univariate
regression model for the well-known stackloss data [see Brownlee (1965)], and a bivariate
location-scale model for the iris setosa data of Fisher (1936).
The analysis through set observations is naturally extended to the case where some
components of the multivariate response are not observed (missing data). We illustrate
this with the artificial Murray (1977) data, extended with some extreme values in Liu and
Rubin (1995).
In addition, we find that none of the results concerning the feasibility of Bayesian
inference with set observations depend on the particular scale mixture of Normals that we
sample from.
Finally, in Section 5 the Student likelihood function for point observations is analyzed
in some detail: it is found that the likelihood is unbounded as we tend to the boundary
of the parameter space in a certain direction. This casts some doubt on the meaning and
validity of a maximum likelihood analysis of this model [as performed in e.g. Lange, Little
and Taylor (1989), Lange and Sinsheimer (1993) and Liu and Rubin (1994, 1995)]. This
behaviour of the likelihood function is illustrated through the stackloss data example, and
it also explains the source of the problems encountered by Lange et al. (1989) and Lange
and Sinsheimer (1993) when applying the EM algorithm for joint estimation of regression
coefficients, scale and degrees of freedom to the radioimmunoassay data set of Tiede and
Pagano (1979).
All proofs are grouped in Appendix A, whereas Appendix B recalls some matricvariate
probability densities used in the body of the paper. With some abuse of notation, we do not
explicitly distinguish between random variables and their realizations, and p(·)(adensity

3
function) or P (·) (a measure) can correspond to either a probability measure or a general
σ-finite measure. All density functions are Radon-Nikodym derivatives with respect to the
Lebesgue measure in the corresponding space, unless stated otherwise.
2. THE MODEL
Observations for the p-variate response variable y
i
are assumed to be generated through
the linear regression model
y
i
= β
0
x
i
+ ε
i
,i=1,...,n, (2.1)
where β is a k × p matrix of regression coefficients, x
i
is a k-dimensional vector of ex-
planatory variables and the entire design matrix, X =(x
1
,...,x
n
)
0
, is taken to be of
full column rank k [denoted as r(X)=k]. The error vectors ε
i
are independent and
identically distributed (i.i.d.) as p-variate scale mixtures of Normals with mean zero and
positive definite symmetric (PDS) covariance matrix Σ. The mixing variables, denoted by
λ
i
, i =1,...,n, follow a probability distribution P
λ
i
|ν
on <
+
, which can depend on a pa-
rameter ν ∈N (possibly of infinite dimension). Thus, we have n independent replications
from the sampling density
p(y
i
|β,Σ)=
Z
0
λ
p/2
i
(2π)
p/2
|Σ|
1/2
exp
λ
i
2
(y
i
β
0
x
i
)
0
Σ
1
(y
i
β
0
x
i
)
dP
λ
i
|ν
. (2.2)
By changing P
λ
i
|ν
we cover the class of p-variate scale mixtures of Normals. The latter is a
subset of the elliptical class [see Fang, Kotz and Ng (1990), chap. 2], with ellipsoids in <
p
as isodensity sets, while allowing for a wide variety of tail behaviour. Leading examples are
finite mixtures of Normals, corresponding to a discrete distribution on λ
i
, and multivariate
Student-t distributions with ν>0 degrees of freedom, where P
λ
i
|ν
is a Gamma distribution
with unitary mean and both shape and precision parameters equal to ν/2. Most of the
subsequent discussion will focus on the empirically relevant case of Student-t sampling.
Special cases of the model in (2.1) are the multivariate location-scale model, where
k = 1 and x
i
= 1, and the univariate regression model for p =1.
The Bayesian model needs to be completed with a prior distribution for (β,Σ). In
particular, we assume a product structure between the three parameters where
p(β, Σ) ∝|Σ|
(p+1)/2
(2.3)
and
P
ν
is any probability measure on N . (2.4)
The prior in (2.3) is the “usual” default prior in the absence of compelling prior information
on (β,Σ). Under fixed ν it corresponds to the Jeffreys’ prior under “independence” and is
thus invariant under separate reparameterizations of β and of Σ.
Note that the model in (2.1) implies that all p components of y
i
are regressed on the
same variables x
i
. Thus, we treat a special case of Zellner’s (1962) seemingly unrelated

4
regression (SUR) model, which allows for different regressors on the p components. Al-
ternatively, our framework can be extended to general SUR models by considering priors
that impose zero restrictions on certain elements of β.
3. BAYESIAN INFERENCE USING POINT OBSERVATIONS
We now consider the feasibility of a Bayesian analysis of the model in (2.2) (2.4) on
the basis of the recorded point observations, as is the usual practice. Since the prior in
(2.3) is improper, we clearly need to verify the existence of the posterior distribution. The
following Theorem addresses this issue.
Theorem 1. Consider n independent replications from (2.2) with any mixing distribution
P
λ
i
|ν
and the prior in (2.3) (2.4) with any proper P
ν
. Then the conditional distribution
of (β,Σ) given y (y
1
,...,y
n
)
0
exists if and only if n k + p.
Somewhat surprisingly, neither the mixing distribution nor the prior on ν affects the
existence of the conditional distribution of the parameters given the observables (i.e. the
posterior distribution). Thus, whenever n k + p, the fact that the prior is improper is of
no consequence for the existence of the posterior distribution. However, probability theory
tells us that a conditional distribution is only defined up to a set of measure zero in the
conditioning variables. In other words, Theorem 1 assures us that p(y) < except possibly
on a set of Lebesgue measure zero in <
n×p
. Theoretically, this validates inference since
problems can only occur for samples that have zero probability of being observed. However,
as stressed in Fern´andez and Steel (1996a), any recorded sample of point observations has
zero probability of occurrence under any continuous sampling distribution. Thus, Theorem
1 does not guarantee that p(y) < for our particular observed sample, and the latter
has to be verified explicitly. Note that this problem stems from an inherent violation of
the rules of probability calculus, since the recorded observations are in contradiction with
the assumed sampling model, and is in no way linked to the improperness of the prior [see
Fern´andez and Steel (1996a) for a more detailed discussion].
If we complement Theorem 1 by considering any possible point y ∈<
n×p
, Lemma 1
in the Appendix shows that both P
λ
i
|ν
and P
ν
can intervene. It is immediate from Lemma
1that
for finite mixtures of Normals,p(y)<if and only if r(X : y)=k+p, (3.1)
which is the minimal possible requirement for any scale mixture of Normals. In the sequel
of this Section, we shall, therefore, assume that this rank condition holds.
Let us now analyze the more challenging case of Student-t sampling, where we shall
use the following Definition:
Definition 1. For a design matrix X and a sample y ∈<
n×p
,s
j
,j=1,...,p,isthe
largest number of observations such that the rank of the corresponding submatrix of X is
k while the rank of the corresponding submatrix of (X : y) is k + p j.
Clearly, since r(X : y)=k+p, we obtain that k s
p
<s
p1
< ... < s
1
<n.Now
we can present the following Theorem.

Citations
More filters
Book

Finite Mixture and Markov Switching Models

TL;DR: This book should help newcomers to the field to understand how finite mixture and Markov switching models are formulated, what structures they imply on the data, what they could be used for, and how they are estimated.
Journal ArticleDOI

Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t‐distribution

TL;DR: In this paper, a fairly general procedure is studied to perturb a multivariate density satisfying a weak form of multivariate symmetry, and to generate a whole set of non-symmetric densities.
Journal ArticleDOI

Distributions generated by perturbation of symmetry with emphasis on a multivariate skew $t$ distribution

TL;DR: In this paper, a general procedure is studied to perturb a multivariate density satisfying a weak form of multivariate symmetry, and to generate a whole set of non-symmetric densities.
Journal ArticleDOI

Phylogeography Takes a Relaxed Random Walk in Continuous Space and Time

TL;DR: A Bayesian statistical approach is presented to infer continuous phylogeographic diffusion using random walk models while simultaneously reconstructing the evolutionary history in time from molecular sequence data and demonstrates increased statistical efficiency in spatial reconstructions of overdispersed random walks.
References
More filters
Book

The Theory of Matrices

TL;DR: In this article, the Routh-Hurwitz problem of singular pencils of matrices has been studied in the context of systems of linear differential equations with variable coefficients, and its applications to the analysis of complex matrices have been discussed.
Journal ArticleDOI

An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias

TL;DR: In this paper, a method of estimating the parameters of a set of regression equations is reported which involves application of Aitken's generalized least-squares to the whole system of equations.
Journal Article

Sampling-based approaches to calculating marginal densities

TL;DR: Stochastic substitution, the Gibbs sampler, and the sampling-importance-resampling algorithm can be viewed as three alternative sampling- (or Monte Carlo-) based approaches to the calculation of numerical estimates of marginal probability distributions.
Journal ArticleDOI

Explaining the Gibbs Sampler

TL;DR: A simple explanation of how and why the Gibbs sampler works is given and analytically establish its properties in a simple case and insight is provided for more complicated cases.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What are the contributions in this paper?

The authors consider likelihood-based inference from multivariate regression models with independent Student-t errors. 

Once a posterior is found to exist, this does not guarantee inference on the basis of an extended sample ! The analysis is now fully coherent, in that new observations can never destroy the possibility of conducting inference. 

For a design matrix X and a sample y ∈ <n×p, sj , j = 1, . . . , p, is the largest number of observations such that the rank of the corresponding submatrix of X is k while the rank of the corresponding submatrix of (X : y) is k + p− j.Clearly, since r(X : y) = k + p, the authors obtain that k ≤ sp < sp−1 < . . . < s1 < n. 

A simple Gibbs sampling strategy is proposed to implement Bayesian analysis with set observations and a number of Examples is considered, all under Student sampling with an Exponential prior on ν. 

It is immediate from Lemma 1 thatfor finite mixtures of Normals, p(y) <∞ if and only if r(X : y) = k + p, (3.1)which is the minimal possible requirement for any scale mixture of Normals. 

In general, Bayesian inference using set observations can easily be implemented through a Gibbs sampler on the parameters augmented with y = (y1, . . . , yn)′. 

As explained in the discussion of Theorem 3, the assumption of Theorem 4 (ii) is most easily interpreted in the location-scale case (k = 1 and xi = 1), where r(X : y) < 1 + p means that there exists a (p − 1)-dimensional affine space that intersects all of the sets S1, . . . , Sn. 

As expected, the analysis based on all sixteen set observations identifies the four extra observations [i.e. the last four in (4.6)] as outliers through small values of the mixing variables λi associated with these observations. 

classical inference could be based on efficient likelihood estimation [Lehmann (1983, chap. 6)], grouped likelihoods [see Giesbrecht and Kempthorne (1976) for a lognormal model and Beckman and Johnson (1987) for the Student-t case], sample percentiles [Resek (1976)], modified likelihood [as in Cheng and Iles (1987)] or spacings methods [as in Cheng and Amin (1979)]. 

Even though Theorem 1 indicates that Bayesian inference is possible for almost all samples (i.e. except for a set of zero probability under the sampling model), problems can occur since any sample of point observations formally has probability zero of being observed. 

The Bayesian model will be completed with a commonly used improper prior on the regression coefficients and scatter matrix, and some proper prior on the degrees of freedom.