scispace - formally typeset
Open AccessJournal ArticleDOI

Statistical Computations on Grassmann and Stiefel Manifolds for Image and Video-Based Recognition

TLDR
This paper discusses how commonly used parametric models for videos and image sets can be described using the unified framework of Grassmann and Stiefel manifolds, and derives statistical modeling of inter and intraclass variations that respect the geometry of the space.
Abstract
In this paper, we examine image and video-based recognition applications where the underlying models have a special structure-the linear subspace structure. We discuss how commonly used parametric models for videos and image sets can be described using the unified framework of Grassmann and Stiefel manifolds. We first show that the parameters of linear dynamic models are finite-dimensional linear subspaces of appropriate dimensions. Unordered image sets as samples from a finite-dimensional linear subspace naturally fall under this framework. We show that an inference over subspaces can be naturally cast as an inference problem on the Grassmann manifold. To perform recognition using subspace-based models, we need tools from the Riemannian geometry of the Grassmann manifold. This involves a study of the geometric properties of the space, appropriate definitions of Riemannian metrics, and definition of geodesics. Further, we derive statistical modeling of inter and intraclass variations that respect the geometry of the space. We apply techniques such as intrinsic and extrinsic statistics to enable maximum-likelihood classification. We also provide algorithms for unsupervised clustering derived from the geometry of the manifold. Finally, we demonstrate the improved performance of these methods in a wide variety of vision applications such as activity recognition, video-based face recognition, object recognition from image sets, and activity-based video clustering.

read more

Content maybe subject to copyright    Report

1
Statistical Computations on Grassmann and
Stiefel manifolds for Image and Video-Based
Recognition
Pavan Turaga, Student Member, IEEE, Ashok Veeraraghavan, Member, IEEE,
Anuj Srivastava, Senior Member, IEEE, and Rama Chellappa, Fellow, IEEE
Abstract
In this paper, we examine image and video based recognition applications where the underlying
models have a special structure the linear subspace structure. We discuss how commonly used
parametric models for videos and image-sets can be described using the unified framework of Grassmann
and Stiefel manifolds. We first show that the parameters of linear dynamic models are finite dimensional
linear subspaces of appropriate dimensions. Unordered image-sets as samples from a finite-dimensional
linear subspace naturally fall under this framework. We show that the study of inference over subspaces
can be naturally cast as an inference problem on the Grassmann manifold.
To perform recognition using subspace-based models, we need tools from the Riemannian geometry
of the Grassmann manifold. This involves a study of the geometric properties of the space, appropriate
definitions of Riemannian metrics, and definition of geodesics. Further, we derive statistical modeling
of inter- and intra-class variations that respect the geometry of the space. We apply techniques such as
intrinsic and extrinsic statistics, to enable maximum-likelihood classification. We also provide algorithms
for unsupervised clustering derived from the geometry of the manifold. Finally, we demonstrate the
improved performance of these methods in a wide variety of vision applications such as activity
A preliminary version of this paper appeared in [1].
Pavan Turaga and Rama Chellappa are with the University of Maryland Institute for Advanced Computer Studies
(UMIACS) ({pturaga,rama}@umiacs.umd.edu). Ashok Veeraraghavan is with the Mitsubishi Electric Research Labs (MERL)
(veerarag@merl.com). Anuj Srivastava is with the Dept. of Statistics, Florida State University (anuj@stat.fsu.edu). This work
was partially supported by the ONR Grant N00014-09-1-0664.
December 1, 2010 DRAFT

2
recognition, video-based face recognition, object recognition from image-sets, and activity-based video
clustering.
Index Terms
Image and Video Models, Feature Representation, Statistical Models, Manifolds, Stiefel, Grassmann
I. INTRODUCTION
Many applications in computer vision such as dynamic textures [2],[3], human activity mod-
eling and recognition [4],[5], video-based face recognition [6], shape analysis [7],[8] involve
learning and recognition of patterns from exemplars which obey certain constraints. To enable
this study, we often make simplifying assumptions of the image-formation process such as a pin-
hole camera model or the Lambertian reflectance model. These assumptions lead to constraints
on the set of images thus obtained. A classic example of such a constraint is that images of a
convex object under all possible illumination conditions form a ‘cone’ in image-space [9]. Once
the underlying assumptions and constraints are well understood, the next important step is to
design inference algorithms that are consistent with the algebra and/or geometry of the constraint
set. In this paper, we shall examine image and video based recognition applications where the
models have a special structure the linear subspace structure.
In many of these applications, given a database of examples and a query, the following two
questions are to be addressed a) what is the ‘closest’ example to the query in the database
? b) what is the ‘most probable’ class to which the query belongs ? A systematic solution to
these problems involves a study of the underlying constraints that the data obeys. The answer to
the first question involves a study of the geometric properties of the space, which then leads to
appropriate definitions of Riemannian metrics and further to the definition of geodesics etc. The
answer to the second question involves statistical modeling of inter- and intra-class variations.
It is well-known that the space of linear subspaces can be viewed as a Riemannian manifold
[10], [11]. More formally, the space of d-dimensional subspaces in R
n
is called the Grassmann
manifold. On a related note, the Stiefel manifold is the space of d orthonormal vectors in R
n
. The
study of these manifolds has important consequences for applications such as dynamic textures
[2], [3], human activity modeling and recognition [4], [5], video-based face recognition [6] and
December 1, 2010 DRAFT

3
shape analysis [7], [8] where data naturally lies either on the Stiefel or the Grassmann manifold.
Estimating linear models of data is standard methodology in many applications and manifests
in various forms such as linear regression, linear classification, linear subspace estimation etc.
However, comparatively less attention has been devoted to statistical inference on the space of
linear subspaces.
A. Prior Work
The Grassmann manifold’s geometric properties have been utilized in certain vision problems
involving subspace constraints. Examples include, [12] which deals with optimization over the
Grassmann manifold for obtaining informative projections. The Grassmann manifold structure
of the affine shape space is also exploited in [13] to perform affine invariant clustering of shapes.
[14] performs discriminative classification over subspaces for object recognition tasks by using
Mercer kernels on the Grassmann manifold. In [15], a face image and its perturbations due to
registration errors are approximated as a linear subspace, hence are embedded as points on a
Grassmann manifold. Most of these methods do not employ statistics on the Grassmann manifold,
or are tuned to specific domains lacking generality. [16] exploited the geometry of the Grassmann
manifold for subspace tracking in array signal processing applications. On a related note, the
geometry of the related Stiefel manifold has been found to be useful in applications where in
addition to the subspace structure, the specific choice of basis vectors is also important [17]. The
methods that we present in this paper form a comprehensive (not exhaustive) set of tools that
draw upon the Riemannian geometry of the Grassmann manifold. Along with the mathematical
formulations, we also present efficient algorithms to perform these computations.
The geometric properties of general Riemannian manifolds forms the subject matter of differ-
ential geometry; a good introduction can be found in [18]. Statistical methods on manifolds have
been studied for several years in the statistics community. Some of the landmark papers in this
area include [19], [20], [21], however an exhaustive survey is beyond the scope of this paper. The
geometric properties of the Stiefel and Grassmann manifolds have received significant attention.
A good introduction to the geometry of the Stiefel and Grassmann manifolds can be found in
[10] who introduced gradient methods on these manifolds in the context of eigenvalue problems.
These problems mainly involved optimization of cost functions with orthogonality constraints.
A compilation of techniques for solving optimization problems with such matrix manifolds is
December 1, 2010 DRAFT

4
provided in [22]. Algorithmic computations of the geometric operations in such problems were
discussed in [11]. A compilation of research results on statistical analysis on the Stiefel and
Grassmann manifolds can be found in [23].
In addition to the Grassmann manifold, general Riemannian manifolds have found important
applications in the vision community. A recently developed formulation of using the covariance
of features in image-patches has found several applications such as texture classification [24],
pedestrian detection [25], and tracking [26]. The Riemannian geometry of covariance matrices
was exploited effectively in all these applications to design state-of-the-art algorithms. More
recently, [27] provides an extension of Euclidean mean shift clustering to the case of Riemannian
manifolds.
Shape analysis is another application area where statistics on Riemannian manifolds have found
wide applicability. Theoretical foundations for manifolds based shape analysis were described in
[7], [8]. Statistical learning of shape classes using non-linear shape manifolds was presented in
[28] where statistics are learnt on the manifold’s tangent space. Using a similar formulation, the
variations due to execution rate changes in human activities is modeled as a distribution over
time-warp functions, which are considered as points on a spherical manifold in [29]. This was
used for execution rate-invariant recognition of human activities.
A preliminary version of this paper was presented in [1], which used extrinsic methods for
statistical modeling on the Grassmann manifold. This paper provides a mathematically well
grounded basis for these methods, where the specific choice of the method in [1] is interpreted as
a special case of using a non-parametric density estimator with an extrinsic divergence measure.
In this paper, we provide more detailed analysis and show how to exploit the geometry of the
manifold to derive intrinsic statistical models. This provides a more consistent approach than
the extrinsic methods of [1]. Further, the dimensionality of the manifold presents a significant
road-block for computer implementation of Riemannian computations. Straightforward imple-
mentation of formulas for geodesic distances, exponential and inverse-exponential maps given in
earlier work such as [10], [11], [27] is computationally prohibitive for large dimensions. This is
especially true of our applications where we deal with high dimensional image and video-data.
Toward this end, we also employ numerically efficient versions of these computations.
Contributions: We first show how a large class of problems drawn from face, activity, and
object recognition can be recast as statistical inference problems on the Stiefel and/or Grassmann
December 1, 2010 DRAFT

5
manifolds. Then, we present methods to solve these problems using the Riemannian geometry
of the manifolds. We also discuss some recently proposed extrinsic approaches to statistical
modeling on the Grassmann manifold. We present a wide range of experimental evaluation to
demonstrate the effectiveness of these approaches and provide a comprehensive comparison.
Organization of the paper: In section II, we discuss parametric subspace-based models
of image-sets and videos and show how the study of these models can be recast as a study of
the Grassmann manifold. Section III introduces the special orthogonal group and its quotient
spaces the Stiefel and the Grassmann manifolds. Section IV discusses statistical methods that
follow from the quotient interpretation of these manifolds. In section V, we develop supervised
and unsupervised learning algorithms. Complexity issues and numerically efficient algorithms for
performing Riemannian computations are discussed in section VI. In section VII, we demonstrate
the strength of the framework for several applications including activity recognition, video-based
face recognition, object matching, and activity-based clustering. Finally, concluding remarks are
presented in section VIII.
II. MODELS FOR VIDEOS AND IMAGES
A. Spatio-temporal dynamical models and the ARMA model
A wide variety of spatio-temporal data have often been modeled as realizations of dynamical
models. Examples include dynamic textures [2], human joint angle trajectories [4] and silhouettes
[5]. A well-known dynamical model for such time-series data is the autoregressive and moving
average (ARMA) model. Linear dynamical systems represent a class of parametric models
for time-series. A wide variety of time series data such as dynamic textures, human joint
angle trajectories, shape sequences, video-based face recognition etc are frequently modeled
as autoregressive and moving average (ARMA) models [2], [4], [5], [6]. The ARMA model
equations are given by
f(t) = Cz(t) + w(t) w(t) N(0, R) (1)
z(t + 1) = Az(t) + v(t) v(t) N(0,Q) (2)
where, z R
d
is the hidden state vector, A R
d×d
the transition matrix and C R
p×d
the
measurement matrix. f R
p
represents the observed features while w and v are noise components
modeled as normal with 0 mean and covariances R R
p×p
and Q R
d×d
, respectively.
December 1, 2010 DRAFT

Figures
Citations
More filters
Reference BookDOI

Handbook of mathematical methods in imaging

TL;DR: In this article, the Mumford and Shah Model and its applications in total variation image restoration are discussed. But the authors focus on the reconstruction of 3D information, rather than the analysis of the image.
Journal ArticleDOI

3-D Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold

TL;DR: This paper proposes a new framework to extract a compact representation of a human action captured through a depth sensor, and enable accurate action recognition, and results with state-of-the-art methods are reported.
Proceedings ArticleDOI

Neural Aggregation Network for Video Face Recognition

TL;DR: This NAN is trained with a standard classification or verification loss without any extra supervision signal, and it is found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces.
Posted Content

Neural Aggregation Network for Video Face Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a Neural Aggregation Network (NAN) for video face recognition, which consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them.
Proceedings ArticleDOI

Projection Metric Learning on Grassmann Manifold with Application to Video based Face Recognition

TL;DR: This work proposes a novel method to learn the Projection Metric directly on Grassmann manifold rather than in Hilbert space, which can be regarded as performing a geometry-aware dimensionality reduction from the original Grassmann manifolds to a lower-dimensional, more discriminative Grassman manifold where more favorable classification can be achieved.
References
More filters
Book

Matrix computations

Gene H. Golub
Journal ArticleDOI

From few to many: illumination cone models for face recognition under variable lighting and pose

TL;DR: A generative appearance-based method for recognizing human faces under variation in lighting and viewpoint that exploits the fact that the set of images of an object in fixed pose but under all possible illumination conditions, is a convex cone in the space of images.
Journal ArticleDOI

Statistical shape analysis: clustering, learning, and testing

TL;DR: This work presents tools for hierarchical clustering of imaged objects according to the shapes of their boundaries, learning of probability models for clusters of shapes, and testing of newly observed shapes under competing probability models.
Journal ArticleDOI

The Geometry of Algorithms with Orthogonality Constraints

TL;DR: The theory proposed here provides a taxonomy for numerical linear algebra algorithms that provide a top level mathematical view of previously unrelated algorithms and developers of new algorithms and perturbation theories will benefit from the theory.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What have the authors contributed in "Statistical computations on grassmann and stiefel manifolds for image and video-based recognition" ?

In this paper, the authors examine image and video based recognition applications where the underlying models have a special structure – the linear subspace structure. The authors discuss how commonly used parametric models for videos and image-sets can be described using the unified framework of Grassmann and Stiefel manifolds. The authors first show that the parameters of linear dynamic models are finite dimensional linear subspaces of appropriate dimensions. Unordered image-sets as samples from a finite-dimensional linear subspace naturally fall under this framework. The authors show that the study of inference over subspaces can be naturally cast as an inference problem on the Grassmann manifold. This involves a study of the geometric properties of the space, appropriate definitions of Riemannian metrics, and definition of geodesics. The authors also provide algorithms for unsupervised clustering derived from the geometry of the manifold. Finally, the authors demonstrate the improved performance of these methods in a wide variety of vision applications such as activity A preliminary version of this paper appeared in [ 1 ]. This work was partially supported by the ONR Grant N00014-09-1-0664. Further, the authors derive statistical modeling of interand intra-class variations that respect the geometry of the space. 

In addition to definitions of distances and statistics on manifolds, many interesting problems such as interpolation, smoothing, and time-series modeling on these manifolds of interest are potential directions of future work. 

For comparison of models, the most commonly used distance metric is based on subspace angles between column-spaces of the observability matrices [31]. 

In addition to definitions of distances and statistics on manifolds, many interesting problems such as interpolation, smoothing, and time-series modeling on these manifolds of interest are potential directions of future work. 

To evaluate the ith class conditional density at a test-point, one merely evaluates the truncated Gaussian by mapping the test-point to the tangent-space at pi. 

The dataset consists of 10 actors performing 11 actions, each action executed 3 times at varying rates while freely changing orientation. 

For high-dimensional time-series data (dynamic textures etc), the most common approach is to first learn a lower-dimensional embedding of the observations via PCA, and learn the temporal dynamics in the lower-dimensional space. 

In order to obtain a quotient space structure for Gn,d , let SO(d)×SO(n−d) be a subgroup of SO(n) using the embedding φb : (SO(d)×SO(n−d))→ SO(n):φb(V1,V2) =V1 00 V2 ∈ SO(n). 

The computation of the geodesic OT exp(tA)J in the direct form implies a complexity of O(n3), where n = mp for the observability matrix, and n = p for the case of PCA basis vectors. 

The authors can estimate the parameters of a family of pdfs such as Gaussian, or mixtures of Gaussian and then use the exponential map to wrap these parameters back onto the manifold. 

As mentioned in section V-A, to evaluate the class conditional probability using truncated wrapped Gaussians, the authors also need to adjust the normalizing constant of each Gaussian. 

Motivated by this, the set of face images of the same person under varying illumination conditions is frequently modeled as a linear subspace of 9-dimensions [38]. 

In their case, it turns out that the Stiefel manifold itself can be interpreted as a quotient of a more basic manifold - the special orthogonal group SO(n). 

A good introduction to the geometry of the Stiefel and Grassmann manifolds can be found in [10] who introduced gradient methods on these manifolds in the context of eigenvalue problems. 

A preliminary version of this paper was presented in [1], which used extrinsic methods for statistical modeling on the Grassmann manifold. 

An equivalence class is given by:[O]b = {Oφb(V1,V2)|V1 ∈ SO(d), V2 ∈ SO(n−d)} ,and the set of all such equivalence classes is Gn,d . 

By exploiting the special structure of the matrix A, it is possible to reduce the complexity of these operations to no more than O(nd2) and O(d3) which represents a significant reduction. 

For an arbitrary point O ∈ SO(n), the tangent space at that point is obtained by a simple rotation of TI(SO(n)): TO(SO(n)) = {OX |X ∈ TI(SO(n))}.