Open AccessJournal ArticleDOI

Statistical Computations on Grassmann and Stiefel Manifolds for Image and Video-Based Recognition

- 01 Nov 2011 -

IEEE Transactions on Pattern Analysis an...

- Vol. 33, Iss: 11, pp 2273-2286

TLDR

This paper discusses how commonly used parametric models for videos and image sets can be described using the unified framework of Grassmann and Stiefel manifolds, and derives statistical modeling of inter and intraclass variations that respect the geometry of the space.

Abstract:

In this paper, we examine image and video-based recognition applications where the underlying models have a special structure-the linear subspace structure. We discuss how commonly used parametric models for videos and image sets can be described using the unified framework of Grassmann and Stiefel manifolds. We first show that the parameters of linear dynamic models are finite-dimensional linear subspaces of appropriate dimensions. Unordered image sets as samples from a finite-dimensional linear subspace naturally fall under this framework. We show that an inference over subspaces can be naturally cast as an inference problem on the Grassmann manifold. To perform recognition using subspace-based models, we need tools from the Riemannian geometry of the Grassmann manifold. This involves a study of the geometric properties of the space, appropriate definitions of Riemannian metrics, and definition of geodesics. Further, we derive statistical modeling of inter and intraclass variations that respect the geometry of the space. We apply techniques such as intrinsic and extrinsic statistics to enable maximum-likelihood classification. We also provide algorithms for unsupervised clustering derived from the geometry of the manifold. Finally, we demonstrate the improved performance of these methods in a wide variety of vision applications such as activity recognition, video-based face recognition, object recognition from image sets, and activity-based video clustering.

Content maybe subject to copyright Report

Statistical Computations on Grassmann and

Stiefel manifolds for Image and Video-Based

Recognition

Pavan Turaga, Student Member, IEEE, Ashok Veeraraghavan, Member, IEEE,

Anuj Srivastava, Senior Member, IEEE, and Rama Chellappa, Fellow, IEEE

Abstract

In this paper, we examine image and video based recognition applications where the underlying

models have a special structure – the linear subspace structure. We discuss how commonly used

parametric models for videos and image-sets can be described using the uniﬁed framework of Grassmann

and Stiefel manifolds. We ﬁrst show that the parameters of linear dynamic models are ﬁnite dimensional

linear subspaces of appropriate dimensions. Unordered image-sets as samples from a ﬁnite-dimensional

linear subspace naturally fall under this framework. We show that the study of inference over subspaces

can be naturally cast as an inference problem on the Grassmann manifold.

To perform recognition using subspace-based models, we need tools from the Riemannian geometry

of the Grassmann manifold. This involves a study of the geometric properties of the space, appropriate

deﬁnitions of Riemannian metrics, and deﬁnition of geodesics. Further, we derive statistical modeling

of inter- and intra-class variations that respect the geometry of the space. We apply techniques such as

intrinsic and extrinsic statistics, to enable maximum-likelihood classiﬁcation. We also provide algorithms

for unsupervised clustering derived from the geometry of the manifold. Finally, we demonstrate the

improved performance of these methods in a wide variety of vision applications such as activity

A preliminary version of this paper appeared in [1].

Pavan Turaga and Rama Chellappa are with the University of Maryland Institute for Advanced Computer Studies

(UMIACS) ({pturaga,rama}@umiacs.umd.edu). Ashok Veeraraghavan is with the Mitsubishi Electric Research Labs (MERL)

(veerarag@merl.com). Anuj Srivastava is with the Dept. of Statistics, Florida State University (anuj@stat.fsu.edu). This work

was partially supported by the ONR Grant N00014-09-1-0664.

December 1, 2010 DRAFT

recognition, video-based face recognition, object recognition from image-sets, and activity-based video

clustering.

Index Terms

Image and Video Models, Feature Representation, Statistical Models, Manifolds, Stiefel, Grassmann

I. INTRODUCTION

Many applications in computer vision such as dynamic textures [2],[3], human activity mod-

eling and recognition [4],[5], video-based face recognition [6], shape analysis [7],[8] involve

learning and recognition of patterns from exemplars which obey certain constraints. To enable

this study, we often make simplifying assumptions of the image-formation process such as a pin-

hole camera model or the Lambertian reﬂectance model. These assumptions lead to constraints

on the set of images thus obtained. A classic example of such a constraint is that images of a

convex object under all possible illumination conditions form a ‘cone’ in image-space [9]. Once

the underlying assumptions and constraints are well understood, the next important step is to

design inference algorithms that are consistent with the algebra and/or geometry of the constraint

set. In this paper, we shall examine image and video based recognition applications where the

models have a special structure – the linear subspace structure.

In many of these applications, given a database of examples and a query, the following two

questions are to be addressed – a) what is the ‘closest’ example to the query in the database

? b) what is the ‘most probable’ class to which the query belongs ? A systematic solution to

these problems involves a study of the underlying constraints that the data obeys. The answer to

the ﬁrst question involves a study of the geometric properties of the space, which then leads to

appropriate deﬁnitions of Riemannian metrics and further to the deﬁnition of geodesics etc. The

answer to the second question involves statistical modeling of inter- and intra-class variations.

It is well-known that the space of linear subspaces can be viewed as a Riemannian manifold

[10], [11]. More formally, the space of d-dimensional subspaces in R

is called the Grassmann

manifold. On a related note, the Stiefel manifold is the space of d orthonormal vectors in R

. The

study of these manifolds has important consequences for applications such as dynamic textures

[2], [3], human activity modeling and recognition [4], [5], video-based face recognition [6] and

December 1, 2010 DRAFT

shape analysis [7], [8] where data naturally lies either on the Stiefel or the Grassmann manifold.

Estimating linear models of data is standard methodology in many applications and manifests

in various forms such as linear regression, linear classiﬁcation, linear subspace estimation etc.

However, comparatively less attention has been devoted to statistical inference on the space of

linear subspaces.

A. Prior Work

The Grassmann manifold’s geometric properties have been utilized in certain vision problems

involving subspace constraints. Examples include, [12] which deals with optimization over the

Grassmann manifold for obtaining informative projections. The Grassmann manifold structure

of the afﬁne shape space is also exploited in [13] to perform afﬁne invariant clustering of shapes.

[14] performs discriminative classiﬁcation over subspaces for object recognition tasks by using

Mercer kernels on the Grassmann manifold. In [15], a face image and its perturbations due to

registration errors are approximated as a linear subspace, hence are embedded as points on a

Grassmann manifold. Most of these methods do not employ statistics on the Grassmann manifold,

or are tuned to speciﬁc domains lacking generality. [16] exploited the geometry of the Grassmann

manifold for subspace tracking in array signal processing applications. On a related note, the

geometry of the related Stiefel manifold has been found to be useful in applications where in

addition to the subspace structure, the speciﬁc choice of basis vectors is also important [17]. The

methods that we present in this paper form a comprehensive (not exhaustive) set of tools that

draw upon the Riemannian geometry of the Grassmann manifold. Along with the mathematical

formulations, we also present efﬁcient algorithms to perform these computations.

The geometric properties of general Riemannian manifolds forms the subject matter of differ-

ential geometry; a good introduction can be found in [18]. Statistical methods on manifolds have

been studied for several years in the statistics community. Some of the landmark papers in this

area include [19], [20], [21], however an exhaustive survey is beyond the scope of this paper. The

geometric properties of the Stiefel and Grassmann manifolds have received signiﬁcant attention.

A good introduction to the geometry of the Stiefel and Grassmann manifolds can be found in

[10] who introduced gradient methods on these manifolds in the context of eigenvalue problems.

These problems mainly involved optimization of cost functions with orthogonality constraints.

A compilation of techniques for solving optimization problems with such matrix manifolds is

December 1, 2010 DRAFT

provided in [22]. Algorithmic computations of the geometric operations in such problems were

discussed in [11]. A compilation of research results on statistical analysis on the Stiefel and

Grassmann manifolds can be found in [23].

In addition to the Grassmann manifold, general Riemannian manifolds have found important

applications in the vision community. A recently developed formulation of using the covariance

of features in image-patches has found several applications such as texture classiﬁcation [24],

pedestrian detection [25], and tracking [26]. The Riemannian geometry of covariance matrices

was exploited effectively in all these applications to design state-of-the-art algorithms. More

recently, [27] provides an extension of Euclidean mean shift clustering to the case of Riemannian

manifolds.

Shape analysis is another application area where statistics on Riemannian manifolds have found

wide applicability. Theoretical foundations for manifolds based shape analysis were described in

[7], [8]. Statistical learning of shape classes using non-linear shape manifolds was presented in

[28] where statistics are learnt on the manifold’s tangent space. Using a similar formulation, the

variations due to execution rate changes in human activities is modeled as a distribution over

time-warp functions, which are considered as points on a spherical manifold in [29]. This was

used for execution rate-invariant recognition of human activities.

A preliminary version of this paper was presented in [1], which used extrinsic methods for

statistical modeling on the Grassmann manifold. This paper provides a mathematically well

grounded basis for these methods, where the speciﬁc choice of the method in [1] is interpreted as

a special case of using a non-parametric density estimator with an extrinsic divergence measure.

In this paper, we provide more detailed analysis and show how to exploit the geometry of the

manifold to derive intrinsic statistical models. This provides a more consistent approach than

the extrinsic methods of [1]. Further, the dimensionality of the manifold presents a signiﬁcant

road-block for computer implementation of Riemannian computations. Straightforward imple-

mentation of formulas for geodesic distances, exponential and inverse-exponential maps given in

earlier work such as [10], [11], [27] is computationally prohibitive for large dimensions. This is

especially true of our applications where we deal with high dimensional image and video-data.

Toward this end, we also employ numerically efﬁcient versions of these computations.

Contributions: We ﬁrst show how a large class of problems drawn from face, activity, and

object recognition can be recast as statistical inference problems on the Stiefel and/or Grassmann

December 1, 2010 DRAFT

manifolds. Then, we present methods to solve these problems using the Riemannian geometry

of the manifolds. We also discuss some recently proposed extrinsic approaches to statistical

modeling on the Grassmann manifold. We present a wide range of experimental evaluation to

demonstrate the effectiveness of these approaches and provide a comprehensive comparison.

Organization of the paper: In section II, we discuss parametric subspace-based models

of image-sets and videos and show how the study of these models can be recast as a study of

the Grassmann manifold. Section III introduces the special orthogonal group and its quotient

spaces – the Stiefel and the Grassmann manifolds. Section IV discusses statistical methods that

follow from the quotient interpretation of these manifolds. In section V, we develop supervised

and unsupervised learning algorithms. Complexity issues and numerically efﬁcient algorithms for

performing Riemannian computations are discussed in section VI. In section VII, we demonstrate

the strength of the framework for several applications including activity recognition, video-based

face recognition, object matching, and activity-based clustering. Finally, concluding remarks are

presented in section VIII.

II. MODELS FOR VIDEOS AND IMAGES

A. Spatio-temporal dynamical models and the ARMA model

A wide variety of spatio-temporal data have often been modeled as realizations of dynamical

models. Examples include dynamic textures [2], human joint angle trajectories [4] and silhouettes

[5]. A well-known dynamical model for such time-series data is the autoregressive and moving

average (ARMA) model. Linear dynamical systems represent a class of parametric models

for time-series. A wide variety of time series data such as dynamic textures, human joint

angle trajectories, shape sequences, video-based face recognition etc are frequently modeled

as autoregressive and moving average (ARMA) models [2], [4], [5], [6]. The ARMA model

equations are given by

f(t) = Cz(t) + w(t) w(t) ∼ N(0, R) (1)

z(t + 1) = Az(t) + v(t) v(t) ∼ N(0,Q) (2)

where, z ∈ R

is the hidden state vector, A ∈ R

d×d

the transition matrix and C ∈ R

p×d

the

measurement matrix. f ∈ R

represents the observed features while w and v are noise components

modeled as normal with 0 mean and covariances R ∈ R

p×p

and Q ∈ R

d×d

, respectively.

December 1, 2010 DRAFT

HTML Viewer

Figures

Fig. 3. In R2 the set of all axes (lines passing through the origin) is the Grassmann manifold withn = 2 andd = 1. (a) Blue dotted lines represent individual points on the Grassmann manifold. The bold red line is the Karcher mean of this set. The Karcher mean corresponds to the notion of a men axis. (b) Wrapped Normal class conditionaldensities of two classes on the Grassmann manifold. Each class is shown in a different color. The mean of each class is shown in bold lines. The wrapped standard-deviation lines are shown in dashed lines for each class.

TABLE II STATISTICAL MODELING FOR RECOGNITION OF ACTIVITIES IN THEINRIA DATASET USING A) COMMON-POLE WRAPPEDNORMAL B) CLASS-SPECIFIC POLEWRAPPEDNORMAL C) KERNEL DENSITY (FIRST REPORTED IN [1]).

Fig. 2. Illustration of exponential maps. The exponential map is a ‘pull-back’ map which takes points on the tangent plane and pulls them onto the manifold in a manner that preserves distances. As an example, shown are two pointsV1 andV2 on the tangent space at poleP. Both points lie along the same tangent vector. The exponential map will map them onto the same geodesic. In a local neighborhood, the geodesic distance between the pole and the obtained points will be the same as the Euclidean distance between the pole and the tangent vectors on the tangent plane.

TABLE III COMPARISON RECOGNITION ACCURACIES OF VIDEO BASED FACE RECOGNITION USING SUBSPACE-BASED APPROACHES: A) SUBSPACEANGLES + ARC-LENGTH METRIC, B) PROCRUSTESDISTANCE, C) KERNEL DENSITY, D) WRAPPEDNORMAL USING A COMMON POLE FOR ALL CLASSES(ALGORITHM 2).

Fig. 4. Shown here are a few sequences from each obtained cluster. Each row in a cluster shows contiguous frames of a sequence.

TABLE IV CMU-PIE DATABASE: FACE IDENTIFICATION USING VARIOUS GRASSMANN STATISTICAL METHODS. PERFORMANCE OF VARIOUS METHODS IS COMPARED AS THE SUBSPACE DIMENSION IS VARIED.

Citations

PDF

Open Access

More filters

Reference BookDOI

Handbook of mathematical methods in imaging

Otmar Scherzer

TL;DR: In this article, the Mumford and Shah Model and its applications in total variation image restoration are discussed. But the authors focus on the reconstruction of 3D information, rather than the analysis of the image.

...read moreread less

Journal ArticleDOI

3-D Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold

Maxime Devanne, +5 more

- 01 Jul 2015 -

IEEE Transactions on Systems, Man, and C...

TL;DR: This paper proposes a new framework to extract a compact representation of a human action captured through a depth sensor, and enable accurate action recognition, and results with state-of-the-art methods are reported.

...read moreread less

Proceedings ArticleDOI

Neural Aggregation Network for Video Face Recognition

Jiaolong Yang, +6 more

TL;DR: This NAN is trained with a standard classification or verification loss without any extra supervision signal, and it is found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces.

...read moreread less

Posted Content

Neural Aggregation Network for Video Face Recognition

Jiaolong Yang, +6 more

- 17 Mar 2016 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as mentioned in this paper proposed a Neural Aggregation Network (NAN) for video face recognition, which consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them.

...read moreread less

Proceedings ArticleDOI

Projection Metric Learning on Grassmann Manifold with Application to Video based Face Recognition

Zhiwu Huang, +3 more

TL;DR: This work proposes a novel method to learn the Projection Metric directly on Grassmann manifold rather than in Hilbert space, which can be regarded as performing a geometry-aware dimensionality reduction from the original Grassmann manifolds to a lower-dimensional, more discriminative Grassman manifold where more favorable classification can be achieved.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Matrix computations

Gene H. Golub

Book

Matrix computations (3rd ed.)

Gene H. Golub, +1 more

Journal ArticleDOI

From few to many: illumination cone models for face recognition under variable lighting and pose

Athinodoros S. Georghiades, +2 more

- 01 Jun 2001 -

IEEE Transactions on Pattern Analysis an...

TL;DR: A generative appearance-based method for recognizing human faces under variation in lighting and viewpoint that exploits the fact that the set of images of an object in fixed pose but under all possible illumination conditions, is a convex cone in the space of images.

...read moreread less

Journal ArticleDOI

Statistical shape analysis: clustering, learning, and testing

Anuj Srivastava, +3 more

- 01 Apr 2005 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This work presents tools for hierarchical clustering of imaged objects according to the shapes of their boundaries, learning of probability models for clusters of shapes, and testing of newly observed shapes under competing probability models.

...read moreread less

Journal ArticleDOI

The Geometry of Algorithms with Orthogonality Constraints

Alan Edelman, +2 more

- 01 Apr 1999 -

SIAM Journal on Matrix Analysis and Appl...

TL;DR: The theory proposed here provides a taxonomy for numerical linear algebra algorithms that provide a top level mathematical view of previously unrelated algorithms and developers of new algorithms and perturbation theories will benefit from the theory.

...read moreread less

Collapse

Frequently Asked Questions (18)

Q1. What have the authors contributed in "Statistical computations on grassmann and stiefel manifolds for image and video-based recognition" ?

In this paper, the authors examine image and video based recognition applications where the underlying models have a special structure – the linear subspace structure. The authors discuss how commonly used parametric models for videos and image-sets can be described using the unified framework of Grassmann and Stiefel manifolds. The authors first show that the parameters of linear dynamic models are finite dimensional linear subspaces of appropriate dimensions. Unordered image-sets as samples from a finite-dimensional linear subspace naturally fall under this framework. The authors show that the study of inference over subspaces can be naturally cast as an inference problem on the Grassmann manifold. This involves a study of the geometric properties of the space, appropriate definitions of Riemannian metrics, and definition of geodesics. The authors also provide algorithms for unsupervised clustering derived from the geometry of the manifold. Finally, the authors demonstrate the improved performance of these methods in a wide variety of vision applications such as activity A preliminary version of this paper appeared in [ 1 ]. This work was partially supported by the ONR Grant N00014-09-1-0664. Further, the authors derive statistical modeling of interand intra-class variations that respect the geometry of the space.

Q2. What are the future works mentioned in the paper "Statistical computations on grassmann and stiefel manifolds for image and video-based recognition" ?

In addition to definitions of distances and statistics on manifolds, many interesting problems such as interpolation, smoothing, and time-series modeling on these manifolds of interest are potential directions of future work.

Q3. What is the common distance metric used for comparison of models?

For comparison of models, the most commonly used distance metric is based on subspace angles between column-spaces of the observability matrices [31].

Q4. What are the potential directions of future work?

Q5. How do the authors evaluate the ith class conditional density at a test-point?

To evaluate the ith class conditional density at a test-point, one merely evaluates the truncated Gaussian by mapping the test-point to the tangent-space at pi.

Q6. How many actors are in the INRIA dataset?

The dataset consists of 10 actors performing 11 actions, each action executed 3 times at varying rates while freely changing orientation.

Q7. What is the common approach for learning the temporal dynamics in the lower-dimensional space?

For high-dimensional time-series data (dynamic textures etc), the most common approach is to first learn a lower-dimensional embedding of the observations via PCA, and learn the temporal dynamics in the lower-dimensional space.

Q8. What is the quotient space of SO(n)?

In order to obtain a quotient space structure for Gn,d , let SO(d)×SO(n−d) be a subgroup of SO(n) using the embedding φb : (SO(d)×SO(n−d))→ SO(n):φb(V1,V2) =V1 00 V2 ∈ SO(n).

Q9. What is the complexity of the OT exp(tA)J?

The computation of the geodesic OT exp(tA)J in the direct form implies a complexity of O(n3), where n = mp for the observability matrix, and n = p for the case of PCA basis vectors.

Q10. What is the way to estimate the parameters of a family of pdfs?

The authors can estimate the parameters of a family of pdfs such as Gaussian, or mixtures of Gaussian and then use the exponential map to wrap these parameters back onto the manifold.

Q11. How do the authors evaluate the class conditional probability using truncated wrapped Gaussian?

As mentioned in section V-A, to evaluate the class conditional probability using truncated wrapped Gaussians, the authors also need to adjust the normalizing constant of each Gaussian.

Q12. What is the common way to model the face of a person under different illumination conditions?

Motivated by this, the set of face images of the same person under varying illumination conditions is frequently modeled as a linear subspace of 9-dimensions [38].

Q13. What is the quotient of the Stiefel manifold?

In their case, it turns out that the Stiefel manifold itself can be interpreted as a quotient of a more basic manifold - the special orthogonal group SO(n).

Q14. What is the introduction to the geometry of the Grassmann manifold?

A good introduction to the geometry of the Stiefel and Grassmann manifolds can be found in [10] who introduced gradient methods on these manifolds in the context of eigenvalue problems.

Q15. What was the first version of this paper?

A preliminary version of this paper was presented in [1], which used extrinsic methods for statistical modeling on the Grassmann manifold.

Q16. What is the equivalence class of Gn,d?

An equivalence class is given by:[O]b = {Oφb(V1,V2)|V1 ∈ SO(d), V2 ∈ SO(n−d)} ,and the set of all such equivalence classes is Gn,d .

Q17. How can the authors reduce the complexity of the OT operations?

By exploiting the special structure of the matrix A, it is possible to reduce the complexity of these operations to no more than O(nd2) and O(d3) which represents a significant reduction.

Q18. What is the tangent space for a point O?

For an arbitrary point O ∈ SO(n), the tangent space at that point is obtained by a simple rotation of TI(SO(n)): TO(SO(n)) = {OX |X ∈ TI(SO(n))}.

Statistical Computations on Grassmann and Stiefel Manifolds for Image and Video-Based Recognition

Figures

Citations

Handbook of mathematical methods in imaging

3-D Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold

Neural Aggregation Network for Video Face Recognition

Neural Aggregation Network for Video Face Recognition

Projection Metric Learning on Grassmann Manifold with Application to Video based Face Recognition

References

Matrix computations

Matrix computations (3rd ed.)

From few to many: illumination cone models for face recognition under variable lighting and pose

Statistical shape analysis: clustering, learning, and testing

The Geometry of Algorithms with Orthogonality Constraints

Related Papers (5)

Grassmann discriminant analysis: a unifying view on subspace-based learning

The Geometry of Algorithms with Orthogonality Constraints

Optimization Algorithms on Matrix Manifolds

Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations

Riemannian center of mass and mollifier smoothing

Frequently Asked Questions (18)

Q1. What have the authors contributed in "Statistical computations on grassmann and stiefel manifolds for image and video-based recognition" ?

Q2. What are the future works mentioned in the paper "Statistical computations on grassmann and stiefel manifolds for image and video-based recognition" ?

Q3. What is the common distance metric used for comparison of models?

Q4. What are the potential directions of future work?

Q5. How do the authors evaluate the ith class conditional density at a test-point?

Q6. How many actors are in the INRIA dataset?

Q7. What is the common approach for learning the temporal dynamics in the lower-dimensional space?

Q8. What is the quotient space of SO(n)?

Q9. What is the complexity of the OT exp(tA)J?

Q10. What is the way to estimate the parameters of a family of pdfs?

Q11. How do the authors evaluate the class conditional probability using truncated wrapped Gaussian?

Q12. What is the common way to model the face of a person under different illumination conditions?

Q13. What is the quotient of the Stiefel manifold?

Q14. What is the introduction to the geometry of the Grassmann manifold?

Q15. What was the first version of this paper?

Q16. What is the equivalence class of Gn,d?

Q17. How can the authors reduce the complexity of the OT operations?

Q18. What is the tangent space for a point O?