What are the contributions mentioned in the paper "Outlier-robust pca: the high dimensional case" ?

The authors propose a High-dimensional Robust Principal Component Analysis ( HR-PCA ) algorithm that is as efficient as PCA, robust to contaminated points, and easily kernelizable.

(Open Access) Outlier-Robust PCA: The High-Dimensional Case (2013) | Huan Xu

Outlier-Robust PCA: The High Dimensional Case

Huan Xu, Constantine Caramanis, Member, and Shie Mannor, Senior Member

Abstract

Principal Component Analysis plays a central role in statistics, engineering and science. Because of the

prevalence of corrupted data in real-world applications, much research has focused on developing robust algorithms.

Perhaps surprisingly, these algorithms are unequipped – indeed, unable – to deal with outliers in the high dimensional

setting where the number of observations is of the same magnitude as the number of variables of each observation,

and the data set contains some (arbitrarily) corrupted observations. We propose a High-dimensional Robust Principal

Component Analysis (HR-PCA) algorithm that is as efﬁcient as PCA, robust to contaminated points, and easily

kernelizable. In particular, our algorithm achieves maximal robustness – it has a breakdown point of 50% (the

best possible) while all existing algorithms have a breakdown point of zero. Moreover, our algorithm recovers the

optimal solution exactly in the case where the number of corrupted points grows sub linearly in the dimension.

Index Terms

Statistical Learning, Dimension Reduction, Principal Component Analysis, Robustness, Outlier

I. INTRODUCTION

The analysis of very high dimensional data – data sets where the dimensionality of each observation

is comparable to or even larger than the number of observations – has drawn increasing attention in

the last few decades [1], [2]. Individual observations can be curves, spectra, images, movies, behavioral

characteristics or preferences, or even a genome; a single observation’s dimensionality can be astronomical,

and, critically, it can equal or even outnumber the number of samples available. Practical high dimensional

data examples include DNA Microarray data, ﬁnancial data, climate data, web search engine, and consumer

data. In addition, the nowadays standard “Kernel Trick” [3], a pre-processing routine which non-linearly

maps the observations into a (possibly inﬁnite dimensional) Hilbert space, transforms virtually every

data set to a high dimensional one. Efforts to extend traditional statistical tools (designed for the low

dimensional case) into this high-dimensional regime are often (if not generally) unsuccessful. This fact has

stimulated research on formulating fresh data-analysis techniques able to cope with such a “dimensionality

explosion.”

Principal Component Analysis (PCA) is perhaps one of the most widely used statistical techniques

for dimensionality reduction. Work on PCA dates back to the beginning of the 20

century [4], and has

become one of the most important techniques for data compression and feature extraction. It is widely used

in statistical data analysis, communication theory, pattern recognition, image processing and far beyond

[5]. The standard PCA algorithm constructs the optimal (in a least-square sense) subspace approximation

to observations by computing the eigenvectors or Principal Components (PCs) of the sample covariance

or correlation matrix. Its broad application can be attributed to primarily two features: its success in

the classical regime for recovering a low-dimensional subspace even in the presence of noise, and also

the existence of efﬁcient algorithms for computation. Indeed, PCA is nominally a non-convex problem,

which we can, nevertheless, solve, thanks to the magic of the SVD which allows us to maximize a convex

function. It is well-known, however, that precisely because of the quadratic error criterion, standard PCA

Preliminary versions of these results have appeared in part, in The Proceedings of the 46th Annual Allerton Conference on Control,

Communication, and Computing, and at the 23rd international Conference on Learning Theory (COLT).

H. Xu is with the Department of Mechanical Engineering, National University of Singapore, Singapore. email: (mpexuh@nus.edu.sg).

C. Caramanis is with the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712 USA

email: (caramanis@mail.utexas.edu).

S. Mannor is with the Department of Electrical Engineering, Technion, Israel. email: (shie@ee.technion.ac.il).

is exceptionally fragile, and the quality of its output can suffer dramatically in the face of only a few

(even a vanishingly small fraction) grossly corrupted points. Such non-probabilistic errors may be present

due to data corruption stemming from sensor failures, malicious tampering, or other reasons. Attempts to

use other error functions growing more slowly than the quadratic that might be more robust to outliers,

result in non-convex (and intractable) optimization problems.

In this paper, we consider a high-dimensional counterpart of Principal Component Analysis (PCA) that

is robust to the existence of arbitrarily corrupted or contaminated data. We start with the standard statistical

setup: a low dimensional signal is (linearly) mapped to a very high dimensional space, after which point

high-dimensional Gaussian noise is added, to produce points that no longer lie on a low dimensional

subspace. At this point, we deviate from the standard setting in two important ways: (1) a constant

fraction of the points are arbitrarily corrupted in a perhaps non-probabilistic manner. We emphasize that

these “outliers” can be entirely arbitrary, rather than from the tails of any particular distribution, e.g., the

noise distribution; we call the remaining points “authentic”; (2) the number of data points is of the same

order as (or perhaps considerably smaller than) the dimensionality. As we discuss below, these two points

confound (to the best of our knowledge) all tractable existing Robust PCA algorithms.

A fundamental feature of the high dimensionality is that the noise is large in some direction, with very

high probability, and therefore deﬁnitions of “outliers” from classical statistics are of limited use in this

setting. Another important property of this setup is that the signal-to-noise ratio (SNR) can go to zero, as

the ℓ

norm of the high-dimensional Gaussian noise scales as the square root of the dimensionality. In the

standard (i.e., low-dimensional case), a low SNR generally implies that the signal cannot be recovered,

even without any corrupted points.

The Main Result

Existing algorithms fail spectacularly in this regime: to the best of our knowledge, there is no algorithm

that can provide any nontrivial bounds on the quality of the solution in the presence of even a vanishing

fraction of corrupted points. In this paper we do just this. We provide a novel robust PCA algorithm

we call High Dimensional PCA (HR-PCA). HR-PCA is efﬁcient (no harder than PCA), and robust with

provable nontrivial performance bounds with up to up to 50% arbitrarily corrupted points. If that fraction is

vanishing (e.g., n samples,

√

n outliers), then HR-PCA guarantees perfect recovery of the low-dimensional

subspace providing optimal approximation of the authentic points. Moreover, our algorithm is easily

kernelizable. This is the ﬁrst algorithm of its kind: tractable, maximally robust (in terms of breakdown

point – see below) and asymptotically optimal when the number of authentic points scales faster than the

number of corrupted points.

The proposed algorithm performs a PCA and a random removal alternately. Therefore, in each iteration

a candidate subspace is found. The random removal process guarantees that with high probability, one of

candidate solutions found by the algorithm is “close” to the optimal one. Thus, comparing all solutions

using a (computational efﬁcient) one-dimensional robust variance estimator leads to a “sufﬁciently good”

output. Alternatively, our algorithm can be shown to be a randomized algorithm giving a constant factor

approximation to the non convex projection pursuit algorithm.

Organization and Notation

The paper is organized as follows: In Section II we discuss past work and the reasons that classical

robust PCA algorithms fail to extend to the high dimensional regime. In Section III we present the setup

of the problem, and the HR-PCA algorithm. We also provide ﬁnite sample and asymptotic performance

guarantees. Section IV is devoted to the kernelization of HR-PCA. We provide some numerical experiment

results in Section V. The performance guarantees are proved in Section VI. Some technical details in the

derivation of the performance guarantees are postponed to the appendix.

Capital letters and boldface letters are used to denote matrices and vectors, respectively. A k ×k identity

matrix is denoted by I

. For c ∈ R, [c]

, max(0, c). We let B

, {w ∈ R

|kwk

≤ 1}, and S

its boundary. We use a subscript ( ·) to represent order statistics of a random variable. For example, let

, . . . , v

∈ R. Then v

(1)

, . . . , v

(n)

is a permutation of v

, . . . , v

, in non-decreasing order. The operator

∨ and ∧ are used to represent the maximal and the minimal value of the operands, respectively. For

example, x ∨y = max(x, y). The standard asymptotic notations o(·), O(·), Θ(·), ω(·) and Ω(·) are used to

light notations. Throughout the paper, “with high probability” means with probability (jointly on sampling

and the randomness of the algorithm) at least 1 − Cn

−10

for some absolute constant C. Indeed that the

exponent −10 is arbitrary, and can readily changed to any ﬁxed integer with all the results still hold.

II. RELATION TO PAST WORK

In this section, we discuss past work and the reasons that classical robust PCA algorithms fail to extend

to the high dimensional regime.

Much previous robust PCA work focuses on the traditional robustness measurement known as the

“breakdown point” [6]: the percentage of corrupted points that can make the output of the algorithm

arbitrarily bad. To the best of our knowledge, no other algorithm can handle any constant fraction of

outliers with a lower bound on the error in the high-dimensional regime. That is, the best-known breakdown

point for this problem is zero. As discussed above, we show that the algorithm we provide has breakdown

point of 50%, which is the best possible for any algorithm. In addition to this, we focus on providing

explicit bounds on the performance, for all corruption levels up to the breakdown point.

In the low-dimensional regime where the observations signiﬁcantly outnumber the variables of each

observation, several robust PCA algorithms have been proposed (e.g., [7]–[16]). These algorithms can be

roughly divided into two classes: (i) The algorithms that obtain a robust estimate of the covariance matrix

and then perform standard PCA. The robust estimate is typically obtained either by an outlier rejection

procedure, subsampling (including “leave-one-out” and related approaches) or by a robust estimation

procedure of each element of the covariance matrix; (ii) So-called projection pursuit algorithms that seek

to ﬁnd directions {w

, . . . , w

} maximizing a robust variance estimate of the points projected to these

d dimensions. Both approaches encounter serious difﬁculties when applied to high-dimensional data-sets,

as we explain.

One of the fundamental challenges tied to the high-dimensional regime relates to the relative magnitude

of the signal component and the noise component of even the authentic samples. In the classical regime,

most of the authentic points must have a larger projection along the true (or optimal) principal components

than in other directions. That is, the noise component must be smaller than the signal component, for many

of the authentic points. In the high dimensional setting entirely the opposite may happen. As a consequence,

and in stark deviation from our intuition from the classical setting, in the high dimensional setting, all

the authentic points may be far from the origin, far from each other, and nearly perpendicular to all the

principal components. To explain this better, consider a simple generative model for the authentic points:

= Ax

+ v

, i = 1, . . . , n where A is a p ×d matrix, x is drawn from a zero mean symmetric random

variable, and v ∼ N( 0, I

). Let us suppose that for n the number of points, p the ambient dimension,

and σ

= σ

max

(A) the largest singular value of A, we have: n ≈ p ≫ σ

and also much bigger than d,

the number of principal components. Then, standard calculation shows that

E(kAxk

) ≤

√

dσ

, while

E(kvk

) ≈

√

p, and in fact there is sharp concentration of the Gaussian about this value. Thus we may

have

E(kvk

) ≈

√

p ≫

√

dσ

≥

E(kAxk

): the magnitude of the noise may be vastly larger than

the magnitude of the signal.

While this observation is simple, it has severe consequences. First, Robust PCA techniques based on

some form of outlier rejection or anomaly detection, are destined to fail. The reason is that in the ambient

(high dimensional) space, since the noise is the dominant component of even the authentic points, it is

essentially impossible to distinguish a corrupted from an authentic point.

Two criteria are often used for to determine a point being an outlier, namely, points with large

Mahalanobis distance or points with large Stahel-Donoho outlyingness. The Mahalanobis distance of

a point y is deﬁned as

(y) =

(y − y)

⊤

−1

(y − y),

where y is the sample mean and S is the sample covariance matrix. Stahel-Donoho outlyingness is deﬁned

as:

, sup

kwk=1

⊤

−med

⊤

med

⊤

−med

⊤

Both the Mahalanobis distance and the Stahel-Donoho (S-D) outlyingness are extensively used in existing

robust PCA algorithms. For example, Classical Outlier Rejection, Iterative Deletion and various alternatives

of Iterative Trimmings all use the Mahalanobis distance to identify possible outliers. Depth Trimming [17]

weights the contribution of observations based on their S-D outlyingness. More recently, the ROBPCA

algorithm proposed in [18] selects a subset of observations with least S-D outlyingness to compute the

d-dimensional signal space. Indeed, consider λn corrupted points of magnitude some (large) constant

multiple of σ

, all aligned. Using matrix concentration arguments (we develop these arguments in detail

in the sequel) it is easy to see that the output of PCA can be strongly manipulated; on the other hand,

since the noise magnitude is

√

p ≈

√

n in a direction perpendicular to the principal components, the

Mahalanobis distance of each corrupted point will be very small. Similarly, the S-D outlyingness of the

corrupted points in this example is smaller than that of the authentic points, again due to the overwhelming

magnitude of the noise component of each authentic point.

Subsampling and leave-one-out attempts at outlier rejection also fail to work, this time because of the

large number (a constant fraction) of outliers. Other algorithms designed for robust estimation of the

covariance matrix fail because there are not enough observations compared to the dimensionality. For

instance, the widely used Minimum Volume Ellipsoid (MVE) estimator [19] ﬁnds the minimum volume

ellipsoid that covers half the points, and uses it to deﬁne a robust covariance matrix. Finding such an

ellipsoid is typically hard (combinatorial). Yet beyond this issue, in the high dimensional regime, the

minimum volume ellipsoid problem is fundamentally ill posed.

The discussion above lies at the core of the failure of many popular algorithms. Indeed, in [17], several

classical covariance estimators including M-estimator [20], Convex Peeling [21], [22], Ellipsoidal Peeling

[23], [24], Classical Outlier Rejection [25], [26], Iterative Deletion [27] and Iterative Trimming [28], [29]

are all shown to have breakdown points upper-bounded by the inverse of the dimensionality, hence not

useful in the regime of interest.

Next, we turn to Algorithmic Tractability. Projection pursuit algorithms seek to ﬁnd a direction (or set

of directions) that maximizes some robust measure of variance in this low-dimensional setting. A common

example (and one which we utilize in the sequel) is the so-called trimmed variance in a particular direction,

w. This projects all points onto w, and computes the average squared distance from the origin for the

(1 − η)-fraction of the points for some η ∈ (0, 1). As a byproduct of our analysis, we show that this

procedure has excellent robustness properties; in particular, our analysis implies that this has breakdown

point 50% if η is set as 0.5. However, it is easy to see that this procedure requires the solution of a non-

convex optimization problem. To the best of our knowledge, there is no tractable algorithm that can do

this. (As part of our work, we implicitly provide a randomized algorithm with guaranteed approximation

rate for this problem). In the classical setting, we note that the situation is different. In [30], the authors

propose a fast approximate Projection-Pursuit algorithm, avoiding the non-convex optimization problem of

ﬁnding the optimal direction, by only examining the directions deﬁned by sample. In the classical regime,

in most samples the signal component is larger than the noise component, and hence many samples make

an acute angle with the principal components to be recovered. In contrast, in the high-dimensional setting

this algorithm fails, since as discussed above, the direction of each sample is almost orthogonal to the

direction of true principal components. Such an approach would therefore only be examining candidate

directions nearly orthogonal to the true maximizing

Finally, we discuss works addressing robust PCA using low-rank techniques and matrix decomposition.

Starting with the work in [31], [32] and [33], recent focus has turned to the problem of recovering a

low-rank matrix from corruption. The work in [31], [32] consider matrix completion — recovering a

low-rank matrix from an overwhelming number of erasures. The work initiated in [33], and subsequently

continued and extended in [34], [35] focuses on recovering a low-rank matrix from erasures and possibly

gross but sparse corruptions. In the noiseless case, stacking all our samples as columns of a p ×n matrix,

we indeed obtain a corrupted low rank matrix. But the corruption is not sparse; rather, the corruption is

column-sparse, with the corrupted columns corresponding to the corrupted points. in addition to this, the

matrix has Gaussian noise. It is easy to check via simple simulation, and not at all surprising, that the

sparse-plus-low-rank matrix decomposition approaches fail to recover a low-rank matrix corrupted by a

column-sparse matrix.

When this manuscript was under review, a subset of us, together with co-authors, developed a low-

rank matrix decomposition technique to handle outliers (i.e., column-wise corruption) [36], [37], see

also [38] for a similar study performed independently. In [36], [37], we give conditions that guarantee

the exact recovery of the principal components and the identity of the outliers in the noiseless case,

up to a (small) constant fraction of outliers depending on the number of principal components. We

provide parallel approximate results in the presence of Frobenius-bounded noise. Outside the realm where

the guarantees hold, the performance of matrix decomposition approach is unknown. In particular, its

breakdown point depends inversely on the number of principal components, and the dependence of noise

is severe. Speciﬁcally, the level of noise considered here would result in only trivial bounds. In short, we

do not know of performance guarantees for the matrix decomposition approach that are comparable to

the results presented here (although it is clearly a topic of interest).

III. HR-PCA: SETUP, ALGORITHM AND GUARANTEES

In this section we describe the precise setting, then provide the HR-PCA algorithm, and ﬁnally state

the main theorems of the paper, providing the performance guarantees.

A. Problem Setup

This paper is about the following problem: Given a mix of authentic and corrupted points, our goal is

to ﬁnd a low-dimensional subspace that captures as much variance of the authentic points. The corrupted

points are arbitrary in every way except their number, which is controlled. We consider two settings for

the authentic points: deterministic (arbitrary) model, and then a stochastic model. In the deterministic

setting, we assume nothing about the authentic points; in the stochastic setting, we assume the standard

generative model, namely, that authentic points are generated according to z

= Ax

+ v

, as we explain

below. In either case, we measure the quality of our solution (i.e., of the low-dimensional subspace) by

comparing to how much variance of the authentic points we capture, compared to the maximum possible.

The guarantees for the deterministic setting are, necessarily, presented in reference to the optimal solution

which is a function of all the points. The stochastic setting allows more interpretable results, since the

optimal solution is deﬁned by the matrix A.

We now turn to the basic deﬁnitions.

• Let n denote the total number of samples, and p the ambient dimension, so that y

∈ R

, i = 1, . . . , n.

Let λ denote the fraction of corrupted points; thus, there are t = (1 − λ)n “authentic samples”

, . . . , z

∈ R

. We assume λ < 0.5. Hence we have 0.5n ≤ t ≤ n, i.e., t and n are of the same

order.

• The remaining λn points are outliers (the corrupted data) and are denoted o

, . . . , o

n−t

∈ R

and as

emphasized above, they are arbitrary (perhaps even maliciously chosen).

• We only observe the contaminated data set

Y , {y

. . . , y

} = {z

, . . . , z

}

[

, . . . , o

n−t

An element of Y is called a “point”.

Setup 1: In the deterministic setup, we make no assumptions whatsoever on the authentic points, and

thus there is no implicit assumption that there is a good low-dimensional approximation of these points.

The results are necessarily ﬁnite-sample, and their quality is a function of all the authentic points.

Outlier-Robust PCA: The High-Dimensional Case

Figures

Citations

Learning Discriminative Reconstructions for Unsupervised Outlier Removal

Robust Subspace Learning: Robust PCA, Robust Subspace Tracking, and Robust Subspace Recovery

Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders

Improvement of Generalization Ability of Deep CNN via Implicit Regularization in Two-Stage Training Process

Autoencoder-based network anomaly detection

References

Principal Component Analysis

LIII. On lines and planes of closest fit to systems of points in space

Robust principal component analysis

Exact Matrix Completion via Convex Optimization

Learning with kernels

Related Papers (5)

Robust principal component analysis

Rank-Sparsity Incoherence for Matrix Decomposition

The Geometry of Algorithms with Orthogonality Constraints

Exact Matrix Completion via Convex Optimization

A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Outlier-robust pca: the high dimensional case" ?