scispace - formally typeset
Open AccessJournal ArticleDOI

Outlier-Robust PCA: The High-Dimensional Case

Reads0
Chats0
TLDR
This work proposes a high-dimensional robust principal component analysis algorithm that is efficient, robust to contaminated points, and easily kernelizable, and achieves maximal robustness.
Abstract
Principal component analysis plays a central role in statistics, engineering, and science. Because of the prevalence of corrupted data in real-world applications, much research has focused on developing robust algorithms. Perhaps surprisingly, these algorithms are unequipped-indeed, unable-to deal with outliers in the high-dimensional setting where the number of observations is of the same magnitude as the number of variables of each observation, and the dataset contains some (arbitrarily) corrupted observations. We propose a high-dimensional robust principal component analysis algorithm that is efficient, robust to contaminated points, and easily kernelizable. In particular, our algorithm achieves maximal robustness-it has a breakdown point of 50% (the best possible), while all existing algorithms have a breakdown point of zero. Moreover, our algorithm recovers the optimal solution exactly in the case where the number of corrupted points grows sublinearly in the dimension.

read more

Content maybe subject to copyright    Report

1
Outlier-Robust PCA: The High Dimensional Case
Huan Xu, Constantine Caramanis, Member, and Shie Mannor, Senior Member
Abstract
Principal Component Analysis plays a central role in statistics, engineering and science. Because of the
prevalence of corrupted data in real-world applications, much research has focused on developing robust algorithms.
Perhaps surprisingly, these algorithms are unequipped indeed, unable to deal with outliers in the high dimensional
setting where the number of observations is of the same magnitude as the number of variables of each observation,
and the data set contains some (arbitrarily) corrupted observations. We propose a High-dimensional Robust Principal
Component Analysis (HR-PCA) algorithm that is as efficient as PCA, robust to contaminated points, and easily
kernelizable. In particular, our algorithm achieves maximal robustness it has a breakdown point of 50% (the
best possible) while all existing algorithms have a breakdown point of zero. Moreover, our algorithm recovers the
optimal solution exactly in the case where the number of corrupted points grows sub linearly in the dimension.
Index Terms
Statistical Learning, Dimension Reduction, Principal Component Analysis, Robustness, Outlier
I. INTRODUCTION
The analysis of very high dimensional data data sets where the dimensionality of each observation
is comparable to or even larger than the number of observations has drawn increasing attention in
the last few decades [1], [2]. Individual observations can be curves, spectra, images, movies, behavioral
characteristics or preferences, or even a genome; a single observation’s dimensionality can be astronomical,
and, critically, it can equal or even outnumber the number of samples available. Practical high dimensional
data examples include DNA Microarray data, nancial data, climate data, web search engine, and consumer
data. In addition, the nowadays standard “Kernel Trick” [3], a pre-processing routine which non-linearly
maps the observations into a (possibly infinite dimensional) Hilbert space, transforms virtually every
data set to a high dimensional one. Efforts to extend traditional statistical tools (designed for the low
dimensional case) into this high-dimensional regime are often (if not generally) unsuccessful. This fact has
stimulated research on formulating fresh data-analysis techniques able to cope with such a “dimensionality
explosion.
Principal Component Analysis (PCA) is perhaps one of the most widely used statistical techniques
for dimensionality reduction. Work on PCA dates back to the beginning of the 20
th
century [4], and has
become one of the most important techniques for data compression and feature extraction. It is widely used
in statistical data analysis, communication theory, pattern recognition, image processing and far beyond
[5]. The standard PCA algorithm constructs the optimal (in a least-square sense) subspace approximation
to observations by computing the eigenvectors or Principal Components (PCs) of the sample covariance
or correlation matrix. Its broad application can be attributed to primarily two features: its success in
the classical regime for recovering a low-dimensional subspace even in the presence of noise, and also
the existence of efficient algorithms for computation. Indeed, PCA is nominally a non-convex problem,
which we can, nevertheless, solve, thanks to the magic of the SVD which allows us to maximize a convex
function. It is well-known, however, that precisely because of the quadratic error criterion, standard PCA
Preliminary versions of these results have appeared in part, in The Proceedings of the 46th Annual Allerton Conference on Control,
Communication, and Computing, and at the 23rd international Conference on Learning Theory (COLT).
H. Xu is with the Department of Mechanical Engineering, National University of Singapore, Singapore. email: (mpexuh@nus.edu.sg).
C. Caramanis is with the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712 USA
email: (caramanis@mail.utexas.edu).
S. Mannor is with the Department of Electrical Engineering, Technion, Israel. email: (shie@ee.technion.ac.il).

2
is exceptionally fragile, and the quality of its output can suffer dramatically in the face of only a few
(even a vanishingly small fraction) grossly corrupted points. Such non-probabilistic errors may be present
due to data corruption stemming from sensor failures, malicious tampering, or other reasons. Attempts to
use other error functions growing more slowly than the quadratic that might be more robust to outliers,
result in non-convex (and intractable) optimization problems.
In this paper, we consider a high-dimensional counterpart of Principal Component Analysis (PCA) that
is robust to the existence of arbitrarily corrupted or contaminated data. We start with the standard statistical
setup: a low dimensional signal is (linearly) mapped to a very high dimensional space, after which point
high-dimensional Gaussian noise is added, to produce points that no longer lie on a low dimensional
subspace. At this point, we deviate from the standard setting in two important ways: (1) a constant
fraction of the points are arbitrarily corrupted in a perhaps non-probabilistic manner. We emphasize that
these “outliers” can be entirely arbitrary, rather than from the tails of any particular distribution, e.g., the
noise distribution; we call the remaining points “authentic”; (2) the number of data points is of the same
order as (or perhaps considerably smaller than) the dimensionality. As we discuss below, these two points
confound (to the best of our knowledge) all tractable existing Robust PCA algorithms.
A fundamental feature of the high dimensionality is that the noise is large in some direction, with very
high probability, and therefore definitions of “outliers” from classical statistics are of limited use in this
setting. Another important property of this setup is that the signal-to-noise ratio (SNR) can go to zero, as
the
2
norm of the high-dimensional Gaussian noise scales as the square root of the dimensionality. In the
standard (i.e., low-dimensional case), a low SNR generally implies that the signal cannot be recovered,
even without any corrupted points.
The Main Result
Existing algorithms fail spectacularly in this regime: to the best of our knowledge, there is no algorithm
that can provide any nontrivial bounds on the quality of the solution in the presence of even a vanishing
fraction of corrupted points. In this paper we do just this. We provide a novel robust PCA algorithm
we call High Dimensional PCA (HR-PCA). HR-PCA is efficient (no harder than PCA), and robust with
provable nontrivial performance bounds with up to up to 50% arbitrarily corrupted points. If that fraction is
vanishing (e.g., n samples,
n outliers), then HR-PCA guarantees perfect recovery of the low-dimensional
subspace providing optimal approximation of the authentic points. Moreover, our algorithm is easily
kernelizable. This is the first algorithm of its kind: tractable, maximally robust (in terms of breakdown
point see below) and asymptotically optimal when the number of authentic points scales faster than the
number of corrupted points.
The proposed algorithm performs a PCA and a random removal alternately. Therefore, in each iteration
a candidate subspace is found. The random removal process guarantees that with high probability, one of
candidate solutions found by the algorithm is “close” to the optimal one. Thus, comparing all solutions
using a (computational efficient) one-dimensional robust variance estimator leads to a “sufficiently good”
output. Alternatively, our algorithm can be shown to be a randomized algorithm giving a constant factor
approximation to the non convex projection pursuit algorithm.
Organization and Notation
The paper is organized as follows: In Section II we discuss past work and the reasons that classical
robust PCA algorithms fail to extend to the high dimensional regime. In Section III we present the setup
of the problem, and the HR-PCA algorithm. We also provide finite sample and asymptotic performance
guarantees. Section IV is devoted to the kernelization of HR-PCA. We provide some numerical experiment
results in Section V. The performance guarantees are proved in Section VI. Some technical details in the
derivation of the performance guarantees are postponed to the appendix.
Capital letters and boldface letters are used to denote matrices and vectors, respectively. A k ×k identity
matrix is denoted by I
k
. For c R, [c]
+
, max(0, c). We let B
d
, {w R
d
|kwk
2
1}, and S
d
be

3
its boundary. We use a subscript ( ·) to represent order statistics of a random variable. For example, let
v
1
, . . . , v
n
R. Then v
(1)
, . . . , v
(n)
is a permutation of v
1
, . . . , v
n
, in non-decreasing order. The operator
and are used to represent the maximal and the minimal value of the operands, respectively. For
example, x y = max(x, y). The standard asymptotic notations o(·), O(·), Θ(·), ω(·) and Ω(·) are used to
light notations. Throughout the paper, “with high probability” means with probability (jointly on sampling
and the randomness of the algorithm) at least 1 Cn
10
for some absolute constant C. Indeed that the
exponent 10 is arbitrary, and can readily changed to any fixed integer with all the results still hold.
II. RELATION TO PAST WORK
In this section, we discuss past work and the reasons that classical robust PCA algorithms fail to extend
to the high dimensional regime.
Much previous robust PCA work focuses on the traditional robustness measurement known as the
“breakdown point” [6]: the percentage of corrupted points that can make the output of the algorithm
arbitrarily bad. To the best of our knowledge, no other algorithm can handle any constant fraction of
outliers with a lower bound on the error in the high-dimensional regime. That is, the best-known breakdown
point for this problem is zero. As discussed above, we show that the algorithm we provide has breakdown
point of 50%, which is the best possible for any algorithm. In addition to this, we focus on providing
explicit bounds on the performance, for all corruption levels up to the breakdown point.
In the low-dimensional regime where the observations significantly outnumber the variables of each
observation, several robust PCA algorithms have been proposed (e.g., [7]–[16]). These algorithms can be
roughly divided into two classes: (i) The algorithms that obtain a robust estimate of the covariance matrix
and then perform standard PCA. The robust estimate is typically obtained either by an outlier rejection
procedure, subsampling (including “leave-one-out” and related approaches) or by a robust estimation
procedure of each element of the covariance matrix; (ii) So-called projection pursuit algorithms that seek
to find directions {w
1
, . . . , w
d
} maximizing a robust variance estimate of the points projected to these
d dimensions. Both approaches encounter serious difficulties when applied to high-dimensional data-sets,
as we explain.
One of the fundamental challenges tied to the high-dimensional regime relates to the relative magnitude
of the signal component and the noise component of even the authentic samples. In the classical regime,
most of the authentic points must have a larger projection along the true (or optimal) principal components
than in other directions. That is, the noise component must be smaller than the signal component, for many
of the authentic points. In the high dimensional setting entirely the opposite may happen. As a consequence,
and in stark deviation from our intuition from the classical setting, in the high dimensional setting, all
the authentic points may be far from the origin, far from each other, and nearly perpendicular to all the
principal components. To explain this better, consider a simple generative model for the authentic points:
y
i
= Ax
i
+ v
i
, i = 1, . . . , n where A is a p ×d matrix, x is drawn from a zero mean symmetric random
variable, and v N( 0, I
p
). Let us suppose that for n the number of points, p the ambient dimension,
and σ
A
= σ
max
(A) the largest singular value of A, we have: n p σ
A
and also much bigger than d,
the number of principal components. Then, standard calculation shows that
p
E(kAxk
2
2
)
A
, while
p
E(kvk
2
2
)
p, and in fact there is sharp concentration of the Gaussian about this value. Thus we may
have
p
E(kvk
2
2
)
p
A
p
E(kAxk
2
2
): the magnitude of the noise may be vastly larger than
the magnitude of the signal.
While this observation is simple, it has severe consequences. First, Robust PCA techniques based on
some form of outlier rejection or anomaly detection, are destined to fail. The reason is that in the ambient
(high dimensional) space, since the noise is the dominant component of even the authentic points, it is
essentially impossible to distinguish a corrupted from an authentic point.
Two criteria are often used for to determine a point being an outlier, namely, points with large
Mahalanobis distance or points with large Stahel-Donoho outlyingness. The Mahalanobis distance of
a point y is defined as
D
M
(y) =
p
(y y)
S
1
(y y),

4
where y is the sample mean and S is the sample covariance matrix. Stahel-Donoho outlyingness is defined
as:
u
i
, sup
kwk=1
|w
y
i
med
j
(w
y
j
)|
med
k
|w
y
k
med
j
(w
y
j
)|
.
Both the Mahalanobis distance and the Stahel-Donoho (S-D) outlyingness are extensively used in existing
robust PCA algorithms. For example, Classical Outlier Rejection, Iterative Deletion and various alternatives
of Iterative Trimmings all use the Mahalanobis distance to identify possible outliers. Depth Trimming [17]
weights the contribution of observations based on their S-D outlyingness. More recently, the ROBPCA
algorithm proposed in [18] selects a subset of observations with least S-D outlyingness to compute the
d-dimensional signal space. Indeed, consider λn corrupted points of magnitude some (large) constant
multiple of σ
A
, all aligned. Using matrix concentration arguments (we develop these arguments in detail
in the sequel) it is easy to see that the output of PCA can be strongly manipulated; on the other hand,
since the noise magnitude is
p
n in a direction perpendicular to the principal components, the
Mahalanobis distance of each corrupted point will be very small. Similarly, the S-D outlyingness of the
corrupted points in this example is smaller than that of the authentic points, again due to the overwhelming
magnitude of the noise component of each authentic point.
Subsampling and leave-one-out attempts at outlier rejection also fail to work, this time because of the
large number (a constant fraction) of outliers. Other algorithms designed for robust estimation of the
covariance matrix fail because there are not enough observations compared to the dimensionality. For
instance, the widely used Minimum Volume Ellipsoid (MVE) estimator [19] finds the minimum volume
ellipsoid that covers half the points, and uses it to define a robust covariance matrix. Finding such an
ellipsoid is typically hard (combinatorial). Yet beyond this issue, in the high dimensional regime, the
minimum volume ellipsoid problem is fundamentally ill posed.
The discussion above lies at the core of the failure of many popular algorithms. Indeed, in [17], several
classical covariance estimators including M-estimator [20], Convex Peeling [21], [22], Ellipsoidal Peeling
[23], [24], Classical Outlier Rejection [25], [26], Iterative Deletion [27] and Iterative Trimming [28], [29]
are all shown to have breakdown points upper-bounded by the inverse of the dimensionality, hence not
useful in the regime of interest.
Next, we turn to Algorithmic Tractability. Projection pursuit algorithms seek to find a direction (or set
of directions) that maximizes some robust measure of variance in this low-dimensional setting. A common
example (and one which we utilize in the sequel) is the so-called trimmed variance in a particular direction,
w. This projects all points onto w, and computes the average squared distance from the origin for the
(1 η)-fraction of the points for some η (0, 1). As a byproduct of our analysis, we show that this
procedure has excellent robustness properties; in particular, our analysis implies that this has breakdown
point 50% if η is set as 0.5. However, it is easy to see that this procedure requires the solution of a non-
convex optimization problem. To the best of our knowledge, there is no tractable algorithm that can do
this. (As part of our work, we implicitly provide a randomized algorithm with guaranteed approximation
rate for this problem). In the classical setting, we note that the situation is different. In [30], the authors
propose a fast approximate Projection-Pursuit algorithm, avoiding the non-convex optimization problem of
finding the optimal direction, by only examining the directions defined by sample. In the classical regime,
in most samples the signal component is larger than the noise component, and hence many samples make
an acute angle with the principal components to be recovered. In contrast, in the high-dimensional setting
this algorithm fails, since as discussed above, the direction of each sample is almost orthogonal to the
direction of true principal components. Such an approach would therefore only be examining candidate
directions nearly orthogonal to the true maximizing
Finally, we discuss works addressing robust PCA using low-rank techniques and matrix decomposition.
Starting with the work in [31], [32] and [33], recent focus has turned to the problem of recovering a
low-rank matrix from corruption. The work in [31], [32] consider matrix completion recovering a
low-rank matrix from an overwhelming number of erasures. The work initiated in [33], and subsequently

5
continued and extended in [34], [35] focuses on recovering a low-rank matrix from erasures and possibly
gross but sparse corruptions. In the noiseless case, stacking all our samples as columns of a p ×n matrix,
we indeed obtain a corrupted low rank matrix. But the corruption is not sparse; rather, the corruption is
column-sparse, with the corrupted columns corresponding to the corrupted points. in addition to this, the
matrix has Gaussian noise. It is easy to check via simple simulation, and not at all surprising, that the
sparse-plus-low-rank matrix decomposition approaches fail to recover a low-rank matrix corrupted by a
column-sparse matrix.
When this manuscript was under review, a subset of us, together with co-authors, developed a low-
rank matrix decomposition technique to handle outliers (i.e., column-wise corruption) [36], [37], see
also [38] for a similar study performed independently. In [36], [37], we give conditions that guarantee
the exact recovery of the principal components and the identity of the outliers in the noiseless case,
up to a (small) constant fraction of outliers depending on the number of principal components. We
provide parallel approximate results in the presence of Frobenius-bounded noise. Outside the realm where
the guarantees hold, the performance of matrix decomposition approach is unknown. In particular, its
breakdown point depends inversely on the number of principal components, and the dependence of noise
is severe. Specifically, the level of noise considered here would result in only trivial bounds. In short, we
do not know of performance guarantees for the matrix decomposition approach that are comparable to
the results presented here (although it is clearly a topic of interest).
III. HR-PCA: SETUP, ALGORITHM AND GUARANTEES
In this section we describe the precise setting, then provide the HR-PCA algorithm, and finally state
the main theorems of the paper, providing the performance guarantees.
A. Problem Setup
This paper is about the following problem: Given a mix of authentic and corrupted points, our goal is
to find a low-dimensional subspace that captures as much variance of the authentic points. The corrupted
points are arbitrary in every way except their number, which is controlled. We consider two settings for
the authentic points: deterministic (arbitrary) model, and then a stochastic model. In the deterministic
setting, we assume nothing about the authentic points; in the stochastic setting, we assume the standard
generative model, namely, that authentic points are generated according to z
i
= Ax
i
+ v
i
, as we explain
below. In either case, we measure the quality of our solution (i.e., of the low-dimensional subspace) by
comparing to how much variance of the authentic points we capture, compared to the maximum possible.
The guarantees for the deterministic setting are, necessarily, presented in reference to the optimal solution
which is a function of all the points. The stochastic setting allows more interpretable results, since the
optimal solution is defined by the matrix A.
We now turn to the basic definitions.
Let n denote the total number of samples, and p the ambient dimension, so that y
i
R
p
, i = 1, . . . , n.
Let λ denote the fraction of corrupted points; thus, there are t = (1 λ)n “authentic samples”
z
1
, . . . , z
t
R
p
. We assume λ < 0.5. Hence we have 0.5n t n, i.e., t and n are of the same
order.
The remaining λn points are outliers (the corrupted data) and are denoted o
1
, . . . , o
nt
R
p
and as
emphasized above, they are arbitrary (perhaps even maliciously chosen).
We only observe the contaminated data set
Y , {y
1
. . . , y
n
} = {z
1
, . . . , z
t
}
[
{o
1
, . . . , o
nt
}.
An element of Y is called a “point”.
Setup 1: In the deterministic setup, we make no assumptions whatsoever on the authentic points, and
thus there is no implicit assumption that there is a good low-dimensional approximation of these points.
The results are necessarily finite-sample, and their quality is a function of all the authentic points.

Citations
More filters
Proceedings ArticleDOI

Learning Discriminative Reconstructions for Unsupervised Outlier Removal

TL;DR: This work gradually inject discriminative information in the learning process of an autoencoder to make the inliers and the outliers more separable when data are reconstructed from low-dimensional representations.
Journal ArticleDOI

Robust Subspace Learning: Robust PCA, Robust Subspace Tracking, and Robust Subspace Recovery

TL;DR: In this article, the authors provide a magazine-style overview of the entire field of robust subspace learning (RSL) and tracking (RST) for long data sequences, where the authors assume that the data lies in a low-dimensional subspace that can change over time, albeit gradually.
Proceedings ArticleDOI

Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders

TL;DR: This work presents an unsupervised learning approach that takes advantage of the abundance of user-edited videos on social media websites such as YouTube to infer highlights using only a set of downloaded edited videos, without also needing their pre-edited counterparts which are rarely available online.
Journal ArticleDOI

Improvement of Generalization Ability of Deep CNN via Implicit Regularization in Two-Stage Training Process

TL;DR: This paper proposes to optimize the feature boundary of deep CNN through a two-stage training method (pre-training process and implicit regularization training process) to reduce the overfitting problem.
Proceedings ArticleDOI

Autoencoder-based network anomaly detection

TL;DR: An Autoencoder-based network anomaly detection method that is able to capture the non-linear correlations between features so as to increase the detection accuracy and outperforms other detection methods.
References
More filters
Book

Principal Component Analysis

TL;DR: In this article, the authors present a graphical representation of data using Principal Component Analysis (PCA) for time series and other non-independent data, as well as a generalization and adaptation of principal component analysis.
Journal ArticleDOI

LIII. On lines and planes of closest fit to systems of points in space

TL;DR: This paper is concerned with the construction of planes of closest fit to systems of points in space and the relationships between these planes and the planes themselves.
Journal ArticleDOI

Robust principal component analysis

TL;DR: In this paper, the authors prove that under some suitable assumptions, it is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible decompositions, simply minimize a weighted combination of the nuclear norm and of the e1 norm.
Journal ArticleDOI

Exact Matrix Completion via Convex Optimization

TL;DR: It is proved that one can perfectly recover most low-rank matrices from what appears to be an incomplete set of entries, and that objects other than signals and images can be perfectly reconstructed from very limited information.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Outlier-robust pca: the high dimensional case" ?

The authors propose a High-dimensional Robust Principal Component Analysis ( HR-PCA ) algorithm that is as efficient as PCA, robust to contaminated points, and easily kernelizable.