scispace - formally typeset
Open AccessProceedings ArticleDOI

HiCS: High Contrast Subspaces for Density-Based Outlier Ranking

Reads0
Chats0
TLDR
A novel subspace search method that selects high contrast subspaces for density-based outlier ranking and proposes a first measure for the contrast of subspace dimensions to enhance the quality of traditional outlier rankings.
Abstract
Outlier mining is a major task in data analysis. Outliers are objects that highly deviate from regular objects in their local neighborhood. Density-based outlier ranking methods score each object based on its degree of deviation. In many applications, these ranking methods degenerate to random listings due to low contrast between outliers and regular objects. Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. Measuring the contrast of such subspaces for outlier rankings is an open research challenge. In this work, we propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. It is designed as pre-processing step to outlier ranking algorithms. It searches for high contrast subspaces with a significant amount of conditional dependence among the subspace dimensions. With our approach, we propose a first measure for the contrast of subspaces. Thus, we enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that our approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking.

read more

Content maybe subject to copyright    Report

HiCS: High Contrast Subspaces
for Density-Based Outlier Ranking
Fabian Keller, Emmanuel M
¨
uller, Klemens B
¨
ohm
Institute for Program Structures and Data Organization
Karlsruhe Institute of Technology (KIT), Germany
{fabian.keller, emmanuel.mueller, klemens.boehm}@kit.edu
Abstract—Outlier mining is a major task in data analysis.
Outliers are objects that highly deviate from regular objects in
their local neighborhood. Density-based outlier ranking methods
score each object based on its degree of deviation. In many
applications, these ranking methods degenerate to random list-
ings due to low contrast between outliers and regular objects.
Outliers do not show up in the scattered full space, they are
hidden in multiple high contrast subspace projections of the data.
Measuring the contrast of such subspaces for outlier rankings is
an open research challenge.
In this work, we propose a novel subspace search method that
selects high contrast subspaces for density-based outlier ranking.
It is designed as pre-processing step to outlier ranking algorithms.
It searches for high contrast subspaces with a significant amount
of conditional dependence among the subspace dimensions. With
our approach, we propose a first measure for the contrast of
subspaces. Thus, we enhance the quality of traditional outlier
rankings by computing outlier scores in high contrast projections
only. The evaluation on real and synthetic data shows that
our approach outperforms traditional dimensionality reduction
techniques, naive random projections as well as state-of-the-art
subspace search techniques and provides enhanced quality for
outlier ranking.
I. INTRODUCTION
Outlier mining is an important task in the field of knowl-
edge discovery. In applications such as fraud detection, gene-
expression analysis or environmental surveillance, one is in-
terested in rare, suspicious, and unexpected objects. Outlier
analysis searches for such highly deviating objects in contrast
to regular objects. An outlier has highly deviating attribute
values compared to its local neighborhood. For example, in
environmental surveillance (cf. Fig. 1) a sensor node might be
an outlier as it shows an abnormally high deviation w.r.t. air
pollution index and noise level. For instance, outlier
1
shows a
high deviation in this specific subset of attributes only. Another
sensor node (outlier
2
) shows high deviation w.r.t. humidity
and temperature, independent of its air pollution index and its
noise level. Thus, a sensor node might be an outlier in one of
these attribute combinations and a regular object in all other
attributes. In general, these multiple roles (outlying vs. regular
behavior) of objects can be observed in other domains as well:
Suspicious customers show fraud activity only w.r.t. some
financial transactions, and genes show unexpected expression
only under specific medical conditions.
Traditional outlier mining [26], [16], [5], [13], [7], [25] is
unable to detect such outliers hidden in subsets of all given
noise level
air pollution index
outlier
1
noise level
humidity
temperature
humidity
outlier
2
low contrast high contrasthigh contrast
Fig. 1. Environmental surveillance example: suspicious sensor readings
attributes. Most outlier mining techniques search for outliers
w.r.t. all given attributes. Considering object distances in the
full data space, these methods fall prey to randomly distributed
attribute combinations. In our example, humidity and noise
level in combination show no clear outlier objects and hinder
outlier detection. Furthermore, due to the increasing number
of attributes in today’s databases, distances between objects
grow more and more alike [6]. Outlier ranking techniques
score each object based on the degree of deviation, e.g.,
by computing its density in the full data space [7]. Thus,
for high dimensional data, outlier rankings degenerate to
random listings, as outliers do not show up in the full space.
Other common statistical techniques try to detect outliers in
single attributes [26]. However, by ignoring the dependencies
between several attributes, these techniques miss outliers that
appear only due to correlations in multi-dimensional spaces.
We focus on such outliers, which are neither visible in the full
space nor in a single attribute.
Subspace mining has been proposed as a novel data mining
paradigm to tackle this challenge. It detects highly deviating
objects in any possible attribute combination (low dimensional
projection). While dimensionality reduction techniques aim at
such lower dimensional projections, they are not designed as
pre-processing step for outlier ranking. General measures, such
as the variance of the data in PCA [14], are not appropriate
objective functions for outlier ranking. Novel quality criteria
and processing schemes are required for subspace outlier
mining. In particular, we search for high contrast subspaces.
Such subspaces have the defining characteristic that outliers
can be clearly distinguished from regular objects within the
subspace context. Our general aim is a two-step processing:
(1) Subspace search: measuring the contrast of subspaces
(2) Outlier ranking: score objects in high contrast subspaces

We consider the decoupling of these two steps to be an open
research issue. Current subspace outlier mining techniques [1],
[18], [23], [21] focus on interleaved algorithms only, which
select subspaces during outlier mining. We propose to consider
subspace outlier mining as a decoupled process, divided into
“subspace search” and “outlier ranking”. By treating these
two steps as independent problems, one can design and
combine the respective algorithms in a modular fashion. It
also allows both research fields to evolve independently. In
conclusion, any improvement in either of these steps will lead
to an improvement in the overall outlier detection quality.
Thus, future research in outlier mining may benefit from the
proposed decoupling.
In this work, we focus on the first step and propose a novel
subspace search method that selects high contrast subspaces
for density-based outlier ranking. As outlier score for the
ranking we rely on the commonly used local outlier factor
(LOF) [7]. However, any other outlier score could be used as
instantiation of the second step. Our subspace search technique
is based on a novel selection of high contrast subspaces
(HiCS). It provides three main contributions:
The decoupling of subspace search as generalized pre-
processing step for outlier ranking
A contrast measure based on the conditional dependence
of dimensions in the selected subspaces
Two statistical instantiations of our contrast measure
ensuring a robust parametrization of our technique
Our contrast measure is based on statistical tests and enables
a high quality outlier ranking of outliers hidden in arbitrary
subspace projections. Our approach searches for high contrast
subspaces with a significant amount of conditional dependence
among the selected dimensions. Thus, we enhance the quality
of traditional outlier rankings by computing outlier scores in
high contrast projections only. The evaluation on real and
synthetic data shows that our approach outperforms tradi-
tional dimensionality reduction techniques [14], naive random
projections [20] as well as state-of-the-art subspace search
techniques [8], [15] and provides enhanced quality for outlier
rankings.
II. R
ELATED WORK
In this section, we review existing techniques in the areas
of outlier discovery and subspace mining. In particular, we
explain the differences of existing paradigms compared to our
novel subspace search approach.
a) Traditional Outlier Ranking: There have been dif-
ferent outlier detection paradigms proposed in the literature,
ranging from deviation-based methods [26], distance-based
methods [16], [5], [13] to density-based methods [7], [25]. We
focus on the density-based outlier ranking paradigm, which
computes a score for each object by measuring its degree
of deviation w.r.t. a local neighborhood. Thus, one is able to
detect local density variations between low density outliers and
their high density (clustered) neighborhood. However, all of
those traditional outlier mining approaches have one drawback.
They cannot detect outliers in subspaces, as their degree of
deviation considers only the full data space.
b) Subspace Outlier Ranking: Outlier detection in sub-
spaces has first been proposed by [1]. Recent approaches have
enhanced subspace outlier mining by ranking objects based
on any possible subspace projection [11], [20], [18], [23],
[21]. These techniques differ in their choice of subspaces. The
majority of approaches uses specialized heuristics for subspace
selection that are integrated into the outlier ranking [11], [18],
[23], [21]. In general, all of these techniques use an integrated
processing of subspaces and outliers. This implies that scoring
functions and subspace selection are tightly coupled such that
none of these techniques would benefit from a novel scoring
function or a novel subspace selection technique.
The only approach with a decoupled processing is consid-
ered as a baseline for our technique. It selects several subspace
projections randomly [20]. Obviously, this random selection
does not guarantee high quality results. Selection of arbitrary
projections will result in random rankings just as in the full
data space. With our work we aim at a decoupled processing
with two steps as proposed in [20]. In contrast to a naive
random selection of subspaces, we aim at an enhanced contrast
measure based on sound statistical foundations.
c) Subspace Search: Based on the general idea of sub-
space mining in arbitrary projections of the data, several pre-
processing techniques for the selection of subspaces have
been proposed [8], [15], [24], [4]. All of these techniques
focus on the related domain of subspace clustering. They
try to decouple the detection of clusters and the selection of
individual subspaces for each cluster. However, each of the
four subspace search models depends on a specific cluster
definition.
First, the Enclus approach proposes a selection based on the
entropy measure [8]. Its quality measure for subspaces highly
depends on the subspace clustering algorithm CLIQUE [2]. It
partitions the data space in equally sized grid cells. A subspace
is selected if it has low entropy, i.e., if it shows a large variation
in the densities of the grid cells. With our approach we follow
this basic idea of contrast, however, we do not rely on fixed
grid cells. This is because they induce several drawbacks for
density estimation in high dimensional spaces.
Other techniques, i.e., RIS [15] and SURFING [4], have
been proposed for the detection of density-based subspace
clusters based on the DBSCAN paradigm [10]. For instance,
RIS counts the core objects in a subspace projection and
uses them as a measure for its subspace selection criterion.
Recently, a subspace search method has been proposed for
spectral clustering as well [24].
In general, all of the proposed subspace search methods fo-
cus on specific clustering tasks. Their selection highly depends
on the underlying clustering model. In contrast to this, our
technique is based on a more general analysis of conditional
dependence. Furthermore, we propose an instantiation of our
objective function that aims at high contrast w.r.t. density-
based outlier ranking, and thus, is tailored to detect low density
regions as required for many outlier models.

III. HIGH CONTRAST SUBSPACES (HICS)
The main idea of our HiCS approach is the statistical
selection of high contrast subspaces. We propose a processing
based on a series of statistical tests. Each test compares the
data distribution in a local subspace region to its marginal
distribution. Dependencies between attributes highlight the
high contrast of a subspace. Based on these statistical tests
and the detected dependence between attributes we derive our
contrast measure. It provides the means for high quality outlier
ranking in a selection of high contrast subspaces.
Overall, HiCS establishes a first statistical subspace search
technique for density-based outlier ranking. In the following,
we will introduce the necessary notation in Section III-A, and
define the general objective for our high contrast subspaces in
Section III-B. We will introduce the notion of subspace slices
that specify local subspace regions in Section III-C, and define
the contrast measure in Section III-D. In Section III-E we will
show how different statistical tests can be used to instantiate
our contrast definition.
A. Notation
Let DB be a database containing N objects, each described
by a D-dimensional real-valued data vector x =(x
1
,...,x
D
).
The set A = {1,...,D} denotes the full data space of all
given attributes. Any attribute subset S = {s
1
,...,s
d
}⊆A
will be called a d-dimensional subspace projection. We denote
the distance between objects x and y as dist
A
(x, y), which
can be instantiated for instance by the widely used Euclidean
Distance dist
A
(x, y)=
s∈A
(x
s
y
s
)
2
.
As general property of any outlier ranking method we have
to consider the underlying scoring function. It measures the
outlierness of an object. Traditionally, each object is sorted
according to a single outlier score score(x) measuring the
degree of deviation in all given attributes A. Traditional
density-based outlier scores measure the density p(x) of an
object and compare it to the density in the local neighborhood
of x. Local outlier ranking based on density deviation in
local neighborhoods has first been proposed by LOF [7]. In
recent years, this outlier mining paradigm has been extended
by enhanced scoring functions and efficient outlier ranking
algorithms [25], [5], [13], [19], [17], [23], [9].
The problem with all of these full space approaches is intro-
duced by the curse of dimensionality. As pointed out in [6], the
definition of a local neighborhood becomes meaningless for
a large number of attributes. Furthermore distances between
objects grow more and more alike, thus
lim
|A|→∞
max
zDB
dist
A
(z,x) min
zDB
dist
A
(z, x)=0
Since local outlier ranking calculates the density based on the
object distances, we observe the same effect for the minimal
and maximal value of score(x). As a result, all mentioned
outlier score functions will suffer from a loss of contrast, i.e.:
score(x) score(y) x, y DB
Any outlier ranking obtained for a sufficiently high dimen-
sional database will degenerate into a random ranking with
very similar scores for all objects.
Subspace outlier rankings address this problem by evalu-
ating the score function in lower dimensional subspace pro-
jections. They simply restrict the distance computation to a
selected subspace S, i.e., compute dist
S
. Thus, any outlier
ranking with score(x) can be extended to a subspace score
score
S
(x). The idea is to aggregate these score
S
(x) values
over several subspaces. Each score provides some insights
about the deviation of x in a lower dimensional projection
S. The final ranking is derived from the aggregation of these
scores:
Definition 1: Outlier Score
score(x)=
1
|RS|
SRS
score
S
(x)
In the most basic approach [20], RS is a selection of
random subspaces that contribute to the overall ranking. A
major drawback of this approach is that irrelevant subspaces
in RS might blur the overall order of objects. To tackle this
challenge, we propose a novel method to select high contrast
subspaces only. Our subspace search technique excludes low
contrast subspaces, which inhibit a clear distinction between
outliers and regular objects.
For our experiments, we instantiate score
S
(x) with the
commonly used local outlier factor [7]. It has been used for
the subspace extension based on random projections [20] as
well. However, our technique is not restricted to LOF only.
Any other density-based scoring function could be used for
score
S
(x). This flexibility w.r.t. the score function is a main
advantage of our method. We only consider the contrast of
subspaces and their selection as pre-processing step. Any
improvement in the area of outlier scoring can be applied
directly to our approach as well. In recent years several
extensions of LOF have addressed specific challenges for this
local outlier ranking [25], [19], [23], [17]. While each of these
publications proposes an individual score function, they all
have an assumption in common: An outlier has low density
compared to its local neighborhood. Our technique relies
only on this general assumption.
To derive our criterion for subspace contrast, we treat the
attributes in DB as random variables. We use the notion of
probability density functions (pdf) to derive the formal back-
ground of our contrast criterion. We will adapt the notation for
subspaces as follows. For a given subspace S = {s
1
,...,s
d
},
we refer to the projected data vectors as x
S
=(x
s
1
,...,x
s
d
).
Notation 1: The subspace data vector x
S
is distributed by
an unknown joint pdf of S:
p
s
1
,...,s
d
(x
s
1
,...,x
s
d
)
By integration over all attributes s ∈A\s
i
we obtain:
Notation 2: The marginal pdf of attribute s
i
:
p
s
i
(x
s
i
)

    





(a) DatasetA–example of an uncorrelated joint pdf
    





(b) DatasetB–example of a correlated joint pdf
Fig. 2. high vs. low contrast and the effects on outlier ranking
Please note that the marginal densities are simply one-
dimensional projections, independent from any subspace. Fur-
thermore, we can require a condition on the attributes s
S \ s
i
, which leads to the following notion.
Notation 3: The conditional pdf of attribute s
i
:
p
s
i
| s S\s
i
(x
s
i
|{x
s
: s S \s
i
})
Thus, we express the probability density function of s
i
w.r.t.
|S|−1 conditions on all other attributes in the subspace.
B. High Contrast Improves Outlier Ranking
Given the notion of probability density in any subspace
S, we measure the contrast by comparing conditional prob-
ability densities to the corresponding marginal densities for
all attributes s
i
S. This idea is based on the following
key hypothesis: the detection of non-trivial outliers is only
possible in a subspace S that shows high dependence between
all attributes s
i
S. The notion of non-trivial outliers is
a new concept and we will postpone the formal definition
for a moment. Intuitively, a non-trivial outlier is an outlier
in subspace S, but it is not visible as outlier in any one-
dimensional projection of S, i.e., all its one-dimensional
attribute values are located in regions of high density. Based on
the one-dimensional projections, a non-trivial outlier appears
to be a clustered object.
1) Motivation Example:
We illustrate the relationship between correlated subspaces
and non-trivial outliers by a toy example (cf. Figure 2). It
consists of two two-dimensional datasets. Both datasets were
generated from the same marginal distributions. In dataset A,
s
1
and s
2
are completely uncorrelated. As a result, this two-
dimensional subspace is filled by a random scattering of
objects in consistency with the marginal distribution. Never-
theless the dataset contains an outlier object o
1
. By considering
the one-dimensional projections of this subspace, the existence
of o
1
is not a surprise: o
1
could trivially be detected by the
examination of the one-dimensional distribution of attribute
s
2
. We call such an object a trivial outlier. In summary, the
evaluation of the two-dimensional subspace does not reveal
any new information for this dataset.
The other dataset features marginal distributions identical to
the ones of dataset A. The difference is that dataset B shows a
significant correlation. The correlation allows the data objects
to form regions of varying or unexpected densities over the
total possible area that would be consistent with the marginal
distribution. We observe (a) cluster-like dense agglomerations
of objects and (b) sparse or even empty regions. Besides a
trivial outlier o
1
, the subspace also features an other outlier
o
2
. This time the outlier is hidden in all one-dimensional
subspace projections, where it even appears to be a clustered
object. We will call this type of objects non-trivial outliers.
For dataset B the evaluation of the two-dimensional subspace
was worthwhile and reveals significant insight regarding the
data structure. Accordingly, we have found an example for a
high contrast subspace in this case.
Once we have found such a high contrast subspace we
can apply any density-based outlier ranking algorithm: for
instance in dataset B, o
1
and o
2
both exhibit a much lower
density compared to the local neighborhood. Thus, deter-
mining the outlierness in the two-dimensional subspace of
dataset B would result in a detection of o
1
and o
2
, i.e.,
score
S
(o
1/2
) score
S
(o
i
) for all other objects o
i
in the
database.
We can also explain the essential idea of our approach
to identify high contrast subspaces using this toy example.
Depicted on top of each plot in Figure 2, we show two different
histograms for the s
1
axis of both datasets. The first one
(red) represents the full data sample, i.e., corresponds to the

marginal probability distribution p
s
1
(x
s
1
). The blue one shows
the conditional probability distribution that is generated by the
sample according to the selection range w.r.t. the s
2
axis (blue
area). The comparison of the blue vs. the red histograms for
both datasets show a basic property of correlation: Whereas
the histograms for dataset A are in good agreement, we see
a significant discrepancy between the two histograms for the
high contrast subspace B. The proposed HiCS algorithm is
based on the evaluation of this discrepancy.
Please note that we design our contrast measure as a
conservative subspace selection criterion. The set of selected
subspaces is a proper superset of the subspaces containing
non-trivial outliers. We will later show that high contrast is
a necessary condition for non-trivial outliers. Still, the result
may contain subspaces without any outliers.
In the following we will focus on non-trivial outliers only.
The reason is simple: A user might already know about the
existence of one-dimensional outliers; one can detect these
outliers by existing methods [26] without difficulty. Moreover,
our subspace search can detect trivial outliers as a by-product
of the search for non-trivial outliers. For instance in dataset B,
we will always detect o
1
as outlier as soon as attribute s
2
is
part of any high contrast subspace. In any case, the detection
of non-trivial outliers will provide a much higher information
gain to the user. Therefore, we focus on the detection of
correlated subspaces containing such non-trivial outliers.
2) Contrast based on correlation of dimensions:
In probability theory, two events A and B are called inde-
pendent and uncorrelated, if and only if the probability of
the combined event is given by the product of the individual
probabilities, i.e.:
p(A B)=p(A) · p(B) (1)
By putting the notion of correlation in the context of sub-
spaces, we obtain:
Definition 2: A subspace S is called an uncorrelated
subspace if and only if:
p
s
1
,...,s
d
(x
s
1
,...,x
s
d
)=
d
i=1
p
s
i
(x
s
i
) (2)
Please note that the formal distinction between statistical
dependence and correlation is not important for our purpose.
Strictly speaking, the term set of independent attributes would
be the appropriate expression. Instead we prefer to use the
more concise term uncorrelated subspace to express the sta-
tistical independence within a subspace.
To support the observations regarding Figure 2, we want to
examine the characteristics of outlier mining in uncorrelated
subspaces more formally. The observation of a high value of
score
S
(x) implies that the object x is located in a region
with a low value of the joint pdf p
s
1
,...,s
d
(x
s
1
,...,x
s
d
).On
the other hand, we can evaluate the expected density for x
under the assumption of an uncorrelated subspace:
p
expected
(x
s
1
,...,x
s
d
)
d
i=1
p
s
i
(x
s
i
) (3)
We define the notion of trivial outliers over the comparison of
the expected density with the joint density:
Definition 3: We call an object x
S
a non-trivial outlier
w.r.t. subspace S if
p
s
1
,...,s
d
(x
s
1
,...,x
s
d
) p
expected
(x
s
1
,...,x
s
d
) (4)
Comparing the definition of an uncorrelated subspace (Eq. 2)
with the definition of non-trivial outliers leads to:
Theorem 1: An uncorrelated subspace S does not contain
any non-trivial outlier.
For an uncorrelated subspace, the joint probability density
function p
s
1
,...,s
d
(x
s
1
,...,x
s
d
) is by definition equal to the
product of the marginal pdfs and thus, will never fulfill Eq. 4.
On the other hand, a correlated subspace allows significantly
smaller values of p
s
1
,...,s
d
(x
s
1
,...,x
s
d
) compared to the
expected density. Thus, we define subspace correlation as
objective function for the subspace contrast.
3) Measuring Correlation:
We propose to quantify the subspace contrast by a comparison
of different probability density functions. To simplify the
notation, we will express all following conditional probability
densities only for s
1
without loss of generality. In the case of
an uncorrelated subspace, Eq. 2 simplifies the definition of all
conditional probability densities within the subspace, i.e.:
p
s
1
(x
s
1
|x
s
2
,...,x
s
d
)=
p
s
1
,...,s
d
(x
s
1
,...,x
s
d
)
p
s
2
,...,s
d
(x
s
2
,...,x
s
d
)
= p
s
1
(x
s
1
) (5)
This allows to measure the contrast of a subspace by deter-
mining the degree of violation of Eq. 5. In other words, we
have to compare a conditional pdf of s
1
to the corresponding
marginal pdf, and we assign a high contrast to a subspace
if we observe a significant deviation between the two pdfs.
Please note that the correlation analysis within subspaces
goes beyond classical correlation analysis approaches, since
we may be faced with high contrast subspaces with more
than two dimensions. In contrast to, say, the Pearson or
Spearman correlation coefficient [28], the proposed approach
is not limited in the subspace dimensionality. Furthermore,
it is possible to detect any kind of non-linear correlation.
Above all, our approach does not require an evaluation of a
high dimensional joint pdf, but is based on one-dimensional
densities only. Hence, it does not fall prey to the curse of
dimensionality.
In the following sections we will discuss (1) how to empiri-
cally analyze the the conditional pdf by introducing the notion
of subspace slices, (2) how to compare the conditional pdf to
the marginal pdf by means of statistical tests, and (3) how to
instantiate these statistical tests in our contrast measure.
C. Evaluation of conditional densities
The main challenge for the proposed calculation of the
subspace contrast is the empirical analysis of the conditional
probability densities p
s1|...
p
s
1
|s
2
,...,s
d
(x
s
1
|x
s
2
,...,x
s
d
).
Since we do not require any knowledge of the underlying

Citations
More filters
Book

Outlier Analysis

TL;DR: Outlier Analysis is a comprehensive exposition, as understood by data mining experts, statisticians and computer scientists, and emphasis was placed on simplifying the content, so that students and practitioners can also benefit.
Journal ArticleDOI

Graph based anomaly detection and description: a survey

TL;DR: This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs, and gives a general framework for the algorithms categorized under various settings.
Proceedings Article

Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection

TL;DR: A Deep Autoencoding Gaussian Mixture Model (DAGMM) for unsupervised anomaly detection, which significantly outperforms state-of-the-art anomaly detection techniques, and achieves up to 14% improvement based on the standard F1 score.
Posted Content

Graph-based Anomaly Detection and Description: A Survey

TL;DR: A comprehensive survey of the state-of-the-art methods for anomaly detection in data represented as graphs can be found in this article, where the authors highlight the effectiveness, scalability, generality, and robustness aspects of the methods.
Journal ArticleDOI

A survey on unsupervised outlier detection in high-dimensional numerical data

TL;DR: This survey article discusses some important aspects of the ‘curse of dimensionality’ in detail and surveys specialized algorithms for outlier detection from both categories.
References
More filters
Book

Principal Component Analysis

TL;DR: In this article, the authors present a graphical representation of data using Principal Component Analysis (PCA) for time series and other non-independent data, as well as a generalization and adaptation of principal component analysis.
Proceedings Article

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Proceedings Article

A density-based algorithm for discovering clusters in large spatial Databases with Noise

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Proceedings Article

Fast algorithms for mining association rules

TL;DR: Two new algorithms for solving thii problem that are fundamentally different from the known algorithms are presented and empirical evaluation shows that these algorithms outperform theknown algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems.
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Hics: high contrast subspaces for density-based outlier ranking" ?

Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. In this work, the authors propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. With their approach, the authors propose a first measure for the contrast of subspaces. Thus, the authors enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that their approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking. 

For future work, the authors aim at further evaluations with other outlier scores such as ORCA [ 5 ] or OUTRES [ 23 ]. Furthermore, the authors would like to extend the research on subspace selection and enhance their subspace search based on other outlier ranking paradigms. Both seem very promising extensions of LOF with enhanced outlier scoring. Due to the decoupled processing, their subspace search can be applied directly to these or other outlier scores. 

The advantage of these subspace slices over any grid-based density estimation is that the authors can construct the subspace slices in a way that does not suffer from the curse of dimensionality. 

The authors randomly selected 2-5 dimensional subspaces out of the full data space and generated high density clusters in these subspaces. 

The subspace generation process terminates when the Apriori merge step produces an empty list for the (d + 1)- dimensional subspace candidates. 

The authors denote the distance between objects x and y as distA( x, y), which can be instantiated for instance by the widely used Euclidean Distance distA( x, y) = √∑ s∈A(xs − ys)2. 

HiCS shows excellent results on a broad variety of datasets, with robust and easy-to-use parameters, and a scalable processing w.r.t. the dimensionality of databases. 

The algorithm operates according to the sampling formalism in III-D. Besides the set of attributes that belong to the specific subspace, the algorithm requires two parameters:• 

As a result, all mentioned outlier score functions will suffer from a loss of contrast, i.e.:score( x) ≈ score( y) ∀ x, y ∈ DBAny outlier ranking obtained for a sufficiently high dimensional database will degenerate into a random ranking with very similar scores for all objects. 

Given the notion of probability density in any subspace S, the authors measure the contrast by comparing conditional probability densities to the corresponding marginal densities for all attributes si ∈ S. 

In their implementation the authors specified the size by a ratio α ∈ (0, 1) that determines the sample size dynamically in relation to the total size of the database. 

Since local outlier ranking calculates the density based on the object distances, the authors observe the same effect for the minimal and maximal value of score( x). 

Therefore it might be possible to improve the quality of HiCS even further by applying a pre-processing step that takes care of the detection of trivial outliers.