scispace - formally typeset
Open AccessProceedings ArticleDOI

Steganalysis in high dimensions: fusing classifiers built on random subspaces

Jan Kodovský, +1 more
- 10 Feb 2011 - 
- Vol. 7880, pp 204-216
Reads0
Chats0
TLDR
This work proposes ensemble classifiers as an alternative to the much more complex support vector machines for steganalysis, with the advantage of its universality, low complexity, simplicity, and improved performance when compared to classifiers trained on the entire prefeature set.
Abstract
By working with high-dimensional representations of covers, modern steganographic methods are capable of preserving a large number of complex dependencies among individual cover elements and thus avoid detection using current best steganalyzers. Inevitably, steganalysis needs to start using high-dimensional feature sets as well. This brings two key problems - construction of good high-dimensional features and machine learning that scales well with respect to dimensionality. Depending on the classifier, high dimensionality may lead to problems with the lack of training data, infeasibly high complexity of training, degradation of generalization abilities, lack of robustness to cover source, and saturation of performance below its potential. To address these problems collectively known as the curse of dimensionality, we propose ensemble classifiers as an alternative to the much more complex support vector machines. Based on the character of the media being analyzed, the steganalyst first puts together a high-dimensional set of diverse "prefeatures" selected to capture dependencies among individual cover elements. Then, a family of weak classifiers is built on random subspaces of the prefeature space. The final classifier is constructed by fusing the decisions of individual classifiers. The advantage of this approach is its universality, low complexity, simplicity, and improved performance when compared to classifiers trained on the entire prefeature set. Experiments with the steganographic algorithms nsF5 and HUGO demonstrate the usefulness of this approach over current state of the art.

read more

Content maybe subject to copyright    Report

Steganalysis in high dimensions: Fusing classifiers
built on random subspaces
Jan Kodovský and Jessica Fridrich
Department of Electrical and Computer Engineering
Binghamton University, State University of New York
ABSTRACT
By working with high-dimensional representations o f covers, modern steganographic methods are capable of
preserving a large number of complex dependencies among individual cover elements and thus avoid detection
using current best steganalyzers. Inevitably, steganalysis needs to sta rt using high-dimensional feature s e ts as
well. This brings two key problems construction of g ood high-dimensional features and machine learning that
scales well with respect to dimensionality. Depending on the classifier, high dimensiona lity may lead to pr oblems
with the lack of training data, infeasibly high complexity of training, degradation of generalization abilities, lack
of robustness to cover source, and saturation of performance below its potential. To address these problems
collectively known as the curse of dimensionality, we propose ensemble classifiers as an alterna tive to the much
more co mplex support vector machines. Bas e d on the character of the media being analyzed, the stegana lyst first
puts together a high-dimensional set of diverse “prefeatures” selected to capture dependencies among individual
cover elements. Then, a family of weak classifiers is built on random subspaces of the pre feature space. The
final classifier is constructed by fusing the decisions of individual classifiers. The advantage of this approach is
its universality, low complexity, simplicity, and improved performance when co mpared to classifiers trained on
the entire prefeature set. Expe riments with the steganographic algorithms nsF5 and HUGO demonstrate the
usefulness of this approach over current state of the art.
1. MOTIVATION
Today, security of steganographic algorithms designed for empirical c over sources is e valuated using steganalyzers
built as binary classifiers trained on cover a nd stego features. Originally, the featur e s were designed by ha nd
to capture the impact of known steganographic schemes.
12
A cleaner strategy is to obtain the features by
adopting a model for individual cover elements and use its sampled form as the feature.
7, 22, 27
Close attention
has usually been paid to keep the dimensionality of the feature space low due to potential problems with
the Curse o f Dimensionality (CoD). However, modern steganography methods, such as HUGO
23
are designed
to approximately pr e serve a high-dimensional representation of covers
and thus many complex dependencies
among pixels, as well.
With the increased sophistication of steganographic algorithms, steganalysis has already begun using feature
spaces of incr e ased dimensionality. The most accurate spatial domain steganalysis of ±1 emb e dding (LSB
matching) uses the 686-dimensional SPAM fea tur e s
22
while a 1,23 4-dimensional Cross-Domain Feature (CDF)
set was was employed in
18
to attack YASS.
25
Moreover, the recent results of the steganalysis competition BOSS
11
indicate that there is little hope that a human-designed low-dimensional feature space effective against HUGO
23
exists.
In this paper, we address both issues desig n of useful high-dimensional feature spaces and scalable machine
learning approach. We form a high-dimensional feature space of “prefeatures whose role is to capture a s many
statistical dependencies among individual cover elements as possible. This is achieved either by merging existing
feature sets or by forming the prefeatures from joint statistics of groups of c over elements that exhibit the strongest
relationship. The emphasis here is on diversity while one should not be as concerned with dimensionality. Having
formed the pre feature space, the steganalyst builds a scalable ensemble classifier by fusing the decisions of a set
E-mail: {jan.ko dovsky, fridrich}@binghamton.edu; J.F.: http://www.ws.binghamton.edu/fridrich
HUGO works in a feature space of dimensionality 10
7
.

0 1000 2000 3000 4000
0.1
0.2
0.3
0.4
0.5
Number of training samples N
Testing error
Fixed d and E
C
d = 500
d = 200
d = 50
E
C
= 0.3
E
C
= 0.1
0 1000 2000 3000
0.1
0.2
0.3
0.4
0.5
Dimensionality d
Fixed N and E
C
N = 100
N = 300
N = 1000
E
C
= 0.3
E
C
= 0.1
0 0.1 0.2 0.3 0.4 0.5
0
0.1
0.2
0.3
0.4
0.5
Clairvoyant er ror E
C
Fixed N and d
d = 1000
d = 100
d = 20
N = 300
Figure 1. Effect of N ,d, and P
C
on the classification performance when fixing P
C
(D
C
) and d (left), N and P
C
(middle),
and N and d (right). All experiments were repeated 50 times and the median values are p lotted, together with their
MAD.
of simple base learners built on random subspaces of the prefeature space. This machine-learning strategy
is introduced as a low-cost and scalable alternative to Support Ve c tor Machines (SVMs) and it can achieve
performance as good as or even better than the SVMs.
We explain our appro ach in six se c tions. In the next section, on a carefully designed artificial scenario we
point out the complicatio ns that manifest when training a classifier in high dimensions. The ensemble classifier
is described in Section 3. Several strategies for forming prefeatures in both JPEG and spatial domain are
introduced in Section 4 and 5, where we include all experiments. The goal is to demonstrate the merit of
using high-dimensional prefeatures in combination with ensemble classifiers by comparing to selected existing
steganalyzers. Section 6 summarizes our contribution.
We use calligraphic font for sets and collections, while vectors o r matr ic e s are always in boldface. The symbol
N
0
is used for the set of positive integers. MAD stands for the Median Absolute Deviation.
2. CURSE OF DI ME NSIONALITY
The purpose of this section is to shed more light on the difficulties one may encounter when classifying in high
dimensions. We do so on an artificially created supervised classificatio n problem.
Let us assume that we have N training examples of dimensionality d from two classes X = {x
(1)
, . . . , x
(N/2)
},
Y = {y
(1)
, . . . , y
(N/2)
}, x
(m)
, y
(m)
R
d
. Each example is formed by d i.i.d. realizations of a Gaussian ra ndom
variable with mea n 0 (for class X) and s > 0 (for class Y). Assuming both classe s are equiprobable, for a given
test sample z R
d
, the optimal test statistic is the sample mean
z thresholded with s/2. We call this classifier
clairvoyant. The total testing error of this classifier is E
C
= 1 Φ(s
d/2) where Φ(x) is the c.d.f. of a standard
normal variable. Furthermore, we define the class distinguishability as D
C
= 1 2E
C
.
The performa nce of detector s built us ing machine learning tools may be quite different from the cla irvoyant
classifier. In this section, we will study the perfo rmance of linear SVMs (L-SVMs) for different number of
training examples, N, feature dimensionality, d, and class distinguishability, E
C
. Since we know that the optimal
separating boundary is linear, kernelized SVMs cannot give better results.
Figure 1 shows the results of experiments when two of the three parameters, N, d, and E
C
(or D
C
) are
fixed while the remaining one varies. For fix e d dimensionality d and class distinguishability D
C
, with increasing
number of training examples, N, the testing error
of the L -SVM approaches that of the clairvoyant detector
(Figure 1 (left)). Furthermore, higher d or lower D
C
require more samples for the L-SVM to perform well.
In Figur e 1 (middle), we fix the number of the training samples N and the class distinguishability D
C
, and
increase the dimensionality d. Note that since we fixed D
C
, the Gaussian shift s decreases a s d grows, and can
Our testing set consists of 4000 samples generated from the underlying cover/stego distributions.

be analytically ex pressed as s = 2/
dΦ
1
(1 E
C
). We show the testing errors obtained for two different values
of E
C
= 0.1 and 0.3. We can c le arly see the negative effect of the increa sed dimension.
In the last experiment, we switch the roles of E
C
and d we fix d and vary E
C
by properly adjusting
the shift s. Figure 1 (right) shows the testing errors for three different values of d and for a fixed number o f
training samples N = 300. The curse of dimensionality manifests the most for moderate distinguishability. For
perfectly separable classes with E
C
0, L-SVM handles high dimensionality well. When E
C
1 (classes become
indistinguishable), both the clair voyant and L-SVM classifiers start making random guesses. As before, higher
dimensionality leads to a la rger difference between the L-SVM and the clairvoyant classifier.
We conclude that there are three main fa c tors that can negatively influence the performance of machine
learning to ols: (1) small number of training samples, (2) low class distinguishability, (3) high dimensionality.
Weak s teganographic methods are easily detectable because they disturb some elementary cover properties
that can be captured by a low-dimensional feature vector with high distinguishability. A fairly small training
dataset is then usually sufficient to train a classifier with an excellent performance. On the other hand, more
advanced steganographic methods (and these are of our interest) require high-dimensiona l feature spaces capable
of capturing more complex dependencies among individual cover elements, which in turn nec e ssitates more
training samples.
A seemingly stra ightforward strategy to improve the performance of existing steganalyzers may b e to increase
the size of the training set. This way we a llow the machine learning tool to better utilize the given feature space
and we may use feature spaces of higher dimensions without degra dation of performance. However, sooner or
later one will likely encounter technical problems with data or memory management, or the training would be
unacceptably long. Furthermore, in many practical scenarios, the steganalyst lacks information about the cover
source (only a limited number of cover examples are available). Here, training the classifier on a different cover
source may result in a serious drop in testing performance.
5, 15
3. PROPOSED FRAMEWORK
This section contains the description of our propos e d framework for building steganalyzers. Instead of hand-
crafting low-dimensional features with good class distinguishability and directly a pplying a machine learning
tool, we divide the problem into two separate stages:
1. Form a set of diverse prefeatures that capture as many dependencies among individual cover elements as
possible. The emphasis is on diversity while making sure all prefeatures are well populated when computed
on a database of covers. At this stage, we do not attempt to limit the dimensionality in any way.
2. In the second, discriminative stage, we take into account the s teganographic method under investig ation.
Our goal is maximizing the classification acc uracy on the prefeature space using machine learning techniques
that scale well with feature dimensionality.
The construction of prefeature s is discussed at the beginning of each experimental section. Here, we focus on
high-dimensional classification and introduce our ensemble classifier.
3.1 High-dimensional classification options
We start with the following rhetoric question: “What is the bes t way of utilizing the distinguishing power of
high-dimensional prefeatures for steganalysis , given a limited amount of training samples?” One option is a
direct applicatio n of a classification tool. It is known that SVMs are quite robust to the CoD, so pr ovided it
is computationally feasible, this should always be tried. And we may indeed obtain a satisfying performance,
depending on the security of the steganographic method (and the secret messag e length). In this paper, we are
more interested in the scenario where the steg anographic method is difficult to detect and when the classification
cannot be performed directly on the prefeature space due to computational issues. Furthermore, in Sec tion 2 we
saw that dire ct application of machine learning when dimensionality is much higher than the number of training
examples may lead to poor performance.
There exist several well developed strategies that can be grouped into the following three broad categories :

1. Reduce dimensionality and then classif y. The high dimensionality of the prefeature space is first
reduced using a dimensionality-reduction technique that can be either unsupervised (PCA) or supervised
(e.g., feature selection
20
). The goal is to reduce the impact of the CoD o n the subsequent cla ssification
problem. In a “traditional” approach to steganalysis, this r eduction is usually achieved using human insight
and heuristics.
2. Reduce dimensional ity and simultaneously classify. Here, the dimensionality reduction and c lassi-
fication are combined into a single task. One can minimize an appropriately constructed single objective
function directly (SVDM
21
) or, in general, construct an iterative algorithm for dimensionality r e duction
with a classification feedback after every iteration. In machine learning, these methods are known as
embedded methods and wrapper methods.
20
3. Ensemble methods follow a simple recipe reduce dimensionality r andomly, construct a classifier (base
learner) on the reduced space, and set it aside. Repeating this procedure many times, each base learner is
built on different subspace of the original space. The final decision is formed by agg regating the decisions
of individual classifiers.
19
This is the direction pursued here.
In o rder to make the supervised ensemble strategy work, the individual base learners have to be sufficiently
diverse in the sense that they should make different errors on unseen data. The diversity is often more important
than the accuracy of the individual classifiers, provided their performance is better than random guessing.
From this point of view, overtrained base learners are not a big issue. In fact, ensemble class ification is often
applied to rela tively weak and unstable clas sifiers since these yield higher diversity. It was shown that even fully
overtrained base learners, when combined through a cla ssification ensemble, may produce accuracy comparable
to state-of-the-art techniques.
8
The idea of injecting an element of randomness into classification, has been previo usly used in different forms.
In bagging,
3
individual classifiers are generated by training a classifier on different bootstrap samples from the
training set. In,
4
the randomness is in the cons truction of individual classifiers (classification trees) rather than in
the training set. In both cases, a set of “weak” classifiers is created, and combined into an ensemble of classifiers.
Different combining methods may then be applied to form the final, accur ate classification tool,
19
however, a
simple majority vote is often sufficient. The process of improving the accuracy of a set of base learners by a
proper aggregation strategy is known as boosting.
26
There exis t methods bas ed on random projections that leverage the Johnson–Lindenstrauss Theorem.
1, 2
In a nutshell, the JL Theorem states that with high probability the dis tances and angles between features
are approximately preserved when the feature space is projected onto a random subspace of a certain smaller
dimension. When an SVM is applied in the projected space, the margin (distinguisha bility) between classe s stays
approximately the same and one thus alleviates the CoD.
After numerous ex periments with ens e mble classifiers on the nsF5 algorithm, we arrived at a construction
that worked the best for us no t only for JPEG images but also for the spatial domain. We make no claim
that the classifier des c ribed below is the best possible approach. This work should rather be viewed as the first
and quite promising step in this direction while additional effort is certainly required to study and optimize its
performance across various embedding algo rithms.
3.2 The proposed ensemble classifier
Let d be the dimensionality of the prefeatures, N
trn
and N
tst
the number o f training and testing examples from
each class, and L the number of base learners whose decisions will be fused. Furthermore , let x
(m)
, y
(m)
R
d
,
m = 1, . . . , N
trn
, and b
(k)
R
d
, k = 1, . . . , N
tst
, be the cover and stego prefeature vectors for the tra ining set
and prefeatures for the testing examples, respectively. The ensemble classifier is described using Algorithm 1.
For D {1, . . . , d}, x
(D)
is the subset of feature s {x
(k)
}
k∈D
.
The individual class ifiers F
l
map to {0, 1}, where 0 stands for cover and 1 for stego. The base lear ners
are Fisher Linear Discriminants (FLDs) because such low complexity, weak, and unstable classifiers desirably
increase diversity. After collecting L base learners, the final class predicto r is formed by combining their individual
decisions using an unweighted (majority) voting strategy.

Algorithm 1 Ensemble classifier.
1: for l=1 to L do
2: Randomly select D
l
{1, . . . , d}, |D
l
| = d
red
< d
3: Train a classifier F
l
on cover features x
(D
l
)
m
and s tego features y
(D
l
)
m
, m = 1, . . . , N
trn
. Each classifier is a
mapping F
l
: R
d
red
{0, 1}.
4: Make decisions using F
l
:
F
l
(b) , [F
l
(b
(1)
)), . . . , F
l
(b
(N
tst
)
)] {0, 1}
N
tst
. (1)
5: end for
6: Fuse all decisions by voting for each test example k {1, . . . , N
tst
}:
F (k) =
(
1 when
P
L
l=1
F
l
(b
(k)
) > L/2
0 otherwise.
(2)
7: return F (k), k = 1, . . . , N
tst
Note that all base lear ners in the algorithm are trained on feature spaces of a fixe d dimension d
red
that can be
chosen to be significantly smaller than the full dimensionality d. Even though the pe rformance of individual base
learners can be weak, the accuracy quickly improve s after fusion and eventually levels out. The voting could be
replaced by other aggregation rules. For example, when the decision boundary is a hyperplane, one ca n use the
sum of projections on the norma l vector of each classifier or the sum of likelihoods of each projection after fitting
models to the projections of cover and stego features. B e c ause in our experiments all three fusion strategies gave
essentially identical results, we recommend using voting due to its simplicity. Finally, the individual c lassifiers
should be adjusted to meet a desired performance criterion.
3.3 Implementation issues
An important advantage of the proposed algorithm is its low computation complexity if implemented correctly,
the training does not depend on the prefeature spac e dimensionality d, which can be achieved by storing the
prefeatures as individual files. The complexity is therefore driven by d
red
rather then d, as only selected d
red
prefeatures need to be accessed at a time. The ensemble classifier depends on two parameters d
red
and L.
As will become apparent in the next experimental section, the classification accuracy saturates rather quickly
with L. For the fas tes t performance one should choose the smallest L that gives satisfactory performance. The
optimal value of d
red
is more critical as values that are too small or too larg e may give sub-optimal results.
Obviously, one co uld implement a one-dimensiona l grid search s imila r to cross-validation in SVMs to find the
optimal value of d
red
. Fortunately, we observed that the optimum is quite flat, which means that the g rid could
be sparse. In all our tests, we simply selected d
red
by hand after a few initial trials.
4. TESTING THE FRAMEWORK IN JPEG DOMAIN
We analyze the proposed steganalysis framework and demonstrate the effects of various design parameters on
the nsF5,
14
which is currently the most secure algorithm that directly manipulates quantized DCT coefficients.
All experiments were carried out on a database of 6,500 JPEG images coming fr om mo re tha n 20 different
cameras. All images were converted to grayscale, resized so that the smaller side had 512 pixels with aspect
ratio preserved, and JPEG compressed with quality facto r 75. A ra ndomly selected half of the images was used
for training and the other half for testing . We used LIBSVM,
6
the publicly available implementation of SVMs.
The decision threshold of all class ifiers (including the FLD base learners) was always a djusted to produce the
minimum overall average clas sification error
P
E
= min
P
FA
1
2
(P
FA
+ P
MD
(P
FA
)) (3)
on the training data (P
FA
and P
MD
are the false-alarm and missed-detection probabilities.). This error is also
used to report the accura c y of detection in the e ntire paper.

Citations
More filters
Journal ArticleDOI

Rich Models for Steganalysis of Digital Images

TL;DR: A novel general strategy for building steganography detectors for digital images by assembling a rich model of the noise component as a union of many diverse submodels formed by joint distributions of neighboring samples from quantized image noise residuals obtained using linear and nonlinear high-pass filters.
Journal ArticleDOI

Ensemble Classifiers for Steganalysis of Digital Media

TL;DR: This paper proposes an alternative and well-known machine learning tool-ensemble classifiers implemented as random forests-and argues that they are ideally suited for steganalysis.
Proceedings ArticleDOI

Steganalysis of JPEG Images Using Rich Models

TL;DR: A rich model of DCT coefficients in a JPEG file for the purpose of detecting steganographic embedding changes delivers superior performance across all tested algorithms and payloads.
Journal ArticleDOI

Digital image splicing detection based on Markov features in DCT and DWT domain

TL;DR: A Markov based approach is proposed to detect image splicing and can outperform some state-of-the-art methods, making the computational cost more manageable.
Proceedings ArticleDOI

Moving steganography and steganalysis from the laboratory into the real world

TL;DR: This position paper sets out some of the important questions which have been left unanswered, as well as highlighting some that have already been addressed successfully, for steganography and steganalysis to be used in the real world.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

Bagging predictors

Leo Breiman
TL;DR: Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy.
Journal Article

LIBLINEAR: A Library for Large Linear Classification

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "Steganalysis in high dimensions: fusing classifiers built on random subspaces" ?

To address these problems collectively known as the curse of dimensionality, the authors propose ensemble classifiers as an alternative to the much more complex support vector machines. 

The authors conclude that there are three main factors that can negatively influence the performance of machine learning tools: (1) small number of training samples, (2) low class distinguishability, (3) high dimensionality. 

A standard way to implement classifiers today is to train an SVM with a Gaussian kernel on a large database of cover and stego images. 

The two most pressing issues that have a strong effect on the classifier’s performance are the formation of prefeatures and the random selection of subspaces. 

Training a 50, 000-dimensional prefeature takes the ensemble classifier with L = 99 and dred = 2000 approximately 20 minutes, while training a G-SVM with the optimal hyperparameters already found takes about 7.5 hours. 

embedding invariants may be very useful for steganalysis, provided they are correlated with some other cover statistic that is disturbed by embedding. 

Although very efficient implementations of linear SVMs exist (e.g., the LIBLINEAR package10), the training can still take a substantial amount of time. 

In general, the optimal selection process will likely depend on mutual dependencies among prefeatures and their classification strength. 

Weak steganographic methods are easily detectable because they disturb some elementary cover properties that can be captured by a low-dimensional feature vector with high distinguishability. 

The most accurate spatial domain steganalysis of ±1 embedding (LSB matching) uses the 686-dimensional SPAM features22 while a 1,234-dimensional Cross-Domain Feature (CDF) set was was employed in18 to attack YASS.25 Moreover, the recent results of the steganalysis competition BOSS11 indicate that there is little hope that a human-designed low-dimensional feature space effective against HUGO23 exists. 

To give an idea about the time savings, the ensemble classifier was trained with CC-PEV features with L = 31 and dred = 400 in about 70 seconds,‡ while training a G-SVM on the same features (with optimal values of the hyperparameters C and γ already found) took approximately 3.5 times longer. 

So far, the authors have shown that the proposed method is capable of working with different prefeatures and that its performance can achieve results similar to a G-SVM.