What is the standard way to implement classifiers today?

A standard way to implement classifiers today is to train an SVM with a Gaussian kernel on a large database of cover and stego images.

What are the two pressing issues that have a strong effect on the classifier’s?

The two most pressing issues that have a strong effect on the classifier’s performance are the formation of prefeatures and the random selection of subspaces.

How long does it take to train a G-SVM?

Training a 50, 000-dimensional prefeature takes the ensemble classifier with L = 99 and dred = 2000 approximately 20 minutes, while training a G-SVM with the optimal hyperparameters already found takes about 7.5 hours.

What is the reason for embedding invariants?

embedding invariants may be very useful for steganalysis, provided they are correlated with some other cover statistic that is disturbed by embedding.

How long can it take to train a linear SVM?

Although very efficient implementations of linear SVMs exist (e.g., the LIBLINEAR package10), the training can still take a substantial amount of time.

What is the selection process for a CC-PEV?

In general, the optimal selection process will likely depend on mutual dependencies among prefeatures and their classification strength.

What is the accurate spatial domain steganalysis of 1 embedding?

The most accurate spatial domain steganalysis of ±1 embedding (LSB matching) uses the 686-dimensional SPAM features22 while a 1,234-dimensional Cross-Domain Feature (CDF) set was was employed in18 to attack YASS.25 Moreover, the recent results of the steganalysis competition BOSS11 indicate that there is little hope that a human-designed low-dimensional feature space effective against HUGO23 exists.

How long did it take to train a G-SVM?

To give an idea about the time savings, the ensemble classifier was trained with CC-PEV features with L = 31 and dred = 400 in about 70 seconds,‡ while training a G-SVM on the same features (with optimal values of the hyperparameters C and γ already found) took approximately 3.5 times longer.

How can the proposed method be used to improve the performance of a G-SVM?

So far, the authors have shown that the proposed method is capable of working with different prefeatures and that its performance can achieve results similar to a G-SVM.

(Open Access) Steganalysis in high dimensions: fusing classifiers built on random subspaces (2011) | Jan Kodovský

Q: What are the main factors that can negatively influence the performance of machine learning tools?

The authors conclude that there are three main factors that can negatively influence the performance of machine learning tools: (1) small number of training samples, (2) low class distinguishability, (3) high dimensionality.

Q: Why are weak steganographic methods easily detectable?

Weak steganographic methods are easily detectable because they disturb some elementary cover properties that can be captured by a low-dimensional feature vector with high distinguishability.

Steganalysis in high dimensions: Fusing classiﬁers

built on random subspaces

Jan Kodovský and Jessica Fridrich

Department of Electrical and Computer Engineering

Binghamton University, State University of New York

ABSTRACT

By working with high-dimensional representations o f covers, modern steganographic methods are capable of

preserving a large number of complex dependencies among individual cover elements and thus avoid detection

using current best steganalyzers. Inevitably, steganalysis needs to sta rt using high-dimensional feature s e ts as

well. This brings two key problems – construction of g ood high-dimensional features and machine learning that

scales well with respect to dimensionality. Depending on the classiﬁer, high dimensiona lity may lead to pr oblems

with the lack of training data, infeasibly high complexity of training, degradation of generalization abilities, lack

of robustness to cover source, and saturation of performance below its potential. To address these problems

collectively known as the curse of dimensionality, we propose ensemble classiﬁers as an alterna tive to the much

more co mplex support vector machines. Bas e d on the character of the media being analyzed, the stegana lyst ﬁrst

puts together a high-dimensional set of diverse “prefeatures” selected to capture dependencies among individual

cover elements. Then, a family of weak classiﬁers is built on random subspaces of the pre feature space. The

ﬁnal classiﬁer is constructed by fusing the decisions of individual classiﬁers. The advantage of this approach is

its universality, low complexity, simplicity, and improved performance when co mpared to classiﬁers trained on

the entire prefeature set. Expe riments with the steganographic algorithms nsF5 and HUGO demonstrate the

usefulness of this approach over current state of the art.

1. MOTIVATION

Today, security of steganographic algorithms designed for empirical c over sources is e valuated using steganalyzers

built as binary classiﬁers trained on cover a nd stego features. Originally, the featur e s were designed by ha nd

to capture the impact of known steganographic schemes.

A cleaner strategy is to obtain the features by

adopting a model for individual cover elements and use its sampled form as the feature.

7, 22, 27

Close attention

has usually been paid to keep the dimensionality of the feature space low due to potential problems with

the Curse o f Dimensionality (CoD). However, modern steganography methods, such as HUGO

are designed

to approximately pr e serve a high-dimensional representation of covers

∗

and thus many complex dependencies

among pixels, as well.

With the increased sophistication of steganographic algorithms, steganalysis has already begun using feature

spaces of incr e ased dimensionality. The most accurate spatial domain steganalysis of ±1 emb e dding (LSB

matching) uses the 686-dimensional SPAM fea tur e s

while a 1,23 4-dimensional Cross-Domain Feature (CDF)

set was was employed in

to attack YASS.

Moreover, the recent results of the steganalysis competition BOSS

indicate that there is little hope that a human-designed low-dimensional feature space eﬀective against HUGO

exists.

In this paper, we address both issues – desig n of useful high-dimensional feature spaces and scalable machine

learning approach. We form a high-dimensional feature space of “prefeatures ” whose role is to capture a s many

statistical dependencies among individual cover elements as possible. This is achieved either by merging existing

feature sets or by forming the prefeatures from joint statistics of groups of c over elements that exhibit the strongest

relationship. The emphasis here is on diversity while one should not be as concerned with dimensionality. Having

formed the pre feature space, the steganalyst builds a scalable ensemble classiﬁer by fusing the decisions of a set

E-mail: {jan.ko dovsky, fridrich}@binghamton.edu; J.F.: http://www.ws.binghamton.edu/fridrich

∗

HUGO works in a feature space of dimensionality 10

0 1000 2000 3000 4000

0.1

0.2

0.3

0.4

0.5

Number of training samples N

Testing error

Fixed d and E

d = 500

d = 200

d = 50

= 0.3

= 0.1

0 1000 2000 3000

0.1

0.2

0.3

0.4

0.5

Dimensionality d

Fixed N and E

N = 100

N = 300

N = 1000

= 0.3

= 0.1

0 0.1 0.2 0.3 0.4 0.5

0.1

0.2

0.3

0.4

0.5

Clairvoyant er ror E

Fixed N and d

d = 1000

d = 100

d = 20

N = 300

Figure 1. Eﬀect of N ,d, and P

on the classiﬁcation performance when ﬁxing P

) and d (left), N and P

(middle),

and N and d (right). All experiments were repeated 50 times and the median values are p lotted, together with their

MAD.

of simple base learners built on random subspaces of the prefeature space. This machine-learning strategy

is introduced as a low-cost and scalable alternative to Support Ve c tor Machines (SVMs) and it can achieve

performance as good as or even better than the SVMs.

We explain our appro ach in six se c tions. In the next section, on a carefully designed artiﬁcial scenario we

point out the complicatio ns that manifest when training a classiﬁer in high dimensions. The ensemble classiﬁer

is described in Section 3. Several strategies for forming prefeatures in both JPEG and spatial domain are

introduced in Section 4 and 5, where we include all experiments. The goal is to demonstrate the merit of

using high-dimensional prefeatures in combination with ensemble classiﬁers by comparing to selected existing

steganalyzers. Section 6 summarizes our contribution.

We use calligraphic font for sets and collections, while vectors o r matr ic e s are always in boldface. The symbol

is used for the set of positive integers. MAD stands for the Median Absolute Deviation.

2. CURSE OF DI ME NSIONALITY

The purpose of this section is to shed more light on the diﬃculties one may encounter when classifying in high

dimensions. We do so on an artiﬁcially created supervised classiﬁcatio n problem.

Let us assume that we have N training examples of dimensionality d from two classes X = {x

(1)

, . . . , x

(N/2)

Y = {y

(1)

, . . . , y

(N/2)

}, x

(m)

, y

(m)

∈ R

. Each example is formed by d i.i.d. realizations of a Gaussian ra ndom

variable with mea n 0 (for class X) and s > 0 (for class Y). Assuming both classe s are equiprobable, for a given

test sample z ∈ R

, the optimal test statistic is the sample mean

z thresholded with s/2. We call this classiﬁer

clairvoyant. The total testing error of this classiﬁer is E

= 1 −Φ(s

√

d/2) where Φ(x) is the c.d.f. of a standard

normal variable. Furthermore, we deﬁne the class distinguishability as D

= 1 − 2E

The performa nce of detector s built us ing machine learning tools may be quite diﬀerent from the cla irvoyant

classiﬁer. In this section, we will study the perfo rmance of linear SVMs (L-SVMs) for diﬀerent number of

training examples, N, feature dimensionality, d, and class distinguishability, E

. Since we know that the optimal

separating boundary is linear, kernelized SVMs cannot give better results.

Figure 1 shows the results of experiments when two of the three parameters, N, d, and E

(or D

) are

ﬁxed while the remaining one varies. For ﬁx e d dimensionality d and class distinguishability D

, with increasing

number of training examples, N, the testing error

†

of the L -SVM approaches that of the clairvoyant detector

(Figure 1 (left)). Furthermore, higher d or lower D

require more samples for the L-SVM to perform well.

In Figur e 1 (middle), we ﬁx the number of the training samples N and the class distinguishability D

, and

increase the dimensionality d. Note that since we ﬁxed D

, the Gaussian shift s decreases a s d grows, and can

†

Our testing set consists of 4000 samples generated from the underlying cover/stego distributions.

be analytically ex pressed as s = 2/

√

dΦ

−1

(1 −E

). We show the testing errors obtained for two diﬀerent values

of E

= 0.1 and 0.3. We can c le arly see the negative eﬀect of the increa sed dimension.

In the last experiment, we switch the roles of E

and d – we ﬁx d and vary E

by properly adjusting

the shift s. Figure 1 (right) shows the testing errors for three diﬀerent values of d and for a ﬁxed number o f

training samples N = 300. The curse of dimensionality manifests the most for moderate distinguishability. For

perfectly separable classes with E

≈ 0, L-SVM handles high dimensionality well. When E

≈ 1 (classes become

indistinguishable), both the clair voyant and L-SVM classiﬁers start making random guesses. As before, higher

dimensionality leads to a la rger diﬀerence between the L-SVM and the clairvoyant classiﬁer.

We conclude that there are three main fa c tors that can negatively inﬂuence the performance of machine

learning to ols: (1) small number of training samples, (2) low class distinguishability, (3) high dimensionality.

Weak s teganographic methods are easily detectable because they disturb some elementary cover properties

that can be captured by a low-dimensional feature vector with high distinguishability. A fairly small training

dataset is then usually suﬃcient to train a classiﬁer with an excellent performance. On the other hand, more

advanced steganographic methods (and these are of our interest) require high-dimensiona l feature spaces capable

of capturing more complex dependencies among individual cover elements, which in turn nec e ssitates more

training samples.

A seemingly stra ightforward strategy to improve the performance of existing steganalyzers may b e to increase

the size of the training set. This way we a llow the machine learning tool to better utilize the given feature space

and we may use feature spaces of higher dimensions without degra dation of performance. However, sooner or

later one will likely encounter technical problems with data or memory management, or the training would be

unacceptably long. Furthermore, in many practical scenarios, the steganalyst lacks information about the cover

source (only a limited number of cover examples are available). Here, training the classiﬁer on a diﬀerent cover

source may result in a serious drop in testing performance.

5, 15

3. PROPOSED FRAMEWORK

This section contains the description of our propos e d framework for building steganalyzers. Instead of hand-

crafting low-dimensional features with good class distinguishability and directly a pplying a machine learning

tool, we divide the problem into two separate stages:

1. Form a set of diverse prefeatures that capture as many dependencies among individual cover elements as

possible. The emphasis is on diversity while making sure all prefeatures are well populated when computed

on a database of covers. At this stage, we do not attempt to limit the dimensionality in any way.

2. In the second, discriminative stage, we take into account the s teganographic method under investig ation.

Our goal is maximizing the classiﬁcation acc uracy on the prefeature space using machine learning techniques

that scale well with feature dimensionality.

The construction of prefeature s is discussed at the beginning of each experimental section. Here, we focus on

high-dimensional classiﬁcation and introduce our ensemble classiﬁer.

3.1 High-dimensional classiﬁcation – options

We start with the following rhetoric question: “What is the bes t way of utilizing the distinguishing power of

high-dimensional prefeatures for steganalysis , given a limited amount of training samples?” One option is a

direct applicatio n of a classiﬁcation tool. It is known that SVMs are quite robust to the CoD, so pr ovided it

is computationally feasible, this should always be tried. And we may indeed obtain a satisfying performance,

depending on the security of the steganographic method (and the secret messag e length). In this paper, we are

more interested in the scenario where the steg anographic method is diﬃcult to detect and when the classiﬁcation

cannot be performed directly on the prefeature space due to computational issues. Furthermore, in Sec tion 2 we

saw that dire ct application of machine learning when dimensionality is much higher than the number of training

examples may lead to poor performance.

There exist several well developed strategies that can be grouped into the following three broad categories :

1. Reduce dimensionality and then classif y. The high dimensionality of the prefeature space is ﬁrst

reduced using a dimensionality-reduction technique that can be either unsupervised (PCA) or supervised

(e.g., feature selection

). The goal is to reduce the impact of the CoD o n the subsequent cla ssiﬁcation

problem. In a “traditional” approach to steganalysis, this r eduction is usually achieved using human insight

and heuristics.

2. Reduce dimensional ity and simultaneously classify. Here, the dimensionality reduction and c lassi-

ﬁcation are combined into a single task. One can minimize an appropriately constructed single objective

function directly (SVDM

) or, in general, construct an iterative algorithm for dimensionality r e duction

with a classiﬁcation feedback after every iteration. In machine learning, these methods are known as

embedded methods and wrapper methods.

3. Ensemble methods follow a simple recipe – reduce dimensionality r andomly, construct a classiﬁer (base

learner) on the reduced space, and set it aside. Repeating this procedure many times, each base learner is

built on diﬀerent subspace of the original space. The ﬁnal decision is formed by agg regating the decisions

of individual classiﬁers.

This is the direction pursued here.

In o rder to make the supervised ensemble strategy work, the individual base learners have to be suﬃciently

diverse in the sense that they should make diﬀerent errors on unseen data. The diversity is often more important

than the accuracy of the individual classiﬁers, provided their performance is better than random guessing.

From this point of view, overtrained base learners are not a big issue. In fact, ensemble class iﬁcation is often

applied to rela tively weak and unstable clas siﬁers since these yield higher diversity. It was shown that even fully

overtrained base learners, when combined through a cla ssiﬁcation ensemble, may produce accuracy comparable

to state-of-the-art techniques.

The idea of injecting an element of randomness into classiﬁcation, has been previo usly used in diﬀerent forms.

In bagging,

individual classiﬁers are generated by training a classiﬁer on diﬀerent bootstrap samples from the

training set. In,

the randomness is in the cons truction of individual classiﬁers (classiﬁcation trees) rather than in

the training set. In both cases, a set of “weak” classiﬁers is created, and combined into an ensemble of classiﬁers.

Diﬀerent combining methods may then be applied to form the ﬁnal, accur ate classiﬁcation tool,

however, a

simple majority vote is often suﬃcient. The process of improving the accuracy of a set of base learners by a

proper aggregation strategy is known as boosting.

There exis t methods bas ed on random projections that leverage the Johnson–Lindenstrauss Theorem.

1, 2

In a nutshell, the JL Theorem states that with high probability the dis tances and angles between features

are approximately preserved when the feature space is projected onto a random subspace of a certain smaller

dimension. When an SVM is applied in the projected space, the margin (distinguisha bility) between classe s stays

approximately the same and one thus alleviates the CoD.

After numerous ex periments with ens e mble classiﬁers on the nsF5 algorithm, we arrived at a construction

that worked the best for us no t only for JPEG images but also for the spatial domain. We make no claim

that the classiﬁer des c ribed below is the best possible approach. This work should rather be viewed as the ﬁrst

and quite promising step in this direction while additional eﬀort is certainly required to study and optimize its

performance across various embedding algo rithms.

3.2 The proposed ensemble classiﬁer

Let d be the dimensionality of the prefeatures, N

trn

and N

tst

the number o f training and testing examples from

each class, and L the number of base learners whose decisions will be fused. Furthermore , let x

(m)

, y

(m)

∈ R

m = 1, . . . , N

trn

, and b

(k)

∈ R

, k = 1, . . . , N

tst

, be the cover and stego prefeature vectors for the tra ining set

and prefeatures for the testing examples, respectively. The ensemble classiﬁer is described using Algorithm 1.

For D ⊂ {1, . . . , d}, x

(D)

is the subset of feature s {x

(k)

}

k∈D

The individual class iﬁers F

map to {0, 1}, where 0 stands for cover and 1 for stego. The base lear ners

are Fisher Linear Discriminants (FLDs) because such low complexity, weak, and unstable classiﬁers desirably

increase diversity. After collecting L base learners, the ﬁnal class predicto r is formed by combining their individual

decisions using an unweighted (majority) voting strategy.

Algorithm 1 Ensemble classiﬁer.

1: for l=1 to L do

2: Randomly select D

⊂ {1, . . . , d}, |D

| = d

red

< d

3: Train a classiﬁer F

on cover features x

)

and s tego features y

)

, m = 1, . . . , N

trn

. Each classiﬁer is a

mapping F

: R

red

→ {0, 1}.

4: Make decisions using F

(b) , [F

(1)

)), . . . , F

tst

)

)] ∈ {0, 1}

tst

. (1)

5: end for

6: Fuse all decisions by voting for each test example k ∈ {1, . . . , N

tst

F (k) =

(

1 when

l=1

(k)

) > L/2

0 otherwise.

(2)

7: return F (k), k = 1, . . . , N

tst

Note that all base lear ners in the algorithm are trained on feature spaces of a ﬁxe d dimension d

red

that can be

chosen to be signiﬁcantly smaller than the full dimensionality d. Even though the pe rformance of individual base

learners can be weak, the accuracy quickly improve s after fusion and eventually levels out. The voting could be

replaced by other aggregation rules. For example, when the decision boundary is a hyperplane, one ca n use the

sum of projections on the norma l vector of each classiﬁer or the sum of likelihoods of each projection after ﬁtting

models to the projections of cover and stego features. B e c ause in our experiments all three fusion strategies gave

essentially identical results, we recommend using voting due to its simplicity. Finally, the individual c lassiﬁers

should be adjusted to meet a desired performance criterion.

3.3 Implementation issues

An important advantage of the proposed algorithm is its low computation complexity – if implemented correctly,

the training does not depend on the prefeature spac e dimensionality d, which can be achieved by storing the

prefeatures as individual ﬁles. The complexity is therefore driven by d

red

rather then d, as only selected d

red

prefeatures need to be accessed at a time. The ensemble classiﬁer depends on two parameters – d

red

and L.

As will become apparent in the next experimental section, the classiﬁcation accuracy saturates rather quickly

with L. For the fas tes t performance one should choose the smallest L that gives satisfactory performance. The

optimal value of d

red

is more critical as values that are too small or too larg e may give sub-optimal results.

Obviously, one co uld implement a one-dimensiona l grid search s imila r to cross-validation in SVMs to ﬁnd the

optimal value of d

red

. Fortunately, we observed that the optimum is quite ﬂat, which means that the g rid could

be sparse. In all our tests, we simply selected d

red

by hand after a few initial trials.

4. TESTING THE FRAMEWORK IN JPEG DOMAIN

We analyze the proposed steganalysis framework and demonstrate the eﬀects of various design parameters on

the nsF5,

which is currently the most secure algorithm that directly manipulates quantized DCT coeﬃcients.

All experiments were carried out on a database of 6,500 JPEG images coming fr om mo re tha n 20 diﬀerent

cameras. All images were converted to grayscale, resized so that the smaller side had 512 pixels with aspect

ratio preserved, and JPEG compressed with quality facto r 75. A ra ndomly selected half of the images was used

for training and the other half for testing . We used LIBSVM,

the publicly available implementation of SVMs.

The decision threshold of all class iﬁers (including the FLD base learners) was always a djusted to produce the

minimum overall average clas siﬁcation error

= min

+ P

)) (3)

on the training data (P

and P

are the false-alarm and missed-detection probabilities.). This error is also

used to report the accura c y of detection in the e ntire paper.

Steganalysis in high dimensions: fusing classifiers built on random subspaces

Figures

Citations

Rich Models for Steganalysis of Digital Images

Ensemble Classifiers for Steganalysis of Digital Media

Steganalysis of JPEG Images Using Rich Models

Digital image splicing detection based on Markov features in DCT and DWT domain

Moving steganography and steganalysis from the laboratory into the real world

References

Random Forests

LIBSVM: A library for support vector machines

Pattern Classification

Bagging predictors

LIBLINEAR: A Library for Large Linear Classification

Related Papers (5)

Ensemble Classifiers for Steganalysis of Digital Media

Steganalysis by Subtractive Pixel Adjacency Matrix

Merging Markov and DCT Features for Multi-Class JPEG Steganalysis

Rich Models for Steganalysis of Digital Images

Using high-dimensional image models to perform highly undetectable steganography

Frequently Asked Questions (12)

Q1. What are the contributions in "Steganalysis in high dimensions: fusing classifiers built on random subspaces" ?

Q2. What are the main factors that can negatively influence the performance of machine learning tools?

Q3. What is the standard way to implement classifiers today?

Q4. What are the two pressing issues that have a strong effect on the classifier’s?

Q5. How long does it take to train a G-SVM?

Q6. What is the reason for embedding invariants?

Q7. How long can it take to train a linear SVM?

Q8. What is the selection process for a CC-PEV?

Q9. Why are weak steganographic methods easily detectable?

Q10. What is the accurate spatial domain steganalysis of 1 embedding?

Q11. How long did it take to train a G-SVM?

Q12. How can the proposed method be used to improve the performance of a G-SVM?