What future works have the authors mentioned in the paper "Coupled information-theoretic encoding for face photo-sketch recognition" ?

In the future work, the authors would like to further investigate the system with more cross-modality recognition problems.

What is the method for comparing pseudo photos to gallery photos?

Pseudo photos are synthesized from query sketches, and random sampling LDA (RS-LDA) [30] is used to match them to gallery photos.

How do the authors train a coupled projection tree?

To train a coupled projection tree, a set of vector pairs X = {(xpi ,x s i ), i = 1, ..., N} is prepared, where x p i ,x s i ∈ RD.

Why do the authors choose 256 leaf nodes?

Due to small performance gain and high computational cost of a large leaf node number, the authors choose 256 leaf nodes as their default setting.

How did Lin and Tang map features from two modalities into a common discriminative space?

In order to reduce the inter-modality gap at the classification stage, Lin and Tang [17] mapped features from two modalities into a common discriminative space.

What is the vote of a pixel to the histogram?

The vote of a pixel to the histogram is weighted by its gradient magnitude and a Gaussian window with parameter σ centered at the center of the region.

What is the ROC curve for the CITP forest?

The ROC curves are shown in Fig. 5. Even 32- dimensional CITE2 (please refer to Section 3 for this notation) significantly outperforms the 59-dimensional LBP and 128-dimensional SIFT.

What is the effect of different parameters on the performance of a CITP forest?

The authors investigate the effect of various free parameters on the performance of the system, including the number of leaf nodes, the projected dimension by PCA+LDA, the size of randomized forest and the effect of using different sampling patterns.

(Open Access) Coupled information-theoretic encoding for face photo-sketch recognition (2011) | Wei Zhang

Q: What are the contributions mentioned in the paper "Coupled information-theoretic encoding for face photo-sketch recognition" ?

In this paper, the authors propose a new inter-modality face recognition approach by reducing the modality gap at the feature extraction stage. Guided by maximizing the mutual information between photos and sketches in the quantized feature spaces, the coupled encoding is achieved by the proposed coupled informationtheoretic projection tree, which is extended to the randomized forest to further boost the performance. The authors create the largest face sketch database including sketches of 1, 194 people from the FERET database. Experiments on this large scale dataset show that their approach significantly outperforms the state-of-the-art methods.

Q: What is the way to learn the projections?

At each node, the authors randomly sample α percent (empirically α = 80) of the element indices of the sampled vectors, i.e. use a sub-vector of each sampled vector, to learn the projections.

Q: What is the main purpose of the first family of approaches?

The first family of approaches [27, 18, 31] fo-cused on the preprocessing stage and synthesized a pseudophoto from the query sketch or pseudo-sketches from the gallery photos to transform inter-modality face recognition into intra-modality face recognition.

Coupled Information-Theoretic Encoding for Face Photo-Sketch Recognition

Wei Zhang

Department of Information Engineering

The Chinese University of Hong Kong

zw009@ie.cuhk.edu.hk

Xiaogang Wang

2,3

Department of Electronic Engineering

The Chinese University of Hong Kong

xgwang@ee.cuhk.edu.hk

Xiaoou Tang

1,3

Shenzhen Institutes of Advanced Technology

Chinese Academy of Sciences, China

xtang@ie.cuhk.edu.hk

Abstract

Automatic face photo-sketch recognition has important

applications for law enforcement. Recent research has fo-

cused on transforming photos and sketches into the same

modality for matching or developing advanced classiﬁca-

tion algorithms to reduce the modality gap between features

extracted from photos and sketches. In this paper, we pro-

pose a new inter-modality face recognition approach by re-

ducing the modality gap at the feature extraction stage. A

new face descriptor based on coupled information-theoretic

encoding is used to capture discriminative local face struc-

tures and to effectively match photos and sketches. Guided

by maximizing the mutual information between photos and

sketches in the quantized feature spaces, the coupled en-

coding is achieved by the proposed coupled information-

theoretic projection tree, which is extended to the random-

ized forest to further boost the performance. We create the

largest face sketch database including sketches of 1, 194

people from the FERET database. Experiments on this large

scale dataset show that our approach signiﬁcantly outper-

forms the state-of-the-art methods.

1. Introduction

Face photo-sketch recognition is to match a face sketch

drawn by an artist to one of many face photos in the

database. In law enforcement, it is desired to automati-

cally search photos from police mug-shot databases using

a sketch drawing when the photo of a suspect is not avail-

able. This application leads to a number of studies on this

topic [26, 27, 28, 31, 9, 14, 6]. Photo-sketch generation and

recognition are also useful in digital entertainment industry.

The major challenge of face photo-sketch recognition is

to match images in different modalities. Sketches are a con-

cise representation of human faces, often containing shape

exaggeration and having different textures than photos. It

is infeasible to directly apply face photo recognition algo-

rithms. Recently, great progress has been made in two di-

rections. The ﬁrst family of approaches [27, 18, 31] fo-

Sketch vector space

Photo vector space

CITP tree

Photos

Sketches

Figure 1. A CITP tree with three levels for illustration purpose.

The local structures of photos and sketches are sampled and cou-

pled encoded via the CITP tree. Each leaf node of the CITP tree

corresponds to a cell in the photo vector space and in the sketch

vector space. The sampled vectors in the same cell are assigned the

same code, so that different local structures have different codes

and the same structures in different modalities have the same code.

cused on the preprocessing stage and synthesized a pseudo-

photo from the query sketch or pseudo-sketches from the

gallery photos to transform inter-modality face recognition

into intra-modality face recognition. Face photo/sketch syn-

thesis is actually a harder problem than recognition. Imper-

fect synthesis results signiﬁcantly degrade the recognition

performance. The second family of approaches [17, 15, 14]

focused on the classiﬁcation stage and tried to design ad-

vanced classiﬁers to reduce the modality gap between fea-

tures extracted from photos and sketches. If the inter-

modality difference between the extracted features is large,

the discriminative power of the classiﬁers will be reduced.

In this paper, we propose a new approach of reducing

the modality gap at the feature extraction stage. A new face

descriptor is designed by the coupled information-theoretic

encoding, which quantizes the local structures of face pho-

tos and sketches into discrete codes. In order to effectively

match photos and sketches, it requires that the extracted

513

codes are uniformly distributed across different subjects,

which leads to high discriminative power, and that the codes

of the same subject’s photo and sketch are highly correlated,

which leads to small inter-modality gap. These require-

ments can be well captured under the criterion of maximiz-

ing the mutual information between photos and sketches

in the quantized feature spaces. The coupled encoding is

achieved by the proposed randomized coupled information-

theoretic projection forest, which is learned with the maxi-

mum mutual information (MMI) criterion.

Another contribution of this work is to release CUHK

Face Sketch FERET Database (CUFSF)

, a large scale face

sketch database. It includes the sketches of 1, 194 people

from the FERET database [22]. Wang and Tang [31] pub-

lished the CUFS database with sketches of

606

people. The

sketches in the CUFS database had less shape distortion.

The new database is not only larger in size but also more

challenging because its sketches have more shape exagger-

ation and thus are closer to practical applications. Exper-

iments on this large scale dataset show that our approach

signiﬁcantly outperforms the state-of-the-art methods.

1.1. Related work

To synthesize pseudo photos (sketches) from sketches

(photos), Tang and Wang [27] proposed to apply the eigen-

transform globally. Another global approach proposed by

Gao et al. [9] was based on the embedded hidden Markov

model and the selective ensemble strategy. Liu et al. [18]

proposed patch-based face sketch reconstruction using lo-

cal linear embedding based mapping. The sketch patches

were synthesized independently ignoring the spatial rela-

tionship. Wang and Tang [31] used a multiscale Markov

random ﬁeld (MRF) to model the dependency of neighbor-

ing sketch patches. Photos and sketches were matched once

they were transformed to the same modality.

In order to reduce the inter-modality gap at the classiﬁ-

cation stage, Lin and Tang [17] mapped features from two

modalities into a common discriminative space. Lei and

Li [15] proposed coupled spectral regression (CSR). CSR

was computationally efﬁcient in learning projections to map

data from two modalities into a common subspace. Klare et

al. [14] proposed local feature-based discriminant analysis

(LFDA). They used multiple projections to extract a dis-

criminative representation from partitioned vectors of SIFT

and LBP features.

There is an extensive literature on descriptor-based face

recognition [1, 32, 36], due to its advantages of computa-

tional efﬁciency and relative robustness to illumination and

pose variations. They are relevant to our coupled encoding.

However, those handcrafted features, such as LBP [1] and

SIFT [19], were not designed for inter-modality face recog-

nition. The extracted features from photos and sketches

Available at http://mmlab.ie.cuhk.edu.hk/cufsf/.

may have large inter-modality variations.

Although information-theoretic concepts were explored

in building decision trees and decision forests for vector

quantization [2, 21, 23] in the application of object recogni-

tion, these algorithms were applied in a single space and did

not address the problem of inter-modality matching. With

the supervision of object labels, their tree construction pro-

cesses were much more straightforward than ours.

2. Information-Theoretic Projection Tree

Vector quantization was widely used to create discrete

image representations, such as textons [20] and visual

words [24], for object recognition and face recognition. Im-

age pixels [5, 23], ﬁlter-bank responses [20] or invariant de-

scriptors [24, 33] were computed either sparsely or densely

on a training set, and clustered to produce a codebook by

algorithms such as k-means, mean shift [12], random pro-

jection tree [5, 8, 33] and random forest [21, 23]. Then with

the codebook any image could be turned into an encoded

representation.

However, to the best of our knowledge, it has not been

clear how to apply vector quantization to cross-modality ob-

ject matching yet. In this section, we present a new cou-

pled information-theoretic projection (CITP) tree for cou-

pled quantization across modalities. We further extend the

CITP tree to the randomized CITP tree and forest. For clar-

ity of exposition, we present the method in the photo-sketch

recognition scenario.

2.1. Projection Tree

A projection tree [8] partitions a feature space R

into

cells. It is built in a recursive manner, splitting the data

along one projection direction at a time. The succession

of splits leads to a binary tree, whose leaves are individual

cells in R

. With a built projection tree, a code is assigned

to each test sample x, according to the cell (i.e. leaf node)

it belongs to. The sample is simply propagated down the

tree, starting from the root node and branching left or right

until a leaf node is reached. Each node is associated with a

learned binary function f(x) = sign(w

x − τ ). The node

propagates x to its left child if f (x) = −1 and to its right

child if f(x) = 1.

2.2. Mutual Information Maximization

Since quantization needs to be done in both the photo

space and the sketch space, we extend a projection tree to

a coupled projection tree. In a coupled projection tree, vec-

tors sampled from photos and sketches share the same tree

structure, but are input to different binary functions f

)

and f

) at each node. A vector x

sampled from the

neighborhood of a photo pixel is quantized with f

and a

vector x

sampled from the neighborhood of a sketch pixel

is quantized with f

. Then the sampled photo vectors and

514

sketch vectors are mapped to the same codebook, but their

coding functions represented by the tree are different, de-

noted by C

and C

, respectively.

To train a coupled projection tree, a set of vector pairs

X = {(x

, x

), i = 1, ..., N} is prepared, where x

, x

∈

. In this paper, x

and x

are the normalized vectors

of sampled gradients around the same location

in a photo

and a sketch of the same subject, respectively. Denote that

= [x

, ..., x

], X

= [x

, ..., x

]. Since x

and x

are

sampled from the same subject at the same location, it is

expected that they are quantized into the same code by the

coupled projection tree. In the meanwhile, in order to in-

crease the discriminative power, it is expected that the codes

of X

and X

are uniformly distributed across different

subjects. To achieve these goals, our coupled information-

theoretic projection (CITP) trees are learned using the max-

imum mutual information (MMI) criterion (see Fig. 2).

Mutual information, which is a symmetric measure to

quantify the statistical information shared between two ran-

dom variables [7], provides a sound indication of the match-

ing quality between coded photo vectors and coded sketch

vectors. Formally, the objective function is as follows.

I(C

); C

)) = H(C

)) − H(C

)|C

)).

(1)

To increase the discriminative power, the quantization

should maximize the entropy H(C

)) so that the sam-

ples are nearly uniformly distributed over the codebook. To

reduce the inter-modality gap, the quantization should min-

imize the conditional entropy H(C

)|C

)).

2.3. Tree Construction with MMI

Similar to random projection tree [8], the CITP tree is

also built top down recursively. However, it is different in

that the CITP tree is not a balanced binary tree, i.e. the leaf

nodes are at different levels. So the tree building process

consists of searching for both the best tree structure and the

optimal parameters at each node.

Tree structure searching. We adopt a greedy algorithm

to build the tree structure. At each iteration, we search the

node whose splitting can maximize the mutual information

between the codes of sampled photo and sketch vectors.

The mutual information, given in Eqn. (1), can be eas-

ily approximated in a nonparametric way. All the sampled

photo and sketch vectors in the training set are quantized

into codes with the current tree after splitting the candidate

node, and the joint distribution of photo and sketch codes is

We sample the gradients (i.e. the ﬁrst-order derivatives in the hori-

zontal and vertical directions) I

and I

for an image I. Please refer to

Section 3 for details.

The mutual information is originally deﬁned between two random

variables C

) and C

). We use the empirical mutual information

estimated on the training set throughout this paper.

I = 1.51

If we split

Node B,

I = 0.08

If we split

Node A,

If we split

Node C,

I = 1.41

Freq.

0.5

0.25

Freq.

0.5

0.25

Freq.

0.5

0.25

Sketch

codes

Photo codesPhoto codes

Sketch

codes

Photo codes

Sketch

codes

-- Non-leaf node

-- Leaf node

Figure 2. An illustration of tree construction with MMI. In each

step, all current leaf nodes are tested and the one with the maxi-

mum mutual information is selected to split. For a leaf node, we

try to split it and obtain a tree to encode photo vectors and sketch

vectors. The selected leaf node should satisfy: (1) the codes are

uniformly distributed; (2) the codes of photo vectors and corre-

sponding sketch vectors are highly correlated. These requirements

can be well captured under the MMI criterion. In this example,

if we split node A, requirement (2) will not be satisﬁed, and if

we split node C, requirement (1) will not be satisﬁed. The cor-

responding mutual information I of both are relatively small. So

node B with the maximum mutual information is selected. The

histograms and joint histograms of photo and sketch codes are vi-

sualized. In joint histograms, the colors represent the joint proba-

bility densities.

computed to estimate the mutual information. A toy exam-

ple is shown in Fig. 2.

Node parameter searching. It is critical to search for

optimal parameters of binary functions f

) and f

)

to determine how to split the node. Formally, we aim at

ﬁnding projection vectors w

, w

and thresholds τ

, τ

for

node k

, such that

= w

− τ

, y

= sign(y

= w

− τ

, y

= sign(y

(2)

Then a binary value y

(or y

) is assigned to each vector x

(or x

), to split the training data into two subsets and prop-

agate them to the two child nodes. The node propagates a

training vector pair (x

, x

) to its children only if the binary

values ˆy

and ˆy

are the same. Otherwise, the vector pair is

treated as an outlier and discarded.

Suppose that the input of a node k is a set of vec-

tor pairs X

= {(x

, x

), 1 ≤ i ≤ N

}. Denote

that X

= [x

, ..., x

], X

= [x

, ..., x

], Y

, ..., y

], Y

= [y

, ..., y



= [y

, ..., y

]

We omit index k of the parameters, for conciseness.

515

and



= [y

, ..., y

]. The node is split according to the

MMI criterion, i.e. maximizing



;



) = H(



) + H(



) − H(



). (3)

Instead of solving the above maximization problem di-

rectly, an approximate objective I(Y

; Y

) is maximized

ﬁrst. Through maximizing I(Y

; Y

), w

and w

are es-

timated without considering τ

and τ

. Assume that y

and y

are jointly Gaussian distributed. The entropy of a

jointly Gaussian random vector g is

ln[det(Σ

)] + const

[7], where Σ

is the covariance matrix of g. Following this,

the mutual information can be rewritten in a simple form

I(Y

; Y

) =



det(Σ

) det(Σ

)

det(Σ

)



+ const, (4)

where Σ

, Σ

and Σ

are the covariance of Y

, Y

and

[(Y

)

, (Y

)

]

, respectively. According to Eqn (2),

= w

, Σ

= w



p,s



p,s





(5)

where C

and C

are the covariance matrix of X

, X

respectively, and C

p,s

is the covariance matrix between X

and X

Substituting Eqn. (5) into Eqn. (4), we ﬁnd the equiv-

alence between maximizing (4) and the Canonical Correla-

tion Analysis (CCA) model

max

p,s



. (6)

So the optimal w

and w

are obtained by solving CCA

(details are given later). CCA is found with good trade-off

between the scalability and performance, when the input set

is usually of a large size (about 2.5 million sample pairs in

our experiments).

To estimate the thresholds τ

and τ

, we use brute-force

search to maximize (3) in the region (τ

, τ

) ∈ [ˆµ

−

ˆσ

, ˆµ

+ ˆσ

]×[ˆµ

− ˆσ

, ˆµ

+ ˆσ

], where ˆµ

= median

)

and ˆσ

= median

(|y

− ˆµ

|) are the median and median

of absolute deviation of y

, respectively, and ˆµ

and ˆσ

are

the median and median of absolute deviation of y

, respec-

tively.

Canonical Correlation Analysis. CCA was introduced

by Hotelling for correlating linear relationships between

two sets of vectors [10]. It was used in some computer vi-

sion applications [34, 13, 25]. However, it has not been ex-

plored as a component of a vector quantization algorithm.

Blaschko and Lampert [4] proposed an algorithm for spec-

tral clustering with paired data based on kernel CCA. How-

ever, this method is not appropriate for quantization, as the

Algorithm 1 Algorithm of building a CITP Tree

1: Input: a set of vector pairs X = {(x

, x

), i =

1, ..., N}, where x

, x

∈ R

, and the expected num-

ber of codes (i.e. leaf nodes) n

2: Create an empty set S, and add the root node to S.

3: repeat

4: for each node k in S and its associated vector set X

5: Compute the possible node splitting:

(i) Generate projection vectors w

, w

and thresh-

olds τ

, τ

with X

;

(ii) For its left child L and right child R,

← {(x

, x

)|w

≤ τ

, w

≤ τ

← {(x

, x

)|w

> τ

, w

> τ

⊂ X

, X

⊂ X

);

6: end for

7: Select the best node splitting with the maximum mu-

tual information in Eqn. (1);

8: Split the node, remove the node from S and add its

child nodes to S;

9: until the number of leaf nodes is n

10: Output: the CITP tree with projection vectors and

thresholds at each node.

kernel trick causes high computational and memory cost

due to the very large size of the training set, and the near-

est centroid assignment may be unstable (there is no hard

constraint to require a pair of vectors in the same cluster).

To solve CCA in (6), let



0 C

p,s

)



, S



p,s

)



and then w = [w

, w

]

can be solved as the eigenvec-

tor associated with the largest eigenvalue of the generalized

eigenvalue problem S

w = λ(S

+ εI)w, where ε is a

small positive number for regularization.

The whole algorithm for building a CITP tree is summa-

rized as Algorithm 1.

2.4. Randomized CITP Forest

Randomization is an effective way to create an ensem-

ble of trees to boost the performance of tree structured al-

gorithms [21, 23, 33]. The randomized counterpart of the

CITP tree includes two modiﬁcations on node splitting as

follows.

Randomization in sub-vector choice. At each node, we

randomly sample α percent (empirically α = 80) of the el-

ement indices of the sampled vectors, i.e. use a sub-vector

of each sampled vector, to learn the projections. To im-

prove the strength of generated trees, the random choice is

repeated for 10 times empirically at each node, and the one

516

PCA+LDA

classifier

,2,1,

,22,21,2

,12,11,1

xxx

LLLL

,2,1,

,22,21,2

,12,11,1

xxx

LLLL

Photo

Sketch

Gradient extraction,

Sampling and normalization

Gradient extraction,

Sampling and normalization

Coupled encoding

via CITP Tree

Coupled encoding

via CITP Tree

Coded Image

Sampling pattern

Concatenated

histogram

Concatenated

histogram

r = 2

Figure 3. The pipeline of extracting CITE descriptors.

with the maximum mutual information in Eqn. (3) is se-

lected. The randomization at each node results in random-

ized trees with different tree structures and utilizing differ-

ent information from the training data. Therefore, the ran-

domized trees are more complementary.

Randomization in parameter selection. The eigenvec-

tors associated with the ﬁrst d largest eigenvalues in the

CCA model are ﬁrst selected. Then a set of

vectors are

generated by randomly linearly combining the d selected

eigenvectors.

According to the MMI criterion in Eqn. (3),

the best one is selected from the set of n random vectors

and used as the projection vectors w

and w

. In our exper-

iments, we choose d = 3 and n = 20.

The creation of a random ensemble of diverse trees can

signiﬁcantly improve the performance over a single tree,

which is veriﬁed by our experiments.

3. Coupled Encoding Based Descriptor

In this section, we introduce our coupled information-

theoretic encoding (CITE) based descriptor. With a CITP

tree, a photo or a sketch can be converted into an image

of discrete codes. The CITE descriptor is a collection of

region-based histograms of the “code” image. The pipeline

of photo-sketch recognition using a single CITP tree is

shown in Fig. 3. The details are given as follows.

Preprocessing. The same geometric rectiﬁcation and

photometric rectiﬁcation are applied to all the photos and

sketches. With afﬁne transform, the images are cropped

to 80 × 64, and the two eye centers and the mouth cen-

ter of all the face images are at ﬁxed positions. Then

both the photo and sketch images are processed with a

Difference-of-Gaussians (DoG) ﬁlter [11] to remove both

high-frequency and low-frequency illumination variations.

Empirical investigations show that (σ

, σ

) = (1, 2) is the

best in our experiments.

The eigenvectors are orthogonalized with Gram-Schmidt orthogonal-

ization and normalized with L

-norm.

Sampling and normalization. At each pixel, its neigh-

boring pixels are sampled in a certain pattern to form a vec-

tor. A sampling pattern is a combination of one or several

rings and the pixel itself. On a ring with radius r , 8r pixels

are sampled evenly. Fig. 3 shows the sampling pattern of

r = 2. We denote a CITE descriptor by a sampling pattern

with rings of radius r

, ..., r

as CITE

,...,r

We ﬁnd that sampling the gradients

and

results in

a better descriptor than sampling the intensities [5]. The

gradient domain explicitly reﬂects relationships between

neighboring pixels. Therefore, it has more discriminat-

ing power to discover key facial features than the inten-

sity domain. In addition, the similarity between photos and

sketches are easier to compare in the gradient domain than

intensity domain [35].

After the sampling, each sampled vector is normalized

such that its L

-norm is unit.

Coupled Information-Theoretic Encoding. In the en-

coding step, the sampled vectors are turned into discrete

codes using the proposed CITP tree (Section 2). Then

each pixel has a code and the input image is converted

into a “code” image. The vectors sampled from photos and

sketches for training CITP tree are paired according to the

facial landmarks detected by a state-of-the-art alignment al-

gorithm [16].

Speciﬁcally, a pixel in the sketch image ﬁnds

its counterpart in the photo image using a simple warping

based on the landmarks. Note that the pairing is performed

after sampling so that local structures are not deformed by

the warping.

CITE Descriptor. The image is divided into 7 × 5 lo-

cal regions with equal size, and a histogram of the codes

is computed in each region. Then the local histograms are

concatenated to form a histogram representation of the im-

age, i.e. the CITE descriptor.

According to our observation, a general face alignment algorithm

trained on commonly used face photo data sets is actually also effective

for sketch alignment. We did not separately train a face alignment algo-

rithm for sketches.

517

Coupled information-theoretic encoding for face photo-sketch recognition

Figures

Citations

DualGAN: Unsupervised Dual Learning for Image-to-Image Translation

Face Alignment at 3000 FPS via Regressing Local Binary Features

Multi-View Discriminant Analysis

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

A Comprehensive Survey to Face Hallucination

References

Distinctive Image Features from Scale-Invariant Keypoints

Elements of information theory

Eigenfaces vs. Fisherfaces: recognition using class specific linear projection

Video Google: a text retrieval approach to object matching in videos

Relations Between Two Sets of Variates

Related Papers (5)

Face Photo-Sketch Synthesis and Recognition

Heterogeneous Face Recognition Using Kernel Prototype Similarities

The FERET evaluation methodology for face-recognition algorithms

Eigenfaces vs. Fisherfaces: recognition using class specific linear projection

Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch

Frequently Asked Questions (13)

Q1. What are the contributions mentioned in the paper "Coupled information-theoretic encoding for face photo-sketch recognition" ?

Q2. What future works have the authors mentioned in the paper "Coupled information-theoretic encoding for face photo-sketch recognition" ?

Q3. What is the method for comparing pseudo photos to gallery photos?

Q4. What is the way to learn the projections?

Q5. How do the authors train a coupled projection tree?

Q6. Why do the authors choose 256 leaf nodes?

Q7. What is the main purpose of the first family of approaches?

Q8. What is the problem with the kernel trick?

Q9. How did Lin and Tang map features from two modalities into a common discriminative space?

Q10. What is the vote of a pixel to the histogram?

Q11. What global approach was proposed by Gao et al.?

Q12. What is the ROC curve for the CITP forest?

Q13. What is the effect of different parameters on the performance of a CITP forest?