What datasets are used to evaluate the performance of the proposed method?

In their experiments, to evaluate the performance of the proposed method, three datasets are employed, i.e., National Taiwan University 3D Model database (NTU) [33], Princeton Shape Benchmark (PSB) [19] and Shape Retrieval Content 2009 (SHREC) [2]

What is the relevance of the 3D objects in HL?

In HL, the relevanceamong 3D objects is formulated in a hypergraph structure, where the hyperedges are generated using the view clustering results.

(Open Access) Learning View-Model Joint Relevance for 3D Object Retrieval (2015) | Ke Lu

Q: What are the contributions in "Learning view-model joint relevance for 3d object retrieval" ?

In this paper, the authors propose to jointly learn the view-model relevance among 3D objects for retrieval, in which the 3D objects are formulated in different graph structures. This is the first work to jointly explore the view-based and model-based relevance among the 3D objects in a graphbased framework.

Q: What are the methods used to represent 3D objects?

To represent 3D objects, low-levelfeatures, such as volumetric descriptor [11] and surface geometry [8], [12], and high-level features, such as the method in [14] were employed in previous works.

Q: What is the regularizer term on the view-based hypergraphstructure?

(4) KV ( f ) is the regularizer term on the view-based hypergraphstructure, KM ( f ) is the regularizer term on the model-basedThe vertex degree matrix and the edge degree matrix canbe denoted by two diagonal matrices Dv and De.

Learning View-Model Joint Relevance

for 3D Object Retrieval

Ke Lu, Member, IEEE, Ning He, Jian Xue, Jiyang Dong, and Ling Shao, Senior Member, IEEE

Abstract— 3D object retrieval has attracted extensive research

efforts and become an important task in recent years. It is

noted that how to measure the relevance between 3D objects

is still a difficult issue. Most of the existing methods employ

just the model-based or view-based approaches, which may lead

to incomplete information for 3D object representation. In this

paper, we propose to jointly learn the view-model relevance

among 3D objects for retrieval, in which the 3D objects are

formulated in different graph structures. With the view informa-

tion, the multiple views of 3D objects are employed to formulate

the 3D object relationship in an object hypergraph structure.

With the model data, the model-based features are extracted to

construct an object graph to describe the relationship among

the 3D objects. The learning on the two graphs is conducted

to estimate the relevance among the 3D objects, in which the

view/model graph weights can be also optimized in the learning

process. This is the first work to jointly explore the view-based

and model-based relevance among the 3D objects in a graph-

based framework. The proposed method has been evaluated in

three data sets. The experimental results and comparison with

the state-of-the-art methods demonstrate the effectiveness on

retrieval accuracy of the proposed 3D object retrieval method.

Index Terms— 3D object retrieval, view information, model

data, joint learning.

INTRODUCTION

D Objects have been widely applied in plenty of diverse

applications [1]–[3], e.g., computer graphics, the medical

industry, and virtual reality, due to the fast advances in graphic

hardware, computer techniques and networks. Large scale

databases of 3D objects are rapidly increasing, which leads

Manuscript received July 28, 2014; revised November 10, 2014; accepted

January 19, 2015. Date of publication January 28, 2015; date of current

version March 6, 2015. This work was supported in part by the National

Natural Science Foundation of China under Grant 61271435, Grant 61370138,

and Grant U1301251, in part by the Beijing Natural Science Foundation

under Grant 4152017 and Grant 4141003, in part by the Importation and

Development of High-Caliber Talents Project of Beijing Municipal Institutions

under Grant IDHT20130225, in part by the National Program on Key Basic

Research Project (973 Programs) under Grant 2011CB706901-4, and in part

by the Instrument Developing Project through the Chinese Academy of Sci-

ences under Grant YZ201321. The associate editor coordinating the review of

this manuscript and approving it for publication was Prof. Marios S. Pattichis.

K. Lu is with the University of Chinese Academy of Sciences,

Beijing 100049, China, and also with the Beijing Center for Mathematics

and Information Interdisciplinary Sciences, Beijing 100190, China (e-mail:

luk@ucas.ac.cn).

N. He is with Beijing Union University, Beijing 100191, China (e-mail:

xxthening@buu.edu.cn).

J. Xue and J. Dong are with the University of Chinese Academy

of Sciences, Beijing 100049, China (e-mail: xuejian@ucas.ac.cn;

dongjiyang12@mails.ucas.ac.cn).

L. Shao is with the Department of Computer Science and Digital Technolo-

gies, Northumbria University, Newcastle upon Tyne, NE1 8ST, U.K. (e-mail:

ling.shao@ieee.org).

Fig. 1. Example views of two 3D objects.

to the high requirement of effective and efficient 3D object

retrieval algorithms.

Recently, extensive research efforts have been dedicated to

3D object retrieval technologies [4]–[7]. Existing 3D object

retrieval approaches can be briefly divided into two paradigms,

i.e., model-based methods and view-based methods.

In model-based method [8]–[10], 3D objects are described

model-based features, such as low-level feature (e.g. the volu-

metric descriptor [11], the surface distribution [9] and surface

geometry [8], [12], [13]) or high-level features, e.g. the method

in [14]. In [14], both visual and geometric characteristics

are taken into consideration and a high level semantic space

mapping from the low level features is further learned with

user relevance feedback, which is another Euclidean space

and can be regarded as a dimension reduction or feature

selection method. One advantage of model-based methods

is that they can preserve the global spatial information of

3D objects. Although model-based method is effective, they

require 3D model information explicitly, which limits the

applications of model-based methods. The 3D model infor-

mation is not always available, especially in some practical

applications.

In view-based method, [15]–[17], 3D objects are represented

by a group of images from different directions. For different

methods, these views may be captured with a static camera

array or without such camera array constraint. For view-based

method, the matching between two 3D objects is accomplished

via multiple-view matching. Figure 1 shows some examples of

multiple views for 3D objects. The view-based methods benefit

from existing image processing/matching technologies. These

methods make 3D object retrieval more flexible due to that

they do not require 3D model information. Existing works [18]

also show that view-based method can be highly discriminative

for 3D objects, which also provide better retrieval perfor-

mance than model-based methods [3], [19]. Compared with

Fig. 2. The framework of the proposed method.

model-based methods, one disadvantage of view-based

methods is that when the camera array information is not

available, they are difficult to describe the spatial relationship

among different views.

One typical scenario that 3D model information is not

available is when we want to search the objects in the world.

For example, when the tourist finds some interesting things and

wants to find similar ones in the dataset, it is hard to obtain the

model information but just take several pictures. In this case,

the model-based methods cannot work and only the image-

based methods can be applied. For model-based method,

CAD is a very important area for application. Other areas

where model-based methods work well are entertainment,

such as 3D TV and games, and the medical field, such as tele-

medical treatment and diagnosis. It is noted that the visual

information becomes more important recently in the above

application. Both the model-information and view-based

information can bring in useful angles, which can further

improve the performance.

It is noted that most of existing methods separate the model-

based methods and the view-based methods, and employ

either model information or view feature for 3D object

retrieval. In this work, we propose to jointly employ both

the model and the view information for 3D object relevance

estimation. In the view part, representative views are firstly

selected for each object, and then the view-level distances are

calculated. Following the method in [20], an object hypergraph

is constructed using the view star expansion. In the model

part, the spatial structure circular descriptor [21] is extracted

and a simple graph is generated using the pairwise object

distances. In this way, the view information and the model

data can be formulated in two graph structures. Learning

on the two graphs is conducted to estimate the relevance

among 3D objects, in which the graph weights can be also

optimized. Figure 2 demonstrates the schematic framework of

the proposed approach. Evaluation on three datasets has shown

superior 3D object retrieval accuracy performance compared

with the state-of-the-art methods.

The rest of the paper is organized as follows. Related work

on 3D object retrieval is reviewed in Section II. The proposed

method is provided in Section III. Experiments and discussion

are given in Section IV. We conclude the paper in Section V.

II.

RELATED WORK

In this section, we briefly review existing methods on

3D object retrieval. To represent 3D objects, low-level

features, such as volumetric descriptor [11] and surface

geometry [8], [12], and high-level features, such

as the method in [14] were employed in previous

works.

For model-based 3D object retrieval, the shape descriptor

is an important role for 3D object representation. According

to [22], 3D shape descriptors can be divided into four cate-

gories, i.e., histogram-based method [9], [23]–[25], transform-

based method [26]–[29], graph-based method [30]–[32] and

view-based method [21], [33], [34].

In the histogram-based method, a histogram-like feature is

extracted from the 3D model to collect numerical values of

certain attributes. Typical histogram-based descriptors include

shape distribution [9], generalized shape distribution [23],

extended Gaussian image [24] and 3D Hough transform [25].

In transform-based method, transform coefficients are

employed as the 3D shape descriptor, such as 3D Fourier [26],

spherical trace transform [27], radialized extend function [28],

and concrete radialized shperical projection [29]. The graph-

based method aims to represent 3 objects by graph structure,

and the comparison between 3D objects turns to matching of

two graphs. Some typical graph-based methods include reeb

graphs [30], [31] and skeletal graphs [32].

Given the 3D model, a spatial structure circular

descriptor (SSCD) descriptor was introduced in [21], which

projected the model information into a circular region to

preserve the global spatial information of the 3D model. In this

method, the histogram for each SSCD view was calculated to

measure the distance between two 3D objects. A panoramic

view, named PANORAMA, was employed in [35] for

3D model representation. The panoramic view was generated

by projecting the model to a lateral surface of a cylinder in

PANORAMA, and the distance between two models can be

calculated by the matching between two PANORAMA images.

Leng et al. [14] employs both the Dbuffer descriptor [36],

which contains 6 depth buffer images from the front,

lateral and vertical views, and GEDT coefficients [37] as

the descriptors. Then these two descriptors are combined

as TUGE descriptor, which is 982-dimension. With user

feedback, these low level features are mapped to high level

semantic space, which is another Euclidean space and can

be regarded as a dimension reduction or feature selection

method. A bipartite graph learning method is introduced

in [38], where the comparison between two groups of multiple

views is formulated in a bipartite graph. A learning-based

method for bipartite graph matching is proposed in [39].

In view-based 3D object retrieval methods, how to generate

multiple views is an important issue. Some existing methods

employed predefined camera arrays to capture views, while

some other works may not have such constraints. Lighting

Field Descriptor (LFD [33] is the first view-based 3D object

retrieval method. In LFD, each 3D object was represented by

several groups of representative views. Each group contained

10 views and the Zernike moments and Fourier descriptors

were employed as the view feature. The minimal distance

between two groups of views from two compared 3D objects

was employed as the pairwise object distance. Different

from LFD, Elevation Descriptor (ED) [34] employed six range

views from different directions of 3D objects. The depth

histogram was extracted to describe the EDs and the matching

between two groups of EDs was conducted to calculate the

distance between two 3D objects. 18 views were captured

in Compact Multi-View Descriptor (CMVD) [18] from the

18 vertices of a 32-hedron. 7 characteristic views were

generated in [40] from different directions. In the camera

constraint free method (CCFV) [41], a set of representative

views are selected from the originally captured multiple views

via view clustering and a probabilistic matching method

is then employed to calculate the similarity between each

two 3D objects.

Some other methods first generate large scale raw views,

and further select representative views in the big view pool.

One typical method is Adaptive Views Clustering (AVC) [15].

In AVC, 320 initial views were firstly captured and represen-

tative views, generally about 20 to 40 views, were selected

from these raw views. The comparison between 3D objects is

formulated as a probabilistic approach to measure the posterior

probability of the target object given the query. In [41],

a positive matching model and a negative matching model

were used to measure the relevance between a target object

and the query. This is the first attempt to explore the relevance

of one candidate object on both positive and negative samples

and evaluation has shown satisfied performance.

In [42], curvature scale space was employed as the view

descriptor, which was further combined with Zernike Moments

to measure the distance between two 3D models. In depth

gradient image (DGI) model [43], both the surface and the

contour information were synthesized, which can avoid restric-

tions concerning the layout and visibility of the models in the

scene.

Distance estimation between two groups of views is

one important problem in view-based 3D object retrieval.

Gao et al. [44] propose a learning based Hausdorff distance

for 3D object retrieval. In this method, a Mahalanobis distance

metric was learnt to the view-level distance measure, which

can be further used in the object-level Hausdorff distance

calculation. This method solves the challenges that the label

is on the object level while the distance metric is on the

view level. To estimate the relevance among 3D objects, semi-

supervised learning has been investigated in recent years. In

[20], a hypergraph structure was employed to formulate the

relationship among 3D objects. In this method, the view

clustering was conducted to generate hyperedges, which were

used to connect 3D objects. Based on different

view clustering results, multiple hypergraphs could be

constructed, and learning was conducted on the hypergraph to

estimate the relevance among 3D objects. This method further

extends existing view-based 3D object retrieval method to

semi-supervised learning approach, which has been justfied as

the state-of-the-art methods. Gaussian mixture model (GMM)

was used in [45] to formulate the distribution of multiple

views for 3D objects. In this method, the KL divergence was

employed to measure the distance between two 3D objects.

III.

LEARNING VIEW-MODEL JOINT RELEVANCE

FOR 3D OBJECT RETRIEVAL

In this section, we introduce the view-model joint relevance

learning method for 3D object retrieval. This method explores

both the view information and the model data of 3D objects.

The proposed method is composed of three key components, as

shown in Figure 2. Given the view information of 3D objects,

the proposed method first constructs a hypergraph to formulate

the relationship among 3D objects with the view connections.

Then with the model data, a spatial structure circular descriptor

is extracted from each 3D model, and the distance between

each two 3D models is used to generate a simple graph

to explore the relationship among 3D models. Finally, the

learning the joint view-model graphs is conducted to estimate

the relevance among 3D objects.

View-Based Hypergraph Generation

Here the view-based hypergraph is generated following

the method in [20] and briefly introduced as follows. Let

O = {O

, O

, . . . , O

} denote the n 3D objects in the dataset,

and

...

denote

the

views

the

i th 3D object O

. In this part, we aim to explore the relevance

among 3D object with multiple view information.

Generally, although multiple views can represent rich infor-

mation of 3D objects, they also bring in redundant data, which

may cause much computational cost and even lead to false

results. Here we first select representative views for each

3D object, and only these representative views are employed

in the 3D object retrieval process.

Given

the

views

...,

con-

duct hierarchical agglomerative clustering (HAC) [46] to group

these views into view clusters. The HAC method is selected

here due to that it can guarantee the intracluster distance

between each pair of views cannot exceed a given threshold.

Here the widely employed Zernike moments [47] are used as

the view features, which are robust to image rotation, scaling

and translation and have been used in many 3D object retrieval

tasks [15], [20], [33], [48]. The 49-D Zernike moments are

extracted from each view of 3D objects. With the view

clustering results, one representative view is selected from

each view cluster. Here we let

, ...,

denote the m

representative views for O

. In our experiments,

mostly ranges from 5 to 20.

Hypergraph has been used in many multimedia information

retrieval tasks, such as image retrieval [49], [50]. Hypergraph

has shown its superior on high-order information representa-

tion. In our work, we propose to employ star expansion to

Fig. 3. An illustration of hyperedge construction. In this figure, there are

seven objects with representative views. Here one view from O

is selected

as the centra view, and its four closest views are located in the figure,

which are from O

, O

and O

. Then the corresponding hyperedge

connects O

, O

and O

construct an object hypergraph with views to formulate the

Model-Based Graph Generation

Given the model data of 3D objects, here we further explore

the model-based object relationship. Here the spatial structure

circular descriptor (SSCD) [21] is employed as the model

feature. SSCD aims to represent the depth information of

the model surface on the projection minimal bounding box

of the 3D model. The depth histogram is generated as the

feature for the 3D model. Following [21], the bipartite graph

matching is conducted to measure the distance between each

two 3D models, i.e., d

SSC D

, O

Here, the relationship among 3D objects is formulated in

a simple object graph structure G = (V, E, W). Here each

vertex in G represents one 3D object, i.e., there are n vertices

. The weight of an edge

(

)

is calculated by

using the similarity between two corresponding 3D objects

and O

v , v

relationship among 3D objects. Here we denote the object

hypergraph as G

= (V

, E

, W

). For the n objects in

the dataset, there are n vertices in G

, where each vertex

exp

SSC D i j

−

(5)

represents one 3D object.

The hyperedges are generated as follows. We assume there

are totally n

representative views for all n objects. We first

calculate the Zernike moments-based distance between each

two views, and the top K closest views can be generated

for each representative view. For each representative view,

one hyperedge is constructed, which connects the objects with

views in the top K closest views. In our experiment, K is set

as 10. Figure 3 shows an example of hyperedge generation.

Generally, n

hyperedges can be generated for G

. The

weight of one hyperedge e

can be calculated by

where d

SSC D

is distance between O

and O

, and

set as the median of all modal pair distances.

Learning on the Joint Graphs

Now we have two types of formulation of relationship

among 3D objects, i.e., view-based and model-based. Here

these two formulations are jointly explored to estimate the

relevance among 3D objects.

In this part, first we introduce the learning framework when

the view-based and model-based information are regarded

(

)

exp

, v

)

−

(1)

with equal weight, and then we propose a jointly learning

framework to learn the optimal combination weights for each

modality.

where v

is the centra view of the hyperedge, v

is one of the

top K closest view to v

, d (v

, v

) is the distance between

and v

, and σ

is empirically set as the median of all view

pair distances.

Given the object hypergraph G

= (V

, E

, W

), the

incidence matrix H can be generated by

1 if v

∈ e

The Initial Learning Framework: Here we start from

the learning framework which regards different modalities,

i.e., model and view, as equal. The 3D object retrieval task

can be formulated as the one-class classification work as

shown in [51]. The main objective is to learn the optimal

pairwise object relevance under both the graph and hypergraph

structure. Given the initial labeled data (the query object in

our case), an empirical loss term can be added as a constraint

(

)

(2)

∈

for the learning process. The transductive inference can be

The vertex degree of v

can be defined as

formulated as a regularization as

ρ (v

)

∈

ω (

)

(

(3)

arg min

{

(

)

(

)

(

)

}

(6)

The edge degree of e

can be defined as

In this formulation, f is the to-be-learnt relevance vector,

ρ(

)

∈

(

(4)

(

)

is the regularizer term on the view-based hypergraph

structure,

(

)

is the regularizer term on the model-based

The vertex degree matrix and the edge degree matrix can

be denoted by two diagonal matrices D

and D

In the constructed hypergraph, when two 3D objects share

more similar views, they can be connected by more hyperedges

with high weights, which can indicate the high correlation

among these 3D objects.

graph structure,

(

)

is the empirical loss. This objective

function aims to minimize the empirical loss and the regular-

izers on the model-based graph and the view-based hypergraph

simultaneously which can lead to the optimal relevance

vector f for retrieval. The two regularizers and the empirical

loss term are defined as follows.

The view-based hypergraph regularizer

(

)

is defined

Learning the Combination Weights: We noted that the

view information and the model information may not share the

same impact on 3D object representation. In some scenarios,

. .

(

)

(

)

(v,

)

the view information may be more important, and in some

(

)

H u,v ∈V

ρ (

)

other cases, the model data may play an important role. Under

(

)

(

)

× √

ρ (

)

− √

ρ (v)

such circumstances, we further learn the optimal weights

for the view information and the model data. In this part,

we introduce the learning framework embedding the combina-

. .

H u,v ∈V

(

)

(

)

(

)

(v,

)

ρ (

)

(

)

(v)

tion weight learning. The objective for the learning process is

composed of three parts, i.e., the graph/hypergraph structure

regularizers, the empirical loss and the combination weight

ρ (

)

− √

ρ (

) ρ (v)

(

−

)

(7)

regularizer.

Here we let α and β denote the combination weights for

view-based and model-based information respectively, where

1 1

α + β = 1. After addeing the l2 norm on the combination

where ©

= D

−

HWD

−1

−

. Here

weights, the objective function can be further revised as

we denote

−

(

)

can be written as

2 2

(

)

(8)

arg min

f,α,β

−

" +

+ β ,

(14)

The model-based graph regularizer

(

)

is defined as

(

)

(

)

where α + β = 1.

The solotion for the above optimization task is provided as

follows. To solve the above objective function, we alternatively

(

)

∈

w (

)

√

(

)

− √

(v)

optimize f and α/β. We first fix α and β, and optimize f. Now

the objective function changes to

(

)

(

)

(v)

∈

w (

)

(

)

− √

(

)

(v)

arg

min

+ β f

−

(15)

(

−

)

(9)

where ©

= D

−1/2

. Here we denote 6

= I − ©

According to Eq. (13), it can be solved by

−

(

)

can be written as

f = I +

(α6

+ β6

) y. (16)

(

)

(10)

Then we optimize α/β with fixed f . Here we employ the

Lagrangian method, and the objective function changes to

The empirical loss term

(

)

is defined as

(

)

−

(11)

arg min

α,β

where y is the initial label vector. In the retrieval process, it

+ ξ (α + β − 1) . (17)

is defined as an n × 1 vector, in which only the query is set

as 1 and all other components are set as 0.

Now the objective function can be rewritten as

Solving the above optimization problem, we can obtain

f + f

ξ = −

− η, (18)

1 f

f − f

arg min

−

f can be solved by

. (12)

and

−

4η

f − f

(19)

−

β = − . (20)

)

(13)

2 4η

f is the relevance of all the objects in the dataset with

respect to the query object. A large relevance value indicates

high similarity between the object and the query. The higher

the corresponding relevance value is, the more similar the

two objects are. With the generated object relevance f, all

the objects in the dataset can be sorted in a descending order

according to f.

The above alternative optimization can be processed under

the optimal f value is achieved, which can be used for the

3D object retrieval. With the learned combination weights, the

model-based and view-based data can be optimally explored

simultaneously and the relevance vector f can be obtained.

The main merit of the proposed method is that it jointly

explore the view information and the model data of 3D objects

in hypergraph/graph frameworks for 3D object retrieval.

Learning View-Model Joint Relevance for 3D Object Retrieval

Figures

Citations

Multi-Modal Clique-Graph Matching for View-Based 3D Model Retrieval

Local Bit-Plane Decoded Pattern: A Novel Feature Descriptor for Biomedical Image Retrieval

Exploring Deep Learning for View-Based 3D Model Retrieval

Multi-view ensemble manifold regularization for 3D object recognition

View-Based 3-D Model Retrieval: A Benchmark

References

A Comparison of Document Clustering Techniques

Using spin images for efficient object recognition in cluttered 3D scenes

Topology matching for fully automatic similarity estimation of 3D shapes

Invariant image recognition by Zernike moments

Shape distributions

Related Papers (5)

On Visual Similarity Based 3D Model Retrieval

Local Derivative Pattern Versus Local Binary Pattern: Face Recognition With High-Order Local Pattern Descriptor

Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Frequently Asked Questions (15)

Q1. What are the contributions in "Learning view-model joint relevance for 3d object retrieval" ?

Q2. What are the common 3D shape descriptors?

Q3. What is the common method used to measure the relevance between a target object and the query?

Q4. What are the methods used to represent 3D objects?

Q5. What is the regularizer term on the view-based hypergraphstructure?

Q6. What is the common method used to measure the distance between two 3D models?

Q7. What are the different types of 3D shape descriptors?

Q8. What datasets are used to evaluate the performance of the proposed method?

Q9. How can the object relevance vector be obtained?

Q10. What is the method used to generate multiple views?

Q11. What is the common method of comparing 3D objects?

Q12. What is the relevance of the 3D objects in HL?

Q13. What is the objective function for the view-based hypergraph?

Q14. What is the meaning of the term "High level semantic space"?

Q15. What is the proposed method for 3D object retrieval?