scispace - formally typeset
Search or ask a question
Journal ArticleDOI

SIMPLIcity: semantics-sensitive integrated matching for picture libraries

TL;DR: SIMPLIcity (semantics-sensitive integrated matching for picture libraries), an image retrieval system, which uses semantics classification methods, a wavelet-based approach for feature extraction, and integrated region matching based upon image segmentation to improve retrieval.
Abstract: We present here SIMPLIcity (semantics-sensitive integrated matching for picture libraries), an image retrieval system, which uses semantics classification methods, a wavelet-based approach for feature extraction, and integrated region matching based upon image segmentation. An image is represented by a set of regions, roughly corresponding to objects, which are characterized by color, texture, shape, and location. The system classifies images into semantic categories. Potentially, the categorization enhances retrieval by permitting semantically-adaptive searching methods and narrowing down the searching range in a database. A measure for the overall similarity between images is developed using a region-matching scheme that integrates properties of all the regions in the images. The application of SIMPLIcity to several databases has demonstrated that our system performs significantly better and faster than existing ones. The system is fairly robust to image alterations.

Summary (6 min read)

1 INTRODUCTION

  • WITH the steady growth of computer power, rapidlydeclining cost of storage, and ever-increasing access to the Internet, digital acquisition of information has become increasingly popular in recent years.
  • The automatic derivation of semantically-meaningful information from the content of an image is the focus of interest for most research on image databases.
  • The image ªsemantics,º i.e., the meanings of an image, has several levels.
  • Content-based image retrieval (CBIR) is the set of techniques for retrieving semantically-relevant images from an image database based on automatically-derived image features.

1.3 Overview of the SIMPLIcity System

  • CBIR is a complex and challenging problem spanning diverse disciplines, including computer vision, color perception, image processing, image classification, statistical clustering, psychology, human-computer interaction (HCI), and specific application domain dependent criteria.
  • While the authors are not claiming to be able to solve all the problems related to CBIR, they have made some advances towards the final goal, close to human-level automatic image understanding and retrieval performance.
  • The authors discuss issues related to the design and implementation of a semantics-sensitive CBIR system for picture libraries.
  • An experimental system, the SIMPLIcity (Semantics-sensitive Integrated Matching for Picture LIbraries) system, has been developed to validate the methods.
  • The authors summarize the main contributions as follows.

1.3.1 Semantics-Sensitive Image Retrieval

  • The capability of existing CBIR systems is limited in large part by fixing a set of features used for retrieval.
  • The authors propose a semantics-sensitive approach to the problem of searching general-purpose image databases.
  • Semantic classification methods are used to categorize images so that semantically-adaptive searching methods applicable to each category can be applied.
  • Automatic classification methods can be used to categorize a general-purpose picture library into semantic classes including ªgraph,º ªphotograph,º ªtextured,º ªnontextured,º ªbenign,º ªobjectionable,º ªindoor,º ªoutdoor,º ªcity,º ªlandscape,º ªwith people,º and ªwithout people.º.
  • Automatic derivation of optimal features is a challenging and important issue in its own right.

1.3.2 Image Classification

  • For the purpose of searching picture libraries such as those on the Web or in a patient digital library, the authors are initially focusing on techniques to classify images into the classes ªtexturedº versus ªnontextured,º ªgraphº versus ªphotograph.º.
  • Several other classification methods have been previously developed elsewhere, including ªcityº versus ªlandscapeº [26], and ªwith peopleº versus ªwithout peopleº [1].
  • The authors report on several classification methods they have developed and their performance.

1.3.3 Integrated Region Matching (IRM) Similarity Measure

  • Besides using semantics classification, another strategy of SIMPLIcity to better capture the image semantics is to define a robust region-based similarity measure, the Integrated Region Matching (IRM) metric.
  • Image segmentation is an extremely difficult process and is still an open problem in computer vision.
  • Traditionally, region-based matching is performed on individual regions [2], [11].
  • The IRM metric the authors have developed has the following major advantages: 1. Compared with retrieval based on individual regions, the overall ªsoft similarityº approach in IRM reduces the adverse effect of inaccurate segmentation, an important property lacked by previous systems.
  • In many cases, knowing that one object usually appears with another helps to clarify the semantics of a particular region.

1.4 Outline of the Paper

  • The remainder of the paper is organized as follows:.
  • The semantics-sensitive architecture is further introduced in Section 2.
  • The image segmentation algorithm is described in Section 3.
  • Classification methods are presented in Section 4.
  • In Section 6, experiments and results are described.

2 SEMANTICS-SENSITIVE ARCHITECTURE

  • The architecture of the SIMPLIcity retrieval system is presented in Fig. 1.
  • During indexing, the system partitions an image into 4 4 pixel blocks and extracts a feature vector for each block.
  • A statistical clustering [8] algorithm is then used to quickly segment the image into regions.
  • For an image in the database, its semantic type is first checked and then its signature is extracted from the corresponding database.
  • Once the signature of the query image is obtained, similarity scores between the query image and images in the database with the same semantic type are computed and sorted to provide the list of images that appear to have the closest semantics.

3 THE IMAGE SEGMENTATION METHOD

  • The authors describe the image segmentation procedure based on the k-means algorithm [8] using color and spatial variation features.
  • A low D k indicates high purity in the clustering process.
  • The first derivative of distortion with respect to k, D k ÿD kÿ 1 , is below a threshold with comparison to the average derivative at k 2; 3. A lowD k ÿ D kÿ 1 indicates convergence in the clustering process.
  • After a one-level wavelet transform, a 4 4 block is decomposed into four frequency bands, as shown in Fig.
  • An image with vertical strips thus has high energy in the HL band and low energy in the LH band.

4 THE IMAGE CLASSIFICATION METHODS

  • The image classification methods described in this section have been developed mainly for searching picture libraries such as Web images.
  • The authors are initially interested in classifying images into the classes textured versus nontextured, graph versus photograph, and objectionable versus benign.
  • Karu et al. provided an overview of texture-related research [10].
  • Other classification methods such as city versus landscape [26] and with people versus without people [1] were developed elsewhere.

4.1 Textured versus Nontextured Classification

  • The authors describe the algorithm to classify images into the semantic classes textured or nontextured.
  • Fig. 4 shows some sample textured images.
  • The classification of textured or nontextured image is performed by thresholding the average 2 statistics for all the regions in the image, 2 1m Pm i 1 2 i .

4.2 Graph versus Photograph Classification

  • An image is a photograph if it is a continuous-tone image.
  • The authors have developed a graph-photograph classification method.
  • The classifier partitions an image into blocks and classifies every block into either of the two classes.
  • If the percentage of blocks classified as photograph is higher than a threshold, the image is marked as photograph; otherwise, graph.
  • The authors achieved 100 percent sensitivity for photographic images and higher than 95 percent specificity.

5 THE IRM SIMILARITY MEASURE

  • The integrated region matching (IRM) measure of image similarity is described.
  • An advantage of the overall similarity measure is the robustness against poor segmentation (Fig. 6), an important property lacked in previous work [2], [11].
  • Every point in the space corresponds to the feature vector or the descriptor of a region.
  • Such as the Euclidean distance, it is not obvious how to define a distance between two sets of feature points.
  • The distance should be sufficiently consistent with a person's concept of semantic ªclosenessº of two images.

5.1 Integrated Region Matching (IRM)

  • Every match between images is characterized by links between regions and their significance credits.
  • If a graph represents an admissible matching, the distance between the two region sets is the summation of all the weighted edge lengths, i.e., d R1; R2 X i;j si;jdi;j: 4.
  • The SIMPLIcity system uses the area percentage scheme.

5.2 Distance between Regions

  • Now, the authors discuss the definition of distance between a region pair, d r; r0 .
  • Denote the th order normalized inertia of spheres as L .
  • If two regions match very well in shape, their color and texture distance is attenuated by a smaller weight to provide the final distance.

5.3 Characteristics of IRM

  • To study the characteristics of the IRM distance, the authors performed 100 random queries on their COREL photograph data set.
  • 6 million IRM distances obtained, the authors estimated the distribution of the IRM distance, also known as Based on the 5.
  • The authors may notify the user that two images are considered to be very close when the IRM distance between the two images is less than 15.
  • Likewise, the authors may advise the user that two images are considerably different when the IRM distance between the two images is greater than 50.

6 EXPERIMENTS

  • The SIMPLIcity system has been implemented with a general-purpose image database including about 200; 000 pictures, which are stored in JPEG format with size 384 256 or 256 384.
  • Two classification methods, graph-photograph and textured-nontextured, have been used in their experiments.
  • WBIIS had been compared with the original IBM QBIC system and found to perform better [28].
  • It is difficult to design a fair comparison with existing region-based searching algorithms such as the Blobworld system and the NeTra system which depends on additional information to be provided by the user during the process.
  • A list of online image retrieval demonstration Web sites can be found on their site.

6.1 Accuracy

  • The authors evaluated the accuracy of the system in two ways.
  • First, the authors used a 200,000-image COREL database to compare with existing systems such as EMD-based color histogram and WBIIS.
  • Then, the authors designed systematic evaluation methods to judge the performance statistically.
  • The SIMPLIcity system has demonstrated much improved accuracy over the other systems.

6.2 Query Comparison

  • The authors compare the SIMPLIcity system with the WBIIS (Wavelet-Based Image Indexing and Searching) system [28] with the same image database.
  • Due to the limitation of space, the authors show only two rows of images with the top 11 matches to each query.
  • The authors chose the numbers ª11º and ª29º before viewing the results.
  • In each query, the authors decide the relevance to the query image before viewing the query results.
  • To view the images better or to see more matched images, users can visit the demonstration Web site and use the query image ID to repeat the retrieval.

6.3.1 Performance on Image Queries

  • To provide numerical results, the authors tested 27 sample images chosen randomly from nine categories, each containing three of the images.
  • The categories of images tested are listed in Table 1a.
  • Images in the ªsports and public eventsº class contain people in a game or public event, such as a festival.
  • On average, the precision and the weighted precision of SIMPLIcity are higher than those of WBIIS by 0:227 and 0:273, respectively.

6.3.2 Performance on Image Categorization

  • The SIMPLIcity system was also evaluated based on a subset of the COREL database, formed by 10 image categories (shown in Table 1b), each containing 100 pictures.
  • The recall within the first 100 retrieved images is identical to the precision in this special case.
  • The authors used LUV color space and a matching metric similar to the EMD described in [18] to extract color histogram features and match in the categorized image database.
  • The authors call the one with less filled color bins the Color Histogram 1 system and the other the Color Histogram 2 system.
  • For this reason, the authors cannot evaluate this system using the COREL database of 200,000 images and the 27 sample query images described in the previous section.

6.4.1 Speed

  • The algorithm has been implemented on a Pentium III 450MHz PC using the Linux operating system.
  • On average, one second is needed to segment an image and to compute the features of all regions.
  • The speed is much faster than other region-based methods.
  • Fast indexing has provided us with the capability of handling external queries and sketch queries in real time.
  • If the query image is not already in the database, one extra second of CPU time is spent to extract the feature from the query image.

Did you find this useful? Give us your feedback

Figures (17)

Content maybe subject to copyright    Report

SIMPLIcity: Semantics-Sensitive Integrated
Matching for Picture LIbraries
James Z. Wang, Member, IEEE,JiaLi,Member, IEEE, and Gio Wiederhold, Fellow, IEEE
AbstractÐThe need for efficient content-based image retrieval has increased tremendously in many application areas such as
biomedicine, military, commerce, education, and Web image classification and searching. We present here SIMPLIcity (Semantics-
sensitive Integrated Matching for Picture LIbraries), an image retrieval system, which uses semantics classification methods, a
wavelet-based approach for feature extraction, and integrated region matching based upon image segmentation. As in other region-
based retrieval systems, an image is represented by a set of regions, roughly corresponding to objects, which are characterized by
color, texture, shape, and location. The system classifies images into semantic categories, such as textured-nontextured, graph-
photograph. Potentially, the categorization enhances retrieval by permitting semantically-adaptive searching methods and narrowing
down the searching range in a database. A measure for the overall similarity between images is developed using a region-matching
scheme that integrates properties of all the regions in the images. Compared with retrieval based on individual regions, the overall
similarity approach 1) reduces the adverse effect of inaccurate segmentation, 2) helps to clarify the semantics of a particular region,
and 3) enables a simple querying interface for region-based image retrieval systems. The application of SIMPLIcity to several
databases, including a database of about 200,000 general-purpose images, has demonstrated that our system performs significantly
better and faster than existing ones. The system is fairly robust to image alterations.
Index TermsÐContent-based image retrieval, image classification, image segmentation, integrated region matching, clustering,
robustness.
æ
1INTRODUCTION
W
ITH the steady growth of computer power, rapidly
declining cost of storage, and ever-increasing access
to the Internet, digital acquisition of information has
become increasingly popular in recent years. Effective
indexing and searching of large-scale image databases
remain as challenges for computer systems.
The automatic derivation of semantically-meaningful
information from the content of an image is the focus of
interest for most research on image databases. The image
ªsemantics,º i.e., the meanings of an image, has several
levels. From the lowest to the highest, these levels can be
roughly categorized as
1. semantic types (e.g., landscape photograph, clip art),
2. object composition (e.g., a bike and a car parked on a
beach, a sunset scene),
3. abstract semantics (e.g., people fighting, happy
person, objectionable photograph), and
4. detailed semantics (e.g., a detailed description of a
given picture).
Content-based image retrieval (CBIR) is the set of techniques
for retrieving semantically-relevant images from an image
database based on automatically-derived image features.
1.1 Related Work in CBIR
CBIR for general-purpose image databases is a highly
challenging problem because of the large size of the
database, the difficulty of understanding images, both by
people and computers, the difficulty of formulating a query,
and the issue of evaluating results properly. A number of
general-purpose image search engines have been devel-
oped. We cannot survey all related work in the allocated
space. Instead, we try to emphasize some of the work that is
most related to our work. The references below are to be
taken as examples of related work, not as the complete list
of work in the cited area.
In the commercial domain, IBM QBIC [4] is one of the
earliest systems. Recently, additional systems have been
developed at IBM T.J. Watson [22], VIRAGE [7], NEC
AMORA [13], Bell Laboratory [14], and Interpix. In the
academic domain, MIT Photobook [15], [17], [12] is one of
the earliest. Berkeley Blobworld [2], Columbia VisualSEEK
and WebSEEK [21], CMU Informedia [23], UCSB NeTra
[11], UCSD [9], University of Maryland [16], Stanford EMD
[18], and Stanford WBIIS [28] are some of the recent
systems.
The common ground for CBIR systems is to extract a
signature for every image based on its pixel values and to
define a rule for comparing images. The signature serves as
an image representation in the ªviewº of a CBIR system.
The components of the signature are called features. One
advantage of a signature over the original pixel values is the
significant compression of image representation. However,
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 9, SEPTEMBER 2001 947
. J.Z. Wang is with the School of Information Sciences and Technology and
the Department of Computer Science and Engineering, The Pennsylvania
State University, University Park, PA 16801.
E-mail: wangz@cs.stanford.edu.
. J. Li is with the Department of Statistics, The Pennsylvania State
University, University Park, PA 16801. E-mail: jiali@stat.psu.edu.
. G. Wiederhold is with the Department of Computer Science, Stanford
University, Stanford, CA 94305. E-mail: gio@cs.stanford.edu.
Manuscript received 20 Oct. 1999; revised 8 Aug. 2000; accepted 21 Feb.
2001.
Recommended for acceptance by R. Picard.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 110789.
0162-8828/01/$10.00 ß 2001 IEEE

a more important reason for using the signature is to gain
on improved correlation between image representation and
semantics. Actually, the main task of designing a signature
is to bridge the gap between image semantics and the pixel
representation, that is, to create a better correlation with
image semantics.
Existing general-purpose CBIR systems roughly fall into
three categories depending on the approach to extract
signatures: histogram, color layout, and region-based
search. We will briefly review the three methods in this
section. There are also systems that combine retrieval
results from individual algorithms by a weighted sum
matching metric [7], [4], or other merging schemes [19].
After extracting signatures, the next step is to determine a
comparison rule, including a querying scheme and the
definition of a similarity measure between images. For most
image retrieval systems, a query is specified by an image to
be matched. We refer to this as global search since similarity
is based on the overall properties of images. By contrast,
there are also ªpartial searchº querying systems that retrieve
based on a particular region in an image [11], [2].
1.1.1 Histogram Search
Histogram search algorithms [4], [18] characterize an image
by its color distribution or histogram. Many distances have
been used to define the similarity of two color histogram
representations. Euclidean distance and its variations are
the most commonly used [4]. Rubner et al. of Stanford
University proposed the earth mover's distance (EMD) [18]
using linear programming for matching histograms.
The drawback of a global histogram representation is
that information about object location, shape, and texture
[10] is discarded. Color histogram search is sensitive to
intensity variation, color distortions, and cropping.
1.1.2 Color Layout Search
The ªcolor layoutº approach attempts to overcome the
drawback of histogram search. In simple color layout
indexing [4], images are partitioned into blocks and the
average color of each block is stored. Thus, the color layout
is essentially a low resolution representation of the original
image. A relatively recent system, WBIIS [28], uses
significant Daubechies' wavelet coefficients instead of
averaging. By adjusting block sizes or the levels of wavelet
transforms, the coarseness of a color layout representation
can be tuned. The finest color layout using a single pixel
block is the original pixel representation. Hence, we can
view a color layout representation as an opposite extreme of
ahistogram.Atproperresolutions,thecolorlayout
representation naturally retains shape, location, and texture
information. However, as with pixel representation,
although information such as shape is preserved in the
color layout representation, the retrieval system cannot
perceive it directly. Color layout search is sensitive to
shifting, cropping, scaling, and rotation because images are
described by a set of local properties [28].
The approach taken by the recent WALRUS system [14]
to reduce the shifting and scaling sensitivity for color layout
search is to exhaustively reproduce many subimages based
on an original image. The subimages are formed by sliding
windows of various sizes and a color layout signature is
computed for every subimage. The similarity between
images is then determined by comparing the signatures of
subimages. An obvious drawback of the system is the
sharply increased computational complexity and increase of
size of the search space due to exhaustive generation of
subimages. Furthermore, texture and shape information is
discarded in the signatures because every subimage is
partitioned into four blocks and only average colors of the
blocks are used as features. This system is also limited to
intensity-level image representations.
1.1.3 Region-Based Search
Region-based retrieval systems attempt to overcome the
deficiencies of color layout search by representing images at
the object-level. A region-based retrieval system applies
image segmentation [20], [27] to decompose an image into
regions, which correspond to objects if the decomposition is
ideal. The object-level representation is intended to be close
to the perception of the human visual system (HVS).
However, image segmentation is nearly as difficult as
image understanding because the images are 2D projections
of 3D objects and computers are not trained in the 3D world
the way human beings are.
Since the retrieval system has identified what objects are
in the image, it is easier for the system to recognize similar
objects at different locations and with different orientations
and sizes. Region-based retrieval systems include the NeTra
system [11], the Blobworld system [2], and the query system
with color region templates [22].
The NeTra and the Blobworld systems compare images
based on individual regions. Although querying based on a
limited number of regions is allowed, the query is
performed by merging single-region query results. The
motivation is to shift part of the comparison task to the
users. To query an image, a user is provided with the
segmented regions of the image and is required to select the
regions to be matched and also attributes, e.g., color and
texture, of the regions to be used for evaluating similarity.
Such querying systems provide more control to the user.
However, the user's semantic understanding of an image is
at a higher level than the region representation. For objects
without discerning attributes, such as special texture, it is
not obvious for the user how to select a query from the large
variety of choices. Thus, such a querying scheme may add
burdens on users without significant reward. On the other
hand, because of the great difficulty of achieving accurate
segmentation, systems in [11], [2] often partition one object
into several regions with none of them being representative
for the object, especially for images without distinctive
objects and scenes.
Not much attention has been paid to developing similarity
measures that combine information from all of the regions.
One effort in this direction is the querying system developed
by Smith and Li [22]. Their system decomposes an image into
regions with characterizations predefined in a finite pattern
library. With every pattern labeled by a symbol, images are
then represented by region strings. Region strings are
converted to composite region template (CRT) descriptor
matrices that provide the relative ordering of symbols.
Similarity between images is measured by the closeness
between the CRT descriptor matrices. This measure is
sensitive to object shifting since a CRT matrix is determined
solely by the ordering of symbols. The measure also lacks
948 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 9, SEPTEMBER 2001

robustness to scaling and rotation. Because the definition of
the CRT descriptor matrix relies on the pattern library, the
system performance depends critically on the library. The
performance degrades if region types in an image are not
represented by patterns in the library. The system uses a
CRT library with patterns described only by color. In
particular, the patterns are obtained by quantizing color
space. If texture and shape features are also used to
distinguish patterns, the number of patterns in the library
will increase dramatically, roughly exponentially in the
number of features if patterns are obtained by uniformly
quantizing features.
1.2 Related Work in Semantic Classification
The underlying assumption of CBIR is that semantically-
relevant images have similar visual characteristics, or
features. Consequently, a CBIR system is not necessarily
capable of understanding image semantics. Image semantic
classification, on the other hand, is a technique for
classifying images based on their semantics. While image
semantics classification is a limited form of image under-
standing, the goal of image classification is not to under-
stand images the way human beings do, but merely to
assign the image to a semantic class. We argue that image
class membership can assist retrieval.
Minka and Picard [12] introduced a learning component
in their CBIR system. The system internally generated many
segmentations or groupings of each image's regions based
on different combinations of features, then it learned which
combinations best represented the semantic categories
given as exemplars by the user. The system requires the
supervised training of various parts of the image.
Although region-based systems aim at decomposing
images into constituent objects, a representation composed
of pictorial properties of regions is indirectly related to its
semantics. There is no clear mapping from a set of pictorial
properties to semantics. An approximately round brown
region might be a flower, an apple, a face, or part of a sunset
sky. Moreover, pictorial properties such as color, shape, and
texture of an object vary dramatically in different images. If
a system understood the semantics of images and could
determine which features of an object are significant, it
would be capable of fast and accurate search. However, due
to the great difficulty of recognizing and classifying images,
not much success has been achieved in identifying high-
level semantics for the purpose of image retrieval. There-
fore, most systems are confined to matching images with
low-level pictorial properties.
Despite the fact that it is currently impossible to reliably
recognize objects in general-purpose images, there are
methods to distinguish certain semantic types of images.
Any information about semantic types is helpful since a
system can constrict the search to images with a particular
semantic type. More importantly, the semantic classification
schemes can improve retrieval by using various matching
schemes tuned to the semantic class of the query image.
One example of semantic classification is the identifica-
tion of natural photographs versus artificial graphs gener-
ated by computer tools [29]. The classifier divides an image
into blocks and classifies every block into either of the
two classes. If the percentage of blocks classified as
photograph is higher than a threshold, the image is marked
as photograph; otherwise, text.
Other examples include the WIPE system to detect
objectionable images developed by Wang et al. [29],
motivated by an earlier system by Fleck et al. [5] of the
University of California at Berkeley. WIPE uses training
images and CBIR to determine if a given image is closer to
the set of objectionable training images or the set of benign
training images. The system developed by Fleck et al.,
however, is more deterministic and involves a skin filter
and a human figure grouper.
Szummer and Picard [24] have developed a system to
classify indoor and outdoor scenes. Classification over
low-level image features such as color histogram and
DCT coefficients is performed. A 90 percent accuracy rate
hasbeenreportedover a database of1,300imagesfromKodak.
Other examples of image semantic classification include
city versus landscape [26] and face detection [1]. Wang and
Fischler [30] have shown that rough, but accurate semantic
understanding, can be very helpful in computer vision tasks
such as image stereo matching.
1.3 Overview of the SIMPLIcity System
CBIR is a complex and challenging problem spanning
diverse disciplines, including computer vision, color per-
ception, image processing, image classification, statistical
clustering, psychology, human-computer interaction (HCI),
and specific application domain dependent criteria. While
we are not claiming to be able to solve all the problems
related to CBIR, we have made some advances towards the
final goal, close to human-level automatic image under-
standing and retrieval performance.
In this paper, we discuss issues related to the design and
implementation of a semantics-sensitive CBIR system for
picture libraries. An experimental system, the SIMPLIcity
(Semantics-sensitive Integrated Matching for Picture
LIbraries) system, has been developed to validate the
methods. We summarize the main contributions as follows.
1.3.1 Semantics-Sensitive Image Retrieval
The capability of existing CBIR systems is limited in large
part by fixing a set of features used for retrieval.
Apparently, different image features are suitable for the
retrieval of images in different semantic types. For example,
a color layout indexing method may be good for outdoor
pictures, while a region-based indexing approach is much
better for indoor pictures. Similarly, global texture matching
is suitable only for textured pictures.
We propose a semantics-sensitive approach to the problem
of searching general-purpose image databases. Semantic
classification methods are used to categorize images so that
semantically-adaptive searching methods applicable to each
category can be applied. At the same time, the system
can narrow down the searching range to a subset of the
original database to facilitate fast retrieval. For example,
automatic classification methods can be used to categorize a
general-purpose picture library into semantic classes
including ªgraph,º ªphotograph,º ªtextured,º ªnontex-
tured,º ªbenign,º ªobjectionable,º ªindoor,º ªoutdoor,º
ªcity,º ªlandscape,º ªwith people,º and ªwithout people.º
In our experiments, we used textured-nontextured and
graph-photograph classification methods. We apply a
WANG ET AL.: SIMPLICITY: SEMANTICS-SENSITIVE INTEGRATED MATCHING FOR PICTURE LIBRARIES 949

suitable feature extraction method and a corresponding
matching metric to each of the semantic classes. When more
classification methods are utilized, the current semantic
classification architecture may need to be improved.
In our current system, the set of features for a particular
image category is determined empirically based on the
perception of the developers. For example, shape-related
features are not used for textured images. Automatic
derivation of optimal features is a challenging and important
issue in its own right. A major difficulty in feature selection is
the lack of information about whether any two images in the
database match with each other. The only reliable way to
obtain this information is through manual assessment which
is formidable for a database of even moderate size.
Furthermore, human evaluation is hard to be kept consistent
from person to person. To explore feature selection, primitive
studies can be carried with relatively small databases. A
database can be formed from several distinctive groups of
images, among which only images from the same group are
considered matched. A search algorithm can be developed to
select a subset of candidate features that provides optimal
retrieval according to an objective performance measure.
Although such studies are likely to be seriously biased,
insights regarding which features are most useful for a certain
image category may be obtained.
1.3.2 Image Classification
For the purpose of searching picture libraries such as those
on the Web or in a patient digital library, we are initially
focusing on techniques to classify images into the classes
ªtexturedº versus ªnontextured,º ªgraphº versus ªphoto-
graph.º Several other classification methods have been
previously developed elsewhere, including ªcityº versus
ªlandscapeº [26], and ªwith peopleº versus ªwithout
peopleº [1]. In this paper, we report on several classification
methods we have developed and their performance.
1.3.3 Integrated Region Matching (IRM) Similarity
Measure
Besides using semantics classification, another strategy of
SIMPLIcity to better capture the image semantics is to
define a robust region-based similarity measure, the
Integrated Region Matching (IRM) metric. It incorporates
the properties of all the segmented regions so that
information about an image can be fully used to gain
robustness against inaccurate segmentation. Image segmen-
tation is an extremely difficult process and is still an open
problem in computer vision. For example, an image
segmentation algorithm may segment an image of a dog
into two regions: the dog and the background. The same
algorithm may segment another image of a dog into six
regions: the body of the dog, the front leg(s) of the dog, the
rear leg(s) of the dog, the eye(s), the background grass, and
the sky.
Traditionally, region-based matching is performed on
individual regions [2], [11]. The IRM metric we have
developed has the following major advantages:
1. Compared with retrieval based on individual re-
gions, the overall ªsoft similarityº approach in IRM
reduces the adverse effect of inaccurate segmenta-
tion, an important property lacked by previous
systems.
2. In many cases, knowing that one object usually
appears with another helps to clarify the semantics
of a particular region. For example, flowers typically
appear with green leaves, and boats usually appear
with water.
3. By defining an overall image-to-image similarity
measure, the SIMPLIcity system provides users with
a simple querying interface. To complete a query, a
user only needs to specify the query image. If desired,
the system can be added with a function allowing
users to query based on a specific region or a few
regions.
1.4 Outline of the Paper
The remainder of the paper is organized as follows: The
semantics-sensitive architecture is further introduced in
Section 2. The image segmentation algorithm is described in
Section 3. Classification methods are presented in Section 4.
The IRM similarity measure based on segmentation is
defined in Section 5. In Section 6, experiments and results
are described. We conclude and suggest future research in
Section 7.
2SEMANTICS-SENSITIVE ARCHITECTURE
The architecture of the SIMPLIcity retrieval system is
presented in Fig. 1. During indexing, the system partitions
an image into 4 4 pixel blocks and extracts a feature vector
for each block. A statistical clustering [8] algorithm is then
used to quickly segment the image into regions. The
segmentation result is fed into a classifier that decides the
semantic type of the image. An image is currently classified as
one of the n manually-defined mutually exclusive and
collectively exhaustive semantic classes. The system can be
extended to one that classifies an image softly into multiple
classes with probability assignments. Examples of semantic
types are indoor-outdoor, objectionable-benign, textured-
nontextured, city-landscape, with-without people, and
graph-photograph images. Features reflecting color, texture,
shape, and location information are then extracted for each
region in the image. The features selected depend on the
semantic type of the image. The signature of an image is the
collection of featuresfor all of its regions. Signatures of images
with various semantic types are stored in separate databases.
In the querying process, if the query image is not in the
database as indicated by the user interface, it is first passed
through the same feature extraction process as was used
during indexing. For an image in the database, its semantic
type is first checked and then its signature is extracted from
the corresponding database. Once the signature of the
query image is obtained, similarity scores between the
query image and images in the database with the same
semantic type are computed and sorted to provide the list of
images that appear to have the closest semantics.
3THE IMAGE SEGMENTATION METHOD
In this section, we describe the image segmentation
procedure based on the k-means algorithm [8] using color
and spatial variation features. For general-purpose images
such as the images in a photo library or on the World Wide
Web (WWW), automatic image segmentation is almost as
950 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 9, SEPTEMBER 2001

difficult as automatic image semantic understanding. The
segmentation accuracy of our system is not crucial because
an integrated region-matching (IRM) scheme is used to
provide robustness against inaccurate segmentation.
To segment an image, SIMPLIcity partitions the image into
blocks with 4 4 pixels and extracts a feature vector for each
block. The k-means algorithm is used to cluster the feature
vectors into several classes with every class corresponding to
one region in the segmented image. Since the block size is
small and boundary blockyness has little effect on retrieval,
we choose blockwise segmentation rather than pixelwise
segmentation to lower computational cost significantly.
Suppose observations are fx
i
: i 1; ...;Lg. The goal of
the k-means algorithm is to partition the observations into
k groups with means
^
x
1
;
^
x
2
; ...;
^
x
k
such that
Dk
X
L
i1
min
1jk
x
i
ÿ
^
x
j
2
1
is minimized. The k-means algorithm does not specify how
many clusters to choose. We adaptively choose the number
of clusters k by gradually increasing k and stop when a
criterion is met. We start with k 2 and stop increasing k if
one of the following conditions is satisfied.
1. The distortion Dk is below a threshold. A low Dk
indicates high purity in the clustering process. The
threshold is not critical because the IRM measure is
not sensitive to k.
2. The first derivative of distortion with respect to k,
DkÿDk ÿ 1, is below a threshold with compar-
ison to the average derivative at k 2; 3. A low Dkÿ
Dk ÿ 1 indicates convergence in the clustering
process. The threshold determines the overall time
to segment images and needs to be set to a near-zero
value. It is critical to the speed, but not the quality of
the final image segmentation. The threshold can be
adjusted according to the experimental runtime.
3. The number k exceeds an upper bound. We allow an
image to be segmented into a maximum of
16 segments. That is, we assume an image has less
than 16 distinct types of objects. Usually, the
segmentation process generates much less number
of segments in an image. The threshold is rarely met.
Six features are used for segmentation. Three of them are
the average color components in a 4 4 block. The other three
represent energy in high frequency bands of wavelet trans-
forms [3], that is, the square root of the second order moment
of wavelet coefficients in high frequency bands. We use the
well-known LUV color space, where L encodes luminance
and U and V encode color information (chrominance). The
LUV color space has good perception correlation properties.
The block size is chosen to be 4 4 to compromise between
the texture detail and the computation time.
To obtain the other three features, we apply either the
Daubechies-4 wavelet transform or the Haar transform to
the L component of the image. We use these two wavelet
transforms because they have better localization proper-
ties and require less computation compared to Daube-
chies' wavelets with longer filters. After a one-level
wavelet transform, a 4 4 block is decomposed into four
frequency bands, as shown in Fig. 2. Each band contains
2 2 coefficients. Without loss of generality, suppose the
coefficients in the HL band are fc
k;l
;c
k;l1
;c
k1;l
;c
k1;l1
g.
One feature is then computed as
f
1
4
X
1
i0
X
1
j0
c
2
ki;lj
!
1
2
:
The other two features are computed similarly from the
LH and HH bands. The motivation for using these features is
their reflection of texture properties. Moments of wavelet
coefficients in various frequency bands have proven effective
for discerning texture [25]. The intuition behind this is that
coefficients in different frequency bands signal variations in
WANG ET AL.: SIMPLICITY: SEMANTICS-SENSITIVE INTEGRATED MATCHING FOR PICTURE LIBRARIES 951
Fig. 1. The architecture of feature indexing process. The heavy lines show a sample indexing path of an image.
Fig. 2. Decomposition of images into frequency bands by wavelet
transforms.

Citations
More filters
Journal ArticleDOI
TL;DR: Almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation are surveyed, and the spawning of related subfields are discussed, to discuss the adaptation of existing image retrieval techniques to build systems that can be useful in the real world.
Abstract: We have witnessed great interest and a wealth of promise in content-based image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger association of weakly related fields. In this article, we survey almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation, and in the process discuss the spawning of related subfields. We also discuss significant challenges involved in the adaptation of existing image retrieval techniques to build systems that can be useful in the real world. In retrospect of what has been achieved so far, we also conjecture what the future may hold for image retrieval research.

3,433 citations

Journal ArticleDOI
TL;DR: This paper attempts to provide a comprehensive survey of the recent technical achievements in high-level semantic-based image retrieval, identifying five major categories of the state-of-the-art techniques in narrowing down the 'semantic gap'.

1,713 citations

Journal ArticleDOI
TL;DR: The goal is not, in general, to replace text-based retrieval methods as they exist at the moment but to complement them with visual search tools.

1,535 citations

Proceedings ArticleDOI
01 Oct 2001
TL;DR: This work proposes the use of a support vector machine active learning algorithm for conducting effective relevance feedback for image retrieval and achieves significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.
Abstract: Relevance feedback is often a critical component when designing image databases. With these databases it is difficult to specify queries directly and explicitly. Relevance feedback interactively determinines a user's desired output or query concept by asking the user whether certain proposed images are relevant or not. For a relevance feedback algorithm to be effective, it must grasp a user's query concept accurately and quickly, while also only asking the user to label a small number of images. We propose the use of a support vector machine active learning algorithm for conducting effective relevance feedback for image retrieval. The algorithm selects the most informative images to query a user and quickly learns a boundary that separates the images that satisfy the user's query concept from the rest of the dataset. Experimental results show that our algorithm achieves significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.

1,512 citations

Proceedings ArticleDOI
28 Jul 2003
TL;DR: Three hierarchical probabilistic mixture models which aim to describe annotated data with multiple types, culminating in correspondence latent Dirichlet allocation, a latent variable model that is effective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type.
Abstract: We consider the problem of modeling annotated data---data with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). We describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent Dirichlet allocation, a latent variable model that is effective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. We conduct experiments on the Corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval.

1,199 citations

References
More filters
Book
01 Jan 1967

22,994 citations


"SIMPLIcity: semantics-sensitive int..." refers methods in this paper

  • ...The goodness of t is measured by the 2 statistics [20]....

    [...]

Book
01 May 1992
TL;DR: This paper presents a meta-analyses of the wavelet transforms of Coxeter’s inequality and its applications to multiresolutional analysis and orthonormal bases.
Abstract: Introduction Preliminaries and notation The what, why, and how of wavelets The continuous wavelet transform Discrete wavelet transforms: Frames Time-frequency density and orthonormal bases Orthonormal bases of wavelets and multiresolutional analysis Orthonormal bases of compactly supported wavelets More about the regularity of compactly supported wavelets Symmetry for compactly supported wavelet bases Characterization of functional spaces by means of wavelets Generalizations and tricks for orthonormal wavelet bases References Indexes.

16,073 citations


"SIMPLIcity: semantics-sensitive int..." refers background in this paper

  • ...The other three represent energy in high frequency bands of wavelet transforms [3], that is, the square root of the second order moment of wavelet coefficients in high frequency bands....

    [...]

Journal ArticleDOI
TL;DR: In this article, the regularity of compactly supported wavelets and symmetry of wavelet bases are discussed. But the authors focus on the orthonormal bases of wavelets, rather than the continuous wavelet transform.
Abstract: Introduction Preliminaries and notation The what, why, and how of wavelets The continuous wavelet transform Discrete wavelet transforms: Frames Time-frequency density and orthonormal bases Orthonormal bases of wavelets and multiresolutional analysis Orthonormal bases of compactly supported wavelets More about the regularity of compactly supported wavelets Symmetry for compactly supported wavelet bases Characterization of functional spaces by means of wavelets Generalizations and tricks for orthonormal wavelet bases References Indexes.

14,157 citations

Journal ArticleDOI
TL;DR: This work treats image segmentation as a graph partitioning problem and proposes a novel global criterion, the normalized cut, for segmenting the graph, which measures both the total dissimilarity between the different groups as well as the total similarity within the groups.
Abstract: We propose a novel approach for solving the perceptual grouping problem in vision. Rather than focusing on local features and their consistencies in the image data, our approach aims at extracting the global impression of an image. We treat image segmentation as a graph partitioning problem and propose a novel global criterion, the normalized cut, for segmenting the graph. The normalized cut criterion measures both the total dissimilarity between the different groups as well as the total similarity within the groups. We show that an efficient computational technique based on a generalized eigenvalue problem can be used to optimize this criterion. We applied this approach to segmenting static images, as well as motion sequences, and found the results to be very encouraging.

13,789 citations

Frequently Asked Questions (19)
Q1. What are the contributions in "Simplicity: semantics-sensitive integrated matching for picture libraries" ?

The authors present here SIMPLIcity ( Semanticssensitive Integrated Matching for Picture LIbraries ), an image retrieval system, which uses semantics classification methods, a wavelet-based approach for feature extraction, and integrated region matching based upon image segmentation. Potentially, the categorization enhances retrieval by permitting semantically-adaptive searching methods and narrowing down the searching range in a database. 

The authors are planning to build a sharable testbed for statistical evaluation of different CBIR systems. 

The automatic derivation of semantically-meaningfulinformation from the content of an image is the focus ofinterest for most research on image databases. 

The algorithm the authors used to classify image blocks is based on a probability density analysis of wavelet coefficients in high frequency bands. 

When the query image is in the database, it takes about 1:5 seconds of CPU time on average to sort all the images in the 200,000-image database using the IRM similarity measure. 

the main task of designing a signature is to bridge the gap between image semantics and the pixel representation, that is, to create a better correlation with image semantics. 

Region-based retrieval systems attempt to overcome the deficiencies of color layout search by representing images at the object-level. 

To compute the feature vectors for the 200; 000 color images of size 384 256 in their general-purpose image database requires approximately 60 hours. 

Existing general-purpose CBIR systems roughly fall into three categories depending on the approach to extract signatures: histogram, color layout, and region-based search. 

The approach taken by the recent WALRUS system [14] to reduce the shifting and scaling sensitivity for color layout search is to exhaustively reproduce many subimages based on an original image. 

If texture and shape features are also used to distinguish patterns, the number of patterns in the library will increase dramatically, roughly exponentially in the number of features if patterns are obtained by uniformly quantizing features. 

The SIMPLIcity system has been implemented with a general-purpose image database including about 200; 000 pictures, which are stored in JPEG format with size 384 256 or 256 384. 

The authors used LUV color space and a matching metric similar to the EMD described in [18] to extract color histogram features and match in the categorized image database. 

Because the definition of the CRT descriptor matrix relies on the pattern library, the system performance depends critically on the library. 

If the query image is not already in the database, one extra second of CPU time is spent to extract the feature from the query image. 

The authors may notify the user that two images are considered to be very close when the IRM distance between the two images is less than 15. 

To view the images better or to see more matched images, users can visit the demonstration Web site and use the query image ID to repeat the retrieval. 

As shown by the segmentation results in Fig. 3, regions in textured images tend to scatter in the entire image, whereas nontextured images are usually partitioned into clumped regions. 

The application of SIMPLIcity to a database of about 200,000 general-purpose images shows more accurate and much faster retrieval compared with the existing algorithms.