Journal Article•DOI•

SIMPLIcity: semantics-sensitive integrated matching for picture libraries

James Z. Wang¹, Jia Li¹, Gio Wiederhold²•Institutions (2)

Pennsylvania State University¹, Stanford University²

01 Sep 2001-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Computer Society)-Vol. 23, Iss: 9, pp 947-963

TL;DR: SIMPLIcity (semantics-sensitive integrated matching for picture libraries), an image retrieval system, which uses semantics classification methods, a wavelet-based approach for feature extraction, and integrated region matching based upon image segmentation to improve retrieval.

read less

Abstract: We present here SIMPLIcity (semantics-sensitive integrated matching for picture libraries), an image retrieval system, which uses semantics classification methods, a wavelet-based approach for feature extraction, and integrated region matching based upon image segmentation. An image is represented by a set of regions, roughly corresponding to objects, which are characterized by color, texture, shape, and location. The system classifies images into semantic categories. Potentially, the categorization enhances retrieval by permitting semantically-adaptive searching methods and narrowing down the searching range in a database. A measure for the overall similarity between images is developed using a region-matching scheme that integrates properties of all the regions in the images. The application of SIMPLIcity to several databases has demonstrated that our system performs significantly better and faster than existing ones. The system is fairly robust to image alterations.

...read moreread less

Summary (6 min read)

Jump to: [1 INTRODUCTION] – [1.1 Related Work in CBIR] – [1.1.1 Histogram Search] – [1.1.2 Color Layout Search] – [1.1.3 Region-Based Search] – [1.2 Related Work in Semantic Classification] – [1.3 Overview of the SIMPLIcity System] – [1.3.1 Semantics-Sensitive Image Retrieval] – [1.3.2 Image Classification] – [1.3.3 Integrated Region Matching (IRM) Similarity Measure] – [1.4 Outline of the Paper] – [2 SEMANTICS-SENSITIVE ARCHITECTURE] – [3 THE IMAGE SEGMENTATION METHOD] – [4 THE IMAGE CLASSIFICATION METHODS] – [4.1 Textured versus Nontextured Classification] – [4.2 Graph versus Photograph Classification] – [5 THE IRM SIMILARITY MEASURE] – [5.1 Integrated Region Matching (IRM)] – [5.2 Distance between Regions] – [5.3 Characteristics of IRM] – [6 EXPERIMENTS] – [6.1 Accuracy] – [6.2 Query Comparison] – [6.3.1 Performance on Image Queries] – [6.3.2 Performance on Image Categorization] and [6.4.1 Speed]

1 INTRODUCTION

WITH the steady growth of computer power, rapidlydeclining cost of storage, and ever-increasing access to the Internet, digital acquisition of information has become increasingly popular in recent years.
The automatic derivation of semantically-meaningful information from the content of an image is the focus of interest for most research on image databases.
The image ªsemantics,º i.e., the meanings of an image, has several levels.
Content-based image retrieval (CBIR) is the set of techniques for retrieving semantically-relevant images from an image database based on automatically-derived image features.

1.1.1 Histogram Search

Histogram search algorithms [4], [18] characterize an image by its color distribution or histogram.
Many distances have been used to define the similarity of two color histogram representations.
Euclidean distance and its variations are the most commonly used [4].
The drawback of a global histogram representation is that information about object location, shape, and texture [10] is discarded.
Color histogram search is sensitive to intensity variation, color distortions, and cropping.

1.1.2 Color Layout Search

The ªcolor layoutº approach attempts to overcome the drawback of histogram search.
In simple color layout indexing [4], images are partitioned into blocks and the average color of each block is stored.
Thus, the color layout is essentially a low resolution representation of the original image.
As with pixel representation, although information such as shape is preserved in the color layout representation, the retrieval system cannot perceive it directly.
This system is also limited to intensity-level image representations.

1.1.3 Region-Based Search

Region-based retrieval systems attempt to overcome the deficiencies of color layout search by representing images at the object-level.
The motivation is to shift part of the comparison task to the users.
The user's semantic understanding of an image is at a higher level than the region representation.
On the other hand, because of the great difficulty of achieving accurate segmentation, systems in [11], [2] often partition one object into several regions with none of them being representative for the object, especially for images without distinctive objects and scenes.
Region strings are converted to composite region template (CRT) descriptor matrices that provide the relative ordering of symbols.

1.3 Overview of the SIMPLIcity System

CBIR is a complex and challenging problem spanning diverse disciplines, including computer vision, color perception, image processing, image classification, statistical clustering, psychology, human-computer interaction (HCI), and specific application domain dependent criteria.
While the authors are not claiming to be able to solve all the problems related to CBIR, they have made some advances towards the final goal, close to human-level automatic image understanding and retrieval performance.
The authors discuss issues related to the design and implementation of a semantics-sensitive CBIR system for picture libraries.
An experimental system, the SIMPLIcity (Semantics-sensitive Integrated Matching for Picture LIbraries) system, has been developed to validate the methods.
The authors summarize the main contributions as follows.

1.3.1 Semantics-Sensitive Image Retrieval

The capability of existing CBIR systems is limited in large part by fixing a set of features used for retrieval.
The authors propose a semantics-sensitive approach to the problem of searching general-purpose image databases.
Semantic classification methods are used to categorize images so that semantically-adaptive searching methods applicable to each category can be applied.
Automatic classification methods can be used to categorize a general-purpose picture library into semantic classes including ªgraph,º ªphotograph,º ªtextured,º ªnontextured,º ªbenign,º ªobjectionable,º ªindoor,º ªoutdoor,º ªcity,º ªlandscape,º ªwith people,º and ªwithout people.º.
Automatic derivation of optimal features is a challenging and important issue in its own right.

1.3.2 Image Classification

For the purpose of searching picture libraries such as those on the Web or in a patient digital library, the authors are initially focusing on techniques to classify images into the classes ªtexturedº versus ªnontextured,º ªgraphº versus ªphotograph.º.
Several other classification methods have been previously developed elsewhere, including ªcityº versus ªlandscapeº [26], and ªwith peopleº versus ªwithout peopleº [1].
The authors report on several classification methods they have developed and their performance.

1.3.3 Integrated Region Matching (IRM) Similarity Measure

Besides using semantics classification, another strategy of SIMPLIcity to better capture the image semantics is to define a robust region-based similarity measure, the Integrated Region Matching (IRM) metric.
Image segmentation is an extremely difficult process and is still an open problem in computer vision.
Traditionally, region-based matching is performed on individual regions [2], [11].
The IRM metric the authors have developed has the following major advantages: 1. Compared with retrieval based on individual regions, the overall ªsoft similarityº approach in IRM reduces the adverse effect of inaccurate segmentation, an important property lacked by previous systems.
In many cases, knowing that one object usually appears with another helps to clarify the semantics of a particular region.

1.4 Outline of the Paper

The remainder of the paper is organized as follows:.
The semantics-sensitive architecture is further introduced in Section 2.
The image segmentation algorithm is described in Section 3.
Classification methods are presented in Section 4.
In Section 6, experiments and results are described.

2 SEMANTICS-SENSITIVE ARCHITECTURE

The architecture of the SIMPLIcity retrieval system is presented in Fig. 1.
During indexing, the system partitions an image into 4 4 pixel blocks and extracts a feature vector for each block.
A statistical clustering [8] algorithm is then used to quickly segment the image into regions.
For an image in the database, its semantic type is first checked and then its signature is extracted from the corresponding database.
Once the signature of the query image is obtained, similarity scores between the query image and images in the database with the same semantic type are computed and sorted to provide the list of images that appear to have the closest semantics.

3 THE IMAGE SEGMENTATION METHOD

The authors describe the image segmentation procedure based on the k-means algorithm [8] using color and spatial variation features.
A low D k indicates high purity in the clustering process.
The first derivative of distortion with respect to k, D k ÿD kÿ 1 , is below a threshold with comparison to the average derivative at k 2; 3. A lowD k ÿ D kÿ 1 indicates convergence in the clustering process.
After a one-level wavelet transform, a 4 4 block is decomposed into four frequency bands, as shown in Fig.
An image with vertical strips thus has high energy in the HL band and low energy in the LH band.

4 THE IMAGE CLASSIFICATION METHODS

The image classification methods described in this section have been developed mainly for searching picture libraries such as Web images.
The authors are initially interested in classifying images into the classes textured versus nontextured, graph versus photograph, and objectionable versus benign.
Karu et al. provided an overview of texture-related research [10].
Other classification methods such as city versus landscape [26] and with people versus without people [1] were developed elsewhere.

4.1 Textured versus Nontextured Classification

The authors describe the algorithm to classify images into the semantic classes textured or nontextured.
Fig. 4 shows some sample textured images.
The classification of textured or nontextured image is performed by thresholding the average 2 statistics for all the regions in the image, 2 1m Pm i 1 2 i .

4.2 Graph versus Photograph Classification

An image is a photograph if it is a continuous-tone image.
The authors have developed a graph-photograph classification method.
The classifier partitions an image into blocks and classifies every block into either of the two classes.
If the percentage of blocks classified as photograph is higher than a threshold, the image is marked as photograph; otherwise, graph.
The authors achieved 100 percent sensitivity for photographic images and higher than 95 percent specificity.

5 THE IRM SIMILARITY MEASURE

The integrated region matching (IRM) measure of image similarity is described.
An advantage of the overall similarity measure is the robustness against poor segmentation (Fig. 6), an important property lacked in previous work [2], [11].
Every point in the space corresponds to the feature vector or the descriptor of a region.
Such as the Euclidean distance, it is not obvious how to define a distance between two sets of feature points.
The distance should be sufficiently consistent with a person's concept of semantic ªclosenessº of two images.

5.1 Integrated Region Matching (IRM)

Every match between images is characterized by links between regions and their significance credits.
If a graph represents an admissible matching, the distance between the two region sets is the summation of all the weighted edge lengths, i.e., d R1; R2 X i;j si;jdi;j: 4.
The SIMPLIcity system uses the area percentage scheme.

5.2 Distance between Regions

Now, the authors discuss the definition of distance between a region pair, d r; r0 .
Denote the th order normalized inertia of spheres as L .
If two regions match very well in shape, their color and texture distance is attenuated by a smaller weight to provide the final distance.

5.3 Characteristics of IRM

To study the characteristics of the IRM distance, the authors performed 100 random queries on their COREL photograph data set.
6 million IRM distances obtained, the authors estimated the distribution of the IRM distance, also known as Based on the 5.
The authors may notify the user that two images are considered to be very close when the IRM distance between the two images is less than 15.
Likewise, the authors may advise the user that two images are considerably different when the IRM distance between the two images is greater than 50.

6 EXPERIMENTS

The SIMPLIcity system has been implemented with a general-purpose image database including about 200; 000 pictures, which are stored in JPEG format with size 384 256 or 256 384.
Two classification methods, graph-photograph and textured-nontextured, have been used in their experiments.
WBIIS had been compared with the original IBM QBIC system and found to perform better [28].
It is difficult to design a fair comparison with existing region-based searching algorithms such as the Blobworld system and the NeTra system which depends on additional information to be provided by the user during the process.
A list of online image retrieval demonstration Web sites can be found on their site.

6.1 Accuracy

The authors evaluated the accuracy of the system in two ways.
First, the authors used a 200,000-image COREL database to compare with existing systems such as EMD-based color histogram and WBIIS.
Then, the authors designed systematic evaluation methods to judge the performance statistically.
The SIMPLIcity system has demonstrated much improved accuracy over the other systems.

6.2 Query Comparison

The authors compare the SIMPLIcity system with the WBIIS (Wavelet-Based Image Indexing and Searching) system [28] with the same image database.
Due to the limitation of space, the authors show only two rows of images with the top 11 matches to each query.
The authors chose the numbers ª11º and ª29º before viewing the results.
In each query, the authors decide the relevance to the query image before viewing the query results.
To view the images better or to see more matched images, users can visit the demonstration Web site and use the query image ID to repeat the retrieval.

6.3.1 Performance on Image Queries

To provide numerical results, the authors tested 27 sample images chosen randomly from nine categories, each containing three of the images.
The categories of images tested are listed in Table 1a.
Images in the ªsports and public eventsº class contain people in a game or public event, such as a festival.
On average, the precision and the weighted precision of SIMPLIcity are higher than those of WBIIS by 0:227 and 0:273, respectively.

6.3.2 Performance on Image Categorization

The SIMPLIcity system was also evaluated based on a subset of the COREL database, formed by 10 image categories (shown in Table 1b), each containing 100 pictures.
The recall within the first 100 retrieved images is identical to the precision in this special case.
The authors used LUV color space and a matching metric similar to the EMD described in [18] to extract color histogram features and match in the categorized image database.
The authors call the one with less filled color bins the Color Histogram 1 system and the other the Color Histogram 2 system.
For this reason, the authors cannot evaluate this system using the COREL database of 200,000 images and the 27 sample query images described in the previous section.

6.4.1 Speed

The algorithm has been implemented on a Pentium III 450MHz PC using the Linux operating system.
On average, one second is needed to segment an image and to compute the features of all regions.
The speed is much faster than other region-based methods.
Fast indexing has provided us with the capability of handling external queries and sketch queries in real time.
If the query image is not already in the database, one extra second of CPU time is spent to extract the feature from the query image.

Did you find this useful? Give us your feedback

Figures (17)

Fig. 4. Sample textured images. (a) Surface texture. (b) Fabric texture. (c) Artificial texture. (d) Pattern of similarly-shaped objects.

Fig. 3. Segmentation results by the k-means clustering algorithm: First row: original images. Second row: regions of the images. Results for other images in the database can be found online.

Fig. 16. The robustness of the system to image alterations. Due to space, only the best five matches are shown. The first image in each example is the query image. Database size: 200,000 images.

Fig. 15. Comparing SIMPLIcity with color histogram methods on average precision p, average rank of matched images r, and the standard deviation of the ranks of matched images . The lower numbers indicate better results for the last two plots (i.e., the r plot and the plot). Color Histogram 1 gives an average of 13.1 filled color bins per image, while Color Histogram 2 gives an average of 42.6 filled color bins per image. SIMPLIcity partitions an image into an average of only 4.3 regions.

Fig. 5. The histograms of average 2's over 100 textured images and 100 nontextured images.

Fig. 6. Integrated Region Matching (IRM) is potentially robust to poor image segmentation.

Fig. 7. Region-to-region matching results are incorporated in the Integrated Region Matching (IRM) metric. A 3D feature space is shown to illustrate the concept.

Fig. 10. The empirical pdf and cdf of the IRM distance.

Fig. 14. Comparison of SIMPLIcity and WBIIS: average precision and weighted precision of nine image categories.

Fig. 12. SIMPLIcity gives better results than the same system without the classification component. The query image is a textured image.

Fig. 13. SIMPLIcity does not mix clip art pictures with photographs. A graph-photograph classification method using image segmentation and statistical hypothesis testing is used. The query image is a clip art picture.

Fig. 8. Integrated region matching (IRM) allows a region in an image to be matched with several regions in another image.

Fig. 11. Comparison of SIMPLIcity and WBIIS. The query image is the upper-left corner image of each block of images. Due to the limitation of space, we show only two rows of images with the top 11 matches to each query. More matches can be viewed from the online demonstration site. (a) Natural out-door scene, (b) food, (c) people, (d) portrait, and (e) flower.

Fig. 9. Feature extraction in the SIMPLIcity system. (* The computation of shape features is omitted for textured images.)

Fig. 1. The architecture of feature indexing process. The heavy lines show a sample indexing path of an image.

Fig. 2. Decomposition of images into frequency bands by wavelet transforms.

Fig. 17. The robustness of the system compared to image alterations. Six query images were randomly selected from the database. Each curve represents the robustness on one of the six images.

Content maybe subject to copyright Report

SIMPLIcity: Semantics-Sensitive Integrated

Matching for Picture LIbraries

James Z. Wang, Member, IEEE,JiaLi,Member, IEEE, and Gio Wiederhold, Fellow, IEEE

AbstractÐThe need for efficient content-based image retrieval has increased tremendously in many application areas such as

biomedicine, military, commerce, education, and Web image classification and searching. We present here SIMPLIcity (Semantics-

sensitive Integrated Matching for Picture LIbraries), an image retrieval system, which uses semantics classification methods, a

wavelet-based approach for feature extraction, and integrated region matching based upon image segmentation. As in other region-

based retrieval systems, an image is represented by a set of regions, roughly corresponding to objects, which are characterized by

color, texture, shape, and location. The system classifies images into semantic categories, such as textured-nontextured, graph-

photograph. Potentially, the categorization enhances retrieval by permitting semantically-adaptive searching methods and narrowing

down the searching range in a database. A measure for the overall similarity between images is developed using a region-matching

scheme that integrates properties of all the regions in the images. Compared with retrieval based on individual regions, the overall

similarity approach 1) reduces the adverse effect of inaccurate segmentation, 2) helps to clarify the semantics of a particular region,

and 3) enables a simple querying interface for region-based image retrieval systems. The application of SIMPLIcity to several

databases, including a database of about 200,000 general-purpose images, has demonstrated that our system performs significantly

better and faster than existing ones. The system is fairly robust to image alterations.

Index TermsÐContent-based image retrieval, image classification, image segmentation, integrated region matching, clustering,

robustness.

1INTRODUCTION

ITH the steady growth of computer power, rapidly

declining cost of storage, and ever-increasing access

to the Internet, digital acquisition of information has

become increasingly popular in recent years. Effective

indexing and searching of large-scale image databases

remain as challenges for computer systems.

The automatic derivation of semantically-meaningful

information from the content of an image is the focus of

interest for most research on image databases. The image

ªsemantics,º i.e., the meanings of an image, has several

levels. From the lowest to the highest, these levels can be

roughly categorized as

1. semantic types (e.g., landscape photograph, clip art),

2. object composition (e.g., a bike and a car parked on a

beach, a sunset scene),

3. abstract semantics (e.g., people fighting, happy

person, objectionable photograph), and

4. detailed semantics (e.g., a detailed description of a

given picture).

Content-based image retrieval (CBIR) is the set of techniques

for retrieving semantically-relevant images from an image

database based on automatically-derived image features.

1.1 Related Work in CBIR

CBIR for general-purpose image databases is a highly

challenging problem because of the large size of the

database, the difficulty of understanding images, both by

people and computers, the difficulty of formulating a query,

and the issue of evaluating results properly. A number of

general-purpose image search engines have been devel-

oped. We cannot survey all related work in the allocated

space. Instead, we try to emphasize some of the work that is

most related to our work. The references below are to be

taken as examples of related work, not as the complete list

of work in the cited area.

In the commercial domain, IBM QBIC [4] is one of the

earliest systems. Recently, additional systems have been

developed at IBM T.J. Watson [22], VIRAGE [7], NEC

AMORA [13], Bell Laboratory [14], and Interpix. In the

academic domain, MIT Photobook [15], [17], [12] is one of

the earliest. Berkeley Blobworld [2], Columbia VisualSEEK

and WebSEEK [21], CMU Informedia [23], UCSB NeTra

[11], UCSD [9], University of Maryland [16], Stanford EMD

[18], and Stanford WBIIS [28] are some of the recent

systems.

The common ground for CBIR systems is to extract a

signature for every image based on its pixel values and to

define a rule for comparing images. The signature serves as

an image representation in the ªviewº of a CBIR system.

The components of the signature are called features. One

advantage of a signature over the original pixel values is the

significant compression of image representation. However,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 9, SEPTEMBER 2001 947

. J.Z. Wang is with the School of Information Sciences and Technology and

the Department of Computer Science and Engineering, The Pennsylvania

State University, University Park, PA 16801.

E-mail: wangz@cs.stanford.edu.

. J. Li is with the Department of Statistics, The Pennsylvania State

University, University Park, PA 16801. E-mail: jiali@stat.psu.edu.

. G. Wiederhold is with the Department of Computer Science, Stanford

University, Stanford, CA 94305. E-mail: gio@cs.stanford.edu.

Manuscript received 20 Oct. 1999; revised 8 Aug. 2000; accepted 21 Feb.

2001.

Recommended for acceptance by R. Picard.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number 110789.

0162-8828/01/$10.00 ß 2001 IEEE

a more important reason for using the signature is to gain

on improved correlation between image representation and

semantics. Actually, the main task of designing a signature

is to bridge the gap between image semantics and the pixel

representation, that is, to create a better correlation with

image semantics.

Existing general-purpose CBIR systems roughly fall into

three categories depending on the approach to extract

signatures: histogram, color layout, and region-based

search. We will briefly review the three methods in this

section. There are also systems that combine retrieval

results from individual algorithms by a weighted sum

matching metric [7], [4], or other merging schemes [19].

After extracting signatures, the next step is to determine a

comparison rule, including a querying scheme and the

definition of a similarity measure between images. For most

image retrieval systems, a query is specified by an image to

be matched. We refer to this as global search since similarity

is based on the overall properties of images. By contrast,

there are also ªpartial searchº querying systems that retrieve

based on a particular region in an image [11], [2].

1.1.1 Histogram Search

Histogram search algorithms [4], [18] characterize an image

by its color distribution or histogram. Many distances have

been used to define the similarity of two color histogram

representations. Euclidean distance and its variations are

the most commonly used [4]. Rubner et al. of Stanford

University proposed the earth mover's distance (EMD) [18]

using linear programming for matching histograms.

The drawback of a global histogram representation is

that information about object location, shape, and texture

[10] is discarded. Color histogram search is sensitive to

intensity variation, color distortions, and cropping.

1.1.2 Color Layout Search

The ªcolor layoutº approach attempts to overcome the

drawback of histogram search. In simple color layout

indexing [4], images are partitioned into blocks and the

average color of each block is stored. Thus, the color layout

is essentially a low resolution representation of the original

image. A relatively recent system, WBIIS [28], uses

significant Daubechies' wavelet coefficients instead of

averaging. By adjusting block sizes or the levels of wavelet

transforms, the coarseness of a color layout representation

can be tuned. The finest color layout using a single pixel

block is the original pixel representation. Hence, we can

view a color layout representation as an opposite extreme of

ahistogram.Atproperresolutions,thecolorlayout

representation naturally retains shape, location, and texture

information. However, as with pixel representation,

although information such as shape is preserved in the

color layout representation, the retrieval system cannot

perceive it directly. Color layout search is sensitive to

shifting, cropping, scaling, and rotation because images are

described by a set of local properties [28].

The approach taken by the recent WALRUS system [14]

to reduce the shifting and scaling sensitivity for color layout

search is to exhaustively reproduce many subimages based

on an original image. The subimages are formed by sliding

windows of various sizes and a color layout signature is

computed for every subimage. The similarity between

images is then determined by comparing the signatures of

subimages. An obvious drawback of the system is the

sharply increased computational complexity and increase of

size of the search space due to exhaustive generation of

subimages. Furthermore, texture and shape information is

discarded in the signatures because every subimage is

partitioned into four blocks and only average colors of the

blocks are used as features. This system is also limited to

intensity-level image representations.

1.1.3 Region-Based Search

Region-based retrieval systems attempt to overcome the

deficiencies of color layout search by representing images at

the object-level. A region-based retrieval system applies

image segmentation [20], [27] to decompose an image into

regions, which correspond to objects if the decomposition is

ideal. The object-level representation is intended to be close

to the perception of the human visual system (HVS).

However, image segmentation is nearly as difficult as

image understanding because the images are 2D projections

of 3D objects and computers are not trained in the 3D world

the way human beings are.

Since the retrieval system has identified what objects are

in the image, it is easier for the system to recognize similar

objects at different locations and with different orientations

and sizes. Region-based retrieval systems include the NeTra

system [11], the Blobworld system [2], and the query system

with color region templates [22].

The NeTra and the Blobworld systems compare images

based on individual regions. Although querying based on a

limited number of regions is allowed, the query is

performed by merging single-region query results. The

motivation is to shift part of the comparison task to the

users. To query an image, a user is provided with the

segmented regions of the image and is required to select the

regions to be matched and also attributes, e.g., color and

texture, of the regions to be used for evaluating similarity.

Such querying systems provide more control to the user.

However, the user's semantic understanding of an image is

at a higher level than the region representation. For objects

without discerning attributes, such as special texture, it is

not obvious for the user how to select a query from the large

variety of choices. Thus, such a querying scheme may add

burdens on users without significant reward. On the other

hand, because of the great difficulty of achieving accurate

segmentation, systems in [11], [2] often partition one object

into several regions with none of them being representative

for the object, especially for images without distinctive

objects and scenes.

Not much attention has been paid to developing similarity

measures that combine information from all of the regions.

One effort in this direction is the querying system developed

by Smith and Li [22]. Their system decomposes an image into

regions with characterizations predefined in a finite pattern

library. With every pattern labeled by a symbol, images are

then represented by region strings. Region strings are

converted to composite region template (CRT) descriptor

matrices that provide the relative ordering of symbols.

Similarity between images is measured by the closeness

between the CRT descriptor matrices. This measure is

sensitive to object shifting since a CRT matrix is determined

solely by the ordering of symbols. The measure also lacks

948 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 9, SEPTEMBER 2001

robustness to scaling and rotation. Because the definition of

the CRT descriptor matrix relies on the pattern library, the

system performance depends critically on the library. The

performance degrades if region types in an image are not

represented by patterns in the library. The system uses a

CRT library with patterns described only by color. In

particular, the patterns are obtained by quantizing color

space. If texture and shape features are also used to

distinguish patterns, the number of patterns in the library

will increase dramatically, roughly exponentially in the

number of features if patterns are obtained by uniformly

quantizing features.

1.2 Related Work in Semantic Classification

The underlying assumption of CBIR is that semantically-

relevant images have similar visual characteristics, or

features. Consequently, a CBIR system is not necessarily

capable of understanding image semantics. Image semantic

classification, on the other hand, is a technique for

classifying images based on their semantics. While image

semantics classification is a limited form of image under-

standing, the goal of image classification is not to under-

stand images the way human beings do, but merely to

assign the image to a semantic class. We argue that image

class membership can assist retrieval.

Minka and Picard [12] introduced a learning component

in their CBIR system. The system internally generated many

segmentations or groupings of each image's regions based

on different combinations of features, then it learned which

combinations best represented the semantic categories

given as exemplars by the user. The system requires the

supervised training of various parts of the image.

Although region-based systems aim at decomposing

images into constituent objects, a representation composed

of pictorial properties of regions is indirectly related to its

semantics. There is no clear mapping from a set of pictorial

properties to semantics. An approximately round brown

region might be a flower, an apple, a face, or part of a sunset

sky. Moreover, pictorial properties such as color, shape, and

texture of an object vary dramatically in different images. If

a system understood the semantics of images and could

determine which features of an object are significant, it

would be capable of fast and accurate search. However, due

to the great difficulty of recognizing and classifying images,

not much success has been achieved in identifying high-

level semantics for the purpose of image retrieval. There-

fore, most systems are confined to matching images with

low-level pictorial properties.

Despite the fact that it is currently impossible to reliably

recognize objects in general-purpose images, there are

methods to distinguish certain semantic types of images.

Any information about semantic types is helpful since a

system can constrict the search to images with a particular

semantic type. More importantly, the semantic classification

schemes can improve retrieval by using various matching

schemes tuned to the semantic class of the query image.

One example of semantic classification is the identifica-

tion of natural photographs versus artificial graphs gener-

ated by computer tools [29]. The classifier divides an image

into blocks and classifies every block into either of the

two classes. If the percentage of blocks classified as

photograph is higher than a threshold, the image is marked

as photograph; otherwise, text.

Other examples include the WIPE system to detect

objectionable images developed by Wang et al. [29],

motivated by an earlier system by Fleck et al. [5] of the

University of California at Berkeley. WIPE uses training

images and CBIR to determine if a given image is closer to

the set of objectionable training images or the set of benign

training images. The system developed by Fleck et al.,

however, is more deterministic and involves a skin filter

and a human figure grouper.

Szummer and Picard [24] have developed a system to

classify indoor and outdoor scenes. Classification over

low-level image features such as color histogram and

DCT coefficients is performed. A 90 percent accuracy rate

hasbeenreportedover a database of1,300imagesfromKodak.

Other examples of image semantic classification include

city versus landscape [26] and face detection [1]. Wang and

Fischler [30] have shown that rough, but accurate semantic

understanding, can be very helpful in computer vision tasks

such as image stereo matching.

1.3 Overview of the SIMPLIcity System

CBIR is a complex and challenging problem spanning

diverse disciplines, including computer vision, color per-

ception, image processing, image classification, statistical

clustering, psychology, human-computer interaction (HCI),

and specific application domain dependent criteria. While

we are not claiming to be able to solve all the problems

related to CBIR, we have made some advances towards the

final goal, close to human-level automatic image under-

standing and retrieval performance.

In this paper, we discuss issues related to the design and

implementation of a semantics-sensitive CBIR system for

picture libraries. An experimental system, the SIMPLIcity

(Semantics-sensitive Integrated Matching for Picture

LIbraries) system, has been developed to validate the

methods. We summarize the main contributions as follows.

1.3.1 Semantics-Sensitive Image Retrieval

The capability of existing CBIR systems is limited in large

part by fixing a set of features used for retrieval.

Apparently, different image features are suitable for the

retrieval of images in different semantic types. For example,

a color layout indexing method may be good for outdoor

pictures, while a region-based indexing approach is much

better for indoor pictures. Similarly, global texture matching

is suitable only for textured pictures.

We propose a semantics-sensitive approach to the problem

of searching general-purpose image databases. Semantic

classification methods are used to categorize images so that

semantically-adaptive searching methods applicable to each

category can be applied. At the same time, the system

can narrow down the searching range to a subset of the

original database to facilitate fast retrieval. For example,

automatic classification methods can be used to categorize a

general-purpose picture library into semantic classes

including ªgraph,º ªphotograph,º ªtextured,º ªnontex-

tured,º ªbenign,º ªobjectionable,º ªindoor,º ªoutdoor,º

ªcity,º ªlandscape,º ªwith people,º and ªwithout people.º

In our experiments, we used textured-nontextured and

graph-photograph classification methods. We apply a

WANG ET AL.: SIMPLICITY: SEMANTICS-SENSITIVE INTEGRATED MATCHING FOR PICTURE LIBRARIES 949

suitable feature extraction method and a corresponding

matching metric to each of the semantic classes. When more

classification methods are utilized, the current semantic

classification architecture may need to be improved.

In our current system, the set of features for a particular

image category is determined empirically based on the

perception of the developers. For example, shape-related

features are not used for textured images. Automatic

derivation of optimal features is a challenging and important

issue in its own right. A major difficulty in feature selection is

the lack of information about whether any two images in the

database match with each other. The only reliable way to

obtain this information is through manual assessment which

is formidable for a database of even moderate size.

Furthermore, human evaluation is hard to be kept consistent

from person to person. To explore feature selection, primitive

studies can be carried with relatively small databases. A

database can be formed from several distinctive groups of

images, among which only images from the same group are

considered matched. A search algorithm can be developed to

select a subset of candidate features that provides optimal

retrieval according to an objective performance measure.

Although such studies are likely to be seriously biased,

insights regarding which features are most useful for a certain

image category may be obtained.

1.3.2 Image Classification

For the purpose of searching picture libraries such as those

on the Web or in a patient digital library, we are initially

focusing on techniques to classify images into the classes

ªtexturedº versus ªnontextured,º ªgraphº versus ªphoto-

graph.º Several other classification methods have been

previously developed elsewhere, including ªcityº versus

ªlandscapeº [26], and ªwith peopleº versus ªwithout

peopleº [1]. In this paper, we report on several classification

methods we have developed and their performance.

1.3.3 Integrated Region Matching (IRM) Similarity

Measure

Besides using semantics classification, another strategy of

SIMPLIcity to better capture the image semantics is to

define a robust region-based similarity measure, the

Integrated Region Matching (IRM) metric. It incorporates

the properties of all the segmented regions so that

information about an image can be fully used to gain

robustness against inaccurate segmentation. Image segmen-

tation is an extremely difficult process and is still an open

problem in computer vision. For example, an image

segmentation algorithm may segment an image of a dog

into two regions: the dog and the background. The same

algorithm may segment another image of a dog into six

regions: the body of the dog, the front leg(s) of the dog, the

rear leg(s) of the dog, the eye(s), the background grass, and

the sky.

Traditionally, region-based matching is performed on

individual regions [2], [11]. The IRM metric we have

developed has the following major advantages:

1. Compared with retrieval based on individual re-

gions, the overall ªsoft similarityº approach in IRM

reduces the adverse effect of inaccurate segmenta-

tion, an important property lacked by previous

systems.

2. In many cases, knowing that one object usually

appears with another helps to clarify the semantics

of a particular region. For example, flowers typically

appear with green leaves, and boats usually appear

with water.

3. By defining an overall image-to-image similarity

measure, the SIMPLIcity system provides users with

a simple querying interface. To complete a query, a

user only needs to specify the query image. If desired,

the system can be added with a function allowing

users to query based on a specific region or a few

regions.

1.4 Outline of the Paper

The remainder of the paper is organized as follows: The

semantics-sensitive architecture is further introduced in

Section 2. The image segmentation algorithm is described in

Section 3. Classification methods are presented in Section 4.

The IRM similarity measure based on segmentation is

defined in Section 5. In Section 6, experiments and results

are described. We conclude and suggest future research in

Section 7.

2SEMANTICS-SENSITIVE ARCHITECTURE

The architecture of the SIMPLIcity retrieval system is

presented in Fig. 1. During indexing, the system partitions

an image into 4  4 pixel blocks and extracts a feature vector

for each block. A statistical clustering [8] algorithm is then

used to quickly segment the image into regions. The

segmentation result is fed into a classifier that decides the

semantic type of the image. An image is currently classified as

one of the n manually-defined mutually exclusive and

collectively exhaustive semantic classes. The system can be

extended to one that classifies an image softly into multiple

classes with probability assignments. Examples of semantic

types are indoor-outdoor, objectionable-benign, textured-

nontextured, city-landscape, with-without people, and

graph-photograph images. Features reflecting color, texture,

shape, and location information are then extracted for each

region in the image. The features selected depend on the

semantic type of the image. The signature of an image is the

collection of featuresfor all of its regions. Signatures of images

with various semantic types are stored in separate databases.

In the querying process, if the query image is not in the

database as indicated by the user interface, it is first passed

through the same feature extraction process as was used

during indexing. For an image in the database, its semantic

type is first checked and then its signature is extracted from

the corresponding database. Once the signature of the

query image is obtained, similarity scores between the

query image and images in the database with the same

semantic type are computed and sorted to provide the list of

images that appear to have the closest semantics.

3THE IMAGE SEGMENTATION METHOD

In this section, we describe the image segmentation

procedure based on the k-means algorithm [8] using color

and spatial variation features. For general-purpose images

such as the images in a photo library or on the World Wide

Web (WWW), automatic image segmentation is almost as

950 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 9, SEPTEMBER 2001

difficult as automatic image semantic understanding. The

segmentation accuracy of our system is not crucial because

an integrated region-matching (IRM) scheme is used to

provide robustness against inaccurate segmentation.

To segment an image, SIMPLIcity partitions the image into

blocks with 4  4 pixels and extracts a feature vector for each

block. The k-means algorithm is used to cluster the feature

vectors into several classes with every class corresponding to

one region in the segmented image. Since the block size is

small and boundary blockyness has little effect on retrieval,

we choose blockwise segmentation rather than pixelwise

segmentation to lower computational cost significantly.

Suppose observations are fx

: i  1; ...;Lg. The goal of

the k-means algorithm is to partition the observations into

k groups with means

;

; ...;

such that

Dk

i1

min

1jk

x



1

is minimized. The k-means algorithm does not specify how

many clusters to choose. We adaptively choose the number

of clusters k by gradually increasing k and stop when a

criterion is met. We start with k  2 and stop increasing k if

one of the following conditions is satisfied.

1. The distortion Dk is below a threshold. A low Dk

indicates high purity in the clustering process. The

threshold is not critical because the IRM measure is

not sensitive to k.

2. The first derivative of distortion with respect to k,

DkÿDk ÿ 1, is below a threshold with compar-

ison to the average derivative at k  2; 3. A low Dkÿ

Dk ÿ 1 indicates convergence in the clustering

process. The threshold determines the overall time

to segment images and needs to be set to a near-zero

value. It is critical to the speed, but not the quality of

the final image segmentation. The threshold can be

adjusted according to the experimental runtime.

3. The number k exceeds an upper bound. We allow an

image to be segmented into a maximum of

16 segments. That is, we assume an image has less

than 16 distinct types of objects. Usually, the

segmentation process generates much less number

of segments in an image. The threshold is rarely met.

Six features are used for segmentation. Three of them are

the average color components in a 4  4 block. The other three

represent energy in high frequency bands of wavelet trans-

forms [3], that is, the square root of the second order moment

of wavelet coefficients in high frequency bands. We use the

well-known LUV color space, where L encodes luminance

and U and V encode color information (chrominance). The

LUV color space has good perception correlation properties.

The block size is chosen to be 4  4 to compromise between

the texture detail and the computation time.

To obtain the other three features, we apply either the

Daubechies-4 wavelet transform or the Haar transform to

the L component of the image. We use these two wavelet

transforms because they have better localization proper-

ties and require less computation compared to Daube-

chies' wavelets with longer filters. After a one-level

wavelet transform, a 4  4 block is decomposed into four

frequency bands, as shown in Fig. 2. Each band contains

2  2 coefficients. Without loss of generality, suppose the

coefficients in the HL band are fc

k;l

k;l1

k1;l

k1;l1

One feature is then computed as

f 

i0

j0

ki;lj

The other two features are computed similarly from the

LH and HH bands. The motivation for using these features is

their reflection of texture properties. Moments of wavelet

coefficients in various frequency bands have proven effective

for discerning texture [25]. The intuition behind this is that

coefficients in different frequency bands signal variations in

WANG ET AL.: SIMPLICITY: SEMANTICS-SENSITIVE INTEGRATED MATCHING FOR PICTURE LIBRARIES 951

Fig. 1. The architecture of feature indexing process. The heavy lines show a sample indexing path of an image.

Fig. 2. Decomposition of images into frequency bands by wavelet

transforms.

HTML Viewer

Frequently Asked Questions (19)

Q1. What are the contributions in "Simplicity: semantics-sensitive integrated matching for picture libraries" ?

The authors present here SIMPLIcity ( Semanticssensitive Integrated Matching for Picture LIbraries ), an image retrieval system, which uses semantics classification methods, a wavelet-based approach for feature extraction, and integrated region matching based upon image segmentation. Potentially, the categorization enhances retrieval by permitting semantically-adaptive searching methods and narrowing down the searching range in a database.

Q2. What future works have the authors mentioned in the paper "Simplicity: semantics-sensitive integrated matching for picture libraries" ?

The authors are planning to build a sharable testbed for statistical evaluation of different CBIR systems.

Q3. What is the focus of research on image databases?

The automatic derivation of semantically-meaningfulinformation from the content of an image is the focus ofinterest for most research on image databases.

Q4. What is the algorithm used to classify graph images?

The algorithm the authors used to classify image blocks is based on a probability density analysis of wavelet coefficients in high frequency bands.

Q5. How fast is the query image retrieval?

When the query image is in the database, it takes about 1:5 seconds of CPU time on average to sort all the images in the 200,000-image database using the IRM similarity measure.

Q6. What is the main task of designing a signature?

the main task of designing a signature is to bridge the gap between image semantics and the pixel representation, that is, to create a better correlation with image semantics.

Q7. What is the purpose of region-based retrieval systems?

Region-based retrieval systems attempt to overcome the deficiencies of color layout search by representing images at the object-level.

Q8. How long does it take to compute the feature vectors for the color images?

To compute the feature vectors for the 200; 000 color images of size 384 256 in their general-purpose image database requires approximately 60 hours.

Q9. What are the three categories of CBIR systems?

Existing general-purpose CBIR systems roughly fall into three categories depending on the approach to extract signatures: histogram, color layout, and region-based search.

Q10. What is the recent approach to reduce the shifting and scaling sensitivity for color layout search?

The approach taken by the recent WALRUS system [14] to reduce the shifting and scaling sensitivity for color layout search is to exhaustively reproduce many subimages based on an original image.

Q11. How many features can be obtained by uniformly quantizing features?

If texture and shape features are also used to distinguish patterns, the number of patterns in the library will increase dramatically, roughly exponentially in the number of features if patterns are obtained by uniformly quantizing features.

Q12. How many images are stored in the SIMPLIcity system?

The SIMPLIcity system has been implemented with a general-purpose image database including about 200; 000 pictures, which are stored in JPEG format with size 384 256 or 256 384.

Q13. What is the way to extract color histogram features from the categorized image database?

The authors used LUV color space and a matching metric similar to the EMD described in [18] to extract color histogram features and match in the categorized image database.

Q14. Why does the system depend on the pattern library?

Because the definition of the CRT descriptor matrix relies on the pattern library, the system performance depends critically on the library.

Q15. How much time is spent to extract the feature from the query image?

If the query image is not already in the database, one extra second of CPU time is spent to extract the feature from the query image.

Q16. How do the authors notify the user that two images are considered very close?

The authors may notify the user that two images are considered to be very close when the IRM distance between the two images is less than 15.

Q17. How do the authors use the image ID to view the images better?

To view the images better or to see more matched images, users can visit the demonstration Web site and use the query image ID to repeat the retrieval.

Q18. What is the way to determine whether an image is textured?

As shown by the segmentation results in Fig. 3, regions in textured images tend to scatter in the entire image, whereas nontextured images are usually partitioned into clumped regions.

Q19. How fast is the application of SIMPLIcity to a database of general-purpose images?

The application of SIMPLIcity to a database of about 200,000 general-purpose images shows more accurate and much faster retrieval compared with the existing algorithms.

SIMPLIcity: semantics-sensitive integrated matching for picture libraries

Summary (6 min read)

1 INTRODUCTION

1.1 Related Work in CBIR

1.1.1 Histogram Search

1.1.2 Color Layout Search

1.1.3 Region-Based Search

1.2 Related Work in Semantic Classification

1.3 Overview of the SIMPLIcity System

1.3.1 Semantics-Sensitive Image Retrieval

1.3.2 Image Classification

1.3.3 Integrated Region Matching (IRM) Similarity Measure

1.4 Outline of the Paper

2 SEMANTICS-SENSITIVE ARCHITECTURE

3 THE IMAGE SEGMENTATION METHOD

4 THE IMAGE CLASSIFICATION METHODS

4.1 Textured versus Nontextured Classification

4.2 Graph versus Photograph Classification

5 THE IRM SIMILARITY MEASURE

5.1 Integrated Region Matching (IRM)

5.2 Distance between Regions

5.3 Characteristics of IRM

6 EXPERIMENTS

6.1 Accuracy

6.2 Query Comparison

6.3.1 Performance on Image Queries

6.3.2 Performance on Image Categorization

6.4.1 Speed

Figures (17)

Citations

References

"SIMPLIcity: semantics-sensitive int..." refers methods in this paper

"SIMPLIcity: semantics-sensitive int..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (19)

Q1. What are the contributions in "Simplicity: semantics-sensitive integrated matching for picture libraries" ?

Q2. What future works have the authors mentioned in the paper "Simplicity: semantics-sensitive integrated matching for picture libraries" ?

Q3. What is the focus of research on image databases?

Q4. What is the algorithm used to classify graph images?

Q5. How fast is the query image retrieval?

Q6. What is the main task of designing a signature?

Q7. What is the purpose of region-based retrieval systems?

Q8. How long does it take to compute the feature vectors for the color images?

Q9. What are the three categories of CBIR systems?

Q10. What is the recent approach to reduce the shifting and scaling sensitivity for color layout search?

Q11. How many features can be obtained by uniformly quantizing features?

Q12. How many images are stored in the SIMPLIcity system?

Q13. What is the way to extract color histogram features from the categorized image database?

Q14. Why does the system depend on the pattern library?

Q15. How much time is spent to extract the feature from the query image?

Q16. How do the authors notify the user that two images are considered very close?

Q17. How do the authors use the image ID to view the images better?

Q18. What is the way to determine whether an image is textured?

Q19. How fast is the application of SIMPLIcity to a database of general-purpose images?