scispace - formally typeset
Open AccessBook ChapterDOI

Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review

TLDR
This paper reviews a number of recently available techniques in content analysis of visual media and their application to the indexing, retrieval, abstracting, relevance assessment, interactive perception, annotation and re-use of visual documents.
Abstract
This paper reviews a number of recently available techniques in content analysis of visual media and their application to the indexing, retrieval, abstracting, relevance assessment, interactive perception, annotation and re-use of visual documents.

read more

Content maybe subject to copyright    Report

1
Content-based Representation and Retrieval of Visual
Media: A State-of-the-Art Review
1
Philippe Aigrain
Institut de Recherche en Informatique de Toulouse, Universite Paul Sabatier
118, route de Narbonne, F-31062 Toulouse Cedex, France
HongJiang Zhang
Broadband Information Systems Lab, Hewlett-Packard Labs
1501 Page Mill Road. Palo Alto, CA94304, USA
Dragutin Petkovic
IBM Almaden Research Center
San Jose, CA 95120-6099, USA
Abstract
This paper reviews a number of recently available techniques in contentanalysis
of visual media and their application to the indexing, retrieval,abstracting, rele-
vance assessment, interactive perception, annotation and re-use of visualdocu-
ments.
1. Background
A few years ago, the problems of representation and retrieval of visualmedia were confined to
specialized image databases (geographical, medical, pilot experimentsin computerized slide
libraries), in the professional applications of the audiovisualindustries (production, broadcasting
and archives), and in computerized training or education. The presentdevelopment of multimedia
technology and information highways has put content processing of visualmedia at the core of
key application domains: digital and interactive video, large distributed digital libraries, multime-
dia publishing. Though the most important investments have been targeted at the information
infrastructure (networks, servers, coding and compression, deliverymodels, multimedia systems
architecture), a growing number of researchers have realized thatcontent processing will be a key
asset in putting together successful applications. The need for contentprocessing techniques has
been made evident from a variety of angles, ranging from achievingbetter quality in compression,
allowing user choice of programs in video-on-demand, achieving betterproductivity in video pro-
duction, providing access to large still image databases or integrating still images and video in
multimedia publishing and cooperative work.
Content-based retrieval of visual media and representation of visualdocuments in human-com-
puter interfaces are based on the availability of content representationdata (time-structure for
1. To apprear in Multimedia Tools and Applications special issue on Representation and Retrieval of Visual
Media.

2
time-based media, image signatures, object and motion data). When itis possible, the human pro-
duction of this descriptive data is so time consuming - and thus costly -that it is almost impossible
to generate it for large document spaces. There is some hope that for video documents, some of
this data will be created at production time and coded in the documentitself. Nonetheless it will
never be available for many existing documents, and when considering thehistory of media and
carriers one is lead to a very cautious estimate of how often this typeof information will be really
available even in future documents. Thus, there is a clear need forautomatic analysis tools which
are able to extract representation data from the documents.
The researchers involved in content processing efforts come from various backgrounds, for
instance:
the publishing, entertainment, retail or document industry whereresearchers try to extend
their activity to visual documents, or to integrate them inhypertext-based new document
types,
the AV hardware and software industry, primarily interested by digital editing tools and
other programma production tools,
academic laboratories where research had been conducted for some time oncomputer
analysis and access to existing visual media, such as the MIT MediaLaboratory , the Insti-
tute of System Sciences in Singapore, or IRIT in France,
large telecommunication company laboratories, where researchers areprimarily interest-
ing in cooperative work and remote access to visual media,
the robotics vision, signal processing, image sequence processing forsecurity , or data
compression research communities who try to find new applications fortheir models of
images or human perception.
computer hardware manufacturers developing digital library or visualmedia research pro-
grams.
These researchers originally used very different models and techniques and often conflicting
vocabulary. After a few years of lively confusion and exciting achievements, it isnow possible to
draw a clearer panorama of the state of this emerging field, and to outline some of its possible
directions of development.
In this paper, we review the methods available for different types of visual content analysis, repre-
sentation and their application and survey some open research problems.Section 2 covers various
visual features for representing and comparing image content. Sections 3reviews video content
parsing and representation algorithms and schemes, including temporalsegmentation, video
abstraction, shot comparison and soundtrack analysis. Section 4 presentapplications of visual
representation schemes in content-based image and video retrieval andbrowsing. Finally, Section
5 summaries our survey and current research directions.
2 The many facets of image similarity
Retrieval of still images by similarity, i.e. retrieving images which are similar to an already
retrieved image (retrieval by example) or to a model or schema is arelatively old idea. Some

3
might date it to the mnemotechnical ideas of the antiquity, but more seriously it appeared in spe-
cialized geographical information systems databases around 1980, inparticular in the Query by
Pictorial Example system of IMAID[CF80]. From the start, it was clearthat retrieval by similarity
called for specific definitions of what it means to be similar. In the mapping system, a satellite
image was matched to existing map images from the point of view ofsimilarity of road and river
networks, easily extracted from images by edge detection. Apart frompaper models [Aig87], it
was only in the beginning of the 90s that researchers started to look atretrieval by similarity in
large set of heterogeneous images with no specific model of their semanticcontents. The proto-
type systems of Kato[Kat92], followed by the availability of the QBIC commercial system using
several types of similarities [FSN+95] contributed to making this idea moreand more popular .
A system for retrieval by similarity rest on 3 components:
extraction of features or image signatures from the images, and an efficient representation
and storage strategy for this precomputed data,
a set of similarity measures, each of which captures some perceptivelymeaningful defini-
tion of similarity, and which should be efficiently computable when matching an example
with the whole database,
a user interface for the choice of which definition(s) ofsimilarity should be applied for
retrieval, and for the ordered and visually efficient presentation of retrieved images and for
supporting relevance feedback.
Recent work has made evident that:
A large number of meaningful types of similarity can and must be defined.Only part of
these definitions are associated with efficient feature extraction mechanisms and (dis)sim-
ilarity measures.
Since there are many definitions of similarity and the discriminatingpower of each of the
measures is likely to degrade significantly for large image databases, the user interface and
the feature storage strategy components of the systems will play a moreand more essential
role. We will come back to this point in Section 4.1.
Visual content based retrieval is best utilized when combined with thetraditional search,
both at user interface and the system level. The basic reason for this isthat we do not see
the possibility of content based retrieval replacing the ability ofparametric (SQL) search,
text and keywords to represent the rich semantic content of the visualmaterial (names,
places, action, prices etc.). The key is to apply content basedretrieval where appropriate,
and this is where the use of text and keywords is suboptimal. Examples ofsuch applica-
tions are where visual appearance (e.g. color, texture, shape, motion) are important search
arguments like in stock photo/video, art, retail, on-line shopping etc. Notonly content
based retrieval reduces the high variability among human indexers, but italso enables
more “fuzzy” browsing and search which in many application is anessential part of the
process. It is obvious then that content based retrieval involves stronguser interaction,
thus necessitating the development of special fast browser and UI techniques.
In this section we briefly survey the various types of similaritydefinitions and associated feature
extraction and measures for systems which do not assume any specificimage domain or a-priori

4
semantic knowledge on the images.
Gudivada has listed possible types of similarity for retrievalin[Gudivada95]: color similarity , tex-
ture similarity, shape similarity, spatial similarity, etc. Some of these types can be considered in
all or only part of one image, can be considered independently of scaleor angle or not, depending
on whether one is interested in the scene represented by the image or in theimage per se.
2.1 Color similarity
Color distribution similarity has been one of the first choices [HK92,FSN+95] because if one
chooses a proper representation and measure it can be partially reliableeven in presence of
changes in lighting, view angle, and scale. For the capture of propertiesof the global color distri-
bution in images, the need for a perceptively meaningful color modelleads to the choice of HLS
(Hue-Luminosity-Saturation) models, and of measures based on the 3first moments of color dis-
tributions[SO94] preferably to histogram distances. It has been proposedin [AJL95] to use hue
and saturation distributions only when one wants to capturelighting-independent color distribu-
tion properties which are good signatures of a scene when the scale doesnot change too much. In
this case one can identify the hue-saturation perceptive space with thecomplex unit discus and
define measures using statistical moments in this space. This isuseful to avoid biases of measures
which do not take in account the circular nature of hue, and could befurther refined to distinguish
between true spectral hues and the purples. Stricker and Orengo have argued in [SO94] for the
importance of including the 3rd moment (distribution skewness) inthe definition of the similarity
measure.
One important difficulty with color similarity is that when using it for retrieval, anuser will often
be looking for an image “with a red object such as this one”. Thisproblem of restricting color
similarity to a spatial component, and more generally of combiningspatial similarity and color
similarity is also present for texture similarity. It explains why prototype and commercial systems
have included complex ad-hoc mechanisms in their user interfaces tocombine various similarity
functions.
2.2 Texture similarity
For texture as for color, it is essential to define a well-funded perceptive space. Picard and
Liu[PL94] have shown that it is possible to do so using the Wold decomposition of the texture
considered as a luminance field. One gets three components(periodic, evanescent and random)
corresponding to the bi-dimensional periodicity, mono-dimensional orientation, and complexity
of the analyzed texture. Experiments have shown that these independentcomponents agree well
with the perceptive evaluation of texture similarity[TMY79]. Therelated similarity measures
has lead to remarkably efficient results including for the retrieval of large-scale textures such as
images of buildings and cars [PM95] In QIBC system, Tomura texture features, contrast, com-
pactness and direction, ared used [FSN+95]. But of course one is againconfronted to the problem
of combining texture information with the spatial organization of several textures (see below).
2.3 Shape similarity
A proper definition of shape similarity calls for the distinctionsbetween shape similarity in

5
images (similarity between actual geometrical shapes appearing in theimages) and shape similar-
ity between the objects depicted by the images, i.e. similarity modulo anumber of geometrical
transformations corresponding to changes in view angle, opticalparameters and scale. In some
cases, one wants to include even deformation of non-rigid bodies. Thefirst type of similarity has
attracted research work only for calibrated image databases of specialtypes of objects, such as
ceramic plates. Even, in this case, the researchers have tried todefine shape representations which
are scale independent, resting on curvature, angle statistics andcontour complexity . Systems such
as QBIC[FSN+95] use circularity, eccentricity, major axis orientation (not angle-independent)
and algebraic moment. It should be noted that in some cases the user of aretrieval system will
want a definition of shape similarity which is dependent on view angle(for instance will want to
retrieve trapezoids with an horizontal basis and not the othertrapezoids).
In the general case, a promising approach has been proposed by Sclaroff and Pent-
land[PS91,SP95] in which shapes are represented as canonicaldeformations of prototype objects.
In this approach, a “physical” model of the 2D-shape is builtusing a new form of Galerkin’ s inter-
polation method (finite-element discretization). The possibledeformation modes are analyzed
using Karhunen-Loeve transform. This yields an ordered list ofdeformation modes corresponding
to rigid body modes (translation, rotation), low-frequency non-rigidmodes associated to global
deformations and higher-frequency modes associated to localized deformations.
As for color and texture, the present schemes for shape similaritymodelling are faced with seri-
ous difficulties when images include several objects or background. Apreliminary segmentation
as well as modelling of spatial relationships between shapes is thennecessary (are we interested
in finding images where one region represent a shape similar to a givenprototype or to some spa-
tial organization of several shapes?).
2.4 Spatial similarity
Gudivada and Raghavan[GR95] have treated spatial similarity in thesituation in which it is
assumed that images have been (automatically or manually) segmentedinto meaningful objects,
each object being associated with is centroid and a symbolic name. Sucha representation is called
a symbolic image, and it is relatively easy to define similarity functions for suchimage modulo
transformations such as rotation, scaling and translation. Efforts have also been made to address
spatial similarity directly (without segmentation and objectindexing). This was the case, for
instance, in the original work of Kato[Kat92], in the limited case ofdirect spatial similarity (with-
out geometrical transformation), using a number of ad-hoc statisticalfeatures computed on very
low resolution images.
2.5 Object presence analysis
Finding in a set of images in which a particular object or type of objectappears - all images with
cars, all shots in a video in which a given character is present - is aparticular case of similarity
computation. Once again, the range of applicable methods is defined bythe invariants of the
object to be recognized. For color images, and for images whose colordoes not change, local
color distribution is efficient, and can be reliable even when changes in scale or angle occur
[NT92]. In the general case, the best results so far have been obtainedwith texture-based mod-
els[PPDH94]. A pyramidal analysis of texture (with the whole imageconsidered as the texture

Citations
More filters
Journal ArticleDOI

Content-based image retrieval at the end of the early years

TL;DR: The working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap are discussed, as well as aspects of system engineering: databases, system architecture, and evaluation.
Journal ArticleDOI

Image retrieval: Ideas, influences, and trends of the new age

TL;DR: Almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation are surveyed, and the spawning of related subfields are discussed, to discuss the adaptation of existing image retrieval techniques to build systems that can be useful in the real world.
Journal ArticleDOI

Bursty and Hierarchical Structure in Streams

TL;DR: The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content.
Proceedings ArticleDOI

Bursty and hierarchical structure in streams

TL;DR: The goal of the present work is to develop a formal approach for modeling such "bursts" in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content.
Journal ArticleDOI

Similarity measures

TL;DR: A similarity measure is developed, based on fuzzy logic, that exhibits several features that match experimental findings in humans and is an extension to a more general domain of the feature contrast model due to Tversky (1977).
References
More filters
Journal ArticleDOI

Query by image and video content: the QBIC system

TL;DR: The Query by Image Content (QBIC) system as discussed by the authors allows queries on large image and video databases based on example images, user-constructed sketches and drawings, selected color and texture patterns, camera and object motion, and other graphical information.
Proceedings ArticleDOI

Similarity of color images

TL;DR: Two new color indexing techniques are described, one of which is a more robust version of the commonly used color histogram indexing and the other which is an example of a new approach tocolor indexing that contains only their dominant features.
Journal ArticleDOI

Photobook: content-based manipulation of image databases

TL;DR: The Photobook system is described, which is a set of interactive tools for browsing and searching images and image sequences that make direct use of the image content rather than relying on text annotations to provide a sophisticated browsing and search capability.
Journal ArticleDOI

Automatic partitioning of full-motion video

TL;DR: A twin-comparison approach has been developed to solve the problem of detecting transitions implemented by special effects, and a motion analysis algorithm is applied to determine whether an actual transition has occurred.
Proceedings ArticleDOI

Photobook: tools for content-based manipulation of image databases

TL;DR: The Photobook system is described, which is a set of interactive tools for browsing and searching images and image sequences that differ from those used in standard image databases in that they make direct use of the image content rather than relying on annotations.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions mentioned in the paper "Content-based representation and retrieval of visual media: a state-of-the-art review1" ?

This paper reviews a number of recently available techniques in content analysis of visual media and their application to the indexing, retrieval, abstracting, relevance assessment, interactive perception, annotation and re-use of visual documents. 

Not only content based retrieval reduces the high variability among human indexers, but it also enables more “fuzzy” browsing and search which in many application is an essential part of the process. 

The present development of multimedia technology and information highways has put content processing of visual media at the core of key application domains: digital and interactive video, large distributed digital libraries, multimedia publishing. 

One gets three components (periodic, evanescent and random) corresponding to the bi-dimensional periodicity, mono-dimensional orientation, and complexity of the analyzed texture. 

Gudivada has listed possible types of similarity for retrieval in[Gudivada95]: color similarity, texture similarity, shape similarity, spatial similarity, etc. 

Color distribution similarity has been one of the first choices [HK92, FSN+95] because if one chooses a proper representation and measure it can be partially reliable even in presence of changes in lighting, view angle, and scale. 

When the problem is to locate images with a particular object (a particular face, a particular building) and not any object of a given type, principal component analysis methods of more general features of the images is the only efficient method. 

As for color and texture, the present schemes for shape similarity modelling are faced with serious difficulties when images include several objects or background. 

It has been proposed in [AJL95] to use hue and saturation distributions only when one wants to capture lighting-independent color distribution properties which are good signatures of a scene when the scale does not change too much. 

Content-based retrieval of visual media and representation of visual documents in human-computer interfaces are based on the availability of content representation data (time-structure for1. 

The need for content processing techniques has been made evident from a variety of angles, ranging from achieving better quality in compression, allowing user choice of programs in video-on-demand, achieving better productivity in video production, providing access to large still image databases or integrating still images and video in multimedia publishing and cooperative work. 

in this case, the researchers have tried to define shape representations which are scale independent, resting on curvature, angle statistics and contour complexity. 

A proper definition of shape similarity calls for the distinctions between shape similarity in5 images (similarity between actual geometrical shapes appearing in the images) and shape similarity between the objects depicted by the images, i.e. similarity modulo a number of geometrical transformations corresponding to changes in view angle, optical parameters and scale. 

Apart from paper models [Aig87], it was only in the beginning of the 90s that researchers started to look at retrieval by similarity in large set of heterogeneous images with no specific model of their semantic contents. 

Examples of such applications are where visual appearance (e.g. color, texture, shape, motion) are important search arguments like in stock photo/video, art, retail, on-line shopping etc.