What is the purpose of a saccade?

The purpose of this type of eye movement, occurring up to three times per second, is to direct a small part of their visual field into the fovea in order to achieve a closer inspection.

What is the main purpose of the paper?

the results are summarized and some conclusions are drawn in Section 6.HVS acts as a passive selector, acknowledging some stimuli but rejecting others.

(Open Access) A coherent computational approach to model bottom-up visual attention (2006) | O. Le Meur

Q: What is the role of the center-surround organization in the definition of CSF?

This center-surround organization is responsible for their great sensibility to the contrast and to the spatial frequency leading to the definition of Contrast Sensitivity Function (CSF).

A Coherent Computational Approach to

Model Bottom-Up Visual Attention

Olivier Le Meur, Patrick Le Callet, Member, IEEE,

Dominique Barba, Senior Member, IEEE, and Dominique Thoreau

Abstract—Visual attention is a mechanism which filters out redundant visual information and detects the most relevant parts of our visual

field. Automatic determination of the most visually relevant areas would be useful in many applications such as image and video coding,

watermarking, video browsing, and quality assessment. Many research groups are currently investigating computational modeling of the

visual attention system. The first published computational models have been based on some basic and well-understood Human Visual

System (HVS) properties. These models feature a single perceptual layer that simulates only one aspect of the visual system. More recent

models integrate complex features of the HVS and simulate hierarchical perceptual representation of the visual input. The bottom-up

mechanism is the most occurring feature found in modern models. This mechanism refers to involuntary attention (i.e., salient spatial

visual features that effortlessly or involuntary attract our attention). This paper presents a coherent computational approach to the

modeling of the bottom-up visual attention. This model is mainly based on the current understanding of the HVS behavior. Contrast

sensitivity functions, perceptual decomposition, visual masking, and center-surround interactions are some of the features implemented

in this model. The performances of this algorithm are assessed by using natural images and experimental measurements from an eye-

tracking system. Two adequate well-known metrics (correlation coefficient and Kullbacl-Leibler divergence) are used to validate this

model. A further metric is also defined. The results from this model are finally compared to those from a reference bottom-up model.

Index Terms—Computationally modeled human vision, bottom-up visual attention, coherent modeling, eye tracking experiments.

1INTRODUCTION

ISUAL attention is one of the most important features of

the human visual system. Rather than speaking about

the usefulness of visual attention, which seems obvious, it is

worth lingering about its description. The first trial dates back

to 1890 when James [1] suggested that everyone knows what

attentions is. It is the taking possession by the mind, in clear and

vivid form, of one out of what seem several simultaneously possible

objects or trains of thought. In others words, visual attention

serves as a mediating mechanism involving competition

between different aspects of the visual scene and selecting the

most relevant areas to the detriment of others.

Nevertheless, our environment presents far more per-

ceptual information than can be effectively processed. In

order to keep the essential visual information, humans have

developed a particular strategy, first outlined by James.

This strategy, confirmed during the last two decades,

involves two mechanisms. The first refers to the sensory

attention driven by environmental events, commonly called

bottom-up or stimulus-driven. The second one is the

voluntational attention to both external and internal stimuli,

commonly called top-down or goal-driven.

Most recent computational models of visual attention can

be placed in two categories. A recent trend concerns a

statistical signal-based approach [2] whic h con sists of

automatically predicting salient regions of the visual scene

by directly using image statistics at the point of gaze. In fact,

several studies have recently reported [3], [4], [5] that the

human fixation regions present higher spatial contrast and

spatial entropy than random fixation regions. These studies

show that human eyes movements are not necessarily

random but rather driven by particular features. The second

category consists of models [6], [7], [8], [9], [10], [11] built

around two important concepts: the Feature Integration

Theory (FIT) from Treisman and Gelade [12] and a neurally

plausible architecture proposed by Koch and Ullman [13].

The FIT suggests that visual information is analyzed in

parallel from different maps. These maps are retinotopically

organized according to locations in our visual field. There is a

map for each early visual feature. From this theory, several

frameworks for simulating human visual attention have been

designed. The most interesting one has been proposed by

Koch and Ullman [13]. Their framework is based on the

concept of saliency map which is a two-dimensional topo-

graphic representation of conspicuity for every pixels in the

image. Fig. 1 illustrates the general synoptic of their model. It

mainly consists of early visual features extraction, feature

maps building, and feature map fusion.

In this paper, a new bottom-up model based on the FIT and

the plausible architecture proposed by Koch and Ullman [13]

is described. Its purpose is to automatically detect the most

relevant parts of a color picture displayed on a television

screen. The general philosophy of this approach is to design a

biologically-inspired algorithm that performs better than

802 IEEE TRANSACTIONS ON PATTERN ANALYSI S AND MACHINE INTELLIGENCE, VOL. 28, NO. 5, MAY 2006

. O. Le Meur and D. Thoreau are with the Video Compression Laboratory,

Thomson, 1 avenue Belle Fontaine-CS 17616, 35576 Cesson-Se

vigne

Cedex, France. E-mail: {olivier.le-meur, dominique.thoreau}thomson.net.

. P. Le Callet and D. Barba are with the Institut de Recherche en

Communications et Cyberne

tique de Nantes (IRCCyN) Laboratory, Ecole

Polytechnique de l’Universite

de Nantes, Rue Christian Pauc-BP 50609,

44306 Nantes Cedex 3, France.

E-mail: {patrick.lecallet, dominique.barba}@polytech.univ-nantes.fr.

Manuscript received 20 July 2004; revised 24 Aug. 2005; accepted 12 Sept.

2005; published online 13 Mar. 2006.

Recommended for acceptance by M. Srinivasan.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number TPAMI-0367-0704.

0162-8828/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society

conventional approaches. The proposed model is based on a

coherent psychovisual space from which a saliency map is

deduced. This space, well justified with psychophysic

experiments, is used to combine the visual features (intensity,

color, orientation, spatial frequencies...) of the image, that are

normalized to their individual visibility threshold. Accurate

nonlinear models simulating visual cells behaviors are used

to calculate the visibility threshold associated to each value of

each component. From this coherent psychovisual space, a

new way of calculating a saliency map is proposed.

The paper is organized as follows: Section 2 gives insight

into the natural mechanisms that allow us to reduce the

amount of visual information. Experiments are conducted

to record and track real observer’s eye movements with an

eye tracking apparatus. These experiments aim to build the

ground truth required to achieve a performance assessment

of the bottom-up model described here. This experiment is

presented in Section 3. The proposed coherent computa-

tional approach to model the bottom-up visual attention is

described in Section 4. In Section 5, the performances of this

model are evaluated, both qualitatively and quantitatively,

using relevant metrics. A particular saliency-based applica-

tion is then briefly de scribed. Finally, the results are

summarized and some conclusions are drawn in Section 6.

2THE NATURAL SELECTION OF THE VISUAL

INFORMATION

2.1 A Passive Selection

HVS acts as a passive selector, acknowledging some stimuli

but rejecting others. The first information reduction appears

in the retina in which the photoreceptors only process the

wavelengths of the visible light. The neural signal is then

treated by ganglion cells which are insensitive to uniform

illumination. This particular property is due to the spatial

organization of their receptive fields (RF). This fundamental

notion was first emphasized in the work of Hartline [33]: The

RF is defined as a particular region of the retina within which

an appropriate stimulation gives a relevant response. The RF

presents an antagonistic center-surround organization. The

center is roughly circular surrounded by an annulus. These

two regions provide an opposite response for the same

stimulation. This center-surround organization is responsi-

ble for our great sensibility to the contrast and to the spatial

frequency leading to the definition of Contrast Sensitivity

Function (CSF).

The responses stemming from the retina neurons are then

transmitted to the primary visual cortex. Hubel and Wiesel

who received the Nobel prize for medicine and physiology

in 1981 discovered that the RF’s structure of the cortical cells

is considerably different to the structure of the RF of retinal

and lateral geniculate nucleus (LGN) cells. The RFs of retinal

and LGN cells have a circular structure with a center-

surround organization whereas the cortical cells present an

elongated RF and respond best to a particular orientation

and to a particular spatial frequency. In addition, recent

studies [15], [16], [17], [18], [19], [20] have shown that the

cortical cell’s response can be influenced by stimuli outside

their classical RF. These contextual influences are mediated

by long-range connections linking cells with nonoverlap-

ping receptive fields. Studies by Kapadia et al. [19], [20]

show that the cell’s response can be greatly enhanced by the

presentation of coaligned, coorientated stimuli in the

neighborhood and increases with the number of appropriate

stimuli placed outside the CRF. Generally speaking, the

contour, feature linking [21], [23], [43], and texture segmen-

tation [22] are assumed to be in close relation with the long-

range connections.

2.2 An Active Selection

Human beings have a collection of passive mechanisms

lessening the amount of incoming visual information. For

instance, the signal stemming from the photoreceptors is

assumed to be compressed by a factor of about 130:1, before it

is transmitted to the visual cortex. Nevertheless, the visual

system is still faced with too much information. To deal with

the still overwhelming amount of input, an active selection,

involving eye movement, is required to allocate processing

resources to some parts of our visual field. Oculomotor

mechanisms involve different types of eye movements. A

saccade is a rapid eye movement allowing jump from one

location to another. The purpose of this type of eye move-

ment, occurring up to three times per second, is to direct a

small part of our visual field into the fovea in order to achieve

a closer inspection. This last step corresponds to a fixation.

Saccades are therefore a major instrument of the selective

visual attention. This active selection is assumed to be

controlled by two major mechanisms called bottom-up and

top-down control. The former, the bottom-up attentional

selection, is linked to involuntary attention. This mechanism

is fast, involuntary, and stimulus-driven. Our attention is

effortlessly drawn to salient parts in our visual field. These

salient parts consist of an abrupt onsets [25] or a local

singularity [12]. An image containing one green circle (called

target) located among a number of red circles (distractors) is a

classic example. The target is easily seen against the red

circles due to its local singularity (its local hue), no matter how

many distractors are present. The appearance of new

perceptual object consistent or not with the context of the

scene could also attract our attention [24], [26]. Several studies

have shown that observers tend to make longer and more

frequent fixations on such object [24].

The second control, top-down attentional selection,

refers to voluntary attention closely linked to the experience

LE MEUR ET AL.: A COH ERENT COMPUTATIONAL APPROACH TO MODEL BOTTOM-UP VISUAL ATTENTION 803

Fig. 1. Framework proposed by Koch and Ullman. Early visual features

are extracted from the visual input into several separate parallel

channels. After this extraction and a particular treatment, a feature

map is obtained for each channel. Next, the saliency map is built by

fusing all these maps.

of the observers and to the task they have in mind.

Compared to the bottom-up attentional selection, the top-

down mechanism, voluntary and task-driven, is slower.

3EYE TRACKING EXPERIMENTS

3.1 Apparatus and Procedure

In order to track and record real observers eye movements,

experiments have been conducted using an eye tracker from

Cambridge Research Corporation. This apparatus is

mounted on a rigid headrest for gr eater m easurement

accuracy (less than 0.5 degree on the fixation point).

Experiments were conducted in normalized conditions

(ITU-R BT 500-10) at a viewing distance of four times the

TV monitor height. Ten natural color images with various

contents have been selected. The quality of these pictures was

then degraded using different techniques (spatial filtering,

JPEG, JPEG200 coding...). Forty-six pictures were finally

obtained. Every image was seen in random order by up to

40 observers for 15 seconds each in a task-free viewing mode.

The collected data corresponds to the regular time sampling

(20 ms) of eye gaze on the monitor.

3.2 Human Fixation Density Map Computation

A fixation map, which encodes the conspicuous locations, is

computed from the collected data. For a particular picture

and for each observer, the samples corresponding to

saccades are filtered out. A data point is removed if the

number of data included in a squared window is below a

given threshold. The size of the window and the threshold

are functions of the viewing distance, the accuracy of the eye

tracker (0.25 degrees of visual angle) and the resolution of

the display (800  600 pixels). In practice, the size of the

window and the threshold are, respectively, 9  9 (corre-

sponding to 0.25 degrees of visual angle) and 5 (correspond-

ing to the number of data required in the previous defined

window).

All fixation patterns for a given picture are added together

providing a spatial d istribution o f human fixation (see

examples in Fig. 2). The resulting map is then smoothed

using a two-dimensional Gaussian filter. Its standard devia-

tion is determined according to the accuracy of the eye-

tracking apparatus. The result is a fixation density map [34]

which represents the observer’s regions of interest (RoI). This

is often compared to a landscape map [35] consisting of peaks

and valleys (see examples in Fig. 2).

3.3 Conclusions from Empirical Data

3.3.1 Coverage

Coverage has been previously defined by Wooding [34] in

the following terms: the coverage is a measure of the amount of

the original stimulus covered by the fixations. The coverage

value is therefore given by the ratio between the number of

fixated pixels and the number of inspected pixels. A

threshold, called T , is required in order to decide whether

a pixel is fixated or not.

The coverage value is assessed on the human fixation

density maps for three threshold values 0:25; 0:5; 0:75 and

for three viewing times (2s, 8s, and 14s). Table 1 gives the

results for three pictures (Kayak (see Fig. 3), Rapids (second

row of the Fig. 2), and ChurchandCapitol).

As expected, the coverage value increases with increas-

ing viewing time and decreasing threshold T . Moreover, the

coverage value is highly dependent on the picture content:

for picture Kayak, the coverage is equal to 21 percent for a

viewing time of 14s and for a threshold of 0.75 whereas, in

the same condition, the coverage is about 40 percent for

picture ChurchandCapitol. It is worth noticing that only a

small area of the pictures (on average 36 percent for the

three thresholds T for 14s of viewing time) has been fixated.

In fact, humans pursue to fixate areas of interest rather than

to scan the whole scene.

3.3.2 Bias toward the Central Part of Pictures

Fig. 2 shows the spatial distribution and the density of human

fixations. These results are coherent with a well-known

property of the human visual strategy. Observers have a

general tendency to stare at the central locations of the screen.

This tendency is not reduced with the viewing time: It can be

804 IEEE TRANSACTIONS ON PATTERN ANALYSI S AND MACHINE INTELLIGENCE, VOL. 28, NO. 5, MAY 2006

Fig. 2. (a) The original picture, (b) the spatial distribution of human fixations for 14s of viewing time, (c) fixation density map obtained by convolved

the spatial distribution with a 2D Gaussian filter, and (d) highlighted human RoI (Regions of Interest) obtained by redrawing the original picture by

leaving in the darkness in the nonfixated areas.

shown that observers continue to focus on these areas rather

than to scan the whole picture. There are at least two plausible

explanations: The nonuniform distribution of photoreceptors

is a biological candidate. However, it seems more logical to

tackle this question by introducing a top-down or higher-

level explanation as proposed by Parkhurst et al. [14]. The

great majority of visually important information is tradition-

ally loc ated in the central par t o f t he pi cture frame.

Consequently, observers unconsciously tend to select central

locations in order to catch the potentially most important

visual information.

4THE PROPOSED COMPUTATIONAL MODEL

The model proposed in this paper is based on the architecture

of Koch and Ullman. The model designed by Itti et al. [7] was

one of the first to take advantage of such architecture. It has

been chosen as a benchmark for the model presented here

and is therefore briefly described hereafter.

The first step of Itti et al.’s model consists of the extraction

of early visual features. The visual input is broken down into

three separate feature channels (color, intensity, and orienta-

tion). Each channel is obtained from Gaussian pyramids as in

[32]. This allows the computation of different spatial scales by

progressively applying a low-pass filter and subsampling the

visual features. In order to take into account the organization

of the visual cells, a center-surround mechanism based on a

Difference of Gaussian (DoG) is applied on each scale. The

resulting maps are then linearly summed across feature

channels to form the saliency map.

Although this model provides good results on several

types of picture, it contains arbitrary steps that are difficult

to justify with respect to the HVS:

. several normalization steps are applied before and

after the fusion step,

. each channel is normalized independently to a

common scale in order to be independent of the

feature extraction mechanisms, and

. there are strong links between the visual sensitivity

and the viewing distance. However, this has been

overlooked.

The proposed computational bottom-up model has been

developed bearing numerous properties of human visual

cells in mind. Three aspects of the vision process are

sequentially tackled, namely, the visibility, the perception,

and the perceptual grouping. The complete synoptic is

shown in Fig. 3 and described in the following sections.

4.1 Visibility Process

The visibility process simulates the limited sensitivity of the

HVS. Despite the seemingly complex mechanisms under-

lying the human vision, the visual system is not able to

perceive all information present in the visual field with the

same accuracy. A coherent normalization is first used to

scale all the visual data. A value of 1 represents a feature

which is just noticeable. All the normalized data is grouped

into a psychovisual space. This space is built from the

following set of basic mechanisms entirely identified and

validated from psychophysic experiments.

4.1.1 Transformation of the RGB Luminance into the

Krauskopf’s Color Space

There are two different types of photoreceptors in the retina:

cones and rods. As TV displays luminance levels not

corresponding to scotopic conditions (low light levels), rods

can be neglected. Cones form the basis of color perception

and work at photopic conditions. Cones are of three types:

L-cones, M-cones, and S-cones which are sensitive to long,

medium, and short wavelengths, respectively. They are

mainly located in the central part of the retina, called fovea,

which is 2 degrees in diameter. Both psychological and

physiological experiments give evidences to the theory of

early transformation in the HVS of the L, M, and S signals

issued from cones absorption. This transformation provides

an opponent-color space in which the signals are less

correlated. The principal components of opponent colors

space are black-white (B-W), red-green (R-G), and blue-

yellow (B-Y). There is a variety of opponent color spaces

which differ in the way they combine the different cone

responses. The color space proposed by Krauskopf was

validated from psychophysic experiments. These experi-

ments are based on the interaction between a color masking

signal and a color stimulus signal in term of differential

visibility thresho ld

(DVT)ofthestimulus.Thecolor

orientations of the masking and stimulus signal, respectively,

for which the DVT value is minimum are determined. These

experiments have been made with still and time varying

stimulus. The color space is given by the relation (1):

110

1 10

0:5 0:51

: ð1Þ

LE MEUR ET AL.: A COH ERENT COMPUTATIONAL APPROACH TO MODEL BOTTOM-UP VISUAL ATTENTION 805

1. The differential visibility threshold of a stimulus superimposed to a

background (masking signal) is defined as the magnitude required by the

stimulus to be just noticeable.

TABLE 1

Coverage Evolution in Function of Viewing Time

and of Picture Content

A is a pure achromatic perceptual signal whereas Cr

and

are pure chromatic perceptual signals.

During these experiments, the adaptation effects through a

mechanism of “desensibiliza tion” [16] were taken i nto

account. While Krauskopf used only a temporal “desensibi-

lization” mechanism, a spatial “desensibilization” mechan-

ism was used here. Both methods produced the same result.

4.1.2 Early Visual Features Extraction

It was previously mentioned that visual cells can be

characterized by a radial spatial frequency and by orienta-

tion. It could therefore be interesting to group visual cells

sharing similar properties. The early visual features extrac-

tion perform ed by a perceptual channel decomposition

consists of splitting the 2D spatial frequency domain both in

spatial radial frequency and in orientation. This decomposi-

tion is applied to each of the three perceptual components.

Psychophysic experiments [17] show that psychovisual

spatial frequency partitioning for the achromatic component

leads to 17 psychovisual channels in standard TV viewing

conditions while only five channels are obtained for

chromatic component (see Fig. 3). Each resulting subband

or channel may be regarded as the neural image correspond-

ing to a population of visual cells tuned to a range of spatial

frequency and to a particular orientation.

The achromatic subbands are distributed over four

crowns noted I, II, III, and IV (see Fig. 3). Chromatic

subbands are distributed over two crowns noted I, II. The

main properties of these decompositions and the main

differences from a similar transform, called the cortex

transform [27], are a nondyadic radial selectivity and an

orientation selectivity that increases with radial frequency

(except for the chromatic components).

4.1.3 Contrast Sensitivity Functions

Contrast sensitivity functions (CSF) have been widely used to

measure the visibility of natural images components. In fact,

these components can be described by a setof Fourier function

andtheir amplitude. The visibility of a specific component can

be assessed by applying a CSF in the frequency domain. When

the amplitude of a frequency component is greater than a

threshold CT

, the frequency component is perceptible. This

threshold is called the visibility threshold, and its inverse

defines the value of the CSF at this spatial frequency. In the

806 IEEE TRANSACTIONS ON PATTERN ANALYSI S AND MACHINE INTELLIGENCE, VOL. 28, NO. 5, MAY 2006

Fig. 3. Flow chart of the proposed computational model of bottom-up visual selective attention. It presents three aspects of the vision: visibility,

perception, and perceptual grouping. The visibility part, also called the psychovisual space, simulates the limited sensitivity of the human eyes and

takes into account the major properties of the retinal cells. The perception is used to suppress the redundant visual information by simulating the

behavior of cortical cells. Finally, the non-CRF and the saliency map building are achieved by the perceptual grouping.

A coherent computational approach to model bottom-up visual attention

Figures

Citations

Learning to Detect a Salient Object

State-of-the-Art in Visual Attention Modeling

Context-Aware Saliency Detection

Context-aware saliency detection

Learning to Detect A Salient Object

References

The Principles of Psychology

A feature-integration theory of attention

A model of saliency-based visual attention for rapid scene analysis

The Laplacian Pyramid as a Compact Image Code

Shifts in selective visual attention: towards the underlying neural circuitry.

Related Papers (5)

A model of saliency-based visual attention for rapid scene analysis

Saliency Detection: A Spectral Residual Approach

Graph-Based Visual Saliency

Shifts in selective visual attention: towards the underlying neural circuitry.

A feature-integration theory of attention

Frequently Asked Questions (8)

Q1. What is the visibility threshold for a visual cell?

Q2. How much is the signal sent from the photoreceptors?

Q3. What is the purpose of a saccade?

Q4. What is the effect of the saccade on the visual system?

Q5. What is the first information reduction in the retina?

Q6. What is the role of the cortical cell in the RF?

Q7. What is the main purpose of the paper?

Q8. What is the role of the center-surround organization in the definition of CSF?