A model of saliency-based visual attention for rapid scene analysis

doi:10.1109/34.730558

Home
/
Papers
/
A model of saliency-based visual attention for rapid scene analysis

Journal Article•DOI•

A model of saliency-based visual attention for rapid scene analysis

Laurent Itti¹, Christof Koch¹, Ernst Niebur²•Institutions (2)

California Institute of Technology¹, Johns Hopkins University²

01 Nov 1998-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Computer Society)-Vol. 20, Iss: 11, pp 1254-1259

TL;DR: In this article, a visual attention system inspired by the behavior and the neuronal architecture of the early primate visual system is presented, where multiscale image features are combined into a single topographical saliency map.

read less

Abstract: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Rapid object detection using a boosted cascade of simple features

[...]

Paul A. Viola¹, Michael Jones•Institutions (1)

Mitsubishi¹

01 Dec 2001

TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.

...read moreread less

Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

...read moreread less

18,620 citations

Journal Article•DOI•

Squeeze-and-Excitation Networks

[...]

Jie Hu¹, Li Shen², Samuel Albanie², Gang Sun¹, Enhua Wu¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, University of Oxford²

18 Jun 2018

TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.

...read moreread less

Abstract: The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251 percent, surpassing the winning entry of 2016 by a relative improvement of ${\sim }$ ∼ 25 percent. Models and code are available at https://github.com/hujie-frank/SENet .

...read moreread less

14,807 citations

Journal Article•DOI•

Robust Real-Time Face Detection

[...]

Paul A. Viola¹, Michael Jones²•Institutions (2)

Microsoft¹, Mitsubishi Electric²

01 May 2004-International Journal of Computer Vision

TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.

...read moreread less

Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

...read moreread less

13,037 citations

Proceedings Article•DOI•

Robust real-time face detection

[...]

Paul A. Viola¹, Michael Jones²•Institutions (2)

Microsoft¹, Mitsubishi Electric²

07 Jul 2001

TL;DR: A new image representation called the “Integral Image” is introduced which allows the features used by the detector to be computed very quickly and a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.

...read moreread less

Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algo- rithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a "cascade" which allows back- ground regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection perfor- mance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

...read moreread less

10,592 citations

Journal Article•DOI•

Content-based image retrieval at the end of the early years

[...]

Arnold W. M. Smeulders¹, Marcel Worring¹, Simone Santini², Amarnath Gupta², Ramesh Jain - Show less +1 more•Institutions (2)

University of Amsterdam¹, University of California, San Diego²

01 Dec 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap are discussed, as well as aspects of system engineering: databases, system architecture, and evaluation.

...read moreread less

Abstract: Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

...read moreread less

6,447 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A feature-integration theory of attention

[...]

Anne Treisman¹, Garry A. Gelade²•Institutions (2)

University of British Columbia¹, University of Oxford²

01 Jan 1980-Cognitive Psychology

TL;DR: A new hypothesis about the role of focused attention is proposed, which offers a new set of criteria for distinguishing separable from integral features and a new rationale for predicting which tasks will show attention limits and which will not.

...read moreread less

11,452 citations

Book Chapter•DOI•

Shifts in selective visual attention: towards the underlying neural circuitry.

[...]

Christof Koch¹, Shimon Ullman²•Institutions (2)

California Institute of Technology¹, Massachusetts Institute of Technology²

01 Jan 1985-Human neurobiology

TL;DR: This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention and suggests a possible role for the extensive back-projection from the visual cortex to the LGN.

...read moreread less

Abstract: Psychophysical and physiological evidence indicates that the visual system of primates and humans has evolved a specialized processing focus moving across the visual scene. This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention. Specifically, we propose the following: (1) A number of elementary features, such as color, orientation, direction of movement, disparity etc. are represented in parallel in different topographical maps, called the early representation. (2) There exists a selective mapping from the early topographic representation into a more central non-topographic representation, such that at any instant the central representation contains the properties of only a single location in the visual scene, the selected location. We suggest that this mapping is the principal expression of early selective visual attention. One function of selective attention is to fuse information from different maps into one coherent whole. (3) Certain selection rules determine which locations will be mapped into the central representation. The major rule, using the conspicuity of locations in the early representation, is implemented using a so-called Winner-Take-All network. Inhibiting the selected location in this network causes an automatic shift towards the next most conspicious location. Additional rules are proximity and similarity preferences. We discuss how these rules can be implemented in neuron-like networks and suggest a possible role for the extensive back-projection from the visual cortex to the LGN.

...read moreread less

3,930 citations

Journal Article•DOI•

Guided Search 2.0 A revised model of visual search

[...]

Jeremy M. Wolfe¹, Jeremy M. Wolfe²•Institutions (2)

Brigham and Women's Hospital¹, Harvard University²

01 Jun 1994-Psychonomic Bulletin & Review

TL;DR: This paper reviews the visual search literature and presents a model of human search behavior, a revision of the guided search 2.0 model in which virtually all aspects of the model have been made more explicit and/or revised in light of new data.

...read moreread less

Abstract: An important component of routine visual behavior is the ability to find one item in a visual world filled with other, distracting items. This ability to performvisual search has been the subject of a large body of research in the past 15 years. This paper reviews the visual search literature and presents a model of human search behavior. Built upon the work of Neisser, Treisman, Julesz, and others, the model distinguishes between a preattentive, massively parallel stage that processes information about basic visual features (color, motion, various depth cues, etc.) across large portions of the visual field and a subsequent limited-capacity stage that performs other, more complex operations (e.g., face recognition, reading, object identification) over a limited portion of the visual field. The spatial deployment of the limited-capacity process is under attentional control. The heart of the guided search model is the idea that attentional deployment of limited resources isguided by the output of the earlier parallel processes. Guided Search 2.0 (GS2) is a revision of the model in which virtually all aspects of the model have been made more explicit and/or revised in light of new data. The paper is organized into four parts: Part 1 presents the model and the details of its computer simulation. Part 2 reviews the visual search literature on preattentive processing of basic features and shows how the GS2 simulation reproduces those results. Part 3 reviews the literature on the attentional deployment of limited-capacity processes in conjunction and serial searches and shows how the simulation handles those conditions. Finally, Part 4 deals with shortcomings of the model and unresolved issues.

...read moreread less

3,436 citations

Journal Article•

Components of visual orienting

[...]

Michael I. Posner

01 Jan 1984-Attention and Performance

2,856 citations

Book•DOI•

Biophysics of computation : information processing in single neurons

[...]

Christof Koch

12 Nov 1998

TL;DR: This work presents simplified models of individual neurons, and unconventional coupling, of action-potential generation and phase space analysis of neuronal excitability in response to the Hodgkin-Huxley model.

...read moreread less

Abstract: 1. The membrane equation 2. Linear cable theory 3. Passive dendritic trees 4. Synaptic input 5. Synaptic interactions in a passive dendritic tree 6. The Hodgkin-Huxley model of action-potential generation 7. Phase space analysis of neuronal excitability 8. Ionic channels 9. Beyond Hodgkin and Huxley: calcium, and calcium-dependent potassium currents 10. Linearizing voltage-dependent currents 11. Diffusion, buffering, and binding 12. Dendritic spines 13. Synaptic plasticity 14. Simplified models of individual neurons 15. Stochastic models of single cells 16. Bursting cells 17. Input resistance, time constants, and spike initiation 18. Synaptic input to a passive tree 19. Voltage-dependent events in the dendritic tree 20. Unconventional coupling 21. Computing with neurons - a summary

...read moreread less

1,368 citations