scispace - formally typeset
Open AccessJournal ArticleDOI

A vision architecture for unconstrained and incremental learning of multiple categories

Reads0
Chats0
TLDR
This work presents an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects and imposes no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.
Abstract
We present an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects. Additionally we focus on interactive learning, which requires real-time image processing methods and a fast learning algorithm. The overall system is composed of a figure-ground segregation part, several feature extraction methods and a life-long learning approach combining incremental learning with category-specific feature selection. In contrast to most visual categorization approaches, where typically each view is assigned to a single category, we allow labeling with an arbitrary number of shape and color categories. We also impose no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.

read more

Content maybe subject to copyright    Report

A Vision Architecture for Unconstrained
and Incremental Learning of Multiple
Categories
Stephan Kirstein
1,2
, Alexander Denecke
1,3
, Stephan Hasler
1
,
Heiko Wersing
1
, Horst-Michael Gross
2
and Edgar orner
1
1
Honda Research Institute Europe GmbH
Carl-Legien-Str. 30 63073 Offenbach am Main, Germany
{stephan.kirstein, stephan.hasler, heiko.wersing, edgar.koerner}@honda-ri.de
2
Ilmenau University of Technology
Neuroinformatics and Cognitive Robotics Lab
P.O.B. 10 05 65, 98684 Ilmenau, Germany
horst-michael.gross@tu-ilmenau.de
3
Bielefeld University
CoR-Lab
P.O.B. 10 01 31, 33501 Bielefeld, Germany
adenecke@cor-lab.uni-bielefeld.de
Abstract
We present an integrated vision architecture capable of incrementally
learning several visual categories based on natural hand-held objects. Ad-
ditionally we focus on interactive learning, which requires real-time image
pro cessing methods and a fast learning algorithm. The overall system is
compos ed of a figure-ground segregation part, several feature extraction
methods and a life-long learning approach combining incremental learn-
ing with category-specific feature selection. In contrast to most visual
categorization approaches, where typically each view is assigned to a sin-
gle category, we allow labeling with an arbitrary number of shape and
color categories. We also impose no restrictions on the viewing angle of
presented objects, relaxing the common constraint on canonical views.

1 Introduction
An amazing capability of the human visual system is the ability to learn an
enormous repertoire of visual categories. This large amount of categories is ac-
quired incrementally during our life and requires at least partially the direct
interaction with a tutor. Inspired by the child-like knowledge acquisition we
propose an architecture for learning several visual categories in an incremen-
tal and interactive fashion. The architecture is composed of several building
blocks including figure-ground segregation, feature extraction, a category learn-
ing module and user interaction. All these modules together allow the training
of categories based on natural objects presented in hand.
The learning system proposed in this paper is partly based on earlier work
dealing with online object identification in cluttered scenes (Wersing et al.,
2007). For our learning system a novel incremental category learning method is
proposed that combines a learning vector quantization (LVQ) (Kohonen 1989)
network to approach the “stability-plasticity dilemma” with a category-specific
forward feature selection. Based on this combination we are able to interactively
learn a category-specific long-term memory (LTM) representation, where previ-
ous LTM models proposed by Kirstein, Wersing, & orner (2008) could only be
learned offline. Other major contributions are the integration of an enhanced
figure-ground segregation method and the extraction of parts-based feature. In
the following further related work with respect to categorization frameworks,
online learning methods and life-long learning architectures is discussed in more
detail.
1.1 Visual Categorization Architectures
In the past few years many architectures dealing with object detection and
categorization tasks have been proposed in the computer vision community. In-
terestingly many of those approaches are based on local parts-based features,
which are extracted around some defined interest points e.g. (Leibe et al., 2004;
Willamowski et al., 2004; Agarwal et al., 2004) or on agglomerative clustering
(Mikolajczyk, Leibe, & Schiele 2006) to build up object models for categories
like faces or cars. The advantages of these approaches are their robustness
against partial occlusion, scale changes, and the ability to deal with cluttered
environments. One drawback is that such methods are typically restricted to
the canonical view of a certain category. Thomas et al. (2006) try to overcome
this limitation by training several pose-specific implicit shape models (ISM)
(Leibe, Leonardis, & Schiele 2004) for each category. Afterwards detected
parts from neighboring pose-dependent ISMs are linked by so-called “activa-
tion links”. This allows the detection of categories from many viewpoints. Such
categorization architectures, however, are designed for offline usage only, where
the required training time is not important. This makes them unsuitable for
our desired online and interactive training. A recent work of Fritz, Kruijff, &
Schiele (2007) addresses this issue and proposes a semi-supervised and incre-
mental clustering method for interactive category learning. This approach is,
1

however, restricted to the canonical view of the categories.
1.2 Online and Interactive Learning Systems
The development of online and interactive learning systems became more and
more popular in the recent years, see e.g. (Roth et al., 2006), (Steels & Kaplan,
2001), (Arsenio, 2004) or (Wersing et al., 2007). All these systems are able
to identify several objects in cluttered environments, but are not applicable to
categorization tasks. This is because their learning methods can not extract a
more variable category representation. Nonetheless those models are useful as
a short-term memory (STM) representation. Afterwards this representation is
consolidated into a more abstract LTM representation of categories allowing a
higher generalization performance compared to the object-specific STM repre-
sentation. Of particular interest with respect to online and interactive learning
of categories is the work of Skoˇcaj et al. (2007). It enables learning of several
simple color and shape categories by selecting a single feature which describes
the particular category most consistently. The category itself is then repre-
sented by the mean and variance of this selected feature (Skoˇcaj et al., 2007)
or more recently by an incremental kernel density estimation using mixtures of
Gaussians (Skoˇcaj et al., 2008). Especially this feature selection enhances the
categorization performance, but the restriction to a single feature allows only the
representation of simple categories with little appearance changes. Therefore we
propose a feature selection process that can incrementally select an arbitrary
number of features, if they are required for the representation of a particular
category.
1.3 Life-Long Learning Architectures
Based on the STM representation, which is assumed to be limited in capacity,
we propose an incremental and life-long learning method to acquire a category-
specific long-term memory (LTM) representation. For the LTM we approach
the so-called “stability-plasticity dilemma”. This dilemma occurs when neu-
ral networks are trained with a limited and changing training ensemble, causing
the well known “catastrophic forgetting effect” (French 1999). A common strat-
egy for life-long learning architectures e.g. (Hamker, 2001; Furao & Hasegawa,
2006; Kirstein et al., 2008) is the usage of a node specific learning rate com-
bined with an incremental node insertion rule. This permits plasticity of newly
inserted neurons, while the stability of matured neurons is pr eserved. The ma-
jor drawback of those architectures commonly used for identification tasks is
the inefficient separation of cooccuring categories. This means for natural ob-
jects, which typically belong to several different categories (e.g. red-white car),
a decoupled representation for each category (for category red, white and car)
should be learned. This decoupling leads to a more condensed representation
and higher generalization performance compared to object identification archi-
tectures. Another approach to the “stability-plasticity dilemma” was proposed
by Ozawa et al. (2005). Here representative input-output pairs are stored into
2

0.0
0.0
0.3
0.8
0.0
0.0
0.1
0.6
0.0
0.2
User
Interaction
...
Input Image
Color Histogram
Holistic C2 Features
...
...
ForegroundSegment
Depth Map
Parts−Based Features
...
Category Learning
Incremental
Category
Labels
Feature Vector
Figure 1: Category Learning System. Based on an object hypothesis ex-
tracted from the depth map a figure-ground segregation is performed. The
detected foreground is used to extract color and shape features. Color fea-
tures are represented as histogram bins in the RGB color space. In contrast to
most other categorization approaches we combine general category independent
features obtained from a detection hierarchy with parts-based features. All ex-
tracted features are concatenated into a single structureless vector. This vector
together with the category labels provided by an human tutor, is the input to
the incremental category learning module.
a long-term memory for stabilizing an incremental learning radial basis func-
tion (RBF) like network. Additionally it also accounts for a feature selection
mechanism based on incremental principal component analysis, but no class-
specific feature selection is applied. Therefore this method it unsuitable for
categorization tasks without modification.
In the following we describe step by step the building blocks of our learning
system illustrated in Fig. 1. The first pr ocessing block extracts the object
hypothesis from cluttered s cenes. This hypothesis is further refined by a figure-
ground segregation method as described in Section 2. Additionally we describe
all used feature extraction methods in Section 3. The extracted shape and
color information is combined and used to train the proposed life-long learning
vector quantization approach described in Section 4, which is trained in direct
interaction with a human tutor. The target of our system is interactive and life-
long learning of categories. Therefore in Section 5 the learning results of our
proposed methods are shown for differently complex databases. Additionally we
show the interactive learning capability of the proposed learning system under
real-world constraints. Finally we discuss the results and related work in Section
6.
3

2 Preprocessing and Figure-ground Segregation
One of the essential problems when dealing with learning in unconstrained en-
vironments is the definition of a shared attention concept between the learning
system and the human tutor. Specifically this is necessary to decide what and
when to learn. In our architecture we use the peri-personal space concept (Go-
erick et al., 2006), which basically is defined as the manipulation range around
an active vision system. Everything in this short distance range is of particular
interest to the system with respect to interaction and learning. Therefore we
use a stereo camera system with a pan-tilt unit and parallelly aligned cameras,
which deliver a stream of image pairs. Depth information is calculated after
the correction of lens distortions. This depth information is used to generate
an interaction hypothesis in cluttered scenes, which after its initial detection is
actively tracked until it disappears from the peri-personal attention range. Ad-
ditionally we apply a color constancy method (Pomierski & Gross 1996) and a
size normalization of the hypothesis. Both operations ensure invariances, which
are beneficial for any kind of recognition system, but are essential for fast on-
line and interactive learning in unconstrained environments. Finally a region of
interest (ROI) of an object view is extracted and scaled to a fixed segment size
of 144x144 pixel.
The extracted segment j
i
contains the object view, but also a substantial
amount of background clutter as can be seen in Fig. 2. For the incremental
build-up of category representations it is beneficial to suppress such clutter, oth-
erwise it would slow down the learning process and considerably more training
examples are necessary. Therefore we apply an additional figure-ground segre-
gation as proposed by Denecke et al. (2009) to reduce this influence. The basic
idea of this segregation method illustrated in Fig. 2 is to train for each segment
j
i
a learning vector quantization (LVQ) network based on a predefined number of
distinct prototypes for foreground and background. As an initial hypothesis for
the foreground the noisy depth information belonging to the extracted segment
is used. The noise of this hypothesis is caused by the ill-posed problem of dispar-
ity calculation and is basically located at the corner of the corresponding object
view. Furthermore also “holes” at textureless object parts are common. Due to
the fact that the objects are presented by hand, skin color parts in the segment
are systematic noise, which we remove from the initial foreground hypothesis
based on the detection method proposed by Fritsch et al. (2002). Due to this
skin color removal faces and gestures can not be learned with this preprocess-
ing. Nevertheless with a modified preprocessing as proposed in Wersing et al.
(2007) a combined learning of objects and faces can be achieved. The learning of
each LVQ prototype is based on feature maps consisting of RGB-color features
as well as the pixel positions. Instead of the standard Euclidean metrics for
the distance computation an extended version of the generalized matrix LVQ
(Schneider, Biehl, & Hammer 2007) approach is used. This metric adaptation is
used to learn relevance factors for each prototype and feature dimension. These
local relevance factors are adapted online and weight dynamically the differ-
ent feature maps to discriminate between foreground and background. For the
4

Figures
Citations
More filters
Proceedings ArticleDOI

Combining offline and online classifiers for life-long learning

TL;DR: This work proposes an extension of a flexible system combining a static offline classifier and an incremental online classifier that is well suited for life-long learning scenarios and reports exemplary results and an extensive comparison to alternative state of the art algorithms for incremental learning such as incremental generalised LVQ and the support vector machine.
Journal ArticleDOI

A new modified panoramic UAV image stitching model based on the GA-SIFT and adaptive threshold method

TL;DR: The modified random sample consensus method is proposed to estimate the image transformation parameters and to determine the solution with the best consensus for the data to solve the limitation problem of high computation load and high cost of stitching time.
Proceedings ArticleDOI

Visual information abstraction for interactive robot learning

TL;DR: Experiments demonstrate that with the refined spatial information, the proposed mechanism facilitates robust visual information abstraction which is a requirement for continuous interactive learning.
Proceedings ArticleDOI

Towards autonomous bootstrapping for life-long learning categorization tasks

TL;DR: An exemplar-based learning approach for incremental and life-long learning of visual categories and it is argued that contextual information is beneficial for this process.
Proceedings Article

Adaptive distance measures for sequential data

TL;DR: A metric learning scheme is proposed which allows for autonomous learning of the underlying scoring matrix according to a given discriminative task and offers an increased interpretability of the results by pointing out structural invariances for the given task.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

An introduction to variable and feature selection

TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Book

Self Organization And Associative Memory

Teuvo Kohonen
TL;DR: The purpose and nature of Biological Memory, as well as some of the aspects of Memory Aspects, are explained.
Journal ArticleDOI

Color indexing

TL;DR: In this paper, color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models, and they can differentiate among a large number of objects.
Journal ArticleDOI

Neocognitron: A Self Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position

TL;DR: A neural network model for a mechanism of visual pattern recognition that is self-organized by “learning without a teacher”, and acquires an ability to recognize stimulus patterns based on the geometrical similarity of their shapes without affected by their positions.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What have the authors contributed in "A vision architecture for unconstrained and incremental learning of multiple categories" ?

The authors present an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects. The authors also impose no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views. 

The forward feature selection method is used to find low dimensional subsets of category-specific features by predominately selecting features, which occur almost exclusively for a certain category. 

Due to the fact that the objects are presented by hand, skin color parts in the segment are systematic noise, which the authors remove from the initial foreground hypothesis based on the detection method proposed by Fritsch et al. (2002). 

The major drawback of those architectures commonly used for identification tasks is the inefficient separation of cooccuring categories. 

It seems to be that for their categorization task the indepen-dent representation of categories somehow weakens the forgetting effect of SLP networks. 

A common strategy for life-long learning architectures e.g. (Hamker, 2001; Furao & Hasegawa, 2006; Kirstein et al., 2008) is the usage of a node specific learning rate combined with an incremental node insertion rule. 

The advantages of these approaches are their robustness against partial occlusion, scale changes, and the ability to deal with cluttered environments. 

For color categories the effect of imprecise foreground masks on the categorization performance seems also to be only minor, otherwise the performance would be considerably lower. 

This allows that object views can be first used to test the STM and LTM representation and after providing confirmed labels the same views can also be used to enhance the representation by transferring them into the STM, even if they where recorded before the confirmation. 

Based on the currently available feature vectors, the learning methods are used to incorporate this STM knowledge into the LTM by applying the learning dynamics of the cLVQ method described in Section 4.2.3. 

To relax this separation and to make the most efficient use of object views, the authors introduce a sensory memory concept for temporarily remembering views of the currently attended object, by using the same one-shot learning method as used for the STM. 

(1)This computation of local edge responses is restricted to the positions in the foreground mask with ξi(x, y) > 0, whereas the ∗ denotes the inner product of two vectors. 

One of the essential problems when dealing with learning in unconstrained environments is the definition of a shared attention concept between the learning system and the human tutor. 

Additionally the hypothesis list is repeatedly communicated to the user (in 5 second intervals), while newly acquired segments are also used to refine this list. 

Additionally this constraint strongly reduces the appearance variations of the presented objects and therefore makes the category learning task much easier. 

The authors could show that their learning system can efficiently perform all necessary processing steps including figure-ground segregation, feature extraction and incremental learning.