Journal Article•DOI•

A vision architecture for unconstrained and incremental learning of multiple categories

Stephan Kirstein¹, Stephan Kirstein², Alexander Denecke³, Alexander Denecke², Stephan Hasler², Heiko Wersing², Horst-Michael Gross¹, Edgar Körner² - Show less +4 more•Institutions (3)

Technische Universität Ilmenau¹, Honda², Bielefeld University³

12 Nov 2009-Memetic Computing (Springer-Verlag)-Vol. 1, Iss: 4, pp 291-304

TL;DR: This work presents an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects and imposes no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.

read less

Abstract: We present an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects. Additionally we focus on interactive learning, which requires real-time image processing methods and a fast learning algorithm. The overall system is composed of a figure-ground segregation part, several feature extraction methods and a life-long learning approach combining incremental learning with category-specific feature selection. In contrast to most visual categorization approaches, where typically each view is assigned to a single category, we allow labeling with an arbitrary number of shape and color categories. We also impose no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.

...read moreread less

Figures (8)

Figure 2: Figure-ground Segregation. Based on the extracted segment, the corresponding depth mask and a skin color removal a foreground hypothesis is generated. This hypothesis includes a considerable amount of noise and clutter, which the applied figure-ground segregation method strongly reduces. The noise in the foreground hypothesis is a consequence of the ill-posed problem of disparity calculation, which introduces noise mainly around the object or at textureless parts of the object. After generating this hypothesis a generalized matrix LVQ network is trained, based on a predefined number of prototypes and prototype specific relevance factors. Based on the learned network the refined foreground mask is calculated. Only the foreground pixels are used for feature extraction in the following steps.

Figure 4: Category Learning with a Coupled Short and Long-Term Memory Concept. Object views are first buffered into the sensory memory using the online vector quantization (oVQ) method until label information is provided by the tutor. Due to our assumption that only views of a single object are collected into this memory, all collected views have the same label information, even if they are collected before the labeling. After labeling this knowledge is transferred into the STM using the same oVQ learning method. The object-specific STM is limited in capacity allowing only to store the latest presented objects. Therefore we use the life-long learning cLVQ method to deal with the “stability-plasticity dilemma” and iteratively transfer the STM information into the category-specific LTM.

Figure 5: Illustration of the cLVQ Optimization Loop. The basic idea of this loop is to make small modifications to the representation of erroneous categories. If the gain in categorization performance based on all available training examples of category c is above the insertion threshold the modification is kept and otherwise retracted.

Figure 7: Average Results of Offline Categorization Experiments. The performance of our proposed cLVQ method and the SLP networks are compared for the restricted database and the unconstrained database using the same set of training and test objects (averaged over 10 runs). All results show the categorization performance on the test set, which was never seen during the training. The difference between both databases is that the objects for the restricted database are rotated around the vertical axis in front of a black background, while the unconstrained database was collected under relaxed constraints with cluttered background and full in-depth rotation of objects. Additionally we tested the effect of C2 features with respect to the categorization performance of shape categories. For all offline experiments the SLP method is superior at earlier learning stages, while the cLVQ is better at later learning steps. After the presentation of all training objects the cLVQ method performs distinctly better for the color categories compared to the SLP networks, while for the shape categories it is slightly better. The addition of C2 features to the feature representation increases the performance of shape categories only for the restricted database, while for the unconstrained database with much higher appearance variations no such performance changes could be measured.

Figure 8: Incremental Learning of Visual Categories. The incremental selection of features for each category is shown over time, while presenting different objects. The model starts from a completely blank memory. Additionally the total number of cLVQ representatives is plotted, which are allocated during the interactive learning session. We also added the categorization decisions of the learning system, communicated to the user on top of the figure with sloped text. Furthermore the confirmed category labels provided by the user are denoted underneath. Note that “ok” means the confirmation of the categorization decisions on top of the figure and that not every time confirmed labels are provided. Additionally the intervals where new training vectors are collected into the STM are marked with <>. The transfer of the STM to the LTM occurs gradually according to the parallely running cLVQ and is not fully synchronized to the speech labels.

Figure 1: Category Learning System. Based on an object hypothesis extracted from the depth map a figure-ground segregation is performed. The detected foreground is used to extract color and shape features. Color features are represented as histogram bins in the RGB color space. In contrast to most other categorization approaches we combine general category independent features obtained from a detection hierarchy with parts-based features. All extracted features are concatenated into a single structureless vector. This vector together with the category labels provided by an human tutor, is the input to the incremental category learning module.

Figure 6: Categorization Database. Training and test objects used for the offline categorization experiments based on two different databases collected under different experimental settings. The objects are aligned so that each row corresponds to one of the five shape categories. For the restricted database all objects where shown in front of a black background and are rotated around the vertical axis. For the unconstrained database the same training and test objects are used, but the objects are shown in a cluttered office environment and are freely rotated by hand covering almost the complete viewing sphere. Additionally some rotation examples are shown for each database, where for the examples of the unconstrained database also the corresponding foreground mask is applied to show the segment part used for feature extraction.

Figure 3: Illustration of Feature Candidate Scoring. For each feature candidate wna the corresponding response P ni a is calculated for each training image i. The threshold ǫn is chosen so that all Pnia ≥ ǫ n belong to the same category and are assigned to a constant scoring value hnia = 3. The scoring values are used to guide the iterative selection process, by adding the feature candidate wna to the list of selected features z m a leading to the highest additional gain.

Content maybe subject to copyright Report

A Vision Architecture for Unconstrained

and Incremental Learning of Multiple

Categories

Stephan Kirstein

1,2

, Alexander Denecke

1,3

, Stephan Hasler

Heiko Wersing

, Horst-Michael Gross

and Edgar K¨orner

Honda Research Institute Europe GmbH

Carl-Legien-Str. 30 63073 Oﬀenbach am Main, Germany

{stephan.kirstein, stephan.hasler, heiko.wersing, edgar.koerner}@honda-ri.de

Ilmenau University of Technology

Neuroinformatics and Cognitive Robotics Lab

P.O.B. 10 05 65, 98684 Ilmenau, Germany

horst-michael.gross@tu-ilmenau.de

Bielefeld University

CoR-Lab

P.O.B. 10 01 31, 33501 Bielefeld, Germany

adenecke@cor-lab.uni-bielefeld.de

Abstract

We present an integrated vision architecture capable of incrementally

learning several visual categories based on natural hand-held objects. Ad-

ditionally we focus on interactive learning, which requires real-time image

pro cessing methods and a fast learning algorithm. The overall system is

compos ed of a ﬁgure-ground segregation part, several feature extraction

methods and a life-long learning approach combining incremental learn-

ing with category-speciﬁc feature selection. In contrast to most visual

categorization approaches, where typically each view is assigned to a sin-

gle category, we allow labeling with an arbitrary number of shape and

color categories. We also impose no restrictions on the viewing angle of

presented objects, relaxing the common constraint on canonical views.

1 Introduction

An amazing capability of the human visual system is the ability to learn an

enormous repertoire of visual categories. This large amount of categories is ac-

quired incrementally during our life and requires at least partially the direct

interaction with a tutor. Inspired by the child-like knowledge acquisition we

propose an architecture for learning several visual categories in an incremen-

tal and interactive fashion. The architecture is composed of several building

blocks including ﬁgure-ground segregation, feature extraction, a category learn-

ing module and user interaction. All these modules together allow the training

of categories based on natural objects presented in hand.

The learning system proposed in this paper is partly based on earlier work

dealing with online object identiﬁcation in cluttered scenes (Wersing et al.,

2007). For our learning system a novel incremental category learning method is

proposed that combines a learning vector quantization (LVQ) (Kohonen 1989)

network to approach the “stability-plasticity dilemma” with a category-speciﬁc

forward feature selection. Based on this combination we are able to interactively

learn a category-speciﬁc long-term memory (LTM) representation, where previ-

ous LTM models proposed by Kirstein, Wersing, & K¨orner (2008) could only be

learned oﬄine. Other major contributions are the integration of an enhanced

ﬁgure-ground segregation method and the extraction of parts-based feature. In

the following further related work with respect to categorization frameworks,

online learning methods and life-long learning architectures is discussed in more

detail.

1.1 Visual Categorization Architectures

In the past few years many architectures dealing with object detection and

categorization tasks have been proposed in the computer vision community. In-

terestingly many of those approaches are based on local parts-based features,

which are extracted around some deﬁned interest points e.g. (Leibe et al., 2004;

Willamowski et al., 2004; Agarwal et al., 2004) or on agglomerative clustering

(Mikolajczyk, Leibe, & Schiele 2006) to build up object models for categories

like faces or cars. The advantages of these approaches are their robustness

against partial occlusion, scale changes, and the ability to deal with cluttered

environments. One drawback is that such methods are typically restricted to

the canonical view of a certain category. Thomas et al. (2006) try to overcome

this limitation by training several pose-speciﬁc implicit shape models (ISM)

(Leibe, Leonardis, & Schiele 2004) for each category. Afterwards detected

parts from neighboring pose-dependent ISMs are linked by so-called “activa-

tion links”. This allows the detection of categories from many viewpoints. Such

categorization architectures, however, are designed for oﬄine usage only, where

the required training time is not important. This makes them unsuitable for

our desired online and interactive training. A recent work of Fritz, Kruijﬀ, &

Schiele (2007) addresses this issue and proposes a semi-supervised and incre-

mental clustering method for interactive category learning. This approach is,

however, restricted to the canonical view of the categories.

1.2 Online and Interactive Learning Systems

The development of online and interactive learning systems became more and

more popular in the recent years, see e.g. (Roth et al., 2006), (Steels & Kaplan,

2001), (Arsenio, 2004) or (Wersing et al., 2007). All these systems are able

to identify several objects in cluttered environments, but are not applicable to

categorization tasks. This is because their learning methods can not extract a

more variable category representation. Nonetheless those models are useful as

a short-term memory (STM) representation. Afterwards this representation is

consolidated into a more abstract LTM representation of categories allowing a

higher generalization performance compared to the object-speciﬁc STM repre-

sentation. Of particular interest with respect to online and interactive learning

of categories is the work of Skoˇcaj et al. (2007). It enables learning of several

simple color and shape categories by selecting a single feature which describes

the particular category most consistently. The category itself is then repre-

sented by the mean and variance of this selected feature (Skoˇcaj et al., 2007)

or more recently by an incremental kernel density estimation using mixtures of

Gaussians (Skoˇcaj et al., 2008). Especially this feature selection enhances the

categorization performance, but the restriction to a single feature allows only the

representation of simple categories with little appearance changes. Therefore we

propose a feature selection process that can incrementally select an arbitrary

number of features, if they are required for the representation of a particular

category.

1.3 Life-Long Learning Architectures

Based on the STM representation, which is assumed to be limited in capacity,

we propose an incremental and life-long learning method to acquire a category-

speciﬁc long-term memory (LTM) representation. For the LTM we approach

the so-called “stability-plasticity dilemma”. This dilemma occurs when neu-

ral networks are trained with a limited and changing training ensemble, causing

the well known “catastrophic forgetting eﬀect” (French 1999). A common strat-

egy for life-long learning architectures e.g. (Hamker, 2001; Furao & Hasegawa,

2006; Kirstein et al., 2008) is the usage of a node speciﬁc learning rate com-

bined with an incremental node insertion rule. This permits plasticity of newly

inserted neurons, while the stability of matured neurons is pr eserved. The ma-

jor drawback of those architectures commonly used for identiﬁcation tasks is

the ineﬃcient separation of cooccuring categories. This means for natural ob-

jects, which typically belong to several diﬀerent categories (e.g. red-white car),

a decoupled representation for each category (for category red, white and car)

should be learned. This decoupling leads to a more condensed representation

and higher generalization performance compared to object identiﬁcation archi-

tectures. Another approach to the “stability-plasticity dilemma” was proposed

by Ozawa et al. (2005). Here representative input-output pairs are stored into

0.0

0.3

0.8

0.0

0.1

0.6

0.0

0.2

User

Interaction

...

Input Image

Color Histogram

Holistic C2 Features

...

ForegroundSegment

Depth Map

Parts−Based Features

...

Category Learning

Incremental

Category

Labels

Feature Vector

Figure 1: Category Learning System. Based on an object hypothesis ex-

tracted from the depth map a ﬁgure-ground segregation is performed. The

detected foreground is used to extract color and shape features. Color fea-

tures are represented as histogram bins in the RGB color space. In contrast to

most other categorization approaches we combine general category independent

features obtained from a detection hierarchy with parts-based features. All ex-

tracted features are concatenated into a single structureless vector. This vector

together with the category labels provided by an human tutor, is the input to

the incremental category learning module.

a long-term memory for stabilizing an incremental learning radial basis func-

tion (RBF) like network. Additionally it also accounts for a feature selection

mechanism based on incremental principal component analysis, but no class-

speciﬁc feature selection is applied. Therefore this method it unsuitable for

categorization tasks without modiﬁcation.

In the following we describe step by step the building blocks of our learning

system illustrated in Fig. 1. The ﬁrst pr ocessing block extracts the object

hypothesis from cluttered s cenes. This hypothesis is further reﬁned by a ﬁgure-

ground segregation method as described in Section 2. Additionally we describe

all used feature extraction methods in Section 3. The extracted shape and

color information is combined and used to train the proposed life-long learning

vector quantization approach described in Section 4, which is trained in direct

interaction with a human tutor. The target of our system is interactive and life-

long learning of categories. Therefore in Section 5 the learning results of our

proposed methods are shown for diﬀerently complex databases. Additionally we

show the interactive learning capability of the proposed learning system under

real-world constraints. Finally we discuss the results and related work in Section

2 Preprocessing and Figure-ground Segregation

One of the essential problems when dealing with learning in unconstrained en-

vironments is the deﬁnition of a shared attention concept between the learning

system and the human tutor. Speciﬁcally this is necessary to decide what and

when to learn. In our architecture we use the peri-personal space concept (Go-

erick et al., 2006), which basically is deﬁned as the manipulation range around

an active vision system. Everything in this short distance range is of particular

interest to the system with respect to interaction and learning. Therefore we

use a stereo camera system with a pan-tilt unit and parallelly aligned cameras,

which deliver a stream of image pairs. Depth information is calculated after

the correction of lens distortions. This depth information is used to generate

an interaction hypothesis in cluttered scenes, which after its initial detection is

actively tracked until it disappears from the peri-personal attention range. Ad-

ditionally we apply a color constancy method (Pomierski & Gross 1996) and a

size normalization of the hypothesis. Both operations ensure invariances, which

are beneﬁcial for any kind of recognition system, but are essential for fast on-

line and interactive learning in unconstrained environments. Finally a region of

interest (ROI) of an object view is extracted and scaled to a ﬁxed segment size

of 144x144 pixel.

The extracted segment j

contains the object view, but also a substantial

amount of background clutter as can be seen in Fig. 2. For the incremental

build-up of category representations it is beneﬁcial to suppress such clutter, oth-

erwise it would slow down the learning process and considerably more training

examples are necessary. Therefore we apply an additional ﬁgure-ground segre-

gation as proposed by Denecke et al. (2009) to reduce this inﬂuence. The basic

idea of this segregation method illustrated in Fig. 2 is to train for each segment

a learning vector quantization (LVQ) network based on a predeﬁned number of

distinct prototypes for foreground and background. As an initial hypothesis for

the foreground the noisy depth information belonging to the extracted segment

is used. The noise of this hypothesis is caused by the ill-posed problem of dispar-

ity calculation and is basically located at the corner of the corresponding object

view. Furthermore also “holes” at textureless object parts are common. Due to

the fact that the objects are presented by hand, skin color parts in the segment

are systematic noise, which we remove from the initial foreground hypothesis

based on the detection method proposed by Fritsch et al. (2002). Due to this

skin color removal faces and gestures can not be learned with this preprocess-

ing. Nevertheless with a modiﬁed preprocessing as proposed in Wersing et al.

(2007) a combined learning of objects and faces can be achieved. The learning of

each LVQ prototype is based on feature maps consisting of RGB-color features

as well as the pixel positions. Instead of the standard Euclidean metrics for

the distance computation an extended version of the generalized matrix LVQ

(Schneider, Biehl, & Hammer 2007) approach is used. This metric adaptation is

used to learn relevance factors for each prototype and feature dimension. These

local relevance factors are adapted online and weight dynamically the diﬀer-

ent feature maps to discriminate between foreground and background. For the

HTML Viewer

Frequently Asked Questions (16)

Q1. What have the authors contributed in "A vision architecture for unconstrained and incremental learning of multiple categories" ?

The authors present an integrated vision architecture capable of incrementally learning several visual categories based on natural hand-held objects. The authors also impose no restrictions on the viewing angle of presented objects, relaxing the common constraint on canonical views.

Q2. What is the purpose of the forward feature selection method?

The forward feature selection method is used to find low dimensional subsets of category-specific features by predominately selecting features, which occur almost exclusively for a certain category.

Q3. Why do the authors remove the noise from the initial foreground hypothesis?

Due to the fact that the objects are presented by hand, skin color parts in the segment are systematic noise, which the authors remove from the initial foreground hypothesis based on the detection method proposed by Fritsch et al. (2002).

Q4. What is the main drawback of those architectures commonly used for identification tasks?

The major drawback of those architectures commonly used for identification tasks is the inefficient separation of cooccuring categories.

Q5. What is the reason for the indepen-dent representation of categories?

It seems to be that for their categorization task the indepen-dent representation of categories somehow weakens the forgetting effect of SLP networks.

Q6. What is the common strategy for life-long learning architectures?

A common strategy for life-long learning architectures e.g. (Hamker, 2001; Furao & Hasegawa, 2006; Kirstein et al., 2008) is the usage of a node specific learning rate combined with an incremental node insertion rule.

Q7. What are the advantages of these approaches?

The advantages of these approaches are their robustness against partial occlusion, scale changes, and the ability to deal with cluttered environments.

Q8. What is the reason for the poor performance of color categories?

For color categories the effect of imprecise foreground masks on the categorization performance seems also to be only minor, otherwise the performance would be considerably lower.

Q9. What is the way to test the representation of objects?

This allows that object views can be first used to test the STM and LTM representation and after providing confirmed labels the same views can also be used to enhance the representation by transferring them into the STM, even if they where recorded before the confirmation.

Q10. How is the learning of the category specific LTM based on the current available feature vectors?

Based on the currently available feature vectors, the learning methods are used to incorporate this STM knowledge into the LTM by applying the learning dynamics of the cLVQ method described in Section 4.2.3.

Q11. How do the authors relax the separation of object views?

To relax this separation and to make the most efficient use of object views, the authors introduce a sensory memory concept for temporarily remembering views of the currently attended object, by using the same one-shot learning method as used for the STM.

Q12. What is the computation of local edge responses?

(1)This computation of local edge responses is restricted to the positions in the foreground mask with ξi(x, y) > 0, whereas the ∗ denotes the inner product of two vectors.

Q13. What is the main problem when dealing with learning in unconstrained environments?

One of the essential problems when dealing with learning in unconstrained environments is the definition of a shared attention concept between the learning system and the human tutor.

Q14. How is the hypothesis list communicated to the user?

Additionally the hypothesis list is repeatedly communicated to the user (in 5 second intervals), while newly acquired segments are also used to refine this list.

Q15. What is the effect of the constraint on the learning task?

Additionally this constraint strongly reduces the appearance variations of the presented objects and therefore makes the category learning task much easier.

Q16. How does the learning system perform the necessary processing steps?

The authors could show that their learning system can efficiently perform all necessary processing steps including figure-ground segregation, feature extraction and incremental learning.

A vision architecture for unconstrained and incremental learning of multiple categories

Figures (8)

Citations

Cites methods from "A vision architecture for unconstra..."

Cites background from "A vision architecture for unconstra..."

Cites background from "A vision architecture for unconstra..."

Cites background from "A vision architecture for unconstra..."

Cites background from "A vision architecture for unconstra..."

References

"A vision architecture for unconstra..." refers methods in this paper

"A vision architecture for unconstra..." refers methods in this paper

"A vision architecture for unconstra..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (16)

Q1. What have the authors contributed in "A vision architecture for unconstrained and incremental learning of multiple categories" ?

Q2. What is the purpose of the forward feature selection method?

Q3. Why do the authors remove the noise from the initial foreground hypothesis?

Q4. What is the main drawback of those architectures commonly used for identification tasks?

Q5. What is the reason for the indepen-dent representation of categories?

Q6. What is the common strategy for life-long learning architectures?

Q7. What are the advantages of these approaches?

Q8. What is the reason for the poor performance of color categories?

Q9. What is the way to test the representation of objects?

Q10. How is the learning of the category specific LTM based on the current available feature vectors?

Q11. How do the authors relax the separation of object views?

Q12. What is the computation of local edge responses?

Q13. What is the main problem when dealing with learning in unconstrained environments?

Q14. How is the hypothesis list communicated to the user?

Q15. What is the effect of the constraint on the learning task?

Q16. How does the learning system perform the necessary processing steps?