What is the basic concept behind this optimization loop?

2. The basic concept behind this optimization loop is to apply small changes to the representation of erroneous categories by testing new features vc and representation nodes wk that may lead to a considerable performance increase for the current set of training vectors.

What is the performance of the unlabeled object views?

Additionally are the fluctuations in the feature responses of the extracted parts-based features larger during the object rotation compared to the color features, so that the unlabeled object views contain further information with respect to the representation of shape categories.

Why did the authors select a smaller range for the threshold?

The authors selected a distinctly smaller range for the threshold ǫ− because due to the selection of low-dimensional feature sets the rejection of categories is typically nearly perfect.

What is the update step for the winning node of category c?

The update step for the winning node wkmin(c) of category c is calculated as follows:w kmin(c) f := w kmin(c) f +r i ocµΘ kmin(c)(xif−w kmin(c) f ) ∀f ∈ Sc, (14) where rioc is the reliability factor and the µ indicates the correctness of the categorization decision.

Why was the continuous update of the scoring values deactivated?

Besides this modulation of the learning parameters, weighted with reliability, the continuous update of the scoring values hcf was deactivated for this bootstrapping phase, because these values are most fragile with respect to errors in the estimation process of category labels.

What is the value of the counter value for each new object view?

For each newly inserted object view, the counter value Hcf is updated in the following way:Hcf := Hcf + 1 if x i f > 0 and t i c = +1, (5)where H̄cf is updated as follows:

What is the scoring value for each training epoch?

Therefore for each training epoch the scoring values hcf , used for guiding the feature selection process,are updated in the following way:hcf = HcfHcf + H̄cf .

(Open Access) Towards autonomous bootstrapping for life-long learning categorization tasks (2010) | Stephan Kirstein

Q: What is the description of the proposed learning approach?

Their proposed category learning approach [8] enables interactive and life-long learning and therefore can be utilized for autonomous systems, but so far the authors only considered supervised learning based on interactions with an human tutor.

Towards Autonomous Bootstrapping for

Life-long Learning Categorization Tasks

Stephan Kirstein, Heiko Wersing and Edgar K

orner

Abstract—We present an exemplar-based learning approach

for incremental and life-long learning of visual categories. The

basic concept of the proposed learning method is to subdivide

the learning process into two phases. In the ﬁrst phase we utilize

supervised learning to generate an appropriate category seed,

while in the second phase this seed is used to autonomously

bootstrap the visual representation. This second learning phase

is especially useful for assistive systems like a mobile robot,

because the visual knowledge can be enhanced even if no

tutor is present. Although for this autonomous bootstrapping

no category labels are provided, we argue that contextual

information is beneﬁcial for this process. Finally we investigate

the effect of the proposed second learning phase with respect

to the overall categorization performance.

I. INTRODUCTION

In the recent decades a wide variety of category learn-

ing paradigms have been proposed ranging from generative

[10], [14] to discriminative models [6], [18]. However, most

research on this topic focused so far on supervised learn-

ing. The major advantage of supervised over unsupervised

learning is the higher categorization performance, where the

time consuming and costly collection of accurately labeled

training data is its fundamental drawback. In the context of

assistive systems this means that whenever the system should

enhance its category representation a tutor has to specify

the corresponding labels. Although we consider the interac-

tion with a tutor as a necessary part of the early learning

phase, we want to enable the system to more and more

autonomously bootstrap its acquired category representation.

Therefore we investigate in this paper the combination of

semi-supervised and life-long learning to reduce the necessity

of tutor interactions.

The basic idea of semi-supervised learning is to com-

bine supervised with unsupervised learning [12], [2]. The

advantage of this combination is typically a considerably

higher performance compared to purely data driven unsuper-

vised methods, whereas the labeling effort can be strongly

reduced. Typically for semi-supervised learning the initial

representation is trained based on the labeled portion of the

training data. Afterwards this initial representation is utilized

to estimate the correct class labels for the unlabeled portion

of the training data. Commonly only unlabeled training

examples with high classiﬁer conﬁdence are used for the

Stephan Kirstein is with the Honda Research Institute Europe

GmbH, Carl-Legien-Strasse 30, 63073 Offenbach, Germany; (email:

stephan.kirstein@honda-ri.de).

Heiko Wersing is with the Honda Research Institute Europe GmbH,

(email: heiko.wersing@honda-ri.de).

Edgar K

orner is with the Honda Research Institute Europe GmbH, (email:

edgar.koerner@honda-ri.de).

bootstrapping. This guaranties a low amount of errors in

the estimated labels, but this data most probably is less

useful to enhance the classiﬁer performance, because it is

already well represented [17]. To overcome this limitation

semi-supervised learning can be extended by active learning

[13], [15], where the learning system requests the tutor-driven

labeling for the currently worst represented training data.

In contrast to this we propose to use temporal context

information to overcome this limitation rather than requesting

additional user interactions. To use the temporal context,

object views that belong to the same physical object have

to be identiﬁed ﬁrst. In ofﬂine experiments this typically can

be easily achieved. For an autonomous system this requires

the tracking of the object over a longer period, so that it

is most probable that the corresponding views belong to

the same physical object. Based on this object view list a

majority voting can be applied. The advantage of such voting

is that not only already well represented views are added to

the training ensemble, but also currently wrong categorized

views of the same object. We believe that such a combination

has the highest potential effect with respect to an increasing

categorization performance.

Although semi-supervised learning is a common learn-

ing technique (see [19] for an overview), in the context

of incremental and life-long learning it has gained so far

much less interest. We consider the ability of increasing

the visual knowledge in a life-long learning fashion as a

basic requirement for an autonomous system. Nevertheless

combining semi-supervised with life-long learning is more

challenging compared to typical semi-supervised learning

approaches. This is because for life-long learning tasks the

learning method commonly has only access to a limited

amount of training data, so that the bootstrapping is normally

purely based on the unlabeled training views and their

autonomously assigned label information. This is in contrast

to typical semi-supervised approaches, where the labeled

and unlabeled training views are combined to one single

training set. Furthermore to cope with the “stability-plasticity

dilemma” [1] of life-long learning tasks on the one hand sta-

bility considerations are required to avoid the “catastrophic

forgetting effect” [3] of the learned representation, while

for the plasticity the allocation of new network resources is

necessary. It is obvious that this resource allocation is con-

siderably more difﬁcult if the label information is unreliable

as this is the case for the unsupervised training data.

The paper is structured in the following way. In the next

Section II we brieﬂy explain our category learning vector

quantization (cLVQ) framework. Afterwards the modiﬁca-

Negative representative

Positive representative

Object 4

Object 6

Object 5

High−dimensional

feature space

cLVQ

...

K−2

K−1

...

Low−dimensional

subspace

Category 1

Low−dimensional

subspace

Category C−1

Low−dimensional

subspace

Category 2

Low−dimensional

subspace

Category C

Limited and Changing Training Set Category Representation

Fig. 1. Illustration of the Category Learning Framework. The learning with our proposed category learning vector quantization (cLVQ) approach

is based on a limited and changing training set. Based on the currently available training vectors x

and the corresponding target labels t

the cLVQ

incrementally allocates new representation nodes and category-speciﬁc features. The selected features sets for each category c enables an efﬁcient separation

of co-occurring categories (e.g. if an object belongs to several categories, which is the standard setting in our experiments) and the deﬁnition of various

metrical “views” to a single node w

. The categorization decision itself is based on the allocated cLVQ nodes w

and the low-dimensional category-speciﬁc

feature spaces.

tions of the basic cLVQ approach and the context dependent

estimation of category labels is described in Section III. In

Section IV the experimental results are summarized and are

discussed in Section V.

II. CATEGORY LEARNING VECTOR QUANTIZATION

Our proposed category learning approach [8] enables in-

teractive and life-long learning and therefore can be utilized

for autonomous systems, but so far we only considered

supervised learning based on interactions with an human

tutor. In the following we brieﬂy describe the learning

framework as illustrated in Fig.1. In the presented paper we

utilized this framework for creating the category seed in a

purely supervised fashion. The proposed learning approach

is basically based on an exemplar-based incremental learning

network combined with a forward feature selection method

to enable incremental and life-long learning of arbitrary

categories. Both parts are optimized together to ﬁnd a balance

between the insertion of features and allocation of represen-

tation nodes, while using as little resources as possible. In the

following we refer to this architecture as category learning

vector quantization (cLVQ).

To achieve the interactive and incremental learning capa-

bility the exemplar-based network part of the cLVQ method

is used to approach the ”stability-plasticity dilemma” of life-

long learning problems. Thus we deﬁne a node insertion

rule that automatically determines the number of required

representation nodes. The ﬁnal number of allocated nodes

and the assigned category labels u

corresponds to the

difﬁculty of the different categories itself but also to the

within-category variance. Finally the long-term stability of

these incrementally learned nodes is considered based on an

individual node learning rate Θ

as proposed in [7].

Additionally a category-speciﬁc forward feature selection

method is used to enable the separation of co-occurring cate-

gories, because it deﬁnes category-speciﬁc metrical “views”

on the representation nodes of the exemplar-based network.

During the learning process it selects low-dimensional sub-

sets of features by predominantly choosing features that

occur almost exclusively for this particular category. Fur-

thermore only these selected category-speciﬁc features are

used to decide whether a particular category is present or

not as illustrated in Fig.1. For guiding this selection process

a feature scoring value h

is calculated for each category c

and feature f . This scoring value is only based on previously

seen exemplars of a certain category, which can strongly

change if further information is encountered. Therefore a

continuous update of the h

values is required to follow

this change.

A. Distance Computation and Learning Rule

The learning in the cLVQ architecture is based on a

set of high-dimensional and sparse feature vectors x

, . . . , x

), where F denotes the total number of features.

Each x

is assigned to a list of category labels t

, . . . , t

). We use C to denote the current number of

represented color and shape categories, whereas each t

∈

{−1, 0, +1} labels an x

as positive or negative example of

category c. The third state t

= 0 is interpreted as unknown

category membership, which means that all x

with t

= 0

have no inﬂuence on the representation of category c.

The cLVQ representative nodes w

with k = 1, . . . , K are

built up incrementally, where K denotes the current number

of allocated vectors w. Each w

is attached to a label vector

where u

∈ {−1, 0, +1} is the model target output for

category c, representing positive, negative, and missing label

output, respectively. The winning nodes w

min

(c)

) are

calculated independently for each category c, where k

min

(c)

is determined in the following way:

min

f=1

− w

)

, ∀k with u

6= 0.

(1)

where the category-speciﬁc weights λ

are updated contin-

uously inspired by the generalized relevance LVQ proposed

by [4] . We denote the set of selected features for an active

category c ∈ C as S

. We choose λ

= 0 for all f 6∈ S

and otherwise adjust it according to a scoring procedure

explained later. Each w

min

(c)

) is updated based on the

standard LVQ learning rule [9], but is restricted to feature

dimensions f ∈ S

min

(c)

:= w

min

(c)

+ µ Θ

min

(c)

− w

min

(c)

) ∀f ∈ S

(2)

where µ = 1 if the categorization decision for x

was correct,

otherwise µ = −1 and the winning node w

min

(c)

will be

shifted away from x

. Additionally Θ

min

(c)

is the node-

dependent learning rate as proposed by [7]:

min

(c)

= Θ

exp



−

min

(c)



. (3)

Here Θ

is a predeﬁned initial value, σ is a ﬁxed scaling

factor, and a

is an iteration-dependent age factor. The age

factor a

is incremented every time the corresponding w

becomes the winning node.

B. Feature Scoring and Category Initialization

The learning dynamics of the cLVQ learning approach

is organized in training epochs, where at each epoch only

a limited amount of objects and their corresponding views

are visible to the learning method. After each epoch some

of the training vectors x

and their corresponding target

category values t

are removed and replaced by vectors of

a new object. Therefore for each training epoch the scoring

values h

, used for guiding the feature selection process,

are updated in the following way:

. (4)

The variables H

and

are the number of previously

seen positive and negative training examples of category c,

where the corresponding feature f was active (x

> 0). For

each newly inserted object view, the counter value H

updated in the following way:

:= H

+ 1 if x

> 0 and t

= +1, (5)

where

is updated as follows:

+ 1 if x

> 0 and t

= −1. (6)

The score h

deﬁnes the metrical weighting in the cLVQ

representation space. We then choose λ

= h

for all f ∈

and λ

= 0 otherwise.

For our learning architecture we assume that not all cate-

gories are known from the beginning, so that new categories

can occur in each training epoch. Therefore if category c

with the category label t

= +1 occurred for the ﬁrst time

in the current training epoch, we initialize this category c

with a single feature and one cLVQ node. We select the

feature v

= arg max

) with the largest scoring value

and initialize S

= {v

}. The training vector x

is selected

as the initial cLVQ node, where the selected feature v

has

the highest activation, i.e. w

K+1

= x

with x

≥ x

for

all i. The attached label vector is chosen as u

K+1

= +1 and

zero for all other categories.

C. Learning Dynamics

All changes of the cLVQ network are only based on the

limited and changing set of training vectors x

. During a

single learning epoch of the cLVQ method an optimization

loop is performed iteratively as illustrated in Fig. 2. The

basic concept behind this optimization loop is to apply small

changes to the representation of erroneous categories by

testing new features v

and representation nodes w

that may

lead to a considerable performance increase for the current

set of training vectors. A single run through the optimization

loop is composed of the following processing steps:

Step 1: Feature Testing. For each category c with remain-

ing errors a new feature is temporally added and tested. If

a category c is not present in the current training set or is

error free then no modiﬁcation to its representation is applied.

The feature selection itself is based on the observable training

vectors x

, the feature scoring values h

and the e

values.

The e

is deﬁned as the ratio of active feature entries

> 0.0) for feature f among the positive training errors

of class c. The E

is calculated in the following way:

= {i|t

= +1 ∧ t

6= u

min

)}, (7)

where the t

∈ {−1, 0, +1} is deﬁned as target signal for

and u

min

is the label assigned to the winning node

min

(c)

) of category c.

For the feature testing a candidate v

should be added to

the category-speciﬁc feature set S

that potentially improves

until errors solved

or no features left

as new node

del feature

new feature

select and add

keep feature

keep nodedel node

select erroneous vector

gain >

gain <=

gain >

occured − start learning

errors for category c

all errors solved for

category c − stop learning

Fig. 2. Illustration of the cLVQ Optimization Loop. The basic idea of

this optimization loop is to make small modiﬁcations to the representation of

categories where categorization errors on the available training vectors occur.

If the gain in categorization performance, based on all available training

examples of category c , is above the insertion threshold the modiﬁcation is

kept and otherwise it is retracted.

the categorization performance of category c by having a

high scoring value h

. Additionally the feature candidate

should also be very active in the remaining training errors of

this category to quickly resolve all remaining errors of this

particular category. Therefore we choose:

= arg max

f6∈S

+ h

) (8)

and add S

:= S

∪{v

}. The added feature dimension modi-

ﬁes the cLVQ metrics by changing the decision boundaries of

all Voronoi clusters assigned to category c, which potentially

reduces the remaining categorization errors. Thus based on

all training vectors x

we calculate the actual categorization

performance of the erroneous categories. If the performance

increase for category c is larger than the prespeciﬁed thresh-

old ǫ

the v

is permanently added and otherwise is removed

and excluded for further training iterations of this epoch.

Furthermore in rare cases also the removal of already

selected features is possible. This is done if the total number

of negative errors #E

−

> #E

, where the E

−

is analogous

to E

deﬁned as:

−

= {i|t

= −1 ∧ t

6= u

min

)}. (9)

The only difference is that in this case a feature f ∈ S

is removed from the set of selected features S

and the

performance gain is computed for the ﬁnal decision on the

removal.

Step 2: LVQ Node Testing. Similar to Step 1 we test new

LVQ nodes only for erroneous categories. In contrast to the

node insertion rule proposed in [7], where nodes are inserted

for training vectors with smallest distance to wrong winning

nodes, we propose to insert new LVQ nodes based on training

vectors x

with most categorization errors. This leads to a

more compact representation, because a single node typically

improves the representation of several categories. In this

optimization step we insert new representation nodes w

until for each erroneous category c at least one new node

is inserted. As categorization labels u

for these nodes only

the correct targets labels for the categorization errors are

assigned. For all other categories c the corresponding u

= 0,

keeping all error free categories unchanged.

Again we calculate the performance increase based on all

currently available training vectors. If this increase for cate-

gory c is above the threshold ǫ

, we make no modiﬁcations

to LVQ node labels of the newly inserted nodes. Otherwise

we set the labels u

of this set of newly inserted nodes w

to zero. If due to this evaluation step all u

become zero

then we remove the corresponding w

Step 3: Stop condition. If all remaining categorization

errors for the current training set are resolved or all possible

features f of erroneous categories c are tested then we

start the next training epoch. Otherwise we continue this

optimization loop and test further feature candidates and

LVQ representation nodes.

III. UNSUPERVISED BOOTSTRAPPING OF CATEGORY

REPRESENTATIONS

Our focus is the life-long learning of visual representa-

tions. For such learning tasks normally it is unsuitable to

store all previously seen training vectors. Thus we decided

that the learning during the bootstrapping phase is only based

on unlabeled training views and their estimated category

labels, which is distinct from most commonly used semi-

supervised learning methods. Before the cLVQ modiﬁcations

are described in more detail, we ﬁrst deﬁne the majority vot-

ing schema used for the autonomous estimation of category

labels for the unlabeled training views.

A. Autonomous Estimation of Category Labels

For the autonomous estimation of category labels we ﬁrst

measure the network response for all available unlabeled

training views based on the previously supervised trained

category seed. For each individual object o in this current

training set we calculate the detection rates d

= D

and d

−

= D

−

, where the Q

is deﬁned as the number

of unlabeled training views of object o. The measures d

indicates how reliable the category c can be detected in the

views of object o, while the rate d

−

indicates how probable

the category c is not present in these views. Furthermore we

count the number of object views indicating the presence

) and absence (D

−

) of category c in the following way:

:= D

+ 1 if u

min

) = +1 (10)

and

−

:= D

−

+ 1 if u

min

) = −1, (11)

where the sum of D

+ D

−

= Q

Based on these detection rates and the predetermined

thresholds ǫ

and ǫ

−

the correct target values t

∈

{−1, 0, +1} are estimated for all views of the same object.

The assignment of the target values is done in the following

way:







+1 : if d

> ǫ

−1 : if d

<= ǫ

& d

−

> ǫ

−

0 : else.

(12)

The selection of ǫ

and ǫ

−

is crucial with respect to the

potential performance gain of this bootstrapping phase. If

these values are chosen too conservative many t

become

zero and the corresponding object views have no effect

to the representation. On the contrary the possibility of

mislabeling increases if these values are low. In general our

cLVQ approach is robust with respect to a smaller amount

of mislabeled training vectors, because additional network

resources are only allocated if the performance gain is above

the insertion thresholds ǫ

and ǫ

. Nevertheless if the number

of wrongly labeled training views becomes to large the

categorization performance can possibly also decrease.

B. Modiﬁcation of the cLVQ Learning Approach

For our ﬁrst evaluation of the unsupervised bootstrapping

of visual category representations we keep the incremental

learning approach as in [8]. Thus also in this bootstrapping

phase the learning process is subdivided into epochs and

also the overall cLVQ learning dynamics is reused. This

means the category representation is enhanced by making

small changes to the category representation by selecting new

category-speciﬁc features or by allocating additional repre-

sentation nodes. Furthermore the same learning parameters

like the learning rate Θ, the feature insertion threshold ǫ

and node insertion threshold ǫ

are used.

Although the same learning parameters are utilized we still

want to express the reliability of the autonomously estimated

category labels. This means if the reliability is low only

small changes with respect to the modiﬁcation of existing

nodes, the allocation of new category-speciﬁc features and

representation nodes should be applied. To achieve this effect

all learning parameters are modulated based on the parameter

∈ {0, . . . , 1} that is deﬁned as follows:







: if t

= +1

−

: if t

= −1

0 : if t

= 0.

(13)

The r

value is assigned to each unlabeled object views and

is equal for all views of one physical object o.

For both insertion thresholds ǫ

and ǫ

this r

modulates

the measurement of the performance gain after the insertion

of a new feature v

or representation node w

. In the

basic cLVQ each erroneous training view that could be

resolved by such slight modiﬁcation of the representation

is counted with 1.0. In contrast to this for the modiﬁed

version of the cLVQ each resolved erroneous training view is

counted as r

only. This means that the required amount of

training vectors, necessary to reach the insertion threshold,

is inversely proportional to the corresponding r

values

(e.g. if for all current training views r

= 0.8 a factor

of 1.25 views are required compared to the basic cLVQ).

The fundamental effect of the modulation of ǫ

and ǫ

that it becomes distinctly more difﬁcult to allocate new

resources the more unreliable the corresponding estimated

category labels become. Therefore the allocation of category

unspeciﬁc or even erroneous network resources should be

strongly reduced.

Also for the adaptation of the representation nodes w

the

original cLVQ learning rule (see Eq. 2) is multiplied with

. Besides the node dependent learning rate Θ

min

(c)

this

modiﬁcation guarantees the stability of the learned visual

category representation. The update step for the winning

node w

min

(c)

of category c is calculated as follows:

min

(c)

:= w

min

(c)

µΘ

min

(c)

−w

min

(c)

) ∀f ∈ S

(14)

where r

is the reliability factor and the µ indicates the

correctness of the categorization decision.

Besides this modulation of the learning parameters,

weighted with reliability, the continuous update of the scor-

ing values h

was deactivated for this bootstrapping phase,

because these values are most fragile with respect to errors

in the estimation process of category labels. A larger amount

of such errors could strongly interfere globally with the

previous trained category representations. This can cause

a global performance decrease of all categories, while all

other modiﬁcations due to the allocation of new features and

representation nodes have only a local effect.

IV. EXPERIMENTAL RESULTS

A. Image Ensemble

As experimental setup we use an image database com-

posed of 44 training and 33 test objects as shown in

Fig. 3. This image ensemble contains objects assigned to

ﬁve different color and ten shape categories. Each object

was rotated around the vertical axis in front of a black

background. For each of the training and test objects 300

views are collected. The views of all training objects are

furthermore subdivided into labeled and unlabeled views as

illustrated at the bottom of Fig. 3. In general out of all

300 views are 200 used to train the seed of the category

representation in a supervised manner, while the remaining

100 object views (view range 50–100 and 150–200) are used

for the unsupervised bootstrapping of this representation.

This separation into labeled and unlabeled object views

means that for the autonomous bootstrapping the cLVQ has

to generalize to a quite large unseen angular range of object

views. Compared to a random sampling of the unlabeled

object views this is more challenging, because for random

selected views the appearance difference to already seen

labeled views would be considerably smaller.

B. Feature Representation

For the representation of visual categories we combine

simple color histograms with a parts-based feature repre-

sentation, but we do not utilize this a priori separation for

our category learning approach. Therefore for each object

view all extracted features are concatenated into a single

Towards autonomous bootstrapping for life-long learning categorization tasks

Figures

Citations

Online learning for template-based multi-channel ego noise estimation

2013 Special Issue: Efficient online bootstrapping of sensory representations

PROPRE: PROjection and PREdiction for multimodal correlations learning. An application to pedestrians visual data discrimination

Multimodal space representation driven by self-evaluation of predictability

Simultaneous concept formation driven by predictability

References

Color indexing

Maximum likelihood estimation from incomplete data via the EM algorithm

Self-organization and associative memory: 3rd edition

System for self-organization of stable category recognition codes for analog input patterns

Catastrophic forgetting in connectionist networks.

Related Papers (5)

Self-organized formation of topologically correct feature maps

Self-supervised cross-modal online learning of basic object affordances for developmental robotic systems

The multifaceted nature of unsupervised category learning

Self-Supervised Learning for Visual Tracking and Recognition of Human Hand

Learning one more thing

Frequently Asked Questions (10)

Q1. What is the basic concept behind this optimization loop?

Q2. What is the learning in the cLVQ architecture?

Q3. What are the learning parameters used for the cLVQ?

Q4. What is the performance of the unlabeled object views?

Q5. What is the description of the proposed learning approach?

Q6. Why did the authors select a smaller range for the threshold?

Q7. What is the update step for the winning node of category c?

Q8. Why was the continuous update of the scoring values deactivated?

Q9. What is the value of the counter value for each new object view?

Q10. What is the scoring value for each training epoch?