scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Towards autonomous bootstrapping for life-long learning categorization tasks

18 Jul 2010-pp 1-8
TL;DR: An exemplar-based learning approach for incremental and life-long learning of visual categories and it is argued that contextual information is beneficial for this process.
Abstract: We present an exemplar-based learning approach for incremental and life-long learning of visual categories. The basic concept of the proposed learning method is to subdivide the learning process into two phases. In the first phase we utilize supervised learning to generate an appropriate category seed, while in the second phase this seed is used to autonomously bootstrap the visual representation. This second learning phase is especially useful for assistive systems like a mobile robot, because the visual knowledge can be enhanced even if no tutor is present. Although for this autonomous bootstrapping no category labels are provided, we argue that contextual information is beneficial for this process. Finally we investigate the effect of the proposed second learning phase with respect to the overall categorization performance.

Summary (3 min read)

Introduction

  • In the recent decades a wide variety of category learning paradigms have been proposed ranging from generative [10], [14] to discriminative models [6], [18].
  • The major advantage of supervised over unsupervised learning is the higher categorization performance, where the time consuming and costly collection of accurately labeled training data is its fundamental drawback.
  • In the context of incremental and life-long learning it has gained so far much less interest.
  • Afterwards the modifica- tions of the basic cLVQ approach and the context dependent estimation of category labels is described in Section III.

A. Distance Computation and Learning Rule

  • The authors useC to denote the current number of represented color and shape categories, whereas eachtic ∈ {−1, 0,+1} labels anxi as positive or negative example of categoryc.
  • Eachwk is attached to a label vector u k whereukc ∈ {−1, 0,+1} is the model target output for categoryc, representing positive, negative, and missing label output, respectively.
  • Sc, and otherwise adjust it according to a scoring procedure explained later.
  • The age factor ak is incremented every time the correspondingwk becomes the winning node.

B. Feature Scoring and Category Initialization

  • The learning dynamics of the cLVQ learning approach is organized in training epochs, where at each epoch only a limited amount of objects and their corresponding views are visible to the learning method.
  • After each epoch some of the training vectorsxi and their corresponding target category valuesti are removed and replaced by vectors of a new object.
  • Therefore for each training epoch the scoring valueshcf , used for guiding the feature selection process, are updated in the following way: hcf = Hcf Hcf +.
  • Therefore if categoryc with the category labeltic = +1 occurred for the first time in the current training epoch, the authors initialize this categoryc with a single feature and one cLVQ node.
  • The attached label vector is chosen asuK+1c = +1 and zero for all other categories.

C. Learning Dynamics

  • All changes of the cLVQ network are only based on the limited and changing set of training vectorsxi.
  • A single run through the optimization loop is composed of the following processing steps: Step 1: Feature Testing.
  • Sc is removed from the set of selected featuresSc and the performance gain is computed for the final decision on the removal.
  • If all remaining categorization errors for the current training set are resolved or all possible featuresf of erroneous categoriesc are tested then the authors start the next training epoch.
  • Otherwise the authors continue this optimization loop and test further feature candidates and LVQ representation nodes.

A. Autonomous Estimation of Category Labels

  • For the autonomous estimation of category labels the authors first measure the network response for all available unlabeled training views based on the previously supervised trained category seed.
  • The measuresd+oc indicates how reliable the categoryc can be detected in the views of objecto, while the rated−oc indicates how probable the categoryc is not present in these views.
  • If these values are chosen too conservative manytic become zero and the corresponding object views have no effect to the representation.
  • On the contrary the possibility of mislabeling increases if these values are low.
  • In general their cLVQ approach is robust with respect to a smaller amount of mislabeled training vectors, because additional network resources are only allocated if the performance gain is above the insertion thresholdsǫ1 andǫ2.

B. Modification of the cLVQ Learning Approach

  • For their first evaluation of the unsupervised bootstrapping of visual category representations the authors keep the incremental learning approach as in [8].
  • In contrast to this for the modified version of the cLVQ each resolved erroneous training view is counted asrioc only.
  • Besides the node dependent learning rateΘ kmin(c) this modification guarantees the stability of the learned visual category representation.
  • This can cause a global performance decrease of all categories, while all other modifications due to the allocation of new features and representation nodes have only a local effect.
  • The views of all training objects are furthermore subdivided into labeled and unlabeled views as illustrated at the bottom of Fig.

B. Feature Representation

  • For the representation of visual categories the authors combine simple color histograms with a parts-based feature representation, but they do not utilize this a priori separation for their category learning approach.
  • Therefore for each object view all extracted features are concatenated into a single Training Objects Test Objects structureless feature vector.
  • The authors use color histograms becaus they combine robustness against view and scale changes with computational efficiency [16].
  • The parts-based shape feature extraction [5] is based on a learned set of categoryspecific feature detectors that are based on SIFT descriptors [11].
  • This especially allows the representation of less structured categories.

C. Categorization Performance

  • As already mentioned for the experimental evaluation of their semi-supervised category learning framework the training is splitted into two training phases.
  • In the second training phase the categories are bootstrapped based on the incremental presentation of the unlabeled training set.
  • These additionally allocated shape features ar most probable the cause for the slight performance decrease of the color categories.
  • In this experiment the authors selected the optimal detection threshold ǫ+ = 0.5 and ǫ− = 0.9 for the shape categories and investigate the effect of an continuously increasing set of additional object views with respect to the change in categorization performance.
  • A. McCallum and K. Nigam, “Employing EM and pool-based active learning for text classification”,In Proc. of the Fifteenth International Conference on Machine Learning, pp. 350–358, 1998. [14].

Did you find this useful? Give us your feedback

Figures (5)

Content maybe subject to copyright    Report

Towards Autonomous Bootstrapping for
Life-long Learning Categorization Tasks
Stephan Kirstein, Heiko Wersing and Edgar K
¨
orner
AbstractWe present an exemplar-based learning approach
for incremental and life-long learning of visual categories. The
basic concept of the proposed learning method is to subdivide
the learning process into two phases. In the first phase we utilize
supervised learning to generate an appropriate category seed,
while in the second phase this seed is used to autonomously
bootstrap the visual representation. This second learning phase
is especially useful for assistive systems like a mobile robot,
because the visual knowledge can be enhanced even if no
tutor is present. Although for this autonomous bootstrapping
no category labels are provided, we argue that contextual
information is beneficial for this process. Finally we investigate
the effect of the proposed second learning phase with respect
to the overall categorization performance.
I. INTRODUCTION
In the recent decades a wide variety of category learn-
ing paradigms have been proposed ranging from generative
[10], [14] to discriminative models [6], [18]. However, most
research on this topic focused so far on supervised learn-
ing. The major advantage of supervised over unsupervised
learning is the higher categorization performance, where the
time consuming and costly collection of accurately labeled
training data is its fundamental drawback. In the context of
assistive systems this means that whenever the system should
enhance its category representation a tutor has to specify
the corresponding labels. Although we consider the interac-
tion with a tutor as a necessary part of the early learning
phase, we want to enable the system to more and more
autonomously bootstrap its acquired category representation.
Therefore we investigate in this paper the combination of
semi-supervised and life-long learning to reduce the necessity
of tutor interactions.
The basic idea of semi-supervised learning is to com-
bine supervised with unsupervised learning [12], [2]. The
advantage of this combination is typically a considerably
higher performance compared to purely data driven unsuper-
vised methods, whereas the labeling effort can be strongly
reduced. Typically for semi-supervised learning the initial
representation is trained based on the labeled portion of the
training data. Afterwards this initial representation is utilized
to estimate the correct class labels for the unlabeled portion
of the training data. Commonly only unlabeled training
examples with high classifier confidence are used for the
Stephan Kirstein is with the Honda Research Institute Europe
GmbH, Carl-Legien-Strasse 30, 63073 Offenbach, Germany; (email:
stephan.kirstein@honda-ri.de).
Heiko Wersing is with the Honda Research Institute Europe GmbH,
(email: heiko.wersing@honda-ri.de).
Edgar K
¨
orner is with the Honda Research Institute Europe GmbH, (email:
edgar.koerner@honda-ri.de).
bootstrapping. This guaranties a low amount of errors in
the estimated labels, but this data most probably is less
useful to enhance the classifier performance, because it is
already well represented [17]. To overcome this limitation
semi-supervised learning can be extended by active learning
[13], [15], where the learning system requests the tutor-driven
labeling for the currently worst represented training data.
In contrast to this we propose to use temporal context
information to overcome this limitation rather than requesting
additional user interactions. To use the temporal context,
object views that belong to the same physical object have
to be identified first. In offline experiments this typically can
be easily achieved. For an autonomous system this requires
the tracking of the object over a longer period, so that it
is most probable that the corresponding views belong to
the same physical object. Based on this object view list a
majority voting can be applied. The advantage of such voting
is that not only already well represented views are added to
the training ensemble, but also currently wrong categorized
views of the same object. We believe that such a combination
has the highest potential effect with respect to an increasing
categorization performance.
Although semi-supervised learning is a common learn-
ing technique (see [19] for an overview), in the context
of incremental and life-long learning it has gained so far
much less interest. We consider the ability of increasing
the visual knowledge in a life-long learning fashion as a
basic requirement for an autonomous system. Nevertheless
combining semi-supervised with life-long learning is more
challenging compared to typical semi-supervised learning
approaches. This is because for life-long learning tasks the
learning method commonly has only access to a limited
amount of training data, so that the bootstrapping is normally
purely based on the unlabeled training views and their
autonomously assigned label information. This is in contrast
to typical semi-supervised approaches, where the labeled
and unlabeled training views are combined to one single
training set. Furthermore to cope with the “stability-plasticity
dilemma” [1] of life-long learning tasks on the one hand sta-
bility considerations are required to avoid the “catastrophic
forgetting effect” [3] of the learned representation, while
for the plasticity the allocation of new network resources is
necessary. It is obvious that this resource allocation is con-
siderably more difficult if the label information is unreliable
as this is the case for the unsupervised training data.
The paper is structured in the following way. In the next
Section II we briefly explain our category learning vector
quantization (cLVQ) framework. Afterwards the modifica-

Negative representative
Positive representative
Object 4
Object 6
Object 5
High−dimensional
feature space
cLVQ
...
w
w
w
1
w
2
w
3
4
K
w
K−2
w
K−1
...
Low−dimensional
subspace
Category 1
Low−dimensional
subspace
Category C−1
Low−dimensional
subspace
Category 2
Low−dimensional
subspace
Category C
Limited and Changing Training Set Category Representation
i
x
Fig. 1. Illustration of the Category Learning Framework. The learning with our proposed category learning vector quantization (cLVQ) approach
is based on a limited and changing training set. Based on the currently available training vectors x
i
and the corresponding target labels t
i
the cLVQ
incrementally allocates new representation nodes and category-specific features. The selected features sets for each category c enables an efficient separation
of co-occurring categories (e.g. if an object belongs to several categories, which is the standard setting in our experiments) and the definition of various
metrical “views” to a single node w
k
. The categorization decision itself is based on the allocated cLVQ nodes w
k
and the low-dimensional category-specific
feature spaces.
tions of the basic cLVQ approach and the context dependent
estimation of category labels is described in Section III. In
Section IV the experimental results are summarized and are
discussed in Section V.
II. CATEGORY LEARNING VECTOR QUANTIZATION
Our proposed category learning approach [8] enables in-
teractive and life-long learning and therefore can be utilized
for autonomous systems, but so far we only considered
supervised learning based on interactions with an human
tutor. In the following we briefly describe the learning
framework as illustrated in Fig.1. In the presented paper we
utilized this framework for creating the category seed in a
purely supervised fashion. The proposed learning approach
is basically based on an exemplar-based incremental learning
network combined with a forward feature selection method
to enable incremental and life-long learning of arbitrary
categories. Both parts are optimized together to find a balance
between the insertion of features and allocation of represen-
tation nodes, while using as little resources as possible. In the
following we refer to this architecture as category learning
vector quantization (cLVQ).
To achieve the interactive and incremental learning capa-
bility the exemplar-based network part of the cLVQ method
is used to approach the ”stability-plasticity dilemma” of life-
long learning problems. Thus we define a node insertion
rule that automatically determines the number of required
representation nodes. The final number of allocated nodes
w
k
and the assigned category labels u
k
corresponds to the
difficulty of the different categories itself but also to the
within-category variance. Finally the long-term stability of
these incrementally learned nodes is considered based on an
individual node learning rate Θ
k
as proposed in [7].
Additionally a category-specific forward feature selection
method is used to enable the separation of co-occurring cate-
gories, because it defines category-specific metrical “views”
on the representation nodes of the exemplar-based network.
During the learning process it selects low-dimensional sub-
sets of features by predominantly choosing features that
occur almost exclusively for this particular category. Fur-
thermore only these selected category-specific features are
used to decide whether a particular category is present or
not as illustrated in Fig.1. For guiding this selection process
a feature scoring value h
cf
is calculated for each category c
and feature f . This scoring value is only based on previously
seen exemplars of a certain category, which can strongly
change if further information is encountered. Therefore a
continuous update of the h
cf
values is required to follow
this change.

A. Distance Computation and Learning Rule
The learning in the cLVQ architecture is based on a
set of high-dimensional and sparse feature vectors x
i
=
(x
i
1
, . . . , x
i
F
), where F denotes the total number of features.
Each x
i
is assigned to a list of category labels t
i
=
(t
i
1
, . . . , t
i
C
). We use C to denote the current number of
represented color and shape categories, whereas each t
i
c
{−1, 0, +1} labels an x
i
as positive or negative example of
category c. The third state t
c
= 0 is interpreted as unknown
category membership, which means that all x
i
with t
i
c
= 0
have no influence on the representation of category c.
The cLVQ representative nodes w
k
with k = 1, . . . , K are
built up incrementally, where K denotes the current number
of allocated vectors w. Each w
k
is attached to a label vector
u
k
where u
k
c
{−1, 0, +1} is the model target output for
category c, representing positive, negative, and missing label
output, respectively. The winning nodes w
k
min
(c)
(x
i
) are
calculated independently for each category c, where k
min
(c)
is determined in the following way:
k
min
(c) = arg min
k
F
X
f=1
λ
cf
(x
i
f
w
k
f
)
2
, k with u
k
c
6= 0.
(1)
where the category-specific weights λ
cf
are updated contin-
uously inspired by the generalized relevance LVQ proposed
by [4] . We denote the set of selected features for an active
category c C as S
c
. We choose λ
cf
= 0 for all f 6∈ S
c
,
and otherwise adjust it according to a scoring procedure
explained later. Each w
k
min
(c)
(x
i
) is updated based on the
standard LVQ learning rule [9], but is restricted to feature
dimensions f S
c
:
w
k
min
(c)
f
:= w
k
min
(c)
f
+ µ Θ
k
min
(c)
(x
i
f
w
k
min
(c)
f
) f S
c
,
(2)
where µ = 1 if the categorization decision for x
i
was correct,
otherwise µ = 1 and the winning node w
k
min
(c)
will be
shifted away from x
i
. Additionally Θ
k
min
(c)
is the node-
dependent learning rate as proposed by [7]:
Θ
k
min
(c)
= Θ
0
exp
a
k
min
(c)
σ
. (3)
Here Θ
0
is a predefined initial value, σ is a fixed scaling
factor, and a
k
is an iteration-dependent age factor. The age
factor a
k
is incremented every time the corresponding w
k
becomes the winning node.
B. Feature Scoring and Category Initialization
The learning dynamics of the cLVQ learning approach
is organized in training epochs, where at each epoch only
a limited amount of objects and their corresponding views
are visible to the learning method. After each epoch some
of the training vectors x
i
and their corresponding target
category values t
i
are removed and replaced by vectors of
a new object. Therefore for each training epoch the scoring
values h
cf
, used for guiding the feature selection process,
are updated in the following way:
h
cf
=
H
cf
H
cf
+
¯
H
cf
. (4)
The variables H
cf
and
¯
H
cf
are the number of previously
seen positive and negative training examples of category c,
where the corresponding feature f was active (x
f
> 0). For
each newly inserted object view, the counter value H
cf
is
updated in the following way:
H
cf
:= H
cf
+ 1 if x
i
f
> 0 and t
i
c
= +1, (5)
where
¯
H
cf
is updated as follows:
¯
H
cf
:=
¯
H
cf
+ 1 if x
i
f
> 0 and t
i
c
= 1. (6)
The score h
cf
defines the metrical weighting in the cLVQ
representation space. We then choose λ
cf
= h
cf
for all f
S
c
and λ
cf
= 0 otherwise.
For our learning architecture we assume that not all cate-
gories are known from the beginning, so that new categories
can occur in each training epoch. Therefore if category c
with the category label t
i
c
= +1 occurred for the first time
in the current training epoch, we initialize this category c
with a single feature and one cLVQ node. We select the
feature v
c
= arg max
f
(h
cf
) with the largest scoring value
and initialize S
c
= {v
c
}. The training vector x
i
is selected
as the initial cLVQ node, where the selected feature v
c
has
the highest activation, i.e. w
K+1
= x
q
with x
q
v
c
x
i
v
c
for
all i. The attached label vector is chosen as u
K+1
c
= +1 and
zero for all other categories.
C. Learning Dynamics
All changes of the cLVQ network are only based on the
limited and changing set of training vectors x
i
. During a
single learning epoch of the cLVQ method an optimization
loop is performed iteratively as illustrated in Fig. 2. The
basic concept behind this optimization loop is to apply small
changes to the representation of erroneous categories by
testing new features v
c
and representation nodes w
k
that may
lead to a considerable performance increase for the current
set of training vectors. A single run through the optimization
loop is composed of the following processing steps:
Step 1: Feature Testing. For each category c with remain-
ing errors a new feature is temporally added and tested. If
a category c is not present in the current training set or is
error free then no modification to its representation is applied.
The feature selection itself is based on the observable training
vectors x
i
, the feature scoring values h
cf
and the e
+
cf
values.
The e
+
cf
is defined as the ratio of active feature entries
(x
i
f
> 0.0) for feature f among the positive training errors
E
+
c
of class c. The E
+
c
is calculated in the following way:
E
+
c
= {i|t
i
c
= +1 t
i
c
6= u
k
min
c
(x
i
)}, (7)
where the t
i
c
{−1, 0, +1} is defined as target signal for
x
i
and u
k
min
c
is the label assigned to the winning node
w
k
min
(c)
(x
i
) of category c.
For the feature testing a candidate v
c
should be added to
the category-specific feature set S
c
that potentially improves

until errors solved
or no features left
as new node
del feature
new feature
select and add
keep feature
keep nodedel node
select erroneous vector
gain >
gain <=
gain <=
gain >
1
1
2
2
occured − start learning
errors for category c
all errors solved for
category c − stop learning
Fig. 2. Illustration of the cLVQ Optimization Loop. The basic idea of
this optimization loop is to make small modifications to the representation of
categories where categorization errors on the available training vectors occur.
If the gain in categorization performance, based on all available training
examples of category c , is above the insertion threshold the modification is
kept and otherwise it is retracted.
the categorization performance of category c by having a
high scoring value h
cf
. Additionally the feature candidate
should also be very active in the remaining training errors of
this category to quickly resolve all remaining errors of this
particular category. Therefore we choose:
v
c
= arg max
f6∈S
c
(e
+
cf
+ h
cf
) (8)
and add S
c
:= S
c
{v
c
}. The added feature dimension modi-
fies the cLVQ metrics by changing the decision boundaries of
all Voronoi clusters assigned to category c, which potentially
reduces the remaining categorization errors. Thus based on
all training vectors x
i
we calculate the actual categorization
performance of the erroneous categories. If the performance
increase for category c is larger than the prespecified thresh-
old ǫ
1
the v
c
is permanently added and otherwise is removed
and excluded for further training iterations of this epoch.
Furthermore in rare cases also the removal of already
selected features is possible. This is done if the total number
of negative errors #E
c
> #E
+
c
, where the E
c
is analogous
to E
+
c
defined as:
E
c
= {i|t
i
c
= 1 t
i
c
6= u
k
min
c
(x
i
)}. (9)
The only difference is that in this case a feature f S
c
is removed from the set of selected features S
c
and the
performance gain is computed for the final decision on the
removal.
Step 2: LVQ Node Testing. Similar to Step 1 we test new
LVQ nodes only for erroneous categories. In contrast to the
node insertion rule proposed in [7], where nodes are inserted
for training vectors with smallest distance to wrong winning
nodes, we propose to insert new LVQ nodes based on training
vectors x
i
with most categorization errors. This leads to a
more compact representation, because a single node typically
improves the representation of several categories. In this
optimization step we insert new representation nodes w
k
until for each erroneous category c at least one new node
is inserted. As categorization labels u
k
for these nodes only
the correct targets labels for the categorization errors are
assigned. For all other categories c the corresponding u
k
c
= 0,
keeping all error free categories unchanged.
Again we calculate the performance increase based on all
currently available training vectors. If this increase for cate-
gory c is above the threshold ǫ
2
, we make no modifications
to LVQ node labels of the newly inserted nodes. Otherwise
we set the labels u
k
c
of this set of newly inserted nodes w
k
to zero. If due to this evaluation step all u
k
c
become zero
then we remove the corresponding w
k
.
Step 3: Stop condition. If all remaining categorization
errors for the current training set are resolved or all possible
features f of erroneous categories c are tested then we
start the next training epoch. Otherwise we continue this
optimization loop and test further feature candidates and
LVQ representation nodes.
III. UNSUPERVISED BOOTSTRAPPING OF CATEGORY
REPRESENTATIONS
Our focus is the life-long learning of visual representa-
tions. For such learning tasks normally it is unsuitable to
store all previously seen training vectors. Thus we decided
that the learning during the bootstrapping phase is only based
on unlabeled training views and their estimated category
labels, which is distinct from most commonly used semi-
supervised learning methods. Before the cLVQ modifications
are described in more detail, we first define the majority vot-
ing schema used for the autonomous estimation of category
labels for the unlabeled training views.
A. Autonomous Estimation of Category Labels
For the autonomous estimation of category labels we first
measure the network response for all available unlabeled
training views based on the previously supervised trained
category seed. For each individual object o in this current
training set we calculate the detection rates d
+
oc
= D
+
oc
/Q
o
and d
oc
= D
oc
/Q
o
, where the Q
o
is defined as the number
of unlabeled training views of object o. The measures d
+
oc
indicates how reliable the category c can be detected in the
views of object o, while the rate d
oc
indicates how probable
the category c is not present in these views. Furthermore we
count the number of object views indicating the presence
(D
+
oc
) and absence (D
oc
) of category c in the following way:
D
+
oc
:= D
+
oc
+ 1 if u
k
min
c
(x
i
) = +1 (10)
and
D
oc
:= D
oc
+ 1 if u
k
min
c
(x
i
) = 1, (11)
where the sum of D
+
oc
+ D
oc
= Q
o
.
Based on these detection rates and the predetermined
thresholds ǫ
+
and ǫ
the correct target values t
i
c
{−1, 0, +1} are estimated for all views of the same object.

The assignment of the target values is done in the following
way:
t
i
c
=
+1 : if d
+
oc
> ǫ
+
1 : if d
+
oc
<= ǫ
+
& d
oc
> ǫ
0 : else.
(12)
The selection of ǫ
+
and ǫ
is crucial with respect to the
potential performance gain of this bootstrapping phase. If
these values are chosen too conservative many t
i
c
become
zero and the corresponding object views have no effect
to the representation. On the contrary the possibility of
mislabeling increases if these values are low. In general our
cLVQ approach is robust with respect to a smaller amount
of mislabeled training vectors, because additional network
resources are only allocated if the performance gain is above
the insertion thresholds ǫ
1
and ǫ
2
. Nevertheless if the number
of wrongly labeled training views becomes to large the
categorization performance can possibly also decrease.
B. Modification of the cLVQ Learning Approach
For our first evaluation of the unsupervised bootstrapping
of visual category representations we keep the incremental
learning approach as in [8]. Thus also in this bootstrapping
phase the learning process is subdivided into epochs and
also the overall cLVQ learning dynamics is reused. This
means the category representation is enhanced by making
small changes to the category representation by selecting new
category-specific features or by allocating additional repre-
sentation nodes. Furthermore the same learning parameters
like the learning rate Θ, the feature insertion threshold ǫ
1
and node insertion threshold ǫ
2
are used.
Although the same learning parameters are utilized we still
want to express the reliability of the autonomously estimated
category labels. This means if the reliability is low only
small changes with respect to the modification of existing
nodes, the allocation of new category-specific features and
representation nodes should be applied. To achieve this effect
all learning parameters are modulated based on the parameter
r
i
c
{0, . . . , 1} that is defined as follows:
r
i
oc
=
d
+
oc
: if t
i
c
= +1
d
oc
: if t
i
c
= 1
0 : if t
i
c
= 0.
(13)
The r
i
oc
value is assigned to each unlabeled object views and
is equal for all views of one physical object o.
For both insertion thresholds ǫ
1
and ǫ
2
this r
i
oc
modulates
the measurement of the performance gain after the insertion
of a new feature v
c
or representation node w
k
. In the
basic cLVQ each erroneous training view that could be
resolved by such slight modification of the representation
is counted with 1.0. In contrast to this for the modified
version of the cLVQ each resolved erroneous training view is
counted as r
i
oc
only. This means that the required amount of
training vectors, necessary to reach the insertion threshold,
is inversely proportional to the corresponding r
i
oc
values
(e.g. if for all current training views r
i
oc
= 0.8 a factor
of 1.25 views are required compared to the basic cLVQ).
The fundamental effect of the modulation of ǫ
1
and ǫ
2
is
that it becomes distinctly more difficult to allocate new
resources the more unreliable the corresponding estimated
category labels become. Therefore the allocation of category
unspecific or even erroneous network resources should be
strongly reduced.
Also for the adaptation of the representation nodes w
k
the
original cLVQ learning rule (see Eq. 2) is multiplied with
r
i
oc
. Besides the node dependent learning rate Θ
k
min
(c)
this
modification guarantees the stability of the learned visual
category representation. The update step for the winning
node w
k
min
(c)
of category c is calculated as follows:
w
k
min
(c)
f
:= w
k
min
(c)
f
+r
i
oc
µΘ
k
min
(c)
(x
i
f
w
k
min
(c)
f
) f S
c
,
(14)
where r
i
oc
is the reliability factor and the µ indicates the
correctness of the categorization decision.
Besides this modulation of the learning parameters,
weighted with reliability, the continuous update of the scor-
ing values h
cf
was deactivated for this bootstrapping phase,
because these values are most fragile with respect to errors
in the estimation process of category labels. A larger amount
of such errors could strongly interfere globally with the
previous trained category representations. This can cause
a global performance decrease of all categories, while all
other modifications due to the allocation of new features and
representation nodes have only a local effect.
IV. EXPERIMENTAL RESULTS
A. Image Ensemble
As experimental setup we use an image database com-
posed of 44 training and 33 test objects as shown in
Fig. 3. This image ensemble contains objects assigned to
ve different color and ten shape categories. Each object
was rotated around the vertical axis in front of a black
background. For each of the training and test objects 300
views are collected. The views of all training objects are
furthermore subdivided into labeled and unlabeled views as
illustrated at the bottom of Fig. 3. In general out of all
300 views are 200 used to train the seed of the category
representation in a supervised manner, while the remaining
100 object views (view range 50–100 and 150–200) are used
for the unsupervised bootstrapping of this representation.
This separation into labeled and unlabeled object views
means that for the autonomous bootstrapping the cLVQ has
to generalize to a quite large unseen angular range of object
views. Compared to a random sampling of the unlabeled
object views this is more challenging, because for random
selected views the appearance difference to already seen
labeled views would be considerably smaller.
B. Feature Representation
For the representation of visual categories we combine
simple color histograms with a parts-based feature repre-
sentation, but we do not utilize this a priori separation for
our category learning approach. Therefore for each object
view all extracted features are concatenated into a single

Citations
More filters
Proceedings ArticleDOI
24 Dec 2012
TL;DR: This paper presents a system that gives a robot the ability to diminish its own disturbing noise by utilizing template-based ego noise estimation, an algorithm previously developed by the authors.
Abstract: This paper presents a system that gives a robot the ability to diminish its own disturbing noise (i.e., ego noise) by utilizing template-based ego noise estimation, an algorithm previously developed by the authors. In pursuit of an autonomous, online and adaptive template learning system in this work, we specifically focus on eliminating the requirement of an offline training session performed in advance to build the essential templates, which represent the ego noise. The idea of discriminating ego noise from all other sound sources in the environment enables the robot to learn the templates online without requiring any prior information. Based on the directionality/diffuseness of the sound sources, the robot can easily decide whether the template should be discarded because it is corrupted by external noises, or it should be inserted into the database because the template consists of pure ego noise only. Furthermore, we aim to update the template database optimally by introducing an additional time-variant forgetting factor parameter, which provides a balance between adaptivity and stability of the learning process automatically. Moreover, we enhanced the single-channel noise estimation system to be compatible with the multi-channel robot audition framework so that ego noise can be eliminated from all signals stemming from multiple sound sources respectively. We demonstrate that the proposed system allows the robot to have the ability of online template learning as well as a high performance of noise estimation and suppression for multiple sound sources.

13 citations


Cites methods from "Towards autonomous bootstrapping fo..."

  • ...The learning process, which continues over the entire lifespan of a robot without human intervention, is called life-long learning and it is successfully applied for various tasks such as robot navigation/manipulation [9] and object recognition/categorization [10] in robotics....

    [...]

Journal ArticleDOI
TL;DR: An autonomous and local neural learning algorithm termed PROPRE (projection-prediction) that updates induced representations based on predictability that is computationally efficient and stable, and that the multimodal transfer of feature selectivity is successful and robust under resource constraints.

11 citations

Proceedings ArticleDOI
06 Jul 2014
TL;DR: The modulation of the projection learning by the predictability measure improves significantly classification performances of the system independently of the measure used, and multiple generic predictability measures are proposed.
Abstract: PROPRE is a generic and modular unsupervised neural learning paradigm that extracts meaningful concepts of multimodal data flows based on predictability across modalities. It consists on the combination of three modules. First, a topological projection of each data flow on a self-organizing map. Second, a decentralized prediction of each projection activity from each others map activities. Third, a predictability measure that compares predicted and real activities. This measure is used to modulate the projection learning so that to favor the mapping of predictable stimuli across modalities. In this article, we use Kohonen map for the projection module, linear regression for the prediction one and we propose multiple generic predictability measures. We illustrate the properties and performances of PROPRE paradigm on a challenging supervised classification task of visual pedestrian data. The modulation of the projection learning by the predictability measure improves significantly classification performances of the system independently of the measure used. Moreover, PROPRE provides a combination of interesting functional properties, such as a dynamical adaptation to input statistic variations, that is rarely available in other machine learning algorithms.

6 citations


Cites background from "Towards autonomous bootstrapping fo..."

  • ...Hence, autonomous and progressive construction of sensory-motor representations is currently an active research field in developmental robotics [2], [3], [4]....

    [...]

Proceedings ArticleDOI
13 Oct 2014
TL;DR: The self-evaluation module of PROPRE is improved, by introducing a sliding threshold, and applied to the unsupervised classification of gestures caught from two time-of-flight (ToF) cameras to illustrate that the modulation mechanism is still useful although less efficient than purely supervised learning.
Abstract: PROPRE is a generic and modular neural learning paradigm that autonomously extracts meaningful concepts of multimodal data flows driven by predictability across modalities in an unsupervised, incremental and online way. For that purpose, PROPRE consists of the combination of projection and prediction. Firstly, each data flow is topologically projected with a self-organizing map, largely inspired from the Kohonen model. Secondly, each projection is predicted by each other map activities, by mean of linear regressions. The main originality of PROPRE is the use of a simple and generic predictability measure that compares predicted and real activities for each modal stream. This measure drives the corresponding projection learning to favor the mapping of predictable stimuli across modalities at the system level (i.e. that their predictability measure overcomes some threshold). This predictability measure acts as a self-evaluation module that tends to bias the representations extracted by the system so that to improve their correlations across modalities. We already showed that this modulation mechanism is able to bootstrap representation extraction from previously learned representations with artificial multimodal data related to basic robotic behaviors and improves performance of the system for classification of visual data within a supervised learning context. In this article, we improve the self-evaluation module of PROPRE, by introducing a sliding threshold, and apply it to the unsupervised classification of gestures caught from two time- of-flight (ToF) cameras. In this context, we illustrate that the modulation mechanism is still useful although less efficient than purely supervised learning.

5 citations


Cites background from "Towards autonomous bootstrapping fo..."

  • ...This work fits in the currently active research on autonomous and progressive construction of sensory-motor representations in the developmental robotics field [3], [4], [5]....

    [...]

Proceedings ArticleDOI
01 Nov 2012
TL;DR: An online learning method that simultaneously discovers “meaningful” concepts in the associated processing streams is described, extending methods such as PCA, SOM or sparse coding to the multimodal case and finding that those concepts which are predictable from other modalities successively “grow”, i.e., become over-represented, whereas concepts that are not predictable become systematically under-represented.
Abstract: This study is conducted in the context of developmental learning in embodied agents who have multiple data sources (sensors) at their disposal We describe an online learning method that simultaneously discovers “meaningful” concepts in the associated processing streams, extending methods such as PCA, SOM or sparse coding to the multimodal case In addition to the avoidance of redundancies in the concepts derived from single modalities, we claim that “meaningful” concepts are those who have statistical relations across modalities This is a reasonable claim because measurements by different sensors often have common cause in the external world and therefore carry correlated information To capture such cross-modal relations while avoiding redundancy of concepts, we propose a set of interacting self-organization processes which are modulated by local predictability To validate the fundamental applicability of the method, we conduct a plausible simulation experiment with synthetic data and find that those concepts which are predictable from other modalities successively “grow”, ie, become over-represented, whereas concepts that are not predictable become systematically under-represented We conclude the article by a discussion of applicability in real-world robotics scenarios

5 citations


Cites background from "Towards autonomous bootstrapping fo..."

  • ...I. INTRODUCTION The autonomous formation of representations is a currently very active research topic in developmental robotics[1], [2], [3], [4]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

01 Jan 1967
TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Abstract: The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

24,320 citations

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations


"Towards autonomous bootstrapping fo..." refers background in this paper

  • ...In the recent decades a wide variety of category learning paradigms have been proposed ranging from generative [10], [14] to discriminative models [6], [18]....

    [...]

Book
01 Jan 1984
TL;DR: The purpose and nature of Biological Memory, as well as some of the aspects of Memory Aspects, are explained.
Abstract: 1. Various Aspects of Memory.- 1.1 On the Purpose and Nature of Biological Memory.- 1.1.1 Some Fundamental Concepts.- 1.1.2 The Classical Laws of Association.- 1.1.3 On Different Levels of Modelling.- 1.2 Questions Concerning the Fundamental Mechanisms of Memory.- 1.2.1 Where Do the Signals Relating to Memory Act Upon?.- 1.2.2 What Kind of Encoding is Used for Neural Signals?.- 1.2.3 What are the Variable Memory Elements?.- 1.2.4 How are Neural Signals Addressed in Memory?.- 1.3 Elementary Operations Implemented by Associative Memory.- 1.3.1 Associative Recall.- 1.3.2 Production of Sequences from the Associative Memory.- 1.3.3 On the Meaning of Background and Context.- 1.4 More Abstract Aspects of Memory.- 1.4.1 The Problem of Infinite-State Memory.- 1.4.2 Invariant Representations.- 1.4.3 Symbolic Representations.- 1.4.4 Virtual Images.- 1.4.5 The Logic of Stored Knowledge.- 2. Pattern Mathematics.- 2.1 Mathematical Notations and Methods.- 2.1.1 Vector Space Concepts.- 2.1.2 Matrix Notations.- 2.1.3 Further Properties of Matrices.- 2.1.4 Matrix Equations.- 2.1.5 Projection Operators.- 2.1.6 On Matrix Differential Calculus.- 2.2 Distance Measures for Patterns.- 2.2.1 Measures of Similarity and Distance in Vector Spaces.- 2.2.2 Measures of Similarity and Distance Between Symbol Strings.- 2.2.3 More Accurate Distance Measures for Text.- 3. Classical Learning Systems.- 3.1 The Adaptive Linear Element (Adaline).- 3.1.1 Description of Adaptation by the Stochastic Approximation.- 3.2 The Perceptron.- 3.3 The Learning Matrix.- 3.4 Physical Realization of Adaptive Weights.- 3.4.1 Perceptron and Adaline.- 3.4.2 Classical Conditioning.- 3.4.3 Conjunction Learning Switches.- 3.4.4 Digital Representation of Adaptive Circuits.- 3.4.5 Biological Components.- 4. A New Approach to Adaptive Filters.- 4.1 Survey of Some Necessary Functions.- 4.2 On the "Transfer Function" of the Neuron.- 4.3 Models for Basic Adaptive Units.- 4.3.1 On the Linearization of the Basic Unit.- 4.3.2 Various Cases of Adaptation Laws.- 4.3.3 Two Limit Theorems.- 4.3.4 The Novelty Detector.- 4.4 Adaptive Feedback Networks.- 4.4.1 The Autocorrelation Matrix Memory.- 4.4.2 The Novelty Filter.- 5. Self-Organizing Feature Maps.- 5.1 On the Feature Maps of the Brain.- 5.2 Formation of Localized Responses by Lateral Feedback.- 5.3 Computational Simplification of the Process.- 5.3.1 Definition of the Topology-Preserving Mapping.- 5.3.2 A Simple Two-Dimensional Self-Organizing System.- 5.4 Demonstrations of Simple Topology-Preserving Mappings.- 5.4.1 Images of Various Distributions of Input Vectors.- 5.4.2 "The Magic TV".- 5.4.3 Mapping by a Feeler Mechanism.- 5.5 Tonotopic Map.- 5.6 Formation of Hierarchical Representations.- 5.6.1 Taxonomy Example.- 5.6.2 Phoneme Map.- 5.7 Mathematical Treatment of Self-Organization.- 5.7.1 Ordering of Weights.- 5.7.2 Convergence Phase.- 5.8 Automatic Selection of Feature Dimensions.- 6. Optimal Associative Mappings.- 6.1 Transfer Function of an Associative Network.- 6.2 Autoassociative Recall as an Orthogonal Projection.- 6.2.1 Orthogonal Projections.- 6.2.2 Error-Correcting Properties of Projections.- 6.3 The Novelty Filter.- 6.3.1 Two Examples of Novelty Filter.- 6.3.2 Novelty Filter as an Autoassociative Memory.- 6.4 Autoassociative Encoding.- 6.4.1 An Example of Autoassociative Encoding.- 6.5 Optimal Associative Mappings.- 6.5.1 The Optimal Linear Associative Mapping.- 6.5.2 Optimal Nonlinear Associative Mappings.- 6.6 Relationship Between Associative Mapping, Linear Regression, and Linear Estimation.- 6.6.1 Relationship of the Associative Mapping to Linear Regression.- 6.6.2 Relationship of the Regression Solution to the Linear Estimator.- 6.7 Recursive Computation of the Optimal Associative Mapping.- 6.7.1 Linear Corrective Algorithms.- 6.7.2 Best Exact Solution (Gradient Projection).- 6.7.3 Best Approximate Solution (Regression).- 6.7.4 Recursive Solution in the General Case.- 6.8 Special Cases.- 6.8.1 The Correlation Matrix Memory.- 6.8.2 Relationship Between Conditional Averages and Optimal Estimator.- 7. Pattern Recognition.- 7.1 Discriminant Functions.- 7.2 Statistical Formulation of Pattern Classification.- 7.3 Comparison Methods.- 7.4 The Subspace Methods of Classification.- 7.4.1 The Basic Subspace Method.- 7.4.2 The Learning Subspace Method (LSM).- 7.5 Learning Vector Quantization.- 7.6 Feature Extraction.- 7.7 Clustering.- 7.7.1 Simple Clustering (Optimization Approach).- 7.7.2 Hierarchical Clustering (Taxonomy Approach).- 7.8 Structural Pattern Recognition Methods.- 8. More About Biological Memory.- 8.1 Physiological Foundations of Memory.- 8.1.1 On the Mechanisms of Memory in Biological Systems.- 8.1.2 Structural Features of Some Neural Networks.- 8.1.3 Functional Features of Neurons.- 8.1.4 Modelling of the Synaptic Plasticity.- 8.1.5 Can the Memory Capacity Ensue from Synaptic Changes?.- 8.2 The Unified Cortical Memory Model.- 8.2.1 The Laminar Network Organization.- 8.2.2 On the Roles of Interneurons.- 8.2.3 Representation of Knowledge Over Memory Fields.- 8.2.4 Self-Controlled Operation of Memory.- 8.3 Collateral Reading.- 8.3.1 Physiological Results Relevant to Modelling.- 8.3.2 Related Modelling.- 9. Notes on Neural Computing.- 9.1 First Theoretical Views of Neural Networks.- 9.2 Motives for the Neural Computing Research.- 9.3 What Could the Purpose of the Neural Networks be?.- 9.4 Definitions of Artificial "Neural Computing" and General Notes on Neural Modelling.- 9.5 Are the Biological Neural Functions Localized or Distributed?.- 9.6 Is Nonlinearity Essential to Neural Computing?.- 9.7 Characteristic Differences Between Neural and Digital Computers.- 9.7.1 The Degree of Parallelism of the Neural Networks is Still Higher than that of any "Massively Parallel" Digital Computer.- 9.7.2 Why the Neural Signals Cannot be Approximated by Boolean Variables.- 9.7.3 The Neural Circuits do not Implement Finite Automata.- 9.7.4 Undue Views of the Logic Equivalence of the Brain and Computers on a High Level.- 9.8 "Connectionist Models".- 9.9 How can the Neural Computers be Programmed?.- 10. Optical Associative Memories.- 10.1 Nonholographic Methods.- 10.2 General Aspects of Holographic Memories.- 10.3 A Simple Principle of Holographic Associative Memory.- 10.4 Addressing in Holographic Memories.- 10.5 Recent Advances of Optical Associative Memories.- Bibliography on Pattern Recognition.- References.

8,197 citations

Frequently Asked Questions (10)
Q1. What is the basic concept behind this optimization loop?

2. The basic concept behind this optimization loop is to apply small changes to the representation of erroneous categories by testing new features vc and representation nodes wk that may lead to a considerable performance increase for the current set of training vectors. 

The learning in the cLVQ architecture is based on a set of high-dimensional and sparse feature vectors xi = (xi1, . . . , x i F ), where F denotes the total number of features. 

Furthermore the same learning parameters like the learning rate Θ, the feature insertion threshold ǫ1 and node insertion threshold ǫ2 are used. 

Additionally are the fluctuations in the feature responses of the extracted parts-based features larger during the object rotation compared to the color features, so that the unlabeled object views contain further information with respect to the representation of shape categories. 

Their proposed category learning approach [8] enables interactive and life-long learning and therefore can be utilized for autonomous systems, but so far the authors only considered supervised learning based on interactions with an human tutor. 

The authors selected a distinctly smaller range for the threshold ǫ− because due to the selection of low-dimensional feature sets the rejection of categories is typically nearly perfect. 

The update step for the winning node wkmin(c) of category c is calculated as follows:w kmin(c) f := w kmin(c) f +r i ocµΘ kmin(c)(xif−w kmin(c) f ) ∀f ∈ Sc, (14) where rioc is the reliability factor and the µ indicates the correctness of the categorization decision. 

Besides this modulation of the learning parameters, weighted with reliability, the continuous update of the scoring values hcf was deactivated for this bootstrapping phase, because these values are most fragile with respect to errors in the estimation process of category labels. 

For each newly inserted object view, the counter value Hcf is updated in the following way:Hcf := Hcf + 1 if x i f > 0 and t i c = +1, (5)where H̄cf is updated as follows: 

Therefore for each training epoch the scoring values hcf , used for guiding the feature selection process,are updated in the following way:hcf = HcfHcf + H̄cf .