Journal ArticleDOI

# Compact Q-Learning Optimized for Micro-robots with Processing and Memory Constraints

31 Aug 2004-Robotics and Autonomous Systems (North-Holland)-Vol. 48, Iss: 1, pp 49-61

TL;DR: This paper proposes a simplified reinforcement learning algorithm based on one-step Qlearning that is optimized in speed and memory consumption and uses only integer-based sum operators and avoids floatingpoint and multiplication operators.

AbstractScaling down robots to miniature size introduces many new challenges including memory and program size limitations, low processor performance and low power autonomy. In this paper we describe the concept and implementation of learning of a safewandering task with the autonomous micro-robots, Alice. We propose a simplified reinforcement learning algorithm based on one-step Qlearning that is optimized in speed and memory consumption. This algorithm uses only integer-based sum operators and avoids floatingpoint and multiplication operators. Finally, quality of learning is compared to a floating-point based algorithm.

Topics: , Q-learning (54%)

### 1. Introduction

• Swarm Intelligence metaphor [1] has become a hot topic in recent years.
• Miniaturizing robots introduces many problems in physical parts and behavior implementations [6].
• Due to simplicity of hardware parts, the control program must handle all additional processing such as noise filtering.
• Additionally, due to limited power autonomy, there is a serious limitation in long tasks such as learning.
• In the following section the learning of safewandering is described and the experimental results are discussed.

### 2. State of the Art

• But, since no learning is done during fitness evaluation of a newly generated individual, the training takes too much time (3 hours) even for such a simple task.
• Dean et al. [7] applied ROLNNET (Rapid Output Learning Neural Network with Eligibility Traces) neural networks to mini-robots for backing a car with trailers.
• Input and output spaces are divided to discrete regions and a single neuron is assigned to each region.
• In this paper the authors study the potential for optimization and implementation of reinforcement learning on their micro-robots.

### 2.1. Reinforcement Learning

• Reinforcement learning [19] is one of the widely used online learning methods in robotics.
• This action changes the world’s state and as a result the agent gets a feedback signal from the environment, called "reinforcement signal", indicating the quality of the new state.
• In the one-step Q-learning algorithm the external world is modeled as a Markov Decision Process with discrete finite-time states.
• In the next section the authors show what happens to the convergence when the numbers are limited to integers.

### 3. The Compact Q-Learning Algorithm

• In order to implement Q-learning on the micro-robot Alice, the authors need a simplified Q-Learning algorithm that is able to cope with the limited memory and processing resources and by the restricted power autonomy.
• Thus the authors propose a new algorithm based only on integer operations.

### 3.1. Integer vs. floating point operators

• Floating-point operations take too much processing time and program memory.
• For the sake of comparison, in Table 1, the authors have listed the number of instructions generated by their C compiler (PCW Compiler, from Custom Computer Services Inc.) and the average execution time for four floating-point operations: a=b+c, a=b-c, a=b*c, and a=b/c, and they compared them to integer-based operations.
• Call overhead takes both processing time and program memory for every instance of operator.
• Therefore, the authors prefer to use only integer sum operators since they have no call overhead, require just a few instructions and run very fast.
• Moreover, the authors prefer to use unsigned operations to save memory bits, ease computations and reduce overflow-checking.

### 3.2. Q-Learning problems with Integer operators

• To their knowledge, all reinforcement learning algorithms deal with real numbers at least in the action selection mechanisms.
• In this section the authors discuss some problems that happen when trying to switch to integer numbers.
• The first problem rises in the Boltzmann probability distribution (1).
• Since the Q-values must be of type integer, they must be incremented or decremented by one (not a fraction).
• Also the reward value should be increased according to Q-value increase (see the c factor in (4)).

### 3.3. The proposed algorithm

• Based on the problems described in the previous section the authors propose a very simple algorithm dealing with only unsigned integer summation.
• The probability assignment formula is changed to Roulette Selection as following: ∑ ∈ + + = Actionsk k axQActions iaxQxiaP ),( 1),( )|( (7) Where Actions is the size of action set.
• Q-values are summed by one so that zero-valued actions have a small positive probability.
• In order to get and idea about the simplification, assume that there are 10 available actions in a state, 9 of them have zero values and the value of the 10th action changes.
• For the positive part, comparing the shape of the curves shows they are similar enough for their purpose: incremental and logarithmic.

### 5. Learning Safe-Wandering Behavior

• The H-maze is composed of very narrow parts and has a complex shape.
• The task of the robot is to wander in the maze while having a preference to move forward, without hitting the walls.
• At the end the program detects the next state, computes the reward and updates the policy.
• Also, it is possible to stop the learning and command the robot to behave according to the learned policy.
• 5 shows the changes of the received rewards during a 20 minutes learning experiment in the X-maze.

### 6. Integer vs. Floating-point

• To measure the introduced error due to the simplification of the algorithm, the authors compared the proposed algorithm with the floating-point based algorithm (regular Q-learning algorithm).
• Then learning is stopped and the robot behavior is tested for 10-minute.
• If any learning algorithm finds an optimal policy then its average should be higher in this phase.
• Both, the convergence rate and learning quality is near the second best results of float-based case.
• On the other hand, the floating-point based program plus operating system takes 89% of data and 83% of program memory (60% and 64% without OS).

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

ETH Library
Compact Q-Learning Optimized for
Micro-robots with Processing and
Memory Constraints
Conference Paper
Author(s):
Publication date:
2004-08-31
https://doi.org/10.3929/ethz-a-010002588
In Copyright - Non-Commercial Use Permitted
Originally published in:
Robotics and Autonomous Systems 48(1), https://doi.org/10.1016/j.robot.2004.05.006

Title of paper:
Compact Q-Learning Optimized for Micro-robots
with Processing and Memory Constraints
Authors:
Roland Siegwart (roland.siegwart@epfl.ch
)
Autonomous Systems Laboratory (http://asl.epfl.ch)
Swiss Federal Institute of Technology (EPFL)
CH-1015, Lausanne
Switzerland
Fax number: (+41) 21 693 7807
Corresponding Author:

2
Compact Q-Learning Optimized for Micro-robots
with Processing and Memory Constraints
Autonomous Systems Lab (http://asl.epfl.ch)
Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland
, roland.siegwart@epfl.ch
Abstract. Scaling down robots to miniature size introduces many new challenges including memory and program size limitations, low
processor performance and low power autonomy. In this paper we describe the concept and implementation of learning of a safe-
wandering task with the autonomous micro-robots, Alice. We propose a simplified reinforcement learning algorithm based on one-step Q-
learning that is optimized in speed and memory consumption. This algorithm uses only integer-based sum operators and avoids floating-
point and multiplication operators. Finally, quality of learning is compared to a floating-point based algorithm.
Keywords: Reinforcement Learning; Q-Learning; micro-robots.
1. Introduction
Swarm Intelligence metaphor [1] has become a hot topic in recent years. The number of its successful applications is
exponentially growing in combinatorial optimization [8], communication networks [17] and robotics [14]. This approach
emphasizes collective intelligence of groups of simple and small agents like ants, bees, and cockroaches. Small robots are
also good frameworks to study biology [12]. Small mobile machines could one day perform noninvasive microsurgery,
miniaturized rovers could greatly reduce the cost of planetary missions, and tiny surveillance vehicles could carry equipment
undetected.
Miniaturizing robots introduces many problems in physical parts and behavior implementations [6]. The robot parts must
have low power consumption. This forces the designer to add parts such as sensors conservatively. Due to simplicity of
hardware parts, the control program must handle all additional processing such as noise filtering. The instruction set of
processors is reduced. The robot behavior must be coded compactly and efficiently while having limitation on program size,
memory and processing speed. Additionally, due to limited power autonomy, there is a serious limitation in long tasks such
as learning.
In this paper we describe how to practically tackle on-line learning problem on micro-robots with processing constraints
and optimize it in program and memory consumption. The proposed algorithm is then verified on learning of a safe-
wandering task using the Alice micro-robots [3].

3
The next section deals with previous works in micro-robots, reinforcement learning and one-step Q-learning algorithm.
The third section discusses the problems happening when applying simplifications to Q-learning and introduces an optimized
algorithm in size, processing time, and memory consumption based on integer-calculation and low-level instructions. Section
4 presents the micro-robot Alice and its hardware and software features. In the following section the learning of safe-
wandering is described and the experimental results are discussed. The 6
th
section compares the results to floating-point
based algorithm and the last section contains conclusion and future works.
2. State of the Art
In his PhD thesis, Gilles Caprari [6] showed processing power of a micro-robot, which is related to available energy,
scales down by L
2
factor (L is length). This drastically limits control algorithm capacity. It therefore forces robot designers to
further reduce the calculation power by using 8-bit instead of 16 or 32-bit microcontrollers. As a consequence, we have to
accept that the intelligence of micro-robots will be limited. Nevertheless, in connection with an external supervisor
(computer, human), an adequate collective approach or enough simplifications, small robots might still be able to fulfill
Different learning algorithms have been implemented by researchers on micro-robots. Floreano et al. [9] used
evolutionary algorithms in combination with spike neurons to train the old version of the Alice micro-robots for obstacle-
avoidance. The spiking neural networks are encoded into genetic strings, and the population is evolved. Crossover, mutation
and fitness evaluation tasks are optimized and use bitwise operators. But, since no learning is done during fitness evaluation
of a newly generated individual, the training takes too much time (3 hours) even for such a simple task.
Dean et al. [7] applied ROLNNET (Rapid Output Learning Neural Network with Eligibility Traces) neural networks to
mini-robots for backing a car with trailers. ROLNNET [11] is a mixture of Neural Networks and Reinforcement Learning. It
has been designed for real robots with very limited computing power and memory. Input and output spaces are divided to
discrete regions and a single neuron is assigned to each region. Neurons are provided with regional sensitivity through the
use of eligibility traces. Response learning takes place rapidly using cooperation among neighbor neurons. Even if the
mathematical formulation is simple, consists of only summation, multiplication and division, the required floting point
operations might still cause a processing problem on autonomous micro-robots with limited processing capacity.
Various implementations of micro-robots exist, such as Sandia MARV [2], MIT Ants [15], Nagoya MARS [10], KAIST
Kity [13] and ULB Meloe [16]. However, to our knowledge, no learning task has been implemented on them. In this paper
we study the potential for optimization and implementation of reinforcement learning on our micro-robots.
2.1. Reinforcement Learning
Reinforcement learning [19] is one of the widely used online learning methods in robotics. With an online approach, the
robot learns during action and acts during learning. Supervised learning methods neglect this feature. With reinforcement

4
learning the learner perceives the state of its environment (or conditions at higher levels), and based on a predefined criterion
chooses an action (or behavior). This action changes the world’s state and as a result the agent gets a feedback signal from
the environment, called "reinforcement signal", indicating the quality of the new state. After receiving the reinforcement
signal, it updates the learned policy based on the type of signal, which can be positive (reward) or negative (punishment).
The reinforcement learning method that we use in this work is the one-step Q-learning method [20][21]. However, we
have to adapt the algorithms in accordance with the robot’s limitations. In the one-step Q-learning algorithm the external
world is modeled as a Markov Decision Process with discrete finite-time states. After each action, the agent immediately
receives a scalar "reward" or "punishment".
An action-value table, called Q-table, determines the learned policy of the agent. It estimates the long-term discounted
reward for each state-action pair. Given the current state x and the available actions a
i
, a Q-learning agent selects action "a"
with the probability “P” given by the Boltzmann probability distribution:
=
actionsk
)/τ
k
Q(x,a
e
)/τ
i
Q(x,a
e
|x)
i
P(a
(1)
Where
τ
is the temperature parameter which adjusts exploration rate of action selection. High
τ
values give high
randomness to selection at the beginning. The exploration rate will be decreased when Q-values increase gradually, and
make exploitation more favorable at the end.
After selecting the favorite action based on the probability distribution, the agent executes the action, receives an
immediate reward r, moves to the next state y, and updates Q(x,a) as follows:
γ V(y)) (rββ )Q(x,a)(Q(x,a)
+
+
1
(2)
Where
β
is the learning rate,
γ
()01
γ
is a discount parameter and V(x) is given by:
Q(y,b)
actionsb
V(y)
= max
(3)
Q is improved gradually and the agent learns to maximize the future rewards.
Studies by Sutton [19] showed that convergence of reinforcement learning can be guaranteed. In all of the above and
other used formula, numbers are floating-point numbers or at least need floating-point operations, so no discretization or
interpolation is applied. In the next section we show what happens to the convergence when the numbers are limited to
integers.
3. The Compact Q-Learning Algorithm
In order to implement Q-learning on the micro-robot Alice, we need a simplified Q-Learning algorithm that is able to
cope with the limited memory and processing resources and by the restricted power autonomy. Thus we propose a new
algorithm based only on integer operations.

##### Citations
More filters

Journal ArticleDOI
TL;DR: New methods for the choice and adaptation of the smoothing parameter of the probabilistic neural network (PNN) based on three reinforcement learning algorithms, based on Q(0)-learning, Q(λ-learning, and stateless Q-learning are proposed.
Abstract: In this paper, we propose new methods for the choice and adaptation of the smoothing parameter of the probabilistic neural network (PNN). These methods are based on three reinforcement learning algorithms: $Q(0)$ -learning, $Q(\lambda )$ -learning, and stateless $Q$ -learning. We regard three types of PNN classifiers: the model that uses single smoothing parameter for the whole network, the model that utilizes single smoothing parameter for each data attribute, and the model that possesses the matrix of smoothing parameters different for each data variable and data class. Reinforcement learning is applied as the method of finding such a value of the smoothing parameter, which ensures the maximization of the prediction ability. PNN models with smoothing parameters computed according to the proposed algorithms are tested on eight databases by calculating the test error with the use of the cross validation procedure. The results are compared with state-of-the-art methods for PNN training published in the literature up to date and, additionally, with PNN whose sigma is determined by means of the conjugate gradient approach. The results demonstrate that the proposed approaches can be used as alternative PNN training procedures.

56 citations

Journal ArticleDOI
TL;DR: The scenario of distributed data aggregation in wireless sensor networks is considered, where sensors can obtain and estimate the information of the whole sensing field through local data exchange and aggregation and a sequential decision process model is proposed.
Abstract: The scenario of distributed data aggregation in wireless sensor networks is considered, where sensors can obtain and estimate the information of the whole sensing field through local data exchange and aggregation. An intrinsic tradeoff between energy and aggregation delay is identified, where nodes must decide optimal instants for forwarding samples. The samples could be from a node's own sensor readings or an aggregation with samples forwarded from neighboring nodes. By considering the randomness of the sample arrival instants and the uncertainty of the availability of the multiaccess communication channel, a sequential decision process model is proposed to analyze this problem and determine optimal decision policies with local information. It is shown that, once the statistics of the sample arrival and the availability of the channel satisfy certain conditions, there exist optimal control-limit-type policies that are easy to implement in practice. In the case that the required conditions are not satisfied, the performance loss of using the proposed control-limit-type policies is characterized. In general cases, a finite-state approximation is proposed and two on-line algorithms are provided to solve it. Practical distributed data aggregation simulations demonstrate the effectiveness of the developed policies, which also achieve a desired energy-delay tradeoff.

50 citations

Journal ArticleDOI
TL;DR: It is shown that the presented procedure can be applied to the automatic adaptation of the smoothing parameter of each of the considered PNN models and that this is an alternative training method.
Abstract: In this article, an iterative procedure is proposed for the training process of the probabilistic neural network (PNN). In each stage of this procedure, the Q(0)-learning algorithm is utilized for the adaptation of PNN smoothing parameter (?). Four classes of PNN models are regarded in this study. In the case of the first, simplest model, the smoothing parameter takes the form of a scalar; for the second model, ? is a vector whose elements are computed with respect to the class index; the third considered model has the smoothing parameter vector for which all components are determined depending on each input attribute; finally, the last and the most complex of the analyzed networks, uses the matrix of smoothing parameters where each element is dependent on both class and input feature index. The main idea of the presented approach is based on the appropriate update of the smoothing parameter values according to the Q(0)-learning algorithm. The proposed procedure is verified on six repository data sets. The prediction ability of the algorithm is assessed by computing the test accuracy on 10 %, 20 %, 30 %, and 40 % of examples drawn randomly from each input data set. The results are compared with the test accuracy obtained by PNN trained using the conjugate gradient procedure, support vector machine algorithm, gene expression programming classifier, k---Means method, multilayer perceptron, radial basis function neural network and learning vector quantization neural network. It is shown that the presented procedure can be applied to the automatic adaptation of the smoothing parameter of each of the considered PNN models and that this is an alternative training method. PNN trained by the Q(0)-learning based approach constitutes a classifier which can be treated as one of the top models in data classification problems.

32 citations

Journal ArticleDOI
19 Aug 2020
TL;DR: The design and physical realization of RoBeetle is the result of combining the notion of controllable NiTi-Pt–based catalytic artificial micromuscle with that of integrated millimeter-scale mechanical control mechanism (MCM).
Abstract: The creation of autonomous subgram microrobots capable of complex behaviors remains a grand challenge in robotics largely due to the lack of microactuators with high work densities and capable of using power sources with specific energies comparable to that of animal fat (38 megajoules per kilogram). Presently, the vast majority of microrobots are driven by electrically powered actuators; consequently, because of the low specific energies of batteries at small scales (below 1.8 megajoules per kilogram), almost all the subgram mobile robots capable of sustained operation remain tethered to external power sources through cables or electromagnetic fields. Here, we present RoBeetle, an 88-milligram insect-sized autonomous crawling robot powered by the catalytic combustion of methanol, a fuel with high specific energy (20 megajoules per kilogram). The design and physical realization of RoBeetle is the result of combining the notion of controllable NiTi-Pt-based catalytic artificial micromuscle with that of integrated millimeter-scale mechanical control mechanism (MCM). Through tethered experiments on several robotic prototypes and system characterization of the thermomechanical properties of their driving artificial muscles, we obtained the design parameters for the MCM that enabled RoBeetle to achieve autonomous crawling. To evaluate the functionality and performance of the robot, we conducted a series of locomotion tests: crawling under two different atmospheric conditions and on surfaces with different levels of roughness, climbing of inclines with different slopes, transportation of payloads, and outdoor locomotion.

21 citations

Proceedings ArticleDOI

01 Oct 2006
Abstract: Using each other's knowledge and expertise in learning - what we call cooperation in learning- is one of the major existing methods to reduce the number of learning trials, which is quite crucial for real world applications. In situated systems, robots become expert in different areas due to being exposed to different situations and tasks. As a consequence, areas of expertise (AOE) of the other agents must be detected before using their knowledge, especially when the exchanged knowledge is not abstract, and simple information exchange might result in incorrect knowledge, which is the case for Q-learning agents. In this paper we introduce an approach for extraction of AOE of agents for cooperation in learning using their Q-tables. The evaluating robot uses a behavioral measure to evaluate itself, in order to find a set of states it is expert in. That set is used, then, along with a Q-table-based feature for extraction of areas of expertise of other robots by means of a classifier. Extracted areas are merged in the last stage. The proposed method is tested both in extensive simulations and in real world experiments using mobile robots. The results show effectiveness of the introduced approach, both in accurate extraction of areas of expertise and increasing the quality of the combined knowledge, even when, there are uncertainty and perceptual aliasing in the application and the robot

20 citations

##### References
More filters

Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

32,257 citations

### "Compact Q-Learning Optimized for Mi..." refers background in this paper

• ...We would like to thank the two unknown reviewers for their remarkable comments and our colleague Gilles Caprari for his valuable works on the Alice robot and its operating system....

[...]

• ...In his PhD thesis, Gilles Caprari [6] showed processing power of a micro-robot, which is related to available energy, scales down by L2 factor (L is length)....

[...]

BookDOI
01 Jan 1999
TL;DR: This chapter discusses Ant Foraging Behavior, Combinatorial Optimization, and Routing in Communications Networks, and its application to Data Analysis and Graph Partitioning.
Abstract: 1. Introduction 2. Ant Foraging Behavior, Combinatorial Optimization, and Routing in Communications Networks 3. Division of Labor and Task Allocation 4. Cemetery Organization, Brood Sorting, Data Analysis, and Graph Partitioning 5. Self-Organization and Templates: Application to Data Analysis and Graph Partitioning 6. Nest Building and Self-Assembling 7. Cooperative Transport by Insects and Robots 8. Epilogue

5,634 citations

### "Compact Q-Learning Optimized for Mi..." refers background in this paper

• ...Swarm Intelligence metaphor [ 1 ] has become a hot topic in recent years....

[...]

01 Jan 1989

4,910 citations

Journal ArticleDOI
TL;DR: This book provides fairly comprehensive coverage of recent research developments and constitutes an excellent resource for researchers in the swarm intelligence area or for those wishing to familiarize themselves with current approaches e.g. it would be an ideal introduction for a doctoral student wanting to enter this area.
Abstract: (2002). Swarm Intelligence: From Natural to Artificial Systems. Connection Science: Vol. 14, No. 2, pp. 163-164.

1,674 citations

### "Compact Q-Learning Optimized for Mi..." refers methods in this paper

• ...Research Collection Conference Paper Compact Q-Learning Optimized for Micro-robots with Processing and Memory Constraints Author(s): Asadpour, Masoud; Siegwart, Roland Publication Date: 2004-08-31 Permanent Link: https://doi.org/10.3929/ethz-a-010002588 Originally published in: Robotics and…...

[...]

• ...Research Collection Conference Paper Compact Q-Learning Optimized for Micro-robots with Processing and Memory Constraints Author(s): Asadpour, Masoud; Siegwart, Roland Publication Date: 2004-08-31 Permanent Link: https://doi.org/10.3929/ethz-a-010002588 Originally published in: Robotics and Autonomous Systems 48(1), http://doi.org/10.1016/j.robot.2004.05.006 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection....

[...]

Journal ArticleDOI
TL;DR: A novel method of achieving load balancing in telecommunications networks using ant-based control, which is shown to result in fewer call failures than the other methods, while exhibiting many attractive features of distributed control.
Abstract: This article describes a novel method of achieving load balancing in telecommunications networks. A simulated network models a typical distribution of calls between nodes; nodes carrying an excess ...

830 citations

### "Compact Q-Learning Optimized for Mi..." refers methods in this paper

• ...…Optimized for Micro-robots with Processing and Memory Constraints Author(s): Asadpour, Masoud; Siegwart, Roland Publication Date: 2004-08-31 Permanent Link: https://doi.org/10.3929/ethz-a-010002588 Originally published in: Robotics and Autonomous Systems 48(1),…...

[...]