# Compact Q-Learning Optimized for Micro-robots with Processing and Memory Constraints

TL;DR: This paper proposes a simplified reinforcement learning algorithm based on one-step Qlearning that is optimized in speed and memory consumption and uses only integer-based sum operators and avoids floatingpoint and multiplication operators.

Abstract: Scaling down robots to miniature size introduces many new challenges including memory and program size limitations, low processor performance and low power autonomy. In this paper we describe the concept and implementation of learning of a safewandering task with the autonomous micro-robots, Alice. We propose a simplified reinforcement learning algorithm based on one-step Qlearning that is optimized in speed and memory consumption. This algorithm uses only integer-based sum operators and avoids floatingpoint and multiplication operators. Finally, quality of learning is compared to a floating-point based algorithm.

## Summary (2 min read)

### 1. Introduction

- Swarm Intelligence metaphor [1] has become a hot topic in recent years.
- Miniaturizing robots introduces many problems in physical parts and behavior implementations [6].
- Due to simplicity of hardware parts, the control program must handle all additional processing such as noise filtering.
- Additionally, due to limited power autonomy, there is a serious limitation in long tasks such as learning.
- In the following section the learning of safewandering is described and the experimental results are discussed.

### 2. State of the Art

- But, since no learning is done during fitness evaluation of a newly generated individual, the training takes too much time (3 hours) even for such a simple task.
- Dean et al. [7] applied ROLNNET (Rapid Output Learning Neural Network with Eligibility Traces) neural networks to mini-robots for backing a car with trailers.
- Input and output spaces are divided to discrete regions and a single neuron is assigned to each region.
- In this paper the authors study the potential for optimization and implementation of reinforcement learning on their micro-robots.

### 2.1. Reinforcement Learning

- Reinforcement learning [19] is one of the widely used online learning methods in robotics.
- This action changes the world’s state and as a result the agent gets a feedback signal from the environment, called "reinforcement signal", indicating the quality of the new state.
- In the one-step Q-learning algorithm the external world is modeled as a Markov Decision Process with discrete finite-time states.
- In the next section the authors show what happens to the convergence when the numbers are limited to integers.

### 3. The Compact Q-Learning Algorithm

- In order to implement Q-learning on the micro-robot Alice, the authors need a simplified Q-Learning algorithm that is able to cope with the limited memory and processing resources and by the restricted power autonomy.
- Thus the authors propose a new algorithm based only on integer operations.

### 3.1. Integer vs. floating point operators

- Floating-point operations take too much processing time and program memory.
- For the sake of comparison, in Table 1, the authors have listed the number of instructions generated by their C compiler (PCW Compiler, from Custom Computer Services Inc.) and the average execution time for four floating-point operations: a=b+c, a=b-c, a=b*c, and a=b/c, and they compared them to integer-based operations.
- Call overhead takes both processing time and program memory for every instance of operator.
- Therefore, the authors prefer to use only integer sum operators since they have no call overhead, require just a few instructions and run very fast.
- Moreover, the authors prefer to use unsigned operations to save memory bits, ease computations and reduce overflow-checking.

### 3.2. Q-Learning problems with Integer operators

- To their knowledge, all reinforcement learning algorithms deal with real numbers at least in the action selection mechanisms.
- In this section the authors discuss some problems that happen when trying to switch to integer numbers.
- The first problem rises in the Boltzmann probability distribution (1).
- Since the Q-values must be of type integer, they must be incremented or decremented by one (not a fraction).
- Also the reward value should be increased according to Q-value increase (see the c factor in (4)).

### 3.3. The proposed algorithm

- Based on the problems described in the previous section the authors propose a very simple algorithm dealing with only unsigned integer summation.
- The probability assignment formula is changed to Roulette Selection as following: ∑ ∈ + + = Actionsk k axQActions iaxQxiaP ),( 1),( )|( (7) Where Actions is the size of action set.
- Q-values are summed by one so that zero-valued actions have a small positive probability.
- In order to get and idea about the simplification, assume that there are 10 available actions in a state, 9 of them have zero values and the value of the 10th action changes.
- For the positive part, comparing the shape of the curves shows they are similar enough for their purpose: incremental and logarithmic.

### 5. Learning Safe-Wandering Behavior

- The H-maze is composed of very narrow parts and has a complex shape.
- The task of the robot is to wander in the maze while having a preference to move forward, without hitting the walls.
- At the end the program detects the next state, computes the reward and updates the policy.
- Also, it is possible to stop the learning and command the robot to behave according to the learned policy.
- 5 shows the changes of the received rewards during a 20 minutes learning experiment in the X-maze.

### 6. Integer vs. Floating-point

- To measure the introduced error due to the simplification of the algorithm, the authors compared the proposed algorithm with the floating-point based algorithm (regular Q-learning algorithm).
- Then learning is stopped and the robot behavior is tested for 10-minute.
- If any learning algorithm finds an optimal policy then its average should be higher in this phase.
- Both, the convergence rate and learning quality is near the second best results of float-based case.
- On the other hand, the floating-point based program plus operating system takes 89% of data and 83% of program memory (60% and 64% without OS).

Did you find this useful? Give us your feedback

...read more

##### Citations

56 citations

50 citations

32 citations

21 citations

20 citations

##### References

32,257 citations

### "Compact Q-Learning Optimized for Mi..." refers background in this paper

...We would like to thank the two unknown reviewers for their remarkable comments and our colleague Gilles Caprari for his valuable works on the Alice robot and its operating system....

[...]

...In his PhD thesis, Gilles Caprari [6] showed processing power of a micro-robot, which is related to available energy, scales down by L2 factor (L is length)....

[...]

5,634 citations

### "Compact Q-Learning Optimized for Mi..." refers background in this paper

...Swarm Intelligence metaphor [ 1 ] has become a hot topic in recent years....

[...]

1,674 citations

### "Compact Q-Learning Optimized for Mi..." refers methods in this paper

...Research Collection Conference Paper Compact Q-Learning Optimized for Micro-robots with Processing and Memory Constraints Author(s): Asadpour, Masoud; Siegwart, Roland Publication Date: 2004-08-31 Permanent Link: https://doi.org/10.3929/ethz-a-010002588 Originally published in: Robotics and…...

[...]

...Research Collection Conference Paper Compact Q-Learning Optimized for Micro-robots with Processing and Memory Constraints Author(s): Asadpour, Masoud; Siegwart, Roland Publication Date: 2004-08-31 Permanent Link: https://doi.org/10.3929/ethz-a-010002588 Originally published in: Robotics and Autonomous Systems 48(1), http://doi.org/10.1016/j.robot.2004.05.006 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection....

[...]

830 citations

### "Compact Q-Learning Optimized for Mi..." refers methods in this paper

...…Optimized for Micro-robots with Processing and Memory Constraints Author(s): Asadpour, Masoud; Siegwart, Roland Publication Date: 2004-08-31 Permanent Link: https://doi.org/10.3929/ethz-a-010002588 Originally published in: Robotics and Autonomous Systems 48(1),…...

[...]

##### Related Papers (5)

##### Frequently Asked Questions (2)

###### Q2. What are the future works mentioned in the paper "Compact q-learning optimized for micro-robots with processing and memory constraints" ?

For future works the authors plan to test the algorithm on other, more complex tasks, especially collective behaviors and cooperative learning among groups of robots.