Evolving large-scale neural networks for vision-based reinforcement learning
read more
Citations
Deep learning in neural networks
End-to-end training of deep visuomotor policies
Deep Reinforcement Learning: A Brief Survey
A brief survey of deep reinforcement learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning.
References
Machine perception of three-dimensional solids,
Designing Neural Networks Using Genetic Algorithms with Graph Generation System
Accelerated Neural Evolution through Cooperatively Coevolved Synapses
Dynamic Model of the Octopus Arm. I. Biomechanics of the Octopus Reaching Movement
Related Papers (5)
Human-level control through deep reinforcement learning
Mastering the game of Go with deep neural networks and tree search
Frequently Asked Questions (23)
Q2. What have the authors stated for future works in "Evolving large-scale neural networks for vision-based reinforcement learning" ?
In the future, the authors would like to apply Compressed Network Complexity Search [ 5 ] to simultaneously determine the number of coefficients and the number of neurons ( topology ) by running multiple evolutionary algorithms in parallel, one for each topology-coefficient complexity class, and assigning run-time to each based on a probability distribution that is adapted on-line according to the performance of each class. This approach has so far only been applied to much simpler control tasks than those used here, but should produce solutions for harder tasks that are both simple in terms of weight matrix regularity, and model class, to evolve potentially more robust controllers. A potentially more tractable approach might be Generalized Compressed Network Search ( GCNS ; [ 14 ] ) which uses a messy GA to simultaneously determine which arbitrary subset of frequencies should be used as well as the value at each of those frequencies. Their initial work with this method has been promising.
Q3. How does the TORCS search space dimensionalize?
Using fewer coefficients than weights sacrifices some expressive power (some networks can no longer be represented), but constrains the search to the subspace of lower complexity, but still sufficiently powerful networks, reducing the search space dimensionality by, e.g. a factor of more than 5000 for the car-driving networks evolved here.
Q4. What is the main appeal of evolving neural networks instead of training them?
The main appeal of evolving neural networks instead of training them (e.g. backpropagation) is that it can potentially harness the universal function approximation capability of neural networks to solve reinforcement learning (RL) tasks without relying on noisy, nonstationary gradient information to perform temporal credit assignment.
Q5. How many fully interconnected neurons are fed into the SRN?
The saturation plane is passed through Robert’s edge detector [12] and then fed into a Elman (recurrent) neural network (SRN) with 16×
Q6. What is the main appeal of evolving neural networks?
Early work in NE focused on evolving rather small networks (hundreds of weights) for RL benchmarks, and control problems with relatively few inputs/outputs.
Q7. What is the inverse DCT transform used to generate the weight values?
a Dm−dimensional inverse DCT transform is applied to the array to generate the weight values, which are mapped to their position in the corresponding 2D weight matrix.
Q8. What is the image passed in the UDP?
The image passed in the UDP is encoded as a message chunk with image prefix, followed by unsigned byte values of the image pixels.
Q9. How do the authors determine the number of coefficients and the number of neurons?
In the future, the authors would like to apply CompressedNetwork Complexity Search [5] to simultaneously determine the number of coefficients and the number of neurons (topology) by running multiple evolutionary algorithms in parallel, one for each topology-coefficient complexity class, and assigning run-time to each based on a probability distribution that is adapted on-line according to the performance of each class.
Q10. What is the dimensionality of the coefficient array for chromosome m?
The number of chromosomes is determined by the choice of network architecture, Ψ, and data structures used to decode the genome, specified by Ω={D1, . . . , Dk}, where Dm, m = 1..k, is the dimensionality of the coefficient array for chromosome m.
Q11. What is the main appeal of NE?
The result is that scaling NE to large nets (i.e. tens of thousands of weights) is infeasible using a straightforward, direct encoding where genes map one-to-one to network components.
Q12. How many times is the race run?
In each fitness evaluation, the car is placed at the startingline of the track shown in figure 4(c), and its mirror image, and a race is run for 25s of simulated time, resulting in a maximum of 125 time-steps at the 5Hz control frequency.
Q13. What is the weights encoded by the hidden layer?
The weights are encoded indirectly by 200 DCT coefficients which are mapped into 5 coefficient arrays using mapping Ω={4, 4, 2, 3, 1} : (1) a 4D array encodes the input weights from the 2D input image to the 2D array of neurons in the hidden layer, so that each weight is correlated (a) with the weights of adjacent pixels for the same neuron, (b) with the weights for the same pixel for neurons that are adjacent in the 16× 16 grid, and (c) with the weights from adjacent pixels connected to adjacent neurons; (2) a 4D array encodes the recurrent weights in the hidden layer, again capturing three types of correlations; (3) a 2D array encodes the hidden layer biases; (4) a 3D array encodes weights between the hidden layer and 3 output neurons; and (5) a 1D array with 3 elements encodes the output neuron biases.
Q14. How does the controller learn to distinguish between the two tracks?
the controllers start to distinguish between the two tracks as they develop useful visual feature detectors, and from then on the evolutionary search refines the control to optimize acceleration and braking through the curves and straight sections.
Q15. What is the approach used to decode the genome?
The approach taken here restricts the search space to bandlimited neural networks where the power spectrum of the weight matrices goes to zero above a specified limit frequency, cm` , and chromosomes contain all frequencies up to cm` , gm = (c m 0 , . . . , c m ` ).
Q16. What is the idea of modifying an existing RL benchmark to use visual inputs?
The idea of modifying an existing RL benchmark to use visual inputs dates back to the adaptive “broom balancer” of Tolat and Widrow [15], and more recently the vision-based mountain car in [1].
Q17. How many weights were represented in the RL benchmark?
The controllers were represented by fully-connected recurrent neural networks with 32 neurons, one for each muscle in the 10 compartment arm, for a total of 33,824 weights organized into 3 weight matrices.
Q18. What is the goal of the task?
For a video demo of the evolved behavior go to http://www.idsia.ch/~koutnik/images/octo pusVisual.mp4The goal of the task is to evolve a recurrent neural network controller that can drive the car around a race track using only vision.
Q19. What is the performance of the vision-based controller?
The performance of the vision-based controller is similar to that of the other controllers which enjoy access to the full set of pre-processed TORCS features, such as forward and lateral speed, angle to the track axis, position at the track, distance to the track side, etc.
Q20. What is the main drawback of the TORCS?
This feature is critical for the experiments presented below since, unlike the non-vision-based TORCS, the costly image rendering, required for vision, cannot be disabled.
Q21. How does the controller learn to drive straight?
This can be seen in the flat portion of the curve until generation 118, when the fitness jumps from 140 to 190, as the controller learns to turn both left1Evolution can find weights that implement a dynamical system that drives around the track from the same initial conditions, even with no input.
Q22. What is the way to drive straight on the TORCS?
In the initial stages of evolution, a suboptimal strategy is to just drive straight on both tracks ignoring the first curve, and crashing into the barrier.
Q23. What grants were used to support this research?
This research was supported by Swiss National Science Foundation grants #137736: “Advanced Cooperative NeuroEvolution for Autonomous Control” and #138219: “Theory and Practice of Reinforcement Learning 2”.