What have the authors stated for future works in "Evolving large-scale neural networks for vision-based reinforcement learning" ?

In the future, the authors would like to apply Compressed Network Complexity Search [ 5 ] to simultaneously determine the number of coefficients and the number of neurons ( topology ) by running multiple evolutionary algorithms in parallel, one for each topology-coefficient complexity class, and assigning run-time to each based on a probability distribution that is adapted on-line according to the performance of each class. This approach has so far only been applied to much simpler control tasks than those used here, but should produce solutions for harder tasks that are both simple in terms of weight matrix regularity, and model class, to evolve potentially more robust controllers. A potentially more tractable approach might be Generalized Compressed Network Search ( GCNS ; [ 14 ] ) which uses a messy GA to simultaneously determine which arbitrary subset of frequencies should be used as well as the value at each of those frequencies. Their initial work with this method has been promising.

How does the TORCS search space dimensionalize?

Using fewer coefficients than weights sacrifices some expressive power (some networks can no longer be represented), but constrains the search to the subspace of lower complexity, but still sufficiently powerful networks, reducing the search space dimensionality by, e.g. a factor of more than 5000 for the car-driving networks evolved here.

What is the image passed in the UDP?

The image passed in the UDP is encoded as a message chunk with image prefix, followed by unsigned byte values of the image pixels.

How do the authors determine the number of coefficients and the number of neurons?

In the future, the authors would like to apply CompressedNetwork Complexity Search [5] to simultaneously determine the number of coefficients and the number of neurons (topology) by running multiple evolutionary algorithms in parallel, one for each topology-coefficient complexity class, and assigning run-time to each based on a probability distribution that is adapted on-line according to the performance of each class.

What is the dimensionality of the coefficient array for chromosome m?

The number of chromosomes is determined by the choice of network architecture, Ψ, and data structures used to decode the genome, specified by Ω={D1, . . . , Dk}, where Dm, m = 1..k, is the dimensionality of the coefficient array for chromosome m.

What is the goal of the task?

For a video demo of the evolved behavior go to http://www.idsia.ch/~koutnik/images/octo pusVisual.mp4The goal of the task is to evolve a recurrent neural network controller that can drive the car around a race track using only vision.

How does the controller learn to drive straight?

This can be seen in the flat portion of the curve until generation 118, when the fitness jumps from 140 to 190, as the controller learns to turn both left1Evolution can find weights that implement a dynamical system that drives around the track from the same initial conditions, even with no input.

What grants were used to support this research?

This research was supported by Swiss National Science Foundation grants #137736: “Advanced Cooperative NeuroEvolution for Autonomous Control” and #138219: “Theory and Practice of Reinforcement Learning 2”.

(Open Access) Evolving large-scale neural networks for vision-based reinforcement learning (2013) | Jan Koutník

Q: What are the contributions mentioned in the paper "Evolving large-scale neural networks for vision-based reinforcement learning" ?

In this paper, the authors scale-up their “ compressed ” network encoding where network weight matrices are represented indirectly as a set of Fourier-type coefficients, to tasks that require very-large networks due to the high-dimensionality of their input space.

Q: How many fully interconnected neurons are fed into the SRN?

The saturation plane is passed through Robert’s edge detector [12] and then fed into a Elman (recurrent) neural network (SRN) with 16×

Q: What is the main appeal of evolving neural networks?

Early work in NE focused on evolving rather small networks (hundreds of weights) for RL benchmarks, and control problems with relatively few inputs/outputs.

Q: What is the inverse DCT transform used to generate the weight values?

a Dm−dimensional inverse DCT transform is applied to the array to generate the weight values, which are mapped to their position in the corresponding 2D weight matrix.

Evolving Large-Scale Neural Networks

for Vision-Based Reinforcement Learning

Jan Koutník Giuseppe Cuccu Jürgen Schmidhuber Faustino Gomez

IDSIA, USI-SUPSI

Galleria 2

Manno-Lugano, CH 6928

{hkou, giuse, juergen, tino}@idsia.ch

ABSTRACT

The idea of using evolutionary computation to train artiﬁ-

cial neural networks, or neuroevolution (NE), for reinforce-

ment learning (RL) tasks has now been around for over 20

years. However, as RL tasks become more challenging, the

networks required become larger, as do their genomes. But,

scaling NE to large nets (i.e. tens of thousands of weights)

is infeasible using direct encodings that map genes one-to-

one to network components. In this paper, we scale-up our

“compressed” network encoding where network weight ma-

trices are represented indirectly as a set of Fourier-type co-

eﬃcients, to tasks that require very-large networks due to

the high-dimensionality of their input space. The approach

is demonstrated successfully on two reinforcement learning

tasks in which the control networks receive visual input: (1)

a vision-based version of the octopus control task requiring

networks with over 3 thousand weights, and (2) a version of

the TORCS driving game where networks with over 1 mil-

lion weights are evolved to drive a car around a track using

video images from the driver’s perspective.

1. INTRODUCTION

Neuroevolution (NE), has now been around for over 20

years. The main appeal of evolving neural networks instead

of training them (e.g. backpropagation) is that it can po-

tentially harness the universal function approximation ca-

pability of neural networks to solve reinforcement learning

(RL) tasks without relying on noisy, nonstationary gradient

information to perform temporal credit assignment.

Early work in NE focused on evolving rather small net-

works (hundreds of weights) for RL benchmarks, and con-

trol problems with relatively few inputs/outputs. However,

as RL tasks become more challenging, the networks required

become larger, as do their genomes. The result is that scal-

ing NE to large nets (i.e. tens of thousands of weights)

is infeasible using a straightforward, direct encoding where

genes map one-to-one to network components. Therefore,

recent eﬀorts have focused increasingly on indirect encod-

ings [2, 3, 6, 7, 13] where relatively small genomes are trans-

formed into networks of arbitrary size using a more complex

mapping.

In previous work [5, 8, 9, 14], we presented a new indirect

encoding where network weight matrices are represented as

GECCO’13 Companion, July 6–10, 2013, Amsterdam, The Netherlands.

ACM 978-1-4503-1964-5/13/07.

a set of coeﬃcients that are transformed into weight val-

ues via an inverse Fourier-type transform, so that evolution-

ary search is conducted in the frequency-domain instead of

weight space. The basic idea is that if nearby weights in the

matrices are correlated, then this regularity can be encoded

using fewer coeﬃcients than weights, eﬀectively reducing the

search space dimensionality. For problems exhibiting a high-

degree of redundancy, this “compressed” approach can result

in an two orders of magnitude fewer free parameters and sig-

niﬁcant speedup [9].

With this encoding, networks with over 3000 weights were

evolved to successfully control a high-dimensional version of

the Octopus Arm task [16], by searching in the space of as

few as 20 Fourier coeﬃcients (164:1 compression ratio) [10].

In this paper, the approach is scaled up to two tasks that

require networks with up to over 1 million weights, due to

their use of high-dimensional, vision inputs: (1) a visual

version of the aforementioned Octopus Arm task, and (2) a

visual version of the TORCS race car driving environment.

In the standard setup for TORCS, used now for several years

in reinforcement learning competitions (e.g. [11]), a set of

features describing the state of the car is provided to the

driver. In the version used here, the controllers do not have

access to these features, but instead must drive the car using

only a stream of images from the driver’s perspective; no

task-speciﬁc information is provided to the controller, and

the controllers must compute the car velocity internally, via

feedback (recurrent) connections, based on the history of

observed images. To our knowledge this the ﬁrst attempt to

tackle TORCS using vision, and successfully evolve neural

network controllers of this size.

The next section describes the compressed network encod-

ing in detail. Section 3 presents the experiments in the two

test domains, which are discussed in section 4.

2. COMPRESSED NETWORKS

Networks are encoded as a string or genome, g = {g

, . . . ,

}, consisting of k substrings or chromosomes of real num-

bers representing DCT coeﬃcients. The number of chro-

mosomes is determined by the choice of network architec-

ture, Ψ, and data structures used to decode the genome,

speciﬁed by Ω={D

, . . . , D

}, where D

, m = 1..k, is the

dimensionality of the coeﬃcient array for chromosome m.

The total number of coeﬃcients, C =

m=1

|  N, is

user-speciﬁed (for a compression ratio of N/C, where N is

the number of weights in the network), and the coeﬃcients

are distributed evenly over the chromosomes. Which fre-

Figure 1: Mapping the coeﬃcients: The cuboidal

array (top) is ﬁlled with the coeﬃcients from chro-

mosome g one simplex at a time, according to Al-

gorithm 1, starting at the origin and moving to the

opposite corner one simplex at a time.

quencies should be included in the encoding is unknown.

The approach taken here restricts the search space to band-

limited neural networks where the power spectrum of the

weight matrices goes to zero above a speciﬁed limit fre-

quency, c

, and chromosomes contain all frequencies up to

, g

= (c

, . . . , c

Figure 2 illustrates the procedure used to decode the geno-

mes. In this example, a fully-recurrent neural network (on

the right) is represented by k = 3 weight matrices, one for

the input layer weights, one for the recurrent weights, and

one for the bias weights. The weights in each matrix are gen-

erated from a diﬀerent chromosome which is mapped into

its own D

-dimensional array with the same number of ele-

ments as its corresponding weight matrix; in the case shown,

Ω={3, 3, 2}: 3D arrays for both the input and recurrent ma-

trices, and a 2D array for the bias weights.

Algorithm 1: Coeﬃcient mapping(g, d)

j ← 0

K ← sort(diag(d) − I)

for i = 0 to |d| − 1 +

|d|

n=1

l ← 0

← {e|

|d|

k=1

= i}

while |s

| > 0 do

ind[j] ← argmin

e∈s



e − K[l++ mod |d|]



← s

\ ind[j++]

end

for i = 0 to |ind| do

if i < |g| then

coeﬀ array[ind[i]] ← c

else

coeﬀ array[ind[i]] ← 0

end

In [9], the coeﬃcient matrices were 2D, where the sim-

plexes are just the secondary diagonals; starting in the top-

left corner, each diagonal is ﬁlled alternately starting from

its corners. However, if the task exhibits inherent structure

that cannot be captured by low frequencies in a 2D layout,

(a) (b)

Figure 3: Visual Octopus Arm. (a) The arm has to

reach the goal (red dot) using a noisy visualization

of the environment (b).

more compression can potentially be gained by organizing

the coeﬃcients in higher-dimensional arrays [10].

Each chromosome is mapped to its coeﬃcient array ac-

cording to Algorithm 1 (ﬁgure 1) which takes a list of array

dimension sizes, d = (d

, . . . , d

) and the chromosome, g

to create a total ordering on the array elements, e

,...,ξ

In the ﬁrst loop, the array is partitioned into (D

− 1)-

simplexes, where each simplex, s

, contains only those el-

ements e whose Cartesian coordinates, (ξ

, . . . , ξ

), sum

to integer i. The elements of simplex s

are ordered in the

while loop according to their distance to the corner points,

(i.e. those points having exactly one non-zero coordinate;

see example points for a 3D-array in ﬁgure 1), which form

the rows of matrix K = [p

, . . . , p

]

, sorted in descending

order by their sole, non-zero dimension size. In each loop

iteration, the coordinates of the element with the smallest

Euclidean distance to the selected corner is appended to the

list ind, and removed from s

. The loop terminates when s

is empty.

After all of the simplexes have been traversed, the vec-

tor ind holds the ordered element coordinates. In the ﬁnal

loop, the array is ﬁlled with the coeﬃcients from low to high

frequency to the positions indicated by ind; the remaining

positions are ﬁlled with zeroes. Finally, a D

−dimensional

inverse DCT transform is applied to the array to generate

the weight values, which are mapped to their position in the

corresponding 2D weight matrix. Once the k chromosomes

have been transformed, the network is complete.

3. EXPERIMENTS

Two vision-based control tasks were used to scale-up the

compressed network encoding, the Visual Octopus Arm and

Visual TORCS. All neural network controllers were evolved

using the Cooperative Synapse NeuroEvolution (CoSyNE; [4])

algorithm.

3.1 Visual Octopus Arm

The octopus arm (see ﬁgure 3) consists of p compartments

ﬂoating in a 2D water environment. Each compartment has

a constant volume and contains three controllable muscles

(dorsal, transverse and ventral). The state of a compartment

is described by the x, y-coordinates of two of its corners plus

their corresponding x and y velocities. Together with the

arm base rotation, the arm has 8p + 2 state variables and

Map

Genome

Inverse

DCT

Weight Matrices

Network

Weight SpaceFourier Space

Map

| {z }

| {z } | {z }

| {z }

Ω Ψ

Figure 2: Decoding the compressed networks. The ﬁgure shows the three step process involved in trans-

forming a genome of frequency-domain coeﬃcients into a recurrent neural network. First, the genome (left)

is divided into k chromosomes, one for each of the weight matrices speciﬁed by the network architecture, Ψ.

Each chromosome is mapped, by Algorithm 1, into a coeﬃcient array of a dimensionality speciﬁed by Ω. In

this example, an RNN with two inputs and four neurons is encoded as 8 coeﬃcients. There are k = |Ω| = 3,

chromosomes and Ω={3, 3, 2}. The second step is to apply the inverse DCT to each array to generate the

weight values, which are mapped into the weight matrices in the last step.

3p + 2 control variables. In the vision-based version used

here, the control network does not have access to the state

variables. Instead it receives a noisy 32 × 32 pixel gray-scale

image of the arm from a the perspective shown in ﬁgure 3(b).

The goal of the task to reach a target position with the tip

of the arm, from the starting position (arm hanging down)

by contracting the 32 muscles appropriately at each 1s step

of simulated time. Figure 3(a) shows the standard visualiza-

tion the environment with the arm hanging down and two

target positions shown in red. The idea of modifying an ex-

isting RL benchmark to use visual inputs dates back to the

adaptive “broom balancer” of Tolat and Widrow [15], and

more recently the vision-based mountain car in [1].

3.1.1 Setup

An evaluation consists of two trials, one with the target

on the left, the other on the right, see ﬁgure 3. In each trial

the target disappears after the ﬁrst time-step, so that the

network must remember which target is active throughout

trial in order to solve the task. Having two target positions

forces the controller to use the visual input instead of just

outputting a ﬁxed action sequence (i.e. open-loop control).

The controllers were represented by fully-connected recur-

rent neural networks with 32 neurons, one for each muscle

in the 10 compartment arm, for a total of 33,824 weights

organized into 3 weight matrices. Twenty simulations were

run with networks encoded using the following numbers of

DCT coeﬃcients: {10,20,40,80,160,320,640,1280,2560}. In

all case the coeﬃcients were mapped into 3 coeﬃcient ar-

rays using mapping Ω={4, 4, 2}: (1) a 4D array encodes

the input weights from the 2D input image to the 2D array

of neurons, so that each weight is correlated (a) with the

weights of adjacent pixels for the same neuron, (b) with the

weights for the same pixel for neurons that are adjacent in

the 3 × 11 grid, and (c) with the weights from adjacent pix-

els connected to adjacent neurons; (2) a 4D array encodes

the recurrent weights, again capturing three types of correla-

tion; (3) a 2D array encodes the hidden layer biases (see [10]

for further discussion of higher-dimensional coeﬃcient ma-

trices).

CoSyNE was used to evolve the coeﬃcient genomes, with

a population size of 128, a mutation rate of 0.8, and ﬁtness

computed as the average of the following score over the two

trials:

max



1 −

, 0



, (1)

where t is the number of time-steps before the arm touches

the target, T is a number of time-steps in a trial, d is the

ﬁnal distance of the arm tip to the target and D is the

initial distance of the arm tip to the goal. Each trial lasted

for T = 200 time-steps.

3.1.2 Results

Figure 6 compare the performance of the various com-

pressed encoding with the direct encoding in which evolu-

tion is conducted in 33,824-dimensional weight space. Using

only 10 coeﬃcients performs poorly but almost as well as

the direct approach. With just 20 coeﬃcient performance

increases signiﬁcantly, and after 40 coeﬃcients, near optimal

performance is achieved. For a video demo of the evolved be-

havior go to http://www.idsia.ch/~koutnik/images/octo

pusVisual.mp4

3.2 Visual TORCS

The goal of the task is to evolve a recurrent neural network

controller that can drive the car around a race track using

only vision. The challenge for the controller is not only to

interpret each static image as it is received, but also to retain

information from previous images in order to compute the

velocity of the car internally, via its feedback connections.

The visual TORCS environment is based on TORCS ver-

sion 1.3.1. The simulator had to be modiﬁed to provide

images as input to the controllers. At each time-step dur-

ing a network evaluation, an image rendered in OpenGL

(a) (b) (c)

Figure 4: Visual TORCS environment. (a) The 1st-person perspective used as input to the RNN controllers

(ﬁgure 5) to drive the car around the track. (b), a 3rd-person perspective of car. The controllers were

evolved using a track (c) of length of 714.16m and road width of 10m, that consists of straight segments of

length 50 and 100m and curves with radius of 25m. The car starts at the bottom (start line) and has to drive

counter-clockwise. The track boundary has a width of 14m.

is captured in the car code (C++), and passed via UDP

to the client (Java), that contains the RNN controller. The

client is wrapped into a Java class that provides methods for

setting up the RNN weights, executing the evaluation, and

returning the ﬁtness score. These methods are called from

Mathematica which is used to implement the compressed

networks and the evolutionary search.

The Java wrapper allows multiple controllers to be evalu-

ated in parallel in diﬀerent instances of the simulator via dif-

ferent UDP ports. This feature is critical for the experiments

presented below since, unlike the non-vision-based TORCS,

the costly image rendering, required for vision, cannot be

disabled. The main drawback of the current implementa-

tion is that the images are captured from the screen buﬀer

and, therefore, have to actually be rendered to the screen.

Other tweaks to the original TORCS include changing the

control frequency from 50 Hz to 5 Hz, and removing the “3-

2-1-GO” waiting sequence from the beginning of each race.

The image passed in the UDP is encoded as a message chunk

with image preﬁx, followed by unsigned byte values of the

image pixels. Each image is decomposed into the HSB color

space and only the saturation (S) plane is passed in the

message.

3.2.1 Setup

In each ﬁtness evaluation, the car is placed at the starting

line of the track shown in ﬁgure 4(c), and its mirror image,

and a race is run for 25s of simulated time, resulting in a

maximum of 125 time-steps at the 5Hz control frequency.

At each control step (see ﬁgure 5), a raw 64 × 64 pixel im-

age, taken from the driver’s perspective is split into three

color planes (hue, saturation and brightness). The satura-

tion plane is passed through Robert’s edge detector [12] and

then fed into a Elman (recurrent) neural network (SRN)

with 16 × 16 = 256 fully-interconnected neurons in the hid-

den layer, and 3 output neurons. The ﬁrst two outputs,

, o

, are averaged, (o

+ o

)/2, to provide the steering sig-

nal, and the third neuron, o

controls the brake and throttle

(−1 = full brake, 1 = full throttle). All neurons use sig-

moidal activation functions.

With this architecture, the networks have a total of 1,115,

139 weights, organized into 5 weight matrices. The weights

are encoded indirectly by 200 DCT coeﬃcients which are

mapped into 5 coeﬃcient arrays using mapping Ω={4, 4, 2, 3,

1} : (1) a 4D array encodes the input weights from the 2D

input image to the 2D array of neurons in the hidden layer,

so that each weight is correlated (a) with the weights of

adjacent pixels for the same neuron, (b) with the weights for

the same pixel for neurons that are adjacent in the 16 × 16

grid, and (c) with the weights from adjacent pixels connected

to adjacent neurons; (2) a 4D array encodes the recurrent

weights in the hidden layer, again capturing three types of

correlations; (3) a 2D array encodes the hidden layer biases;

(4) a 3D array encodes weights between the hidden layer

and 3 output neurons; and (5) a 1D array with 3 elements

encodes the output neuron biases.

CoSyNE was used to evolve the coeﬃcient genomes, with

a population size of 64, a mutation rate of 0.8, and ﬁtness

being computed by:

f = d −

1000

max

− 100c , (2)

where d is the distance along the track axis, v

max

is maxi-

mum speed, m is the cumulative damage, and c is the sum

of squares of the control signal diﬀerences, divided by the

number of control variables, 3, and the number simulation

control steps, T :

c =

(t) − o

(t − 1)]

. (3)

The maximum speed component in equation (2) forces

the controllers to accelerate and brake eﬃciently, while the

damage component favors controllers that drive safely, and

Figure 6: Performance on Visual Octopus Arm Task.

Each curve is the average of 20 runs using a partic-

ular number of coeﬃcients to encode the networks.

c encourages smoother driving. Fitness scores roughly cor-

respond to the distance traveled along the race track axis.

Each individual is evaluated both on the track and its

mirror image to prevent the RNN from blindly memorizing

the track without using the visual input.

The original track

starts with a left turn, while the mirrored track starts with

a right turn, forcing the network to use the visual input to

distinguish between tracks. The ﬁtness score is calculated

as the minimum of the two track scores.

3.2.2 Results

Table 1 compares the distance travelled and maximum

speed of the visual RNN controller with that of other, hard-

coded controllers that come with the TORCS package. The

performance of the vision-based controller is similar to that

of the other controllers which enjoy access to the full set of

pre-processed TORCS features, such as forward and lateral

speed, angle to the track axis, position at the track, distance

to the track side, etc.

Figure 7 shows the learning curve for the compressed net-

works (upper curve). The lower curve shows a typical evolu-

tionary run where the network is evolved directly in weight

space, i.e. using chromosomes with 1,115,139 genes, one for

each weight, instead of 200 coeﬃcient genes. Direct evolu-

tion makes little progress as each of the weights has to be

set individually, without being explicitly constrained by the

values of other weights in their matrix neighborhood, as is

the case for the compressed encoding.

As discussed above, the controllers were evaluated on two

tracks to prevent them from simply “memorizing” a single

sequence of curves. In the initial stages of evolution, a sub-

optimal strategy is to just drive straight on both tracks ig-

noring the ﬁrst curve, and crashing into the barrier. This is

a simple behavior, requiring no vision, that produces rela-

tively high ﬁtness, and therefore represents local minima in

the ﬁtness landscape. This can be seen in the ﬂat portion

of the curve until generation 118, when the ﬁtness jumps

from 140 to 190, as the controller learns to turn both left

Evolution can ﬁnd weights that implement a dynamical

system that drives around the track from the same initial

conditions, even with no input.

Table 1: Maximum distance, d, in meters and max-

imum speed, v

max

, in kilometers per hour achieved

by the selected hard-coded controllers that enjoy ac-

cess to the state variables, compared to the visual

RNN controller which does not.

controller d [m] v

max

[km/h]

olethros 570 147

bt 613 141

berniw 624 149

tita 657 150

inferno 682 150

visual RNN 625 144

and right. Gradually, the controllers start to distinguish be-

tween the two tracks as they develop useful visual feature

detectors, and from then on the evolutionary search reﬁnes

the control to optimize acceleration and braking through the

curves and straight sections. For a video demo go to

http://www.idsia.ch/~koutnik/images/torcsVisual.mp4

4. DISCUSSION

The compressed network encoding reduces the search space

dimensionality by exploiting the inherent regularity in the

environment. Since, as with most natural images, the pix-

els in a given neighborhood tend to have correlated values,

searching for each weight independently is overkill. Us-

ing fewer coeﬃcients than weights sacriﬁces some expressive

power (some networks can no longer be represented), but

constrains the search to the subspace of lower complexity,

but still suﬃciently powerful networks, reducing the search

space dimensionality by, e.g. a factor of more than 5000 for

the car-driving networks evolved here.

Figure 8(a) shows the weights from the input layer of a

successful car-driving network. Each 64×64 square corre-

sponds to the input weight values of a particular neuron

in the 16×16 hidden layer. The pattern in each square in-

dicates the part of the input image to which the neuron

responds. Because of the 4D structure of the input coeﬃ-

cient matrix, these feature detectors vary smoothly across

the layer. This regularity is apparent in all ﬁve of the net-

work matrices. Figures 8(b) and 8(c) show the activation of

the hidden layer while driving through a left and right curve

on the track, respectively. The two curves produce very dif-

ferent activation patterns from which the network computes

the control signal. The highly activated neurons (in orange)

form contiguous regions due to the correlation between the

feature detectors of adjacent neurons.

Further experiments are needed to compare the approach

with other indirect or generative encodings such as Hyper-

NEAT [2]; not only to evaluate the relative eﬃciency of each

algorithm, but also to understand how the methods diﬀer in

the type of solutions they produce. Part of that comparison

should involve testing the controllers in diﬀerent conditions

from those under which they were evolved (e.g. on diﬀerent

tracks) to measure the degree to which the ability to gen-

eralize beneﬁts from the low-complexity representation, as

was shown in [10].

In this work, the size of the networks was decided heuris-

tically. In the future, we would like to apply Compressed

Evolving large-scale neural networks for vision-based reinforcement learning

Figures

Citations

Deep learning in neural networks

End-to-end training of deep visuomotor policies

Deep Reinforcement Learning: A Brief Survey

A brief survey of deep reinforcement learning

Evolution Strategies as a Scalable Alternative to Reinforcement Learning.

References

I and J

Machine perception of three-dimensional solids,

Designing Neural Networks Using Genetic Algorithms with Graph Generation System

Accelerated Neural Evolution through Cooperatively Coevolved Synapses

Dynamic Model of the Octopus Arm. I. Biomechanics of the Octopus Reaching Movement

Related Papers (5)

Human-level control through deep reinforcement learning

Playing Atari with Deep Reinforcement Learning

Mastering the game of Go with deep neural networks and tree search

Reinforcement Learning: An Introduction

Long short-term memory

Frequently Asked Questions (23)

Q1. What are the contributions mentioned in the paper "Evolving large-scale neural networks for vision-based reinforcement learning" ?

Q2. What have the authors stated for future works in "Evolving large-scale neural networks for vision-based reinforcement learning" ?

Q3. How does the TORCS search space dimensionalize?

Q4. What is the main appeal of evolving neural networks instead of training them?

Q5. How many fully interconnected neurons are fed into the SRN?

Q6. What is the main appeal of evolving neural networks?

Q7. What is the inverse DCT transform used to generate the weight values?

Q8. What is the image passed in the UDP?

Q9. How do the authors determine the number of coefficients and the number of neurons?

Q10. What is the dimensionality of the coefficient array for chromosome m?

Q11. What is the main appeal of NE?

Q12. How many times is the race run?

Q13. What is the weights encoded by the hidden layer?

Q14. How does the controller learn to distinguish between the two tracks?

Q15. What is the approach used to decode the genome?

Q16. What is the idea of modifying an existing RL benchmark to use visual inputs?

Q17. How many weights were represented in the RL benchmark?

Q18. What is the goal of the task?

Q19. What is the performance of the vision-based controller?

Q20. What is the main drawback of the TORCS?

Q21. How does the controller learn to drive straight?

Q22. What is the way to drive straight on the TORCS?

Q23. What grants were used to support this research?