Hybrid computing using a neural network with dynamic external memory

doi:10.1038/NATURE20101

Symbolic Reasoning with Differentiable Neural Comput- 1

ers 2

Alex Graves*, Greg Wayne*, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka 3

Grabska-Barwi

´

nska, Sergio Gomez, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adri

`

a 4

Puigdom

`

enech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen 5

King, Christopher Summerﬁeld, Phil Blunsom, Koray Kavukcuoglu, Demis Hassabis. 6

*Joint ﬁrst authors 7

Recent breakthroughs demonstrate that neural networks are remarkably adept at sensory 8

processing

1

and sequence

2, 3

and reinforcement learning

4

. However, cognitive scientists and 9

neuroscientists have argued that neural networks are limited in their ability to deﬁne vari- 10

ables and data structures

5–9

, store data over long time scales without interference

10, 11

, and 11

manipulate it to solve tasks. Conventional computers, on the other hand, can easily be pro- 12

grammed to store and process large data structures in memory, but cannot learn to recognise 13

complex patterns. This work aims to combine the advantages of neural and computational 14

processing by providing a neural network with read-write access to an external memory. We 15

refer to the resulting architecture as a Differentiable Neural Computer (DNC). Memory access 16

is sparse, minimising interference among memoranda and enabling long-term storage

12, 13

, 17

and the entire system can be trained with gradient descent, allowing the network to learn how 18

to operate and organise the memory in a goal-directed manner. We demonstrate DNC’s abil- 19

ity to manipulate large data structures by applying it to a set of synthetic question-answering 20

tasks involving graphs, such as ﬁnding shortest paths and inferring missing links. We then 21

show that DNC can learn, based solely on behavioral reinforcement

14, 15

, to carry out com- 22

plex symbolic instructions in a game environment

16

. Taken together, these results suggest 23

1

that DNC is a promising model for tasks requiring a combination of pattern recognition and 24

symbol manipulation, such as question-answering and memory-based reinforcement learn- 25

ing. 26

Modern computers separate computation and memory. Computation is performed by a pro- 27

cessor, which can use an addressable memory to bring operands in and out of play. This confers on 28

the computer two important properties: it provides extensible storage to write new information as 29

it arrives and the ability to treat the contents of memory locations as variables. Variables are criti- 30

cal to algorithm generality: to perform the same procedure on one datum or another, an algorithm 31

merely has to change the address it looks up or the content of the address. By contrast to com- 32

puters, the computational and memory resources of artiﬁcial neural networks are mixed together 33

in the network weights and neuron activity. This is a major liability: as the memory demands of a 34

task increase, these networks cannot allocate new storage dynamically, nor easily learn algorithms 35

that act independently of the values realised by the task variables. 36

The Differentiable Neural Computer (DNC) is a neural network coupled to an external mem- 37

ory matrix (Figure 1). The behaviour of the controller network is independent of the memory size 38

as long as the memory is not ﬁlled to capacity, which is why we view the memory as “external”. 39

If the memory can be thought of as DNC’s RAM, then the network, referred to as the controller, 40

is a CPU whose operations are learned. DNCs differ from recent neural memory frameworks

17, 18

41

in that the memory can be selectively written to as well as read, allowing iterative modiﬁcation of 42

memory content. An earlier form of DNC, the Neural Turing Machine

19

, had a similar structure 43

2

but less ﬂexible memory access methods (Methods). 44

While conventional computers use unique addresses to access memory contents, DNC uses 45

differentiable attention mechanisms

2, 19–22

to deﬁne distributions over the rows, or locations, in the 46

memory matrix. These distributions, which we call weightings, represent the degree to which each 47

location is involved in a read or write operation, and are typically very sparse in a trained system. 48

For example, the read vector r returned by weighting w over memory M is simply a weighted 49

sum over the N memory locations: r =

P

N

i=1

M[i, .]w[i]. The functional units that determine 50

and apply the weightings are called read and write heads. Crucially, the heads are differentiable, 51

allowing the complete system to learn by gradient descent. 52

The heads employ three distinct forms of attention. The ﬁrst is content lookup

19, 20, 23–25

in 53

which a key emitted by the controller is compared to the content of each location in memory accord- 54

ing to a similarity measure (here: cosine similarity). The similarity scores determine a weighting 55

that can be used by the read heads for associative recall

26

or by the write head to modify an ex- 56

isting vector in memory. Importantly, a key that only partially matches the content of a memory 57

location can still be used to attend strongly to that location. This enables key-value retrieval where 58

the value recovered by reading the memory location includes additional information not present in 59

the key. Key-value retrieval provides a rich mechanism for navigating associative data structures 60

in the external memory, as the content of one address can effectively encode references to other 61

addresses. In our experiments, this proved essential to processing graph data. 62

A second attention mechanism records transitions between consecutively written locations 63

3

Figure 1: DNC Architecture. a: A recurrent controller network receives input from an external

data source and produces output. b & c: The controller also outputs vectors that parameterise one

write head (green) and multiple read heads (two in this case: blue and pink). The heads deﬁne

weightings that selectively focus on the rows, or locations, in the memory matrix (stronger colour

for higher weight). The read vectors returned by the read heads are passed to the controller at the

next time step. d: A temporal link matrix records the order locations were written in; here, we

represent the order locations were written to using directed arrows. The grey arrows indicates a

write event that was split between two locations.

4

in an N × N temporal link matrix L (Figure 1d). L[i, j] is close to one if i was the next location 64

written after j, and is close to zero otherwise. For any weighting w, the operation Lw smoothly 65

shifts the focus forward to the locations written after those emphasised in w, while L

>

w shifts the 66

focus backward. This gives DNC the native ability to recover sequences in the order in which they 67

were presented. 68

The third form of attention allocates memory for writing. The usage of each location is 69

represented as a number between zero and one. Based on the usages, a weighting over unused 70

locations is delivered to the write head. As well as automatically increasing with each write to a 71

location, usage can be decreased after each read using the free gates. This allows the controller 72

to reallocate memory that is no longer required (Supplementary Figure 3). As a consequence 73

of its allocation mechanism, DNC can be trained to solve a task using one size of memory and 74

later be upgraded to a larger memory without retraining and without any impact on performance 75

(Supplementary Figure 1). This property would also make it possible to use an unbounded external 76

memory by automatically increasing the number of locations every time the usage of all locations 77

passes a certain threshold. 78

Although the design of DNC was motivated largely by computational considerations, we 79

cannot resist drawing some connection between the attention mechanisms and the mammalian 80

hippocampus’ functional capabilities. DNC memory modiﬁcation is fast and can be one-shot, re- 81

sembling the associative long-term potentiation of hippocampal CA3 and CA1 synapses

27

. The 82

hippocampal dentate gyrus, a region known to support neurogenesis

28

, has been proposed to in- 83

5

Hybrid computing using a neural network with dynamic external memory

Figures

Citations

Neural Networks and Deep Learning

Relational inductive biases, deep learning, and graph networks

Continual lifelong learning with neural networks: A review.

Building machines that learn and think like people.

Deep learning with coherent nanophotonic circuits

References

ImageNet Classification with Deep Convolutional Neural Networks

Long short-term memory

Visualizing Data using t-SNE

Human-level control through deep reinforcement learning

Sequence to Sequence Learning with Neural Networks

Related Papers (5)

Long short-term memory

Human-level control through deep reinforcement learning

Deep learning

Neural Machine Translation by Jointly Learning to Align and Translate

Deep Residual Learning for Image Recognition