Symbolic Reasoning with Differentiable Neural Comput- 1
ers 2
Alex Graves*, Greg Wayne*, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka 3
Grabska-Barwi
´
nska, Sergio Gomez, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adri
`
a 4
Puigdom
`
enech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen 5
King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, Demis Hassabis. 6
*Joint first authors 7
Recent breakthroughs demonstrate that neural networks are remarkably adept at sensory 8
processing
1
and sequence
2, 3
and reinforcement learning
4
. However, cognitive scientists and 9
neuroscientists have argued that neural networks are limited in their ability to define vari- 10
ables and data structures
5–9
, store data over long time scales without interference
10, 11
, and 11
manipulate it to solve tasks. Conventional computers, on the other hand, can easily be pro- 12
grammed to store and process large data structures in memory, but cannot learn to recognise 13
complex patterns. This work aims to combine the advantages of neural and computational 14
processing by providing a neural network with read-write access to an external memory. We 15
refer to the resulting architecture as a Differentiable Neural Computer (DNC). Memory access 16
is sparse, minimising interference among memoranda and enabling long-term storage
12, 13
, 17
and the entire system can be trained with gradient descent, allowing the network to learn how 18
to operate and organise the memory in a goal-directed manner. We demonstrate DNC’s abil- 19
ity to manipulate large data structures by applying it to a set of synthetic question-answering 20
tasks involving graphs, such as finding shortest paths and inferring missing links. We then 21
show that DNC can learn, based solely on behavioral reinforcement
14, 15
, to carry out com- 22
plex symbolic instructions in a game environment
16
. Taken together, these results suggest 23
1
that DNC is a promising model for tasks requiring a combination of pattern recognition and 24
symbol manipulation, such as question-answering and memory-based reinforcement learn- 25
ing. 26
Modern computers separate computation and memory. Computation is performed by a pro- 27
cessor, which can use an addressable memory to bring operands in and out of play. This confers on 28
the computer two important properties: it provides extensible storage to write new information as 29
it arrives and the ability to treat the contents of memory locations as variables. Variables are criti- 30
cal to algorithm generality: to perform the same procedure on one datum or another, an algorithm 31
merely has to change the address it looks up or the content of the address. By contrast to com- 32
puters, the computational and memory resources of artificial neural networks are mixed together 33
in the network weights and neuron activity. This is a major liability: as the memory demands of a 34
task increase, these networks cannot allocate new storage dynamically, nor easily learn algorithms 35
that act independently of the values realised by the task variables. 36
The Differentiable Neural Computer (DNC) is a neural network coupled to an external mem- 37
ory matrix (Figure 1). The behaviour of the controller network is independent of the memory size 38
as long as the memory is not filled to capacity, which is why we view the memory as “external”. 39
If the memory can be thought of as DNC’s RAM, then the network, referred to as the controller, 40
is a CPU whose operations are learned. DNCs differ from recent neural memory frameworks
17, 18
41
in that the memory can be selectively written to as well as read, allowing iterative modification of 42
memory content. An earlier form of DNC, the Neural Turing Machine
19
, had a similar structure 43
2
but less flexible memory access methods (Methods). 44
While conventional computers use unique addresses to access memory contents, DNC uses 45
differentiable attention mechanisms
2, 19–22
to define distributions over the rows, or locations, in the 46
memory matrix. These distributions, which we call weightings, represent the degree to which each 47
location is involved in a read or write operation, and are typically very sparse in a trained system. 48
For example, the read vector r returned by weighting w over memory M is simply a weighted 49
sum over the N memory locations: r =
P
N
i=1
M[i, .]w[i]. The functional units that determine 50
and apply the weightings are called read and write heads. Crucially, the heads are differentiable, 51
allowing the complete system to learn by gradient descent. 52
The heads employ three distinct forms of attention. The first is content lookup
19, 20, 23–25
in 53
which a key emitted by the controller is compared to the content of each location in memory accord- 54
ing to a similarity measure (here: cosine similarity). The similarity scores determine a weighting 55
that can be used by the read heads for associative recall
26
or by the write head to modify an ex- 56
isting vector in memory. Importantly, a key that only partially matches the content of a memory 57
location can still be used to attend strongly to that location. This enables key-value retrieval where 58
the value recovered by reading the memory location includes additional information not present in 59
the key. Key-value retrieval provides a rich mechanism for navigating associative data structures 60
in the external memory, as the content of one address can effectively encode references to other 61
addresses. In our experiments, this proved essential to processing graph data. 62
A second attention mechanism records transitions between consecutively written locations 63
3
Figure 1: DNC Architecture. a: A recurrent controller network receives input from an external
data source and produces output. b & c: The controller also outputs vectors that parameterise one
write head (green) and multiple read heads (two in this case: blue and pink). The heads define
weightings that selectively focus on the rows, or locations, in the memory matrix (stronger colour
for higher weight). The read vectors returned by the read heads are passed to the controller at the
next time step. d: A temporal link matrix records the order locations were written in; here, we
represent the order locations were written to using directed arrows. The grey arrows indicates a
write event that was split between two locations.
4
in an N × N temporal link matrix L (Figure 1d). L[i, j] is close to one if i was the next location 64
written after j, and is close to zero otherwise. For any weighting w, the operation Lw smoothly 65
shifts the focus forward to the locations written after those emphasised in w, while L
>
w shifts the 66
focus backward. This gives DNC the native ability to recover sequences in the order in which they 67
were presented. 68
The third form of attention allocates memory for writing. The usage of each location is 69
represented as a number between zero and one. Based on the usages, a weighting over unused 70
locations is delivered to the write head. As well as automatically increasing with each write to a 71
location, usage can be decreased after each read using the free gates. This allows the controller 72
to reallocate memory that is no longer required (Supplementary Figure 3). As a consequence 73
of its allocation mechanism, DNC can be trained to solve a task using one size of memory and 74
later be upgraded to a larger memory without retraining and without any impact on performance 75
(Supplementary Figure 1). This property would also make it possible to use an unbounded external 76
memory by automatically increasing the number of locations every time the usage of all locations 77
passes a certain threshold. 78
Although the design of DNC was motivated largely by computational considerations, we 79
cannot resist drawing some connection between the attention mechanisms and the mammalian 80
hippocampus’ functional capabilities. DNC memory modification is fast and can be one-shot, re- 81
sembling the associative long-term potentiation of hippocampal CA3 and CA1 synapses
27
. The 82
hippocampal dentate gyrus, a region known to support neurogenesis
28
, has been proposed to in- 83
5