Overview of the SpiNNaker System Architecture

doi:10.1109/TC.2012.142

IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID 1

Overview of the SpiNNaker system

architecture

Steve B. Furber, Fellow, IEEE, David R. Lester, Luis A. Plana, Senior Member, IEEE, Jim D.

Garside, Eustace Painkras, Steve Temple, and Andrew D. Brown, Senior Member, IEEE

Abstract—SpiNNaker (a contraction of Spiking Neural Network Architecture) is a million-core computing engine whose flagship

goal is to be able to simulate the behaviour of aggregates of up to a billion neurons in real time. It consists of an array of ARM9

cores, communicating via packets carried by a custom interconnect fabric. The packets are small (40 or 72 bits), and their

transmission is brokered entirely by hardware, giving the overall engine an extremely high bisection bandwidth of over 5 billion

packets/s. Three of the principle axioms of parallel machine design – memory coherence, synchronicity and determinism –

have been discarded in the design without, surprisingly, compromising the ability to perform meaningful computations. A further

attribute of the system is the acknowledgment, from the initial design stages, that the sheer size of the implementation will make

component failures an inevitable aspect of day-to-day operation, and fault detection and recovery mechanisms have been built

into the system at many levels of abstraction. This paper describes the architecture of the machine and outlines the underlying

design philosophy; software and applications are to be described in detail elsewhere, and only introduced in passing here as

necessary to illuminate the description.

Index Terms— Interconnection architectures, parallel processors, neurocomputers, real-time distributed.

——————————



——————————

1 INTRODUCTION

he SpiNNaker engine [1] is a massively-parallel

multi-core computing system. It will contain up to

1,036,800 ARM9 cores and 7Tbytes of RAM distri-

buted throughout the system in 57K nodes, each node be-

ing a System-in-Package (SiP) containing 18 cores plus a

128Mbyte off-die SDRAM (Synchronous Dynamic Ran-

dom Access Memory). Each core has associated with it

64Kbytes of data tightly-coupled memory (DTCM) and

32Kbytes of instruction tightly-coupled memory (ITCM).

The cores have a variety of ways of communicating

with each other and with the memory, the dominant of

which is by packets. These are 5- or 9-byte (40- or 72-bit)

quanta of information that are transmitted around the

system under the aegis of a bespoke concurrent hardware

routing system.

The physical hierarchy of the system has each node

containing two silicon dies – the SpiNNaker chip itself,

plus the Mobile DDR (Double Data Rate) SDRAM, which

is physically mounted on top of the SpiNNaker die and

stitch-bonded to it – see Fig. 1. The nodes are packaged

and mounted in a 48-node hexagonal array on a PCB

(Printed Circuit Board), the full system requiring 1,200

such boards. In operation, the engine consumes at most

90kW of electrical power.

This paper will describe architectural and physical de-

sign aspects of the system. Clearly, there are many chal-

lenges associated with the design, construction and use of

a system as large and complex as this – the software and

application portfolio will be described in detail elsewhere.

While previous papers have presented aspects of the ar-

chitecture (e.g. [2], [3]; a complete list of SpiNNaker pub-

lications is available on the project web site [1]), the con-

tribution here is to offer a comprehensive overview focus-

ing on the motivation and rationale for the architectural

decisions taken in the design of the machine.

2 HIGH-LEVEL PROJECT GOALS AND BACKGROUND

Multi-core processors are now clearly established as the

way forward on the desktop, and highly-parallel systems

xxxx-xxxx/0x/$xx.00 © 200x IEEE

————————————————

• S.B. Furber, D.R Lester, L.A. Plana, J.D. Garside, E. Painkras and S. Tem-

ple are with the School of Computer Science, the University of Manchester,

UK.

• A.D. Brown is with Electronics and Computer Science, the University of

Southampton, UK.

Manuscript received 14

th

January 2012.

T

Fig 1: SDRAM stitch-bonded to the underlying SpiNNaker die.

3D packaging by UNISEM (Europe) Ltd.

2 IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID

have been the norm for high-performance computing for

some while. In a surprisingly short space of time, indus-

try has abandoned the exploitation of Moore’s Law

through ever more complex uniprocessors, and is embrac-

ing the ‘new’ Moore’s Law: the number of processor cores

on a chip will double roughly every 18 months. If pro-

jected over the next 25 years this leads inevitably to the

landmark of a million-core processor system.

Much work is required to understand how to optimize

the scheduling of workloads on such machines, but the

nature of this task is changing: in the past, a large applica-

tion was distributed ‘evenly’ over a few processors and

much effort went into scheduling to keep all of the proc-

essor resources busy; today, the nature of the cost func-

tion is different: processing is effectively a free resource.

Although the automatic parallelization of general-

purpose codes remains a ‘holy grail’ of computer science,

biological systems achieve much higher levels of parallel-

ism, and we turn for inspiration to connectivity patterns

and computational models based on our (extremely lim-

ited) understanding of the brain.

This biological inspiration draws us to two parallel,

synergistic directions of enquiry [4]; significant progress

in either direction will represent a major scientific break-

through:

• How can massively-parallel computing resources ac-

celerate our understanding of brain function?

• How can our growing understanding of brain func-

tion point the way to more efficient, parallel, fault-

tolerant computation?

We start from the following question: what will hap-

pen when processors become so cheap that there is, in

effect, an unlimited supply of them? The goal is now to

get the job done as quickly and/or energy-efficiently as

possible, and as many processors can be brought into

play as is useful; this may well result in a significant

number of processors doing identical calculations, or in-

deed nothing at all - they are a free resource.

2.1 The mammalian nervous system

The mammalian nervous system – by any metric – is one

of the most remarkable, effective and efficient structures

occurring in nature. The human brain exhibits massive

parallelism (10

11

neurons), and massive connectivity (10

15

synapses). It consumes around 25W, and is composed of

very low-performance components (neurons ‘behave’ at

up to around 100Hz; the biological interconnect propa-

gates information at speeds of a few ms

-1

). It is massively

tolerant of component-level failure – typically a human

will lose neurons at a rate of about 1s

-1

throughout their

adult life [5].

For a computer engineer, the similarities between the

nervous system and a digital system are overwhelming.

The principal component of the nervous system, the neu-

ron [6], is a unidirectional device, connected to its peers

via a single output, the axon. Near its terminal the axon

branches and forms connections (synapses) with the inputs

of its fellow neurons. The input structure of a neuron is

termed the dendritic tree - see Fig. 2. Specialised neurons

interface to muscles (and drive the system ‘actuators’),

and others to various sensors.

2.2 Spiking communication

Most biological neurons communicate predominantly via

an electrochemical impulse known as an action potential

[6]. This is a complex, propagating electrochemical pulse,

supported mainly by transient sodium, potassium, chlo-

ride and electron fluxes, and perturbations of the electro-

chemical impedance to these species in the axon cell

walls. To a zeroth approximation, these impulses can be

viewed as spikes. The size and shape of the spike is

largely invariant, (and, indeed, probably irrelevant) being

determined by local instabilities in the cell membrane

current balance, so a spike can be viewed as a unit im-

pulse that conveys information solely in the time at which

it occurs. It costs the axon energy to transmit an event,

but this is provided by a kind of electrochemical ‘gain’

distributed along the length of the fibre: the net effect is

that – again to a zeroth approximation – the axon can be

viewed as a lossless dispersion free transmission line,

although it has to have a ‘rest’ just after a pulse has gone

by to ‘charge itself up’ again.

2.3 Point neuron model

SpiNNaker is optimized for what is commonly known as

the ‘point neuron model’ [4], where the details of the den-

dritic structure of the neuron are ignored and all inputs

are effectively applied direct to the soma (the ‘body’ of

the neuron). The inputs arrive in the correct temporal

order – more or less – but there is no attempt to model the

Fig. 3. T h e c o r r e s p o n d i n g p o i n t n e u r o n m o d e l .

Fig. 2. A biological neuron.

FURBER ET AL.: OVERVIEW OF THE SPINNAKER SYSTEM ARCHITECTURE 3

geometry of the dendritic tree. The abstract synaptic in-

puts are summed to form a net soma input that drives a

system of simple differential equations that compute

when an output spike should be issued.

2.4 Synapses

A synapse is the ‘component’ whereby a spike from one

neuron couples into the input to another neuron. A spike

has unit impulse, but the synapse has a variable efficiency

which is often represented by a numerical ‘weight’ [4]. If

the weight is positive, the synapse is excitatory. If the

weight is negative, the synapse is inhibitory.

The modeling abstraction is summarized in Fig. 3. In

the jargon of electronic circuits, a neural circuit is repre-

sented by a devices-on-devices graph. Biology – as one

might expect – is vastly more complex than this extreme

abstraction. An unresolved issue is how much of the

complexity is biological artifact, and how much is neces-

sary for the information processing required to support a

viable organism? The performance of electronic circuits is

ultimately dictated by the speed and efficiency with

which the flow of electrons through silicon can be cho-

reographed by the designer – and there are physical lim-

its. In biology, the information carriers are more diverse

(ionic species) and they are controlled by an electro-

chemical field gradient. Ions are necessarily big, electro-

chemical fields necessarily small. Nature compensates by

utilizing massive parallelism, but there will always be

huge functional compromises. It is interesting to note that

almost every creature on the planet today utilizes broadly

the same structure for its controlling neural system.

Comprehensive descriptions of the many types of real

neurons and synapses are available elsewhere [6], [7].

2.5 Address Event Representation

The central idea of the standard SpiNNaker execution

model is that of Address Event Representation (AER) [8], [9].

The underlying principle of AER, which is well-

established in the neuromorphic community, is that when

a neuron fires the spike is a pure asynchronous ‘event’.

All of the information is conveyed solely in the time of the

spike and the identity of the neuron that emitted the spike.

In a real-time system, time models itself, so in an AER

system the identity (‘address’) of a neuron that spikes is

simply broadcast at the time that it spikes to all neurons

to which the spiking neuron connects.

In SpiNNaker, AER is implemented using packet-

switched communication and multicast routing. Al-

though the communication system introduces some tem-

poral latency, provided this is small compared with bio-

logical time constants (which in practice means provided

it is well under 1ms) then the error introduced by this

latency is negligible (when modeling biological neural

systems).

2.6 Topological virtualization

Biological neural systems develop and operate in three

dimensions, and both their topologies and geometries are

constrained by their physical structures. SpiNNaker em-

ploys a two-dimensional physical communication struc-

ture, but this in no way limits its capacity to model three-

(or higher-) dimensional networks. Because electronic

communication is effectively instantaneous on biological

time-scales, every neuron in a SpiNNaker system can be

connected to any other neuron with a time delay that

equates to adjacency in the biological three-dimensional

space. Thus the mapping of neurons from the biological

3-D space into the SpiNNaker 2-D network of processors

can be arbitrary – any neuron can be mapped to any

processor. In practice, the SpiNNaker model will be more

efficient if the mapping is chosen carefully, and this, in

turn, means mapping physically close neurons into

physically close processors, but this is only a matter of

efficiency and is in no way fundamentally constrained by

the SpiNNaker implementation.

2.7 Time models itself

Biological systems have no central synchronising clock.

Spikes are launched, spikes propagate, spikes arrive

(usually), target neurons react. In a conventional elec-

tronic synchronous system, data is expected to be at the

right place at the right time. If it isn't, the system is bro-

ken. In an asynchronous electronic system, data arrives, is

processed and passed on, and a non-trivial choreography

of request and acknowledge signals ensures that the in-

tegrity of the dataflow is maintained. In biology, data is

transmitted in the hope that most of it will get to the right

place in a timely – but strictly undefined – manner.

Strangely, it is clear by inspection that it is possible to

create hugely complex systems – mammals – operating

successfully on this principle.

In SpiNNaker, cores react to packets, process packets,

and optionally emit further packets. These are transmit-

ted to their target by the routing subsystem, to the best of

its ability. If the routing fabric becomes congested – an

unpredictable function of the workload – packets will, in

the first instance, be re-routed (causing them to arrive

late) or even dropped (if there is no space to hold them).

A design axiom of SpiNNaker is that nothing can ever

prevent a packet from being launched. A consequence of

the effects described above is that not only is the arrival

time of the packets non-deterministic, the packet ordering

is non-transitive.

A single SpiNNaker core is a single ARM9 processor.

This is deterministic and is expected to multiplex the be-

haviour of around 1,000 neurons. The nodes, each con-

taining 18 such cores, cores are equipped with six bi-

directional fast links, and embedded in a communication

mesh – see the next section – which intelligently redirects

and duplicates packets as necessary. The speed at which

packets are transmitted over the network is about

0.2!s/node hop, all of which means that we can reason-

ably expect the neuron models to react to stimuli on a

wall-clock timescale of ms – just like biology.

3 THE TECHNOLOGY LANDSCAPE

There are other approaches to brain modelling with objec-

tives broadly similar to, though approaches rather differ-

ent from, the work described here.

4 IEEE TRANSACTIONS ON COMPUTERS, MANUSCRIPT ID

3.1 BlueBrain

The Blue Brain project at EPFL [10], is bringing together

wet neuroscience with high-performance computing to

deliver high-fidelity computer models of biological neural

systems. The computing resource available to the project

is an IBM Blue Gene supercomputer [11] with very so-

phisticated visualisation facilities.

3.2 SyNAPSE

An IBM project, funded under the DARPA SyNAPSE

programme [12] claims the successful modelling of a neu-

ral network on the scale of a cat cortex (which is around a

billion neurons with 10

13

synapses).

3.3 Izhikevich

Eugene Izhikevich, at the Neuroscience Research Institute

in San Diego, developed a 100 billion neuron model based

on the mammalian thalamo-cortical system [13], [14]. One

second of simulation took 50 days on a 27-processor Be-

owulf cluster.

3.4 Issues

These major projects demonstrate the debate (that is as

yet unresolved within the brain modelling research com-

munity): to what extent are the finer details of biological

neurons essential to the accurate modelling of the infor-

mation processing capabilities of the brain, and to what

extent can they be ignored as artifacts resulting from the

evolutionary development of the biological neuron and

its need to grow and find energy?

The SpiNNaker architecture is biased towards the

simpler side of this debate – the machine is optimised for

simple point neuron models and it is capable of model-

ling very complex networks of these simple models.

The principal differentiator of the SpiNNaker project

from other large-scale neural models is our objective to

run in biological real time. None of the above systems are

close to this goal, but we believe this to be essential if the

neural experiments are to benefit from ‘embodiment’ by

integration with robotic systems.

Other approaches to large-scale neural modelling are,

of course, possible, for example using GPGPUs or FPGAs.

It is difficult with such approaches to achieve the balance

of computation, memory hierarchy and communication

that SpiNNaker achieves, though of course they do avoid

the high development cost of the bespoke chip approach.

4 ARCHITECTURE OVERVIEW

4.1 Overview

A block diagram of a single SpiNNaker node is shown in

Fig. 4. The six communications links are used to connect

the nodes in a triangular lattice; this lattice is then folded

onto the surface of a toroid, as in Fig. 5. Other tilings are

obviously possible; this design decision was guided by

the pragmatics of assembling the system onto a set of two

dimensional printed circuit boards.

Fig. 4. A SpiNNaker node.

Fig. 5. The SpiNNaker machine.

Fig. 6. The SpiNNaker die.

FURBER ET AL.: OVERVIEW OF THE SPINNAKER SYSTEM ARCHITECTURE 5

Fig. 6 depicts the individual SpiNNaker die. Each chip

contains 18 identical processing subsystems (ARM cores).

The die is fabricated by UMC on a 130nm CMOS process,

and was designed using Synopsys, Inc., synthesis tools

for the clocked subsystems and Silistix Ltd tools and li-

braries for the self-timed on-chip and inter-chip networks.

At start-up, following self-test, one of the processors is

elected to a special role as Monitor Processor, achieved by a

deliberate hardware race, and thereafter performs system

management tasks. The other processors are available for

application processing; normally 16 will be used to sup-

port the application and one reserved as a spare for fault-

tolerance and manufacturing yield-enhancement pur-

poses.

The router is responsible for routing neural event

packets both between the on-chip processors and from

and to other SpiNNaker nodes. The Tx and Rx interface

components (Fig. 4) are used to extend the on-chip Com-

munications NoC (Network-on-Chip) to other SpiNNaker

chips. Inputs from the various on- and off-chip sources

are assembled into a single serial stream which is then

passed to the router.

Various resources are accessible from the processor

systems via the System NoC. Each of the processors has

access to the shared off-die SDRAM, and various system

components also connect through the System NoC in

order that, whichever processor is the monitor, it will

have access to these components.

4.2 Quantitative drivers

The SpiNNaker architecture is driven by the quantitative

characteristics of the biological neural systems it is de-

signed to model. The human brain comprises in the re-

gion of 10

11

neurons; the objective of the SpiNNaker work

is to model 1% of this scale, which amounts to a billion

neurons. This corresponds approximately to 10% of the

human cortex, or ten complete mouse brains. Each neu-

ron in the brain connects to thousands of other neurons.

The mean firing rate of neurons is below 10 Hz, with the

peak rate being 100s of Hz. These numerical points of

reference can be summarized in the following deductions:

10

9

neurons, mean fan in/out 10

3

=> 10

12

synapses.

10

12

synapses, ~4 bytes/synapse => 4x10

6

Mbytes.

10

12

synapses switching at ~10Hz => 10

13

connections/s.

10

13

conn/s, 20 instr/conn => 2x10

8

MIPS.

2x10

8

MIPS, ~200MHz ARM => 10

6

ARMs.

So 10

9

neurons need 10

6

ARMs, whence:

1 ARM at ~200MHz => 10

3

neurons.

1 node: 16 ARM968 + 64MB => 1.6x10

4

neurons.

6 x10

5

nodes, 1.6x10

4

neur/node => 10

9

neurons.

The above numbers all assume each neuron has 1,000

inputs. In biology, this number varies from 1 to of the

order of 10

5

, and it is probably most useful to think of

each ARM being able to model about 1M synapses, so it

can model 100 neurons each with 10,000 inputs, and so

on.

The system will be inefficient unless there is some

commonality across the inputs to the set of neurons mod-

eled on a processor, so that each input event typically

connects to tens or hundreds of neurons modeled by a

processor. In biology, connections tend to be sparse, so,

for example, a processor could model 1000 neurons each

of which connects to a random 10% of the 10

4

inputs that

are routed to the processor. The standard model assumes

sparse connectivity.

4.3 Routing

With a billion neurons a 32-bit address is (more than) suf-

ficient. The AER packets incur a small overhead for con-

trol purposes, which amounts to one byte in the current

design. This is generally transparent to the software run-

ning on the ARM cores and exists only while the packet is

in transit. Since spike events are unit impulses, all the

packet need carry is the control byte and the 32-bit ID of

the neuron that fired. SpiNNaker packet formats support

an optional 32-bit data payload in addition, but that is not

used for neural system modeling directly. The payload

will be used for other applications and for debug and di-

agnostics. Thus the communication traffic generated by

one node is:

1.6x10

4

neurons x 10Hz x 5 bytes => 0.8Mbyte/s.

Each chip incorporates a router that implements AER-

based routing of neural spike-event packets. The total

traffic from neurons modeled by the processors on the

same chip as the router averages 1.6x10

5

packets/s, which

is undemanding, although the router also handles incom-

ing and passing traffic.

4-bit symbols @ 60MHz/link => 6x10

6

pkts/s

6 incoming links => 3.6x10

7

pkts/s

So a router operating at 100MHz processing one packet

per clock cycle can easily handle all local, incoming and

passing traffic.

4.4 Bisection bandwidth

If a 57K-node system is organized in such a way that all

of the neurons in one half are connected to at least one

neuron in the other half, the traffic across the border from

one half to the other is 29K x 160K = 4.6G packets/s. The

border is 480 nodes long (assuming a square layout,

mapped to a toroid), so each node must carry 10M pack-

ets/s, which is well within the capacity of the router, and

960 links connect the two halves, each carrying 5M pack-

ets/s, which is again within a link’s capability.

5 SYSTEM COMPONENTS

The routing subsystem, which is a crucial component of

SpiNNaker, is described in section 7.

5.1 ARM968

The ARM subsystem [15] organisation is shown in Fig. 7.

The system is memory-mapped (see section 6), and the

map for the ARM968 spans a number of devices and

Overview of the SpiNNaker System Architecture

Figures

Citations

Neuroscience 細胞死：最近の知見

Neurogrid: A Mixed-Analog-Digital Multichip System for Large-Scale Neural Simulations

The SpiNNaker Project

Training Deep Spiking Neural Networks Using Backpropagation.

The SpiNNaker project

References

Neuroscience 細胞死：最近の知見

The blue brain project.

Large-scale model of mammalian thalamocortical systems

PyNN: A Common Interface for Neuronal Network Simulators.

Fault-Tolerant Systems

Related Papers (5)

A million spiking-neuron integrated circuit with a scalable communication network and interface

Loihi: A Neuromorphic Manycore Processor with On-Chip Learning

Simple model of spiking neurons

Neuromorphic electronic systems

A quantitative description of membrane current and its application to conduction and excitation in nerve