How many connections can a CAVIAR system perform per second?

The CAVIAR system consists of about 45k spiking neurons and 5M synapses; and it can perform up to 12G connections/operations per second.

How fast does the jAER software send AEs?

Monitor PCBs connect to a host computer through a high-speed USB2.0 connection, sending AEs at a speed of up to a peak rate of 6 Meps.

What is the ability of the WTA chip to learn to classify spatio–temporal?

It is capable of both spike-based learning (or spike-timing-dependent plasticity [82]) to learn to classify spatio–temporal spike patterns and rate-based Hebbian learning to learn spatio–temporal activity patterns.

What is the way to visualize a slow-motion sequence of events?

For high-speed phenomena, one can configure a time slice of very short duration (down to a few microseconds) and visualize a slow-motion recorded sequence of events offline.

How many neurons can specialize on the input pattern?

One could have expected that there are still only four different input patterns and that maximally four neurons could specialize on exactly those four patterns, but since this real-world input is changing its state in a continuous fashion rather than just assuming four discrete states, some of the neurons have become selective for transitory states “between” the four positions.

How long does the delay line train take to be tapped?

Those delay lines are tapped at three different delays (approximately 0 s, 200 ms, and 400 ms) and the resulting 2 2 3 spike trains are passed on to the learning chip.

How many categories can be categorized by the learning classifier chip?

The task of the learning classifier chip is now to provide a good representation of at most 32 categories (since there are 32 neurons) from the repeated spatio–temporal pattern.

What is the servo system for changing the central view point?

The authors also implemented a fully electronic (without mirrors, mechanical parts, or motors) servo system for changing the central view point.

How many synaptic connections can be processed in a single chip?

Their convolution chips are very efficient in this sense, because for each input event, they can process up to synaptic connections in 330 ns connections/s/chip.

What is the effect of the global inhibitory neuron on the other quadrants?

this neuron will activate the global inhibitory neurons 2 of the remaining three quadrants the most if its quadrant receives the highest input rate out of the four quadrants.

What is the simplest way to track the movement of an object in space?

The learning chip can then, for example, track the 3-D movement of an object in space by programming the same feature shape at different sizes in the different convolution chips.

(Open Access) CAVIAR: A 45k Neuron, 5M Synapse, 12G Connects/s AER Hardware Sensory–Processing– Learning–Actuating System for High-Speed Visual Object Recognition and Tracking (2009) | R. Serrano-Gotarredona

Q: What are the contributions in "Connects/s aer hardware sensory–processing– learning–actuating system for high-speed visual object recognition and tracking" ?

This paper describes CAVIAR, a massively parallel hardware implementation of a spike-based sensing–processing–learning–actuating system inspired by the physiology of the nervous system. CAVIAR uses the asychronous address–event representation ( AER ) communication framework and was developed in the context of a European Union funded project. This work was supported by the European Commission under Grant IST-2001-34124 ( CAVIAR ). Color versions of one or more of the figures in this paper are available online at http: //ieeexplore.

Q: What future works have the authors mentioned in the paper "Connects/s aer hardware sensory–processing– learning–actuating system for high-speed visual object recognition and tracking" ?

The authors plan to miniaturize it by about 3–4 orders of magnitude within the next few years, by increasing the numbers of synapses and neurons per AER-module and by integrating more modules into a smaller physical volume. Assuming timing delays similar to those reported in this paper, preliminary results [ 91 ] suggest that these systems could perform sophisticated object recognition with delays around 100 s. With such developments, the authors will be able to provide a modular and scalable platform for real-time implementations of neural models of a really challenging complexity. Coupling this massive preprocessing power with flexible back-ends of conventional procedural computation will enable solutions to a host of practical applications.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 9, SEPTEMBER 2009 1417

CAVIAR: A 45k Neuron, 5M Synapse, 12G

Connects/s AER Hardware Sensory–Processing–

Learning–Actuating System for High-Speed

Visual Object Recognition and Tracking

Rafael Serrano-Gotarredona, Matthias Oster, Patrick Lichtsteiner, Alejandro Linares-Barranco,

Rafael Paz-Vicente, Francisco Gómez-Rodríguez, Luis Camuñas-Mesa, Raphael Berner, Manuel Rivas-Pérez,

Tobi Delbrück, Shih-Chii Liu, Rodney Douglas, Philipp Häﬂiger, Gabriel Jiménez-Moreno, Anton Civit Ballcels,

Teresa Serrano-Gotarredona

, Member, IEEE, Antonio J. Acosta-Jiménez, and Bernabé Linares-Barranco

Abstract—This paper describes CAVIAR, a massively par-

allel hardware implementation of a spike-based sensing–pro-

cessing–learning–actuating system inspired by the physiology of

the nervous system. CAVIAR uses the asychronous address–event

representation (AER) communication framework and was de-

veloped in the context of a European Union funded project. It

has four custom mixed-signal AER chips, ﬁve custom digital

AER interface components, 45k neurons (spiking cells), up to

5M synapses, performs 12G synaptic operations per second, and

achieves millisecond object recognition and tracking latencies.

Manuscript received June 29, 2008; revised November 11, 2008 and April 06,

2009; accepted April 24, 2009. First published July 24, 2009; current version

published September 02, 2009. This work was supported by the European Com-

mission under Grant IST-2001-34124 (CAVIAR). The work of R. Serrano-Go-

tarredona was supported by the Spanish Ministry of Education and Science

under FPU scholarship. The work of L. Camuñas-Mesa was supported by the

Spanish Ministry of Education and Science under FPI scholarship. The work of

S.-C. Liu and T. Delbrück was supported by the Institute of Neuroinformatics

(INI), ETH Zürich/University of Zürich, Zürich, Switzerland and some fabrica-

tion costs were paid by Austria Research Corporation.

R. Serrano-Gotarredona was with the Consejo Superior de Investigaciones

Cientiﬁcas, Seville Microelectronics Institute, Seville 41012, Spain. He is now

with the Austriamicrosystems, Valencia, Spain (e-mail: rserrano@imse.cnm.

es).

M. Oster was with the Institute of Neuroinformatics (INI), ETH Zürich/Uni-

versity of Zürich, Zürich CH-8057, Switzerland. He is now with the Varian

Medical Systems, Baden CH-5405, Switzerland (e-mail: matthias.oster@gmail.

com).

P. Lichtsteiner was with the Institute of Neuroinformatics (INI), ETH Zürich/

University of Zürich, Zürich CH-8057, Switzerland. He is now with the Es-

pros Photonics Corporation, Baar CH-6340, Switzerland (e-mail: patrick.licht-

steiner@espros.ch).

A. Linares-Barranco, R. Paz-Vicente, F. Gómez-Rodríguez, M. Rivas-Pérez,

G. Jiménez-Moreno, and A. Civit Ballcels are with the Computer Archi-

tecture and Technology Department, University of Seville, Seville 41012,

Spain (e-mail: alinares@atc.us.es; rpaz@atc.us.es; gomezroz@atc.us.es;

mrivas@us.es; gaji@atc.us.es; civit@atc.us.es).

L. Camuñas-Mesa, T. Serrano-Gotarredona, A. Acosta-Jiménez, and B.

Linares-Barranco are with the Consejo Superior de Investigaciones Ci-

entiﬁcas, Seville Microelectronics Institute, Seville 41092, Spain (e-mail:

luiscamu@imse.cnm.es; terese@imse.cnm.es; acojim@imse.cnm.es;

bernabe@imse.cnm.es).

R. Berner, T. Delbrück, S.-C. Liu, and R. Douglas are with the Insti-

tute of Neuroinformatics (INI), ETH Zürich/University of Zürich, Zürich

CH-8057, Switzerland (e-mail: bernerr@ee.ethz.ch; tobi@ini.phys.ethz.ch;

shih@ini.phys.ethz.ch; rjd@ini.phys.ethz.ch).

P. Häﬂiger is with the Informatics, University of Oslo, Oslo NO-0316,

Norway (e-mail: haﬂiger@iﬁ.uio.no).

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TNN.2009.2023653

Index Terms—Address–event representation (AER), neuromor-

phic chips, neuromorphic systems, vision.

I. INTRODUCTION

RAINS perform powerful and fast vision processing in a

way conceptually different from that of machine vision

systems. Machine vision systems process sequences of still

frames from a camera. For performing scale- and rotation-in-

variant 3-D object recognition, for example, sequences of

computationally demanding operations need to be performed

on each acquired frame. The computational power and speed

required for such tasks make it difﬁcult to develop real-time

autonomous systems for such applications.

On the other hand, vision sensing and object recognition in

brains are performed without using the “frame” concept, at least

not in the usual sense of implying a ﬁxed-rate sequence of still

images. Throughout this paper, we intentionally avoid the use

of the expression “image processing,” because in our hardware

technology, there never is an “image” or a “frame,” but rather a

continuous ﬂow of visual information in the form of temporal

spikes.

The visual cortex is structured as a sequence of layers (8–10

layers in the human cortex [1], [13]), starting from the retina,

which does its own preprocessing in a more compact and analog

architecture. Although cortex has massive feedback and recur-

rent connections, it is known that a very fast and purely feed-

forward recognition path exists within the ventral stream of the

visual cortex [1], [2]. Here we exploited this feedforward path

concept to build a fast vision recognition system. A concep-

tual block diagram of such a cortically inspired feedforward

hierarchically structured autonomous system for sensing/pro-

cessing/decision–actuation can be seen in Fig. 1(a) [1]–[11].

The pattern of connectivity in cortex follows a basic structure:

each neuron in a layer connects to a “cluster of neurons” or “pro-

jective ﬁeld” in the next layer [12], [13].

In most cases, these projective ﬁelds can be approximated by

computing 2-D convolutions. A single layer of a single con-

volution kernel can detect and localize a preprogrammed or

prelearned object, independent of its position. Using multiple

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on March 06,2010 at 10:45:19 EST from IEEE Xplore. Restrictions apply.

1418 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 9, SEPTEMBER 2009

kernels of different sizes and rotations can make the compu-

tation scale and rotation invariant. Multilayered convolutional

networks are capable of complex object recognition [3]–[8].

Spiking neurons receive synaptic input from other cells in the

form of electrical spikes, and they autonomously decide when

to generate their own output spikes. Hardware that combines

spike-based multineuron modules to compute projective ﬁelds

can enable powerful and fast frame-free vision processing. If the

components generate short-latency meaningful, nonredundant

spikes, then spike-based systems can efﬁciently compute “on-

demand” compared to conventional approaches. The processing

delay depends mainly on the number of layers, and not on the

complexity of objects and shapes to be recognized. Their latency

and throughput are not limited by a conventional sampling

rate.

In recent years, signiﬁcant progress has been made towards

the understanding of the computational principles exploited by

visual cortex. Many artiﬁcial systems that implement bioin-

spired software models use biological-like (convolution-based)

processing that outperforms more conventionally engineered

machines [3]–[11], [14]–[17]. However, these systems gen-

erally run at extremely low speeds because the models are

implemented as software programs on conventional computers.

For real-time solutions, direct hardware implementations of

these models are required. However, hardware engineers face

a large hurdle when trying to mimic the bioinspired layered

structure and the massive connectivity within and between

layers. A growing number of research groups worldwide are

mapping some of these computational principles onto real-time

spiking hardware through the development and exploitation of

the so-called address–event representation (AER) technology.

In this paper, we report on the results of our European Union

consortium project “Convolution AER Vision Architecture for

Real-Time” (CAVIAR), where the largest ever built multichip

multilayer AER real-time frame-free vision system to date has

been developed.

The purpose of this paper is to introduce to various commu-

nities, including computational neuroscience and machine vi-

sion, the promising and effective AER hardware technology that

allows the construction of modular, multilayered, hierarchical,

and scalable (visual) sensory processing learning and actuating

systems. Throughout this paper, we will illustrate the power and

potential of the AER hardware technology through the demon-

strator assembled in the CAVIAR project.

The AER is a spike-based representation technique for

communicating asynchronous spikes between layers of neurons

in different chips. The spikes in AER are carried as addresses of

sending or receiving neurons on a digital bus. Time “represents

itself” as the asynchronous occurrence of the event. AER was

ﬁrst proposed in 1991 by Mead’s Lab at California Institute

of Technology (Caltech, Pasadena) [24]–[28], and has been

used since then by a wide community of hardware engineers.

Unarbitrated and simpler event readout have been used [29],

[30], and more elaborate and efﬁcient arbitrated versions have

also been proposed, based on winner-take-all (WTA) [31],

or the use of arbiter trees [32], which have evolved to row

parallel [33] and burst-mode word-serial [34]–[36] readout

schemes by Boahen’s Lab. The AER has been used in image

and vision sensors, for simple light intensity to frequency

transformations [38], time-to-ﬁrst-spike codings [40]–[42],

foveated sensors [43], [44], spatial contrast sensors [23],

[45], temporal intensity difference [39] and temporal contrast

sensors [19], [20], and motion sensing and computation systems

[46]–[50]. AER has also been used for auditory systems

[51]–[53], competition and WTA networks [54]–[56], and

even for systems distributed over wireless networks [57]. For

AER-based 2-D convolution, Vernier

et al. [58] and Choi et al.

[59] reported on 2-D convolution chips with hard-wired elliptic

or Gabor-shaped kernels for orientation extraction. AER has

made it feasible to emulate large scale neurocortical-like

multilayered realistic structures since the development of

scalable and reprogrammable kernel 2-D convolution chips,

either with some minor restrictions on symmetry [60], or

without any restrictions on shape or size [18]. Of great

importance for the spread and success of AER systems has also

been the availability of open-source reusable silicon IP [37], a

better understanding by the community of asynchronous logic

design, and the development of conventional synchronous

interfacing logic and computer interfaces [61]–[64].

In CAVIAR, an AER infrastructure was developed to support

a set of AER modules (chips and interfaces) [Fig. 1(b)] that are

connected in series and parallel to embody the abstract layered

architecture in Fig. 1(a). The following modules were devel-

oped: 1) a temporal contrast retina (motion sensing camera)

chip; 2) a programmable kernel 2-D convolution processing

chip; 3) a 2-D WTA object chip; 4) spatio–temporal processing

and learning chips; 5) AER remapping, splitting, and merging

ﬁeld-programmable gate array (FPGA)-based modules; and

6) computer–AER interfacing FPGA modules for generating

and/or capturing AER. These modules were then used for

building a multilayer artiﬁcial vision demonstrator system for

detecting and tracking balls moving at high speeds.

The overall architecture of the CAVIAR vision system is il-

lustrated in Fig. 1(b) and in more detail in Fig. 13. Moving

objects in the ﬁeld of view of the retina cause spikes. Each

spike from the retina causes a splat of each convolution chip’s

kernel onto its own integrator array. When the integrator array

pixels exceed positive or negative thresholds they in turn emit

spikes. In the CAVIAR system experiments, we generally used

circular kernels such as the ones in Fig. 3(c) and (d), which de-

tect circular objects of particular sizes. The resulting convolu-

tion spike outputs are noise ﬁltered by the WTA object chip.

The WTA output spikes, whose addresses represent the loca-

tion of the “best” circular object, are fed into a conﬁgurable

delay line chip that spreads time into space. This spatial pat-

tern of temporal delayed spikes is then learned by the learning

chip. The WTA spikes also control a mechanical or electronic

tracking system that stabilizes the programmed object in the

ﬁeld-of-view center.

The rest of this paper is structured as follows. Section II

describes the temporal contrast retina, Section III the pro-

grammable kernel 2-D convolution chip, Section IV the 2-D

WTA chip, Section V the learning chips, Section VI the

different interfaces, and ﬁnally, Section VII describes the com-

plete CAVIAR vision system and shows experimental results.

Section VIII concludes the paper and gives future outlooks.

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on March 06,2010 at 10:45:19 EST from IEEE Xplore. Restrictions apply.

SERRANO-GOTARREDONA et al.: CAVIAR: A 45K NEURON, 5M SYNAPSE, 12G CONNECTS/S AER HARDWARE 1419

Fig. 1. CAVIAR system overview. (a) A bioinspired system architecture performing feedforward sensing

processing

actuation tends to have the following

conceptual hierarchical structure: 1) a sensing layer; 2) a set of low-level processing layers usually implemented through projection ﬁelds (convolutions) for feature

extraction and combination; 3) a set of high level processing layers that operate on “abstractions” and progressively compress information through, for example,

dimension reduction, competition, and learning; 4) once a reduced set of signals/decisions is obtained they are conveyed to (usually mechanical) actuators. (b)

The CAVIAR system components and multilayer architecture; an example output of each component is shown in response to the rotating stimulus and the basic

functionality is illustrated below each chip component.

TABLE I

EMPORAL CONTRAST VISION SENSOR PROPERTIES ADAPTED FROM [20]

II. AER T

EMPORAL CONTRAST

RETINA

The temporal contrast silicon retina is an asynchronous vi-

sion sensor that emits spike address–events (AEs) (Fig. 2 and

Table I) [19], [20]. Each AE from the chip is the address

of a

pixel and signiﬁes that the log intensity at pixel

changed by an

amount

since the last event from that pixel. is a global event

threshold that we typically set to about 15% contrast. In addi-

tion, one bit of the address encodes the sign of the change (

OFF). This representation of “change in log intensity” gen-

erally encodes scene reﬂectance change. The compressive log-

arithmic transformation in each pixel allows for wide dynamic

range operation (120 dB, compared with for example, 60 dB

for a high-quality traditional image sensor). This wide dynamic

TABLE II

ONVOLUTION CHIP

PROPERTIES

range means that the sensor can be used with uncontrolled nat-

ural lighting. The asynchronous response property also means

that the events have a latency down to 15

s with bright lighting

and typically about 1 ms under indoor illumination, resulting in

an effective frame rate of typically several kilohertz. The tem-

poral redundancy reduction greatly reduces the output data rate

for scenes in which most pixels are not changing. The design of

the pixel also allows for unprecedented uniformity of response:

the mismatch between pixel contrast thresholds is 2.1% con-

trast. The event threshold can be set down to 10% contrast, al-

lowing the device to sense natural scenes rather than only artiﬁ-

cial high-contrast stimuli. The vision sensor also has integrated

digitally controlled biases that greatly reduce chip-to-chip vari-

ation in parameters and temperature sensitivity [21].

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on March 06,2010 at 10:45:19 EST from IEEE Xplore. Restrictions apply.

1420 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 9, SEPTEMBER 2009

Fig. 2. Temporal contrast silicon retina vision sensor. (a) Silicon retina USB2

system. The vision sensor with its lens and USB2.0 interface. (b) Chip micro-

graph. A die photograph labeled with the row and column from a pixel that

generates an event with

, type output, where type is ON or OFF. (c) Simpli-

ﬁed pixel core schematic that responds with events to ﬁxed-size changes of log

intensity. (d) Principle of operation. How the

ON and OFF events are internally

represented and output in response to an input signal. Figure adapted from [20].

III. AER PROGRAMMABLE KERNEL 2-D CONVOLUTION CHIP

The convolution chip is an AER transceiver with an array

of event integrators, already reported elsewhere [18]. Table II

summarizes the chip performance ﬁgures and speciﬁcations. For

each incoming event, integrators within a projection ﬁeld around

the addressed pixel compute a weighted event integration. The

weight of this integration is deﬁned by the convolution kernel

[18], [60]. Each incoming event computation splats the kernel

onto the integrators.

Fig. 3(a) shows the block diagram of the convolution chip.

The main parts of the chip are as follows.

1) An array of 32

32 pixels where each pixel contains a

binary weighted signed current source and an integrate-

and-ﬁre signed integrator. The current source is controlled

by the kernel weight read from the RAM and stored in a

dynamic register for each input event.

2) A 32

32 kernel static RAM where each kernel weight

value is stored with signed 4-b resolution.

3) A digital controller that handles all sequence of operations.

4) For each incoming event, a monostable generates a pulse of

ﬁxed duration that enables the integration simultaneously

in all active pixels.

Fig. 3. Convolution chip. (a) Architecture of the convolution chip. (b) Mi-

crophotograph of fabricated chip. (c) Kernel for detecting circumferences of

radius close to four pixels and (d) close to nine pixels.

5) An -neighborhood block that performs a displacement of

the kernel in the

direction.

6) Arbitration and decoding circuitry that generate the output

AEs and which uses Boahen’s burst mode word parallel

AER [33].

The chip operation sequence is as follows.

1) The digital control block stores the

address of an

incoming event and acknowledges reception of the event

through the

and signals.

2) The control block computes the

-displacement that has to

be applied to the kernel and the limits in the

addresses

where the kernel has to be copied.

3) The control block copies the kernel from the kernel RAM

row by row to the corresponding rows in the pixel array.

4) The control block activates the generation of a monostable

pulse. This way, in each pixel a current weighted by the

corresponding kernel weight is integrated during a ﬁxed

time interval.

5) Kernel weights in the pixels are erased.

A pixel (Fig. 4) contains two digitally controlled pulsing

current sources (pulsing CDAC) which provide a current pulse

of ﬁxed width [equal to the width of the signal “event pulse”

coming from the monostable in Fig. 3(a)] and amplitude de-

pendent on the kernel weight stored in the dynamic register

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on March 06,2010 at 10:45:19 EST from IEEE Xplore. Restrictions apply.

SERRANO-GOTARREDONA et al.: CAVIAR: A 45K NEURON, 5M SYNAPSE, 12G CONNECTS/S AER HARDWARE 1421

Fig. 4. Simpliﬁed block diagram of convolution chip pixel.

“weight” in Fig. 4. Depending on the combination of kernel

weight sign and input event sign, the current pulse has to

be positive (provided by CDACp) or negative (provided by

CDACn). The current of each CDAC is proportional to a locally

trimmable current (

or ) to compensate for interpixel

mismatches. Calibration values are loaded from an external

source over a serial interface. Current pulses are integrated

onto a capacitor, whose voltage is monitored by two compara-

tors. If an upper (lower) threshold

is reached, the

pixel sends a positive (negative) output event, and resets the

capacitor voltage to the intermediate resting level. This event is

arbitrated and decoded in the periphery of the chip. In parallel,

all pixels receive a periodic signal “forgetting pulse” which

discharges (charges) the capacitor voltage to the intermediate

resting voltage if “CapSign” is high (low), by generating ﬁxed

amplitude current pulses at CDACn (CDACp).

Both the size of the pixel array and the size of the kernel

storage RAM are 32

32. The input address space can be up to

128

128 (14 b) and the chip is programmed to receive input

from a part of this space. Fig. 3(b) shows the microphotograph

of the fabricated chip. AER events can be fed-in up to a peak

rate of 50 million events per second (Meps). The chip can gen-

erate output events at a maximum rate of 25 Meps. Input event

throughput depends on kernel size and internal clock frequency.

The event cycle time is given by

, where

is the number of programmed kernel lines (from 1 to 32) and

is the internal clock period. The internal clock is tunable

and could be set up to 200 MHz (

5 ns) before ob-

serving operation degradation although in our setup we gener-

ally used 100 MHz. Maximum sustained input event throughput

can, therefore, vary between 33 Meps for a one line kernel down

to 3 Meps for a full 32 line kernel. Further details are given in

Table II and elsewhere [18].

Each convolution chip can process an input space of up to

128

128 pixels, but can produce outputs for only 32 32

pixels. This is useful for multichip assembly. For example,

Fig. 5 illustrates how an array of 4

4 chips, each with 32 32

pixels, could be used to process a visual input of 128

128

pixels. Each chip stores into internal registers its own limit

coordinates

within the total 128 128

pixel space. All chips share the same input AER bus (this is

done in practice using AER splitters). Maximum kernel size

can be 31

31 (i.e., ; see Fig. 5), which means

Fig. 5. Multichip assembly of convolution chips. All chips “see” the same input

space (up to 128

128 pixels), but each chip can process only 32

32 pixels.

Each pixel stores its limit coordinates (

, and

). In general,

when an event is received at coordinate

(

x ;y

)

, up to four chips process it.

that pixels up to 30 positions apart from a chip might need to

be processed by it. For example, in Fig. 5, we can see how an

event with address

is processed simultaneously by four

neighboring chips. The output events produced by all chips are

merged on a single AER bus by an external merger.

For the vision system described in Section VII, we assembled

four convolution chips on a single printed circuit board (PCB).

The PCB has one AER input bus connector and one AER output

bus connector. The input bus goes to a 1-to-4 splitter, imple-

mented on a complex programmable logic device (CPLD) chip,

that feeds the input AER ports of the four chips. The chips’

output AER ports connect to a merger circuit, implemented on

another CPLD circuit, whose output goes to the PCB output

AER connector. The four chips can be programmed to “see” the

same input space and each compute a different 2-D ﬁlter (con-

volution) on the same 32

32 pixel space, or the four chips can

be programmed to process the same kernel while operating on

an expanded 64

64 pixel space. In Section VII, we used this

latter option, so that the PCB would work as one single convolu-

tion processor of array size 64

64, and maximum kernel size

of 31

31.

IV. AER 2-D WTA C

HIP

The AER WTA transceiver chip [66]–[70] is designed to si-

multaneously determine the “what” and “where” of the convo-

lution chip outputs. The “whats” are the best matched features

in the case of multiple convolution chips, each with a different

kernel, and the “wheres” are the spatial locations of these fea-

tures (Fig. 6 and Table III). The WTA chip implements this

feature competition using four populations of spiking neurons

which receive the outputs of individual convolution chips and

computes the winner (strongest input) in two dimensions. First,

it performs a WTA operation on the inputs from a feature map

to determine the strongest input (which codes the location of the

strongest feature in the feature map), and second, it performs a

second level of WTA operation on the sparse feature maps to

determine the strongest feature out of all preprogrammed fea-

tures. The parameters of the network are conﬁgured so that it

implements a hard WTA with only one neuron active at a time.

The spike rate of the winning neuron is proportional to its input

Authorized licensed use limited to: MAIN LIBRARY UNIVERSITY OF ZURICH. Downloaded on March 06,2010 at 10:45:19 EST from IEEE Xplore. Restrictions apply.