A binary Self-Organizing Map and its FPGA implementation

doi:10.1109/IJCNN.2009.5179001

Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009

A

Binary

Self-Organizing

Map

and

its

FPGA

Implementation

Kofi Appiah Andrew Hunter Hongying Meng and Shigang Yue

Mervyn Hobden Nigel Priestley Peter Hobden and Cy Pettit

The vector components of the winning node

Wk

with mini-

mum distance

Di;

is then updated as follows

where

TJ

is the learning rate. The topological ordering prop-

erty is imposed by also updating weight vectors of nodes in

the neighbourhood of the winning node. This can be achieved

by the following learning rule

(3)

(2)

(1)

N-l

D;

==

L

(Xi

-

Wji)2.

i=O

where N

j

is a neighbourhood function (defining the region

around

Wk

) based on the topological displacement of

neighbouring neuron from the winning neuron. The size of

N

j

decreases as training progresses.

In the vast majority of implementations, the SOM input

data and neurons are represented by real numbers, making

it difficult to implement on a hardware architecture like the

Field Programmable Gate Array (FPGA). However, in many

applications the data is either presented as a binary string, or

may be conveniently recoded as such (a "binary signature").

For example, in image processing applications a bank of Haar

filters produces a long binary signature. In this paper we

present a new learning algorithm which takes binary inputs

and maintains tri-state weights (neuron) in the SOM. We

also present the FPGA implementation of this binary Self

Organizing Map (bSOM). The bSOM is designed for efficient

hardware implementation, having both greatly reduced circuit

size compared to a real-valued SOM, and exceptionally fast

execution and training times.

In section II, we review previous implementations of SOM

on hardware architectures. The novel bSOM algorithm is then

presented in III, followed by its FPGA implementation in

section IV. Section V, presents the experimental results in

software and hardware, and we conclude in section VI.

During training, the "nearest" neuron prototype vector to

the input vector is identified - this is called the "winning"

neuron - using a distance metric,

D. The Euclidean distance

is most frequently used as the metric.

For a given network with

M neurons and

N-dimensional

input vector x, the distance for neuron with weight vector

Wj

(j

<

M)

is given by

I.

INTRODUCTION

T

HE original Self Organizing Map (SOM) proposed by

Kohonen [1] consists of two layers; the input and

the competitive layers. It is an unsupervised neural network

with competitive learning models that captures the topology

and probability distribution of input data, which facilitates

clustering and classification in pattern recognition[2] , [3], [4].

The SOM is typically implemented on a standard von

Neumann architecture computer. For large input dimension-

ality and training set size execution speeds are reasonable,

but training is rather slow, as the SOM training algorithm

typically requires thousands of iterations, each of which

involves the calculation of the Euclidean distance of each of

the input vectors to each of the neuron prototype vectors.

Hardware implementation is therefore of interest. Fortu-

nately, the structure is fairly easy to convert into hardware

processing units executing in parallel [5]. However, a direct

implementation of the standard SOM onto hardware results in

large designs, which consume substantial hardware internal

resources (slices, registers and look-up table (LUT) units),

limiting the scale of network implementation.

The SOM algorithm presented in [1] is based on a

competitive learning algorithm, the winner-take-all (WTA)

network, where an input vector is represented by the closest

neuron prototype vector, which is assigned during training

to a data cluster centre. The prototype vectors are stored

in the "weights" of the neural network. The architecture

consists of topologically organized array of neurons, each

with

N-dimensional

weight vector, where N is also the

dimensionality of the input vector. The basic principle of the

SOM is to adjust the weight vectors until the neurons repre-

sent the input data, while using a topological neighbourhood

update rule to ensure that similar prototype occupy nearby

positions on the topological map.

Abstract-

A

binary

Self Organizing

Map

(SOM) has been

designed

and

implemented on a Field

Programmable

Gate

Array

(FPGA) chip. A novel learning algorithm which takes

binary

inputs

and

maintains tri-state weights is presented.

The

binary

SOM

has the capability of recognizing

binary

input

sequences

after

training. A novel tri-state

rule

is used in

updating

the network weights

during

the

training

phase.

The

rule

implementation is highly suited to the

FPGA

architecture,

and

allows extremely

rapid

training. This architecture may be

used in real-time for fast

pattern

clustering

and

classification

of

binary

features.

Kofi Appiah, Andrew Hunter, Hongying Meng and Shigang Yue are with

the Department of Computing and Informatics, University of Lincoln, UK

and Mervyn Hobden, Nigel Priestley, Peter Hobden and Cy Pettit are with

e2v Technologies, Lincoln, UK.

This work was supported by TSB under BRAINS Project

II.

HARDWARE

ARCHITECTURES

FOR

KOHONEN'

S

MAP

Software simulations are very useful for investigating the

capabilities of neural network models [6], and are suitable

164

for many applications, but are limited in the size of network

implementation, particularly where very fast execution and

training is required. Hardware neural networks can be im-

plemented using analogue or digital systems [12].

The popularity of digital implementations stems from the

fact they are more accurate, more flexible and are less

sensitive to noise than analogue ones [7] - notwithstanding

the analogue inspiration from theoretical neural models. The

computational complexity of the SOM algorithm [1] prevents

it from training in real-time on single processor architectures,

for many real-time applications. The FPGA provides a suit-

able platform for the implementation of a digital version of

the SOM neural network, due to its reconfigurability and

smaller non-recurring engineering (NRE) costs.

However, a floating-point representation of neurons in a

neural network presents significant difficulties for implemen-

tation on FPGAs, despite the current advances in FPGA

technology [13], since floating point multipliers and the

computation of nonlinear excitation functions is complex and

consumes large resources [7] [8]. A number of authors have

sought to mitigate this problem by introducing simplifications

to the SOM algorithm; Pena et. al. [4] implemented a digital

version of the SOM on FPGA by replacing the Euclidean

distance computations with a Cityblock (Manhattan distance)

computation to avoid the expense of hardware multiplication.

In addition, they simplified the neighbourhood function and

introduced a set of new learning parameters.

A similar implementation of the SOM, where the distance,

neighbourhood and learning rate computation is replaced

with a simplified version, has been presented by Chang

et. al. [9] and Porrmann et. al. [10]. An efficient SOM

architecture based on a new Frequency Adaptive Learning

(FAL) algorithm, which efficiently replaces the neighbour-

hood adaptation function of the conventional SOM, has been

presented in [9]. The design was implemented on a Xilinx

FPGA and is capable of quantizing a 512 x 512 pixel colour

image in about 1.003sec at 35MHz clock rate without the

use of sub-sampling.

A design based on the universal rapid prototyping system

RAPTOR2000 for the acceleration of SOM is presented in

[10]. Using Xilinx FPGAs, the implementation achieves a

speed-up of up to 190 (with five FPGA modules on the RAP-

TOR2000 system) compared to a software implementation on

a state of the art personal computer, for typical applications

of self-organizing maps. A similar system, implemented on

a Xilinx Virtex II XC2V300, aimed at reducing the training

processing time of SOM, has been presented in [11]. The

design consists of 16 units in the input layer, N neurons in

the output layer and is divided into three sections: the pro-

cessing unit array, the address generator and the controller.

Compared with an all software implementation, the design

achieves approximately 89% speed-up.

Other forms of neural networks have also been designed

and implemented on hardware architectures such as FPGA.

In [17], Nedjah et. al. proposed the design of a feed-

forward neural network on FPGA using a stochastic process

to implement the computation performed by the neurons.

In the implementation, the multiplication and addition of

stochastic values are achieved by an ensemble of XNOR and

AND gates respectively. In the proposed stochastic model,

a long probabilistic bit-stream whose density of set bits

is proportional to the encoded numeric value is used to

represent a number.

Merchant et. al. [13] designed an intrinsic embed-

ded online evolution system using Block-based neural

networks(BbNN)[15]; a grid based network structure of

interconnected block-based neurons. Each neuron block can

have up to 3 inputs, 3 outputs and 9 synaptic weights and

biases depending on the internal configuration determined by

the network structure. The design has been implemented on a

Xilinx Virtex II Pro FPGA running at 40MHz, using a LUT

based BbNN implemented on the block RAM.

A modified version of Boolean k-nearest neighbour

(BKNN), a supervised classifier using Boolean Neural Net-

works, with binary inputs and outputs, has been implemented

on FPGA by Liu et. al. [14]. The modification omits the

iterative classification procedure and is characterised by a

one-shot training and a single classification sweep to obtain

the answer. The design has been verified with Xilinx ISE 6,

targeting XC2S 100E Xilinx Spartan2E FPGA.

To entirely avoid numeric weights in the SOM, while

maintaining the level of performance as well as speed up in

training and using SOM for real-time application, Yamakawa

et. al. in [3] proposed a binary weighted vector SOM and

simulated it in hardware. The proposed SOM uses binary data

for both input and weight vectors. The Hamming distance

is used to calculate the distance between the input and

weight vectors, to identify the winning neuron in the network.

However, the weight vector is updated with priority given to

the most significant bit, thus attempting to treat the weights

as a direct representation of integer values.

The use of the binary weighted SOM on FPGA proves

to be very successful compared to the others. The design

of the binary weighted SOM is five times faster than the

real number weighted SOM in software and 140 times faster

in hardware[3]. This highlights a key principle - that the

most successful design will take account of the nature of the

hardware architecture. A novel binary SOM that follows this

principle is presented in the following section.

III.

PROPOSED

BINARY

SOM

ALGORITHM

In this section we introduce the binary Self Organizing

Map (bSOM). This takes a binary vector input, and maintains

tri-state vector weights with

{O,

1, # } as the possible values.

The

# represents a

"don't

care" state, which signifies that

the corresponding input vector bit may be either set or clear.

The weight vectors have the same length as the input binary

vector. The bSOM has the same essential structure as a stan-

dard SOM, with an input layer and a competitive layer - see

figure 1. Given a binary input vector

hi

==

(b

1

,

b

2

,

...

, b

n

) ,

all the units in the competitive layer are "connected" by

corresponding prototype vectors,

Wj

==

(Wjl,

Wj2,

...

,Wjn).

165

Input layer ne urons

I

• A bit in the weight vector is only updated if it is

different from its corresponding input vector bit.

• An update value is generated for each iteration during

training. This value decreases as training progresses.

• A random number is then generated and if the number

is greater than the update value, the bit is updated.

• A bit is updated by changing its value from

I to #, 0

to # or # to (0 or 1) depending on the input bit value.

Outpu

t layer neurons

Fig. I. Structure of the Original SOM[18].

where X i and

Wji

are the bit inverse

of

Xi

and

Wji

respec-

tively.

The bSOM trammg algorithm is discussed below, and

compared and contrasted with the original SOM algorithm

and Yamakawa's [3] implementation.

a

1 #

a

0.5

a

0.5

1

a 0.5 0.5

# 0.5 0.5

a

1 #

a

1-0.5p

a

0.5p

1

a

1-0.5p

0.5p

#

0

.5p

0

.5p

l-p

Fig. 2. The conditional Markov transition matrix

Fig. 3. The effective Markov transition matrix

X in (T

----t

>"I)X = 0 where

>..

= 1 and X is a vector

representing the three states (0, 1, #);

Xl

= X

2

= X

3

.

This

shows that increasing the number

of

training iterations makes

no significant difference to the final results, confirming that

the bSOM requires fewer iterations to converge, as compared

to the original SOM and that presented in [3]. The following

section gives the architectural and implementation features

of

the proposed bSOM algorithm.

The bit transition can be modelled as a Markov chain

with a conditional Markov transition matrix

(T)

as shown

in figure 2.

If

the probability

of

applying the conditional

Markov transition matrix is given as

p = 1 -

updat

e

rat

e.

The resulting effective Markov transition matrix

(T

e

)

for a bit

to change is as shown in figure 3.

If

T is a regular transition

matrix, then as

n approaches infinity,

T"

----t S, where S is

a matrix with constant vectors, as shown in figure 4. The

transition matrix settles after the 12th iteration. Solving for

(5)

D. Updating Weight Vectors

The winning neuron and its neighbourhood are updated

as shown in equation 3. In bSOM, a probabilistic update is

used. The probabilistic update in the bSOM is summarised

as follows :

B. Winner Take All (WTA)

Analogously to the original SOM, the unit with the small-

est Hamming distance to the input is defined as the winning

neuron; see equation 5. Since the weight is a tri-state vector,

a # is considered as a matching bit irrespective of the input

bit's value. The total number of

#'s

in the weight vector

is stored and used when selecting the winning unit in the

competitive layer. When there is a tie or when two neurons

have the same Hamming distance to the input vector, the

neuron with the minimum number

of

#'s

is chosen as the

better match .

C. Neighbourhood Selection

As in the original SOM and in [3], a neighbourhood

N

e

of

neurons around the winning neuron W e is selected

and updated with the winning neuron. The size

of

the

neighbourhood is inversely proportional to the iteration value.

A. Distance Computation

The Euclidean distance computation, equation 1, is used in

the original SOM to calculate the distance between the input

vector and the neuron prototype vectors. The implementation

of

this equation is not only difficult to realise in hardware,

but also unnecessary for binary vectors. Following [3], we

use the Hamming distance H, as shown in equation 4, for an

input vector

x and weight vector

Wj

.

166

T2

T I2

0.5000 0.2500 0.2500 0.3335 0.3333 0.3333

0.2500 0.50 00 0.2500 0.3333 0.3335 0.3333

0.2500 0.2500 0.5000 0.3333 0.3333 0.3335

T

13

T I4

0.3334 0.3333 0.3334 0.3334 0.3333 0.3333

0.3333 0.3334 0.3334 0.3333 0.3334 0.3333

0.3334 0.3334 0.3333 0.3333 0.3333 0.3334

many clock cycles as there are bits in the binary input vector

to complete the initialization. The hardware architecture

presented here has been test with binary image characters

of size 28 x 28, totalling 784 bits. The sizes of the input

and weight vectors are all set to 784 bits and can easily

be altered for any image size. The presented implementation

takes exactly 784 clock cycles to completely initialize all the

neurons.

Fig. 5. A block diagram of the design circuit.

Fig. 4. The conditional Markov transition matrix after the 2nd, 12th, 13th

and 14th iterations respectively.

TABLE I

SPECIFICATIO

NOF FPGA

CIRC

UIT

DESIG

N.

(6)

D. Neighbourhood update block

This block is use to select the neighbourhood of the

winning neuron and to update the neurons in the specified

region. The size of the neighbourhood reduces as training

progresses. In the hardware implementation the maximum

size of the neighbourhood is set to 4, and decreases as

training progresses. The iterations count determines the size

of the neighbourhood; for example, if the total number of

iterations is set to 100, then for the first 25 iterations the

neighbourhood is set to 4, then 3 in the second 25 iterations

(thus iteration 26 to 50) and then I in the last 25 iterations.

where k is the total number of bits in the input vector and

j E (1· .. 40) is the address of the neuron.

It

is worth noting

that the neuron vector is tri-state and the # state is ignored

when computing the Hamming distance . Thus, for a neuron

with 784 #'s, the Hamming distance will always be O.

2) Winning neuron unit:

This unit uses the results from

the Hamming distance computed in section IV-C.l to identify

the winning neuron. The design, as shown in figure 6, uses

a series of comparators to select the minimum of every

two input Hamming distances . For an implementation with

40 values, the design takes exactly seven clock cycles to

compute the node with the minimum Hamming distance .

C. Winner Take All block

This block is made up two parts, the Hamming distance

computation unit and the winning neuron unit.

I) Distance computation unit: This unit is used to com-

pute the Hamming distance between the input binary vector

and all the (40) neurons in the bSOM. The Hamming distance

between the input vector X i and a neuron

Wj,

as shown in

equation 6 is a bitwise operation and hence, takes as many

clock cycles as there are bits in the input vector. Since the

Hamming distance for all the 40 neurons are computed in

parallel, it takes exactly 784 clock cycles to compute the

Hamming distance for all the neurons in the network.

784

n., =

2::=

H

ijk

, where

Wjk

-I-

#.

k

=l

B. Pattern Input block

This block is used to acquire the binary input vector (or

binary image) from an external camera. The size of the input

vector 784 is pre-programmed and the input is complete

when a total of 784 bits is read from the camera. This binary

data is stored in the input vector and then passed onto the

WTA block for further processing.

40 neurons

784 bits

Random

4 neurons

Network Size

Input vectors

Neuron vectors

Initial weights

Maximum neighbourhood

IV.

FPGA

ARCHITECTURE

AND IMPLEMENTATION

The most critical aspect of any hardware design is the

selection and design of the architecture that provides the most

efficient and effective implementation [9]. The specifications

of the circuit implemented on FPGA is given in table

I with its corresponding block diagram in figure 5. The

circuitry is made up of five basic blocks, namely the weight

initialization, pattern input, Winner Take All, neighbourhood

update and the display blocks.

Three of the five blocks run in parallel. These are the

pattern input, Winner Take All and the display (output) block.

The weight initialization block is triggered only at start-up.

Similarly, the neighbourhood update block is triggered when

a winning node

ui;

is identified for an input binary vector.

Details of the five basic blocks are presented in the following

sections.

A. Weight Initialization block

This block is used to randomly initialize all the weight

(neuron) vectors in the network. All the neurons in the

network are initialized in parallel bit-by-bit; hence it takes as

167

Fig. 6.

Structure

of

the WTA unit.

7 clock cycles

Resource Total Used

Name Total Used

Per.(%)

Flip

Flops

135,168

4,095

3

4

input

LUTs

135,168 18,387 13

bonded

lOBs

768

147 19

Occupied

Slices

67,584

11,468 16

RAM16s

288 43 14

A. Software Simulation

The software based simulation of the bSOM has been

achieved on a PC with a general purpose processor clocked

at 2.8GHz and 2GB of SDRAM. Initial experiments were

conducted to empirically select control parameters - number

of neuron, neighbourhood size and learning rate - for all

three implementations of the SOM (the conventional SOM,

strict binary SOM and tri-state SOM (bSOM)).

To determine the number of neurons required to represent

all 60,000 patterns in the dataset; see figure 7. We experi-

mented with different numbers of neurons from 10 to 100

in steps of 10. This experiment was primarily based on the

bSOM and also applies for the conventional SOM algorithms.

The results improve with increasing numbers of neurons until

performance begin to plateau at 80 neurons for the bSOM

(with minimal improvement thereafter).

initialization. The clock frequency of 40MHz also includes

the design for controlling the external logic for the VGA

and the camera. This is the actual hardware test and the

most stable clock frequency. The frequency could be much

higher without the requirement to interface these devices.

Table II gives the details of the resource utilization of the

FPGA implementation.

V.

EXPERIMENTAL

RESULTS

This section describes some of the experiments conducted

on the algorithm to verify its correctness, and compares

it with other implementations. The MNIST database of

handwritten digits[ 19], sample shown in figure 7 is used

to test the implementation on both PC simulation and on

the FPGA hardware architecture. A comparison on the PC

between the original SOM as presented by Kohonen in [1]

(herein referred to as the conventional SOM), a strictly binary

SOM and the proposed tri-state SOM (bSOM) algorithms is

also given in this section.

Even though the bSOM is meant for hardware imple-

mentation; for simulation and a fair comparison with the

conventional SOM, we have also implemented the bSOM on

a PC using MATLAB. To justify the use of tri-state

(0,1,

#)

rather than just binary (0,1), we have also implemented a

version of the binary SOM with only O's and 1's excluding

the third state

#. The solely binary implementation uses the

same rules as in the tri-state (or bSOM) implementation and

it is herein referred to as the strict binary SOM.

TABLE

II

IMPLEMENTATION

RESULTS

FOR THE

BSOM,

USING VIRTEX-4

XC4VLX160,

PACKAGE

FFl148

AND SPEED GRADE -10.

Muniplexer

S,

D--1

___

S2

c ENS

Multiplexer

S,

D-

S:1

C ENS

"

Minimum

hamming

Muniplexer

distance and

s,

address of the

S2

corresponding

c

ENS

neuron

The update requires a random number generator, which is

not only complex to implement in hardware but also compu-

tationally expensive. To avoid these costs, an LUT with 2000

randomly generated numbers has been implemented on the

FPGA. For a mismatched bit between the input vector and

the neuron to be updated, one of the 2000 values is selected

using the iteration count. If the number of iterations exceeds

2000, the last 10 bits of the iteration count is used to address

the random number in the LUT.

Mis-matching bits in the neuron vector are updated as

discussed in section III-D; thus a 1 changes to

#, a 0 changes

to

# and a # changes into 0 or 1 depending on the binary

input value. Note, a # is implemented as

'10'

or decimal 2.

E. Output display blocks

This block displays the neurons (weights) as binary image

on an external Video Graphics Array (VGA) for visual

verification. It runs in parallel with the input and WTA

blocks. It runs at the refresh rate for the VGA used, typically

60Hz.

The bSOM architecture discussed here has been imple-

mented on a Xilinx Virtex-4 FPGA chip (XC4VLX160)

with approximately 152,064 logic cells with embedded RAM

totalling 5,184 Kbits. The design and verification was accom-

plished using the Handel-C high level descriptive language.

Compilation and simulation were achieved using the Agility

DK design suite. Synthesis - the translation of abstract high-

level code into a gate-level net-list - was accomplished using

Xilinx ISE tools.

The entire design can be clocked up to 40MHz, making

it possible to train the binary Self Organizing Map with

up to 25,000 patterns of size 784bit in a second after

168

A binary Self-Organizing Map and its FPGA implementation

Citations

A convolutional recursive modified Self Organizing Map for handwritten digits recognition

Improved Learning Performance of Hardware Self-Organizing Map Using a Novel Neighborhood Function

FPGA implementation of Naive Bayes classifier for visual object recognition

Implementation and Applications of Tri-State Self-Organizing Maps on FPGA

Integer Self-Organizing Maps for Digital Hardware

References

Self-Organizing Maps

The Amsterdam Library of Object Images

Neural Network Implementation Using FPGA: Issues and Application

Block-based neural networks

Self-organizing learning array

Related Papers (5)

A hardware design of a massive-parallel, modular NN-based vector quantizer for real-time video coding

Self-Organizing Maps

A run-length based connected component algorithm for FPGA implementation

A novel hardware-oriented Kohonen SOM image compression algorithm and its FPGA implementation

Hardware Design for Self-Organizing Feature Maps with Binary Input Vectors