The Role of Architectural and Learning Constraints in Neural Network Models: A Case Study on Visual Space Coding.

doi:10.3389/FNCOM.2017.00013

ORIGINAL RESEARCH

published: 21 March 2017

doi: 10.3389/fncom.2017.00013

Frontiers in Computational Neuroscience | www.frontiersin.org 1 March 2017 | Volume 11 | Article 13

Edited by:

Marcel van Gerven,

Radboud University Nijmegen,

Netherlands

Reviewed by:

Michael W. Spratling,

King’s College London, UK

Kandan Ramakrishnan,

University of Amsterdam, Netherlands

*Correspondence:

Alberto Testolin

alberto.testolin@unipd.it

Marco Zorzi

marco.zorzi@unipd.it

Received: 30 November 2016

Accepted: 27 February 2017

Published: 21 March 2017

Citation:

Testolin A, De Filippo De Grazia M and

Zorzi M (2017) The Role of

Architectural and Learning Constraints

in Neural Network Models: A Case

Study on Visual Space Coding.

Front. Comput. Neurosci. 11:13.

doi: 10.3389/fncom.2017.00013

The Role of Architectural and

Learning Constraints in Neural

Network Models: A Case Study on

Visual Space Coding

Alberto Testolin

1

*

, Michele De Filippo De Grazia

1

and Marco Zorzi

1, 2

*

1

Department of General Psychology and Padova Neuroscience Center, University of Padova, Padova, Italy,

2

San Camillo

Hospital IRCCS, Venice, Italy

The recent “deep learning revolution” in artiﬁcial neural networks had strong impact

and widespread deployment for engineering applications, but the use of deep learning

for neurocomputational modeling has been so far limited. In this article we argue

that unsupervised deep learning represents an important step forward for improving

neurocomputational models of perception and cognition, because it emphasizes the role

of generative learning as opposed to discriminative (supervised) learning. As a case study,

we present a series of simulations investigating the emergence of neural coding of visual

space for sensorimotor transformations. We compare different network architectures

commonly used as building blocks for unsupervised deep lear ning by systematically

testing the type of receptive ﬁelds and gain modulation developed by the hidden

neurons. In particular, we compare Restricted Boltzmann Machines (RBMs), which are

stochastic, generative networks with bidirectional connections trained using contrastive

divergence, with autoencoders, which are deterministic networks trained using error

backpropagation. For both learning architecture s we also explore the role of sparse

coding, which has been identiﬁed as a fundamental principle of neural computation. The

unsupervised models are then compared with supervised, feed-forward networks that

learn an explicit mapping between different spatial reference frames. Our simulations

show that both architectural and learning constraints strongly inﬂuenced the emergent

coding of visual space in terms of distribution of tuning functions at the level of single

neurons. Unsupervised models, and particularly RBMs, were found to more closely

adhere to neurophysiological data from single-cell recordings i n the primate parietal

cortex. These results provide new insights into how basic properties of artiﬁcial neural

networks might be relevant for modeling neural information processing in biological

systems.

Keywords: connectionist modeling, unsupervised deep learning, restricted Boltzmann machines, autoencoders,

sparseness, space coding, gain modulation, sensorimotor transformations

Testolin et al. Architecture and Learning Shape Space Coding

INTRODUCTION

Artiﬁcial neural network models aim at explaining human

cognition and behavior in terms of the emergent consequences

of a large number of simple, subcognitive processes (McClelland

et al., 2010). Within this framework, the pattern seen in

overt behavior (macroscopic dynamics of the system) reﬂects

the coordinated operations of simple biophysical mechanisms

(microscopic dynamics of the system), such as the propagation

of activation and inhibition among elementary processing units.

Though t h is general tenet is shared by all connectionist models,

there is large variability in processing architectures and learning

algorithms, which turns into varying degrees of psychological and

biological realism (e.g.,

Thorpe and Imbert, 1989; O’Reilly, 1998).

When the aim is to investigate high-le vel cognitive functions,

simpliﬁcation is essential (McClelland, 2009) and the underlying

processing mechanisms do not need to faithfully implement the

neuronal circuits supposed to carry out such functions in the

brain. However, modelers should strive to consider biological

plausibility if this can bridge diﬀerent levels of description

(

Testolin and Zorzi, 2016).

Recent theoretic a l and technical progress in artiﬁcial neural

networks has signiﬁcantly expanded the range of tasks that

can be solved by machine intelligence. In particular, the advent

of powerful parallel computing architectures based on Graphic

Processing Units (GPUs), coupled with the availability of “big

data,” has allowed to create and train large-scale, hierarchical

neural networks known as deep neural networks (LeCun et al.,

2015, for review). These powerful learning systems achieve

impressive performance in many challenging cognitive tasks,

such as visual object recognition (

Krizhevsky et al., 2012),

speech processing (Mohamed et al., 2012) and natural language

understanding (Collobert et al., 2011). However, while the impact

of deep learning for engineering applications is undisputed,

its relevance for modeling neural information processing in

biological systems still needs to be fully evaluated (for seminal

attempts, see Stoianov and Zorzi, 2012; Khaligh-Razavi and

Kriegeskorte, 2014; Güçlü and van Gerven, 2015).

One critical aspect of most deep learning systems is

the reliance on a feed-forward architecture trained wit h

error backpropagation (Rumelhart et al., 1986), which has

been repeatedly shown to yield state-of-the-art performance

in a variety of problems (

LeCun et al., 2015). However,

the assumptions that learning is largely discriminative (e.g.,

classiﬁcation or function learning) and that an external teaching

signal is always available at each learning event (i.e., all training

data is “labeled”) are clearly implausible from both a cognitive

and a biological perspective (

Zorzi et al., 2013; Cox and Dean,

2014). Reinforcement learning is a valuable alternative and

it has already shown promising results when combined with

deep learning (Mnih et al., 2015; Silver et al., 2016), but

there is a broad range of situations where learning seems

to be fully unsupervised and its only objective is t hat of

discovering the latent structure of the input data in order to build

rich, internal representations of the environment (

Hinton and

Sejnowski, 1999

). We argue that more realistic neurocognitive

models should therefore also exploit unsuper vised forms of deep

learning, where the objective is not to explicitly classify the

input patterns but rather to discover internal representations

by ﬁtting a hierarchical generative model to the sensory d ata

(

Hinton, 2007 , 2013; Zorzi et al., 2013). Compared to its

supervised counterpart, this modeling approach emphasizes the

role of feedback, recurrent connections (

Sillito et al., 2006),

which carry top-down expectations that are gradually adjusted to

better reﬂect the observed data (Hinton and Ghahramani, 1997;

Friston, 2010) and which can be used to implement concurrent

probabilistic inference along the whole cortical hierarchy (Lee

and Mumford, 2003; Gilbert and Sigman, 2007). Notably, top-

down processing is also relevant for understanding attentional

mechanisms in terms of modulation of neural information

processing (Kastner and Ungerleider, 2000).

A powerful class of stochastic neural networks that learn a

generative model of the data is that of Restricted Boltzmann

Machines (RBMs), which can eﬃciently discover internal

representations (i.e., latent features) using Hebbian-like learning

mechanisms (

Hinton, 2002). RBMs constitute the building block

of hierarchical generative models such as Deep Belief Networks

(Hinton and Salakhutdinov, 2006) and Deep Boltzmann

Machines (Salakhutdinov, 2015). These unsupervised deep

learning models have been successfully used to simulate a

variety of cognitive functions, such as numerosity perception

(Stoianov and Zorzi, 2012), letter perception (Testolin et al.,

under review), location-invariant visual word recognition (Di

Bono and Zorzi, 2013), and visual hallucinations in psychiatric

syndromes (Reichert et al., 2013). A similar approach has been

used to simulate how early visual cortic al representations are

adapted to statistical regularities in natural images, in order to

predict single voxel responses to natural images and identify

images from stimulus-evoked multiple voxel responses (

Güçlü

and van Gerven, 2014). A temporal extension of RBMs has also

been recently used to model sequential orthographic processing

and spontaneous pseudoword generation (Testolin et al., 2016).

Unsupervised deep learning can be implemented using an

alternative architecture based on autoencoders (Bengio et al.,

2007), which are deterministic, feed-forward networks whose

learning goal is to accurately reconstruct the input dat a into

a separate layer of output units. Single-layer autoencoders are

trained using error backpropagation, and can be stacked in

order to build more complex, multi-layer architectures. However,

despite the common view that RBMs and autoencoders could

be considered equivalent (Ranzato et al., 2007), we note that

their underlying a rchitectural and learning assumptions are

signiﬁcantly diﬀerent. In this study we empirically compare

RBMs and autoencoders in terms of the type of internal encoding

emerging in the hidden neurons. Moreover, we investigate how

additional learning constraints, such as sparsity and limitation

of computa tional resources (i.e., hidden la yer size), could

inﬂuence the representations de veloped by the networks. As a

case study, we focus on the problem of learning visuospatial

coding for sensorimotor transformations, which is a prominent

example of how the emergentist approach based on learning

in artiﬁcial neural networks has oﬀered important insights into

the computat ions performed by biological neurons (

Zipser and

Andersen, 1988

).

Frontiers in Computational Neuroscience | www.frontiersin.org 2 March 2017 | Volume 11 | Article 13

Testolin et al. Architecture and Learning Shape Space Coding

Sensorimotor transformations refer to the process by

which sensory stimuli are converted into motor commands.

For example, reaching requires to map visual information,

represented in retinal coordinates, into a system of coordinates

that is centered on the eﬀector. Coordinate transformations

can be accomplished by combining sensory information with

extra-retinal information, such as postural signals representing

the position of eyes, head, or hand, thereby obtaining abstract

representations of the space interposed between the sensory input

and the motor output (

Pouget and Snyder, 2000). Single-neuron

recordings from monkey posterior parietal cortex have shown

that the response amplitude of many neurons indeed depends

on the position of the eyes, thereby unveiling a fundamental

coding principle used to perform this type of signal integration

(Andersen et al., 1985). The term gain ﬁeld was coined to describe

this gaze-dependent response of parietal neurons, and since then

the notion of gain modulation has been generalized to indicate the

multiplicative control of one neuron’s responses by the responses

of another set of neurons (

Salinas and Thier, 2000). Another

fundamental property unveiled by neuronal recordings is that the

encoding of space used for coordinate transformations involves

a variety of diﬀerent, complement ary frames of reference. For

example, although many parietal neurons are centered on retinal

coordinates (Andersen et al., 1985; Duhamel et al., 1992),

others represent space using body-centered (Snyder et al., 1998)

or eﬀector-centered (Sakata et al., 1995) coordinate systems.

Moreover, some neurons exhibit multiple gain modulation

(Chang et al., 2009), suggesting more complex forms of spatial

coding. For example, postural information related to both eye

and he a d positions can be combined in order to encode “gaze”

direction (

Brotchie et al., 1995; Stricanne et al., 1996; Duhamel

et al., 1997).

From a computational perspective, the seminal work of

Zipser and Andersen (1988) showed that gain modulation

could spontaneously emerge in supervised, feed-forward neural

networks trained to explicitly map visual targets into head-

centered coordinates, giving as input any arbitrary pair of eye and

retinal positions. Similar results have been observed using more

biologically-plausible learning setti ngs, such as reinforcement

learning (Mazzoni et al., 1991) and predictive coding (De Meyer

and Spratling, 2011

). Note that these learning settings assume

that gain modulation emerges because the task implies to

establish a mapping between diﬀerent reference frames. However,

it is unclear wheth er the form of modulation and the distribution

of neuronal tuning functions is inﬂuenced by the type of

learning algorithm and/or by the nature of the learning task (i.e.,

learning input-output mappings vs. unsupervised learning of

internal representations). We also note that a popular alternative

framework for modeling sensorimotor transformations is not

based on learning, but rather stipulates that parietal neurons

represent a set of basis functions that combine vi sual and postural

information (for review, see

Pouget and Snyder, 2000).

In summary, space coding represents an interesting case

study for testing the adequacy of diﬀerent neural network

architectures and learning algorithms, because it provides a

wealth of neurophysiological data (both at the population and

single-neuron levels), and it departs from the classic problem of

visual object recognition investigated in the large majority of deep

learning research.

MATERIALS AND METHODS

In this section we describe the space coding tasks used in our

simulations, including training and test stimuli, the diﬀerent

learning architectures, and the procedures for analyzing the

emergent neural representations.

Space Coding Tasks

In this study we consider a visual signal in retinotopic coordinates

and two diﬀerent postural signals, one for eye position and

another for a generic “eﬀector,” which might represent, for

example, the position of the hand. We do not consider

the integration between diﬀerent modalities (see

Xing and

Andersen, 2000, for a computational investigation of multimodal

integration in several coordinate frames). We implemented

three types of space coding tasks to test the diﬀerent learning

architectures.

Unsupervised Learning with No Coordinate

Transformation

The ﬁrst learning architecture is depicted in Figure 1A .

Unsupervised learning is represented by undirected arrows,

which connect the sensory input to a separate layer of hidden

neurons. The input signal to the network consists of a visual map,

which represents t a rget location in retinotopic coordinates, and

two postural maps, which represent eye and eﬀector positions.

The learning goal is only to build a compact representation of

these input signals in the hidden layer, which is later rea d-out by

a simple linear associator in order to establish a mapping with

the corresponding motor program. Details of input and output

representations are provided in Section D ataset and Stimuli. The

unsupervised learning phase does not involve any coordinate

transformation because information about the motor program is

not available.

Unsupervised Learning with Coordinate

Transformation

The second learning architecture is depicted in Figure 1B. The

input signal to the network still consists of a visual map and

two postural maps, but in this c ase we also provide as input the

corresponding motor program. In this setting the unsupervised

learning phase implicitly involves coordinate transformation

(i.e., diﬀerent coordinate systems become associated). In

order to compare the mapping accuracy of diﬀerent learning

architectures using the same method, the motor program

is still read-out from hidden neurons via a simple line a r

associator.

Supervised Learning with Coordinate Transformation

The third learning architecture is depicted in Figure 1C, and it

corresponds to the model used by

Zipser and Andersen (1988).

The input is the same of t he unsupervised architecture shown in

Figure 1A, but in th is case supervised learning (directed arrows)

is used to establish an explicit mapping between input signals

Frontiers in Computational Neuroscience | www.frontiersin.org 3 March 2017 | Volume 11 | Article 13

Testolin et al. Architecture and Learning Shape Space Coding

FIGURE 1 | Graphical representations of the learning architectures

used to simulate the space coding tasks. Undirected edges entail

bidirectional (recurrent) connections, while directed arrows represent

feed-forward connections. (A) Unsupervised learning with no coordinate

transformation. (B) Unsupervised learning with coordinate transformation. (C)

Supervised learning with coordinate transformation.

and motor programs. As for the previous architectures, accuracy

of the motor program is also tested by read-out from hidden

neurons via linear association.

Dataset and Stimuli

The representation format adopted for the sensory stimuli

was the same used in previous computational investigations

(

Zipser and Andersen, 198 8; Pouget and Snyder, 2000; De

Filippo De Grazia et al., 2012

), which is broadly consistent

with neurophysiological data recorded in animals performing

tasks involving coordinate transformations (e.g.,

Andersen et al.,

1985

).

The visual input to the models consisted in a real-valued

vector representing the position of the stimulus as a Gaussian

peak of activity in a speciﬁc location. These visible neurons

simulate the activity of the cortic al areas supplying reti notopic

sensory information to the posterior parietal cortex. The

retinotopic map consisted in a square matrix of 17 × 17 neurons,

which employed a population code with Gaussian tuning

functions (standard deviation = 4

◦

). Visual receptive ﬁelds were

uniformly spread between −9

◦

and +9

◦

with increments of 3

◦

,

both in the horizontal and vertical dimensions.

Four postural maps, each one consisting of 17 neurons, were

used to represent the horizontal and vertical positions of the eye

and the eﬀector. These visible neurons used a sigmoid activation

function (steepness parameter = 0.125) to represent postural

information between −18 and +18

◦

, with steps of 3

◦

.

The motor program consisted in a real-valued vector

representing the target position of t he stimulus. Similarly to

the retinotopic map, it was coded as a square matrix of 25 ×

25 neurons, which employed a population code with G a ussian

tuning functions to represent target position in coordinates

centered on the eﬀector (standard deviation = 6

◦

). Motor

programs were uniformly spread between −9

◦

and +9

◦

with

increments of 3

◦

, both in the horizontal and vertical dimensions.

In order to create the stimuli dataset, all possible combinations

of visual input and postural si gnals were ﬁrst generated, and the

corresponding motor program (target location) was computed.

We then balanced the patterns to ensure that target locations

were equally distributed across the motor map to avoid position

biases when decoding the motor program. This resulted in a total

of 28,880 patt erns, which were randomly split into a training set

(20,000 patterns) and an independent test set (8,880 patterns).

The latter was used to assess the generalization performance of

the models.

Learning Architectures

Despite they diﬀer in several aspects, Boltzmann machines and

autoencoders can both be deﬁned within the mathematical

framework of energy-based models (

Ranzato et al., 2007), where

the learning objective is to carve the surface of an energy function

so as to minimize the energies of training points and maximize

the energies of unobserved points. A set of latent variables is

used to learn an internal code that can eﬃciently represent the

obser ved data points, and since the number of latent variables is

usually smaller than that of the observed variables the encoding

process can be interpreted as a form of dimensionality reduction

(Hinton and Salakhutdinov, 2006). In this unsupervised setting,

the model learns the statistical structure of the d a ta without the

need for any explicit, external label.

Restricted Boltzmann Machines (RBMs)

Boltzmann machines are stochastic neural networks that use a

set of hidden neurons to model the latent causes of the observed

data vectors, which are presented to the network through a set of

Frontiers in Computational Neuroscience | www.frontiersin.org 4 March 2017 | Volume 11 | Article 13

Testolin et al. Architecture and Learning Shape Space Coding

visible neurons (Ackley et al., 1985). In the “restricted” case, the

network connectivity is constrained in order to obtain a bipartite

graph (i.e., there are no connections within the same layer; see

Figure 2A for a graphical representation). The behavior of the

network is driven by an energy function E, which deﬁnes the

joint distribution of the hidden and visible neurons by assigning

a probability value to each of their possible conﬁgurations:

p(v, h) =

e

−E(v, h)

Z

where v and h are the column vectors containing the values of

visible and hidden neurons, respectively, and Z is the partition

function. The energy function is deﬁned as a linear combination

of visible and hidden neurons’ activation:

E(v, h) = −b

T

v − c

T

h − h

T

Wv

where W is the matrix of connections weights, b and c are two

additional parameters known as unit biases and T denotes the

transpose operator. Since there are no connections within the

same layer, hidden neurons are conditionally independent given

the state of visible neurons (and vice versa). In particular, the

activation probability of the neurons in each layer conditioned

on the activation of the neurons in the opposite layer can be

eﬃciently computed in one parallel step:

P(h

j

= 1|v) = σ (c

j

+

X

i

w

ij

v

i

)

P(v

i

= 1|h) = σ (b

i

+

X

j

w

ij

h

j

)

where σ is the sigmoid function, c

j

and b

i

are the biases of

hidden and visible neurons (h

j

and v

i

respectively), and w

ij

is

the connection weight between h

j

and v

i

. Learning in RBMs can

be per formed through maximum-likelihood, where each weight

should be changed at each step according to a Hebbian-like

learning rule:

1W = η(v

+

h

+

− v

−

h

−

)

where η represents the learning rate, v

+

h

+

are the visible-hidden

correlations computed on the training data (positive phase), and

v

−

h

−

are th e visible-hidden correlations computed according to

the model’s expectations (negative phase). Model’s expectations

have been traditionally computed by running Gibbs sampling

algorithms until the network reached equilibrium (

Ackley et al.,

1985). However, more eﬃcient algorithms such as contrastive

divergence (Hinton, 2002) speed-up learning by approximating

the log-probability gradient. The reader is referred to Hinton

(2010) and Zorzi et al. (2013) for more details about RBMs and

for the discussion of hyper-parameters of t he learning algorithm.

In our simulations, RBMs were trained using 1-step

contrastive divergence with a learning rate of 0.03, a weight

decay of 0.0002 and a momentum coeﬃcient of 0.9, which was

initialized to 0.5 for the ﬁrst few epochs. Learning was pe rformed

using a mini-batch scheme, with a mini-batch size of 4 patterns,

for a total of 100 learning epochs (reconstruction error always

converged). Sparse representations were encouraged by forcing

the network’s internal representations to rely on a limited number

of active hidden units, that is, by driving the probability q of a

unit to be active to a certain desired (low) probability p (Lee et al.,

2008

). For logistic units, this can be practically implemented by

ﬁrst calculating the quantity q-p, which is then multiplied by a

scaling factor and added to the biases of each hidden units at

every weight update. When the sparsity constraint was applied,

we always veriﬁed that the average activation of hidden units was

indeed maintained below the desired level. All the simulations

were performed using an eﬃcient implementation of RBMs on

graphic processors (

Testolin et al., 2013). The complete source

code is available for download

1

.

Autoencoders

Similarly to RBMs, autoencoders rely on a single layer of

nonlinear hidden units to compactly represent the statistical

regularities of the training data. However, autoencoders

are feed-forward, deterministic networks trained with error

backpropagation (

Bengio et al., 2007). The training data is

presented to a layer of input units, and the learning goal is

to accurately reconstruct such input vector into a separate,

output layer. An autoencoder is therefore composed of a set of

encoding weights W

1

that are used to compute the activation

of hidden h units given the activation of input units v, and a set

of decoding weights W

2

that are used to compute the network

reconstructions v_rec from the activations of hidden units:

h = σ (W

1

v + c)

v_rec = σ (W

2

h + b)

where b and c are the vectors of output and hidden unit

biases, and σ is the sigmoid function (see Figure 2B for a

graphical representation). The error function E to be minimized

corresponds to the average reconstruction error, which is

quantiﬁed by the sum across all output units of the squared

diﬀerence between the original and the reconstructed values:

E =

1

N

X

n = 1

K

X

k = 1

(v

k

− v_rec

k

)

2

+ β

∗



sparsity

where K is the number of output units and N is the number of

training patterns. Similarly to RBMs, sparse representations can

be induced by adding to the cost function a regularization term



sparsity

that takes a large value when the average activation value

q of each hidden neuron diverges from a certain desired (low)

value p. In particular, the sparsity constraint was implemented as

the Kullback-Leibler divergence from q to p:



sparsity

=

H

X

i = 1

KL(p || q

i

)

where H is the number of hidden units. As for RBMs, when

sparsity was applied we always veriﬁed that t h e average activation

of hidden units was indeed maintained below the desired level.

1

http://ccnl.psy.unipd.it/research/deeplearning

Frontiers in Computational Neuroscience | www.frontiersin.org 5 March 2017 | Volume 11 | Article 13

The Role of Architectural and Learning Constraints in Neural Network Models: A Case Study on Visual Space Coding.

Citations

Population Code in Mouse V1 Facilities Read-out of Natural Scenes through Increased Sparseness

A systematic review on deep learning architectures and applications

An emergentist perspective on the origin of number sense.

Deep learning systems as complex networks

Stability of Singular Discrete-Time Neural Networks With State-Dependent Coefficients and Run-to-Run Control Strategies

References

Bayesian inference with probabilistic population codes.

Encoding of spatial location by posterior parietal neurons.

Hierarchical Bayesian inference in the visual cortex

Sparse coding and decorrelation in primary visual cortex during natural vision.

A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons.

Related Papers (5)

Learning representations in Bayesian Confidence Propagation neural networks

Deep learning systems as complex networks

Probabilistic Models and Generative Neural Networks: Towards an Unified Framework for Modeling Normal and Impaired Neurocognitive Functions.

How Much Chemistry Does a Deep Neural Network Need to Know to Make Accurate Predictions

Learning Orthographic Structure With Sequential Generative Neural Networks

Trending Questions (1)