What contributions have the authors mentioned in the paper "Hedonic value: enhancing adaptation for motivated agents" ?

Inspired by the potential of this operation, the authors present a review of its underlying processes and they introduce a simplified formalisation for artificial agents.

What future works have the authors mentioned in the paper "Hedonic value: enhancing adaptation for motivated agents" ?

This paper has formulated a process of internal modulation of value as an additional mechanism to extend the agent ’ s adaptivity to difficult or changing environments. It shows that whenever the authors make reward value dependent on the motivational state and on previous experience about the environment, in their case recorded by the actor-critic policies, this modulation can exert a significant influence on the behavioural cycles generated and on the agent ’ s overall physiological stability. Although further study will be necessary to find specific methodologies to develop mechanisms of adaptation based on affective phenomena, the review and results presented here highlight that this kind of processes are called to play a significant role in behavioural adaptation.

What is the model of artificial physiology?

Their model of artificial physiology consists of a set of homeostatic, survival-related variables, a set of drives that depend on the internal variables and a repertoire of behaviours.

How many behaviour executions are possible in the ideal scenario?

The shortest possible cycle length for the given ⌧ decay constant (10ms) is obtained in the ideal scenario: 11 behaviour executions.•

What is the percentage of mismatch between the greedy and the physiological policy?

The remaining 35% of mismatch may be distributed between the 20% excluded by the greedy policy (✏ = 0.2) and a 15% due to decisions of behaviours not match-14ing the affordance offered by the object nearby or because its physiological effect would on a homeostatic variable already sated.

How many times did the simulation run take?

The authors performed twenty simulation runs per condition, and recorded the time-course of the agent’s internal physiology and the encompassing behavioural cycles throughout this time.

How long does the cycle length exhibit in all cases?

As expected, the cycle length exhibits a gradual shortening in all cases, as the knowledge about the environment improves and the behavioural policy becomes increasingly effective.

Why is the agent endowed with a wandering behaviour?

the authors have first endowed their agent with a wandering behaviour to facilitate exploration and therefore object encountering.

(Open Access) Hedonic value: enhancing adaptation for motivated agents (2013) | Ignasi Cos

Q: What is the hedonic value of the neuro-inspired notion of subjective assessment?

their neuro-inspired notion of subjective assessment is implemented as a value function, and is tested by learning behavioural responses to different stimuli and physiological states in a manner compliant with the hypothesis of phasic dopamine as an error signal (Khamassi et al., 2005; McClure et al., 2003; Schultz et al., 2000; Houk3 et al., 1995).

Hedonic Value: Enhancing Adaptation for Motivated Agents

Ignasi Cos

⇤† ‡

Lola Ca

namero

‡

Gillian M. Hayes

†

Andrew Gillies

⇤⇤

†

Institute of Perception, Action and Behaviour,

School of Informatics, University of Edinburgh,

Informatics Forum, 10 Crichton Street,

Edinburgh, EH8 9AB, Scotland, UK.

‡

Adaptive Systems Research Group,

School of Computer Science, University of Hertfordshire,

Hatﬁeld, Herts, AL10 9AB, UK.

⇤⇤

Institute of Artiﬁcial and Neural Computation,

School of Informatics, University of Edinburgh,

5 Forrest Hill, Edinburgh, EH1 2QL, Scotland, UK.

Abstract

Reinforcement learning (RL) in the context of artiﬁcial agents is typically used to produce be-

havioural responses as a function of the reward obtained by interaction with the environment. When

the problem consists of learning the shortest path to a goal, it is common to use reward functions

yielding a ﬁxed value after each decision, for example a positive value if the target location has been

attained and a negative one at each intermediate step. However, this ﬁxed strategy may be overly

simplistic for agents to adapt to dynamic environments, in which resources may vary from time to

time. By contrast, there is signiﬁcant evidence that most living beings internally modulate reward

value as a function of their context to expand their range of adaptivity. Inspired by the potential of

this operation, we present a review of its underlying processes and we introduce a simpliﬁed formal-

isation for artiﬁcial agents. The performance of this formalism is tested by monitoring the adaptation

of an agent endowed with a model of motivated actor-critic, embedded with our formalisation of

value and constrained by physiological stability, to environments with different resource distribution.

Our main result shows that the manner in which reward is internally processed as a function of the

agent’s motivational state, strongly inﬂuences adaptivity of the behavioural cycles generated and the

agent’s physiological stability.

Keywords: Hedonic Value, Motivation, Reinforcement Learning, Actor-Critic, Grounding.

⇤

Current Address: ISIR, Universite Pierre et Marie Curie, 4 Place Jussieu, 75005 Paris, France. Email: ig-

nasi.cos@isir.upmc.fr

1 Introduction

Although it is possible to learn efﬁcient behavioural sequences in the context of reinforcement learn-

ing (RL) by propagating value backwards to previously visited states (Sutton and Barto, 1998), this

procedure by itself may fall short in the more demanding context of a motivated agent that has to

adapt to and survive in a situated, dynamic environment. If an agent has several needs, a reasonable

strategy to survive may consist of learning behavioural patterns that prioritise the compensation of

an internal resource over another, as a function of what the environment affords and of the internal

rate of consumption of each resource, in a similar fashion to most animals. However, to reach this

behavioural ﬂexibility via RL in changing environments may demand mechanisms that modulate the

criterion of internal assessment to inﬂuence behaviour in an adaptive manner. An example of this

is observed in the the level of pleasure associated to food consumption, which varies a great deal as

a function of the level of hunger (Shizgal, 1997). While an empty stomach typically reinforces the

pleasure and urge with which a meal is consumed, this pleasure progressively diminishes, sometimes

even reverts, as one gradually satiates. Although the modulation of value by the physiological state is

a well known phenomenon, often studied in the context of stimulus devaluation (Dittrich and Klauer,

2011; Eder and Rothermund, 2008), we still lack a uniﬁed view of its underlying mechanisms that

facilitate devising procedures to modulate behaviour in RL for artiﬁcial agents (Konidaris and Barto,

2006).

Related to this, a tremendous amount of studies in neuro-physiology have devoted to investigate

the underlying mechanisms of hedonic value (HV) (Smith et al., 2011; Grabenhorst et al., 2008).

Loosely speaking, HV refers to the subjective, internally perceived value resulting from any given

interaction. It may inﬂuence decision-making and adaptation, and may vary as a function of the state

of the animal and of its perception of the environment. Although hedonic phenomena encompass a

large number of brain areas whose implication is under current investigation (Rolls, 2004; Damasio,

2000), there is some consensus that the neural encoding of HV involves recurrent projections between

ventral striatum, amygdala and prefrontal areas (Alexander et al., 1990), such as orbito-frontal, ante-

rior cingulate or dorso-lateral pre-frontal cortices (Reynolds and O’Reilly, 2009; Hazy et al., 2007;

Tanji and Hoshi, 2001), as well as the dorso-lateral striatum (Guitart-Masip et al., 2011; Rolls, 2004).

We dedicate the next section to review some of this evidence, which inspired us to propose an arti-

ﬁcial implementation of a modulatory mechanism of value, based on these principles. However, our

goal is to investigate the contribution to adaptation of a mechanism of value modulation grounded

on the situated nature of our agent (Ca

namero, 1997; Wilson, 1991), rather than providing a descrip-

tive model of the parts of the brain involved in this process. This mechanism would be tailored to

each sort of stimulus, experience or sensory modality that biases decision-making across candidate

options and its associated reinforcement learning process (McClure et al., 2003). Our proposal may

be therefore viewed as an extension of the ecological relationship between the agent and its environ-

ment (Gibson, 1986; Pfeifer, 1996), to include the dynamics of each internal variable, the resources

offered by the environment to replenish them and the range of policies the agent can possibly learn.

We assume that value modulation varies within a speciﬁc range for each resource, expressed along

a single scale that makes it possible to reconcile assessments across dissimilar options (Grabenhorst

and Rolls, 2011; Gurney et al., 1998).

To test our formulation of hedonic value, we expanded a motivated architecture (Cos et al., 2010),

which initially focused on the dynamics of perception only, with an actor-critic algorithm to learn

decision-making strategies (Sutton and Barto, 1981). Therein, our neuro-inspired notion of subjec-

tive assessment is implemented as a value function, and is tested by learning behavioural responses

to different stimuli and physiological states in a manner compliant with the hypothesis of phasic

dopamine as an error signal (Khamassi et al., 2005; McClure et al., 2003; Schultz et al., 2000; Houk

et al., 1995). The adaptivity of the agent has been assessed in terms of its physiological stability

(Ashby, 1965), as a function of the response to changes in the availability of resources of the envi-

ronment. The results show the inﬂuence of the subjective interpretation of reward during adaptation,

and importantly, the dependence of the behavioural patterns and of the agent’s physiological stability

on the agent’s subjective view of the environment.

2 Background and Related Research

Value-based decisions are based on differences of expected value (Wallis, 2012; Kennerley et al.,

2011; Wallis and Miller, 2003), and their related learning algorithms are sensitive to the matching

between expected reward and outcome value (McClure et al., 2003; Houk et al., 1995). However,

differences in value between options are not solely dependent on actions and stimuli, they include

as well a component of subjective perception, constrained by the speciﬁcs of an individual and its

relationship with the environment (Pfeifer, 1996; Gibson, 1986). Here we describe this additional

dimension to enhance adaptivity.

In general, most attention in RL has been devoted to structure the problem in a tractable man-

ner, typically by devising an efﬁcient hierarchy of motor primitives; either imposed by prior design

constraints (Matari

c and Brooks, 1990) or by self-adaptation on the basis of sensorimotor interac-

tion (Toussaint, 2003). The result is a dramatic decrease of the learning interval as a result of a

reduced dimensionality of the RL state-space and a more parsimonious behaviour. In a complemen-

tary fashion, some of the architectures best suited to reproduce aspects of animal behaviour have

incorporated the dynamics of interaction between the animal and the statistics of the environment

to the operation of their adaptation mechanisms (Konidaris and Hayes, 2005; Matari

c and Brooks,

1990). As an extension to these, we propose to endow motivated agents with an additional element

of behaviour control, namely a mechanism of internal assessment, constrained by the agent’s condi-

tion of situatedness (Cos et al., 2010; Vel

asquez, 1998; Ca

namero, 1997; Wilson, 1991). In the same

manner that both perception and value (or reward) based decision-making are processes inﬂuenced

by internal physiological processes, our proposal of subjective assessment should be founded on

the speciﬁcs of valuation of physiological effect as a result of behaviour executions (Dickinson and

Balleine, 2001), organised in cycles of sensorimotor interaction (McFarland and Sibly, 1975; Seth,

2000). In a continuous fashion, drives may express an urge for action (Hull, 1943) and incorporate

internal information to give rise to the agent’s motivations, which may exert a direct inﬂuence on the

saliency of the agent’s behaviours, as part of the internal-external dynamics seeking physiological

stability (Ashby, 1965). Motivation, deﬁned in several contexts and disciplines (McDougall, 1913;

Freud, 1940; Tinbergen, 1951; Lorenz, 1966), often with different nuances, always conveyed the role

of “a substance, capable of energising behaviour, held back in a container and subsequently released

in action” (Hinde, 1971, 1960), hence relating physiology to behaviour.

Intrinsically, the notion of value includes an objective and a subjective component. Objective is

the part independent of the physiological state, subjective is the part dependent on it (Grabenhorst

and Rolls, 2011; Conover et al., 1994; Conover and Shizgal, 1994). Revisiting the adaptation pro-

cesses implemented by some of the aforementioned robotic architectures, most of them could be

classiﬁed as based on objective value, as the notion of motivation does not explicitly incorporate

motivation into their modulation of value. An exception is however the Hullian drive based archi-

tecture of Konidaris and Barto (2006), which includes both the expression of intended purpose of

any motivated architecture, combined with the learning of priorities by reinforcement learning. As

a novelty, this architecture includes a procedure of internal modulation that biased its internal mo-

tivations as a function of the environment statistics, e.g., scarce resources result an overexpressed

related motivation, showing that this may lead to a better adaptation. Likewise, Coninx et al. (2008)

extended Konidaris’ architecture with a model of the basal ganglia (Girard et al., 2008) that arbitrates

between several actions, showing that two different policies arise when there are different environ-

ments. Although these architectures do not explicitly address the notion of hedonic value by using

ﬁxed reward formulae, they strongly suggest that mechanisms of internal modulation do inﬂuence the

overall behaviour of the robot. Along a different line of research, Damoulas et al. (2005) proposed a

RL context wherein genetic algorithms were used to evolve an interpretation of physiological effect

in the form of Q-values (Sutton and Barto, 1998), showing that only a small fraction of agents yielded

physiologically stable behaviour to be transferred to the next generation. In conclusion, Damoulas

et al. (2005) showed that adaptivity depends on the manner reward is assessed over generations. In

a complementary fashion to these studies, we propose to investigate the role of a subjective inter-

pretation of reward, modulated by the environment and by the internal dynamics of the agent, which

inﬂuences the manner in which reward is used by an actor-critic to adapt behaviour to the environ-

ment. Finally, seeking the maximisation of reward and the avoidance of penalty, other architectures

have included principles of contextual grounding and RL to learn behavioural policies adapted to a

certain environment (Butz et al., 2010).

A brief summary of the neuroscience of decision-making and hedonic value. Despite the progress

about the neural organization of the brain and of the different processes encompassing decision-

making, our knowledge about the interplay of the neural structures implicated in hedonic value re-

mains incomplete (Wallis, 2012; Kennerley et al., 2011). Recent evidence has gradually revealed a

complex and subtle organization of the different factors and sub-roles implicated in decision-making:

expected outcome, energetic cost, hedonic value, time, risk or conﬁdence in decision, associated with

speciﬁc brain areas. Numerous experiments have been performed to characterize the operation of dif-

ferent brain areas during the learning of stimulus-response (SR) and behaviour-reward relationships

(Balleine and O’Doherty, 2010), showing that the brain areas mainly responsible for the encoding

of the function and the limbic function are the pre-frontal cortex (Balleine and O’Doherty, 2010)

and the ventral striatum (Guitart-Masip et al., 2011). Next we list a summary of some of the main

relevant aspects of neural encoding related to hedonic value.

• The main brain area specialised in the encoding of hedonic value in an independent manner

of behaviour is the Orbito-Frontal Cortex (OFC). The OFC is an area of integration of mul-

timodal sensory and limbic information (Wallis, 2012; Cardinal et al., 2002), receiving major

afferents from sensory cortex, hypothalamus, dorsal and ventral striatum and amygdala (Rolls,

2005, 2004). Consistently with the operation of hedonic assessment, dependent on the internal

physiology, neurons in the OFC activate with pleasant or positively hedonic stimuli and are

sensitive to stimulus devaluation after satiation (Grabenhorst and Rolls, 2011). Furthermore,

OFC neurons can rapidly reverse their responses to a visual stimulus, depending on whether

its previous association was rewarding or punitive (Grabenhorst and Rolls, 2011). In addition

to OFC, neighbouring areas such as the Anterior Cingulate Cortex (ACC) and the Lateral Pre-

Frontal Cortex (LPFC) have been shown to encode aspects of value related to the energy cost

of the candidate options of different goal-directed actions (Kennerley et al., 2011; Hazy et al.,

2007).

• Absolute vs. Relative Reward. The encoding of hedonic value cannot solely be attributed to

the OFC, as several brain areas exert different functions during the perception-action loop that

necessitate of this notion. Most important may be the bonding between action and value at-

tributed to ACC (Grabenhorst and Rolls, 2011; Quilodran et al., 2008). While the OFC encodes

the absolute value associated to stimuli via projections from the amygdala, the ventral striatum

(nucleus accumbens), and projections to ACC modulate activity and value as a function of ex-

ternal stimuli, incorporating a component of cost into its encoding of value. For example, value

representations in ACC are sensitive to the presence of fat in food (Shizgal, 1997; Conover

et al., 1994). Furthemore, the nature of value encoding is constrained by the need of imple-

menting decisions, hence requiring the comparison across often dissimilar options. Therefore,

although areas unrelated to the execution of behaviour such as the OFC encode value in abso-

lute terms, value in areas also encoding motor responses, such as PPC and PMd, are relative to

make comparisons possible (Chib et al., 2009).

• Although the involvement of the nigrostriatal circuitry in the arbitration of behaviour has been

long established (Redgrave et al., 1999; Houk et al., 1995), its relation to hedonic value is

still under investigation. A recent study has shown, probably as a means to learn behavioural

policies in an actor-critic-like manner (Li and Daw, 2012; McClure et al., 2003; Houk et al.,

1995), a specialization of the medial part of the substantia nigra pars compacta (SNc) and

ventral tegmental area (VTA) for the encoding of hedonic value (Guitart-Masip et al., 2011),

and of the nucleus accumbens (ventral striatum) for prediction errors (Rushworth et al., 2009;

Hare et al., 2008; Rolls and Grabenhorst, 2008). These areas modulate value by projecting

to the ventro-medial cortex (vmOFC), ﬁne tuning value predictions of future actions (Peters

and Buechel, 2010; Samejima et al., 2005). Different aspects of value are also encoded by

different brain regions; while those aspects of value related to the goal may be attributed to the

mOFC, mPFC and amygdala, those other related to the decision are attributed to the central

OFC (cOFC) (Kable and Glimcher, 2007; Schaeffer and Rotte, 2007). In this context, the

contribution of the basal ganglia is the correction of prediction errors of value and the control of

the intensity with which the behaviour is elicited (Jin and Costa, 2010; Turner and Desmurget,

2010).

In summary, a full description of HV or the sense of valency, as this notion is also referred to

in AI (Ackley and Littman, 1991), would require a descriptive model of the recurrent activity across

OFC, ACC and the ventral striatum. However, rather than a description of the neural activity across

these brain areas, we deemed more useful for its application to artiﬁcial agents to build a model that

captures the principle of operation of HV. In this light, the main principle of our model consists of

making the modulation of value dependent on loop of interaction between the agent’s homeostasis

and the sensorimotor cycle. In other words, value is not only dependent on the agent’s internal state,

but also on the perception the agent may build of its environment. For example, the agent will not

value a consummatory action in the same manner if its internal level of energy is close to satiation

than to depletion, and will also vary if the necessary resource is easily attainable or scarce. This

set of principles is captured by a reward formula embedded into an actor-critic RL algorithm, which

aims at maintaining the agent’s internal physiology within the boundaries that permit the agent’s

operation (Ca

namero, 1997) — see sections 3.1 and 3.2. We have tested the performance of this

formulation by monitoring the resulting patterns of decision-making of an artiﬁcial agent endowed

with this formulation of HV, tested in several simulated environments. Overall, the results show that

a subjective component to the assessment of value increases learning speed and enhances the stability

of the behavioural patterns.

3 Theoretical Model

The model consists of three main parts: a module of artiﬁcial physiology (top right ﬁg. 1), which

is an abstraction of internal physiological processes; a perception module (bottom right ﬁgure 1),

which provides grounded knowledge about the behaviours afforded by each object nearby the agent;

a motivated actor-critic module (centre ﬁg. 1), which learns behavioural patterns adapted to the

environment; and a module to calculate hedonic value (see Value Function, centre-top ﬁg. 1). Each

module is described in the next sections, however, we have considered appropriate to introduce a

Hedonic value: enhancing adaptation for motivated agents

Citations

The Study of Instinct

The Ecological Approach to Visual Perception

Cambrian Intelligence: The Early History of the New AI

Emotion in reinforcement learning agents and robots: a survey

Making New "New AI" Friends : Designing a Social Robot for Diabetic Children from an Embodied AI Perspective

References

The Ecological Approach to Visual Perception

Learning internal representations by error propagation

Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations

Learning internal representations by error propagation

Introduction to Reinforcement Learning

Related Papers (5)

Learning Affordances of Consummatory Behaviors: Motivation-Driven Adaptive Perception

Robots that have emotions

A survey of socially interactive robots

Emotions: from brain to robot

Experiments in Cultural Language Evolution

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Hedonic value: enhancing adaptation for motivated agents" ?

Q2. What future works have the authors mentioned in the paper "Hedonic value: enhancing adaptation for motivated agents" ?

Q3. What is the model of artificial physiology?

Q4. How has the adaptive value of the agent been assessed?

Q5. How many behaviour executions are possible in the ideal scenario?

Q6. What is the main brain area specialised in the encoding of hedonic value?

Q7. What is the percentage of mismatch between the greedy and the physiological policy?

Q8. How many times did the simulation run take?

Q9. How long does the cycle length exhibit in all cases?

Q10. What is the hedonic value of the neuro-inspired notion of subjective assessment?

Q11. What is the nature of value encoding?

Q12. Why is the agent endowed with a wandering behaviour?

Trending Questions (1)