scispace - formally typeset
Open AccessJournal ArticleDOI

Hedonic value: enhancing adaptation for motivated agents

TLDR
The main result shows that the manner in which reward is internally processed as a function of the agent’s motivational state, strongly influences adaptivity of the behavioral cycles generated and the agent's physiological stability.
Abstract
Reinforcement learning (RL) in the context of artificial agents is typically used to produce behavioral responses as a function of the reward obtained by interaction with the environment. When the problem consists of learning the shortest path to a goal, it is common to use reward functions yielding a fixed value after each decision, for example a positive value if the target location has been attained and a negative value at each intermediate step. However, this fixed strategy may be overly simplistic for agents to adapt to dynamic environments, in which resources may vary from time to time. By contrast, there is significant evidence that most living beings internally modulate reward value as a function of their context to expand their range of adaptivity. Inspired by the potential of this operation, we present a review of its underlying processes and we introduce a simplified formalization for artificial agents. The performance of this formalism is tested by monitoring the adaptation of an agent endowed with a model of motivated actor-critic, embedded with our formalization of value and constrained by physiological stability, to environments with different resource distribution. Our main result shows that the manner in which reward is internally processed as a function of the agent's motivational state, strongly influences adaptivity of the behavioral cycles generated and the agent's physiological stability.

read more

Content maybe subject to copyright    Report

1
Hedonic Value: Enhancing Adaptation for Motivated Agents
Ignasi Cos
Lola Ca
˜
namero
Gillian M. Hayes
Andrew Gillies
⇤⇤
Institute of Perception, Action and Behaviour,
School of Informatics, University of Edinburgh,
Informatics Forum, 10 Crichton Street,
Edinburgh, EH8 9AB, Scotland, UK.
Adaptive Systems Research Group,
School of Computer Science, University of Hertfordshire,
Hatfield, Herts, AL10 9AB, UK.
⇤⇤
Institute of Artificial and Neural Computation,
School of Informatics, University of Edinburgh,
5 Forrest Hill, Edinburgh, EH1 2QL, Scotland, UK.
Abstract
Reinforcement learning (RL) in the context of artificial agents is typically used to produce be-
havioural responses as a function of the reward obtained by interaction with the environment. When
the problem consists of learning the shortest path to a goal, it is common to use reward functions
yielding a fixed value after each decision, for example a positive value if the target location has been
attained and a negative one at each intermediate step. However, this fixed strategy may be overly
simplistic for agents to adapt to dynamic environments, in which resources may vary from time to
time. By contrast, there is significant evidence that most living beings internally modulate reward
value as a function of their context to expand their range of adaptivity. Inspired by the potential of
this operation, we present a review of its underlying processes and we introduce a simplified formal-
isation for artificial agents. The performance of this formalism is tested by monitoring the adaptation
of an agent endowed with a model of motivated actor-critic, embedded with our formalisation of
value and constrained by physiological stability, to environments with different resource distribution.
Our main result shows that the manner in which reward is internally processed as a function of the
agent’s motivational state, strongly influences adaptivity of the behavioural cycles generated and the
agent’s physiological stability.
Keywords: Hedonic Value, Motivation, Reinforcement Learning, Actor-Critic, Grounding.
Current Address: ISIR, Universite Pierre et Marie Curie, 4 Place Jussieu, 75005 Paris, France. Email: ig-
nasi.cos@isir.upmc.fr

2
1 Introduction
Although it is possible to learn efficient behavioural sequences in the context of reinforcement learn-
ing (RL) by propagating value backwards to previously visited states (Sutton and Barto, 1998), this
procedure by itself may fall short in the more demanding context of a motivated agent that has to
adapt to and survive in a situated, dynamic environment. If an agent has several needs, a reasonable
strategy to survive may consist of learning behavioural patterns that prioritise the compensation of
an internal resource over another, as a function of what the environment affords and of the internal
rate of consumption of each resource, in a similar fashion to most animals. However, to reach this
behavioural flexibility via RL in changing environments may demand mechanisms that modulate the
criterion of internal assessment to influence behaviour in an adaptive manner. An example of this
is observed in the the level of pleasure associated to food consumption, which varies a great deal as
a function of the level of hunger (Shizgal, 1997). While an empty stomach typically reinforces the
pleasure and urge with which a meal is consumed, this pleasure progressively diminishes, sometimes
even reverts, as one gradually satiates. Although the modulation of value by the physiological state is
a well known phenomenon, often studied in the context of stimulus devaluation (Dittrich and Klauer,
2011; Eder and Rothermund, 2008), we still lack a unified view of its underlying mechanisms that
facilitate devising procedures to modulate behaviour in RL for artificial agents (Konidaris and Barto,
2006).
Related to this, a tremendous amount of studies in neuro-physiology have devoted to investigate
the underlying mechanisms of hedonic value (HV) (Smith et al., 2011; Grabenhorst et al., 2008).
Loosely speaking, HV refers to the subjective, internally perceived value resulting from any given
interaction. It may influence decision-making and adaptation, and may vary as a function of the state
of the animal and of its perception of the environment. Although hedonic phenomena encompass a
large number of brain areas whose implication is under current investigation (Rolls, 2004; Damasio,
2000), there is some consensus that the neural encoding of HV involves recurrent projections between
ventral striatum, amygdala and prefrontal areas (Alexander et al., 1990), such as orbito-frontal, ante-
rior cingulate or dorso-lateral pre-frontal cortices (Reynolds and O’Reilly, 2009; Hazy et al., 2007;
Tanji and Hoshi, 2001), as well as the dorso-lateral striatum (Guitart-Masip et al., 2011; Rolls, 2004).
We dedicate the next section to review some of this evidence, which inspired us to propose an arti-
ficial implementation of a modulatory mechanism of value, based on these principles. However, our
goal is to investigate the contribution to adaptation of a mechanism of value modulation grounded
on the situated nature of our agent (Ca
˜
namero, 1997; Wilson, 1991), rather than providing a descrip-
tive model of the parts of the brain involved in this process. This mechanism would be tailored to
each sort of stimulus, experience or sensory modality that biases decision-making across candidate
options and its associated reinforcement learning process (McClure et al., 2003). Our proposal may
be therefore viewed as an extension of the ecological relationship between the agent and its environ-
ment (Gibson, 1986; Pfeifer, 1996), to include the dynamics of each internal variable, the resources
offered by the environment to replenish them and the range of policies the agent can possibly learn.
We assume that value modulation varies within a specific range for each resource, expressed along
a single scale that makes it possible to reconcile assessments across dissimilar options (Grabenhorst
and Rolls, 2011; Gurney et al., 1998).
To test our formulation of hedonic value, we expanded a motivated architecture (Cos et al., 2010),
which initially focused on the dynamics of perception only, with an actor-critic algorithm to learn
decision-making strategies (Sutton and Barto, 1981). Therein, our neuro-inspired notion of subjec-
tive assessment is implemented as a value function, and is tested by learning behavioural responses
to different stimuli and physiological states in a manner compliant with the hypothesis of phasic
dopamine as an error signal (Khamassi et al., 2005; McClure et al., 2003; Schultz et al., 2000; Houk

3
et al., 1995). The adaptivity of the agent has been assessed in terms of its physiological stability
(Ashby, 1965), as a function of the response to changes in the availability of resources of the envi-
ronment. The results show the influence of the subjective interpretation of reward during adaptation,
and importantly, the dependence of the behavioural patterns and of the agent’s physiological stability
on the agent’s subjective view of the environment.
2 Background and Related Research
Value-based decisions are based on differences of expected value (Wallis, 2012; Kennerley et al.,
2011; Wallis and Miller, 2003), and their related learning algorithms are sensitive to the matching
between expected reward and outcome value (McClure et al., 2003; Houk et al., 1995). However,
differences in value between options are not solely dependent on actions and stimuli, they include
as well a component of subjective perception, constrained by the specifics of an individual and its
relationship with the environment (Pfeifer, 1996; Gibson, 1986). Here we describe this additional
dimension to enhance adaptivity.
In general, most attention in RL has been devoted to structure the problem in a tractable man-
ner, typically by devising an efficient hierarchy of motor primitives; either imposed by prior design
constraints (Matari
´
c and Brooks, 1990) or by self-adaptation on the basis of sensorimotor interac-
tion (Toussaint, 2003). The result is a dramatic decrease of the learning interval as a result of a
reduced dimensionality of the RL state-space and a more parsimonious behaviour. In a complemen-
tary fashion, some of the architectures best suited to reproduce aspects of animal behaviour have
incorporated the dynamics of interaction between the animal and the statistics of the environment
to the operation of their adaptation mechanisms (Konidaris and Hayes, 2005; Matari
´
c and Brooks,
1990). As an extension to these, we propose to endow motivated agents with an additional element
of behaviour control, namely a mechanism of internal assessment, constrained by the agent’s condi-
tion of situatedness (Cos et al., 2010; Vel
´
asquez, 1998; Ca
˜
namero, 1997; Wilson, 1991). In the same
manner that both perception and value (or reward) based decision-making are processes influenced
by internal physiological processes, our proposal of subjective assessment should be founded on
the specifics of valuation of physiological effect as a result of behaviour executions (Dickinson and
Balleine, 2001), organised in cycles of sensorimotor interaction (McFarland and Sibly, 1975; Seth,
2000). In a continuous fashion, drives may express an urge for action (Hull, 1943) and incorporate
internal information to give rise to the agent’s motivations, which may exert a direct influence on the
saliency of the agent’s behaviours, as part of the internal-external dynamics seeking physiological
stability (Ashby, 1965). Motivation, defined in several contexts and disciplines (McDougall, 1913;
Freud, 1940; Tinbergen, 1951; Lorenz, 1966), often with different nuances, always conveyed the role
of “a substance, capable of energising behaviour, held back in a container and subsequently released
in action” (Hinde, 1971, 1960), hence relating physiology to behaviour.
Intrinsically, the notion of value includes an objective and a subjective component. Objective is
the part independent of the physiological state, subjective is the part dependent on it (Grabenhorst
and Rolls, 2011; Conover et al., 1994; Conover and Shizgal, 1994). Revisiting the adaptation pro-
cesses implemented by some of the aforementioned robotic architectures, most of them could be
classified as based on objective value, as the notion of motivation does not explicitly incorporate
motivation into their modulation of value. An exception is however the Hullian drive based archi-
tecture of Konidaris and Barto (2006), which includes both the expression of intended purpose of
any motivated architecture, combined with the learning of priorities by reinforcement learning. As
a novelty, this architecture includes a procedure of internal modulation that biased its internal mo-
tivations as a function of the environment statistics, e.g., scarce resources result an overexpressed
related motivation, showing that this may lead to a better adaptation. Likewise, Coninx et al. (2008)

4
extended Konidaris’ architecture with a model of the basal ganglia (Girard et al., 2008) that arbitrates
between several actions, showing that two different policies arise when there are different environ-
ments. Although these architectures do not explicitly address the notion of hedonic value by using
fixed reward formulae, they strongly suggest that mechanisms of internal modulation do influence the
overall behaviour of the robot. Along a different line of research, Damoulas et al. (2005) proposed a
RL context wherein genetic algorithms were used to evolve an interpretation of physiological effect
in the form of Q-values (Sutton and Barto, 1998), showing that only a small fraction of agents yielded
physiologically stable behaviour to be transferred to the next generation. In conclusion, Damoulas
et al. (2005) showed that adaptivity depends on the manner reward is assessed over generations. In
a complementary fashion to these studies, we propose to investigate the role of a subjective inter-
pretation of reward, modulated by the environment and by the internal dynamics of the agent, which
influences the manner in which reward is used by an actor-critic to adapt behaviour to the environ-
ment. Finally, seeking the maximisation of reward and the avoidance of penalty, other architectures
have included principles of contextual grounding and RL to learn behavioural policies adapted to a
certain environment (Butz et al., 2010).
A brief summary of the neuroscience of decision-making and hedonic value. Despite the progress
about the neural organization of the brain and of the different processes encompassing decision-
making, our knowledge about the interplay of the neural structures implicated in hedonic value re-
mains incomplete (Wallis, 2012; Kennerley et al., 2011). Recent evidence has gradually revealed a
complex and subtle organization of the different factors and sub-roles implicated in decision-making:
expected outcome, energetic cost, hedonic value, time, risk or confidence in decision, associated with
specific brain areas. Numerous experiments have been performed to characterize the operation of dif-
ferent brain areas during the learning of stimulus-response (SR) and behaviour-reward relationships
(Balleine and O’Doherty, 2010), showing that the brain areas mainly responsible for the encoding
of the function and the limbic function are the pre-frontal cortex (Balleine and O’Doherty, 2010)
and the ventral striatum (Guitart-Masip et al., 2011). Next we list a summary of some of the main
relevant aspects of neural encoding related to hedonic value.
The main brain area specialised in the encoding of hedonic value in an independent manner
of behaviour is the Orbito-Frontal Cortex (OFC). The OFC is an area of integration of mul-
timodal sensory and limbic information (Wallis, 2012; Cardinal et al., 2002), receiving major
afferents from sensory cortex, hypothalamus, dorsal and ventral striatum and amygdala (Rolls,
2005, 2004). Consistently with the operation of hedonic assessment, dependent on the internal
physiology, neurons in the OFC activate with pleasant or positively hedonic stimuli and are
sensitive to stimulus devaluation after satiation (Grabenhorst and Rolls, 2011). Furthermore,
OFC neurons can rapidly reverse their responses to a visual stimulus, depending on whether
its previous association was rewarding or punitive (Grabenhorst and Rolls, 2011). In addition
to OFC, neighbouring areas such as the Anterior Cingulate Cortex (ACC) and the Lateral Pre-
Frontal Cortex (LPFC) have been shown to encode aspects of value related to the energy cost
of the candidate options of different goal-directed actions (Kennerley et al., 2011; Hazy et al.,
2007).
Absolute vs. Relative Reward. The encoding of hedonic value cannot solely be attributed to
the OFC, as several brain areas exert different functions during the perception-action loop that
necessitate of this notion. Most important may be the bonding between action and value at-
tributed to ACC (Grabenhorst and Rolls, 2011; Quilodran et al., 2008). While the OFC encodes
the absolute value associated to stimuli via projections from the amygdala, the ventral striatum
(nucleus accumbens), and projections to ACC modulate activity and value as a function of ex-
ternal stimuli, incorporating a component of cost into its encoding of value. For example, value
representations in ACC are sensitive to the presence of fat in food (Shizgal, 1997; Conover

5
et al., 1994). Furthemore, the nature of value encoding is constrained by the need of imple-
menting decisions, hence requiring the comparison across often dissimilar options. Therefore,
although areas unrelated to the execution of behaviour such as the OFC encode value in abso-
lute terms, value in areas also encoding motor responses, such as PPC and PMd, are relative to
make comparisons possible (Chib et al., 2009).
Although the involvement of the nigrostriatal circuitry in the arbitration of behaviour has been
long established (Redgrave et al., 1999; Houk et al., 1995), its relation to hedonic value is
still under investigation. A recent study has shown, probably as a means to learn behavioural
policies in an actor-critic-like manner (Li and Daw, 2012; McClure et al., 2003; Houk et al.,
1995), a specialization of the medial part of the substantia nigra pars compacta (SNc) and
ventral tegmental area (VTA) for the encoding of hedonic value (Guitart-Masip et al., 2011),
and of the nucleus accumbens (ventral striatum) for prediction errors (Rushworth et al., 2009;
Hare et al., 2008; Rolls and Grabenhorst, 2008). These areas modulate value by projecting
to the ventro-medial cortex (vmOFC), fine tuning value predictions of future actions (Peters
and Buechel, 2010; Samejima et al., 2005). Different aspects of value are also encoded by
different brain regions; while those aspects of value related to the goal may be attributed to the
mOFC, mPFC and amygdala, those other related to the decision are attributed to the central
OFC (cOFC) (Kable and Glimcher, 2007; Schaeffer and Rotte, 2007). In this context, the
contribution of the basal ganglia is the correction of prediction errors of value and the control of
the intensity with which the behaviour is elicited (Jin and Costa, 2010; Turner and Desmurget,
2010).
In summary, a full description of HV or the sense of valency, as this notion is also referred to
in AI (Ackley and Littman, 1991), would require a descriptive model of the recurrent activity across
OFC, ACC and the ventral striatum. However, rather than a description of the neural activity across
these brain areas, we deemed more useful for its application to artificial agents to build a model that
captures the principle of operation of HV. In this light, the main principle of our model consists of
making the modulation of value dependent on loop of interaction between the agent’s homeostasis
and the sensorimotor cycle. In other words, value is not only dependent on the agent’s internal state,
but also on the perception the agent may build of its environment. For example, the agent will not
value a consummatory action in the same manner if its internal level of energy is close to satiation
than to depletion, and will also vary if the necessary resource is easily attainable or scarce. This
set of principles is captured by a reward formula embedded into an actor-critic RL algorithm, which
aims at maintaining the agent’s internal physiology within the boundaries that permit the agent’s
operation (Ca
˜
namero, 1997) see sections 3.1 and 3.2. We have tested the performance of this
formulation by monitoring the resulting patterns of decision-making of an artificial agent endowed
with this formulation of HV, tested in several simulated environments. Overall, the results show that
a subjective component to the assessment of value increases learning speed and enhances the stability
of the behavioural patterns.
3 Theoretical Model
The model consists of three main parts: a module of artificial physiology (top right fig. 1), which
is an abstraction of internal physiological processes; a perception module (bottom right figure 1),
which provides grounded knowledge about the behaviours afforded by each object nearby the agent;
a motivated actor-critic module (centre fig. 1), which learns behavioural patterns adapted to the
environment; and a module to calculate hedonic value (see Value Function, centre-top fig. 1). Each
module is described in the next sections, however, we have considered appropriate to introduce a

Citations
More filters
Journal ArticleDOI

Emotion in reinforcement learning agents and robots: a survey

TL;DR: A survey of computational models of emotion in reinforcement learning (RL) agents is presented in this article, where the authors focus on agent/robot emotions and mostly ignore human user emotions, and compare evaluation criteria and draw connections to important RL sub-domains like (intrinsic) motivation and model-based RL.
Journal ArticleDOI

Making New "New AI" Friends : Designing a Social Robot for Diabetic Children from an Embodied AI Perspective

TL;DR: The rationale behind the design of Robin is discussed, how “New AI” provides a suitable approach to developing a friendly companion that fulfills the therapeutic and affective requirements of the end users beyond other approaches commonly used in assistive robotics and child–robot interaction.
References
More filters
Book

The Ecological Approach to Visual Perception

TL;DR: The relationship between Stimulation and Stimulus Information for visual perception is discussed in detail in this article, where the authors also present experimental evidence for direct perception of motion in the world and movement of the self.
Book ChapterDOI

Learning internal representations by error propagation

TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
Book

Learning internal representations by error propagation

TL;DR: In this paper, the problem of the generalized delta rule is discussed and the Generalized Delta Rule is applied to the simulation results of simulation results in terms of the generalized delta rule.
Book

Introduction to Reinforcement Learning

TL;DR: In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Hedonic value: enhancing adaptation for motivated agents" ?

Inspired by the potential of this operation, the authors present a review of its underlying processes and they introduce a simplified formalisation for artificial agents. 

This paper has formulated a process of internal modulation of value as an additional mechanism to extend the agent ’ s adaptivity to difficult or changing environments. It shows that whenever the authors make reward value dependent on the motivational state and on previous experience about the environment, in their case recorded by the actor-critic policies, this modulation can exert a significant influence on the behavioural cycles generated and on the agent ’ s overall physiological stability. Although further study will be necessary to find specific methodologies to develop mechanisms of adaptation based on affective phenomena, the review and results presented here highlight that this kind of processes are called to play a significant role in behavioural adaptation. 

Their model of artificial physiology consists of a set of homeostatic, survival-related variables, a set of drives that depend on the internal variables and a repertoire of behaviours. 

The adaptivity of the agent has been assessed in terms of its physiological stability (Ashby, 1965), as a function of the response to changes in the availability of resources of the environment. 

The shortest possible cycle length for the given ⌧ decay constant (10ms) is obtained in the ideal scenario: 11 behaviour executions.• 

The main brain area specialised in the encoding of hedonic value in an independent manner of behaviour is the Orbito-Frontal Cortex (OFC). 

The remaining 35% of mismatch may be distributed between the 20% excluded by the greedy policy (✏ = 0.2) and a 15% due to decisions of behaviours not match-14ing the affordance offered by the object nearby or because its physiological effect would on a homeostatic variable already sated. 

The authors performed twenty simulation runs per condition, and recorded the time-course of the agent’s internal physiology and the encompassing behavioural cycles throughout this time. 

As expected, the cycle length exhibits a gradual shortening in all cases, as the knowledge about the environment improves and the behavioural policy becomes increasingly effective. 

their neuro-inspired notion of subjective assessment is implemented as a value function, and is tested by learning behavioural responses to different stimuli and physiological states in a manner compliant with the hypothesis of phasic dopamine as an error signal (Khamassi et al., 2005; McClure et al., 2003; Schultz et al., 2000; Houk3 et al., 1995). 

the nature of value encoding is constrained by the need of implementing decisions, hence requiring the comparison across often dissimilar options. 

the authors have first endowed their agent with a wandering behaviour to facilitate exploration and therefore object encountering. 

Trending Questions (1)
Does emotional connection effects on hedonic adaptation?

The provided paper does not mention anything about emotional connection or its effects on hedonic adaptation.