scispace - formally typeset
Open AccessProceedings Article

Multi-Agent Cooperation and the Emergence of (Natural) Language

TLDR
This paper proposed a framework for language learning that relies on multi-agent communication in the context of referential games, where a sender and a receiver see a pair of images and the receiver must rely on this message to identify the target.
Abstract
The current mainstream approach to train natural language systems is to expose them to large amounts of text. This passive learning is problematic if we are interested in developing interactive machines, such as conversational agents. We propose a framework for language learning that relies on multi-agent communication. We study this learning in the context of referential games. In these games, a sender and a receiver see a pair of images. The sender is told one of them is the target and is allowed to send a message from a fixed, arbitrary vocabulary to the receiver. The receiver must rely on this message to identify the target. Thus, the agents develop their own language interactively out of the need to communicate. We show that two networks with simple configurations are able to learn to coordinate in the referential game. We further explore how to make changes to the game environment to cause the "word meanings" induced in the game to better reflect intuitive semantic properties of the images. In addition, we present a simple strategy for grounding the agents' code into natural language. Both of these are necessary steps towards developing machines that are able to communicate with humans productively.

read more

Content maybe subject to copyright    Report

Published as a conference paper at ICLR 2017
MULTI-AGENT COOPERATION
AND THE EMERGENCE OF (NATURAL) LANGUAGE
Angeliki Lazaridou
1
, Alexander Peysakhovich
2
, Marco Baroni
2,3
1
Google DeepMind,
2
Facebook AI Research,
3
University of Trento
angeliki@google.com, {alexpeys,mbaroni}@fb.com
ABSTRACT
The current mainstream approach to train natural language systems is to expose
them to large amounts of text. This passive learning is problematic if we are in-
terested in developing interactive machines, such as conversational agents. We
propose a framework for language learning that relies on multi-agent communi-
cation. We study this learning in the context of referential games. In these games,
a sender and a receiver see a pair of images. The sender is told one of them is
the target and is allowed to send a message from a fixed, arbitary vocabulary to
the receiver. The receiver must rely on this message to identify the target. Thus,
the agents develop their own language interactively out of the need to communi-
cate. We show that two networks with simple configurations are able to learn to
coordinate in the referential game. We further explore how to make changes to the
game environment to cause the “word meanings” induced in the game to better re-
flect intuitive semantic properties of the images. In addition, we present a simple
strategy for grounding the agents’ code into natural language. Both of these are
necessary steps towards developing machines that are able to communicate with
humans productively.
1 INTRODUCTION
I tried to break it to him gently [...] the only way to learn an unknown language
is to interact with a native speaker [...] asking questions, holding a conversation,
that sort of thing [...] If you want to learn the aliens’ language, someone [...] will
have to talk with an alien. Recordings alone aren’t sufficient.
Ted Chiang, Story of Your Life
One of the main aims of AI is to develop agents that can cooperate with others to achieve goals
(Wooldridge, 2009). Such coordination requires communication. If the coordination partners are to
include humans, the most obvious channel of communication is natural language. Thus, handling
natural-language-based communication is a key step toward the development of AI that can thrive
in a world populated by other agents.
Given the success of deep learning models in related domains such as image captioning or machine
translation (e.g., Sutskever et al., 2014; Xu et al., 2015), it would seem reasonable to cast the prob-
lem of training conversational agents as an instance of supervised learning (Vinyals & Le, 2015).
However, training on “canned” conversations does not allow learners to experience the interactive
aspects of communication. Supervised approaches, which focus on the structure of language, are an
excellent way to learn general statistical associations between sequences of symbols. However, they
do not capture the functional aspects of communication, i.e., that humans use words to coordinate
with others and make things happen (Austin, 1962; Clark, 1996; Wittgenstein, 1953).
This paper introduces the first steps of a research program based on multi-agent coordination com-
munication games. These games place agents in simple environments where they need to develop a
language to coordinate and earn payoffs. Importantly, the agents start as blank slates, but, by play-
ing a game together, they can develop and bootstrap knowledge on top of each others, leading to the
emergence of a language.
Work done while at Facebook AI Research.
1

Published as a conference paper at ICLR 2017
The central problem of our program, then, is the following: How do we design environments that
foster the development of a language that is portable to new situations and to new communication
partners (in particular humans)?
We start from the most basic challenge of using a language in order to refer to things in the context
of a two-agent game. We focus on two questions. First, whether tabula rasa agents succeed in com-
munication. Second, what features of the environment lead to the development of codes resembling
human language.
We assess this latter question in two ways. First, we consider whether the agents associate general
conceptual properties, such as broad object categories (as opposed to low-level visual properties),
to the symbols they learn to use. Second, we examine whether the agents’ “word usage” is partially
interpretable by humans in an online experiment.
Other researchers have proposed communication-based environments for the development of
coordination-capable AI. Work in multi-agent systems has focused on the design of pre-programmed
communication systems to solve specific tasks (e.g., robot soccer, Stone & Veloso 1998). Most re-
lated to our work, Sukhbaatar et al. (2016) and Foerster et al. (2016) show that neural networks can
evolve communication in the context of games without a pre-coded protocol. We pursue the same
question, but further ask how we can change our environment to make the emergent language more
interpretable.
Others (e.g., the SHRLDU program of Winograd 1971 or the game in Wang et al. 2016) propose
building a communicating AI by putting humans in the loop from the very beginning. This approach
has benefits but faces serious scalability issues, as active human intervention is required at each step.
An attractive component of our game-based paradigm is that humans may be added as players, but
do not need to be there all the time.
A third branch of research focuses on “Wizard-of-Oz” environments, where agents learn to play
games by interacting with a complex scripted environment (Mikolov et al., 2015). This approach
gives the designer tight control over the learning curriculum, but imposes a heavy engineering burden
on developers. We also stress the importance of the environment (game setup), but we focus on
simpler environments with multiple agents that force them to get smarter by bootstrapping on top of
each other.
We leverage ideas from work in linguistics, cognitive science and game theory on the emergence of
language (Wagner et al., 2003; Skyrms, 2010; Crawford & Sobel, 1982; Crawford, 1998). Our game
is a variation of Lewis’ signaling game (Lewis, 1969). There is a rich tradition of linguistic and
cognitive studies using similar setups (e.g., Briscoe, 2002; Cangelosi & Parisi, 2002; Spike et al.,
2016; Steels & Loetzsch, 2012). What distinguishes us from this literature is our aim to, eventually,
develop practical AI. This motivates our focus on more realistic input data (a large collection of
noisy natural images) and on trying to align the agents’ language with human intuitions.
Lewis’ classic games have been studied extensively in game theory under the name of “cheap talk”.
These games have been used as models to study the evolution of language both theoretically and
experimentally (Crawford, 1998; Blume et al., 1998; Crawford & Sobel, 1982). A major question
in game theory is whether equilibrium actually occurs in a game as convergence in learning is
not guaranteed (Fudenberg & Peysakhovich, 2014; Roth & Erev, 1995). And, if an equilibrium
is reached, which one it will be (since they are typically not unique). This is particularly true for
cheap talk games, which exhibit Nash equilibria in which precise language emerges, others where
vague language emerges and others where no language emerges at all (Crawford & Sobel, 1982). In
addition, because in these games language has no ex-ante meaning and only emerges in the context
of the equilibrium, some of the emergent languages may not be very natural. Our results speak to
both the convergence question and the question of what features of the game cause the appearance
of different types of languages. Thus, our results are also of interest to game theorists.
An evolutionary perspective has recently been advocated as a way to mitigate the data hunger of
traditional supervised approaches (Goodfellow et al., 2014; Silver et al., 2016). This research con-
firms that learning can be bootstrapped from competition between agents. We focus, however, on
cooperation between agents as a way to foster learning while reducing the need for annotated data.
2

Published as a conference paper at ICLR 2017
2 GENERAL FRAMEWORK
Our general framework includes K players, each parametrized by θ
k
, a collection of tasks/games that
the players have to perform, a communication protocol V that enables the players to communicate
with each other, and payoffs assigned to the players as a deterministic function of a well-defined
goal. In this paper we focus on a particular version of this: referential games. These games are
structured as follows.
1. There is a set of images represented by vectors {i
1
, . . . , i
N
}, two images are drawn at
random from this set, call them (i
L
, i
R
), one of them is chosen to be the “target” t {L, R}
2. There are two players, a sender and a receiver, each seeing the images - the sender receives
input θ
S
(i
L
, i
R
, t)
3. There is a vocabulary V of size K and the sender chooses one symbol to send to the
receiver, we call this the sender’s policy s(θ
S
(i
L
, i
R
, t)) V
4. The receiver does not know the target, but sees the sender’s symbol and tries to guess the
target image. We call this the receiver’s policy r(i
L
, i
R
, s(θ
S
(i
L
, i
R
, t))) {L, R}
5. If r(i
L
, i
R
, s(θ
S
(i
L
, i
R
, t)) = t, that is, if the receiver guesses the target, both players
receive a payoff of 1 (win), otherwise they receive a payoff of 0 (lose).
Many extensions to the basic referential game explored here are possible. There can be more images,
or a more sophisticated communication protocol (e.g., communication of a sequence of symbols or
multi-step communication requiring back-and-forth interaction
1
), rotation of the sender and receiver
roles, having a human occasionally playing one of the roles, etc.
3 EXPERIMENTAL SETUP
Images We use the McRae et al.s (2005) set of 463 base-level concrete concepts (e.g., cat, ap-
ple, car. . . ) spanning across 20 general categories (e.g., animal, fruit/vegetable, vehicle. . . ). We
randomly sample 100 images of each concept from ImageNet (Deng et al., 2009). To create tar-
get/distractor pairs, we randomly sample two concepts, one image for each concept and whether the
first or second image will serve as target. We apply to each image a forward-pass through the pre-
trained VGG ConvNet (Simonyan & Zisserman, 2014), and represent it with the activations from
either the top 1000-D softmax layer (sm) or the second-to-last 4096-D fully connected layer (fc).
Agent Players Both sender and receiver are simple feed-forward networks. For the sender, we
experiment with the two architectures depicted in Figure 1. Both sender architectures take as input
the target (marked with a green square in Figure 1) and distractor representations, always in this
order, so that they are implicitly informed of which image is the target (the receiver, instead, sees
the two images in random order).
The agnostic sender is a generic neural network that maps the original image vectors onto a “game-
specific” embedding space (in the sense that the embedding is learned while playing the game)
followed by a sigmoid nonlinearity. Fully-connected weights are applied to the embedding concate-
nation to produce scores over vocabulary symbols.
The informed sender also first embeds the images into a “game-specific” space. It then applies
1-D convolutions (“filters”) on the image embeddings by treating them as different channels. The
informed sender uses convolutions with kernel size 2x1 applied dimension-by-dimension to the
two image embeddings (in Figure 1, there are 4 such filters). This is followed by the sigmoid
nonlinearity. The resulting feature maps are combined through another filter (kernel size fx1, where
f is the number of filters on the image embeddings), to produce scores for the vocabulary symbols.
Intuitively, the informed sender has an inductive bias towards combining the two images dimension-
by-dimension whereas the agnostic sender does not (though we note the agnostic architecture nests
the informed one).
1
For example, Jorge et al. (2016) explore agents playing a “Guess Who” game to learn about the emergence
of question-asking and answering in language.
3

Published as a conference paper at ICLR 2017
symbol 1
symbol 2
symbol 3
symbol 1
symbol 2
symbol 3
agnostic sender
informed sender
receiver
left image
right image
Figure 1: Architectures of agent players.
For both senders, motivated by the discrete nature of language, we enforce a strong communication
bottleneck that discretizes the communication protocol. Activations on the top (vocabulary) layer
are converted to a Gibbs distribution (with temperature parameter τ), and then a single symbol s is
sampled from the resulting probability distribution.
The receiver takes as input the target and distractor image vectors in random order, as well as the
symbol produced by the sender (as a one-hot vector over the vocabulary). It embeds the images and
the symbol into its own “game-specific” space. It then computes dot products between the symbol
and image embeddings. Ideally, dot similarity should be higher for the image that is better denoted
by the symbol. The two dot products are converted to a Gibbs distribution (with temperature τ ) and
the receiver “points” to an image by sampling from the resulting distribution.
General Training Details We set the following hyperparameters without tuning: embedding di-
mensionality: 50, number of filters applied to embeddings by informed sender: 20, temperature of
Gibbs distributions: 10. We explore two vocabulary sizes: 10 and 100 symbols.
The sender and receiver parameters θ = hθ
R
, θ
S
i are learned while playing the game. No weights
are shared and the only supervision used is communication success, i.e., whether the receiver pointed
at the right referent.
This setup is naturally modeled with Reinforcement Learning (Sutton & Barto, 1998). As out-
lined in Section 2, the sender follows policy s(θ
S
(i
L
, i
R
, t)) V and the receiver policy
r(i
L
, i
R
, s(θ
S
(i
L
, i
R
, t))) {L, R }. The loss function that the two agents must minimize is
IE
˜r
[R(˜r)] where R is the reward function returning 1 iff r(i
L
, i
R
, s(θ
S
(i
L
, i
R
, t)) = t. Param-
eters are updated through the Reinforce rule (Williams, 1992). We apply mini-batch updates, with
a batch size of 32 and for a total of 50k iterations (games). At test time, we compile a set of 10k
games using the same method as for the training games.
We now turn to our main questions. The first is whether the agents can learn to successfully coordi-
nate in a reasonable amount of time. The second is whether the agents’ language can be thought of
as “natural language”, i.e., symbols are assigned to meanings that make intuitive sense in terms of
our conceptualization of the world.
4 LEARNING TO COMMUNICATE
Our first question is whether agents converge to successful communication at all. We see that they
do: agents almost perfectly coordinate in the 1k rounds following the 10k training games for every
architecture and parameter choice (Table 1).
We see, though, some differences between different sender architectures. Figure 2 (left) shows
performance on a sample of the test set as a function of the first 5,000 rounds of training. The agents
4

Published as a conference paper at ICLR 2017
0 1k 2k 3k 4k 5k
#Games
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Communication success
agnostic-sender (100 symbols)
agnostic-sender (10 symbols)
informed-sender (100 symbols)
informed-sender (10 symbols)
0.00
0.03
0.06
0.09
2 10152025 38 100
Singular Value Position
Normalized Spectrum
Figure 2: Left: Communication success as a function of training iterations, we see that informed
senders converge faster than agnostic ones. Right: Spectrum of an example symbol usage matrix:
the first few dimensions do capture only partial variance, suggesting that the usage of more symbols
by the informed sender is not just due to synonymy.
id sender vis voc used comm purity (%) obs-chance
rep size symbols success (%) purity (%)
1 informed sm 100 58 100 46 27
2
informed fc 100 38 100 41 23
3
informed sm 10 10 100 35 18
4
informed fc 10 10 100 32 17
5
agnostic sm 100 2 99 21 15
6
agnostic fc 10 2 99 21 15
7
agnostic sm 10 2 99 20 15
8
agnostic fc 100 2 99 19 15
Table 1: Playing the referential game: test results after 50K training games. Used symbols column
reports number of distinct vocabulary symbols that were produced at least once in the test phase. See
text for explanation of comm success and purity. All purity values are highly significant (p < 0.001)
compared to simulated chance symbol assignment when matching observed symbol usage. The obs-
chance purity column reports the difference between observed and expected purity under chance.
converge to coordination quite fast, but the informed sender reaches higher levels more quickly than
the agnostic one.
The informed sender makes use of more symbols from the available vocabulary, while the agnostic
sender constantly uses a compact 2-symbol vocabulary. This suggests that the informed sender is
using more varied and word-like symbols (recall that the images depict 463 distinct objects, so we
would expect a natural-language-endowed sender to use a wider array of symbols to discriminate
among them). However, it could also be the case that the informed sender vocabulary simply con-
tains higher redundancy/synonymy. To check this, we construct a (sampled) matrix where rows are
game image pairs, columns are symbols, and entries represent how often that symbol is used for that
pair. We then decompose the matrix through SVD. If the sender is indeed just using a strategy with
few effective symbols but high synonymy, then we should expect a 1- or 2-dimensional decomposi-
tion. Figure 2 (right) plots the normalized spectrum of this matrix. While there is some redundancy
in the matrix (thus potentially implying there is synonymy in the usage), the language still requires
multiple dimensions to summarize (cross-validated SVD suggests 50 dimensions).
We now turn to investigating the semantic properties of the emergent communication protocol. Re-
call that the vocabulary that agents use is arbitrary and has no initial meaning. One way to understand
its emerging semantics is by looking at the relationship between symbols and the sets of images they
refer to.
5

Citations
More filters

How to Do Things With Words

Csr Young

打磨Using Language,倡导新理念

付伶俐
TL;DR: Using Language部分的�’学模式既不落俗套,又能真正体现新课程标准所倡导的�'学理念,正是年努力探索的问题.
Proceedings Article

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

TL;DR: In this article, an actor-critic method was used to learn multi-agent coordination policies in cooperative and competitive multi-player RL games, where agent populations are able to discover various physical and informational coordination strategies.
Proceedings Article

Counterfactual Multi−Agent Policy Gradients

TL;DR: In this paper, a multi-agent actor-critic method called counterfactual multiagent (COMA) policy gradients is proposed, which uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.