Multi-Agent Cooperation and the Emergence of (Natural) Language

Published as a conference paper at ICLR 2017

MULTI-AGENT COOPERATION

AND THE EMERGENCE OF (NATURAL) LANGUAGE

Angeliki Lazaridou

1∗

, Alexander Peysakhovich

2

, Marco Baroni

2,3

1

Google DeepMind,

2

Facebook AI Research,

3

University of Trento

angeliki@google.com, {alexpeys,mbaroni}@fb.com

ABSTRACT

The current mainstream approach to train natural language systems is to expose

them to large amounts of text. This passive learning is problematic if we are in-

terested in developing interactive machines, such as conversational agents. We

propose a framework for language learning that relies on multi-agent communi-

cation. We study this learning in the context of referential games. In these games,

a sender and a receiver see a pair of images. The sender is told one of them is

the target and is allowed to send a message from a ﬁxed, arbitary vocabulary to

the receiver. The receiver must rely on this message to identify the target. Thus,

the agents develop their own language interactively out of the need to communi-

cate. We show that two networks with simple conﬁgurations are able to learn to

coordinate in the referential game. We further explore how to make changes to the

game environment to cause the “word meanings” induced in the game to better re-

ﬂect intuitive semantic properties of the images. In addition, we present a simple

strategy for grounding the agents’ code into natural language. Both of these are

necessary steps towards developing machines that are able to communicate with

humans productively.

1 INTRODUCTION

I tried to break it to him gently [...] the only way to learn an unknown language

is to interact with a native speaker [...] asking questions, holding a conversation,

that sort of thing [...] If you want to learn the aliens’ language, someone [...] will

have to talk with an alien. Recordings alone aren’t sufﬁcient.

Ted Chiang, Story of Your Life

One of the main aims of AI is to develop agents that can cooperate with others to achieve goals

(Wooldridge, 2009). Such coordination requires communication. If the coordination partners are to

include humans, the most obvious channel of communication is natural language. Thus, handling

natural-language-based communication is a key step toward the development of AI that can thrive

in a world populated by other agents.

Given the success of deep learning models in related domains such as image captioning or machine

translation (e.g., Sutskever et al., 2014; Xu et al., 2015), it would seem reasonable to cast the prob-

lem of training conversational agents as an instance of supervised learning (Vinyals & Le, 2015).

However, training on “canned” conversations does not allow learners to experience the interactive

aspects of communication. Supervised approaches, which focus on the structure of language, are an

excellent way to learn general statistical associations between sequences of symbols. However, they

do not capture the functional aspects of communication, i.e., that humans use words to coordinate

with others and make things happen (Austin, 1962; Clark, 1996; Wittgenstein, 1953).

This paper introduces the ﬁrst steps of a research program based on multi-agent coordination com-

munication games. These games place agents in simple environments where they need to develop a

language to coordinate and earn payoffs. Importantly, the agents start as blank slates, but, by play-

ing a game together, they can develop and bootstrap knowledge on top of each others, leading to the

emergence of a language.

∗

Work done while at Facebook AI Research.

1

Published as a conference paper at ICLR 2017

The central problem of our program, then, is the following: How do we design environments that

foster the development of a language that is portable to new situations and to new communication

partners (in particular humans)?

We start from the most basic challenge of using a language in order to refer to things in the context

of a two-agent game. We focus on two questions. First, whether tabula rasa agents succeed in com-

munication. Second, what features of the environment lead to the development of codes resembling

human language.

We assess this latter question in two ways. First, we consider whether the agents associate general

conceptual properties, such as broad object categories (as opposed to low-level visual properties),

to the symbols they learn to use. Second, we examine whether the agents’ “word usage” is partially

interpretable by humans in an online experiment.

Other researchers have proposed communication-based environments for the development of

coordination-capable AI. Work in multi-agent systems has focused on the design of pre-programmed

communication systems to solve speciﬁc tasks (e.g., robot soccer, Stone & Veloso 1998). Most re-

lated to our work, Sukhbaatar et al. (2016) and Foerster et al. (2016) show that neural networks can

evolve communication in the context of games without a pre-coded protocol. We pursue the same

question, but further ask how we can change our environment to make the emergent language more

interpretable.

Others (e.g., the SHRLDU program of Winograd 1971 or the game in Wang et al. 2016) propose

building a communicating AI by putting humans in the loop from the very beginning. This approach

has beneﬁts but faces serious scalability issues, as active human intervention is required at each step.

An attractive component of our game-based paradigm is that humans may be added as players, but

do not need to be there all the time.

A third branch of research focuses on “Wizard-of-Oz” environments, where agents learn to play

games by interacting with a complex scripted environment (Mikolov et al., 2015). This approach

gives the designer tight control over the learning curriculum, but imposes a heavy engineering burden

on developers. We also stress the importance of the environment (game setup), but we focus on

simpler environments with multiple agents that force them to get smarter by bootstrapping on top of

each other.

We leverage ideas from work in linguistics, cognitive science and game theory on the emergence of

language (Wagner et al., 2003; Skyrms, 2010; Crawford & Sobel, 1982; Crawford, 1998). Our game

is a variation of Lewis’ signaling game (Lewis, 1969). There is a rich tradition of linguistic and

cognitive studies using similar setups (e.g., Briscoe, 2002; Cangelosi & Parisi, 2002; Spike et al.,

2016; Steels & Loetzsch, 2012). What distinguishes us from this literature is our aim to, eventually,

develop practical AI. This motivates our focus on more realistic input data (a large collection of

noisy natural images) and on trying to align the agents’ language with human intuitions.

Lewis’ classic games have been studied extensively in game theory under the name of “cheap talk”.

These games have been used as models to study the evolution of language both theoretically and

experimentally (Crawford, 1998; Blume et al., 1998; Crawford & Sobel, 1982). A major question

in game theory is whether equilibrium actually occurs in a game as convergence in learning is

not guaranteed (Fudenberg & Peysakhovich, 2014; Roth & Erev, 1995). And, if an equilibrium

is reached, which one it will be (since they are typically not unique). This is particularly true for

cheap talk games, which exhibit Nash equilibria in which precise language emerges, others where

vague language emerges and others where no language emerges at all (Crawford & Sobel, 1982). In

addition, because in these games language has no ex-ante meaning and only emerges in the context

of the equilibrium, some of the emergent languages may not be very natural. Our results speak to

both the convergence question and the question of what features of the game cause the appearance

of different types of languages. Thus, our results are also of interest to game theorists.

An evolutionary perspective has recently been advocated as a way to mitigate the data hunger of

traditional supervised approaches (Goodfellow et al., 2014; Silver et al., 2016). This research con-

ﬁrms that learning can be bootstrapped from competition between agents. We focus, however, on

cooperation between agents as a way to foster learning while reducing the need for annotated data.

2

Published as a conference paper at ICLR 2017

2 GENERAL FRAMEWORK

Our general framework includes K players, each parametrized by θ

k

, a collection of tasks/games that

the players have to perform, a communication protocol V that enables the players to communicate

with each other, and payoffs assigned to the players as a deterministic function of a well-deﬁned

goal. In this paper we focus on a particular version of this: referential games. These games are

structured as follows.

1. There is a set of images represented by vectors {i

1

, . . . , i

N

}, two images are drawn at

random from this set, call them (i

L

, i

R

), one of them is chosen to be the “target” t ∈ {L, R}

2. There are two players, a sender and a receiver, each seeing the images - the sender receives

input θ

S

(i

L

, i

R

, t)

3. There is a vocabulary V of size K and the sender chooses one symbol to send to the

receiver, we call this the sender’s policy s(θ

S

(i

L

, i

R

, t)) ∈ V

4. The receiver does not know the target, but sees the sender’s symbol and tries to guess the

target image. We call this the receiver’s policy r(i

L

, i

R

, s(θ

S

(i

L

, i

R

, t))) ∈ {L, R}

5. If r(i

L

, i

R

, s(θ

S

(i

L

, i

R

, t)) = t, that is, if the receiver guesses the target, both players

receive a payoff of 1 (win), otherwise they receive a payoff of 0 (lose).

Many extensions to the basic referential game explored here are possible. There can be more images,

or a more sophisticated communication protocol (e.g., communication of a sequence of symbols or

multi-step communication requiring back-and-forth interaction

1

), rotation of the sender and receiver

roles, having a human occasionally playing one of the roles, etc.

3 EXPERIMENTAL SETUP

Images We use the McRae et al.’s (2005) set of 463 base-level concrete concepts (e.g., cat, ap-

ple, car. . . ) spanning across 20 general categories (e.g., animal, fruit/vegetable, vehicle. . . ). We

randomly sample 100 images of each concept from ImageNet (Deng et al., 2009). To create tar-

get/distractor pairs, we randomly sample two concepts, one image for each concept and whether the

ﬁrst or second image will serve as target. We apply to each image a forward-pass through the pre-

trained VGG ConvNet (Simonyan & Zisserman, 2014), and represent it with the activations from

either the top 1000-D softmax layer (sm) or the second-to-last 4096-D fully connected layer (fc).

Agent Players Both sender and receiver are simple feed-forward networks. For the sender, we

experiment with the two architectures depicted in Figure 1. Both sender architectures take as input

the target (marked with a green square in Figure 1) and distractor representations, always in this

order, so that they are implicitly informed of which image is the target (the receiver, instead, sees

the two images in random order).

The agnostic sender is a generic neural network that maps the original image vectors onto a “game-

speciﬁc” embedding space (in the sense that the embedding is learned while playing the game)

followed by a sigmoid nonlinearity. Fully-connected weights are applied to the embedding concate-

nation to produce scores over vocabulary symbols.

The informed sender also ﬁrst embeds the images into a “game-speciﬁc” space. It then applies

1-D convolutions (“ﬁlters”) on the image embeddings by treating them as different channels. The

informed sender uses convolutions with kernel size 2x1 applied dimension-by-dimension to the

two image embeddings (in Figure 1, there are 4 such ﬁlters). This is followed by the sigmoid

nonlinearity. The resulting feature maps are combined through another ﬁlter (kernel size fx1, where

f is the number of ﬁlters on the image embeddings), to produce scores for the vocabulary symbols.

Intuitively, the informed sender has an inductive bias towards combining the two images dimension-

by-dimension whereas the agnostic sender does not (though we note the agnostic architecture nests

the informed one).

1

For example, Jorge et al. (2016) explore agents playing a “Guess Who” game to learn about the emergence

of question-asking and answering in language.

3

Published as a conference paper at ICLR 2017

symbol 1

symbol 2

symbol 3

symbol 1

symbol 2

symbol 3

agnostic sender

informed sender

receiver

left image

right image

Figure 1: Architectures of agent players.

For both senders, motivated by the discrete nature of language, we enforce a strong communication

bottleneck that discretizes the communication protocol. Activations on the top (vocabulary) layer

are converted to a Gibbs distribution (with temperature parameter τ), and then a single symbol s is

sampled from the resulting probability distribution.

The receiver takes as input the target and distractor image vectors in random order, as well as the

symbol produced by the sender (as a one-hot vector over the vocabulary). It embeds the images and

the symbol into its own “game-speciﬁc” space. It then computes dot products between the symbol

and image embeddings. Ideally, dot similarity should be higher for the image that is better denoted

by the symbol. The two dot products are converted to a Gibbs distribution (with temperature τ ) and

the receiver “points” to an image by sampling from the resulting distribution.

General Training Details We set the following hyperparameters without tuning: embedding di-

mensionality: 50, number of ﬁlters applied to embeddings by informed sender: 20, temperature of

Gibbs distributions: 10. We explore two vocabulary sizes: 10 and 100 symbols.

The sender and receiver parameters θ = hθ

R

, θ

S

i are learned while playing the game. No weights

are shared and the only supervision used is communication success, i.e., whether the receiver pointed

at the right referent.

This setup is naturally modeled with Reinforcement Learning (Sutton & Barto, 1998). As out-

lined in Section 2, the sender follows policy s(θ

S

(i

L

, i

R

, t)) ∈ V and the receiver policy

r(i

L

, i

R

, s(θ

S

(i

L

, i

R

, t))) ∈ {L, R }. The loss function that the two agents must minimize is

−IE

˜r

[R(˜r)] where R is the reward function returning 1 iff r(i

L

, i

R

, s(θ

S

(i

L

, i

R

, t)) = t. Param-

eters are updated through the Reinforce rule (Williams, 1992). We apply mini-batch updates, with

a batch size of 32 and for a total of 50k iterations (games). At test time, we compile a set of 10k

games using the same method as for the training games.

We now turn to our main questions. The ﬁrst is whether the agents can learn to successfully coordi-

nate in a reasonable amount of time. The second is whether the agents’ language can be thought of

as “natural language”, i.e., symbols are assigned to meanings that make intuitive sense in terms of

our conceptualization of the world.

4 LEARNING TO COMMUNICATE

Our ﬁrst question is whether agents converge to successful communication at all. We see that they

do: agents almost perfectly coordinate in the 1k rounds following the 10k training games for every

architecture and parameter choice (Table 1).

We see, though, some differences between different sender architectures. Figure 2 (left) shows

performance on a sample of the test set as a function of the ﬁrst 5,000 rounds of training. The agents

4

Published as a conference paper at ICLR 2017

0 1k 2k 3k 4k 5k

#Games

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Communication success

agnostic-sender (100 symbols)

agnostic-sender (10 symbols)

informed-sender (100 symbols)

informed-sender (10 symbols)

0.00

0.03

0.06

0.09

2 10152025 38 100

Singular Value Position

Normalized Spectrum

Figure 2: Left: Communication success as a function of training iterations, we see that informed

senders converge faster than agnostic ones. Right: Spectrum of an example symbol usage matrix:

the ﬁrst few dimensions do capture only partial variance, suggesting that the usage of more symbols

by the informed sender is not just due to synonymy.

id sender vis voc used comm purity (%) obs-chance

rep size symbols success (%) purity (%)

1 informed sm 100 58 100 46 27

2

informed fc 100 38 100 41 23

3

informed sm 10 10 100 35 18

4

informed fc 10 10 100 32 17

5

agnostic sm 100 2 99 21 15

6

agnostic fc 10 2 99 21 15

7

agnostic sm 10 2 99 20 15

8

agnostic fc 100 2 99 19 15

Table 1: Playing the referential game: test results after 50K training games. Used symbols column

reports number of distinct vocabulary symbols that were produced at least once in the test phase. See

text for explanation of comm success and purity. All purity values are highly signiﬁcant (p < 0.001)

compared to simulated chance symbol assignment when matching observed symbol usage. The obs-

chance purity column reports the difference between observed and expected purity under chance.

converge to coordination quite fast, but the informed sender reaches higher levels more quickly than

the agnostic one.

The informed sender makes use of more symbols from the available vocabulary, while the agnostic

sender constantly uses a compact 2-symbol vocabulary. This suggests that the informed sender is

using more varied and word-like symbols (recall that the images depict 463 distinct objects, so we

would expect a natural-language-endowed sender to use a wider array of symbols to discriminate

among them). However, it could also be the case that the informed sender vocabulary simply con-

tains higher redundancy/synonymy. To check this, we construct a (sampled) matrix where rows are

game image pairs, columns are symbols, and entries represent how often that symbol is used for that

pair. We then decompose the matrix through SVD. If the sender is indeed just using a strategy with

few effective symbols but high synonymy, then we should expect a 1- or 2-dimensional decomposi-

tion. Figure 2 (right) plots the normalized spectrum of this matrix. While there is some redundancy

in the matrix (thus potentially implying there is synonymy in the usage), the language still requires

multiple dimensions to summarize (cross-validated SVD suggests 50 dimensions).

We now turn to investigating the semantic properties of the emergent communication protocol. Re-

call that the vocabulary that agents use is arbitrary and has no initial meaning. One way to understand

its emerging semantics is by looking at the relationship between symbols and the sets of images they

refer to.

5

Multi-Agent Cooperation and the Emergence of (Natural) Language

Citations

How to Do Things With Words

An Introduction to MultiAgent Systems.

打磨Using Language,倡导新理念

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Counterfactual Multi−Agent Policy Gradients

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Generative Adversarial Nets

Reinforcement Learning: An Introduction

Visualizing Data using t-SNE

Related Papers (5)

Learning to Communicate with Deep Multi−Agent Reinforcement Learning

Emergence of Grounded Compositional Language in Multi-Agent Populations

Convention: A Philosophical Study

Learning multiagent communication with backpropagation

Categorical Reparameterization with Gumbel-Softmax