scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Multiagent Systems in 2019"


Proceedings ArticleDOI
TL;DR: A new traffic simulator CityFlow with fundamentally optimized data structures and efficient algorithms that can support flexible definitions for road network and traffic flow based on synthetic and real-world data and provides user-friendly interface for reinforcement learning.
Abstract: Traffic signal control is an emerging application scenario for reinforcement learning. Besides being as an important problem that affects people's daily life in commuting, traffic signal control poses its unique challenges for reinforcement learning in terms of adapting to dynamic traffic environment and coordinating thousands of agents including vehicles and pedestrians. A key factor in the success of modern reinforcement learning relies on a good simulator to generate a large number of data samples for learning. The most commonly used open-source traffic simulator SUMO is, however, not scalable to large road network and large traffic flow, which hinders the study of reinforcement learning on traffic scenarios. This motivates us to create a new traffic simulator CityFlow with fundamentally optimized data structures and efficient algorithms. CityFlow can support flexible definitions for road network and traffic flow based on synthetic and real-world data. It also provides user-friendly interface for reinforcement learning. Most importantly, CityFlow is more than twenty times faster than SUMO and is capable of supporting city-wide traffic simulation with an interactive render for monitoring. Besides traffic signal control, CityFlow could serve as the base for other transportation studies and can create new possibilities to test machine learning methods in the intelligent transportation domain.

126 citations


Posted Content
TL;DR: This paper addresses the order dispatching problem using multi-agent reinforcement learning (MARL), which follows the distributed nature of the peer-to-peer ridesharing problem and possesses the ability to capture the stochastic demand-supply dynamics in large-scale ridesh sharing scenarios.
Abstract: A fundamental question in any peer-to-peer ridesharing system is how to, both effectively and efficiently, dispatch user's ride requests to the right driver in real time. Traditional rule-based solutions usually work on a simplified problem setting, which requires a sophisticated hand-crafted weight design for either centralized authority control or decentralized multi-agent scheduling systems. Although recent approaches have used reinforcement learning to provide centralized combinatorial optimization algorithms with informative weight values, their single-agent setting can hardly model the complex interactions between drivers and orders. In this paper, we address the order dispatching problem using multi-agent reinforcement learning (MARL), which follows the distributed nature of the peer-to-peer ridesharing problem and possesses the ability to capture the stochastic demand-supply dynamics in large-scale ridesharing scenarios. Being more reliable than centralized approaches, our proposed MARL solutions could also support fully distributed execution through recent advances in the Internet of Vehicles (IoV) and the Vehicle-to-Network (V2N). Furthermore, we adopt the mean field approximation to simplify the local interactions by taking an average action among neighborhoods. The mean field approximation is capable of globally capturing dynamic demand-supply variations by propagating many local interactions between agents and the environment. Our extensive experiments have shown the significant improvements of MARL order dispatching algorithms over several strong baselines on the gross merchandise volume (GMV), and order response rate measures. Besides, the simulated experiments with real data have also justified that our solution can alleviate the supply-demand gap during the rush hours, thus possessing the capability of reducing traffic congestion.

124 citations


Posted Content
Kun Shao1, Zhentao Tang1, Yuanheng Zhu1, Nannan Li1, Dongbin Zhao1 
TL;DR: The progress of DRL methods, including value-based, policy gradient, and model-based algorithms, and compare their main techniques and properties are surveyed, including exploration-exploitation, sample efficiency, generalization and transfer, multi-agent learning, imperfect information, and delayed spare rewards.
Abstract: Deep reinforcement learning (DRL) has made great achievements since proposed. Generally, DRL agents receive high-dimensional inputs at each step, and make actions according to deep-neural-network-based policies. This learning mechanism updates the policy to maximize the return with an end-to-end method. In this paper, we survey the progress of DRL methods, including value-based, policy gradient, and model-based algorithms, and compare their main techniques and properties. Besides, DRL plays an important role in game artificial intelligence (AI). We also take a review of the achievements of DRL in various video games, including classical Arcade games, first-person perspective games and multi-agent real-time strategy games, from 2D to 3D, and from single-agent to multi-agent. A large number of video game AIs with DRL have achieved super-human performance, while there are still some challenges in this domain. Therefore, we also discuss some key points when applying DRL methods to this field, including exploration-exploitation, sample efficiency, generalization and transfer, multi-agent learning, imperfect information, and delayed spare rewards, as well as some research directions.

94 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors propose a model, CoLight, which uses graph attentional networks to facilitate communication between traffic signals in a large-scale road network with hundreds of traffic signals and demonstrate that the proposed model can achieve superior performance against the state-of-the-art methods.
Abstract: Cooperation among the traffic signals enables vehicles to move through intersections more quickly. Conventional transportation approaches implement cooperation by pre-calculating the offsets between two intersections. Such pre-calculated offsets are not suitable for dynamic traffic environments. To enable cooperation of traffic signals, in this paper, we propose a model, CoLight, which uses graph attentional networks to facilitate communication. Specifically, for a target intersection in a network, CoLight can not only incorporate the temporal and spatial influences of neighboring intersections to the target intersection, but also build up index-free modeling of neighboring intersections. To the best of our knowledge, we are the first to use graph attentional networks in the setting of reinforcement learning for traffic signal control and to conduct experiments on the large-scale road network with hundreds of traffic signals. In experiments, we demonstrate that by learning the communication, the proposed model can achieve superior performance against the state-of-the-art methods.

73 citations


Posted Content
TL;DR: A deep recurrent multiagent actor-critic framework (R-MADDPG) for handling multiagent coordination under partial observable set-tings and limited communication and demonstrates that the resulting framework learns time dependencies for sharing missing observations, handling resource limitations, and developing different communication patterns among agents.
Abstract: There are several real-world tasks that would benefit from applying multiagent reinforcement learning (MARL) algorithms, including the coordination among self-driving cars. The real world has challenging conditions for multiagent learning systems, such as its partial observable and nonstationary nature. Moreover, if agents must share a limited resource (e.g. network bandwidth) they must all learn how to coordinate resource use. This paper introduces a deep recurrent multiagent actor-critic framework (R-MADDPG) for handling multiagent coordination under partial observable set-tings and limited communication. We investigate recurrency effects on performance and communication use of a team of agents. We demonstrate that the resulting framework learns time dependencies for sharing missing observations, handling resource limitations, and developing different communication patterns among agents.

56 citations


Posted Content
TL;DR: This work established that agents cluster around a network centroid and proceeded to study the dynamics of this point, and established expected descent in non-convex environments in the large-gradient regime and introduced a short-term model to examine the dynamics over finite-time horizons.
Abstract: The diffusion strategy for distributed learning from streaming data employs local stochastic gradient updates along with exchange of iterates over neighborhoods. In Part I [2] of this work we established that agents cluster around a network centroid and proceeded to study the dynamics of this point. We established expected descent in non-convex environments in the large-gradient regime and introduced a short-term model to examine the dynamics over finite-time horizons. Using this model, we establish in this work that the diffusion strategy is able to escape from strict saddle-points in O(1/$\mu$) iterations; it is also able to return approximately second-order stationary points in a polynomial number of iterations. Relative to prior works on the polynomial escape from saddle-points, most of which focus on centralized perturbed or stochastic gradient descent, our approach requires less restrictive conditions on the gradient noise process.

48 citations


Posted Content
TL;DR: An artificial intelligence research environment inspired by the human game genre of MMORPGs is presented, that aims to simulate this setting in microcosm and reveals that population size magnifies and incentivizes the development of skillful behaviors and results in agents that outcompete agents trained in smaller populations.
Abstract: The emergence of complex life on Earth is often attributed to the arms race that ensued from a huge number of organisms all competing for finite resources. We present an artificial intelligence research environment, inspired by the human game genre of MMORPGs (Massively Multiplayer Online Role-Playing Games, a.k.a. MMOs), that aims to simulate this setting in microcosm. As with MMORPGs and the real world alike, our environment is persistent and supports a large and variable number of agents. Our environment is well suited to the study of large-scale multiagent interaction: it requires that agents learn robust combat and navigation policies in the presence of large populations attempting to do the same. Baseline experiments reveal that population size magnifies and incentivizes the development of skillful behaviors and results in agents that outcompete agents trained in smaller populations. We further show that the policies of agents with unshared weights naturally diverge to fill different niches in order to avoid competition.

45 citations


Proceedings ArticleDOI
TL;DR: Experiments show that the proposed decentralized execution order-dispatching method outperforms the baselines in terms of accumulated driver income (ADI) and Order Response Rate (ORR) in various traffic environments.
Abstract: Improving the efficiency of dispatching orders to vehicles is a research hotspot in online ride-hailing systems. Most of the existing solutions for order-dispatching are centralized controlling, which require to consider all possible matches between available orders and vehicles. For large-scale ride-sharing platforms, there are thousands of vehicles and orders to be matched at every second which is of very high computational cost. In this paper, we propose a decentralized execution order-dispatching method based on multi-agent reinforcement learning to address the large-scale order-dispatching problem. Different from the previous cooperative multi-agent reinforcement learning algorithms, in our method, all agents work independently with the guidance from an evaluation of the joint policy since there is no need for communication or explicit cooperation between agents. Furthermore, we use KL-divergence optimization at each time step to speed up the learning process and to balance the vehicles (supply) and orders (demand). Experiments on both the explanatory environment and real-world simulator show that the proposed method outperforms the baselines in terms of accumulated driver income (ADI) and Order Response Rate (ORR) in various traffic environments. Besides, with the support of the online platform of Didi Chuxing, we designed a hybrid system to deploy our model.

43 citations


Posted Content
TL;DR: ABIDES, an Agent-Based Interactive Discrete Event Simulation environment designed from the ground up to support AI agent research in market applications is introduced and its use and configuration is illustrated with sample code, validating the environment with example trading scenarios.
Abstract: We introduce ABIDES, an Agent-Based Interactive Discrete Event Simulation environment. ABIDES is designed from the ground up to support AI agent research in market applications. While simulations are certainly available within trading firms for their own internal use, there are no broadly available high-fidelity market simulation environments. We hope that the availability of such a platform will facilitate AI research in this important area. ABIDES currently enables the simulation of tens of thousands of trading agents interacting with an exchange agent to facilitate transactions. It supports configurable pairwise network latencies between each individual agent as well as the exchange. Our simulator's message-based design is modeled after NASDAQ's published equity trading protocols ITCH and OUCH. We introduce the design of the simulator and illustrate its use and configuration with sample code, validating the environment with example trading scenarios. The utility of ABIDES is illustrated through experiments to develop a market impact model. We close with discussion of future experimental problems it can be used to explore, such as the development of ML-based trading algorithms.

41 citations


Posted Content
Tom Eccles1, Yoram Bachrach1, Guy Lever1, Angeliki Lazaridou1, Thore Graepel1 
TL;DR: This work introduces inductive biases for positive signalling and positive listening, which ease the learning problem in emergent communication and applies these methods to a more extended environment, showing that agents with these inductive bias achieve better performance.
Abstract: We study the problem of emergent communication, in which language arises because speakers and listeners must communicate information in order to solve tasks. In temporally extended reinforcement learning domains, it has proved hard to learn such communication without centralized training of agents, due in part to a difficult joint exploration problem. We introduce inductive biases for positive signalling and positive listening, which ease this problem. In a simple one-step environment, we demonstrate how these biases ease the learning problem. We also apply our methods to a more extended environment, showing that agents with these inductive biases achieve better performance, and analyse the resulting communication protocols.

37 citations


Posted Content
TL;DR: In this paper, Feudal Multi-Agent Hierarchies (FMH) is proposed to learn to cooperate in reinforcement learning, in which a manager is tasked with maximising the environmentally-determined reward function, and workers are rewarded for achieving managerial subgoals.
Abstract: We investigate how reinforcement learning agents can learn to cooperate. Drawing inspiration from human societies, in which successful coordination of many individuals is often facilitated by hierarchical organisation, we introduce Feudal Multi-agent Hierarchies (FMH). In this framework, a 'manager' agent, which is tasked with maximising the environmentally-determined reward function, learns to communicate subgoals to multiple, simultaneously-operating, 'worker' agents. Workers, which are rewarded for achieving managerial subgoals, take concurrent actions in the world. We outline the structure of FMH and demonstrate its potential for decentralised learning and control. We find that, given an adequate set of subgoals from which to choose, FMH performs, and particularly scales, substantially better than cooperative approaches that use a shared reward function.

Posted Content
TL;DR: A novel Spatio-Temporal Multi-Agent Reinforcement Learning (STMARL) framework for effectively capturing the spatio-temporal dependency of multiple related traffic lights and control these traffic lights in a coordinating way is proposed.
Abstract: The development of intelligent traffic light control systems is essential for smart transportation management. While some efforts have been made to optimize the use of individual traffic lights in an isolated way, related studies have largely ignored the fact that the use of multi-intersection traffic lights is spatially influenced and there is a temporal dependency of historical traffic status for current traffic light control. To that end, in this paper, we propose a novel SpatioTemporal Multi-Agent Reinforcement Learning (STMARL) framework for effectively capturing the spatio-temporal dependency of multiple related traffic lights and control these traffic lights in a coordinating way. Specifically, we first construct the traffic light adjacency graph based on the spatial structure among traffic lights. Then, historical traffic records will be integrated with current traffic status via Recurrent Neural Network structure. Moreover, based on the temporally-dependent traffic information, we design a Graph Neural Network based model to represent relationships among multiple traffic lights, and the decision for each traffic light will be made in a distributed way by the deep Q-learning method. Finally, the experimental results on both synthetic and real-world data have demonstrated the effectiveness of our STMARL framework, which also provides an insightful understanding of the influence mechanism among multi-intersection traffic lights.

Posted Content
TL;DR: The authors empirically investigate the representational power of various network architectures on a series of one-shot games, and quantify how well various approaches can represent the requisite value functions, and help us identify issues that can impede good performance.
Abstract: Recent years have seen the application of deep reinforcement learning techniques to cooperative multi-agent systems, with great empirical success. However, given the lack of theoretical insight, it remains unclear what the employed neural networks are learning, or how we should enhance their representational power to address the problems on which they fail. In this work, we empirically investigate the representational power of various network architectures on a series of one-shot games. Despite their simplicity, these games capture many of the crucial problems that arise in the multi-agent setting, such as an exponential number of joint actions or the lack of an explicit coordination mechanism. Our results quantify how well various approaches can represent the requisite value functions, and help us identify issues that can impede good performance.

Journal ArticleDOI
TL;DR: A new taxonomy is developed which classifies multi-objective multi-agent decision making settings, on the basis of the reward structures, and which and how utility functions are applied, and defines and discusses these solution concepts under both ESR and SER optimisation criteria.
Abstract: The majority of multi-agent system (MAS) implementations aim to optimise agents' policies with respect to a single objective, despite the fact that many real-world problem domains are inherently multi-objective in nature. Multi-objective multi-agent systems (MOMAS) explicitly consider the possible trade-offs between conflicting objective functions. We argue that, in MOMAS, such compromises should be analysed on the basis of the utility that these compromises have for the users of a system. As is standard in multi-objective optimisation, we model the user utility using utility functions that map value or return vectors to scalar values. This approach naturally leads to two different optimisation criteria: expected scalarised returns (ESR) and scalarised expected returns (SER). We develop a new taxonomy which classifies multi-objective multi-agent decision making settings, on the basis of the reward structures, and which and how utility functions are applied. This allows us to offer a structured view of the field, to clearly delineate the current state-of-the-art in multi-objective multi-agent decision making approaches and to identify promising directions for future research. Starting from the execution phase, in which the selected policies are applied and the utility for the users is attained, we analyse which solution concepts apply to the different settings in our taxonomy. Furthermore, we define and discuss these solution concepts under both ESR and SER optimisation criteria. We conclude with a summary of our main findings and a discussion of many promising future research directions in multi-objective multi-agent systems.

Posted Content
Tom Eccles1, Edward Hughes1, János Kramár1, Steven Wheelwright1, Joel Z. Leibo1 
TL;DR: This work presents a general online reinforcement learning algorithm that displays reciprocal behavior towards its co-players and shows that it can induce pro-social outcomes for the wider group when learning alongside selfish agents, both in a $2$-player Markov game, and in intertemporal social dilemmas.
Abstract: Reciprocity is an important feature of human social interaction and underpins our cooperative nature. What is more, simple forms of reciprocity have proved remarkably resilient in matrix game social dilemmas. Most famously, the tit-for-tat strategy performs very well in tournaments of Prisoner's Dilemma. Unfortunately this strategy is not readily applicable to the real world, in which options to cooperate or defect are temporally and spatially extended. Here, we present a general online reinforcement learning algorithm that displays reciprocal behavior towards its co-players. We show that it can induce pro-social outcomes for the wider group when learning alongside selfish agents, both in a $2$-player Markov game, and in $5$-player intertemporal social dilemmas. We analyse the resulting policies to show that the reciprocating agents are strongly influenced by their co-players' behavior.

Proceedings ArticleDOI
TL;DR: It is demonstrated that CoRide provides superior performance in terms of platform revenue and user experience in the task of city-wide hybrid order dispatching and fleet management over strong baselines.
Abstract: How to optimally dispatch orders to vehicles and how to tradeoff between immediate and future returns are fundamental questions for a typical ride-hailing platform. We model ride-hailing as a large-scale parallel ranking problem and study the joint decision-making task of order dispatching and fleet management in online ride-hailing platforms. This task brings unique challenges in the following four aspects. First, to facilitate a huge number of vehicles to act and learn efficiently and robustly, we treat each region cell as an agent and build a multi-agent reinforcement learning framework. Second, to coordinate the agents from different regions to achieve long-term benefits, we leverage the geographical hierarchy of the region grids to perform hierarchical reinforcement learning. Third, to deal with the heterogeneous and variant action space for joint order dispatching and fleet management, we design the action as the ranking weight vector to rank and select the specific order or the fleet management destination in a unified formulation. Fourth, to achieve the multi-scale ride-hailing platform, we conduct the decision-making process in a hierarchical way where a multi-head attention mechanism is utilized to incorporate the impacts of neighbor agents and capture the key agent in each scale. The whole novel framework is named as CoRide. Extensive experiments based on multiple cities real-world data as well as analytic synthetic data demonstrate that CoRide provides superior performance in terms of platform revenue and user experience in the task of city-wide hybrid order dispatching and fleet management over strong baselines.

Journal ArticleDOI
TL;DR: In this article, the authors present the implementation of a decentralized control framework, which was developed previously, in a scaled-city using robotic CAVs, and discuss the implications of CAVs on travel time.
Abstract: The implementation of connected and automated vehicle (CAV) technologies enables a novel computational framework to deliver real-time control actions that optimize travel time, energy, and safety. Hardware is an integral part of any practical implementation of CAVs, and as such, it should be incorporated in any validation method. However, high costs associated with full scale, field testing of CAVs have proven to be a significant barrier. In this paper, we present the implementation of a decentralized control framework, which was developed previously, in a scaled-city using robotic CAVs, and discuss the implications of CAVs on travel time. Supplemental information and videos can be found at this https URL.

Journal ArticleDOI
TL;DR: In this paper, the robust time-varying formation design problems for second-order multi-agent systems subjected to external disturbances are investigated, and the robustness property of the tracked systems is analyzed.
Abstract: Robust time-varying formation design problems for second-order multi-agent systems subjected to external disturbances are investigated. Firstly, by constructing an extended state observer, the disturbance compensation is estimated, which is a critical term in the proposed robust time-varying formation control protocol. Then, an explicit expression of the formation center function is determined and impacts of disturbance compensations on the formation center function are presented. With the formation feasibility conditions, robust time-varying formation design criteria are derived to determine the gain matrix of the formation control protocol by utilizing the algebraic Riccati equation technique. Furthermore, the tracking performance and the robustness property of multi-agent systems are analyzed. Finally, the numerical simulation is provided to illustrate the effectiveness of theoretical results.

Posted Content
TL;DR: In this article, the authors proposed a cooperative driving of connected and automated vehicles (CAVs) at conflict areas (e.g., non-signalized intersections and ramping regions).
Abstract: This paper studies the cooperative driving of connected and automated vehicles (CAVs) at conflict areas (e.g., non-signalized intersections and ramping regions). Due to safety concerns, most existing studies prohibit lane change since this may cause lateral collisions when coordination is not appropriately performed. However, in many traffic scenarios (e.g., work zones), vehicles must change lanes. To solve this problem, we categorize the potential collision into two kinds and thus establish a bi-level planning problem. The right-of-way of vehicles for the critical conflict zone is considered in the upper-level, and the right-of-way of vehicles during lane changes is then resolved in the lower-level. The solutions of the upper-level problem are represented in tree space, and a near-optimal solution is searched for by combining Monte Carlo Tree Search (MCTS) with some heuristic rules within a very short planning time. The proposed strategy is suitable for not only the shortest delay objective but also other objectives (e.g., energy-saving and passenger comfort). Numerical examples show that the proposed strategy leads to good traffic performance in real-time.

Posted Content
TL;DR: In this paper, the authors explore how actor-critic methods in deep reinforcement learning, in particular Asynchronous Advantage Actor-Critic (A3C), can be extended with agent modeling.
Abstract: In this paper we explore how actor-critic methods in deep reinforcement learning, in particular Asynchronous Advantage Actor-Critic (A3C), can be extended with agent modeling. Inspired by recent works on representation learning and multiagent deep reinforcement learning, we propose two architectures to perform agent modeling: the first one based on parameter sharing, and the second one based on agent policy features. Both architectures aim to learn other agents' policies as auxiliary tasks, besides the standard actor (policy) and critic (values). We performed experiments in both cooperative and competitive domains. The former is a problem of coordinated multiagent object transportation and the latter is a two-player mini version of the Pommerman game. Our results show that the proposed architectures stabilize learning and outperform the standard A3C architecture when learning a best response in terms of expected rewards.

Posted Content
TL;DR: This paper shows that the system using Resource Abstraction significantly improves the learning speed and scalability, and achieves the highest possible or near-highest joint performance/social welfare for both congestion problems in large-scale scenarios involving up to 1000 reinforcement learning agents.
Abstract: Real-world congestion problems (e.g. traffic congestion) are typically very complex and large-scale. Multiagent reinforcement learning (MARL) is a promising candidate for dealing with this emerging complexity by providing an autonomous and distributed solution to these problems. However, there are three limiting factors that affect the deployability of MARL approaches to congestion problems. These are learning time, scalability and decentralised coordination i.e. no communication between the learning agents. In this paper we introduce Resource Abstraction, an approach that addresses these challenges by allocating the available resources into abstract groups. This abstraction creates new reward functions that provide a more informative signal to the learning agents and aid the coordination amongst them. Experimental work is conducted on two benchmark domains from the literature, an abstract congestion problem and a realistic traffic congestion problem. The current state-of-the-art for solving multiagent congestion problems is a form of reward shaping called difference rewards. We show that the system using Resource Abstraction significantly improves the learning speed and scalability, and achieves the highest possible or near-highest joint performance/social welfare for both congestion problems in large-scale scenarios involving up to 1000 reinforcement learning agents.

Posted Content
TL;DR: Arena as mentioned in this paper is a general evaluation platform for multi-agent intelligence with 35 games of diverse logics and representations, and it provides a building toolkit for researchers to easily invent and build novel multiagent problems from the provided game set.
Abstract: Learning agents that are not only capable of taking tests, but also innovating is becoming a hot topic in AI. One of the most promising paths towards this vision is multi-agent learning, where agents act as the environment for each other, and improving each agent means proposing new problems for others. However, existing evaluation platforms are either not compatible with multi-agent settings, or limited to a specific game. That is, there is not yet a general evaluation platform for research on multi-agent intelligence. To this end, we introduce Arena, a general evaluation platform for multi-agent intelligence with 35 games of diverse logics and representations. Furthermore, multi-agent intelligence is still at the stage where many problems remain unexplored. Therefore, we provide a building toolkit for researchers to easily invent and build novel multi-agent problems from the provided game set based on a GUI-configurable social tree and five basic multi-agent reward schemes. Finally, we provide Python implementations of five state-of-the-art deep multi-agent reinforcement learning baselines. Along with the baseline implementations, we release a set of 100 best agents/teams that we can train with different training schemes for each game, as the base for evaluating agents with population performance. As such, the research community can perform comparisons under a stable and uniform standard. All the implementations and accompanied tutorials have been open-sourced for the community at this https URL.

Posted Content
TL;DR: A two-stage framework which incorporates a combinatorial optimization and multi-agent deep reinforcement learning methods is established and is able to remarkably improve system performances.
Abstract: Ride-sourcing services are now reshaping the way people travel by effectively connecting drivers and passengers through mobile internets. Online matching between idle drivers and waiting passengers is one of the most key components in a ride-sourcing system. The average pickup distance or time is an important measurement of system efficiency since it affects both passengers' waiting time and drivers' utilization rate. It is naturally expected that a more effective bipartite matching (with smaller average pickup time) can be implemented if the platform accumulates more idle drivers and waiting passengers in the matching pool. A specific passenger request can also benefit from a delayed matching since he/she may be matched with closer idle drivers after waiting for a few seconds. Motivated by the potential benefits of delayed matching, this paper establishes a two-stage framework which incorporates a combinatorial optimization and multi-agent deep reinforcement learning methods. The multi-agent reinforcement learning methods are used to dynamically determine the delayed time for each passenger request (or the time at which each request enters the matching pool), while the combinatorial optimization conducts an optimal bipartite matching between idle drivers and waiting passengers in the matching pool. Two reinforcement learning methods, spatio-temporal multi-agent deep Q learning (ST-M-DQN) and spatio-temporal multi-agent actor-critic (ST-M-A2C) are developed. Through extensive empirical experiments with a well-designed simulator, we show that the proposed framework is able to remarkably improve system performances.

Posted Content
TL;DR: This team is composed of 2 neural networks trained with state of the art deep reinforcement learning algorithms and makes use of concepts like reward shaping, curriculum learning, and an automatic reasoning module for action pruning and presents a collection of open-sourced agents that can be used for training and testing in the Pommerman environment.
Abstract: The Pommerman Team Environment is a recently proposed benchmark which involves a multi-agent domain with challenges such as partial observability, decentralized execution (without communication), and very sparse and delayed rewards. The inaugural Pommerman Team Competition held at NeurIPS 2018 hosted 25 participants who submitted a team of 2 agents. Our submission nn_team_skynet955_skynet955 won 2nd place of the "learning agents'' category. Our team is composed of 2 neural networks trained with state of the art deep reinforcement learning algorithms and makes use of concepts like reward shaping, curriculum learning, and an automatic reasoning module for action pruning. Here, we describe these elements and additionally we present a collection of open-sourced agents that can be used for training and testing in the Pommerman environment. Code available at: this https URL

Posted Content
Andrea Tacchetti, DJ Strouse, Marta Garnelo, Thore Graepel, Yoram Bachrach1 
TL;DR: This work proposes a deep learning based approach to automatically design auctions in a wide variety of domains, shifting the design work from human to machine, and focuses on producing truthful and efficient auctions that minimize the economic burden on participants.
Abstract: Auctions are protocols to allocate goods to buyers who have preferences over them, and collect payments in return Economists have invested significant effort in designing auction rules that result in allocations of the goods that are desirable for the group as a whole However, for settings where participants' valuations of the items on sale are their private information, the rules of the auction must deter buyers from misreporting their preferences, so as to maximize their own utility, since misreported preferences hinder the ability for the auctioneer to allocate goods to those who want them most Manual auction design has yielded excellent mechanisms for specific settings, but requires significant effort when tackling new domains We propose a deep learning based approach to automatically design auctions in a wide variety of domains, shifting the design work from human to machine We assume that participants' valuations for the items for sale are independently sampled from an unknown but fixed distribution Our system receives a data-set consisting of such valuation samples, and outputs an auction rule encoding the desired incentive structure We focus on producing truthful and efficient auctions that minimize the economic burden on participants We evaluate the auctions designed by our framework on well-studied domains, such as multi-unit and combinatorial auctions, showing that they outperform known auction designs in terms of the economic burden placed on participants

Posted Content
TL;DR: A novel network architecture, named Action Semantics Network (ASN), is proposed that characterizes different actions' influence on other agents using neural networks based on the action semantics between them and can be easily combined with existing deep reinforcement learning algorithms to boost their performance.
Abstract: In multiagent systems (MASs), each agent makes individual decisions but all of them contribute globally to the system evolution. Learning in MASs is difficult since each agent's selection of actions must take place in the presence of other co-learning agents. Moreover, the environmental stochasticity and uncertainties increase exponentially with the increase in the number of agents. Previous works borrow various multiagent coordination mechanisms into deep learning architecture to facilitate multiagent coordination. However, none of them explicitly consider action semantics between agents that different actions have different influences on other agents. In this paper, we propose a novel network architecture, named Action Semantics Network (ASN), that explicitly represents such action semantics between agents. ASN characterizes different actions' influence on other agents using neural networks based on the action semantics between them. ASN can be easily combined with existing deep reinforcement learning (DRL) algorithms to boost their performance. Experimental results on StarCraft II micromanagement and Neural MMO show ASN significantly improves the performance of state-of-the-art DRL approaches compared with several network architectures.

Posted Content
TL;DR: A gating mechanism to adaptively prune unprofitable messages and outperforms several state-of-the-art DRL-based and rule-based methods by a large margin in both the real-world packet routing tasks and four benchmark tasks.
Abstract: Communication is an important factor for the big multi-agent world to stay organized and productive. Recently, the AI community has applied the Deep Reinforcement Learning (DRL) to learn the communication strategy and the control policy for multiple agents. However, when implementing the communication for real-world multi-agent applications, there is a more practical limited-bandwidth restriction, which has been largely ignored by the existing DRL-based methods. Specifically, agents trained by most previous methods keep sending messages incessantly in every control cycle; due to emitting too many messages, these methods are unsuitable to be applied to the real-world systems that have a limited bandwidth to transmit the messages. To handle this problem, we propose a gating mechanism to adaptively prune unprofitable messages. Results show that the gating mechanism can prune more than 80% messages with little damage to the performance. Moreover, our method outperforms several state-of-the-art DRL-based and rule-based methods by a large margin in both the real-world packet routing tasks and four benchmark tasks.

Posted Content
TL;DR: This article reviews a game theoretical approach to address this issue, where reinforcement learning is employed to predict the time-extended interaction dynamics, and proposes a computationally feasible approach to simultaneously model multiple humans as decision makers.
Abstract: Predicting the outcomes of cyber-physical systems with multiple human interactions is a challenging problem. This article reviews a game theoretical approach to address this issue, where reinforcement learning is employed to predict the time-extended interaction dynamics. We explain that the most attractive feature of the method is proposing a computationally feasible approach to simultaneously model multiple humans as decision makers, instead of determining the decision dynamics of the intelligent agent of interest and forcing the others to obey certain kinematic and dynamic constraints imposed by the environment. We present two recent exploitations of the method to model 1) unmanned aircraft integration into the National Airspace System and 2) highway traffic. We conclude the article by providing ongoing and future work about employing, improving and validating the method. We also provide related open problems and research opportunities.

Posted Content
TL;DR: The simulation results show that by providing battery swapping infrastructure the performance indicators of SAEV service are significantly improved, and faster charging infrastructure and optimized placement of charging locations in order to minimize distances between demand hubs and charging stations result in a higher performance.
Abstract: Shared autonomous vehicles (SAVs) are the next major evolution in urban mobility. This technology has attracted much interest of car manufacturers aiming at playing a role as transportation network companies (TNCs) in order to gain benefits per kilometer and per ride. The majority of future SAVs will most probably be electric. It is therefore important to understand how limited vehicle range and the configuration of charging infrastructure will affect the performance of shared autonomous electric vehicle (SAEV) services. We aim to explore the impacts of charging station placement, charging type (rapid charging, battery swapping) as well as vehicle range onto service efficiency and customer experience in terms of service availability and response time. We perform an agent-based simulation of SAEVs across the Rouen Normandie metropolitan area in France. The simulation process features impact assessment by considering dynamic demand responsive to the network and traffic. Research results suggest that the performance of SAEVs is strongly correlated to the charging infrastructure. Importantly, faster charging infrastructure and optimized placement of charging locations in order to minimize distances between demand hubs and charging stations result in a higher performance. Further analysis indicates the importance of dispersing charging stations across the service area and how this affects service effectiveness. The results also underline that SAEV battery capacity has to be carefully selected to avoid the overlaps between demand and charging peak times. Finally, the simulation results show that by providing battery swapping infrastructure the performance indicators of SAEV service are significantly improved.

Posted Content
TL;DR: In this article, the authors investigate the evaluation of learned multiagent strategies in the incomplete information setting, which plays a critical role in ranking and training of agents, and derive sample complexity guarantees required to confidently rank agents in this setting.
Abstract: This paper investigates the evaluation of learned multiagent strategies in the incomplete information setting, which plays a critical role in ranking and training of agents. Traditionally, researchers have relied on Elo ratings for this purpose, with recent works also using methods based on Nash equilibria. Unfortunately, Elo is unable to handle intransitive agent interactions, and other techniques are restricted to zero-sum, two-player settings or are limited by the fact that the Nash equilibrium is intractable to compute. Recently, a ranking method called {\alpha}-Rank, relying on a new graph-based game-theoretic solution concept, was shown to tractably apply to general games. However, evaluations based on Elo or {\alpha}-Rank typically assume noise-free game outcomes, despite the data often being collected from noisy simulations, making this assumption unrealistic in practice. This paper investigates multiagent evaluation in the incomplete information regime, involving general-sum many-player games with noisy outcomes. We derive sample complexity guarantees required to confidently rank agents in this setting. We propose adaptive algorithms for accurate ranking, provide correctness and sample complexity guarantees, then introduce a means of connecting uncertainties in noisy match outcomes to uncertainties in rankings. We evaluate the performance of these approaches in several domains, including Bernoulli games, a soccer meta-game, and Kuhn poker.