Topic

# Stochastic game

About: Stochastic game is a(n) research topic. Over the lifetime, 9493 publication(s) have been published within this topic receiving 202664 citation(s).

##### Papers published on a yearly basis

##### Papers

More filters

••

Graz University of Technology

^{1}, University of Milan^{2}, Hebrew University of Jerusalem^{3}, AT&T^{4}TL;DR: A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs.

Abstract: In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines.
In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the per-round payoff of our algorithm approaches that of the best arm at the rate O(T-1/2). We show by a matching lower bound that this is the best possible.
We also prove that our algorithm approaches the per-round payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of N strategies, then our algorithm approaches the per-round payoff of the strategy at the rate O((log N1/2 T-1/2). Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate O(T-1/2).

2,082 citations

•

12 Dec 2012

TL;DR: In this article, the authors focus on regret analysis in the context of multi-armed bandit problems, where regret is defined as the balance between staying with the option that gave highest payoff in the past and exploring new options that might give higher payoffs in the future.

Abstract: A multi-armed bandit problem - or, simply, a bandit problem - is a sequential allocation problem defined by a set of actions. At each time step, a unit resource is allocated to an action and some observable payoff is obtained. The goal is to maximize the total payoff obtained in a sequence of allocations. The name bandit refers to the colloquial term for a slot machine (a "one-armed bandit" in American slang). In a casino, a sequential allocation problem is obtained when the player is facing many slot machines at once (a "multi-armed bandit"), and must repeatedly choose where to insert the next coin. Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the 1930s, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this book, the focus is on two extreme cases in which the analysis of regret is particularly simple and elegant: independent and identically distributed payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, it also analyzes some of the most important variants and extensions, such as the contextual bandit model. This monograph is an ideal reference for students and researchers with an interest in bandit problems.

2,061 citations

••

TL;DR: This article showed that the Folk Theorem always holds in two-player games with no discounting at all, and that it always holds even in the case of infinite repeated games with two players.

Abstract: When either there are only two players or a "full dimensionality" condition holds, any individually rational payoff vector of a one-shot game of complete information can arise in a perfect equilibrium of the infinitely-repeated game if players are sufficiently patient. In contrast to earlier work, mixed strategies are allowed in determining the individually rational payoffs (even when only realized actions are observable). Any individually rational payoffs of a one-shot game can be approximated by sequential equilibrium payoffs of a long but finite game of incomplete information, where players' payoffs are almost certainly as in the one-shot game. THAT STRATEGIC RIVALRY in a long-term relationship may differ from that of a one-shot game is by now quite a familiar idea. Repeated play allows players to respond to each other's actions, and so each player must consider the reactions of his opponents in making his decision. The fear of retaliation may thus lead to outcomes that otherwise would not occur. The most dramatic expression of this phenomenon is the celebrated "Folk Theorem" for repeated games. An outcome that Pareto dominates the minimax point is called individually rational. The Folk Theorem asserts that any individually rational outcome can arise as a Nash equilibrium in infinitely repeated games with sufficiently little discounting. As Aumann and Shapley [3] and Rubinstein [20] have shown, the same result is true when we replace the word "Nash" by "(subgame) perfect" and assume no discounting at all. Because the Aumann-Shapley/Rubinstein result supposes literally no discounting, one may wonder whether the exact counterpart of the Folk Theorem holds for perfect equilibrium, i.e., whether as the discount factor tends to one, the set of perfect equilibrium outcomes converges to the individually rational set. After all, agents in most games of economic interest are not completely patient; the no discounting case is of interest as an approximation. It turns out that this counterpart is false. There can be a discontinuity (formally, a failure of lower hemicontinuity) where the discount factor, 8, equals one, as we show in Example 3. Nonetheless the games in which discontinuities occur are quite degenerate, and, in the end, we can give a qualified "yes" (Theorem 2) to the question of whether the Folk Theorem holds with discounting. In particular, it always holds in two-player games (Theorem 1). This last result contrasts with the recent work of Radner-Myerson-Maskin [18] showing that, even in two-player games, the equilibrium set may not be continuous at 8 = 1 in

1,832 citations

••

TL;DR: In this paper, the authors consider dynamic choice behavior under conditions of uncertainty, with emphasis on the timing of the resolution of uncertainty and provide an axiomatic treatment of the dynamic choice problem which still permits tractable analysis.

Abstract: We consider dynamic choice behavior under conditions of uncertainty, with emphasis on the timing of the resolution of uncertainty. Choice behavior in which an individual distinguishes between lotteries based on the times at which their uncertainty resolves is axiomatized and represented, thus the result is choice behavior which cannot be represented by a single cardinal utility function on the vector of payoffs. Both descriptive and normative treatments of the problem are given and are shown to be equivalent. Various specializations are provided, including an extension of "separable" utility and representation by a single cardinal utility function. CONSIDER THE FOLLOWING idealization of a dynamic choice problem with uncertainty. At each in a finite, discrete sequence of times t = 0, 1, . . ., T, an individual must choose an action d,. His choice is constrained by what we temporarily call the state at time t, xt. Then some random event takes place, determining an immediate payoff zt to the individual and the next state xt+l. The probability distribution of the pair (zt, xt+l) is determined by the action dt. The standard approach in analyzing this problem, which we will call the payoff vector approach, assumes that the individual's choice behavior is representable as follows: He has a von Neumann-Morgenstern utility function U defined on the vector of payoffs (z0, z1, . . ., ZT). Each strategy (which is a contingent plan for choosing actions given states) induces a probability distribution on the vector of payoffs. So the individual's choice of action is that specified by any optimal strategy, any strategy which maximizes the expectation of utility among all strategies (assuming sufficient conditions so that an optimal strategy exists). This paper presents an axiomatic treatment of the dynamic choice problem which is more general than the payoff vector approach, but which still permits tractable analysis. The fundamental difference between our treatment and the payoff vector approach lies in our treatment of the temporal resolution of uncertainty: In our models, uncertainty is "dated" by the time of its resolution, and the individual regards uncertainties resolving at different times as being different. For example, consider a situation in which a fair coin is to be flipped. If it comes up heads, the payoff vector will be (zo, z1) = (5, 10); if it is tails, the vector will be (5, 0). Because z0 = 5 in either case, the coin flip can take place at either time 0 or time 1. It will not matter when the flip occurs to someone who has cardinal utility on the vector of payoffs. But an individual can obey our axioms and prefer either one to the other. One justification for our approach is the well known "timeless-temporal" or "joint time-risk" feature of some models (usually models which are not "complete"). For example, preferences on income streams which are induced from primitive preferences on consumption streams in general depend upon when the

1,649 citations

••

TL;DR: Graph-theoretic ideas are used to analyze cooperation structures in games, and fair allocation rules are proven to be unique, closely related to the Shapley value, and stable for a wide class of games.

Abstract: Graph-theoretic ideas are used to analyze cooperation structures in games. Allocation rules, selecting a payoff for every possible cooperation structure, are studied for games in characteristic function form. Fair allocation rules are defined, and these are proven to be unique, closely related to the Shapley value, and stable for a wide class of games.

1,320 citations