(Open Access) A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations (1952) | Herman Chernoff

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Combining expert advice in reactive environments

[...]

Daniela Pucci de Farias¹, Nimrod Megiddo²•Institutions (2)

Massachusetts Institute of Technology¹, IBM²

01 Sep 2006-Journal of the ACM

TL;DR: The new experts method represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player's actions on the environment, and is capable of inducing cooperation in the repeated Prisoner's Dilemma game.

...read moreread less

Abstract: “Experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any fixed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game.A general exploration-exploitation experts method is presented along with a proper definition of value. The definition is shown to be adequate in that it both captures the impact of an expert's actions on the environment and is learnable. The new experts method is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player's actions on the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner's Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven.

...read moreread less

49 citations

Proceedings Article•DOI•

Computation in noisy radio networks

[...]

Eyal Kushilevitz¹, Yishay Mansour²•Institutions (2)

Technion – Israel Institute of Technology¹, Tel Aviv University²

01 Jan 1998

TL;DR: This paper examines noisy radio (broadcast) networks in which every bit transmitted has a certain probability of being flipped and shows a protocol to compute any threshold function using only a linear number of transmissions.

...read moreread less

Abstract: In this paper, we examine noisy radio (broadcast) networks in which every bit transmitted has a certain probability of being flipped. Each processor has some initial input bit, and the goal is to compute a function of these input bits. In this model, we show a protocol to compute any threshold function using only a linear number of transmissions.

...read moreread less

49 citations

Posted Content•

Thompson Sampling for Combinatorial Semi-Bandits

[...]

Siwei Wang¹, Wei Chen²•Institutions (2)

Tsinghua University¹, Microsoft²

13 Mar 2018-arXiv: Learning

TL;DR: In this article, the authors studied the regret of the Thompson sampling (TS) algorithm for the stochastic combinatorial multi-armed bandit (CMAB) problem.

...read moreread less

Abstract: We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of $O(mK_{\max}\log T / \Delta_{\min})$, where $m$ is the number of arms, $K_{\max}$ is the size of the largest super arm, $T$ is the time horizon, and $\Delta_{\min}$ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot directly replace the exact offline oracle with an approximation oracle in TS algorithm for even the classical MAB problem. Then we expand the analysis to two special cases: the linear reward case and the matroid bandit case. When the reward function is linear, the regret of the TS algorithm achieves a better bound $O(m\log K_{\max}\log T / \Delta_{\min})$. For matroid bandit, we could remove the independence assumption across arms and achieve a regret upper bound that matches the lower bound for the matroid case. Finally, we use some experiments to show the comparison between regrets of TS and other existing algorithms like CUCB and ESCB.

...read moreread less

49 citations

Journal Article•DOI•

On the maximum cardinality of a consistent set of arcs in a random tournament

[...]

W. Fernandez de la Vega¹•Institutions (1)

University of Paris-Sud¹

01 Dec 1983-Journal of Combinatorial Theory, Series B

TL;DR: It is proved that, if T n denotes the random tournament on n vertices, then, P (h(T n ) ≤ 1 2 ( 2 n ) + 1.73n 3 2 ) → 1 as n → ∞.

...read moreread less

49 citations

Cites background from "A Measure of Asymptotic Efficiency ..."

...(1) k>(n/2)(1+1) This inequality is easily deduced from the well-known inequality of Chernoff 111, m--1 < exp[(m - k) log(mq/(m - k)) + k log(mp/k)],...
[...]

Posted Content•

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

[...]

Tengyang Xie¹, Yifei Ma², Yu-Xiang Wang³•Institutions (3)

University of Illinois at Urbana–Champaign¹, Amazon.com², University of California, Santa Barbara³

08 Jun 2019-arXiv: Learning

TL;DR: A marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step and is believed to be the first OPE estimation error bound with a polynomial dependence on the RL horizon $H.

...read moreread less

Abstract: Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $$ \frac{1}{n} \sum olimits_{t=1}^H\mathbb{E}_{\mu}\left[\frac{d_t^\pi(s_t)^2}{d_t^\mu(s_t)^2} \mathrm{Var}_{\mu}\left[\frac{\pi_t(a_t|s_t)}{\mu_t(a_t|s_t)}\big( V_{t+1}^\pi(s_{t+1}) + r_t\big) \middle| s_t\right]\right] + \tilde{O}(n^{-1.5}) $$ where $\mu$ and $\pi$ are the logging and target policies, $d_t^{\mu}(s_t)$ and $d_t^{\pi}(s_t)$ are the marginal distribution of the state at $t$th step, $H$ is the horizon, $n$ is the sample size and $V_{t+1}^\pi$ is the value function of the MDP under $\pi$. The result matches the Cramer-Rao lower bound in \citet{jiang2016doubly} up to a multiplicative factor of $H$. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on $H$. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.

...read moreread less

49 citations

Cites background from "A Measure of Asymptotic Efficiency ..."

...Chernoff, H. et al. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations....
[...]

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

On Information and Sufficiency

[...]

Solomon Kullback, R. A. Leibler

01 Mar 1951-Annals of Mathematical Statistics

16,176 citations

A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations

Citations

Cites background from "A Measure of Asymptotic Efficiency ..."

Cites background from "A Measure of Asymptotic Efficiency ..."

References

Related Papers (5)