Showing papers by "Yishay Mansour published in 2022"

PDF

Open Access

Proceedings Article•DOI•

Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation

[...]

Christoph Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan - Show less +1 more

19 Jun 2022

TL;DR: A new complexity measure called myopic exploration gap is proposed, denoted by α , that captures a structural property of the MDP, the exploration policy expl and the given value function class F and it is shown that the sample-complexity of myopic Exploration scales quadratically with the inverse of this quantity, 1 /α 2 .

...read moreread less

Abstract: Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks and yet, they perform well in many others. In fact, in practice, they are often selected as the top choices, due to their simplicity. But, for what tasks do such policies succeed? Can we give theoretical guarantees for their favorable performance? These crucial questions have been scarcely investigated, despite the prominent practical importance of these policies. This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration. Our results apply to value-function-based algorithms in episodic MDPs with bounded Bellman Eluder dimension. We propose a new complexity measure called myopic exploration gap, denoted by alpha, that captures a structural property of the MDP, the exploration policy and the given value function class. We show that the sample-complexity of myopic exploration scales quadratically with the inverse of this quantity, 1 / alpha^2. We further demonstrate through concrete examples that myopic exploration gap is indeed favorable in several tasks where myopic exploration succeeds, due to the corresponding dynamics and reward structure.

...read moreread less

11 citations

Proceedings Article•

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

[...]

Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg - Show less +1 more

31 Jan 2022

TL;DR: This paper presents the first algorithms that achieve near-optimal √ K +D regret, where K is the number of episodes and D = ∑K k=1 d k is the total delay, significantly improving upon the best known regret bound of (K +D).

...read moreread less

Abstract: The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

...read moreread less

9 citations

Journal Article•DOI•

Regret Minimization and Convergence to Equilibria in General-sum Markov Games

[...]

Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, Yishay Mansour - Show less +1 more

28 Jul 2022-arXiv.org

TL;DR: The key observation is that online learning via policy optimization in Markov games essentially reduces to a form of weighted regret minimization, with unknown weights determined by the path length of the agents’ policy sequence.

...read moreread less

Abstract: An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Never-theless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the ﬁrst (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for swap regret , and thus, along the way, imply convergence to a correlated equilibrium. Our algorithm is decentralized, computationally eﬃcient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of weighted regret minimization, with unknown weights determined by the path length of the agents’ policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which suﬃciently adaptive algorithms provide sublinear regret guarantees.

...read moreread less

7 citations

Proceedings Article•DOI•

Strategizing against Learners in Bayesian Games

[...]

Yishay Mansour, Mehryar Mohri, Jon Schneider, Balasubramanian Sivan

17 May 2022

TL;DR: A new combinatorial notion of regret is introduced called polytope swap regret, that could be of independent interest in other settings and is based on a theory of optimizer-learner interactions.

...read moreread less

Abstract: We study repeated two-player games where one of the players, the learner, employs a no-regret learning strategy, while the other, the optimizer, is a rational utility maximizer. We consider general Bayesian games, where the payoffs of both the optimizer and the learner could depend on the type, which is drawn from a publicly known distribution, but revealed privately to the learner. We address the following questions: (a) what is the bare minimum that the optimizer can guarantee to obtain regardless of the no-regret learning algorithm employed by the learner? (b) are there learning algorithms that cap the optimizer payoff at this minimum? (c) can these algorithms be implemented efﬁciently? While building this theory of optimizer-learner interactions, we deﬁne a new combinatorial notion of regret called polytope swap regret, that could be of independent interest in other settings.

...read moreread less

7 citations

Proceedings Article•

Benign Underfitting of Stochastic Gradient Descent

[...]

Tomer Koren, Roi Livni, Yishay Mansour, Uri Sherman

27 Feb 2022

TL;DR: It turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis).

...read moreread less

Abstract: We study to what extent may stochastic gradient descent (SGD) be understood as a"conventional"learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.

...read moreread less

5 citations

Journal Article•DOI•

Learning Efficiently Function Approximation for Contextual MDP

[...]

Orin Levy, Yishay Mansour

02 Mar 2022-arXiv.org

TL;DR: This work studies learning contextual MDPs using a function approximation for both the rewards and the dynamics using a polynomial sample and time complexity for all four models.

...read moreread less

Abstract: We study learning contextual MDPs using a function approximation for both the rewards and the dynamics. We consider both the case that the dynamics dependent or independent of the context. For both models we derive polynomial sample and time complexity (assuming an efficient ERM oracle). Our methodology gives a general reduction from learning contextual MDP to supervised learning.

...read moreread less

5 citations

Proceedings Article•DOI•

Modeling Attrition in Recommender Systems with Departing Bandits

[...]

Omer Ben-Porat, Lee Cohen, Liu Leqi, Zachary C. Lipton, Yishay Mansour - Show less +1 more

25 Mar 2022

TL;DR: This work proposes a novel multi-armed bandit setup that captures such policy-dependent horizons as well as providing an efficient learning algorithm that achieves O(sqrt(T)ln(T)) regret, where T is the number of users.

...read moreread less

Abstract: Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user types, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user's type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves O(sqrt(T)ln(T)) regret, where T is the number of users.

...read moreread less

4 citations

Proceedings Article•

A Characterization of Semi-Supervised Adversarially-Robust PAC Learnability

[...]

Idan Attias, Steve Hanneke, Yishay Mansour

11 Feb 2022

TL;DR: This work addresses the question of how many labeled and unlabeled examples are required to ensure learning, and establishes a gap between supervised and semi-supervised label complexities which is known not to hold in standard non-robust PAC learning.

...read moreread less

Abstract: We study the problem of learning an adversarially robust predictor to test time attacks in the semi-supervised PAC model. We address the question of how many labeled and unlabeled examples are required to ensure learning. We show that having enough unlabeled data (the size of a labeled sample that a fully-supervised method would require), the labeled sample complexity can be arbitrarily smaller compared to previous works, and is sharply characterized by a different complexity measure. We prove nearly matching upper and lower bounds on this sample complexity. This shows that there is a significant benefit in semi-supervised robust learning even in the worst-case distribution-free model, and establishes a gap between the supervised and semi-supervised label complexities which is known not to hold in standard non-robust PAC learning.

...read moreread less

4 citations

Journal Article•DOI•

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

[...]

Orin Levy, Yishay Mansour

22 Jul 2022

TL;DR: This approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.).

...read moreread less

Abstract: We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound (up to poly-logarithmic factors) of order (H+1/pₘᵢₙ)H|S|³ᐟ²(|A|Tlog(max{|?|,|?|} /?))¹ᐟ² with probability 1−?, where ? and ? are finite and realizable function classes used to approximate the dynamics and rewards respectively, pₘᵢₙ is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of ?((TH|S||A|ln|?| /ln|A| )¹ᐟ² ), on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains order of T³ᐟ⁴ regret.

...read moreread less

3 citations

Proceedings Article•

Cooperative Online Learning in Stochastic and Adversarial MDPs

[...]

Tal Lancewicki, Aviv Rosenberg, Yishay Mansour

31 Jan 2022

TL;DR: This work is the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs, and proves nearly-matching regret lower and upper bounds.

...read moreread less

Abstract: Abstract We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, m agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: fresh – where each agent’s trajectory is sampled i.i.d, and non-fresh – where the realization is shared by all agents (but each agent’s trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.

...read moreread less

2 citations

Journal Article•DOI•

A LASSO-based approach to sample sites for phylogenetic tree search

[...]

Noa Ecker, Dana Azouri, Ben Bettisworth, Alexandros Stamatakis, Yishay Mansour, Itay Mayrose, Tal Pupko - Show less +3 more

24 Jun 2022-Bioinformatics

TL;DR: An artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset, based on training a regularized Lasso-regression model.

...read moreread less

Abstract: Abstract Motivation In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. Results Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. Availability and implementation The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Proceedings Article•

Monotone Learning

[...]

Olivier Bousquet, Amit Daniely, Haim Kaplan, Yishay Mansour, Shay Moran, Uri Stemmer - Show less +2 more

10 Feb 2022

TL;DR: A general result is derived in multiclass classiﬁcation, showing that every learning algorithm 𝐴 can be transformed to a monotone one with similar performance, and demonstrating that one can provably avoid non-monotonic behaviour without compromising performance.

...read moreread less

Abstract: The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gy¨orﬁ, and Lugosi (1996) ask whether there exists a monotone Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classiﬁcation, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classiﬁcation, showing that every learning algorithm 𝐴 can be transformed to a monotone one with similar performance. Further, the transformation is eﬃcient and only uses a black-box oracle access to 𝐴 . This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye, Gy¨orﬁ, and Lugosi (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our general transformation readily implies monotone learners in a variety of contexts: for example, Pestov’s result follows by applying it on any Bayes-consistent algorithm (e.g., 𝑘 -Nearest-Neighbours). In fact, our transformation extends Pestov’s result to classiﬁcation tasks with an arbitrary number of labels. This is in contrast with Pestov’s work which is tailored to binary classiﬁcation. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions asked by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021).

...read moreread less

Journal Article•DOI•

Counterfactual Optimism: Rate Optimal Regret for Stochastic Contextual MDPs

[...]

Orin Levy, Asaf Cassel, A. R. Cohen, Yishay Mansour

arXiv.org

TL;DR: In this paper , the UC 3 RL algorithm for regret minimization in stochastic contextual MDPs is presented, which operates under the minimal assumptions of realizable function class, and access to oﬆine least squares and log loss regression oracles.

...read moreread less

Abstract: We present the UC 3 RL algorithm for regret minimization in Stochastic Contextual MDPs (CMDPs). The algorithm operates under the minimal assumptions of realizable function class, and access to oﬄine least squares and log loss regression oracles. Our algorithm is eﬃcient (as-suming eﬃcient oﬄine regression oracles) and enjoys an e O ( H 3 p T | S || A | (log( |F| /δ ) + log( |P| /δ ))) regret guarantee, with T being the number of episodes, S the state space, A the action space, H the horizon, and P and F are ﬁnite function classes, used to approximate the context-dependent dynamics and rewards, respectively. To the best of our knowledge, our algorithm is the ﬁrst eﬃcient and rate-optimal regret minimization algorithm for CMDPs, which operates under the general oﬄine function approximation setting.

...read moreread less

Journal Article•

Finding Safe Zones of policies Markov Decision Processes

[...]

Lee Cohen, Yishay Mansour, Michal Moshkovitz

23 Feb 2022-arXiv.org

TL;DR: The main result is a bi-criteria approximation algorithm which gives a factor of almost 2 approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity.

...read moreread less

Abstract: Given a policy, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general the problem is computationally hard. For this reason we concentrate on computing approximate SafeZones. Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity. We conclude the paper with an empirical evaluation of our algorithm.

...read moreread less

Eluder-based Regret for Stochastic Contextual MDPs

[...]

Orin Levy, Asaf Cassel, A. R. Cohen, Yishay Mansour

27 Nov 2022

TL;DR: In this paper , the E-UC$^3$RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs) is presented, which operates under the minimal assumptions of realizable function class and access to log loss regression oracles.

...read moreread less

Abstract: We present the E-UC$^3$RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs). The algorithm operates under the minimal assumptions of realizable function class and access to \emph{offline} least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of $ \widetilde{O}(H^3 \sqrt{T |S| |A|d_{\mathrm{E}}(\mathcal{P}) \log (|\mathcal{F}| |\mathcal{P}|/ \delta) )}) , $ with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon, $\mathcal{P}$ and $\mathcal{F}$ are finite function classes used to approximate the context-dependent dynamics and rewards, respectively, and $d_{\mathrm{E}}(\mathcal{P})$ is the Eluder dimension of $\mathcal{P}$ w.r.t the Hellinger distance. To the best of our knowledge, our algorithm is the first efficient and rate-optimal regret minimization algorithm for CMDPs that operates under the general offline function approximation setting. In addition, we extend the Eluder dimension to general bounded metrics which may be of separate interest.

...read moreread less

Journal Article•DOI•

There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning for Mazes

[...]

Yishay Mansour, Michal Moshkovitz, Cynthia Rudin

09 Jun 2022-arXiv.org

TL;DR: It is shown that in certain cases, one can achieve policy interpretability while maintaining its optimality, and it is proved the existence of a small decision tree with a linear function at each inner node and depth O(log k + 2) that represents an optimal policy.

...read moreread less

Abstract: Interpretability is an essential building block for trustworthiness in reinforcement learning systems. However, interpretability might come at the cost of deteriorated performance, leading many researchers to build complex models. Our goal is to analyze the cost of interpretability. We show that in certain cases, one can achieve policy interpretability while maintaining its optimality. We focus on a classical problem from reinforcement learning: mazes with k obstacles in R. We prove the existence of a small decision tree with a linear function at each inner node and depth O(log k + 2) that represents an optimal policy. Note that for the interesting case of a constant d, we have O(log k) depth. Thus, in this setting, there is no accuracy-interpretability tradeoff. To prove this result, we use a new “compressing” technique that might be useful in additional settings.

...read moreread less

Proceedings Article•

Fair Wrapping for Black-box Predictions

[...]

Alexander Soen, Ibrahim M. Alabdulmohsin, Sanmi Koyejo, Yishay Mansour, Nyalleng Moorosi, Richard Nock, Ke Sun, Lexing Xie - Show less +4 more

31 Jan 2022

TL;DR: This work introduces a new family of techniques to postprocess (“wrap”) a black-box classifier in order to reduce its bias, and exemplifies the use of the technique in three fairness notions: conditional value at risk, equality of opportunity, and statistical parity.

...read moreread less

Abstract: We introduce a new family of techniques to post-process ("wrap") a black-box classifier in order to reduce its bias. Our technique builds on the recent analysis of improper loss functions whose optimization can correct any twist in prediction, unfairness being treated as a twist. In the post-processing, we learn a wrapper function which we define as an $\alpha$-tree, which modifies the prediction. We provide two generic boosting algorithms to learn $\alpha$-trees. We show that our modification has appealing properties in terms of composition of $\alpha$-trees, generalization, interpretability, and KL divergence between modified and original predictions. We exemplify the use of our technique in three fairness notions: conditional value-at-risk, equality of opportunity, and statistical parity; and provide experiments on several readily available datasets.

...read moreread less

Journal Article•DOI•

What killed the Convex Booster ?

[...]

Yishay Mansour, Richard Nock, Robert C. Williamson

19 May 2022-arXiv.org

TL;DR: This work calls to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio’s results and a new general boosting algorithm to demonstrate that the real culprit in their context was in fact the (linear) model class.

...read moreread less

Abstract: A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high preci-sion machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for ﬁtting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio’s results and a new general boosting algorithm to demonstrate that the real culprit in their speciﬁc context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive – and otherwise prized – aspect of ML: parameterisation .

...read moreread less

Journal Article•DOI•

Learning Revenue Maximization Using Posted Prices for Stochastic Strategic Patient Buyers

[...]

Eitan-Hai Mashiah, Idan Attias, Yishay Mansour

12 Feb 2022-Proceedings of the ... AAAI Conference on Artificial Intelligence

TL;DR: This paper formalizes this setting and characterize the resulting Stackelberg equilibrium, where the seller commits to her strategy, and then the buyers best respond, and derives a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy.

...read moreread less

Abstract: We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer's type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. In this paper, we formalize this setting and characterize the resulting Stackelberg equilibrium, where the seller first commits to her strategy, and then the buyers best respond. Following this, we show how to compute both the optimal pure and mixed strategies. We then consider a learning setting, where the seller does not have access to the distribution over buyer's types. Our main results are the following. We derive a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting. Moreover, we provide a general sample complexity bound for the approximate optimal mixed strategy. We also consider an online setting and derive a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy.

...read moreread less

Journal Article•DOI•

Dueling Convex Optimization with General Preferences

[...]

Aadirupa Saha, Tomer Koren, Yishay Mansour

27 Sep 2022-arXiv.org

TL;DR: This work addresses the problem of convex optimization with dueling feedback with a very general transfer function class which includes all functions that can be approximated by a ﬁnite polynomial with a minimal degree p.

...read moreread less

Abstract: We address the problem of convex optimization with dueling feedback , where the goal is to minimize a convex function given a weaker form of dueling feedback. Each query consists of two points and the dueling feedback returns a (noisy) single-bit binary comparison of the function values of the two queried points. The translation of the function values to the single comparison bit is through a transfer function . This problem has been addressed previously for some restricted classes of transfer functions, but here we consider a very general transfer function class which includes all functions that can be approximated by a ﬁnite polynomial with a minimal degree p . Our main contribution is an eﬃcient algorithm with convergence rate of e O ( ǫ − 4 p ) for a smooth convex objective function, and an optimal rate of e O ( ǫ − 2 p ) when the objective is smooth and strongly convex.

...read moreread less

Regret minimization of Tabular Policy Gradient

[...]

Tal Dim, Yishay Mansour

TL;DR: It is shown that theorems related to direct parameterization, where the parameterization is the property of the input variable, and the values of the parameters are determined by the explicit specification of the specification itself.

...read moreread less

Abstract: our we direct parameterization we , and we

...read moreread less

Benign Underfitting of SGD in Stochastic Convex Optimization

[...]

Tomer Koren, Roi Livni, Yishay Mansour, Uri Sherman

...read moreread less

Abstract: We study to what extent may stochastic gradient descent (SGD) be understood as a “conventional” learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without -replacement) SGD is classically known to minimize the population risk at rate O(1/ √ n), and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of Ω(1). Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.

...read moreread less

Journal Article•DOI•

Differentially-Private Bayes Consistency

[...]

Olivier Bousquet, Haim Kaplan, Aryeh Kontorovich, Yishay Mansour, Shay Moran, Menachem Sadigurschi, Uri Stemmer - Show less +3 more

08 Dec 2022-arXiv.org

TL;DR: In this paper , a universally Bayes consistent learning rule that satisﬁes differential privacy (DP) was constructed for the setting of binary classification and then extended to the more general setting of density estimation.

...read moreread less

Abstract: We construct a universally Bayes consistent learning rule that satisﬁes differential privacy (DP). We ﬁrst handle the setting of binary classiﬁcation and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classiﬁers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact learn arbitrary distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal labeled sample complexity of ˜ O ( d/ε ) labeled examples (and with an unlabeled sample complexity that can depend on the target distribution).

...read moreread less

Journal Article•

Stochastic Strategic Patient Buyers: Revenue maximization using posted prices

[...]

Eitan-Hai Mashiah, Idan Attias, Yishay Mansour

arXiv.org

TL;DR: This work considers a seller faced with buyers which have the ability to delay their decision, which is called patience, and derives a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting.

...read moreread less

Abstract: We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer’s type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. Our main results are the following. • We formalize this setting and characterize the resulting Stackelberg equilibrium, where the seller first commits to her strategy and then the buyers best respond. • We show a separation between the best fixed price, the best pure strategy, which is a fixed sequence of prices, and the best mixed strategy, which is a distribution over price sequences. • We characterize both the optimal pure strategy of the seller and the buyer’s best response strategy to any seller’s mixed strategy. • We show how to compute efficiently the optimal pure strategy and give an algorithm for the optimal mixed strategy (which is exponential in the maximum patience). We then consider a learning setting, where the seller does not have access to the distribution over buyer’s types. Our main results are the following. • We derive a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting. • We give a general sample complexity bound for the approximate optimal mixed strategy. • We consider an online setting and derive a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy. School of Computer Science, Tel Aviv University; eitanhaimashiah@gmail.com. Department of Computer Science, Ben-Gurion University; idanatti@post.bgu.ac.il. School of Computer Science, Tel Aviv University and Google Research, Tel Aviv; mansour.yishay@gmail.com. ‹Equal contribution.

...read moreread less