scispace - formally typeset
Search or ask a question

Showing papers by "Aviv Tamar published in 2023"


Journal ArticleDOI
TL;DR: This article pointed out that the current direction of reinforcement learning is not likely to lead to deployable RL: RL that works in practice and can work in practical situations yet still is economically viable.
Abstract: Reinforcement learning (RL) has demonstrated great potential, but is currently full of over-hyping and pipe dreams. We point to some difficulties with current research which we feel are endemic to the direction taken by the community. To us, the current direction is not likely to lead to “deployable” RL: RL that works in practice and can work in practical situations yet still is economically viable. We also propose a potential fix to some of the difficulties of the field.

1 citations


Journal ArticleDOI
TL;DR: ContraBAR as discussed by the authors uses contrastive predictive coding (CPC) to learn a Bayes-optimal policy in meta reinforcement learning (meta RL) and achieves comparable performance to state-of-the-art in domains with state-based observation.
Abstract: In meta reinforcement learning (meta RL), an agent seeks a Bayes-optimal policy -- the optimal policy when facing an unknown task that is sampled from some known task distribution. Previous approaches tackled this problem by inferring a belief over task parameters, using variational inference methods. Motivated by recent successes of contrastive learning approaches in RL, such as contrastive predictive coding (CPC), we investigate whether contrastive methods can be used for learning Bayes-optimal behavior. We begin by proving that representations learned by CPC are indeed sufficient for Bayes optimality. Based on this observation, we propose a simple meta RL algorithm that uses CPC in lieu of variational belief inference. Our method, ContraBAR, achieves comparable performance to state-of-the-art in domains with state-based observation and circumvents the computational toll of future observation reconstruction, enabling learning in domains with image-based observations. It can also be combined with image augmentations for domain randomization and used seamlessly in both online and offline meta RL settings.

Journal ArticleDOI
TL;DR: The authors proposed Trajectory Iterative Learner (TraIL), an extension of GCSL that further exploits the information in a trajectory, and uses it for learning to predict both actions and sub-goals.
Abstract: Recently, a simple yet effective algorithm -- goal-conditioned supervised-learning (GCSL) -- was proposed to tackle goal-conditioned reinforcement-learning. GCSL is based on the principle of hindsight learning: by observing states visited in previously executed trajectories and treating them as attained goals, GCSL learns the corresponding actions via supervised learning. However, GCSL only learns a goal-conditioned policy, discarding other information in the process. Our insight is that the same hindsight principle can be used to learn to predict goal-conditioned sub-goals from the same trajectory. Based on this idea, we propose Trajectory Iterative Learner (TraIL), an extension of GCSL that further exploits the information in a trajectory, and uses it for learning to predict both actions and sub-goals. We investigate the settings in which TraIL can make better use of the data, and discover that for several popular problem settings, replacing real goals in GCSL with predicted TraIL sub-goals allows the agent to reach a greater set of goal states using the exact same data as GCSL, thereby improving its overall performance.

Proceedings ArticleDOI
15 Feb 2023
TL;DR: In this article , the authors model the problem as a Markov Decision Process (MDP), where the goal is to maximize expected throughput, and pursue an approximate solution based on model predictive control, where at each time step, they plan based only on the currently visible objects.
Abstract: Deep learning-based grasp prediction models have become an industry standard for robotic bin-picking systems. To maximize pick success, production environments are often equipped with several end-effector tools that can be swapped on-the-fly, based on the target object. Tool-change, however, takes time. Choosing the order of grasps to perform, and corresponding tool-change actions, can improve system throughput; this is the topic of our work. The main challenge in planning tool change is uncertainty - we typically cannot see objects in the bin that are currently occluded. Inspired by queuing and admission control problems, we model the problem as a Markov Decision Process (MDP), where the goal is to maximize expected throughput, and we pursue an approximate solution based on model predictive control, where at each time step we plan based only on the currently visible objects. Special to our method is the idea of void zones, which are geometrical boundaries in which an unknown object will be present, and therefore cannot be accounted for during planning. Our planning problem can be solved using integer linear programming (ILP). However, we find that an approximate solution based on sparse tree search yields near optimal performance at a fraction of the time. Another question that we explore is how to measure the performance of tool-change planning: we find that throughput alone can fail to capture delicate and smooth behavior, and propose a principled alternative. Finally, we demonstrate our algorithms on both synthetic and real world bin picking tasks.

Journal Article
TL;DR: In this paper , the authors use deep reinforcement learning to analyze how a rational miner performs selfish mining by deviating from the protocol to maximize revenue when petty compliant miners are present, and they find that a selfish miner can exploit the selfish miners to increase her revenue by bribing them.
Abstract: —Blockchain security relies on incentives to ensure participants, called miners , cooperate and behave as the protocol dictates. Such protocols have a security threshold – a miner whose relative computational power is larger than the threshold can deviate to improve her revenue. Moreover, blockchain participants can behave in a petty compliant man-ner: usually follow the protocol, but deviate to increase revenue when deviation cannot be distinguished externally from the prescribed behavior. The effect of petty compliant miners on the security threshold of blockchains is not well understood. Due to the complexity of the analysis, it remained an open question since Carlsten et al. identified it in 2016. In this work, we use deep Reinforcement Learning (RL) to analyze how a rational miner performs selfish mining by deviating from the protocol to maximize revenue when petty compliant miners are present. We find that a selfish miner can exploit petty compliant miners to increase her revenue by bribing them. Our method reveals that the security threshold is lower when petty compliant miners are present. In particular, with parameters estimated from the Bitcoin blockchain, we find the threshold drops from the known value of 25% to only 21% (or 19% ) when 50% (or 75% ) of the other miners are petty compliant. Hence, our deep RL analysis puts the open question to rest; the presence of petty compliant miners exacerbates a blockchain’s vulnerability to selfish mining and is a major security threat.

Journal ArticleDOI
TL;DR: Teacher Guided Reinforcement Learning (TGRL) as mentioned in this paper adjusts the importance of teacher supervision by comparing the agent's performance to the counterfactual scenario of the agent learning without teacher supervision.
Abstract: Learning from rewards (i.e., reinforcement learning or RL) and learning to imitate a teacher (i.e., teacher-student learning) are two established approaches for solving sequential decision-making problems. To combine the benefits of these different forms of learning, it is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives. However, without a principled method to balance these objectives, prior work used heuristics and problem-specific hyperparameter searches to balance the two objectives. We present a $\textit{principled}$ approach, along with an approximate implementation for $\textit{dynamically}$ and $\textit{automatically}$ balancing when to follow the teacher and when to use rewards. The main idea is to adjust the importance of teacher supervision by comparing the agent's performance to the counterfactual scenario of the agent learning without teacher supervision and only from rewards. If using teacher supervision improves performance, the importance of teacher supervision is increased and otherwise it is decreased. Our method, $\textit{Teacher Guided Reinforcement Learning}$ (TGRL), outperforms strong baselines across diverse domains without hyper-parameter tuning.

Journal ArticleDOI
TL;DR: In this paper , a new approach to routing under demand uncertainty is presented, tackling this challenge as stochastic optimization, and employing deep learning to learn complex patterns in traffic demands.
Abstract: Routing is, arguably, the most fundamental task in computer networking, and the most extensively studied one. A key challenge for routing in real-world environments is the need to contend with uncertainty about future traffic demands. We present a new approach to routing under demand uncertainty: tackling this challenge as stochastic optimization, and employing deep learning to learn complex patterns in traffic demands. We show that our method provably converges to the global optimum in well-studied theoretical models of multicommodity flow. We exemplify the practical usefulness of our approach by zooming in on the real-world challenge of traffic engineering (TE) on wide-area networks (WANs). Our extensive empirical evaluation on real-world traffic and network topologies establishes that our approach's TE quality almost matches that of an (infeasible) omniscient oracle, outperforming previously proposed approaches, and also substantially lowers runtimes.

Journal ArticleDOI
TL;DR: Deep Dynamic Latent Particle (DDLP) as discussed by the authors uses a set of keypoints with learned parameters for properties such as position and size, and is both efficient and interpretable.
Abstract: We propose a new object-centric video prediction algorithm based on the deep latent particle (DLP) representation. In comparison to existing slot- or patch-based representations, DLPs model the scene using a set of keypoints with learned parameters for properties such as position and size, and are both efficient and interpretable. Our method, deep dynamic latent particles (DDLP), yields state-of-the-art object-centric video prediction results on several challenging datasets. The interpretable nature of DDLP allows us to perform ``what-if'' generation -- predict the consequence of changing properties of objects in the initial frames, and DLP's compact structure enables efficient diffusion-based unconditional video generation. Videos, code and pre-trained models are available: https://taldatech.github.io/ddlp-web

Journal ArticleDOI
TL;DR: ExpGen as mentioned in this paper trains an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which are guaranteed to generalize and drive us to a novel part of the state space, where the ensemble may potentially agree again.
Abstract: We study zero-shot generalization in reinforcement learning - optimizing a policy on a set of training tasks such that it will perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance to the task. However, on problems such as the ProcGen Maze, an adequate solution that is invariant to the task visualization does not exist, and therefore invariance-based approaches fail. Our insight is that learning a policy that $\textit{explores}$ the domain effectively is harder to memorize than a policy that maximizes reward for a specific task, and therefore we expect such learned behavior to generalize well; we indeed demonstrate this empirically on several domains that are difficult for invariance-based approaches. Our $\textit{Explore to Generalize}$ algorithm (ExpGen) builds on this insight: We train an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which are guaranteed to generalize and drive us to a novel part of the state space, where the ensemble may potentially agree again. We show that our approach is the state-of-the-art on several tasks in the ProcGen challenge that have so far eluded effective generalization. For example, we demonstrate a success rate of $82\%$ on the Maze task and $74\%$ on Heist with $200$ training levels.