Showing papers by "Aditya Grover published in 2022"

PDF

Open Access

Proceedings Article•

[...]

Qinqing Zheng, Amy B. Z. Zhang, Aditya Grover

11 Feb 2022

TL;DR: Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends ofﬂine pretraining with online ﬁnetuning in a uniﬁed framework, is proposed and shown to be competitive with the state-of-the-art in absolute performance on the D4RL benchmark.

...read moreread less

Abstract: Recent work has shown that offline reinforcement learning (RL) can be formulated as a sequence modeling problem (Chen et al., 2021; Janner et al., 2021) and solved via approaches similar to large-scale language modeling. However, any practical instantiation of RL also involves an online component, where policies pretrained on passive offline datasets are finetuned via taskspecific interactions with the environment. We propose Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework. Our framework uses sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning. Empirically, we show that ODT is competitive with the state-of-the-art in absolute performance on the D4RL benchmark but shows much more significant gains during the finetuning procedure.

...read moreread less

75 citations

Journal Article•DOI•

Frozen Pretrained Transformers as Universal Computation Engines

[...]

Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch

28 Jun 2022-Proceedings of the ... AAAI Conference on Artificial Intelligence

TL;DR: This work investigates the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning, and finds language-pretrained transformers can obtain strong performance on a variety of non-language tasks.

...read moreread less

Abstract: We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks.

...read moreread less

29 citations

Proceedings Article•DOI•

Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling

[...]

Tung Nguyen, Aditya Grover

09 Jul 2022

TL;DR: This work proposes Transformer Neural Processes (TNPs), a new member of the NP family that casts uncertainty-aware meta learning as a sequence modeling problem and achieves state-of-the-art performance on various benchmark problems, outperforming all previous NP variants.

...read moreread less

Abstract: Neural Processes (NPs) are a popular class of approaches for meta-learning. Similar to Gaussian Processes (GPs), NPs define distributions over functions and can estimate uncertainty in their predictions. However, unlike GPs, NPs and their variants suffer from underfitting and often have intractable likelihoods, which limit their applications in sequential decision making. We propose Transformer Neural Processes (TNPs), a new member of the NP family that casts uncertainty-aware meta learning as a sequence modeling problem. We learn TNPs via an autoregressive likelihood-based objective and instantiate it with a novel transformer-based architecture. The model architecture respects the inductive biases inherent to the problem structure, such as invariance to the observed data points and equivariance to the unobserved points. We further investigate knobs within the TNP framework that tradeoff expressivity of the decoding distribution with extra computation. Empirically, we show that TNPs achieve state-of-the-art performance on various benchmark problems, outperforming all previous NP variants on meta regression, image completion, contextual multi-armed bandits, and Bayesian optimization.

...read moreread less

26 citations

Proceedings Article•DOI•

Masked Autoencoding for Scalable and Generalizable Decision Making

[...]

Fangchen Liu, Hao Liu, Aditya Grover, Pieter Abbeel

23 Nov 2022

TL;DR: Mask Decision Prediction (MaskDP) as mentioned in this paper employs a masked autoencoder (MAE) to state-action trajectories, wherein the model is required to infer masked-out states and actions and extract information about dynamics.

...read moreread less

Abstract: We are interested in learning scalable agents for reinforcement learning that can learn from large-scale, diverse sequential data similar to current large vision and language models. To this end, this paper presents masked decision prediction (MaskDP), a simple and scalable self-supervised pretraining method for reinforcement learning (RL) and behavioral cloning (BC). In our MaskDP approach, we employ a masked autoencoder (MAE) to state-action trajectories, wherein we randomly mask state and action tokens and reconstruct the missing data. By doing so, the model is required to infer masked-out states and actions and extract information about dynamics. We find that masking different proportions of the input sequence significantly helps with learning a better model that generalizes well to multiple downstream tasks. In our empirical study, we find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching, and it can zero-shot infer skills from a few example transitions. In addition, MaskDP transfers well to offline RL and shows promising scaling behavior w.r.t. to model size. It is amenable to data-efficient finetuning, achieving competitive results with prior methods based on autoregressive pretraining.

...read moreread less

12 citations

Proceedings Article•DOI•

Matching Normalizing Flows and Probability Paths on Manifolds

[...]

Heli Ben-Hamu, Samuel A. Cohen, Joey Bose, Brandon Amos, Maximilian Nickel, Aditya Grover, Ricky T. Q. Chen, Yaron Lipman - Show less +4 more

11 Jul 2022

TL;DR: Empirically, CNFs learned by minimizing PPD achieve state-of-the-art results in likelihoods and sample quality on existing low-dimensional manifold benchmarks, and is the first example of a generative model to scale to moderately high dimensional manifolds.

...read moreread less

Abstract: Continuous Normalizing Flows (CNFs) are a class of generative models that transform a prior distribution to a model distribution by solving an ordinary differential equation (ODE). We propose to train CNFs on manifolds by minimizing probability path divergence (PPD), a novel family of divergences between the probability density path generated by the CNF and a target probability density path. PPD is formulated using a logarithmic mass conservation formula which is a linear first order partial differential equation relating the log target probabilities and the CNF's defining vector field. PPD has several key benefits over existing methods: it sidesteps the need to solve an ODE per iteration, readily applies to manifold data, scales to high dimensions, and is compatible with a large family of target paths interpolating pure noise and data in finite time. Theoretically, PPD is shown to bound classical probability divergences. Empirically, we show that CNFs learned by minimizing PPD achieve state-of-the-art results in likelihoods and sample quality on existing low-dimensional manifold benchmarks, and is the first example of a generative model to scale to moderately high dimensional manifolds.

...read moreread less

11 citations

Proceedings Article•

It Takes Four to Tango: Multiagent Selfplay for Automatic Curriculum Generation

[...]

Yuqing Du, Pieter Abbeel, Aditya Grover

22 Feb 2022

TL;DR: Curriculum Self Play (CuSP), an automated goal generation framework that seeks to satisfy desiderata by virtue of a multi-player game with 4 agents, is proposed and succeeds at generating an effective curricula of goals for a range of control tasks.

...read moreread less

Abstract: We are interested in training general-purpose reinforcement learning agents that can solve a wide variety of goals. Training such agents efficiently requires automatic generation of a goal curriculum. This is challenging as it requires (a) exploring goals of increasing difficulty, while ensuring that the agent (b) is exposed to a diverse set of goals in a sample efficient manner and (c) does not catastrophically forget previously solved goals. We propose Curriculum Self Play (CuSP), an automated goal generation framework that seeks to satisfy these desiderata by virtue of a multi-player game with 4 agents. We extend the asymmetric curricula learning in PAIRED (Dennis et al., 2020) to a symmetrized game that carefully balances cooperation and competition between two off-policy student learners and two regret-maximizing teachers. CuSP additionally introduces entropic goal coverage and accounts for the non-stationary nature of the students, allowing us to automatically induce a curriculum that balances progressive exploration with anticatastrophic exploitation. We demonstrate that our method succeeds at generating an effective curricula of goals for a range of control tasks, outperforming other methods at zero-shot test-time generalization to novel out-of-distribution goals.

...read moreread less

8 citations

Journal Article•DOI•

Generative Pretraining for Black-Box Optimization

[...]

Siddarth Krishnamoorthy, Satvik Mashkaria, Aditya Grover

22 Jun 2022-arXiv.org

TL;DR: This work proposes B lack-b o x O ptimization Transfor mer (BOOMER), a generative framework for pretraining black-box optimizers using ofﬂine datasets, and introduces mechanisms to control the rate at which a trajectory transitions from exploration to exploitation, and uses it to generalize outside the ofﰂine data at test-time.

...read moreread less

Abstract: Many problems in science and engineering involve optimizing an expensive black-box function over a high-dimensional space. For such black-box optimization (BBO) problems, we typically assume a small budget for online function evaluations, but also often have access to a fixed, offline dataset for pretraining. Prior approaches seek to utilize the offline data to approximate the function or its inverse but are not sufficiently accurate far from the data distribution. We propose BONET, a generative framework for pretraining a novel black-box optimizer using offline datasets. In BONET, we train an autoregressive model on fixed-length trajectories derived from an offline dataset. We design a sampling strategy to synthesize trajectories from offline data using a simple heuristic of rolling out monotonic transitions from low-fidelity to high-fidelity samples. Empirically, we instantiate BONET using a causally masked Transformer and evaluate it on Design-Bench, where we rank the best on average, outperforming state-of-the-art baselines.

...read moreread less

6 citations

Journal Article•DOI•

Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories

[...]

Qinqing Zheng, Mikael Henaff, Brandon Amos, Aditya Grover

12 Oct 2022-arXiv.org

TL;DR: A simple meta-algorithmic pipeline is developed that learns an inverse-dynamics model on the labelled data to obtainproxy-labels for the unlabelled data, followed by the use of any ofﬂine RL algorithm on the true and proxy-labelled trajectories.

...read moreread less

Abstract: Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action and reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful -- on several D4RL benchmarks~\cite{fu2020d4rl}, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10\% of trajectories which are highly suboptimal. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.

...read moreread less

5 citations

Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning

[...]

Tung Nguyen, Qinqing Zheng, Aditya Grover

11 Oct 2022

TL;DR: This paper proposed ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the reliability of conditional BC with two key components: trajectory weighting and conservative regularization.

...read moreread less

Abstract: Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. While promising, we show that these methods can be unreliable, as their performance may degrade significantly when conditioned on high, out-of-distribution (ood) returns. This is crucial in practice, as we often expect the policy to perform better than the offline dataset by conditioning on an ood value. We show that this unreliability arises from both the suboptimality of training data and model architectures. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the reliability of conditional BC with two key components: trajectory weighting and conservative regularization. Trajectory weighting upweights the high-return trajectories to reduce the train-test gap for BC methods, while conservative regularizer encourages the policy to stay close to the data distribution for ood conditioning. We study CWBC in the context of RvS (Emmons et al., 2021) and Decision Transformers (Chen et al., 2021), and show that CWBC significantly boosts their performance on various benchmarks.

...read moreread less

3 citations

Journal Article•DOI•

Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning

[...]

Carl Qi, Pieter Abbeel, Aditya Grover

07 Apr 2022-arXiv.org

TL;DR: Imitation with Planning at Test-time (IMPLANT) is proposed, a new meta-algorithm for imitation learning that utilizes decision-time planning to correct for compounding errors of any base imitation policy.

...read moreread less

Abstract: The goal of imitation learning is to mimic expert behavior from demonstrations, without access to an explicit reward signal. A popular class of approach infers the (unknown) reward function via inverse reinforcement learning (IRL) followed by maximizing this reward function via reinforcement learning (RL). The policies learned via these approaches are however very brittle in practice and deteriorate quickly even with small test-time perturbations due to compounding errors. We propose Imitation with Planning at Test-time (IMPLANT), a new meta-algorithm for imitation learning that utilizes decision-time planning to correct for compounding errors of any base imitation policy. In contrast to existing approaches, we retain both the imitation policy and the rewards model at decision-time, thereby benefiting from the learning signal of the two components. Empirically, we demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments and excels at zero-shot generalization when subject to challenging perturbations in test-time dynamics.

...read moreread less

2 citations

Journal Article•

Controllable Generative Modeling via Causal Reasoning

[...]

Joey Bose, Ricardo Pio Monti, Aditya Grover

TL;DR: CAGE as mentioned in this paper infers the implicit cause-effect relationships between two attributes as induced by a deep generative model and uses the inferred causal relationships to design a novel strategy for controllable generation based on counterfactual sampling.

...read moreread less

Abstract: Deep latent variable generative models excel at generating complex, high-dimensional data, often exhibiting impressive generalization beyond the training distribution. However, many such models in use today are black-boxes trained on large unlabelled datasets with statistical objectives and lack an interpretable understanding of the latent space required for controlling the generative process. We propose CAGE, a framework for controllable generation in latent variable models based on causal reasoning. Given a pair of attributes, CAGE infers the implicit cause-effect relationships between these attributes as induced by a deep generative model. This is achieved by defining and estimating a novel notion of unit-level causal effects in the latent space of the generative model. Thereafter, we use the inferred cause-effect relationships to design a novel strategy for controllable generation based on counterfactual sampling. Through a series of large-scale synthetic and human evaluations, we demonstrate that generating counterfactual samples which respect the underlying causal relationships inferred via CAGE leads to subjectively more realistic images.

...read moreread less

Journal Article•DOI•

ConserWeightive Behavioral Cloning for Reliable Offline Reinforcement Learning

[...]

Tung Nguyen, Qinqing Zheng, Aditya Grover

arXiv.org

TL;DR: ConserWeightive Behavioral Cloning (CWBC) is proposed, a simple and effective method for improving the performance of conditional BC for ofﬂine RL with two key components: trajectory weighting and conservative regularization.

...read moreread less

Abstract: The goal of ofﬂine reinforcement learning (RL) is to learn near-optimal policies from static logged datasets, thus sidestepping expensive online interactions. Behavioral cloning (BC) provides a straightforward solution to ofﬂine RL by mimicking ofﬂine trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. However, the distribution of returns in the ofﬂine dataset can be arbitrarily skewed and suboptimal, which poses a unique challenge for conditioning BC on expert returns at test-time. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the performance of conditional BC for ofﬂine RL with two key components: trajectory weighting and conservative regularization. Trajectory weighting addresses the bias-variance tradeoff in conditional BC and provides a principled mechanism to learn from both low return trajectories (typically plentiful) and high return trajectories (typically few). Further, we analyze the notion of conservatism in existing BC methods, and propose a novel conservative regularizer that explicitly encourages the policy to stay close to the data distribution. The regularizer helps achieve more reliable performance, and removes the need for ad-hoc tuning of the conditioning value during evaluation. We instantiate CWBC in the context of Reinforcement Learning via Supervised Learning (RvS) (Emmons et al., 2021) and Decision Transformer (DT) (Chen et al., 2021), and empirically show that it signiﬁcantly boosts the performance and stability of prior methods on various ofﬂine RL benchmarks. Code is available at https://github.com/tung-nd/cwbc.

...read moreread less

Rethinking Machine Learning for Climate Science: A Dataset Perspective

[...]

Aditya Grover

TL;DR: The authors argue that many such climate datasets are uniquely biased due to the pervasive use of external simulation models and proxy variables (e.g., satellite measure-ments) for imputing and extrapolating in-situ observational data.

...read moreread less

Abstract: The growing availability of data sources is a predominant factor enabling the widespread success of machine learning (ML) systems across a wide range of applications. Typically, training data in such systems constitutes a source of ground-truth , such as measurements about a physical object (e.g., natural images) or a human artifact (e.g., natural language). In this position paper, we take a critical look at the validity of this assumption for datasets for climate science. We argue that many such climate datasets are uniquely biased due to the pervasive use of external simulation models (e.g., general cir-culation models) and proxy variables (e.g., satellite measure- ments) for imputing and extrapolating in-situ observational data. We discuss opportunities for mitigating the bias in the training and deployment of ML systems using such datasets. Finally, we share views on improving the reliability and ac- countability of ML systems for climate science applications.

...read moreread less

P areto -e fficient d ecision a gents for o ffline m ulti -o bjective r einforcement l earning

[...]

Baiting Zhu, Meihua Dang, Aditya Grover

TL;DR: In this article , a data-driven setup for offline multi-objective reinforcement learning (MORL) is proposed, where a preference-agnostic policy agent is learned using only a finite dataset of offline demonstrations of other agents and their preferences.

...read moreread less

Abstract: The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent’s preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends return-conditioned offline methods including Decision Transformers (Chen et al., 2021) and RvS (Emmons et al., 2022) via a novel preference-and-return conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics.

...read moreread less