scispace - formally typeset
Search or ask a question

Showing papers by "Hao Su published in 2023"


Proceedings ArticleDOI
09 Feb 2023
TL;DR: ManiSkill2 as mentioned in this paper is the next generation of the SAPIEN ManiSkill benchmark and includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames, which cover stationary/mobile-base, single/dual-arm, and rigid/soft-body manipulation tasks with 2D/3D-input data simulated by fully dynamic engines.
Abstract: Generalizable manipulation skills, which can be composed to tackle long-horizon and complex daily chores, are one of the cornerstones of Embodied AI. However, existing benchmarks, mostly composed of a suite of simulatable environments, are insufficient to push cutting-edge research works because they lack object-level topological and geometric variations, are not based on fully dynamic simulation, or are short of native support for multiple types of manipulation tasks. To this end, we present ManiSkill2, the next generation of the SAPIEN ManiSkill benchmark, to address critical pain points often encountered by researchers when using benchmarks for generalizable manipulation skills. ManiSkill2 includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames, which cover stationary/mobile-base, single/dual-arm, and rigid/soft-body manipulation tasks with 2D/3D-input data simulated by fully dynamic engines. It defines a unified interface and evaluation protocol to support a wide range of algorithms (e.g., classic sense-plan-act, RL, IL), visual observations (point cloud, RGBD), and controllers (e.g., action type and parameterization). Moreover, it empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS with 1 GPU and 16 processes on a regular workstation. It implements a render server infrastructure to allow sharing rendering resources across all environments, thereby significantly reducing memory usage. We open-source all codes of our benchmark (simulator, environments, and baselines) and host an online challenge open to interdisciplinary researchers.

10 citations


Journal ArticleDOI
TL;DR: In this article , a set of local neural radiance fields that together represent a scene are used to fit and approximate the scene more efficiently than traditional global NeRFs, allowing the extraction of panoptic and photometric renderings from arbitrary views.
Abstract: We address efficient and structure-aware 3D scene representation from images. Nerflets are our key contribution -- a set of local neural radiance fields that together represent a scene. Each nerflet maintains its own spatial position, orientation, and extent, within which it contributes to panoptic, density, and radiance reconstructions. By leveraging only photometric and inferred panoptic image supervision, we can directly and jointly optimize the parameters of a set of nerflets so as to form a decomposed representation of the scene, where each object instance is represented by a group of nerflets. During experiments with indoor and outdoor environments, we find that nerflets: (1) fit and approximate the scene more efficiently than traditional global NeRFs, (2) allow the extraction of panoptic and photometric renderings from arbitrary views, and (3) enable tasks rare for NeRFs, such as 3D panoptic segmentation and interactive editing.

8 citations


Journal ArticleDOI
TL;DR: Factor Fields as mentioned in this paper decomposes a signal into a product of factors, each of which is represented by a neural or regular field representation operating on a coordinate transformed input signal, which leads to improvements over previous fast reconstruction methods in terms of the three critical goals in neural signal representation.
Abstract: We present Factor Fields, a novel framework for modeling and representing signals. Factor Fields decomposes a signal into a product of factors, each of which is represented by a neural or regular field representation operating on a coordinate transformed input signal. We show that this decomposition yields a unified framework that generalizes several recent signal representations including NeRF, PlenOxels, EG3D, Instant-NGP, and TensoRF. Moreover, the framework allows for the creation of powerful new signal representations, such as the Coefficient-Basis Factorization (CoBaFa) which we propose in this paper. As evidenced by our experiments, CoBaFa leads to improvements over previous fast reconstruction methods in terms of the three critical goals in neural signal representation: approximation quality, compactness and efficiency. Experimentally, we demonstrate that our representation achieves better image approximation quality on 2D image regression tasks, higher geometric quality when reconstructing 3D signed distance fields and higher compactness for radiance field reconstruction tasks compared to previous fast reconstruction methods. Besides, our CoBaFa representation enables generalization by sharing the basis across signals during training, enabling generalization tasks such as image regression with sparse observations and few-shot radiance field reconstruction.

7 citations



Journal ArticleDOI
TL;DR: In this article , a dual Lagrangian view is introduced to enforce representations under the Eulerian/Lagrangian views to be cycle-consistent, which can be used for dynamic scene reconstruction and part discovery.
Abstract: We present MovingParts, a NeRF-based method for dynamic scene reconstruction and part discovery. We consider motion as an important cue for identifying parts, that all particles on the same part share the common motion pattern. From the perspective of fluid simulation, existing deformation-based methods for dynamic NeRF can be seen as parameterizing the scene motion under the Eulerian view, i.e., focusing on specific locations in space through which the fluid flows as time passes. However, it is intractable to extract the motion of constituting objects or parts using the Eulerian view representation. In this work, we introduce the dual Lagrangian view and enforce representations under the Eulerian/Lagrangian views to be cycle-consistent. Under the Lagrangian view, we parameterize the scene motion by tracking the trajectory of particles on objects. The Lagrangian view makes it convenient to discover parts by factorizing the scene motion as a composition of part-level rigid motions. Experimentally, our method can achieve fast and high-quality dynamic scene reconstruction from even a single moving camera, and the induced part-based representation allows direct applications of part tracking, animation, 3D scene editing, etc.

2 citations


Journal ArticleDOI
TL;DR: In this paper , a tensor factorization and neural fields are used to estimate scene geometry, surface reflectance, and environment illumination from multi-view images captured under unknown lighting conditions.
Abstract: We propose TensoIR, a novel inverse rendering approach based on tensor factorization and neural fields. Unlike previous works that use purely MLP-based neural fields, thus suffering from low capacity and high computation costs, we extend TensoRF, a state-of-the-art approach for radiance field modeling, to estimate scene geometry, surface reflectance, and environment illumination from multi-view images captured under unknown lighting conditions. Our approach jointly achieves radiance field reconstruction and physically-based model estimation, leading to photo-realistic novel view synthesis and relighting results. Benefiting from the efficiency and extensibility of the TensoRF-based representation, our method can accurately model secondary shading effects (like shadows and indirect lighting) and generally support input images captured under single or multiple unknown lighting conditions. The low-rank tensor representation allows us to not only achieve fast and compact reconstruction but also better exploit shared information under an arbitrary number of capturing lighting conditions. We demonstrate the superiority of our method to baseline methods qualitatively and quantitatively on various challenging synthetic and real-world scenes.

2 citations


Journal ArticleDOI
TL;DR: OpenShape as discussed by the authors proposes to learn multi-modal joint representations of text, image, and point clouds to enable open-world 3D shape understanding by scaling up training data by ensembling multiple 3D datasets.
Abstract: We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. We adopt the commonly used multi-modal contrastive learning framework for representation alignment, but with a specific focus on scaling up 3D representations to enable open-world 3D shape understanding. To achieve this, we scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions. We also explore and compare strategies for scaling 3D backbone networks and introduce a novel hard negative mining module for more efficient training. We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than 10% for existing methods. OpenShape also achieves an accuracy of 85.3% on ModelNet40, outperforming previous zero-shot baseline methods by 20% and performing on par with some fully-supervised methods. Furthermore, we show that our learned embeddings encode a wide range of visual and semantic concepts (e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D and image-3D interactions. Due to their alignment with CLIP embeddings, our learned shape representations can also be integrated with off-the-shelf CLIP-based models for various applications, such as point cloud captioning and point cloud-conditioned image generation.

2 citations


Journal ArticleDOI
TL;DR: In this paper , a multi-stage aerobic-biofilm/anaerobic-granular baffle reactor (MOABR) and a control strategy on pH/aeration time were designed to accelerate the partial nitrification/anammox (PN/A) process.

1 citations


Journal ArticleDOI
10 Jul 2023
TL;DR: In this paper , the authors propose AnyTeleop, a unified and general teleoperation system to support multiple different arms, hands, reality, and camera configurations within a single system.
Abstract: Vision-based teleoperation offers the possibility to endow robots with human-level intelligence to physically interact with the environment, while only requiring low-cost camera sensors. However, current vision-based teleoperation systems are designed and engineered towards a particular robot model and deploy environment, which scales poorly as the pool of the robot models expands and the variety of the operating environment increases. In this paper, we propose AnyTeleop, a unified and general teleoperation system to support multiple different arms, hands, realities, and camera configurations within a single system. Although being designed to provide great flexibility to the choice of simulators and real hardware, our system can still achieve great performance. For real-world experiments, AnyTeleop can outperform a previous system that was designed for a specific robot hardware with a higher success rate, using the same robot. For teleoperation in simulation, AnyTeleop leads to better imitation learning performance, compared with a previous system that is particularly designed for that simulator. Project page: http://anyteleop.com/.

Journal ArticleDOI
TL;DR: In this article , the authors propose an imitation learning method that incorporates the idea of temporal abstraction and the planning capabilities from Hierarchical RL (HRL) in a novel and effective manner.
Abstract: We study generalizable policy learning from demonstrations for complex low-level control tasks (e.g., contact-rich object manipulations). We propose an imitation learning method that incorporates the idea of temporal abstraction and the planning capabilities from Hierarchical RL (HRL) in a novel and effective manner. As a step towards decision foundation models, our design can utilize scalable, albeit highly sub-optimal, demonstrations. Specifically, we find certain short subsequences of the demos, i.e. the chain-of-thought (CoT), reflect their hierarchical structures by marking the completion of subgoals in the tasks. Our model learns to dynamically predict the entire CoT as coherent and structured long-term action guidance and consistently outperforms typical two-stage subgoal-conditioned policies. On the other hand, such CoT facilitates generalizable policy learning as they exemplify the decision patterns shared among demos (even those with heavy noises and randomness). Our method, Chain-of-Thought Predictive Control (CoTPC), significantly outperforms existing ones on challenging low-level manipulation tasks from scalable yet highly sub-optimal demos.

Journal ArticleDOI
TL;DR: In this paper , a self-assembling paclitaxel (PTX) filament hydrogel that stimulates macrophage-mediated immune response for local treatment of recurrent glioblastoma multiforme (GBM) was presented.
Abstract: The unique cancer-associated immunosuppression in brain, combined with a paucity of infiltrating T cells, contributes to the low response rate and poor treatment outcomes of T cell-based immunotherapy for patients diagnosed with glioblastoma multiforme (GBM). Here, we report on a self-assembling paclitaxel (PTX) filament (PF) hydrogel that stimulates macrophage-mediated immune response for local treatment of recurrent glioblastoma. Our results suggest that aqueous PF solutions containing aCD47 can be directly deposited into the tumor resection cavity, enabling seamless hydrogel filling of the cavity and long-term release of both therapeutics. The PTX PFs elicit an immune-stimulating tumor microenvironment (TME) and thus sensitizes tumor to the aCD47-mediated blockade of the antiphagocytic "don't eat me" signal, which subsequently promotes tumor cell phagocytosis by macrophages and also triggers an antitumor T cell response. As adjuvant therapy after surgery, this aCD47/PF supramolecular hydrogel effectively suppresses primary brain tumor recurrence and prolongs overall survivals with minimal off-target side effects.

Hao Su, Xuefeng Liu, Jianwei Niu, Ji Wan, Xinghao Wu 
19 Jul 2023
TL;DR: Wang et al. as discussed by the authors proposed 3Deformer, a general-purpose framework for interactive 3D shape editing, which only requires supervision of readily available semantic images, and is compatible with editing various objects unlimited by datasets.
Abstract: We propose 3Deformer, a general-purpose framework for interactive 3D shape editing. Given a source 3D mesh with semantic materials, and a user-specified semantic image, 3Deformer can accurately edit the source mesh following the shape guidance of the semantic image, while preserving the source topology as rigid as possible. Recent studies of 3D shape editing mostly focus on learning neural networks to predict 3D shapes, which requires high-cost 3D training datasets and is limited to handling objects involved in the datasets. Unlike these studies, our 3Deformer is a non-training and common framework, which only requires supervision of readily-available semantic images, and is compatible with editing various objects unlimited by datasets. In 3Deformer, the source mesh is deformed utilizing the differentiable renderer technique, according to the correspondences between semantic images and mesh materials. However, guiding complex 3D shapes with a simple 2D image incurs extra challenges, that is, the deform accuracy, surface smoothness, geometric rigidity, and global synchronization of the edited mesh should be guaranteed. To address these challenges, we propose a hierarchical optimization architecture to balance the global and local shape features, and propose further various strategies and losses to improve properties of accuracy, smoothness, rigidity, and so on. Extensive experiments show that our 3Deformer is able to produce impressive results and reaches the state-of-the-art level.

Journal ArticleDOI
TL;DR: In this paper , the advantages of using demonstrations in sequential decision making, various ways to apply demonstrations in learning-based decision making paradigms (for example, reinforcement learning and planning in the learned models), and how to collect the demonstrations in various scenarios.
Abstract: Although reinforcement learning has seen tremendous success recently, this kind of trial-and-error learning can be impractical or inefficient in complex environments. The use of demonstrations, on the other hand, enables agents to benefit from expert knowledge rather than having to discover the best action to take through exploration. In this survey, we discuss the advantages of using demonstrations in sequential decision making, various ways to apply demonstrations in learning-based decision making paradigms (for example, reinforcement learning and planning in the learned models), and how to collect the demonstrations in various scenarios. Additionally, we exemplify a practical pipeline for generating and utilizing demonstrations in the recently proposed ManiSkill robot learning benchmark.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors presented a data-driven approach to automatically detect L dwarfs from SDSS images using an improved Faster R-CNN framework based on deep learning.
Abstract: We present a data-driven approach to automatically detect L dwarfs from Sloan Digital Sky Survey (SDSS) images using an improved Faster R-CNN framework based on deep learning. The established L-dwarf automatic detection (LDAD) model distinguishes L dwarfs from other celestial objects and backgrounds in SDSS field images by learning the features of 387 SDSS images containing L dwarfs. Applying the LDAD model to the SDSS images containing 93 labeled L dwarfs in the test set, we successfully detected 83 known L dwarfs with a recall rate of 89.25% for known L dwarfs. Several techniques are implemented in the LDAD model to improve its detection performance for L dwarfs, including the deep residual network and the feature pyramid network. As a result, the LDAD model outperforms the model of the original Faster R-CNN, whose recall rate of known L dwarfs is 80.65% for the same test set. The LDAD model was applied to detect L dwarfs from a larger validation set including 843 labeled L dwarfs, resulting in a recall rate of 94.42% for known L dwarfs. The newly identified candidates include L dwarfs, late M and T dwarfs, which were estimated from color (i − z) and spectral type relation. The contamination rates for the test candidates and validation candidates are 8.60% and 9.27%, respectively. The detection results indicate that our model is effective to search for L dwarfs from astronomical images.

Journal ArticleDOI
TL;DR: In this article , the authors evaluated the effect of weight distribution around the hip exoskeleton and the amount of weight applied to the hip joints on the gait of 21 healthy individuals walking on a treadmill while bearing weights on the hip.
Abstract: In exoskeleton research, transparency is the degree to which a device hinders the movement of the user, a critical component of performance and usability. Transparency is most often evaluated individually, thus lacking generalization. Our goal was to systematically evaluate transparency due to inertial effects on gait of a hypothetical hip exoskeleton. We predicted that the weight distribution around the pelvis and the amount of weight applied would change gait characteristics. We instructed 21 healthy individuals to walk on a treadmill while bearing weights on the pelvis between 4 and 8 kg in three different configurations, bilaterally, unilaterally (left side) and on the lumbar portion of the back (L4). We measured kinematics, kinetics, and muscle activity during randomly ordered trials of 1.5 min at typical walking speed. We also calculated the margin of stability to measure medial-lateral stability. We observed that loading the hips bilaterally with 4 kg had no changes in kinematics, kinetics, dynamic stability, or muscle activity, but above 6 kg, sagittal joint power was increased. Loading the lumbar area increased posterior pelvic tilt at 6 kg and decreased dynamic stability at 4 kg, with many individuals reporting some discomfort. For the unilateral placement, above 4 kg dynamic stability was decreased and hip joint power was increased, and above 6 kg the pelvis begins to dip towards the loaded side. These results show the different effects of weight distribution around the pelvis. This study represents a novel, systematic approach to characterizing transparency in exoskeleton design (clinicaltrials.gov: NCT05120115).

Journal ArticleDOI
TL;DR: In this article , a three-stage partial denitrification/partial-nitrification/anammox (PD/PN/Anammox) system was established, and the authors showed that the excess of COD and the reduction of nitrite in PD process caused by the instability of influent nitrate are avoided through real-time influent COD/N ratio of 2.1.
Abstract: In this study, a novel three-stage partial-denitrification/partial-nitrification/anammox (PD/PN/Anammox) system was established. Results showed that the excess of COD and the reduction of nitrite in PD process caused by the instability of influent nitrate are avoided through real-time influent COD/N ratio of 2.1. The extending the aeration interval in PN process enhanced the inhibition of nitrite oxidizing bacteria and promoted the activity of ammonia oxidizing bacteria under low free ammonia (FA, 1.1 ± 0.2 mg/L). In the upflow porous-plate anaerobic reactor (UPPAR), anammox bacteria (Candidatus_Kuennia) was rapid natural enriched from 0 % to 48.4 %, and granular anammox was formed by the increased of filamentous bacteria, α-d-glucose and a-helix. Moreover, the removal and robustness of ammonia nitrogen (98.9 ± 1.8 %) and total nitrogen (87.0 ± 2.3 %) in this system are better than those of the single stage PN/PD/Anammox process, and its cost of oxygen consumption, COD consumption and sludge treatment is far lower than that of nitrification denitrification process and PN/Anammox/denitrification process. This study provides a series of feasible start-up and operation strategies of PD, PN, Anammox and PD/PN/Anammox processes for the application of anammox in rare earth mining wastewater, which proposes upgrading and renovation plans for local wastewater treatment plants.

Journal ArticleDOI
22 May 2023-ACS Nano
TL;DR: In this article , a peptide-based supramolecular filament (SF) hydrogel was used as a universal carrier for localized delivery of three immunomodulating agents, including an aPD1 antibody, an IL15 cytokine, and a STING agonist (CDA).
Abstract: A major challenge of cancer immunotherapy is to develop delivery strategies that can effectively and safely augment the immune system's antitumor response. Here, we report on the design and synthesis of a peptide-based supramolecular filament (SF) hydrogel as a universal carrier for localized delivery of three immunomodulating agents of distinct action mechanisms and different molecular weights, including an aPD1 antibody, an IL15 cytokine, and a STING agonist (CDA). We show that in situ hydrogelation can be triggered to occur upon intratumoral injection of SF solutions containing each of aPD1, IL15, or CDA. The formed hydrogel serves as a scaffold depot for sustained and MMP-2-responsive release of immunotherapeutic agents, achieving enhanced antitumor activities and reduced side effects. When administered in combination, the aPD1/IL15 or aPD1/CDA hydrogel led to substantially increased T-cell infiltration and prevented the development of adaptive immune resistance induced by IL15 or CDA alone. These immunotherapy combinations resulted in complete regression of established large GL-261 tumors in all mice and elicited a protective long-acting and systemic antitumor immunity to prevent tumor recurrence while eradicating distant tumors. We believe this SF hydrogel offers a simple yet generalizable strategy for local delivery of diverse immunomodulators for enhanced antitumoral response and improved treatment outcomes.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space.
Abstract: Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.

Journal ArticleDOI
TL;DR: In this article , a differentiable rendering-based camera pose optimization and consistency-based joint space exploration is proposed for hand-eye calibration, which enables accurate end-to-end optimization of the calibration process and eliminates the need for the manual design of robot joint poses.
Abstract: Hand-eye calibration is a critical task in robotics, as it directly affects the efficacy of critical operations such as manipulation and grasping. Traditional methods for achieving this objective necessitate the careful design of joint poses and the use of specialized calibration markers, while most recent learning-based approaches using solely pose regression are limited in their abilities to diagnose inaccuracies. In this work, we introduce a new approach to hand-eye calibration called EasyHeC, which is markerless, white-box, and offers comprehensive coverage of positioning accuracy across the entire robot configuration space. We introduce two key technologies: differentiable rendering-based camera pose optimization and consistency-based joint space exploration, which enables accurate end-to-end optimization of the calibration process and eliminates the need for the laborious manual design of robot joint poses. Our evaluation demonstrates superior performance in synthetic and real-world datasets, enhancing downstream manipulation tasks by providing precise camera poses for locating and interacting with objects. The code is available at the project page: https://ootts.github.io/easyhec.

Journal ArticleDOI
TL;DR: Wei et al. as discussed by the authors combine the benefits of both worlds; they take the geometry initialization obtained from neural volumetric fields, and further optimize the geometry as well as a compact neural texture representation with differentiable rasterizers.
Abstract: We present a method for generating high-quality watertight manifold meshes from multi-view input images. Existing volumetric rendering methods are robust in optimization but tend to generate noisy meshes with poor topology. Differentiable rasterization-based methods can generate high-quality meshes but are sensitive to initialization. Our method combines the benefits of both worlds; we take the geometry initialization obtained from neural volumetric fields, and further optimize the geometry as well as a compact neural texture representation with differentiable rasterizers. Through extensive experiments, we demonstrate that our method can generate accurate mesh reconstructions with faithful appearance that are comparable to previous volume rendering methods while being an order of magnitude faster in rendering. We also show that our generated mesh and neural texture reconstruction is compatible with existing graphics pipelines and enables downstream 3D applications such as simulation. Project page: https://sarahweiii.github.io/neumanifold/

TL;DR: Recently, this article proposed DexDeform, a principled framework that abstracts dexterous manipulation skills from human demonstration, and refines the learned skills with differentiable physics, which is able to better explore and generalize across novel goals unseen in the initial human demonstrations.
Abstract: In this work, we aim to learn dexterous manipulation of deformable objects using multi-fingered hands. Reinforcement learning approaches for dexterous rigid object manipulation would struggle in this setting due to the complexity of physics interaction with deformable objects. At the same time, previous trajectory optimization approaches with differentiable physics for deformable manipulation would suffer from local optima caused by the explosion of contact modes from hand-object interactions. To address these challenges, we propose DexDeform, a principled framework that abstracts dexterous manipulation skills from human demonstration, and refines the learned skills with differentiable physics. Concretely, we first collect a small set of human demonstrations using teleoperation. And we then train a skill model using demonstrations for planning over action abstractions in imagination. To explore the goal space, we further apply augmentations to the existing deformable shapes in demonstrations and use a gradient optimizer to refine the actions planned by the skill model. Finally, we adopt the refined trajectories as new demonstrations for finetuning the skill model. To evaluate the effectiveness of our approach, we introduce a suite of six challenging dexterous deformable object manipulation tasks. Compared with baselines, DexDeform is able to better explore and generalize across novel goals unseen in the initial human demonstrations. Additional materials can be found at our project website 1.

20 Jul 2023
TL;DR: Recently, this article proposed a parametrized policy gradient (PGG) for reinforcement learning in high-dimensional continuous action spaces, which can help agents evade local optima in tasks with dense rewards and solve challenging sparse reward environments by incorporating an object-centric intrinsic reward.
Abstract: We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used Gaussian parameterization. To achieve this, we propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. By conditioning the policy on a latent variable, we derive a novel variational bound as the optimization objective, which promotes exploration of the environment. We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency. Empirical results demonstrate that our method can help agents evade local optima in tasks with dense rewards and solve challenging sparse-reward environments by incorporating an object-centric intrinsic reward. Our method consistently outperforms previous approaches across a range of tasks. Code and supplementary materials are available on the project page https://haosulab.github.io/RPG/

Proceedings ArticleDOI
27 Mar 2023
TL;DR: In this article , a principled framework that abstracts dexterous manipulation skills from human demonstration and refines the learned skills with differentiable physics is proposed, which is able to better explore and generalize across novel goals unseen in the initial human demonstrations.
Abstract: In this work, we aim to learn dexterous manipulation of deformable objects using multi-fingered hands. Reinforcement learning approaches for dexterous rigid object manipulation would struggle in this setting due to the complexity of physics interaction with deformable objects. At the same time, previous trajectory optimization approaches with differentiable physics for deformable manipulation would suffer from local optima caused by the explosion of contact modes from hand-object interactions. To address these challenges, we propose DexDeform, a principled framework that abstracts dexterous manipulation skills from human demonstration and refines the learned skills with differentiable physics. Concretely, we first collect a small set of human demonstrations using teleoperation. And we then train a skill model using demonstrations for planning over action abstractions in imagination. To explore the goal space, we further apply augmentations to the existing deformable shapes in demonstrations and use a gradient optimizer to refine the actions planned by the skill model. Finally, we adopt the refined trajectories as new demonstrations for finetuning the skill model. To evaluate the effectiveness of our approach, we introduce a suite of six challenging dexterous deformable object manipulation tasks. Compared with baselines, DexDeform is able to better explore and generalize across novel goals unseen in the initial human demonstrations.

Journal ArticleDOI
TL;DR: In this article , 3D point clouds are used to train a 3D neural network to learn features in the 3D-native space for robotic manipulation and control tasks, and a robust algorithm for various robotic manipulation tasks is proposed.
Abstract: Recent studies on visual reinforcement learning (visual RL) have explored the use of 3D visual representations. However, none of these work has systematically compared the efficacy of 3D representations with 2D representations across different tasks, nor have they analyzed 3D representations from the perspective of agent-object / object-object relationship reasoning. In this work, we seek answers to the question of when and how do 3D neural networks that learn features in the 3D-native space provide a beneficial inductive bias for visual RL. We specifically focus on 3D point clouds, one of the most common forms of 3D representations. We systematically investigate design choices for 3D point cloud RL, leading to the development of a robust algorithm for various robotic manipulation and control tasks. Furthermore, through comparisons between 2D image vs 3D point cloud RL methods on both minimalist synthetic tasks and complex robotic manipulation tasks, we find that 3D point cloud RL can significantly outperform the 2D counterpart when agent-object / object-object relationship encoding is a key factor.