Showing papers by "Leonidas J. Guibas published in 2022"

PDF

Open Access

Proceedings Article•DOI•

Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation

[...]

Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas J. Guibas, Andrea Tagliasacchi, Frank Dellaert, Thomas Funkhouser - Show less +5 more

09 May 2022

TL;DR: Panoptic Neural Fields is presented, an object-aware neural scene representation that decomposes a scene into a set of objects (things) and background (stuff) that can be smaller and faster than previous object- aware approaches, while still leveraging category-specific priors incorporated via meta-learned initialization.

...read moreread less

Abstract: We present Panoptic Neural Fields (PNF), an object-aware neural scene representation that decomposes a scene into a set of objects (things) and background (stuff). Each object is represented by an oriented 3D bounding box and a multi-layer perceptron (MLP) that takes position, direction, and time and outputs density and radiance. The background stuff is represented by a similar MLP that additionally outputs semantic labels. Each object MLPs are instance-specific and thus can be smaller and faster than previous object-aware approaches, while still leveraging category-specific priors incorporated via meta-learned initialization. Our model builds a panoptic radiance field representation of any scene from just color images. We use off-the-shelf algorithms to predict camera poses, object tracks, and 2D image semantic segmentations. Then we jointly optimize the MLP weights and bounding box parameters using analysis-by-synthesis with self-supervision from color images and pseudo-supervision from predicted semantic segmentations. During experiments with real-world dynamic scenes, we find that our model can be used effectively for several tasks like novel view synthesis, 2D panoptic segmentation, 3D scene editing, and multiview depth prediction.

...read moreread less

69 citations

Proceedings Article•DOI•

Object Scene Representation Transformer

[...]

Mehdi Sajjadi, Daniel Duckworth, Arun Suria Karnan Mahendran, S. Van Steenkiste, Filip Paveti'c, Mario Luvci'c, Leonidas J. Guibas, Klaus Greff, T. Kipf - Show less +5 more

14 Jun 2022

TL;DR: Object Scene Representation Transformer is introduced, a 3D-centric model in which individual object representations naturally emerge through novel view synthesis and is multiple orders of magnitude faster at compositional rendering thanks to its light parametrization and the novel Slot Mixer decoder.

...read moreread less

Abstract: A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. We believe this work will not only accelerate future architecture exploration and scaling efforts, but it will also serve as a useful tool for both object-centric as well as neural scene representation learning communities.

...read moreread less

33 citations

Journal Article•DOI•

ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object Manipulation

[...]

Bokui Shen, Zhenyu Jiang, Christopher Choy, Leonidas J. Guibas, Silvio Savarese, Animashree Anandkumar, Yuke Zhu - Show less +3 more

14 Mar 2022-Robotics

TL;DR: ACID, an action-conditional visual dynamics model for volumetric deformable objects based on structured implicit neural representations, achieves the best performance in geometry, correspondence, and dynamics predictions over existing approaches.

...read moreread less

22 citations

Proceedings Article•DOI•

ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes

[...]

Rahul Sajnani, Adrien Poulenard, Jivitesh Jain, Radhika Dua Dua, Leonidas J. Guibas, Srinath Sridhar - Show less +2 more

19 Jan 2022

TL;DR: ConDor is a self-supervised method that learns to Canonicalize the 3D orientation and position for full and partial 3D point clouds on top of Tensor Field Networks, a class of permutation- and rotation-equivariant, and translation-invariant 3D networks.

...read moreread less

Abstract: Progress in 3D object understanding has relied on manually “canonicalized” shape datasets that contain instances with consistent position and orientation (3D pose). This has made it hard to generalize these methods to in-the-wild shapes, e.g., from internet model collections or depth sensors. ConDor is a self-supervised method that learns to Canonicalize the 3D orientation and position for full and partial 3D point clouds. We build on top of Tensor Field Networks (TFNs), a class of permutation- and rotation-equivariant, and translation-invariant 3D networks. During inference, our method takes an unseen full or partial 3D point cloud at an arbitrary pose and outputs an equivariant canonical pose. During training, this network uses self-supervision losses to learn the canonical pose from an un-canonicalized collection of full and partial 3D point clouds. ConDor can also learn to consistently co-segment object parts without any supervision. Extensive quantitative results on four new metrics show that our approach out-performs existing methods while enabling new applications such as operation on depth images and annotation transfer.

...read moreread less

19 citations

Journal Article•DOI•

NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

[...]

Congyue Deng, Chiyu "Max" Jiang, Charles R. Qi, Xinchen Yan, Yin Zhou, Leonidas J. Guibas, Dragomir Anguelov - Show less +3 more

06 Dec 2022-arXiv.org

TL;DR: NeRDi as discussed by the authors proposes a single-view NeRF synthesis framework with general image priors from 2D diffusion models to improve multiview content coherence and regularize the underlying 3D geometry of the NeRF.

...read moreread less

Abstract: 2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.

...read moreread less

16 citations

Journal Article•DOI•

3D-Aware Video Generation

[...]

Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Hao Tang, Gordon Wetzstein, Leonidas J. Guibas, Luc Van Gool, Radu Timofte - Show less +4 more

29 Jun 2022-arXiv.org

TL;DR: This work develops a GAN framework that synthesizes 3D video supervised only with monocular videos and learns a rich embedding of decomposable 3D structures and motions that enables new visual effects of spatio-temporal renderings while producing imagery with quality comparable to that of existing 3D or video GANs.

...read moreread less

Abstract: Generative models have emerged as an essential building block for many image synthesis and editing tasks. Recent advances in this field have also enabled high-quality 3D or video content to be generated that exhibits either multi-view or temporal consistency. With our work, we explore 4D generative adversarial networks (GANs) that learn unconditional generation of 3D-aware videos. By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos. We show that our method learns a rich embedding of decomposable 3D structures and motions that enables new visual effects of spatio-temporal renderings while producing imagery with quality comparable to that of existing 3D or video GANs.

...read moreread less

10 citations

Proceedings Article•DOI•

GIMO: Gaze-Informed Human Motion Prediction in Context

[...]

Yang Zheng, Yanchao Yang, Kaichun Mo, Jiaman Li, Yebin Liu, Karen Liu, Leonidas J. Guibas - Show less +3 more

20 Apr 2022

TL;DR: Zheng et al. as discussed by the authors proposed a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, as well as ego-centric views with the eye gaze that serves as a surrogate for inferring human intent.

...read moreread less

Abstract: Predicting human motion is critical for assistive robots and AR/VR applications, where the interaction with humans needs to be safe and comfortable. Meanwhile, an accurate prediction depends on understanding both the scene context and human intentions. Even though many works study scene-aware human motion prediction, the latter is largely underexplored due to the lack of ego-centric views that disclose human intent and the limited diversity in motion and scenes. To reduce the gap, we propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, as well as ego-centric views with the eye gaze that serves as a surrogate for inferring human intent. By employing inertial sensors for motion capture, our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects. We perform an extensive study of the benefits of leveraging the eye gaze for ego-centric human motion prediction with various state-of-the-art architectures. Moreover, to realize the full potential of the gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches. Our network achieves the top performance in human motion prediction on the proposed dataset, thanks to the intent information from eye gaze and the denoised gaze feature modulated by the motion. Code and data can be found at https://github.com/y-zheng18/GIMO.

...read moreread less

9 citations

Proceedings Article•DOI•

NeuForm: Adaptive Overfitting for Neural Shape Editing

[...]

Connor Z. Lin, Niloy J. Mitra, Gordon Wetzstein, Leonidas J. Guibas, Paul Guerrero - Show less +1 more

18 Jul 2022

TL;DR: N EU F ORM is proposed to combine the advantages of both over-tted and generalizable representations by adaptively using the one most appropriate for each shape region: the overﬁtted representation where reliable data is available, and the generalizable representation everywhere else.

...read moreread less

Abstract: Neural representations are popular for representing shapes, as they can be learned form sensor data and used for data cleanup, model completion, shape editing, and shape synthesis. Current neural representations can be categorized as either overfitting to a single object instance, or representing a collection of objects. However, neither allows accurate editing of neural scene representations: on the one hand, methods that overfit objects achieve highly accurate reconstructions, but do not generalize to unseen object configurations and thus cannot support editing; on the other hand, methods that represent a family of objects with variations do generalize but produce only approximate reconstructions. We propose NEUFORM to combine the advantages of both overfitted and generalizable representations by adaptively using the one most appropriate for each shape region: the overfitted representation where reliable data is available, and the generalizable representation everywhere else. We achieve this with a carefully designed architecture and an approach that blends the network weights of the two representations, avoiding seams and other artifacts. We demonstrate edits that successfully reconfigure parts of human-designed shapes, such as chairs, tables, and lamps, while preserving semantic integrity and the accuracy of an overfitted shape representation. We compare with two state-of-the-art competitors and demonstrate clear improvements in terms of plausibility and fidelity of the resultant edits.

...read moreread less

6 citations

Journal Article•DOI•

SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene

[...]

Minjung Son, Jeong Joon Park, Leonidas J. Guibas, Gordon Wetzstein

30 Nov 2022-arXiv.org

TL;DR: Li et al. as discussed by the authors introduced a 3D-aware generative model that is trained with a few input images of a single scene, which can generate different realizations of this 3D scene that preserve the appearance of the input while varying scene layout.

...read moreread less

Abstract: Generative models have shown great promise in synthesizing photorealistic 3D objects, but they require large amounts of training data. We introduce SinGRAF, a 3D-aware generative model that is trained with a few input images of a single scene. Once trained, SinGRAF generates different realizations of this 3D scene that preserve the appearance of the input while varying scene layout. For this purpose, we build on recent progress in 3D GAN architectures and introduce a novel progressive-scale patch discrimination approach during training. With several experiments, we demonstrate that the results produced by SinGRAF outperform the closest related works in both quality and diversity by a large margin.

...read moreread less

5 citations

Journal Article•

Directed Weight Neural Networks for Protein Structure Representation Learning

[...]

Jiahan Li, Shitong Luo, Congyue Deng, Chaoran Cheng, Jiaqi Guan, Leonidas J. Guibas, Jian Peng, Jianzhu Ma - Show less +4 more

28 Jan 2022-arXiv.org

TL;DR: The Directed Weight Neural Network is proposed for better capturing geometric relations among different amino acids, and an equivariant message passing paradigm on proteins for plugging the directed weight perceptrons into existing Graph Neural Networks, showing superior versatility in main-taining SO(3) -equivariance at the global scale.

...read moreread less

Abstract: A protein performs biological functions by folding to a particular 3D structure. To accurately model the protein structures, both the overall geometric topology and local ﬁne-grained relations between amino acids (e.g. side-chain torsion angles and inter-amino-acid orientations) should be carefully considered. In this work, we propose the Directed Weight Neural Network for better capturing geometric relations among different amino acids. Extending a single weight from a scalar to a 3D directed vector, our new framework supports a rich set of geometric operations on both classical and SO(3) -representation features, on top of which we construct a perceptron unit for processing amino-acid information. In addi-tion, we introduce an equivariant message passing paradigm on proteins for plugging the directed weight perceptrons into existing Graph Neural Networks, showing superior versatility in main-taining SO(3) -equivariance at the global scale. Experiments show that our network has remark-ably better expressiveness in representing geometric relations in comparison to classical neural networks and the (globally) equivariant networks. It also achieves state-of-the-art performance on vari-ous computational biology applications related to protein 3D structures. All codes and models will be published upon acceptance.

...read moreread less

5 citations

Proceedings Article•DOI•

SpOT: Spatiotemporal Modeling for 3D Object Tracking

[...]

Colton Stearns, Davis Rempe, Jie Liu, Rares Ambrus, Sergey Zakharov, Vitor Guizilini, Yanchao Yang, Leonidas J. Guibas - Show less +4 more

12 Jul 2022

TL;DR: This work reformulate tracking as a spatiotemporal problem by representing tracked objects as sequences of time-stamped points and bounding boxes over a long temporal history, and develops a holistic representation of traffic scenes that leverages both spatial and temporal information of the actors in the scene.

...read moreread less

Abstract: 3D multi-object tracking aims to uniquely and consistently identify all mobile entities through time. Despite the rich spatiotemporal information available in this setting, current 3D tracking methods primarily rely on abstracted information and limited history, e.g. single-frame object bounding boxes. In this work, we develop a holistic representation of traffic scenes that leverages both spatial and temporal information of the actors in the scene. Specifically, we reformulate tracking as a spatiotemporal problem by representing tracked objects as sequences of time-stamped points and bounding boxes over a long temporal history. At each timestamp, we improve the location and motion estimates of our tracked objects through learned refinement over the full sequence of object history. By considering time and space jointly, our representation naturally encodes fundamental physical priors such as object permanence and consistency across time. Our spatiotemporal tracking framework achieves state-of-the-art performance on the Waymo and nuScenes benchmarks.

...read moreread less

Proceedings Article•DOI•

ADeLA: Automatic Dense Labeling with Attention for Viewpoint Shift in Semantic Segmentation

[...]

Hanxiang Ren, Yanchao Yang, He Wang, Bokui Shen, Qingnan Fan, Youyi Zheng, C. Karen Liu, Leonidas J. Guibas - Show less +4 more

01 Jun 2022

TL;DR: A method to deal with performance drop in semantic segmentation caused by viewpoint changes within multi-camera systems, where temporally paired images are readily available, but the annotations may only be abundant for a few typical views is described.

...read moreread less

Abstract: We describe a method to deal with performance drop in semantic segmentation caused by viewpoint changes within multi-camera systems, where temporally paired images are readily available, but the annotations may only be abundant for a few typical views. Existing methods alleviate performance drop via domain alignment in a shared space and assume that the mapping from the aligned space to the output is transferable. However, the novel content induced by viewpoint changes may nullify such a space for effective alignments, thus resulting in negative adaptation. Our method works without aligning any statistics of the images between the two domains. Instead, it utilizes a novel attention-based view transformation network trained only on color images to hallucinate the semantic images for the target. Despite the lack of supervision, the view transformation network can still generalize to semantic images thanks to the induced “information transport” bias. Furthermore, to resolve ambiguities in converting the semantic images to semantic labels, we treat the view transformation network as a functional representation of an unknown mapping implied by the color images and propose functional label hallucination to generate pseudo-labels with uncertainties in the target domains. Our method surpasses baselines built on state-of-the-art correspondence estimation and view synthesis methods. Moreover, it outperforms the state-of-the-art unsupervised domain adaptation methods that utilize self-training and adversarial domain alignments. Our code and dataset will be made publicly available.

...read moreread less

Journal Article•DOI•

MetaCLUE: Towards Comprehensive Visual Metaphors Research

[...]

Arjun R. Akula, Brenda S. Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhi-xuan Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas J. Guibas, WilliamR. Freeman, Yuanzhen Li, Varun Jampani - Show less +8 more

19 Dec 2022-arXiv.org

TL;DR: MetaCLUE as mentioned in this paper is a set of vision tasks on visual metaphor, focusing on the visual metaphor comprehension of images, which is an indispensable part of human cognition and also an inherent part of how we make sense of the world.

...read moreread less

Abstract: Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, metaphorical comprehension of images remains relatively unexplored. Towards this goal, we introduce MetaCLUE, a set of vision tasks on visual metaphor. We also collect high-quality and rich metaphor annotations (abstract objects, concepts, relationships along with their corresponding object boxes) as there do not exist any datasets that facilitate the evaluation of these tasks. We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations, highlighting strengths and weaknesses of current approaches in visual metaphor Classification, Localization, Understanding (retrieval, question answering, captioning) and gEneration (text-to-image synthesis) tasks. We hope this work provides a concrete step towards developing AI systems with human-like creative capabilities.

...read moreread less

Proceedings Article•DOI•

LADIS: Language Disentanglement for 3D Shape Editing

[...]

Ian Yiran Huang, Panos Achlioptas, Tianyi Zhang, Sergey Tulyakov, Minhyuk Sung, Leonidas J. Guibas - Show less +2 more

09 Dec 2022

TL;DR: The authors propose to learn disentangled latent representations that ground language in 3D geometry to produce decoupled, local edits to 3D shapes, which is a promising direction for democratizing 3D shape design.

...read moreread less

Abstract: Natural language interaction is a promising direction for democratizing 3D shape design. However, existing methods for text-driven 3D shape editing face challenges in producing decoupled, local edits to 3D shapes. We address this problem by learning disentangled latent representations that ground language in 3D geometry. To this end, we propose a complementary tool set including a novel network architecture, a disentanglement loss, and a new editing procedure. Additionally, to measure edit locality, we define a new metric that we call part-wise edit precision. We show that our method outperforms existing SOTA methods by 20% in terms of edit locality, and up to 6.6% in terms of language reference resolution accuracy. Our work suggests that by solely disentangling language representations, downstream 3D shape editing can become more local to relevant parts, even if the model was never given explicit part-based supervision.

...read moreread less

Journal Article•DOI•

ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction

[...]

Zhen Wang, Shijie Zhou, Jeong Joon Park, Despoina Paschalidou, Suya ok You, Gordon Wetzstein, Leonidas J. Guibas, Achuta Kadambi - Show less +4 more

08 Dec 2022-arXiv.org

TL;DR: al. as discussed by the authors proposed alternating latent topologies (ALTO) for high-fidelity reconstruction of implicit 3D surfaces from noisy point clouds, which sequentially alternate between geometric representations, before converging to an easy-to-decode latent.

...read moreread less

Abstract: This work introduces alternating latent topologies (ALTO) for high-fidelity reconstruction of implicit 3D surfaces from noisy point clouds. Previous work identifies that the spatial arrangement of latent encodings is important to recover detail. One school of thought is to encode a latent vector for each point (point latents). Another school of thought is to project point latents into a grid (grid latents) which could be a voxel grid or triplane grid. Each school of thought has tradeoffs. Grid latents are coarse and lose high-frequency detail. In contrast, point latents preserve detail. However, point latents are more difficult to decode into a surface, and quality and runtime suffer. In this paper, we propose ALTO to sequentially alternate between geometric representations, before converging to an easy-to-decode latent. We find that this preserves spatial expressiveness and makes decoding lightweight. We validate ALTO on implicit 3D recovery and observe not only a performance improvement over the state-of-the-art, but a runtime improvement of 3-10$\times$. Project website at https://visual.ee.ucla.edu/alto.htm/.

...read moreread less

Journal Article•DOI•

Affection: Learning Affective Explanations for Real-World Visual Data

[...]

Panos Achlioptas, Maks Ovsjanikov, Leonidas J. Guibas, Sergey Tulyakov

04 Oct 2022-arXiv.org

TL;DR: This work introduces and shares with the research community a large-scale dataset that contains emotional reactions and free-form textual explanations for 85,007 publicly available images, analyzed by 6,283 annotators who were asked to indicate and explain how and why they felt in a particular way when observing a particular image.

...read moreread less

Abstract: In this work, we explore the emotional reactions that real-world images tend to induce by using natural language as the medium to express the rationale behind an affective response to a given visual stimulus. To embark on this journey, we introduce and share with the research community a large-scale dataset that contains emotional reactions and free-form textual explanations for 85,007 publicly available images, analyzed by 6,283 annotators who were asked to indicate and explain how and why they felt in a particular way when observing a specific image, producing a total of 526,749 responses. Even though emotional reactions are subjective and sensitive to context (personal mood, social status, past experiences) - we show that there is significant common ground to capture potentially plausible emotional responses with a large support in the subject population. In light of this crucial observation, we ask the following questions: i) Can we develop multi-modal neural networks that provide reasonable affective responses to real-world visual data, explained with language? ii) Can we steer such methods towards producing explanations with varying degrees of pragmatic language or justifying different emotional reactions while adapting to the underlying visual stimulus? Finally, iii) How can we evaluate the performance of such methods for this novel task? With this work, we take the first steps in addressing all of these questions, thus paving the way for richer, more human-centric, and emotionally-aware image analysis systems. Our introduced dataset and all developed methods are available on https://affective-explanations.org

...read moreread less

Proceedings Article•DOI•

Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction

[...]

Yi Hong, Kaichun Mo, Li Yi, Leonidas J. Guibas, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan - Show less +3 more

05 May 2022

TL;DR: This paper presents FixNet, a novel framework that seamlessly incorporates perception and physical dynamics, and shows that the framework outperforms baseline models by a large margin, and can generalize well to objects with similar interaction types.

...read moreread less

Abstract: This paper studies the problem of fixing malfunctional 3D objects. While previous works focus on building passive perception models to learn the functionality from static 3D objects, we argue that functionality is reckoned with respect to the physical interactions between the object and the user. Given a malfunctional object, humans can perform mental simulations to reason about its functionality and figure out how to fix it. Inspired by this, we propose FixIt, a dataset that contains about 5k poorly-designed 3D physical objects paired with choices to fix them. To mimic humans' mental simulation process, we present FixNet, a novel framework that seamlessly incorporates perception and physical dynamics. Specifically, FixNet consists of a perception module to extract the structured representation from the 3D point cloud, a physical dynamics prediction module to simulate the results of interactions on 3D objects, and a functionality prediction module to evaluate the functionality and choose the correct fix. Experimental results show that our framework outperforms baseline models by a large margin, and can generalize well to objects with similar interaction types. Code and dataset are publicly available11http://fixing-malfunctional.csail.mit.edu.

...read moreread less

Journal Article•DOI•

COPILOT: Human Collision Prediction and Localization from Multi-view Egocentric Videos

[...]

Boxiao Pan, Bokui Shen, Davis Rempe, Despoina Paschalidou, Kaichun Mo, Yanchao Yang, Leonidas J. Guibas - Show less +3 more

arXiv.org

TL;DR: The challenging and novel problem of predicting human-scene collisions for diverse environments from multi-view egocentric RGB videos captured from an exoskeleton is proposed and COPILOT, a video transformer-based model that performs both collision prediction and localization simultaneously, is proposed.

...read moreread less

Abstract: —To produce safe human motions, assistive wear- able exoskeletons must be equipped with a perception system that enables anticipating potential collisions from egocentric observations. However, previous approaches to exoskeleton perception greatly simplify the problem to speciﬁc types of en- vironments, limiting their scalability. In this paper, we propose the challenging and novel problem of predicting human-scene collisions for diverse environments from multi-view egocentric RGB videos captured from an exoskeleton. By classifying which body joints will collide with the environment and predicting a collision region heatmap that localizes potential collisions in the environment, we aim to develop an exoskeleton perception system that generalizes to complex real-world scenes and pro- vides actionable outputs for downstream control. We propose COPILOT, a video transformer-based model that performs both collision prediction and localization simultaneously, leveraging multi-view video inputs via a proposed joint space-time- viewpoint attention operation. To train and evaluate the model, we build a synthetic data generation framework to simulate virtual humans moving in photo-realistic 3D environments. This framework is then used to establish a dataset consisting of 8.6M egocentric RGBD frames to enable future work on the problem. Extensive experiments suggest that our model achieves promising performance and generalizes to unseen scenes as well as real world. We apply COPILOT to a downstream collision avoidance task, and successfully reduce collision cases by 29% on unseen scenes using a simple closed-loop control algorithm.

...read moreread less

COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos

[...]

Boxiao Pan, Bokui Shen, Davis Rempe, Despoina Paschalidou, Kaichun Mo, Yanchao Yang, Leonidas J. Guibas - Show less +3 more

04 Oct 2022

TL;DR: In this article , a transformer-based model called COPILOT is proposed to perform collision prediction and localization simultaneously, which accumulates information across multi-view inputs through a novel 4D space-time-viewpoint attention mechanism.

...read moreread less

Abstract: The ability to forecast human-environment collisions from egocentric observations is vital to enable collision avoidance in applications such as VR, AR, and wearable assistive robotics. In this work, we introduce the challenging problem of predicting collisions in diverse environments from multi-view egocentric videos captured from body-mounted cameras. Solving this problem requires a generalizable perception system that can classify which human body joints will collide and estimate a collision region heatmap to localize collisions in the environment. To achieve this, we propose a transformer-based model called COPILOT to perform collision prediction and localization simultaneously, which accumulates information across multi-view inputs through a novel 4D space-time-viewpoint attention mechanism. To train our model and enable future research on this task, we develop a synthetic data generation framework that produces egocentric videos of virtual humans moving and colliding within diverse 3D environments. This framework is then used to establish a large-scale dataset consisting of 8.6M egocentric RGBD frames. Extensive experiments show that COPILOT generalizes to unseen synthetic as well as real-world scenes. We further demonstrate COPILOT outputs are useful for downstream collision avoidance through simple closed-loop control. Please visit our project webpage at https://sites.google.com/stanford.edu/copilot.

...read moreread less

Journal Article•DOI•

3DPointCaps++: Learning 3D Representations with Capsule Networks

[...]

Yongheng Zhao, Guangchi Fang, Yulan Guo, Leonidas J. Guibas, Federico Tombari, Tolga Birdal - Show less +2 more

30 Jul 2022-International Journal of Computer Vision

TL;DR: 3DPointCaps++ as mentioned in this paper uses deconvolution operators to reconstruct 3D points in a self-supervised manner and introduces a cluster loss ensuring that the points reconstructed by a single capsule remain local and do not spread across the object uncontrollably.

...read moreread less

Abstract: Abstract We present 3DPointCaps++ for learning robust, flexible and generalizable 3D object representations without requiring heavy annotation efforts or supervision. Unlike conventional 3D generative models, our algorithm aims for building a structured latent space where certain factors of shape variations, such as object parts, can be disentangled into independent sub-spaces. Our novel decoder then acts on these individual latent sub-spaces (i.e. capsules) using deconvolution operators to reconstruct 3D points in a self-supervised manner. We further introduce a cluster loss ensuring that the points reconstructed by a single capsule remain local and do not spread across the object uncontrollably. These contributions allow our network to tackle the challenging tasks of part segmentation, part interpolation/replacement as well as correspondence estimation across rigid / non-rigid shape, and across / within category. Our extensive evaluations on ShapeNet objects and human scans demonstrate that our network can learn generic representations that are robust and useful in many applications.

...read moreread less

Journal Article•DOI•

Breaking the Symmetry: Resolving Symmetry Ambiguities in Equivariant Neural Networks

[...]

S. Balachandar, Adrien Poulenard, Congyue Deng, Leonidas J. Guibas

29 Oct 2022-arXiv.org

TL;DR: Orientation Aware Vector Neuron Network (OAVNN) as mentioned in this paper is an extension of the vector neuron network that is robust to planar symmetric inputs, which is a rotation equivariant network.

...read moreread less

Abstract: Equivariant networks have been adopted in many 3-D learning areas. Here we identify a fundamental limitation of these networks: their ambiguity to symmetries. Equivariant networks cannot complete symmetry-dependent tasks like segmenting a left-right symmetric object into its left and right sides. We tackle this problem by adding components that resolve symmetry ambiguities while preserving rotational equivariance. We present OAVNN: Orientation Aware Vector Neuron Network, an extension of the Vector Neuron Network (Deng et al., 2021). OAVNN is a rotation equivariant network that is robust to planar symmetric inputs. Our network consists of three key components. 1) We introduce an algorithm to calculate symmetry detecting features. 2) We create a symmetry-sensitive orientation aware linear layer. 3) We construct an attention mechanism that relates directional information across points. We evaluate the network using left-right segmentation and ﬁnd that the network quickly obtains accurate segmentations. We hope this work motivates investigations on the expressivity of equivariant networks on symmetric objects.

...read moreread less

Journal Article•DOI•

6D Camera Relocalization in Visually Ambiguous Extreme Environments

[...]

Yang Zheng, Tolga Birdal, Yanchao Yang, Yu-Kai Duan, Leonidas J. Guibas - Show less +1 more

13 Jul 2022-arXiv.org

TL;DR: A hierarchical localization system, where the pose of a camera is estimated given a sequence of images acquired in extreme environments such as deep seas or extraterrestrial terrains, and a novel environment-aware image enhancement method to boost the robustness and accuracy.

...read moreread less

Abstract: We propose a novel method to reliably estimate the pose of a camera given a sequence of images acquired in extreme environments such as deep seas or extraterrestrial terrains. Data acquired under these challenging conditions are corrupted by textureless surfaces, image degradation, and presence of repetitive and highly ambiguous structures. When naively deployed, the state-of-the-art methods can fail in those scenarios as confirmed by our empirical analysis. In this paper, we attempt to make camera relocalization work in these extreme situations. To this end, we propose: (i) a hierarchical localization system, where we leverage temporal information and (ii) a novel environment-aware image enhancement method to boost the robustness and accuracy. Our extensive experimental results demonstrate superior performance in favor of our method under two extreme settings: localizing an autonomous underwater vehicle and localizing a planetary rover in a Mars-like desert. In addition, our method achieves comparable performance with state-of-the-art methods on the indoor benchmark (7-Scenes dataset) using only 20% training data.

...read moreread less

Journal Article•DOI•

Equivalence Between SE(3) Equivariant Networks via Steerable Kernels and Group Convolution

[...]

Adrien Poulenard, Maks Ovsjanikov, Leonidas J. Guibas

29 Nov 2022-arXiv.org

TL;DR: In this article , the equivalence of steerable convolution and the Fourier transform of the group convolution is discussed. But the equivalences between these two methods are not widely known and moreover the exact relations between deep learning architectures built upon these two approaches have not been precisely described.

...read moreread less

Abstract: A wide range of techniques have been proposed in recent years for designing neural networks for 3D data that are equivariant under rotation and translation of the input. Most approaches for equivariance under the Euclidean group $\mathrm{SE}(3)$ of rotations and translations fall within one of the two major categories. The first category consists of methods that use $\mathrm{SE}(3)$-convolution which generalizes classical $\mathbb{R}^3$-convolution on signals over $\mathrm{SE}(3)$. Alternatively, it is possible to use \textit{steerable convolution} which achieves $\mathrm{SE}(3)$-equivariance by imposing constraints on $\mathbb{R}^3$-convolution of tensor fields. It is known by specialists in the field that the two approaches are equivalent, with steerable convolution being the Fourier transform of $\mathrm{SE}(3)$ convolution. Unfortunately, these results are not widely known and moreover the exact relations between deep learning architectures built upon these two approaches have not been precisely described in the literature on equivariant deep learning. In this work we provide an in-depth analysis of both methods and their equivalence and relate the two constructions to multiview convolutional networks. Furthermore, we provide theoretical justifications of separability of $\mathrm{SE}(3)$ group convolution, which explain the applicability and success of some recent approaches. Finally, we express different methods using a single coherent formalism and provide explicit formulas that relate the kernels learned by different methods. In this way, our work helps to unify different previously-proposed techniques for achieving roto-translational equivariance, and helps to shed light on both the utility and precise differences between various alternatives. We also derive new TFN non-linearities from our equivalence principle and test them on practical benchmark datasets.

...read moreread less

Journal Article•DOI•

3DPointCaps++: Learning 3D Representations with Capsule Networks

[...]

Yongheng Zhao, Guangchi Fang, Yulan Guo, Leonidas J. Guibas, Federico Tombari, Tolga Birdal - Show less +2 more

01 Sep 2022-International Journal of Computer Vision

TL;DR: 3DPointCaps++ as discussed by the authors uses deconvolution operators to reconstruct 3D points in a self-supervised manner and introduces a cluster loss ensuring that the points reconstructed by a single capsule remain local and do not spread across the object uncontrollably.

...read moreread less