Showing papers by "Carl Vondrick published in 2021"

PDF

Open Access

Proceedings Article•DOI•

Learning the Predictability of the Future

[...]

Dídac Surís¹, Ruoshi Liu¹, Carl Vondrick¹•Institutions (1)

01 Jun 2021

TL;DR: This work proposes a predictive model in hyperbolic space that will predict at a concrete level of the hierarchy, but when the model is not confident, it learns to automatically select a higher level of abstraction.

...read moreread less

Abstract: We introduce a framework for learning from unlabeled video what is predictable in the future. Instead of committing up front to features to predict, our approach learns from data which features are predictable. Based on the observation that hyperbolic geometry naturally and compactly encodes hierarchical structure, we propose a predictive model in hyperbolic space. When the model is most confident, it will predict at a concrete level of the hierarchy, but when the model is not confident, it learns to automatically select a higher level of abstraction. Experiments on two established datasets show the key role of hierarchical representations for action prediction. Although our representation is trained with unlabeled video, visualizations show that action hierarchies emerge in the representation.

...read moreread less

47 citations

Proceedings Article•DOI•

Generative Interventions for Causal Learning

[...]

Chengzhi Mao¹, Augustine Cha¹, Amogh Gupta¹, Hao Wang², Junfeng Yang¹, Carl Vondrick¹ - Show less +2 more•Institutions (2)

Columbia University¹, Rutgers University²

01 Jun 2021

TL;DR: The authors introduce a framework for learning robust visual representations that generalize to new viewpoints, backgrounds, and scene contexts, and demonstrate state-of-the-art performance generalizing from ImageNet to ObjectNet dataset.

...read moreread less

Abstract: We introduce a framework for learning robust visual representations that generalize to new viewpoints, backgrounds, and scene contexts. Discriminative models often learn naturally occurring spurious correlations, which cause them to fail on images outside of the training distribution. In this paper, we show that we can steer generative models to manufacture interventions on features caused by confounding factors. Experiments, visualizations, and theoretical results show this method learns robust representations more consistent with the underlying causal relationships. Our approach improves performance on multiple datasets demanding out-of-distribution generalization, and we demonstrate state-of-the-art performance generalizing from ImageNet to ObjectNet dataset.

...read moreread less

32 citations

Journal Article•DOI•

Visual behavior modelling for robotic theory of mind.

[...]

Boyuan Chen¹, Carl Vondrick¹, Hod Lipson¹•Institutions (1)

Columbia University¹

11 Jan 2021-Scientific Reports

TL;DR: In this paper, an observer can model the behavior of an actor through visual processing alone, without any prior symbolic information and assumptions about relevant inputs, which is an essential cognitive ability that underlies many aspects of human and animal social behavior.

...read moreread less

Abstract: Behavior modeling is an essential cognitive ability that underlies many aspects of human and animal social behavior (Watson in Psychol Rev 20:158, 1913), and an ability we would like to endow robots. Most studies of machine behavior modelling, however, rely on symbolic or selected parametric sensory inputs and built-in knowledge relevant to a given task. Here, we propose that an observer can model the behavior of an actor through visual processing alone, without any prior symbolic information and assumptions about relevant inputs. To test this hypothesis, we designed a non-verbal non-symbolic robotic experiment in which an observer must visualize future plans of an actor robot, based only on an image depicting the initial scene of the actor robot. We found that an AI-observer is able to visualize the future plans of the actor with 98.5% success across four different activities, even when the activity is not known a-priori. We hypothesize that such visual behavior modeling is an essential cognitive ability that will allow machines to understand and coordinate with surrounding agents, while sidestepping the notorious symbol grounding problem. Through a false-belief test, we suggest that this approach may be a precursor to Theory of Mind, one of the distinguishing hallmarks of primate social cognition.

...read moreread less

11 citations

Journal Article•

Globetrotter: Unsupervised Multilingual Translation from Visual Alignment

[...]

Didac Suris Coll-Vinent¹, Dave Epstein², Carl Vondrick¹•Institutions (2)

Columbia University¹, University of California, Berkeley²

04 May 2021-arXiv: Computation and Language

TL;DR: A framework that instead uses the visual modality to align multiple languages, using images as the bridge between them, and estimates the cross-modal alignment between language and images, and uses this estimate to guide the learning of cross-lingual representations.

...read moreread less

Abstract: Machine translation in a multi-language scenario requires large-scale parallel corpora for every language pair. Unsupervised translation is challenging because there is no explicit connection between languages, and the existing methods have to rely on topological properties of the language representations. We introduce a framework that leverages visual similarity to align multiple languages, using images as the bridge between them. We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations. Our language representations are trained jointly in one model with a single stage. Experiments with fifty-two languages show that our method outperforms prior work on unsupervised word-level and sentence-level translation using retrieval.

...read moreread less

9 citations

Proceedings Article•

Towards a Unifying Framework for Formal Theories of Novelty.

[...]

Terrance E. Boult¹, Przemyslaw A. Grabowicz, Derek S. Prijatelj², Roni Stern, Lawrence B. Holder³, Joshua Alspector, Mohsen Jafarzadeh¹, Touqeer Ahmad¹, Akshay Raj Dhamija¹, Chunchun Li¹, Steve Cruz¹, Abhinav Shrivastava, Carl Vondrick⁴, Walter J. Scheirer² - Show less +10 more•Institutions (4)

University of Colorado Colorado Springs¹, University of Notre Dame², Washington State University³, Columbia University⁴

18 May 2021

TL;DR: In this article, the authors present a unified framework for formal theories of novelty and use the framework to formally define a family of novelty types, which can be applied across a wide range of domains, from symbolic AI to reinforcement learning, and beyond to open world image recognition.

...read moreread less

Abstract: Managing inputs that are novel, unknown, or out-of-distribution is critical as an agent moves from the lab to the open world. Novelty-related problems include being tolerant to novel perturbations of the normal input, detecting when the input includes novel items, and adapting to novel inputs. While significant research has been undertaken in these areas, a noticeable gap exists in the lack of a formalized definition of novelty that transcends problem domains. As a team of researchers spanning multiple research groups and different domains, we have seen, first hand, the difficulties that arise from ill-specified novelty problems, as well as inconsistent definitions and terminology. Therefore, we present the first unified framework for formal theories of novelty and use the framework to formally define a family of novelty types. Our framework can be applied across a wide range of domains, from symbolic AI to reinforcement learning, and beyond to open world image recognition. Thus, it can be used to help kick-start new research efforts and accelerate ongoing work on these important novelty-related problems.

...read moreread less

6 citations

Proceedings Article•DOI•

RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System

[...]

Haoyang Wen¹, Ying Lin², Tuan Lai, Xiaoman Pan³, Sha Li⁴, Xudong Lin⁵, Ben Zhou⁶, Manling Li², Haoyu Wang⁷, Hongming Zhang⁸, Xiaodong Yu⁴, Alexander Dong, Zhenhailong Wang⁹, Yi Fung, Piyush Mishra, Qing Lyu⁷, Dídac Surís⁵, Brian Chen⁵, Susan Brown¹⁰, Martha Palmer¹⁰, Chris Callison-Burch⁷, Carl Vondrick⁵, Jiawei Han⁴, Dan Roth⁷, Shih-Fu Chang⁵, Heng Ji⁴ - Show less +22 more•Institutions (10)

Harbin Institute of Technology¹, Rensselaer Polytechnic Institute², Tencent³, University of Illinois at Urbana–Champaign⁴, Columbia University⁵, Allen Institute for Artificial Intelligence⁶, University of Pennsylvania⁷, Hong Kong University of Science and Technology⁸, Zhejiang University⁹, University of Colorado Boulder¹⁰

01 Jun 2021

TL;DR: This article presented a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for their experiment), and multiple data modalities (speech, text, image and video).

...read moreread less

Abstract: We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects: (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerlized system publicly available for research purpose at GitHub, with a demo video.

...read moreread less

5 citations

Proceedings Article•DOI•

Learning Goals from Failure

[...]

Dave Epstein¹, Carl Vondrick¹•Institutions (1)

Columbia University¹

01 Jun 2021

TL;DR: In this paper, a framework that predicts the goals behind observable human action in video is introduced. But the model is trained with minimal supervision and it is not able to predict the underlying goals in video of unintentional action.

...read moreread less

Abstract: We introduce a framework that predicts the goals behind observable human action in video. Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features. Experiments and visualizations show our trained model is able to predict the underlying goals in video of unintentional action. We also propose a method to "automatically correct" unintentional action by leveraging gradient signals of our model to adjust latent trajectories. Although the model is trained with minimal supervision, it is competitive with or outperforms baselines trained on large (supervised) datasets of successfully executed goals, showing that observing unintentional action is crucial to learning about goals in video.

...read moreread less

4 citations

Posted Content•

Adversarial Attacks are Reversible with Natural Supervision

[...]

Chengzhi Mao¹, Mia Chiquier, Hao Wang², Junfeng Yang¹, Carl Vondrick¹ - Show less +1 more•Institutions (2)

Columbia University¹, Chinese Academy of Sciences²

26 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors demonstrate that modifying the attacked image to restore the natural structure will reverse many types of attacks, providing a defense against adversarial examples, and demonstrate that their defense is still effective even if the attacker is aware of the defense mechanism.

...read moreread less

Abstract: We find that images contain intrinsic structure that enables the reversal of many adversarial attacks. Attack vectors cause not only image classifiers to fail, but also collaterally disrupt incidental structure in the image. We demonstrate that modifying the attacked image to restore the natural structure will reverse many types of attacks, providing a defense. Experiments demonstrate significantly improved robustness for several state-of-the-art models across the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Our results show that our defense is still effective even if the attacker is aware of the defense mechanism. Since our defense is deployed during inference instead of training, it is compatible with pre-trained networks as well as most other defenses. Our results suggest deep networks are vulnerable to adversarial examples partly because their representations do not enforce the natural structure of images.

...read moreread less

3 citations

Posted Content•

Learning the Predictability of the Future.

[...]

Dídac Surís¹, Ruoshi Liu¹, Carl Vondrick¹•Institutions (1)

Columbia University¹

01 Jan 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors propose a predictive model in hyperbolic space to learn from unlabeled video what is predictable in the future, instead of committing up front to features to predict, they learn from data which features are predictable.

...read moreread less

2 citations

Posted Content•

The Boombox: Visual Reconstruction from Acoustic Vibrations.

[...]

Boyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick

17 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Boombox as mentioned in this paper uses acoustic vibrations to reconstruct an image of its inside contents, which can be used in human-computer interaction and robotics applications, such as human-robot interaction.

...read moreread less

Abstract: We introduce The Boombox, a container that uses acoustic vibrations to reconstruct an image of its inside contents. When an object interacts with the container, they produce small acoustic vibrations. The exact vibration characteristics depend on the physical properties of the box and the object. We demonstrate how to use this incidental signal in order to predict visual structure. After learning, our approach remains effective even when a camera cannot view inside the box. Although we use low-cost and low-power contact microphones to detect the vibrations, our results show that learning from multi-modal data enables us to transform cheap acoustic sensors into rich visual sensors. Due to the ubiquity of containers, we believe integrating perception capabilities into them will enable new applications in human-computer interaction and robotics. Our project website is at: this http URL

...read moreread less

1 citations

Posted Content•

Full-Body Visual Self-Modeling of Robot Morphologies

[...]

Boyuan Chen¹, Robert Kwiatkowski, Carl Vondrick, Hod Lipson•Institutions (1)

Columbia University¹

11 Nov 2021-arXiv: Robotics

TL;DR: In this paper, the authors propose a query-driven self-modeling approach to answer space occupancy queries, conditioned on the robot's state, which is continuous in the spatial domain, memory efficient, fully differentiable and kinematic aware.

...read moreread less

Abstract: Internal computational models of physical bodies are fundamental to the ability of robots and animals alike to plan and control their actions. These "self-models" allow robots to consider outcomes of multiple possible future actions, without trying them out in physical reality. Recent progress in fully data-driven self-modeling has enabled machines to learn their own forward kinematics directly from task-agnostic interaction data. However, forward-kinema\-tics models can only predict limited aspects of the morphology, such as the position of end effectors or velocity of joints and masses. A key challenge is to model the entire morphology and kinematics, without prior knowledge of what aspects of the morphology will be relevant to future tasks. Here, we propose that instead of directly modeling forward-kinematics, a more useful form of self-modeling is one that could answer space occupancy queries, conditioned on the robot's state. Such query-driven self models are continuous in the spatial domain, memory efficient, fully differentiable and kinematic aware. In physical experiments, we demonstrate how a visual self-model is accurate to about one percent of the workspace, enabling the robot to perform various motion planning and control tasks. Visual self-modeling can also allow the robot to detect, localize and recover from real-world damage, leading to improved machine resiliency. Our project website is at: https://robot-morphology.cs.columbia.edu/

...read moreread less

Posted Content•

Discrete Representations Strengthen Vision Transformer Robustness

[...]

Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, Irfan Essa - Show less +2 more

20 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors propose to add discrete tokens produced by a vector-quantized encoder to the input layer of the vision transformer to improve the robustness of the transform.

...read moreread less

Abstract: Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs are overly reliant on local features (e.g., nuisances and texture) and fail to make adequate use of global context (e.g., shape and structure). As a result, ViTs fail to generalize to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.

...read moreread less