Top 21 papers published by Ali Farhadi from University of Washington in 2021

Journal Article•

Watching the World Go By: Representation Learning from Unlabeled Videos

[...]

Daniel Gordon¹, Kiana Ehsani², Dieter Fox¹, Ali Farhadi¹•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

04 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Video Noise Contrastive Estimation is proposed, a method for using unlabeled video to learn strong, transferable single image representations that demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks.

...read moreread less

Abstract: Recent unsupervised representation learning techniques show remarkable success on many single image tasks by using instance discrimination: learning to differentiate between two augmented versions of the same image and a large batch of unrelated images. Prior work uses artificial data augmentation techniques such as cropping, and color jitter which can only affect the image in superficial waysand are not aligned with how objects actually change e.g. occlusion, deformation,viewpoint change. We argue that videos offer this natural augmentation for free.Videos can provide entirely new views of objects, show deformation, and even connect semantically similar but visually distinct concepts.We propose Video Noise Contrastive Estimation, a method for using unlabeled video to learnstrong, transferable, single image representations.We demonstrate improvements over recent unsupervised single image techniques,as well as over fully supervised ImageNet pretraining, across temporal and non-temporal tasks.

...read moreread less

68 citations

Proceedings Article•DOI•

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

[...]

Rowan Zellers¹, Ari Holtzman¹, Matthew E. Peters¹, Roozbeh Mottaghi², Aniruddha Kembhavi², Ali Farhadi¹, Yejin Choi² - Show less +3 more•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

01 Aug 2021

TL;DR: A model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language, and is able to correctly forecast “what happens next” given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%.

...read moreread less

Abstract: We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don’t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast what happens next given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.

...read moreread less

37 citations

Proceedings Article•DOI•

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

[...]

Rowan Zellers¹, Ari Holtzman¹, Elizabeth Clark¹, Lianhui Qin², Ali Farhadi¹, Yejin Choi¹ - Show less +2 more•Institutions (2)

University of Washington¹, Carnegie Mellon University²

01 Jun 2021

TL;DR: Empirical results show that today’s models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples, and this low performance reveals language understanding errors that are hard to spot outside of a generative setting.

...read moreread less

Abstract: We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today’s models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

...read moreread less

18 citations

Posted Content•

Robust fine-tuning of zero-shot models

[...]

Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, Ludwig Schmidt¹ - Show less +4 more•Institutions (1)

University of Washington¹

04 Sep 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Weight-space ensembles as mentioned in this paper ensembling the weights of the zero-shot and fine-tuned models provide large accuracy improvements out-of-distribution, while matching or improving in-disparity accuracy.

...read moreread less

Abstract: Large pre-trained models such as CLIP offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning approaches substantially improve accuracy in-distribution, they also reduce out-of-distribution robustness. We address this tension by introducing a simple and effective method for improving robustness: ensembling the weights of the zero-shot and fine-tuned models. Compared to standard fine-tuning, the resulting weight-space ensembles provide large accuracy improvements out-of-distribution, while matching or improving in-distribution accuracy. On ImageNet and five derived distribution shifts, weight-space ensembles improve out-of-distribution accuracy by 2 to 10 percentage points while increasing in-distribution accuracy by nearly 1 percentage point relative to standard fine-tuning. These improvements come at no additional computational cost during fine-tuning or inference.

...read moreread less

15 citations

Proceedings Article•DOI•

Pushing it out of the Way: Interactive Visual Navigation

[...]

Kuo-Hao Zeng¹, Luca Weihs², Ali Farhadi¹, Roozbeh Mottaghi¹•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

01 Jun 2021

TL;DR: Zeng et al. as discussed by the authors introduced the Neural Interaction Engine (NIE) to explicitly predict the change in the environment caused by the agent's actions. And they found that agents exhibit significant improvements in their navigational capabilities.

...read moreread less

Abstract: We have observed significant progress in visual navigation for embodied agents. A common assumption in studying visual navigation is that the environments are static; this is a limiting assumption. Intelligent navigation may involve interacting with the environment beyond just moving forward/backward and turning left/right. Sometimes, the best way to navigate is to push something out of the way. In this paper, we study the problem of interactive navigation where agents learn to change the environment to navigate more efficiently to their goals. To this end, we introduce the Neural Interaction Engine (NIE) to explicitly predict the change in the environment caused by the agent’s actions. By modeling the changes while planning, we find that agents exhibit significant improvements in their navigational capabilities. More specifically, we consider two downstream tasks in the physics-enabled, visually rich, AI2-THOR environment: (1) reaching a target while the path to the target is blocked (2) moving an object to a target location by pushing it. For both tasks, agents equipped with an NIE significantly outperform agents without the understanding of the effect of the actions indicating the benefits of our approach. The code and dataset are available at github.com/KuoHaoZeng/Interactive_Visual_Navigation.

...read moreread less

11 citations

In the Wild: From ML Models to Pragmatic ML Systems

[...]

Matthew Wallingford¹, Aditya Kusupati¹, Keivan Alizadeh-Vahid¹, Aaron Walsman¹, Aniruddha Kembhavi², Ali Farhadi¹ - Show less +2 more•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

04 May 2021

TL;DR: A unified learning & evaluation framework - iN thE wilD (NED) is introduced, designed to be a more general paradigm by loosening the restrictive design decisions of past settings & imposing fewer restrictions on learning algorithms.

...read moreread less

Abstract: Enabling robust intelligence in the wild entails learning systems that offer uninterrupted inference while affording sustained learning from varying amounts of data and supervision. Such ML systems must be able to cope with the openness and variability inherent to the real world. The machine learning community has organically broken down this challenging task into manageable sub tasks such as supervised, few-shot, continual, and self-supervised learning; each affording distinct challenges and a unique set of methods. Notwithstanding this remarkable progress, the simplified and isolated nature of these experimental setups has resulted in methods that excel in their specific settings, but struggle to generalize beyond them. To foster research towards more general ML systems, we present a new learning and evaluation framework - I{N} TH{E} WIL{D} (NED). NED naturally integrates the objectives of previous frameworks while removing many of the overly strong assumptions such as predefined training and test phases, sufficient labeled data for every class, and the closed-world assumption. In NED, a learner faces a stream of data and must make sequential predictions while choosing how to update itself, adapt quickly to novel classes, and deal with changing data distributions; while optimizing for the total amount of compute. We present novel insights from NED that contradict the findings of less realistic or smaller-scale experiments which emphasizes the need to move towards more pragmatic setups. For example, we show that meta-training causes larger networks to overfit in a way that supervised training does not, few-shot methods break down outside of their narrow experimental setting, and self-supervised method MoCo performs significantly worse when the downstream task contains new and old classes. Additionally, we present two new pragmatic methods (Exemplar Tuning and Minimum Distance Thresholding) that significantly outperform all other methods evaluated in NED.

...read moreread less

10 citations

Posted Content•

MERLOT: Multimodal Neural Script Knowledge Models

[...]

Rowan Zellers¹, Ximing Lu¹, Jack Hessel², Youngjae Yu², Jae Sung Park¹, Jize Cao, Ali Farhadi³, Yejin Choi² - Show less +4 more•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Lorestan University of Medical Sciences³

04 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposed a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech in an entirely label-free, self-supervised manner by pretraining with a mix of both frame-level and video-level objectives.

...read moreread less

Abstract: As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.

...read moreread less

7 citations

Proceedings Article•

Learning Generalizable Visual Representations via Interactive Gameplay

[...]

Luca Weihs¹, Aniruddha Kembhavi¹, Kiana Ehsani¹, Sarah M Pratt¹, Winson Han¹, Alvaro Herrasti¹, Eric Kolve¹, Dustin Schwenk¹, Roozbeh Mottaghi¹, Ali Farhadi² - Show less +6 more•Institutions (2)

Allen Institute for Artificial Intelligence¹, University of Washington²

03 May 2021

TL;DR: This article showed that embodied adversarial reinforcement learning agents playing Cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn generalizable representations of their observations encoding information such as object permanence, free space, and containment.

...read moreread less

Abstract: A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing the neural flexibility for creative problem solving, decision making, and socialization. Comparatively little is known regarding the impact of embodied gameplay upon artificial agents. While recent work has produced agents proficient in abstract games, these environments are far removed the real world and thus these agents can provide little insight into the advantages of embodied play. Hiding games, such as hide-and-seek, played universally, provide a rich ground for studying the impact of embodied gameplay on representation learning in the context of perspective taking, secret keeping, and false belief understanding. Here we are the first to show that embodied adversarial reinforcement learning agents playing Cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn generalizable representations of their observations encoding information such as object permanence, free space, and containment. Moving closer to biologically motivated learning strategies, our agents' representations, enhanced by intentionality and memory, are developed through interaction and play. These results serve as a model for studying how facets of vision develop through interaction, provide an experimental framework for assessing what is learned by artificial agents, and demonstrates the value of moving from large, static, datasets towards experiential, interactive, representation learning.

...read moreread less

5 citations

Proceedings Article•DOI•

Probing Contextual Language Models for Common Ground with Visual Representations.

[...]

Gabriel Ilharco¹, Rowan Zellers¹, Ali Farhadi¹, Hannaneh Hajishirzi¹•Institutions (1)

University of Washington¹

01 Jun 2021

TL;DR: A probing model is designed that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations and shows that language representations alone provide a strong signal for retrieving image patches from the correct object categories.

...read moreread less

Abstract: The success of large-scale contextual language models has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of language models.

...read moreread less

5 citations

Proceedings Article•

Learning Neural Network Subspaces

[...]

Mitchell Wortsman¹, Maxwell Horton², Carlos Guestrin³, Ali Farhadi¹, Mohammad Rastegari¹ - Show less +1 more•Institutions (3)

University of Washington¹, Apple Inc.², Stanford University³

18 Jul 2021

TL;DR: In this paper, a single method and in a single training run is proposed to learn lines, curves, and simplexes of high-accuracy neural networks, which can be ensembled, approaching the ensemble performance of independently trained networks without the training cost.

...read moreread less

Abstract: Recent observations have advanced our understanding of the neural network optimization landscape, revealing the existence of (1) paths of high accuracy containing diverse solutions and (2) wider minima offering improved performance. Previous methods observing diverse paths require multiple training runs. In contrast we aim to leverage both property (1) and (2) with a single method and in a single training run. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. These neural network subspaces contain diverse solutions that can be ensembled, approaching the ensemble performance of independently trained networks without the training cost. Moreover, using the subspace midpoint boosts accuracy, calibration, and robustness to label noise, outperforming Stochastic Weight Averaging.

...read moreread less

4 citations

Posted Content•

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

[...]

Rowan Zellers¹, Ari Holtzman¹, Matthew E. Peters¹, Roozbeh Mottaghi², Aniruddha Kembhavi², Ali Farhadi¹, Yejin Choi² - Show less +3 more•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

01 Jun 2021-arXiv: Computation and Language

TL;DR: This paper propose a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. But their model is limited to a single language model, and it is not able to handle a large number of inputs.

...read moreread less

Abstract: We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don't. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast "what happens next" given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.

...read moreread less

Posted Content•

Learning Neural Network Subspaces

[...]

Mitchell Wortsman¹, Maxwell Horton², Carlos Guestrin³, Ali Farhadi¹, Mohammad Rastegari¹ - Show less +1 more•Institutions (3)

University of Washington¹, Apple Inc.², Stanford University³

20 Feb 2021-arXiv: Learning

TL;DR: In this paper, a single method and in a single training run is proposed to learn lines, curves, and simplexes of high-accuracy neural networks, which can be ensembled, approaching the ensemble performance of independently trained networks without the training cost.

...read moreread less

Abstract: Recent observations have advanced our understanding of the neural network optimization landscape, revealing the existence of (1) paths of high accuracy containing diverse solutions and (2) wider minima offering improved performance. Previous methods observing diverse paths require multiple training runs. In contrast we aim to leverage both property (1) and (2) with a single method and in a single training run. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. These neural network subspaces contain diverse solutions that can be ensembled, approaching the ensemble performance of independently trained networks without the training cost. Moreover, using the subspace midpoint boosts accuracy, calibration, and robustness to label noise, outperforming Stochastic Weight Averaging.

...read moreread less

Proceedings Article•

MERLOT: Multimodal Neural Script Knowledge Models

[...]

Rowan Zellers¹, Ximing Lu¹, Jack Hessel², Youngjae Yu², Jae Sung Park¹, Jize Cao, Ali Farhadi³, Yejin Choi² - Show less +4 more•Institutions (3)

University of Washington¹, Allen Institute for Artificial Intelligence², Lorestan University of Medical Sciences³

06 Dec 2021

TL;DR: This paper proposed a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech in an entirely label-free, self-supervised manner by pretraining with a mix of both frame-level and video-level objectives.

...read moreread less

Abstract: As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.

...read moreread less

Journal Article•DOI•

A preliminary examination of unified protocol for transdiagnostic treatment of emotional disorders in patients with panic disorder: a single-case experimental design in Iran.

[...]

Mehdi Zemestani¹, Fatemeh Davoudi¹, Ali Farhadi², Matthew W. Gallagher³•Institutions (3)

University of Kurdistan¹, Lorestan University of Medical Sciences², University of Houston³

10 Oct 2021-Anxiety Stress and Coping

TL;DR: The majority of patients suffering from anxiety disorders in low and middle-income countries do not receive evidence-based treatments as discussed by the authors, and the Unified Protocol (UP) for the Transdiagnostic Treatment of Anxiety Disorders (UP-TRAD) is one of the most widely used protocols.

...read moreread less

Abstract: The majority of patients suffering from anxiety disorders in low- and middle-income countries do not receive evidence-based treatments. The Unified Protocol (UP) for the Transdiagnostic Treatment o...

...read moreread less

Posted Content•

LanguageRefer: Spatial-Language Model for 3D Visual Grounding

[...]

Junha Roh, Karthik Desingh, Ali Farhadi, Dieter Fox¹•Institutions (1)

University of Washington¹

07 Jul 2021-arXiv: Robotics

TL;DR: In this paper, a transformer-based spatial language model is proposed to identify the target object from a set of potential candidates in a reconstructed 3D scene in the form of a point cloud with 3D bounding boxes of potential object candidates and a language utterance referring to a target object in the scene.

...read moreread less

Abstract: To realize robots that can understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that can understand referential language to identify common objects in real-world 3D scenes. In this paper, we develop a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of a point cloud with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model identifies the target object from a set of potential candidates. Our spatial-language model uses a transformer-based architecture that combines spatial embedding from bounding-box with a finetuned language embedding from DistilBert and reasons among the objects in the 3D scene to find the target object. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D. We provide additional analysis of performance in spatial reasoning tasks decoupled from perception noise, the effect of view-dependent utterances in terms of accuracy, and view-point annotations for potential robotics applications.

...read moreread less

Posted Content•

Pushing it out of the Way: Interactive Visual Navigation

[...]

Kuo-Hao Zeng¹, Luca Weihs², Ali Farhadi¹, Roozbeh Mottaghi¹•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

28 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors introduce the Neural Interaction Engine (NIE) to explicitly predict the change in the environment caused by the agent's actions, by modeling the changes while planning, they find that agents exhibit significant improvements in their navigational capabilities.

...read moreread less

Abstract: We have observed significant progress in visual navigation for embodied agents. A common assumption in studying visual navigation is that the environments are static; this is a limiting assumption. Intelligent navigation may involve interacting with the environment beyond just moving forward/backward and turning left/right. Sometimes, the best way to navigate is to push something out of the way. In this paper, we study the problem of interactive navigation where agents learn to change the environment to navigate more efficiently to their goals. To this end, we introduce the Neural Interaction Engine (NIE) to explicitly predict the change in the environment caused by the agent's actions. By modeling the changes while planning, we find that agents exhibit significant improvements in their navigational capabilities. More specifically, we consider two downstream tasks in the physics-enabled, visually rich, AI2-THOR environment: (1) reaching a target while the path to the target is blocked (2) moving an object to a target location by pushing it. For both tasks, agents equipped with an NIE significantly outperform agents without the understanding of the effect of the actions indicating the benefits of our approach.

...read moreread less

Proceedings Article•

Learning Visual Representation from Human Interactions

[...]

Kiana Ehsani¹, Daniel Gordon², Thomas Hai Dang Nguyen², Roozbeh Mottaghi¹, Ali Farhadi² - Show less +1 more•Institutions (2)

Allen Institute for Artificial Intelligence¹, University of Washington²

03 May 2021

TL;DR: In this article, a self-supervised representation that encodes interaction and attention cues is proposed to learn better representations compared to visual-only representations, which can be applied to a variety of target tasks such as scene classification, action recognition, depth estimation, and walkable surface estimation.

...read moreread less

Abstract: Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our self-supervised representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo (He et al.,2020), on a variety of target tasks: scene classification (semantic), action recognition (temporal), depth estimation (geometric), dynamics prediction (physics) and walkable surface estimation (affordance).

...read moreread less

Proceedings Article•

Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text

[...]

Christopher Clark¹, Jordi Salvador², Dustin Schwenk¹, Derrick Bonafilia, Mark Yatskar³, Eric Kolve¹, Alvaro Herrasti¹, Jonghyun Choi⁴, Sachin Mehta⁵, Sam Skjonsberg¹, Carissa Schoenick¹, Aaron Sarnat, Hannaneh Hajishirzi⁶, Aniruddha Kembhavi¹, Oren Etzioni¹, Ali Farhadi⁷ - Show less +12 more•Institutions (7)

Allen Institute for Artificial Intelligence¹, Polytechnic University of Catalonia², University of Pennsylvania³, Gwangju Institute of Science and Technology⁴, University of Washington⁵, Facebook⁶, Lorestan University of Medical Sciences⁷

01 Nov 2021

TL;DR: In this article, a Guesser tries to identify a phrase that a Drawer is drawing by composing icons, and the Drawer iteratively revises the drawing to help the Guesser in response.

...read moreread less

Abstract: Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics (e.g., metaphors or analogies), and at times multi-modal gestures (e.g., pointing with a finger, or an arrow in a diagram). We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary, that poses a novel challenge for the research community. In Iconary, a Guesser tries to identify a phrase that a Drawer is drawing by composing icons, and the Drawer iteratively revises the drawing to help the Guesser in response. This back-and-forth often uses canonical scenes, visual metaphor, or icon compositions to express challenging words, making it an ideal test for mixing language and visual/symbolic communication in AI. We propose models to play Iconary and train them on over 55,000 games between human players. Our models are skillful players and are able to employ world knowledge in language models to play with words unseen during training.

...read moreread less

Posted Content•

LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time.

[...]

Elvis Nunez, Maxwell Horton, Anish Prabhu, Anurag Ranjan, Ali Farhadi, Mohammad Rastegari - Show less +2 more

08 Oct 2021-arXiv: Learning

TL;DR: In this paper, the authors propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models that range from highly efficient to highly accurate.

...read moreread less

Abstract: When deploying deep learning models to a device, it is traditionally assumed that available computational resources (compute, memory, and power) remain static. However, real-world computing systems do not always provide stable resource guarantees. Computational resources need to be conserved when load from other processes is high or battery power is low. Inspired by recent works on neural network subspaces, we propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models that range from highly efficient to highly accurate. Our models require no retraining, thus our subspace of models can be deployed entirely on-device to allow adaptive network compression at inference time. We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity. We achieve accuracies on-par with standard models when testing our uncompressed models, and maintain high accuracy for sparsity rates above 90% when testing our compressed models. We also demonstrate that our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.

...read moreread less

Posted Content•

Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text.

[...]

Christopher Clark¹, Jordi Salvador², Dustin Schwenk¹, Derrick Bonafilia, Mark Yatskar³, Eric Kolve¹, Alvaro Herrasti¹, Jonghyun Choi⁴, Sachin Mehta⁵, Sam Skjonsberg¹, Carissa Schoenick¹, Aaron Sarnat, Hannaneh Hajishirzi⁶, Aniruddha Kembhavi¹, Oren Etzioni¹, Ali Farhadi⁷ - Show less +12 more•Institutions (7)

Allen Institute for Artificial Intelligence¹, Polytechnic University of Catalonia², University of Pennsylvania³, Gwangju Institute of Science and Technology⁴, University of Washington⁵, Facebook⁶, Lorestan University of Medical Sciences⁷

02 Dec 2021-arXiv: Computation and Language

TL;DR: In this article, a Guesser tries to identify a phrase that a Drawer is drawing by composing icons, and the Drawer iteratively revises the drawing to help the Guesser in response.

...read moreread less

Abstract: Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics (e.g., metaphors or analogies), and at times multi-modal gestures (e.g., pointing with a finger, or an arrow in a diagram). We investigate these challenges in the context of Iconary, a collaborative game of drawing and guessing based on Pictionary, that poses a novel challenge for the research community. In Iconary, a Guesser tries to identify a phrase that a Drawer is drawing by composing icons, and the Drawer iteratively revises the drawing to help the Guesser in response. This back-and-forth often uses canonical scenes, visual metaphor, or icon compositions to express challenging words, making it an ideal test for mixing language and visual/symbolic communication in AI. We propose models to play Iconary and train them on over 55,000 games between human players. Our models are skillful players and are able to employ world knowledge in language models to play with words unseen during training. Elite human players outperform our models, particularly at the drawing task, leaving an important gap for future research to address. We release our dataset, code, and evaluation setup as a challenge to the community at this http URL

...read moreread less

Posted Content•

LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes

[...]

Aditya Kusupati¹, Matthew Wallingford¹, Vivek Ramanujan¹, Raghav Somani¹, Jae Sung Park¹, Krishna Pillutla¹, Prateek Jain¹, Sham M. Kakade¹, Ali Farhadi² - Show less +5 more•Institutions (2)

University of Washington¹, Google²

02 Jun 2021-arXiv: Learning

TL;DR: In this paper, the authors propose a method for learning low-dimensional binary codes (LLC) for instances as well as classes, which does not require any side-information, like annotated attributes or label meta-data.

...read moreread less

Abstract: Learning binary representations of instances and classes is a classical problem with several high potential applications. In modern settings, the compression of high-dimensional neural representations to low-dimensional binary codes is a challenging task and often require large bit-codes to be accurate. In this work, we propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes. Our method does not require any side-information, like annotated attributes or label meta-data, and learns extremely low-dimensional binary codes (~20 bits for ImageNet-1K). The learnt codes are super-efficient while still ensuring nearly optimal classification accuracy for ResNet50 on ImageNet-1K. We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes. We further quantitatively measure the quality of our codes by applying it to the efficient image retrieval as well as out-of-distribution (OOD) detection problems. For ImageNet-100 retrieval problem, our learnt binary codes outperform 16 bit HashNet using only 10 bits and also are as accurate as 10 dimensional real representations. Finally, our learnt binary codes can perform OOD detection, out-of-the-box, as accurately as a baseline that needs ~3000 samples to tune its threshold, while we require none. Code and pre-trained models are available at this https URL.

...read moreread less

Showing papers by "Ali Farhadi published in 2021"