Showing papers by "Facebook published in 2021"

PDF

Open Access

Journal Article•DOI•

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields

[...]

Zhe Cao¹, Gines Hidalgo², Tomas Simon³, Shih-En Wei³, Yaser Sheikh² - Show less +1 more•Institutions (3)

University of California, Berkeley¹, Carnegie Mellon University², Facebook³

01 Jan 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: OpenPose as mentioned in this paper uses Part Affinity Fields (PAFs) to learn to associate body parts with individuals in the image, which achieves high accuracy and real-time performance.

...read moreread less

Abstract: Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

...read moreread less

2,911 citations

Journal Article•DOI•

Billion-Scale Similarity Search with GPUs

[...]

Jeff Johnson¹, Matthijs Douze¹, Hervé Jégou¹•Institutions (1)

Facebook¹

01 Jul 2021-IEEE Transactions on Big Data

TL;DR: This paper proposes a novel design for an inline-formula that enables the construction of a high accuracy, brute-force, approximate and compressed-domain search based on product quantization, and applies it in different similarity search scenarios.

...read moreread less

Abstract: Similarity search finds application in database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data parallel tasks such as distance computation, prior approaches in this domain are bottlenecked by algorithms that expose less parallelism, such as $k$ k -min selection, or make poor use of the memory hierarchy. We propose a novel design for $k$ k -selection. We apply it in different similarity search scenarios, by optimizing brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation operates at up to 55 percent of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5 × faster than prior GPU state of the art. It enables the construction of a high accuracy $k$ k -NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.

...read moreread less

1,050 citations

Posted Content•

An Empirical Study of Training Self-Supervised Vision Transformers

[...]

Xinlei Chen¹, Saining Xie², Kaiming He²•Institutions (2)

Carnegie Mellon University¹, Facebook²

05 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.

...read moreread less

Abstract: This paper does not describe a novel method Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT) While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results We reveal that these results are indeed partial failure, and they can be improved when training is made more stable We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects We discuss the currently positive evidence as well as challenges and open questions We hope that this work will provide useful data points and experience for future research

...read moreread less

949 citations

Proceedings Article•DOI•

Exploring Simple Siamese Representation Learning

[...]

Xinlei Chen¹, Kaiming He¹•Institutions (1)

Facebook¹

01 Jun 2021

TL;DR: SimSiam as discussed by the authors proposes to use a stop-gradient operation to prevent collapsing solutions in Siamese networks, which achieves competitive results on ImageNet and downstream tasks, and further shows proof-of-concept experiments verifying it.

...read moreread less

Abstract: Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code is made available.1

...read moreread less

754 citations

Journal Article•DOI•

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

[...]

Alexander Rives¹, Alexander Rives², Joshua Meier², Tom Sercu², Siddharth Goyal², Zeming Lin¹, Jason Liu², Demi Guo³, Myle Ott², C. Lawrence Zitnick², Jerry Ma⁴, Jerry Ma⁵, Rob Fergus¹ - Show less +9 more•Institutions (5)

New York University¹, Facebook², Harvard University³, University of Chicago⁴, Yale University⁵

13 Apr 2021-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: This paper used unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, which contains information about biological properties in its representations.

...read moreread less

Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity The resulting model contains information about biological properties in its representations The representations are learned from sequence data alone The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction

...read moreread less

700 citations

Posted Content•

Emerging Properties in Self-Supervised Vision Transformers

[...]

Mathilde Caron¹, Hugo Touvron¹, Hugo Touvron², Ishan Misra¹, Hervé Jégou¹, Julien Mairal, Piotr Bojanowski¹, Armand Joulin¹ - Show less +4 more•Institutions (2)

Facebook¹, University of Paris²

29 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) beyond the fact that adapting selfsupervised methods to this architecture works particularly well, they make the following observations: first, self-vised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.

...read moreread less

Abstract: In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

...read moreread less

557 citations

Proceedings Article•

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

[...]

Stéphane d'Ascoli, Hugo Touvron¹, Matthew L. Leavitt¹, Ari S. Morcos¹, Giulio Biroli², Levent Sagun¹ - Show less +2 more•Institutions (2)

Facebook¹, École Normale Supérieure²

18 Jul 2021

TL;DR: GPSA is introduced, a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias and outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.

...read moreread less

Abstract: Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a ``soft" convolutional inductive bias. We initialise the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analysing how it is escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly at this https URL.

...read moreread less

339 citations

Journal Article•DOI•

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

[...]

Wei-Ning Hsu¹, Benjamin Bolte¹, Yao-Hung Hubert Tsai², Kushal Lakhotia¹, Ruslan Salakhutdinov², Abdelrahman Mohamed¹ - Show less +2 more•Institutions (2)

Facebook¹, Carnegie Mellon University²

26 Oct 2021-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: HuBERT as mentioned in this paper utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss, which forces the model to learn a combined acoustic and language model over the continuous inputs.

...read moreread less

Abstract: Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960 h) and Libri-light (60,000 h) benchmarks with 10 min, 1 h, 10 h, 100 h, and 960 h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. 1 2

...read moreread less

266 citations

Proceedings Article•DOI•

KILT: a Benchmark for Knowledge Intensive Language Tasks

[...]

Fabio Petroni¹, Aleksandra Piktus², Angela Fan², Patrick S. H. Lewis³, Majid Yazdani², Nicola De Cao⁴, James Thorne⁵, Yacine Jernite⁶, Vladimir Karpukhin², Jean Maillard², Vassilis Plachouras², Tim Rocktäschel², Sebastian Riedel² - Show less +9 more•Institutions (6)

University of California, Berkeley¹, Facebook², University College London³, University of Amsterdam⁴, University of Cambridge⁵, New York University⁶

01 Jun 2021

TL;DR: It is found that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text.

...read moreread less

Abstract: Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearch/KILT.

...read moreread less

227 citations

Posted Content•

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

[...]

Changhan Wang¹, Morgane Riviere¹, Ann B. Lee¹, Anne Wu¹, Chaitanya Talnikar¹, Daniel Haziza¹, Mary Williamson¹, Juan Pino¹, Emmanuel Dupoux¹ - Show less +5 more•Institutions (1)

Facebook¹

02 Jan 2021-arXiv: Computation and Language

TL;DR: VoxPopuli is introduced, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages and it is the largest open data to date for unsupervised representation learning as well as semi-supervised learning.

...read moreread less

Abstract: We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at this https URL under an open license.

...read moreread less

186 citations

Posted Content•DOI•

MSA Transformer

[...]

Roshan Rao¹, Jason Liu¹, Robert Verkuil¹, Joshua Meier¹, John Canny², Pieter Abbeel², Tom Sercu¹, Alexander Rives³ - Show less +4 more•Institutions (3)

Facebook¹, University of California, Berkeley², New York University³

15 Feb 2021-bioRxiv

TL;DR: This article introduced a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment and interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families.

...read moreread less

Abstract: Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

...read moreread less

Proceedings Article•DOI•

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

[...]

Christoph Feichtenhofer¹, Haoqi Fan¹, Bo Xiong¹, Ross Girshick¹, Kaiming He¹ - Show less +1 more•Institutions (1)

Facebook¹

29 Apr 2021

TL;DR: SlowFast as mentioned in this paper proposes a simple objective to encourage temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures.

...read moreread less

Abstract: We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code will be made available at https://github.com/facebookresearch/SlowFast.

...read moreread less

Proceedings Article•DOI•

Dynabench: Rethinking Benchmarking in NLP.

[...]

Douwe Kiela¹, Max Bartolo², Yixin Nie³, Divyansh Kaushik¹, Atticus Geiger⁴, Zhengxuan Wu⁴, Bertie Vidgen⁵, Grusha Prasad⁶, Amanpreet Singh⁷, Pratik Ringshia⁷, Zhiyi Ma, Tristan Thrush⁸, Sebastian Riedel⁷, Zeerak Waseem⁹, Pontus Stenetorp¹⁰, Robin Jia¹¹, Mohit Bansal³, Christopher Potts⁴, Adina Williams⁷ - Show less +15 more•Institutions (11)

Carnegie Mellon University¹, University College London², University of North Carolina at Chapel Hill³, Stanford University⁴, University of Oxford⁵, Johns Hopkins University⁶, Facebook⁷, Université catholique de Louvain⁸, University of Sheffield⁹, Salesforce.com¹⁰, University of Southern California¹¹

01 Jun 2021

TL;DR: It is argued that Dynabench addresses a critical need in the community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios.

...read moreread less

Abstract: We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

...read moreread less

Proceedings Article•DOI•

Advances in neural rendering

[...]

Ayush Tewari, Ohad Fried¹, Justus Thies², Vincent Sitzmann³, Stephen Lombardi⁴, Zexiang Xu⁵, Tomas Simon⁴, Matthias Nießner⁶, Edgar Tretschk, Lingjie Liu, Ben Mildenhall⁷, Pratul P. Srinivasan⁷, Rohit Pandey⁷, Sergio Orts-Escolano⁷, Sean Fanello⁷, M. Guo⁸, Gordon Wetzstein⁸, Jun-Yan Zhu⁹, Christian Theobalt, Maneesh Agrawala⁸, Dan B. Goldman⁷, Michael Zollhöfer⁴ - Show less +18 more•Institutions (9)

Interdisciplinary Center Herzliya¹, Max Planck Society², Massachusetts Institute of Technology³, Facebook⁴, Adobe Systems⁵, Technische Universität München⁶, Google⁷, Stanford University⁸, Carnegie Mellon University⁹

09 Aug 2021

TL;DR: Loss functions for Neural Rendering Jun-Yan Zhu shows the importance of knowing the number of neurons in the system and how many neurons are firing at the same time.

...read moreread less

Abstract: Loss functions for Neural Rendering Jun-Yan Zhu

...read moreread less

Posted Content•DOI•

Language models enable zero-shot prediction of the effects of mutations on protein function

[...]

Joshua Meier¹, Joshua Meier², Roshan Rao³, Robert Verkuil¹, Jason Liu¹, Tom Sercu¹, Alexander Rives², Alexander Rives¹ - Show less +4 more•Institutions (3)

Facebook¹, New York University², University of California, Berkeley³

10 Jul 2021-bioRxiv

TL;DR: This paper used zero-shot inference to capture the functional effects of sequence variation, and achieved state-of-the-art performance on protein language models without any supervision from experimental data or additional training.

...read moreread less

Abstract: Modeling the effect of sequence variation on function is a fundamental problem for understanding and designing proteins. Since evolution encodes information about function into patterns in protein sequences, unsupervised models of variant effects can be learned from sequence data. The approach to date has been to fit a model to a family of related sequences. The conventional setting is limited, since a new model must be trained for each prediction task. We show that using only zero-shot inference, without any supervision from experimental data or additional training, protein language models capture the functional effects of sequence variation, performing at state-of-the-art.

...read moreread less

Posted Content•DOI•

Language models enable zero-shot prediction of the effects of mutations on protein function

[...]

Joshua Meier¹, Roshan Rao², Robert Verkuil¹, Jason Liu¹, Tom Sercu¹, Alexander Rives³ - Show less +2 more•Institutions (3)

Facebook¹, University of California, Berkeley², New York University³

06 Dec 2021

TL;DR: This article used zero-shot inference to capture the functional effects of sequence variation, and achieved state-of-the-art performance on protein language models without any supervision from experimental data or additional training.

...read moreread less

Proceedings Article•

Training data-efficient image transformers & distillation through attention

[...]

Hugo Touvron¹, Matthieu Cord², Douze Matthijs¹, Francisco Massa¹, Alexandre Sablayrolles¹, Hervé Jégou¹ - Show less +2 more•Institutions (2)

Facebook¹, University of Paris²

18 Jul 2021

Journal Article•DOI•

Correction to “The Open Catalyst 2020 (OC20) Dataset and Community Challenges”

[...]

Lowik Chanussot¹, Abhishek Das¹, Siddharth Goyal¹, Thibaut Lavril¹, Muhammed Shuaibi², Morgane Riviere¹, Kevin Tran², Javier Heras-Domingo², Caleb Ho¹, Weihua Hu³, Aini Palizhati², Anuroop Sriram¹, Brandon Wood⁴, Junwoong Yoon², Devi Parikh⁵, Devi Parikh¹, C. Lawrence Zitnick¹, Zachary W. Ulissi² - Show less +14 more•Institutions (5)

Facebook¹, Carnegie Mellon University², Stanford University³, Lawrence Berkeley National Laboratory⁴, Georgia Institute of Technology⁵

05 Nov 2021-ACS Catalysis

Journal Article•

Multi-modal Self-Supervision from Generalized Data Transformations

[...]

Mandela Patrick¹, Yuki M. Asano¹, Polina Kuznetsova², Ruth Fong¹, João F. Henriques¹, Geoffrey Zweig², Andrea Vedaldi - Show less +3 more•Institutions (2)

University of Oxford¹, Facebook²

04 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The framework of Generalized Data Transformations is introduced to reduce several recent self-supervised learning objectives to a single formulation for ease of comparison, analysis, and extension, and allow a choice between being invariant or distinctive to data transformations, obtaining different supervisory signals, and derive the conditions that combinations of transformations must obey in order to lead to well-posed learning objectives.

...read moreread less

Abstract: In the image domain, excellent representation can be learned by inducing invariance to content-preserving transformations, such as image distortions. In this paper, we show that, for videos, the answer is more complex, and that better results can be obtained by accounting for the interplay between invariance, distinctiveness, multiple modalities and time. We introduce Generalized Data Transformations (GDTs) as a way to capture this interplay. GDTs reduce most previous self-supervised approaches to a choice of data transformations, even when this was not the case in the original formulations. They also allow to choose whether the representation should be invariant or distinctive w.r.t. each effect and tell which combinations are valid, thus allowing us to explore the space of combinations systematically. We show in this manner that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art by a large margin, and even surpassing supervised pretraining. We demonstrate results on a variety of downstream video and audio classification and retrieval tasks, on datasets such as HMDB-51, UCF-101, DCASE2014, ESC-50 and VGG-Sound. In particular, we achieve new state-of-the-art accuracies of 72.8% on HMDB-51 and 95.2% on UCF-101.

...read moreread less

Journal Article•DOI•

Open Catalyst 2020 (OC20) Dataset and Community Challenges

[...]

Lowik Chanussot¹, Abhishek Das¹, Siddharth Goyal¹, Thibaut Lavril¹, Muhammed Shuaibi², Morgane Riviere¹, Kevin Tran², Javier Heras-Domingo², Caleb Ho¹, Weihua Hu³, Aini Palizhati², Anuroop Sriram¹, Brandon Wood, Junwoong Yoon², Devi Parikh⁴, Devi Parikh¹, C. Lawrence Zitnick¹, Zachary W. Ulissi² - Show less +14 more•Institutions (4)

Facebook¹, Carnegie Mellon University², Stanford University³, Georgia Institute of Technology⁴

04 May 2021-ACS Catalysis

TL;DR: The OC20 dataset is developed, consisting of 1,281,121 Density Functional Theory relaxations across a wide swath of materials, surfaces, and adsorbates, and three state-of-the-art graph neural network models were applied to each of these tasks as baseline demonstrations for the community to build on.

...read moreread less

Abstract: Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuel synthesis, long-term energy storage, and renewable fertilizer production. Despite cons...

...read moreread less

Proceedings Article•DOI•

SUPERB: Speech processing Universal PERformance Benchmark

[...]

Shu-wen Yang¹, Po-Han Chi¹, Yung-Sung Chuang¹, Cheng-I Jeff Lai², Kushal Lakhotia³, Yist Y. Lin¹, Andy T. Liu¹, Jiatong Shi⁴, Xuankai Chang⁴, Guan-Ting Lin, Tzu-hsien Huang¹, Wei-Cheng Tseng⁵, Ko-tik Lee, Da-Rong Liu¹, Zili Huang⁴, Shuyan Dong⁶, Shang-Wen Li⁶, Shinji Watanabe⁴, Abdelrahman Mohamed³, Hung-yi Lee¹ - Show less +16 more•Institutions (6)

National Taiwan University¹, Massachusetts Institute of Technology², Facebook³, Johns Hopkins University⁴, National Tsing Hua University⁵, Amazon.com⁶

03 May 2021

TL;DR: The Speech processing Universal PERformance Benchmark (SUPERB) as discussed by the authors is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.

...read moreread less

Abstract: Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.

...read moreread less

Proceedings Article•DOI•

Boundary IoU: Improving Object-Centric Image Segmentation Evaluation

[...]

Bowen Cheng¹, Ross Girshick², Piotr Dollár², Alexander C. Berg², Alexander Kirillov² - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Facebook²

01 Jun 2021

TL;DR: The boundary intersection-over-union (Boundary IoU) measure as mentioned in this paper is a new segmentation evaluation measure focused on boundary quality, which is significantly more sensitive to boundary errors for large objects and does not over-penalize errors on smaller objects.

...read moreread less

Abstract: We present Boundary IoU (Intersection-over-Union), a new segmentation evaluation measure focused on boundary quality. We perform an extensive analysis across different error types and object sizes and show that Boundary IoU is significantly more sensitive than the standard Mask IoU measure to boundary errors for large objects and does not over-penalize errors on smaller objects. The new quality measure displays several desirable characteristics like symmetry w.r.t. prediction/ground truth pairs and balanced responsiveness across scales, which makes it more suitable for segmentation evaluation than other boundary-focused measures like Trimap IoU and F-measure. Based on Boundary IoU, we update the standard evaluation protocols for instance and panoptic segmentation tasks by proposing the Boundary AP (Average Precision) and Boundary PQ (Panoptic Quality) metrics, respectively. Our experiments show that the new evaluation metrics track boundary quality improvements that are generally overlooked by current Mask IoU-based evaluation metrics. We hope that the adoption of the new boundary-sensitive evaluation metrics will lead to rapid progress in segmentation methods that improve boundary quality. 1

...read moreread less

Journal Article•DOI•

Results of the 2020 fastMRI Challenge for Machine Learning MR Image Reconstruction

[...]

Matthew J. Muckley¹, Bruno Riemenschneider², Alireza Radmanesh², Sunwoo Kim, Geunu Jeong, Jingyu Ko, Yohan Jun³, Hyungseob Shin³, Dosik Hwang³, Mahmoud Mostapha⁴, Simon Arberet⁴, Dominik Nickel⁵, Zaccharie Ramzi⁶, Philippe Ciuciu⁶, Jean-Luc Starck, Jonas Teuwen⁷, Dimitrios Karkalousos, Chaoping Zhang, Anuroop Sriram¹, Zhengnan Huang², Nafissa Yakubova¹, Yvonne W. Lui², Florian Knoll² - Show less +19 more•Institutions (7)

Facebook¹, New York University², Yonsei University³, Princeton University⁴, Siemens⁵, Université Paris-Saclay⁶, Radboud University Nijmegen⁷

30 Apr 2021-IEEE Transactions on Medical Imaging

TL;DR: In this article, the second fastMRI challenge was held, which focused on pathological assessment in brain images and required participants to submit models evaluated on MRI scanners from outside the training set.

...read moreread less

Abstract: Accelerating MRI scans is one of the principal outstanding problems in the MRI research community. Towards this goal, we hosted the second fastMRI competition targeted towards reconstructing MR images with subsampled k-space data. We provided participants with data from 7,299 clinical brain scans (de-identified via a HIPAA-compliant procedure by NYU Langone Health), holding back the fully-sampled data from 894 of these scans for challenge evaluation purposes. In contrast to the 2019 challenge, we focused our radiologist evaluations on pathological assessment in brain images. We also debuted a new Transfer track that required participants to submit models evaluated on MRI scanners from outside the training set. We received 19 submissions from eight different groups. Results showed one team scoring best in both SSIM scores and qualitative radiologist evaluations. We also performed analysis on alternative metrics to mitigate the effects of background noise and collected feedback from the participants to inform future challenges. Lastly, we identify common failure modes across the submissions, highlighting areas of need for future research in the MRI reconstruction community.

...read moreread less

Proceedings Article•DOI•

Recipes for building an open-domain chatbot

[...]

Stephen Roller¹, Emily Dinan¹, Naman Goyal², Da Ju¹, Mary Williamson¹, Yinhan Liu¹, Jing Xu¹, Myle Ott¹, Eric Michael Smith¹, Y-Lan Boureau¹, Jason Weston¹ - Show less +7 more•Institutions (2)

Facebook¹, Georgia Institute of Technology²

01 Apr 2021

TL;DR: The authors show that large scale models can learn these skills when given appropriate training data and choice of generation strategy, and build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make their models and code publicly available.

...read moreread less

Abstract: Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we highlight other ingredients. Good conversation requires blended skills: providing engaging talking points, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models outperform existing approaches in multi-turn dialogue on engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.

...read moreread less

Posted Content•DOI•

ETA Prediction with Graph Neural Networks in Google Maps

[...]

Austin Derrow-Pinion, Jennifer She, David Wong, Oliver Fritz Lange¹, Todd Hester², Luis Perez³, Marc Nunkesser¹, Seongjae Lee¹, Xueying Guo¹, Brett Wiltshire, Peter W. Battaglia, Vishal Gupta, Ang Li, Zhongwen Xu, Alvaro Sanchez-Gonzalez, Yujia Li, Petar Veličković - Show less +13 more•Institutions (3)

Google¹, Amazon.com², Facebook³

25 Aug 2021-arXiv: Learning

TL;DR: In this article, a graph neural network estimator for estimated time of arrival (ETA) is presented, which has been deployed in production at Google Maps and has shown promising results.

...read moreread less

Abstract: Travel-time prediction constitutes a task of high importance in transportation networks, with web mapping services like Google Maps regularly serving vast quantities of travel time queries from users and enterprises alike. Further, such a task requires accounting for complex spatiotemporal interactions (modelling both the topological properties of the road network and anticipating events -- such as rush hours -- that may occur in the future). Hence, it is an ideal target for graph representation learning at scale. Here we present a graph neural network estimator for estimated time of arrival (ETA) which we have deployed in production at Google Maps. While our main architecture consists of standard GNN building blocks, we further detail the usage of training schedule methods such as MetaGradients in order to make our model robust and production-ready. We also provide prescriptive studies: ablating on various architectural decisions and training regimes, and qualitative analyses on real-world situations where our model provides a competitive edge. Our GNN proved powerful when deployed, significantly reducing negative ETA outcomes in several regions compared to the previous production baseline (40+% in cities like Sydney).

...read moreread less

Proceedings Article•DOI•

Space-time Neural Irradiance Fields for Free-Viewpoint Video

[...]

Wenqi Xian¹, Jia-Bin Huang¹, Johannes Kopf², Changil Kim²•Institutions (2)

Cornell University¹, Facebook²

01 Jun 2021

TL;DR: In this article, a spatiotemporal neural irradiance field for dynamic scenes from a single video is learned by constraining the time-varying geometry of the scene representation using the scene depth estimated from video depth estimation methods.

...read moreread less

Abstract: We present a method that learns a spatiotemporal neural irradiance field for dynamic scenes from a single video. Our learned representation enables free-viewpoint rendering of the input video. Our method builds upon recent advances in implicit representations. Learning a spatiotemporal irradiance field from a single video poses significant challenges because the video contains only one observation of the scene at any point in time. The 3D geometry of a scene can be legitimately represented in numerous ways since varying geometry (motion) can be explained with varying appearance and vice versa. We address this ambiguity by constraining the time-varying geometry of our dynamic scene representation using the scene depth estimated from video depth estimation methods, aggregating contents from individual frames into a single global representation. We provide an extensive quantitative evaluation and demonstrate compelling free-viewpoint rendering results.

...read moreread less

Proceedings Article•DOI•

QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization

[...]

Ming Zhong¹, Da Yin², Tao Yu³, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha⁴, Ahmed Hassan Awadallah⁵, Asli Celikyilmaz⁵, Yang Liu⁶, Xipeng Qiu¹, Dragomir R. Radev⁷ - Show less +7 more•Institutions (7)

Fudan University¹, Peking University², Yale University³, Facebook⁴, Microsoft⁵, Tsinghua University⁶, Salesforce.com⁷

12 Apr 2021

TL;DR: This work defines a new query-based multi-domain meeting summarization task, where models have to select and summarize relevant spans of meetings in response to a query, and introduces QMSum, a new benchmark for this task.

...read moreread less

Abstract: Meetings are a key component of human collaboration. As increasing numbers of meetings are recorded and transcribed, meeting summaries have become essential to remind those who may or may not have attended the meetings about the key decisions made and the tasks to be completed. However, it is hard to create a single short summary that covers all the content of a long meeting involving multiple people and topics. In order to satisfy the needs of different types of users, we define a new query-based multi-domain meeting summarization task, where models have to select and summarize relevant spans of meetings in response to a query, and we introduce QMSum, a new benchmark for this task. QMSum consists of 1,808 query-summary pairs over 232 meetings in multiple domains. Besides, we investigate a locate-then-summarize method and evaluate a set of strong summarization baselines on the task. Experimental results and manual analysis reveal that QMSum presents significant challenges in long meeting summarization for future research. Dataset is available at https://github.com/Yale-LILY/QMSum.

...read moreread less

Proceedings Article•DOI•

Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?

[...]

Wei-Ning Hsu¹, Yao-Hung Hubert Tsai², Benjamin Bolte¹, Ruslan Salakhutdinov², Abdelrahman Mohamed¹ - Show less +1 more•Institutions (2)

Facebook¹, Carnegie Mellon University²

06 Jun 2021

TL;DR: This paper proposed the Hidden Unit BERT (HUBERT) model, which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model.

...read moreread less

Abstract: Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets.

...read moreread less

Journal Article•DOI•

BrainGNN: Interpretable Brain Graph Neural Network for fMRI Analysis

[...]

Xiaoxiao Li¹, Xiaoxiao Li², Yuan Zhou², Nicha C. Dvornek², Muhan Zhang³, Siyuan Gao², Juntang Zhuang², Dustin Scheinost², Lawrence H. Staib², Pamela Ventola², James S. Duncan - Show less +7 more•Institutions (3)

University of British Columbia¹, Yale University², Facebook³

01 Dec 2021-Medical Image Analysis

TL;DR: In this paper, a graph neural network (GNN) framework was proposed to analyze functional magnetic resonance images (fMRI) and discover neurological biomarkers, which leveraged the topological and functional information of fMRI.

...read moreread less

Proceedings Article•DOI•

ETA Prediction with Graph Neural Networks in Google Maps

[...]

Google¹, Amazon.com², Facebook³

26 Oct 2021

TL;DR: In this paper, a graph neural network estimator for estimated time of arrival (ETA) is presented, which has been deployed in production at Google Maps and has shown promising results.

...read moreread less

Abstract: Travel-time prediction constitutes a task of high importance in transportation networks, with web mapping services like Google Maps regularly serving vast quantities of travel time queries from users and enterprises alike. Further, such a task requires accounting for complex spatiotemporal interactions (modelling both the topological properties of the road network and anticipating events---such as rush hours---that may occur in the future). Hence, it is an ideal target for graph representation learning at scale. Here we present a graph neural network estimator for estimated time of arrival (ETA) which we have deployed in production at Google Maps. While our main architecture consists of standard GNN building blocks, we further detail the usage of training schedule methods such as MetaGradients in order to make our model robust and production-ready. We also provide prescriptive studies: ablating on various architectural decisions and training regimes, and qualitative analyses on real-world situations where our model provides a competitive edge. Our GNN proved powerful when deployed, significantly reducing negative ETA outcomes in several regions compared to the previous production baseline (40+% in cities like Sydney).

...read moreread less

Collapse