Showing papers by "Facebook published in 2018"

PDF

Open Access

Proceedings Article•DOI•

[...]

Xiaolong Wang¹, Ross Girshick¹, Abhinav Gupta², Kaiming He¹•Institutions (2)

18 Jun 2018

TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.

...read moreread less

Abstract: Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.

...read moreread less

8,059 citations

Proceedings Article•DOI•

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

[...]

Alex Wang¹, Amanpreet Singh¹, Julian Michael², Felix Hill³, Omer Levy⁴, Samuel R. Bowman¹ - Show less +2 more•Institutions (4)

New York University¹, University of Washington², Google³, Facebook⁴

01 Nov 2018

TL;DR: The gluebenchmark as mentioned in this paper is a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models.

...read moreread less

Abstract: Human ability to understand language is general, flexible, and robust. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. If we aspire to develop models with understanding beyond the detection of superficial correspondences between inputs and outputs, then it is critical to develop a unified model that can execute a range of linguistic tasks across different domains. To facilitate research in this direction, we present the General Language Understanding Evaluation (GLUE, gluebenchmark.com): a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models. For some benchmark tasks, training data is plentiful, but for others it is limited or does not match the genre of the test set. GLUE thus favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. While none of the datasets in GLUE were created from scratch for the benchmark, four of them feature privately-held test data, which is used to ensure that the benchmark is used fairly. We evaluate baselines that use ELMo (Peters et al., 2018), a powerful transfer learning technique, as well as state-of-the-art sentence representation models. The best models still achieve fairly low absolute scores. Analysis with our diagnostic dataset yields similarly weak performance over all phenomena tested, with some exceptions.

...read moreread less

3,225 citations

Journal Article•DOI•

Optimization Methods for Large-Scale Machine Learning

[...]

Léon Bottou¹, Frank E. Curtis², Jorge Nocedal•Institutions (2)

Facebook¹, Lehigh University²

08 May 2018-Siam Review

TL;DR: The authors provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications and discusses how optimization problems arise in machine learning and what makes them challenging.

...read moreread less

Abstract: This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques th...

...read moreread less

2,238 citations

Proceedings Article•DOI•

A Closer Look at Spatiotemporal Convolutions for Action Recognition

[...]

Du Tran¹, Heng Wang¹, Lorenzo Torresani¹, Jamie Ray², Jamie Ray¹, Yann LeCun¹, Manohar Paluri¹ - Show less +3 more•Institutions (2)

Facebook¹, Dartmouth College²

12 Apr 2018

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

Abstract: In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

1,827 citations

Posted Content•

Deep Clustering for Unsupervised Learning of Visual Features

[...]

Mathilde Caron¹, Piotr Bojanowski¹, Armand Joulin¹, Matthijs Douze¹•Institutions (1)

Facebook¹

15 Jul 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features and outperforms the current state of the art by a significant margin on all the standard benchmarks.

...read moreread less

Abstract: Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network. We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M. The resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks.

...read moreread less

1,363 citations

Book Chapter•DOI•

Deep Clustering for Unsupervised Learning of Visual Features

[...]

Mathilde Caron¹, Piotr Bojanowski¹, Armand Joulin¹, Matthijs Douze¹•Institutions (1)

Facebook¹

08 Sep 2018

TL;DR: DeepCluster as discussed by the authors is a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features, and uses the subsequent assignments as supervision to update the weights of the network.

...read moreread less

Abstract: Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large-scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network. We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M. The resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks.

...read moreread less

1,361 citations

Journal Article•DOI•

Forecasting at scale

[...]

Sean J. Taylor¹, Benjamin Letham¹•Institutions (1)

Facebook¹

24 Apr 2018-The American Statistician

TL;DR: A practical approach to forecasting “at scale” that combines configurable models with analyst-in-the-loop performance analysis, and a modular regression model with interpretable parameters that can be intuitively adjusted by analysts with domain knowledge about the time series are described.

...read moreread less

Abstract: Forecasting is a common data science task that helps organizations with capacity planning, goal setting, and anomaly detection. Despite its importance, there are serious challenges associated with ...

...read moreread less

1,166 citations

Proceedings Article•

Word translation without parallel data

[...]

Guillaume Lample¹, Alexis Conneau¹, Marc'Aurelio Ranzato¹, Ludovic Denoyer², Hervé Jégou¹ - Show less +1 more•Institutions (2)

Facebook¹, University of Paris²

15 Feb 2018

TL;DR: It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

...read moreread less

1,068 citations

Proceedings Article•DOI•

DensePose: Dense Human Pose Estimation in the Wild

[...]

Riza Alp Guler¹, Natalia Neverova², Iasonas Kokkinos³•Institutions (3)

French Institute for Research in Computer Science and Automation¹, Facebook², University College London³

01 Feb 2018

TL;DR: This work establishes dense correspondences between an RGB image and a surface-based representation of the human body, a task referred to as dense human pose estimation, and improves accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu.

...read moreread less

Abstract: In this work we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an inpainting network that can fill in missing ground truth values and report improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter. We further improve accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu. Supplementary materials, data, code, and videos are provided on the project page http://densepose.org.

...read moreread less

987 citations

Posted Content•

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

[...]

Zhe Cao¹, Gines Hidalgo², Tomas Simon³, Shih-En Wei³, Yaser Sheikh² - Show less +1 more•Institutions (3)

University of California, Berkeley¹, Carnegie Mellon University², Facebook³

18 Dec 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: OpenPose is released, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints, and the first combined body and foot keypoint detector, based on an internal annotated foot dataset.

...read moreread less

Abstract: Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

...read moreread less

986 citations

Proceedings Article•DOI•

Understanding Back-Translation at Scale.

[...]

Sergey Edunov¹, Myle Ott¹, Michael Auli¹, David Grangier¹•Institutions (1)

Facebook¹

01 Jan 2018

TL;DR: This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences, finding that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective.

...read moreread less

Abstract: An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search We also compare how synthetic data compares to genuine bitext and study various domain effects Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT’14 English-German test set

...read moreread less

Book Chapter•DOI•

Exploring the Limits of Weakly Supervised Pretraining

[...]

Dhruv Mahajan¹, Ross Girshick¹, Vignesh Ramanathan¹, Kaiming He¹, Manohar Paluri¹, Yixuan Li¹, Ashwin Bharambe¹, Laurens van der Maaten¹ - Show less +4 more•Institutions (1)

Facebook¹

08 Sep 2018

TL;DR: In this paper, the authors presented a transfer learning approach with large convolutional networks trained to predict hashtags on billions of social media images and reported the highest ImageNet-1k single-crop, top-1 accuracy to date.

...read moreread less

Abstract: State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.

...read moreread less

Proceedings Article•

Learning Word Vectors for 157 Languages

[...]

Edouard Grave¹, Piotr Bojanowski¹, Prakhar Gupta², Armand Joulin¹, Tomas Mikolov¹ - Show less +1 more•Institutions (2)

Facebook¹, École Polytechnique Fédérale de Lausanne²

19 Feb 2018

TL;DR: This article used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project, and introduced three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish.

...read moreread less

Abstract: Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

...read moreread less

Proceedings Article•DOI•

3D Semantic Segmentation with Submanifold Sparse Convolutional Networks

[...]

Benjamin Graham¹, Martin Engelcke², Laurens van der Maaten³•Institutions (3)

Facebook¹, University of Oxford², Carnegie Mellon University³

18 Jun 2018

TL;DR: This work introduces new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and uses them to develop Spatially-Sparse Convolutional networks, which outperform all prior state-of-the-art models on two tasks involving semantic segmentation of 3D point clouds.

...read moreread less

Abstract: Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SS-CNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.

...read moreread less

Proceedings Article•

Mutual Information Neural Estimation.

[...]

Mohamed Ishmael Belghazi¹, Aristide Baratin², Sai Rajeshwar, Sherjil Ozair³, Yoshua Bengio², Aaron Courville², Devon Hjelm⁴ - Show less +3 more•Institutions (4)

Facebook¹, Université de Montréal², Baidu³, Microsoft⁴

03 Jul 2018

TL;DR: A Mutual Information Neural Estimator (MINE) is presented that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent, and applied to improve adversarially trained generative models.

...read moreread less

Abstract: We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. We present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent. We present a handful of applications on which MINE can be used to minimize or maximize mutual information. We apply MINE to improve adversarially trained generative models. We also use MINE to implement the Information Bottleneck, applying it to supervised classification; our results demonstrate substantial improvement in flexibility and performance in these settings.

...read moreread less

Proceedings Article•DOI•

Personalizing Dialogue Agents: I have a dog, do you have pets too?

[...]

Saizheng Zhang¹, Emily Dinan², Jack Urbanek², Arthur Szlam², Douwe Kiela², Jason Weston³ - Show less +2 more•Institutions (3)

Université de Montréal¹, Facebook², New York University³

22 Jan 2018

TL;DR: In this paper, the task of making chit-chat more engaging by conditioning on profile information is addressed, and the resulting dialogue can be used to predict profile information about the interlocutors.

...read moreread less

Abstract: Chit-chat models are known to have several problems: they lack specificity, do not display a consistent personality and are often not very captivating. In this work we present the task of making chit-chat more engaging by conditioning on profile information. We collect data and train models to (i)condition on their given profile information; and (ii) information about the person they are talking to, resulting in improved dialogues, as measured by next utterance prediction. Since (ii) is initially unknown our model is trained to engage its partner with personal topics, and we show the resulting dialogue can be used to predict profile information about the interlocutors.

...read moreread less

Proceedings Article•

Unsupervised Machine Translation Using Monolingual Corpora Only

[...]

Guillaume Lample¹, Alexis Conneau¹, Ludovic Denoyer², Marc'Aurelio Ranzato¹•Institutions (2)

Facebook¹, University of Paris²

15 Feb 2018

TL;DR: This work proposes a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space and effectively learns to translate without using any labeled data.

...read moreread less

Abstract: Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores of 32.8 and 15.1 on the Multi30k and WMT English-French datasets, without using even a single parallel sentence at training time.

...read moreread less

Posted Content•

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

[...]

Tabish Rashid¹, Mikayel Samvelyan, Christian Schroeder de Witt¹, Gregory Farquhar¹, Jakob Foerster², Shimon Whiteson¹ - Show less +2 more•Institutions (2)

University of Oxford¹, Facebook²

30 Mar 2018-arXiv: Learning

TL;DR: In this article, the authors propose a value-based method that can train decentralised policies in a centralised end-to-end fashion in simulated or laboratory settings, where global state information is available and communication constraints are lifted.

...read moreread less

Abstract: In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.

...read moreread less

Proceedings Article•DOI•

XNLI: Evaluating Cross-lingual Sentence Representations

[...]

Alexis Conneau¹, Ruty Rinott¹, Guillaume Lample¹, Adina Williams², Samuel R. Bowman¹, Holger Schwenk¹, Veselin Stoyanov¹ - Show less +3 more•Institutions (2)

Facebook¹, New York University²

13 Sep 2018

TL;DR: This work constructs an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus to 14 languages, including low-resource languages such as Swahili and Urdu and finds that XNLI represents a practical and challenging evaluation suite and that directly translating the test data yields the best performance among available baselines.

...read moreread less

Abstract: State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in cross-lingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 14 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.

...read moreread less

Posted Content•

Efficient Lifelong Learning with A-GEM

[...]

Arslan Chaudhry¹, Marc'Aurelio Ranzato², Marcus Rohrbach², Mohamed Elhoseiny²•Institutions (2)

University of Oxford¹, Facebook²

02 Dec 2018-arXiv: Learning

TL;DR: An improved version of GEM is proposed, dubbed Averaged GEM (A-GEM), which enjoys the same or even better performance as GEM, while being almost as computationally and memory efficient as EWC and other regularization-based methods.

...read moreread less

Abstract: In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task. In this work, we investigate the efficiency of current lifelong approaches, in terms of sample complexity, computational and memory cost. Towards this end, we first introduce a new and a more realistic evaluation protocol, whereby learners observe each example only once and hyper-parameter selection is done on a small and disjoint set of tasks, which is not used for the actual learning experience and evaluation. Second, we introduce a new metric measuring how quickly a learner acquires a new skill. Third, we propose an improved version of GEM (Lopez-Paz & Ranzato, 2017), dubbed Averaged GEM (A-GEM), which enjoys the same or even better performance as GEM, while being almost as computationally and memory efficient as EWC (Kirkpatrick et al., 2016) and other regularization-based methods. Finally, we show that all algorithms including A-GEM can learn even more quickly if they are provided with task descriptors specifying the classification tasks under consideration. Our experiments on several standard lifelong learning benchmarks demonstrate that A-GEM has the best trade-off between accuracy and efficiency.

...read moreread less

Proceedings Article•

Wizard of Wikipedia: Knowledge-Powered Conversational Agents

[...]

Emily Dinan¹, Stephen Roller¹, Kurt Shuster¹, Angela Fan², Michael Auli¹, Jason Weston¹ - Show less +2 more•Institutions (2)

Facebook¹, Harvard University²

27 Sep 2018

TL;DR: The best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while a new benchmark allows for measuring further improvements in this important research direction.

...read moreread less

Abstract: In open-domain dialogue intelligent agents should exhibit the use of knowledge, however there are few convincing demonstrations of this to date. The most popular sequence to sequence models typically "generate and hope" generic utterances that can be memorized in the weights of the model when mapping from input utterance(s) to output, rather than employing recalled knowledge as context. Use of knowledge has so far proved difficult, in part because of the lack of a supervised learning benchmark task which exhibits knowledgeable open dialogue with clear grounding. To that end we collect and release a large dataset with conversations directly grounded with knowledge retrieved from Wikipedia. We then design architectures capable of retrieving knowledge, reading and conditioning on it, and finally generating natural responses. Our best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while our new benchmark allows for measuring further improvements in this important research direction.

...read moreread less

Proceedings Article•DOI•

DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images

[...]

Ilke Demir¹, Krzysztof Koperski, David Lindenbaum, Guan Pang¹, Jing Huang¹, Saikat Basu¹, Forest Hughes¹, Devis Tuia², Ramesh Raska³ - Show less +5 more•Institutions (3)

Facebook¹, Wageningen University and Research Centre², Massachusetts Institute of Technology³

18 Jun 2018

TL;DR: The DeepGlobe 2018 Satellite Image Understanding Challenge is presented, which includes three public competitions for segmentation, detection, and classification tasks on satellite images, and characteristics of each dataset are analyzed, and evaluation criteria for each task are defined.

...read moreread less

Abstract: We present the DeepGlobe 2018 Satellite Image Understanding Challenge, which includes three public competitions for segmentation, detection, and classification tasks on satellite images (Figure 1). Similar to other challenges in computer vision domain such as DAVIS[21] and COCO[33], DeepGlobe proposes three datasets and corresponding evaluation methodologies, coherently bundled in three competitions with a dedicated workshop co-located with CVPR 2018. We observed that satellite imagery is a rich and structured source of information, yet it is less investigated than everyday images by computer vision researchers. However, bridging modern computer vision with remote sensing data analysis could have critical impact to the way we understand our environment and lead to major breakthroughs in global urban planning or climate change research. Keeping such bridging objective in mind, DeepGlobe aims to bring together researchers from different domains to raise awareness of remote sensing in the computer vision community and vice-versa. We aim to improve and evaluate state-of-the-art satellite image understanding approaches, which can hopefully serve as reference benchmarks for future research in the same topic. In this paper, we analyze characteristics of each dataset, define the evaluation criteria of the competitions, and provide baselines for each task.

...read moreread less

Proceedings Article•DOI•

Low-Shot Learning from Imaginary Data

[...]

Yu-Xiong Wang¹, Ross Girshick¹, Martial Hebert², Bharath Hariharan³•Institutions (3)

Facebook¹, Carnegie Mellon University², Cornell University³

18 Jun 2018

TL;DR: This work builds on recent progress in meta-learning by combining a meta-learner with a "hallucinator" that produces additional training examples, and optimizing both models jointly, yielding state-of-the-art performance on the challenging ImageNet low-shot classification benchmark.

...read moreread less

Abstract: Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views. Incorporating this ability to hallucinate novel instances of new concepts might help machine vision systems perform better low-shot learning, i.e., learning concepts from few examples. We present a novel approach to low-shot learning that uses this idea. Our approach builds on recent progress in meta-learning ("learning to learn") by combining a meta-learner with a "hallucinator" that produces additional training examples, and optimizing both models jointly. Our hallucinator can be incorporated into a variety of meta-learners and provides significant gains: up to a 6 point boost in classification accuracy when only a single training example is available, yielding state-of-the-art performance on the challenging ImageNet low-shot classification benchmark.

...read moreread less

Proceedings Article•DOI•

What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties

[...]

Alexis Conneau¹, Germán Kruszewski, Guillaume Lample, Loïc Barrault, Marco Baroni¹ - Show less +1 more•Institutions (1)

Facebook¹

15 Jul 2018

TL;DR: The authors introduce 10 probing tasks designed to capture simple linguistic features of sentences, and use them to study embeddings generated by three different encoders trained in eight distinct ways, uncovering intriguing properties of both encoder and training methods.

...read moreread less

Abstract: Although much effort has recently been devoted to training high-quality sentence embeddings, we still have a poor understanding of what they are capturing. "Downstream" tasks, often based on sentence classification, are commonly used to evaluate the quality of sentence representations. The complexity of the tasks makes it however difficult to infer what kind of information is present in the representations. We introduce here 10 probing tasks designed to capture simple linguistic features of sentences, and we use them to study embeddings generated by three different encoders trained in eight distinct ways, uncovering intriguing properties of both encoders and training methods.

...read moreread less

Book Chapter•DOI•

Memory Aware Synapses: Learning What (not) to Forget

[...]

Rahaf Aljundi¹, Francesca Babiloni¹, Mohamed Elhoseiny², Marcus Rohrbach², Tinne Tuytelaars¹ - Show less +1 more•Institutions (2)

Katholieke Universiteit Leuven¹, Facebook²

08 Sep 2018

TL;DR: This paper argues that, given the limited model capacity and the unlimited new information to be learned, knowledge has to be preserved or erased selectively and proposes a novel approach for lifelong learning, coined Memory Aware Synapses (MAS), which computes the importance of the parameters of a neural network in an unsupervised and online manner.

...read moreread less

Abstract: Humans can learn in a continuous manner. Old rarely utilized knowledge can be overwritten by new incoming information while important, frequently used knowledge is prevented from being erased. In artificial learning systems, lifelong learning so far has focused mainly on accumulating knowledge over tasks and overcoming catastrophic forgetting. In this paper, we argue that, given the limited model capacity and the unlimited new information to be learned, knowledge has to be preserved or erased selectively. Inspired by neuroplasticity, we propose a novel approach for lifelong learning, coined Memory Aware Synapses (MAS). It computes the importance of the parameters of a neural network in an unsupervised and online manner. Given a new sample which is fed to the network, MAS accumulates an importance measure for each parameter of the network, based on how sensitive the predicted output function is to a change in this parameter. When learning a new task, changes to important parameters can then be penalized, effectively preventing important knowledge related to previous tasks from being overwritten. Further, we show an interesting connection between a local version of our method and Hebb’s rule, which is a model for the learning process in the brain. We test our method on a sequence of object recognition tasks and on the challenging problem of learning an embedding for predicting triplets. We show state-of-the-art performance and, for the first time, the ability to adapt the importance of the parameters based on unlabeled data towards what the network needs (not) to forget, which may vary depending on test conditions.

...read moreread less

Posted Content•

Rethinking ImageNet Pre-training.

[...]

Kaiming He¹, Ross Girshick, Piotr Dollár•Institutions (1)

Facebook¹

21 Nov 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy, and these discoveries will encourage people to rethink the current de facto paradigm of `pre-training and fine-tuning' in computer vision.

...read moreread less

Abstract: We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when using the hyper-parameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pre-trained models, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust; our results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics. Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy. To push the envelope we demonstrate 50.9 AP on COCO object detection without using any external data---a result on par with the top COCO 2017 competition results that used ImageNet pre-training. These observations challenge the conventional wisdom of ImageNet pre-training for dependent tasks and we expect these discoveries will encourage people to rethink the current de facto paradigm of `pre-training and fine-tuning' in computer vision.

...read moreread less

Book Chapter•DOI•

Graph R-CNN for Scene Graph Generation

[...]

Jianwei Yang¹, Jiasen Lu¹, Stefan Lee¹, Dhruv Batra², Dhruv Batra¹, Devi Parikh², Devi Parikh¹ - Show less +3 more•Institutions (2)

Georgia Institute of Technology¹, Facebook²

08 Sep 2018

TL;DR: A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics.

...read moreread less

Abstract: We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image. We also propose an attentional Graph Convolutional Network (aGCN) that effectively captures contextual information between objects and relations. Finally, we introduce a new evaluation metric that is more holistic and realistic than existing metrics. We report state-of-the-art performance on scene graph generation as evaluated using both existing and our proposed metrics.

...read moreread less

Proceedings Article•

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks

[...]

Brenden M. Lake, Marco Baroni¹•Institutions (1)

Facebook¹

03 Jul 2018

TL;DR: This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods.

...read moreread less

Abstract: Humans can understand and produce new utterances effortlessly, thanks to their compositional skills. Once a person learns the meaning of a new verb "dax," he or she can immediately understand the meaning of "dax twice" or "sing and dax." In this paper, we introduce the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences. We then test the zero-shot generalization capabilities of a variety of recurrent neural networks (RNNs) trained on SCAN with sequence-to-sequence methods. We find that RNNs can make successful zero-shot generalizations when the differences between training and test commands are small, so that they can apply "mix-and-match" strategies to solve the task. However, when generalization requires systematic compositional skills (as in the "dax" example above), RNNs fail spectacularly. We conclude with a proof-of-concept experiment in neural machine translation, suggesting that lack of systematicity might be partially responsible for neural networks' notorious training data thirst.

...read moreread less

Proceedings Article•DOI•

Hierarchical Neural Story Generation

[...]

Angela Fan¹, Michael Lewis², Yann N. Dauphin³•Institutions (3)

Harvard University¹, University of Pittsburgh², Facebook³

13 May 2018

TL;DR: This paper proposed a hierarchical story generation model, where the model first generates a premise, and then transforms it into a passage of text, and further improves the relevance of the story to the prompt by adding a gated multi-scale self-attention mechanism.

...read moreread less

Abstract: We explore story generation: creative systems that can build coherent and fluent passages of text about a topic. We collect a large dataset of 300K human-written stories paired with writing prompts from an online forum. Our dataset enables hierarchical story generation, where the model first generates a premise, and then transforms it into a passage of text. We gain further improvements with a novel form of model fusion that improves the relevance of the story to the prompt, and adding a new gated multi-scale self-attention mechanism to model long-range context. Experiments show large improvements over strong baselines on both automated and human evaluations. Human judges prefer stories generated by our approach to those from a strong non-hierarchical model by a factor of two to one.

...read moreread less

Proceedings Article•DOI•

Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs

[...]

Xiaolong Wang¹, Yufei Ye², Abhinav Gupta²•Institutions (2)

Facebook¹, Carnegie Mellon University²

18 Jun 2018

TL;DR: In this article, a graph convolutional network (GCN) is used to predict the visual classifiers of unseen categories, which is robust to noise in the learned knowledge graph (KG) given a semantic embedding for each node (representing visual category).

...read moreread less

Abstract: We consider the problem of zero-shot recognition: learning a visual classifier for a category with zero training examples, just using the word embedding of the category and its relationship to other categories, which visual data are provided. The key to dealing with the unfamiliar or novel category is to transfer knowledge obtained from familiar classes to describe the unfamiliar class. In this paper, we build upon the recently introduced Graph Convolutional Network (GCN) and propose an approach that uses both semantic embeddings and the categorical relationships to predict the classifiers. Given a learned knowledge graph (KG), our approach takes as input semantic embeddings for each node (representing visual category). After a series of graph convolutions, we predict the visual classifier for each category. During training, the visual classifiers for a few categories are given to learn the GCN parameters. At test time, these filters are used to predict the visual classifiers of unseen categories. We show that our approach is robust to noise in the KG. More importantly, our approach provides significant improvement in performance compared to the current state-of-the-art results (from 2 ~ 3% on some metrics to whopping 20% on a few).

...read moreread less

Collapse