Showing papers by "Carnegie Mellon University published in 2019"

PDF

Open Access

Proceedings Article•

XLNet: Generalized Autoregressive Pretraining for Language Understanding

[...]

Zhilin Yang¹, Zihang Dai¹, Yiming Yang¹, Jaime G. Carbonell¹, Ruslan Salakhutdinov¹, Quoc V. Le² - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Google²

19 Jun 2019

TL;DR: The authors proposes XLNet, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT The authors.

...read moreread less

Abstract: With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

...read moreread less

3,863 citations

Posted Content•

XLNet: Generalized Autoregressive Pretraining for Language Understanding

[...]

Zhilin Yang¹, Zihang Dai¹, Yiming Yang¹, Jaime G. Carbonell¹, Ruslan Salakhutdinov¹, Quoc V. Le² - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Google²

19 Jun 2019-arXiv: Computation and Language

TL;DR: XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

...read moreread less

Abstract: With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

...read moreread less

3,009 citations

Proceedings Article•DOI•

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context.

[...]

Zihang Dai¹, Zhilin Yang¹, Yiming Yang¹, Jaime G. Carbonell¹, Quoc V. Le¹, Ruslan Salakhutdinov² - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Google²

09 Jan 2019

TL;DR: This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.

...read moreread less

Abstract: Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

...read moreread less

2,353 citations

Journal Article•DOI•

Multimodal Machine Learning: A Survey and Taxonomy

[...]

Tadas Baltrusaitis¹, Chaitanya Ahuja², Louis-Philippe Morency²•Institutions (2)

Microsoft¹, Carnegie Mellon University²

01 Feb 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

...read moreread less

Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

...read moreread less

1,945 citations

Journal Article•DOI•

The Simons Observatory : Science goals and forecasts

[...]

Peter A. R. Ade¹, James E. Aguirre², Z. Ahmed³, Simone Aiola⁴ +276 more•Institutions (53)

27 Feb 2019-Journal of Cosmology and Astroparticle Physics

TL;DR: The Simons Observatory (SO) is a new cosmic microwave background experiment being built on Cerro Toco in Chile, due to begin observations in the early 2020s as mentioned in this paper.

...read moreread less

Abstract: The Simons Observatory (SO) is a new cosmic microwave background experiment being built on Cerro Toco in Chile, due to begin observations in the early 2020s. We describe the scientific goals of the experiment, motivate the design, and forecast its performance. SO will measure the temperature and polarization anisotropy of the cosmic microwave background in six frequency bands centered at: 27, 39, 93, 145, 225 and 280 GHz. The initial configuration of SO will have three small-aperture 0.5-m telescopes and one large-aperture 6-m telescope, with a total of 60,000 cryogenic bolometers. Our key science goals are to characterize the primordial perturbations, measure the number of relativistic species and the mass of neutrinos, test for deviations from a cosmological constant, improve our understanding of galaxy evolution, and constrain the duration of reionization. The small aperture telescopes will target the largest angular scales observable from Chile, mapping ≈ 10% of the sky to a white noise level of 2 μK-arcmin in combined 93 and 145 GHz bands, to measure the primordial tensor-to-scalar ratio, r, at a target level of σ(r)=0.003. The large aperture telescope will map ≈ 40% of the sky at arcminute angular resolution to an expected white noise level of 6 μK-arcmin in combined 93 and 145 GHz bands, overlapping with the majority of the Large Synoptic Survey Telescope sky region and partially with the Dark Energy Spectroscopic Instrument. With up to an order of magnitude lower polarization noise than maps from the Planck satellite, the high-resolution sky maps will constrain cosmological parameters derived from the damping tail, gravitational lensing of the microwave background, the primordial bispectrum, and the thermal and kinematic Sunyaev-Zel'dovich effects, and will aid in delensing the large-angle polarization signal to measure the tensor-to-scalar ratio. The survey will also provide a legacy catalog of 16,000 galaxy clusters and more than 20,000 extragalactic sources.

...read moreread less

1,027 citations

Journal Article•DOI•

3D bioprinting of collagen to rebuild components of the human heart

[...]

Andrew Lee¹, Andrew Hudson¹, Daniel J. Shiwarski¹, Joshua W. Tashman¹, Thomas J. Hinton¹, Saigopalakrishna S. Yerneni¹, Jacqueline M. Bliley¹, Phil G. Campbell¹, Adam W. Feinberg¹ - Show less +5 more•Institutions (1)

Carnegie Mellon University¹

02 Aug 2019-Science

TL;DR: 3D-bioprinted hearts accurately reproduce patient-specific anatomical structure as determined by micro–computed tomography and showed synchronized contractions, directional action potential propagation, and wall thickening up to 14% during peak systole.

...read moreread less

Abstract: Collagen is the primary component of the extracellular matrix in the human body. It has proved challenging to fabricate collagen scaffolds capable of replicating the structure and function of tissues and organs. We present a method to 3D-bioprint collagen using freeform reversible embedding of suspended hydrogels (FRESH) to engineer components of the human heart at various scales, from capillaries to the full organ. Control of pH-driven gelation provides 20-micrometer filament resolution, a porous microstructure that enables rapid cellular infiltration and microvascularization, and mechanical strength for fabrication and perfusion of multiscale vasculature and tri-leaflet valves. We found that FRESH 3D-bioprinted hearts accurately reproduce patient-specific anatomical structure as determined by micro-computed tomography. Cardiac ventricles printed with human cardiomyocytes showed synchronized contractions, directional action potential propagation, and wall thickening up to 14% during peak systole.

...read moreread less

996 citations

Proceedings Article•DOI•

Argoverse: 3D Tracking and Forecasting With Rich Maps

[...]

Ming-Fang Chang¹, Deva Ramanan, James Hays, John Lambert¹, Patsorn Sangkloy², Jasvinder A. Singh², Slawomir Bak², Andrew Hartnett², De Wang¹, Peter W. Carr, Simon Lucey - Show less +7 more•Institutions (2)

Georgia Institute of Technology¹, Carnegie Mellon University²

15 Jun 2019

TL;DR: Argoverse includes sensor data collected by a fleet of autonomous vehicles in Pittsburgh and Miami as well as 3D tracking annotations, 300k extracted interesting vehicle trajectories, and rich semantic maps, which contain rich geometric and semantic metadata which are not currently available in any public dataset.

...read moreread less

Abstract: We present Argoverse, a dataset designed to support autonomous vehicle perception tasks including 3D tracking and motion forecasting. Argoverse includes sensor data collected by a fleet of autonomous vehicles in Pittsburgh and Miami as well as 3D tracking annotations, 300k extracted interesting vehicle trajectories, and rich semantic maps. The sensor data consists of 360 degree images from 7 cameras with overlapping fields of view, forward-facing stereo imagery, 3D point clouds from long range LiDAR, and 6-DOF pose. Our 290km of mapped lanes contain rich geometric and semantic metadata which are not currently available in any public dataset. All data is released under a Creative Commons license at Argoverse.org. In baseline experiments, we use map information such as lane direction, driveable area, and ground height to improve the accuracy of 3D object tracking. We use 3D object tracking to mine for more than 300k interesting vehicle trajectories to create a trajectory forecasting benchmark. Motion forecasting experiments ranging in complexity from classical methods (k-NN) to LSTMs demonstrate that using detailed vector maps with lane-level information substantially reduces prediction error. Our tracking and forecasting experiments represent only a superficial exploration of the potential of rich maps in robotic perception. We hope that Argoverse will enable the research community to explore these problems in greater depth.

...read moreread less

950 citations

Journal Article•DOI•

LSST: From Science Drivers to Reference Design and Anticipated Data Products

[...]

Željko Ivezić¹, Steven M. Kahn², J. Anthony Tyson³, Bob Abel⁴ +332 more•Institutions (55)

10 Mar 2019-The Astrophysical Journal

TL;DR: The Large Synoptic Survey Telescope (LSST) as discussed by the authors is a large, wide-field ground-based system designed to obtain repeated images covering the sky visible from Cerro Pachon in northern Chile.

...read moreread less

Abstract: We describe here the most ambitious survey currently planned in the optical, the Large Synoptic Survey Telescope (LSST). The LSST design is driven by four main science themes: probing dark energy and dark matter, taking an inventory of the solar system, exploring the transient optical sky, and mapping the Milky Way. LSST will be a large, wide-field ground-based system designed to obtain repeated images covering the sky visible from Cerro Pachon in northern Chile. The telescope will have an 8.4 m (6.5 m effective) primary mirror, a 9.6 deg2 field of view, a 3.2-gigapixel camera, and six filters (ugrizy) covering the wavelength range 320–1050 nm. The project is in the construction phase and will begin regular survey operations by 2022. About 90% of the observing time will be devoted to a deep-wide-fast survey mode that will uniformly observe a 18,000 deg2 region about 800 times (summed over all six bands) during the anticipated 10 yr of operations and will yield a co-added map to r ~ 27.5. These data will result in databases including about 32 trillion observations of 20 billion galaxies and a similar number of stars, and they will serve the majority of the primary science programs. The remaining 10% of the observing time will be allocated to special projects such as Very Deep and Very Fast time domain surveys, whose details are currently under discussion. We illustrate how the LSST science drivers led to these choices of system parameters, and we describe the expected data products and their characteristics.

...read moreread less

921 citations

Posted Content•

Certified Adversarial Robustness via Randomized Smoothing

[...]

Jeremy M. Cohen¹, Elan Rosenfeld¹, J. Zico Kolter¹•Institutions (1)

Carnegie Mellon University¹

08 Feb 2019-arXiv: Learning

TL;DR: Strong empirical results suggest that randomized smoothing is a promising direction for future research into adversarially robust classification on smaller-scale datasets where competing approaches to certified $\ell_2$ robustness are viable, smoothing delivers higher certified accuracies.

...read moreread less

Abstract: We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use randomized smoothing to obtain an ImageNet classifier with e.g. a certified top-1 accuracy of 49% under adversarial perturbations with $\ell_2$ norm less than 0.5 (=127/255). No certified defense has been shown feasible on ImageNet except for smoothing. On smaller-scale datasets where competing approaches to certified $\ell_2$ robustness are viable, smoothing delivers higher certified accuracies. Our strong empirical results suggest that randomized smoothing is a promising direction for future research into adversarially robust classification. Code and models are available at this http URL.

...read moreread less

719 citations

Proceedings Article•

Certified Adversarial Robustness via Randomized Smoothing

[...]

Jeremy M. Cohen¹, Elan Rosenfeld¹, J. Zico Kolter¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2019

TL;DR: In this paper, randomized smoothing is used to obtain an ImageNet classifier with a certified top-1 accuracy of 49% under adversarial perturbations with less than 0.5.

...read moreread less

714 citations

Journal Article•DOI•

Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis.

[...]

Christopher W. Seymour¹, Jason Kennedy¹, Shu Wang¹, Chung-Chou H. Chang¹, Corrine F. Elliott, Zhongying Xu¹, Scott M. Berry, Gilles Clermont¹, Gregory F. Cooper¹, Hernando Gomez¹, David T. Huang¹, John A. Kellum¹, Qi Mi¹, Steven M. Opal², Victor B. Talisa¹, Tom van der Poll³, Shyam Visweswaran¹, Yoram Vodovotz¹, Jeremy C. Weiss⁴, Donald M. Yealy¹, Sachin Yende¹, Sachin Yende⁵, Derek C. Angus¹ - Show less +19 more•Institutions (5)

University of Pittsburgh¹, Rhode Island Hospital², University of Amsterdam³, Carnegie Mellon University⁴, Veterans Health Administration⁵

28 May 2019-JAMA

TL;DR: In this retrospective analysis of data sets from patients with sepsis, 4 clinical phenotypes were identified that correlated with host-response patterns and clinical outcomes, and simulations suggested these phenotypes may help in understanding heterogeneity of treatment effects.

...read moreread less

Abstract: Importance Sepsis is a heterogeneous syndrome. Identification of distinct clinical phenotypes may allow more precise therapy and improve care. Objective To derive sepsis phenotypes from clinical data, determine their reproducibility and correlation with host-response biomarkers and clinical outcomes, and assess the potential causal relationship with results from randomized clinical trials (RCTs). Design, Settings, and Participants Retrospective analysis of data sets using statistical, machine learning, and simulation tools. Phenotypes were derived among 20 189 total patients (16 552 unique patients) who met Sepsis-3 criteria within 6 hours of hospital presentation at 12 Pennsylvania hospitals (2010-2012) using consensuskmeans clustering applied to 29 variables. Reproducibility and correlation with biological parameters and clinical outcomes were assessed in a second database (2013-2014; n = 43 086 total patients and n = 31 160 unique patients), in a prospective cohort study of sepsis due to pneumonia (n = 583), and in 3 sepsis RCTs (n = 4737). Exposures All clinical and laboratory variables in the electronic health record. Main Outcomes and Measures Derived phenotype (α, β, γ,and δ) frequency, host-response biomarkers, 28-day and 365-day mortality, and RCT simulation outputs. Results The derivation cohort included 20 189 patients with sepsis (mean age, 64 [SD, 17] years; 10 022 [50%] male; mean maximum 24-hour Sequential Organ Failure Assessment [SOFA] score, 3.9 [SD, 2.4]). The validation cohort included 43 086 patients (mean age, 67 [SD, 17] years; 21 993 [51%] male; mean maximum 24-hour SOFA score, 3.6 [SD, 2.0]). Of the 4 derived phenotypes, the α phenotype was the most common (n = 6625; 33%) and included patients with the lowest administration of a vasopressor; in the β phenotype (n = 5512; 27%), patients were older and had more chronic illness and renal dysfunction; in the γ phenotype (n = 5385; 27%), patients had more inflammation and pulmonary dysfunction; and in the δ phenotype (n = 2667; 13%), patients had more liver dysfunction and septic shock. Phenotype distributions were similar in the validation cohort. There were consistent differences in biomarker patterns by phenotype. In the derivation cohort, cumulative 28-day mortality was 287 deaths of 5691 unique patients (5%) for the α phenotype; 561 of 4420 (13%) for the β phenotype; 1031 of 4318 (24%) for the γ phenotype; and 897 of 2223 (40%) for the δ phenotype. Across all cohorts and trials, 28-day and 365-day mortality were highest among the δ phenotype vs the other 3 phenotypes (P 33% chance of benefit to >60% chance of harm). Conclusions and Relevance In this retrospective analysis of data sets from patients with sepsis, 4 clinical phenotypes were identified that correlated with host-response patterns and clinical outcomes, and simulations suggested these phenotypes may help in understanding heterogeneity of treatment effects. Further research is needed to determine the utility of these phenotypes in clinical care and for informing trial design and interpretation.

...read moreread less

Proceedings Article•DOI•

Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition

[...]

Maosen Li, Siheng Chen¹, Xu Chen², Ya Zhang, Yanfeng Wang², Qi Tian³ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, Shanghai Jiao Tong University², Huawei³

15 Jun 2019

TL;DR: The proposed AS-GCN achieves consistently large improvement compared to the state-of-the-art methods and shows promising results for future pose prediction.

...read moreread less

Abstract: Action recognition with skeleton data has recently attracted much attention in computer vision. Previous studies are mostly based on fixed skeleton graphs, only capturing local physical dependencies among joints, which may miss implicit joint correlations. To capture richer dependencies, we introduce an encoder-decoder structure, called A-link inference module, to capture action-specific latent dependencies, i.e. actional links, directly from actions. We also extend the existing skeleton graphs to represent higher-order dependencies, i.e. structural links. Combing the two types of links into a generalized skeleton graph, We further propose the actional-structural graph convolution network (AS-GCN), which stacks actional-structural graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features for action recognition. A future pose prediction head is added in parallel to the recognition head to help capture more detailed action patterns through self-supervision. We validate AS-GCN in action recognition using two skeleton data sets, NTU-RGB+D and Kinetics. The proposed AS-GCN achieves consistently large improvement compared to the state-of-the-art methods. As a side product, AS-GCN also shows promising results for future pose prediction.

...read moreread less

Posted Content•

A Closer Look at Few-shot Classification

[...]

Wei-Yu Chen¹, Yen-Cheng Liu², Zsolt Kira², Yu-Chiang Frank Wang³, Jia-Bin Huang⁴ - Show less +1 more•Institutions (4)

Carnegie Mellon University¹, Georgia Institute of Technology², National Taiwan University³, Virginia Tech⁴

08 Apr 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones, and a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.

...read moreread less

Abstract: Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples. While significant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difficult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classification algorithms, with results showing that deeper backbones significantly reduce the performance differences among methods on datasets with limited domain differences, 2) a modified baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the \miniI and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic cross-domain evaluation setting, we show that a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.

...read moreread less

Proceedings Article•

Theoretically Principled Trade-off between Robustness and Accuracy

[...]

Hongyang Zhang¹, Yaodong Yu², Jiantao Jiao³, Eric P. Xing⁴, Laurent El Ghaoui³, Michael I. Jordan³ - Show less +2 more•Institutions (4)

Carnegie Mellon University¹, University of Virginia², University of California, Berkeley³, Microsoft⁴

24 May 2019

TL;DR: TRADES as mentioned in this paper decomposes the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provides a differentiable upper bound using the theory of classification-calibrated loss.

...read moreread less

Abstract: We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although this problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of ~2,000 submissions, surpassing the runner-up approach by $11.41\%$ in terms of mean $\ell_2$ perturbation distance.

...read moreread less

Book Chapter•DOI•

The sixth visual object tracking VOT2018 challenge results

[...]

Matej Kristan¹, Ales Leonardis², Jiří Matas³, Michael Felsberg⁴ +155 more•Institutions (47)

23 Jan 2019

TL;DR: The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative; results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years.

...read moreread less

Abstract: The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis and a “real-time” experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. A long-term tracking subchallenge has been introduced to the set of standard VOT sub-challenges. The new subchallenge focuses on long-term tracking properties, namely coping with target disappearance and reappearance. A new dataset has been compiled and a performance evaluation methodology that focuses on long-term tracking capabilities has been adopted. The VOT toolkit has been updated to support both standard short-term and the new long-term tracking subchallenges. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net).

...read moreread less

Proceedings Article•

Are Sixteen Heads Really Better than One

[...]

Paul Michel¹, Omer Levy², Graham Neubig¹•Institutions (2)

Carnegie Mellon University¹, Facebook²

25 May 2019

TL;DR: It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.

...read moreread less

Abstract: Multi-headed attention is a driving force behind recent state-of-the-art NLP models. By applying multiple attention mechanisms in parallel, it can express sophisticated functions beyond the simple weighted average. However we observe that, in practice, a large proportion of attention heads can be removed at test time without significantly impacting performance, and that some layers can even be reduced to a single head. Further analysis on machine translation models reveals that the self-attention layers can be significantly pruned, while the encoder-decoder layers are more dependent on multi-headedness.

...read moreread less

Proceedings Article•

Gradient descent finds global minima of deep neural networks

[...]

Simon S. Du¹, Jason D. Lee², Haochuan Li³, Liwei Wang³, Xiyu Zhai⁴ - Show less +1 more•Institutions (4)

Carnegie Mellon University¹, University of Southern California², Peking University³, Massachusetts Institute of Technology⁴

24 May 2019

TL;DR: This paper showed that gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet) and further extended their analysis to deep residual convolutional neural networks and obtained a similar convergence result.

...read moreread less

Abstract: Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.

...read moreread less

Proceedings Article•DOI•

Contrastive Adaptation Network for Unsupervised Domain Adaptation

[...]

Guoliang Kang¹, Lu Jiang², Yi Yang¹, Alexander G. Hauptmann³•Institutions (3)

University of Technology, Sydney¹, Google², Carnegie Mellon University³

15 Jun 2019

TL;DR: In contrast, Contrastive Adaptation Network (CAN) as discussed by the authors proposes a new metric which explicitly models the intra-class domain discrepancy and the inter-class discrepancy and designs an alternating update strategy for training CAN in an end-to-end manner.

...read moreread less

Abstract: Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain. Previous methods minimize the domain discrepancy neglecting the class information, which may lead to misalignment and poor generalization performance. To address this issue, this paper proposes Contrastive Adaptation Network (CAN) optimizing a new metric which explicitly models the intra-class domain discrepancy and the inter-class domain discrepancy. We design an alternating update strategy for training CAN in an end-to-end manner. Experiments on two real-world benchmarks Office-31 and VisDA-2017 demonstrate that CAN performs favorably against the state-of-the-art methods and produces more discriminative features.

...read moreread less

Journal Article•DOI•

Highly active atomically dispersed CoN 4 fuel cell cathode catalysts derived from surfactant-assisted MOFs: carbon-shell confinement strategy

[...]

Yanghua He¹, Sooyeon Hwang², David A. Cullen³, M. Aman Uddin⁴, Lisa Langhorst⁴, Boyang Li⁵, Stavros Karakalos⁶, A. Jeremy Kropf⁷, Evan C. Wegener⁷, Joshua Sokolowski¹, Mengjie Chen¹, Deborah J. Myers⁷, Dong Su³, Karren L. More³, Guofeng Wang⁵, Shawn Litster⁴, Gang Wu¹ - Show less +13 more•Institutions (7)

State University of New York System¹, Center for Functional Nanomaterials², Oak Ridge National Laboratory³, Carnegie Mellon University⁴, University of Pittsburgh⁵, University of South Carolina⁶, Argonne National Laboratory⁷

16 Jan 2019-Energy and Environmental Science

TL;DR: In this article, a new type of atomically dispersed Co doped carbon catalyst with a core-shell structure has been developed via a surfactant-assisted metal-organic framework approach.

...read moreread less

Abstract: Development of platinum group metal (PGM)-free catalysts for oxygen reduction reaction (ORR) is essential for affordable proton exchange membrane fuel cells. Herein, a new type of atomically dispersed Co doped carbon catalyst with a core–shell structure has been developed via a surfactant-assisted metal–organic framework approach. The cohesive interactions between the selected surfactant and the Co-doped zeolitic imidazolate framework (ZIF-8) nanocrystals lead to a unique confinement effect. During the thermal activation, this confinement effect suppressed the agglomeration of Co atomic sites and mitigated the collapse of internal microporous structures of ZIF-8. Among the studied surfactants, Pluronic F127 block copolymer led to the greatest performance gains with a doubling of the active site density relative to that of the surfactant-free catalyst. According to density functional theory calculations, unlike other Co catalysts, this new atomically dispersed Co–N–C@F127 catalyst is believed to contain substantial CoN2+2 sites, which are active and thermodynamically favorable for the four-electron ORR pathway. The Co–N–C@F127 catalyst exhibits an unprecedented ORR activity with a half-wave potential (E1/2) of 0.84 V (vs. RHE) as well as enhanced stability in the corrosive acidic media. It also demonstrated high initial performance with a power density of 0.87 W cm−2 along with encouraging durability in H2–O2 fuel cells. The atomically dispersed Co site catalyst approaches that of the Fe–N–C catalyst and represents the highest reported PGM-free and Fe-free catalyst performance.

...read moreread less

Proceedings Article•DOI•

Feature Selective Anchor-Free Module for Single-Shot Object Detection

[...]

Chenchen Zhu¹, Yihui He¹, Marios Savvides¹•Institutions (1)

Carnegie Mellon University¹

02 Mar 2019

TL;DR: The FSAF module robustly improves the baseline RetinaNet by a large margin under various settings, while introducing nearly free inference overhead, and the resulting best model can achieve a state-of-the-art 44.6% mAP, outperforming all existing single-shot detectors on COCO.

...read moreread less

Abstract: We motivate and present feature selective anchor-free (FSAF) module, a simple and effective building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work independently or jointly with anchor-based branches. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy. Experimental results on the COCO detection track show that our FSAF module performs better than anchor-based counterparts while being faster. When working jointly with anchor-based branches, the FSAF module robustly improves the baseline RetinaNet by a large margin under various settings, while introducing nearly free inference overhead. And the resulting best model can achieve a state-of-the-art 44.6% mAP, outperforming all existing single-shot detectors on COCO.

...read moreread less

Proceedings Article•

On Exact Computation with an Infinitely Wide Neural Net

[...]

Sanjeev Arora, Simon S. Du, Wei Hu¹, Zhiyuan Li¹, Ruslan Salakhutdinov², Ruosong Wang² - Show less +2 more•Institutions (2)

Princeton University¹, Carnegie Mellon University²

26 Apr 2019

TL;DR: The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.

...read moreread less

Abstract: How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its “width”— namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers — is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.

...read moreread less

Journal Article•DOI•

Keyhole threshold and morphology in laser melting revealed by ultrahigh-speed x-ray imaging

[...]

Ross B. Cunningham¹, Cang Zhao², Niranjan D. Parab², Christopher Kantzos¹, Joseph Pauza¹, Kamel Fezzaa², Tao Sun², Anthony D. Rollett¹ - Show less +4 more•Institutions (2)

Carnegie Mellon University¹, Argonne National Laboratory²

22 Feb 2019-Science

TL;DR: The direct visualization of the keyhole morphology and dynamics with high-energy x-rays shows that (i) keyholes are present across the range of power and scanning velocity used in laser powder bed fusion; and (ii) there is a well-defined threshold from conduction mode to keyhole based on laser power density.

...read moreread less

Abstract: We used ultrahigh-speed synchrotron x-ray imaging to quantify the phenomenon of vapor depressions (also known as keyholes) during laser melting of metals as practiced in additive manufacturing. Although expected from welding and inferred from postmortem cross sections of fusion zones, the direct visualization of the keyhole morphology and dynamics with high-energy x-rays shows that (i) keyholes are present across the range of power and scanning velocity used in laser powder bed fusion; (ii) there is a well-defined threshold from conduction mode to keyhole based on laser power density; and (iii) the transition follows the sequence of vaporization, depression of the liquid surface, instability, and then deep keyhole formation. These and other aspects provide a physical basis for three-dimensional printing in laser powder bed machines.

...read moreread less

Proceedings Article•DOI•

PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet

[...]

Yasuhiro Aoki¹, Hunter Goforth¹, Rangaprasad Arun Srivatsan², Simon Lucey¹•Institutions (2)

Carnegie Mellon University¹, Fujitsu²

13 Mar 2019

TL;DR: PointNetLK as mentioned in this paper unrolls PointNet and the Lucas & Kanade (LK) algorithm into a single trainable recurrent deep neural network for point cloud registration.

...read moreread less

Abstract: PointNet has revolutionized how we think about representing point clouds. For classification and segmentation tasks, the approach and its subsequent variants/extensions are considered state-of-the-art. To date, the successful application of PointNet to point cloud registration has remained elusive. In this paper we argue that PointNet itself can be thought of as a learnable "imaging" function. As a consequence, classical vision algorithms for image alignment can be brought to bear on the problem -- namely the Lucas & Kanade (LK) algorithm. Our central innovations stem from: (i) how to modify the LK algorithm to accommodate the PointNet imaging function, and (ii) unrolling PointNet and the LK algorithm into a single trainable recurrent deep neural network. We describe the architecture, and compare its performance against state-of-the-art in several common registration scenarios. The architecture offers some remarkable properties including: generalization across shape categories and computational efficiency -- opening up new paths of exploration for the application of deep learning to point cloud registration. Code and videos are available at https://github.com/hmgoforth/PointNetLK.

...read moreread less

Proceedings Article•DOI•

Designing Theory-Driven User-Centric Explainable AI

[...]

Danding Wang¹, Qian Yang², Ashraf Abdul¹, Brian Y. Lim¹•Institutions (2)

National University of Singapore¹, Carnegie Mellon University²

02 May 2019

TL;DR: This paper proposes a conceptual framework for building human-centered, decision-theory-driven XAI based on an extensive review across philosophy and psychology, and identifies pathways along which human cognitive patterns drives needs for building XAI and how XAI can mitigate common cognitive biases.

...read moreread less

Abstract: From healthcare to criminal justice, artificial intelligence (AI) is increasingly supporting high-consequence human decisions. This has spurred the field of explainable AI (XAI). This paper seeks to strengthen empirical application-specific investigations of XAI by exploring theoretical underpinnings of human decision making, drawing from the fields of philosophy and psychology. In this paper, we propose a conceptual framework for building human-centered, decision-theory-driven XAI based on an extensive review across these fields. Drawing on this framework, we identify pathways along which human cognitive patterns drives needs for building XAI and how XAI can mitigate common cognitive biases. We then put this framework into practice by designing and implementing an explainable clinical diagnostic tool for intensive care phenotyping and conducting a co-design exercise with clinicians. Thereafter, we draw insights into how this framework bridges algorithm-generated explanations and human decision-making theories. Finally, we discuss implications for XAI design and development.

...read moreread less

Journal Article•DOI•

Cosmology from cosmic shear power spectra with Subaru Hyper Suprime-Cam first-year data

[...]

Chiaki Hikage¹, Masamune Oguri¹, Masamune Oguri², Takashi Hamana, Surhud More¹, Surhud More³, Rachel Mandelbaum⁴, Masahiro Takada¹, Fabian Köhlinger¹, Hironao Miyatake, Atsushi J. Nishizawa⁵, Hiroaki Aihara², Hiroaki Aihara¹, Robert Armstrong⁶, James Bosch⁷, Jean Coupon⁸, Anne Ducout¹, Paul T. P. Ho⁹, Bau-Ching Hsieh⁹, Yutaka Komiyama¹⁰, François Lanusse⁴, Alexie Leauthaud¹¹, Robert H. Lupton⁷, Elinor Medezinski⁷, Sogo Mineo, Shoken Miyama¹², Satoshi Miyazaki¹⁰, Ryoma Murata², Ryoma Murata¹, Hitoshi Murayama¹, Hitoshi Murayama¹³, Hitoshi Murayama¹⁴, Masato Shirasaki, Cristóbal Sifón⁷, Melanie Simet¹⁵, Melanie Simet¹⁶, Joshua S. Speagle¹⁷, David N. Spergel¹⁸, David N. Spergel⁷, Michael A. Strauss⁷, Naoshi Sugiyama⁵, Naoshi Sugiyama¹, Masayuki Tanaka, Yousuke Utsumi¹⁹, Shiang-Yu Wang⁹, Yoshihiko Yamada - Show less +42 more•Institutions (19)

Institute for the Physics and Mathematics of the Universe¹, University of Tokyo², Inter-University Centre for Astronomy and Astrophysics³, Carnegie Mellon University⁴, Nagoya University⁵, Lawrence Livermore National Laboratory⁶, Princeton University⁷, University of Geneva⁸, Academia Sinica Institute of Astronomy and Astrophysics⁹, Graduate University for Advanced Studies¹⁰, University of California, Santa Cruz¹¹, Hiroshima University¹², Lawrence Berkeley National Laboratory¹³, University of California, Berkeley¹⁴, California Institute of Technology¹⁵, University of California, Riverside¹⁶, Harvard University¹⁷, York University¹⁸, SLAC National Accelerator Laboratory¹⁹

01 Apr 2019-Publications of the Astronomical Society of Japan

TL;DR: In this article, the authors measured cosmic weak lensing shear power spectra with the Subaru Hyper Suprime-Cam (HSC) survey first-year shear catalog covering 137 degrees of the sky.

...read moreread less

Abstract: We measure cosmic weak lensing shear power spectra with the Subaru Hyper Suprime-Cam (HSC) survey first-year shear catalog covering 137 deg^2 of the sky. Thanks to the high effective galaxy number density of ∼17 arcmin^−2, even after conservative cuts such as a magnitude cut of i < 24.5 and photometric redshift cut of 0.3 ≤ z ≤ 1.5, we obtain a high-significance measurement of the cosmic shear power spectra in four tomographic redshift bins, achieving a total signal-to-noise ratio of 16 in the multipole range 300 ≤ l ≤ 1900. We carefully account for various uncertainties in our analysis including the intrinsic alignment of galaxies, scatters and biases in photometric redshifts, residual uncertainties in the shear measurement, and modeling of the matter power spectrum. The accuracy of our power spectrum measurement method as well as our analytic model of the covariance matrix are tested against realistic mock shear catalogs. For a flat Λ cold dark matter model, we find |$S\,_{8}\equiv \sigma _8(\Omega _{\rm m}/0.3)^\alpha =0.800^{+0.029}_{-0.028}$| for α = 0.45 (⁠|$S\,_8=0.780^{+0.030}_{-0.033}$| for α = 0.5) from our HSC tomographic cosmic shear analysis alone. In comparison with Planck cosmic microwave background constraints, our results prefer slightly lower values of S_8, although metrics such as the Bayesian evidence ratio test do not show significant evidence for discordance between these results. We study the effect of possible additional systematic errors that are unaccounted for in our fiducial cosmic shear analysis, and find that they can shift the best-fit values of S_8 by up to ∼0.6 σ in both directions. The full HSC survey data will contain several times more area, and will lead to significantly improved cosmological constraints.

...read moreread less

Proceedings Article•DOI•

Stereo R-CNN Based 3D Object Detection for Autonomous Driving

[...]

Peiliang Li¹, Xiaozhi Chen, Shaojie Shen²•Institutions (2)

Carnegie Mellon University¹, Hong Kong University of Science and Technology²

15 Jun 2019

TL;DR: Stereo R-CNN as mentioned in this paper proposes a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery, which adds extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D bounding box.

...read moreread less

Abstract: We propose a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to simultaneously detect and associate object in left and right images. We add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D object bounding box. We then recover the accurate 3D bounding box by a region-based photometric alignment using left and right RoIs. Our method does not require depth input and 3D position supervision, however, outperforms all existing fully supervised image-based methods. Experiments on the challenging KITTI dataset show that our method outperforms the state-of-the-art stereo-based method by around 30% AP on both 3D detection and 3D localization tasks. Code will be made publicly available.

...read moreread less

Proceedings Article•DOI•

PipeDream: generalized pipeline parallelism for DNN training

[...]

Deepak Narayanan¹, Aaron Harlap², Amar Phanishayee³, Vivek Seshadri³, Nikhil R. Devanur³, Gregory R. Ganger², Phillip B. Gibbons², Matei Zaharia¹ - Show less +4 more•Institutions (3)

Stanford University¹, Carnegie Mellon University², Microsoft³

27 Oct 2019

TL;DR: PipeDream is presented, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.

...read moreread less

Abstract: DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naive pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.

...read moreread less

Journal Article•DOI•

Review of Causal Discovery Methods Based on Graphical Models.

[...]

Clark Glymour¹, Kun Zhang¹, Peter Spirtes¹•Institutions (1)

Carnegie Mellon University¹

04 Jun 2019-Frontiers in Genetics

TL;DR: This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.

...read moreread less

Abstract: A fundamental task in various disciplines of science, including biology, is to find underlying causal relations and make use of them. Causal relations can be seen if interventions are properly applied; however, in many cases they are difficult or even impossible to conduct. It is then necessary to discover causal relations by analyzing statistical properties of purely observational data, which is known as causal discovery or causal structure search. This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.

...read moreread less

Proceedings Article•DOI•

Video Action Transformer Network

[...]

Rohit Girdhar¹, Joao Carreira, Carl Doersch, Andrew Zisserman²•Institutions (2)

Carnegie Mellon University¹, University of Oxford²

15 Jun 2019

TL;DR: Action Transformer as mentioned in this paper uses a Transformer-style architecture to aggregate features from the spatio-temporal context around the person whose actions we are trying to classify, and shows that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others.

...read moreread less

Abstract: We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action – all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

...read moreread less

Journal Article•

Visual Dialog

[...]

Abhishek Das¹, Satwik Kottur², Khushi Gupta², Avi Singh³, Deshraj Yadav¹, Stefan Lee¹, Jose M. F. Moura², Devi Parikh¹, Dhruv Batra¹ - Show less +5 more•Institutions (3)

Georgia Institute of Technology¹, Carnegie Mellon University², University of California, Berkeley³

01 May 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The authors introduced the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content, given an image, a dialog history and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately.

...read moreread less

Abstract: We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounded in vision to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person real-time chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and consists of $\sim$ ∼ 1.2M dialog question-answer pairs from 10-round, human-human dialogs grounded in $\sim$ ∼ 120k images from the COCO dataset. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders—Late Fusion, Hierarchical Recurrent Encoder and Memory Network (optionally with attention over image features)—and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank and recall $@k$ @ k of human response. We quantify the gap between machine and human performance on the Visual Dialog task via human studies. Putting it all together, we demonstrate the first ‘visual chatbot’! Our dataset, code, pretrained models and visual chatbot are available on https://visualdialog.org .

...read moreread less

Collapse