scispace - formally typeset
Search or ask a question

Showing papers by "Carnegie Mellon University published in 2019"


Proceedings Article
19 Jun 2019
TL;DR: The authors proposes XLNet, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT The authors.
Abstract: With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

3,863 citations


Posted Content
TL;DR: XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
Abstract: With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

3,009 citations


Proceedings ArticleDOI
09 Jan 2019
TL;DR: This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
Abstract: Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

2,353 citations


Journal ArticleDOI
TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

1,945 citations


Journal ArticleDOI
Peter A. R. Ade1, James E. Aguirre2, Z. Ahmed3, Simone Aiola4  +276 moreInstitutions (53)
TL;DR: The Simons Observatory (SO) is a new cosmic microwave background experiment being built on Cerro Toco in Chile, due to begin observations in the early 2020s as mentioned in this paper.
Abstract: The Simons Observatory (SO) is a new cosmic microwave background experiment being built on Cerro Toco in Chile, due to begin observations in the early 2020s. We describe the scientific goals of the experiment, motivate the design, and forecast its performance. SO will measure the temperature and polarization anisotropy of the cosmic microwave background in six frequency bands centered at: 27, 39, 93, 145, 225 and 280 GHz. The initial configuration of SO will have three small-aperture 0.5-m telescopes and one large-aperture 6-m telescope, with a total of 60,000 cryogenic bolometers. Our key science goals are to characterize the primordial perturbations, measure the number of relativistic species and the mass of neutrinos, test for deviations from a cosmological constant, improve our understanding of galaxy evolution, and constrain the duration of reionization. The small aperture telescopes will target the largest angular scales observable from Chile, mapping ≈ 10% of the sky to a white noise level of 2 μK-arcmin in combined 93 and 145 GHz bands, to measure the primordial tensor-to-scalar ratio, r, at a target level of σ(r)=0.003. The large aperture telescope will map ≈ 40% of the sky at arcminute angular resolution to an expected white noise level of 6 μK-arcmin in combined 93 and 145 GHz bands, overlapping with the majority of the Large Synoptic Survey Telescope sky region and partially with the Dark Energy Spectroscopic Instrument. With up to an order of magnitude lower polarization noise than maps from the Planck satellite, the high-resolution sky maps will constrain cosmological parameters derived from the damping tail, gravitational lensing of the microwave background, the primordial bispectrum, and the thermal and kinematic Sunyaev-Zel'dovich effects, and will aid in delensing the large-angle polarization signal to measure the tensor-to-scalar ratio. The survey will also provide a legacy catalog of 16,000 galaxy clusters and more than 20,000 extragalactic sources.

1,027 citations


Journal ArticleDOI
02 Aug 2019-Science
TL;DR: 3D-bioprinted hearts accurately reproduce patient-specific anatomical structure as determined by micro–computed tomography and showed synchronized contractions, directional action potential propagation, and wall thickening up to 14% during peak systole.
Abstract: Collagen is the primary component of the extracellular matrix in the human body. It has proved challenging to fabricate collagen scaffolds capable of replicating the structure and function of tissues and organs. We present a method to 3D-bioprint collagen using freeform reversible embedding of suspended hydrogels (FRESH) to engineer components of the human heart at various scales, from capillaries to the full organ. Control of pH-driven gelation provides 20-micrometer filament resolution, a porous microstructure that enables rapid cellular infiltration and microvascularization, and mechanical strength for fabrication and perfusion of multiscale vasculature and tri-leaflet valves. We found that FRESH 3D-bioprinted hearts accurately reproduce patient-specific anatomical structure as determined by micro-computed tomography. Cardiac ventricles printed with human cardiomyocytes showed synchronized contractions, directional action potential propagation, and wall thickening up to 14% during peak systole.

996 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: Argoverse includes sensor data collected by a fleet of autonomous vehicles in Pittsburgh and Miami as well as 3D tracking annotations, 300k extracted interesting vehicle trajectories, and rich semantic maps, which contain rich geometric and semantic metadata which are not currently available in any public dataset.
Abstract: We present Argoverse, a dataset designed to support autonomous vehicle perception tasks including 3D tracking and motion forecasting. Argoverse includes sensor data collected by a fleet of autonomous vehicles in Pittsburgh and Miami as well as 3D tracking annotations, 300k extracted interesting vehicle trajectories, and rich semantic maps. The sensor data consists of 360 degree images from 7 cameras with overlapping fields of view, forward-facing stereo imagery, 3D point clouds from long range LiDAR, and 6-DOF pose. Our 290km of mapped lanes contain rich geometric and semantic metadata which are not currently available in any public dataset. All data is released under a Creative Commons license at Argoverse.org. In baseline experiments, we use map information such as lane direction, driveable area, and ground height to improve the accuracy of 3D object tracking. We use 3D object tracking to mine for more than 300k interesting vehicle trajectories to create a trajectory forecasting benchmark. Motion forecasting experiments ranging in complexity from classical methods (k-NN) to LSTMs demonstrate that using detailed vector maps with lane-level information substantially reduces prediction error. Our tracking and forecasting experiments represent only a superficial exploration of the potential of rich maps in robotic perception. We hope that Argoverse will enable the research community to explore these problems in greater depth.

950 citations


Journal ArticleDOI
Željko Ivezić1, Steven M. Kahn2, J. Anthony Tyson3, Bob Abel4  +332 moreInstitutions (55)
TL;DR: The Large Synoptic Survey Telescope (LSST) as discussed by the authors is a large, wide-field ground-based system designed to obtain repeated images covering the sky visible from Cerro Pachon in northern Chile.
Abstract: We describe here the most ambitious survey currently planned in the optical, the Large Synoptic Survey Telescope (LSST). The LSST design is driven by four main science themes: probing dark energy and dark matter, taking an inventory of the solar system, exploring the transient optical sky, and mapping the Milky Way. LSST will be a large, wide-field ground-based system designed to obtain repeated images covering the sky visible from Cerro Pachon in northern Chile. The telescope will have an 8.4 m (6.5 m effective) primary mirror, a 9.6 deg2 field of view, a 3.2-gigapixel camera, and six filters (ugrizy) covering the wavelength range 320–1050 nm. The project is in the construction phase and will begin regular survey operations by 2022. About 90% of the observing time will be devoted to a deep-wide-fast survey mode that will uniformly observe a 18,000 deg2 region about 800 times (summed over all six bands) during the anticipated 10 yr of operations and will yield a co-added map to r ~ 27.5. These data will result in databases including about 32 trillion observations of 20 billion galaxies and a similar number of stars, and they will serve the majority of the primary science programs. The remaining 10% of the observing time will be allocated to special projects such as Very Deep and Very Fast time domain surveys, whose details are currently under discussion. We illustrate how the LSST science drivers led to these choices of system parameters, and we describe the expected data products and their characteristics.

921 citations


Posted Content
TL;DR: Strong empirical results suggest that randomized smoothing is a promising direction for future research into adversarially robust classification on smaller-scale datasets where competing approaches to certified $\ell_2$ robustness are viable, smoothing delivers higher certified accuracies.
Abstract: We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use randomized smoothing to obtain an ImageNet classifier with e.g. a certified top-1 accuracy of 49% under adversarial perturbations with $\ell_2$ norm less than 0.5 (=127/255). No certified defense has been shown feasible on ImageNet except for smoothing. On smaller-scale datasets where competing approaches to certified $\ell_2$ robustness are viable, smoothing delivers higher certified accuracies. Our strong empirical results suggest that randomized smoothing is a promising direction for future research into adversarially robust classification. Code and models are available at this http URL.

719 citations


Proceedings Article
01 Jan 2019
TL;DR: In this paper, randomized smoothing is used to obtain an ImageNet classifier with a certified top-1 accuracy of 49% under adversarial perturbations with less than 0.5.
Abstract: We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use randomized smoothing to obtain an ImageNet classifier with e.g. a certified top-1 accuracy of 49% under adversarial perturbations with $\ell_2$ norm less than 0.5 (=127/255). No certified defense has been shown feasible on ImageNet except for smoothing. On smaller-scale datasets where competing approaches to certified $\ell_2$ robustness are viable, smoothing delivers higher certified accuracies. Our strong empirical results suggest that randomized smoothing is a promising direction for future research into adversarially robust classification. Code and models are available at this http URL.

714 citations


Journal ArticleDOI
28 May 2019-JAMA
TL;DR: In this retrospective analysis of data sets from patients with sepsis, 4 clinical phenotypes were identified that correlated with host-response patterns and clinical outcomes, and simulations suggested these phenotypes may help in understanding heterogeneity of treatment effects.
Abstract: Importance Sepsis is a heterogeneous syndrome. Identification of distinct clinical phenotypes may allow more precise therapy and improve care. Objective To derive sepsis phenotypes from clinical data, determine their reproducibility and correlation with host-response biomarkers and clinical outcomes, and assess the potential causal relationship with results from randomized clinical trials (RCTs). Design, Settings, and Participants Retrospective analysis of data sets using statistical, machine learning, and simulation tools. Phenotypes were derived among 20 189 total patients (16 552 unique patients) who met Sepsis-3 criteria within 6 hours of hospital presentation at 12 Pennsylvania hospitals (2010-2012) using consensuskmeans clustering applied to 29 variables. Reproducibility and correlation with biological parameters and clinical outcomes were assessed in a second database (2013-2014; n = 43 086 total patients and n = 31 160 unique patients), in a prospective cohort study of sepsis due to pneumonia (n = 583), and in 3 sepsis RCTs (n = 4737). Exposures All clinical and laboratory variables in the electronic health record. Main Outcomes and Measures Derived phenotype (α, β, γ,and δ) frequency, host-response biomarkers, 28-day and 365-day mortality, and RCT simulation outputs. Results The derivation cohort included 20 189 patients with sepsis (mean age, 64 [SD, 17] years; 10 022 [50%] male; mean maximum 24-hour Sequential Organ Failure Assessment [SOFA] score, 3.9 [SD, 2.4]). The validation cohort included 43 086 patients (mean age, 67 [SD, 17] years; 21 993 [51%] male; mean maximum 24-hour SOFA score, 3.6 [SD, 2.0]). Of the 4 derived phenotypes, the α phenotype was the most common (n = 6625; 33%) and included patients with the lowest administration of a vasopressor; in the β phenotype (n = 5512; 27%), patients were older and had more chronic illness and renal dysfunction; in the γ phenotype (n = 5385; 27%), patients had more inflammation and pulmonary dysfunction; and in the δ phenotype (n = 2667; 13%), patients had more liver dysfunction and septic shock. Phenotype distributions were similar in the validation cohort. There were consistent differences in biomarker patterns by phenotype. In the derivation cohort, cumulative 28-day mortality was 287 deaths of 5691 unique patients (5%) for the α phenotype; 561 of 4420 (13%) for the β phenotype; 1031 of 4318 (24%) for the γ phenotype; and 897 of 2223 (40%) for the δ phenotype. Across all cohorts and trials, 28-day and 365-day mortality were highest among the δ phenotype vs the other 3 phenotypes (P 33% chance of benefit to >60% chance of harm). Conclusions and Relevance In this retrospective analysis of data sets from patients with sepsis, 4 clinical phenotypes were identified that correlated with host-response patterns and clinical outcomes, and simulations suggested these phenotypes may help in understanding heterogeneity of treatment effects. Further research is needed to determine the utility of these phenotypes in clinical care and for informing trial design and interpretation.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: The proposed AS-GCN achieves consistently large improvement compared to the state-of-the-art methods and shows promising results for future pose prediction.
Abstract: Action recognition with skeleton data has recently attracted much attention in computer vision. Previous studies are mostly based on fixed skeleton graphs, only capturing local physical dependencies among joints, which may miss implicit joint correlations. To capture richer dependencies, we introduce an encoder-decoder structure, called A-link inference module, to capture action-specific latent dependencies, i.e. actional links, directly from actions. We also extend the existing skeleton graphs to represent higher-order dependencies, i.e. structural links. Combing the two types of links into a generalized skeleton graph, We further propose the actional-structural graph convolution network (AS-GCN), which stacks actional-structural graph convolution and temporal convolution as a basic building block, to learn both spatial and temporal features for action recognition. A future pose prediction head is added in parallel to the recognition head to help capture more detailed action patterns through self-supervision. We validate AS-GCN in action recognition using two skeleton data sets, NTU-RGB+D and Kinetics. The proposed AS-GCN achieves consistently large improvement compared to the state-of-the-art methods. As a side product, AS-GCN also shows promising results for future pose prediction.

Posted Content
TL;DR: The results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones, and a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.
Abstract: Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples. While significant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difficult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classification algorithms, with results showing that deeper backbones significantly reduce the performance differences among methods on datasets with limited domain differences, 2) a modified baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the \miniI and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic cross-domain evaluation setting, we show that a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.

Proceedings Article
24 May 2019
TL;DR: TRADES as mentioned in this paper decomposes the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provides a differentiable upper bound using the theory of classification-calibrated loss.
Abstract: We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although this problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of ~2,000 submissions, surpassing the runner-up approach by $11.41\%$ in terms of mean $\ell_2$ perturbation distance.

Book ChapterDOI
Matej Kristan1, Ales Leonardis2, Jiří Matas3, Michael Felsberg4  +155 moreInstitutions (47)
23 Jan 2019
TL;DR: The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative; results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years.
Abstract: The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis and a “real-time” experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. A long-term tracking subchallenge has been introduced to the set of standard VOT sub-challenges. The new subchallenge focuses on long-term tracking properties, namely coping with target disappearance and reappearance. A new dataset has been compiled and a performance evaluation methodology that focuses on long-term tracking capabilities has been adopted. The VOT toolkit has been updated to support both standard short-term and the new long-term tracking subchallenges. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net).

Proceedings Article
25 May 2019
TL;DR: It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.
Abstract: Multi-headed attention is a driving force behind recent state-of-the-art NLP models. By applying multiple attention mechanisms in parallel, it can express sophisticated functions beyond the simple weighted average. However we observe that, in practice, a large proportion of attention heads can be removed at test time without significantly impacting performance, and that some layers can even be reduced to a single head. Further analysis on machine translation models reveals that the self-attention layers can be significantly pruned, while the encoder-decoder layers are more dependent on multi-headedness.

Proceedings Article
24 May 2019
TL;DR: This paper showed that gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet) and further extended their analysis to deep residual convolutional neural networks and obtained a similar convergence result.
Abstract: Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: In contrast, Contrastive Adaptation Network (CAN) as discussed by the authors proposes a new metric which explicitly models the intra-class domain discrepancy and the inter-class discrepancy and designs an alternating update strategy for training CAN in an end-to-end manner.
Abstract: Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain. Previous methods minimize the domain discrepancy neglecting the class information, which may lead to misalignment and poor generalization performance. To address this issue, this paper proposes Contrastive Adaptation Network (CAN) optimizing a new metric which explicitly models the intra-class domain discrepancy and the inter-class domain discrepancy. We design an alternating update strategy for training CAN in an end-to-end manner. Experiments on two real-world benchmarks Office-31 and VisDA-2017 demonstrate that CAN performs favorably against the state-of-the-art methods and produces more discriminative features.

Journal ArticleDOI
TL;DR: In this article, a new type of atomically dispersed Co doped carbon catalyst with a core-shell structure has been developed via a surfactant-assisted metal-organic framework approach.
Abstract: Development of platinum group metal (PGM)-free catalysts for oxygen reduction reaction (ORR) is essential for affordable proton exchange membrane fuel cells. Herein, a new type of atomically dispersed Co doped carbon catalyst with a core–shell structure has been developed via a surfactant-assisted metal–organic framework approach. The cohesive interactions between the selected surfactant and the Co-doped zeolitic imidazolate framework (ZIF-8) nanocrystals lead to a unique confinement effect. During the thermal activation, this confinement effect suppressed the agglomeration of Co atomic sites and mitigated the collapse of internal microporous structures of ZIF-8. Among the studied surfactants, Pluronic F127 block copolymer led to the greatest performance gains with a doubling of the active site density relative to that of the surfactant-free catalyst. According to density functional theory calculations, unlike other Co catalysts, this new atomically dispersed Co–N–C@F127 catalyst is believed to contain substantial CoN2+2 sites, which are active and thermodynamically favorable for the four-electron ORR pathway. The Co–N–C@F127 catalyst exhibits an unprecedented ORR activity with a half-wave potential (E1/2) of 0.84 V (vs. RHE) as well as enhanced stability in the corrosive acidic media. It also demonstrated high initial performance with a power density of 0.87 W cm−2 along with encouraging durability in H2–O2 fuel cells. The atomically dispersed Co site catalyst approaches that of the Fe–N–C catalyst and represents the highest reported PGM-free and Fe-free catalyst performance.

Proceedings ArticleDOI
02 Mar 2019
TL;DR: The FSAF module robustly improves the baseline RetinaNet by a large margin under various settings, while introducing nearly free inference overhead, and the resulting best model can achieve a state-of-the-art 44.6% mAP, outperforming all existing single-shot detectors on COCO.
Abstract: We motivate and present feature selective anchor-free (FSAF) module, a simple and effective building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work independently or jointly with anchor-based branches. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy. Experimental results on the COCO detection track show that our FSAF module performs better than anchor-based counterparts while being faster. When working jointly with anchor-based branches, the FSAF module robustly improves the baseline RetinaNet by a large margin under various settings, while introducing nearly free inference overhead. And the resulting best model can achieve a state-of-the-art 44.6% mAP, outperforming all existing single-shot detectors on COCO.

Proceedings Article
26 Apr 2019
TL;DR: The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.
Abstract: How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its “width”— namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers — is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.

Journal ArticleDOI
22 Feb 2019-Science
TL;DR: The direct visualization of the keyhole morphology and dynamics with high-energy x-rays shows that (i) keyholes are present across the range of power and scanning velocity used in laser powder bed fusion; and (ii) there is a well-defined threshold from conduction mode to keyhole based on laser power density.
Abstract: We used ultrahigh-speed synchrotron x-ray imaging to quantify the phenomenon of vapor depressions (also known as keyholes) during laser melting of metals as practiced in additive manufacturing. Although expected from welding and inferred from postmortem cross sections of fusion zones, the direct visualization of the keyhole morphology and dynamics with high-energy x-rays shows that (i) keyholes are present across the range of power and scanning velocity used in laser powder bed fusion; (ii) there is a well-defined threshold from conduction mode to keyhole based on laser power density; and (iii) the transition follows the sequence of vaporization, depression of the liquid surface, instability, and then deep keyhole formation. These and other aspects provide a physical basis for three-dimensional printing in laser powder bed machines.

Proceedings ArticleDOI
13 Mar 2019
TL;DR: PointNetLK as mentioned in this paper unrolls PointNet and the Lucas & Kanade (LK) algorithm into a single trainable recurrent deep neural network for point cloud registration.
Abstract: PointNet has revolutionized how we think about representing point clouds. For classification and segmentation tasks, the approach and its subsequent variants/extensions are considered state-of-the-art. To date, the successful application of PointNet to point cloud registration has remained elusive. In this paper we argue that PointNet itself can be thought of as a learnable "imaging" function. As a consequence, classical vision algorithms for image alignment can be brought to bear on the problem -- namely the Lucas & Kanade (LK) algorithm. Our central innovations stem from: (i) how to modify the LK algorithm to accommodate the PointNet imaging function, and (ii) unrolling PointNet and the LK algorithm into a single trainable recurrent deep neural network. We describe the architecture, and compare its performance against state-of-the-art in several common registration scenarios. The architecture offers some remarkable properties including: generalization across shape categories and computational efficiency -- opening up new paths of exploration for the application of deep learning to point cloud registration. Code and videos are available at https://github.com/hmgoforth/PointNetLK.

Proceedings ArticleDOI
02 May 2019
TL;DR: This paper proposes a conceptual framework for building human-centered, decision-theory-driven XAI based on an extensive review across philosophy and psychology, and identifies pathways along which human cognitive patterns drives needs for building XAI and how XAI can mitigate common cognitive biases.
Abstract: From healthcare to criminal justice, artificial intelligence (AI) is increasingly supporting high-consequence human decisions. This has spurred the field of explainable AI (XAI). This paper seeks to strengthen empirical application-specific investigations of XAI by exploring theoretical underpinnings of human decision making, drawing from the fields of philosophy and psychology. In this paper, we propose a conceptual framework for building human-centered, decision-theory-driven XAI based on an extensive review across these fields. Drawing on this framework, we identify pathways along which human cognitive patterns drives needs for building XAI and how XAI can mitigate common cognitive biases. We then put this framework into practice by designing and implementing an explainable clinical diagnostic tool for intensive care phenotyping and conducting a co-design exercise with clinicians. Thereafter, we draw insights into how this framework bridges algorithm-generated explanations and human decision-making theories. Finally, we discuss implications for XAI design and development.

Journal ArticleDOI
TL;DR: In this article, the authors measured cosmic weak lensing shear power spectra with the Subaru Hyper Suprime-Cam (HSC) survey first-year shear catalog covering 137 degrees of the sky.
Abstract: We measure cosmic weak lensing shear power spectra with the Subaru Hyper Suprime-Cam (HSC) survey first-year shear catalog covering 137 deg^2 of the sky. Thanks to the high effective galaxy number density of ∼17 arcmin^−2, even after conservative cuts such as a magnitude cut of i < 24.5 and photometric redshift cut of 0.3 ≤ z ≤ 1.5, we obtain a high-significance measurement of the cosmic shear power spectra in four tomographic redshift bins, achieving a total signal-to-noise ratio of 16 in the multipole range 300 ≤ l ≤ 1900. We carefully account for various uncertainties in our analysis including the intrinsic alignment of galaxies, scatters and biases in photometric redshifts, residual uncertainties in the shear measurement, and modeling of the matter power spectrum. The accuracy of our power spectrum measurement method as well as our analytic model of the covariance matrix are tested against realistic mock shear catalogs. For a flat Λ cold dark matter model, we find |$S\,_{8}\equiv \sigma _8(\Omega _{\rm m}/0.3)^\alpha =0.800^{+0.029}_{-0.028}$| for α = 0.45 (⁠|$S\,_8=0.780^{+0.030}_{-0.033}$| for α = 0.5) from our HSC tomographic cosmic shear analysis alone. In comparison with Planck cosmic microwave background constraints, our results prefer slightly lower values of S_8, although metrics such as the Bayesian evidence ratio test do not show significant evidence for discordance between these results. We study the effect of possible additional systematic errors that are unaccounted for in our fiducial cosmic shear analysis, and find that they can shift the best-fit values of S_8 by up to ∼0.6 σ in both directions. The full HSC survey data will contain several times more area, and will lead to significantly improved cosmological constraints.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: Stereo R-CNN as mentioned in this paper proposes a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery, which adds extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D bounding box.
Abstract: We propose a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to simultaneously detect and associate object in left and right images. We add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D object bounding box. We then recover the accurate 3D bounding box by a region-based photometric alignment using left and right RoIs. Our method does not require depth input and 3D position supervision, however, outperforms all existing fully supervised image-based methods. Experiments on the challenging KITTI dataset show that our method outperforms the state-of-the-art stereo-based method by around 30% AP on both 3D detection and 3D localization tasks. Code will be made publicly available.

Proceedings ArticleDOI
27 Oct 2019
TL;DR: PipeDream is presented, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.
Abstract: DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naive pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.

Journal ArticleDOI
TL;DR: This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.
Abstract: A fundamental task in various disciplines of science, including biology, is to find underlying causal relations and make use of them. Causal relations can be seen if interventions are properly applied; however, in many cases they are difficult or even impossible to conduct. It is then necessary to discover causal relations by analyzing statistical properties of purely observational data, which is known as causal discovery or causal structure search. This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: Action Transformer as mentioned in this paper uses a Transformer-style architecture to aggregate features from the spatio-temporal context around the person whose actions we are trying to classify, and shows that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others.
Abstract: We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action – all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

Journal Article
TL;DR: The authors introduced the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content, given an image, a dialog history and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately.
Abstract: We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounded in vision to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person real-time chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and consists of $\sim$ ∼ 1.2M dialog question-answer pairs from 10-round, human-human dialogs grounded in $\sim$ ∼ 120k images from the COCO dataset. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders—Late Fusion, Hierarchical Recurrent Encoder and Memory Network (optionally with attention over image features)—and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank and recall $@k$ @ k of human response. We quantify the gap between machine and human performance on the Visual Dialog task via human studies. Putting it all together, we demonstrate the first ‘visual chatbot’! Our dataset, code, pretrained models and visual chatbot are available on https://visualdialog.org .