scispace - formally typeset
Search or ask a question

Showing papers by "Amazon.com published in 2019"


Proceedings ArticleDOI
15 Jun 2019
TL;DR: The objective is to learn feature embeddings that generalize well under a linear classification rule for novel categories and this work exploits two properties of linear classifiers: implicit differentiation of the optimality conditions of the convex problem and the dual formulation of the optimization problem.
Abstract: Many meta-learning approaches for few-shot learning rely on simple base learners such as nearest-neighbor classifiers. However, even in the few-shot regime, discriminatively trained linear predictors can offer better generalization. We propose to use these predictors as base learners to learn representations for few-shot learning and show they offer better tradeoffs between feature size and performance across a range of few-shot recognition benchmarks. Our objective is to learn feature embeddings that generalize well under a linear classification rule for novel categories. To efficiently solve the objective, we exploit two properties of linear classifiers: implicit differentiation of the optimality conditions of the convex problem and the dual formulation of the optimization problem. This allows us to use high-dimensional embeddings with improved generalization at a modest increase in computational overhead. Our approach, named MetaOptNet, achieves state-of-the-art performance on miniImageNet, tieredImageNet, CIFAR-FS, and FC100 few-shot learning benchmarks.

1,084 citations


Proceedings ArticleDOI
Tong He1, Zhi Zhang1, Hang Zhang1, Zhongyue Zhang1, Junyuan Xie1, Mu Li1 
01 Jun 2019
TL;DR: This article examined a collection of such refinements and empirically evaluated their impact on the final model accuracy through ablation study, and showed that by combining these refinements together, they are able to improve various CNN models significantly.
Abstract: Much of the recent progress made in image classification research can be credited to training procedure refinements, such as changes in data augmentations and optimization methods. In the literature, however, most refinements are either briefly mentioned as implementation details or only visible in source code. In this paper, we will examine a collection of such refinements and empirically evaluate their impact on the final model accuracy through ablation study. We will show that, by combining these refinements together, we are able to improve various CNN models significantly. For example, we raise ResNet-50's top-1 validation accuracy from 75.3% to 79.29% on ImageNet. We will also demonstrate that improvement on image classification accuracy leads to better transfer learning performance in other application domains such as object detection and semantic segmentation.

980 citations


Journal ArticleDOI
TL;DR: Temporal Segment Networks (TSN) as discussed by the authors is proposed to model long-range temporal structure with a new segment-based sampling and aggregation scheme, which enables the TSN framework to efficiently learn action models by using the whole video.
Abstract: We present a general and flexible video-level framework for learning action models in videos. This method, called temporal segment network (TSN), aims to model long-range temporal structure with a new segment-based sampling and aggregation scheme. This unique design enables the TSN framework to efficiently learn action models by using the whole video. The learned models could be easily deployed for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the implementation of the TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on five challenging action recognition benchmarks: HMDB51 (71.0 percent), UCF101 (94.9 percent), THUMOS14 (80.1 percent), ActivityNet v1.2 (89.6 percent), and Kinetics400 (75.7 percent). In addition, using the proposed RGB difference as a simple motion representation, our method can still achieve competitive accuracy on UCF101 (91.0 percent) while running at 340 FPS. Furthermore, based on the proposed TSN framework, we won the video classification track at the ActivityNet challenge 2016 among 24 teams.

562 citations


Proceedings ArticleDOI
01 Jan 2019
TL;DR: OCGAN as discussed by the authors uses a de-noising auto-encoder network to explicitly constrain the latent space to exclusively represent the given class and uses a gradient-descent based sampling technique to generate potential out-of-class examples.
Abstract: We present a novel model called OCGAN for the classical problem of one-class novelty detection, where, given a set of examples from a particular class, the goal is to determine if a query example is from the same class. Our solution is based on learning latent representations of in-class examples using a de-noising auto-encoder network. The key contribution of our work is our proposal to explicitly constrain the latent space to exclusively represent the given class. In order to accomplish this goal, firstly, we force the latent space to have bounded support by introducing a tanh activation in the encoder's output layer. Secondly, using a discriminator in the latent space that is trained adversarially, we ensure that encoded representations of in-class examples resemble uniform random samples drawn from the same bounded space. Thirdly, using a second adversarial discriminator in the input space, we ensure all randomly drawn latent samples generate examples that look real. Finally, we introduce a gradient-descent based sampling technique that explores points in the latent space that generate potential out-of-class examples, which are fed back to the network to further train it to generate in-class examples from those points. The effectiveness of the proposed method is measured across four publicly available datasets using two one-class novelty detection protocols where we achieve state-of-the-art results.

460 citations


Proceedings ArticleDOI
10 Jan 2019
TL;DR: This paper presents an alternative scheme, named Guided Anchoring, which leverages semantic features to guide the anchoring, and jointly predicts the locations where the center of objects of interest are likely to exist as well as the scales and aspect ratios at different locations.
Abstract: Region anchors are the cornerstone of modern object detection techniques. State-of-the-art detectors mostly rely on a dense anchoring scheme, where anchors are sampled uniformly over the spatial domain with a predefined set of scales and aspect ratios. In this paper, we revisit this foundational stage. Our study shows that it can be done much more effectively and efficiently. Specifically, we present an alternative scheme, named Guided Anchoring, which leverages semantic features to guide the anchoring. The proposed method jointly predicts the locations where the center of objects of interest are likely to exist as well as the scales and aspect ratios at different locations. On top of predicted anchor shapes, we mitigate the feature inconsistency with a feature adaption module. We also study the use of high-quality proposals to improve detection performance. The anchoring scheme can be seamlessly integrated into proposal methods and detectors. With Guided Anchoring, we achieve 9.1% higher recall on MS COCO with 90% fewer anchors than the RPN baseline. We also adopt Guided Anchoring in Fast R-CNN, Faster R-CNN and RetinaNet, respectively improving the detection mAP by 2.2%, 2.7% and 1.2%. Code is available at https://github.com/open-mmlab/mmdetection.

458 citations


Journal ArticleDOI
TL;DR: This paper utilizes the unique properties of the mesh for a direct analysis of 3D shapes using MeshCNN, a convolutional neural network designed specifically for triangular meshes, and demonstrates the effectiveness of MeshCNN on various learning tasks applied to 3D meshes.
Abstract: Polygonal meshes provide an efficient representation for 3D shapes. They explicitly captureboth shape surface and topology, and leverage non-uniformity to represent large flat regions as well as sharp, intricate features. This non-uniformity and irregularity, however, inhibits mesh analysis efforts using neural networks that combine convolution and pooling operations. In this paper, we utilize the unique properties of the mesh for a direct analysis of 3D shapes using MeshCNN, a convolutional neural network designed specifically for triangular meshes. Analogous to classic CNNs, MeshCNN combines specialized convolution and pooling layers that operate on the mesh edges, by leveraging their intrinsic geodesic connections. Convolutions are applied on edges and the four edges of their incident triangles, and pooling is applied via an edge collapse operation that retains surface topology, thereby, generating new mesh connectivity for the subsequent convolutions. MeshCNN learns which edges to collapse, thus forming a task-driven process where the network exposes and expands the important features while discarding the redundant ones. We demonstrate the effectiveness of MeshCNN on various learning tasks applied to 3D meshes.

414 citations


Posted Content
TL;DR: The increasing popularity of unrolled deep networks is due, in part, to their potential in developing efficient, high-performance (yet interpretable) network architectures from reasonably sized training sets.
Abstract: Deep neural networks provide unprecedented performance gains in many real world problems in signal and image processing. Despite these gains, future development and practical deployment of deep networks is hindered by their blackbox nature, i.e., lack of interpretability, and by the need for very large training sets. An emerging technique called algorithm unrolling or unfolding offers promise in eliminating these issues by providing a concrete and systematic connection between iterative algorithms that are used widely in signal processing and deep neural networks. Unrolling methods were first proposed to develop fast neural network approximations for sparse coding. More recently, this direction has attracted enormous attention and is rapidly growing both in theoretic investigations and practical applications. The growing popularity of unrolled deep networks is due in part to their potential in developing efficient, high-performance and yet interpretable network architectures from reasonable size training sets. In this article, we review algorithm unrolling for signal and image processing. We extensively cover popular techniques for algorithm unrolling in various domains of signal and image processing including imaging, vision and recognition, and speech processing. By reviewing previous works, we reveal the connections between iterative algorithms and neural networks and present recent theoretical results. Finally, we provide a discussion on current limitations of unrolling and suggest possible future research directions.

398 citations


Proceedings ArticleDOI
Sungsoo Ahn1, Shell Xu Hu, Andreas Damianou2, Neil D. Lawrence2, Zhenwen Dai2 
15 Jun 2019
TL;DR: In this article, the authors propose an information-theoretic framework for knowledge transfer which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks, and compare their method with existing knowledge transfer methods on both knowledge distillation and transfer learning tasks and show that their method consistently outperforms existing methods.
Abstract: Transferring knowledge from a teacher neural network pretrained on the same or a similar task to a student neural network can significantly improve the performance of the student neural network. Existing knowledge transfer approaches match the activations or the corresponding hand-crafted features of the teacher and the student networks. We propose an information-theoretic framework for knowledge transfer which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks. We compare our method with existing knowledge transfer methods on both knowledge distillation and transfer learning tasks and show that our method consistently outperforms existing methods. We further demonstrate the strength of our method on knowledge transfer across heterogeneous network architectures by transferring knowledge from a convolutional neural network (CNN) to a multi-layer perceptron (MLP) on CIFAR-10. The resulting MLP significantly outperforms the-state-of-the-art methods and it achieves similar performance to the CNN with a single convolutional layer.

298 citations


Proceedings Article
01 Jan 2019
TL;DR: Theoretically, it is proved that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost.
Abstract: A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.

280 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper builds an Aggregated Co-occurrent Feature (ACF) Module, which learns a fine-grained spatial invariant representation to capture co- occurrent context information across the scene and significantly improves the segmentation results using FCN.
Abstract: Recent work has achieved great success in utilizing global contextual information for semantic segmentation, including increasing the receptive field and aggregating pyramid feature representations. In this paper, we go beyond global context and explore the fine-grained representation using co-occurrent features by introducing Co-occurrent Feature Model, which predicts the distribution of co-occurrent features for a given target. To leverage the semantic context in the co-occurrent features, we build an Aggregated Co-occurrent Feature (ACF) Module by aggregating the probability of the co-occurrent feature with the co-occurrent context. ACF Module learns a fine-grained spatial invariant representation to capture co-occurrent context information across the scene. Our approach significantly improves the segmentation results using FCN and achieves superior performance 54.0% mIoU on Pascal Context, 87.2% mIoU on Pascal VOC 2012 and 44.89% mIoU on ADE20K datasets. The source code and complete system will be publicly available upon publication.

273 citations


Journal ArticleDOI
Adriane Esquivel-Muelbert1, Timothy R. Baker1, Kyle G. Dexter2, Simon L. Lewis3, Simon L. Lewis1, Roel J. W. Brienen1, Ted R. Feldpausch4, Jon Lloyd5, Abel Monteagudo-Mendoza6, Luzmila Arroyo7, Esteban Álvarez-Dávila, Niro Higuchi8, Beatriz Schwantes Marimon9, Ben Hur Marimon-Junior9, Marcos Silveira10, Emilio Vilanova11, Emilio Vilanova12, Emanuel Gloor1, Yadvinder Malhi13, Jérôme Chave14, Jos Barlow15, Jos Barlow16, Damien Bonal17, Nallaret Davila Cardozo18, Terry L. Erwin19, Sophie Fauset1, Bruno Hérault20, Susan G. Laurance21, Lourens Poorter22, Lan Qie5, Clément Stahl23, Martin J. P. Sullivan1, Hans ter Steege24, Hans ter Steege25, Vincent A. Vos, Pieter A. Zuidema22, Everton Cristo de Almeida26, Edmar Almeida de Oliveira9, Ana Andrade8, Simone Aparecida Vieira27, Luiz E. O. C. Aragão28, Luiz E. O. C. Aragão4, Alejandro Araujo-Murakami7, Eric Arets22, Gerardo A. Aymard C, Christopher Baraloto29, Plínio Barbosa de Camargo30, Jorcely Barroso10, Frans Bongers22, René G. A. Boot31, José Luís Camargo8, Wendeson Castro10, Victor Chama Moscoso6, James A. Comiskey19, Fernando Cornejo Valverde32, Antonio Carlos Lola da Costa33, Jhon del Aguila Pasquel32, Jhon del Aguila Pasquel34, Anthony Di Fiore35, Luisa Fernanda Duque, Fernando Elias9, Julien Engel20, Julien Engel29, Gerardo Flores Llampazo, David W. Galbraith1, Rafael Herrera Fernández36, Rafael Herrera Fernández37, Eurídice N. Honorio Coronado34, Wannes Hubau38, Eliana Jimenez-Rojas39, Adriano José Nogueira Lima8, Ricardo Keichi Umetsu9, William F. Laurance21, Gabriela Lopez-Gonzalez1, Thomas E. Lovejoy40, Omar Aurelio Melo Cruz41, Paulo S. Morandi9, David A. Neill, Percy Núñez Vargas6, Nadir Pallqui Camacho6, Alexander Parada Gutierrez, Guido Pardo, Julie Peacock1, Marielos Peña-Claros22, Maria Cristina Peñuela-Mora, Pascal Petronelli14, Georgia Pickavance1, Nigel C. A. Pitman, Adriana Prieto42, Carlos A. Quesada8, Hirma Ramírez-Angulo11, Maxime Réjou-Méchain43, Zorayda Restrepo Correa, Anand Roopsind44, Agustín Rudas42, Rafael de Paiva Salomão15, Natalino Silva, Javier Silva Espejo45, James Singh46, Juliana Stropp47, John Terborgh48, Raquel Thomas44, Marisol Toledo7, Armando Torres-Lezama11, Luis Valenzuela Gamarra, Peter J. van de Meer49, Geertje M. F. van der Heijden50, Peter van der Hout, Rodolfo Vásquez Martínez, César I.A. Vela6, Ima Célia Guimarães Vieira15, Oliver L. Phillips1 
University of Leeds1, University of Edinburgh2, University College London3, University of Exeter4, Imperial College London5, National University of Saint Anthony the Abbot in Cuzco6, Universidad Autónoma Gabriel René Moreno7, National Institute of Amazonian Research8, Universidade do Estado de Mato Grosso9, Universidade Federal do Acre10, University of Los Andes11, University of Washington12, Environmental Change Institute13, Centre national de la recherche scientifique14, Museu Paraense Emílio Goeldi15, Lancaster University16, University of Lorraine17, Universidad Nacional de la Amazonía Peruana18, Smithsonian Institution19, University of Montpellier20, James Cook University21, Wageningen University and Research Centre22, Agro ParisTech23, Naturalis24, University of Amsterdam25, Federal University of Western Pará26, State University of Campinas27, National Institute for Space Research28, Florida International University29, University of São Paulo30, Tropenbos International31, Amazon.com32, Federal University of Pará33, Michigan Technological University34, University of Texas at Austin35, Venezuelan Institute for Scientific Research36, Polytechnic University of Valencia37, Royal Museum for Central Africa38, Tecnológico de Antioquia39, George Mason University40, Universidad del Tolima41, National University of Colombia42, Paul Sabatier University43, Georgetown University44, University of La Serena45, Forestry Commission46, Federal University of Alagoas47, Duke University48, Van Hall Larenstein University of Applied Sciences49, University of Nottingham50
TL;DR: A slow shift to a more dry‐affiliated Amazonia is underway, with changes in compositional dynamics consistent with climate‐change drivers, but yet to significantly impact whole‐community composition.
Abstract: Most of the planet's diversity is concentrated in the tropics, which includes many regions undergoing rapid climate change. Yet, while climate‐induced biodiversity changes are widely documented elsewhere, few studies have addressed this issue for lowland tropical ecosystems. Here we investigate whether the floristic and functional composition of intact lowland Amazonian forests have been changing by evaluating records from 106 long‐term inventory plots spanning 30 years. We analyse three traits that have been hypothesized to respond to different environmental drivers (increase in moisture stress and atmospheric CO2 concentrations): maximum tree size, biogeographic water‐deficit affiliation and wood density. Tree communities have become increasingly dominated by large‐statured taxa, but to date there has been no detectable change in mean wood density or water deficit affiliation at the community level, despite most forest plots having experienced an intensification of the dry season. However, among newly recruited trees, dry‐affiliated genera have become more abundant, while the mortality of wet‐affiliated genera has increased in those plots where the dry season has intensified most. Thus, a slow shift to a more dry‐affiliated Amazonia is underway, with changes in compositional dynamics (recruits and mortality) consistent with climate‐change drivers, but yet to significantly impact whole‐community composition. The Amazon observational record suggests that the increase in atmospheric CO2 is driving a shift within tree communities to large‐statured species and that climate changes to date will impact forest composition, but long generation times of tropical trees mean that biodiversity change is lagging behind climate change.

Posted Content
TL;DR: The implicit MAML algorithm as discussed by the authors decouples the meta-gradient computation from the choice of inner-loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints.
Abstract: A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.

Journal ArticleDOI
01 Dec 2019
TL;DR: A conceptual model for cloud futurology is proposed in this article to explore the influence of emerging paradigms and technologies on evolution of cloud computing. But, the model is limited to three technologies: Blockchain, IoT and Artificial Intelligence.
Abstract: Cloud computing plays a critical role in modern society and enables a range of applications from infrastructure to social media. Such system must cope with varying load and evolving usage reflecting societies’ interaction and dependency on automated computing systems whilst satisfying Quality of Service (QoS) guarantees. Enabling these systems are a cohort of conceptual technologies, synthesized to meet demand of evolving computing applications. In order to understand current and future challenges of such system, there is a need to identify key technologies enabling future applications. In this study, we aim to explore how three emerging paradigms (Blockchain, IoT and Artificial Intelligence) will influence future cloud computing systems. Further, we identify several technologies driving these paradigms and invite international experts to discuss the current status and future directions of cloud computing. Finally, we proposed a conceptual model for cloud futurology to explore the influence of emerging paradigms and technologies on evolution of cloud computing.

Journal ArticleDOI
TL;DR: The authors argue that word embedding models are a useful tool for the study of culture using a historical analysis of shared understandings of social class as an empirical case, and they argue word embeddings represent semant...
Abstract: We argue word embedding models are a useful tool for the study of culture using a historical analysis of shared understandings of social class as an empirical case. Word embeddings represent semant...

Proceedings ArticleDOI
06 Apr 2019
TL;DR: This article proposed an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different selfsupervised tasks, and the needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones.
Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.

Posted Content
TL;DR: An information-theoretic framework for knowledge transfer is proposed which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks and which consistently outperforms existing methods.
Abstract: Transferring knowledge from a teacher neural network pretrained on the same or a similar task to a student neural network can significantly improve the performance of the student neural network. Existing knowledge transfer approaches match the activations or the corresponding hand-crafted features of the teacher and the student networks. We propose an information-theoretic framework for knowledge transfer which formulates knowledge transfer as maximizing the mutual information between the teacher and the student networks. We compare our method with existing knowledge transfer methods on both knowledge distillation and transfer learning tasks and show that our method consistently outperforms existing methods. We further demonstrate the strength of our method on knowledge transfer across heterogeneous network architectures by transferring knowledge from a convolutional neural network (CNN) to a multi-layer perceptron (MLP) on CIFAR-10. The resulting MLP significantly outperforms the-state-of-the-art methods and it achieves similar performance to the CNN with a single convolutional layer.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: Topical-Chat is introduced, a knowledge-grounded humanhuman conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles, to help further research in opendomain conversational AI.
Abstract: Building socialbots that can have deep, engaging open-domain conversations with humans is one of the grand challenges of artificial intelligence (AI). To this end, bots need to be able to leverage world knowledge spanning several domains effectively when conversing with humans who have their own world knowledge. Existing knowledge-grounded conversation datasets are primarily stylized with explicit roles for conversation partners. These datasets also do not explore depth or breadth of topical coverage with transitions in conversations. We introduce Topical-Chat, a knowledge-grounded humanhuman conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles, to help further research in opendomain conversational AI. We also train several state-of-theart encoder-decoder conversational models on Topical-Chat and perform automated and human evaluation for benchmarking.

Proceedings ArticleDOI
05 Apr 2019
TL;DR: This paper reports state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data and introduces a new layer-wise optimizer called NovoGrad to improve training.
Abstract: In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub5'00 conversational evaluation datasets.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: It is shown that self-consistency alone is not sufficient to generate realistic skeletons, however adding a 2D pose discriminator enables the lifter to output valid 3D poses and demonstrates the useful- ness of2D pose data for unsupervised 3D lifting.
Abstract: We present an unsupervised learning approach to re- cover 3D human pose from 2D skeletal joints extracted from a single image. Our method does not require any multi- view image data, 3D skeletons, correspondences between 2D-3D points, or use previously learned 3D priors during training. A lifting network accepts 2D landmarks as inputs and generates a corresponding 3D skeleton estimate. Dur- ing training, the recovered 3D skeleton is reprojected on random camera viewpoints to generate new ‘synthetic’ 2D poses. By lifting the synthetic 2D poses back to 3D and re-projecting them in the original camera view, we can de- fine self-consistency loss both in 3D and in 2D. The training can thus be self supervised by exploiting the geometric self- consistency of the lift-reproject-lift process. We show that self-consistency alone is not sufficient to generate realistic skeletons, however adding a 2D pose discriminator enables the lifter to output valid 3D poses. Additionally, to learn from 2D poses ‘in the wild’, we train an unsupervised 2D domain adapter network to allow for an expansion of 2D data. This improves results and demonstrates the useful- ness of 2D pose data for unsupervised 3D lifting. Results on Human3.6M dataset for 3D human pose estimation demon- strate that our approach improves upon the previous un- supervised methods by 30% and outperforms many weakly supervised approaches that explicitly use 3D data.

Journal ArticleDOI
Roy Burstein1, Nathaniel J Henry1, Michael Collison1, Laurie B. Marczak1  +663 moreInstitutions (290)
16 Oct 2019-Nature
TL;DR: A high-resolution, global atlas of mortality of children under five years of age between 2000 and 2017 highlights subnational geographical inequalities in the distribution, rates and absolute counts of child deaths by age.
Abstract: Since 2000, many countries have achieved considerable success in improving child survival, but localized progress remains unclear. To inform efforts towards United Nations Sustainable Development Goal 3.2—to end preventable child deaths by 2030—we need consistently estimated data at the subnational level regarding child mortality rates and trends. Here we quantified, for the period 2000–2017, the subnational variation in mortality rates and number of deaths of neonates, infants and children under 5 years of age within 99 low- and middle-income countries using a geostatistical survival model. We estimated that 32% of children under 5 in these countries lived in districts that had attained rates of 25 or fewer child deaths per 1,000 live births by 2017, and that 58% of child deaths between 2000 and 2017 in these countries could have been averted in the absence of geographical inequality. This study enables the identification of high-mortality clusters, patterns of progress and geographical inequalities to inform appropriate investments and implementations that will help to improve the health of all populations.

Proceedings ArticleDOI
10 May 2019
TL;DR: This work proposes a method for learning embeddings for few-shot learning that is suitable for use with any number of shots (shot-free), that encompasses metric learning, that facilitates adding new classes without crowding the class representation space.
Abstract: We propose a method for learning embeddings for few-shot learning that is suitable for use with any number of shots (shot-free). Rather than fixing the class prototypes to be the Euclidean average of sample embeddings, we allow them to live in a higher-dimensional space (embedded class models) and learn the prototypes along with the model parameters. The class representation function is defined implicitly, which allows us to deal with a variable number of shots per class with a simple constant-size architecture. The class embedding encompasses metric learning, that facilitates adding new classes without crowding the class representation space. Despite being general and not tuned to the benchmark, our approach achieves state-of-the-art performance on the standard few-shot benchmark datasets.

Proceedings Article
01 Jan 2019
TL;DR: This paper evaluates on several standard retrieval datasets such as CAR-196, CUB-200-2011, Stanford Online Product, and In-Shop datasets for image retrieval and clustering, and establishes that the classification-based approach is competitive across different feature dimensions and base feature networks.
Abstract: Deep metric learning aims to learn a function mapping image pixels to embedding feature vectors that model the similarity between images. Two major applications of metric learning are content-based image retrieval and face verification. For the retrieval tasks, the majority of current state-of-the-art (SOTA) approaches are triplet-based non-parametric training. For the face verification tasks, however, recent SOTA approaches have adopted classification-based parametric training. In this paper, we look into the effectiveness of classification based approaches on image retrieval datasets. We evaluate on several standard retrieval datasets such as CAR-196, CUB-200-2011, Stanford Online Product, and In-Shop datasets for image retrieval and clustering, and establish that our classification-based approach is competitive across different feature dimensions and base feature networks. We further provide insights into the performance effects of subsampling classes for scalable classification-based training, and the effects of binarization, enabling efficient storage and computation for practical applications.

Posted ContentDOI
TL;DR: It is shown that pre-norm residual connections (PRENORM) and smaller initializations enable warmup-free, validation-based training with large learning rates and proposed l2 normalization with a single scale parameter (SCALENORN) for faster training and better performance.
Abstract: We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PreNorm) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose $\ell_2$ normalization with a single scale parameter (ScaleNorm) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FixNorm). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT'15 English-Vietnamese. We observe sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the high-resource setting (WMT'14 English-German), ScaleNorm and FixNorm remain competitive but PreNorm degrades performance.

Journal ArticleDOI
TL;DR: In this paper, the authors formulate the service migration problem as a Markov decision process (MDP) and provide a mathematical framework to design optimal service migration policies in mobile edge computing.
Abstract: In mobile edge computing, local edge servers can host cloud-based services, which reduces network overhead and latency but requires service migrations as users move to new locations. It is challenging to make migration decisions optimally because of the uncertainty in such a dynamic cloud environment. In this paper, we formulate the service migration problem as a Markov decision process (MDP). Our formulation captures general cost models and provides a mathematical framework to design optimal service migration policies. In order to overcome the complexity associated with computing the optimal policy, we approximate the underlying state space by the distance between the user and service locations. We show that the resulting MDP is exact for the uniform 1-D user mobility, while it provides a close approximation for uniform 2-D mobility with a constant additive error. We also propose a new algorithm and a numerical technique for computing the optimal solution, which is significantly faster than traditional methods based on the standard value or policy iteration. We illustrate the application of our solution in practical scenarios where many theoretical assumptions are relaxed. Our evaluations based on real-world mobility traces of San Francisco taxis show the superior performance of the proposed solution compared to baseline solutions.

Proceedings ArticleDOI
22 Aug 2019
TL;DR: The authors proposed a multi-passage BERT model to globally normalize answer scores across all passages of the same question, and this change enables our QA model find better answers by utilizing more passages.
Abstract: BERT model has been successfully applied to open-domain QA tasks. However, previous work trains BERT by viewing passages corresponding to the same question as independent training instances, which may cause incomparable scores for answers from different passages. To tackle this issue, we propose a multi-passage BERT model to globally normalize answer scores across all passages of the same question, and this change enables our QA model find better answers by utilizing more passages. In addition, we find that splitting articles into passages with the length of 100 words by sliding window improves performance by 4%. By leveraging a passage ranker to select high-quality passages, multi-passage BERT gains additional 2%. Experiments on four standard benchmarks showed that our multi-passage BERT outperforms all state-of-the-art models on all benchmarks. In particular, on the OpenSQuAD dataset, our model gains 21.4% EM and 21.5% F1 over all non-BERT models, and 5.8% EM and 6.5% F1 over BERT-based models.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: This work proposes a novel paired convolution to infer the semantic correlation of the pair and based on that to generate a shape mask, of which the receptive field is controlled by the shape mask that varies with the appearance of input.
Abstract: Context is essential for semantic segmentation. Due to the diverse shapes of objects and their complex layout in various scene images, the spatial scales and shapes of contexts for different objects have very large variation. It is thus ineffective or inefficient to aggregate various context information from a predefined fixed region. In this work, we propose to generate a scale- and shape-variant semantic mask for each pixel to confine its contextual region. To this end, we first propose a novel paired convolution to infer the semantic correlation of the pair and based on that to generate a shape mask. Using the inferred spatial scope of the contextual region, we propose a shape-variant convolution, of which the receptive field is controlled by the shape mask that varies with the appearance of input. In this way, the proposed network aggregates the context information of a pixel from its semantic-correlated region instead of a predefined fixed region. Furthermore, this work also proposes a labeling denoising model to reduce wrong predictions caused by the noisy low-level features. Without bells and whistles, the proposed segmentation network achieves new state-of-the-arts consistently on the six public segmentation datasets.

Proceedings ArticleDOI
10 Feb 2019
TL;DR: A method to generate vectorial representations of visual classification tasks which can be used to reason about the nature of those tasks and their relations, and is demonstrated to be capable of predicting task similarities that match the authors' intuition about semantic and taxonomic relations between different visual tasks.
Abstract: We introduce a method to generate vectorial representations of visual classification tasks which can be used to reason about the nature of those tasks and their relations. Given a dataset with ground-truth labels and a loss function, we process images through a "probe network" and compute an embedding based on estimates of the Fisher information matrix associated with the probe network parameters. This provides a fixed-dimensional embedding of the task that is independent of details such as the number of classes and requires no understanding of the class label semantics. We demonstrate that this embedding is capable of predicting task similarities that match our intuition about semantic and taxonomic relations between different visual tasks. We demonstrate the practical value of this framework for the meta-task of selecting a pre-trained feature extractor for a novel task. We present a simple meta-learning framework for learning a metric on embeddings that is capable of predicting which feature extractors will perform well on which task. Selecting a feature extractor with task embedding yields performance close to the best available feature extractor, with substantially less computational effort than exhaustively training and evaluating all available models.

Proceedings ArticleDOI
22 Jan 2019
TL;DR: This work proposes SAN-CTC, a deep, fully self-attentional network for CTC, and shows it is tractable and competitive for end-to-end speech recognition, and explores how label alphabets affect attention heads and performance.
Abstract: The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and down-sampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.

Posted Content
Zhi Zhang, Tong He, Hang Zhang, Zhongyue Zhang, Junyuan Xie, Mu Li1 
TL;DR: This work explores training tweaks that apply to various models including Faster R-CNN and YOLOv3 that can improve up to 5% absolute precision compared to state-of-the-art baselines.
Abstract: Training heuristics greatly improve various image classification model accuracies~\cite{he2018bag}. Object detection models, however, have more complex neural network structures and optimization targets. The training strategies and pipelines dramatically vary among different models. In this works, we explore training tweaks that apply to various models including Faster R-CNN and YOLOv3. These tweaks do not change the model architectures, therefore, the inference costs remain the same. Our empirical results demonstrate that, however, these freebies can improve up to 5% absolute precision compared to state-of-the-art baselines.

Proceedings ArticleDOI
27 Oct 2019
TL;DR: Nexus is a fully implemented system that includes cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments ofDNNs.
Abstract: We address the problem of serving Deep Neural Networks (DNNs) efficiently from a cluster of GPUs. In order to realize the promise of very low-cost processing made by accelerators such as GPUs, it is essential to run them at sustained high utilization. Doing so requires cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments of DNNs. Nexus is a fully implemented system that includes these innovations. In large-scale case studies on 16 GPUs, when required to stay within latency constraints at least 99% of the time, Nexus can process requests at rates 1.8-12.7X higher than state of the art systems can. A long-running multi-application deployment stays within 84% of optimal utilization and, on a 100-GPU cluster, violates latency SLOs on 0.27% of requests.