Open AccessPosted Content
Learning Factored Representations in a Deep Mixture of Experts
TLDR
Deep Mixture of Experts as mentioned in this paper is a stacked model with multiple sets of gating and experts, which exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size.Abstract:
Mixtures of Experts combine the outputs of several "expert" networks, each of which specializes in a different part of the input space This is achieved by training a "gating" network that maps each input to a distribution over the experts Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time In this this work, we extend the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size On a randomly translated version of the MNIST dataset, we find that the Deep Mixture of Experts automatically learns to develop location-dependent ("where") experts at the first layer, and class-specific ("what") experts at the second layer In addition, we see that the different combinations are in use when the model is applied to a dataset of speech monophones These demonstrate effective use of all expert combinationsread more
Citations
More filters
Posted Content
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer,Azalia Mirhoseini,Krzysztof Maziarz,Andy Davis,Quoc V. Le,Geoffrey E. Hinton,Jeffrey Dean +6 more
TL;DR: This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
Journal ArticleDOI
Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges
Di Feng,Christian Haase-Schutz,Lars Rosenbaum,Heinz Hertlein,Claudius Gläser,Fabian Timm,Werner Wiesbeck,Klaus Dietmayer +7 more
TL;DR: In this article, the authors systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving and provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection.
Posted Content
FiLM: Visual Reasoning with a General Conditioning Layer
TL;DR: Feature-wise linear modulation (FiLM) as mentioned in this paper is a general-purpose conditioning method for neural networks, which can influence neural network computation via a simple, feature-wise affine transformation based on conditioning information.
Posted Content
Gradient Episodic Memory for Continual Learning
TL;DR: In this article, Gradient Episodic Memory (GEM) is proposed for continual learning, where the model observes, once and one by one, examples concerning a sequence of tasks.
Proceedings ArticleDOI
Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
TL;DR: This work proposes a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data and demonstrates the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.
References
More filters
Proceedings ArticleDOI
Speech recognition with deep recurrent neural networks
TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Posted Content
Speech Recognition with Deep Recurrent Neural Networks
TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Journal ArticleDOI
Adaptive mixtures of local experts
TL;DR: A new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases, which is demonstrated to be able to be solved by a very simple expert network.
Journal ArticleDOI
Hierarchical mixtures of experts and the EM algorithm
TL;DR: An Expectation-Maximization (EM) algorithm for adjusting the parameters of the tree-structured architecture for supervised learning and an on-line learning algorithm in which the parameters are updated incrementally.
Posted Content
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
TL;DR: This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.