scispace - formally typeset
Open AccessPosted Content

Learning Factored Representations in a Deep Mixture of Experts

TLDR
Deep Mixture of Experts as mentioned in this paper is a stacked model with multiple sets of gating and experts, which exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size.
Abstract
Mixtures of Experts combine the outputs of several "expert" networks, each of which specializes in a different part of the input space This is achieved by training a "gating" network that maps each input to a distribution over the experts Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time In this this work, we extend the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size On a randomly translated version of the MNIST dataset, we find that the Deep Mixture of Experts automatically learns to develop location-dependent ("where") experts at the first layer, and class-specific ("what") experts at the second layer In addition, we see that the different combinations are in use when the model is applied to a dataset of speech monophones These demonstrate effective use of all expert combinations

read more

Citations
More filters
Posted Content

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

TL;DR: This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
Journal ArticleDOI

Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

TL;DR: In this article, the authors systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving and provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection.
Posted Content

FiLM: Visual Reasoning with a General Conditioning Layer

TL;DR: Feature-wise linear modulation (FiLM) as mentioned in this paper is a general-purpose conditioning method for neural networks, which can influence neural network computation via a simple, feature-wise affine transformation based on conditioning information.
Posted Content

Gradient Episodic Memory for Continual Learning

TL;DR: In this article, Gradient Episodic Memory (GEM) is proposed for continual learning, where the model observes, once and one by one, examples concerning a sequence of tasks.
Proceedings ArticleDOI

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

TL;DR: This work proposes a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data and demonstrates the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.
References
More filters
Proceedings ArticleDOI

Speech recognition with deep recurrent neural networks

TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Posted Content

Speech Recognition with Deep Recurrent Neural Networks

TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Journal ArticleDOI

Adaptive mixtures of local experts

TL;DR: A new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases, which is demonstrated to be able to be solved by a very simple expert network.
Journal ArticleDOI

Hierarchical mixtures of experts and the EM algorithm

TL;DR: An Expectation-Maximization (EM) algorithm for adjusting the parameters of the tree-structured architecture for supervised learning and an on-line learning algorithm in which the parameters are updated incrementally.
Posted Content

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

TL;DR: This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.
Related Papers (5)