Learning Factored Representations in a Deep Mixture of Experts

Open AccessPosted Content

Learning Factored Representations in a Deep Mixture of Experts

- 16 Dec 2013 -

TLDR

Deep Mixture of Experts as mentioned in this paper is a stacked model with multiple sets of gating and experts, which exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size.

Abstract:

Mixtures of Experts combine the outputs of several "expert" networks, each of which specializes in a different part of the input space This is achieved by training a "gating" network that maps each input to a distribution over the experts Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time In this this work, we extend the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size On a randomly translated version of the MNIST dataset, we find that the Deep Mixture of Experts automatically learns to develop location-dependent ("where") experts at the first layer, and class-specific ("what") experts at the second layer In addition, we see that the different combinations are in use when the model is applied to a dataset of speech monophones These demonstrate effective use of all expert combinations

Citations

PDF

Open Access

More filters

Posted Content

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, +6 more

- 23 Jan 2017 -

arXiv: Learning

TL;DR: This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

...read moreread less

Journal ArticleDOI

Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

Di Feng, +7 more

- 01 Mar 2021 -

IEEE Transactions on Intelligent Transpo...

TL;DR: In this article, the authors systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving and provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection.

...read moreread less

Posted Content

FiLM: Visual Reasoning with a General Conditioning Layer

Ethan Perez, +5 more

- 22 Sep 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Feature-wise linear modulation (FiLM) as mentioned in this paper is a general-purpose conditioning method for neural networks, which can influence neural network computation via a simple, feature-wise affine transformation based on conditioning information.

...read moreread less

Posted Content

Gradient Episodic Memory for Continual Learning

David Lopez-Paz, +1 more

- 26 Jun 2017 -

arXiv: Learning

TL;DR: In this article, Gradient Episodic Memory (GEM) is proposed for continual learning, where the model observes, once and one by one, examples concerning a sequence of tasks.

...read moreread less

Proceedings ArticleDOI

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

Jiaqi Ma, +5 more

TL;DR: This work proposes a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data and demonstrates the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Speech recognition with deep recurrent neural networks

Alex Graves, +2 more

TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

...read moreread less

Posted Content

Speech Recognition with Deep Recurrent Neural Networks

Alex Graves, +2 more

- 22 Mar 2013 -

arXiv: Neural and Evolutionary Computing

TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

...read moreread less

Journal ArticleDOI

Adaptive mixtures of local experts

Robert A. Jacobs, +3 more

- 01 Mar 1991 -

Neural Computation

TL;DR: A new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases, which is demonstrated to be able to be solved by a very simple expert network.

...read moreread less

Journal ArticleDOI

Hierarchical mixtures of experts and the EM algorithm

Michael I. Jordan, +1 more

- 01 Mar 1994 -

Neural Computation

TL;DR: An Expectation-Maximization (EM) algorithm for adjusting the parameters of the tree-structured architecture for supervised learning and an on-line learning algorithm in which the parameters are updated incrementally.

...read moreread less

Posted Content

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, +2 more

- 15 Aug 2013 -

arXiv: Learning

TL;DR: This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.

...read moreread less

Learning Factored Representations in a Deep Mixture of Experts

Citations

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

FiLM: Visual Reasoning with a General Conditioning Layer

Gradient Episodic Memory for Continual Learning

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

References

Speech recognition with deep recurrent neural networks

Speech Recognition with Deep Recurrent Neural Networks

Adaptive mixtures of local experts

Hierarchical mixtures of experts and the EM algorithm

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Related Papers (5)

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Attention is All you Need

Long short-term memory

ImageNet: A large-scale hierarchical image database