Distilling the Knowledge in a Neural Network
Citations
[...]
38,208 citations
14,406 citations
Cites background or methods from "Distilling the Knowledge in a Neura..."
...Distillation [9] works by training the classifier to emulate the outputs of a larger model2 instead of the ground-truth labels, hence enabling training from large (and potentially infinite) unlabeled datasets....
[...]
...In a face attribute classification task, we demonstrate a synergistic relationship between MobileNet and distillation [9], a knowledge transfer technique for deep networks....
[...]
...Another method for training small networks is distillation [9] which uses a larger network to teach a smaller network....
[...]
10,422 citations
Cites background from "Distilling the Knowledge in a Neura..."
...Since its first introduction, Inception has been one of the best performing family of models on the ImageNet dataset [14], as well as internal datasets in use at Google, in particular JFT [5]....
[...]
...in [5], which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes....
[...]
10,132 citations
7,113 citations
Cites methods from "Distilling the Knowledge in a Neura..."
...Pretraining on JFT: Similar to [10], we also employ the proposed Xception model that has been pretrained on both ImageNet-1k [62] and JFT-300M dataset [29, 12, 69], which brings extra 0....
[...]
References
73,978 citations
33,597 citations
"Distilling the Knowledge in a Neura..." refers methods in this paper
...The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout [9]....
[...]
9,091 citations
"Distilling the Knowledge in a Neura..." refers background in this paper
...State-of-the-art ASR systems currently use DNNs to map a (short) temporal context of features derived from the waveform to a probability distribution over the discrete states of a Hidden Markov Model (HMM) [4]....
[...]
...More specifically, the DNN produces a probability distribution over clusters of tri-phone states at each time and a decoder then finds a path through the HMM states that is the best compromise between using high probability states and producing a transcription that is probable under the language model....
[...]
...The input is 26 frames of 40 Mel-scaled filterbank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame....
[...]
...We use an architecture with 8 hidden layers each containing 2560 rectified linear units and a final softmax layer with 14,000 labels (HMM targets ht)....
[...]
...Although it is possible (and desirable) to train the DNN in such a way that the decoder (and, thus, the language model) is taken into account by marginalizing over all possible paths, it is common to train the DNN to perform frame-by-frame classification by (locally) minimizing the cross entropy between the predictions made by the net and the labels given by a forced alignment with the ground truth sequence of states for each observation: θ = argmax θ′ P (ht|st;θ′) where θ are the parameters of our acoustic model P which maps acoustic observations at time t, st, to a probability, P (ht|st;θ′) , of the “correct” HMM state ht, which is determined by a forced alignment with the correct sequence of words....
[...]
6,899 citations
"Distilling the Knowledge in a Neura..." refers methods in this paper
...The net was strongly regularized using dropout and weight-constraints as described in [5]....
[...]
...For the distillation we tried temperatures of [1,2, 5, 10] and used a relative weight of 0....
[...]
5,679 citations
"Distilling the Knowledge in a Neura..." refers background in this paper
...A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions [3]....
[...]