Distilling the Knowledge in a Neural Network
Citations
2,393 citations
2,391 citations
2,366 citations
Cites background from "Distilling the Knowledge in a Neura..."
...Saxe, McClelland, and Ganguli (2013); Sussillo and Abbott (2014), Hinton, Vinyals, and Dean (2015), Romero et al. (2015), and Srivastava (2015a, 2015b) can be referred to for other appropriate techniques....
[...]
2,291 citations
Additional excerpts
...A related approach is discussed in [202]....
[...]
2,258 citations
Additional excerpts
...Distillation is a model compression to transfer information (dark knowledge) from deep networks (the ‘‘teacher’’) to shallow networks (the ‘‘student’’) [121], [122]....
[...]
References
73,978 citations
33,597 citations
"Distilling the Knowledge in a Neura..." refers methods in this paper
...The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout [9]....
[...]
9,091 citations
"Distilling the Knowledge in a Neura..." refers background in this paper
...State-of-the-art ASR systems currently use DNNs to map a (short) temporal context of features derived from the waveform to a probability distribution over the discrete states of a Hidden Markov Model (HMM) [4]....
[...]
...More specifically, the DNN produces a probability distribution over clusters of tri-phone states at each time and a decoder then finds a path through the HMM states that is the best compromise between using high probability states and producing a transcription that is probable under the language model....
[...]
...The input is 26 frames of 40 Mel-scaled filterbank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame....
[...]
...We use an architecture with 8 hidden layers each containing 2560 rectified linear units and a final softmax layer with 14,000 labels (HMM targets ht)....
[...]
...Although it is possible (and desirable) to train the DNN in such a way that the decoder (and, thus, the language model) is taken into account by marginalizing over all possible paths, it is common to train the DNN to perform frame-by-frame classification by (locally) minimizing the cross entropy between the predictions made by the net and the labels given by a forced alignment with the ground truth sequence of states for each observation: θ = argmax θ′ P (ht|st;θ′) where θ are the parameters of our acoustic model P which maps acoustic observations at time t, st, to a probability, P (ht|st;θ′) , of the “correct” HMM state ht, which is determined by a forced alignment with the correct sequence of words....
[...]
6,899 citations
"Distilling the Knowledge in a Neura..." refers methods in this paper
...The net was strongly regularized using dropout and weight-constraints as described in [5]....
[...]
...For the distillation we tried temperatures of [1,2, 5, 10] and used a relative weight of 0....
[...]
5,679 citations
"Distilling the Knowledge in a Neura..." refers background in this paper
...A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions [3]....
[...]