arxiv: https://arxiv.org/pdf/1503.02531.pdf
Slides: https://www.ttic.edu/dl/dark14.pdf
What is George Hinton’s Dark Knowledge?
Distilling the knowledge in an ensemble of models into a single model.
It was based on the ‘model compression’ paper of Rich Caruana http://www.cs.cornell.edu/~caruana/
http://www.cs.cornell.edu/~caruana/compression.kdd06.pdf
There is a distinction between hard and soft targets, when you train a smaller network to get the same results as a bigger network… If you train the smaller network based on a cost function minimizing the difference from the original larger network’s results, you lose some knowledge that was encoded in the ‘softer targets’. By changing the softmax function at the end of the classification network, it’s possible to take into account how likely a class is to be mistaken for the other classes.