NN: Dropout

Research on dropout in deep learning defines it as a technique to randomly modifies neural network paramters or activations during training or inference.

Dropout was introduced in 2012 by Hinton as an effective technique to avoid overfitting in feedforward neural networks. During each feedforward training iteration, any neuron meeting a certain probability threshold can be removed from the network. Once trained, the full network with all neurons is used for inferences, with neuron outputs multiplied by the probability that the neuron was removed. Dropout is applied to all the layers except for the output layer.

There are two major theoretical interpretations of dropout: 1) implicitly averaging over an ensemble of neural networks (similar to bagging concept), and 2) simplification of Bayesian machine learning models.

For the Bayesian theory, an ideal Bayesian model assumes a prior distribution over the model parameters, then determines the posterior distribution of these parameters with a training set, once trained, marginalizes over this distribution to perform inference on a new input. This is computationally expensive, however, approximations are commonly used to simplify this process. Many studies proposed that training with dropout technique can be interpreted as using a Bayesian model with certain approximations. Gal and Ghahramani 2016 indicated that training a neural network with standard dropout is equivalent to optimizing a variational objective between an approximate distribution and the posterior of a deep Gaussian process, which is essentially a Bayesian machine learning model.

For RNNs (recurrent neural networks), especially LSTM (long short-term memory), an alternative technique that can preserve memory in any LSTM while still generating different dropout masks for each sample has proposed by Semeniuta et al 2016. This technique applies dropout to the part of the RNN that updates the hidden state and not the state itself. Thus, if an neuron is removed, then it simply does not contribute to network memory, rather than removing the hidden state.

A nice review of dropout in neural networks by Labach et al 2019 can be found here: https://arxiv.org/pdf/1904.13310.pdf

References:

Hinton 2012: “Improving neural networks by preventing co-adaptation of feature detectors”

Alex Labach, Hojjat Salehinejad, Shahrokh Valaee, 2019: Neural and Evolutionary Computing: Machine Learning, https://arxiv.org/pdf/1904.13310.pdf

Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS, 2016.

S. Semeniuta, A. Severyn, and E. Barth, “Recurrent dropout without memory loss,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics, 2016.