CNN: Training

Convolutional neural networks, CNN, are widely used in computer vision. Convolution layers are used to extract image features, and pooling layers are used to extract main features.

Generally, the first convolution layer learns some margin features, the second convolution layer then learns some basic shape features. The rest convolution layers further learn more advanced features.

Local invariance: same features can be extracted from images after image augmentations such as sliding, rotating, resizing, etc.

Pooling layers can not only keep the local invariance, but also filter noises, reducing overfitting.

Batch normalization is critical in CNNs training to avoid distribution drift. It standarlizes all samples, so large learning steps can be used in training. It generally speeds up the training speed. Tecanically, it subtracts the batch mean, and is divided by the square root of sample standard deviation.

This article investigated the improvements of several training tips for image classification and segmentation in convolutional neural networks, such as ResNet:

  1. Linear scaling learning rate

  2. Learing rate warmup

  3. Zero gamma in batch normalization

  4. No bias decay

  5. Architecture: stride, number of residual blocks, kernal size, pooling size

  6. Cosine leanring rate decay

  7. Label smoothing

  8. Knowledge distillation

  9. Mixup training

A combination of these tips showed improvements in transfer learning for both object detection and semantic segmentation.

He T, Zhang Zh, Zhang H, Zhang Z, Xie J, Li M (2019): Bag of tricks for image classification with convolutional neural networks. Proceddings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) June 2019