Title: ImageNet Classification with Deep Convolutional Neural Networks
Authors: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
URL: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Year: 2012
Other information: Published in Advances in Neural Information Processing Systems (NIPS 2012).
Key: Krizhevsky2012

Train a deep convolutional neural network to classify 1.2 million images into 1000 different categories.

Make strong and correct assumptions about the nature of the images (stationarity, pixel dependencies).
Much fewer connections and parameters: easier to train than fully connected neural networks.

ImageNet: 15 million labeled high-resolution images from 22000 categories. Labeled manually using Amazon Mechanical Turk.
ImageNet Large-Scale Visual Recognition Challenge (ILSVRC): subset of ImageNet
- 1.2 million training images, 50000 validation images, 150000 test images.
- 1000 categories
Variable resolution images:
- Images downsampled to a fixed resolution of 256 x 256.

8 layers: 5 convolutional and 3 fully-connected, 1000-way softmax at the output.

Methodology

ReLU activation function: train several times faster than tanh units.
- Faster learning had influence on the performance of large models trained on large datasets
Training on Multiple GPUs
Local Response Normalization
- mimics a form of lateral inhibition found on real neurons.
- applied after ReLU in the 1st and 2nd convolutional layers.
- improves top-1 and top-5 error rates by 1.4% and 1.2%
Overlapping pooling
- Neighborhood z = 3 and stride s = 2.
- Max-pooling employed in the 1st and 2nd convolutional layers (after response normalization) and as well as after the 5th convolutinal layer.
Reducing Overfitting
- Data Augmentation
  - Generate image translations and horizontal reflections.
  - Alter the intensities of RGB channels.
- Dropout
  - Used in the first two fully-connected layers - p(keep) = 0.5
Learning
- Stochastic Gradient Descent, batch size = 128, momentum = 0.9, weight decay = 0.0005
- Weights initialized from Gaussian distribution with mean = 0 and standard deviation = 0.01
  - Bias in 2nd, 4th, and 5th convolutional layers initialized as 1. This accelerated learning as the ReLU was fed with positive inputs from the start.
  - Bias in remaining layers initialized as zeros.
- Learning rate ($\epsilon$)
  - Equal for all layers
  - Adjusted manually (divided by 10 when validation error stopped decreasing).
  - Initialized at 0.01 and reduced 3 times during training.
- Trained during 90 epochs (5-6 days on two NVIDIA GTX 580 3GB GPUs).

Model	Top-1 (val)	Top-5 (val)	Top-5 (test)
Sparse Coding	--	--	26.2%
1 CNN	40.7%	18.2%	--
5 CNNs	38.1%	16.4%	16.4%
1 CNN*	39.0%	16.6%	--
7 CNNs*	36.7%	15.4%	15.3%

CNN* are convolutional neural networks pretrained on ImageNet 2011 Fall release and fine-tuned on ILSVRC-2012 training data.

Qualitative assessment
- Convolutional kernels showed specialization
- Most of top-5 labels were reasonable
- Image similarity based on the feature activations induced at the last fully connected layer:

Most of the choices made in the paper were based on experimental results. There is not too much theory behind.

Provide feedback

Saved searches