Search

Question

How does Batch Normalization work?

Answer 1

Batch Normalization normalizes each scalar feature independently, by making it have zero mean and a variance of 1. The normalized values are then scaled and shifted.
During training, the batchnorm layer works as follows:

Input: Values of \(x\) over a mini-batch: \(\mathcal{B} = {x_{1...m}}\)
Output: \(y_i = \operatorname{BatchNorm}_{\gamma, \beta}(x_i)\)
\[\mu_{\mathcal{B}} = \frac{1}{m} \sum^m_{i=1}x_i\]
\[\sigma^2_{\mathcal{B}} = \frac{1}{m} \sum^m_{i=1}(x_i - \mu_{\mathcal{B}})^2\]
\[\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}} \]
\[y_i = \gamma x_i + \beta\]

With \(\epsilon\) a constant added for numeric stability.
And \(\gamma\) and \(\beta\) learned parameters.
During inference, the normalization is done using the population rather than mini-batches.
\[\hat{x}_i = \frac{x_i - \operatorname{E}[x_i]}{\sqrt{\operatorname{Var}[x_i] + \epsilon}} \]

Answer 2

For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations.
In other words, in convolutions we normalize (and scale/shift) per feature map (channel), rather than per individual value.

Answer 3

The learned shift and scale parameters \(\gamma\) and \(\beta\) in
\[y_k = \gamma_k \hat{x}_k + \beta_k\]
enable the full represation power of the neural network.