# Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

How does

**Batch Normalization**work?**Batch Normalization**normalizes

**each scalar feature independently**, by making it have

**zero mean**and a

**variance of 1**. The normalized values are then

**scaled and shifted**.

__During training__, the batchnorm layer works as follows:

**Input:**Values of \(x\) over a mini-batch: \(\mathcal{B} = {x_{1...m}}\)

**Output:**\(y_i = \operatorname{BatchNorm}_{\gamma, \beta}(x_i)\)

\[\mu_{\mathcal{B}} = \frac{1}{m} \sum^m_{i=1}x_i\]

\[\sigma^2_{\mathcal{B}} = \frac{1}{m} \sum^m_{i=1}(x_i - \mu_{\mathcal{B}})^2\]

\[\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}} \]

\[y_i = \gamma x_i + \beta\]

With \(\epsilon\) a constant added for numeric stability.

And

**\(\gamma\) and \(\beta\) learned parameters**.

__During inference__, the normalization is done using the population rather than mini-batches.

\[\hat{x}_i = \frac{x_i - \operatorname{E}[x_i]}{\sqrt{\operatorname{Var}[x_i] + \epsilon}} \]

What is different for using

**Batch Normalization**in**Convolutions**compared to**fully connected layers**?For convolutional layers, we additionally want the normalization to obey the

In other words, in convolutions

**convolutional property**– so that different elements of the same feature map, at**different locations, are normalized in the same way**. To achieve this, we**jointly normalize all the activations in a minibatch, over all locations**.In other words, in convolutions

**we normalize (and scale/shift) per feature map (channel), rather than per individual value**.Why

**does Batch Normalization**have**learned scale and shift parameters**?The learned shift and scale parameters

\[y_k = \gamma_k \hat{x}_k + \beta_k\]

**\(\gamma\)**and**\(\beta\)**in\[y_k = \gamma_k \hat{x}_k + \beta_k\]

**enable the full represation power of the neural network.**By setting \(\gamma_k = \sqrt{\operatorname{Var}[x_k]}\) and \(\beta_k = \operatorname{E}[x_k]\), we

**could recover the original values**, if that were the optimal thing to do.