When the minibatch size is multiplied by \(k\), how should the learning rate be scaled?
Multiply the learning rate by \(k\) (Linear Scaling Rule)
What is the technique used to train large minibatch SGD.
Linear scaling rule.