MobileNetV1

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

What is the computational cost of a standard convolution with input \(h_i \times w_i \times d_i\), output \(h_i \times w_i \times d_j\) and kernel size \(k\)?

\(h_i \times w_i \times d_i \times d_j \times k \times k\)
What is the computational cost of a depthwise separabale convolution with input \(h_i \times w_i \times d_i\), output \(h_i \times w_i \times d_j\) and kernel size \(k\)?

\(h_i \cdot w_i \cdot d_i (k^2 + d_j) \)
How much reduction in computation do you get by replacing standard convolutions with depthwise separable convolutions?

Computational cost standard convolution:
\(h_i \times w_i \times d_i \times d_j \times k \times k\)
Computational cost depthwise separable convolution:
\(h_i \cdot w_i \cdot d_i (k^2 + d_j) \)
with input \(h_i \times w_i \times d_i\), output \(h_i \times w_i \times d_j\) and kernel size \(k\)
Reduction:
\(= \frac{h_i \cdot w_i \cdot d_i (k^2 + d_j)}{h_i \times w_i \times d_i \times d_j \times k \times k}\)
\(= \frac{1}{d_j} + \frac{1}{k^2}\)
MobileNet uses \(3 \times 3\) depthwise separable convolutions which uses between 8 to 9 times less computation than standard convolutions.
With what does MobileNet replace the standard convolutional layer with batchnorm and ReLU?


With a Depthwise Separable convolutions with Depthwise and Pointwise layers followed by batchnorm and ReLU.

Which two additional hyperparameters are introduced in MobileNet to construct scaled versions of the standard architecture?

Width multiplier:
The role of the width multiplier \(\alpha\) is to thin a network uniformly at each layer. For a given layer and width multiplier \(\alpha\), the number of input channels \(d_i\) becomes \(\alpha d_i\) and the number of output channels \(d_j\) becomes \(\alpha d_j\).
Width multiplier has the effect of reducing computational cost and the number of parameters quadratically by roughly \(\alpha^2\).

Resolution multiplier:
The resolution multiplier \(\rho\) is applied to the input image and the internal representation of every layer is subsequently reduced by the same multiplier. In practice we implicitly set \(\rho\) by setting the input resolution.
Resolution multiplier has the effect of reducing computational cost by \(\rho^2\).
How many multiply-adds and parameters does the default MobileNetV1 have? And how much accuracy does it get on ImageNet?

569M MAdds and 4.2M parameters with an accuracy of 70.6% on ImageNet.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki. Edit MLRF on GitHub.