Search

Question

Draw the Vision Transformer model.

Answer 1

Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embedded patches (\(\mathbf{z}^0_0 = \mathbf{x}_{\text{class}}\)), whose state at the output of the Transformer encoder (\(\mathbf{z}^0_L\)) serves as the
image representation \(\mathbf{y}\) (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to \(\mathbf{z}^0_L\). The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.

Answer 2

Standard learnable 1D position embeddings.
Other embeddings were also tested but they did not observe significant performance gains from using more advanced 2D-aware position embeddings.

Answer 3

Dataset size.
When pre-trained on the smallest dataset, ImageNet, ViTmodels underperform compared to Resnet models. With ImageNet-21k pre-training, their performances are similar. Only
with JFT-300M, do we see the full benefit of the transformer models.

Answer 4

The intuition is that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is sufficient, even beneficial.

Answer 5

Remove the pre-trained prediction head and attach a zero-initialized feedforward layer (so one hidden layer instead of 2).
It is often beneficial to fine-tune at higher resolution than pre-training.
When feeding images of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image.

Answer 6

ViT-B/16 has 33G FLOPS 86M Params and an ImageNet accuracy of 85.43.
ViT-L/16 has 117G FLOPS 304M Params and an ImageNet accuracy of 85.63.

Answer 7

240 TPUv3-core-days.
Or 1 month when using 8 cores.

Vision Transformers