# Token Merging: Your ViT But Faster

Give the algorithm for

**Token Merging**.- Partition the tokens into two sets \(\mathbb{A}\) and \(\mathbb{B}\) by alternating the assignment.
- Calculate the similarity score for all tokens in \(\mathbb{A}\) by taking the cosine similarity of the \(\mathbf{K}\) vector of a token in \(\mathbb{A}\) to all tokens in \(\mathbb{B}\). The final similarity score of a token in \(\mathbb{A}\) is the highest pairwise score.
- Merge the \(k\) pairs with the highest similarity score
- Concatenate the two sets back together

Apart from the merging module, which other change needs to be made to the standard Vision Transformer in order to apply

**Token Merging**?The standard attention function need to be changed to

\[\mathbf{A} = \operatorname{softmax}(\frac{\mathbf{QK}^T}{\sqrt{d}} + \log(\mathbf{s}))\]

**proportional attention**:\[\mathbf{A} = \operatorname{softmax}(\frac{\mathbf{QK}^T}{\sqrt{d}} + \log(\mathbf{s}))\]

*where \(\mathbf{s}\) is a row vector containing the size of each token (number of patches the token represents)*. Tokens are also weighted by \(\mathbf{s}\) any time they are aggregated, like when the tokens are merged together.This performs the same operation as if you'd have

*s*copies of the keyWhere is the

**Token Merging**module inserted in the vision transformer?Between the Multi-headed Self-Attention (MSA) layer and the MLP layer.

What schedule does

**Token Merging**use to merge tokens?The default setting merges a

They also report on a linearly decreasing schedule.

**fixed**\(k\) tokens per layer.They also report on a linearly decreasing schedule.