Search

Question

Give schematic of the contrastive learning framework used in SimCLR.

Answer 1

Cosine similarity
This can be represented by using a dot product and scaling by the magnitudes.
\[s(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}^T\mathbf{v}}{\|u\| \|v\|}\]

Answer 2

The loss function for a positive pair of examples \((i, j)\) is defined as:
\[\begin{aligned} \mathcal{L}_\text{SimCLR}^{(i,j)} &= - \log\frac{\exp(s(\mathbf{z}_i, \mathbf{z}_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(s(\mathbf{z}_i, \mathbf{z}_k) / \tau)} \end{aligned}\]where \(s(.)\) is the similarity metric (usually cosine similarity).
The final loss is computed across all positive pairs, both \((i,j)\) and \((j,i)\).

Answer 3

input: batch size \(N\), temperature constant \(\tau\), encoder \(f\), projection head \(g\), augmentation family \(\mathcal{T}\).
for sampled minibatch \(\{\mathbf{x}_k\}^N_{k=1}\) do:
for all \(k \in \{1, \dots, N\}\) do:
sample two augmentation functions \(t \sim \mathcal{T}\), \(t' \sim \mathcal{T}\)
\(\tilde{\mathbf{x}}_{2k - 1}= t(\mathbf{x}_k)\)
\(\tilde{\mathbf{x}}_{2k}= t'(\mathbf{x}_k)\)
\(\mathbf{h}_{2k - 1}= f(\tilde{\mathbf{x}}_{2k -1 })\)
\(\mathbf{h}_{2k}= f(\tilde{\mathbf{x}}_{2k})\)
\(\mathbf{z}_{2k-1} = g(\mathbf{h}_{2k-1})\)
\(\mathbf{z}_{2k} = g(\mathbf{h}_{2k})\)
for all \(i \in \{1, \dots, 2N\}\) and \(j \in \{1, \dots, 2N\}\) do:
\(s_{i,j} = \frac{\mathbf{z}_i^\top\mathbf{z}_j}{\|\mathbf{z}_i\| \|\mathbf{z}_j\|}\)
define \(\mathcal{L}^{(i,j)} = - \log\frac{\exp(s_{i,j} / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(s_{i,k} / \tau)}\)
\(\mathcal{L} = \frac{1}{2N} \sum^N_{k=1}[\mathcal{L}^{(2k-1,2k)} +\mathcal{L}^{(2k,2k-1)}]\)
update networks \(f\) and \(g\) to minimize \(\mathcal{L}\)
return encoder \(f\) and throw away \(g\)

Answer 4

It likely due to the fact that the contrastive representation needs to be invariant to many data transformations, as such information such as color is removed in this representation while this may be useful for downstream tasks. By adding an additional projection head, \(g\) can remove information that may be useful for downstream tasks but needs to be removed in order to maximize the contrastive similarity.However all of this is found empirically.