Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Schematic illustration of the proposed SEgmentation TRansformer (SETR) (a). We first split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. To perform pixel-wise segmentation, we introduce different decoder designs: (b) progressive upsampling (resulting in a variant called SETR-PUP); and (c) multi-level feature aggregation (a variant called SETR-MLA).
A 3-layer network is applied with the feature channels halved at the first and third layers respectively, and the spatial resolution upscaled \(4 \times\) by bilinear operation after the third layer. To enhance the interactions across different streams, we introduce a top-down aggregation design via element-wise addition after the first layer. An additional \(3 \times 3\) conv is applied after the element-wise additioned feature. After the third layer, we obtain the fused feature from
all the streams via channel-wise concatenation which is then bilinearly upsampled \(4 \times\) to the full resolution. When using this decoder, we denote our model as SETR-MLA.
Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki. Edit MLRF on GitHub.