Image Transformer

Image Transformer

Image Transformer[1] is a sequence modeling formulation of image generation generalized by Transformer, which restricting the self-attention mechanism to attend to local neighborhoods, while maintaining large receptive field. There are some details of reading and implementing it.


Paper & Code & note

Paper: Image Transformer(2018 arXiv paper)
Code: [Code]
Note: Mendeley




  1. Image generation has been successfully cast as an autoregressive sequence generation or transformation problem.
  2. In this work, they generalize the Transformer to a sequence modeling formulation of image generation.
  3. By restricting the self-attention mechanism to attend to local neighborhoods while maintaining large receptive field.
  4. outperform the current state of the art in image generation and super-resolution.

Problem Description


  1. Training RNNs(recurrent neural networks) to sequentially predict each pixel of even a small image is computationally very challenging. Thus, parallelizable models that use CNNs(convolutional neural networks) such as the PixelCNN have recently received much more attention, and have now surpassed the PixelRNN in quality.
  2. One disadvantage of CNNs compared to RNNs is their typically fairly limited receptive field. This can adversely affect their ability to model long-range phenomena common in images, such as symmetry and occlusion, especially with a small number of layers.

Problem Solution


  1. self-attention can achieve a better balance in the trade-off between the virtually unlimited receptive field of the necessarily sequential PixelRNN and the limited receptive field of the much more parallelizable PixelCNN and its various extensions.
  2. Image Transformer which is a model based entirely on a self-attention mechanism allows us to use significantly larger receptive fields than the PixelCNN.
  3. Increasing the size of the receptive field plays a significant role in experiments improvement.

Conceptual Understanding



  1. Each self-attention layer computes a d-dimensional representation for each position.
  2. it first compares the position’s current representation to other positions’ representations, obtaining an attention distribution over the other positions.
  3. This distribution is then used to weight the contribution of the other positions’ representations to the next representation for the position.

Local Self-Attention


  1. Inspired by CNNs, they address this by adopting a notion of locality, restricting the positions in the memory matrix M to a local neighborhood around the query position.
  2. They partition the image into query blocks and associate each of these with a larger memory block.
  3. The model attends to the same memory matrix, the self-attention is then computed for all query blocks in parallel.

Core Conception


  1. Recomputing the representation $q’$ of a single channel of one pixel $q$ by attending to a memory of previously generated pixels $m_1,m_2,…$.
  2. After performing local self-attention we apply a two-layer position- wise feed-forward neural network with the same parameters for all positions in a given layer.
  3. Self-attention and the feed-forward networks are followed by dropout and bypassed by a residual connection with subsequent layer normalization.







  1. We further hope to have provided additional evidence that even in the light of GANs(generative adversarial networks), likelihood-based models of images is very much a promising area for further research.
  2. We would like to explore a broader variety of conditioning information including free-form text, and tasks combining modalities such as language-driven editing of images.
  3. Fundamentally, we aim to move beyond still images to video and towards applications in model-based reinforcement learning.


[1] Parmar, Niki, et al. “Image transformer.” arXiv preprint arXiv:1802.05751 (2018).



Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now
