Pixel Recurrent Neural Networks

Modeling the distribution of high-dimensional data is a central problem in unsupervised machine learning. Since images are high-dimensional and highly structured, estimating their underlying distribution is notoriously challenging. With the recent advances in deep learning, there has been significant progress in developing expressive, scalable, and tractable methods to tackle generative modeling problems. In this blog, we are going to explore the PixelRNN ¹ and GatedPixelRNN ² models for generating images.

Perhaps the most popular technique for generative modeling in recent years has been the Generative Adversarial Network. These models generate rich and sharp images. However, GANs are notoriously hard to train because of instability due to the adversarial nature of training.³

On the other hand, stochastic latent variable models such as the Variational Auto-Encoder produce blurry samples due to the nature of its reconstruction loss. Additionally, the VAE exproximates a lower bound (ELBO) to the desired probability distribution. ⁴

Previous methods that model the distribution as a product of conditionals such as NADE/MADE are limited because they lack sophisticated recurrent units like LSTM cells. ⁵ ⁶ By using sophisticated auto-regressive modeling techniques, the PixelRNN is able to achieve state-of-the-art performance on image generation benchmarks.

Background

PixelRNN Model

In PixelRNN, each pixel is conditionally dependent on previous pixels from top to bottom and left to right. We model the joint probablity distribution of the image as a product of the conditional probabilities. Image taken from paper ¹ .

Joint probability as a product of conditionals
In the following image, the pixel $x_i$ depends on all pixels $[x_1, x_2, ..., x_{i-1}]$ from top to bottom and left to right. Visualization taken from ⁷ .

enter image description here
Additionally, each channel RGB is conditionally dependent on previous channels. For example, the green channel is conditionally dependent on the red channel of the same pixel. Image taken from paper ¹ .

conditional rgb
As we will see later in this blog, ensuring this auto-regressive property holds requires clever masking of inputs in the network.

Discrete Softmax Output

The output layer of the network has a $n \times n \times 3 \times 256$ shape. This can be interpreted as each channel for each pixel having a 256 channel output. This 256 channel output is normalized via softmax and represents the discrete multinomial probability distribution for channel values. The following is a visualization of the softmax output for one channel for one pixel. Image taken from paper ¹ .

enter image description here

PixelRNN Generation

Images are generated sequentially pixel-by-pixel and channel-by-channel from top-to-bottom and left-to-right. This makes the generation process extremely slow which is a big weakness of this model. However, training the PixelRNN model can be done in parallel since all the conditional inputs are present. The inputs just need to be masked to preserve the autoregressive property. Visualization taken from ⁷ .

enter image description here

PixelRNN Network Architecture

The model always start with a $7 \times 7$ masked convolution. This is then followed by several residual blocks which can either be convolutional, RowLSTM, or DiagonalBiLSTM. Finally, there are two $1 \times 1$ convolutional layers to generate the final output. Image taken from paper ¹ .

enter image description here

Input Masking

There are two types of masks in the PixelRNN network. The first type (Mask A) exists to maintain the autoregressive property for the first $7 \times 7$ convolutional layer. In this mask, the output of the layer depends on all information from previous pixels and only information from previous channels of the same pixel. The second variant (Mask B) is applied to subsequent layers. This variant also allows the output of the layer to depend on the information from the same channel of the same pixel. This is because the channel values for subsequent layers only depends on inputs from previous channel and can be used without violating the autoregressive property. Below is a visualization of these two masking schemes. Image taken from paper ¹ .

enter image description here
Below is a visualization of Mask A for the red, green, and blue channels respectively. We are masking the inputs to generate the center pixel for every channel.

enter image description here
Below is a visualization of Mask B for the red, green, and blue channels respectively. Note how the pixel of the same channel can be used this time.

enter image description here

PixelCNN

In the PixelCNN each residual block is masked $3 \times 3$ convolution. PixelCNN is heavily parallelizable due to its convolutional layers. However, as we are only looking at a $3 \times 3$ neighborhood for the convolution, we are not capturing information from all previous pixels. While the receptive field of the convolution grows linearly with the depth of the network, in one particular layer, the masked $3 \times 3$ has a small receptive field. The image below is a visualization of the receptive field of the masked $3 \times 3$ convolution. Image taken from paper ¹ .

enter image description here

RowLSTM

RowLSTM generates its output row-by-row from top-to-bottom and left-to-right. To model the output, RowLSTM modifies the traditional LSTM cell to compute all hidden outputs via convolutions. RowLSTM uses $3 \times 1$ convolutions for the he state-to-state kernel $K^{ss}$ and input-to-state kernel $K^{is}$ . Note that the input-to-state component depends on the input and must be appropriately masked to ensure the autoregressive property. Additionally, since the input-to-state component depends only on the input, it can be computed for the entire $n \times n$ input in parallel. However, the state-to-state component of the RowLSTM convolution must be computed sequentially using previous hidden states. Below is the mathematical notation for the convolutional LSTM cell in RowLSTM. Image taken from paper ¹ .

enter image description here
The image below is a visualization of the convolutions in the RowLSTM. The $3 \times 1$ convolution slides left-to-right row-by-row. Visualization taken from ⁷ .

enter image description here
Because of the sequential nature of the computation, RowLSTM is more computational intensive than convolutional layers. However, the hidden state for RowLSTM encapsulates a much larger context than convolutional layers. Specifically, the RowLSTM captures the entire triangular context above the output pixel. The image below is a visualization of the receptive field of the RowLSTM. Image taken from paper ¹ .

enter image description here

DiagonalBiLSTM

While the RowLSTM is an improvement on the convolutional layers in terms of receptive field, there is still room for improvement. This is where the DiagonalBiLSTM comes in. The goal of the DiagonalBiLSTM is to capture all the available context. In order to accomplish this, the DiagonalBiLSTM scans the diagonals of the image from two directions; top-left to bottom-right and top-right to bottom-left. The outputs from these two scans are added together for the final output.
Similar to RowLSTM, the DiagonalBiLSTM uses a convolutional LSTM framework with a $1 \times 1$ convolution for the input-to-state kernel $K^{is}$ and a $2 \times 1$ convolution for the state-to-state kernel. Additionally, the $K^{is}$ convolution must be masked to preserve the autoregressive property and can be precomuted for the entire output. Visualization taken from ⁷ .

enter image description here
The above image shows how the DiagonalBiLSTM generates its output from the top-left to bottom-right diagonal. Implementing this diagonal $2 \times 1$ is tricky. In order to simplify the compute, the image is skewed to rearrange the convolution as shown in the image below. Visualization taken from ⁷ .

enter image description here
Note that the the DiagonalBiLSTM computes this operation for both diagonals. As a result, it is able to capture all the available context to generate outputs and has a complete receptive field. However, the DiagonalBiLSTM has even more computational overhead because of computing two outputs. Image taken from paper ¹ .

enter image description here
The image above shows how the receptive field for the DiagonalBiLSTM is able to capture the entire available context to generate its output.

Residual Connections

As mentioned earlier, each of the blocks (convolutional, RowLSTM, DiagonalBiLSTM) are residual. Residual connections enable training deeper PixelRNN networks. These residual connections increase both convergence speed by propagating signals more directly through the network. The image below is a visualization on how residual connection are setup in the convolutional and LSTM cells. Image taken from paper ¹ .

enter image description here

PixelRNN Model Summary

This visualization summarizes the model architecture of the PixelCNN, RowLSTM, and Diagonal BiLSTM variants of the model. Image taken from paper ¹ .

enter image description here

Preliminary Results

As shown in the image below, PixelRNN variants achieve state-of-the-art performance in common datasets (MNIST, CIFAR-10, and ImageNet). The best variant model is the DiagonalBiLSTM, which is expected since it has the largest receptive field. Image taken from paper ¹ .

enter image description here

Gated PixelCNN Motivation

The authors of the PixelRNN paper released another paper shortly after the first one that improved upon the design on the PixelCNN. The authors reasoned that PixelRNN variants (RowLSTM and DiagonalBiLSTM) are outperforming PixelCNN for two main reasons:

The element-wise multiplicative units are able to model more complex interactions. The absence of multiplicative operations in PixelCNN is limiting its performance.
PixelRNN’s capture much larger receptive fields. While the receptive fields of the PixelCNN grows linearly with the number of layers, there is a blind spot that forms in the receptive field of masked CNNs (more information below).

The authors proposed modification to the PixelCNN architecture to fix its shortcomings. First, they added Gated Activation Units that contained multiplicative operations to add more sophistication to the model. Second, they fixed the receptive field blind-spot problem by splitting up the convolution into an unmasked vertical stack and a masked horizontal stack that takes the output of the vertical stack as input.

Horizontal and Vertical Stack

As mentioned above, the receptive field of the PixelCNN, while increasing linearly with depth, contains a growing blind-spot. Pixels in this blind-spot are never used as context regardless of how many layers are stacked. The blind-spot problem is caused because of the masking in the convolutions to maintain the autoregressive property. The image below is a visualization of the blind-spot problem. Image taken from paper ² .

enter image description here

To fix the receptive field blind-spot, the single masked convolution is replaced with a horizontal and vertical stack. The vertical stack is an unmasked operation that captures the entire receptive field in the rows above the output pixel. The horizontal stack is a masked operation that captures the context to the left of the output pixels and uses the vertical stack output as input. Also, the authors add an additional residual connection in the horizontal stack convolution. Splitting up the convolution in this way fixes the receptive field blind-spot as shown in the figure below. Image taken from paper ² .

enter image description here

Gated Activation Units

Additionally, the authors replace the ReLU activation between convolutional blocks with a more sophisticated gated activation unit. This new activation computes two different convolutions with half the feature maps. The output of the two convolutions are subjected to two different non-linear activation functions (tanh and sigmoid) and multiplied together element-wise for the final output. Image taken from paper ² .

enter image description here
The image above shows the mathematical definition of the gated activation unit. The final convolutional block is shown in the image below. The green $n \times n$ convolution is the vertical stack and the green $n \times 1$ convolution is the horizontal stack. The output of the vertical stack is fed into the horizontal stack as mentioned above. Image taken from paper ² .

enter image description here

Conditional PixelCNN

In my opinion, the most interesting part of the paper is adding the ability to conditionally generate images. You can condition the output probability distribution of the images on a high-dimensional latent vector $h$ that behaves as the image description. For example, the $h$ vector could be a one-hot encoded vector of class labels in the ImageNet dataset. The PixelCNN network would then learn to conditionally generate specific classes of ImageNet data. So passing in the latent vector $h$ corresponding to the class “Dog” would generate images of dogs! The equation given below shows how the output probability is now conditionally dependent on $h$ . Image taken from paper ² .

enter image description here

However, the latent vector $h$ does not contain any spatial information about the object. So in the above example, while images of dogs would be generated, the dog could appear anywhere in the image. Fortunately, the authors of the paper had a solution to this problem. The latent vector $h$ can be passed through a deconvolution network to produce output $s = m(h)$ such that $s$ has the same spatial dimensions as the image but arbitrary channels. $s$ contains spatial information about the generated object. Now you can control where in the image the dog is generated!

The equations below show how the gated activation unit in the network is modified to accommodate conditional generation. For the $h$ , the $V_{k,f}$ is a linear layer and for $s$ the $V_{k,f}$ is a $1 \times 1$ convolution. Image taken from paper ² .

enter image description here

Final Results

The image below shows the performance of the GatedPixelCNN on CIFAR-10 (left) and ImageNet (right) from paper ² .

enter image description here

Conditional Generation Examples

Here are some examples from the paper for condtional image generation from the ImageNet dataset from paper ² .

Image description

Oord, Aaron van den, et al. “Pixel Recurrent Neural Networks.” ArXiv:1601.06759 [Cs], Aug. 2016. arXiv.org, http://arxiv.org/abs/1601.06759 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Oord, Aaron van den, et al. “Conditional Image Generation with PixelCNN Decoders.” ArXiv:1606.05328 [Cs], June 2016. arXiv.org, http://arxiv.org/abs/1606.05328 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Recent research has made progress in demystifying the problems in training GANs. However, when the initial PixelRNN paper was published in 2016, training GANs for generative modeling was still a daunting task). ↩︎
Advances in VAE have made it possible to generative sharp high-dimensional data by using hierarchical techniques and modifying the ELBO loss for better reconstructions. ↩︎
Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” ArXiv:1605.02226 [Cs], May 2016. arXiv.org, http://arxiv.org/abs/1605.02226. ↩︎
Germain, Mathieu, et al. “MADE: Masked Autoencoder for Distribution Estimation.” ArXiv:1502.03509 [Cs, Stat], June 2015. arXiv.org, http://arxiv.org/abs/1502.03509. ↩︎
Slides from UCF PixelRNN presentation by Logan Lebanoff 2/22/17. https://www.crcv.ucf.edu/wp-content/uploads/2019/03/CAP6412_Spring2018_Pixel-Recurrent-Neural-Networks.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

Pixel Recurrent Neural Networks

Related Work

Background

PixelRNN Model

Discrete Softmax Output

PixelRNN Generation

PixelRNN Network Architecture

Input Masking

PixelCNN

RowLSTM

DiagonalBiLSTM

Residual Connections

PixelRNN Model Summary

Preliminary Results

Gated PixelCNN Motivation

Horizontal and Vertical Stack

Gated Activation Units

Conditional PixelCNN

Final Results

Conditional Generation Examples