NF4 Discrete Flows - Invertible Generative Models of Discrete Data | CS 598 Deep Generative and Dynamical Models

Introduction

In recent years, normalizing flows have become a promising method for both generative tasks and density estimation. This is because, unlike variational autoencoders, it is tractable to compute direct log-likelihoods, which can then be used to optimize the model. While normalizing flows have shown strong results in modeling continuous domains, they have not been explored in a discrete setting. This may be because the idea of a flow in a discrete setting is difficult to intuit. However, it can be considered as “relabeling” the data instead. The key idea of this paper is: Can flows be used on discrete distributions?

In this paper, two new models are introduced:

Discrete Autoregressive Flows
Discrete Bipartite Flows

Background

Continuous normalizing flows

Normalizing flows generally assume that there is a function, $f^{-1}$ , usually parameterized by a neural network, which is invertible. The goal is to map the distribution of the data to a more easily interpretable base distribution. Generally it is desirable for the base distribution to be factorizable (independent components), and spherical Gaussians are often used. Additionally, much work enforces a constraint on the Jacobian of $f^{-1}$ to make it computationally tractable to compute. To optimize the maximum likelihood of the data, a change of variables formulation is used to instead calculate the maximum likelihood of the base, which is tractable.

Given a discrete random variable $x$ from the base distribution and $y = f(x)$ , we can write the change of variables as:
$p(y) = p(f^{-1}(y)) \text{det}|\frac{dx}{dy}|$
Note that prior work often writes this the other direction, but it is equivalent.

Generally, normalizing flows can be divided into two classes:

Autoregressive Flows, such as MAF [2] or IAF [3].
Bipartite flows, such as RealNVP [4].

Autoregressive Flows

These are models that are both autoregressive and flows. Autoregressive models take the previous inputs and use it to create an output. The output is then passed back in to the function to calculate the next output.
Given a base distribution $x \sim p(x)$ in $D$ dimensions,
Transform $x$ into $y$ :
$y_d = \mu_d + \sigma_d \cdot x_d$
where $\sigma_d = f(y_1, ..., y_{d-1})$ for $d \in [D]$ .
For the inverse,
$x_d = \sigma^{-1}_d (y_d - \mu_d)$

Bipartite flows

These methods use a bipartite factorization method. Essentially, some variables are held constant while the others are transformed. For total dimensions $D$ and $0 < d < D$ ,
$y_{1:d} = x_{1:d}$
$y_{d+1:D} = \mu + \sigma x_{(d+1):D}$
Here, note that $\mu$ is a location transform and $\sigma$ is a scale transform.
Both forward (inference) and inverse (generation) computations are fast, but these models are not as expressive as autoregressive flows.

Normalizing flows have the downside that they cannot be used for dimensionality reduction.

Discrete Distribution Modelling

There has not been comparable work to normalizing flows for discrete distributions. Existing work either focuses on:

latent variable models, such as generating sentences from a continuous space
models that assume a fixed-ordering of the data, such as RNNs, Transformers, etc.

Older work has also used bidirectional models such as Markov random fields, but these require approximate inference or sampling. However, note that models such as BERT show improvements from bidirectionality, which this paper has in the form of discrete bipartite flows.
There has also been work on non-autoregressive discrete models, but they do not maintain an exact density like this work.

Building Blocks

Discrete Change of Variables

Suppose $x$ is a discrete random variables and $y = f(x)$
Then the discrete change of variables is:
$p(\textbf{y} = y)=\sum_{x \in f^{-1}(y)} p(\textbf{x}=x)$
where $f^{-1}$ is the pre-image of $f$ .
If $f$ is invertible, this simplifies to:
$p(\textbf{y} = y)=p(\textbf{x}=f^{-1}(y))$
Compare this to the continuous change of variables:
$p(y) = p(f^{-1}(y)) \text{det}|\frac{dx}{dy}|$
Essentially, the discrete formulation is just lacking the Jacobian determinant. This makes intuitive sense: discrete distributions have no volume, so there is no need to correct for the change in volume (which is what the determinant does).

Discrete Flow Transformations - XOR Example

To create discrete flows, we need a discrete, invertible analog to the functions used for transformation in the continuous case. To this end, we will first consider a binary case: XOR.
Given a binary vector $x$ , we can compute
$y_d = \mu_d \oplus x_d$
for $d$ in $1, ..., D$
This is of course invertible:
$x_d = \mu_d \oplus y_d$
Now we will consider an example to build intuition. Assume $D=2$ and $p(x)$ defined as follows:

XOR example table [1]

It is clear that we cannot factorize this distribution (i.e., we cannot write the distribution as a product $p(x_1)p(x_2)$ , which would be an independence assumption)
Now, we will show that flows can model this. To do so, consider the flow
$f(x_1, x_2) = (x_1, x_1 \oplus x_2) = (y_1, y_2)$
Here, $y_2 = x_1 \oplus x_2$ is the invertible transformation, while $y_1 = x_1$ is the linear partition.

Thus, after applying one layer of flow, $p(x_1) = [0.7, 0.3], p(x_2) = [0.9, 0.1]$ .
Here, the notation used is confusing, so we will instead call $p(x_1) = p(y_1)$ and $p(x_2) = p(y_2)$ . Thus,
$p(y_1) = [0.7, 0.3], p(y_2) = [0.9, 0.1]$

$p(y_1)$ is the marginal of the table above. However, how do we get $p(y_2)$ ? Plugging this into $f$ ,
$p(y_2=0) = 1-p(x_1 \oplus x_2) = p(x_1 = x_2) = 0.63 + 0.27 = 0.9$
Essentially, the flow “relabels” the data so it is better modeled by the base.

Discrete Flow Transformations - Extension to Categorical Data

We now have flows for binary variables, but we want to extend them to categorical discrete data. To do so, we take a page from number theory and generalize the XOR function. The authors call this “modulo location-scale transform”.

Given a $D$ -dimensional vector $x$ where each element has $K$ values,
$y_d = (\mu_d + \sigma_d \cdot x_d) \text{mod } K$
Here, $\mu_d$ and $\sigma_d$ autoregressive functions of $y$ . Note that $\sigma$ cannot be zero (just $1,…,K-1$ ) just like in continuous case.

An important condition of flows is the invertible function. To invert the above function, it is necessary that $\sigma$ and $K$ are coprime. This means they only share the divisor 1. Hence, there are three possible constraints that can be used to make this happen:

Set $K$ to be prime
Mask noninvertible (non-coprime to $K$ ) values of $\sigma$
Set $\sigma=1$

Additionally, note what happens when $K=2$ and $\sigma = 1$ . It’s the XOR function!

We can see how the modulo location-scale transform works in the following figure.

Example of a single flow using the modulo location-scale transform. (a) is the data modelled (which is discretized into bins), (b) is an attempt to factorize the base distribution, and (c) is 1 discrete flow. Note that even 1 flow is much better at modelling the data. [1]

Model

Discrete Autoregressive Flows

This figure shows 4 inputs (and outputs). Note multiple levels of autoregression can be stacked. The solid lines show receptive fields of the red block, while the dashed lines show other connections. Note how the order can be switched

Discrete Bipartite Flows

This figure shows two bipartite flows. The receptive field of the 2nd output is only $x_{1:3}$ , unlike autoregressive flows above. Note that Blue and green are binary masks indicating variables that don’t transform. The other blocks are transformed. This can quickly be confirmed by looking at the number of outgoing lines from a block.

Training

Like other normalizing flows, the model can be trained by directly optimizing the maximum likelihood. Again, note that we have change of variables:
$p(\textbf{y} = y)=p(\textbf{x}=f^{-1}(y))$
In this case, we can optimize both parameters of $f$ and base distribution $p$ . Note that this paper doesn’t distinguish $p$ from the left and the right, but we are optimizing $p$ on the right.

Unfortunately, using discrete functions means we cannot calculate a direct gradient. Specifically, $\mu$ and $\sigma$ both produce discrete values but need to be backpropagated through. To address this issue, the authors use the straight-through gradient estimator. This essentially means that on the backward pass of the network they pretend the discrete function is the identity function and just pass the gradients through.

We run into another issue with differentiation. To compute $\mu$ and $\sigma$ , first produce $K$ logits for each, where $K$ is chosen using an argmax.
$\mu_d = \text{one\_hot} ( \text{argmax} (\theta_d) )$
However, this also isn’t differentiable! Instead, we use a temperature-softmax
$\frac{d \mu_d}{d \theta_d} \approx \frac{d}{d \theta_d} \text{softmax} (\frac{\theta_d}{\tau})$
Now, we have a differentiable function. Further, this function approaches the argmax as $\tau$ approaches 0. The authors pick $\tau=0.1$ for their experiments.

Results

Alright, we’ve looked at how the model works. Now, let’s see the results!

Experimental Settings

For discrete autoregressive flows, we will use an autoregressive Categorical base distribution. For discrete bipartite flows, use a factorized Categorial distribution. Use $\sigma = 1$ for all experiments except character-level language modeling.

Full-rank Discrete Distribution

First, to test the flexibility of the model, the authors test the model on how it fits a full-rank discrete distribution. They sample $K$ classes in $D$ dimensions from a Dirichlet distribution with $\alpha=1$ . Additionally, they use a transformer with 64 hidden units is used as a base model and for flow parameters. They compute “nats” for negative log likelihood, indicating natural logarithm is used.

Negative log-likelihoods for the full-rank distribution fitting task. $D$ and $K$ are varied. Note that bipartite flows achieve nearly the same values as autoregressive flows without the associated disadvantages. [1] ### Addition Next, the authors consider a more challenging toy example: adding base-10 numbers, which they do with $D=10$ and $D=20$ digits. Addition is a right-to-left task, which disadvantages the base autoregressive model.

They use an LSTM with 256 hidden units for D=10 and 512 for D=20 as a base.

For D=10, the autoregressive base (left to right) achieves 4.0 nats (negative log likelihood). The autoregressive flow achieves 0.2 nats.
The bipartite model achieves 4.0, 3.17, and 2.58 nats for 1, 2, and 4 flows.
For D=20, the autoregressive base achieves 12.2 nats (negative log likelihood). The autoregressive flow achieves 4.8 nats.
The bipartite model achieves 12.2, 8.8, 7.6, and 5.08 nats for 1, 2, 4, and 8 flows.

Potts Model

Next, the authors wonder: Can the discrete flows be applied to models with intractable sampling and likelihood?
To test this, they sample from the Potts model, which is a 2D Markov random field. Samples are a $D$ x $D$ matrix, where the coupling between elements is $J$ . Essentially, it is a grid of discrete values where neighbors are correlated according to $J$ . Additionally, since sampling is intractable, they use 500 steps of the Metropolis-Hastings (MCMC).

An example of Potts models from [5]. Note that β = J here.

The following figure shows their tests on this model. Note that the authors use this task to examine the different between an autoregressive base model and their autoregressive flows. They do not test bipartite flows. The following table shows how the base autoregressive model cannot capture the undirected Potts model as well.

Additionally, the authors sample from these models and find that they cannot distinguish the samples from the true distribution.

Character Level Tasks

Finally, the authors test the models on real-world tasks: modelling natural language. First, use the Penn treebank. It has $K = 51$ characters. To preprocess, data is split into sentences. In this work, sequence length is restricted to 288 (which is not explained). The authors compare to only one other non-autoregressive language model [6], a VAE-based generative model which learns a normalizing flow in the latent space. However, that model isn’t directly comparable because the sequence length was not restricted for it.

Next, the authors tested on the text8 dataset. This was originally intended for testing text compression algorithms, and it has 100M characters as opposed to just 5M, which provides more data to improve the models. The results are shown in the following table:

Note, the important takeaway is how much faster these models can generate compared to the large Transformer and RNN models.

This figure from Papers With Code [7] shows how the bipartite flow results compare to all models tested on text8.

Final Takeaway

This paper is an initial step in the direction to extend the power of normalizing flows to discrete distributions. It can be summarized in the following points:

Motivation: Normalizing flows generally are only used for continuous distributions
- This can be extended to discrete distributions using a different change of variables formulation (without a Jacobian determinant!)
Discrete autoregressive flows enable bidirectionality
Discrete bipartite flows enable quick generation
Future work:
- Can inverse autoregressive flows be made discrete?
- How can this method be scaled to many more classes? The straight-through estimator might not work on word sequences with 1000s of vocabulary tokens.
  - (This is also why they test on character modelling instead, which has significantly fewer values)
- Can other invertible discrete functions be applied, such as from cryptography or random number generation?

References

[1] Tran, Dustin, et al. “Discrete flows: Invertible generative models of discrete data.” Advances in Neural Information Processing Systems 32 (2019): 14719-14728.
[2] Papamakarios, George, Theo Pavlakou, and Iain Murray. “Masked autoregressive flow for density estimation.” arXiv preprint arXiv:1705.07057 (2017).
[3] Kingma, Durk P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in neural information processing systems 29 (2016): 4743-4751.
[4] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using real nvp.” arXiv preprint arXiv:1605.08803 (2016).
[5] ACM Youtube: “Session 6B - Efficient sampling and counting algorithms for the Potts model”
[6] Ziegler, Zachary, and Alexander Rush. “Latent normalizing flows for discrete sequences.” International Conference on Machine Learning. PMLR, 2019.
[7] https://paperswithcode.com/sota/language-modelling-on-text8