NF4 Discrete Flows - Invertible Generative Models of Discrete Data
Carl Edwards (cne2@illinois.edu)
Introduction
In recent years, normalizing flows have become a promising method for both generative tasks and density estimation. This is because, unlike variational autoencoders, it is tractable to compute direct log-likelihoods, which can then be used to optimize the model. While normalizing flows have shown strong results in modeling continuous domains, they have not been explored in a discrete setting. This may be because the idea of a flow in a discrete setting is difficult to intuit. However, it can be considered as “relabeling” the data instead. The key idea of this paper is: Can flows be used on discrete distributions?
In this paper, two new models are introduced:
- Discrete Autoregressive Flows
- Discrete Bipartite Flows
Background
Continuous normalizing flows
Normalizing flows generally assume that there is a function, , usually parameterized by a neural network, which is invertible. The goal is to map the distribution of the data to a more easily interpretable base distribution. Generally it is desirable for the base distribution to be factorizable (independent components), and spherical Gaussians are often used. Additionally, much work enforces a constraint on the Jacobian of to make it computationally tractable to compute. To optimize the maximum likelihood of the data, a change of variables formulation is used to instead calculate the maximum likelihood of the base, which is tractable.
Given a discrete random variable from the base distribution and , we can write the change of variables as:
Note that prior work often writes this the other direction, but it is equivalent.
Generally, normalizing flows can be divided into two classes:
- Autoregressive Flows, such as MAF [2] or IAF [3].
- Bipartite flows, such as RealNVP [4].
Autoregressive Flows
These are models that are both autoregressive and flows. Autoregressive models take the previous inputs and use it to create an output. The output is then passed back in to the function to calculate the next output.
Given a base distribution in dimensions,
Transform into :
where for .
For the inverse,
Bipartite flows
These methods use a bipartite factorization method. Essentially, some variables are held constant while the others are transformed. For total dimensions and ,
Here, note that is a location transform and is a scale transform.
Both forward (inference) and inverse (generation) computations are fast, but these models are not as expressive as autoregressive flows.
Normalizing flows have the downside that they cannot be used for dimensionality reduction.
Discrete Distribution Modelling
There has not been comparable work to normalizing flows for discrete distributions. Existing work either focuses on:
- latent variable models, such as generating sentences from a continuous space
- models that assume a fixed-ordering of the data, such as RNNs, Transformers, etc.
Older work has also used bidirectional models such as Markov random fields, but these require approximate inference or sampling. However, note that models such as BERT show improvements from bidirectionality, which this paper has in the form of discrete bipartite flows.
There has also been work on non-autoregressive discrete models, but they do not maintain an exact density like this work.
Building Blocks
Discrete Change of Variables
Suppose is a discrete random variables and
Then the discrete change of variables is:
where is the pre-image of .
If is invertible, this simplifies to:
Compare this to the continuous change of variables:
Essentially, the discrete formulation is just lacking the Jacobian determinant. This makes intuitive sense: discrete distributions have no volume, so there is no need to correct for the change in volume (which is what the determinant does).
Discrete Flow Transformations - XOR Example
To create discrete flows, we need a discrete, invertible analog to the functions used for transformation in the continuous case. To this end, we will first consider a binary case: XOR.
Given a binary vector , we can compute
for in
This is of course invertible:
Now we will consider an example to build intuition. Assume and defined as follows:
It is clear that we cannot factorize this distribution (i.e., we cannot write the distribution as a product , which would be an independence assumption)
Now, we will show that flows can model this. To do so, consider the flow
Here, is the invertible transformation, while is the linear partition.
Thus, after applying one layer of flow, .
Here, the notation used is confusing, so we will instead call and . Thus,
is the marginal of the table above. However, how do we get ? Plugging this into ,
Essentially, the flow “relabels” the data so it is better modeled by the base.
Discrete Flow Transformations - Extension to Categorical Data
We now have flows for binary variables, but we want to extend them to categorical discrete data. To do so, we take a page from number theory and generalize the XOR function. The authors call this “modulo location-scale transform”.
Given a -dimensional vector where each element has values,
Here, and autoregressive functions of . Note that cannot be zero (just ) just like in continuous case.
An important condition of flows is the invertible function. To invert the above function, it is necessary that and are coprime. This means they only share the divisor 1. Hence, there are three possible constraints that can be used to make this happen:
- Set to be prime
- Mask noninvertible (non-coprime to ) values of
- Set
Additionally, note what happens when and . It’s the XOR function!
We can see how the modulo location-scale transform works in the following figure.
Model
Discrete Autoregressive Flows
This figure shows 4 inputs (and outputs). Note multiple levels of autoregression can be stacked. The solid lines show receptive fields of the red block, while the dashed lines show other connections. Note how the order can be switched
Discrete Bipartite Flows
This figure shows two bipartite flows. The receptive field of the 2nd output is only , unlike autoregressive flows above. Note that Blue and green are binary masks indicating variables that don’t transform. The other blocks are transformed. This can quickly be confirmed by looking at the number of outgoing lines from a block.
Training
Like other normalizing flows, the model can be trained by directly optimizing the maximum likelihood. Again, note that we have change of variables:
In this case, we can optimize both parameters of and base distribution . Note that this paper doesn’t distinguish from the left and the right, but we are optimizing on the right.
Unfortunately, using discrete functions means we cannot calculate a direct gradient. Specifically, and both produce discrete values but need to be backpropagated through. To address this issue, the authors use the straight-through gradient estimator. This essentially means that on the backward pass of the network they pretend the discrete function is the identity function and just pass the gradients through.
We run into another issue with differentiation. To compute and , first produce logits for each, where is chosen using an argmax.
However, this also isn’t differentiable! Instead, we use a temperature-softmax
Now, we have a differentiable function. Further, this function approaches the argmax as approaches 0. The authors pick for their experiments.
Results
Alright, we’ve looked at how the model works. Now, let’s see the results!
Experimental Settings
For discrete autoregressive flows, we will use an autoregressive Categorical base distribution. For discrete bipartite flows, use a factorized Categorial distribution. Use for all experiments except character-level language modeling.
Full-rank Discrete Distribution
First, to test the flexibility of the model, the authors test the model on how it fits a full-rank discrete distribution. They sample classes in dimensions from a Dirichlet distribution with . Additionally, they use a transformer with 64 hidden units is used as a base model and for flow parameters. They compute “nats” for negative log likelihood, indicating natural logarithm is used.
They use an LSTM with 256 hidden units for D=10 and 512 for D=20 as a base.
- For D=10, the autoregressive base (left to right) achieves 4.0 nats (negative log likelihood). The autoregressive flow achieves 0.2 nats.
- The bipartite model achieves 4.0, 3.17, and 2.58 nats for 1, 2, and 4 flows.
- For D=20, the autoregressive base achieves 12.2 nats (negative log likelihood). The autoregressive flow achieves 4.8 nats.
- The bipartite model achieves 12.2, 8.8, 7.6, and 5.08 nats for 1, 2, 4, and 8 flows.
Potts Model
Next, the authors wonder: Can the discrete flows be applied to models with intractable sampling and likelihood?
To test this, they sample from the Potts model, which is a 2D Markov random field. Samples are a x matrix, where the coupling between elements is . Essentially, it is a grid of discrete values where neighbors are correlated according to . Additionally, since sampling is intractable, they use 500 steps of the Metropolis-Hastings (MCMC).
The following figure shows their tests on this model. Note that the authors use this task to examine the different between an autoregressive base model and their autoregressive flows. They do not test bipartite flows. The following table shows how the base autoregressive model cannot capture the undirected Potts model as well.
Additionally, the authors sample from these models and find that they cannot distinguish the samples from the true distribution.
Character Level Tasks
Finally, the authors test the models on real-world tasks: modelling natural language. First, use the Penn treebank. It has characters. To preprocess, data is split into sentences. In this work, sequence length is restricted to 288 (which is not explained). The authors compare to only one other non-autoregressive language model [6], a VAE-based generative model which learns a normalizing flow in the latent space. However, that model isn’t directly comparable because the sequence length was not restricted for it.
Next, the authors tested on the text8 dataset. This was originally intended for testing text compression algorithms, and it has 100M characters as opposed to just 5M, which provides more data to improve the models. The results are shown in the following table:
Note, the important takeaway is how much faster these models can generate compared to the large Transformer and RNN models.
Final Takeaway
This paper is an initial step in the direction to extend the power of normalizing flows to discrete distributions. It can be summarized in the following points:
-
Motivation: Normalizing flows generally are only used for continuous distributions
- This can be extended to discrete distributions using a different change of variables formulation (without a Jacobian determinant!)
-
Discrete autoregressive flows enable bidirectionality
-
Discrete bipartite flows enable quick generation
-
Future work:
-
Can inverse autoregressive flows be made discrete?
-
How can this method be scaled to many more classes? The straight-through estimator might not work on word sequences with 1000s of vocabulary tokens.
- (This is also why they test on character modelling instead, which has significantly fewer values)
-
Can other invertible discrete functions be applied, such as from cryptography or random number generation?
-
References
[1] Tran, Dustin, et al. “Discrete flows: Invertible generative models of discrete data.” Advances in Neural Information Processing Systems 32 (2019): 14719-14728.
[2] Papamakarios, George, Theo Pavlakou, and Iain Murray. “Masked autoregressive flow for density estimation.” arXiv preprint arXiv:1705.07057 (2017).
[3] Kingma, Durk P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in neural information processing systems 29 (2016): 4743-4751.
[4] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using real nvp.” arXiv preprint arXiv:1605.08803 (2016).
[5] ACM Youtube: “Session 6B - Efficient sampling and counting algorithms for the Potts model”
[6] Ziegler, Zachary, and Alexander Rush. “Latent normalizing flows for discrete sequences.” International Conference on Machine Learning. PMLR, 2019.
[7] https://paperswithcode.com/sota/language-modelling-on-text8