layout | title |
---|---|
post |
Autoregressive Models |
We begin our study into generative modeling with autoregressive models. As before, we assume we are given access to a dataset
By the chain rule of probability, we can factorize the joint distribution over the
{% math %} p(\mathbf{x}) = \prod\limits_{i=1}^{n}p(x_i \vert x_1, x_2, \ldots, x_{i-1}) = \prod\limits_{i=1}^{n} p(x_i \vert \mathbf{x}_{< i } ) {% endmath %}
where $$\mathbf{x}{< i}=[x_1, x_2, \ldots, x{i-1}]$$ denotes the vector of random variables with index less than
The chain rule factorization can be expressed graphically as a Bayesian network.
Graphical model for an autoregressive Bayesian network with no conditional independence assumptions.Such a Bayesian network that makes no conditional independence assumptions is said to obey the autoregressive property.
The term autoregressive originates from the literature on time-series models where observations from the previous time-steps are used to predict the value at the current time step. Here, we fix an ordering of the variables
If we allow for every conditional
To see why, let us consider the conditional for the last dimension, given by $$p(x_n \vert \mathbf{x}{< n})$$. In order to fully specify this conditional, we need to specify a probability distribution for each of the $$2^{n-1}$$ configurations of the variables $$x_1, x_2, \ldots, x{n-1}$$. For any one of the
In an autoregressive generative model, the conditionals are specified as parameterized functions with a fixed number of parameters. That is, we assume the conditional distributions $$p(x_i \vert \mathbf{x}{< i})$$ to correspond to a Bernoulli random variable and learn a function that maps the preceding random variables $$x_1, x_2, \ldots, x{i-1}$$ to the
mean of this distribution. Hence, we have
{% math %}
p_{\theta_i}(x_i \vert \mathbf{x}{< i}) = \mathrm{Bern}(f_i(x_1, x_2, \ldots, x{i-1}))
{% endmath %}
where
The number of parameters of an autoregressive generative model are given by
{% math %} f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\alpha^{(i)}0 + \alpha^{(i)}1 x_1 + \ldots + \alpha^{(i)}{i-1} x{i-1}) {% endmath %}
where
A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). For example, consider the case of a neural network with 1 hidden layer. The mean function for variable
{% math %} \mathbf{h}i = \sigma(A_i \mathbf{x{< i}} + \mathbf{c}i)\ f_i(x_1, x_2, \ldots, x{i-1}) =\sigma(\boldsymbol{\alpha}^{(i)}\mathbf{h}_i +b_i ) {% endmath %}
where
The Neural Autoregressive Density Estimator (NADE) provides an alternate MLP-based parameterization that is more statistically and computationally efficient than the vanilla approach. In NADE, parameters are shared across the functions used for evaluating the conditionals. In particular, the hidden layer activations are specified as
{% math %}
\mathbf{h}i = \sigma(W{., < i} \mathbf{x_{< i}} + \mathbf{c})\
f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\boldsymbol{\alpha}^{(i)}\mathbf{h}i +b_i )
{% endmath %}
where $$\theta={W\in \mathbb{R}^{d\times n}, \mathbf{c} \in \mathbb{R}^d, {\boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d}^n{i=1}, {b_i \in \mathbb{R}}^n_{i=1}}$$is
the full set of parameters for the mean functions
-
The total number of parameters gets reduced from
to [readers are encouraged to check!]. -
The hidden unit activations can be evaluated in
time via the following recursive strategy: {% math %} \mathbf{h}i = \sigma(\mathbf{a}i)\ \mathbf{a}{i+1} = \mathbf{a}{i} + W[., i]x_i {% endmath %} with the base case given by .
The RNADE algorithm extends NADE to learn generative models over real-valued data. Here, the conditionals are modeled via a continuous distribution such as a equi-weighted mixture of
Notice that NADE requires specifying a single, fixed ordering of the variables. The choice of ordering can lead to different models. The EoNADE algorithm allows training an ensemble of NADE models with different orderings.
Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness in the KL divergence between the data and the model distributions.
{% math %} \min_{\theta\in \mathcal{M}}d_{KL} (p_{\mathrm{data}}, p_{\theta}) = \mathbb{E}{\mathbf{x} \sim p{\mathrm{data}} }\left[\log p_{\mathrm{data}}(\mathbf{x}) - \log p_{\theta}(\mathbf{x})\right] {% endmath %}
Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes any model distribution
Since
{% math %} \max_{\theta\in \mathcal{M}}\mathbb{E}{\mathbf{x} \sim p{\mathrm{data}} }\left[\log p_{\theta}(\mathbf{x})\right]. {% endmath %}
Here,
To approximate the expectation over the unknown
{% math %} \max_{\theta\in \mathcal{M}}\frac{1}{\vert D \vert} \sum_{\mathbf{x} \in\mathcal{D} }\log p_{\theta}(\mathbf{x}) = \mathcal{L}(\theta \vert \mathcal{D}). {% endmath %}
The maximum likelihood estimation (MLE) objective has an intuitive interpretation: pick the model parameters
In practice, we optimize the MLE objective using mini-batch gradient ascent. The algorithm operates in iterations. At every iteration
where
From a practical standpoint, we must think about how to choose hyperparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods cease to improve1.
Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorized joint distribution of an autoregressive model in the MLE objective to get
{% math %} \max_{\theta \in \mathcal{M}}\frac{1}{\vert D \vert} \sum_{\mathbf{x} \in\mathcal{D} }\sum_{i=1}^n\log p_{\theta_i}(x_i \vert \mathbf{x}_{< i}) {% endmath %}
where
Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point
Sampling from an autoregressive model is a sequential procedure. Here, we first sample
Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.
Footnotes
-
Given the non-convex nature of such problems, the optimization procedure can get stuck in local optima. Hence, early stopping will generally not be optimal but is a very practical strategy. ↩