autoregressive/index.tex

\section{Autoregressive Models}

We begin our study with the autoregressive generative models. As before, we assume we are given access to a dataset $\mathcal{D}$  of $n$-dimensional datapoints $\mathbf{x}$. For simplicity, we assume the datapoints are binary, i.e.,  $\mathbf{x} \in \{0,1\}^n$.

\section{Representation}

By the chain rule of probability, we can factorize the joint distribution over the $n$-dimensions as:
\[
\begin{equation}
p(\mathbf{x}) = \prod\limits_{i=1}^{n}p(x_i \vert x_1, x_2, \ldots, x_{i-1}) = \prod\limits_{i=1}^{n} p(x_i \vert \mathbf{x}_{<i})
\end{equation}
\label{eq:chain_rule}
\]
where $\mathbf{x}_{<i}=[x_1, x_2, \ldots, x_{i-1}]$ denotes the vector of random variables with index less than $i$. If we allow for every conditional $p(x_i \vert \mathbf{x}_{<i})$ to be specified in a tabular form, then such a representation is fully general and can represent any possible distribution over $n$ random variables. However, the space complexity for such a representation grows exponentially with $n$. 

To see why, let us consider the conditional for the last dimension, given by $p(x_n \vert \mathbf{x}_{<n})$. In order to fully specify this conditional, we need to specify a probability for $2^{n-1}$ configurations of the variables $x_1, x_2, \ldots, x_{n-1}$.  Since $x_n$ is a binary variable we need to specify one parameter per configuration $x_1, x_2, \ldots, x_{n-1}$ : namely $p(x_n = 1 | x_1, x_2, \ldots, x_{n-1})$. In total, the number of parameters needed to specify this conditional is given by $2^{n-1}$. Hence, a tabular representation for the conditionals is impractical for learning the joint distribution in (\ref{eq:chain_rule}) . 

In an \textit{autoregressive generative model}, the conditionals are specified as parameterized functions with a fixed number of parameters. That is, we assume the conditional distributions $p(x_i \vert \mathbf{x}_{<i})$ to correspond to a Bernoulli random variable and learn a function that maps the preceeding random variables $x_1, x_2, \ldots, x_{i-1}$ to the mean of this distribution. Hence, we have:
\[
p_{\theta_i}(x_i \vert \mathbf{x}_{<i}) = \mathrm{Bern}(f_i(x_1, x_2, \ldots, x_{i-1}))
\]
where $\theta_i$ denotes the set of parameters used to specify the mean function $f_i: \{0,1\}^{i-1}\rightarrow [0,1]$.  The term \textit{autoregressive} originates from the literature on time-series models where observations from the previous time-steps are used to predict the value at the current time step. Here, we are predicting the distribution for the $i$-th random variable using the values of the preceeding random variables in the sequence $x_1, x_2, \ldots, x_n$.

The number of parameters of an autoregressive generative model are given by $\sum_{i=1}^n \vert \theta_i \vert$. As we shall see in the examples below, the number of parameters are much fewer than the tabular setting considered previously. Unlike the tabular setting however, an autoregressive generative model cannot represent all possible distributions. Its expressiveness is limited by the fact that we are limiting the conditional distributions to correspond to a Bernoulli random variable with a restricted class of parameterized functions specifying the mean.

In the simplest case, we can specify the function as a linear combination of the input elements followed by a sigmoid non-linearity (to restrict the output to lie between 0 and 1). This gives us the formulation of a \textit{fully-visible sigmoid belief network} (FVSBN):
\[
f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\alpha^{(i)}_0 + \alpha^{(i)}_1 x_1 + \ldots + \alpha^{(i)}_{i-1} x_{i-1})  
\]
where $\sigma$ denotes the sigmoid function and $\theta_i=\{\alpha^{(i)}_0,\alpha^{(i)}_1, \ldots, \alpha^{(i)}_{i-1}\}$ denote the parameters of the mean function. The conditional for variable $i$ requires $i$
 parameters, and hence the total number of parameters in the model is given by $\sum_{i=1}^ni= O(n^2)$.  Note that the number of parameters are much fewer than the exponential parameters required in the tabular case.

A natural way to increase the expressiveness of an autoregressive generative model is to use more flexible parameterizations for the mean function e.g., multi-layer perceptrons (MLP). In the case of 1-hidden layer neural networks, the mean function for variable $i$ can be expressed as:
\[
\mathbf{h}_i = \sigma(A_i \mathbf{x_{<i}} + \mathbf{c}_i)\\
f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\boldsymbol{\alpha}^{(i)}\mathbf{h}_i +b_i )  
\]
where $\mathbf{h}_i \in \mathbb{R}^d$ denotes the hidden layer activations for the MLP and$\theta_i = \{A_i \in \mathbb{R}^{d\times (i-1)},  \mathbf{c}_i \in \mathbb{R}^d, \boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d, b_i \in \mathbb{R}\}$ are the set of parameters for the mean function $f_i(\cdot)$.  The total number of parameters in this model is dominated by the matrices $A_i$ and given by $O(n^2 d)$. 

The Neural Autoregressive Density Estimation (NADE) provides an efficient MLP parameterization that shares parameters used for evaluating the hidden layer activations.
\[
\mathbf{h}_i = \sigma(W_{., <i} \mathbf{x_{<i}} + \mathbf{c})\\
f_i(x_1, x_2, \ldots, x_{i-1}) =\sigma(\boldsymbol{\alpha}^{(i)}\mathbf{h}_i +b_i )  
\]
where $\theta=\{W\in \mathbb{R}^{d\times n}, \mathbf{c} \in \mathbb{R}^d, \{\boldsymbol{\alpha}^{(i)}\in \mathbb{R}^d\}^n_{i=1}, \{b_i \in \mathbb{R}\}^n_{i=1}\}$is the full set of parameters for the mean functions $f_1(\cdot), f_2(\cdot), \ldots, f_n(\cdot)$. The weight matrix $W$ and the bias vector $\mathbf{c}$ are shared across the conditionals. Sharing parameters offers two benefits:
\begin{enumerate}
\item The total number of parameters from $O(n^2 d)$ to $O(nd)$ [readers are encouraged to check!].
\item The hidden unit activations can be evaluated in $O(nd)$ time via the following recursive strategy:
\[
\mathbf{h}_i = \sigma(\mathbf{a}_i)\\
\mathbf{a}_{i+1} = \mathbf{a}_{i} + W[., i]x_i
\]
with the base case given by $\mathbf{a}_1=\mathbf{c}$.
\end{enumerate}


\section{Learning and inference}

Recall that learning a generative model involves optimizing the closeness between the data and model distributions. One commonly used notion of closeness is the KL divergence between the data and the model distributions.

$$
\begin{align*}
\min_{\theta\in \mathcal{M}}d_{KL}(p_{\mathrm{data}}, p_{\theta}) &= \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} }\left[\log p_{\mathrm{data}}(\mathbf{x}) - \log p_{\theta}(\mathbf{x})\right].
\end{align*}
$$
Before moving any further, we make two comments about the KL divergence. First, we note that the KL divergence between any two distributions is asymmetric. As we navigate through this chapter, the reader is encouraged to think what could go wrong if we decided to optimize the reverse KL divergence instead. Secondly, the KL divergences heavily penalizes model distribution $p_\theta$ which place little mass on any datapoint that has a non-zero probability under $p_{\mathrm{data}}$. In the extreme case, if the density $p_\theta(\mathbf{x})$ evaluates to zero for a datapoint sampled from $p_{\mathrm{data}}$, the objective evaluates to $+\infty$. 

Since $p_{\mathrm{data}}$ does not depend on $\theta$, we can equivalently recover the optimal parameters via maximizing likelihood estimation:

$$
\begin{align*}
\max_{\theta\in \mathcal{M}}\mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} }\left[\log p_{\theta}(\mathbf{x})\right].
\end{align*}
$$
Here, $\log p_{\theta}(\mathbf{x})$ is referred to as the log-likelihood of the datapoint $\mathbf{x}$ with respect to the model distribution $p_\theta$. 

To approximate the expectation over the unknown $p_{\mathrm{data}}$, we make an assumption: points in the dataset $\mathcal{D}$ are sampled i.i.d. from $p_{\mathrm{data}}$. This allows us to obtain an unbiased Monte Carlo estimate of the objective:

$$
\begin{align}
\max_{\theta\in \mathcal{M}}\frac{1}{\vert D \vert} \sum_{\mathbf{x} \in\mathcal{D} }\log p_{\theta}(\mathbf{x}) = \mathcal{L}(\theta \vert \mathcal{D}).
\end{align}
\label{eq:mle}
\tag{2}
$$

The maximum likelihood estimation (MLE) objective has an intuitive interpretation: pick the model parameters $\theta \in \mathcal{M}$ that maximize the log-probability of the observed datapoints in $\mathcal{D}$. 

In practice, we optimize the MLE objective using mini-batch gradient ascent. The algorithm operates in iterations. At every iteration $t$, we sample a mini-batch $\mathcal{B}_t$  of datapoints sampled randomly from the dataset ($\vert \mathcal{B}_t\vert < \vert \mathcal{D} \vert$) and compute gradients of the objective evaluated for the mini-batch. These parameters at iteration $t+1$ are then given via the following update rule:
\[
\theta^{(t+1)} = \theta^{(t)} + r_t \nabla_\theta\mathcal{L}(\theta^{(t)} \vert \mathcal{B}_t)
\]
where $\theta^{(t+1)}$ and $\theta^{(t)}$ are the parameters at iterations $t+1$ and $t$ respectively, and $r_t$ is the learning rate at iteration $t$.  Typically, we only specify the initial learning rate $r_1$ and update the rate based on a schedule.  [Variants](http://cs231n.github.io/optimization-1/) of stochastic gradient ascent, such as RMS prop and Adam, employ modified update rules that work slightly better in practice. 

From a practical standpoint, we must think about how to choose hyperaparameters (such as the initial learning rate) and a stopping criteria for the gradient descent. For both these questions, we follow the standard practice in machine learning of monitoring the objective on a validation dataset. Consequently, we choose the hyperparameters with the best performance on the validation dataset and stop updating the parameters when the validation log-likelihoods stop improving.

Now that we have a well-defined objective and optimization procedure, the only remaining task is to evaluate the objective in the context of an autoregressive generative model. To this end, we substitute the factorization of the joint distribution in Eq.$~\ref{eq:chain_rule}$
in the MLE objective in Eq.$~\ref{eq:mle}$ to get:
\[
\max_{\theta \in \mathcal{M}}\frac{1}{\vert D \vert} \sum_{\mathbf{x} \in\mathcal{D} }\sum_{i=1}^n\log p_{\theta_i}(x_i \vert \mathbf{x}_{<i})
\]where $\theta = \{\theta_1, \theta_2, \ldots, \theta_n\}$ now denotes the collective set of parameters for the conditionals.

Inference in an autoregressive model is straightforward. For density estimation of an arbitrary point $\mathbf{x}$, we simply evaluate the log-conditionals $\log p_{\theta_i}(x_i \vert \mathbf{x}_{<i})$ for each $i$ and add these up to obtain the log-likelihood assigned by the model to $\mathbf{x}$. Since we know conditioning vector $\mathbf{x}$, each of the conditionals can be evaluated in parallel. Hence, density estimation is efficient on modern hardware.

Sampling from an autoregressive model is a sequential procedure. Here, we first sample $x_1$, then we sample $x_2$ conditioned on the sampled $x_1$, followed by $x_3$ conditioned on both $x_1$ and $x_2$ and so on until we sample $x_n$ conditioned on the previously sampled $\mathbf{x}_{<n}$. For applications requiring real-time generation of high-dimensional data such as audio synthesis, the sequential sampling can be an expensive process.


TODO: add NADE samples figure

Finally, an autoregressive model does not directly learn unsupervised representations of the data. In the next few set of lectures, we will look at latent variable models (e.g., variational autoencoders) which explicitly learn latent representations of the data.


TODO: Autoregressive generative models based on Autoencoders, RNNs, and CNNs.
MADE, Char-RNN, Pixel-CNN, Wavenet