Jekyll2022-01-23T07:39:01+00:00https://m-wiesner.github.io/feed.xmlMatthew WiesnerPersonal webpage of Matthew WiesnerVariational Bounds on Mutual Information2020-10-30T13:35:00+00:002020-10-30T13:35:00+00:00https://m-wiesner.github.io/Variational-Bounds-on-Mutual-Information<p>There has been incredible success using InfoNCE as a constrastive unsupervised pretraining objective for ASR in the past year. The objective function, originally presented in <a href="https://arxiv.org/pdf/1807.03748.pdf">Contrastive Predictive Coding (CPC)</a>, was described more theoretically in a subsequent <a href="https://arxiv.org/pdf/1905.06922.pdf">paper</a>.</p>
<p>Estimating the mutual information between two random variables is difficult. Training objectives that aim to maximize the mutual information between various quantities in sequence-to-sequence prediction tasks can be formulated by constructing lower bounds on the mutual information and maximizing them. Most of these bounds come from viewing the mutual information in terms of the KL-divergence</p>
\[I\left(X; Y\right) = I\left(p\left(X\right); p\left(Y\right)\right) = D_{KL}\left(p\left(X, Y\right) || p\left(X\right)p\left(Y\right)\right)\]
<h2 id="upper-bound-on-mutual-information">Upper Bound on Mutual Information</h2>
<p>To upper bound the mutual information, we express the KL-divergence as an expectation, introduce a third distribution, \(q\left(Y\right)\) and factor the joint distribution as \(p\left(X, Y\right) = p\left(Y|X\right)p\left(Y\right)\).</p>
\[\begin{align}
I\left(X; Y\right) &= \mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{p\left(Y|X\right)}{p\left(Y\right)}}\right] \\
&= \mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{p\left(Y|X\right)q\left(Y\right)}{p\left(Y\right)q\left(Y\right)}}\right] \\
&= \mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{p\left(Y|X\right)}{q\left(Y\right)}}\right] - D_{KL}\left(p\left(y\right) || q\left(y\right)\right) \\
&\leq \mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{p\left(Y|X\right)}{q\left(Y\right)}}\right] \mbox{Since KL-Divergence is non-negative} \\
&= \mathbb{E}_{p\left(X\right)}\left[D_{KL}\left(p\left(Y|X\right) || q\left(Y\right)\right)\right] \\
\end{align}\]
<p>And now we have our first bound!</p>
\[I\left(X; Y\right) \leq \mathbb{E}_{p\left(X\right)}\left[D_{KL}\left(p\left(Y|X\right) || q\left(Y\right)\right)\right] = R\]
<p>This term can be thought of as a regularizer. Overfitting is reduced by forcing overconfident predictions to be smoothed with the prior distribution.
\(R\) stands for the rate of a model and is a limit on the information about the output \(Y\) that is transmitted through the model from the input \(X\). Similarly most regularizers can be interpretted as limiting the rate of model. An ideal regularizer limits the rate to the true mutual information between the random variables representing the inputs and desired outputs.</p>
<h2 id="lower-bound-on-mutual-information">Lower Bound on Mutual Information</h2>
<p>To lower bound the mutual information, we factor in the opposite direction \(p\left(X, Y\right) = p\left(X | Y\right)p\left(Y\right)\), introduce a third distribution \(q\left(X | Y\right)\) and use the non-negativity of the KL-divergence as well as the differential entropy of a random variable to arrive at our lower bound.</p>
\[\begin{align}
I\left(X; Y\right) &= \mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{p\left(X|Y\right)q\left(X | Y\right)}{p\left(X\right)q\left(X|Y\right)}}\right] \\
&= \mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{q\left(X | Y\right)}{p\left(X\right)}}\right] + \mathbb{E}_{p\left(y\right)}\left[D_{KL}\left(p\left(X|Y\right) || q\left(X|Y\right)\right)\right] \\
&\geq \mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{q\left(X | Y\right)}{p\left(X\right)}}\right] \mbox{due to the non-negativity of KL-divergence} \\
&= \mathbb{E}_{p\left(X, Y\right)}\left[\log{q\left(X | Y\right)}\right] - \mathbb{E}_{p\left(X\right)}\left[\log{p\left(X\right)}\right] \\
&= \mathbb{E}_{p\left(X, Y\right)}\left[\log{q\left(X | Y\right)}\right] + h\left(X\right) \\
\end{align}\]
<p>And now we have our second bound!</p>
\[I\left(X; Y\right) \geq \mathbb{E}_{p\left(X, Y\right)}\left[q\left(X | Y\right)\right] + h\left(X\right) = I_{BA}\]
<h2 id="lower-bound-on-mutual-information-estimated-with-unormalized-distributions">Lower Bound on Mutual Information estimated with unormalized distributions</h2>
<p>In general, computing normalized distributions as well as the differential entropy are intractable. For this reason it is important to find bounds of un-normalized distributions. By chosing a specific form for our un-normalized distribution and plugging it into the expresesion in our third-to-last step in our derivation for the lower bound on mutual information, we arrive at the following bound for un-normalized distributions.</p>
<p>Let</p>
\[q\left(X|Y\right) = \frac{p\left(X\right)e^{f\left(X, Y\right)}}{\mathbb{E}_{p\left(x\right)}\left[e^{f\left(X, Y\right)}\right]}\]
<p>We then have that
\(\begin{align}
\mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{q\left(X | Y\right)}{p\left(X\right)}}\right] &= \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \mathbb{E}_{p\left(Y\right)}\left[\log{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}\right]
\end{align}\)</p>
<p>So finally we have our third bound!!</p>
\[I\left(X; Y\right) \geq \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \mathbb{E}_{p\left(Y\right)}\left[\log{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}\right] = I_{UBA}\]
<h2 id="donsker-varadhan-bound-on-mutual-information">Donsker-Varadhan Bound on Mutual Information</h2>
<p>The Donsker-Varadhan lower bound on mutual information is a well known bound that can be recovered by applying Jensen’s Inequality on the expectation in the second term of the bound</p>
\[I\left(X; Y\right) \geq \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \mathbb{E}_{p\left(Y\right)}\left[\log{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}\right]\]
\[\begin{align}
\mathbb{E}_{p\left(Y\right)}\left[\log{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}\right] &\leq \log{\mathbb{E}_{p\left(Y\right)}\left[\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]\right]} \\
&\implies \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \mathbb{E}_{p\left(Y\right)}\left[\log{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}\right] \geq \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \log{\mathbb{E}_{p\left(Y\right)}\left[\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]\right]} \\
&\implies I\left(X; Y\right) \geq \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \log{\mathbb{E}_{p\left(Y\right)}\left[\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]\right]}
\end{align}\]
<p>This is our fourth bound!!</p>
\[I\left(X; Y\right) \geq \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \log{\mathbb{E}_{p\left(Y\right)}\left[\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]\right]} = I_{DV}\]
<p>In Summary we have derived an upper bound and three lower bounds (or estimators) of Mutual Information. Their relationship is as follows.</p>
\[R \geq I\left(X; Y\right) \geq I_{BA} \geq I_{UBA} \geq I_{DV}\]
<h2 id="the-mine-estimator-for-mutual-information">The <a href="https://arxiv.org/pdf/1801.04062.pdf">MINE</a> Estimator for Mutual Information</h2>
<p>MINE is an estimator for the mutual information parameterized by a neural network that is almost identical to InfoNCE. It uses the \(I_{DV}\) bound from above and replaces the expectations with Monte-Carlo Estimates from minibatches of data.</p>
<!--
Note when the minibatch consists of a single element,
$$\begin{align}\mathbb{E}_{p\left(Y\right)}\left[\log{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}\right] &= \mathbb{E}_{p\left(Y\right)}\left[\log{e^{f\left(X, Y\right)}}\right] \\
&= \mathbb{E}_{p\left(Y\right)}\left[f\left(X, Y\right)\right] \\
&= \mathbb{E}_{p\left(Y\right)}\left[\mathbb{E}_{p\left(X\right)}\left[f\left(X, Y\right)\right]\right]
\end{align}$$
In other words,
The MINE objective can also be obtained by using Jensen's Inequatlity on the second term of the $$I_{UBA}$$ bound, but on the inner expectation over $$p\left(X\right)$$.
$$\begin{align}
\mathbb{E}_{p\left(Y\right)}\left[\log{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}\right] &\geq \mathbb{E}_{p\left(Y\right)}\left[\mathbb{E}_{p\left(X\right)}\left[f\left(X, Y\right)\right]\right] \
&\implies I_{UBA} \leq \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \mathbb{E}_{p\left(Y\right)}\left[\mathbb{E}_{p\left(X\right)}\left[f\left(X, Y\right)\right]\right] = I_{MINE}\\
\end{align}$$
So finally we have that
$$begin{align}
I\left(X; Y\right) &\geq I_{UBA} \\
I_{MINE} &\geq I_{UBA} \\
I\left(X; Y\right) &\lessgtr I_{MINE} \\
\end{align}$$
-->
<h2 id="maximum-mutual-information-and-pseudo-labeling">Maximum Mutual Information and Pseudo-Labeling</h2>
<p>In a <a href="https://m-wiesner.github.io/LF-MMI">previous post</a> I gave a somewhat wrong explanation of why the MMI objective function actually did maximize the mutual information between random variables. Here is a better explanation for the particular case where we are working with an un-normalized neural estimator \(f\left(X, Y\right)\).</p>
\[\begin{align}
I\left(X; Y\right) &\geq I_{UBA} \\
&= \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \mathbb{E}_{p\left(X\right)}\left[\log{\mathbb{E}_{p\left(Y\right)}\left[e^{f\left(X, Y\right)}\right]}\right] \\
&= \mathbb{E}_{p\left(X, Y\right)}\left[\log{\frac{e^{f\left(x, y\right)}}{\mathbb{E}_{p\left(y\right)}e^{f\left(x, y\right)}}}\right] \\
\end{align}\]
<p>This is exactly the MMI objective, where the \(\log{p\left(y\right)}\) in the numerator gets removed. Since we do not optimize with respect to a fixed \(p\left(Y\right)\) optimizing either objective is clearly the same as optimizing a lower-bound on the mutual information.</p>
<p>We can understand pseudo-labeling as factoring the expectation in the first term in terms of the posterior and marginal data likelihood. In this way you first sample unlabeled data, estimate a posterior distribution over output sequences, by producing a hypothesis lattice for instance, and then using this lattice for marginalization to compute the expectation over the posterior.</p>
<h2 id="tractable-lower-bound">Tractable Lower Bound</h2>
<p>An easier to compute lower bound comes from the identity \(\log{x} \leq \frac{x}{a} + \log{a} + 1\). Thereforethe</p>
\[\begin{align}
\log{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]} &\leq \frac{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}{a\left(Y\right)} + \log{a\left(Y\right)} + 1 \\
&\implies I_{UBA} \geq \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \mathbb{E}_{p\left(Y\right)}\left[\frac{\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)}\right]}{a\left(Y\right)} + \log{a\left(Y\right)} + 1\right] = I_{TUBA} \\
\end{align}\]
<p>Since this relationship holds true for all values of \(a\left(Y\right)\), we can set \(a\left(Y\right) = e \ \forall \ Y\) at the expense of having a slightly looser bound on the mutual information. This gives us the bound</p>
<p>\(\begin{align}
I\left(X, Y\right) &\geq I_{TUBA} \\
&\geq \mathbb{E}_{p\left(X, Y\right)} \left[ f\left(X, Y\right) \right] - \mathbb{E}_{p\left(Y\right)} \left[e^{-1} \mathbb{E}_{p\left(X\right)} \left[e^{f\left(X, Y\right)}\right]\right] \\
&= I_{NJW} \\
\end{align}\)
<!--
&\geq \mathbb{E}_{p\left(X, Y\right)}\left[f\left(X, Y\right)\right] - \mathbb{E}_{p\left(Y\right)}\left[e^{-1}\mathbb{E}_{p\left(X\right)}\left[e^{f\left(X, Y\right)\right]\right] \\
&= I_{NJW} \\
\end{align}$$
--></p>
<p>This is the bound used in the f-MINE variant of the MINE objective. When \(a\left(y\right)\) is estimated by an exponential moving average, this corresponds to the heuristic used in MINE to reduce the bias of the MINE gradient.</p>
<p>To summarize all of the bounds we’ve seen again</p>
\[R \geq I\left(X; Y\right) \geq I_{BA} \geq I_{UBA} = F_{MMI} - \mathbb{E}_{p\left(Y\right)}\left[\log{p\left(Y\right)}\right] = F_{MMI} + H\left(Y\right) \geq I_{TUBA} \geq I_{NJW}\]
<h2 id="infonce">InfoNCE</h2>
<p>The main insight of InfoNCE is that we can use independent samples from some other distribution to decrease the variance of our estimate of the mutual information.</p>
\[I\left(X, Z; Y\right) = \mathbb{E}_{p\left(Z\right)}\left[I\left(X; Y\right)\right] = I\left(X, Y\right)\]
<p>In InfoNCE the RV \(Z=\{X_2^{\prime} \ldots X_N^{\prime}\}\) are \(N-1\) samples from some other distribution over \(X\), which we treat as negative examples of \(X\). Often the neural network is called the critic in this literature as it is tasked with compared inputs \(X\) to outputs \(Y\). Setting the critic to</p>
\[\begin{align}
f^{\prime}\left(X, Y\right) &= 1 + \log{\frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}} \\
&\implies I_{TUBA} = 1 + \mathbb{E}_{p\left(X,Y\right)}\left[\log{\frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}}\right] - \mathbb{E}_{p\left(Y\right)p\left(X\right)}\left[\frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}\right] \\
&= 1 + \mathbb{E}_{p\left(X,Y\right)p\left(Z\right)}\left[\log{\frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}}\right] - \mathbb{E}_{p\left(Y\right)p\left(X\right)p\left(Z\right)}\left[\frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}\right]
\end{align}\]
<p>Note that the optimal critic for \(I_{UBA}\) is
\(f\left(X, Y\right) = 1 + \log{\frac{p\left(Y|X\right)}{p\left(Y\right)}}\)</p>
<p>So we are simply replacing
\(\frac{p\left(Y|X\right)}{p\left(Y\right)} \to \frac{e^{f\left(X, Y\right)}}{a\left(Y\right)}\)</p>
<p>which are learned parameters and if trained to convergence we should recover the optimal critic. \(a\left(Y\right)\) should optimally be the partition function for \(Y\).</p>
<p>There are two final steps to get the InfoNCE objective. The inner two expectations of the last term of the above expression for \(I_{TUBA}\) can be rewritten as</p>
\[\begin{align}
\mathbb{E}_{p\left(Y\right)p\left(X\right)p\left(Z\right)}\left[\frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}\right] &= \mathbb{E}_{p\left(Y\right)}\left[\frac{1}{K}\sum_{i=1}^K \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[\frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}\right]\right] \\
\end{align}\]
<p>In other words, we can just rewrite the expectation in terms \(K\) replicas of the expectation. How do we get \(K\) replicas of the data? Since \(Z\) are other input examples drawn <em>independently</em>, each example can also be considered as a draw from \(X\). So we can simply swap one of the \(K-1\) examples in \(Z\) with the original example from \(p\left(X\right)\). Since there are \(K-1\) examples in \(Z\) we can repeat this swapping procedure \(K-1\) times, which in addition to using the original value \(X\) drawn from \(p\left(X\right)\) gives us \(K\) replicas. In expectation this sum will be the same as the sum of the expectations.</p>
<p>The second step is to use \(Z\) to form a Monte-Carlo approximation for \(a\left(Y; X, Z\right)\), which can also be viewed as approximating the partition function.</p>
<p>\(a\left(Y; X, Z\right) = \frac{1}{K}\left(e^{f\left(X, Y\right)} + \sum_{i=2}^{K} e^{f\left(Z_i, Y\right)}\right)\).</p>
<p>Therefore …</p>
\[\begin{align}
\mathbb{E}_{p\left(Y\right)}\left[\frac{1}{K}\sum_{i=1}^K \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[\frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}\right]\right] &= \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[\frac{1}{K} \sum_{i=1}^K \frac{e^{f\left(X, Y\right)}}{a\left(Y; X, Z\right)}\right] \\
&= \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[\frac{\frac{1}{K}\left(e^{f\left(X, Y\right)} + \sum_{i=2}^K e^{f\left(Z_i, Y\right)}\right)}{a\left(Y; X, Z\right)}\right] \\
&= \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[\frac{\frac{1}{K}\left(e^{f\left(X, Y\right)} + \sum_{i=2}^K e^{f\left(Z_i, Y\right)}\right)}{\frac{1}{K}\left(e^{f\left(X, Y\right)} + \sum_{i=2}^{K} e^{f\left(Z_i, Y\right)}\right)}\right] \\
&= \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[ 1 \right] \\
&\implies I_{TUBA} = I_{NCE} = \mathbb{E}_{p\left(X, Y\right)p\left(Z\right)}\left[\log{\frac{e^{f\left(X, Y\right)}}{\frac{1}{K}\left(e^{f\left(X, Y\right)} + \sum_{i=2}^{K} e^{f\left(Z_i, Y\right)}\right)}}\right] \\
\end{align}\]
<p>Since the second term in the bound is now a constant \(1\) it cancels with the 1 in \(I_{TUBA}\) and only the first expectation remains.</p>
<p>Also note that this lower bound is itself upper bounded by \(\log{K}\).</p>
\[I_{NCE} = \log{K} + \mathbb{E}_{p\left(X, Y\right)p\left(Z\right)}\left[\log{\frac{e^{f\left(X, Y\right)}}{e^{f\left(X, Y\right)} + \sum_{i=2}^{K} e^{f\left(Z_i, Y\right)}}}\right]\]
<p>And \(\mathbb{E}_{p\left(X, Y\right)p\left(Z\right)}\left[\log{\frac{e^{f\left(X, Y\right)}}{e^{f\left(X, Y\right)} + \sum_{i=2}^{K} e^{f\left(Z_i, Y\right)}}}\right]\) is guaranteed to be negative, since the denominator is a sum of non-negative values that includes the numerator. Therefore, the smallest value this can take is \(0\) leaving …</p>
<p>\(I_{NCE} \leq \log{K}\).</p>
<p>Therefore, if \(I\left(X; Y\right) \geq \log{K}\) this estimator will drastically underestimate the mutual information. It also shows that it is critical to use a large number of negative samples to accurately estimate the mutual information.</p>
<p>Approximating the expectation over \(p\left(X, Y\right)p\left(Z\right)\) can be handled in many ways. In the original CPC paper, \(X\) and \(Y\) are particular values called \(z_{t+k}, c_t\) which correspond to learned latent, local encodings of speech frames, and a global context vector learned over these encodings. The are both deterministic functions of the <em>same</em> input \(X=\{x_1, x_2, \ldots, x_N \}\). The expecation is then approximated with the Monte-Carlo estimate using the neural network outputs corresponding to a minibatch of inputs.</p>
<h2 id="alternative-factorizations-of-the-expecation">Alternative Factorizations of the Expecation</h2>
<p>We could factor the expectation in multiple ways. Let</p>
\[\begin{align}
I_{NCE} &= \mathbb{E}_{p\left(X, Y\right)p\left(Z\right)}\left[\log{\frac{e^{f\left(X, Y\right)}}{\frac{1}{K}\left(e^{f\left(X, Y\right)} + \sum_{i=2}^{K} e^{f\left(Z_i, Y\right)}\right)}}\right] \\
&= \mathbb{E}_{p\left(X, Y\right)p\left(Z\right)}\left[L_{NCE}\right] \\
&= \mathbb{E}_{p\left(Z\right)p\left(X\right)} \left[\sum_Y p\left(Y|X\right) L_{NCE}\right] \\
&= \mathbb{E}_{p\left(Z\right)p\left(X\right)} \left[\sum_Y p\left(Y|X\right) \log{\frac{e^{f\left(X, Y\right)}}{\frac{1}{K}\left(e^{f\left(X, Y\right)} + \sum_{i=2}^{K} e^{f\left(Z_i, Y\right)}\right)}} \right] \\
&= \mathbb{E}_{p\left(Z\right)p\left(X\right)} \left[\sum_Y p\left(Y|X\right) \left(f\left(X, Y\right) - \log{\frac{1}{K}\left(e^{f\left(X, Y\right)} + \sum_{i=2}^{K} e^{f\left(Z_i, Y\right)} \right)} \right) \right] \\
&= \mathbb{E}_{p\left(Z\right)p\left(X\right)} \left[\sum_Y p\left(Y|X\right) \left( f\left(X, Y\right) - f^{*} \right)\right] \\
&= \mathbb{E}_{p\left(Z\right)p\left(X\right)} \left[\sum_Y \frac{p\left(Y\right)e^{f\left(X, Y\right)}}{\mathbb{E}_{p\left(Y\right)}\left[e^{f\left(X, Y\right)}\right]} \left( f\left(X, Y\right) - f^{*} \right)\right] \\
&= \mathbb{E}_{p\left(Z\right)p\left(X\right)} \left[\frac{\sum_Y p\left(Y\right)e^{f\left(X, Y\right)}\left( f\left(X, Y\right) - f^{*} \right)}{\mathbb{E}_{p\left(Y\right)}\left[e^{f\left(X, Y\right)}\right]}\right] \\
\end{align}\]
<p>What if we did have labeled data and did not have to marginalize over all possible outputs \(Y\)? Then the above equation becomes</p>
\[\begin{align}
I_{NCE} &= \mathbb{E}_{p\left(Z\right)p\left(X\right)} \left[\frac{p\left(Y\right)e^{f\left(X, Y\right)}\left( f\left(X, Y\right) - f^{*} \right)}{\mathbb{E}_{p\left(Y\right)}\left[e^{f\left(X, Y\right)}\right]}\right]
\end{align}\]
<p>This is exactly the MMI objective scaled by the term \(\left( f\left(X, Y\right) - f^{*} \right)\). We therefore see that under this objective function, a good \(f\left(X, Y\right)\) is one that learns to discriminate between the correct <em>output</em> and competing outputs, as well ensuring that different <em>inputs</em> result in different outputs.</p>
<h2 id="tractable-alternative-factorization">Tractable Alternative Factorization</h2>
<p>The above alternative factorization is unfortunately completely intractable. To solve this we can bound it one more time using Jensen’s Inequality. Also we will assume that the \(K-1\) draws from \(p\left(Z\right)\) will be the \(K-1\) other examples in a minibatch. We call the whole minibatch</p>
\[X = \{X, Z_2, \ldots, Z_k \} = \{X_1, X_2, \ldots, X_k\}\]
\[\begin{align}
\sum_Y p\left(Y | X \right) \left(f\left(X, Y\right) - f^{*}\right) &= \sum_Y p\left(Y | X \right) f\left(X, Y\right) - \sum_Y p\left(Y | X\right) \log{\sum_{i=1}^K e^{f\left(X_i, Y\right)}} \\
&\geq \sum_Y p\left(Y | X \right) f\left(X, Y\right) - \log{\sum_Y p\left(Y | X\right) \sum_{i=1}^K e^{f\left(X_i, Y\right)}} \\
&= \sum_Y p\left(Y | X \right) f\left(X, Y\right) - \log{\sum_{i=1}^K \sum_Y p\left(Y | X\right) e^{f\left(X_i, Y\right)}} \\
\end{align}\]
<p>So finally we have</p>
<p>\(\begin{align}
I_{NCE} = \mathbb{E}_{p\left(Z\right)p\left(X\right)} \left[ \sum_Y p\left(Y | X \right) f\left(X, Y\right) - \log{\sum_{i=1}^K \sum_Y p\left(Y | X\right) e^{f\left(X_i, Y\right)}} \right] &= \mathbb{E}_{p\left(X\right)} \left[ \sum_Y p\left(Y | X \right) f\left(X, Y\right) \right] - \mathbb{E}_{p\left(Z\right)p\left(X\right)}\left[\log{\sum_{i=1}^K \sum_Y p\left(Y | X\right) e^{f\left(X_i, Y\right)}}\right] \\
\end{align}\)
<!--
&\approx \frac{1}{K} \sum_{i=1}^K \sum_Y p\left(Y | X_i \right) f\left(X_i, Y\right) - \frac{1}{K} \sum_{k=1}^K \log{\sum_{i=1}^K \sum_Y p\left(Y | X_k\right) e^{f\left(X_i, Y\right)}} \\
--></p>
<h2 id="gradient-of-the-alternative-factorization">Gradient of the alternative factorization</h2>
<p>The above objective leaves us with a catch-22. We are trying to estimate a posterior distribution, but doing so requires an estimate for it. One potential solution is to hold fixed the posterior distribution when updating the model with unlabeled data. Also assume that our expectations are over minibatches \(\mathcal{B} = \{X_1, \ldots, X_K\}\) and \(X\) is simply the first element in the minibatch \(X_1\). We also will assume a neural network \(\phi\) parameterized \(f\left(\cdot\right)\) and inputs are of lenght \(T\).</p>
\[f\left(X, Y\right) = \sum_{t=1}^T \phi\left(X\right)_{Y_t}^t\]
<p>and that marginalization over output sequences is handled by representing the space of sequences with a WFST \(G\). We denote the forward and backward scores at state \(s\) and time \(\tau\) over this graph respectively as</p>
\[\alpha\left(s, \tau\right), \beta\left(x, \tau\right)\]
<p>In this case the gradient becomes …</p>
\[\begin{align}
\frac{\partial I_{NCE}}{\partial y_s^{\tau}\left(j\right)} &= \mathbb{E}_{\mathcal{B}}\left[\sum_{Y} p\left(Y | X_1\right) \mathbb{1}\left(Y_{\tau}, s\right) \mathbb{1}\left(j, 1\right) - \frac{\sum_Y p\left(Y | X_1\right) e^{f\left(X_j, Y\right)} \mathbb{1}\left(Y_{\tau}, s\right)}{\sum_{i=1}^K \sum_Y p\left(Y|X_1\right)e^{f\left(X_i, Y\right)}}\right] \\
&= \mathbb{E}_{\mathcal{B}}\left[ \gamma_{X_1}\left(s, \tau\right)\mathbb{1}\left(j, 1\right) - \frac{\alpha_{X_{1,j}}\left(s, \tau\right)\beta_{X_{1, j}}\left(s, \tau\right)}{\sum_{i=1}^K \sum_{\sigma} \alpha_{X_{1, i}}\left(\sigma, \tau\right)\beta_{X_{1,i}}\left(\sigma, \tau\right)} \right]\\
\end{align}\]
<p>Since in a single minibatch we have \(K\) examples of speech, and \(K-1\) negative samples, we can use a single example to generate \(K\) unique minibatches where instead of using \(X_1\) as \(X\) we use \(X_i\).</p>
<p>The foward score through the lattice generated by inputs \(X_k, X_j\) is</p>
\[E(k, j) = [\![\left(\phi\left(X_k\right) + \phi\left(X_j\right)\right) \circ G]\!]\]
<p>The loss function for the minibatch of data then becomes.</p>
\[\begin{align}
\frac{\partial I_{NCE}}{\partial y_s^{\tau}\left(j\right)} &= \frac{1}{K} \sum_{k=1}^K \gamma_{X_k}\left(s, \tau\right)\mathbb{1}\left(j, k\right) - \frac{\alpha_{X_{k,j}}\left(s, \tau\right)\beta_{X_{k, j}}\left(s, \tau\right)}{\sum_{i=1}^K \sum_{\sigma} \alpha_{X_{k, i}}\left(\sigma, \tau\right)\beta_{X_{k,i}}\left(\sigma, \tau\right)}\\
&= \frac{1}{K} \left[\gamma_{X_j}\left(s, \tau\right) - \sum_{k=1}^K \frac{\alpha_{X_{k,j}}\left(s, \tau\right)\beta_{X_{k, j}}\left(s, \tau\right)}{\sum_{i=1}^K \sum_{\sigma} \alpha_{X_{k, i}}\left(\sigma, \tau\right)\beta_{X_{k,i}}\left(\sigma, \tau\right)} \right]\\
&= \frac{1}{K} \left[\gamma_{X_j}\left(s, \tau\right) - \sum_{k=1}^K \gamma_{X_{k, j}}\left(s, \tau\right) \frac{e^{E\left(k, j\right)}}{\sum_{i=1}^K e^{E\left(k, i\right)}} \right] \\
\end{align}\]
<h2 id="updating-py--x">Updating p(Y | X)</h2>
<p>The problem with the above update is that as mentioned before, the optimal critic is</p>
\[f\left(X, Y\right) = \log{\frac{p\left(Y | X \right)}{p\left(Y\right)}}\]
<p>When we evaluate over the wrong distribution
\(q\left(Y | X\right) \neq p\left(Y | X\right)\)
then we are only able to train the network to perform as well as the original posterior we supplied. To increase the mutual information, we have to also be able to update our model for the posterior distribution. Unfortunately, we run into some computation that as far as I can tell is intractable. Notably</p>
\[\sum_Y p\left(Y | X\right) f\left(X, Y\right)\]
<p>is intractable because of the form of \(f\left(X, Y\right)\). If this were a simple classification task, then we could probably evaluate this quantity, however, in sequence tasks, evaluating \(f\left(X, Y\right)\) of the sequence \(Y\), for all possible values is not feasible. Nonetheless we take the gradient of this term holding fixed</p>
\[f\left(X, Y\right)\]
<p>this time. For the purpose of taking the gradient, I will actually use a specific functional form for</p>
\[p\left(Y | X\right) = \frac{p\left(Y\right)e^{f\left(X, Y\right)}}{\mathbb{E}_{p\left(Y\right)}\left[e^{f\left(X, Y\right)}\right]}\]
\[\begin{align}
\frac{\partial p\left(Y | X \right)}{\partial y_s^{\tau}\left(j\right)} &= \frac{\partial}{\partial y_s^{\tau}\left(j\right)} p\left(Y\right) e^{f\left(X, Y\right)}\left(\sum_Y p\left(Y\right)e^{f\left(X, Y\right)}\right)^{-1} \\
&= p\left(Y\right)e^{f\left(X, Y\right)} \frac{\partial}{\partial y_s^{\tau}\left(j\right)} f\left(X, Y\right)\left(\sum_Y p\left(y\right)e^{f\left(X, Y\right)}\right)^{-1} - p\left(Y\right)e^{f\left(X, Y\right)} \frac{\sum_Y p\left(y\right)e^{f\left(X, Y\right)}\frac{\partial f\left(X, Y\right)}{\partial y_s^{\tau}\left(j\right)}}{\left(\sum_Y p\left(y\right)e^{f\left(X, Y\right)}\right)^2} \\
&= p\left(Y | X\right) \left(\mathbb{1}\left(Y_{\tau}, s\right) - \gamma_{X}\left(s, \tau \right)\right) \\
&\implies \sum_Y \frac{\partial p\left(Y | X \right)}{\partial y_s^{\tau}\left(j\right)} f\left(X, Y\right) = \mathbb{1}\left(j, 1\right)\sum_Y p\left(Y | X\right) f\left(X, Y\right)\left(\mathbb{1}\left(Y_{\tau}, s\right) - \gamma_{X}\left(s, \tau \right)\right) \\
\end{align}\]
<p>Having worked out the gradient of the posterior we can easily get the gradient of the second term in the objective function.</p>
\[\begin{align}
\frac{\partial}{\partial y_s^{\tau}\left(j\right)} \mathbb{E}_{\mathcal{B}}\left[\log{\sum_{i=1}^K \sum_Y p\left(Y | X_1\right) e^{f\left(X_i, Y\right)}} \right] &= \mathbb{E}_{\mathcal{B}}\left[\frac{1}{\sum_{i=1}^K \sum_Y p\left(Y | X_1\right) e^{f\left(X_i, Y\right)}} \sum_Y p\left(Y | X_1\right) e^{f\left(X_1, Y\right)}\left(\mathbb{1}\left(Y_{\tau}, s\right) - \gamma_{X_1}\left(s, \tau\right)\right)\right] \\
&= \mathbb{E}_{\mathcal{B}}\left[\frac{\alpha_{1, 1}\left(s, \tau\right)\beta_{1, 1}\left(s, \tau\right) - \gamma_{X_1}\left(s, \tau\right)\sum_{\sigma}\alpha_{1, 1}\left(s, \tau\right)\beta_{1,1}\left(s, \tau\right)}{\sum_{i=1}^K \sum_{\sigma} \alpha_{1, i}\left(\sigma, \tau\right)\beta_{1, i}\left(\sigma, \tau\right)}\right] \\
&= \mathbb{E}_{\mathcal{B}}\left[ \left(\gamma_{X_{1, 1}}\left(s, \tau\right) - \gamma_{X_1}\left(s, \tau\right)\right) \frac{e^{E\left(1, 1\right)}}{\sum_{i=1}^K e^{E\left(1, i\right)}}\right]
\end{align}\]
<!--
Putting the gradients together we get
$$\begin{align}
\mathbb{1}\left(j, 1\right)\sum_Y p\left(Y | X\right) \sum_Y p\left(Y | X_1\right) f\left(X_1, Y\right)\left(\mathbb{1}\left(Y_{\tau}, s\right) - \gamma_{X}\left(s, \tau \right)\right) + \gamma_{X_1}\left(s, \tau\right)\left(1 + \frac{e^{E\left(1, j\right)}}{\sum_{i=1}^K e^{E\left(1, i\right)}}\right) - \gamma_{X_{1, j}}\left(s, \tau\right) \frac{2e^{E\left(1, j\right)}}{\sum_{i=1}^K e^{E\left(1, i\right)}}
\end{align}$$
-->
<p>You can interpret this as an acoustic confidence of an input weighted by how distinguishable it is on average (from a sample of \(K\) other examples).
Now we have to deal with the intractable(?) first term.</p>
\[\begin{align}
\sum_Y p\left(Y | X_1\right) f\left(X_1, Y\right) &= \frac{1}{Z\left(X_1\right)} \sum_Y p\left(Y\right) e^{\sum_t \phi\left(X_1\right)_{Y_t}^t} \sum_{t^{\prime}} \phi\left(X_1\right)_{Y_{t^{\prime}}}^{t^{\prime}} \\
&= \frac{1}{Z\left(X_1\right)} \sum_Y p\left(Y\right) \sum_{t^{\prime}} \phi\left(X_1\right)_{Y_{t^{\prime}}}^{t^{\prime}} e^{\sum_t \phi\left(X_1\right)_{Y_t}^t} \\
&= \frac{1}{Z\left(X_1\right)} \sum_{t^{\prime}} \sum_Y \phi\left(X_1\right)_{Y_{t^{\prime}}}^{t^{\prime}} p\left(Y\right) e^{\sum_t \phi\left(X_1\right)_{Y_t}^t} \\
&\simeq \frac{T}{N Z\left(X_1\right)} \sum_{i=1}^N \sum_Y \phi\left(X_1\right)_{Y_{t_i}}^{t_i} p\left(Y\right) e^{\sum_t \phi\left(X_1\right)_{Y_t}^t} \\
&= T \frac{Z\left(X_1\right)}{NZ\left(X_1\right)} \sum_{i=1}^N \sum_{\sigma}\gamma_{X_1}\left(\sigma, t_i\right) \phi\left(X_1\right)^{t_i}\\
&=\frac{T}{N} \sum_{i=1}^N \sum_{\sigma}\gamma_{X_1}\left(\sigma, t_i\right) \phi\left(X_1\right)^{t_i} \\
&= T \hat{Z}\left(X_1 \right)\\
&\implies \sum_Y p\left(Y | X_1\right) f\left(X_1, Y\right) \left(\mathbb{1}\left(Y_{\tau}, s\right) - \gamma_{X_1}\left(s, \tau\right) \right) \simeq \sum_Y p\left(Y | X_1\right) f\left(X_1, Y\right) \mathbb{1}\left(Y_{\tau}, s\right) - \gamma_{X_1}\left(s, \tau \right) T\hat{Z}\left(X_1\right) \\
\end{align}\]
<h2 id="alternative-approximation">Alternative Approximation</h2>
<p>This is clearly not necessarily a lower bound any more, but it makes computation easy. I just used Jensen’s Inequality on the numerator and denominator separately after splitting the log fraction up. The second term is lower than it should be, but the first term is greater than it should be.</p>
\[\begin{align}
I_{NCE} &= \mathbb{E}_{p\left(X, Y\right)p\left(Z\right)}\left[\log{\frac{e^{f\left(X, Y\right)}}{\sum_{i=1}^{K} e^{f\left(X_i, Y\right)}}}\right] \\
&= \mathbb{E}_{p\left(X, Y\right)p\left(Z\right)}\left[\sum_Y p\left(Y | X\right)\log{\frac{e^{f\left(X, Y\right)}}{\sum_{i=1}^{K} e^{f\left(X_i, Y\right)}}}\right] \\
&\simeq \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[\log{\sum_Y p\left(Y | X\right)e^{f\left(X, Y\right)}} - \log{\sum_{i=1}^{K} \sum_Y p\left(Y | X\right)e^{f\left(X_i, Y\right)}}\right] \\
&= \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[\log{\frac{\sum_Y p\left(Y | X\right)e^{f\left(X, Y\right)}}{\sum_{i=1}^{K} \sum_Y p\left(Y | X\right)e^{f\left(X_i, Y\right)}}}\right] \\
&= \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[\log{\frac{\sum_Y p\left(Y | X\right)e^{f\left(X, Y\right)}}{\sum_{i=1}^{K} \sum_Y p\left(Y | X\right)e^{f\left(X_i, Y\right)}}}\right] \\
&= \mathbb{E}_{p\left(X\right)p\left(Z\right)}\left[[\![\left(\phi_X + \phi_X\right) \circ G]\!] - \mbox{logsumexp}_i\left([\![\left(\phi_X + \phi_{X_i}\right) \circ G]\!]\right )\right] \\
\end{align}\]
<!--
$$\begin{align}
&\implies \frac{\partial}{\partial y_s^{\tau}\left(j\right)} I_{MCE} \simeq \mathbb{E}_{\mathcal{B}}\left[ \mathbb{1}\left(1, j\right)\left(\left(\hat{\gamma}_{X_1}\left(s, \tau\right) - \gamma_{X_1}\left(s, \tau \right)\right) \frac{T\hat{Z}\left(X_1\right)}{Z\left(X_1\right)}\right) - \left(\gamma_{X_{1, 1}}\left(s, \tau\right) - \gamma_{X_1}\left(s, \tau\right)\right) \frac{e^{E\left(1, 1\right)}}{\sum_{i=1}^K e^{E\left(1, i\right)}} + \mathbb{1}\left(1, j\right)\gamma_{X_1}\left(s, \tau\right) - \gamma_{X_{1, j}}\left(s, \tau\right) \frac{e^{E\left(1, j\right)}}{\sum_{i=1}^K e^{E\left(1, i\right)}} \right]\\
&\simeq \frac{1}{K} \left[ \hat{\gamma}_{X_j}\left(s, \tau\right) + \gamma_{X_j}\left(s, \tau \right) \left(1 - \frac{T\hat{Z}\left(X_j\right)}{Z\left(X_j\right)}\right) - \sum_{k=1}^K \left[\left(\gamma_{X_k}\left(s, \tau\right) - \gamma_{X_{k, k}}\left(s, \tau\right)\right) \frac{e^{E\left(k, k\right)}}{\sum_{i=1}^K e^{E\left(k, i\right)}} + \gamma_{X_{k, j}}\left(s, \tau\right) \frac{e^{E\left(k, j\right)}}{\sum_{i=1}^K e^{E\left(k, i\right)}}\right]\right] \\
\end{align}$$
-->
<!--
&\simeq \frac{1}{K}\left[ \left(\hat{\gamma}_{X_j}\left(s, \tau\right) - \gamma_{X_j}\left(s, \tau \right) \frac{\hat{Z}\left(X_j\right)}{Z\left(X_j\right)}\right) - \sum_{k=1}^{K}\left(\gamma_{X_{k, j}}\left(s, \tau\right) - \gamma_{X_k}\left(s, \tau\right)\right) \frac{e^{E\left(k, j\right)}}{\sum_{i=1}^K e^{E\left(k, i\right)}} + \mathbb{1}\left(1, j\right)\gamma_{X_1}\left(s, \tau\right) - \gamma_{X_{1, j}}\left(s, \tau\right) \frac{e^{E\left(1, j\right)}}{\sum_{i=1}^K e^{E\left(1, i\right)}} \right]
\end{align}$$
-->
<!--
&\implies \frac{\partial}{\partial y_s^{\tau}\left(j\right)} I_{MCE} = \mathbb{E}_{\mathcal{B}}\left[ \hat{\gamma}_{X_1}\left(s, \tau\right) + \gamma_{X_1}\left(s, \tau\right)\left[\left(1-\frac{\hat{Z}\left(X_1\right)}{Z\left(X_1\right)}\right) + \frac{e^{E\left(1, j\right)}}{\sum_{i=1}^K e^{E\left(1, i\right)}}\right] - \gamma_{X_{1, j}}\left(s, \tau\right) \frac{2e^{E\left(1, j\right)}}{\sum_{i=1}^K e^{E\left(1, i\right)}}\right]
### Semi-supervised Algorithm
$$\begin{align}
\mbox{Sample} B_{sup} &~ \mathcal{D}_{sup} \\
\mbox{Sample} B_{unsup} &~ \mathcal{D}_{unsup} \\
\mbox{Update } \Theta \mbox{ according to } &\mathcal{L}_{MMI}\left(B_{sup}, \Theta \right) \\
\mbox{compute} \phi\left(B_{unsup}\right) &\\
\mbox{\textbf{for}} &\mbox{all combinations } \left(i, j\right) \\
& \phi_{i,j} = \phi\left(B_{unsup}\right)_i + \phi\left(B_{unsup}\right)_j \\
& \mbox{Do forward-backward on} \phi_{i,j} \circ G \mbox{and store} \\
\mbox{compute gradients using the graident fomula from before}\\
\end{align}$$
-->MatthewWiesnerThere has been incredible success using InfoNCE as a constrastive unsupervised pretraining objective for ASR in the past year. The objective function, originally presented in Contrastive Predictive Coding (CPC), was described more theoretically in a subsequent paper.Energy Based Models2020-02-05T15:56:00+00:002020-02-05T15:56:00+00:00https://m-wiesner.github.io/Energy-Based-Models<p>I’ve wanted to learn about generative neural models for some time. I’m focusing on energy-based models for now. I had been trying to come up with
other ways to use untranscribed speech in domain adaptation when I came across <a href="https://arxiv.org/pdf/1912.03263.pdf">this paper</a>, which was
doing something very similar to what I had been thinking about. Most of this post is just me going through the background I needed to
understand this paper.</p>
<h1 id="energy-based-models">Energy Based Models</h1>
<p>The key idea in Energy Based Models (EBMs) is to generate a score for data points. Our data can be viewed as measurements of the underlying system
that we are attempting to model. We use the score as a goodness measure of a particular configuration. This score is termed the Energy.</p>
\[E_{\theta}\left(x_i\right) : \mathbb{R}^{d x 1} \to \mathbb{R}\]
<p>The only restriction on this score is that it result in a finite integral over the entire domain of our data. We can generate a probability distribution
from the energy.</p>
<p>\begin{align}
p_{\theta}\left(x\right) &= \frac{e^{-E\left(x\right)}}{\int_{x \in \mathcal{X}} e^{-E\left(x\right)} dx} \<br />
&= \frac{e^{-E\left(x\right)}}{Z\left(\theta\right)}
\end{align}</p>
<p>Here, \(Z\left(\theta\right)\) is known as the partition function, and computing it is impossible because we can never integrate over all possible value
for our data. In spite of this, we will proceed to take gradients of this function as if we could perform gradient descent.</p>
\[\nabla_{\theta} \log{p_{\theta}\left(x\right)} = -\nabla_{\theta} E\left(x\right) - \nabla_{\theta} \log{Z\left(\theta\right)}\]
\[= -\nabla_{\theta} E\left(x\right) - \frac{1}{Z\left(\theta\right)} \int_{x \in \mathcal{X}} e^{-E\left(x\right)} \left(- \nabla_{\theta}E\left(x\right)\right) dx\]
\[= -\nabla_{\theta} E\left(x\right) + \int_{x \in \mathcal{X}} \frac{e^{-E\left(x\right)}}{Z\left(\theta\right)} \nabla_{\theta} E\left(x\right) dx\]
\[= -\nabla_{\theta} E\left(x\right) + \int_{x \in \mathcal{X}} p_{\theta}\left(x\right) \nabla_{\theta} E\left(x\right) dx\]
\[= \mathbb{E}_{p_{\theta}\left(x\right)} \left[\nabla_{\theta}E\left(x\right)\right] - \nabla_{\theta} E\left(x\right)\]
<p>So if we know how to compute the gradient with respect to \(E\left(x\right)\), then we can approximate the expectation by sampling.
This sampling procedure therefore becomes crucial. One easy way of sampling is to use a technique known as Stochastic Gradient Langevin Dynamics (SGLD).</p>
<h1 id="sgld">SGLD</h1>
<p>The main idea behind SGLD is to generate low-energy data points according to our current model. If we can do this, then we basically have a way of
sampling from from \(p_{\theta}\left(x\right)\) since the low energy points should correspond to likely points. And this sampling technique is itself very
similar to stochastic gradient descent (SGD).</p>
<p>We initially start with points sampled uniformly from our domain. Then we find the direction of minimum energy and take a step in that direction.
If we did this for enough steps, we would eventually reach the points of minimum energy, which correspond to the modes of the \(p_{\theta}\left(x\right)\).
But we obviously do not want to only sample the modes of the distribution. To ensure that we are at least sometimes returning samples that correspond to other points
we need to inject some noise. In this way we can sample points around the modes of the distribution. The amount of noise,
or if we model the noise as Gaussian, the variance should be tuned appropriately to ensure the desired behavior.</p>
<p>Formally, the sampling procedure is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">uniform_sample</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">*</span> <span class="p">[</span><span class="n">sigma_1</span><span class="p">,</span> <span class="n">sigma2</span><span class="p">,</span> <span class="p">...,</span> <span class="n">sigma_D</span><span class="p">]</span> <span class="c1"># Sample uniformly from the inout domain (approximated by 3 standard devations per dimension)
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_sgld_steps</span><span class="p">):</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">step_size</span> <span class="o">*</span> <span class="n">grad</span><span class="p">(</span><span class="n">E</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">normal</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">sgld_variance</span><span class="p">)</span> <span class="c1"># Gradient of the energy E with respect to x
</span></code></pre></div></div>
<p>In this way we can generate a full minibatch of samples which we use to approximate the gradient. In practice rather than sampling at random uniformly
for each minibatch, a buffer of past generated points is stored and points at a new iteration can be sampled from this buffer instead of uniformly.
In this way we can sample points from previous iterations that may not be quite as random, and we can take more steps using these points leading to
easier convergence without requiring too many SGLD steps for any given minibatch. This is known as replay memory. Some points are still sampled completely
randomly.</p>MatthewWiesnerI’ve wanted to learn about generative neural models for some time. I’m focusing on energy-based models for now. I had been trying to come up with other ways to use untranscribed speech in domain adaptation when I came across this paper, which was doing something very similar to what I had been thinking about. Most of this post is just me going through the background I needed to understand this paper.Lattice Free Maximum Mutual Information (LF-MMI)2020-01-22T10:27:00+00:002020-01-22T10:27:00+00:00https://m-wiesner.github.io/LF-MMI<p>I’m writing this to remember all of the details of MMI and LF-MMI, especially the gradient computation that I worked through with <a href="https://desh2608.github.io/2019-05-21-chain/">Desh Raj</a>. I probably missed a few things, but hopefully the main concepts are covered. If you are reading this and you already know what MMI is and you are just looking for the details of LF-MMI, skip to the LF-MMI section. I added some very basic information about ASR with HMMs just for the sake of completeness. The original paper is <a href="http://www.danielpovey.com/files/2016_interspeech_mmi.pdf">here</a>.</p>
<p>tldr;</p>
<p>LF-MMI is just like lattice based MMI, but replaces utterance specific denomiator lattices with a globally shared denominator graph constructed by means of a 4-gram phone language model. Some tricks to prevent overfitting are usually required including cross-entropy regularization via multitask learning and L-2 regularization on the network outputs. Since the network is directly trained to produce pseudo-likelihoods, no prior normalization is required.</p>
<hr />
<p>The LF-MMI objective function is a particular discriminative objective function used especially in hybrid HMM-DNN ASR.
Discriminative objective functions are of interest because they allow us not only to train models to make the correct output sequence
more likely, but they also learn to make incorrect sequences less likely. In other words they are trained to maximize the separation
between the correct and incorrect answers, or to discriminate between correct and incorrect answers rather than simply assign high weights to the correct sequences.</p>
<p>One such objective function is called the <strong>Maximum Mutual Information</strong> (MMI) objective function. It is also sometimes referred to as
<strong>Maximum Conditional Likelihood Estimation</strong>. LF-MMI is essentially just the MMI objective function that has been modified to enable
training ASR systems on GPU. I describe these modificaitons later. For now, I am just going to describe the MMI objective function.</p>
<h2 id="relationship-of-maximum-mutual-information-objective-to-mutual-information">Relationship of Maximum Mutual Information Objective to Mutual Information</h2>
<p>The MMI objective function is called MMI, because it can be derived from maximizing the mutual information between the input
\(X\), and output \(W\) sequences. First the MMI objective function is defined to be</p>
\[F_{MMI} = \sum_{r=1}^N \log{\frac{p_{\theta}\left(X_r | W_r\right)p\left(W_r\right)}{\sum_{W} p_{\theta}\left(X_r | W\right)p\left(W\right)}}\]
<p>In this function, \(r\) indexes the training examples (utterances or short chunks of audio), \(X_r\) are the audio features corresponding to chunk \(r\), and \(W_r\) is the reference transcript. Probability density functions (PDFs) subscripted by \(\theta\) are those parameterized with learnable parameters \(\theta\). This objective function can be shown to be equivalent to maximizing the mutual information over the parameter space between the input and output sequences.</p>
\[arg\max_\theta I_\theta \left(X_r; W_r\right) = arg\max_\theta H\left(W_r\right) - H_\theta\left(W_r | X_r\right)\]
<p>In general since we are only trying to model the relationship between inputs and outputs, the only parameters we are able to optimize
are those responsible for the conditional distribution \(p_\theta\left(W | X\right)\). The distribution \(p\left(W\right)\) is estimated
from the training transcripts and is considered fixed. In the case of ASR, this simply corresponds to a language model.</p>
<p>From this we see that the above optimization problem is equivalent to</p>
\[arg\max_\theta H\left(W_r\right) - H_\theta\left(W_r | X_r\right) = arg\min_\theta H_{\theta}\left(W_r | X_r\right)\]
<p>Using the definition of conditional entropy we have that</p>
\[\begin{align}
\mbox{ (1) } H_{\theta}\left(W_r | X_r\right) &=& E_{p\left(X_r, W_r\right)} \left[ - \log{p_{\theta}\left(W_r | X_r\right)}\right] \\\
\mbox{ (2) } &=& E_{p\left(X_r, W_r\right)} \left[ - \log{\frac{p_{\theta}\left(X_r | W_r\right) p\left(W_r\right)}{p\left(X_r\right)}}\right] \\\
\mbox{ (3) } &=& E_{p\left(X_r, W_r\right)} \left[ - \log{\frac{p_{\theta}\left(X_r | W_r\right) p\left(W_r\right)}{\sum_{W} p_{\theta}\left(X_r | W\right) p\left(W\right)}}\right]
\end{align}\]
<p>Line (1) from above is the from the definition of conditional entropy. Line (2) uses Bayes rule to factorize the posterior distribution. Line (4) simply factorizes the joint distribution \(p_{\theta}\left(X, W\right)\)
into a product of the conditional and marginal distributions.</p>
<p>Then, using the law of large numbers we note that …</p>
\[E_{p\left(X_r, W_r\right)} \left[ - \log{\frac{p_{\theta}\left(X_r | W_r\right) p\left(W_r\right)}{\sum_{W} p_{\theta}\left(X_r | W\right) p\left(W\right)}}\right] = \lim_{N \to \infty} \frac{-1}{N} \sum_{r=1}^N\log{\frac{p_{\theta}\left(X_r | W_r\right) p\left(W_r\right)}{\sum_{W} p_{\theta}\left(X_r | W\right) p\left(W\right)}}\]
<p>So finally, by approximating this limit by using a finite sample size \(N\) …</p>
\[E_{p\left(X_r, W_r\right)} \left[ - \log{\frac{p_{\theta}\left(X_r | W_r\right) p\left(W_r\right)}{\sum_{W} p_{\theta}\left(X_r | W\right) p\left(W\right)}}\right] \simeq \frac{-1}{N} \sum_{r=1}^{N}\log{\frac{p_{\theta}\left(X_r | W_r\right) p\left(W_r\right)}{\sum_{W} p_{\theta}\left(X_r | W\right) p\left(W\right)}}\]
<p>And we convert the minimization problem</p>
\[arg\min_{\theta} H_{\theta}\left(W_r | X_r\right)\]
<p>into a maximization problem by negating the conditional entropy and dropping the constant \(N\) which is just a scaling factor and won’t change the optimal parameter values. This leaves us with the originally presented expression for the MMI objective function.</p>
<h2 id="acoustic-modeling-with-hmms-background">Acoustic Modeling with HMMs Background</h2>
<p>The HMM acoustic model is constructed as follows.</p>
<ul>
<li>Words are modeled as a sequence of units. In traditional ASR these units are triphones, but even just a sequence of letters would probably work fine. A single word could corresponding to different allowable sequences of units.</li>
</ul>
<blockquote>
<p>EITHER –> IY - TH - ER</p>
</blockquote>
<blockquote>
<p>EITHER –> AY - TH - ER</p>
</blockquote>
<ul>
<li>For each of these units, there is an associated HMM. Traditionally this is a 3-state HMM, but for reasons I’ll explain later, we tend to use a 1-state HMM instead. It would look something like this …</li>
</ul>
<p><img src="/LF-MMI/graphviz (1).png" alt="" height="50%" width="50%" /></p>
<ul>
<li>In an HMM we model
\(p_{\theta}\left(X_r | W_r\right)\)
where \(X_r = \{x_0, \ldots, x_{T-1} \}\)
is a length \(T\) sequence as …</li>
</ul>
\[p_{\theta}\left(X_r | W_r \right) = \sum_{\pi_r} \prod_{t=0}^{T-1} p_{\theta}\left(x_t | \pi_r^t\right) p\left(\pi_r^t | \pi_r^{t-1} \right)\]
<p>\(\pi_r\) corresponds to one of the valid paths through the HMM for the word sequence \(w_r\). \(\pi_r^t\) is the state at time \(t\) along the path \(\pi_r\).</p>
<h2 id="gradient-of-mmi">Gradient of MMI</h2>
<p>I am going to assume that the underlying acoustic model is an HMM. I assume we are using a Hybrid HMM-DNN model, where the DNN output activations are used as log emission probabilities for the \(D\) states \(s_1, \ldots, s_D\) in our HMM. We define \(y_s^t\) to be the DNN activations at time \(t\) for a state \(s\). In other terms \(y_s^t = \log{p_{\theta}\left(x_t | s \right)}\)</p>
<p>We also note that since the transition probabilities are not trained, we will just consider these to be a multiplicative weight associated with a particular path
(i.e. \(K_{\pi_r} = \prod_{t=0}^{T-1} p\left(\pi_r^t | \pi_r^{t-1}\right)\)).</p>
<p>Plugging this into the expression for \(F_{MMI}\) and converting the logarithm of a product into the sum of the logartihms we get</p>
\[\begin{align}
F_{MMI} &= \sum_{r} \log{\frac{p\left(w_r\right) \sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}}{\sum_{w} p\left(w\right) \sum_{\pi_w}K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}}} \\\
&= \sum_{r} \left[ \log{p\left(w_r\right)} + \log{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} - \log{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \right]
\end{align}\]
<p>Now since we are modeling the emission probabilities using a neural network trained using backpropagation and automatic differentiation, we really only need the partial gradient with respect to the output activations of the neural network \(y_s^t\).</p>
\[\begin{align}
\frac{\partial F_{MMI}}{\partial y_s^{\tau}} &= \frac{\partial}{\partial y_s^{\tau}} \sum_{r} \left[ \log{p\left(w_r\right)} + \log{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} - \log{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \right] \\\
&= \sum_{r} \frac{\partial}{\partial y_s^{\tau}} \left[ \log{p\left(w_r\right)} + \log{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} - \log{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \right] \\\
&= \sum_{r} \frac{\partial}{\partial y_s^{\tau}} \left[\log{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} - \log{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \right] \\\
&= \sum_{r} \left[ \frac{\partial}{\partial y_s^{\tau}} \log{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} - \frac{\partial}{\partial y_s^{\tau}} \log{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \right] \\\
&= \sum_{r} \left[ \frac{1}{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} \cdot \frac{\partial}{\partial y_s^{\tau}} \sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t} - \frac{1}{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \cdot \frac{\partial}{\partial y_s^{\tau}} \sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t} \right] \\\
&= \sum_{r} \left[ \frac{1}{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} \cdot \sum_{\pi_r}K_{\pi_r} \frac{\partial}{\partial y_s^{\tau}} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t} - \frac{1}{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \cdot \sum_{w, \pi_w} p\left(w\right) K_{\pi_w} \frac{\partial}{\partial y_s^{\tau}} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t} \right] \\\
&= \sum_{r} \left[ \frac{1}{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} \cdot \sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r}^t} \frac{\partial}{\partial_s^{\tau}} \sum_{t=0}^{T-1}y_{\pi_r^t}^t - \frac{1}{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \cdot \sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t} \frac{\partial}{\partial y_s^{\tau}} \sum_{t=0}^{T-1}y_{\pi_w^t}^t \right] \\\
&= \sum_{r} \left[ \frac{1}{\sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t}} \cdot \sum_{\pi_r}K_{\pi_r} e^{\sum_{t=0}^{T-1}y_{\pi_r^t}^t} \mathbb{1}\left(\pi_r^\tau, s\right) - \frac{1}{\sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t}} \cdot \sum_{w, \pi_w} p\left(w\right) K_{\pi_w} e^{\sum_{t=0}^{T-1}y_{\pi_w^t}^t} \mathbb{1}\left(\pi_r^\tau, s\right)\right]
\end{align}\]
<p>Here we introduced the indicator function</p>
\[\begin{align}
\mathbb{1}\left(\pi_r^\tau, s\right) = \begin{cases} 1 & \pi_r^\tau = s \\\ 0 & \pi_r^\tau \neq s \end{cases}
\end{align}\]
<p>We are almost done now. We note that the numerator of the first term in our expression corresponds to the joint probability of the acoustic sequence \(X_r\) going through <strong>any</strong> path for which \(\pi_r^{\tau} = s\). Note that we have restricted the set of paths to be those that correspond to the word sequence \(W_r\). Using the forward and backward probabilities at each time step we can write this as \(p\left(X_r, \pi_r^\tau = s\right) = \alpha_r\left(s, \tau\right) \beta_r\left(s, \tau\right)\). In the denominator we can partition the set of paths into the set of all paths that use a state \(s\) at time \(\tau\). In this way when we sum over all of the states at time \(\tau\) we are in fact summing over all paths. Hence we can rewrite the denominator as \(\sum_{\sigma} \alpha_r\left(\sigma, \tau \right) \beta_r\left(\sigma, \tau\right)\).</p>
<p>The second term in our expression can be decomposed in the same way as the first term. The only difference is that the set of paths is now the set of paths that are valid for <strong>any</strong> possible word sequence. We can still represent this as an HMM where paths for different word sequences are weighted by the probability of those word sequences. The state space just happens to be much larger. We will name the forward and backward probabilities associated with the space of all possible words \(\alpha_{w^\ast}\left(s, \tau\right), \beta_{w^\ast}\left(s, \tau\right)\).</p>
<p>Our expression for the gradient then becomes …</p>
\[\begin{align}
\frac{\partial F_{MMI}}{\partial y_s^{\tau}} &= \sum_{r} \left[ \frac{\alpha_r\left(s, \tau\right) \beta_r\left(s, \tau\right)}{\sum_{\sigma} \alpha_{r}\left(\sigma,\tau\right) \beta_{r}\left(\sigma, \tau\right)} - \frac{\alpha_{w^\ast}\left(s,\tau\right) \beta_{w^\ast}\left(s, \tau\right)}{\sum_{\sigma^\prime} \alpha_{w^\ast}\left(\sigma^\prime,\tau\right) \beta_{w^\ast}\left(\sigma^\prime, \tau\right)}\right] \\\
&= \sum_{r} \left[ \gamma_{r}\left(s, \tau\right) - \gamma_{w^\ast}\left(s, \tau\right)\right]
\end{align}\]
<h2 id="algorithm">Algorithm</h2>
<p>We now almost have an algorithm for computing the gradient with respect to a neural network output!</p>
<ol>
<li>
<p>Create a graph (HMM) representing the space of all possible word sequences. To do this you could imagine enumerating all possbile word sequences. You could then enumerate all possible pronunciations of each word sequences. Finally you would chain together the HMM models for the phonemes present in each of these pronunciations. Taking the union of all such HMM chains would correspond to the graph of all possible word sequences. We call this the denomiator graph as it corresponds to the denominator in the MMI objective function.</p>
</li>
<li>
<p>For each audio chunk \(X_r\), create a graph that corresponds to the reference word sequence. This corresponds to the union of all HMM chains that correspond to the ground truth word sequence for the audio chunk \(X_r\). We call this the numerator graph as it corresponds to the numerator in the MMI objective function.</p>
</li>
<li>
<p>Use the DNN to produce outputs \(y_s^t\) for the audio chunk \(X_r\).</p>
</li>
<li>
<p>Using these outputs, run the forward and backward algorithm on both the numerator and denominator graph to generate
\(\alpha_{r}\left(s, t\right), \beta_{r}\left(s, t\right), \alpha_{w^{\ast}}\left(s, t\right), \beta_{w^\ast}\left(s, t\right)\)</p>
</li>
<li>
<p>Compute the gradient according to</p>
</li>
</ol>
\[\frac{\partial F_{MMI}}{\partial \theta} = \sum_{r} \left[ \gamma_{r}\left(s, \tau\right) - \gamma_{w^\ast}\left(s, \tau\right)\right] \frac{\partial y_s^{\tau}}{\partial \theta}\]
<p>where \(\theta\) is some parameter in the neural network, and that gradient is just computed via autograd and backpropagation.</p>
<p>Our algorithm has some major problems, however, especially the way we proposed to generate the numerator and denominator graphs. Below are details explaining how in practice we can generate these graphs. The main contribution of LF-MMI is how it approximates this graph in order to make the computation feasible and the graph of a manageable size. In practice this is all done using Finite State Transducers (FSTs), which enable us to compactly store the set of all possible sequences needed in (1.) for instance. I will be making another post about FSTs in ASR, specifically on the decoding graph and the components used to create it in a future post <a href="https://m-wiesner.github.io/HCLG/">HCLG</a>.</p>
<h2 id="lf-mmi">LF-MMI</h2>
<p>In order to make the denominator graph a manageable size, the following modifications are made to the denominator graph:</p>
<ol>
<li>
<p>The denominator graph uses a 4-gram phone language model instead of a word level language model. The space of phones is much smaller than the space of all words. Furthermore, the language model is not smoothed; smoothing introduces many back-off states and edges which increases the size of the denominator graph.</p>
</li>
<li>
<p>The HMM topology used is the one state topology described above (as opposed to the 3-state topology). Again this makes the denominator graph smaller, and speeds up the forward / backward computation.</p>
</li>
<li>
<p>The DNN output frame rate is also reduced from 1 frame/10ms to 1 frame/30ms for the same reasons.</p>
</li>
</ol>
<h3 id="training-with-chunks">Training with chunks</h3>
<p>In order to train on chunks that are smaller than a whole utterance, there are a number of other necessary changes as well.</p>
<ul>
<li>The utterance level HMMs are represented as finite state transducers (FSTs). A separate FST acceptor is constructed that enforces a chunk of audio to be roughly aligned with the single best state sequence. The FST is created by having 1 node per time step. Between each node are a set of edges representing the set of state-ids that are allowable at this time step. This set is constructed from looking at a user specified window around the single best path and allowing any of the pdfids in this window to be accepted at the specific time. By composing this FST with the utterance level HMM, we get back a lattice representing the set of all probable alignments of the audio to the HMM states. Alternate paths in the resulting lattice therefore correspond to alternative pronunciations or alternative alignments of the audio. In this way, each state in the lattice is associated with a particular time index, which allows us to chop the utterance level lattices into chunks. Note that since we use state-tied parameters for HMMs in ASR, we actually align to pdf-ids (which could be shared across state), rather than on state id.</li>
</ul>
<p>The time enforcer FST looks like the FST shown below. For more information on the time enforcement, read <a href="https://ieeexplore.ieee.org/abstract/document/8639684">Improving LF-MMI Using Unconstrained Supervisions for ASR</a>, which is where the image below comes from.</p>
<p><img src="/LF-MMI/Screen Shot 2020-01-23 at 9.29.32 PM.png" alt="" height="50%" width="50%" /></p>
<p>where \(\{\alpha_t^k | k \in \left[0, N-1\right]\}\)
is the set of \(N\) distinct pdf-ids allowed at time \(t\).</p>
<ul>
<li>
<p>The denominator graph is created using a language model trained on full utterances. Since we are using chunks of audio, this means we could be both starting and ending in the middle of an utterance. Clearly, the initial and final probabilities of the initial denominator graph would be wrong if that were the case. To compensate for this we use modified initial and final probabilities. The final probability at every state is set to 1. This lets the utterance end at any arbitrary HMM state (not just at the end of the utterance). New initial probabilities are computed by creating the first 100 steps of the FST trellis. The state occupancy probabilities are averaged across all 100 time steps. We use this average as our new initial probabilities for each state. This modified denominator graph is called the <strong>normalization fst</strong>.</p>
</li>
<li>
<p>Finally, since the numerator and denominator graphs are created in different ways, we need to ensure that the set of paths in the numerator graph is a subset of those in the denominator graph. We do this by composing the numerator graph with the normalization fst. To avoid double counting the transition probabilities, they are actually omitted in the original numerator graph.</p>
</li>
</ul>
<h3 id="regularization">Regularization</h3>
<p>The LF-MMI objective was observed to overfit. 3 methods of regularizing the network are used to prevent overfitting.</p>
<ol>
<li>
<p>Multitask training with the cross-entropy objective. The forward probabilities at each time step in the numerator graph are used as soft targets instead of the usual hard targets.</p>
</li>
<li>
<p>L-2 regularization on the network outputs. In otherwords, the network is trained using the objective.
\(F_{LF-MMI} = F_{MMI} + \lambda {\left\lVert y^t\right\rVert}_2^2 + \omega F_{xent}\)</p>
</li>
<li>
<p>A Leaky HMM is used. Here a small transition probability between any two states is allowed. This allows for gradual forgetting of context.</p>
</li>
</ol>MatthewWiesnerI’m writing this to remember all of the details of MMI and LF-MMI, especially the gradient computation that I worked through with Desh Raj. I probably missed a few things, but hopefully the main concepts are covered. If you are reading this and you already know what MMI is and you are just looking for the details of LF-MMI, skip to the LF-MMI section. I added some very basic information about ASR with HMMs just for the sake of completeness. The original paper is here.Bootstrap Sampling2019-08-30T22:44:00+00:002019-08-30T22:44:00+00:00https://m-wiesner.github.io/bootstrap-sampling<p>I’m writing this post to remind myself how bootstrap sampling works. There are
no proofs, only intuition, some quick experiments, and some comments.</p>
<hr />
<p>One way to gauge the certainty of a reported result is to provide confidence
estimates. This is something that bootstrap sampling can be used for, even
though we only have access to a relatively small sample from the larger
population of interest.</p>
<h2 id="intuition">Intuition</h2>
<p>The key idea of Bootstrap sampling is exceedlingly easy. Let’s assume that the
small population to which I have access, is representative of my larger
population of interest. If I repeatedly sample subsets from the small population
(with replacement), and I measure a particular statistic, I can simply report
the interval in which this statistic falls some fraction of the time.
This becomes my confidence interval.</p>
<p>Let’s call the statistic \(g(.)\). Let’s assume the test population is called
\(\mathcal{T} \sim \mathcal{P}\) is drawn from population \(\mathcal{P}\) In
bootstrap sampling we are just simulating other possible subsets from
\(\mathcal{P}\) that I could drawn, by sampling new subsets
\(\mathcal{T}_i\) with replacement from \(\mathcal{T}\). My new estimate
of my statistic becomes</p>
\[g\left(\mathcal{T}\right) = g^{\ast} = \frac{1}{B}\sum_{i} g\left(\mathcal{T}_i\right)\]
<p>where \(B\) is the number of simulated subsets I create. The empirical
distribution of \(g\left(\mathcal{T_i}\right)\) is used to determine confidence
intervals.</p>
<p>There is an implicit assumption here that the sampled data points are
independent. In speech the use of speaker information in training systems will
break this assumption if data points are individual sentences, many of which
could have been spoken by the same speaker. Details about Bootstrap WER can be
found here <a href="http://www-i6.informatik.rwth-aachen.de/PostScript/InterneArbeiten/Bisani_BootstrapEstimatesForConfidenceIntervalsInASRPerformanceEvaluation_ICASSP_2004.pdf">Bootstrap Confidence Intervals in ASR</a>.</p>
<p>Some code is included below to illustrate how this works on a toy example. We
construct \(\mathcal{T} = [t_1, t_2, \ldots, t_{100}], \ t_i \sim \mathcal{N}\left(1.3, 0.16\right)\)</p>
<h2 id="example">Example</h2>
<p>We construct bootstrap samples by sampling \(100\) points with replacement from
\(\mathcal{T}\).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="c1"># Sample 100 points from gaussian mean=1.3 sig=0.4
</span><span class="n">gauss</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mf">1.3</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">bssize_means</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># Collect all of the bootstrap statistics
</span><span class="k">for</span> <span class="n">bs_size</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">20000</span><span class="p">,</span><span class="mi">500</span><span class="p">):</span>
<span class="n">samples</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">bs_size</span><span class="p">):</span>
<span class="n">samples</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">gauss</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">ssize</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">bs_means</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">bs</span><span class="p">)</span> <span class="k">for</span> <span class="n">bs</span> <span class="ow">in</span> <span class="n">samples</span><span class="p">]</span>
<span class="n">bssize_means</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">bs_means</span><span class="p">))</span>
</code></pre></div></div>
<p>An interesting question might be how many bootstrap samples are necessary for
the bootstrap estimate to converge? Clearly the more bootstrap samples the
better, but do we really need that many? Below is a plot of the bootstrap mean
estimate as a function of the number of bootstrap samples.
<img src="/bootstrap-sampling/Bootstrap_convergence_demo.png" alt="" /></p>
<p>It seems only a few bootstrap samples are needed to achieve a stable estimate of
the desired statistic, but if the computation is cheap, you may as well use as
many as you can easily do. I’ve seen 1000-10000 recommended.</p>
<p>Our claim was that having small datasets would result in a lot of uncertainty
about the measured statistic. In ASR, this means a small dataset causes uncertainty
in model performance. So what is the relationship betweeen dataset size and the
bootstrap distribution? To use the bootstrap distribution to estimate confidence
intervals, it should have a large variance when the dataset is small, and a
small variance when the dataset is large. Below we show the bootstrap
distributions for datasets of different sizes sampled from the same population.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">50</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">1000</span><span class="p">]:</span>
<span class="n">gauss</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mf">1.3</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">s</span><span class="p">)</span>
<span class="n">samples</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">20000</span><span class="p">):</span>
<span class="n">samples</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">gauss</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">s</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">bs_means</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">bs</span><span class="p">)</span> <span class="k">for</span> <span class="n">bs</span> <span class="ow">in</span> <span class="n">samples</span><span class="p">]</span>
<span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bs_means</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"size="</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">s</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Bootstrap Distribution of the Empirical Mean"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"Bootstrap_distribution_demo.png"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/bootstrap-sampling/Bootstrap_distribution_demo.png" alt="" /></p>
<p>Sure enough, we find that the bootstrap distribution encodes the uncertainty
we’d expect when using smaller datasets. To report a confidence interval, we
can simply report the range of the middle 95% of the bootstrap simulations.
Otherwise, we could model the bootstrap distribution as a Guassian or other
distribution and estimate the confidence interval from the empircal variance of
the bootstrap simulations.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">print_function</span>
<span class="k">print</span><span class="p">(</span><span class="s">"95% Confidence interval: ("</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">percentile</span><span class="p">(</span><span class="n">bs_means</span><span class="p">,</span> <span class="mf">2.5</span><span class="p">),</span> <span class="s">", "</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">percentile</span><span class="p">(</span><span class="n">bs_means</span><span class="p">,</span> <span class="mf">97.5</span><span class="p">),</span> <span class="s">")"</span><span class="p">)</span>
</code></pre></div></div>
<p>In our case the 95% confidence interval for the 1000 sample dataset was \((1.27195901268, 1.32081083565)\).</p>
<h2 id="bootstrap-confidence-interval-in-kaldi">Bootstrap Confidence Interval in Kaldi</h2>
<p>For anyone using Kaldi, the bootstrap confidence interval is computed by default
using the script <code class="language-plaintext highlighter-rouge">./steps/score_kaldi.sh</code> or the binary <code class="language-plaintext highlighter-rouge">compute-wer-bootci</code>.
After running most of the kaldi examples you can find the relevant file in
<code class="language-plaintext highlighter-rouge">exp/MODEL_NAME/decode_DATASET/scoring_kaldi/wer_details/wer_bootci</code>.</p>
<p>That’s about it.</p>MatthewWiesnerI’m writing this post to remind myself how bootstrap sampling works. There are no proofs, only intuition, some quick experiments, and some comments.