Mathursday: Rayleigh-Ritz, Courant-Fischer, and Weyl’s Inequality

Mathursday is back after a very long time. The last year was unusually hectic for all of us and I couldn’t devote enough time to posts. We restart with the study of eigenvalues which finds significant use in many important areas in mathematics, and computer science. In this post, we’ll discuss some fundamental results on eigenvalues that are reused commonly. These results are common and well-known but it can help to relook at their proof.


Lemma: (Eigenvalues of symmetric matrices) If A is be a n \times n symmetric matrix then there is an orthonormal basis of \mathbb{R}^n consisting of eigenvectors of A.

We’ll assume knowledge of the following:

Lemma: (Dimension Formula) Given vector sub-space V, W of \mathbb{R}^n we have:

\dim(V + W) = \dim(V) + \dim(W) - \dim(V\cap W)

Definition (Rayleigh Quotient): For a square matrix A, its Rayleigh quotient is the function from x \ne 0 to \mathbb{R} defined below:

R_A(x) = \frac{x^\top A x}{x^\top x}.

Rayleigh–Ritz Theorem

Rayleigh–Ritz Theorem: Let A \in M_n be a symmetric matrix with eigenvalues \lambda_1 \le \lambda_2 \le \cdots \lambda_n upto multiplicity and with an orthonormal basis of eigenvectors \{u_i\}_{i=1}^n, where u_i has eigenvalue \lambda_i, then for all 1 \le k \le n we have:

\displaystyle \lambda_k = \min_{x \ne 0; x \in \{u_1, \cdots, u_{k-1}\}^\top} R_A(x),

where x \in \{u_1, \cdots, u_{k-1}\}^\top means that x is orthogonal to \{u_1, \cdots, u_{k-1}\}. For k=1, the only constraint on x is that x \ne 0.

Proof: We can express each x as x = \sum_{i=1}^n \alpha_i u_i for some \alpha \in \mathbb{R}^n. Then we have:

\begin{aligned} R_A(x) = \frac{\left(\sum_{i=1}^n \alpha_i u_i\right)^\top A\left(\sum_{i=1}^n \alpha_i u_i\right)}{(\sum_{i=1}^n \alpha_i u_i)^\top (\sum_{i=1}^n \alpha_i u_i)} &= \frac{\left(\sum_{i=1}^n \alpha_i u_i\right)^\top \left(\sum_{i=1}^n \alpha_i A u_i\right)}{(\sum_{i, j=1}^n \alpha_i \alpha_j u_i^\top u_j)}  \\ &= \frac{\left(\sum_{i=1}^n \alpha_i u_i\right)^\top \left(\sum_{i=1}^n \lambda_i \alpha_i u_i\right)}{\sum_{i=1}^n \alpha^2_i} \\ &= \frac{\sum_{i=1}^n \lambda_i \alpha^2_i}{\sum_{i=1}^n \alpha^2_i} = \sum_{i=1}^n p_i \lambda_i \end{aligned}

where in the last line we have p_i = \frac{\alpha^2_i}{\sum_{j=1}^n \alpha^2_j}. Observe that this p is a probability vector for any value of \alpha, and any probability vector can be written using \alpha. Hence, max over x, is same as max over \alpha, which itself is same as max over p.

If x \in \{u_1, \cdots, u_{k-1}\}^\top, then we have \alpha_i = 0 \Rightarrow p_i=0 for all i \in \{1, 2, \cdots, k-1\}. This implies:

\displaystyle \qquad \min_{x \ne 0; x \in \{u_1, \cdots, u_{k-1}\}^\top} R_A(x) = \min_{p \in \Delta^n; p_1=p_2, \cdots = p_{k-1}=0} \sum_{i=1}^n p_i \lambda_i .

It is straightforward to observe that this minima is achieved when the entire probability mass is on \lambda_k. Hence, proved.

Lemma: Let A \in M_n be a symmetric matrix with eigenvalues \{\lambda_i\}_{i=1}^n and an orthonormal basis of eigenvectors \{u_i\}_{i=1}^n. Then for all 1 \le i \le j \le n we have:

\forall x \in LS\{u_i, \cdots, u_j\}, \qquad R_A(x) \in \left[\lambda_i, \lambda_j \right].

Proof: We can adopt proof for Rayleigh-Ritz to write:

\displaystyle \min_{x \ne 0; x \in LS\{u_i, \cdots, u_j\}^\top} R_A(x) = \min_{p \in \Delta^n; p_1=p_2, \cdots = p_{i-1}=0, p_{j+1}=p_{j+2}, \cdots=p_n=0} \sum_{i=1}^n p_i \lambda_i .

It is obvious to see that the minima is achieved when the entire probability mass is on the smallest eigenvalue possible which is \lambda_i, and the maximum is achieved when the entire probability mass is on the largest eigenvalue possible which is \lambda_j. Hence, proved.

Courant-Fischer Min-Max Theorem

Courant-Fischer Min-Max Theorem: Let A be a real symmetric matrix with eigenvalues \lambda_1 \le \lambda_2 \cdots \lambda_n corresponding to an orthonormal set of eigenvectors u_1, u_2, \cdots, u_n. Then for any k \in \{1, 2, \cdots, n\} we have:

\displaystyle \lambda_k = \min_{U, \dim(U)=k}~~\max_{x; x \in U, x \ne 0} R_A(x),

\displaystyle \lambda_k = \max_{U, \dim(U)=n-k+1}~~\min_{x; x \in U, x \ne 0} R_A(x).

Proof: Let V = \{u_k, \cdots, u_n\} and let U be any subspace of dimension k. Then from dimensionality theorem we have:

\displaystyle \begin{aligned}  \dim(V\cap U) &= \dim(V) + \dim(U) - \dim(V + U) \\ &\ge \dim(V) + \dim(U) - n \\ &= (n - k + 1) + k - n = 1 \end{aligned}.

This allows us to pick a non-zero vector x in V \cap U. For this, x we have x \in LS\{u_k, \cdots, u_n\}, and therefore, from previous result we have R_A(x) \ge \lambda_k. This gives us:

\displaystyle \max_{x \in U; x\ne 0} R_A(x) \ge \lambda_k .

However, since this result holds for any U of dimensionality k, therefore, we can write:

\displaystyle \min_{U; \dim(U)=k}\max_{x \in U; x\ne 0} R_A(x) \ge \lambda_k .

We now need to prove the other direction. Let W = LS\{u_1, \cdots, u_k\} then dim(W)=k. Further, for any x \in W, we have, from previous result, R_A(x) \le \lambda_k. This implies:

\displaystyle \min_{U; \dim(U)=k}\max_{x \in U; x\ne 0} R_A(x) \le \max_{x \in W; x\ne 0} R_A(x) \le \lambda_k .

Combining these two inequalities, proves the first min-max equality. The second equation can be proven similarly.

Weyl’s Inequality

Weyl’s Inequality: Let A, B be two n \times n real symmetric matrix. For all i \in \{1, 2, \cdots, n\}, let \lambda_i(A), \lambda_i(B), \lambda_i(A+B) denote the i^{th} eigenvalues of A, B, and A+B respectively, arranged in ascending order. Let u_i, v_i, w_i be the unit norm eigenvectors of A, B, and A+B respectively corresponding to the i^{th} eigenvalue. Then for all k \in \{1, 2, \cdots, n\} and i \in \{k, k+1\cdots, n\}, we have:

\displaystyle \lambda_k(A+B) \le \lambda_i(A) + \lambda_{n+k-i}(B)

\displaystyle \lambda_{n-k+1}(A+B) \ge \lambda_{n-i+1}(A) + \lambda_{i-k+1}(B)

Proof: The proof is similar to that of Courant-Fischer, in that we will define a set of subspaces and show that we can pick a point in their intersection. We will prove the first inequality and the second one is proven in a similar fashion. For a fixed value k, i, we define three subspaces:

\displaystyle \begin{aligned} S_1 &= LS\{w_k, \cdots, w_n\},\\ S_2 &= LS\{u_1, \cdots, u_i\},\\ S_3 &= LS\{v_1, \cdots, v_{n+k-i}\}\\ \end{aligned}

You can sort of guess why these subspaces were defined in this manner by looking at the inequality we are trying to prove. We have k on the left hand-side corresponding to matrix A + B, and i and n+k-i on the right hand side corresponding to matrices A and B respectively. We have \dim(S_1) = n-k+1, \dim(S_2) = i, and \dim(S_3) = n+k-i. Applying the dimension theorem twice we get:

\displaystyle \begin{aligned} \dim(S_1 \cap S_2 \cap S_3) &= \dim(S_1) + \dim(S_2 \cap S_3) - dim(S_1 + S_2 \cap S_3)\\ &= \dim(S_1) + \dim(S_2) + \dim(S_3) - \dim(S_2 + S_3) - dim(S_1 + S_2 \cap S_3)\\ &\ge \dim(S_1) + \dim(S_2) + \dim(S_3) - 2n\\  &= n-k+1 + i + n + k - i - 2n\\ &= 1 \end{aligned}

Let z be a non-zero vector in S_1 \cap S_2 \cap S_3. Then since it is in S_1, we have \lambda_k(A + B)  \le R_{A+B}(z) = R_A(z) + R_B(z), where the last equality holds from observing that Rayleigh’s quotient of a matrix is linear in matrix. Finally, as z \in S_2 and z \in S_3 we have R_A(z) \le \lambda_i(A) and R_B(z) \le \lambda_{n+k-i}. Combining these terms we get \lambda_k(A + B) \le \lambda_i(A) + \lambda_{n+k-i}(B) which is what we want to prove.

Equalities and inequalities involving eigenvalues are quite useful and constitute a big part of linear algebra literature. Interested readers can look at Matrix Algebra & Its Applications to Statistics & Econometrics by C. R. Rao and M. B. Rao, or Matrix Analysis by Rajendra Bhatia. Bhatia also has an extremely interesting article on eigenvalue inequalities which covers interesting history: Linear Algebra to Quantum Cohomology: The Story of Alfred Horn’s Inequalities.


Academia and Compute-Intensive AI Research

Artificial Intelligence research (AI) has achieved great empirical success in the last decade. One major reason has been the availability of fast GPU and CPU clusters that have made it possible to run experiments on very large scale (e.g., training models on millions of documents, or training an image classifier on hundred thousand images). Some recent AI directions (particular in architecture search) have been using notoriously large amount of resources. For example, Strubell et al. 2019 found that training a language model produces the same carbon emission as taking a trans-American flight.

While it is an interesting research direction to create computationally-efficient algorithms, the state at the end of 2019 is still that beating state-of-the-art (SOTA) justifies it all. Getting an empirical paper accepted without SOTA results is generally hard, and my speculation is that reviewers do not care as much about carbon emissions and FLOPs spent as improvement over state-of-the-art.

However, an interesting and overlooked side-story to this is that not everyone has benefited from this resource heavy research.

Research labs, particularly those in academia, without connections to well-funded industry, have massive disadvantage in playing this SOTA game.

A student in a randomly chosen academic lab will have to use a small shared GPU cluster to play the same number game as a person in industry with access to a cluster of 100 GPU machines. This is generally less of a concern for top-tier schools where many  professors are well connected to industry (or otherwise well-funded) and therefore, are in a better position to provide computational resources for their students.

It is relatively easy to dismiss these concerns as unwarranted— that a good research should come from well-founded intuition rather than heavy computational use. I believe this sentiment is true in the long-run. Maybe in 10 years, when we understand the black-box nature of deep learning, we will be able to derive these models on whiteboard or find them in an ultra-fast way using basic principles – ResNet, LSTMs, Transformers will then magically appear as optimal solutions of some functional. That would be a great achievement. However, until that time comes, the problem remains as following:

  • students in empirical research need to publish
  • current reviewing practice such as use of leaderboards heavily favour beating SOTA
  • current deep learning methods are not well understood to allow a whiteboard derivation of solutions. E.g., we don’t have a proof why ResNet work better. The need for residual connection is not apparent from a formal whiteboard argument.
  • students from underfunded labs are at disadvantage

What can underfunded labs do?

  1. Do theoretical research or focus on understanding existing empirical solutions
    One can avoid the computation-driven try-and-fail rat race by focusing on  theoretical questions or understanding existing solutions. This is relatively under-explored and plenty of open questions remain that likely won’t be solved anytime soon.
  2. Think of existing problems in a different way
    The community as a whole can get tunnel vision, resulting in a significant number of researchers working on the same problem. I remember the time when 3-4 similar papers came from well established labs concurrently on the same problem. Working on a different problem or analyzing the same problem in a different way (such as by questioning the assumptions or changing the setup) are advantageous to get out of this tunnel vision. There is more to contribute to the literature compared to solving a problem which will surely be solved by 10 other people in the next few months.
  3. Focus on the learning problem, or data-poor problems
    Focus on sample-efficient learning and/or work with setups where one cannot simply scale the data (due to annotation cost, privacy or other reasons).
  4. Reach out to industry for resources
    Research lab in universities can reach out to industry for collaboration or funding. For example, Microsoft supports various fellowships (investigator fellowship, phd fellowship, dissertation grant, Ada Lovelace fellowship). Facebook, Amazon and others also have similar fellowship programs.

What can the community do?

  1. Avoid reducing reviewing to SOTA measurements
    A research paper is more than performance gain over prior work. Reducing reviewing to how many datasets it killed is harmful for long-term research. Consider for example two contributions: one which achieves 2% improvement over SOTA by tweaking the SOTA solution, and another work which takes a less travelled route for which previous performance was significantly less than SOTA and manages to make it come within 1% (but still less) of SOTA. Which one is more informative? The first one reaffirms belief that current SOTA techniques when mixed together work well. The second one says that we have grossly underestimated the performance of another route.
  2. Take into consideration FLOP, memory footprint, carbon emission etc.  in reviewing
    Traditionally, these statistics have either never been reported or if reported then ignored by reviewers. Conference guidelines should encourage authors to report these statistics. This doesn’t require any more experiments but a simple accounting. Reviewers should judge for themselves whether a 1% gain at the cost of 10x more compute is acceptance worthy.
  3. Ask if empirical gains are coming from technical novelty or extensive tuning
    It is not rare to see a paper reporting 5% gain over SOTA but where the core idea is only contributing <1% to performance. A simple way to check this is to ask authors to perform ablations.
  4. Evaluate candidates on core research contribution than citations, no. of papers, etc.
    Many AI students feel a lot of pressure from unable to match the productivity of large well-funded labs writing 50-100 papers a year (1-2 paper a week). Generally, most of these papers revolve around a few topic and do not contribute fundamentally in all 100 directions (that would be nearly impossible).
    Okay, but what does it have to do with compute-intensive work? Firstly, I speculate that if students are not pressured to write many papers, they would rely more on fundamental principles (or taking a completely new route) rather than using brute force compute power to improve rank on leaderboards. Secondly, if there is no  pressure to publish constantly, then students maybe encouraged to try different research problems and increase the diversity of their research.

Heavily tuned deep neural architecture trained with relatively simple SGD methods have revolutionized machine learning. Similarly, use of leaderboards have made it possible to perform controlled experiments which has huge impact. These two methods have resulted in a large amount of financial interest in machine learning and AI which is also benefitting other aspects of AI and ML (and for good). However, we might be starting to see some disadvantageous of over reliance on compute-intensive and leaderboard driven research. 

For those who are still feeling underwhelmed by the SOTA competition can find solace in knowing that there is still lots of work to do in developing re-usable first principles, in understanding existing solutions, and taking the more rigorous albeit slow path towards better solutions.  On the way, you may find solutions that have been proposed before by people with great intuition, or by exhaustion of computational resources. However, there is  great benefit in discovering a new elegant path to the same solution.

Mathursday: PAC with Hoeffding-Bernstein

PAC results with Hoeffding-Bernstein’s inequality are the bread and butter of machine learning theory. In this post, we’ll see how to use Hoeffding’s inequality to derive agnostic PAC bounds with an error rate that decreases as O\left(\sqrt{\frac{1}{n}}\right) in number of training samples n.  Then, we’ll see that in the realizable setting we can use Bernstein’s inequality to improve this to O\left(\frac{1}{n}\right).

Supervised Learning

Consider the task of regression with input space \mathcal{X} and output space \mathcal{Y}. We have access to a model class (also known as hypothesis family) \mathcal{F} = \left\{ f: \mathcal{X} \rightarrow \mathcal{Y} \right\}. We will assume \mathcal{F} to be finite for simplicity.  There exists some unknown data distribution D \in \Delta\left(\mathcal{X} \times \mathcal{Y}\right) that is generating examples. Of course, we don’t have access to this data distribution but we have access to some samples from it. Let S = \{(x_i, y_i)\}_{i=1}^n be n independent identically distributed (i.i.d) samples from the data distribution i.e., (x_i, y_i) \sim D(.,.) and probability of observing a sample S is given by P(S) = \prod_{i=1}^n D(x_i, y_i).

In order, to evaluate the performance of a model we will need a loss function. Let \ell: \mathcal{Y} \times \mathcal{Y} \rightarrow [0, 1] be a loss function. Given a datapoint (x, y) and a model f \in \mathcal{F}, the loss associated with our prediction is given by \ell(y, f(x)). For now, we will not make any assumption about D, \mathcal{F}, \ell.

Our aim is to find a function f \in \mathcal{F} that is as good as possible. What is a good function? Well one that has the least loss function across the input space i.e., its generalization error L_D(f) is as low as possible: L_D(f) = \mathbb{E}_{(x, y) \sim D}\left[\ell(y, f(x)) \right]. The only signal we have for finding such a f is the training data S. We will define the training error L_S(f) = \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i)). The aim of machine learning theory is to (i) allow us to find a “good” f, and (ii) bound the performance of the predicted model in terms of S, \mathcal{F} and other relevant parameters. The first result we will look at is the Occam’s Razor Bound and then we will see how to improve this result.

Occam’s Razor Bound

One of the first result that is taught in a machine learning theory class is the Occam’s Razor bound. The formal result is given in the theorem below. Occam’s Razor bound suggests using the model that minimizes the error on the observed sample S i.e., the solution of the following optimization: \hat{f}_S := \inf_{f \in \mathcal{F}} L_S(f). This solution is called the empirical risk minimizer (ERM) or the ERM solution. We will drop the S from the notation when it is clear from the context.

Theorem: Let (\mathcal{X}, \mathcal{Y}, D, \ell) be a supervised learning problem with input space \mathcal{X}, output space \mathcal{Y}, data distribution D \in \Delta(\mathcal{X} \times \mathcal{Y}), and \ell: \mathcal{Y} \times \mathcal{Y} \rightarrow [0, 1] be the loss function. We are given n i.i.d samples \{(x_i, y_i\}_{i=1}^n  from the data distribution D. Let \mathcal{F} be a finite collection of models from \mathcal{X} to \mathcal{Y}. Fix any \delta \in (0, 1) then with probability at least 1 - \delta over samples S drawn i.i.d. from D, we have:

\forall f \in \mathcal{F}, \,\, \left|L_D(f) -  L_S(f)\right| < \sqrt{\frac{1}{2n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}, and

L_D(\hat{f}) < \inf_{f\in \mathcal{F}} L_D(f) + \sqrt{\frac{2}{n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}.

Proof: Fix any model in the model family f \in \mathcal{F}. Let Z_i = \ell(y_i, f(x_i)). Then observe Z_i \in [0, 1] and \{Z_i\}_{i=1}^n are i.i.d random variables. They are random variables with respect to the sample S which is randomly chosen. Further, L_S(f) = \frac{1}{n}\sum_{i=1}^n Z_i and

\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n Z_i\right] = \frac{1}{n} \sum_{i=1}^n \mathbb{E}\left[Z_i\right] = \frac{1}{n} \sum_{i=1}^n L_D(f) = L_D(f),

where the first equality follows from linearity of expectation and the second inequality follows from iid assumption. Then using Hoeffding’s inequality:

P( \left| L_D(f) - L_S(f) \right| \ge t) \le 2\exp^{-2nt^2},

Setting \delta = 2\exp^{-2nt^2}, we  get: \left|L_D(f) - L_S(f) \right| < \sqrt{\frac{1}{2n}\log\left(\frac{2}{\delta}\right)} with probability of at least 1 - \delta. Let Q_f be result derived in the above equation then we have P(Q_f) \ge 1 - \delta or in other words P(\overline{Q_f}) < \delta (\overline{X} represents the complement of measurable set X). Even though we derived this result for a fixed f, we never used any property of f and therefore, it holds for all f \in \mathcal{F}. Therefore, we can bound the probability of \cap_{f \in \mathcal{F}} Q_f using union bound:

P(Q_{\hat{f}}) = 1 - P(\overline{Q_{\hat{f}}}) \ge 1 - \sum_{f \in \mathcal{F}} P(\overline{Q_f}) \ge 1 - \sum_{f \in \mathcal{F}} \delta = 1 - \delta |\mathcal{F}|.

This means with probability of at least 1 - \delta |\mathcal{F}| we have:

\forall f \in \mathcal{F}, \,\,\,\,\,\, \left| L_D(f) - L_S(f)  \right| < \sqrt{\frac{1}{2n} \log\left(\frac{2}{\delta}\right)}.

Replacing \delta by \frac{\delta}{|\mathcal{F}|} proves the first result. Let \hat{f} be the ERM solution and fix f in \mathcal{F}. Then:

L_D(\hat{f}) < L_S(\hat{f}) +  \sqrt{\frac{1}{2n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}     (from Hoeffidng’s inequality)
L_S(\hat{f}) \le L_S(f)                                 (from the definition of ERM)
L_S(f) < L_D(f) + \sqrt{\frac{1}{2n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}     (from Hoeffidng’s inequality)

This implies, for any f \in \mathcal{F}, with probability at least 1 - \delta: L_D(\hat{f}) < L_D(f) + \sqrt{\frac{2}{n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}. As this holds for any f \in \mathcal{F}, therefore, we can pick the f with smallest value of L_D(f). This proves the second result.

There are couple of things that stand out in this result:

  • More training data helps: Formally, the bound improves as O\left(\sqrt{\frac{1}{n}}\right) in sample size n. Informally, this means as we have more training data we will learn a model whose generalized error has a better upper bound.
  • Complex model class can be undesirable: The bound becomes weaker as the model family \mathcal{F} becomes larger. This is an important lesson in model regularization: if we have two model class \mathcal{F}_1, \mathcal{F}_2 that have the same training error on a sample S, then if F_2 is more complex class than F_1 i.e., |\mathcal{F}_1| < |\mathcal{F}_2|, then the bound on the generalization error for the later will be weaker. Note that the result does not say that the larger model class is always bad. We can get better bounds with a more complex model class provided we can appropriately improve the training error.

Improving Occam’s Razor Bound with Bernstein’s Inequality

Bernstein’s inequality allows us to incorporate the variance of the random variables in the bound thereby allowing us to get tighter bound for low variance settings.

Theorem (Bernstein’s Inequality): Let \left\{X_i\right\}_{i=1}^n be n i.i.d random variables with mean \mu and variance \sigma^2. Further, let \left|X_i - \mu \right| \le M almost surely, for all i. Then:

P\left(\left| \frac{1}{n}\sum_{i=1}^n X_i - \mu \right| > t \right) \le 2\exp\left\{- \frac{3nt^2}{6\sigma^2 + 2Mt} \right\}

If the variance was negligible \sigma^2 \approx 0 then we will get P\left(\left| \frac{1}{n}\sum_{i=1}^n X_i - \mu \right| > t \right) \le 2\exp\left\{-\frac{3nt}{2M}\right\}. This bound, if we repeat what we did for Occam’s Razor Bound, will give us O\left( \frac{1}{n}\right) bound.

Before we proceed to derive a O\left( \frac{1}{n}\right) in realizable setting, we will first state the Bernstein’s inequality in the more familiar form. Setting the right hand side to \delta and solving the quadratic equation gives us:

With probability at least 1-\delta we have:

\left| \frac{1}{n}\sum_{i=1}^n X_i - \mu \right| \le \frac{M}{3n} \log\left( \frac{2}{\delta} \right) + \sqrt{\frac{M^2}{9n^2} \log^2\left(\frac{2}{\delta}\right) + \frac{2\sigma^2}{n}\log\left(\frac{2}{\delta}\right)}

We now apply the inequality \forall a, b \in [0, \infty), \sqrt{a + b} \le \sqrt{a} + \sqrt{b} to get:

\left| \frac{1}{n}\sum_{i=1}^n X_i - \mu \right| \le \frac{2M}{3n} \log\left( \frac{2}{\delta} \right) + \sqrt{\frac{2\sigma^2}{n}\log\left(\frac{2}{\delta}\right)}.

Theorem: Let (\mathcal{X}, \mathcal{Y}, D, \ell) be a supervised learning problem with input space \mathcal{X}, output space \mathcal{Y}, data distribution D \in \Delta(\mathcal{X} \times \mathcal{Y}), and loss function \ell \in [0, 1]. Let there be f^\star \in \mathcal{F} such that L_D(f^\star) = 0. We are given n i.i.d samples \{(x_i, y_i\}_{i=1}^n  from the data distribution D. Let \mathcal{F} be a finite collection of models from \mathcal{X} to \mathcal{Y}. Fix any \delta \in (0, 1) then with probability at least 1 - \delta we have:

L_D(\hat{f}) < \frac{20}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right).

Proof: Fix a f \in \mathcal{F} and define \{Z_1, \cdots, Z_n\} as before. This time we will apply Bernstein’s inequality. We have |Z_i - \mu| < 1 as the loss function is bounded in [0,1]. Bernstein’s inequality gives us:

\left|L_D(f) - L_S(f) \right| \le \frac{2}{3n} \log\left( \frac{2}{\delta} \right) + \sqrt{\frac{2\sigma^2}{n}\log\left(\frac{2}{\delta}\right)},

with probability of at least 1 - \delta. All we need to do is bound the variance \sigma^2 = \mathbb{E}_{(x, y) \sim D}[ (\ell(y, f(x)) - L_D(f) )^2] = \mathbb{E}_{(x, y) \sim D}[\ell(y, f(x))^2] - L_D(f)^2. As the loss function is bounded in [0, 1] therefore, \mathbb{E}_{(x, y) \sim D}[\ell(y, f(x))^2] \le \mathbb{E}_{(x, y) \sim D}[1. \ell(y, f(x))] = L_D(f). This gives us \sigma^2 \le L_D(f). Plugging this in we get:

\left|L_D(f) - L_S(f) \right| \le \frac{2}{3n} \log\left( \frac{2}{\delta} \right) + \sqrt{\frac{2L_D(f)}{n}\log\left(\frac{2}{\delta}\right)}.

Applying AM-GM inequality, we have \sqrt{\frac{2L_D(f)}{n}\log\left(\frac{2}{\delta}\right)} < \frac{L_D(f)}{2} + \frac{1}{n}\log\left(\frac{2}{\delta}\right) giving us,

\left|L_D(f) - L_S(f) \right| \le \frac{L_D(f)}{2} + \frac{5}{3n} \log\left( \frac{2}{\delta} \right),

with probability 1-\delta. Applying union bound on all hypothesis we get:

\forall f \in \mathcal{F}, \,\,\,\,  \left|L_D(f) - L_S(f) \right| \le \frac{L_D(f)}{2} + \frac{5}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right),

with probability at least 1-\delta. Let \hat{f} be the ERM solution then:

L_D(\hat{f}) < L_S(\hat{f}) + \frac{L_D(\hat{f})}{2} + \frac{5}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right)    (Bernstein’s inequality)
L_S(\hat{f}) < L_S(f^\star)                                        (ERM solution)
L_S(f^\star) < \frac{3}{2}L_D(f^\star) + \frac{5}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right)            (Bernstein’s inequality)
L_D(f^\star) = 0                                              (realizability condition)

Combining these we get L_D(\hat{f}) < \frac{20}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right) with probability at least 1-\delta. Hence, proved.

So what is happening? We are making a realizability assumption i.e., there is a function f^\star \in \mathcal{F} such that L_D(f^\star) = 0. As the loss is lower bounded by 0, therefore, the variance of the samples of its loss are 0. Now as our ERM solver is becoming better with more data, it is converging towards this f^\star whose variance is 0. Therefore, the variance of the ERM (which is bounded by its mean \mu) is becoming smaller. This helps in bounding the rate of convergence.

Summary: We saw how Hoeffding’s inequality gives us a O(\sqrt{\frac{1}{n}}) bound in number of samples n. When we are in the realizable case, we can get O(\frac{1}{n}) using Bernstein’s inequality.

Growing Bifurcation of AI Scholarship

The field of artificial intelligence (which includes machine learning, computer vision, natural language understanding etc.) is going through an unprecedented phase of interest and enthusiasm. This is evident in the number of articles written in popular media, interest of non-experts in the field, increase in the audience at AI conferences,  and rise in number of submitted articles to the top conferences. The later has posed a challenge in ensuring high quality reviewing at these conferences.

AI, as in other scientific discipline, relies on double-blind peer review process to ensure that quality papers are added to the literature. While alternative exists such as arXiv where anyone can upload their paper. The importance of peer review is evident from observing that every year several papers get published on the Maths arXiv that claim to have solved Riemann Hypothesis (a famous unsolved problem). Almost all these papers have glaring mistakes. I picked a maths example cause an error in the proof is more glaring, however, arXiv CS papers can contain errors in proof, experimental protocol  or an ethical concern that might be pointed out during peer review.

However, the multiplicative increase in submissions at AI conferences have put this peer review process under stress. Various methods have been proposed to ensure that only quality submissions are submitted including: preventing authors to post on arXiv a month before the conference deadline (to avoid flag planting papers), adopting an open reviewing policy (to discourage resubmission without revision), and some interesting proposals that place limit on the number of papers an author can write every year.

Anecdotally, it appears to me that a large increase in the submissions is driven by papers that demonstrate a novel application of an existing technology, or an important but not novel modification of an existing technology. These are perfectly valid and important contributions, however, it might be prudent to publish such papers separately. For example, we can create an applied research/demonstration conference where such papers are published. This conference can be held more frequently to accommodate the high frequency load and can have a more tailored reviewing process (such as focus on code review and application). The existing research conferences can then focus on more fundamental research questions. This has several advantages:

  • These papers will not face heat from NeuRIPS/ICML reviewers who demand technical novelty. A paper can have an actual application without being technically  novel. But the current conference system doesn’t do justice to distinguish between these two type of contributions.
  • Encourages engineering claim such as hyperparameter tuning to be published. The current conference model inevitably pressures authors to come up with a technical story to sell their paper to reviewers. Often times, these papers are forced to pick/add a story whereas the biggest empirical gain might be coming from hyperparameter tuning. Alternatively, such results can be published as short applied papers that claim hyperparameter tuning to be their main contribution. For example, let’s say someone downloads the state-of-the-art RL algorithm and demonstrates better result by adding another layer in the model. Then such a result should be communicated but it would be better suited as a short applied AI conference paper instead of being packaged as an 8 page conference paper.

If a paper claims to have found a better choice of hyper parameter for an important problem, then this result should be communicated. However, it should be a two paragraph article in an applied AI magazine instead of being packaged as an  8 page conference paper.

  • Load on existing conferences will be reduced. These mainstream conference can focus on the bigger picture such as understanding why existing neural network methods work, creating better optimization algorithm, or in general creating algorithms which are fundamentally novel and not minor modification of an existing algorithm.
  • The audience for an applied AI conference might be more suited for these papers than general NeuRIPS/ICML/ICLR audience. For example, an engineer working in the field of machine translation might readily benefit from the knowledge of improved hyperparameter tuning or an observation that doing curriculum learning is helpful. They will not have to extract this detail from the Appendix of an 8 page paper which will not have this detail as their main story.

A significant bulk of AI submissions are applied research paper of the type described above. The current conference cycle does not do justice in appreciating these results and demands technical novelty which often results in a weird incentive system that pushes  authors to come up with a story to package their results. A move towards the creation of an applied conference/workshop/magazines where such results are appreciated would result in both decrease in load on existing conferences while providing the right audience and appreciation for these applied AI papers.

Mathursday: Dynkin’s π-λ Theorem and CDF (Part 2)

The aim of this post is to prove that a cummulative distribution function (CDF) uniquely determines the probability distribution. This is a well known fundamental result that is quite intuitive but requires some advanced techniques to prove. In this post, we provide a complete proof using Dynkin’s π-λ Theorem. The previous post on this topic covered basic of topology and measure theory. If you are unfamiliar with concepts like measure, open set, Borel algebra then please see the previous post.

The main result we want to prove is the following.

Theorem: Let f: (X_1, \Sigma_1, P_1) \rightarrow (\mathbb{R}, \mathbb{B}) and g: (X_2, \Sigma_2, P_2) \rightarrow (\mathbb{R}, \mathbb{B}) be two random variables with CDF F, G respectively. If F(x) = G(x), \,\, \forall x \in \mathbb{R} then P_1(g^{-1}(A))=P_2(g^{-1}(A)), \,\, \forall A \in \mathbb{B}(\mathbb{R}).

Notice that we want to prove P_1(g^{-1}(A))=P_2(g^{-1}(A)) for every A \in \mathbb{B}(\mathbb{R}). Proving this for every member of a σ-algebra is difficult. Instead we will prove it for a subset of σ-algebra and show that it holds for the entire σ-algebra. The last step will be achieved using the powerful Dynkin’s π-λ Theorem. This is going to be our first stop.

Dynkin’s π-λ Theorem

To state the theorem we will have to define two more things:

Definition (π-system): A collection of subsets S of X is called a π-system if it is nonempty and closed under intersection i.e., \forall A, B \in S \rightarrow A \cap B \in S.

It is trivial to verify that every algebra of subsets is a π-system.

Definition (λ-system): A collection of subsets S of X is called a λ-system if:
1. X \in S.
2. For any A, B \in S such that A \subseteq B we have B - A \in S.
3. For any \{A_i\}_{i=1}^\infty where \forall i \in \mathbb{N}, A_i \in S and A_i \subseteq A_{i+1} then \cup_{i=1}^\infty A_i \in S.

This definition looks quite similar to that of a σ-algebra. Similarly a λ-system is closed under complement. Why? Let S be a λ-system and let A \in S. Then as X \in S and A \subseteq X therefore, A^c = X - A \in S. However, unlike a σ-algebra which is closed under countable union of arbitrary sets, the condition (3) above only holds for increasing sequence of sets (A_i \subseteq A_{i+1}). It is however straightforward to show the following:

Lemma 1: Every σ-algebra is both a π-system and a λ-system
Proof: Trivial

With these two definition we are ready to state the Dynkin’s π-λ Theorem.

Dynkin’s π-λ Theorem: If A is a π-system and B is a λ-system and A \subseteq B then A \subseteq \sigma(A) \subseteq B

The proof of this theorem is short but tricky. We defer the proof for now to first demonstrate the usefulness of the above result.

Usefulness: The way we use Dynkin’s π-λ Theorem is by showing that the property holds for a π-system and which generates the σ-algebra of interest. Then we show that the collection of sets where the property holds is a λ-system. Dynkin’s π-λ Theorem then will show that the property holds for the entire σ-algebra.

CDF Uniquely Determines the Distribution

Theorem: Let f: (X_1, \Sigma_1, P_1) \rightarrow (\mathbb{R}, \mathbb{B}) and g: (X_2, \Sigma_2, P_2) \rightarrow (\mathbb{R}, \mathbb{B}) be two random variables with CDF F, G respectively. If F(x) = G(x), \,\, \forall x \in \mathbb{R} then P_1(g^{-1}(A))=P_2(g^{-1}(A)), \,\,\ \forall A \in \mathbb{B}(\mathbb{R}).

Proof: Let S be the collection of sets where the two probabiltiy measures agree, i.e., S = \left\{ A \in \mathbb{B}(\mathbb{R}) \mid P_1(g^{-1}(A)) = P_2(g^{-1}(A))\right\}. We will construct a π-system where the property holds.

Claim 1: Let T = \left\{ \left(-\infty, x\right] \mid x \in (-\infty, \infty] \right\} be the collection of closed rays on real number line. Then T is a π-system, T \subseteq S and \sigma(T) = \mathbb{B}(\mathbb{R}).
Proof of Claim 1: The proof directly exploits the relation between closed rays and CDF. For any value of x \in (-\infty, \infty] we have

P_1(g^{-1}((-\infty, x])) = F(x) = G(x) = P_2(g^{-1}((-\infty, x])).

Hence, (-\infty, x] \in T i.e., all closed rays are in T. Secondly, T is closed under intersection as intersection of closed rays is another closed ray (\,\, (-\infty, x] \cap (-\infty, y] = (-\infty, \min(x,y)] \,\,). As T is nonempty therefore, it is a π-system. As we proved in the last post, the collection of closed rays generates the Borel σ-algebra i.e., \sigma(T) = \mathbb{B}(R).

Great! so now we have a π-system where the desired result holds and that generates \mathbb{B}(\mathbb{R}) which in this case is the σ-algebra of interest.

Claim 2: S is a λ-system
Proof of Claim 2: We prove each required property of a λ-system below.

  • From unitarity property of a probability measure we have P_1(X) = 1 = P_2(X). Therefore, X \in S.
  • Let A, B \in S and A \subseteq B. Let C = B-A then C and A are disjoint. Therefore, for any probability measure P we have P(C \cup A) = P(A) + P(C) \Rightarrow P(C) = P(B) - P(A). Then
    P_1(C) = P_1(B) - P_1(A) = P_2(B) - P_2(B) = P_2(C). Therefore, C=B-A \in S.
  • Let \{A_i\}_{i=1}^\infty be a sequence of increasing sets and \forall i \in \mathbb{N}, \,\, A_i \in S. For any probability measure P we have: P(\cup_{i=1}^\infty A_i) = \lim_{n \rightarrow \infty} P(A_n). We have P_1(A_n) = P_2(A_n) for every value of n. If two sequences have the same members then they have the same limit. Hence, \lim_{n \rightarrow \infty} P_1(A_n) = \lim_{n \rightarrow \infty} P_2(A_n) \Rightarrow P_1(\cup_{i=1}^\infty A_i) = P_2(\cup_{i=1}^\infty A_i). Therefore, \cup_{i=1}^\infty A_i \in S.

Wrapping up: We showed that T \subseteq S where T is a π-system and S is a  λ-system. Therefore, using Dynkin’s π-λ theorem we get \sigma(T) = \mathbb{B}(\mathbb{R}) \subseteq S. What does this mean? It means that for any A \in \mathbb{B}(\mathbb{R}) we have A \in S and therefore, by definition of S we have P_1(g^{-1}(A)) = P_2(g^{-1}(A)). This is exactly what we wanted to prove.

Proof of Dynkin’s π-λ Theorem

In order to proof this result we will make use of a couple of basic lemma:

Lemma 2: Intersection of λ-systems is a λ-system.
Proof: The proof is trivial. For example, let \{S_i\}_1^\infty be a set of λ-systems (so this is a set of set of set #inception). Now let A, B \in \cap_{i=1}^\infty S_i and A \subseteq B. As each S_i is a λ-system therefore, B-A \in S_i for every i \in \mathbb{N}. This implies, B-A \in \cap_{i=1}^\infty S_i.

Lemma 3: Intersection of σ-algebra is a σ-algebra
Proof: Trivial to verify.

Lemma 4: If a collection of subsets of X is both a π-system and a λ-system then it is a σ-algebra
Proof:  Let S be both a π-system and a λ-system. Then we will show it satisfies all three properties of a σ-algebra

  • As S) is a λ-system therefore, X \in S. Similarly, using the second property of a λ-system we have, X \subseteq X \Rightarrow \emptyset = X - X \in S.
  • Let E \in S then E \subseteq X \Rightarrow E^c = X - A \in S.
  • Let \{E_i\}_{i=1}^\infty be a collection of subsets of X. Define a sequence of increasing sets as follows: Y_1 = E_1 and Y_i = E_i \cup Y_{i-1}. Further, Y_1 \subseteq Y_2 \subseteq Y_3 \cdots. If somehow we could show that Y_i \in S then we can show that \cup_{i=1}^\infty Y_i = \cup_{i=1}^\infty E_i \in S. This would be interesting since we will show that every λ-system is a σ-algebra as we haven’t used the property of a π-system so far. However, as we will see we need the property to show Y_i \in S. We make an inductive argument: Y_1 = E_1 \in S. Assume Y_k \in S for all values of $k <= i$. Now Y_{i+1} = E_{i+1} \cup Y_i = \left( E_{i+1}^c \cap Y_i^c \right)^c. As S is closed under complement (point 2 above) and intersection (S is a π-system) therefore, Y_{i+1} \in S. Hence, proved.

Dynkin’s π-λ Theorem: If A is a π-system and B is a λ-system and A \subseteq then A \subseteq \sigma(A) \subseteq B.

Proof: Let \ell(A) be the smallest λ-system containing A.  As intersection of λ-systems is another λ-system therefore, we can define \ell(A) as the intersection of every λ-system containing A. As the power set of X is a λ-system therefore, \ell(A) is well defined. As B is another λ-system containing A and therefore, |\ell(A)| <= |B|. However, we can go further and claim \ell(A) \subseteq B. As if this wasn't the case then the intersection \ell(A) \cap B will be a smaller λ-system containing A which is a contradiction. This argument will be repeated again and again so make sure you understand it!

Now every σ-algebra is a λ-system and as \sigma(A) contains A therefore, \ell(A) \subseteq \sigma(A). Here we have used the above argument to show containment. At this point we have shown \ell(A) \subseteq B and \ell(A) \subseteq \sigma(A). If we can show that \ell(A) = \sigma(A) then we will be done. This is exactly what we will show.

Now this is where the proof gets tricky. We first define a rather unintuitive set:

For any E \subseteq X we define D_E = \{ F \mid F \subseteq X, E \cap F \in \ell(A) \}. We will now make a couple of claims:

Claim 1: D_E is a λ-system  for any E \in \ell(A).
Proof of Claim 1:
We prove each property of a λ-system below:

  • X \cap E = E \in \ell(A) therefore, X \in D_E
  • Let F_1, F_2 \in D_E and F_1 \subseteq F_2. This implies F_1 \cap E \in \ell(A) and F_2 \cap E \in \ell(A). Further, F_1 \subseteq F_2 \Rightarrow F_1 \cap  E \subseteq F_2 \cap E.
    As \ell(A) is a λ-system therefore, F_2 \cap E - F_1 \cap E = (F_2 - F_1) \cap E \in \ell(A). From the definition of D_E this implies F_2 - F_1 \in D_E.
  • Let \{F_i\}_{i=1}^\infty be a countable collection of sets in D_E. This implies, F_i \cap E \in \ell(A). As \ell(A) is a λ-system therefore, \cap_{i=1}^\infty (F_i \cap E) = \left( \cap_{i=1}^\infty F_i \right) \cap E \in \ell(A). This implies, \cap_{i=1}^\infty F_i \in D_E.

So far we have not use the fact that A is a π-system. The next claim will exploit this property:

Claim 2: A \subseteq D_E for every E \in A.
Proof of Claim 2: As A is a π-system therefore, F \in A \Rightarrow F \cap E \in A. As A \subseteq \ell(A) therefore, F \cap E \in \ell(A). This implies F \in D_E.

Claim 3: Define A \cap \ell(A) = \left\{ E \cap F \mid E \in A, F \in \ell(A)\right\}. Then A \cap \ell(A) \subseteq \ell(A).
Proof of Claim 3:  For E \in A \subseteq \ell(A) we have D_E as a λ-system (Claim 1) containing A (Claim 2). This implies \ell(A) \subseteq D_E for every E \in A. Therefore, for any F \in \ell(A) we have F \in D_E which implies F \cap E \in \ell(A).

Using Claim 3 we will now strengthen Claim 2.

Claim 4: A \subseteq D_E for every E \in \ell(A).
Proof of Claim 4:  Let E \in \ell(A) and F \in A then F \cap E \in \ell(A) (Claim 3). This implies F \in D_E.

Claim 5: \ell(A) is closed under intersection.
Proof of Claim 5: For E \in \ell(A) we have D_E as a λ-system (Claim 1) containing A (Claim 4). This implies \ell(A) \subseteq D_E for every E \in \ell(A). Now let F \in \ell(A). Then F \in D_E and therefore, by definition of D_E we have F \cap E \in \ell(A). Proof is completed by observing that E, F are arbitrary members of \ell(A).

At this point, take a moment to observe the interesting pattern in Claim 2-5.

Wrapping Up: Claim 5 implies \ell(A) is both a π-system and a λ-system. From Lemma 4, this implies it is a σ-algebra. Further, A \subseteq \ell(A) hence we must have \sigma(A) \subseteq \ell(A) otherwise, we can create a smaller σ-algebra \ell(A) \cap \sigma(A) which is smaller than \sigma(A) and contains A (Lemma 3). However, we also showed earlier that \ell(A) \subseteq \sigma(A). This implies \sigma(A) = \ell(A).

At the beginning of the proof, we showed A \subseteq \ell(A) and \ell(A) \subseteq B. Therefore, we have A \subseteq \ell(A) = \sigma(A) \subseteq B. Hence, proved.

Further Reading

If you are interested in learning more about probability theory then John Pike‘s lecture notes are a useful resource:

Mathursday: Dynkin’s π-λ Theorem and CDF (Part 1)

I am starting a new series of posts called “Mathursday” (portmanteau: Maths+Thursday) containing short articles on  important mathematical results. Today’s article is about Dynkin’s π-λ Theorem and using it to prove that a Cummulative Distribution Function (CDF) uniquely determines the probability distribution. This is one of the result that is quite intuitive and well known but requires a bit of analysis to formally prove or even to state. There are several such results in statistics that use similar techniques for proving. Generally, the treatment of these proofs is completely omitted in statistics book (like Wasserman). The alternative is to meander your way through mathematical course on probability theory to extract the result. In this series of posts, I hope to give a direct self-contained mathematical treatment.

The proof uses basic results and definition from two important areas of mathematics: topology and measure theory. This series is divided into two posts. The first post (this one) covers basic of the above fields. The aim here is not to reproduce a complete treatment as found in mathematical books but give a concise concentrated basic treatment of different topics. You can skip to the next post if you already know these basics.

Set Theory

A set X is a collection of some members. Each member in the set appears one time. It can be finite (e.g., natural numbers less than 20), countable (meaning it can be enumerated e.g., set of odd natural numbers) or uncountable (e.g., real numbers). A subset A of X (denoted as A \subseteq X) is a set containing only members of X. Power set of a set X (denoted as 2^X) is the set of all subsets of X. A set containing no members is called an empty set (denoted by \emptyset). The cardinality of a set S (denoted by |S|) is the number of members in the set.

For two sets A and B:  their union (denoted A \cup B) is a set containing all members of A and B; their intersection (denoted A \cap B) is a set containing members common to A and B; the set subtraction A-B is a set obtained by removing all members of B present in A while retaining other members. If A \subseteq X is a subset of X then the complement set of A is defined as A^c = X - A. Observe that for any two sets A, B we have A-B = A \cap B^c. There is a useful identity called De Morgan’s law which states, for any two sets A, B: (A\cap B)^c = A^c \cup B^c and (A\cup B)^c = A^c \cap B^c.

While the above may look all easy and simple, developing an axiom based version of set theory has been quite challenging. Naive set theory suffered from Russell’s paradox while modern attempts like ZFC had their own controversy. For our current purpose, we won’t go into these details.

General Topology

Definition (Topology): (X, \tau) is called a topological space if X is a set and \tau \subset 2^X is a collection of subsets of X called open sets that satisfy the following properties:
1) \emptyset, X \in \tau
2) \tau is closed under countable union i.e, if A_i \in \tau\,\, \forall i \in \mathbb{N} then \cup_{i \in \mathbb{N}} A_i \in \tau.
3) \tau is closed under finite intersection i.e., if A_i \in \tau\,\, \forall i \in \{1, \cdots, k\} for k < \infty then \cap_{i=1}^k A_i \in \tau.
If \tau satisfy these properties then it is called a topology on X.

There is a reason why we call members of the topology as open set and this will become clear when we consider topology induced by metric spaces.

Example 1: For any set X, the power set 2^X and \{\emptyset, X\} are topologies on X. Let X = \{1, 2, 3\} then \tau = \{ \emptyset, \{1\}, \{2, 3\}, \{1, 2, 3\}\} is a topology on X.

Definition (Metric Spaces): (X, d) is a metric space if X is a vector space and d: X \times X \rightarrow \mathbb{R} is a function satisfying the following properties for all x, y, z \in X:
1) d(x, y) = 0 \Longleftrightarrow x=y (identity of indiscernibles)
2) d(x, y) = d(y, x) (symmetry)
3) d(x, y) \le d(x, z) + d(z, y) (triangle inequality)
If d satisfies the above properties then it is called a metric on X.

Observe that setting y=x in the triangle inequality gives us d(x, x) \le d(x, z)+ d(z, x). Using symmetry we get d(x, x) \le 2d(x, z). As d(x, x)=0 we get d(x, z) \ge 0. Hence, d is a non-negative valued function.

Example 2: Metric is supposed to abstract commonly used distance functions. For example, Minkowski distance d(x, y) = \| x - y\|_p = \left( \sum_{i=1}^n |x_i-y_i|^p \right)^{\frac{1}{p}} for p \in (0, \infty] is a metric function. It is straightforward to verify that \| x - y\|_p the first two property. Proving triangle inequality is more challenging and uses Minkowski’s inequality.

One can define various sort of interesting geometric subsets using the metric. For example, we can define a ball of radius \epsilon as B_\epsilon(x) = \left\{ y \in X \mid d(x, y) \le \epsilon \right\}.

Metric spaces (X, d) provide a way to induce a useful topology on X. We define the topology induced by metric space as follows:

\tau = \left\{ S \subseteq 2^X \,\,|\,\, \forall x \in S, \exists\,\, \epsilon > 0\,\, \mbox{ such that } \,\, B_\epsilon(x) \subseteq S \right\}

It is straightforward to verify that \tau is a topology. It is interesting to note that if X = \mathbb{R} and d(x, y) = |x-y|, then open intervals (a, b) (open in the geometric sense) are members of \tau. This means our standard geometric notion of open set coincides with open set in topology. Unless mentioned otherwise, we will assume the topology defined on real number space \mathbb{R}^n to be the topology defined above using Euclidean distance as our choice of metric.

Lemma 1: Every open set on \mathbb{R} is a countable union of disjoint open intervals.
Proof: Let S be an open set on \mathbb{R}. We define an equivalence relation x \sim y if  [\min(x, y), \max(x, y)] \subseteq S. It is easy to see that each equivalance class is an open interval (a, b) for some a, b \in \mathbb{R} with a < b. Further, these open classes are disjoint else they can be merged together. Finally, for each equivalence class we can associate a rational number in the interval. The proof is completed by observing that the set of rational numbers is countable.

Measure Theory

Measure theory is a field of mathematics concerned with assigning value to subsets that provides a useful theoretical basis for probability theory, for defining Lebesgue integration and studying dynamical systems. Below we consider main concepts in measure theory that will be useful in our proof.

Definition (Algebra on Subsets): For a set X, a nonempty collection of subsets C \subseteq X is an algebra of subsets if:
1. A  \in C \rightarrow  A^c \in C (closed under complement).
2. A, B \in C \rightarrow A \cap B \in C (closed under finite intersection).

Lemma 2: If C is an algebra on X then \emptyset, X \in C and for all A, B \in C we have A \cup B \in C.
Proof: As C is nonempty let A \in C. Then A^c \in C and \emptyset = A \cap A^c = C. From DeMorgan’s law: A \cup B = (A^c \cap B^c)^c which belongs to C as the algebra is closed under complement and finite intersections. Lastly, X = A \cup A^c \in C.

Definition (σ-algebra): For a set X, a σ-algebra is an algebra on X that is closed under countable union i.e., if \{A_i\}_{i\in \mathbb{N}} is a countable collection of sets in the σ–algebra then \cup_{i\in \mathbb{N}} A_i \in σ-algebra.

Observe that the second condition of algebra of subsets only implies it is closed under finite union and not closed under countable unions.  For example, let consider a collection of subsets S of \mathbb{R} which are finite or have finite complement. Then S is an algebra and closed under finite union: union of any two members still yields a finite set (or a finite complement). But if we take countable unions of members of S then we can get the set of natural numbers which are neither finite nor have a finite complement. Therefore, the countable union condition in σ-algebra makes it a strict generalization of algebra on subsets.

Using De Morgan’s law we can show that any σ-algebra is also closed under countable intersection: let \{A_i\}_{i=1}^\infty be a countable collection of sets in the σ-algebra. Then \cap_{i=1}^\infty A_i = \left(\cup_{i=1}^\infty A_i^c\right)^c. As σ-algebra  is closed under intersections and countable union therefore, \cap_{i=1}^\infty A_i belongs to the σ-algebra. Another nice property of algebra of subsets and σ-algebra is that they are closed under countable intersection. We leave the proof as it is quite straightforward.

Lemma 2: A countable intersection of σ-algebras is also a σ-algebra.

The intersection property of σ-algebra allows us to define a useful notion: Given a subset Y \subset X, we define \sigma(Y) as the smallest σ-algebra that contains Y. This can be formally defined as: \sigma(Y) = \cap \left\{ \Sigma \mid \Sigma \right. is a σ-algebra and \left. Y \subseteq \Sigma \right\}. This definition is well formed as the power set is always a σ-algebra containing any subset.

σ-algebra and Topology: One cannot fail to observe a remarkable similarity between the definition of σ-algebra and topology. Given a topology and σ-algebra on a set X. Both contain the empty set and X, and are closed under countable union. However, a topology is not closed under complement and is only closed under finite intersection. This difference stems from their different purpose. For example, assigning measure to a set intuitively suggests being able to assign measure to its complement.

In general, measure theory and topology do not always sync. One has to only look at the different topics in the two fields. However, there is an important interplay between topology and measure theory that allows us to talk about measure on top of open sets. Borel set captures this idea:

Definition (Borel Set): Given a topological space (X, C), the Borel σ-algebra \mathbb{B} is the smallest σ-algebra generated by topology: \mathbb{B} = \sigma(C). A member of \mathbb{B} is called a Borel set.

Lemma 3: For standard topology on \mathbb{R}, the Borel σ-algebra \mathbb{B}(\mathbb{R}) on \mathbb{R} is given by:

1) \mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ (a, b) \mid a, b \in \mathbb{R} \right\}\right)
2) \mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ (-\infty, a] \mid a \in \mathbb{R} \right\}\right)
3) \mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ (-\infty, a) \mid a \in \mathbb{R} \right\}\right)
4) \mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ [b, \infty) \mid b \in \mathbb{R} \right\}\right)
5) \mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ (b, \infty) \mid b \in \mathbb{R} \right\}\right)

Proof: We prove the first two case. The proof for remaining case is similar.

1. From Lemma 1, any open set in \mathbb{R} is generated by a countable union of disjoint open intervals. As σ-algebra is closed under countable union so \mathbb{B}(\mathbb{R})=\sigma\left(\left\{ (a, b) \mid a, b \in \mathbb{R} \right\}\right).

2. Let A = \sigma\left(\left\{ (-\infty, a] \mid a \in \mathbb{R} \right\}\right). As A is closed under complement therefore, we can also express: A = \sigma\left(\left\{ (b, \infty) \mid b \in \mathbb{R} \right\}\right). However, (b, \infty) is an open set and A is a σ-algebra generated by a collection of open sets. Therefore, A \subseteq \mathbb{B}(\mathbb{R}).

Let a < b then (-\infty, b], (a, \infty) \in A. As a σ-algebra is closed under intersection, therefore, (a, b] = (-\infty, b] \cap (a, \infty) \in A. It can now be shown that any open interval can be generated from countable union: (a, b) = \cup_{n=1}^\infty \left(a, b - \frac{1}{n}\right] \in A. As A contains open interval therefore, it contains all member of \mathbb{B}(\mathbb{R}) (using our proof of case 1). This implies \mathbb{B}(\mathbb{R}) \subseteq A. Hence, \mathbb{B}(\mathbb{R}) = A.

Definition (Measurable Space): A measurable space is a tuple (X, \Sigma) where X is a set and \Sigma is a σ-algebra on X.

Example 3: (\mathbb{R}, \mathbb{B}) is a measurable space where the real number line is the set and set of all Borel sets is our σ-algebra.

The next notion is useful in defining the notion of random variables in probability theory.

Definition (Measurable Function): A function f: (X_1, \Sigma_1) \rightarrow (X_2, \Sigma_2) is called measurable if (X_1, \Sigma_1), (X_2, \Sigma_2) are measurable spaces, and \forall E \in \Sigma_2, f^{-1}(E) \in \Sigma_1.

In the above definition: f^{-1}(E) = \left\{ x \in X_1 \mid f(x) \in E \right\}. The job of measurable functions is to ensure that we stick to members in σ-algebras when doing calculations.

Definition (Measure): Let X be a set and \Sigma be a σ-algebra on X. A measure \mu: \Sigma \rightarrow [0, \infty] is a function satisfying the following property:
1. \mu(\emptyset) = 0 (null empty set)
2. If \{E_i\}_{i=1}^\infty be a countable collection of pairwise disjoint members of \Sigma then \mu(\cup_{i=1}^\infty E_i) = \sum_{i=1}^\infty \mu(E_i) (σ-additivity).

The σ-additivity condition also holds for a finite set: let \{E_1, \cdots, E_k\} be a set of pairwise disjoint members of \Sigma. Then define E_j = \emptyset for all values of j > k. Then \{E_i\}_{i=1}^\infty is an infinite countable set of pairwise disjoint members (Yes, \{\emptyset, \emptyset\} are disjoint). Therefore, \mu(\cup_{i=1}^k E_i) = \mu(\cup_{i=1}^\infty E_i) = \sum_{i=1}^\infty \mu(E_i) = \sum_{i=1}^k \mu(E_i). The last equality used \mu(\emptyset) = 0.

A measure is called finite if \forall E \in \Sigma, \mu(E) < \infty. If for any member E \in \Sigma we have \mu(E) = 0 then we call E a null set. Intuitively, null set denotes the smallest subsets in the σ-algebra.

Example 4: If X is finite and the σ-algebra is the power set then \mu(E) = |E| is a measure. This measure is called the counting measure.

Measures are intuitively supposed to generalize the common notion of length, area, and volume. In general however,  constructing a useful measure such as Lebesgue measure is quite challenging. See Hunter’s lecture note on measure theory for full treatment.

Definition (Probability Measure):Let X be a set and \Sigma be a σ-algebra on X. A probability measure P is a measure with the unitarity property that states: P(X) = 1.

When talking about probability measures, we understand X as the set of elementary outcomes and \Sigma as a collection of events. For example, say we flip a coin two times then an elementary outcome is [H, H] indicating two heads, and an event would be outcomes with at least one heads.

The need for σ-algebra: The definition of σ-algebra looks opaque. Why cannot we simply use the power set to define the domain of our measure? The answer to this question has an interesting history which arises from the discovery of non-measurable sets at the beginning of 20th century. Mathematicians found that if we consider all possible subsets of \mathbb{R} for defining a measure, then no function exists that satisfies the above properties and generalizes the notion of length of an interval i.e., \mu([a, b]) = b-a! Interesting non-measurable sets like the Vitali set were discovered. The concept of σ-algebra was, therefore, used to narrow down the choice of subsets while maintaining useful properties. For example, if we talk about the probability of an event E then we may also want to consider the probability of it not taking place i.e., E^c hence our σ-algebra must contain E^c.

Definition (Measure Space and Probability Space): A measure space is a tuple (X, \Sigma, \mu) where (X, \Sigma) is a measurable space and \mu is a measure on \Sigma. A measure space is called a probability space if \mu is a probability measure i.e., \mu(X) = 1.

Definition (Random Variables): A real-valued random variable g: (X, \Sigma, P) \rightarrow (\mathbb{R}, \mathbb{B}) is a measurable function from a probability space  (X, \Sigma, P) to the measurable space (\mathbb{R}, \mathbb{B}).

When considering the measurability of the random variable we ignore the probability measure P and consider (X, \Sigma) as our input measurable space. Further, even though we consider Borel σ-algebra  as our choice of σ-algebra  on \mathbb{R} we can use any other σ-algebra. 

Cummulative Distribution Function: Given a random variable g: (X, \Sigma, P) \rightarrow (\mathbb{R}, \mathbb{B}), the cummulative distribution function (CDF) is a function G: \mathbb{R} \rightarrow [0, 1] defined as G(z) = P(g^{-1}((-\infty, z])).

This definition is a bit involved. Firstly, observe that the closed ray (-\infty, z] \in \mathbb{B} for any value of z \in \mathbb{R}. Therefore, as random variable is a measurable function we have g^{-1}((-\infty, z]) \in \Sigma. As P is a probability measure on \Sigma therefore P(g^{-1}((-\infty, z]))) \in [0, 1].

The reason it is called CDF is since we can think of it as the total accumulated probability measure assigned to the outcomes that give a value of less than equal to z, as we vary z from -\infty to +\infty. It is straightforward to verify that CDF is a non-decreasing function. Further, it can be easily shown that \lim_{z \rightarrow -\infty} G(z) = 0 and \lim_{z \rightarrow \infty} G(z) = 1.

We are now ready to state the main theorem that we prove in the next post.

Theorem: Let f: (X_1, \Sigma_1, P_1) \rightarrow (\mathbb{R}, \mathbb{B}) and g: (X_2, \Sigma_2, P_2) \rightarrow (\mathbb{R}, \mathbb{B}) be two random variables with CDF F, G respectively. If F(x) = G(x), \,\, \forall x \in \mathbb{R} then P_1(g^{-1}(A))=P_2(g^{-1}(A)), \,\, \forall A \in \mathbb{B}(\mathbb{R}).

We do not require the two random variables to even have the same input space or σ-algebra. The theorem states that if two real-valued random variables have the same CDF then they have the same induced distribution over the real number line.

Are Synthetic Datasets in AI Useful?

Problems in Artificial Intelligence (AI) are generally approached by solving a dataset/setup that is a proxy of the real world. This is a fairly old practice going back to the start of AI as a field. Some important milestones along the way include the ATIS dataset for natural language understanding (1990s),  MNIST dataset for digit recognition (1990s) and ImageNet for visual object detection (2000s). All the above datasets have a commonality which is they are real samples of the actual problem distribution i.e., each sample in this dataset can actually take place in real life. An image of a flower in the ImageNet dataset can actually be found in real life. We will distinguish between datasets and setups, e.g., ImageNet is a dataset whereas Atari games is a setup. However, eventually setups are used to generate dataset of some form so for the remaining post we will treat setups as datasets (albeit with control over how to generate this dataset).

In contrast, another trend has become very popular for certain problems whereby the samples in the dataset are not real life samples but synthetic samples. Examples of real and synthetic datasets are shown in this Figure:


Figure: Examples of real and synthetic datasets. The real datasets on the left show natural language text and real images. The synthetic datasets on the right show templated text and synthetic images.

But what does it formally mean for a dataset to be synthetic? Without a formal definition we will not be able to make complete sense of what we are critiquing.

Formal Definition of a Synthetic Dataset

Let (X, \mathcal{C}) be a topological space where each point in X is a single datapoint. Let \mu be a Borel measure defined on (X, \mathcal{C}). Let \Delta(X) be the space of distributions over X. We will assume there exists a unique stationary gold distribution D^\star \in \Delta(X) such that datapoints in real life are sampled using D^\star. This is a gross simplification but is sufficient for our current purpose. Now consider a sampling procedure used to create an artificial dataset Z \subseteq X where each datapoint in Z are sampled from a procedural distribution D \in \Delta(X). We will call a dataset Z as synthetic if the procedural distribution satisfies certain constraints that we describe below.

We will consider two types of synthetic dataset distribution D. Synthetic datasets of first kind or exclusive synthetic datasets, and second datasets of second kind or inclusive synthetic datasets:

Synthetic Datasets of First Kind (Exclusive)

Definition: D is a synthetic dataset of first kind or exclusive if the measure of datapoints that can be generated by D and D^\star is 0 i.e.,  \mu\left(\left\{x | x \in X, D^\star(x) > 0, D(x) > 0\right\}\right) = 0.

Example: The synthetic image dataset shown in Figure 1 is of first kind since not even a single image in this dataset will be occur in real life.

Synthetic Datasets of Second Kind (Inclusive)

Definition: D is a synthetic dataset of second kind or inclusive if the measure of datapoints that can be generated by D and D^\star is greater than 0 but less than the measure of set of points that can be sampled from D^\star i.e., 0 < \mu\left(\left\{x \mid x \in X, D^\star(x) > 0, D(x) > 0 \right\}\right) < \mu\left(\left\{ x \mid x \in X, D^\star(x) > 0 \right\} \right)

Example: The synthetic text dataset shown in Figure 2 is of second kind as whereas these templated sentences can occur in real life they represent only a limited form of real life free-form texts.

Example: Consider an object detection dataset that is created using the following approach. We find 1000 randomly uniformly chosen object categories in the world and for each category, select 10,000 randomly chosen images. Such a dataset will not be synthetic as every object has a non-zero chance of selection. For current purpose, I will avoid the discussion of whether this can still constitute a null set with respect to some measure.

The above definition is quite conservative. There could be other instances where a dataset can be considered synthetic. For example, a highly biased dataset distribution can be considered synthetic. However, the above definition does not use any thresholds (e.g., some bound on KL-divergence between D and D^\star or any other hyperparameter). This provides ability to further refine this definition by adding more constraints.

Are Synthetic Datasets Useful?

Since the trained models on these datasets/setups are not directly useful in real life, therefore, a debate has arisen in the community if such synthetic datasets are useful at all.

Synthetic datasets can be useful if they are designed with a specific objective in mind.

I would argue that synthetic datasets are useful if they are designed with a specific objective in mind. I argue that there are three main useful reasons for creating a new dataset.

  • Generating Real Life Samples: An obvious use of datasets is to predict the performance of a system in real life. If this is your aim then you should try to create a dataset that is as close to the real world as possible for accurate prediction. An example of this trend is the ImageNet dataset.
  • Ablating Real Life Phenomenon: Often times the real life problem is far too complex and challenging. A common solution is to consider a specific phenomenon and “ablate” away the others. Let’s say you are interested in creating autonomous vehicles. This involves solving challenges in planning, computer vision and control. Directly creating a setup where the car navigates in the city is dangerous and not to say illegal. Therefore, to make progress towards this difficult challenge, people ablate away other phenomenons to focus on a single challenge. For example, if you are designing a planning component then you can create setup where the vision and control problems have been solved. Or if you are solving the computer vision aspect then you can assume access to a set of images taken from a camera mounted on top of a car. You can then focus on doing object detection and localization in these static images without worrying about how the car is moving around. You can even create simulations to ablate phenomenons.  If this is your goal then the dataset has to be created to be as realistic as possible in terms of the phenomenon you are interested in while ablating away the other challenges. The ablation can be done by giving “oracle” access to the solution of those challenges (e.g., an oracle planner) or making the problem really easy (e.g., using simple images).  Most of the empirical side of my research has been in this category (e.g., [EMNLP 2017], [EMNLP 2018]). In these works, we are interested in natural language understanding. Therefore, we use free-form text but use simple images to ablate vision challenges. As we continue making progress on this front, we want to move closer to our real life goal.

    There is an important challenge in pursuing this approach. The solution to the main problem need not be a composition of solution of individual problems.

    A reinforcement learning algorithm that does well on problems with simple vision challenges like Atari games need not do well on problems with difficult vision challenges like a 3D house.

    How do we resolve this? Can we build theoretical models to support our ablations?  Turns out we can using the learning reduction approach. For example, we can create an RL algorithm that can solve any Contextual Decision Process by assuming access to solution of elementary problems. Then improving the solution of these elementary problem necessarily improves the overall solution. See this paper and this for examples.

  •  Debugging and Stress Testing of AI systems: Synthetic datasets can be used for performing a quick sanity check or for stress testing the system. It is often beneficial to design easy but non-trivial synthetic systems such that if you cannot solve them then you have no hope of solving the real problem. This saves money and helps you understand the challenges before going out and building an expensive real life dataset. Alternatively, often times real life problems are not adversarial and do not enable you to stress test your system. For example, if you are building a planning algorithm for navigation then you may not be able to find challenging real life environments (except perhaps this). But why do we want something like that? Cause you do not want to tell a customer that their problem is too hard for your system to solve. If the bot has to navigate in a building that is burning to rescue workers then they must not fail cause the environment is hard to plan for. Synthetic datasets can be used to create an adversarial setup such that if you solve them then you can solve any other problem.

When creating a synthetic dataset you are either aiming for (2) or (3). One must understand this objective prior to collecting the dataset: is it for stress testing, sanity check, ablation? What must be avoided is a rush to create a dataset to flag plant a research idea. Generally, datasets have a longer shell life than solutions and therefore, it is important to get them right.

Are we doing NLP the right way?

I have been pondering over some of these questions below for sometime and after taking a break from chasing many paper deadlines, I found sometime over Christmas to pen down my thoughts. I’ll discuss four issues (or at least the way I see them) facing NLP and I’ll offer suggestions.

Please note that I do not advocate the following as the only way everyone should be working. I believe that research is best done when people do what they are excited about and we should try diverse things with open mind specially in the early stages of a field.

1: Too Many Datasets and Not Enough Structure

Few months back, I made a comment jokingly that there are now “more NLP datasets than NLP researchers”. I do think there is an element of truth here. The field has witnessed an explosive growth and the result has been an over emphasis on too many problems in isolation without enough work on how these different problems relate to one another.

For reference, let’s consider how the field of mathematics progressed. People first proved  a set of theorems which are grounded in something of immediate interest such as how to approximate the area of a circle or measure height of a tower. It is of no surprise that initial mathematics was mostly real algebra and trigonometry instead of abstract algebra and algebraic geometry. A good healthy trend in mathematics is that  most maths paper will use at least some results from the literature. Unfortunately, in NLP one seldom has the opportunity to use (I do not mean citation but actual use) results from very old research papers or use the solution of another problem which has been reliably solved (e.g., POS tags). Why is this case?


Let’s look at a bit of history first.

NLP from early 1990s to sometime around 2012 was dominated by the linguistic agenda. This agenda consisted of building a structure, one level at a time, by first solving part of speech tagging, then syntactic parsing, then approaching semantic and finally something in pragmatics/discourse. It made total sense! Surely one cannot really approach discourse if one cannot understand a sentence. And you cannot understand a sentence if you cannot even understand dependencies between its words. These problems were also generally well defined (mostly one can agree on what is the right parse tree) and were also interpretable (you can look at the parse tree to see what mistakes were made).

Unfortunately, it is beginning to be apparent that these beautiful linguistic approaches didn’t yield empirical gains (using empirical evaluations was another breakthrough that happened in NLP from 1990s and I talk about that in point 4 below). The biggest evidence of this is that most NLP papers these days in non parsing track, don’t use a syntactic parser either in their model or in some baseline. A single LSTM with attention processing the entire sentence outperforms linear models using parse trees. The field realized this and as a result, many researchers changed their agenda and left the linguistic bandwagon for deep learning supercar.


It is disappointing that after improving score on the Penn Tree Bank from somewhere in 80s to now 96s, we still don’t end up using parse trees as features in our model.  But this is mostly a tragedy for the linguistic agenda and not the field cause across the board  performance on several datasets have improved (due to better feature representations using deep learning, improved learning and optimization algorithms and just more researchers working), new problems have blossomed, attendance and submissions to conference has skyrocketed. Another change that has taken place is that more people are now directly working on the final problem of interest like Question Answering, Machine Comprehension, Instruction Based Navigation etc. The general approach these day appears to be:

  1. Create a new dataset and problem.
  2. Improve the performance on this dataset by generally using more and more complicated models.
  3. Once the performance reaches a certain level and further improvement is hard, interest dies and people move on to step 1.

A new problem however is appearing.

Rarely a solution for one problem informs solution for another problem. The field is littered with dead datasets or solutions which are not used again.

I believe this is happening cause we are not building a structure. There are too many problems and it is not clear how these problems relate to one another.

For reference, one crowning achievement of complexity theory was to reduce thousands of problems to few complexity classes and study how these relate to one another. There are thousand different NLP problems. What is our analog of a NP class?

Since linguistic agenda of studying a hierarchy of linguistic structure hasn’t succeeded after decades of work. What new structure could we use? The following is one suggestion which I call Learning Reduction Agenda:

  1. Narrow down to problems that are of immediate interest and well defined. I can name for example, knowledge base or visual question answering, vision based instruction following and multiple choice reading comprehension. These will be our  analog of Pythagorean theorem.
  2. Create new problems which use solutions to (1) for evaluation or reduction. Here is one example: One can define a machine translation metric in terms of question answering. If the gold French text and the translated French text can be used to answer the same question then they have the same content and should be treated as paraphrase (I am assuming there is an executable environment).
  3. Gradually expand the problem set by doing 2.

    [Learning Reduction Agenda] Researchers should think carefully where a new problem stands in this structure. If it is of end interest with an unambiguous evaluation metric it belongs in the ground floor otherwise it should reduce to a problem in the lower structure and thereby reveal its position.

2: Solving an Ill-Defined Problem

There are several problems in NLP that do not have a well defined metric. Most of these problems involve generating some text. For example, a chat bot having a conversation with a human,  machine translation, image caption generation. The solution to this problem has been to defer to humans for evaluation or use proxy evaluation metrics such as BLEUE, ROGUE, METEOR, CIDER etc. If one really buys into these metrics then they can use standard method to maximize the performance of their approach with respect to this metric. Some sophisticated researchers often prefer an ensemble of metrics or report performance on several metrics.

The main question still remains: what do you want to achieve from the generated text? An image caption that says, “this is an image” is an accurate caption but doesn’t contain any content. A chat bot that says, “Hmm…” accurately mimics the attitude of a person who doesn’t want to be bothered. I believe this question is not adequately answered in most research papers. We should understand what this problem is going to be used for. For example, if the image caption generation model is being used to describe the image to a visually impaired person then we must understand what kind of information that person seeks. Are they interested in hearing about the type of image (e.g., impressionist, cubism etc.) or the color of the image (e.g. ,pale blue background with pink tinge), or the number of people in the image. We should understand what the output of our models is going to be used for and merely maximizing BLEU, METEOR etc. is not enough.

Following the learning reduction agenda, one can for example define this evaluation metric in terms of the accuracy of a well trained QA system to answer questions based on the generated text. If the generated text is used for assisting a human then their performance could be used as the evaluation metric. Or if you are using a chat bot for retention then it makes sense to use the expected retention time as the choice of evaluation metric (chatbot is one of those problems for which it is really hard to think of any scenario where it could be useful).

3: Over Reliance on “Linguistic Intuition”

Another thing that one hears is the idea of “linguistic priors” or “linguistic intuition“. Often times, papers report using a linguistic prior as one of their main contribution and back it up by showing performance gain. An example of this is the use of recursive structures for deep learning (Stanford really championed this research) that was very popular in 2013-2015. It is very well known that sentences have tree structures and therefore it made “total sense” when recursive structures were employed for sentence representation. These were some very popular research work at that point. However, they fell out of fashion later largely due to results showing that same gain can be obtained by not using recursive structures and simply doing averaging.

The point where it gets slippery is that these linguistic intuitions could be wrong or when coupled with other ingredients (models, optimization, learning) may in fact reduce performance. I have often tried several changes to my model that I expected should have increased performance only to see the performance go down (from what I hear this is a general experience. For a general discussion related to this, see this NIPS talk by Ali Rahimi).

We should not over rely on linguistic intuition (and the failure of the linguistic agenda should be a warning in itself) and try more broad range of solutions. After all the current deep learning trend did not emerge out of the mainstream.  Who knows, maybe the answer to NLP are hidden in some abstract mathematical objects.

4: Product-less Off Line Evaluation

The field of NLP underwent a revolution with the use of datasets and rigorous evaluation. Nowadays nearly all papers at NLP conferences either introduce a dataset or evaluate on an existing dataset. One can say, the field is going through a dataset revolution (or bubble?). However, existing datasets have an issue. They only offer offline evaluation i.e., they are static and not continuously changing. This is different from online evaluation where one subjects their algorithm to live use by various people across the world and one computes statistics based on hourly (or other timescale) performance. Most applications in industry have an online evaluation: the Amazon echo dot and Google Translate engines are both being evaluated by users every minute who can type anything based on something that just happened in the world. For example, a Korean video called “Oppa Gagnam style” going live can spur users to try this phrase on these algorithms. Offline datasets just don’t allow us to make such evaluation (one can however simulate an online audience but this has other issues).

I was told by a researcher at a top industry that product engineers do not get convinced of the utility of a research algorithm by performance on some datasets. The gold standard for industry is A/B testing by showing improvement on an end metric (e.g., engagement, profit, or other utility) with real audience and real setting. Industry standards are quite high and rightfully so! A twitter chatbot that becomes racist can be a PR nightmare for the company (you can read more about chatbot fails here). There is too much to loose. The damages are not only for the company but also real users. Think of the following example:

Human: “I feel like just killing myself. No one cares about me.”

Some Conversation Engine: “Sure.”

If this is part of offline dataset then generating “Sure” results in a minor reduction in BLEUE score but this can abet a suicide and cause real harm if this was a live testing. I would even argue that this error should get a score of -\infty, thus giving a total score of -\infty even when every other response was generated perfectly.

While using human evaluation and presenting Likert score is better, I think we should go the full extent. We need online evaluation where anyone can use these systems and one can see the performance for themselves. Never Ending Language Learning (NELL) is one such popular NLP project to do that. In robotics, I was part of a team that attempted to do something similar (this project is no longer active). Maybe we can create an online platform where people can register their NLP systems and get some feedback in the form of like or dislike. These systems can also teach each other. We do need more such evaluation and I also believe that this is the only way to truly judge an NLP system.

Writing and Proofreading Research Code

Most papers in Artificial Intelligence (AI) research these days have an accompanying source code. Sometimes authors will release their code with their paper after publication (our research group always does) and sometimes for various reasons they don’t (e.g., the code is built at an industry using another source code which is proprietary; code requires a lengthy internal peer review before being allowed to release).

Issues with using Released Source Code

Even when the code is released, you can experience the following problem:

  1. Code doesn’t compile or run.

    Image result for segmentation fault

    You cannot run the code. You even email the authors but they don’t reply. This is a common issue and the main reason is that unfortunately writing a high quality code doesn’t align very well with incentives for a PhD student. You can spend 3 months improving your code but your advisor may want to see another research paper. Further, the student may have graduated and now works in quant and has no interest in replying to your email about their code from 10 years ago.

    I strongly advocate for incentivizing the quality of the code and giving strong preference to papers with accompanying code when accepting a paper.
    The standard argument against this is that additional burden will be placed on reviewers who don’t even have time to read the supplementary paper but I feel that we could automate parts of it (in any case we need to address it more).“Accepting a paper without a source code is to an approximation like accepting a cheque (check) without a signature (it feels like a cheque, it has the right number, the memo looks beautiful but there is no guarantee).”Doing this will however slow down the research cause for whatever reasons embedded in human nature most researchers end up submitting on the last day of the conference deadline and often times the code is just not ready to be submitted (unfortunately this has happened to me and to almost everyone I know). Discouraging such behaviour will be a good thing for the community. The plot below shows that most papers for NAACL 2018 were submitted on the last day (special thanks to program chairs for this analysis). Full analysis is available here.
  2. Code gives wrong results.

    Even when you can compile and run the code, you find that results are different from what the paper reports or it outputs several numbers without telling which one to use. I argue that errors of this type are much more common and difficult to deal with. It can happen for reasons that authors cannot foresee e.g., the version of the underline library changed (DyNet is particularly susceptible to giving different result based on different versions). Remember that most codes are developed by PhD students who aren’t necessarily the best code developers (this is one skill that we need to incentivize as a community). This kind of dependency hell like situation is common in software developments and solutions have been discovered to handle these issues. Releasing Amazon AMI images or docker files is one way to handle this situation. I have used both of these solutions in the past and they greatly helped me. For example, if you are running experiments across different machines you can get different results even with the random seed fixed and using synchronous computation (due to different libraries or the way they compile on that system). Docker can help fix this issue and also ensures that performance changes you see are not cause of using a different machine but due to the changes you made).

  3. Code performs poorly on a different but related task

    Most research papers using deep learning these days are reporting increasingly complicated neural networks for smaller and more specific task. A main problem with this approach could be a form of community overfitting. I define this as:

    Community Overfitting:“Phenomenon where a research community as a whole overfits on a single or a few tasks by trying many solutions even though an individual researcher may have only evaluated on the test set once.”

    A colleague of mine for example, recently found that several datasets that were released for visual question answering perform around chance (i.e. 50% accuracy) when used for a new dataset that she collected. This is in stark contrast to the general feeling around (particularly on social media) that say that we have made great improvements on tasks requiring understanding in vision and language.

    A solution to this might be to stop using very old datasets (like the Penn Tree Bank), maintain a held out test set that is not revealed to the public (there is a contradiction here to keeping everything public) and on which authors cannot evaluate more than once a month (or a relevant long time frame). We also need to be careful when stating our results. Instead of saying, “we present a model that improves on visual question answering problem.”, say “we present a model that improves on the visual question answering dataset XYZ (cite ABC)”. I believe it is generally useful to  be cautious and under claim and then enjoy being proven right for new datasets or tasks, than over claim and find your claims proven wrong.

Three Main Crimes in an AI Source Code

When developing big code bases for research, it is helpful to debug them carefully to avoid claiming wrongly. In general, however one cannot guard against all possible sort of errors (e.g., an error in the underline library is extremely hard to detect). I have a feeling that there is an underline power law like distribution where making sure that your code has no error with probability at least 1-p requires number of debugging hours that grow polynomially with 1/p. There is a fair bit of research now on writing provable code (for example, see this paper from Selsam, Liang and Dill). However, today most research code don’t use these automated proving techniques. I have come up with the following quick checks to make sure that your code doesn’t contain the gravest type of errors:

  1. Train-Test Dilution. 
    This is the classic machine learning error where you use information from the test data when training. This could involve training on the test data or more subtle information leakage from the test data. The standard guard against it to make sure your dataset is split into 4 disjoint files: train, tune, dev and test and making sure that these files are indeed disjoint. Then using only the train file for training and the tune file for finding optimal hyperparameters. Dev file should only be used for occasionally checking for performance and the test file should only be used once (generally two weeks before the deadline). Using different files as opposed to using indices is better as using wrong indices accidentally won’t let you see the wrong split during training (or evaluate on train data when testing). I do want to point out that this error can happen is very subtle form. E.g., once I was using a dataset with train, dev and test splits. I was using a model for transfer learning that was publically available for download from an existing code base. I was getting very high dev performance but then I got very low test performance during testing (generally you expect a ~1% drop in performance when running on the test set but this was a ~15% drop). I then realized that the publically available model used labels from the dev set when training. Fortunately, we were able to fix this error and improve our model in other ways to gain back our performance before submitting our results.
  2.  Label Leakage. 
    Label leakage is the phenomenon where your model uses the gold labels (or some prohibited information) for an input when evaluating. For example, a classifier that has to predict some label for an image but that label is encoded in the image itself say as a watermark, or in more direct case given as input to the classifier. The standard guard against label leakage is to separate prohibited information in your class. E.g., if you you are using Java then all labels that are prohibited should be made private and once is only allowed access to a scoring function which evaluates to a score using the gold label and the predicted label.
  3. Wrong Evaluation Metric.

    Wrong evaluation metric will give you wrong results. E.g., one can compute F1 score wrongly or write their own buggy functions for BLEU score. I remember that using Python NLTK BLEU scoring function used to give wrong results. In general, when using a complicated evaluation metric, use the test scripts that previous publications have used or when using a dataset with an accompanying evaluation script use that. When writing your own implementation for an evaluation metric,  I suggest manually computing the score and then checking it with the evaluation metric for a set of randomly chosen inputs.

When supervising an intern I ask them to make sure that error type 1 to 3 doesn’t exist in their code. This has the advantage of ensuring that their code is trustable by which I roughly mean that the results from the code can be reported for that task. Following argument makes it more clear.

Theorem: If your code doesn’t contain error 1 to 3 then the result from your code can be reported.

Proof: Say you train your model on this task and then evaluate this model. From ensuring against error 1 and 2 you make sure that you haven’t seen the test data or labels from the test examples when training. The model that you train is then evaluated on the test set and not on train set since you don’t have error 1. So now you are evaluating a model that is not using the test labels on the test set and hasn’t seen this test set before. Since your evaluation metric is correct, therefore your results are also correct.

This is only a proof against high level errors. If you save the model wrongly, give wrong inputs to the evaluation metric or use other information in the world that you are not allowed to use even though you are not using the test set, then the above proof doesn’t hold. But I hope I am able to get the rough idea of “trustable results” general idea across.


In this blog post, I talked about certain errors in running released code and how to make sure that your code doesn’t have the gravest error before submitting these results. I also talked about a rough idea behind trustable results. I haven’t talked about debuggability or whether what one claims in the paper is what their code does For example, one can not make error of type 1 to 3 but their model could be incorrectly implemented or different from what they describe in the paper. In general, it is good to see that community is talking more about these issues and I hope that my blog post will help the discourse.


Mathematical Analysis of Policy Gradient Methods

Most papers using reinforcement learning these days use the policy gradient class of learning. In this post, I will cover a basic tutorial of policy gradient, uncover some confusion on using baseline and regularization and give some suggestion for debugging these learning algorithms that I have found useful in the past.

Policy gradient methods is a class of reinforcement learning algorithm that optimize the reinforcement learning (RL) objective by performing gradient descent on the policy parameters. As opposed to value function based methods (SARSA, Q-learning), the central object of study in policy gradient methods is a policy function \pi(a\mid s; \theta) , which denotes a probability distribution over actions given a state. \theta are the parameters of this policy. Assuming we are optimizing using stochastic gradient methods our updates are given by:

 \theta^{(t + 1)} \leftarrow \theta^{(t)} +\eta_t \nabla_\theta J^\pi

The objective J^\pi can be expected total discounted reward, expected average reward or something else. We will stick to total discounted reward for this discussion. The aim now is to express \nabla_\theta J^\pi  in a form that is easier to work with. This is done using the policy gradient theorem.

Policy Gradient Theorem

Let M = \langle S, A, R, P, \gamma, \mu \rangle be our MDP where S, A, R, P are the state space, action space, reward function and transition probability respectively. \gamma is the discounting factor and \mu is an initial state distribution.

The expected total discounted reward can be expressed as:

J^\pi = E[\sum_{t\ge 0} \gamma^t r_t] = \sum_s \mu(s) V^\pi(s)

See my previous post if you don’t know what V function is. Using above equation we have \nabla_\theta J^\pi = \sum_s \mu(s) \nabla_\theta V^\pi(s). Policy gradient theorem (Sutton et al. 1999) computes V^\pi(s) by repeatedly applying the Bellman self consistency theorem. It really is just that.

\nabla_\theta V^\pi(s) =\nabla_\theta \left\{ \sum_a \pi(a\mid s) Q^\pi(s, a) \right\}

\nabla_\theta V^\pi(s) = \sum_a \pi(a \mid s) \nabla_\theta Q^\pi(s, a) + \sum_a Q^\pi(s, a)\nabla_\theta\pi(a \mid s) 

using Q^\pi(s, a) = \sum_{s'} P^a_{s, s'} \left\{ R^a_{s, s'} + \gamma V^\pi(s') \right\} we get:

\nabla_\theta Q^\pi(s, a) =  \gamma P^a_{s, s'} \sum_{s'} \nabla_\theta V^\pi(s')

so now we have an expression for \nabla_\theta V^\pi in terms of itself. Writing it down together we have:

\nabla_\theta V^\pi(s) = \sum_a Q^\pi(s, a)\nabla_\theta\pi(a \mid s)   +\sum_a \pi(a \mid s)\gamma P^a_{s, s'}  \sum_{s'}\nabla_\theta V^\pi(s')

we will change the notation as s \rightarrow s_0, a \rightarrow a_1, s' \rightarrow s_1 giving us:

\nabla_\theta V^\pi(s_0) = \sum_{a_1} Q^\pi(s_0, a_1)\nabla_\theta\pi(a_1 \mid s_0)   +\sum_{a_1} \pi(a_1 \mid s_0)\gamma P^{a_1}_{s_0, s_1}  \sum_{s_1}\nabla_\theta V^\pi(s_1).

unfolding this equation gives us (verify it for yourself):

\nabla_\theta V^\pi(s_0) =\sum_s \sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a).

where P(s_t = s \mid \pi, s_0) is the probability of being in state s after time t when starting in state s_0 and taking actions according to the policy \pi. Note that our setup does not include multiple agents, policies dependent upon time or non-Markovian policies.

Discounted unnormalized visitation probability: The term \sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0)  appears fairly often in RL theory so it is given a notation of d^\pi(s \mid s_0). Essentially this value is high for a state s if it is likely to be visited by an agent sampling actions according to the policy \pi and starting in state s_0. This term is called “discounted unnormalized visitation probability”. It is “visitation probability” since it is higher for state which are likely to be visited. It is discounted cause probability terms are discounted with \gamma^t factor and it is unnormalized cause it may not sum to 1. In fact for infinite horizon:

\sum_s d^\pi(s; s_0) = \sum_s \sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0)

=  \sum_{t \ge 0} \gamma^t \sum_s P(s_t = s \mid \pi, s_0) = \sum_{t \ge 0} \gamma^t* 1 = \frac{1}{1-\gamma}.

and for finite horizon with \gamma =1 we have:

\sum_s d^\pi(s; s_0) = \sum_s \sum_{t \ge 0}^T P(s_t = s \mid \pi, s_0)

\sum_{t \ge 0}^{T-1} \sum_s P(s_t = s \mid \pi, s_0) = \sum_{t \ge 0}^{T-1} 1 = T

so in  either case the sum of d^\pi(s; s_0 is equal to the time horizon (for episodic undiscounted case) or effective time horizon (for unending discounted case).

using the notation we have: \nabla_\theta V^\pi(s_0) = \sum_s d^pi(s; s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)

and which gives us:

\nabla_\theta J^\pi = \sum_{s_0} \mu(s_0) \sum_s d^pi(s; s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)

or using the notation of Kakade and Langford 2002, we define d^\pi_\mu(s; s_0) = \sum_{s_0} \mu(s_0) d^\pi(s; s_0) which allows us to express the gradient as:

\nabla_\theta J^\pi = \sum_s d^\pi_\mu(s) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a).

Another way to express objective: Using the notation of visitation distribution, allows us to express the expected total discounted reward objective in a form that gives another interpretation to the policy gradient theorem. Observe that:

J = E[\sum_{t \ge 0} \gamma^t r_t] = \sum_{t \ge 0} \gamma^t E[r_t]  (linearity of expectation)

reward r_t = R(s_t, a_{t+1}) where s_t, a_{t+1} is the state in which the agent is at time t and a_{t+1} is the action taken at that time. Therefore:

E[r_t] = E[R(s_t, a_{t+1})] = \sum_{s} P(s_t = s) \sum_a \pi(a \mid s).

Plugging it in we get:

J^\pi = \sum_{t \ge 0} \gamma^t E[r_t] =\sum_{t \ge 0} \gamma^t\sum_{s} P(s_t = s) \sum_a \pi(a \mid s).

Rearranging the terms we get:

J^\pi = \sum_{t \ge 0} \gamma^t E[r_t] =\sum_{s} P(s_t = s) \sum_{t \ge 0} \gamma^t\sum_{s} P(s_{t-1} = s) \sum_a \pi(a \mid s)

and then using the notation of d^\pi_\mu(s) we get:

J^\pi = \sum_s d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a).

Compare this with the gradient of the objective:

\nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a).

It appears as if we don’t take gradient over the term d^\pi_\mu(s) and it comes with the price of replacing R(s, a) with Q^\pi(s, a). It also allows us to derive a corollary for derivative of visitation distribution. Derivative of visitation distribution has been studied in other context, not using this corollary, which I will probably cover in a future post. 

Corollary*: \sum_s \nabla_\theta d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) =\sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}\{ Q^\pi(s, a) - R(s ,a)\}

Proof: \nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a) (from policy gradient theorem).

\nabla J^\pi = \nabla \left( \sum_s d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) \right) (from alternative expression for J^\pi).

=  \sum_s \nabla d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) + \sum_s  d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) R(s, a)

comparing the two equations and rearranging gives the result.

The REINFORCE algorithm

REINFORCE algorithm (Williams 1992) comes directly from approximating the gradient given by the policy gradient objective. Firstly, we will review sampling for the wider audience.

Brief Summary on Sampling: Given an expectation E_p[f] = \sum_s f(s) p(s) of a function f: X \rightarrow \mathbb{R} with respect to probability distribution p over X, we can approximate the expectation by drawing N samples x_i \sim p(.) and using empirical mean given by \mu_N = \sum_{i=1}^N \frac{f(x_i)}{N} as our approximation. From the law of large number this approximation will converge to E_p[f] as N \rightarrow \infty. The estimate \mu_N is called unbiased since its mean is the quantity we are approximating i.e. E_p[f]. E_{x_i \sim p}[\sum_{i=1}^N \frac{f(x_i)}{N}] = \sum_{i=1}^N \frac{1}{N} E_{x_i \sim p}[f(x_i)] =\sum_{i=1}^N \frac{1}{N} E_p[f] = E_p[f]. Note that if our samples were from another distribution then our estimate may not be unbiased. Another quantity of interest for an estimate is its variance given by var(X) = E[(X- E[X])^2], which measures how far can the estimate deviate on expectation from the mean. For the above estimate, we have var(\sum_{i=1}^N \frac{1}{N}E_{x_i \sim p}[f(x_i)]) = \frac{var(x)}{N}. Thus, as we collect more samples our estimate deviates less from the mean and if the variance var(x) is low then even fewer samples can help us get accurate estimate.

Vanilla REINFORCE: A quick look at the policy gradient tells us that we can never compute it exactly except for tiny MDP. We therefore want to approximate the gradient of J^\pi given by policy gradient objective and we will do it using sampling. We show below how to do it,

\nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)

=\sum_s \sum_{t\ge 0} \gamma^t P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)

\sum_{t\ge 0} \gamma^t \sum_s P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)

Let’s say we sample a rollout \langle s_0, a_1, r_1, s_1, a_2, r_2, \cdots a_T, r_T, s_T \rangle using the policy \pi. A rollout is generated by starting from a sampled state s_0 \sim \mu(.) and then taking actions in the current state using the policy and receiving reward. Then note that s_t \sim P(. \mid \pi ). Then using the sampling approach described above,

\sum_s P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a) \approx  \sum_a \left\{ \nabla \pi(a \mid s_t) \right \}Q^\pi(s_t, a)

Note that we only use a single sample here (N=1) for approximation!

After this approximation we have,

\nabla  J^\pi \approx \sum_{t\ge 0} \gamma^t \sum_a \left\{ \nabla \pi(a \mid s_t) \right \}Q^\pi(s_t, a)

If we knew Q^\pi(s, a) perfectly for every state s and action a then we can compute the above objective perfectly provided the number of actions are not prohibitively large. However, in general we do not have these conditions therefore we want to get ride of the second summation as well.

Unfortunately, the summation is not written in the form of an expectation i.e. \sum_a f(a) p(a) for some probability distribution p. Therefore we cannot apply the sampling approach. To solve this, we will use calculus 101 to express it in an expectation format by using \sum_a (\nabla p(a)) f(a) = \sum_a p(a)  (\nabla \ln p(a) )f(a) = E_{a \sim p(.)}[ \nabla \ln p(a) f(a)]. Thus,

\nabla  J^\pi \approx \sum_{t\ge 0} \gamma^t \sum_a \left\{\pi(a \mid s_t) \nabla \ln \pi(a \mid s_t) \right \}Q^\pi(s_t, a)

We already have samples from \pi(. \mid s_t) which are nothing but the action a_{t+1}. Using it as a sample gives us:

\nabla  J^\pi \approx \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) Q^\pi(s_t, a_{t+1}).

We are still left with estimating Q^\pi(s_t, a_{t+1}) to complete our algorithm. Turns out, we can approximate even this by sampling. By definition, Q^\pi(s, a) = E[\sum_{t \ge 0} \gamma^t r_t \mid s_0=s, a_1=a] is an expectation of total discounted reward collected by a rollout sampled using our policy. For Q^\pi(s_t, a_{t+1}), the value \sum_{t' \ge t} \gamma^{t'-t} r_{t'+1} is an unbiased sample estimating Q^\pi(s_t, a_{t+1}) (convince yourself that this is true). This gives us

\begin{aligned}  \nabla  J^\pi &\approx \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'-t} r_{t'+1} \\  \nabla  J^\pi &\approx \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}  \end{aligned}  .

After these approximations, we are ready to state the vanilla REINFORCE algorithm:

The REINFORCE Algorithm

  1. Initialize the parameters \theta^0 randomly.
  2. Generate rollout s_0, a_1, r_1, s_1, \cdots a_T, r_T, s_T using the current policy \pi_\theta
  3. Do gradient ascent: \theta^{t+1} \leftarrow \theta^t + \sum_{t \ge 0}  \nabla \ln \pi(a_{t+1} \mid s_t) \sum_{t' \ge t} \gamma^{t'} r_{t'+1}

Keep the Samples Unbiased. Throw away Rollouts after Update.

Our entire derivation approximating the value of J^\pi relies on a single rollout that is generated using the policy \pi. If the rollout was generated using another policy then our assumptions will not be valid anymore. This is true, if we keep old rollouts and update the policy. Then those rollouts are no longer unbiased sample of the current policy and hence do not estimate the objective above. Therefore, we cannot keep those samples after update. This is unlike Q-learning where experiences are stored in huge replay memory. Since our derivation relies on samples coming from the current policy, therefore REINFORCE is an example of on-policy reinforcement learning approach. It seems kind of wasteful and inefficient to throw away old samples after updating the parameter. A common tactic is therefore to collect several rollouts in parallel using large number of threads. We will talk about this in another post.

Problem with REINFORCE:

REINFORCE is not a very good learning algorithm for anything interesting. Few main issues with it are:

  1. No Exogenous Randomization: REINFORCE uses the same policy for exploration as the one being optimized. There is no way to control the flow of exploration or way to add exogenous randomization. This can hinder exploration since the policy can get stuck in specific regions and may not be able to explore outside.
  2. Variance: We used a single rollout above to make several approximations. This can affect the variance of our estimate of the gradient. Having high-variance estimate can prohibit learning.
  3. Degenerate Solutions: REINFORCE can get stuck in degenerate solutions where the policy can suddenly become deterministic while being far from optimal. Once the policy is deterministic, the updates stop since our estimated gradient:

\sum_{t \ge 0}  \nabla \ln \pi(a_{t+1} \mid s_t) \sum_{t' \ge t} \gamma^{t'} r_{t'+1},

becomes 0. This effectively kills the learning and it has to be restarted. This situation is frequently encountered in practice and we will call it the entropy-collapse issue. Q-learning does not suffer from this degeneracy.

Next we will discuss some solutions to above problem and discuss few methods for debugging.

No Exogenous Randomization

REINFORCE’s inability to separate the exploration policy from the policy being optimized is one of its biggest weakness. There are atleast two different ways in which this has been addressed. The first approach uses a warm start to initialize the policy using behaviour cloning or imitation learning. This also makes sense from deployment point of view where we have to place our agent in the real world in order to perform reinforcement learning and thus don’t want the start policy to be randomly initialized (imagine asking users to chat with a randomly intialized conversation model!). One can naively hope that warm start will enable the policy to explore in the right region of the problem.

The other approach is to use a separate exploration policy and unbias the gradients by using importance weights. However this approach may not be suitable if the policy has close to 0 mass on actions which are chosen by the exploration policy.

Another line of research involves designing intrinsic reward function (pseudo-counts, prediction error, variance of ensemble) to incentivize the policy to eventually learn to explore the desired state space. However, these approaches generally range from having theoretical guarantees in tabular MDPs (e.g., MBIE-EB count based exploration) to having no theoretical guarantee at all (e.g., prediction error).

Reducing Variance through Control Variate

We made a remark earlier pertaining to the variance of the estimate of the gradient for the REINFORCE algorithm. Let’s make it more formalized:

Given a rollout, \tau = \langle s_0, a_1, r_1, s_1, \cdots a_T, r_T, s_T \rangle generated by a policy \pi, we approximate the derivative of expected total discounted reward \nabla J^\pi using:

X(\tau) = \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}

This estimate is unbiased i.e. E[X(\tau)] = \nabla J^\pi and its variance is given by:

Var(X) = E[(X(\tau) -\nabla J) (X(\tau) - \nabla J)^T]. This variance can be high due to  the value of \sum_{t' \ge t} \gamma^{t'} r_{t'+1}. Therefore, we want to modify the estimate, keeping it unbiased but reducing the variance. In statistics, one way to do this is to use control-variate.

Control-Variate: Let X be an estimate of a quantity with mean \mu, we define a control variable Z with mean E[Z] and define a new estimate given by Y = X + c (Z - E[Z]). This new estimate is still unbiased as E[Y] = E[X] + c E[Z] - c E[Z] = \mu. The variance is given by:

Var(Y) = Var(X) + c^2 Var(Z) + 2c Cov(X, Z) which can be made smaller than Var(X) by optimizing the choice of c and Z.

Reinforce Baseline: We need to define a control variable to reduce variance.

We define it as: Z(\tau) = \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t) ,  where b: \mathcal{S} \rightarrow R is a function called the baseline. We can observe that E[Z(\tau)] = 0 as,

E[Z(\tau)] = E[\sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t) ]

E[Z(\tau)] = \sum_{t \ge 0} \gamma^t E[\nabla \ln \pi(a_{t+1} \mid s_t) b(s_t) ]

=  \sum_{t \ge 0} \gamma^t \sum_{s_t} P_t(s_t) \sum_{a_{t+1}} \pi(a_{t+1} \mid s_t) \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t)

= \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \sum_{a_{t+1}} \nabla \pi(a_{t+1} \mid s_t)

= \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \nabla \sum_{a_{t+1}} \pi(a_{t+1} \mid s_t)

=  \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \nabla 1 = 0.

We will further set the value of c=-1, giving us our new estimate as:

Y(\tau) = X(\tau) - (Z(\tau) - E[Z]) =\sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) \{ Q(s_t, a_{t+1}) - b(s_t) \}, where Q(s_t, a_{t+1}) = \sum_{t' \ge t} \gamma^{t' - t} r_{t' + 1}.

Proof of Variance Reduction(*):  We still haven’t chosen a baseline function and we set the value of c = -1 without justification. Overall we still haven’t shown that variance gets reduced. I will provide a proof that gives a choice of the baseline function and proves variance reduction. For simplicity, we will consider a single step case (i.e. \tau = \langle s_0, a_1, r_1, s_1 \rangle . We have,

Var(Y) = Var(X) + Var(Z) - 2 Cov(X, Z)

= Var(X) + E[(Z - E[Z])^2] - 2 E[(X- E[X])(Z-E[Z])]

= Var(X) + E[Z^2] - 2 E[XZ] + 2 E[X] E[Z] = Var(X) + E[Z(Z - 2X)].

Where E[Z(Z-2X)] = E[ \nabla p(a_1 \mid s_0) b(s_0)  \{\nabla p(a_1 \mid s_0) \{ b(s_0) - Q(s_1, a_1)\} \}].

which can be simplified to = E[  \|\nabla p(a_1 \mid s_0) \|^2 b(s_0) \{ b(s_0) - Q(s_0, a_1) \}]

= E[\|\nabla p(a_1 \mid s_0) \|^2] b^2(s_0) - b(s_0) E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)].

setting it to 0 gives the optimal baseline as b(s_0) = \frac{E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]}.

If one makes an independency assumption between \|\nabla p(a_1 \mid s_0) \|^2 and Q(s_0, a_1) then we get

b(s_0) = \frac{E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]} =\frac{E[ \|\nabla p(a_1 \mid s_0) \|^2] E[Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]} =E[Q(s_0, a_1)] = V(s_0).

One can do a similar analysis for multi-step reinforcement learning and derive the optimal policy. While the optimal policy has been known for many years, all empirical application of REINFORCE or its derivative (in which I count actor-critic methods) use an  approximated baseline given by the state-value function V^\pi. I am not familiar with any literature where they prove (or disprove) if the approximate baseline actually does reduces the variance. However, in the empirical RL circle the variance reduction due to the approximated baseline is often taken as granted.

Importance of Regularization

REINFORCE can get stuck in degenerate solutions as pointed out before. To avoid this degeneracy, a common tactic of regularizing the objective with the entropy of the policy or KL-divergence from a reference policy is adopted. We will focus on the former method here. The entropy regularized objective is given below:

J^\pi_\lambda = \sum_s d^\pi_\mu(s) \left\{ \sum_a \pi(a \mid s) R(s, a) + \lambda H(\pi(.\mid s)) \right\},

where \lambda is a hyperparameter controlling the effect of the regularization. If one computes the derivative of the new objective one gets:

\nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda \sum_s \nabla \{d^\pi_\mu(s)  H(\pi(.\mid s))\}

Most researchers however use biased gradients given below:

\nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda \sum_s d^\pi_\mu(s)  \nabla H(\pi(.\mid s))

\Rightarrow \nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \{ \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda H(\pi(.\mid s))\} .

Entropy regularized objective will no longer follow our classic Bellman optimality conditions and the optimal policy no longer remains deterministic. An easy verification of this is to set \lambda = \infty and observe that the optimal solution of \arg\max_\pi J^\pi_\lambda will be to remain uniformly random everywhere (assuming the reward function is bounded).

Evaluating Training Progress

When training a classifier for a task like ImageNet, one generally monitors the training error and the error on a held out tune set. The decrease in training error tells that your model has sufficient capacity and an increase in tune set indicates potential overfitting. Consider, the approximated gradient of the objective for REINFORCE:

\nabla  J^\pi \approx \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}

the standard way to program this using pytorch or tensorflow is to define the loss variable:

L = -\sum_{t \ge 0} \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}

when doing gradient descent on this variable we get the same gradient as the approximation gives us. This does not make L the real loss function, in fact it is not the loss for the REINFORCE algorithm at all but using it gives us the real gradients therefore we will call it the substitute loss.

Monitoring the substitute loss is not the same as monitoring the actual training loss when doing supervised learning. To begin with, notice that the substitute loss is positive when the agent receives only positive reward and it is negative when the agent only receives negative rewards. This is counterintuitive as one ideally associates high loss with low return. So why is this so?

This happens cause when all the rewards are positive then the loss is positive and the only way the agent can reduce it is by pushing the term \ln \pi(a_{t+1} \mid s_t) towards 0 which means increasing the probability of actions which generate positive reward (which is what we want). Similarly, when all rewards are negative then the loss is negative and the only way the agent can make it more negative is by decreasing the probability of these actions until the term \ln \pi(a_{t+1} \mid s_t) tends towards negative infinity. This will eventually lead to these actions not being sampled anymore.

Thus, instead of monitoring the substitute loss, one can monitor two things: (i) the total reward received by the agent which also represents an unbiased estimate of the actual objective and (ii) the entropy of the policy. Ideally the total reward achieved by the agent should increase and the entropy of the policy should decrease. Any stagnation means the learning is not happening effectively.

Major Failure Case for Policy Gradient

Policy gradient methods have no guarantees in theory or practice. A simple example which can be used to demonstrate this is from John Langford. There are H+1 states and two actions and the agent always starts in state s_0. At each state s_i (i \in \{0, 1, \cdots, H-1\}), an action takes you to s_{i+1} and the other actions takes you back to s_0. The action mappings are randomly assigned to each state but the MDP is deterministic. Every state has a reward of 0 except s_H which has a reward of 1. The game ends when the agent reaches the state s_H.


With a close to 1 probability, any rollout will fail to reach the destination state where it earns a reward and therefore will not cause any meaningful learning when using on-policy policy gradient methods. It will require exponentially many samples for it to reach the state s_H and therefore on-policy policy gradient with a randomly initialized policy is not what is called a  PAC RL algorithm, which is an important theoretical guarantee.


Despite its flaws policy gradient methods are used widely for all sort of AI applications. One reason behind its widespread use is their remarkable similarity to supervised learning with which most empiricists are widely familiar. Thus, one can train their favourite dialogue model using REINFORCE by simply multiplying the log probabilities with a value term. Further, variants of policy gradient methods like PPO perform much better and are reasonable baselines. However, it could also be the case that most people don’t care about guarantees as long as they can solve their favourite childhood game and as a field we must guard against this trend as we move towards more serious applications of these algorithms.