## Are we doing NLP the right way?

I have been pondering over some of these questions below for sometime and after taking a break from chasing many paper deadlines, I found sometime over Christmas to pen down my thoughts. I’ll discuss four issues (or at least the way I see them) facing NLP and I’ll offer suggestions.

Please note that I do not advocate the following as the only way everyone should be working. I believe that research is best done when people do what they are excited about and we should try diverse things with open mind specially in the early stages of a field.

### 1: Too Many Datasets and Not Enough Structure

Few months back, I made a comment jokingly that there are now “more NLP datasets than NLP researchers”. I do think there is an element of truth here. The field has witnessed an explosive growth and the result has been an over emphasis on too many problems in isolation without enough work on how these different problems relate to one another.

For reference, let’s consider how the field of mathematics progressed. People first proved  a set of theorems which are grounded in something of immediate interest such as how to approximate the area of a circle or measure height of a tower. It is of no surprise that initial mathematics was mostly real algebra and trigonometry instead of abstract algebra and algebraic geometry. A good healthy trend in mathematics is that  most maths paper will use at least some results from the literature. Unfortunately, in NLP one seldom has the opportunity to use (I do not mean citation but actual use) results from very old research papers or use the solution of another problem which has been reliably solved (e.g., POS tags). Why is this case?

*********************************

Let’s look at a bit of history first.

NLP from early 1990s to sometime around 2012 was dominated by the linguistic agenda. This agenda consisted of building a structure, one level at a time, by first solving part of speech tagging, then syntactic parsing, then approaching semantic and finally something in pragmatics/discourse. It made total sense! Surely one cannot really approach discourse if one cannot understand a sentence. And you cannot understand a sentence if you cannot even understand dependencies between its words. These problems were also generally well defined (mostly one can agree on what is the right parse tree) and were also interpretable (you can look at the parse tree to see what mistakes were made).

Unfortunately, it is beginning to be apparent that these beautiful linguistic approaches didn’t yield empirical gains (using empirical evaluations was another breakthrough that happened in NLP from 1990s and I talk about that in point 4 below). The biggest evidence of this is that most NLP papers these days in non parsing track, don’t use a syntactic parser either in their model or in some baseline. A single LSTM with attention processing the entire sentence outperforms linear models using parse trees. The field realized this and as a result, many researchers changed their agenda and left the linguistic bandwagon for deep learning supercar.

*********************************

It is disappointing that after improving score on the Penn Tree Bank from somewhere in 80% to now 96%, we still don’t end up using parse trees as features in our model.  But this is mostly a tragedy for the linguistic agenda and not the field cause across the board  performance on several datasets have improved (due to better feature representations using deep learning, improved learning and optimization algorithms and just more researchers working), new problems have blossomed, attendance and submissions to conference has skyrocketed. Another change that has taken place is that more people are now directly working on the final problem of interest like Question Answering, Machine Comprehension, Instruction Based Navigation etc. The general approach these day appears to be:

1. Create a new dataset and problem.
2. Improve the performance on this dataset by generally using more and more complicated models.
3. Once the performance reaches a certain level and further improvement is hard, interest dies and people move on to step 1.

A new problem however is appearing.

Rarely a solution for one problem informs solution for another problem. The field is littered with dead datasets or solutions which are not used again.

I believe this is happening cause we are not building a structure. There are too many problems and it is not clear how these problems relate to one another.

For reference, one crowning achievement of complexity theory was to reduce thousands of problems to few complexity classes and study how these relate to one another. There are thousand different NLP problems. What is our analog of a NP class?

Since linguistic agenda of studying a hierarchy of linguistic structure hasn’t succeeded after decades of work. What new structure could we use? The following is one suggestion which I call Learning Reduction Agenda:

1. Narrow down to problems that are of immediate interest and well defined. I can name for example, knowledge base or visual question answering, vision based instruction following and multiple choice reading comprehension. These will be our  analog of Pythagorean theorem.
2. Create new problems which use solutions to (1) for evaluation or reduction. Here is one example: One can define a machine translation metric in terms of question answering. If the gold French text and the translated French text can be used to answer the same question then they have the same content and should be treated as paraphrase (I am assuming there is an executable environment).
3. Gradually expand the problem set by doing 2.

[Learning Reduction Agenda] Researchers should think carefully where a new problem stands in this structure. If it is of end interest with an unambiguous evaluation metric it belongs in the ground floor otherwise it should reduce to a problem in the lower structure and thereby reveal its position.

### 2: Solving an Ill-Defined Problem

There are several problems in NLP that do not have a well defined metric. Most of these problems involve generating some text. For example, a chat bot having a conversation with a human,  machine translation, image caption generation. The solution to this problem has been to defer to humans for evaluation or use proxy evaluation metrics such as BLEUE, ROGUE, METEOR, CIDER etc. If one really buys into these metrics then they can use standard method to maximize the performance of their approach with respect to this metric. Some sophisticated researchers often prefer an ensemble of metrics or report performance on several metrics.

The main question still remains: what do you want to achieve from the generated text? An image caption that says, “this is an image” is an accurate caption but doesn’t contain any content. A chat bot that says, “Hmm…” accurately mimics the attitude of a person who doesn’t want to be bothered. I believe this question is not adequately answered in most research papers. We should understand what this problem is going to be used for. For example, if the image caption generation model is being used to describe the image to a visually impaired person then we must understand what kind of information that person seeks. Are they interested in hearing about the type of image (e.g., impressionist, cubism etc.) or the color of the image (e.g. ,pale blue background with pink tinge), or the number of people in the image. We should understand what the output of our models is going to be used for and merely maximizing BLEU, METEOR etc. is not enough.

Following the learning reduction agenda, one can for example define this evaluation metric in terms of the accuracy of a well trained QA system to answer questions based on the generated text. If the generated text is used for assisting a human then their performance could be used as the evaluation metric. Or if you are using a chat bot for retention then it makes sense to use the expected retention time as the choice of evaluation metric (chatbot is one of those problems for which it is really hard to think of any scenario where it could be useful).

### 3: Over Reliance on “Linguistic Intuition”

Another thing that one hears is the idea of “linguistic priors” or “linguistic intuition“. Often times, papers report using a linguistic prior as one of their main contribution and back it up by showing performance gain. An example of this is the use of recursive structures for deep learning (Stanford really championed this research) that was very popular in 2013-2015. It is very well known that sentences have tree structures and therefore it made “total sense” when recursive structures were employed for sentence representation. These were some very popular research work at that point. However, they fell out of fashion later largely due to results showing that same gain can be obtained by not using recursive structures and simply doing averaging.

The point where it gets slippery is that these linguistic intuitions could be wrong or when coupled with other ingredients (models, optimization, learning) may in fact reduce performance. I have often tried several changes to my model that I expected should have increased performance only to see the performance go down (from what I hear this is a general experience. For a general discussion related to this, see this NIPS talk by Ali Rahimi).

We should not over rely on linguistic intuition (and the failure of the linguistic agenda should be a warning in itself) and try more broad range of solutions. After all the current deep learning trend did not emerge out of the mainstream.  Who knows, maybe the answer to NLP are hidden in some abstract mathematical objects.

### 4: Product-less Off Line Evaluation

The field of NLP underwent a revolution with the use of datasets and rigorous evaluation. Nowadays nearly all papers at NLP conferences either introduce a dataset or evaluate on an existing dataset. One can say, the field is going through a dataset revolution (or bubble?). However, existing datasets have an issue. They only offer offline evaluation i.e., they are static and not continuously changing. This is different from online evaluation where one subjects their algorithm to live use by various people across the world and one computes statistics based on hourly (or other timescale) performance. Most applications in industry have an online evaluation: the Amazon echo dot and Google Translate engines are both being evaluated by users every minute who can type anything based on something that just happened in the world. For example, a Korean video called “Oppa Gagnam style” going live can spur users to try this phrase on these algorithms. Offline datasets just don’t allow us to make such evaluation (one can however simulate an online audience but this has other issues).

I was told by a researcher at a top industry that product engineers do not get convinced of the utility of a research algorithm by performance on some datasets. The gold standard for industry is A/B testing by showing improvement on an end metric (e.g., engagement, profit, or other utility) with real audience and real setting. Industry standards are quite high and rightfully so! A twitter chatbot that becomes racist can be a PR nightmare for the company (you can read more about chatbot fails here). There is too much to loose. The damages are not only for the company but also real users. Think of the following example:

Human: “I feel like just killing myself. No one cares about me.”

Some Conversation Engine: “Sure.”

If this is part of offline dataset then generating “Sure” results in a minor reduction in BLEUE score but this can abet a suicide and cause real harm if this was a live testing. I would even argue that this error should get a score of $-\infty$, thus giving a total score of $-\infty$ even when every other response was generated perfectly.

While using human evaluation and presenting Likert score is better, I think we should go the full extent. We need online evaluation where anyone can use these systems and one can see the performance for themselves. Never Ending Language Learning (NELL) is one such popular NLP project to do that. In robotics, I was part of a team that attempted to do something similar (this project is no longer active). Maybe we can create an online platform where people can register their NLP systems and get some feedback in the form of like or dislike. These systems can also teach each other. We do need more such evaluation and I also believe that this is the only way to truly judge an NLP system.

## Writing and Proof Reading Research Code

Most papers in Artificial Intelligence (AI) research these days have an accompanying source code. Sometimes authors will release their code with their paper after publication (our research group always does) and sometimes for various reasons they don’t (e.g., the code is built at an industry using another source code which is proprietary; code requires a lengthy internal peer review before being allowed to release).

### Issues with using Released Source Code

Even when the code is released, you can experience the following problem:

1. Code doesn’t compile or run.

You cannot run the code. You even email the authors but they don’t reply. This is a common issue and the main reason is that unfortunately writing a high quality code doesn’t align very well with incentives for a PhD student. You can spend 3 months improving your code but your advisor may want to see another research paper. Further, the student may have graduated and now works in quant and has no interest in replying to your email about their code from 10 years ago.

I strongly advocate for incentivizing the quality of the code and giving strong preference to papers with accompanying code when accepting a paper.
The standard argument against this is that additional burden will be placed on reviewers who don’t even have time to read the supplementary paper but I feel that we could automate parts of it (in any case we need to address it more).“Accepting a paper without a source code is to an approximation like accepting a cheque (check) without a signature (it feels like a cheque, it has the right number, the memo looks beautiful but there is no guarantee).”

Doing this will however slow down the research cause for whatever reasons embedded in human nature most researchers end up submitting on the last day of the conference deadline and often times the code is just not ready to be submitted (unfortunately this has happened to me and to almost everyone I know). Discouraging such behaviour will be a good thing for the community. The plot below shows that most papers for NAACL 2018 were submitted on the last day (special thanks to program chairs for this analysis). Full analysis is available here.

2. Code gives wrong results.

Even when you can compile and run the code, you find that results are different from what the paper reports or it outputs several numbers without telling which one to use. I argue that errors of this type are much more common and difficult to deal with. It can happen for reasons that authors cannot for see e.g., the version of the underline library changed (DyNet is particularly susceptible to giving different result based on different versions). Remember that most codes are developed by PhD students who aren’t necessarily the best code developers (this is one skill that we need to incentivize as a community).This kind of dependency hell like situation is common in software developments and solutions have been discovered to handle these issues. Releasing Amazon AMI images or docker files is one way to handle this situation. I have used both of these solutions in the past and they greatly helped me. For example, if you are running experiments across different machines you can get different results even with the random seed fixed and using synchronous computation (due to different libraries or the way they compile on that system). Docker can help fix this issue and also ensures that performance changes you see are not cause of using a different machine but due to the changes you made).

3. Code performs poorly on a different but related task

Most research papers using deep learning these days are reporting increasingly complicated neural networks for smaller and more specific task. A main problem with this approach could be a form of community overfitting. I define this as:

Community Overfitting:“Phenomenon where a research community as a whole overfits on a single or a few tasks by trying many solutions even though an individual researcher may have only evaluated on the test set once.”

A colleague of mine for example, recently found that several datasets that were released for visual question answering perform around chance (i.e. 50% accuracy) when used for a new dataset that she collected. This is in stark contrast to the general feeling around (particularly on social media) that say that we have made great improvements on tasks requiring understanding in vision and language.

A solution to this might be to stop using very old datasets (like the Penn Tree Bank), maintain a held out test set that is not revealed to the public (there is a contradiction here to keeping everything public) and on which authors cannot evaluate more than once a month (or a relevant long time frame). We also need to be careful when stating our results. Instead of saying, “we present a model that improves on visual question answering problem.”, say “we present a model that improves on the visual question answering dataset XYZ (cite ABC)”. I believe it is generally useful to  be cautious and under claim and then enjoy being proven right for new datasets or tasks, than over claim and find your claims proven wrong.

## Three Main Crimes in an AI Source Code

When developing big code bases for research, it is helpful to debug them carefully to avoid claiming wrongly. In general, however one cannot guard against all possible sort of errors (e.g., an error in the underline library is extremely hard to detect). I have a feeling that there is an underline power law like distribution where making sure that your code has no error with probability 1-p requires number of debugging hours that grow polynomially with 1/p. There is a fair bit of research now on writing provable code (for example, see this paper from Selsam, Liang and Dill). However, today most research code don’t use these automated proving techniques. I have come up with the following quick checks to make sure that your code doesn’t contain the gravest type of errors:

1. Train-Test Dilution.

This is the classic machine learning error where you use information from the test data when training. This could involve training on the test data or more subtle information leakage from the test data. The standard guard against it to make sure your dataset is split into 4 disjoint files: train, tune, dev and test and making sure that these files are indeed disjoint. Then using only the train file for training and the tune file for finding optimal hyperparameters. Dev file should only be used for occasionally checking for performance and the test file should only be used once (generally two weeks before the deadline). Using different files as opposed to using indices is better as using wrong indices accidentally won’t let you see the wrong split during training (or evaluate on train data when testing). I do want to point out that this error can happen is very subtle form. E.g., once I was using a dataset with train, dev and test splits. I was using a model for transfer learning that was publically available for download from an existing code base. I was getting very high dev performance but then I got very low test performance during testing (generally you expect a ~1% drop in performance when running on the test set but this was a ~15% drop). I then realized that the publically available model used labels from the dev set when training. Fortunately, we were able to fix this error and improve our model in other ways to gain back our performance before submitting our results.

2.  Label Leakage.

Label leakage is the phenomenon where your model uses the gold labels (or some prohibited information) for an input when evaluating. For example, a classifier that has to predict some label for an image but that label is encoded in the image itself say as a watermark, or in more direct case given as input to the classifier. The standard guard against label leakage is to separate prohibited information in your class. E.g., if you you are using Java then all labels that are prohibited should be made private and once is only allowed access to a scoring function which evaluates to a score using the gold label and the predicted label.

3. Wrong Evaluation Metric.

Wrong evaluation metric is obviously give you wrong results. E.g., one can compute F1 score wrongly or write their own buggy functions for BLEU score. I remember that using Python NLTK BLEU scoring function used to give wrong results. In general, when using a complicated evaluation metric, use the test scripts that previous publications have used or when using a dataset with an accompanying evaluation script use that. When writing your own implementation for an evaluation metric,  I suggest manually computing the score and then checking it with the evaluation metric for a set of randomly chosen inputs.

When supervising an intern I ask them to make sure that error type 1 to 3 doesn’t exist in their code. This has the advantage of ensuring that their code is trustable by which I roughly mean that the results from the code can be reported for that task. Following argument makes it more clear.

Theorem: If your code doesn’t contain error 1 to 3 then the result from your code can be reported.

Proof: Say you train your model on this task and then evaluate this model. From ensuring against error 1 and 2 you make sure that you haven’t seen the test data or labels from the test examples when training. The model that you train is then evaluated on the test set and not on train set since you don’t have error 1. So now you are evaluating a model that is not using the test labels on the test set and hasn’t seen this test set before. Since your evaluation metric is correct, therefore your results are also correct.

This is only a proof against high level errors. If you save the model wrongly, give wrong inputs to the evaluation metric or use other information in the world that you are not allowed to use even though you are not using the test set, then the above proof doesn’t hold. But I hope I am able to get the rough idea of “trustable results” general idea across.

### Conclusion

In this blog post, I talked about certain errors in running released code and how to make sure that your code doesn’t have the gravest error before submitting these results. I also talked about a rough idea behind trustable results. I haven’t talked about debuggability or whether what one claims in the paper is what their code does For example, one can not make error of type 1 to 3 but their model could be incorrectly implemented or different from what they describe in the paper. In general, it is good to see that community is talking more about these issues and I hope that my blog post will help the discourse.

## Getting into Top CS PhD Programs in the US

December 15th deadline is approaching and thousands of PhD applicants from around the world would be submitting several pieces of documents for an increasingly more competitive PhD programs at the top US schools. A repetitive theme that one hears from some old professors is that they wouldn’t been accepted into the PhD program at their own school with the application they submitted.

I learned a lot in the process of applying to PhD schools five years ago and I’ll offer some suggestions in this post for being competitive and avoiding common mistakes. The information below is specific to the artificial intelligence (computer vision, natural language, machine learning) field but could apply to  other fields as well.

## Die has been cast

Unfortunately to a large extent the admission process is predecided. Many programs generally take students from a select few universities (of similar ranking) that they have come to trust over the period of time and many of these students would have already written papers with some professors at these universities as part of internship or exchange program. This is not necessarily a bad thing and from a professor’s perspective it makes sense to hire someone you have worked with and who has been productive over an unknown person that you are unsure of. This is also true for jobs and life in general, your connection and where you come from is a good indicator of where you will be and the more at top you are the more opportunities you have to rise even further. Further a lot of this advantage could be very subtle e.g., children of professors would know much more about the university and inner working and naturally this makes them more competitive.

However, for an applicant from a less prestigious school or one who hasn’t had the opportunity to find good internships can be at severe disadvantage. I remember my second year of summer at IIT Kanpur when I was very desperate to find a summer internship (I ended up spending the summer reading about advanced data structures and hashing techniques on my own). While I did not get an official internship position, students at Ivy Leagues, Stanford or UC Berkeley have plenty of such opportunities.

However, to balance what I said, I must add that if you are a student from a top-tier university applying to a PhD program then bar for you is also higher and not being able to take advantage of your alma mater makes your application look much weaker than a student with similar credentials from a less prestigious school.

By writing a quality research statement describing any personal research that you did as a course project or on your own, one could negate the above disadvantage. However something like the following is generally true for top CS PhD programs:

### ″If you haven’t published, have not done top-tier internship, do not have good grades and do not have great recommendation letters then your chances are very slim (<2%)″

and none of the above can be fixed in a couple of weeks before the deadline (unless you solve P vs NP or some major result).

A colleague of mine has pointed out to me that the admission committee (to a varying extent) looks for (1) capability and technical skills to succeed in a competitive research program, (2) interest in research, (3) likelihood to succeed and stick through with it when things get difficult. Experience in a research lab that has led to a publication can demonstrate all three of these, but these could also be demonstrated separately in unique circumstances, in which case properly presenting them in a personal statement is crucial.

## The Usual Thing

The material below has been repeated by several more eminent people elsewhere so I’ll cover it briefly. An application consists of the following:

• CV: consisting of list of papers that you have published, list of internships and your grades. This is an objective history of your capabilities. A good candidate would have one or few publications (only top-tier counts), would have done internship at good schools or top research labs and would have high grades.
• Research Statement: Your description of your capabilities and experiences. You can use this to talk about any specific hardship that you had to face. E.g., if you were very sick for a specific semester which affected your grades or performance.
• Recommendation Letters: Other people’s description of your capabilities. It is very important to get letters from people who are well known and who know you well. To a large extent, hiring process is based on trust which is a transitive relation. So if a school trusts a person and they trust you then the school trusts you. Generally schools require 3 recommendation letters and a common solution is to get recommendation from your thesis supervisor, someone with whom you did an internship and someone with whom you collaborated on a project. Getting a recommendation letter that says, “person XYZ did good in my class” is a near waste  and shows you in bad light. It is advisable to plan this ahead in time and reach out to your letter writers to give them enough time. No one likes to be rushed particularly when they are doing you a favour.  If you did not go to a school in US, try to get recommendation from someone the university may know. You can look at the success rate of applicants who asked a letter from a specific person. You should also let your letter writers that you need a strong letter and it is better for them to say no than to write a weak one.
• GRE and (TOEFEL) score: This is generally not a problem for students who are targeting top-tier US schools. However, TOEFEL could be hard for students from non-English speaking countries like China. If this is the case, you can seek some coaching in advance for speaking and writing. This is going to be very important later in your career.

### Common Mistakes

The Subset Rejection Rule: This is not really mistake in the sense that you could do something about it. But if you are a strict subset of another applicant then your rejection chances increases. For example, if there is another student in your university who did whatever you did but has better grades or has more experience, and if there is no other distinguish criteria then universities have no incentive to hire you over the other. Further, the admission process may want to cover more universities and countries and therefore may only select a single applicant from your school which then rules you out.

Talk to Alums: The biggest mistake you can make, which I did make, is to not seek help  from your alums who are studying at the school of your liking. They are the ones to provide the most tailored advice that anyone can get you.

Applying too Narrowly: Even if you are a very capable applicant, there are aspects of the decision making that is not in your hand. A school may not be taking any students in that field that year or a professor may have already committed to working with someone else. For this reason, one should apply to several schools unless you are very sure that you would rather not do PhD if you don’t get your favourite school.

Writing Lousy Research Statements: I bet each year some applicants write a research statement that reads like:

A: “My parents gave me a toy as a kid which I quickly proceeded to disassemble to figure out its internal working since then I decided…..”

B: “I saw Beautiful Mind/Goodwill Hunting/Sherlock/House MD… and I was inspired to be a…”

C: “I want to do a PhD to make my family proud…”

The first one doesn’t say much about you besides that you like breaking toys. The second is a similar type of mistake and the last one shows that you are not interested in pursuing PhD cause you like doing research.

A good research statement can contain your research experience, the problem you really enjoyed and could contain stuff like,

“In the CS470 course, I was introduced to the spectral properties of the Laplacian matrix and I was intrigued by capabilities of such an abstract quantity to explain several properties of graphs. This made me decide to spend the next summer doing research on this topic with Professor ABC QRB during which we investigated the use of spectral methods for analyzing XYZ problem which arises in MNR. By the end, even though we couldn’t solve the problem in its entirety, we felt confident enough by the novelty of our findings to submit a paper at ICML 2018. I am attaching this paper with my statement. I am interested in continuing to do research in spectral methods with Professor MNR whose work RQR is very similar to our finding…..”

This statement, while not fully fluent, shows a genuine interest in something concrete, shows your confidence in pursuing a topic you really like and seeing it through the end and at last that you did research the university properly. PhD is a lot about a kiddish fascination with things that probably no one else on Earth except you and your advisor appreciates.

## How to find a Good Advisor

If you get accepted or make it through the first round, the university may arrange interviews with the professors that you have listed in your research statement or someone related to your interest. In some universities, you get hired by a professor and so your advisor is predecided. At Cornell University that is not true and you get freedom for a year to try out various advisors (and for them to try you) to find the best match. In general, finding a good advisor is quite hard cause your interest may change or you may not match in terms of working style or personality. For example, in AI certain groups publish more than 50 papers per year while some groups publish only 3-4 papers. Some groups have projects with several first authors while other groups will generally have a single student own the project completely. Some groups will be more abstractly inclined while some groups will be more empirical. Some advisors maybe more hands off while others maybe more hands on. In general, finding the right balance could take time and it is important to get it right otherwise you can have a very unhappy PhD life.

## Big Universities vs Small Universities

One final thing I want to talk about is the dilemma of going to a big university (defn: place with lots of researchers) cause of the more number of options available or a small university (defn: place with few researchers) cause you really want to work with a specific professor there. In general, I always recommend going to the big universities particularly if you are undecided. While the second idea is idealistic and could work if you are very sure about working with that professor and they plan to stick around and help you graduate. In practice, several things can happen that is not in your control: the professor you are working with is very good and gets an offer from a top school and decides to leave; they may not get a tenure or in worst case they die. The first two conditions are quite common actually. My first advisor decided to quit academia to open a startup after my second year. In the worst case, you’ll be left in a place with few backup options (for example the professor moves to a top school and cannot bring you along). Top universities have way more options and generally have someone who works on your topic of interest. They also have great connections with industries and provide a safe option in case your interests change.

Acknowledgement: I am thankful to my friend and colleague Valts Blukis for taking valuable time to offer suggestions.

Disclaimer: The opinion above are my alone and do not represent my research group,  Cornell University, Microsoft Research or any other institution that I am associated with or was.

## Mathematical Analysis of Policy Gradient Methods

Most papers using reinforcement learning these days use the policy gradient class of learning. In this post, I will cover a basic tutorial of policy gradient, uncover some confusion on using baseline and regularization and give some suggestion for debugging these learning algorithms that I have found useful in the past.

Policy gradient methods is a class of reinforcement learning algorithm that optimize the reinforcement learning (RL) objective by performing gradient descent on the policy parameters. As opposed to value function based methods (SARSA, Q-learning), the central object of study in policy gradient methods is a policy function $\pi(a\mid s; \theta)$, which denotes a probability distribution over actions given a state. $\theta$ are the parameters of this policy. Assuming we are optimizing using stochastic gradient methods our updates are given by:

$\theta^{(t + 1)} \leftarrow \theta^{(t)} +\eta_t \nabla_\theta J^\pi$

The objective $J^\pi$ can be expected total discounted reward, expected average reward or something else. We will stick to total discounted reward for this discussion. The aim now is to express $\nabla_\theta J^\pi$  in a form that is easier to work with. This is done using the policy gradient theorem.

Let $M = \langle S, A, R, P, \gamma, \mu \rangle$ be our MDP where $S, A, R, P$ are the state space, action space, reward function and transition probability respectively. $\gamma$ is the discounting factor and $\mu$ is an initial state distribution.

The expected total discounted reward can be expressed as:

$J^\pi = E[\sum_{t\ge 0} \gamma^t r_t] = \sum_s \mu(s) V^\pi(s)$

See my previous post if you don’t know what $V$ function is. Using above equation we have $\nabla_\theta J^\pi = \sum_s \mu(s) \nabla_\theta V^\pi(s)$. Policy gradient theorem (Sutton et al. 1999) computes $V^\pi(s)$ by repeatedly applying the Bellman self consistency theorem. It really is just that.

$\nabla_\theta V^\pi(s) =\nabla_\theta \left\{ \sum_a \pi(a\mid s) Q^\pi(s, a) \right\}$

$\nabla_\theta V^\pi(s) = \sum_a \pi(a \mid s) \nabla_\theta Q^\pi(s, a) + \sum_a Q^\pi(s, a)\nabla_\theta\pi(a \mid s)$

using $Q^\pi(s, a) = \sum_{s'} P^a_{s, s'} \left\{ R^a_{s, s'} + \gamma V^\pi(s') \right\}$ we get:

$\nabla_\theta Q^\pi(s, a) = \gamma P^a_{s, s'} \sum_{s'} \nabla_\theta V^\pi(s')$

so now we have an expression for $\nabla_\theta V^\pi$ in terms of itself. Writing it down together we have:

$\nabla_\theta V^\pi(s) = \sum_a Q^\pi(s, a)\nabla_\theta\pi(a \mid s) +\sum_a \pi(a \mid s)\gamma P^a_{s, s'} \sum_{s'}\nabla_\theta V^\pi(s')$

we will change the notation as $s \rightarrow s_0, a \rightarrow a_1, s' \rightarrow s_1$ giving us:

$\nabla_\theta V^\pi(s_0) = \sum_{a_1} Q^\pi(s_0, a_1)\nabla_\theta\pi(a_1 \mid s_0) +\sum_{a_1} \pi(a_1 \mid s_0)\gamma P^{a_1}_{s_0, s_1} \sum_{s_1}\nabla_\theta V^\pi(s_1)$.

unfolding this equation gives us (verify it for yourself):

$\nabla_\theta V^\pi(s_0) =\sum_s \sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)$.

where $P(s_t = s \mid \pi, s_0)$ is the probability of being in state $s$ after time $t$ when starting in state $s_0$ and taking actions according to the policy $\pi$. Note that our setup does not include multiple agents, policies dependent upon time or non-Markovian policies.

Discounted unnormalized visitation probability: The term $\sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0)$  appears fairly often in RL theory so it is given a notation of $d^\pi(s \mid s_0)$. Essentially this value is high for a state $s$ if it is likely to be visited by an agent sampling actions according to the policy $\pi$ and starting in state $s_0$. This term is called “discounted unnormalized visitation probability”. It is “visitation probability” since it is higher for state which are likely to be visited. It is discounted cause probability terms are discounted with $\gamma^t$ factor and it is unnormalized cause it may not sum to 1. In fact for infinite horizon:

$\sum_s d^\pi(s; s_0) = \sum_s \sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0)$

$= \sum_{t \ge 0} \gamma^t \sum_s P(s_t = s \mid \pi, s_0) = \sum_{t \ge 0} \gamma^t* 1 = \frac{1}{1-\gamma}$.

and for finite horizon with $\gamma =1$ we have:

$\sum_s d^\pi(s; s_0) = \sum_s \sum_{t \ge 0}^T P(s_t = s \mid \pi, s_0)$

$\sum_{t \ge 0}^{T-1} \sum_s P(s_t = s \mid \pi, s_0) = \sum_{t \ge 0}^{T-1} 1 = T$

so in  either case the sum of $d^\pi(s; s_0$ is equal to the time horizon (for episodic undiscounted case) or effective time horizon (for unending discounted case).

using the notation we have: $\nabla_\theta V^\pi(s_0) = \sum_s d^pi(s; s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)$

and which gives us:

$\nabla_\theta J^\pi = \sum_{s_0} \mu(s_0) \sum_s d^pi(s; s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)$

or using the notation of Kakade and Langford 2002, we define $d^\pi_\mu(s; s_0) = \sum_{s_0} \mu(s_0) d^\pi(s; s_0)$ which allows us to express the gradient as:

$\nabla_\theta J^\pi = \sum_s d^\pi_\mu(s) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)$.

Another way to express objective: Using the notation of visitation distribution, allows us to express the expected total discounted reward objective in a form that gives another interpretation to the policy gradient theorem. Observe that:

$J = E[\sum_{t \ge 0} \gamma^t r_t] = \sum_{t \ge 0} \gamma^t E[r_t]$  (linearity of expectation)

reward $r_t = R(s_t, a_{t+1})$ where $s_t, a_{t+1}$ is the state in which the agent is at time $t$ and $a_{t+1}$ is the action taken at that time. Therefore:

$E[r_t] = E[R(s_t, a_{t+1})] = \sum_{s} P(s_t = s) \sum_a \pi(a \mid s)$.

Plugging it in we get:

$J^\pi = \sum_{t \ge 0} \gamma^t E[r_t] =\sum_{t \ge 0} \gamma^t\sum_{s} P(s_t = s) \sum_a \pi(a \mid s)$.

Rearranging the terms we get:

$J^\pi = \sum_{t \ge 0} \gamma^t E[r_t] =\sum_{s} P(s_t = s) \sum_{t \ge 0} \gamma^t\sum_{s} P(s_{t-1} = s) \sum_a \pi(a \mid s)$

and then using the notation of $d^\pi_\mu(s)$ we get:

$J^\pi = \sum_s d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a)$.

Compare this with the gradient of the objective:

$\nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$.

It appears as if we don’t take gradient over the term $d^\pi_\mu(s)$ and it comes with the price of replacing $R(s, a)$ with $Q^\pi(s, a)$. It also allows us to derive a corollary for derivative of visitation distribution. Derivative of visitation distribution has been studied in other context, not using this corollary, which I will probably cover in a future blog.

Corollary*: $\sum_s \nabla_\theta d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) =\sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}\{ Q^\pi(s, a) - R(s ,a)\}$

Proof: $\nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$ (from policy gradient theorem).

$\nabla J^\pi = \nabla \left( \sum_s d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) \right)$ (from alternative expression for $J^\pi$).

$= \sum_s \nabla d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) + \sum_s d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) R(s, a)$

comparing the two equations and rearranging gives the result.

## The REINFORCE algorithm

REINFORCE algorithm (Williams 1992) comes directly from approximating the gradient given by the policy gradient objective. Firstly, we will review sampling for the wider audience.

Brief Summary on Sampling: Given an expectation $E_p[f] = \sum_s f(s) p(s)$ of a function $f: X \rightarrow \mathbb{R}$ with respect to probability distribution $p$ over $X$, we can approximate the expectation by drawing $N$ samples $x_i \sim p(.)$ and using empirical mean given by $\mu_N = \sum_{i=1}^N \frac{f(x_i)}{N}$ as our approximation. From the law of large number this approximation will converge to $E_p[f]$ as $N \rightarrow \infty$. The estimate $\mu_N$ is called unbiased since its mean is the quantity we are approximating i.e. $E_p[f]$. $E_{x_i \sim p}[\sum_{i=1}^N \frac{f(x_i)}{N}] = \sum_{i=1}^N \frac{1}{N} E_{x_i \sim p}[f(x_i)] =\sum_{i=1}^N \frac{1}{N} E_p[f] = E_p[f]$. Note that if our samples were from another distribution then our estimate may not be unbiased. Another quantity of interest for an estimate is its variance given by $var(X) = E[(X- E[X])^2]$, which measures how far can the estimate deviate on expectation from the mean. For the above estimate, we have $var(\sum_{i=1}^N \frac{1}{N}E_{x_i \sim p}[f(x_i)]) = \frac{var(x)}{N}$. Thus, as we collect more samples our estimate deviates less from the mean and if the variance $var(x)$ is low then even fewer samples can help us get accurate estimate.

Vanilla REINFORCE: A quick look at the policy gradient tells us that we can never compute it exactly except for tiny MDP. We therefore want to approximate the gradient of $J^\pi$ given by policy gradient objective and we will do it using sampling. We show below how to do it,

$\nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$

$=\sum_s \sum_{t\ge 0} \gamma^t P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$

$\sum_{t\ge 0} \gamma^t \sum_s P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$

Let’s say we sample a rollout $\langle s_0, a_1, r_1, s_1, a_2, r_2, \cdots a_T, r_T, s_T \rangle$ using the policy $\pi$. A rollout is generated by starting from a sampled state $s_0 \sim \mu(.)$ and then taking actions in the current state using the policy and receiving reward. Then note that $s_t \sim P(. \mid \pi )$. Then using the sampling approach described above,

$\sum_s P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a) \approx \sum_a \left\{ \nabla \pi(a \mid s_t) \right \}Q^\pi(s_t, a)$

Note that we only use a single sample here ($N=1$) for approximation!

After this approximation we have,

$\nabla J^\pi \approx \sum_{t\ge 0} \gamma^t \sum_a \left\{ \nabla \pi(a \mid s_t) \right \}Q^\pi(s_t, a)$

If we knew $Q^\pi(s, a)$ perfectly for every state $s$ and action $a$ then we can compute the above objective perfectly provided the number of actions are not prohibitively large. However, in general we do not have these conditions therefore we want to get ride of the second summation as well.

Unfortunately, the summation is not written in the form of an expectation i.e. $\sum_a f(a) p(a)$ for some probability distribution $p$. Therefore we cannot apply the sampling approach. To solve this, we will use calculus 101 to express it in an expectation format by using $\sum_a (\nabla p(a)) f(a) = \sum_a p(a) (\nabla \ln p(a) )f(a) = E_{a \sim p(.)}[ \nabla \ln p(a) f(a)]$. Thus,

$\nabla J^\pi \approx \sum_{t\ge 0} \gamma^t \sum_a \left\{\pi(a \mid s_t) \nabla \ln \pi(a \mid s_t) \right \}Q^\pi(s_t, a)$

We already have samples from $\pi(. \mid s_t)$ which are nothing but the action $a_{t+1}$. Using it as a sample gives us:

$\nabla J^\pi \approx \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) Q^\pi(s_t, a_{t+1})$.

We are still left with estimating $Q^\pi(s_t, a_{t+1})$ to complete our algorithm. Turns out, we can approximate even this by sampling. By definition, $Q^\pi(s, a) = E[\sum_{t \ge 0} \gamma^t r_t \mid s_0=s, a_1=a]$ is an expectation of total discounted reward collected by a rollout sampled using our policy. For $Q^\pi(s_t, a_{t+1})$, the value $\sum_{t' \ge t} \gamma^{t'-t} r_{t'+1}$ is an unbiased sample estimating $Q^\pi(s_t, a_{t+1})$ (convince yourself that this is true). This gives us

\begin{aligned} \nabla J^\pi &\approx \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'-t} r_{t'+1} \\ \nabla J^\pi &\approx \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1} \end{aligned}.

After these approximations, we are ready to state the vanilla REINFORCE algorithm:

The REINFORCE Algorithm

1. Initialize the parameters $\theta^0$ randomly.
2. Generate rollout $s_0, a_1, r_1, s_1, \cdots a_T, r_T, s_T$ using the current policy $\pi_\theta$
3. Do gradient ascent: $\theta^{t+1} \leftarrow \theta^t + \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t) \sum_{t' \ge t} \gamma^{t'} r_{t'+1}$

Keep the Samples Unbiased. Throw away Rollouts after Update.

Our entire derivation approximating the value of $J^\pi$ relies on a single rollout that is generated using the policy $\pi$. If the rollout was generated using another policy then our assumptions will not be valid anymore. This is true, if we keep old rollouts and update the policy. Then those rollouts are no longer unbiased sample of the current policy and hence do not estimate the objective above. Therefore, we cannot keep those samples after update. This is unlike Q-learning where experiences are stored in huge replay memory. Since our derivation relies on samples coming from the current policy, therefore REINFORCE is an example of on-policy reinforcement learning approach. It seems kind of wasteful and inefficient to throw away old samples after updating the parameter. A common tactic is therefore to collect several rollouts in parallel using large number of threads. We will talk about this in another blog.

Problem with REINFORCE:

REINFORCE is not a very good learning algorithm for anything interesting. Few main issues with it are:

1. No Exogenous Randomization: REINFORCE uses the same policy for exploration as the one being optimized. There is no way to control the flow of exploration or way to add exogenous randomization. This can hinder exploration since the policy can get stuck in specific regions and may not be able to explore outside.
2. Variance: We used a single rollout above to make several approximations. This can affect the variance of our estimate of the gradient. Having high-variance estimate can prohibit learning.
3. Degenerate Solutions: REINFORCE can get stuck in degenerate solutions where the policy can suddenly become deterministic while being far from optimal. Once the policy is deterministic, the updates stop since our estimated gradient:

$\sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t) \sum_{t' \ge t} \gamma^{t'} r_{t'+1}$,

becomes 0. This effectively kills the learning and it has to be restarted. This situation is frequently encountered in practice and we will call it the entropy-collapse issue. Q-learning does not suffer from this degeneracy.

Next we will discuss some solutions to above problem and discuss few methods for debugging.

## No Exogenous Randomization

REINFORCE’s inability to separate the exploration policy from the policy being optimized is one of its biggest weakness. There are atleast two different ways in which this has been addressed. The first approach uses a warm start to initialize the policy using behaviour cloning or imitation learning. This also makes sense from deployment point of view where we have to place our agent in the real world in order to perform reinforcement learning and thus don’t want the start policy to be randomly initialized (imagine asking users to chat with a randomly intialized conversation model!). One can naively hope that warm start will enable the policy to explore in the right region of the problem.

The other approach is to use a separate exploration policy and unbias the gradients by using importance weights. However this approach may not be suitable if the policy has close to 0 mass on actions which are chosen by the exploration policy.

Another line of research involves designing intrinsic reward function (pseudo-counts, prediction error, variance of ensemble) to incentivize the policy to eventually learn to explore the desired state space. However, these approaches generally range from having theoretical guarantees in tabular MDPs (e.g., MBIE-EB count based exploration) to having no theoretical guarantee at all (e.g., prediction error).

## Reducing Variance through Control Variate

We made a remark earlier pertaining to the variance of the estimate of the gradient for the REINFORCE algorithm. Let’s make it more formalized:

Given a rollout, $\tau = \langle s_0, a_1, r_1, s_1, \cdots a_T, r_T, s_T \rangle$ generated by a policy $\pi$, we approximate the derivative of expected total discounted reward $\nabla J^\pi$ using:

$X(\tau) = \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}$

This estimate is unbiased i.e. $E[X(\tau)] = \nabla J^\pi$ and its variance is given by:

$Var(X) = E[(X(\tau) -\nabla J) (X(\tau) - \nabla J)^T]$. This variance can be high due to  the value of $\sum_{t' \ge t} \gamma^{t'} r_{t'+1}$. Therefore, we want to modify the estimate, keeping it unbiased but reducing the variance. In statistics, one way to do this is to use control-variate.

Control-Variate: Let $X$ be an estimate of a quantity with mean $\mu$, we define a control variable $Z$ with mean $E[Z]$ and define a new estimate given by $Y = X + c (Z - E[Z])$. This new estimate is still unbiased as $E[Y] = E[X] + c E[Z] - c E[Z] = \mu$. The variance is given by:

$Var(Y) = Var(X) + c^2 Var(Z) + 2c Cov(X, Z)$ which can be made smaller than $Var(X)$ by optimizing the choice of $c$ and $Z$.

Reinforce Baseline: We need to define a control variable to reduce variance.

We define it as: $Z(\tau) = \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t)$,  where $b: \mathcal{S} \rightarrow R$ is a function called the baseline. We can observe that $E[Z(\tau)] = 0$ as,

$E[Z(\tau)] = E[\sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t) ]$

$E[Z(\tau)] = \sum_{t \ge 0} \gamma^t E[\nabla \ln \pi(a_{t+1} \mid s_t) b(s_t) ]$

$= \sum_{t \ge 0} \gamma^t \sum_{s_t} P_t(s_t) \sum_{a_{t+1}} \pi(a_{t+1} \mid s_t) \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t)$

$= \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \sum_{a_{t+1}} \nabla \pi(a_{t+1} \mid s_t)$

$= \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \nabla \sum_{a_{t+1}} \pi(a_{t+1} \mid s_t)$

$= \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \nabla 1 = 0$.

We will further set the value of $c=-1$, giving us our new estimate as:

$Y(\tau) = X(\tau) - (Z(\tau) - E[Z]) =\sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) \{ Q(s_t, a_{t+1}) - b(s_t) \}$, where $Q(s_t, a_{t+1}) = \sum_{t' \ge t} \gamma^{t' - t} r_{t' + 1}$.

Proof of Variance Reduction(*):  We still haven’t chosen a baseline function and we set the value of $c = -1$ without justification. Overall we still haven’t shown that variance gets reduced. I will provide a proof that gives a choice of the baseline function and proves variance reduction. For simplicity, we will consider a single step case (i.e. $\tau = \langle s_0, a_1, r_1, s_1 \rangle$. We have,

$Var(Y) = Var(X) + Var(Z) - 2 Cov(X, Z)$

$= Var(X) + E[(Z - E[Z])^2] - 2 E[(X- E[X])(Z-E[Z])]$

$= Var(X) + E[Z^2] - 2 E[XZ] + 2 E[X] E[Z] = Var(X) + E[Z(Z - 2X)]$.

Where $E[Z(Z-2X)] = E[ \nabla p(a_1 \mid s_0) b(s_0) \{\nabla p(a_1 \mid s_0) \{ b(s_0) - Q(s_1, a_1)\} \}]$.

which can be simplified to $= E[ \|\nabla p(a_1 \mid s_0) \|^2 b(s_0) \{ b(s_0) - Q(s_0, a_1) \}]$

$= E[\|\nabla p(a_1 \mid s_0) \|^2] b^2(s_0) - b(s_0) E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)]$.

setting it to 0 gives the optimal baseline as $b(s_0) = \frac{E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]}$.

If one makes an independency assumption between $\|\nabla p(a_1 \mid s_0) \|^2$ and $Q(s_0, a_1)$ then we get

$b(s_0) = \frac{E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]} =\frac{E[ \|\nabla p(a_1 \mid s_0) \|^2] E[Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]} =E[Q(s_0, a_1)] = V(s_0)$.

One can do a similar analysis for multi-step reinforcement learning and derive the optimal policy. While the optimal policy has been known for many years, all empirical application of REINFORCE or its derivative (in which I count actor-critic methods) use an  approximated baseline given by the state-value function $V^\pi$. I am not familiar with any literature where they prove (or disprove) if the approximate baseline actually does reduces the variance. However, in the empirical RL circle the variance reduction due to the approximated baseline is often taken as granted.

## Importance of Regularization

REINFORCE can get stuck in degenerate solutions as pointed out before. To avoid this degeneracy, a common tactic of regularizing the objective with the entropy of the policy or KL-divergence from a reference policy is adopted. We will focus on the former method here. The entropy regularized objective is given below:

$J^\pi_\lambda = \sum_s d^\pi_\mu(s) \left\{ \sum_a \pi(a \mid s) R(s, a) + \lambda H(\pi(.\mid s)) \right\}$,

where $\lambda$ is a hyperparameter controlling the effect of the regularization. If one computes the derivative of the new objective one gets:

$\nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda \sum_s \nabla \{d^\pi_\mu(s) H(\pi(.\mid s))\}$

Most researchers however use biased gradients given below:

$\nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda \sum_s d^\pi_\mu(s) \nabla H(\pi(.\mid s))$

$\Rightarrow \nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \{ \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda H(\pi(.\mid s))\}$.

Entropy regularized objective will no longer follow our classic Bellman optimality conditions and the optimal policy no longer remains deterministic. An easy verification of this is to set $\lambda = \infty$ and observe that the optimal solution of $\arg\max_\pi J^\pi_\lambda$ will be to remain uniformly random everywhere (assuming the reward function is bounded).

## Evaluating Training Progress

When training a classifier for a task like ImageNet, one generally monitors the training error and the error on a held out tune set. The decrease in training error tells that your model has sufficient capacity and an increase in tune set indicates potential overfitting. Consider, the approximated gradient of the objective for REINFORCE:

$\nabla J^\pi \approx \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}$

the standard way to program this using pytorch or tensorflow is to define the loss variable:

$L = -\sum_{t \ge 0} \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}$

when doing gradient descent on this variable we get the same gradient as the approximation gives us. This does not make $L$ the real loss function, in fact it is not the loss for the REINFORCE algorithm at all but using it gives us the real gradients therefore we will call it the substitute loss.

Monitoring the substitute loss is not the same as monitoring the actual training loss when doing supervised learning. To begin with, notice that the substitute loss is positive when the agent receives only positive reward and it is negative when the agent only receives negative rewards. This is counterintuitive as one ideally associates high loss with low return. So why is this so?

This happens cause when all the rewards are positive then the loss is positive and the only way the agent can reduce it is by pushing the term $\ln \pi(a_{t+1} \mid s_t)$ towards 0 which means increasing the probability of actions which generate positive reward (which is what we want). Similarly, when all rewards are negative then the loss is negative and the only way the agent can make it more negative is by decreasing the probability of these actions until the term $\ln \pi(a_{t+1} \mid s_t)$ tends towards negative infinity. This will eventually lead to these actions not being sampled anymore.

Thus, instead of monitoring the substitute loss, one can monitor two things: (i) the total reward received by the agent which also represents an unbiased estimate of the actual objective and (ii) the entropy of the policy. Ideally the total reward achieved by the agent should increase and the entropy of the policy should decrease. Any stagnation means the learning is not happening effectively.

## Major Failure Case for Policy Gradient

Policy gradient methods have no guarantees in theory or practice. A simple example which can be used to demonstrate this is from John Langford. There are $H+1$ states and two actions and the agent always starts in state $s_0$. At each state $s_i$ ($i \in \{0, 1, \cdots, H-1\})$, an action takes you to $s_{i+1}$ and the other actions takes you back to $s_0$. The action mappings are randomly assigned to each state but the MDP is deterministic. Every state has a reward of 0 except $s_H$ which has a reward of 1. The game ends when the agent reaches the state $s_H$.

With a close to 1 probability, any rollout will fail to reach the destination state where it earns a reward and therefore will not cause any meaningful learning when using on-policy policy gradient methods. It will require exponentially many samples for it to reach the state $s_H$ and therefore on-policy policy gradient with a randomly initialized policy is not what is called a  PAC RL algorithm, which is an important theoretical guarantee.

## Conclusion

Despite its flaws policy gradient methods are used widely for all sort of AI applications. One reason behind its widespread use is their remarkable similarity to supervised learning with which most empiricists are widely familiar. Thus, one can train their favourite dialogue model using REINFORCE by simply multiplying the log probabilities with a value term. Further, variants of policy gradient methods like PPO perform much better and are reasonable baselines. However, it could also be the case that most people don’t care about guarantees as long as they can solve their favourite childhood game and as a field we must guard against this trend as we move towards more serious applications of these algorithms.

## Is the MTA System Holding New York City Back?

I presume that the title of this blog would come across as either shocking or trivial. Growing up in a 2nd tier city in India, I always imagined New York City to be the center of civilization for our time. Though I never gave much consideration to its transport system, my mind occupied with its lavish skyscrapers, busy life and a broad range of restaurants all accessible in a tiny island which generates more money than most countries; I would have imagined it in positive terms. However, having spent several years in the city I have come to accept the current state of the MTA system as serious impediment to the city’s long term competitiveness. At best it is annoying and at worst it can cause serious harm. An assortment of news articles and research show that New Yorkers have lost money, jobs and precious time. For my own part, I have missed (been late for) many meetings, classes, dates and other appointments. This tardiness manifested in several form. At times the train skipped my destination, other times it metamorphosed in the middle of the journey, local became express, express became local, laws of relativity changed to mean 2 mins to the next train to imply 20 mins. However, two incidents were particularly dreadful. In both instances, the train stopped in the middle of a tunnel for almost  2hours without moving, wifi or phone signal and often with extremely uncomfortable coach temperatures. While I was able to grit my teeth, I can only imagine the condition of infants, pregnant women, elderly or people with health issues (imagine someone with bladder issues). The most horrific incident that I know to be true, can be read here.

After one such incident described above, my Swedish friend educated me on the policy of his country which entitles passengers to reimbursement if they end up experiencing delay in commute. A quick conversation with my fellow subway passengers reveals a deep cynicism towards the possibility of this scheme in New York. This is depressing considering that GDP of New York metropolitan area is about twice the national economy of Sweden. My girlfriend, who is a sociologist, tells me that part of the reason is due to unwillingness of state officials in Albany to spend more money in improving the system.

Back in my home country, the metro system in the capital New Delhi is not only much cleaner and better but also arguably more environment friendly. I was equally impressed with the metro system in Beijing which is better despite being longer and busier than New York city. Only my experience with the Paris metro system was unimpressive. However, the case is not as simple as poorer countries doing better due to their adoption of the recent technology since both subway systems in Berlin and in Tokyo are considered excellent. Maybe the commute of passengers means more to these countries and their officials?

As the economy of other countries rise and other cities continue to grow, as someone who has come to deeply love New York, I would be disappointed if the city loses the baton of the center of civilization.

## Learning to Listen: Acquiring a New Language vs. Doing Sociology with One’s Ears

I have mentioned in various blog posts that Hindi is the language that I am learning this year (Than 2017a, Than 2017b). My current level is A1 according to the Common European Framework of Reference for Languages. My goal is to move to the next level by the end of April. Before starting my Hindi language learning journey I messaged a couple of polyglot friends, and they sent me materials to self-learn the language. Yet to be honest, I am nowhere at the level where I can teach myself a language that I have had very little contact with. Therefore, looking for a language teacher online was my solution. I found one from Mumbai. Rachna Singh is her name. She is very patient, and positive all the time.  We have had regular lessons for almost two months.

My Hindi learning routine goes as follows: 5 minutes…

View original post 1,153 more words

## Markov Decision Process and Reinforcement Learning (Part 1)

Markov decision process (MDP) is a framework for modeling decision making and is very popularly used in the areas of machine learning, robotics, economics and others. An MDP models a decision maker (henceforth agent) in a world, where the agent gets a reward for performing action and the long term goal is to maximize the total reward. The concept of MDP is very intimately tied with the idea of Reinforcement Learning (RL), which is a machine learning framework for learning policies for decision making.

The general idea of MDP has been around since 1950s whereas RL has been around since 1969 (Andreae’s 1969). Long considered (often unjustifiably) suitable for only toy problems, RL has seen a recent renewal due to reasons, I will explain later. There is a good amount of literature on MDP and RL, with the Reinforcement learning book by Sutton and Barto, being a particularly good introduction to the subject. However, I found that these resources remove some essential mathematical treatment or they are written for mathematicians and beyond the reach of general audience.

In this post, I will cover a simple mathematical treatment of MDP with proofs of fundamental concepts. It will be best to use this post as complementary to Sutton and Barto. At places, I did not find proof of many theorems, and in such cases I have supplied my own proofs (marked with *). Please be careful in reading these proofs and let me know if you find any error.

## 1. Introduction

Definition: An MDP $M$ is defined by the tuple $(S, A, P, R, \gamma)$ where $S$ is a set of states, $A$ is a set of actions that an agent can take, $P^a_{s,s'}$ is defined for all $s,s' \in S$ and $a \in A$ and denotes the probability of transitioning to state $s'$ from state $s$ on taking action $a$, we will denote such a transition as $s \rightarrow_a s'$$R^a_{s,s'}$ is a real number denoting the reward received by the agent in the transition$s \rightarrow_a s'$$\gamma$ is a real number that denotes how much to discount future rewards (its purpose will become clear shortly).

Thus, MDP defines a problem setup, where there is an agent that can take action and in the process receive a reward as well change the state. The setup is Markovian since the next state and reward depends only upon the current state and the action taken.

The goal of a rational agent should be to achieve the maximum reward in the long term. This is expressed as a policy which gives a distribution over actions to take, when presented with a state. More formally,

Definition: Policy $\pi: S\times A \rightarrow [0,1]$ defines the probability distribution over action given a state. $\pi(s,a)$ denotes the probability of taking action a given state s. The policy is deterministic if $\forall s\in S, \exists a \in A \,\,s.t.\,\, \pi(s,a) = 1$.

The agent then plays a game where it is landed in a state $s$, and it continuously takes action to maximize the total reward. More precisely, a trace $\tau$ of agent’s game will look like a sequence of state, action, reward $\tau =$, where $a_t$ is the action taken when presented with state $s_t$ and $r_t$ is the reward given in the process and $s_{t+1}$ is the new state.

What we are interested in is the expected reward given the state$s$, which we denote by the concept of state value function $V^\pi(s)$.

$V^\pi(s) = E[ \sum_{t=0}^\infty \gamma^{t} r_{t + 1} | s_o = s]$

a similar concept is given by action value function $Q^\pi(s,a)$ :

$Q^\pi(s) = E[ \sum_{t=0}^\infty \gamma^{t} r_{t + 1} | s_o = s, a_1 = a]$

Episodic and Continuous Task: If the task defined by MDP, always terminates in  some bounded number of steps irrespective of the starting state and actions taken by the agent, the task is called episodic (will game of Chess be episodic? how about a racing game?). A single play in an episodic task is called an episode. If a task is not episodic then it can potentially run forever and such tasks are called continuous task.

Discounting: As seen in the equation above, the reward at time step $t+1$ is discounted by $\gamma^t$. There are several reasons why it is done but two main reasons are: there are evidence that humans behave in similar manner and secondly, we want the value function is a well defined quantity. When the reward is bounded by some finite constant$M$ and $\gamma \in [0, 1)$ then value function is always defined. The maximum reward achievable is bounded by $\frac{M}{1-\gamma}$ as shown below:

$V^\pi(s) =E[ \sum_{t=0}^\infty \gamma^{t} r_{t + 1} | s_o = s]$

$V^\pi(s) \le E[ \sum_{t=0}^\infty \gamma^{t} M | s_o = s] = M \sum_{t=0}^\infty \gamma^t = \frac{M}{1-\gamma}$

Note that the maximum achievable reward is $M * \frac{1}{1-\gamma}$. If the task was episodic and terminates in $T$ steps, then the maximum achievable reward in undiscounted scenario would have been $MT$. Comparing it with what we have for continuous discounted task, we see that $\frac{1}{1-\gamma}$ plays the role of $T$ and for this reason, this is sometimes called the effective time horizon of the task. If $\gamma = 0.99$ means that the task is effectively being played for $100$ steps.

In our proofs below, we will focus only on continuous tasks with $\gamma \in [0,1)$.

## 2. Bellman Self-consistency equation

### 2.1 Self-consistency equation

By a little work, we can express the value functions in recursive form giving us what is called the self-consistency equation. Lets see how it is done:

$V^\pi(s) = E[ \sum_{t=0}^\infty \gamma^{t} r_{t + 1} | s_o = s]$ $= \sum_{\tau} \left(\sum_{t=0}^\infty \gamma^{t} r_{t + 1}\right) p(\tau)$ $= \sum_{a_1, r_1, s_1} \sum_{\tau'} \left(\sum_{t=0}^\infty \gamma^{t} r_{t + 1}\right) p(\langle a_1, r_1, s_1\rangle:\tau')$ $= \sum_{a_1, r_1, s_1} \sum_{\tau'}\left( r_ 1 + \sum_{t=1}^\infty \gamma^{t} r_{t + 1} \right) p(\langle a_1, r_1, s_1\rangle)p(\tau' | s_1)$ $= \sum_{a_1, r_1, s_1}p(\langle a_1, r_1, s_1\rangle)\sum_{\tau'}\left\{ r_1 +\left( \sum_{t=1}^\infty \gamma^{t} r_{t + 1} \right) \right\}p(\tau' | s_1)$ $= \sum_{a_1, r_1, s_1}p(\langle a_1, r_1, s_1\rangle)\left\{\sum_{\tau'}r_1p(\tau' | s_1) + \sum_{\tau'}\left( \sum_{t=1}^\infty \gamma^{t} r_{t + 1} \right)p(\tau' | s_1) \right\}$

$r_1$ only depends upon $s, a_1, s_1$ therefore we get

$= \sum_{a_1, r_1, s_1}p(\langle a_1, r_1, s_1\rangle)\left\{r_1 + \gamma \sum_{\tau'}\left( \sum_{t=0}^\infty \gamma^{t} r_{t + 2} \right)p(\tau' | s_1) \right\}$

the second term is same as $V^\pi(s_1)$ therefore we get

$= \sum_{a_1, r_1, s_1}p(\langle a_1, r_1, s_1\rangle)\left\{r_1 + \gamma V^\pi(s_1) \right\}$

Expanding $p(\langle a_1, r_1, s_1 \rangle)$ using the fact that given $s_1, a_1$ the reward $r_1$ is fixed, we get:

$V^\pi(s) = \sum_{a_1} \pi(s, a_1) \sum_{s_1} P^a_{s, s_1}\left\{R^a_{s,s_1} + \gamma V^\pi(s_1) \right\}$

this is the Bellman self consistency equation for the value function. Basically any state-value function should satisfy the self-consistency equation.

Similarly, $Q^\pi$ satisfies a self-consistency equation below:

$Q^\pi(s,a) = \sum_{s'}P^a_{s,s'} \{ R^a_{s,s'} + \gamma \sum_{a'} \pi(s',a')Q^\pi(s', a') \}$

we also have relation between $Q^\pi$ and $V^\pi$ given by:

$Q^\pi(s,a) = \sum_{s'} P^a_{s,s'} \{ R^a_{s,s'} + \gamma V^\pi(s') \}$

It is not clear whether the self-consistency equations admit a unique solution. We show below that infact they do for finite MDPs.

### 2.2 Self-consistency equation has a unique solution

Theorem*: Self-consistency equation for state-value function for finite MDPs have a unique solution.

Proof: We have $\forall s\in S$

$V^\pi(s) = \sum_{a} \pi(s, a) \sum_{s_1} P^a_{s, s_1}\left\{R^a_{s,s_1} + \gamma V^\pi(s_1) \right\}$ $V^\pi(s) = \sum_{s_1} \{ \gamma \sum_{a}\pi(s, a)P^a_{s, s_1} \} V^\pi(s_1) + \sum_{s_1} \sum_{a} \pi(s,a) P^a_{s,s_1} R^a_{s,s_1}$

define $\alpha_{s,s_1} =\{ \gamma \sum_{a}\pi(s, a) P^a_{s, s_1} \}$, and $\beta_s = \sum_{s_1} \sum_{a} \pi(s,a) P^a_{s,s_1} R^a_{s,s_1}$ we get

$V^\pi(s) = \sum_{s_1}\alpha_{s,s_1} V^\pi(s_1) + \beta_s$

or in matrix form with $A = [\alpha_{s,s_1}]$ and $b = [\beta_s]$,

$I V^\pi = A V^\pi + b \Rightarrow (I - A) V^\pi = b$

We show that this linear equation has a unique solution by showing that matrix $(I - A)$ has a null space of dimenion 0.

Let $x = (x_s) \in N(I - A) \Rightarrow (I - A) x = 0$  or equivalentlty $Ax = x$.

$x_s = \sum_{s'} \alpha_{s,s'} x_{s'}$

Here we note following properties of matrix $A$:

1. $\alpha_{s,s'} \ge 0$
2. $\sum_{s'} \alpha_{s,s'} = \gamma \sum_{s'} \sum_{a}\pi(s, a)P^a_{s, s'}$
$=\gamma \sum_{a}\pi(s, a)(\sum_{s'} P^a_{s, s'}) = \gamma \sum_{a}\pi(s, a) = \gamma$.

Defining $x_{max} = max\{x_s | s\in S\}$ and $x_{min} = min\{x_s | s\in S\}$ and using the above property of matrix $A$ we get.

This means $\gamma x_{min} \le x_s = \sum_{s'} \alpha_{s,s'} x_{s'} \le \gamma x_{max}$. This implies:

$\gamma x_{min} \le x_{min} \Rightarrow (1 - \gamma) x_{min} \ge 0 \Rightarrow x_{min} \ge 0$ and
$x_{max} \le \gamma x_{max} \Rightarrow (1 - \gamma) x_{max} \le 0 \Rightarrow x_{max} \le 0$

Using $x_{min} \ge 0$ and $x_{max} \le 0$, we get $x = 0$. This implies that $N(I - A) = \{ 0 \}$ and therefore the self consistency equation for state-value has a unique solution.

## 3. Optimal Policy

A natural question to ask is, how to find a policy that gives us very high value of state value function. For a given state $s$, we can say that policy $\pi$ is as good as $\pi^{'}$ if $V^\pi(s) \ge V^{\pi'}(s)$. The policy is better if the inequality is strict. It is easy to see that for finite MDP and state, there exists a policy that is as good as all other policies. But how do we define an ordering over policies, irrespective of state?

One way would be, $\pi$ is as good as $\pi^{'}$ if $\forall s\in S; V^\pi(s) \ge V^{\pi'}(s)$. That is the policy must equal (or exceed) the other policy for every state. But this leads to the question of optimality, does an optimal policy exists. After all, we can end up in a situation where for two states $s_1, s_2 \in S$ and policies $\pi_1, \pi_2$; $V^{\pi_1}(s_1) > V^{\pi_2}(s_1)$  and $V^{\pi_1}(s_2) < V^{\pi_2}(s_2)$. We now prove that such an optimal policy does exists for finite MDP.

### 3.1 Existence of optimal policy

Theorem*: There exists an optimal policy for every finite MDP.

Proof:  Let MDP be $M = (S,A,P,R,\gamma)$ be a finite MDP and $\Pi$ be the space of all possible policies $\pi$. For a given state $s$, $\exists \pi_s \in \Pi$ such that $V^{\pi_s}(s) \ge V^{\pi}(s) \forall \pi \in \Pi$. We call such $\pi_s$ as the leader policy for state $s$. We now define a policy as $\pi(s,a) = \pi_s(s,a)$ i.e. for a given state $s$ we use the leader policy for that state to make our move.

This seems most natural and probably the only way to define $\pi$ since, for a given state $s$ the maximum reward is attained by policy $\pi_s$ and hence if we do not follow the policy $\pi_s$ for the first step itself then we may not be guaranteed to achieve the maximum reward in the long run. All that remains to show is that $\pi(s,a)$ is optimal. For this we first define $\delta(s) = V^\pi(s) - V{^\pi_s}(s)$. Showing that $\pi$ is optimal corresponds to showing that $\delta(s) \ge 0$ $\forall s\in S$ since if we beat leader policy for every state then we beat all policies for all states.

From Bellman self-consistency equation we have:

$V^{\pi}(s) = \sum_{a \in A} \pi(s,a) \sum_{s' \in S} P^a_{s,s'} \{ R^a_{s,s'} + \gamma V^\pi(s') \}$

$V^{\pi_s}(s) = \sum_{a \in A} \pi_s(s,a) \sum_{s' \in S} P^a_{s,s'} \{ R^a_{s,s'} + \gamma V^{\pi_s}(s') \}$

by definition $\pi(s,a) = \pi_s(s,a)$. Subtracting (3) and (4) we get:

$\delta(s) = V^{\pi}(s) - V^{\pi_s}(s) = \gamma \sum_{a \in A} \pi(s,a) \sum_{s' \in S} P^a_{s,s'} \{V^\pi(s') - V^{\pi_s}(s')\}$

We have $V^\pi(s') - V^{\pi_s}(s') \ge V^\pi(s') - V^{\pi_{s'}}(s') = \delta(s')$, which together with equation 5 gives us:

$\delta(s) \ge \gamma \sum_{a \in A} \pi(s,a) \sum_{s' \in S} P^a_{s,s'} \delta(s') = \gamma \sum_{s'\in S} \left(\sum_{a \in A} \pi(s,a) P^a_{s,s'} \right) \delta(s')$

the summation on RHS represents convex combination of $\delta$ and therefore we can use convex-combination$(x) \ge x_{min}$ to give us:

$\delta(s) \ge \gamma \,\, \inf \{\delta(s') \mid s' \in S\}$

as inequality 7 holds for all $s \in S$, therefore this gives us:

$\inf \{\delta(s') \mid s' \in S\} \ge \gamma \,\, \inf \{\delta(s') \mid s' \in S\}$

as $\gamma \in [0,1)$, therefore inequality 8 can only hold if $\inf \{\delta(s') \mid s' \in S\} \ge 0 \Rightarrow \delta(s) \ge 0$ $\forall s \in S$. Hence proved.

### 3.2 Bellman optimality condition

Note that our proof for existence of optimal policy is almost impractical. It requires us to find the optimal policy for every state, which can be highly inefficient if we have a large state space or action space. Here we will derive an equation that will allow us to later compute the optimal policy more efficiently.

We first denote the optimal policy by $\pi^*$ and define $V^*(s) = V^{\pi^*}(s)$ and $Q^*(s) = Q^{\pi^*}(s)$.

Theorem* (Bellman Optimality Condition):  $V^*(s)$ and $Q^*(s)$ satisfy the following equations:

$V^*(s) = \max_a Q^*(s,a)$

$V^*(s) = \max_a \sum_s' P^a_{s,s'} \{ R^a_{s,s'} + \gamma V^*(s') \}$

$Q^*(s,a) = \sum_s' P^a_{s,s'} \{ R^a_{s,s'} + \gamma \max_{a'} Q^*(s',a')\}$

Proof: Let $\pi^*(s,a)$ be the optimal policy. Then we have, $V^*(s) = \sum_a \pi^*(s,a) Q^*(s,a) \le \max_a Q^*(s,a)$. To prove the first equation, all we have to show is that we can never have $V^*(s) < \max_a Q^*(s,a)$. We will prove this via contradiction.

Let $\exists s' \,\,s.t.\,\, V^*(s') < \max_a Q^*(s',a)$. We will use this fact to show that $\pi^*$ is not optimal by deriving a better policy. We define this new policy as: $\pi'(s) = \arg\max_a Q^*(s,a)$. Note that this new policy is a deterministic policy. Then from self-consistency equations we have:

$V^*(s) = \sum_a' \pi^*(s,a) Q^*(s,a)$

$V^{\pi'}(s) = Q^{\pi'}(s, \pi'(s))$

As before, we define $\delta(s) = V^{\pi'}(s) -V^*(s) =Q^{\pi'}(s, \pi'(s)) - \sum_a' \pi^*(s,a) Q^*(s,a)$. We then have:

$\delta(s) = V^{\pi'}(s) -V^*(s) =Q^{\pi'}(s, \pi'(s)) - \sum_a' \pi^*(s,a) Q^*(s,a)$

Using the definition of $\pi'$, as the policy that picks an action that maximizes $Q^*$ values, we get:

$\delta(s) \ge Q^{\pi'}(s, \pi'(s)) - Q^{*}(s, \pi'(s))$.

Note that for the state $s'$, the above inequality is strict. This follows from $V^*(s') < \max_a Q^*(s,a') = Q^*(s, \pi(s'))$.

Let $a = \pi'(s)$ then we have:

$Q^{\pi'}(s, a) - Q^{*}(s, a) = \sum_{s'} P^a_{s,s'} (R^a_{s,s'} + \gamma V^{\pi'}(s')) - \sum_s' P^a_{s,s'} (R^a_{s,s'} + \gamma V^{*}(s'))$

$Q^{\pi'}(s, a) - Q^{*}(s, a) = \gamma \sum_{s'}P^a_{s,s'} (V^{\pi'}(s') -V^{*}(s')) = \gamma \sum_{s'} P^a_{s,s'} \delta(s')$.

Giving us, $\delta(s) \ge\gamma \sum_{s'} P^a_{s,s'} \delta(s')$.

using argument similar to earlier, we show that $\inf_s \delta(s) \ge \gamma \inf_{s'} \delta(s')$ and as $\gamma\in [0,1)$, we have $\inf_s \delta(s) \ge \Rightarrow \forall s, \delta(s) \ge 0$.

But wait, this is not what we wanted to show. We want to show that $\pi' > \pi^*$ or in other words, we want to show that $\exists s \,\,s.t.\,\, \delta(s) > 0$. But this follows easily since for the state $s'$, the above inequality is strict. Hence, $\pi^*$ cannot be the optimal policy, which is a contradiction.

The next 2 equations in the proof, easily follow from the first equation (try it yourself).

These three equations are called the Bellman optimality conditions.

Interesting observations from the above proof:

1. For the optimal policy $\pi^*$ satisfying $V^{\pi^*}(s) = \max_a Q^{\pi^*}(s,a)$, we can still define a new policy $\pi'(s) = \arg\max_a Q(s,a)$. We can then follow the above proof to derive that the new policy is as good as the optimal policy, but not better. But this new policy $\pi'$ is  deterministic. This shows that every MDP has a deterministic optimal policy. So you do not have to worry about being stochastic if you want to be optimal. This is a remarkably interesting fact about MDPs.
2. If $\pi$ does not satisfy the Bellman optimality condition, then we can follow the proof to derive a new policy $\pi'$ given by $\pi'(s) = \arg\max_a Q^{\pi}(s,a)$ such that $\pi' > \pi$. That is we can improve upon our current policy. This is one way of doing policy improvement. Given a policy, find its Q-values and then define a better policy as discussed. We can keep doing this, either forever or till we finally find a policy that satisfies the Bellman optimality condition.
3. Lastly, notice how the reward terms simply cancel out. It beautifully shows that the proof does not depend upon how the reward function looks like.

### 3.3 Bellman backup operator

To solve an MDP means to find the optimal policy and the optimal value functions. Unfortunately, we cannot easily solve the Bellman optimality condition as they are non-linear due to $max$ sitting in the equation. We now provide a strategy for computing $V^*, Q^*$ by iteratively solving the optimality condition. Given $V^*$ we can compute $Q^*$ using the equations above. Therefore, in the remainder of this section we will be solely interested in computing $V^*$.

Let $V: S \rightarrow R$ be the state-value function for an MDP then $T: V \rightarrow V$ is the Bellman backup operator given by:

$TV(s) = max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V(s') \}$

We then prove that $T$ satisfies the contraction property namely:

Theorem: $\| TV_1 - TV_2 \|_{\infty} \le \gamma \| V_1 - V_2 \|_{\infty}$ where $\| . \|_{\infty}$ is the max norm given by $\| x \| = \max\{ |x_1|, \cdots |x_n| \}$

Proof: $\| TV_1 - TV_2 \|_{\infty} = \| \max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_1(s') \} - \max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_2(s') \} \|$
we now use the fact that $|\max_x f(x) - \max_x g(x)| \le \max_x |f(x) - g(x)|$.

$|\max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_1(s') \} - \max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_2(s') \}|$
$\le \max_a | \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_1(s') \} - \sum_{s'}P^a_{s,s'} \{R^a_{s,s'} + \gamma V_2(s') \}|$
$= \max_a \gamma |\sum_{s'}P^a_{s,s'} \{V_1(s') -V_2(s')\}|$

since this holds for every $s$, we have

$\| TV_1 - TV_2 \|_{\infty} \le \|\max_a \gamma |\sum_{s'}P^a_{s,s'} \{V_1(s') -V_2(s')\}|\|_{\infty}$
$\le \|\max_a \gamma \sum_{s'}P^a_{s,s'}|V_1(s') - V_2(s')|\|_{\infty}$
$\le \|\max_a \gamma \max_{s'}|V_1(s') -V_2(s')|\|_{\infty}$
$= \|\gamma \max_{s'}|V_1(s') -V_2(s')|\|_{\infty}$
$= \gamma \max_{s'}|V_1(s') -V_2(s')| = \gamma \|V_1 -V_2\|_{\infty}$

Hence proved.

We now prove that $T$ has a unique fixed point. A fixed point $x$ of an operator satisfies $Tx = x$. This also means $T^n x = T^{n-1}( Tx) = T^{n-1}x \cdots = x$.

Lemma: Bellman backup operator $T$ has a unique fixed point.
Proof: Let $V_1, V_2$ be two fixed point of $T$. Whic means $TV_1 = V_1$ and $TV_2 = V_2$. Then using the contraction property of $T$ we get:

$0 \le \| TV_1 - TV_2\|_{\infty} \le \gamma \| V_1 - V_2\|_{\infty}$
further, $\| TV_1 - TV_2\|_{\infty} = \| V_1 - V_2\|_{\infty}$.

Using these two equations and $\gamma \in [0, 1)$ we get,

$\| V_1 - V_2\|_{\infty} \le \gamma \| V_1 - V_2\|_{\infty} \Rightarrow \| V_1 - V_2\|_{\infty} \le 0$. This gives us $V_1 = V_2$.

Corrollary: Optimal state-value function is a fixed point of its Bellman backup operator. By above lemma, it is also the fixed point.

This means that solving for the fixed point of this operator gives us our optimal value functions and optimal policy. We finally show that repeated application of Bellman backup operator to any state-value function $V$ eventually converges to the optimal value function.

Theorem: $\lim_{n\rightarrow \infty} T^nV = V^*$  $\forall V$
Proof: Using the contraction property.
$\| T^n V - V^* \|_{\infty} = \| T^n V - TV^*\|_{\infty} \le \gamma \| T^{n-1}V-V^*\|_{\infty}$

Applying the contraction property $n-1$ times we get:
$\| T^n V - V^*\|_{\infty} \le \gamma^n \| V - V^* \|_{\infty}$

taking limits on both sides and using $\gamma \in [0,1)$ we get,
$\lim_{n \rightarrow \infty} \| T^n V - V^*\|_{\infty} \le \lim_{n\rightarrow} \gamma^n \| V - V^* \|_{\infty} = 0$.

Here, we assume that value functions are bounded which follows from bounded reward values.

## 4 Conclusion

Applying backup operator to compute optimal policy is impractical in most cases as number of states become very large. In next blog post on this topic, we will discuss algorithms for efficient learning of optimal policies such as Q-learning, Monte Carlo methods etc. We will also look at interesting reinforcement learning problems such as deep reinforcement learning (reinforcement learning with state represented by deep neural network features).

## 5 References

1. Reinforcement learning: An Introduction, R. Sutton and A. Bato.
2. Scribes from Peter Abeel’s course for section 3