# Mathematical Analysis of Policy Gradient Methods

Most papers using reinforcement learning these days use the policy gradient class of learning. In this post, I will cover a basic tutorial of policy gradient, uncover some confusion on using baseline and regularization and give some suggestion for debugging these learning algorithms that I have found useful in the past.

Policy gradient methods is a class of reinforcement learning algorithm that optimize the reinforcement learning (RL) objective by performing gradient descent on the policy parameters. As opposed to value function based methods (SARSA, Q-learning), the central object of study in policy gradient methods is a policy function $\pi(a\mid s; \theta)$, which denotes a probability distribution over actions given a state. $\theta$ are the parameters of this policy. Assuming we are optimizing using stochastic gradient methods our updates are given by:

$\theta^{(t + 1)} \leftarrow \theta^{(t)} +\eta_t \nabla_\theta J^\pi$

The objective $J^\pi$ can be expected total discounted reward, expected average reward or something else. We will stick to total discounted reward for this discussion. The aim now is to express $\nabla_\theta J^\pi$  in a form that is easier to work with. This is done using the policy gradient theorem.

Let $M = \langle S, A, R, P, \gamma, \mu \rangle$ be our MDP where $S, A, R, P$ are the state space, action space, reward function and transition probability respectively. $\gamma$ is the discounting factor and $\mu$ is an initial state distribution.

The expected total discounted reward can be expressed as:

$J^\pi = E[\sum_{t\ge 0} \gamma^t r_t] = \sum_s \mu(s) V^\pi(s)$

See my previous post if you don’t know what $V$ function is. Using above equation we have $\nabla_\theta J^\pi = \sum_s \mu(s) \nabla_\theta V^\pi(s)$. Policy gradient theorem (Sutton et al. 1999) computes $V^\pi(s)$ by repeatedly applying the Bellman self consistency theorem. It really is just that.

$\nabla_\theta V^\pi(s) =\nabla_\theta \left\{ \sum_a \pi(a\mid s) Q^\pi(s, a) \right\}$

$\nabla_\theta V^\pi(s) = \sum_a \pi(a \mid s) \nabla_\theta Q^\pi(s, a) + \sum_a Q^\pi(s, a)\nabla_\theta\pi(a \mid s)$

using $Q^\pi(s, a) = \sum_{s'} P^a_{s, s'} \left\{ R^a_{s, s'} + \gamma V^\pi(s') \right\}$ we get:

$\nabla_\theta Q^\pi(s, a) = \gamma P^a_{s, s'} \sum_{s'} \nabla_\theta V^\pi(s')$

so now we have an expression for $\nabla_\theta V^\pi$ in terms of itself. Writing it down together we have:

$\nabla_\theta V^\pi(s) = \sum_a Q^\pi(s, a)\nabla_\theta\pi(a \mid s) +\sum_a \pi(a \mid s)\gamma P^a_{s, s'} \sum_{s'}\nabla_\theta V^\pi(s')$

we will change the notation as $s \rightarrow s_0, a \rightarrow a_1, s' \rightarrow s_1$ giving us:

$\nabla_\theta V^\pi(s_0) = \sum_{a_1} Q^\pi(s_0, a_1)\nabla_\theta\pi(a_1 \mid s_0) +\sum_{a_1} \pi(a_1 \mid s_0)\gamma P^{a_1}_{s_0, s_1} \sum_{s_1}\nabla_\theta V^\pi(s_1)$.

unfolding this equation gives us (verify it for yourself):

$\nabla_\theta V^\pi(s_0) =\sum_s \sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)$.

where $P(s_t = s \mid \pi, s_0)$ is the probability of being in state $s$ after time $t$ when starting in state $s_0$ and taking actions according to the policy $\pi$. Note that our setup does not include multiple agents, policies dependent upon time or non-Markovian policies.

Discounted unnormalized visitation probability: The term $\sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0)$  appears fairly often in RL theory so it is given a notation of $d^\pi(s \mid s_0)$. Essentially this value is high for a state $s$ if it is likely to be visited by an agent sampling actions according to the policy $\pi$ and starting in state $s_0$. This term is called “discounted unnormalized visitation probability”. It is “visitation probability” since it is higher for state which are likely to be visited. It is discounted cause probability terms are discounted with $\gamma^t$ factor and it is unnormalized cause it may not sum to 1. In fact for infinite horizon:

$\sum_s d^\pi(s; s_0) = \sum_s \sum_{t \ge 0} \gamma^t P(s_t = s \mid \pi, s_0)$

$= \sum_{t \ge 0} \gamma^t \sum_s P(s_t = s \mid \pi, s_0) = \sum_{t \ge 0} \gamma^t* 1 = \frac{1}{1-\gamma}$.

and for finite horizon with $\gamma =1$ we have:

$\sum_s d^\pi(s; s_0) = \sum_s \sum_{t \ge 0}^T P(s_t = s \mid \pi, s_0)$

$\sum_{t \ge 0}^{T-1} \sum_s P(s_t = s \mid \pi, s_0) = \sum_{t \ge 0}^{T-1} 1 = T$

so in  either case the sum of $d^\pi(s; s_0$ is equal to the time horizon (for episodic undiscounted case) or effective time horizon (for unending discounted case).

using the notation we have: $\nabla_\theta V^\pi(s_0) = \sum_s d^pi(s; s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)$

and which gives us:

$\nabla_\theta J^\pi = \sum_{s_0} \mu(s_0) \sum_s d^pi(s; s_0) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)$

or using the notation of Kakade and Langford 2002, we define $d^\pi_\mu(s; s_0) = \sum_{s_0} \mu(s_0) d^\pi(s; s_0)$ which allows us to express the gradient as:

$\nabla_\theta J^\pi = \sum_s d^\pi_\mu(s) \sum_a \nabla_\theta \pi(a \mid s) Q^\pi(s, a)$.

Another way to express objective: Using the notation of visitation distribution, allows us to express the expected total discounted reward objective in a form that gives another interpretation to the policy gradient theorem. Observe that:

$J = E[\sum_{t \ge 0} \gamma^t r_t] = \sum_{t \ge 0} \gamma^t E[r_t]$  (linearity of expectation)

reward $r_t = R(s_t, a_{t+1})$ where $s_t, a_{t+1}$ is the state in which the agent is at time $t$ and $a_{t+1}$ is the action taken at that time. Therefore:

$E[r_t] = E[R(s_t, a_{t+1})] = \sum_{s} P(s_t = s) \sum_a \pi(a \mid s)$.

Plugging it in we get:

$J^\pi = \sum_{t \ge 0} \gamma^t E[r_t] =\sum_{t \ge 0} \gamma^t\sum_{s} P(s_t = s) \sum_a \pi(a \mid s)$.

Rearranging the terms we get:

$J^\pi = \sum_{t \ge 0} \gamma^t E[r_t] =\sum_{s} P(s_t = s) \sum_{t \ge 0} \gamma^t\sum_{s} P(s_{t-1} = s) \sum_a \pi(a \mid s)$

and then using the notation of $d^\pi_\mu(s)$ we get:

$J^\pi = \sum_s d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a)$.

Compare this with the gradient of the objective:

$\nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$.

It appears as if we don’t take gradient over the term $d^\pi_\mu(s)$ and it comes with the price of replacing $R(s, a)$ with $Q^\pi(s, a)$. It also allows us to derive a corollary for derivative of visitation distribution. Derivative of visitation distribution has been studied in other context, not using this corollary, which I will probably cover in a future post.

Corollary*: $\sum_s \nabla_\theta d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) =\sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}\{ Q^\pi(s, a) - R(s ,a)\}$

Proof: $\nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$ (from policy gradient theorem).

$\nabla J^\pi = \nabla \left( \sum_s d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) \right)$ (from alternative expression for $J^\pi$).

$= \sum_s \nabla d^\pi_\mu(s) \sum_a \pi(a \mid s) R(s, a) + \sum_s d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) R(s, a)$

comparing the two equations and rearranging gives the result.

## The REINFORCE algorithm

REINFORCE algorithm (Williams 1992) comes directly from approximating the gradient given by the policy gradient objective. Firstly, we will review sampling for the wider audience.

Brief Summary on Sampling: Given an expectation $E_p[f] = \sum_s f(s) p(s)$ of a function $f: X \rightarrow \mathbb{R}$ with respect to probability distribution $p$ over $X$, we can approximate the expectation by drawing $N$ samples $x_i \sim p(.)$ and using empirical mean given by $\mu_N = \sum_{i=1}^N \frac{f(x_i)}{N}$ as our approximation. From the law of large number this approximation will converge to $E_p[f]$ as $N \rightarrow \infty$. The estimate $\mu_N$ is called unbiased since its mean is the quantity we are approximating i.e. $E_p[f]$. $E_{x_i \sim p}[\sum_{i=1}^N \frac{f(x_i)}{N}] = \sum_{i=1}^N \frac{1}{N} E_{x_i \sim p}[f(x_i)] =\sum_{i=1}^N \frac{1}{N} E_p[f] = E_p[f]$. Note that if our samples were from another distribution then our estimate may not be unbiased. Another quantity of interest for an estimate is its variance given by $var(X) = E[(X- E[X])^2]$, which measures how far can the estimate deviate on expectation from the mean. For the above estimate, we have $var(\sum_{i=1}^N \frac{1}{N}E_{x_i \sim p}[f(x_i)]) = \frac{var(x)}{N}$. Thus, as we collect more samples our estimate deviates less from the mean and if the variance $var(x)$ is low then even fewer samples can help us get accurate estimate.

Vanilla REINFORCE: A quick look at the policy gradient tells us that we can never compute it exactly except for tiny MDP. We therefore want to approximate the gradient of $J^\pi$ given by policy gradient objective and we will do it using sampling. We show below how to do it,

$\nabla J^\pi = \sum_s d^\pi_\mu(s) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$

$=\sum_s \sum_{t\ge 0} \gamma^t P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$

$\sum_{t\ge 0} \gamma^t \sum_s P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a)$

Let’s say we sample a rollout $\langle s_0, a_1, r_1, s_1, a_2, r_2, \cdots a_T, r_T, s_T \rangle$ using the policy $\pi$. A rollout is generated by starting from a sampled state $s_0 \sim \mu(.)$ and then taking actions in the current state using the policy and receiving reward. Then note that $s_t \sim P(. \mid \pi )$. Then using the sampling approach described above,

$\sum_s P(s_t = s \mid \pi) \sum_a \left\{ \nabla \pi(a \mid s) \right \}Q^\pi(s, a) \approx \sum_a \left\{ \nabla \pi(a \mid s_t) \right \}Q^\pi(s_t, a)$

Note that we only use a single sample here ($N=1$) for approximation!

After this approximation we have,

$\nabla J^\pi \approx \sum_{t\ge 0} \gamma^t \sum_a \left\{ \nabla \pi(a \mid s_t) \right \}Q^\pi(s_t, a)$

If we knew $Q^\pi(s, a)$ perfectly for every state $s$ and action $a$ then we can compute the above objective perfectly provided the number of actions are not prohibitively large. However, in general we do not have these conditions therefore we want to get ride of the second summation as well.

Unfortunately, the summation is not written in the form of an expectation i.e. $\sum_a f(a) p(a)$ for some probability distribution $p$. Therefore we cannot apply the sampling approach. To solve this, we will use calculus 101 to express it in an expectation format by using $\sum_a (\nabla p(a)) f(a) = \sum_a p(a) (\nabla \ln p(a) )f(a) = E_{a \sim p(.)}[ \nabla \ln p(a) f(a)]$. Thus,

$\nabla J^\pi \approx \sum_{t\ge 0} \gamma^t \sum_a \left\{\pi(a \mid s_t) \nabla \ln \pi(a \mid s_t) \right \}Q^\pi(s_t, a)$

We already have samples from $\pi(. \mid s_t)$ which are nothing but the action $a_{t+1}$. Using it as a sample gives us:

$\nabla J^\pi \approx \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) Q^\pi(s_t, a_{t+1})$.

We are still left with estimating $Q^\pi(s_t, a_{t+1})$ to complete our algorithm. Turns out, we can approximate even this by sampling. By definition, $Q^\pi(s, a) = E[\sum_{t \ge 0} \gamma^t r_t \mid s_0=s, a_1=a]$ is an expectation of total discounted reward collected by a rollout sampled using our policy. For $Q^\pi(s_t, a_{t+1})$, the value $\sum_{t' \ge t} \gamma^{t'-t} r_{t'+1}$ is an unbiased sample estimating $Q^\pi(s_t, a_{t+1})$ (convince yourself that this is true). This gives us

\begin{aligned} \nabla J^\pi &\approx \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'-t} r_{t'+1} \\ \nabla J^\pi &\approx \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1} \end{aligned}.

After these approximations, we are ready to state the vanilla REINFORCE algorithm:

The REINFORCE Algorithm

1. Initialize the parameters $\theta^0$ randomly.
2. Generate rollout $s_0, a_1, r_1, s_1, \cdots a_T, r_T, s_T$ using the current policy $\pi_\theta$
3. Do gradient ascent: $\theta^{t+1} \leftarrow \theta^t + \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t) \sum_{t' \ge t} \gamma^{t'} r_{t'+1}$

Keep the Samples Unbiased. Throw away Rollouts after Update.

Our entire derivation approximating the value of $J^\pi$ relies on a single rollout that is generated using the policy $\pi$. If the rollout was generated using another policy then our assumptions will not be valid anymore. This is true, if we keep old rollouts and update the policy. Then those rollouts are no longer unbiased sample of the current policy and hence do not estimate the objective above. Therefore, we cannot keep those samples after update. This is unlike Q-learning where experiences are stored in huge replay memory. Since our derivation relies on samples coming from the current policy, therefore REINFORCE is an example of on-policy reinforcement learning approach. It seems kind of wasteful and inefficient to throw away old samples after updating the parameter. A common tactic is therefore to collect several rollouts in parallel using large number of threads. We will talk about this in another post.

Problem with REINFORCE:

REINFORCE is not a very good learning algorithm for anything interesting. Few main issues with it are:

1. No Exogenous Randomization: REINFORCE uses the same policy for exploration as the one being optimized. There is no way to control the flow of exploration or way to add exogenous randomization. This can hinder exploration since the policy can get stuck in specific regions and may not be able to explore outside.
2. Variance: We used a single rollout above to make several approximations. This can affect the variance of our estimate of the gradient. Having high-variance estimate can prohibit learning.
3. Degenerate Solutions: REINFORCE can get stuck in degenerate solutions where the policy can suddenly become deterministic while being far from optimal. Once the policy is deterministic, the updates stop since our estimated gradient:

$\sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t) \sum_{t' \ge t} \gamma^{t'} r_{t'+1}$,

becomes 0. This effectively kills the learning and it has to be restarted. This situation is frequently encountered in practice and we will call it the entropy-collapse issue. Q-learning does not suffer from this degeneracy.

Next we will discuss some solutions to above problem and discuss few methods for debugging.

## No Exogenous Randomization

REINFORCE’s inability to separate the exploration policy from the policy being optimized is one of its biggest weakness. There are atleast two different ways in which this has been addressed. The first approach uses a warm start to initialize the policy using behaviour cloning or imitation learning. This also makes sense from deployment point of view where we have to place our agent in the real world in order to perform reinforcement learning and thus don’t want the start policy to be randomly initialized (imagine asking users to chat with a randomly intialized conversation model!). One can naively hope that warm start will enable the policy to explore in the right region of the problem.

The other approach is to use a separate exploration policy and unbias the gradients by using importance weights. However this approach may not be suitable if the policy has close to 0 mass on actions which are chosen by the exploration policy.

Another line of research involves designing intrinsic reward function (pseudo-counts, prediction error, variance of ensemble) to incentivize the policy to eventually learn to explore the desired state space. However, these approaches generally range from having theoretical guarantees in tabular MDPs (e.g., MBIE-EB count based exploration) to having no theoretical guarantee at all (e.g., prediction error).

## Reducing Variance through Control Variate

We made a remark earlier pertaining to the variance of the estimate of the gradient for the REINFORCE algorithm. Let’s make it more formalized:

Given a rollout, $\tau = \langle s_0, a_1, r_1, s_1, \cdots a_T, r_T, s_T \rangle$ generated by a policy $\pi$, we approximate the derivative of expected total discounted reward $\nabla J^\pi$ using:

$X(\tau) = \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}$

This estimate is unbiased i.e. $E[X(\tau)] = \nabla J^\pi$ and its variance is given by:

$Var(X) = E[(X(\tau) -\nabla J) (X(\tau) - \nabla J)^T]$. This variance can be high due to  the value of $\sum_{t' \ge t} \gamma^{t'} r_{t'+1}$. Therefore, we want to modify the estimate, keeping it unbiased but reducing the variance. In statistics, one way to do this is to use control-variate.

Control-Variate: Let $X$ be an estimate of a quantity with mean $\mu$, we define a control variable $Z$ with mean $E[Z]$ and define a new estimate given by $Y = X + c (Z - E[Z])$. This new estimate is still unbiased as $E[Y] = E[X] + c E[Z] - c E[Z] = \mu$. The variance is given by:

$Var(Y) = Var(X) + c^2 Var(Z) + 2c Cov(X, Z)$ which can be made smaller than $Var(X)$ by optimizing the choice of $c$ and $Z$.

Reinforce Baseline: We need to define a control variable to reduce variance.

We define it as: $Z(\tau) = \sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t)$,  where $b: \mathcal{S} \rightarrow R$ is a function called the baseline. We can observe that $E[Z(\tau)] = 0$ as,

$E[Z(\tau)] = E[\sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t) ]$

$E[Z(\tau)] = \sum_{t \ge 0} \gamma^t E[\nabla \ln \pi(a_{t+1} \mid s_t) b(s_t) ]$

$= \sum_{t \ge 0} \gamma^t \sum_{s_t} P_t(s_t) \sum_{a_{t+1}} \pi(a_{t+1} \mid s_t) \nabla \ln \pi(a_{t+1} \mid s_t) b(s_t)$

$= \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \sum_{a_{t+1}} \nabla \pi(a_{t+1} \mid s_t)$

$= \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \nabla \sum_{a_{t+1}} \pi(a_{t+1} \mid s_t)$

$= \sum_{t \ge 0} \sum_{s_t} P_t(s_t) b(s_t) \nabla 1 = 0$.

We will further set the value of $c=-1$, giving us our new estimate as:

$Y(\tau) = X(\tau) - (Z(\tau) - E[Z]) =\sum_{t \ge 0} \gamma^t \nabla \ln \pi(a_{t+1} \mid s_t) \{ Q(s_t, a_{t+1}) - b(s_t) \}$, where $Q(s_t, a_{t+1}) = \sum_{t' \ge t} \gamma^{t' - t} r_{t' + 1}$.

Proof of Variance Reduction(*):  We still haven’t chosen a baseline function and we set the value of $c = -1$ without justification. Overall we still haven’t shown that variance gets reduced. I will provide a proof that gives a choice of the baseline function and proves variance reduction. For simplicity, we will consider a single step case (i.e. $\tau = \langle s_0, a_1, r_1, s_1 \rangle$. We have,

$Var(Y) = Var(X) + Var(Z) - 2 Cov(X, Z)$

$= Var(X) + E[(Z - E[Z])^2] - 2 E[(X- E[X])(Z-E[Z])]$

$= Var(X) + E[Z^2] - 2 E[XZ] + 2 E[X] E[Z] = Var(X) + E[Z(Z - 2X)]$.

Where $E[Z(Z-2X)] = E[ \nabla p(a_1 \mid s_0) b(s_0) \{\nabla p(a_1 \mid s_0) \{ b(s_0) - Q(s_1, a_1)\} \}]$.

which can be simplified to $= E[ \|\nabla p(a_1 \mid s_0) \|^2 b(s_0) \{ b(s_0) - Q(s_0, a_1) \}]$

$= E[\|\nabla p(a_1 \mid s_0) \|^2] b^2(s_0) - b(s_0) E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)]$.

setting it to 0 gives the optimal baseline as $b(s_0) = \frac{E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]}$.

If one makes an independency assumption between $\|\nabla p(a_1 \mid s_0) \|^2$ and $Q(s_0, a_1)$ then we get

$b(s_0) = \frac{E[ \|\nabla p(a_1 \mid s_0) \|^2 Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]} =\frac{E[ \|\nabla p(a_1 \mid s_0) \|^2] E[Q(s_0, a_1)]}{E[\|\nabla p(a_1 \mid s_0) \|^2]} =E[Q(s_0, a_1)] = V(s_0)$.

One can do a similar analysis for multi-step reinforcement learning and derive the optimal policy. While the optimal policy has been known for many years, all empirical application of REINFORCE or its derivative (in which I count actor-critic methods) use an  approximated baseline given by the state-value function $V^\pi$. I am not familiar with any literature where they prove (or disprove) if the approximate baseline actually does reduces the variance. However, in the empirical RL circle the variance reduction due to the approximated baseline is often taken as granted.

## Importance of Regularization

REINFORCE can get stuck in degenerate solutions as pointed out before. To avoid this degeneracy, a common tactic of regularizing the objective with the entropy of the policy or KL-divergence from a reference policy is adopted. We will focus on the former method here. The entropy regularized objective is given below:

$J^\pi_\lambda = \sum_s d^\pi_\mu(s) \left\{ \sum_a \pi(a \mid s) R(s, a) + \lambda H(\pi(.\mid s)) \right\}$,

where $\lambda$ is a hyperparameter controlling the effect of the regularization. If one computes the derivative of the new objective one gets:

$\nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda \sum_s \nabla \{d^\pi_\mu(s) H(\pi(.\mid s))\}$

Most researchers however use biased gradients given below:

$\nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda \sum_s d^\pi_\mu(s) \nabla H(\pi(.\mid s))$

$\Rightarrow \nabla J^\pi_\lambda = \sum_s d^\pi_\mu(s) \{ \sum_a \nabla \pi(a \mid s) Q^\pi(s, a) + \lambda H(\pi(.\mid s))\}$.

Entropy regularized objective will no longer follow our classic Bellman optimality conditions and the optimal policy no longer remains deterministic. An easy verification of this is to set $\lambda = \infty$ and observe that the optimal solution of $\arg\max_\pi J^\pi_\lambda$ will be to remain uniformly random everywhere (assuming the reward function is bounded).

## Evaluating Training Progress

When training a classifier for a task like ImageNet, one generally monitors the training error and the error on a held out tune set. The decrease in training error tells that your model has sufficient capacity and an increase in tune set indicates potential overfitting. Consider, the approximated gradient of the objective for REINFORCE:

$\nabla J^\pi \approx \sum_{t \ge 0} \nabla \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}$

the standard way to program this using pytorch or tensorflow is to define the loss variable:

$L = -\sum_{t \ge 0} \ln \pi(a_{t+1} \mid s_t)\sum_{t' \ge t} \gamma^{t'} r_{t'+1}$

when doing gradient descent on this variable we get the same gradient as the approximation gives us. This does not make $L$ the real loss function, in fact it is not the loss for the REINFORCE algorithm at all but using it gives us the real gradients therefore we will call it the substitute loss.

Monitoring the substitute loss is not the same as monitoring the actual training loss when doing supervised learning. To begin with, notice that the substitute loss is positive when the agent receives only positive reward and it is negative when the agent only receives negative rewards. This is counterintuitive as one ideally associates high loss with low return. So why is this so?

This happens cause when all the rewards are positive then the loss is positive and the only way the agent can reduce it is by pushing the term $\ln \pi(a_{t+1} \mid s_t)$ towards 0 which means increasing the probability of actions which generate positive reward (which is what we want). Similarly, when all rewards are negative then the loss is negative and the only way the agent can make it more negative is by decreasing the probability of these actions until the term $\ln \pi(a_{t+1} \mid s_t)$ tends towards negative infinity. This will eventually lead to these actions not being sampled anymore.

Thus, instead of monitoring the substitute loss, one can monitor two things: (i) the total reward received by the agent which also represents an unbiased estimate of the actual objective and (ii) the entropy of the policy. Ideally the total reward achieved by the agent should increase and the entropy of the policy should decrease. Any stagnation means the learning is not happening effectively.

## Major Failure Case for Policy Gradient

Policy gradient methods have no guarantees in theory or practice. A simple example which can be used to demonstrate this is from John Langford. There are $H+1$ states and two actions and the agent always starts in state $s_0$. At each state $s_i$ ($i \in \{0, 1, \cdots, H-1\})$, an action takes you to $s_{i+1}$ and the other actions takes you back to $s_0$. The action mappings are randomly assigned to each state but the MDP is deterministic. Every state has a reward of 0 except $s_H$ which has a reward of 1. The game ends when the agent reaches the state $s_H$.

With a close to 1 probability, any rollout will fail to reach the destination state where it earns a reward and therefore will not cause any meaningful learning when using on-policy policy gradient methods. It will require exponentially many samples for it to reach the state $s_H$ and therefore on-policy policy gradient with a randomly initialized policy is not what is called a  PAC RL algorithm, which is an important theoretical guarantee.

## Conclusion

Despite its flaws policy gradient methods are used widely for all sort of AI applications. One reason behind its widespread use is their remarkable similarity to supervised learning with which most empiricists are widely familiar. Thus, one can train their favourite dialogue model using REINFORCE by simply multiplying the log probabilities with a value term. Further, variants of policy gradient methods like PPO perform much better and are reasonable baselines. However, it could also be the case that most people don’t care about guarantees as long as they can solve their favourite childhood game and as a field we must guard against this trend as we move towards more serious applications of these algorithms.