Markov Decision Process and Reinforcement Learning (Part 1)

Markov decision process (MDP) is a framework for modeling decision making and is very popularly used in the areas of machine learning, robotics, economics and others. An MDP models a decision maker (henceforth agent) in a world, where the agent gets a reward for performing action and the long term goal is to maximize the total reward. The concept of MDP is very intimately tied with the idea of Reinforcement Learning (RL), which is a machine learning framework for learning policies for decision making.

The general idea of MDP has been around since 1950s whereas RL has been around since 1969 (Andreae’s 1969). Long considered (often unjustifiably) suitable for only toy problems, RL has seen a recent renewal due to reasons, I will explain later. There is a good amount of literature on MDP and RL, with the Reinforcement learning book by Sutton and Barto, being a particularly good introduction to the subject. However, I found that these resources remove some essential mathematical treatment or they are written for mathematicians and beyond the reach of general audience.

In this post, I will cover a simple mathematical treatment of MDP with proofs of fundamental concepts. It will be best to use this post as complementary to Sutton and Barto. At places, I did not find proof of many theorems, and in such cases I have supplied my own proofs (marked with *). Please be careful in reading these proofs and let me know if you find any error.

1. Introduction

Definition: An MDP M is defined by the tuple (S, A, P, R, \gamma) where S is a set of states, A is a set of actions that an agent can take, P^a_{s,s'} is defined for all s,s' \in S and a \in A and denotes the probability of transitioning to state s' from state s on taking action a, we will denote such a transition as s \rightarrow_a s'R^a_{s,s'} is a real number denoting the reward received by the agent in the transitions \rightarrow_a s'\gamma is a real number that denotes how much to discount future rewards (its purpose will become clear shortly).

Thus, MDP defines a problem setup, where there is an agent that can take action and in the process receive a reward as well change the state. The setup is Markovian since the next state and reward depends only upon the current state and the action taken.

The goal of a rational agent should be to achieve the maximum reward in the long term. This is expressed as a policy which gives a distribution over actions to take, when presented with a state. More formally, 

Definition: Policy \pi: S\times A \rightarrow [0,1] defines the probability distribution over action given a state. \pi(s,a) denotes the probability of taking action a given state s. The policy is deterministic if \forall s\in S, \exists a \in A \,\,s.t.\,\, \pi(s,a) = 1 .

The agent then plays a game where it is landed in a state s, and it continuously takes action to maximize the total reward. More precisely, a trace \tau of agent’s game will look like a sequence of state, action, reward \tau = , where a_t is the action taken when presented with state s_t and r_t is the reward given in the process and s_{t+1} is the new state.

What we are interested in is the expected reward given the states, which we denote by the concept of state value function V^\pi(s).

 V^\pi(s) = E[ \sum_{t=0}^\infty \gamma^{t} r_{t + 1} | s_o = s] 

a similar concept is given by action value function Q^\pi(s,a) :

 Q^\pi(s) = E[ \sum_{t=0}^\infty \gamma^{t} r_{t + 1} | s_o = s, a_1 = a]

Episodic and Continuous Task: If the task defined by MDP, always terminates in  some bounded number of steps irrespective of the starting state and actions taken by the agent, the task is called episodic (will game of Chess be episodic? how about a racing game?). A single play in an episodic task is called an episode. If a task is not episodic then it can potentially run forever and such tasks are called continuous task.

Discounting: As seen in the equation above, the reward at time step t+1 is discounted by \gamma^t. There are several reasons why it is done but two main reasons are: there are evidence that humans behave in similar manner and secondly, we want the value function is a well defined quantity. When the reward is bounded by some finite constantM and \gamma \in [0, 1) then value function is always defined. The maximum reward achievable is bounded by \frac{M}{1-\gamma} as shown below:

 V^\pi(s) =E[ \sum_{t=0}^\infty \gamma^{t} r_{t + 1} | s_o = s]

 V^\pi(s) \le E[ \sum_{t=0}^\infty \gamma^{t} M | s_o = s] = M \sum_{t=0}^\infty  \gamma^t = \frac{M}{1-\gamma}

Note that the maximum achievable reward is M * \frac{1}{1-\gamma}. If the task was episodic and terminates in T steps, then the maximum achievable reward in undiscounted scenario would have been MT. Comparing it with what we have for continuous discounted task, we see that \frac{1}{1-\gamma} plays the role of T and for this reason, this is sometimes called the effective time horizon of the task. If \gamma = 0.99 means that the task is effectively being played for 100 steps.

In our proofs below, we will focus only on continuous tasks with \gamma \in [0,1).

2. Bellman Self-consistency equation

2.1 Self-consistency equation

 By a little work, we can express the value functions in recursive form giving us what is called the self-consistency equation. Lets see how it is done:

 V^\pi(s) = E[ \sum_{t=0}^\infty \gamma^{t} r_{t + 1} | s_o = s]  = \sum_{\tau} \left(\sum_{t=0}^\infty \gamma^{t} r_{t + 1}\right) p(\tau)     = \sum_{a_1, r_1, s_1} \sum_{\tau'} \left(\sum_{t=0}^\infty \gamma^{t} r_{t + 1}\right) p(\langle a_1, r_1, s_1\rangle:\tau')  = \sum_{a_1, r_1, s_1} \sum_{\tau'}\left( r_ 1 + \sum_{t=1}^\infty \gamma^{t} r_{t + 1}  \right) p(\langle a_1, r_1, s_1\rangle)p(\tau' | s_1)   = \sum_{a_1, r_1, s_1}p(\langle a_1, r_1, s_1\rangle)\sum_{\tau'}\left\{ r_1  +\left( \sum_{t=1}^\infty \gamma^{t} r_{t + 1}  \right) \right\}p(\tau' | s_1)   = \sum_{a_1, r_1, s_1}p(\langle a_1, r_1, s_1\rangle)\left\{\sum_{\tau'}r_1p(\tau' | s_1)  + \sum_{\tau'}\left( \sum_{t=1}^\infty \gamma^{t} r_{t + 1}  \right)p(\tau' | s_1) \right\}

  r_1  only depends upon  s, a_1, s_1 therefore we get

  = \sum_{a_1, r_1, s_1}p(\langle a_1, r_1, s_1\rangle)\left\{r_1  + \gamma \sum_{\tau'}\left( \sum_{t=0}^\infty \gamma^{t} r_{t + 2}  \right)p(\tau' | s_1) \right\}

the second term is same as V^\pi(s_1) therefore we get

  = \sum_{a_1, r_1, s_1}p(\langle a_1, r_1, s_1\rangle)\left\{r_1  + \gamma V^\pi(s_1) \right\}

Expanding p(\langle a_1, r_1, s_1 \rangle) using the fact that given s_1, a_1 the reward r_1 is fixed, we get:

V^\pi(s) = \sum_{a_1} \pi(s, a_1) \sum_{s_1} P^a_{s, s_1}\left\{R^a_{s,s_1}  + \gamma V^\pi(s_1) \right\}  

this is the Bellman self consistency equation for the value function. Basically any state-value function should satisfy the self-consistency equation.

Similarly, Q^\pi satisfies a self-consistency equation below:

Q^\pi(s,a) = \sum_{s'}P^a_{s,s'} \{ R^a_{s,s'} + \gamma \sum_{a'} \pi(s',a')Q^\pi(s', a') \}

we also have relation between Q^\pi and V^\pi given by:

Q^\pi(s,a) = \sum_{s'} P^a_{s,s'} \{ R^a_{s,s'} + \gamma V^\pi(s') \}

It is not clear whether the self-consistency equations admit a unique solution. We show below that infact they do for finite MDPs.

2.2 Self-consistency equation has a unique solution

Theorem*: Self-consistency equation for state-value function for finite MDPs have a unique solution.

Proof: We have  \forall s\in S  

V^\pi(s) = \sum_{a} \pi(s, a) \sum_{s_1} P^a_{s, s_1}\left\{R^a_{s,s_1}  + \gamma V^\pi(s_1) \right\}   V^\pi(s) = \sum_{s_1} \{ \gamma \sum_{a}\pi(s, a)P^a_{s, s_1} \} V^\pi(s_1) + \sum_{s_1} \sum_{a} \pi(s,a) P^a_{s,s_1} R^a_{s,s_1}  

define \alpha_{s,s_1} =\{ \gamma \sum_{a}\pi(s, a) P^a_{s, s_1} \}, and \beta_s =  \sum_{s_1} \sum_{a} \pi(s,a) P^a_{s,s_1} R^a_{s,s_1}  we get

V^\pi(s) = \sum_{s_1}\alpha_{s,s_1} V^\pi(s_1) + \beta_s   

or in matrix form with  A = [\alpha_{s,s_1}] and b = [\beta_s],

I V^\pi = A V^\pi + b   \Rightarrow (I - A) V^\pi = b

We show that this linear equation has a unique solution by showing that matrix  (I - A) has a null space of dimenion 0.

Let  x = (x_s) \in N(I - A) \Rightarrow (I - A) x = 0  or equivalentlty  Ax = x .

 x_s = \sum_{s'} \alpha_{s,s'} x_{s'}

Here we note following properties of matrix  A :

  1.  \alpha_{s,s'} \ge 0
  2.  \sum_{s'} \alpha_{s,s'}  = \gamma \sum_{s'} \sum_{a}\pi(s, a)P^a_{s, s'}
    =\gamma \sum_{a}\pi(s, a)(\sum_{s'} P^a_{s, s'}) =  \gamma \sum_{a}\pi(s, a) = \gamma .

Defining  x_{max} = max\{x_s | s\in S\} and  x_{min} = min\{x_s | s\in S\} and using the above property of matrix  A we get.

This means  \gamma x_{min} \le x_s = \sum_{s'} \alpha_{s,s'} x_{s'} \le \gamma x_{max} . This implies:

 \gamma x_{min} \le x_{min}  \Rightarrow (1 - \gamma) x_{min} \ge 0 \Rightarrow x_{min} \ge 0 and
 x_{max} \le \gamma x_{max} \Rightarrow (1 - \gamma) x_{max} \le 0 \Rightarrow x_{max} \le 0

Using  x_{min} \ge 0 and  x_{max} \le 0 , we get  x = 0 . This implies that  N(I - A) = \{ 0 \} and therefore the self consistency equation for state-value has a unique solution.

3. Optimal Policy

A natural question to ask is, how to find a policy that gives us very high value of state value function. For a given state s, we can say that policy \pi is as good as   \pi^{'} if V^\pi(s) \ge V^{\pi'}(s). The policy is better if the inequality is strict. It is easy to see that for finite MDP and state, there exists a policy that is as good as all other policies. But how do we define an ordering over policies, irrespective of state?

One way would be, \pi is as good as   \pi^{'} if \forall s\in S; V^\pi(s) \ge V^{\pi'}(s). That is the policy must equal (or exceed) the other policy for every state. But this leads to the question of optimality, does an optimal policy exists. After all, we can end up in a situation where for two states s_1, s_2 \in S and policies \pi_1, \pi_2; V^{\pi_1}(s_1) > V^{\pi_2}(s_1)  and V^{\pi_1}(s_2) < V^{\pi_2}(s_2). We now prove that such an optimal policy does exists for finite MDP.

3.1 Existence of optimal policy

Theorem*: There exists an optimal policy for every finite MDP.

Proof:  Let MDP be M = (S,A,P,R,\gamma) be a finite MDP and \Pi be the space of all possible policies \pi. For a given state s, \exists \pi_s \in \Pi such that V^{\pi_s}(s) \ge V^{\pi}(s) \forall \pi \in \Pi. We call such \pi_s as the leader policy for state s. We now define a policy as \pi(s,a) = \pi_s(s,a) i.e. for a given state $s$ we use the leader policy for that state to make our move.

This seems most natural and probably the only way to define \pi since, for a given state s the maximum reward is attained by policy \pi_s and hence if we do not follow the policy \pi_s for the first step itself then we may not be guaranteed to achieve the maximum reward in the long run. All that remains to show is that \pi(s,a) is optimal. For this we first define \delta(s) = V^\pi(s) - V{^\pi_s}(s). Showing that \pi is optimal corresponds to showing that \delta(s) \ge 0 \forall s\in S since if we beat leader policy for every state then we beat all policies for all states.

From Bellman self-consistency equation we have:

V^{\pi}(s) = \sum_{a \in A} \pi(s,a) \sum_{s' \in S} P^a_{s,s'} \{ R^a_{s,s'} + \gamma V^\pi(s') \}

V^{\pi_s}(s) = \sum_{a \in A} \pi_s(s,a) \sum_{s' \in S} P^a_{s,s'} \{ R^a_{s,s'} + \gamma V^{\pi_s}(s') \}

by definition \pi(s,a) = \pi_s(s,a). Subtracting (3) and (4) we get:

\delta(s) = V^{\pi}(s) - V^{\pi_s}(s) = \gamma \sum_{a \in A} \pi(s,a) \sum_{s' \in S} P^a_{s,s'} \{V^\pi(s') - V^{\pi_s}(s')\}

We have V^\pi(s') - V^{\pi_s}(s') \ge V^\pi(s') - V^{\pi_{s'}}(s') = \delta(s'), which together with equation 5 gives us:

\delta(s)  \ge \gamma \sum_{a \in A} \pi(s,a) \sum_{s' \in S} P^a_{s,s'} \delta(s') = \gamma \sum_{s'\in S} \left(\sum_{a \in A} \pi(s,a) P^a_{s,s'} \right) \delta(s')

the summation on RHS represents convex combination of \delta and therefore we can use convex-combination(x) \ge x_{min} to give us:

\delta(s) \ge \gamma \,\, \inf \{\delta(s') \mid s' \in S\}

as inequality 7 holds for all s \in S, therefore this gives us:

\inf \{\delta(s') \mid s' \in S\} \ge \gamma \,\, \inf \{\delta(s') \mid s' \in S\}

as \gamma \in [0,1), therefore inequality 8 can only hold if \inf \{\delta(s') \mid s' \in S\} \ge 0 \Rightarrow \delta(s) \ge 0 \forall s \in S. Hence proved.

3.2 Bellman optimality condition

Note that our proof for existence of optimal policy is almost impractical. It requires us to find the optimal policy for every state, which can be highly inefficient if we have a large state space or action space. Here we will derive an equation that will allow us to later compute the optimal policy more efficiently.

We first denote the optimal policy by \pi^* and define V^*(s) = V^{\pi^*}(s) and Q^*(s) = Q^{\pi^*}(s).

Theorem* (Bellman Optimality Condition):  V^*(s) and Q^*(s) satisfy the following equations:

V^*(s) = \max_a Q^*(s,a)

V^*(s) = \max_a \sum_s' P^a_{s,s'} \{ R^a_{s,s'} + \gamma V^*(s') \}

Q^*(s,a) = \sum_s' P^a_{s,s'} \{ R^a_{s,s'} + \gamma \max_{a'} Q^*(s',a')\}

Proof: Let \pi^*(s,a) be the optimal policy. Then we have, V^*(s) = \sum_a \pi^*(s,a) Q^*(s,a) \le \max_a Q^*(s,a). To prove the first equation, all we have to show is that we can never have V^*(s) < \max_a Q^*(s,a). We will prove this via contradiction.

Let \exists s' \,\,s.t.\,\, V^*(s') < \max_a Q^*(s',a). We will use this fact to show that \pi^* is not optimal by deriving a better policy. We define this new policy as: \pi'(s) = \arg\max_a Q^*(s,a). Note that this new policy is a deterministic policy. Then from self-consistency equations we have:

V^*(s) = \sum_a' \pi^*(s,a) Q^*(s,a)

V^{\pi'}(s) = Q^{\pi'}(s, \pi'(s))

As before, we define \delta(s) =  V^{\pi'}(s) -V^*(s) =Q^{\pi'}(s, \pi'(s)) - \sum_a' \pi^*(s,a) Q^*(s,a). We then have:

\delta(s) =  V^{\pi'}(s) -V^*(s) =Q^{\pi'}(s, \pi'(s)) - \sum_a' \pi^*(s,a) Q^*(s,a)

Using the definition of \pi', as the policy that picks an action that maximizes Q^* values, we get:

\delta(s) \ge Q^{\pi'}(s, \pi'(s)) - Q^{*}(s, \pi'(s)).

Note that for the state s', the above inequality is strict. This follows from V^*(s') < \max_a Q^*(s,a') = Q^*(s, \pi(s')).

Let a = \pi'(s) then we have:

Q^{\pi'}(s, a) - Q^{*}(s, a) = \sum_{s'} P^a_{s,s'} (R^a_{s,s'} + \gamma V^{\pi'}(s')) - \sum_s' P^a_{s,s'} (R^a_{s,s'} + \gamma V^{*}(s'))

Q^{\pi'}(s, a) - Q^{*}(s, a) = \gamma \sum_{s'}P^a_{s,s'}  (V^{\pi'}(s') -V^{*}(s')) = \gamma \sum_{s'} P^a_{s,s'} \delta(s').

Giving us, \delta(s) \ge\gamma \sum_{s'} P^a_{s,s'} \delta(s').

using argument similar to earlier, we show that \inf_s \delta(s) \ge \gamma \inf_{s'} \delta(s') and as \gamma\in [0,1), we have \inf_s \delta(s) \ge  \Rightarrow \forall s, \delta(s) \ge 0.

But wait, this is not what we wanted to show. We want to show that \pi' > \pi^* or in other words, we want to show that \exists s \,\,s.t.\,\, \delta(s) > 0. But this follows easily since for the state s', the above inequality is strict. Hence, \pi^* cannot be the optimal policy, which is a contradiction.

The next 2 equations in the proof, easily follow from the first equation (try it yourself).

These three equations are called the Bellman optimality conditions.

Interesting observations from the above proof:

  1. For the optimal policy \pi^* satisfying V^{\pi^*}(s) = \max_a Q^{\pi^*}(s,a), we can still define a new policy \pi'(s)  = \arg\max_a Q(s,a). We can then follow the above proof to derive that the new policy is as good as the optimal policy, but not better. But this new policy \pi' is  deterministic. This shows that every MDP has a deterministic optimal policy. So you do not have to worry about being stochastic if you want to be optimal. This is a remarkably interesting fact about MDPs.
  2. If \pi does not satisfy the Bellman optimality condition, then we can follow the proof to derive a new policy \pi' given by \pi'(s) = \arg\max_a Q^{\pi}(s,a) such that \pi' > \pi. That is we can improve upon our current policy. This is one way of doing policy improvement. Given a policy, find its Q-values and then define a better policy as discussed. We can keep doing this, either forever or till we finally find a policy that satisfies the Bellman optimality condition.
  3. Lastly, notice how the reward terms simply cancel out. It beautifully shows that the proof does not depend upon how the reward function looks like.


3.3 Bellman backup operator

To solve an MDP means to find the optimal policy and the optimal value functions. Unfortunately, we cannot easily solve the Bellman optimality condition as they are non-linear due to max sitting in the equation. We now provide a strategy for computing V^*, Q^* by iteratively solving the optimality condition. Given V^* we can compute Q^* using the equations above. Therefore, in the remainder of this section we will be solely interested in computing V^*.

Let V: S \rightarrow R be the state-value function for an MDP then T: V \rightarrow V is the Bellman backup operator given by:

TV(s) = max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V(s') \}

We then prove that T satisfies the contraction property namely:

Theorem: \| TV_1 - TV_2 \|_{\infty} \le \gamma \| V_1 - V_2 \|_{\infty} where \| . \|_{\infty} is the max norm given by \| x \| = \max\{ |x_1|, \cdots |x_n| \}

Proof: \| TV_1 - TV_2 \|_{\infty} = \| \max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_1(s') \} - \max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_2(s') \} \|
we now use the fact that |\max_x f(x) - \max_x g(x)| \le \max_x |f(x) - g(x)|.

|\max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_1(s') \} - \max_a \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_2(s') \}|
 \le \max_a | \sum_{s'} P^a_{s,s'} \{R^a_{s,s'} + \gamma V_1(s') \} -  \sum_{s'}P^a_{s,s'} \{R^a_{s,s'} + \gamma V_2(s') \}|
= \max_a \gamma |\sum_{s'}P^a_{s,s'} \{V_1(s') -V_2(s')\}|

since this holds for every s, we have

\| TV_1 - TV_2 \|_{\infty} \le \|\max_a \gamma |\sum_{s'}P^a_{s,s'} \{V_1(s') -V_2(s')\}|\|_{\infty}
\le \|\max_a \gamma \sum_{s'}P^a_{s,s'}|V_1(s') - V_2(s')|\|_{\infty}
\le \|\max_a \gamma \max_{s'}|V_1(s') -V_2(s')|\|_{\infty}
= \|\gamma \max_{s'}|V_1(s') -V_2(s')|\|_{\infty}
= \gamma \max_{s'}|V_1(s') -V_2(s')| = \gamma \|V_1 -V_2\|_{\infty}

Hence proved.

We now prove that T has a unique fixed point. A fixed point x of an operator satisfies Tx  = x. This also means T^n x = T^{n-1}( Tx) = T^{n-1}x \cdots = x.

Lemma: Bellman backup operator T has a unique fixed point.
Proof: Let V_1, V_2 be two fixed point of T. Whic means TV_1 = V_1 and TV_2 = V_2. Then using the contraction property of T we get:

0 \le \| TV_1 - TV_2\|_{\infty} \le \gamma \| V_1 - V_2\|_{\infty}
further, \| TV_1 - TV_2\|_{\infty} = \| V_1 - V_2\|_{\infty}.

Using these two equations and \gamma \in [0, 1) we get,

\| V_1 - V_2\|_{\infty} \le \gamma \| V_1 - V_2\|_{\infty} \Rightarrow \| V_1 - V_2\|_{\infty} \le 0. This gives us V_1 = V_2.

Corrollary: Optimal state-value function is a fixed point of its Bellman backup operator. By above lemma, it is also the fixed point.

This means that solving for the fixed point of this operator gives us our optimal value functions and optimal policy. We finally show that repeated application of Bellman backup operator to any state-value function V eventually converges to the optimal value function.

Theorem: \lim_{n\rightarrow \infty} T^nV = V^*  \forall V
Proof: Using the contraction property.
\| T^n V - V^* \|_{\infty} = \| T^n V - TV^*\|_{\infty} \le \gamma \| T^{n-1}V-V^*\|_{\infty}

Applying the contraction property n-1 times we get:
\| T^n V - V^*\|_{\infty} \le \gamma^n \| V - V^* \|_{\infty}

taking limits on both sides and using \gamma \in [0,1) we get,
\lim_{n \rightarrow \infty} \| T^n V - V^*\|_{\infty} \le \lim_{n\rightarrow} \gamma^n \| V - V^* \|_{\infty} = 0.

Here, we assume that value functions are bounded which follows from bounded reward values.

4 Conclusion

Applying backup operator to compute optimal policy is impractical in most cases as number of states become very large. In next blog post on this topic, we will discuss algorithms for efficient learning of optimal policies such as Q-learning, Monte Carlo methods etc. We will also look at interesting reinforcement learning problems such as deep reinforcement learning (reinforcement learning with state represented by deep neural network features).

5 References

  1. Reinforcement learning: An Introduction, R. Sutton and A. Bato.
  2. Scribes from Peter Abeel’s course for section 3