Markov decision process (MDP) is a framework for modeling decision making and is very popularly used in the areas of machine learning, robotics, economics and others. An MDP models a decision maker (henceforth agent) in a world, where the agent gets a reward for performing action and the long term goal is to maximize the total reward. The concept of MDP is very intimately tied with the idea of Reinforcement Learning (RL), which is a machine learning framework for learning policies for decision making.

The general idea of MDP has been around since 1950s whereas RL has been around since 1969 (Andreae’s 1969). Long considered (often unjustifiably) suitable for only toy problems, RL has seen a recent renewal due to reasons, I will explain later. There is a good amount of literature on MDP and RL, with the Reinforcement learning book by Sutton and Barto, being a particularly good introduction to the subject. However, I found that these resources remove some essential mathematical treatment or they are written for mathematicians and beyond the reach of general audience.

In this post, I will cover a simple mathematical treatment of MDP with proofs of fundamental concepts. It will be best to use this post as complementary to Sutton and Barto. At places, I did not find proof of many theorems, and in such cases I have supplied my own proofs (marked with *). Please be careful in reading these proofs and let me know if you find any error.

## 1. Introduction

**Definition:** An MDP is defined by the tuple where is a set of states, is a set of actions that an agent can take, is defined for all and and denotes the probability of transitioning to state from state on taking action , we will denote such a transition as . is a real number denoting the reward received by the agent in the transition. is a real number that denotes how much to discount future rewards (its purpose will become clear shortly).

Thus, MDP defines a problem setup, where there is an agent that can take action and in the process receive a reward as well change the state. The setup is Markovian since the next state and reward depends only upon the current state and the action taken.

The goal of a rational agent should be to achieve the maximum reward in the long term. This is expressed as a policy which gives a distribution over actions to take, when presented with a state. More formally,

**Definition: **Policy defines the probability distribution over action given a state. denotes the probability of taking action a given state s. The policy is *deterministic *if .

The agent then plays a game where it is landed in a state , and it continuously takes action to maximize the total reward. More precisely, a trace of agent’s game will look like a sequence of state, action, reward , where is the action taken when presented with state and is the reward given in the process and is the new state.

What we are interested in is the expected reward given the state, which we denote by the concept of *state value function *.

a similar concept is given by *action value function* * :*

**Episodic and Continuous Task: **If the task defined by MDP, always terminates in some bounded number of steps irrespective of the starting state and actions taken by the agent, the task is called episodic (will game of Chess be episodic? how about a racing game?). A single play in an episodic task is called an episode. If a task is not episodic then it can potentially run forever and such tasks are called continuous task.

**Discounting: **As seen in the equation above, the reward at time step is discounted by . There are several reasons why it is done but two main reasons are: there are evidence that humans behave in similar manner and secondly, we want the value function is a well defined quantity. When the reward is bounded by some finite constant and then value function is always defined. The maximum reward achievable is bounded by as shown below:

Note that the maximum achievable reward is . If the task was episodic and terminates in steps, then the maximum achievable reward in undiscounted scenario would have been . Comparing it with what we have for continuous discounted task, we see that plays the role of and for this reason, this is sometimes called the *effective time horizon* of the task. If means that the task is effectively being played for steps.

In our proofs below, we will focus only on continuous tasks with .

## 2. Bellman Self-consistency equation

**2.1 Self-consistency equation**

** **By a little work, we can express the value functions in recursive form giving us what is called the self-consistency equation. Lets see how it is done:

only depends upon therefore we get

the second term is same as therefore we get

Expanding using the fact that given the reward is fixed, we get:

this is the Bellman self consistency equation for the value function. Basically any state-value function should satisfy the self-consistency equation.

Similarly, satisfies a self-consistency equation below:

we also have relation between and given by:

It is not clear whether the self-consistency equations admit a unique solution. We show below that infact they do for finite MDPs.

**2.2 Self-consistency equation has a unique solution**

**Theorem*:** Self-consistency equation for state-value function for finite MDPs have a unique solution.

**Proof: **We have ** **

define , and we get

or in matrix form with and ,

We show that this linear equation has a unique solution by showing that matrix has a null space of dimenion 0.

Let or equivalentlty .

Here we note following properties of matrix :

.

Defining and and using the above property of matrix we get.

This means . This implies:

and

Using and , we get . This implies that and therefore the self consistency equation for state-value has a unique solution.

## 3. Optimal Policy

A natural question to ask is, how to find a policy that gives us very high value of state value function. For a given state , we can say that policy is as good as if . The policy is better if the inequality is strict. It is easy to see that for finite MDP and state, there exists a policy that is as good as all other policies. But how do we define an ordering over policies, irrespective of state?

One way would be, is as good as if . That is the policy must equal (or exceed) the other policy for every state. But this leads to the question of optimality, does an optimal policy exists. After all, we can end up in a situation where for two states and policies ; and . We now prove that such an optimal policy does exists for finite MDP.

**3.1 Existence of optimal policy**

**Theorem*:** There exists an optimal policy for every finite MDP.

**Proof: **Let MDP be be a finite MDP and be the space of all possible policies . For a given state , such that . We call such as the leader policy for state . We now define a policy as i.e. for a given state $s$ we use the leader policy for that state to make our move.

This seems most natural and probably the only way to define since, for a given state the maximum reward is attained by policy and hence if we do not follow the policy for the first step itself then we may not be guaranteed to achieve the maximum reward in the long run. All that remains to show is that is optimal. For this we first define . Showing that is optimal corresponds to showing that since if we beat leader policy for every state then we beat all policies for all states.

From Bellman self-consistency equation we have:

by definition . Subtracting (3) and (4) we get:

We have , which together with equation 5 gives us:

the summation on RHS represents convex combination of and therefore we can use *convex-combination* to give us:

as inequality 7 holds for all , therefore this gives us:

as , therefore inequality 8 can only hold if . Hence proved.

**3.2 Bellman optimality condition**

Note that our proof for existence of optimal policy is almost impractical. It requires us to find the optimal policy for every state, which can be highly inefficient if we have a large state space or action space. Here we will derive an equation that will allow us to later compute the optimal policy more efficiently.

We first denote the optimal policy by and define and .

**Theorem* (Bellman Optimality Condition): ** and satisfy the following equations:

**Proof: **Let be the optimal policy. Then we have, . To prove the first equation, all we have to show is that we can never have . We will prove this via contradiction.

Let . We will use this fact to show that is not optimal by deriving a better policy. We define this new policy as: . Note that this new policy is a deterministic policy. Then from self-consistency equations we have:

As before, we define . We then have:

Using the definition of , as the policy that picks an action that maximizes values, we get:

.

Note that for the state , the **above inequality is strict**. This follows from .

Let then we have:

.

Giving us, .

using argument similar to earlier, we show that and as , we have .

But wait, this is not what we wanted to show. We want to show that or in other words, we want to show that . But this follows easily since for the state , the above inequality is strict. Hence, cannot be the optimal policy, which is a contradiction.

The next 2 equations in the proof, easily follow from the first equation (try it yourself).

These three equations are called the Bellman optimality conditions.

**Interesting observations from the above proof:**

- For the optimal policy satisfying , we can still define a new policy . We can then follow the above proof to derive that the new policy is as good as the optimal policy, but not better. But this new policy is deterministic. This shows that every MDP has a deterministic optimal policy. So you do not have to worry about being stochastic if you want to be optimal. This is a remarkably interesting fact about MDPs.
- If does not satisfy the Bellman optimality condition, then we can follow the proof to derive a new policy given by such that . That is we can improve upon our current policy. This is one way of doing policy improvement. Given a policy, find its Q-values and then define a better policy as discussed. We can keep doing this, either forever or till we finally find a policy that satisfies the Bellman optimality condition.
- Lastly, notice how the reward terms simply cancel out. It beautifully shows that the proof does not depend upon how the reward function looks like.

**3.3 Bellman backup operator**

To solve an MDP means to find the optimal policy and the optimal value functions. Unfortunately, we cannot easily solve the Bellman optimality condition as they are non-linear due to sitting in the equation. We now provide a strategy for computing by iteratively solving the optimality condition. Given we can compute using the equations above. Therefore, in the remainder of this section we will be solely interested in computing .

Let be the state-value function for an MDP then is the Bellman backup operator given by:

We then prove that satisfies the contraction property namely:

**Theorem: ** where is the max norm given by

**Proof: **

we now use the fact that .

since this holds for every , we have

Hence proved.

We now prove that has a unique fixed point. A fixed point of an operator satisfies . This also means .

**Lemma: **Bellman backup operator has a unique fixed point.

**Proof:** Let be two fixed point of . Whic means and . Then using the contraction property of we get:

**
**further, .

Using these two equations and we get,

. This gives us .

**Corrollary:** Optimal state-value function is a fixed point of its Bellman backup operator. By above lemma, it is also the fixed point.

This means that solving for the fixed point of this operator gives us our optimal value functions and optimal policy. We finally show that repeated application of Bellman backup operator to any state-value function eventually converges to the optimal value function.

**Theorem: **

**Proof:** Using the contraction property.

Applying the contraction property times we get:

taking limits on both sides and using we get,

.

Here, we assume that value functions are bounded which follows from bounded reward values.

**4 Conclusion**

Applying backup operator to compute optimal policy is impractical in most cases as number of states become very large. In next blog post on this topic, we will discuss algorithms for efficient learning of optimal policies such as Q-learning, Monte Carlo methods etc. We will also look at interesting reinforcement learning problems such as deep reinforcement learning (reinforcement learning with state represented by deep neural network features).

**5 References**

- Reinforcement learning: An Introduction, R. Sutton and A. Bato.
- Scribes from Peter Abeel’s course for section 3