Recently, research on the geometric properties of value functions and attempts to apply these properties to various fields of Reinforcement Learning (RL) have been actively conducted. This post is an introduction to Value Function Geometry including an introduction of the paper "The Value Function Polytope in Reinforcement Learning" by Dadashi et al 1.
This post aims to explain the basic concepts of value function linear approximation and representation learning. Also, the paper 1 establishes the geometric shape of value function space in finite state-action Markov decision processes: a general polytope. By reimplementing the paper myself, I was able to check the surprising theorems and geometric properties. I have attached some results that I checked directly in the post.
Figure 1: Mapping between policies and value functions 1
We usually consider an environment with a Markov Decision Process(MDP) in Reinforcement Learning 2. Markov Decision Process
A stationary policy
The state-value function
The action-value function
In particular, the state-value function is an expectation over the action-value functions under a policy
In the case, we take an action by following an argmax policy
Figure 2: Geometric View of Value Function 3
Figure 2 represents the geometric relationship between the state-value function and the action-value function, when following an argmax policy. Each line corresponds to the action-value function and the bold line corresponds to the state-value function. It can be seen that the maximum action-value selected for each state is the state-value function. In Figure 2, it can also be checked that
Figure 3: Value Functions for the mountain car problem using DQN
Figure 3 is an example of value functions that I obtained for the mountain car problem. Both used DQN (Deep Q-Networks) to train the agent. Figure 3(a) is based on 2000 runs of mountain car without changing the reward, and Figure 3(b) is based on 500 runs of mountain car with adding a heuristic reward. I gave a bonus reward of 15 times the velocity of the next state. By giving a heuristic reward to the agent, the mountain car was able to learn and train faster.
3. Value Function Linear Approximations
A value function for a policy
For a given representation
Figure 4: A deep reinforcement learning architecture viewed as a two-part approximation 4
Assume a finite state space:
Let's start by introducing some vector and matrix notation to make it more convenient.
It is well known that the state-value function
Given a policy
In the case, we do not know the entire model completely, we cannot calculate
Projection operator is a widely used operator in Statistics for Multiple Linear Regression Models and Least Squares Estimation.
Projector operator satifies the normal equations:
Back to the value function approximation, let's assume a finite state space
We consider the approximation minimizing the squared error:
4. Value Function Polytope in Reinforcement Learning
Definition 1 : Polytope
Definition 2 : Polyhedron
The Space of Value Functions
The main contribution of Dadashi et al 1 is the characterization of the space of value functions. The space of value functions is the set of all value functions obtained from all possible policies.
Dadashi et al 1 show that
Figure 5: Space of Value Functions for 2-state MDPs
Figure 5 depicts the value function space
Figure 5: Left
Figure 5: Right
I used the following convention to express
To easily check the shape of the polytope, I checked with the value function spaces from 2-state MDPs, which is convenient to visualize. The policy space
In fact, if you look closely at the right of Figure 5, you'll find something a little odd. You may think it's not a perfect polytope because the upper left hand is a little empty. To double-check if the policy was sampled uniformly from the policy space, I visualized the mapping between policy samples and value function samples.
Figure 6: Mapping between sample policies and sample value functions. The red points are the value functions of deterministic policies
To easily visualize the policy space and value function space, 2-state 2-action MDP was used for Figure 6. It includes 50,000 pairs of policy and value function samples. The specific details of MDP presented in this figure are as follows.
The left of Figure 6 definitely shows that the policy has been sampled uniformly. Despite the uniformly selected policies, you can still see the sparse empty space in the upper left corner of the value function space. Also, to confirm the existence of the polytope's vertex, I took the value functions of deterministic policies and confirmed that they become the vertex. In conclusion, we can know that the sparse empty space in the figure is simply due to the lack of sample numbers. It can also be seen that the polytope of the value function space does not consist of equal density.
Definition 3 : Policy Agreement
For a given policy
Definition 4 : Policy Determinism
-deterministic for if .
- semi-deterministic if it is
-deterministic for at least one .
- deterministic if it is
-deterministic for all states .
We denote by
Value Functions and Policy Agreement
Now, I will introduce one of the most amazing theorem of this paper, line theorem. It characterizes the subsets of value function space that are generated by the policies which agree on all but one state. Surprisingly, it describes a line segment within the value function polytope and this line segment is monotone.
Theorem 1 : Line Theorem
Furthermore, the image of
Furthermore, the image of
Theorem 1 implies that we can generate
Figure 7: Illustration of Theorem 1 Line Theorem.
Figure 7 actually illustrates the value function space drawn by the interpolated policies between
Figure 7: Left
Figure 7: Right
For both figures in Figure 7, the following
Figure 8: Value functions of mixtures of two policies in the general case.
The red points in Figure 8 are the value functions of mixtures of two policies. You can see that the red points are curved, not straight. It can be seen that the condition, the policies should agree on all but one state, is a necessary condition.
By visualizing the policy space, we can easily recognize that the two policies do not agree on any states. In the case of
Corollary 1.* *For any set of states
Corollary 1 can be proved by mathematical induction on the number of states
Figure 9: Visual representation of Corollary 1. The red dots are the value functions of deterministic policies.
As you can see in Figure 9, the space of value functions is included in the convex hull of value functions of deterministic policies. Also, you can check the relationship between the vertices of
Figure 10: Visual representation of Corollary 1. The value functions of deterministic policies are not identical to the vertices of V.
The Boundary of
Theorem 2 and Corollary 3 show that the boundary of value function space is described by semi-deterministic policies.
Let's start with first defining the affine vector space before we mention Theorem 2. Consider a policy
Consider the ensemble of policies
Corollary 3.* *Consider a policy
Figure 11: Visual representation of Corollary 3. The orange points are the value functions of semi-deterministic policies.
The Polytope of Value Functions
Consider a policy