The Geometry of Value Functions

I recently read a series of excellent papers on the geometry of value functions:

I wrote up these notes as I was going through them, and thought I’d post them in case they might be useful to anyone, including future me.

The Value Function Polytope

Studies the map $\pi \mapsto V^\pi$ from stationary policies (e.g. of RL agents) to value functions

Consider a 2-state, 2-action MDP: $S =\{s_1, s_2\}, A=\{a_1, a_2\}$. What is the space of policies for state $s_1$?
- Line from “always $a_1$” to “always $a_2$”
- Same shape for $s_2$.
What about a 2-state, 3-action MDP: $S =\{s_1, s_2\}, A = \{a_1, a_2, a_3\}$?
- What is the space of policies for state $s_1$?
  - Triangle from “always $a_1$” to “always $a_2$” to “always $a_3$”
  - Same shape for $s_2$.
In general, the policy space $\Pi$ is described by a simplex for each state.

Value Iteration
- VI generates a sequence of vectors that may not map to any policy. Value functions go outside the polytope!
Policy Iteration
- The sequence of value functions visited by policy iteration (Figure 8) corresponds to value functions of deterministic policies.
Policy Gradient
- The convergence rate strongly depends on the initial condition
- Gradients vanish at boundary of polytope
- With entropy regularization:
  - Avoids boundaries, converges to suboptimal policy
- With natural gradients:
  - “Gradient steps follow direction of steepest ascent in the underlying structure of the parameter space”
  - Does not take “shortest path” through polytope; looks more like policy iteration

Section 3: We measure quality of representation $\phi$ in terms of how well it can approximate all possible value functions.
Eqn 1: \min_\phi \max_\pi |\hat{V}_\phi^\pi - V^\pi|_2^2 \qquad (1) “Find the representation $\phi$ that minimizes the worst-case error in our value function estimate over all policies.”
Figure 1 (right): visualize using the polytope
- The worst-case error always happens at an extremal vertex

Idea: an agent trained to predict various value functions should develop a good state representation
Problem: can’t consider all policies (too expensive), or a random subset (not representative)
Solution: only consider vertices!
- Theorem 1: worst-case approximation error is the same measured over all value functions as only over extremal vertices – call these “adversarial value functions” (AVFs)
- Corollary 2: There are at most $$2^{ S }$$ AVFs.
Problem: That’s still a lot of AVFs, and finding one AVF is still NP-hard.
Solution: Relax the problem!
- Eqn 3: Change ‘max’ in Eqn. (1) to an expectation over a finite set of value functions $\mathbb{V}$. \min_\phi \sum_{V \in \mathbb{V}} |\hat{V}_\phi - V | \qquad (3)
- Hope that $\mathbb{V}$ is representative of the polytope.
- Basically: train representation with useful auxiliary tasks

It kind of works!
- Representations look meaningful
- Average return is competitive
Caveats:
- For tabular MDPs.
- For one specific tabular MDP (4-rooms).
Procedure (summary):
- Randomly assign “interest” $\delta(s) \in \{-1, 0, 1\}$ to each state.
- Find the policy $\pi$ maximizing $\sum_{s\in S} \delta(s)V^\pi(s)$.
- Compute its network flow and AVF.
- Repeat $k$ times.

It actually works really well!
- Figure 4 (left): much better generalization to future value functions
- Figure 4 (center, right): much better performance on Atari