The Geometry of Value Functions
I recently read a series of excellent papers on the geometry of value functions:
- The Value Function Polytope in Reinforcement Learning
- A Geometric Perspective on Optimal Representations for Reinforcement Learning
- The Value-Improvement Path: Towards Better Representations for Reinforcement Learning
I wrote up these notes as I was going through them, and thought I’d post them in case they might be useful to anyone, including future me.
- The Geometry of Value Functions
- The Value Function Polytope
- A Geometric Perspective on Optimal Representations for RL
- The Value Improvement Path
The Value Function Polytope
- Studies the map \(\pi \mapsto V^\pi\) from stationary policies (e.g. of RL agents) to value functions
The space of policies, \(\Pi\)
- Consider a 2-state, 2-action MDP: \(S =\{s_1, s_2\}, A=\{a_1, a_2\}\). What is the space of policies for state \(s_1\)?
- Line from “always \(a_1\)” to “always \(a_2\)”
- Same shape for \(s_2\).
- What about a 2-state, 3-action MDP: \(S =\{s_1, s_2\}, A = \{a_1, a_2, a_3\}\)?
- What is the space of policies for state \(s_1\)?
- Triangle from “always \(a_1\)” to “always \(a_2\)” to “always \(a_3\)”
- Same shape for \(s_2\).
- What is the space of policies for state \(s_1\)?
- In general, the policy space \(\Pi\) is described by a simplex for each state.
The space of value functions
- Is a polytope!
- What is a polytope?
- Geometric object with flat sides
- Examples:
- Triangle, square, polygon
- Tetrahedron, cube, polyhedron
- …
- Components*:
- Vertices, 0-dim
- Edges, 1-dim
- Faces, 2-dim
- Cell, 3-dim
- …
- Facet, (n-1)-dim
- The polytope itself (N-dim)
How does this correspond to concepts from RL?
- Figure 2: four 2-state MDPs, each with either 2 or 3 actions
- Plotting value of \(s_1\) against value of \(s_2\).
- Notice:
- These are polytopes
- \(V*\) is always the top-right vertex
- Lines always have positive slope
- The space is connected, but not necessarily convex
- How does changing the policy move us through value function space?
- Change policy in only one state?
- Sweeps out a line segment (with positive slope)
- Notice: one end dominates the other end
- Interpolate between two arbitrary policies?
- Sweeps out a path, possibly non-linear, possibly non-monotonic
- Notice: intermediate value functions may get worse before they get better or better before they get worse
- Change policy in only one state?
- What about vertices?
- Figure 5: Red dots are the value functions of deterministic policies
- The convex hull of these dots contains the polytope
- Not all dots are at vertices
- (Not pictured) Not all vertices are attained by deterministic policies
- Figure 6 (right): The polytope can be non-convex.
- Here the intersection means two policies have the same value function but don’t agree on the action for either state.
- The boundary of the polytope is described by semi-deterministic policies (deterministic for at least one state)
- Figure 5: Red dots are the value functions of deterministic policies
What if we try to visualize learning?
- Value Iteration
- VI generates a sequence of vectors that may not map to any policy. Value functions go outside the polytope!
- Policy Iteration
- The sequence of value functions visited by policy iteration (Figure 8) corresponds to value functions of deterministic policies.
- Policy Gradient
- The convergence rate strongly depends on the initial condition
- Gradients vanish at boundary of polytope
- With entropy regularization:
- Avoids boundaries, converges to suboptimal policy
- With natural gradients:
- “Gradient steps follow direction of steepest ascent in the underlying structure of the parameter space”
- Does not take “shortest path” through polytope; looks more like policy iteration
A Geometric Perspective on Optimal Representations for RL
Can we leverage these insights to learn good representations?
- Section 3: We measure quality of representation \(\phi\) in terms of how well it can approximate all possible value functions.
- Eqn 1: \(\)\min_\phi \max_\pi |\hat{V}_\phi^\pi - V^\pi|_2^2 \qquad (1)\(\) “Find the representation \(\phi\) that minimizes the worst-case error in our value function estimate over all policies.”
- Figure 1 (right): visualize using the polytope
- The worst-case error always happens at an extremal vertex
An idea for representation learning
- Idea: an agent trained to predict various value functions should develop a good state representation
- Problem: can’t consider all policies (too expensive), or a random subset (not representative)
- Solution: only consider vertices!
- Theorem 1: worst-case approximation error is the same measured over all value functions as only over extremal vertices – call these “adversarial value functions” (AVFs)
-
Corollary 2: There are at most $$2^{ S }$$ AVFs.
- Problem: That’s still a lot of AVFs, and finding one AVF is still NP-hard.
- Solution: Relax the problem!
- Eqn 3: Change ‘max’ in Eqn. (1) to an expectation over a finite set of value functions \(\mathbb{V}\). \(\)\min_\phi \sum_{V \in \mathbb{V}} |\hat{V}_\phi - V | \qquad (3)\(\)
- Hope that \(\mathbb{V}\) is representative of the polytope.
- Basically: train representation with useful auxiliary tasks
Does it work?
- It kind of works!
- Representations look meaningful
- Average return is competitive
- Caveats:
- For tabular MDPs.
- For one specific tabular MDP (4-rooms).
- Procedure (summary):
- Randomly assign “interest” \(\delta(s) \in \{-1, 0, 1\}\) to each state.
- Find the policy \(\pi\) maximizing \(\sum_{s\in S} \delta(s)V^\pi(s)\).
- Compute its network flow and AVF.
- Repeat \(k\) times.
The Value Improvement Path
Can we do better? Can we scale up?
- What is the actual task we care about in RL?
- Learn optimal value function?
- Moving target!
- During Q-learning:
- Learn a value function, act greedily, get more experience, learn the next value function
- Want a representation \(\phi\) that is:
- Good for approximating current value function \(V_i\)
- Good at generalizing to future value functions \(V_j\), where \(j>i\)
- How?
- Auxiliary tasks!
- Learn optimal value function?
What does that look like?
- Figure 3
- Value improvement path
- Only estimate current Q function
- Invent “cumulants” or pseudo-reward functions
- Estimate optimal policy for maximizing that cumulant’s pseudo-reward
- Estimate the value under each cumulant’s pseudo-reward of those policies
- Use those value functions as auxiliary tasks
- Use the same cumulants and the same optimal policies
- Estimate their value under the true reward function
- Use those value functions as auxiliary tasks
- Consider the past \(k\) policies you’ve used
- You already estimated their value
- Use those value functions as auxiliary tasks
Does it work?
- It actually works really well!
- Figure 4 (left): much better generalization to future value functions
- Figure 4 (center, right): much better performance on Atari