Asynchronous Dynamic Programming#
Dynamic Programming#
Dynamic programming refers to the set of methods that iteratively approximate the Bellman Equations to estimate the values of states or state-action pairs. The Bellman Equations are equations that relate the value of future states to the value of the current state, or the value of future state-action pairs to the value of the current state-action pair. They are solvable in the ideal situation where we know the dynamics of the environment, namely the transition probabilities from one state to another given an action, the associated reward, and the expected next states.
Mathematically, we want to estimate \(v_\pi(s)\) which is the value at state \(s\) given we follow a policy \(\pi(a|s)\) which gives us the probability of selecting action \(a\) given the agent is in state \(s\).
To solve this, we first look at the the Bellman Equations, which gives us the relationship:
\(v_\pi(s) = f(\pi(a|s), r, \gamma, \text{p}(s',r|s,a))\) which means \(v_\pi\) is a function of:
our policy \(\pi(a|s)\)
immediate next reward \(r\)
discount rate \(\gamma\), where we discount future rewards by this factor between [0, 1]
transition probabilities \(\text{p}(s',r|s,a))\)
Now, solving this directly becomes quite complicated especially with many states \(s\), because effectively this becomes a system of \(|S|\) unknowns \(v_\pi(s)\) and equations where \(|S|\) is the number of possible discrete states. Note that in this formulation, we assume that states, actions, and rewards are discrete.
A more practical method for estimating the value of \(v_\pi(s)\) is Dynamic Programming. Which iteratively refines the estimates, following this rule:
\(v_{k+1}(s) \leftarrow v_{k}(s)\)
Which means we use the previous estimate \(v_{k}(s)\) at iteration \(k\) to compute a refined estimate \(v_{k+1}(s)\). Now, in the limit as \(k \rightarrow \inf\), we expect
\(v_{k+1}(s) = v_{k}(s) = v_\pi(s)\)
for some arbitrary policy \(\pi(a|s)\)
Asynchronous Dynamic Programming#
Now, the original formulation of dynamic programming requires that all states be updated in some sequence per iteration. Such that we update \(v_{k+1}(s) \leftarrow v_{k}(s)\) for each \(s \in S\). However, this becomes prohibitively expensive for cases where there are many possible states. In which case, we can apply a modified version called Asynchronous Dynamic Programming (ADP).
ADP allows flexibility to update the state values at any order. In fact, it’s possible that certain states are updated much more frequently than others. It also allows us to update states as the agent “experiences” the environment. This is the formulation that we use in this experiment.
The Experiments#
We experiment on a simple gridworld problem. We create an \(N \times N\) grid where there are \(X\) cells that are the “goal states.” The objective is for an agent randomly placed on the grid to get to any of these goal states as quickly as possible.
Each transition from one state to another yields a -1 reward. If the agent bumps the border/edges of the grid, it stays in place but incurs a -1 reward, same if it bumps on a wall if it exists. The reward in any of the goal states is 0. We use a discount rate of 0.99.
We experiment on three epsilon greedy policies: Greedy (\(\epsilon\) = 0), Epsilon-Greedy (\(\epsilon\) = 0.1), Random (\(\epsilon\)=1). Greedy always goes to the cell with the largest estimated value \(v_k(s)\). If there are ties, it will pick randomly. Epsilon Greedy does the same thing except there’s an \(\epsilon\) chance it will pick randomly at each timestep. Random merely chooses any action with equal probability.
We wanted to see how these three policies explore the environment to estimate the state values \(v_k(s)\) using dynamic programming.
In our first set of experiments, there are no walls and if the agent reaches any of the goals, it resets into a random state excluding the goal states. However, its knowledge of the state values are kept, and so it can continue to refine the estimates.
In our second set of experiments, we add walls following the four rooms environment (Sutton). The goal is fixed but the agent always starts at a random location. In the first experiments, the value of boundaries are initially set to be a very low -1e3, while in the second experiments, both walls and boundaries have an initial value of -1e3.
No Walls#
We visualize in the animation below the movement of the Greedy, Epsilon-Greedy, and Random agents, as well as the corresponding estimated state values \(v_k(s)\). We find that random is slowest to estimate the state values, which is interesting because I initially thought random would be able to cover more of the grid because it is not encouraged to go to the goals.
The goal is indicated by the yellow cell. In terms of the value estimates, redder means more negative while bluer means less negative or more positive.
Note that each frame in the animation below corresponds to 20 timesteps in the simulation.