Asynchronous Dynamic Programming#

Dynamic Programming#

Dynamic programming refers to the set of methods that iteratively approximate the Bellman Equations to estimate the values of states or state-action pairs. The Bellman Equations are equations that relate the value of future states to the value of the current state, or the value of future state-action pairs to the value of the current state-action pair. They are solvable in the ideal situation where we know the dynamics of the environment, namely the transition probabilities from one state to another given an action, the associated reward, and the expected next states.

Mathematically, we want to estimate \(v_\pi(s)\) which is the value at state \(s\) given we follow a policy \(\pi(a|s)\) which gives us the probability of selecting action \(a\) given the agent is in state \(s\).

To solve this, we first look at the the Bellman Equations, which gives us the relationship:

\(v_\pi(s) = f(\pi(a|s), r, \gamma, \text{p}(s',r|s,a))\) which means \(v_\pi\) is a function of:

  • our policy \(\pi(a|s)\)

  • immediate next reward \(r\)

  • discount rate \(\gamma\), where we discount future rewards by this factor between [0, 1]

  • transition probabilities \(\text{p}(s',r|s,a))\)

Now, solving this directly becomes quite complicated especially with many states \(s\), because effectively this becomes a system of \(|S|\) unknowns \(v_\pi(s)\) and equations where \(|S|\) is the number of possible discrete states. Note that in this formulation, we assume that states, actions, and rewards are discrete.

A more practical method for estimating the value of \(v_\pi(s)\) is Dynamic Programming. Which iteratively refines the estimates, following this rule:

\(v_{k+1}(s) \leftarrow v_{k}(s)\)

Which means we use the previous estimate \(v_{k}(s)\) at iteration \(k\) to compute a refined estimate \(v_{k+1}(s)\). Now, in the limit as \(k \rightarrow \inf\), we expect

\(v_{k+1}(s) = v_{k}(s) = v_\pi(s)\)

for some arbitrary policy \(\pi(a|s)\)

Asynchronous Dynamic Programming#

Now, the original formulation of dynamic programming requires that all states be updated in some sequence per iteration. Such that we update \(v_{k+1}(s) \leftarrow v_{k}(s)\) for each \(s \in S\). However, this becomes prohibitively expensive for cases where there are many possible states. In which case, we can apply a modified version called Asynchronous Dynamic Programming (ADP).

ADP allows flexibility to update the state values at any order. In fact, it’s possible that certain states are updated much more frequently than others. It also allows us to update states as the agent “experiences” the environment. This is the formulation that we use in this experiment.

The Experiments#

We experiment on a simple gridworld problem. We create an \(N \times N\) grid where there are \(X\) cells that are the “goal states.” The objective is for an agent randomly placed on the grid to get to any of these goal states as quickly as possible.

Each transition from one state to another yields a -1 reward. If the agent bumps the border/edges of the grid, it stays in place but incurs a -1 reward, same if it bumps on a wall if it exists. The reward in any of the goal states is 0. We use a discount rate of 0.99.

We experiment on three epsilon greedy policies: Greedy (\(\epsilon\) = 0), Epsilon-Greedy (\(\epsilon\) = 0.1), Random (\(\epsilon\)=1). Greedy always goes to the cell with the largest estimated value \(v_k(s)\). If there are ties, it will pick randomly. Epsilon Greedy does the same thing except there’s an \(\epsilon\) chance it will pick randomly at each timestep. Random merely chooses any action with equal probability.

We wanted to see how these three policies explore the environment to estimate the state values \(v_k(s)\) using dynamic programming.

In our first set of experiments, there are no walls and if the agent reaches any of the goals, it resets into a random state excluding the goal states. However, its knowledge of the state values are kept, and so it can continue to refine the estimates.

In our second set of experiments, we add walls following the four rooms environment (Sutton). The goal is fixed but the agent always starts at a random location. In the first experiments, the value of boundaries are initially set to be a very low -1e3, while in the second experiments, both walls and boundaries have an initial value of -1e3.

No Walls#

We visualize in the animation below the movement of the Greedy, Epsilon-Greedy, and Random agents, as well as the corresponding estimated state values \(v_k(s)\). We find that random is slowest to estimate the state values, which is interesting because I initially thought random would be able to cover more of the grid because it is not encouraged to go to the goals.

The goal is indicated by the yellow cell. In terms of the value estimates, redder means more negative while bluer means less negative or more positive.

Note that each frame in the animation below corresponds to 20 timesteps in the simulation.

<Figure size 640x480 with 0 Axes>

We ran our Greedy agent for 500,000 timesteps to get our “ground truth” \(v_\pi(s)\) estimate. Note that even if we use our random agent, it will still result in the same values, which we have validated. We visualize the optimal \(v(s)\) on the right chart below.

../_images/1ad69d745b4c0e9f1060591d182894afd667d6fbc3be17abaae7906e644bdf7f.png

We then compare the rates at which our agents reach the ground truth values by computing the mean squared error between the estimate and the ground truth over many timesteps.

../_images/f3425656f702d53ca904e0ca67c644d09b9e277562a965c4163da69b29d7f88b.png

We find that while Greedy leads to lower errors initially, there comes a time when Random leads to better estimates of \(v_\pi(s)\) as indicated by lower MSE. We hypothesize that this is because Greedy and Epsilon-Greedy tend to get biased to move towards the goal, which reduces the chance of moving into certain areas, thus leading to lower opportunities to refine the estimates there. On the other hand, Random does not have this bias and thus we expect to continuously refine its estimates across the grid.

../_images/b0e39155c69f9a0a8569874168b7e00a8d9db48e8dd6dec71fc6cf14abd580bc.png

On the other hand, when we look at cumulative rewards over timesteps, we find that Greedy performs best, and Random performs dismally. Which makes sense since for the Random agent, knowing the values of states does not impact its decisions.

For this visualization, we treated the problem as one long continuing tasks where cumulative rewards “spill over” to the next episode. In other words, resetting the agent to a random state is treated as merely a transition in our Markov Decision Process.

With Walls#

We did the similar experiments for the “four rooms” case. In this case, there is a high chance that the Greedy policy get stuck in a state or in a loop of switching back and forth between two states along a wall. In this case, the wall becomes some sort of trap, that prevents the Greedy algorithm from progressing. Imagine if you are in a cell whose cells below you have lower values than your current cell. Now the problem is there is a wall above you. In this situation, the Greedy action is to stay in place!

This is not a problem of Epsilon-Greedy or Random policies which are not constrained to keep going to the state of highest value.

<Figure size 640x480 with 0 Axes>

For estimating the optimal \(v(s)\) for this case, we used the Random Policy. This is because from our experiments above, we find that sometimes the Greedy policies become stuck in a loop of iterating between just two states. This isn’t the problem for the Random policy.

Same as above, we visualize the estimated optimal \(v(s)\) below, where for visualization purposes, we use the value of 0 for the walls, even though their values are set to be 1e-3.

[None, None]
../_images/c86904d0f47614dd28857ea50976c97d77ae30d49f939c77fccb5508ae95147c.png

In the charts below, we find that the Random policy again is able to approximate \(v(s)\) the best across all states on average.

../_images/88ce1e5afcddd6be7b0fb0ffacf3f2e9b1d117cc13c5002637866e99146e25ae.png

However in terms of returns, we find that the cumulative returns are noisier than in the case without walls. Which is why we smoothened the chart further by using a moving average with window of over 9000.

The Greedy policy got stuck in a trap, which caused the rewards it to linearly decrease. Meanwhile Epsilon Greedy (\(\epsilon=0.5\)) got the highest rewards in this case.

../_images/257c6d882105182e6bcaa38579508f17b1c72ae7dc317b1b339293a51ecb9259.png