# Learning RL

# Table of Contents

- Sutton and Barto
- How I Am Reading This Book
- What I Think About The Book
- Exercises
- 1 - Introduction
- 2 - Multi-armed Bandits
- 3 - Finite Markov Decision Processes
- Exercise 3.1
- Exercise 3.2
- Exercise 3.3
- Exercise 3.4
- Exercise 3.5
- Exercise 3.7
- Exercise 3.8
- Exercise 3.9
- Exercise 3.10
- Exercise 3.11
- Exercise 3.12
- Exercise 3.13
- Exercise 3.14
- Exercise 3.15
- Exercise 3.16
- Exercise 3.17
- Exercise 3.18
- Exercise 3.19
- Exercise 3.20
- Exercise 3.21
- Exercise 3.22
- Exercise 3.23
- Exercise 3.24
- Exercise 3.25
- Exercise 3.26
- Exercise 3.27
- Exercise 3.28
- Exercise 3.29

- 4 - Dynamic Programming
- 5 - Monte Carlo Methods
- 6 - Temporal-Difference Learning
- 7 - n-step Bootstrapping
- 8 - Planning and Learning with Tabular Methods
- 9 - On-policy Prediction with Approximation
- 10 - On-policy Control with Approximation
- 11 - Off-policy Methods with Approximation
- 12 - Eligibility Traces
- 13 - Policy Gradient Methods
- 14-16: No Exercises
- 17 - Frontiers

# Sutton and Barto

I am anticipating needing to deeply understand Reinforcement Learning to properly do my job. Everybody says that this book is the way to do that, so I'm working through it and doing the exercises. I'll be putting the exercises in this document along with some random thoughts I may have. I have done this on here before, with the textbook Probabilistic Models of Cognition, so it will largely be the same as that. I am also roughly following the DeepMind x UCL Reinforcement Learning Course (2018) which follows this book. I learn much better from reading and practicing compared to from watching and listening, but it doesn't hurt to attack the problem from all fronts.

## How I Am Reading This Book

My primary objective in reading this book is to understand deep reinforcement learning well enough to make meaningful contributions to research, especially where that intersects with both games and broader machine learning (since I currently work at Riot, and think this is the most interesting application of RL^{1}).

In general, there are two types of learning: let's call them "waow learning" and "ugh learning". In waow learning, you learn a new concept, some good words to say about that concept, and a corpus of other people's thoughts on a topic. In ugh learning, you are running on the treadmill and trying not to die. Since my objective is research, it's of higher relative importance that I do a lot of ugh learning on this topic. I'm not anticipating RL becoming my primary specialty within machine learning, but I need to *get it* more than I would to read through an RL paper and nod sagely when I see familiar terms. I do my best work when I get really *opinionated* about a topic, and I can't reliably do that without chaining myself to the treadmill.

So, I'll be doing all of the exercises in this textbook, barring the ones which are starred or are inside of starred sections. The exception here is chapter 11, which I will be doing despite the optional designation since I'm anticipating it being relevant to what I'm aiming to actually do with it. The majority of these are near the beginning of the book (chapter 3 has almost 30 exercises, most chapters after have about 10, part III has one exercise total). I made a good faith effort to actually get through the whole book, including these less exercise-heavy chapters / ones not covered in the DeepMind course. Likewise, at the end I'll be making cloze deletion flashcards of the convenient summary sections at the end of every chapter.

**Please Note: I give no guarantee of correctness for my solutions to these exercises.** This is not an answer key, it's my attempt at fighting through a textbook to learn a new topic^{2}. Likewise, these answers aren't of the quality I would submit as a homework assignment, they're just having passed the threshold where I believe it is clear to myself that I understand the solution^{3}. I'll be checking my answers to the exercises with LLMs as I work through the book, which should be noted in the little *status* line at the start of each subsection.

In addition, this book is actually pretty surface level on deep RL specifically – this book just represents the background necessary for understanding the RL part. I'm hopeful that I already understand the deep part, but it's helpful for me to have a plan of attack for after I complete the book. In particular, I'm planning on reading through the code for important algorithms in CleanRL, along some important relevant papers. This will probably be it's own post, but I'll link it here once I get to that point.

**Update 8-27:** I have completed this book, having read through all the chapters and completed all the exercises aside from ones marked as optional. All in all, it took me just over 2 weeks to get through the whole thing, doing something like roughly 4-7 hours per day of effort. I probably would have learned the topics in this book more deeply had I given myself the recommended two semesters worth of time, but since I had a particular *destination* in mind I was willing to adhere to this accelerated pace provided I could still complete the exercises^{4}. In all likelihood, I probably could have started immediately from deep RL papers given my background, but I think I would have pidgeonholed myself into just being able to get stuff working and not understanding why they worked.

## What I Think About The Book

So far I have a few thoughts about this book. In general I consider myself relatively not math-inclined^{5}; my strengths mostly lie in my ability to throw borderline obsessive amounts of effort at projects. But I do have a fair bit of good experience in machine learning, programming, artificial intelligence, all that sort of stuff. The jury is probably out on whether I was better-suited for this book compared to a plucky engineering undergraduate, but I figured it should all be within my grasp.

I found myself able to navigate the exercises in this book with great difficulty. They were very challenging to me, and I expected my background to be of much greater help than it ended up being. I was forced to acknowledge that maybe, just a little bit, I was letting my eyes glaze over a little bit when I read through dense notation in papers. The notation in RL is very intense^{6}, and this book does not pull punches. It was all fairly doable, but I was grateful that I worked through it after we invented LLMs / after I landed somewhere with bright colleagues who already know this topic.

I think the structure of the book is very strong. Almost all the ideas that are introduced feel like very natural extensions of previously introduced topics, and you arrive at a nice unifying framework for why the important techniques make the decisions that they make. This is pretty nice compared to just reading papers about it.

Overall I think that it makes sense to consider this the canonical book on reinforcement learning. I think it could be more gentle, but it would probably be twice the length. I think it could have more neural networks / deep RL in it, but then the purpose of the book would probably get too muddled, and it runs the risk of more quickly growing dated. This book is probably your best bet if you have access to some sort of teacher, and definitely seems like more of a foundation-building text about the field in general.

Overall good.

## Exercises

### 1 - Introduction

*status: checked with o1-preview*

#### Exercise 1.1: Self-Play

Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

I already know self-play already works for achieving high skill in RL tasks (see: alphago) but I do think it's possible for larger environments that policies learned through self-play to tunnel into behaviors which do not fully explore a very complex environment. I remember seeing some papers exploiting this by learning adversarial policies that win by creating out-of-distribution states. But for something simple like tic-tac-toe I doubt it would cause huge problems, in the limit both will explore the space completely.

#### Exercise 1.2: Symmetries

Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

We could take advantage of this via data augmentation in training, applying all the different rotations and mirrors to the game board and having 8 training examples instead of 1 per episode. This would make us equally strong from all angles, since we will have an equal balance of the different versions of the same positions. Taking advantage of the symmetries shouldn't have any different effects whether or not our opponent uses them or not – they're literally identical states, and should have identical value functions as a result. One thing which may break this is if we learn that our opponent is worse at one particular rotation of the game board, and seeking to exploit that. If we do augmentation this way, we won't be able to learn that sort of thing, since we will be artificially improving their performance in those spots, but in exchange we get more training examples and better balance subject to symmetry.

#### Exercise 1.3: Greedy Play

Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

If the player was greedy and never explored, that could lead to lots of problems where states get assigned incorrect value estimations. For example, if you have an action which causes +1 reward 90% of the time and -1 reward 10% of the time, and you happen to get -1 reward the first time you take that option, your estimation for that action is likely to be irrecoverably wrong. The purpose of non-greedy decisions is to make sure that your "rating the best" is actually accurate, since over time you'll get a better picture of more types of states that way.

#### Exercise 1.4: Learning from Exploration

Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

Consider a case where you have a state which is next to the goal state, and also next to a state which kills you. In the greedy case, you would love to be in this state, because then you can go to the goal state, and you'll get very high reward. But in the exploration-enabled case, this state is kind of risky because there's a change you'll decide to explore in this state, land in the death state, and recieve a big negative reward. If you learn after exploratory moves, you're learning the value of the state including that probability of randomly dying, whereas otherwise you aren't. If you want to keep doing this exploration, you might be better off using these probabilities, since avoiding death is pretty important, but if you plan on turning those moves off then there's no need to avoid this hypothetical state.

#### Exercise 1.5: Other Improvements

Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the tic-tac-toe problem as posed?

Arming the player with search seems like a good way to get better estimates of the current state.

### 2 - Multi-armed Bandits

*status: checked with o1-preview, need to fix 2.7 because code is not the correct deliverable*

#### Exercise 2.1

In ε -greedy action selection, for the case of two actions and ε = 0.5, what is the probability that the greedy action is selected?

.75 (from greedy 0.5, from random 0.5 / 2)

#### Exercise 2.2: Bandit example

Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using "-greedy action selection, sample-average action-value estimates, and initial estimates of Q1(a) = 0, for all a. Suppose the initial sequence of actions and rewards is A1 = 1, R1 = 1, A2 = 2, R2 = 1, A3 = 2, R3 = 2, A4 = 2, R4 = 2, A5 = 3, R5 = 0. On some of these time steps the " case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred? ⇤

Step | Action | Reward | Greedy ? |
---|---|---|---|

1 | 1 | 1 | Maybe (all 0) |

2 | 2 | 1 | No (A1 has 1) |

3 | 2 | 2 | Maybe (A1 and A2 both with 1) |

4 | 2 | 2 | Maybe (A2 clearly best now but can still be selected by exploration) |

5 | 3 | 0 | No (No info on 3 at all) |

#### Exercise 2.3

In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.

In the long run. \eps = 0.01 will perform best. \eps = 0.1 learns the optimal action the fastest, but is bottlenecked by the fact that it must select a random action 10% of the time, meaning it gets optimal reward 91% of the time. In comparison, once \eps = 0.01 learns the optimal action, it will pick that option 99.5% of the time. You can observe this in the slopes of the figure, where 0.01 is initially lower but continues to grow.

#### Exercise 2.4

If the step-size parameters, α_{n}, are not constant, then the estimate Qn is
a weighted average of previously received rewards with a weighting different from that
given by (2.6). What is the weighting on each prior reward for the general case, analogous
to (2.6), in terms of the sequence of step-size parameters? ⇤

If you expand out the terms you'll get:

Q_{n+1} = a_{1} R_{n} + (1 - a_{1})(a_{2} R_{n-1} + (1 - a_{2})(a_{3} R_{n-2} + (1 - a_{3})(Q_{n-3})))

etc

if we try to break it apart we get

Q_{n+1} = a_{1} R_{n} + (1 - a_{1})(a_{2} R_{n-1}) + (1 - a_{1})(1 - a_{2})(a_{3} R_{n-2}) + (1-a_{1})(1-a_{2})(1 - a_{3})(Q_{n-3})))

suggesting that in the general case we arrive at a form that looks like this:

W_{n+1} = \Prod{i=1}{n} (1 - a_{i}) (a_{n} R_{n})

I think I likely have the notation wrong here but visually it makes sense.

*o1-preview: \(Q_{n+1} = \sum_{k=1}^{n}(\alpha_k \prod_{i=k+1}^{n}(1-\alpha_i))R_k\)*

#### Exercise 2.5 (programming)

Design and conduct an experiment to demonstrate the
diculties that sample-average methods have for nonstationary problems. Use a modified
version of the 10-armed testbed in which all the q_{*}(a) start out equal and then take
independent random walks (say by adding a normally distributed increment with mean 0
and standard deviation 0.01 to all the q⇤(a) on each step). Prepare plots like Figure 2.2
for an action-value method using sample averages, incrementally computed, and another
action-value method using a constant step-size parameter, α = 0.1. Use ε = 0.1 and
longer runs, say of 10,000 steps

import numpy as np import matplotlib.pyplot as plt # define k armed bandit k = 10 q_stars = [5 for _ in range(k)] def run_experiment(epsilon, method='constant'): num_acts = [0 for _ in q_stars] q_vals = [0 for _ in q_stars] if method == 'constant': alpha = 0.1 steps = 10000 avg_rewards = [] pct_optimals = [] avg_reward = 0 optimal_actions = 0 for step in range(steps): if method != 'constant': alpha = 1 / (step + 1) # random walks for i, bandit in enumerate(q_stars): q_stars[i] += np.random.normal(0, 0.01) # epsilon-greedy if np.random.random() < epsilon: act = np.random.randint(0, k) else: act = np.argmax(q_vals) num_acts[act] += 1 q_vals[act] += alpha * (q_stars[act] - q_vals[act]) avg_reward += (1 / (step + 1)) * q_stars[act] avg_rewards.append(avg_reward) if act == np.argmax(q_stars): optimal_actions += 1 pct_optimal = optimal_actions / (step + 1) pct_optimals.append(pct_optimal) return avg_rewards, pct_optimals const_rewards, const_optimals = run_experiment(0.1, method='constant') avg_rewards, avg_optimals = run_experiment(0.1, method='average') plt.title("Average rewards") plt.plot(const_rewards, label='constant alpha') plt.plot(avg_rewards, label='averaging') plt.legend() plt.show() plt.title("Optimal actions") plt.plot(const_optimals, label='constant alpha') plt.plot(avg_optimals, label='averaging') plt.legend() plt.show()

#### Exercise 2.6: Mysterious Spikes

The results shown in Figure 2.3 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? In other words, what might make this method perform particularly better or worse, on average, on particular early steps? ⇤

If the rewards are optimistic, it's very likely that you will pull all the levers once after only a few turns, since you'll be disappointed each time. You should then get a good picture of the best one very quickly, which means you should pick the best option very often very early on. However, you run into a problem – greedily picking that option will make your estimate of that state worse, so by picking it you temporarily make it less likely to be selected again. This will continue until the estimates are accurate enough for selecting the best option to not make the estimate worse than the estimates for the other options.

#### Exercise 2.7: Unbiased Constant-Step-Size Trick

In most of this chapter we have used sample averages to estimate action values because sample averages do not produce the initial bias that constant step sizes do (see the analysis leading to (2.6)). However, sample averages are not a completely satisfactory solution because they may perform poorly on nonstationary problems. Is it possible to avoid the bias of constant step sizes while retaining their advantages on nonstationary problems? One way is to use a step size of

\(\beta_n \doteq \alpha / \bar{o}_n\)

to process the nth reward for a particular action, where α > 0 is a conventional constant step size, and ¯on is a trace of one that starts at 0:

\(\bar{o}_n \doteq \bar{o}_{n-1} + \alpha (1 - \bar{o}_{n-1}) \text{ for } n > 0, \text{ with } \bar{o}_0 \doteq 0\).

Carry out an analysis like that in (2.6) to show that Qn is an exponential recency-weighted average without initial bias.

import numpy as np import matplotlib.pyplot as plt # define k armed bandit k = 10 q_stars = [np.random.normal(0, 1) for _ in range(k)] def run_experiment(epsilon, method='constant'): num_acts = [0 for _ in q_stars] q_vals = [5 for _ in q_stars] #optimistic reward if method == 'constant': alpha = 0.1 o_bar = 0 steps = 10000 avg_rewards = [] pct_optimals = [] avg_reward = 0 optimal_actions = 0 for step in range(steps): if method == 'constant': o_bar += alpha * (1 - o_bar) beta = alpha / o_bar else: beta = 1 / (step + 1) # epsilon-greedy if np.random.random() < epsilon: act = np.random.randint(0, k) else: act = np.argmax(q_vals) num_acts[act] += 1 q_vals[act] += beta * (q_stars[act] - q_vals[act]) avg_reward += (1 / (step + 1)) * q_stars[act] avg_rewards.append(avg_reward) if act == np.argmax(q_stars): optimal_actions += 1 pct_optimal = optimal_actions / (step + 1) pct_optimals.append(pct_optimal) return avg_rewards, pct_optimals const_rewards, const_optimals = run_experiment(0.1, method='constant') avg_rewards, avg_optimals = run_experiment(0.1, method='average') plt.title("Average rewards") plt.plot(const_rewards, label='constant alpha') plt.plot(avg_rewards, label='averaging') plt.legend() plt.show() plt.title("Optimal actions") plt.plot(const_optimals, label='constant alpha') plt.plot(avg_optimals, label='averaging') plt.legend() plt.show()

*TODO: I think this question requires me to show that the weights sum to 1, not to implement it*

#### Exercise 2.8: UCB Spikes

In Figure 2.4 the UCB algorithm shows a distinct spike in performance on the 11th step. Why is this? Note that for your answer to be fully satisfactory it must explain both why the reward increases on the 11th step and why it decreases on the subsequent steps. Hint: If c = 1, then the spike is less prominent. ⇤

If you have 10 bandits after only a few trials, the UCB term will likely dominate for untested bandits, so it will test all the bandits once each in the first ten trials. On the 11th trial, all of the UCB terms will be equal, so it's very likely to pull the bandit which returned the highest value, which is most often the optimal one. However, once you do that, you reduce the UCB term for that bandit, which means that you'll start wanting to pull the other bandits again. This will repeat until the UCB term goes to ~0 after many trials. When c=1, this term is less dominating, so it becomes more possible to select two bandits twice in the first 10 trials, which would diffuse this spike to adjacent timesteps.

#### Exercise 2.9

Show that in the case of two actions, the soft-max distribution is the same as that given by the logistic, or sigmoid, function often used in statistics and artificial neural networks.

with two actions we have

e^{zi} / ∑{j=1}{K} e^{zj}

e^{zi} / (e^{z1} + e^{z2})

p(1) + p(2) = 1

p(1) = e^{z1} / (e^{z1} + e^{z2})

dividing numerator and denomenator by e^{z}_{2} is equivalent to subtraction

p(1) = e^{z1 - z2} / (e^{z1 - z2} + e^{z2 - z2})

p(1) = e^{z1 - z2} / (1 + e^{z1 - z2})

if x = z_{1} - z_{2} we now have

e^{x} / (1 + e^{x})

which is the sigmoid

#### Exercise 2.10

Suppose you face a 2-armed bandit task whose true action values change randomly from time step to time step. Specifically, suppose that, for any time step, the true values of actions 1 and 2 are respectively 10 and 20 with probability 0.5 (case A), and 90 and 80 with probability 0.5 (case B). If you are not able to tell which case you face at any step, what is the best expected reward you can achieve and how should you behave to achieve it? Now suppose that on each step you are told whether you are facing case A or case B (although you still don’t know the true action values). This is an associative search task. What is the best expected reward you can achieve in this task, and how should you behave to achieve it?

If you don't know the state, you do the same on both cases. picking action A will give you (10 + 90)/2 = 50 and action B will give you (20 + 80)/2 = 50 on average, so you can't do better than random. If you know what state you're in, you will want to select 2 in case A and 1 in case B, which will give you (20 + 90) / 2 = 55 average reward. Once you know the state, you collapse to the normal learning problem in a k-armed bandit, so any of those methods would work once you know the underlying state.

#### Exercise 2.11 (programming)

Make a figure analogous to Figure 2.6 for the nonstationary case outlined in Exercise 2.5. Include the constant-step-size ε-greedy algorithm with α = 0.1. Use runs of 200,000 steps and, as a performance measure for each algorithm and parameter setting, use the average reward over the last 100,000 steps.

import numpy as np import matplotlib.pyplot as plt # define k armed bandit k = 10 q_stars = [np.random.normal(0, 1) for _ in range(k)] def run_experiment(epsilon, method='constant'): num_acts = [0 for _ in q_stars] if method == 'optimistic': q_vals = [5 for _ in q_stars] else: q_vals = [0 for _ in q_stars] #do they want the unbiased one? if method == 'constant' or method == 'optimistic': alpha = 0.1 o_bar = 0 steps = 200000 avg_rewards = [] pct_optimals = [] avg_reward = 0 optimal_actions = 0 for step in range(steps): if method == 'constant' or method == 'optimistic': o_bar += alpha * (1 - o_bar) beta = alpha / o_bar else: beta = 1 / (step + 1) # epsilon-greedy if method != 'ucb' and np.random.random() < epsilon: act = np.random.randint(0, k) elif method == 'ucb': ucbs = [q_vals[i] + np.sqrt(epsilon * np.log(step+1) / \ num_acts[i]) for i in range(k)] act = np.argmax(ucbs) else: act = np.argmax(q_vals) num_acts[act] += 1 q_vals[act] += beta * (q_stars[act] - q_vals[act]) avg_reward += (1 / (step + 1)) * q_stars[act] avg_rewards.append(avg_reward) if act == np.argmax(q_stars): optimal_actions += 1 pct_optimal = optimal_actions / (step + 1) pct_optimals.append(pct_optimal) return np.mean(avg_rewards[:100000]) vals = [1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1, 2, 4] const_rewards = [run_experiment(x, method='constant') for x in vals] optimistic_rewards = [run_experiment(x, method='optimistic') for x in vals] ucb_rewards = [run_experiment(x, method='ucb') for x in vals] plt.title("Parameter Study") plt.plot(vals, const_rewards, label='eps-greedy') plt.plot(vals, optimistic_rewards, label='optimistic eps-greedy') plt.plot(vals, ucb_rewards, label='UCB') plt.xlabel("epsilon") plt.ylabel("Average reward over last 100k steps") plt.legend() plt.show()

### 3 - Finite Markov Decision Processes

*status: kinda rocky, but checked with o1-preview*

#### Exercise 3.1

Devise three example tasks of your own that fit into the MDP framework, identifying for each its states, actions, and rewards. Make the three examples as different from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples. ⇤

- Chess can be framed as an MDP, where each state is a board position, each action is the legal moves you can perform in that position, and each reward is the relative value of the position (or just 1 for goal state and -1 for loss state)
- Flirting with someone can be framed as an MDP, where each state is the current point in a conversation, each action is what you can say at that point, and the reward is how much you observe they're into what you're saying (can be negative, for example if you start talking about how flirting is a Markov Decision Process)
- Doing the exercises in Sutton and Barto can be framed as an MDP. Each state is your current location in the textbook, each action is your letter by letter solving of the problem (e.g. you write answers one letter at a time), and each reward is the feedback from a teacher or LLM about how well you solved an exercise.

#### Exercise 3.2

Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions? ⇤

Maybe not usefully; a big component of this is that MDPs have the markov property (where the past sequence of events is priced into the current state, and two identical "states" which would have different local behaviors based on the path required to reach them would get represented as different states). It's possible there are MDPs it's hard to represent the state as being independent of / inclusive of the entire history prior (i.e. it is possible, but the state space is so large that the dynamics can't be learned well). Starcraft might be one of these? They struggled to reach superhuman play under human constraints and had to rely on imitation learning due to the overly large state space, due to the "exploration problem".

#### Exercise 3.3

Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in—say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice? ⇤

I imagine your framing matters a lot here. If you want to build a system which outperforms humans at driving, you'll likely be interested in defining it at the machine level (unless you were building a humanoid robot which drives) because in that case you're able to directly actuate the pedals and stuff. If you're building a gps navigation service which arrives at a location while avoiding the most traffic, you don't actually care about the machine at all. If you're drunk at a bar, you hopefully would carefully consider that your body's condition introduces an additional level of uncertainty to your observations and actions, even though your car in the parking lot didn't change at all. It's not so much that it's a free choice, rather that it depends on the type of problem you are attempting to solve with your agent.

#### Exercise 3.4

Give a table analogous to that in Example 3.3, but for p(s', r|s, a). It should have columns for s, a, s', r, and p(s', r|s, a), and a row for every 4-tuple for which p(s', r|s, a) > 0.

s | a | s' | r | p(s' / s, a) | p(s', r / s, a) |
---|---|---|---|---|---|

high | search | high | r_{search} |
α | α * p(r = R / s, a, s') |

high | search | low | r_{search} |
1 - α | (1 - α) * p(r = R / s, a, s') |

low | search | high | -3 | 1 - β | (1 - β) * p(r = R / s, a, s') |

low | search | low | r_{search} |
β | β * p(r = R / s, a, s') |

high | wait | high | r_{wait} |
1 | 1 * p(r = R / s, a, s') |

low | wait | low | r_{wait} |
1 | 1 * p(r = R / s, a, s') |

low | recharge | high | 0 | 1 | 1 * p(r = R / s, a, s') |

I am a bit confused by this because it doesn't look like there's anything about the probability of a specific reward, but I guess in concept it should be this right?

#### Exercise 3.5

The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).

continuing case:

\(\sum_{s' \in S} \sum_{r \in R} p(s', r | s, a) = 1 \text{ for all } s \in S, a \in A(s)\)

episodic case:

\(\sum_{s' \in S \cup T} \sum_{r \in R} p(s', r | s, a) = 1 \text{ for all } s \in S, a \in A(s) \text{ where T is the set of terminal states }\)

Exercise 3.6 Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for -1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task? ⇤

\(G_t = \sum_{k=0}^{T} \gamma^k R_{t+k+1}\)

Since R is always 0 except at the terminal state, we can just write this simply as

\(G_t = -\gamma^T\)

This differs from the discounted, continuing formulation of this task because the reward in the continuous case the model will get negative reward every time it's not balancing, but if it falls it can right itself again to resume having no penalty. In the episodic case, it will just reset so that you start again, and you're directly maximizing the time to first failure rather than the minimum number of failures as late as possible.

#### Exercise 3.7

Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve? ⇤

If you do this, the agent will try to get out of the maze eventually, with no rush at all for how long that takes. As a result, with a long enough time horizon, taking enough random actions will eventually reach the terminal state, and all trials will have the same reward (+1). You aren't making it learn the maze, you're just asking it to exist until the terminal state is reached, and then rewarding it. What you would prefer is punishing -1 for every time step, so that the agent is rewarded for getting out faster, which will incentivize it to actually learn to escape the maze.

#### Exercise 3.8

Suppose γ = 0.5 and the following sequence of rewards is received R1 = 1, R2 = 2, R3 = 6, R4 = 3, and R5 = 2, with T = 5. What are G0, G1, …, G5? Hint: Work backwards. ⇤

G_{0} = r_{1} + γ G_{1}
G_{1} = r_{2} + γ G_{2}
G_{2} = r_{3} + γ G_{3}
G_{3} = r_{4} + γ G_{4}
G_{4} = r_{5} + γ G_{5}
G_{5} = 0

G_{4} = 2 + 0 = 2
G_{3} = 3 + 0.5 * 2 = 4
G_{2} = 6 + 0.5 * 4 = 8
G_{1} = 2 + 0.5 * 8 = 6
G_{0} = 1 + 0.5 * 6 = 4

#### Exercise 3.9

Suppose γ = 0.9 and the reward sequence is R1 = 2 followed by an infinite sequence of 7s. What are G1 and G0? ⇤

\(G_1 = 7 + \gamma G_2\)

\(G_2 = 7 \sum_{k=0}^{\infty} \gamma^k = \frac{7}{1 - \gamma} = 70\)

\(G_1 = 7 + 0.9*70 = 70\)

\(G_0 = 2 + 0.9*70 = 65\)

#### Exercise 3.10

Prove the second equality in (3.10). ⇤

\(G_0 = \sum_{k=0}^{\infty} \gamma^k\) is the geometric series.

\(G_0 = \gamma^0 + \gamma^1 + \gamma^2 + \gamma^3 + ... + \gamma^\infty\)

\(G_0 = 1 + \gamma (1 + \gamma + \gamma^2 + ... + \gamma^\infty)\)

\(G_0 = 1 + \gamma G_0\)

\(G_0 = 1 + \gamma G_0\)

\(G_0 - \gamma G_0 = 1\)

\(G_0 (1 - \gamma) = 1\)

\(G_0 = 1 / (1 - \gamma)\)

#### Exercise 3.11

If the current state is St, and actions are selected according to a stochastic policy π, then what is the expectation of Rt+1 in terms of π and the four-argument function p (3.2)? ⇤

Framing this as an expectation means we need to sum across all possible actions

\(\sum_{a} \pi(a | S_t) \sum_{s', r} r * p(r| s', a)\)

#### Exercise 3.12

Give an equation for v⇡ in terms of q⇡ and π. ⇤

\(v_\pi(s) \doteq E_\pi[G_t | S_t = s]\)

\(q_\pi(s, a) \doteq E_\pi[G_t | S_t = s, A_t = a]\)

—

To write in terms of q we just need to marginalize over all actions

\(v_\pi(s) \doteq \sum_{a} \pi(a|s) E_\pi[G_t | S_t = s, A_t = a]\)

that last term is the same as q

\(v_\pi(s) \doteq \sum_{a} \pi(a|s) q_\pi(s, a)\)

#### Exercise 3.13

Give an equation for q⇡ in terms of v⇡ and the four-argument p. ⇤

\(q_\pi(s, a) \doteq E_\pi[G_t | S_t = s, A_t = a]\)

Expanding out G_{t}

\(q_\pi(s, a) \doteq E_\pi[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a]\)

Now we can condition on the next state to get v

\(q_\pi(s, a) \doteq E_\pi[R_{t+1} + \gamma E[G_{t+1} | S_{t+1}] | S_t = s, A_t = a]\)

\(q_\pi(s, a) \doteq E_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = a]\)

and now since we have something with the shape (s', r | s, a) we can undo the expectation using the 4 argument p

\(q_\pi(s, a) \doteq \sum_{s'} \sum_{r} p(s', r | s, a) * [r + \gamma v_\pi(s')]\)

#### Exercise 3.14

The Bellman equation (3.14) must hold for each state for the value function v⇡ shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, 0.4, and +0.7. (These numbers are accurate only to one decimal place.) ⇤

The four actions are equally likely, discount factor is 0.9

the discounted other rewards are 2.07, 0.36, 0.36, 0.63

\(0.7 = \sum_{a} 1/4 \sum_{s, r} 1[r + \text{discounted reward}]\)

\(0.7 = \frac{1}{4} (0 + 2.07) + \frac{1}{4} (0 + 0.36) + \frac{1}{4} (0 + 0.36) + \frac{1}{4} (0 + 0.63)\)

\(0.7 = 0.5175 + .009 + .009 + .1575\)

0.7 = 0.693 (accurate enough to the tenth)

#### Exercise 3.15

In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant c to all the rewards adds a constant, vc, to the values of all states, and thus does not affect the relative values of any states under any policies. What is vc in terms of c and ? ⇤

Only the differences are important if we're trying to maximize it, the signs are mostly useful to semantically describe which are rewards and which are punishments. The advantage of a good state over a bad one exists independent of sign.

\(G_t \doteq \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\)

\(G_t \doteq \sum_{k=0}^{\infty} \gamma^k (R_{t+k+1} + c)\)

\(G_t \doteq \sum_{k=0}^{\infty} [\gamma^k R_{t+k+1} + \gamma^k c]\)

\(G_t \doteq \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} + \sum_{k=0}^{\infty} \gamma^k c\)

Since it's a constant term (i.e. a sum of constants) We can define \(v_c = \sum_{k=0}^{\infty} \gamma^k c\) so \(G_t \doteq \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} + v_c\)

Ergo, the relative value of the states will not change, because no matter what you will be adding \(v_c\) to the state, which does not change from state to state.

#### Exercise 3.16

Now consider adding a constant c to all the rewards in an episodic task, such as maze running. Would this have any e↵ect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example. ⇤

In an episodic task, it does cause problems to add a constant to all values. Consider maze running. If you have a negative reward for each non-solved turn, and then a big positive reward at the end, your total reward is maximized by getting out of the maze as fast as possible. If you have a small positive reward for each non-solved turn, and then an even bigger reward at the end, your total reward is now maximized by existing in the maze for all eternity, since eventually you will accumulate more reward by deliberately not finding the exit and bounding your reward.

#### Exercise 3.17

What is the Bellman equation for action values, that
is, for q_{π}? It must give the action value q_{π}(s, a) in terms of the action
values, q_{\i}(s', a'), of possible successors to the state–action pair (s, a).
Hint: The backup diagram to the right corresponds to this equation.
Show the sequence of equations analogous to (3.14), but for action
values.

Well let's start from bellman equation for values

\(v_\pi(s) \doteq \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]\)

We've already shown we can write v in terms of q

\(v_\pi(s) \doteq \sum_{a} \pi(a|s) q_\pi(s, a)\)

so it seems to emerge that we can just do this

\(q_\pi(s, a) \doteq \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]\)

*o1: this might be wrong?*

#### Exercise 3.18

The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

Give the equation corresponding to this intuition and diagram for the value at the root node, v⇡(s), in terms of the value at the expected leaf node, q⇡(s, a), given St = s. This equation should include an expectation conditioned on following the policy, ⇡. Then give a second equation in which the expected value is written out explicitly in terms of ⇡(a|s) such that no expected value notation appears in the equation. ⇤

\(v_\pi(s) \doteq \mathbb{E}[q_\pi(s, a) | s = S_t]\)

\(v_\pi(s) \doteq \sum_{a} \pi(a|s) q_\pi(s, a)\)

#### Exercise 3.19

The value of an action, q⇡(s, a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:

Give the equation corresponding to this intuition and diagram for the action value, q⇡(s, a), in terms of the expected next reward, Rt+1, and the expected next state value, v⇡(St+1), given that St =s and At =a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s', r|s, a) defined by (3.2), such that no expected value notation appears in the equation. ⇤

\(q_\pi(s, a) \doteq \mathbb{E}[R_{t+1} + \gamma v_\pi(s') | s = S_t, a = A_t]\)

\(q_\pi(s, a) \doteq \sum_{s', r} p(s' r | s, a) [r + \gamma v_\pi(s')]\)

#### Exercise 3.20

Draw or describe the optimal state-value function for the golf example. ⇤

In the golf example the optimal state value function is \(max_a \sum_{s', r} p(s' r | s, a)[r + \gamma max_a q_*(s', a')]\)

As a result, the state-value function should look like the listed q*(s, driver) contours but with the values subtracted by 1, since the cost of the action is -1

#### Exercise 3.21

Draw or describe the contours of the optimal action-value function for putting, q⇤(s, putter), for the golf example. ⇤

it will have the same first contour as v_{putt}, but then it will have the contours of v_{driver}, until you get to the green, which will entirely be -1 (return to putting)

#### Exercise 3.22

Consider the continuing MDP shown to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, ⇡left and ⇡right. What policy is optimal if γ = 0? If γ = 0.9? If γ = 0.5? ⇤

if γ is zero, future rewards will be ignored, and you'll prefer π_{left} which provides immediate reward. With γ = 0.9, you'll prefer π_{right} since you'll care a lot about the resulting +2 after the first state. At γ = 0.5, both policies are equivalent, since left is \(1 + 0.5(0) + 0.25 R_{t+3}\) and right is \(0 + 0.5(2) + 0.25 R_{t+3}\).

#### Exercise 3.23

Give the Bellman equation for q_{*} for the recycling robot. ⇤

Given that v_{*} is provided in the text, and v_{*}(s) = max q_{*}(s, a), we can just say

\(q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')]\)

where v_{*}(s') are the provided optimality equations for the recycling robot from the text.

I don't really want to write it all out in tex. I can revisit this if necessary.

#### Exercise 3.24

Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places. ⇤

Recall the bellman equation

\(v_\pi(s) = \sum_{a} \pi(a|s) \sum_{s, r} p(s' r | s, a) [r + \gamma v_\pi(s')]\)

In our case, we have a reward of 10, a fixed action, and a certain probability of identical reward and state transition. So:

\(v_*(s) = 10 + \gamma v_*(s')\)

We know that v_{*}(s') here is 16, and I think it was mentioned that γ was 0.9

Ergo \(v_*(s) = 10 + 0.9(16) = 24.400\)

A bit confused about this problem, I guess I could chain it together until I arrive back at v_{*} but I don't really feel like doing that at the moment.

#### Exercise 3.25

Give an equation for v_{*} in terms of q_{*}. ⇤

Isn't this just \(v_* = max_a q_*(s, a)\)

#### Exercise 3.26

Give an equation for q⇤ in terms of v⇤ and the four-argument p. ⇤

This was already in the text I think, it's \(q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')]\)

#### Exercise 3.27

Give an equation for π_{*} in terms of q_{*}. ⇤

\(\pi_*(a | s) = \mathbb{1}[q_*(s, a) = max_{a \in A}(q_*(s, a))]\)

*Note: I am not thrilled with this answer. I feel like it should actually be something like this:*

\(\pi_*(a | s) = \frac{\mathbb{1}[q_*(s, a) = max_{a \in A}(q_*(s, a))]}{\sum \mathbb{1}[q_*(s, a) = max_{a \in A}(q_*(s, a))]}\)

*Because in the case where multiple actions are equally optimal they'll both be 1, which means the total probability will sum to greater than 1 which isn't right, I think. This seems way too wordy but at least conceptually has the right idea.*

#### Exercise 3.28

Give an equation for π_{*} in terms of v_{*} and the four-argument p. ⇤

\(\pi_*(a | s) = \mathbb{1}[\sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')] = max_{a}(\sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')])]\)

#### Exercise 3.29

Rewrite the four Bellman equations for the four value functions (v_{π}, v_{*}, q_{π},
and q_{*}) in terms of the three argument function p (3.4) and the two-argument function r
(3.5). ⇤

\(v_\pi(s) = \sum_{a} \pi(a|s) [r(s, a) + \gamma \sum_{s'} p(s' | s, a) v_\pi(s')]\)

\(v_*(s) = max_{a \in A} [r(s,a) + \gamma \sum_{s'} p(s'|s, a) v_*(s')]\)

\(q_\pi(s, a) = r(s, a) + \gamma \sum_{s'} p(s' | s, a) \sum_{a'} \pi(a' | s') q_\pi(s', a')\)

\(q_*(s, a) = r(s, a) + \gamma \sum_{s'} p(s' | s,a) max_{a'}q_*(s', a')\)

### 4 - Dynamic Programming

*status: checked with o1-preview, but maybe it was a bit much for it*

#### Exercise 4.1

In Example 4.1, if π is the equiprobable random policy, what is q_{π}(11, down)?
What is q_{π}(7, down)?

\(q_{\pi}(s,a) \doteq \sum_{s', r} p(s', r | s, a)[r + \gamma v_\pi(s')]\)

because it's not discounted, and the rewards and state transitions are fixed

\(q_{\pi}(11, down) \doteq 1[-1 + v_\pi(s')] = -1\)

…since v_{π}(s') has to be 0 (it's a terminal state)

\(q_{\pi}(7,down) \doteq \sum_{s', r} p(s', r | s, a)[r + \gamma v_\pi(s')]\)

\(q_{\pi}(7,down) \doteq -1 + v_\pi(11)\)

Which depends on k per the diagram (at initialization v_{π}(11) is 0, but eventually it climbs)

#### Exercise 4.2

In Example 4.1, suppose a new state 15 is added to the gridworld just below
state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14,
and 15, respectively. Assume that the transitions from the original states are unchanged.
What, then, is v_{π}(15) for the equiprobable random policy? Now suppose the dynamics of
state 13 are also changed, such that action down from state 13 takes the agent to the new
state 15. What is v_{π}(15) for the equiprobable random policy in this case? ⇤

\(\sum_{a} 1/4 (-1 + v_\pi(s'))\)

That is, 1/4 (-1 + v_{π}(12)) + 1/4 (-1 + v_{π}(13)) + 1/4 (-1 + v_{π}(14)) + 1/4 (-1 + v_{π}(15))

or just -1 + 1/4(v_{π}(12) + v_{π}(13) + v_{π}(14) + v_{π}(15))

if down in state 13 moves us to 15 instead of 13, then 13's v values needs to be recalc as

\(v'_{\pi}(13) = -1 + 1/4(v_\pi(12) + v_\pi(9) + v_\pi(14) + v_\pi(15))\)

and then v_{π}(15) would be

\(v_\pi(15) = -1 + 1/4(v_\pi(12) + v'_\pi(13) + v_\pi(14) + v_\pi(15))\)

These can probably be calculated from the listed converged values, but I guess it depends on k.

#### Exercise 4.3

What are the equations analogous to (4.3), (4.4), and (4.5), but for actionvalue functions instead of state-value functions?

(4.3) \(v_\pi(s) \doteq \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1})]\)

(4.4) \(v_\pi(s) \doteq \sum_{a} \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]\)

(4.5) \(v_{k+1}(s) \doteq \max_{a} \pi(a|s) \sum_{s', r}p(s', r | s, a)[r + \gamma v_k(s')]\)

(4.3) \(q_\pi(s, a) \doteq \mathbb{E}_\pi[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) | s=S_t, a=A_t]\)

(4.4) \(q_\pi(s, a) \doteq \sum_{s', r} p(s', r | s, a) [r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s', a')]\)

(4.5) \(q_{k+1}(s, a) \doteq \sum_{s', r}p(s', r | s, a)[r + \gamma max_{a'}q_k(s', a')]\)

#### Exercise 4.4

The policy iteration algorithm on page 80 has a subtle bug in that it may never terminate if the policy continually switches between two or more policies that are equally good. This is okay for pedagogy, but not for actual use. Modify the pseudocode so that convergence is guaranteed. ⇤

The only way it's possible to repeatedly switch between policies that are equally good but not the same are if the resulting rewards from both states are the same. In this code we save the existing action π(s) and then assign the action in that state to the argmax of the new values. All we have to do to avoid this bug is additionally store old-value which is just V(old-action), and then mark policy-stable as false if V(π(s)) != V(old-action). We don't care if the action changes, we care if the policy improves.

#### Exercise 4.5

How would policy iteration be defined for action values? Give a complete algorithm for computing q⇤, analogous to that on page 80 for computing v⇤. Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book. ⇤

in policy evaluation you can substitute these lines

\(v \leftarrow V(s)\)

\(V(s) \leftarrow \sum_{s', r} p(s', r | s, \pi(s))[r + \gamma V(s')]\)

\(\Delta \leftarrow max(\Delta, |v - V(s)|)\)

with this, looping over \(a \in A\):

\(q \leftarrow Q(s, a)\)

\(Q(s, a) \leftarrow \sum_{s', r} p(s', r | s, a)[r + \gamma [\sum_{a'}Q(s', a')\pi(a|s')]]\)

\(\Delta \leftarrow max(\Delta, |q - Q(s)|)\)

and it should be good from there (plus changing initialization and updating policy with max q instead of v, of course). The important part is that since you're not keeping track of V(s), you have to expand it out in terms of Q, which involves the all the possible actions over the next state.

#### Exercise 4.6

Suppose you are restricted to considering only policies that are ε-soft, meaning that the probability of selecting each action in each state, s, is at least ε/|A(s)|. Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for v⇤ on page 80. ⇤

In step 3, you would replace the argmax assignment with one which takes the argmax with probability 1 - ε and takes a random action with probability ε

In step 2, you would need to replace the value update step with one which first sums across all actions and multiplies them by the probability \(\pi(a|s)\).

In step 1, you'll need to declare an epsilon.

#### Exercise 4.7 (programming)

Write a program for policy iteration and re-solve Jack’s car rental problem with the following changes. One of Jack’s employees at the first location rides a bus home each night and lives near the second location. She is happy to shuttle one car to the second location for free. Each additional car still costs $2, as do all cars moved in the other direction. In addition, Jack has limited parking space at each location. If more than 10 cars are kept overnight at a location (after any moving of cars), then an additional cost of $4 must be incurred to use a second parking lot (independent of how many cars are kept there). These sorts of nonlinearities and arbitrary dynamics often occur in real problems and cannot easily be handled by optimization methods other than dynamic programming. To check your program, first replicate the results given for the original problem.

import math import numpy as np def get_proba(n, lam): return ((lam ** n)/(math.factorial(n))) * np.exp(-lam) lam_rent_l1 = 3 lam_return_l1 = 3 lam_rent_l2 = 4 lam_return_l2 = 2 max_cars = 20 max_move = 5 gamma = 0.9 ## policy iter vals = [[0 for _ in range(max_cars+1)] for _ in range(max_cars+1)] policy = [[0 for _ in range(max_cars+1)] for _ in range(max_cars+1)] # poisson gets very small after less than the full range so we can just ignore the tails rent_transition_probs_l1 = [get_proba(x, lam_rent_l1) for x in range(11)] rent_transition_probs_l2 = [get_proba(x, lam_rent_l2) for x in range(11)] ret_transition_probs_l1 = [get_proba(x, lam_return_l1) for x in range(11)] ret_transition_probs_l2 = [get_proba(x, lam_return_l2) for x in range(11)] # make this tractable def get_t_probs(a, b, c, d): t_probs = [[[[1 for _ in a] for _ in b] for _ in c] for _ in d] for i,x in enumerate(a): for j,y in enumerate(b): two_prob = x * y for k,z in enumerate(c): three_prob = two_prob * z for l,zz in enumerate(d): t_probs[i][j][k][l] = three_prob * zz return t_probs t_probs_table = get_t_probs(rent_transition_probs_l1, rent_transition_probs_l2, ret_transition_probs_l1, ret_transition_probs_l2) def expected_return(state, action, vals): i,j = state old_val = vals[i][j] move = action new_value = 0 #for all r, s' # employee is willing to move 1 car for free if move > 0: base_reward = -2 * abs(move-1) else: base_reward = -2 * abs(move) table_dims = 11 # hard coded for now for x_sub in range(table_dims): for x_add in range(table_dims): for y_sub in range(table_dims): for y_add in range(table_dims): # you cannot rent out more than you have reward = 10*(min(i, x_sub) + min(j, y_sub)) reward += base_reward new_i = i - move + x_add - min(i, x_sub) new_i = min(20, max(0, new_i)) new_j = j + move + y_add - min(j, y_sub) new_j = min(20, max(0, new_j)) # add 2nd parking lot penalty for both locations if new_i > 10: reward -= 4 if new_j > 10: reward -= 4 next_state_value = vals[new_i][new_j] transition_prob = t_probs_table[x_sub][x_add][y_sub][y_add] new_value += transition_prob * (reward + gamma*next_state_value) delta = abs(old_val - new_value) return new_value, delta def policy_iteration(vals, policy, probs): theta = 1 delta = theta+1 while delta > theta: print(f"delta: {delta}, theta: {theta}") delta = 0 #for all s for i in range(max_cars+1): print(f"row i {i}") for j in range(max_cars+1): new_value, obs_delta = expected_return((i,j), policy[i][j], vals) vals[i][j] = new_value delta = max(delta, obs_delta) return vals, policy # policy improvement policy_stable = False while not policy_stable: policy_stable = True #if policy-stable, stop, else do policy iteration then improvement print(f"doing policy iteration!") vals, policy = policy_iteration(vals, policy, t_probs_table) print(f"improving the policy now!") #for all s for i in range(max_cars+1): print(f"Improving row {i}") for j in range(max_cars+1): old_action = policy[i][j] new_action = old_action old_action_value, _ = expected_return((i,j), policy[i][j], vals) for a in range(-max_move, max_move): new_action_value, _ = expected_return((i,j), a, vals) if new_action_value > old_action_value: new_action = a old_action_value = new_action_value policy_stable = False policy[i][j] = new_action

Overall I'm not thrilled with this implementation – it does recreate everything per the text but I can't shake the feeling there's some substantial optimization improvements I can be doing here.

The plots look reasonable, though.

#### Exercise 4.8

Why does the optimal policy for the gambler’s problem have such a curious form? In particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does not. Why is this a good policy? ⇤

If you imagine the all flip at 50 being 0.4 probability to win, betting 1 at 51 means you are adding the probability of getting 50 more points starting with 1 to the fixed probability of 0.4, making it strictly better. Not sure it's intuitive why it's optimal but it definitely does not make sense to bet 51 (because you only need 100).

#### Exercise 4.9 (programming)

Implement value iteration for the gambler’s problem and solve it for ph = 0.25 and ph = 0.55. In programming, you may find it convenient to introduce two dummy states corresponding to termination with capital of 0 and 100, giving them values of 0 and 1 respectively. Show your results graphically, as in Figure 4.3. Are your results stable as θ → 0? ⇤

import math import numpy as np from matplotlib import pyplot as plt def value_iteration(vals, actions, theta, ph): iters = 0 delta = theta+1 while delta > theta or iters < 256: iters += 1 delta = 0 for s in range(1, 100): v = vals[s] max_checkval = None for bet in range(actions[s]+1): check_val = (ph * vals[min(100, s + bet)]) + ((1-ph) * vals[max(0, s - bet)]) if not max_checkval or check_val > max_checkval: max_checkval = check_val delta = max(delta, abs(v - max_checkval)) vals[s] = max_checkval policy = [] for s in range(1,100): greedy_act = 0 greedy_val = 0 for bet in range(1, actions[s]+1): q = (ph * vals[min(100, s + bet)]) + ((1-ph) * vals[max(0, s - bet)]) if q > greedy_val: greedy_val = q greedy_act = bet policy.append(greedy_act) return policy, vals def viz(policy, values): plt.plot(policy) plt.show() plt.plot(values) plt.show() ph = 0.4 vals = [0 for x in range(101)] vals[-1] = 1 actions = [x for x in range(100)] theta = 1e-12 p_4, v_4 = value_iteration(vals, actions, theta, ph) viz(p_4, v_4) ph = 0.2 vals = [0 for x in range(101)] vals[-1] = 1 actions = [x for x in range(100)] theta = 1e-12 p_2, v_2 = value_iteration(vals, actions, theta, ph) viz(p_2, v_2)

Honestly, no. The value and policy both do converge, given enough timesteps, but the form they take is pretty unusual. I wonder if there are multiple optimal actions for each state, and the wild behavior of the policy is not preferring one type of state to the other? The code is pretty simple so I imagine this has to do with the problem statement, but I'm definitely left with more questions than answers.

#### Exercise 4.10

What is the analog of the value iteration update (4.10) for action values,
q_{k+1}(s, a)? ⇤

(4.10) \(v_{k+1}(s) \doteq max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')]\)

\(q_{k+1}(s, a) \doteq \sum_{s', r} p(s', r | s, a) [r + \gamma \sum_{a'} \pi(a|s) q_k(s')]\)

### 5 - Monte Carlo Methods

*status: checked with o1-preview*

#### Exercise 5.1

Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why are the frontmost values higher in the upper diagrams than in the lower? ⇤

It jumps up because those are the ones you stick and often win. It drops off on the left because if the dealer has a usable ace they're more likely to win also. Frontmost values are higher in the upper diagrams because having a useable ace lets you salvage bad hands by going over 21 and looping back to a lower value.

#### Exercise 5.2

Suppose every-visit MC was used instead of first-visit MC on the blackjack task. Would you expect the results to be very different? Why or why not? ⇤

Because it also quadratically converges to v_{π} I imagine it would largely look the same. Especially because the episode lengths are very short, the only states which would be revisitable are small sums which are looped back upon with the usable ace, otherwise per-episode it should be about the same everywhere.

#### Exercise 5.3

What is the backup diagram for Monte Carlo estimation of q⇡? ⇤

It should look like the backup diagram on page 95, a single line representing the experience, but starting from an action instead of a state. One episode, no bootstrapping.

#### Exercise 5.4

The pseudocode for Monte Carlo ES is inefficient because, for each state– action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more ecient to use techniques similar to those explained in Section 2.4 to maintain just the mean and a count (for each state–action pair) and update them incrementally. Describe how the pseudocode would be altered to achieve this. ⇤

Initialize returns as a list of tuples (0,0) instead of an empty list.

Instead of appending, add 1 to the second value and change the first value to the reward + 1 / second value times the old first value.

#### Exercise 5.5

Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability p and transitions to the terminal state with probability 1p. Let the reward be +1 on all transitions, and let = 1. Suppose you observe one episode that lasts 10 steps, with a return of 10. What are the first-visit and every-visit estimators of the value of the nonterminal state? ⇤

The first-visit should be \(V(S) = G_0\) which refers to the episode reward, NOT the immediate reward. So in this case, it would be 10.

First visit it will be 10 like above. Second visit it will be 9 (since we lose the reward from the first transition). This will get added to returns which will make it [10,9], and the average is 9.5. This will continue [10,9,8,7,6,5,4,3,2,1], and the average of this is 55/10 = 5.5.

#### Exercise 5.6

What is the equation analogous to (5.6) for action values Q(s, a) instead of state values V (s), again given returns generated using b? ⇤

I actually think it should be pretty much the same? We need to change \(t \in T(s)\) to be instead \(t \in T(s, a)\) but p is already in terms of the actions of each policy rather than as states, and we are doing it episode by episode which means its the same episode but in terms of the state-action transitions rather than the states alone.

#### Exercise 5.7

In learning curves such as those shown in Figure 5.3 error generally decreases with training, as indeed happened for the ordinary importance-sampling method. But for the weighted importance-sampling method error first increased and then decreased. Why do you think this happened? ⇤

Weighted importance sampling starts from a biased estimation and updates a ratio towards the other policy, compared to ordinary importance sampling which monotonically increases the denominator. With OIS, the variance is high that improving with more samples is much larger than the effect of having only a few samples, but with WIS, you'll likely need a few samples before it starts heading in the right direction, which could lead to a temporary increase in error.

#### Exercise 5.8

The results with Example 5.5 and shown in Figure 5.4 used a first-visit MC method. Suppose that instead an every-visit MC method was used on the same problem. Would the variance of the estimator still be infinite? Why or why not? ⇤

The variance would definitely still be infinite if we used an every-visit MC method. Because the importance sampling ratio is the same in both cases, each step is still 2^{k} which is unbounded and diverging. The every-visit method will sum together averaged versions which will divide the terms by n, but that still diverges in the same way.

#### Exercise 5.9

Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incremental implementation for sample averages described in Section 2.4. ⇤

This would be done the same way as in exercise 5.4, you just replace the list with a list of tuples, and keep track of N and the running average and do the incremental average instead.

#### Exercise 5.10

Derive the weighted-average update rule (5.8) from (5.7). Follow the pattern of the derivation of the unweighted rule (2.3). ⇤

\(V_n = \frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1} W_k}\)

\(V_n = \frac{1}{\sum_{k=1}^{n-1} W_k} (W_{n-1} G_{n-1} + \sum_{k=1}^{n-2} W_k G_k )\)

\(V_n = \frac{1}{\sum_{k=1}^{n-1} W_k} (W_{n-1} G_{n-1} + (\sum_{k=1}^{n-2}W_k) \frac{1}{\sum_{k=1}^{n-2}W_k} \sum_{k=1}^{n-2} W_k G_k )\)

\(V_n = \frac{1}{\sum_{k=1}^{n-1} W_k} (W_{n-1} G_{n-1} + (\sum_{k=1}^{n-2}W_k) V_{n-1})\)

\(V_n = \frac{1}{\sum_{k=1}^{n-1} W_k} (W_{n-1} G_{n-1}) + \frac{1}{\sum_{k=1}^{n-1} W_k} \sum_{k=1}^{n-2}W_k V_{n-1})\)

\(V_n = \frac{1}{\sum_{k=1}^{n-1} W_k} (W_{n-1} G_{n-1}) + \frac{\sum_{k=1}^{n-2} W_k}{\sum_{k=1}^{n-1} W_k} V_{n-1})\)

\(V_n = \frac{W_{n-1}}{\sum_{k=1}^{n-1} W_k} (G_{n-1}) + \frac{\sum_{k=1}^{n-2} W_k}{\sum_{k=1}^{n-1} W_k} V_{n-1})\)

Let's introduce \(C_n = \sum_{k=1}^n W_k\)

\(V_n = \frac{W_{n-1}}{C_{n-1}} (G_{n-1}) + \frac{C_{n-2}}{C_{n-1}} V_{n-1}\)

\(V_n - V_{n-1} = \frac{W_{n-1}}{C_{n-1}} (G_{n-1}) + \frac{C_{n-2}}{C_{n-1}} V_{n-1} - V_{n-1}\)

\(V_n - V_{n-1} = \frac{W_{n-1}}{C_{n-1}} (G_{n-1}) + V_{n-1} (\frac{C_{n-2}}{C_{n-1}} - 1)\)

\(V_n - V_{n-1} = \frac{W_{n-1}}{C_{n-1}} (G_{n-1}) + V_{n-1} (\frac{C_{n-2}}{C_{n-2} + W_{n-1}} - \frac{C_{n-2}+ {W_{n-1}}}{C_{n-2} + {W_{n-1}}})\)

\(V_n - V_{n-1} = \frac{W_{n-1}}{C_{n-1}} (G_{n-1}) + V_{n-1} (\frac{C_{n-2} - C_{n-2} + {W_{n-1}}}{C_{n-2}{W_{n-1}}})\)

\(V_n - V_{n-1} = \frac{W_{n-1}}{C_{n-1}} (G_{n-1}) - V_{n-1} (\frac{W_{n-1}}{C_{n-1}})\)

\(V_n - V_{n-1} = \frac{W_{n-1}}{C_{n-1}} (G_{n-1} - V_{n-1})\)

\(V_n = V_{n-1} + \frac{W_{n-1}}{C_{n-1}} [G_{n-1} - V_{n-1}]\)

#### Exercise 5.11

In the boxed algorithm for off-policy MC control, you may have been expecting the W update to have involved the importance-sampling ratio m(At|St) / b(At|St) , but instead it involves 1 / b(At|St) . Why is this nevertheless correct? ⇤

This is because we are using W to incrementally update the importance sampling ratio, and we modify the ratio each Q assignment rather than calculating it from scratch every time.

Specifically, we can observe that after A_{t} has been selected by π, \(\pi(A_t|S_t) = 1\), which matches what we have above. W in this case is just repeatedly multiplying this together, which makes it the same as a product of importance sampling ratios.

#### Exercise 5.12: Racetrack (programming)

Consider driving a race car around a turn like those shown in Figure 5.5. You want to go as fast as possible, but not so fast as to run o↵ the track. In our simplified racetrack, the car is at one of a discrete set of grid positions, the cells in the diagram. The velocity is also discrete, a number of grid cells moved horizontally and vertically per time step. The actions are increments to the velocity components. Each may be changed by +1, 1, or 0 in each step, for a total of nine (3 ⇥ 3) actions. Both velocity components are restricted to be nonnegative and less than 5, and they cannot both be zero except at the starting line. Each episode begins in one of the randomly selected start states with both velocity components zero and ends when the car crosses the finish line. The rewards are 1 for each step until the car crosses the finish line. If the car hits the track boundary, it is moved back to a random position on the starting line, both velocity components are reduced to zero, and the episode continues. Before updating the car’s location at each time step, check to see if the projected path of the car intersects the track boundary. If it intersects the finish line, the episode ends; if it intersects anywhere else, the car is considered to have hit the track boundary and is sent back to the starting line. To make the task more challenging, with probability 0.1 at each time step the velocity increments are both zero, independently of the intended increments. Apply a Monte Carlo control method to this task to compute the optimal policy from each starting state. Exhibit several trajectories following the optimal policy (but turn the noise off for these trajectories). ⇤

Better to put this as a link to a colab here.

#### Exercise 5.15

Make new equations analogous to the importance-sampling Monte Carlo estimates (5.5) and (5.6), but for action value estimates Q(s, a). You will need new notation T(s, a) for the time steps on which the state–action pair s, a is visited on the episode. Do these estimates involve more or less importance-sampling correction?

These are pretty close to the same thing:

\(Q(s,a) = \frac{\sum_{t \in T(s,a)} p_{t:T(t)-1} G_t}{|T(s,a)|}\)

\(Q(s,a) = \frac{\sum_{t \in T(s,a)} p_{t:T(t)-1} G_t}{\sum_{t \in T(s,a)} p_{t:T(t)-1}}\)

These are actually likely to require a lot more importance-sampling correction. In general, there are almost always many more state-action pairs than there are states, so you'll likely need to get more samples as you construct these estimates.

#### TODO Asterisked Exercises to Revisit Later

**These below are asterisked, so come back to them later**

Exercise 5.13 Show the steps to derive (5.14) from (5.12). ⇤

Exercise 5.14 Modify the algorithm for off-policy Monte Carlo control (page 111) to use the idea of the truncated weighted-average estimator (5.10). Note that you will first need to convert this equation to action values. ⇤

### 6 - Temporal-Difference Learning

*status: checked with o1-preview*

#### Exercise 6.1

If V changes during the episode, then (6.6) only holds approximately; what would the difference be between the two sides? Let Vt denote the array of state values used at time t in the TD error (6.5) and in the TD update (6.2). Redo the derivation above to determine the additional amount that must be added to the sum of TD errors in order to equal the Monte Carlo error.

Definitions

\(\delta_t = R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)\)

\(G_t = R_{t+1} + \gamma G_{t+1}\)

\(V_{t+1}(S_t) = V_t(S_t) + \alpha_t \delta_t\)

We start with MC return

\(G_t = R_{t+1} + \gamma G_{t+1}\)

\(G_t - V_t(S_t) = R_{t+1} + \gamma G_{t+1} - V_t(S_t)\)

\(G_t - V_t(S_t) = R_{t+1} + \gamma G_{t+1} - V_t(S_t) + \gamma V_t(S_{t+1}) - \gamma V_t(S_{t+1})\)

\(G_t - V_t(S_t) = \delta_t + \gamma G_{t+1} - \gamma V_t(S_{t+1})\)

\(G_t - V_t(S_t) = \delta_t + \gamma[G_{t+1} - V_t(S_{t+1})]\)

We can define a new quantity

\(V_\Delta(S) = V_{t+1}(S) - V_{t}(S)\)

\(G_{t+1} - V_{t}(S_{t+1}) = G_{t+1} - V_{t}(S_{t+1})\)

\(G_{t+1} - V_{t}(S_{t+1}) = G_{t+1} - [V_{t+1}(S_{t+1}) - V_\Delta(S_{t+1})]\)

\(G_{t+1} - V_{t}(S_{t+1}) = G_{t+1} - V_{t+1}(S_{t+1}) + V_\Delta(S_{t+1})\)

Now we have it written in terms of what we want so we can do the same expansions as in 6.6

\(G_t - V_t(S_t) = \delta_t + \gamma[G_{t+1} - V_{t+1}(S_{t+1})] + \gamma V_\Delta(S_{t+1})\)

\(G_t - V_t(S_t) = \delta_t + \gamma \delta_{t+1} + \gamma^2[G_{t+2} - V_{t+2}(S_{t+2}) + V_\Delta(S_{t+2})] + \gamma V_\Delta(S_{t+1})\)

…

\(= \gamma V_\Delta(S_{t+1}) + \gamma^2 V_\Delta(S_{t+2}) + ... + \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k\)

\(= \sum_{k=t}^{T-1} \gamma^{k-t} V_\Delta(S_{k}) + \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k\)

\(= \sum_{k=t}^{T-1} \gamma^{k-t} [\delta_k + V_\Delta(S_k)]\)

#### Exercise 6.2

This is an exercise to help develop your intuition about why TD methods are often more ecient than Monte Carlo methods. Consider the driving home example and how it is addressed by TD and Monte Carlo methods. Can you imagine a scenario in which a TD update would be better on average than a Monte Carlo update? Give an example scenario—a description of past experience and a current state—in which you would expect the TD update to be better. Here’s a hint: Suppose you have lots of experience driving home from work. Then you move to a new building and a new parking lot (but you still enter the highway at the same place). Now you are starting to learn predictions for the new building. Can you see why TD updates are likely to be much better, at least initially, in this case? Might the same sort of thing happen in the original scenario?

In the new case where you have a new final destination, an important note is that your initial path home from work is the same in both cases (getting on and off the highway in the same spots). As a result, you already know how long those time steps take, and the likely difference in total time between those states is likely to be the exact same. In TD learning, the important thing is that your early states stay similar relative to each other, and that you figure out your new trailing states. In MC methods, you need to completely update all the states to predict the final time, which resets everything since every state directly learns the resulting dynamics of everything after it.

In the original scenario, this might happen too. If a road is closed and one leg is now abnormally slow, the rest of the state value estimates might still be pretty accurate relative to each other, compared to MC methods which do not handle this type of case well.

#### Exercise 6.3

From the results shown in the left graph of the random walk example it appears that the first episode results in a change in only V(A). What does this tell you about what happened on the first episode? Why was only the estimate for this one state changed? By exactly how much was it changed?

It makes sense that only one state updated. The only two states that can update in TD learning with equal initialization like this are the ones next to the terminal states. The middle ones all update to \(V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]\) which in this case is V(S_{t}) + 0 + (1 * 0.5) - 0.5 which is just V(S_{t}). The right state can update upwards, but only if the episode actually visits that state (and it can only go to one of the states. In our case here, V(A) will update to 0.5 + 0.1[0 + 1*0 - 0.5] which equals 0.45, which checks out in the diagram.

#### Exercise 6.4

The specific results shown in the right graph of the random walk example are dependent on the value of the step-size parameter, α. Do you think the conclusions about which algorithm is better would be affected if a wider range of α values were used? Is there a different, fixed value of α at which either algorithm would have performed significantly better than shown? Why or why not?

For this problem it makes sense that TD methods are more sample efficient. MC methods rely heavily upon the fact that you can arrive at the value of the state after sampling lots of episode returns and averaging them for each state. TD methods will instead increment based on the estimates of surrounding states, which makes it more like a combination of DP methods and MC methods. I can imagine that there are very unwise alpha values for which TD methods may fail to converge, though, for example if α = 1, V(A) in our last example would be assigned a new value of 0, which seems bad. In this case, MC methods would still be making updates based on the total returns, while TD methods only get to consider adjacent states, which might make them more likely to diverge instead of converge.

#### Exercise 6.6

In Example 6.2 we stated that the true values for the random walk example are 1/6, 2/6, 3/6, 4/6, and 5/6, for states A through E. Describe at least two different ways that these could have been computed. Which would you guess we actually used? Why?

I probably would have used value iteration, where we can repeatedly improve the estimates until they converged. You could also use monte carlo methods starting from each state and averaging the return, which would also be pretty easy and would arrive at these values in the limit.

#### Exercise 6.8

Exercise 6.8 Show that an action-value version of (6.6) holds for the action-value form
of the TD error δ_{t} = Rt+1 + γ Q(St+1, At+1) - Q(St, At), again assuming that the values
don’t change from step to step.

\(G_t - Q(S_t, A_t) = R_{t+1} + \gamma G_{t+1} - Q(S_t, A_t) + \gamma Q(S_{t+1}, A_{t+1}) - \gamma Q(S_{t+1}, A_{t+1})\)

\(= \delta_t + \gamma(G_{t+1} - Q(S_{t+1}, A_{t+1}))\)

\(= \delta_t + \gamma \delta_{t+1}+ \gamma^2(G_{t+2} - Q(S_{t+2}, A_{t+2}))\)

\(= \delta_t + \gamma \delta_{t+1}+ ... + \gamma^{T-t}(G_{T} - Q(S_{T}, A_{T}))\)

\(= \delta_t + \gamma \delta_{t+1}+ ... + \gamma^{T-t}(0 - 0)\)

\(= \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k\)

Thankfully this is the same, since Q(S_{T}, A_{T}) and G_{T} are both still 0.

#### Exercise 6.9 (programming)

Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than four. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

import numpy as np # hardcoded for now def init_gridworld(): world = [["-" for _ in range(10)] for _ in range(6)] winds = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0] return world, winds def apply_move(position, action, world, winds): tall = len(world)-1 wide = len(world[0])-1 current_y = position[0] move_y = action[0] current_x = position[1] move_x = action[1] # this seems how it works from the figure current_wind = winds[current_x] new_y = max(0, min(tall, current_y + move_y - current_wind)) new_x = max(0, min(wide, current_x + move_x)) return -1, (new_y, new_x) def follow_policy(Q, pos, eps=0.1): if pos == GOAL_STATE: return (0,0) if np.random.random() < eps: return (np.random.choice([-1,0,1]), np.random.choice([-1,0,1])) valued_actions = [] for a_y in [-1, 0, 1]: for a_x in [-1, 0, 1]: if (pos, (a_y, a_x)) in Q.keys(): valued_actions.append(((a_y, a_x), Q[(pos, (a_y, a_x))])) if len(valued_actions) == 0: return (np.random.choice([-1,0,1]), np.random.choice([-1,0,1])) else: return max(valued_actions, key=lambda x: x[1])[0] GOAL_STATE = (3, 7) START_STATE = (3, 0) # sarsa world, winds = init_gridworld() alpha = 0.1 eps = 0.1 gamma = 1 n_episodes = 100 Q = {} for _ in range(n_episodes): steps = 0 pos = START_STATE action = follow_policy(Q, pos, eps=eps) while pos != GOAL_STATE: steps += 1 reward, new_pos = apply_move(pos, action, world, winds) new_action = follow_policy(Q, new_pos, eps=eps) sa = (pos, action) spap = (new_pos, new_action) if sa not in Q: Q[sa] = 0 if spap not in Q: Q[spap] = 0 Q[sa] = Q[sa] + alpha * (reward + gamma * Q[spap] - Q[sa]) pos = new_pos action = new_action print(f"Completed in {steps} steps")

#### Exercise 6.10 (programming)

Exercise 6.10: Stochastic Wind (programming) Re-solve the windy gridworld task with King’s moves, assuming that the effect of the wind, if there is any, is stochastic, sometimes varying by 1 from the mean values given for each column. That is, a third of the time you move exactly according to these values, as in the previous exercise, but also a third of the time you move one cell above that, and another third of the time you move one cell below that. For example, if you are one cell to the right of the goal and you move left, then one-third of the time you move one cell above the goal, one-third of the time you move two cells above the goal, and one-third of the time you move to the goal. ⇤

import numpy as np # hardcoded for now def init_gridworld(): world = [["-" for _ in range(10)] for _ in range(6)] winds = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0] return world, winds def apply_move(position, action, world, winds): tall = len(world)-1 wide = len(world[0])-1 current_y = position[0] move_y = action[0] current_x = position[1] move_x = action[1] # stochastic current_wind = np.random.choice([winds[current_x]-1, winds[current_x], winds[current_x]+1]) new_y = max(0, min(tall, current_y + move_y - current_wind)) new_x = max(0, min(wide, current_x + move_x)) return -1, (new_y, new_x) def follow_policy(Q, pos, eps=0.1): if pos == GOAL_STATE: return (0,0) if np.random.random() < eps: return (np.random.choice([-1,0,1]), np.random.choice([-1,0,1])) valued_actions = [] for a_y in [-1, 0, 1]: for a_x in [-1, 0, 1]: if (pos, (a_y, a_x)) in Q.keys(): valued_actions.append(((a_y, a_x), Q[(pos, (a_y, a_x))])) if len(valued_actions) == 0: return (np.random.choice([-1,0,1]), np.random.choice([-1,0,1])) else: return max(valued_actions, key=lambda x: x[1])[0] GOAL_STATE = (3, 7) START_STATE = (3, 0) # sarsa world, winds = init_gridworld() alpha = 0.1 eps = 0.1 gamma = 1 n_episodes = 100 Q = {} for _ in range(n_episodes): steps = 0 pos = START_STATE action = follow_policy(Q, pos, eps=eps) while pos != GOAL_STATE: steps += 1 reward, new_pos = apply_move(pos, action, world, winds) new_action = follow_policy(Q, new_pos, eps=eps) sa = (pos, action) spap = (new_pos, new_action) if sa not in Q: Q[sa] = 0 if spap not in Q: Q[spap] = 0 Q[sa] = Q[sa] + alpha * (reward + gamma * Q[spap] - Q[sa]) pos = new_pos action = new_action print(f"Completed in {steps} steps")

#### Exercise 6.11

Why is Q-learning considered an *off-policy* control method?

Q-learning is considered an off-policy control method because it directly approximates the optimal action-value function but doesn't use that function to actually select actions to perform this calculation. This makes it different from Sarsa, which will use the Q value of the action actually taken from S_{t+1}, compared to Q learning which performs the update using the maximum Q value available from S_{t+1} (even if it decides to do something else)

#### Exercise 6.12

Suppose action selection is greedy. Is Q-learning then exactly the same algorithm as Sarsa? Will they make exactly the same action selections and weight updates?

If action selection is greedy, then \(max_a Q(S_{t+1}, a) = Q(S_{t+1}, A_{t+1})\), meaning Sarsa and Q-learning are the same. The difference here is that if you want to add an exploratory policy, Sarsa would use that policy to update, but Q-learning would use the same greedy action selection regardless of what action is actually taken. Note: this also assumes that ties are broken the same way. If they differ then obviously they are no longer the same, but I'm assuming from the way the question is asked that this means "if the actions are always the same"

#### Exercise 6.14

Describe how the task of Jack’s Car Rental (Example 4.2) could be reformulated in terms of afterstates. Why, in terms of this specific task, would such a reformulation be likely to speed convergence?

In Jack's Car Rental, your "move" is how many cars you move from location A to location B (which can be negative, if you want to move cars from B to A). The "opponent move" is how many cars get added or subtracted from each location (i.e. from rentals and returns).

In this situation, you will often arrive at the same number of cars in both locations even if all of these numbers vary a lot (i.e. I have 0 and 3, I move -1 to get 1 and 2, location A has 1 return, location B has 1 rental, leading to 2 and 1. This is the same, for example, as starting with 2 and 1, doing 0 moves, and having 3 rentals and 3 returns in each situation). The number of possible paths spirals out to a dramatically high number (since many values can vary) but the actual resulting states is limited to just that 20x20 box, so framing it as the resulting number before doing the update is much more likely to lead to fast convergence since you'll be able to cross-learn from different situations.

#### TODO Asterisked Exercises to Return To Later

Exercise 6.5 In the right graph of the random walk example, the RMS error of the TD method seems to go down and then up again, particularly at high ↵’s. What could have caused this? Do you think this always occurs, or might it be a function of how the approximate value function was initialized?

Exercise 6.7 Design an off-policy version of the TD(0) update that can be used with arbitrary target policy π and covering behavior policy b, using at each step t the importance sampling ratio pt:t (5.3).

Exercise 6.13 What are the update equations for Double Expected Sarsa with an ε -greedy target policy?

### 7 - n-step Bootstrapping

*Status: Checked with o1-preview*

#### Exercise 7.1

In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the n-step error used in (7.2) can also be written as a sum of TD errors (again if the value estimates don’t change) generalizing the earlier result.

Recall the TD Error

\(\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\)

The n-step error here is the bracketed part of 7.2:

\(V_{t+n}(S_t) \doteq V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)]\)

Since the values won't change we can just write it as V from now on.

The n-step return is:

\(G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\)

\(G_{t:t+n} - V(S_t) = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n}) - V(S_t)\)

For this last step, we can add and subtract the same values like so:

\(G_{t:t+n} - V(S_t) = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n}) + \gamma^{n-1} V(S_{t+n-1}) - \gamma^{n-1} V(S_{t+n-1}) + ... - V(S_t)\)

Rearranging we get

\(G_{t:t+n} - V(S_t) = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) + R_{t+2} + \gamma^2 V(S_{t+2}) - \gamma V(S_{t+1}) + ... + R_{t+n} + \gamma^n V(S_{t+n}) - \gamma^{n-1} V(S_{t+n-1})\)

Which is

\(G_{t:t+n} - V(S_t) = \delta_t + \gamma \delta_{t+1} + ... + \gamma^{n-1} \delta_{t+n-1}\)

\(G_{t:t+n} - V(S_t) = \sum_{k=t}^{t+n-1} \gamma^{k-1} \delta_k\)

#### Exercise 7.2 (programming)

With an n-step method, the value estimates do change from step to step, so an algorithm that used the sum of TD errors (see previous exercise) in place of the error in (7.2) would actually be a slightly different algorithm. Would it be a better algorithm or a worse one? Devise and program a small experiment to answer this question empirically.

I am stupid and I tried to do this with n-step sarsa, which doesn't use value estimates at all. I don't really feel like going back and implementing n-step TD and then making this changes, but intuitively I feel like using the TD errors would provide a more biased estimate towards the current values, which might be better in situations where the values have higher variance or change over time. It's probably better or worse depending on the problem.

I may revisit this later but I at least still got to do some programming in this chapter so we will call it completed even though it's wrong.

import numpy as np # hardcoded for now def init_gridworld(): world = [["-" for _ in range(10)] for _ in range(10)] return world def apply_move(position, action, world): tall = len(world)-1 wide = len(world[0])-1 current_y = position[0] move_y = action[0] current_x = position[1] move_x = action[1] new_y = max(0, min(tall, current_y + move_y)) new_x = max(0, min(wide, current_x + move_x)) return -1, (new_y, new_x) def follow_policy(Q, pos, eps=0.1): if pos == GOAL_STATE: return (0,0) if np.random.random() < eps: return (np.random.choice([-1,0,1]), np.random.choice([-1,0,1])) valued_actions = [] for a_y in [-1, 0, 1]: for a_x in [-1, 0, 1]: if (pos, (a_y, a_x)) in Q.keys(): valued_actions.append(((a_y, a_x), Q[(pos, (a_y, a_x))])) if len(valued_actions) == 0: return (np.random.choice([-1,0,1]), np.random.choice([-1,0,1])) else: return max(valued_actions, key=lambda x: x[1])[0] GOAL_STATE = (3, 7) #START_STATE = (3, 0) # n-step sarsa world = init_gridworld() alpha = 0.1 eps = 0.1 gamma = 1 n_episodes = 100 n_steps = 10 Q = {} for _ in range(n_episodes): steps = 0 pos = GOAL_STATE while pos == GOAL_STATE: pos = (np.random.randint(0,9), np.random.randint(0,9)) action = follow_policy(Q, pos, eps=eps) all_rewards = [] all_states = [pos] all_actions = [] T = np.inf while steps < T: update_step = steps - n_steps + 1 reward, new_pos = apply_move(pos, action, world) all_rewards.append(reward) all_states.append(new_pos) all_actions.append(action) if new_pos == GOAL_STATE: T = steps + 1 else: new_action = follow_policy(Q, new_pos, eps=eps) if update_step >= 0: StnAtn = (all_states[steps], all_actions[steps]) StAt = (all_states[update_step], all_actions[update_step]) if StnAtn not in Q: Q[StnAtn] = 0 if StAt not in Q: Q[StAt] = 0 G = 0 for i in range(update_step+1, min(update_step+n_steps, T)): G += gamma**(i-update_step-1) * all_rewards[i] if update_step + n_steps < T: G = G + gamma**n_steps * Q[StnAtn] Q[StAt] += alpha * (G - Q[StAt]) pos = new_pos action = new_action steps += 1 print(f"Completed in {steps} steps")

#### Exercise 7.3

Why do you think a larger random walk task (19 states instead of 5) was used in the examples of this chapter? Would a smaller walk have shifted the advantage to a different value of n? How about the change in left-side outcome from 0 to -1 made in the larger walk? Do you think that made any difference in the best value of n?

With a shorter random walk task, it takes fewer n for your n-step method to pretty much be a MC method (barring the very unlikely extremes, which are truncated). It seems unlikely that you'd get as interesting separation in this experiment compared to the version with the longer walk.

The left-side value being -1 means that the range of values is twice as large, which means that the variance of the cumulative returns. With higher variance, larger values of n are likely to be better.

#### Exercise 7.4

Prove that the n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error

\(G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} ... + \gamma^{n-1} R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n})\)

\(G_{t:t+n} \doteq \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}) + \sum_{k=t}^{t+n} \gamma^{k-t }R_{k+1}\)

We use that plus minus trick again

\(G_{t:t+n} \doteq \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}) + \gamma^{n-1} Q_{t+n-1}(S_{t+n-1}, A_{t+n-1}) - \gamma^{n-1} Q_{t+n-1}(S_{t+n-1}, A_{t+n-1}) + ... + \sum_{k=t}^{t+n} \gamma^{k-t }R_{k+1}\)

Then we can pull everything except the last Q_{t-1}(S_{t}, A_{t}) into the sum

\(G_{t:t+n} = Q_{t-1}(S_t, A_t) + \sum_{k=t}^{min(t+n, T)-1} \gamma^{k-t} [R_{k+1} + \gamma Q_k(S_{k+1}, A_{k+1}) - Q_{k-1}(S_k, A_k)]\)

#### Exercise 7.11

Show that if the approximate action values are unchanging, then the tree-backup return (7.16) can be written as a sum of expectation-based TD errors

\(\delta_t \doteq R_{t+1} + \gamma \bar{V}_t(S_{t+1}) - Q(S_t, A_t)\)

\(\bar{V}_t(s) \doteq \sum_a \pi(a|s)Q_t(s, a)\)

(7.16) \(G_{t:t+n} \doteq R_{t+1} + \gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1})Q(S_{t+1}, a) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:t+n}\)

let's rewrite \(\bar{V}\) in terms of this weird action sum we're doing

\(\bar{V}_t(s) \doteq \pi(s, A_{t+1})Q(s,A_{t+1}) + \sum_{a \neq A_{t+1}} \pi(a|s)Q_t(s, a)\)

And therefore

\(\sum_{a \neq A_{t+1}} \pi(a|s)Q_t(s, a) = \bar{V}_t(s) - \pi(A_{t+1}|s)Q(s,A_{t+1})\)

so

\(G_{t:t+n} \doteq R_{t+1} + \gamma \bar{V}_t(S_{t+1}) - \gamma \pi(A_{t+1}|S_{t+1})Q(S_{t+1}, A_{t+1}) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:t+n}\)

Let's observe that

\(\delta_t + Q(S_t, A_t) \doteq R_{t+1} + \gamma \bar{V}_t(S_{t+1})\)

So we can simplify

\(G_{t:t+n} \doteq \delta_t + Q(S_t, A_t) - \gamma \pi(A_{t+1}|S_{t+1})Q(S_{t+1}, A_{t+1}) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:t+n}\)

\(G_{t:t+n} \doteq \delta_t + Q(S_t, A_t) + \gamma \pi(A_{t+1}|S_{t+1})[G_{t+1:t+n} - Q(S_{t+1}, A_{t+1})]\)

\(G_{t:t+n} \doteq \delta_t + Q(S_t, A_t) + \gamma \pi(A_{t+1}|S_{t+1})[[\delta_{t+1} + Q(S_{t+1}, A_{t+1}) + \gamma \pi(A_{t+2}|S_{t+2})[G_{t+2:t+n} - Q(S_{t+2}, A_{t+2})] - Q(S_{t+1}, A_{t+1})]\)

Distribute it back in

\(G_{t:t+n} \doteq \delta_t + Q(S_t, A_t) + \gamma \pi(A_{t+1}|S_{t+1})[\delta_{t+1} + Q(S_{t+1}, A_{t+1}) + \gamma \pi(A_{t+2}|S_{t+2})[G_{t+2:t+n} - Q(S_{t+2}, A_{t+2})] - \gamma \pi(A_{t+1}|S_{t+1})Q(S_{t+1}, A_{t+1})]\)

\(G_{t:t+n} \doteq \delta_t + Q(S_t, A_t) + \gamma \pi(A_{t+1}|S_{t+1})\delta_{t+1} + \gamma \pi(A_{t+1}|S_{t+1})Q(S_{t+1}, A_{t+1}) + \gamma \pi(A_{t+1}|S_{t+1})\gamma \pi(A_{t+2}|S_{t+2})[G_{t+2:t+n} - Q(S_{t+2}, A_{t+2})] - \gamma \pi(A_{t+1}|S_{t+1})Q(S_{t+1}, A_{t+1})\)

\(G_{t:t+n} \doteq \delta_t + Q(S_t, A_t) + \gamma \pi(A_{t+1}|S_{t+1})\delta_{t+1} + \gamma \pi(A_{t+1}|S_{t+1})\gamma \pi(A_{t+2}|S_{t+2})[G_{t+2:t+n} - Q(S_{t+2}, A_{t+2})]\)

Rewrite it, note that the product over an empty index set is 1.

\(G_{t:t+n} \doteq Q(S_t, A_t) + \prod_{i=t+1}^{t}\gamma\pi(A_i, S_i)\delta_t + \prod_{i=t+1}^{t+1}\gamma\pi(A_i, S_i) \delta_{t+1} + \prod_{i=t+1}^{t+2}\gamma\pi(A_i, S_i)[Q(S_{t+2}, A_{t+2}) + G_{t+2:t+n}]]\)

Expand out until terminal n (or terminal state)

\(G_{t:t+n} = Q(S_t, A_t) + \sum_{k=t}^{min(t+n-1, T-1)} \delta_k \prod_{i=t+1}^{k} \gamma \pi(A_i|S_i)\)

#### TODO Starred Exercises

This section is a bit odd – Chapter 7 has 5 exercises in section 7.4, which is a starred section. One of these exercises is starred, the rest are not. I'm treating these exercises as starred, and the starred exercise as double-starred, but that might be wrong?

Exercise 7.5 Write the pseudocode for the off-policy state-value prediction algorithm described above. (This one isn't technically starred? But it's in a starred section, so I'll put it here)

Exercise 7.6 Prove that the control variate in the above equations does not change the expected value of the return. ⇤

(double-starred) Exercise 7.7 Write the pseudocode for the off-policy action-value prediction algorithm described immediately above. Pay particular attention to the termination conditions for the recursion upon hitting the horizon or the end of episode. ⇤

Exercise 7.8 Show that the general (off-policy) version of the n-step return (7.13) can still be written exactly and compactly as the sum of state-based TD errors (6.5) if the approximate state value function does not change. ⇤

Exercise 7.9 Repeat the above exercise for the action version of the off-policy n-step return (7.14) and the Expected Sarsa TD error (the quantity in brackets in Equation 6.9). ⇤

Exercise 7.10 (programming) Devise a small off-policy prediction problem and use it to show that the off-policy learning algorithm using (7.13) and (7.2) is more data efficient than the simpler algorithm using (7.1) and (7.9). ⇤

### 8 - Planning and Learning with Tabular Methods

*status: checked with Claude Sonnet 3.5, provided with the textbook chapter as pdf*

#### Exercise 8.1

The nonplanning method looks particularly poor in Figure 8.3 because it is a one-step method; a method using multi-step bootstrapping would do better. Do you think one of the multi-step bootstrapping methods from Chapter 7 could do as well as the Dyna method? Explain why or why not.

The methods do slightly different things. The dyna method specifically is useful in these cases where the environment doesn't change very much and is deterministic, so you can just replay old experience and continually get useful updates without actually observing them. The multi-step bootstrapping methods are more about propagating information through more states per real update, which might more quickly arrive at a policy which arrives at the goal state from the start state using a completely greedy policy, but would still take longer to correct suboptimal path turns compared to something like dyna.

#### Exercise 8.2

Why did the Dyna agent with exploration bonus, Dyna-Q+, perform better in the first phase as well as in the second phase of the blocking and shortcut experiments?

Encouraging it to explore means it's more likely to correct errors in the model, which is important because otherwise dyna will fill the Q table with values based on unlikely events it happened to observe.

#### Exercise 8.3

Careful inspection of Figure 8.5 reveals that the difference between Dyna-Q+ and Dyna-Q narrowed slightly over the first part of the experiment. What is the reason for this?

When the models are correct, encouraging it to do extra exploration provides minimal value over doing more exploitation, which is done in Dyna-Q more than Dyna-Q+. Once exploration becomes more important, the advantage of Dyna-Q+ widens substantially, but there's a cost associated with that, it's not just always better.

#### Exercise 8.4 (programming)

The exploration bonus described above actually changes the estimated values of states and actions. Is this necessary? Suppose the bonus \(k \sqrt{t}\) was used not in updates, but solely in action selection. That is, suppose the action selected was always that for which Q(St, a) + \(k \sqrt{t(S_t, a)}\) was maximal. Carry out a gridworld experiment that tests and illustrates the strengths and weaknesses of this alternate approach.

import numpy as np import random # hardcoded for now def init_gridworld(): world = [["-" for _ in range(10)] for _ in range(10)] world[7] = ["x" for _ in range(9)] + ['-'] return world def apply_move(position, action, world): tall = len(world)-1 wide = len(world[0])-1 current_y = position[0] move_y = action[0] current_x = position[1] move_x = action[1] new_y = max(0, min(tall, current_y + move_y)) new_x = max(0, min(wide, current_x + move_x)) if world[new_y][new_x] == 'x': new_y = current_y new_x = current_x return -1, (new_y, new_x) # not very efficient just getting working def revalue_actions(state, acts, model, k, history): revalued_acts = [] for (act, val) in acts: t = 0 for h in history[::-1]: t += 1 if h == (state, act): break revalued_acts.append((act, val + k*np.sqrt(t))) return revalued_acts def follow_policy(Q, pos, eps=0.1, k=0.1, model=None, history=None): if pos == GOAL_STATE: return (0,0) if np.random.random() < eps: return random.choice(ACTIONS) valued_actions = [] for (a_y, a_x) in ACTIONS: if (pos, (a_y, a_x)) in Q.keys(): valued_actions.append(((a_y, a_x), Q[(pos, (a_y, a_x))])) if len(valued_actions) == 0: return random.choice(ACTIONS) else: if model is None or history is None: return max(valued_actions, key=lambda x: x[1])[0] else: return max(revalue_actions(pos, valued_actions, model, k, history), key=lambda x: x[1])[0] def Q_update(Q, alpha, state, action, reward, new_state): best_nextQ = -np.inf for a in ACTIONS: if (new_state, a) in Q: if Q[(new_state, a)] > best_nextQ: best_nextQ = Q[(new_state, a)] if best_nextQ == -np.inf: best_nextQ = 0 if (state, action) not in Q: Q[(state, action)] = 0 Q[(state, action)] += alpha * (reward + gamma * best_nextQ - Q[(state, action)]) return Q GOAL_STATE = (0, 4) START_STATE = (9, 3) ACTIONS = [] for a_y in [-1, 0, 1]: for a_x in [-1, 0, 1]: ACTIONS.append((a_y, a_x)) # Dyna-Q world = init_gridworld() alpha = 0.1 eps = 0.1 gamma = 0.9 n_episodes = 200 n_steps = 25 Q = {} model = {} for episode in range(n_episodes): steps = 0 pos = START_STATE if episode == 100: world[7][-1] = 'x' world[7][0] = '-' print("~~~~~~~~~~~~~~~~~") history = [] while pos != GOAL_STATE: action = follow_policy(Q, pos, eps=eps, model=model, history=history) reward, new_pos = apply_move(pos, action, world) Q = Q_update(Q, alpha, pos, action, reward, new_pos) model[(pos, action)] = (reward, new_pos) history.append((pos, action)) for _ in range(n_steps): (sim_s, sim_a) = random.choice(list(model.keys())) sim_r, sim_sp = model[(sim_s, sim_a)] Q = Q_update(Q, alpha, sim_s, sim_a, sim_r, sim_sp) pos = new_pos steps += 1 print(f"Completed in {steps} steps")

In general this seems much slower to learn compared to normal, since it adds a substantial premium to exploring new states. I get the sense that this is building a better Q table compared to the normal way, where the Q values more accurate reflect the actual expected returns (by virtue of not having the values polluted by the exploration bonus), at the cost of slower convergence for the best policy resulting from it.

#### Exercise 8.5

How might the tabular Dyna-Q algorithm shown on page 164 be modified to handle stochastic environments? How might this modification perform poorly on changing environments such as considered in this section? How could the algorithm be modified to handle stochastic environments and changing environments?

Replace (e) with a table of state transitions and rewards, and instead of just taking from it you sample from it instead. This might perform poorly because you'll take a long time to notice changes in the environment, since you'll just have to keep updating until the new change dwarfs the old one, so to make it handle changing environments you could keep a sort of running table of the most recent \(k\) visits to that state, instead of every one, which will naturally kill off information from older visits.

#### Exercise 8.6

The analysis above assumed that all of the b possible next states were equally likely to occur. Suppose instead that the distribution was highly skewed, that some of the b states were much more likely to occur than most. Would this strengthen or weaken the case for sample updates over expected updates? Support your answer.

Would probably strengthen; in the case where you have a lot of very unlikely successor states you waste a lot of potential calculating expectations compared to sampling, which will more directly explore the more likely trajectories much more often.

#### Exercise 8.7

Some of the graphs in Figure 8.8 seem to be scalloped in their early portions, particularly the upper graph for b = 1 and the uniform distribution. Why do you think this is? What aspects of the data shown support your hypothesis?

A very noteworthy thing about the scalloping in the graphs is that they all happen at the same time for each plot, regardless of whether it's using on-policy or uniform. If we assume that the on-policy and uniform cases make the same number of model updates per real update, it's probably at those cases where the model completes an episode and resets back to the start state. Since both models have done a number of real updates at the beginning of the MDP, it is likely that this update is going to be a pretty moderate change compared to later values which are likely to be more inaccurate (and thus change the overall estimate a lot more).

#### Exercise 8.8 (programming)

Replicate the experiment whose results are shown in the lower part of Figure 8.8, then try the same experiment but with b = 3. Discuss the meaning of your results.

This is a bit much to dump here so here's the colab.

I didn't do the smoothing / multiple runs part, and I am worried that my value estimation for the starting state is kind of unstable as a result. However, I did produce a graph which shows what is described (i.e. at branching factor 3 the on-policy method learns faster but the eventual value of the uniform sampling is higher after a longer period of time.) I imagine that as the branching factor grows,

### 9 - On-policy Prediction with Approximation

*Status: Checked with o1-preview*

#### Exercise 9.1

Show that tabular methods such as presented in Part I of this book are a special case of linear function approximation. What would the feature vectors be?

Linear function approximation estimates \(\hat{v}\) using the following form:

\(\hat{v} \doteq w^T x(s) \doteq \sum_{i=1}^{d} w_ix_i(s)\)

It takes this form because in linear function approximation, |w| << |S|. If we relax this assumption such that |w| = |S|, we can have |S| parameters, each with a weight in |w| representing that parameter's estimated v value. The features \(x\), as a result, would just be a one-hot vector of size |S| which is 0 in all states except for the single parameter corresponding to that state's estimated value. This makes linear function approximation reduce to keeping estimated v values for each state, as was done in every tabular method in part I.

#### Exercise 9.2

Why does 9.17 define (n+1)^{k} distinct features for dimension k?

I'm a bit confused by this question – if you have n features and k dimensions, then this product which includes 0 (i.e. not including the feature) will have (n+1)^{k} features. For example, take k=1, you have your original n features (plus the null feature) of (n+1). Now take k=2, you have each of your (n+1) features multiplied with each of the other (n+1) features, for (n+1)^{2} features. etc. You need the null feature because that way k=2 can have all of the k=1 features (e.g. that feature multiplied by 1).

#### Exercise 9.3

What n and c_{i,j} produce the feature vectors x(s) = (1, s1, s2, s1s2, s1^{2}, s2^{2}, s1s2^{2}, s1^{2s2}, s1^{2s2}^{2})^{T}?

This is n=2 c_{i, j} = {0, 1, 2}

#### Exercise 9.4

Suppose we believe that one of two state dimensions is more likely to have an effect on the value function than is the other, that generalization should be primarily across this dimension rather than along it. What kind of tilings could be used to take advantage of this prior knowledge?

It would be more advantageous to use tiling which (comparatively) ignores the second dimension in favor of providing higher granularity along the first dimension. For example, in the horizontal stripes tiling, we care a lot about the vertical axis, but don't differentiate between two examples which have the same vertical coordinate. We could use a much coarser second tiling for the horizontal axis, since we know ahead of time that it if less likely to vary, and we will get superior generalization compared to the case where we consider each axis equally.

#### Exercise 9.5

Suppose you are using tile coding to transform a seven-dimensional continuous state space into binary feature vectors to estimate a state value function vˆ(s,w) ⇡ v⇡(s). You believe that the dimensions do not interact strongly, so you decide to use eight tilings of each dimension separately (stripe tilings), for 7 x 8 = 56 tilings. In addition, in case there are some pairwise interactions between the dimensions, you also take all \({7 \choose 2}\) = 21 pairs of dimensions and tile each pair conjunctively with rectangular tiles. You make two tilings for each pair of dimensions, making a grand total of 21 x 2 + 56 = 98 tilings. Given these feature vectors, you suspect that you still have to average out some noise, so you decide that you want learning to be gradual, taking about 10 presentations with the same feature vector before learning nears its asymptote. What step-size parameter α should you use? Why?

per 9.19 you should use \(\alpha \doteq (10 \mathbb{E}[x^T x])^{-1}\)

In our case, you have 98 tilings. Specifically, you have eight different stripe tilings for each dimension, where for each dimension 1 will be active and the rest will be inactive. Likewise, for the pairwise ones, you have two different pairwise tilings for each dimension pair, for each 1 will be active and the rest will be inactive. This is important because no matter what vector x you take, you should have the same number of 1s and 0s, which means the dot product \(x^T x\) is a constant (i.e. 98). Since it's a constant, it's expectation is a constant, so we arrive at just \(\alpha \doteq \frac{1}{980}\)

#### Exercise 9.6

If \(\tau = 1\) and x(St)^{T} x(St) = E[x^{T} x], prove that (9.19) together with (9.7)
and linear function approximation results in the error being reduced to zero in one update.

(9.7) \(w_{t+1} = w_t + \alpha [U_t -\hat{v}(S_t, W_t)] \nabla \hat{v}(S_t, w_t)\)

(9.19) \(\alpha \doteq (\tau \mathbb{E}[x^T x])^{-1}\)

Things to note in linear methods:

(9.8) \(\hat{v}(s,w) \doteq w^Tx(s)\)

\(\nabla \hat{v}(s, w) = x(s)\)

note that we can also write 9.8 as

\(\hat{v}(s,w) \doteq x(s)^T w\) since that's just the same scalar value

substituting everything we get

\(w_{t+1} = w_t + (x(S_t)^T x(S_t))^{-1} [U_t - x(S_t)^T w_t] x(S_t)\)

\(w_{t+1}x(S_t)^T = w_tx(S_t)^T + (x(S_t)^T x(S_t))^{-1} [U_t - x(S_t)^T w_t] x(S_t)x(S_t)^T\)

\(w_{t+1}x(S_t)^T = w_tx(S_t)^T + U_t - x(S_t)^T w_t\)

\(w_{t+1}x(S_t)^T = U_t\)

\(x(S_t)^Tw_{t+1} = \hat{v}(S_t, w_{t+1}) = U_t\) by 9.8

Since U_{t} is estimating \(v_\pi(S_t)\) this means that the estimate and the actual values match, meaning there is zero error.

#### Exercise 9.7

One of the simplest artificial neural networks consists of a single semi-linear unit with a logistic nonlinearity. The need to handle approximate value functions of this form is common in games that end with either a win or a loss, in which case the value of a state can be interpreted as the probability of winning. Derive the learning algorithm for this case, from (9.7), such that no gradient notation appears.

(9.7) \(w_{t+1} = w_t + \alpha [U_t -\hat{v}(S_t, W_t)] \nabla \hat{v}(S_t, w_t)\)

semi-linear unit with logistic nonlinearity:

\(\hat{v}(S_t, w_t) = (1 + e^{-(\sum_{i=1}^n w_i s_i)})^{-1}\)

We do the chain rule here, because we want to take the derivative of this, which is the derivative of the logistic multiplied by the derivative of this sum

\(\nabla \hat{v}(S_t, w_t) = \hat{v}(S_t, w_t)(1-\hat{v}(S_t, w_t))S_t\)

so

\(w_{t+1} = w_t + \alpha [U_t - \hat{v}(S_t, w_t)] \hat{v}(S_t, w_t)(1-\hat{v}(S_t, w_t))S_t\)

I don't know how to move forwards from here

*Claude 3.5 Sonnet: This is the correct answer, and is the final form.* I'm not sure I buy this but I'll mark it as completed since I got the chain rule part.

#### TODO Starred Exercises to Revisit Later

*Exercise 9.8 Arguably, the squared error used to derive (9.7) is inappropriate for the case treated in the preceding exercise, and the right error measure is the cross-entropy loss (which you can find on Wikipedia). Repeat the derivation in Section 9.3, using the cross-entropy loss instead of the squared error in (9.4), all the way to an explicit form with no gradient or logarithm notation in it. Is your final form more complex, or simpler, than that you obtained in the preceding exercise?

### 10 - On-policy Control with Approximation

*Status: Checked with o1-preview*

#### Exercise 10.1

We have not explicitly considered or given pseudocode for any Monte Carlo methods in this chapter. What would they be like? Why is it reasonable not to give pseudocode for them? How would they perform on the Mountain Car task?

There would be a step that says "Generate an episode S_{0}, A_{0}, R_{1}, S_{1}, A_{1} … S_{T}". The first one of these in an environment like mountain car would take an unfathomably long time to complete, since it would need to solve the task using completely random actions. In these partially observable / huge state space cases, it's much better to be able to learn during episodes.

#### Exercise 10.2

Give pseudocode for semi-gradient one-step Expected Sarsa for control. ⇤

This would be the same as the existing method on page 247 with the following changes:

- We would remove the \(\tau\) line, since it would always just be equal to \(t\).
- We would replace the G sum with \(G = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) \hat{q}(S_{t+1}, a)\)

The rest should be the same.

#### Exercise 10.3

Why do the results shown in Figure 10.4 have higher standard errors at large n than at small n?

If the n is too large, the time horizon for updating is too long. As a result, the greater distance between the current state and the state n steps ago creates a larger degree of uncertainty between the two states, which in turn increases the standard error.

#### Exercise 10.4

Give pseudocode for a differential version of semi-gradient Q-learning

On page 251 they give an example of differential semi-gradient Sarsa. Semi-gradient Q-learning would just be this algorithm, but replacing \(\delta \leftarrow R - \bar{R} + \hat{q}(S', A', w) - \hat{q}(S, A, w)\) with \(\delta \leftarrow R - \bar{R} + max_a \hat{q}(S', a, w) - \hat{q}(S, A, w)\)

#### Exercise 10.5

What equations are needed (beyond 10.10) to specify the differential version of TD(0)?

We need a differential form of the TD error in terms of values, i.e.

\(w_{t+1} \doteq w_t + \alpha \delta_t \nabla \hat{v}(S_{t}, w_t)\)

#### Exercise 10.6

Consider a Markov reward process consisting of a ring of three states A, B, and C, with state transitions going deterministically around the ring. A reward of +1 is received upon arrival in A and otherwise the reward is 0. What are the differential values of the three states?

The average reward here is 1/3. The differential reward is the difference between the reward vs the average reward. As a result the rewards are:

A = 2/3 B = -1/3 C = -1/3

#### Exercise 10.7

Suppose there is an MDP that under any policy produces the deterministic sequence of rewards +1, 0, +1, 0, +1, 0,… going on forever. Technically, this violates ergodicity; there is no stationary limiting distribution µ⇡ and the limit (10.7) does not exist. Nevertheless, the average reward (10.6) is well defined. What is it?

Now consider two states in this MDP. From A, the reward sequence is exactly as described above, starting with a +1, whereas, from B, the reward sequence starts with a 0 and then continues with +1, 0, +1, 0,…. We would like to compute the differential values of A and B. Unfortunately, the differential return (10.9) is not well defined when starting from these states as the implicit limit does not exist. To repair this, one could alternatively define the differential value of a state as

\(v_\pi(s) \doteq lim_{\gamma \rightarrow 1} lim_{h \rightarrow \infty} \sum_{t=0}^{h}\gamma^t (\mathbb{E}_\pi[R_{t+1}|S_0=s] - r(\pi))\)

Under this definition, what are the differential values of states A and B?

–

The average reward is 0.5

As a result the calculations for these for A are:

\((1 - 0.5) + \gamma(0 - 0.5) + \gamma^2(1 - 0.5) + etc\)

\(v(A) = 0.5 - \gamma 0.5 + \gamma^2 0.5 + etc\)

\(v(A) = 0.5(1 - \gamma + gamma^2 + etc)\)

\(v(A) = 0.5(1 / (1 + \gamma))\)

and as lim γ → 1, A = 0.25

Likewise the same for B, but now it's -0.5 + γ 0.5 etc, which means it will be -0.25

#### Exercise 10.8

The pseudocode in the box on page 251 updates R¯t using δ_{t} as an error
rather than simply Rt+1 - R¯t. Both errors work, but using δ_{t} is better. To see why,
consider the ring MRP of three states from Exercise 10.6. The estimate of the average
reward should tend towards its true value of 1/3 . Suppose it was already there and was
held stuck there. What would the sequence of Rt+1 - R¯t errors be? What would the
sequence of δ_{t} errors be (using Equation 10.10)? Which error sequence would produce
a more stable estimate of the average reward if the estimate were allowed to change in
response to the errors? Why?

The sequence of R_{t+1} - \bar{R}_{t} would be

2/3, -1/3, -1/3, 2/3, -1/3, -1/3, 2/3, -1/3, -1/3

We need the V values of each state to do the next part, so we solve using the differential values

V(A) = 2/3 + V(B)

V(B) = -1/3 + V(C)

V(C) = -1/3 + V(A)

One solution to this is V(A) = 0, V(B) = -2/3, V(C) = -1/3 (*note: I actually dunno how to do this very basic step without just setting V(A) to 0. Maybe I should go back to middle school. Claude seems to think it's normal to do this but I don't really get why it works.*)

the sequence of δ_{t} errors following \(R_{t+1} - \bar{R}_{t} +\hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w)\) would be

(2/3 + 0 - 2/3), (-1/3 + -1/3 - -2/3), (-1/3 + 0 - -1/3)

or

0, 0, 0, …

That is, the latter is more stable because if the reward and states are right, it doesn't perform updates at all, whereas the other one will cause oscillations since the updates aren't all 0.

#### Exercise 10.9

In the differential semi-gradient n-step Sarsa algorithm, the step-size parameter on the average reward, , needs to be quite small so that R¯ becomes a good long-term estimate of the average reward. Unfortunately, R¯ will then be biased by its initial value for many steps, which may make learning inecient. Alternatively, one could use a sample average of the observed rewards for R¯. That would initially adapt rapidly but in the long run would also adapt slowly. As the policy slowly changed, R¯ would also change; the potential for such long-term nonstationarity makes sample-average methods ill-suited. In fact, the step-size parameter on the average reward is a perfect place to use the unbiased constant-step-size trick from Exercise 2.7. Describe the specific changes needed to the boxed algorithm for differential semi-gradient n-step Sarsa to use this trick.

Per exercise 2.7, the constant-step-size trick is

\(\beta_n \doteq \alpha \bar{o}_n\)

where

\(\bar{o}_n \doteq \bar{o}_{n-1} + \alpha(1 - \bar{o}_{n-1})\)

So we'd have to make three primary changes to the algorithm for semi-gradient n-step sarsa.

First, we'd have to initialize \(\bar{o}_0\) at 0

Second, we'd have to replace \(\bar{R} \leftarrow \bar{R} + \beta\delta\) with \(\bar{R} \leftarrow \bar{R} + \alpha \bar{o}_n \delta\)

Finally, we'd have to repeatedly update \(\bar{o}_n\) after each action using \(\bar{o}_n \doteq \bar{o}_{n-1} + \alpha(1 - \bar{o}_{n-1})\)

### 11 - Off-policy Methods with Approximation

*Status: Checked with Claude Sonnet 3.5*

#### Exercise 11.1

Convert the equation of n-step off-policy TD (7.9) to semi-gradient form. Give accompanying definitions of the return for both the episodic and continuing cases

(7.9) \(V_{t+n} \doteq V_{t+n-1} + \alpha p_{t:t+n-1}[G_{t:t+n} - V_{t+n-1}(S_t)]\)

\(\hat{v}_{t+n}(S_t, w_t) \doteq \hat{v}_{t+n-1} + \alpha p_{t:t+n-1}[G_{t:t+n} - \hat{v}_{t+n-1}(S_t, w_t)]\)

where

\(w_{t+1} \doteq w_t + \alpha p_{t:t+n-1}[G_{t:t+n} - \hat{v}_{t+n-1}(S_t, w_t)] \nabla \hat{v}_{t+n-1}(S_t, w_t)\)

Episodic: \(G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}_{t+n-1}(S_{t+n}, w_{t+n-1})\)

Continuous: \(G_{t:t+n} = R_{t+1} - \bar{R}_t + ... + R_{t+n} - \bar{R}_{t+n-1} + \hat{v}_{t+n-1}(S_{t+n}, w_{t+n-1})\)

#### Exercise 11.3

Exercise 11.3 (programming) Apply one-step semi-gradient Q-learning to Baird’s counterexample and show empirically that its weights diverge.

Code can be found in this colab.

#### TODO Starred Exercises to Complete Later

*Exercise 11.2 Convert the equations of n-step Q() (7.11 and 7.17) to semi-gradient form. Give definitions that cover both the episodic and continuing cases.

*Exercise 11.4 Prove (11.24). Hint: Write the RE as an expectation over possible states s of the expectation of the squared error given that St = s. Then add and subtract the true value of state s from the error (before squaring), grouping the subtracted true value with the return and the added true value with the estimated value. Then, if you expand the square, the most complex term will end up being zero, leaving you with (11.24). ⇤

### 12 - Eligibility Traces

*Status: Checked with o1-preview*

#### Exercise 12.1

Just as the return can be written recursively in terms of the first reward and itself one-step later (3.9), so can the λ-return. Derive the analogous recursive relationship from (12.2) and (12.1).

(12.1) \(G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, w_{t+n-1})\)

\(G_{t+1:t+n} = R_{t+2} + \gamma R_{t+3} + ... + \gamma^{n-2} R_{t+n} + \gamma^{n-1} \hat{v}(S_{t+n}, w_{t+n-1})\)

\(\gamma G_{t+1:t+n} = \gamma R_{t+2} + \gamma^2 R_{t+3} + ... + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, w_{t+n-1})\)

\(G_{t:t+n} = R_{t+1} + \gamma G_{t+1:t+n}\)

(12.2) \(G_{t}^{\lambda} \doteq \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}\)

\(G_{t}^{\lambda} \doteq \sum_{n=1}^{\infty} \lambda^{n-1} [R_{t+1} + \gamma G_{t+1:t+n}]\)

\(G_{t}^{\lambda} \doteq \sum_{n=1}^{\infty} \lambda^{n-1} R_{t+1} + \sum_{n=1}^{\infty} \lambda^{n-1} \gamma G_{t+1:t+n}\)

geometric series converges to a/(1-r), second term is similar to the next G but we have to pull some terms out

\(G_{t}^{\lambda} \doteq \frac{R_{t+1}}{(1 - \lambda)} + \gamma \lambda G_{t+1}^\lambda\)

#### Exercise 12.2

Exercise 12.2 The parameter λ characterizes how fast the exponential weighting in Figure 12.2 falls off, and thus how far into the future the λ-return algorithm looks in determining its update. But a rate factor such as is sometimes an awkward way of characterizing the speed of the decay. For some purposes it is better to specify a time constant, or half-life. What is the equation relating λ and the half-life, \(\tau_\pi\), the time by which the weighting sequence will have fallen to half of its initial value?

If you want to cut things in half, that's ln(2), like the opposite of a tree. In each step, the step has a weight of \(\lambda^{t-1}\). If you want to find which time step makes this equal 1/2, that's \(\lambda^{\tau_\pi - 1} = 1/2\).

We can solve for \(\tau_\pi\), \(ln(\lambda^{\tau_\pi - 1}) = ln(1/2)\)

\((\tau_\pi - 1)ln(\lambda) = -ln(2)\)

\(\tau_\pi = -ln(2) / ln(\lambda) + 1\)

#### Exercise 12.3

Some insight into how TD(λ) can closely approximate the off-line λ-return algorithm can be gained by seeing that the latter’s error term (in brackets in (12.4)) can be written as the sum of TD errors (12.6) for a single fixed w. Show this, following the pattern of (6.6), and using the recursive relationship for the λ-return you obtained in Exercise 12.1.

error term: \([G_t^\lambda - \hat{v}(S_t, w_t)]\)

TD error 12.6: \(\delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)\)

Recursive

\(G_{t}^{\lambda} \doteq \sum_{n=1}^{\infty} \lambda^{n-1} [R_{t+1} + \gamma G_{t+1:t+n}]\)

Rewriting:

\(G_t^\lambda - \hat{v}(S_t, w_t) = [\sum_{n=1}^{\infty} \lambda^{n-1} [R_{t+1} + \gamma G_{t+1:t+n}]] - \hat{v}(S_t, w_t)\)

\(G_t^\lambda - \hat{v}(S_t, w_t) = R_{t+1} + \lambda\gamma R_{t+2} + ... - \hat{v}(S_t, w_t)\)

\(G_t^\lambda - \hat{v}(S_t, w_t) = R_{t+1} + \lambda\gamma \hat{v}(S_{t+1}, w_t) - \lambda\gamma \hat{v}(S_{t+1}, w_t) + \lambda\gamma R_{t+2} + ... - \hat{v}(S_t, w_t)\)

\(G_t^\lambda - \hat{v}(S_t, w_t) = \delta_t + \lambda\gamma R_{t+2} + ... - \lambda\gamma \hat{v}(S_{t+1}, w_t)\)

etc etc, as done in other exercises

\(G_t^\lambda - \hat{v}(S_t, w_t) = \sum_{k=t}^\infty \lambda^{k-t}\gamma^{k-t}\delta_k\)

#### Exercise 12.4

Use your result from the preceding exercise to show that, if the weight updates over an episode were computed on each step but not actually used to change the weights (w remained fixed), then the sum of TD(λ)’s weight updates would be the same as the sum of the off-line λ-return algorithm’s updates

The off-line λ-return algorithm is

\(w_{t+1} \doteq w_t + \alpha [G_t^\lambda - \hat{v}(S_t, w_t)] \nabla \hat{v}(S_t, w_t)\)

which we showed can be written

\(w_{t+1} \doteq w_t + \alpha [\sum_{k=t}^{\infty}\gamma^{k-t}\delta_k] \nabla \hat{v}(S_t, w_t)\)

For each step in an episode, TD(λ) will update according to

\(z_t \doteq \gamma \lambda z_{t-1} + \nabla \hat{v}(S, w)\)

\(w_{t+1} \doteq w_t + \alpha\delta_tz_t\)

meaning if we dont change them we can sum across t

\(w_{t+1} \doteq w_t + \alpha[\sum_{k=t}^{\infty}\delta_k]z_t\)

With respect to z_{t}

\(z_t = \nabla \hat{v}(S_t,w_t)(1 + \gamma\lambda + (\gamma\lambda)^2 + ... )\)

\(\sum_{k=t}^{\infty}z_k = \sum_{k=t}^{\infty}[\gamma^{k-t}\lambda^{k-t}] \nabla\hat{v}(S_t, w_t)\)

z_{t} here would be updated t times where each time you multiply it by γλ and then add the gradient, so we can write this as

\(w_{t+1} \doteq w_t + \alpha[\sum_{k=t}^{\infty}\gamma^{k-t}\lambda^{k-t}\delta_k]\nabla\hat{v}(S_t, w_t)\)

this is the same as our off-line λ-return algorithm.

#### Exercise 12.5

Several times in this book (often in exercises) we have established that returns can be written as sums of TD errors if the value function is held constant. Why is (12.10) another instance of this? Prove (12.10).

It makes sense, n-step truncated TD(λ) is just a weighted sum of n-step TD methods for multiple n, which we have already shown in previous exercises to be a sum of TD errors, so this will be a weighted sum of sum of TD errors, i.e. a sum of TD errors.

\(\delta_t^{'} \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_{t-1})\)

We start with the definition of truncated λ-return

\(G_{t:t+k}^{\lambda} \doteq (1 - \lambda) \sum_{n=1}^{h-t-1} \lambda^{n-1}G_{t:t+n} + \lambda^{h-t-1} G_{t:t+k}\)

\(G_{t:t+k}^{\lambda} \doteq (1 - \lambda) [G_{t:t+1} + \sum_{n=2}^{h-t-1} \lambda^{n-1}G_{t:t+n} + \lambda^{h-t-1} G_{t:t+k}]\)

\(G_{t:t+k}^{\lambda} \doteq (1 - \lambda) [R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) + \sum_{n=2}^{h-t-1} \lambda^{n-1}G_{t:t+n} + \lambda^{h-t-1} G_{t:t+k}]\)

\(G_{t:t+k}^{\lambda} \doteq (1 - \lambda) [\hat{v}(S_{t}, w_{t-1}) + \delta_t^{'} + \sum_{n=2}^{h-t-1} \lambda^{n-1}G_{t:t+n} + \lambda^{h-t-1} G_{t:t+k}]\)

\(G_{t:t+k}^{\lambda} \doteq (1 - \lambda) [\hat{v}(S_{t}, w_{t-1}) + \delta_t^{'} + \lambda G_{t:t+2} \sum_{n=3}^{h-t-1} \lambda^{n-1}G_{t:t+n} + \lambda^{h-t-1} G_{t:t+k}]\)

\(G_{t:t+k}^{\lambda} \doteq (1 - \lambda) [\hat{v}(S_{t}, w_{t-1}) + \delta_t^{'} + \lambda [R_{t+2} + \hat{v}(S_{t+2}, w_t)] \sum_{n=3}^{h-t-1} \lambda^{n-1}G_{t:t+n} + \lambda^{h-t-1} G_{t:t+k}]\)

…

\(G_{t:t+k}^\lambda = (1 - \lambda) ([\sum_{i=t}^{i+k-1} \lambda^{i-t}][...])\)

This first sum is equivalent to 1/(1-r) so that term cancels out the (1-λ) term

\(G_{t:t+k}^\lambda = \hat{v}(S_t, w_{t-1}) + \delta_t^{'} + \gamma\lambda \delta_{t+1}^{'} + ... + (\gamma\lambda)^{k}\delta_{t+k}\)

\(G_{t:t+k}^\lambda = \hat{v}(S_t, w_{t-1}) + \sum_{i=t}^{t+k-1} (\lambda \gamma)^{i-t}\delta_i^{'}\)

#### Exercise 12.6

Modify the pseudocode for Sarsa(λ) to use dutch traces (12.11) without the other distinctive features of a true online algorithm. Assume linear function approximation and binary features

(12.11) dutch traces \(z_t \doteq \gamma\lambda z_{t-1} + (1 - \alpha \gamma \lambda z_{t-1}^T x_t) x_t\)

In addition to initializing S, we will need to obtain an initial feature x as we do in True Online TD(λ), a feature function from states to vectors. When we take an action, we will need to observe R, S', and now x'.

We will replace the line \(z \leftarrow \gamma \lambda z\) with \(z \leftarrow \gamma\lambda z + (1 - \alpha \gamma \lambda z^T x) x\)

and when we also move \(x \leftarrow x'\)

#### Exercise 12.7

Generalize the three recursive equations above to their truncated versions,
defining G^{λ s}_{t:h} and G^{λ α}_{t:h}.

\(G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1} ((1 - \lambda_{t+1})\hat{q}(S_{t+1}, A_{t+1}, w_t) + \lambda_{t+1}G_{t+1}^{\lambda a})\)

\(G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1} ((1 - \lambda_{t+1})\hat{q}(S_{t+1}, A_{t+1}, w_t) + \lambda_{t+1}[R_{t+2} + \gamma_{t+2} ((1 - \lambda_{t+2})\hat{q}(S_{t+2}, A_{t+2}, w_{t+1}) + \lambda_{t+2}G_{t+2}^{\lambda a})])\)

\(G_t^{\lambda a} \doteq R_{t+1} + \gamma_{t+1} ((1 - \lambda_{t+1})\hat{q}(S_{t+1}, A_{t+1}, w_t) + \lambda_{t+1}R_{t+2} + \lambda_{t+1}\gamma_{t+2} ((1 - \lambda_{t+2})\hat{q}(S_{t+2}, A_{t+2}, w_{t+1}) + \lambda_{t+1}\lambda_{t+2}G_{t+2}^{\lambda a})])\)

\(G_t^{\lambda a} \doteq R_{t+1} + \lambda_{t+1}R_{t+2} + \gamma_{t+1} ((1 - \lambda_{t+1})\hat{q}(S_{t+1}, A_{t+1}, w_t) + \lambda_{t+1}\gamma_{t+2} ((1 - \lambda_{t+2})\hat{q}(S_{t+2}, A_{t+2}, w_{t+1}) + \lambda_{t+1}\lambda_{t+2}G_{t+2}^{\lambda a})])\)

\(G_t^{\lambda a} \doteq R_{t+1} + \lambda_{t+1}R_{t+2} + \gamma_{t+1} ((1 - \lambda_{t+1})\hat{q}(S_{t+1}, A_{t+1}, w_t) + \lambda_{t+1}\gamma_{t+2} ((1 - \lambda_{t+2})\hat{q}(S_{t+2}, A_{t+2}, w_{t+1}) + \lambda_{t+1}\lambda_{t+2}R_{t+3} + \lambda_{t+1}\lambda_{t+2}\gamma_{t+3} ((1 - \lambda_{t+3})\hat{q}(S_{t+3}, A_{t+3}, w_t) + \lambda_{t+1}\lambda_{t+2}\lambda_{t+3}G_{t+3}^{\lambda a})))])\)

\(G_t^{\lambda a} \doteq R_{t+1} + \lambda_{t+1}R_{t+2} + \lambda_{t+1}\lambda_{t+2}R_{t+3} + \gamma_{t+1} ((1 - \lambda_{t+1})\hat{q}(S_{t+1}, A_{t+1}, w_t) + \lambda_{t+1}\gamma_{t+2} ((1 - \lambda_{t+2})\hat{q}(S_{t+2}, A_{t+2}, w_{t+1}) + \lambda_{t+1}\lambda_{t+2}\gamma_{t+3} ((1 - \lambda_{t+3})\hat{q}(S_{t+3}, A_{t+3}, w_t) + \lambda_{t+1}\lambda_{t+2}\lambda_{t+3}G_{t+3}^{\lambda a})))])\)

…

\(G_{t:h}^{\lambda a} \doteq \sum_{k=t}^{h-1} \left(\prod_{i=t+1}^k \lambda_i\gamma_i\right)R_{k+1} + \left(\prod_{i=t+1}^k \lambda_i\gamma_i\right)(1-\lambda_{k+1})\hat{q}(S_{k+1}, A_{k+1}, w_k)\)

and likewise

\(G_{t:h}^{\lambda s} \doteq \sum_{k=t}^{h-1} \left(\prod_{i=t+1}^k \lambda_i\gamma_i\right)R_{k+1} + \left(\prod_{i=t+1}^k \lambda_i\gamma_i\right)(1-\lambda_{k+1})\bar{V}_k(S{k+1})\)

#### TODO Starred Exercises to do later

##### Exercise 12.8

Exercise 12.8 Prove that (12.24) becomes exact if the value function does not change. To save writing, consider the case of t = 0, and use the notation Vk = \hat{v}(Sk,w). ⇤

##### Exercise 12.9

The truncated version of the general off-policy return is denoted G^{λ s}_{t:h}.
Guess the correct equation, based on (12.24).

##### Exercise 12.10

Prove that (12.27) becomes exact if the value function does not change. To save writing, consider the case of t = 0, and use the notation Qk = qˆ(Sk, Ak, w). Hint: Start by writing out a 0 and Ga 0 , then Ga 0 Q0.

##### Exercise 12.11

The truncated version of the general o↵-policy return is denoted Ga t:h. Guess the correct equation for it, based on (12.27).

##### Exercise 12.12

Show in detail the steps outlined above for deriving (12.29) from (12.27). Start with the update (12.15), substitute Ga t from (12.26) for G t , then follow similar steps as led to (12.25).

##### Exercise 12.13

What are the dutch-trace and replacing-trace versions of o↵-policy eligibility traces for state-value and action-value methods?

##### Exercise 12.14

How might Double Expected Sarsa be extended to eligibility traces? ⇤

### 13 - Policy Gradient Methods

*Status: Checked with o1-preview*

#### Exercise 13.1

Use your knowledge of the gridworld and its dynamics to determine an exact symbolic expression for the optimal probability of selecting the right action in Example 13.1.

From state 1, you want to select right-left-right for an optimal return of -3

from state 2, you want to select left-right for an optimal return of -2

from state 3, you want to select right for an optimal return of -1

you don't know what state you're in, so the best that you can do is match the optimal first actions with equal probability, i.e. select right 2/3 of the time, select left 1/3 of the time.

#### Exercise 13.3

In Section 13.1 we considered policy parameterizations using the soft-max in action preferences (13.2) with linear action preferences (13.3). For this parameterization, prove that the eligibility vector is

(13.9) \(\nabla ln \pi(a|s, \theta) = x(s,a) - \sum_b \pi(b|s, \theta)x(s,b)\)

using the definitions and elementary calculus

(13.2) \(\pi(a|s, \theta) \doteq \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}\)

(13.3) \(h(s,a,\theta) = \theta^T x(s,a)\)

start from the eligibility vector

\(\nabla \pi(A_t|S_t,\theta_t) / \pi(A_t|S_t,\theta_t)\)

which from the text we can write as

\(\nabla ln \pi(a|s, \theta)\)

so

\(\nabla ln \pi(a|s, \theta) = \nabla \pi(a|s,\theta) / \pi(a|s,\theta)\)

substitute in 13.2

\(\nabla ln \pi(a|s, \theta) = \nabla \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}} / \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}\)

substitute in 13.3

\(\nabla ln \pi(a|s, \theta) = \nabla \frac{e^{\theta^T x(s,a)}}{\sum_b e^{\theta^T x(s,a)}} / \frac{e^{\theta^T x(s,a)}}{\sum_b e^{\theta^T x(s,b)}}\)

use the same identity

\(\nabla ln \pi(a|s, \theta) = \nabla ln ( \frac{e^{\theta^T x(s,a)}}{\sum_b e^{\theta^T x(s,b)}} )\)

quotient rule

\(\nabla ln \pi(a|s, \theta) = \nabla (ln (e^{\theta^T x(s,a)}) - ln(\sum_b e^{\theta^T x(s,b)}))\)

\(\nabla ln \pi(a|s, \theta) = \nabla (\theta^T x(s,a) - ln(\sum_b e^{\theta^T x(s,b)}))\)

\(\nabla ln \pi(a|s, \theta) = \nabla\theta^T x(s,a) - \nabla ln(\sum_b e^{\theta^T x(s,b)}))\)

\(\nabla ln \pi(a|s, \theta) = x(s,a) - \nabla ln(\sum_b e^{\theta^T x(s,b)}))\)

The gradient of ln(f(x)) is f'(x)/f(x) due to chain rule

\(\nabla ln \pi(a|s, \theta) = x(s,a) - \frac{\nabla \sum_b e^{\theta^T x(s,b)}}{\sum_b e^{\theta^T x(s,b)}}\)

the gradient of the sum of exponentials also using the chain rule

\(\nabla ln \pi(a|s, \theta) = x(s,a) - \frac{\nabla(\theta^T x(s,b)) \sum_b e^{\theta^T x(s,b)}}{\sum_b e^{\theta^T x(s,b)}}\)

\(\nabla ln \pi(a|s, \theta) = x(s,a) - \frac{x(s,b) \sum_b e^{\theta^T x(s,b)}}{\sum_b e^{\theta^T x(s,b)}}\)

rather than cancel this upper term we can recognize that we are summing across b the fraction \(\frac{e^{\theta_T x(s,a)}}{\sum_b e^{\theta^T x(s,b)}}\) which is 13.2, so we can write this as

\(\nabla ln \pi(a|s, \theta) = x(s,a) - \sum_b \pi(b|s, \theta)x(s,b)\)

#### Exercise 13.4

Show that for the Gaussian policy parameterization (Equations 13.19 and 13.20) the eligibility vector has the following two parts:

\(\nabla ln \pi(a|s, \theta_\mu) = \frac{1}{\sigma(s, \theta)^2}(\alpha - \mu(s, \theta))x_\mu(s)\)

and

\(\nabla ln \pi(a|s, \theta_\sigma) = (\frac{(\alpha - \mu(s, \theta))^2}{\sigma(s,\theta)^2}-1)x_\sigma(s)\)

–

This seems extremely cumbersome

(13.19) \(\pi(a|s, \theta) \doteq \frac{1}{\sigma(s, \theta)\sqrt{2\pi}} exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2})\)

(13.20) \(\mu(s, \theta) \doteq \theta_\mu^T x_\mu(s)\) and \(\sigma(s, \theta) \doteq exp(\theta_\sigma^Tx_\sigma(s))\)

so the problem solving necessary here is just substituting and doing more calculus. For each part we will solve the upper term \(\nabla \pi (a|s, \theta_x)\) and then divide the result by \(\frac{1}{\sigma(s, \theta)\sqrt{2\pi}} exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2})\)

part A:

\(\nabla \pi (a|s, \theta_\mu) = \nabla(\frac{1}{\sigma(s, \theta)\sqrt{2\pi}} exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}))\)

product rule

\(\nabla \pi (a|s, \theta_\mu) = \nabla(\frac{1}{\sigma(s, \theta)\sqrt{2\pi}})exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) + (\frac{1}{\sigma(s, \theta)\sqrt{2\pi}}) \nabla(exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}))\)

first term goes to zero becasuse the partial is 0, since \(\sigma(s, \theta)\) is not using \(\theta_\mu\). So

\(\nabla \pi (a|s, \theta_\mu) = \frac{1}{\sigma(s, \theta)\sqrt{2\pi}} \nabla(exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}))\)

now chain rule

\(\nabla \pi (a|s, \theta_\mu) = \frac{1}{\sigma(s, \theta)\sqrt{2\pi}} exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) * \nabla (- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2})\)

This first term is cumbersome so let's just write it with 13.19

\(\nabla \pi (a|s, \theta_\mu) = \pi(a|s, \theta) * - \nabla (\frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2})\)

quotient rule

\(\nabla \pi (a|s, \theta_\mu) = \pi(a|s, \theta) * -[(2\sigma(s, \theta)^2)(\nabla ((a - \mu(s,\theta))^2)) - ((a - \mu(s,\theta))^2)(\nabla(2\sigma(s, \theta)^2))]/[(2\sigma(s, \theta)^2)^2]\)

because it's a partial the right term goes to 0

\(\nabla \pi (a|s, \theta_\mu) = \pi(a|s, \theta) * -[(2\sigma(s, \theta)^2)(\nabla ((a - \mu(s,\theta))^2))] /[(2\sigma(s, \theta)^2)^2]\)

\(\nabla \pi (a|s, \theta_\mu) = \pi(a|s, \theta) * -[\nabla ((a - \mu(s,\theta))^2)] /[2\sigma(s, \theta)^2]\)

we can divide out the policy term to get what we are actually solving for so

\(\nabla ln(\pi (a|s, \theta_\mu)) = -[\nabla ((a - \mu(s,\theta))^2)] /[2\sigma(s, \theta)^2]\)

\(\nabla ln(\pi (a|s, \theta_\mu)) = -[-2(a - \mu(s,\theta) (\nabla(a - \mu(s,\theta)))] /[2\sigma(s, \theta)^2]\)

\(\nabla ln(\pi (a|s, \theta_\mu)) = [(a - \mu(s,\theta))x_\mu(s)] /[\sigma(s, \theta)^2]\)

pull out terms to get what we need

\(\nabla ln(\pi (a|s, \theta_\mu)) = \frac{1}{\sigma(s, \theta)^2} (a - \mu(s,\theta))x_\mu(s)\)

** part B**:

\(\nabla \pi (a|s, \theta_\sigma) = \nabla(\frac{1}{\sigma(s, \theta)\sqrt{2\pi}} exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}))\)

We do the exact same as above but for the other direction

\(\nabla \pi (a|s, \theta_\sigma) = \nabla(\frac{1}{\sigma(s, \theta)\sqrt{2\pi}})exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) + (\frac{1}{\sigma(s, \theta)\sqrt{2\pi}}) \nabla(exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}))\)

Let's tackle the nablas separately

chain rule

\(\nabla \frac{1}{\sigma(s,\theta)\sqrt{2\pi}} = \frac{\sqrt{2\pi}\nabla(\sigma(s, \theta))}{(\sigma(s, \theta)\sqrt{2\pi})^2}\)

\(\nabla \frac{1}{\sigma(s,\theta)\sqrt{2\pi}} = \frac{\sigma(s, \theta)\sqrt{2\pi} x_\sigma(s)}{(\sigma(s, \theta)\sqrt{2\pi})^2}\)

\(\nabla \frac{1}{\sigma(s,\theta)\sqrt{2\pi}} = \frac{x_\sigma(s)}{\sigma(s, \theta)\sqrt{2\pi}}\)

ok, now chain rule

\(\nabla exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) = exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) \nabla(-\frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2})\)

\(\nabla(-\frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) = - [(2\sigma(s, \theta)^2)(\nabla((a - \mu(s,\theta))^2)) - ((a - \mu(s,\theta))^2)(\nabla(2\sigma(s, \theta)^2))] / [(2\sigma(s, \theta)^2)^2]\)

since it's a partial the mu nabla term goes to 0

\(\nabla(-\frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) = ((a - \mu(s,\theta))^2)(2\nabla(\sigma(s, \theta)^2))] / [(2\sigma(s, \theta)^2)^2]\)

chain rule again and again

\(\nabla(-\frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) = ((a - \mu(s,\theta))^2)(2(2\sigma(s, \theta) \nabla(\sigma(s, \theta))))] / [(2\sigma(s, \theta)^2)^2]\)

\(\nabla(-\frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) = ((a - \mu(s,\theta))^2)(2(2\sigma(s, \theta) \sigma(s, \theta)(x_\sigma(s))))] / [(2\sigma(s, \theta)^2)^2]\)

\(\nabla(-\frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) = ((a - \mu(s,\theta))^2)(x_\sigma(s))] / \sigma(s, \theta)^2\)

coming back to the original we can factor out numerator from the first nabla and rewrite both terms using policy

\(\nabla \pi (a|s, \theta_\sigma) = x_\sigma(s) \frac{1}{\sigma(s, \theta)\sqrt{2\pi}}exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) - (\frac{1}{\sigma(s, \theta)\sqrt{2\pi}}) exp(- \frac{(a - \mu(s,\theta))^2}{2\sigma(s, \theta)^2}) ((a - \mu(s,\theta))^2)(x_\sigma(s))] / \sigma(s, \theta)^2\)

\(\nabla \pi (a|s, \theta_\sigma) = x_\sigma(s) \pi(a|s, \theta) - \pi(a|s, \theta) \frac{((a - \mu(s,\theta))^2)(x_\sigma(s))}{\sigma(s, \theta)^2}\)

\(\nabla ln\pi (a|s, \theta_\sigma) = x_\sigma(s) - \frac{((a - \mu(s,\theta))^2)(x_\sigma(s))}{\sigma(s, \theta)^2}\)

\(\nabla ln\pi (a|s, \theta_\sigma) = (\frac{(a - \mu(s,\theta))^2}{\sigma(s, \theta)^2} - 1)x_\sigma(s)\)

#### Exercise 13.5

A Bernoulli-logistic unit is a stochastic neuron-like unit used in some ANNs
(Section 9.7). Its input at time t is a feature vector x(S_{t}); its output, A_{t}, is a random
variable having two values, 0 and 1, with Pr{A_{t} = 1} = P_{t} and Pr{At = 0} = 1 - P_{t} (the
Bernoulli distribution). Let h(s, 0, θ) and h(s, 1, θ) be the preferences in state s for the
unit’s two actions given policy parameter θ. Assume that the difference between the
action preferences is given by a weighted sum of the unit’s input vector, that is, assume
that h(s, 1, θ) - h(s, 0, θ) = θ^{T} x(s), where θ is the unit’s weight vector.

(a) Show that if the exponential soft-max distribution (13.2) is used to convert action
preferences to policies, then P_{t} = π(1|St, θ_{t})=1/(1 + exp(-θ_{t}^{T} x(S_{t}))) (the logistic
function).

(b) What is the Monte-Carlo REINFORCE update of θ_{t} to θ_{t+1} upon receipt of return
G_{t}?

(c) Express the eligibility \(\nabla ln \pi(a|s, \theta)\) for a Bernoulli-logistic unit, in terms of a, x(s), and π(a|s, θ) by calculating the gradient.

Hint for part (c): Define P = π(1|s, θ) and compute the derivative of the logarithm, for
each action, using the chain rule on P. Combine the two results into one expression that
depends on a and P, and then use the chain rule again, this time on θ^{T} x(s), noting that
the derivative of the logistic function f(x)=1/(1 + e^{-x}) is f(x)(1 - f(x)).

–

(a) this would be \(\frac{e^{h(s,1,\theta)}}{e^{h(s,1,\theta)} + e^{h(s,0,\theta)}}\)

If we multiply top and bottom by \(1 / e^{h(s,1,\theta)}\) we get

\(\frac{1}{1 + \frac{e^{h(s,0,\theta)}}{e^{h(s,1,\theta)}}}\) which is also

\(\frac{1}{1 + e^{h(s,0,\theta) - h(s,1,\theta)}}\)

which is in turn the logistic function when using the given definition of θ^{T} x(s)

(c)

let's do as they suggest and define \(P = \pi(1|s, \theta)\), which means 1 = P and 0 = 1-P

since P is the logistic function we know \(\nabla P = P(1-P)\)

∇ ln(P) here is (1-P) P(1-P) x(s) = (1-P) x(s)

∇ ln(1-P) here is (-1/(1-P)) P(1-P) x(s) = -P x(s)

notice that P corresponds to a=1 and 1-P responds to a=0, meaning both are actually

(a-P) x(s)

(b) \(\theta \leftarrow \theta + \alpha \gamma^t G_t (A_t - P_t) x(S_t)\)

#### TODO Starred Exercises to Complete Later

*Exercise 13.2 Generalize the box on page 199, the policy gradient theorem (13.5), the proof of the policy gradient theorem (page 325), and the steps leading to the REINFORCE update equation (13.8), so that (13.8) ends up with a factor of t and thus aligns with the general algorithm given in the pseudocode.

### 14-16: No Exercises

### 17 - Frontiers

*Status: Checked with Claude Sonnet 3.5*

#### Exercise 17.1

This section has presented options for the discounted case, but discounting is arguably inappropriate for control when using function approximation (Section 10.4). What is the natural Bellman equation for a hierarchical policy, analogous to (17.4), but for the average reward setting (Section 10.3)? What are the two parts of the option model, analogous to (17.2) and (17.3), for the average reward setting? ⇤

(17.4) \(v_\pi(s) = \sum_{w \in \Omega(s)} \pi(w|s) [ r(s,w) + \sum_{s'}p(s'|s,w)v_\pi(s')]\)

to convert this to average reward we need to remove γ and then replace rewards with differences, here specifically for each option which we can denotate r(s,w,π)

\(v_\pi(s) = \sum_{w \in \Omega(s)} \pi(w|s) [ r(s,w) - r(s,w,\pi) + \sum_{s'}p(s'|s,w)v_\pi(s')]\)

the analogy to (17.2) would be

\(r(s,w) \doteq \mathbb{E}[R_1 - r(s,w,\pi) + R_2 - r(s,w\pi) + ... + R_\tau - r(s,w,\pi)]\)

and the analogy to 17.3 would be

\(p(s'|s,w) \doteq \sum_{k=1}^\infty Pr[S_k=s' | S_0 = s, A_{0:k-1} ~ \pi_w]\)

## Footnotes:

^{1}

Language modeling a close second, but loses points when considering that "RL for language models for games" counts as games.

^{2}

Not to mention fighting through it at a pretty rapid pace, considering it's intended to fill out two full semesters.

^{3}

i.e. there's a few in here that are like "this is the same as the previous exercise" which would obviously not fly if I was a student in a course but are fine here. I'm not trying to impress a grader, I am trying to powerlevel exercises and complete this textbook asap so I'll be unblocked on the work I need RL knowledge to perform well.

^{4}

Likewise, I read this book as self-study, so it's possible I just did literally everything wrong and the LLMs have misled me into thinking I would have passed a course based on this book

^{5}

This may read as obnoxious to say but I mostly mean relative to my peers in ML research, who typically have a lot more hours spent pen to paper on it.

^{6}

This is a bit of a euphemism.