notes from David Silver's Reinforcement Learning course Advanced Topics: RL (see also). 2015 (COMPM050/COMPGI13).

Reinforcement learning characteristics:

- no supervisor (nothing top-down saying what's right and what's wrong as in supervised learning), only a reward signal
- feedback (reward) may be delayed (not instantaneous)
- e.g. giving up short-term gain may be better for long-term gain

- time is important (i.e. sequences of actions)
- the agent's actions affects the subsequent data it receives

A *reward*

Note that the reward must be scalar though you could take a reward vector over different aspects (e.g. this increases my income, but decreases my happiness) and take its dot product with some weights (e.g. I value my happiness more than my income) to retrieve a scalar value.

The *reward hypothesis* states that goals can be described by the maximization of expected cumulative reward and is the central assumption of reinforcement learning.

The term "reward" may be a bit confusing, since they may be negative as well, i.e. "punishment".

Rewards can come at each time step (for example, if you want an agent to do something quickly, you can set a reward of -1 for each time step), or at the end of the task, or at the end of each "episode" if the task has no end (but you need to chunk up time in some way), or for certain states, etc.

The general framework is that the agent decides on and executes some action

The *history* is the sequence of observations, actions, and rewards:

We are essentially looking to build a mapping from this history to some action to take next.

Practically, we can't work with the full history, so we summarize it as *state*

We can represent a state for the environment as well (distinct from the agent's state

We can more specifically define an *information state* (also called a *Markov state*), which is a state in which:

That is, we make a Markov assumption that the probability of the next state depends only on the current state. This assumption allows us to effectively ignore the history previous to the present state.

How we represent state greatly affects learning. Consider an experiment in which a rat either gets shocked or gets a piece of cheese. The rat observes the following two sequences:

- light, light, lever, bell -> shock
- bell, light, lever, lever -> cheese

Consider this sequence:

- lever, light, lever, bell

Will the rat get shocked or get cheese?

You might think "shock" because the same 3-item sequence is at the end (light, lever, bell). However, if you represent the state as numbers of each event, you'd say cheese (1 bell, 1 light, 2 levers). Or you could represent it as the entire 4-item sequence in which case it's not clear what the result will be.

In a *fully observable* environment, the agent directly observes environment state, i.e.

In a *partially observable* environment, the agent indirectly observes the environment (a lot of information is hidden). In this case the agent and environment states are not congruent. This is a partially observable Markov decision process (POMDP). In this case we must construct the agent's state representation separately - one possibility is as the beliefs of the environment state, or use a recurrent neural network to generate the next state, etc.

The main components that may be included in an RL agent:

- policy: the agent's behavior function
- a map from state to action
- it may be deterministic, i.e.
$a = \pi(s)$ - or it may be stochastic, i.e.
$\pi(a|s) = P[A=a|S=s]$

- value function: how good is each state and/or action
- a prediction of (i.e. expected) future reward
- evaluates the goodness or badness of states
- used to select between actions
- e.g.
$v_{\pi}(s) = E_{\pi}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots | S_t = s]$ - the value function
$V$ is the state value function, which values states, and$Q$ is the action value function, which values actions (given states)

- model: agent's representation of the environment
- often has two components
- the
*transitions*$P$ predicts the next state (the "dynamics"), e.g.$P_{ss'}^a = P[S'=s'|S=s,A=a]$ - the
*rewards*$R$ predicts the next immediate reward, e.g.$R_s^a = E[R|S=s,A=a]$ - but there are many model-free methods as well

RL agents can be categorized according to which of these components they have:

*value based*agents- have a value function
- if it has a value function, the policy is implied (choose the highest-valued action)

*policy based*agents- have an explicit policy
- without a value function

*actor critic*agents- has both a policy and a value function

Further distinction is made between model-free agents (the agent doesn't model the environment, just has a policy and/or value function) or model-based agents (has a model and a policy and/or value function).

Prediction vs control:

- prediction: evaluate the future, given a policy
- control: optimize the future, find the best policy

Markov decision processes formally describe a fully-observable environment for reinforcement learning. Almost all RL problems can be formalized as MDPs, including partially-observable RL problems: partially-observable problems can be converted into MDPs.

We make the Markov assumption (see above) for MDPs. That is, the current state captures all relevant information from the history, so the history can be thrown away.

We have a *state transition matrix* which encodes the state transition probabilities, i.e. the probability of going to some state

A Markov process then is a memoryless (i.e. Markov) random process, i.e. a stochastic sequence of states, i.e. a Markov chain. We define a Markov process as a tuple

A *Markov reward process* is a Markov chain with values. It is described by a tuple

The *return*

The value function

The Bellman Equation decomposes the value function into two parts:

- the immediate reward
$R_{t+1}$ - the discounted value of successor state
$\gamma v(S_{t+1})$

So it turns it into a recursive function:

i.e.:

Or written more concisely using matrices:

As an aside, the Bellman equation is a linear equation, so it can be solved directly:

Where

This however has a computational complexity of

There are other iterative methods for solving MRPs which are more important/practical, e.g. dynamic programming, Monte-Carlo evaluation, and Temporal-Difference learning.

A Markov decision process (MDP) is a Markov reward process with decisions. It is described by a tuple

A policy

A policy fully defines the behavior of an agent. Because it depends only on the current state (and not the history), it is said to be *stationary* (time-independent).

The *state-value function*

This is different than the value function, which does not involve a policy for selecting actions but rather only proceeds randomly.

The *action-value function*

We can also define Bellman equations for these value functions (these are called Bellman expectation equations):

We can combine these:

There is some optimal state-value function

Similarly, there is an optimal action-value function

From this we can get the Bellman optimality equation:

Which can then be combined:

Which can equivalently be written:

The Bellman optimality equation is non-linear, unlike the Bellman equation we saw previously. In general there is no closed form solution, but we can use iterative methods, e.g. value iteration, policy iteration, Q-learning, and Sarsa.

"Dynamic" refers to some sequential or temporal aspect to the problem and we want to optimize some program, i.e. a policy.

Dynamic programming is a method for solving complex problems by breaking them into subproblems, then solving the subproblems (divide-and-conquer).

For dynamic programming to work, the problem must have two properties:

- optimal substructure; i.e. the optimal solutions of the subproblems tell us about the optimal solution for the overall problem
- the subproblems should "overlap"; i.e. they should recur throughout the overall problem. That way, by solving a subproblem, we are simultaneously solving many parts of the overall problem (the solutions can be cached and reused).

MDPs satisfy both of these properties with the Bellman equation (it is recursive decomposition) and the value function stores and reuses solutions.

Dynamic programming assumes full knowledge of the MDP, so it is used for *planning* in an MDP (i.e. prediction and control).

For *prediction* it takes a fully-specified MDP

For *control* it takes a fully-specified MDP and gives us the optimal value function

Policy evaluation involves taking a policy and an MDP and computing the expected reward for following that policy.

To solve this, we apply the Bellman expectation equation as an iterative update.

We start off with some arbitrary value function *synchronous* backups (i.e. consider each state in turn):

- at each iteration
$k+1$ - for all states
$s \in S$ - update
$v_{k+1}(s)$ from$v_k(s')$ , where$s'$ is a successor state of$s$

This eventually converges to

Now that we can evaluate a policy, we can figure out how to improve it.

Given a policy

- evaluate the policy
$\pi$ :$v_{\pi}(s) = E[R_{t+1} + \gamma R_{t+2} + \dots | S_t = s]$ - improve the policy by acting greedily wrt to
$v_{\pi}$ :$\pi' = \text{greedy}(v_{\pi})$ (greedy just means we move to the state with the highest value)

And we can iteratively apply this approach (called *greedy policy improvement*), which will converge to the optimal policy

The policy and the value function influence each other, since the policy dictates which states are explored and the value function influences how the policy chooses states, so they push off each other to convergence.

With the greedy policy, we always choose the best state (remember that the value of a state takes into account future reward from that state as well!) so we update the value function for that state so that it is equal to or greater than it was before. This is how the greedy policy improves the value function.

Because the value function drives the greed policy, that in turn improves the policy.

Eventually the value function will only be equal to (rather than greater than or equal to) what it was before; at this point convergence is achieved.

The solution

Then apply these updates iteratively. This is *value iteration*.

Basically we iterate over states and apply this update until convergence, giving us the optimal value function.

The pseudocode is:

- using synchronous backups
- at each iteration
$k + 1$ - for all states
$s \ in S$ - update
$v_{k+1}(s)$ from$v_k(s')$

- at each iteration
- until convergence

Unlike policy iteration, there is no explicit policy here, and the intermediate value functions may not correspond to any policy. But once we have

The methods so far have used synchronous backups, i.e. each iteration we look at and update every single state simultaneously before updating any state again.

We can instead use *asynchronous backups* in which states are backed up individually in any order, without waiting for other states to update. This can significantly reduce computation. As long as we continue to select all states (again, order doesn't matter), we will still have convergence.

Three asynchronous dynamic programming methods:

Synchronous value iteration stores two copies of the value function, i.e. for all

*In-place* value iteration, on the other hand, only stores one copy of the value function, i.e. for all

That is, we immediately update the value function and always use the newest value function.

Sometimes the order can affect how quickly you reach the optimal value function. *Prioritised sweeping* uses the magnitude of the Bellman error to guide state selection, e.g.:

We backup the state with the largest remaining Bellman error, then update the Bellman error of affected states after each backup.

The magnitude of the Bellman error tells us how much our value estimate for that state changed; the intuition is that this change was really big, we want to attend to that state first, since will likely have a large influence on other values.

With *real-time dynamic programming* we only select and update the states that the agent actually visits.

These algorithms (i.e. algorithms based on the state-value function

Dynamic programming uses *full-width* backups which means we consider *all* actions and *all* successor states (i.e. we consider the full branching factor), which is really expensive. Furthermore, to actually do these branch lookaheads we need to know the MDP transitions and reward function (i.e. have a model of the dynamics of the environment).

So we run into the problem of dimensionality, where the number of states

One way around this is by sampling - instead of considering the entire branching factor, we sample a single complete trajectory.

But sometimes the problem is so big that even one backup is too expensive - in these cases, we can use *sample backups*, i.e. we start in a state, sample one action according to our policy, sample one transition according to our dynamics, etc, then backup this branch.

With sample backups, because we are sampling from the dynamics of the environment, this also frees us from needing a model of the dynamics of the environment.

We have an environment that can be represented as an MDP but we are not given the dynamics or reward function (i.e. we don't know what the MDP is). With *model-free* prediction methods, we can still learn a value function even without this knowledge (i.e. without needing to model the environment).

Not necessarily efficient, but effective.

MC methods learn directly from *complete* episodes of experience. Note that MC methods must learn from episodes (i.e. they are only applicable to *episodic MDPs*, where episodes eventually terminate). So by "complete" episode, this means the episode is expanded to a terminal state. We do this a lot (each episode is a sample) and estimate the value of our start state as the mean return from these episodes.

Note that the *return* is the total discounted reward.

The value function is usually the expected return (where

But Monte-Carlo policy evaluation uses the *empirical mean* return instead of the expected return (i.e. as mentioned before, we collect sample episodes and compute the mean of their returns).

So how do we get these empirical mean returns for *all* states in the environment?

There are two methods:

*first-visit*Monte-Carlo policy evaluation: to evaluate a state$s$ , the first time-step$t$ that state$s$ is visited in an episode (i.e. if the state is returned to later one, ignore), increment counter$N(s) \leftarrow N(s) + 1$ , which tracks number of visits to the state, increment the total return$S(s) \leftarrow S(s) + G_t$ , then the value is estimated as the mean return$V(s) = \frac{S(s)}{N(s)}$ . By the law of large numbers,$V(s) \to v_{\pi}(s)$ as$N(s) \to \infty$ ; that is, with enough samples, this will converge on the true value.*every-visit*Monte-Carlo policy evaluation: same as the first-visit variant, except now we do these updates for*every*visit to the state (instead of just the first)

The mean can be computed incrementally, i.e. we can do it in an online fashion. The mean

Using this, we'd change our

For non-stationary problems (which is the typical case) we might want to have a running mean (i.e. forget old episodes):

TD methods, like Monte Carlo learning, learns from episodes, but it can learn from *incomplete* episodes by *bootstrapping* (similar to dynamic programming). We substitute the rest of the trajectory (the rest of the episode, before it is finished) with an estimate of what will happen from that state onwards. You take another step, generating an estimate for that step and updating the previous estimate with what what you've learned from then new step.

So whereas with Monte-Carlo learning we update *actual* return *estimated* return

*TD target*.

*TD error* (the difference between our estimate before and after taking the step).

The fact that TD methods do not require complete methods means that it can be applied to non-terminating environments.

TDs are often more efficient for Markov environments because they exploit the Markov property (assuming the current state summarizes all previous states). MC methods, on the other hand, do not exploit this property.

Both MC and TD use samples, whereas dynamic programming does not (dynamic programming is exhaustive).

A compromise between TD (looks 1 step into the future) and MC (looks all the way to the end) is to look

So we define the

So

We don't have to commit to one value for *all*

So the update is (this is called *forward-view*

Note that like MC this can only be computed from complete episodes.

There is *backward-view*

We compute *eligibility traces* for the credit assignment problem which helps us figure out the eligibility of the states we saw for credit assignment (i.e. responsibility for the reward at the end). These combine the frequency heuristic (assign credit to the most frequent states) and the recency heuristic (assign credit to the most recent states):

So in backward-view

When

So how does an agent in a new environment learn how to maximize its reward? Another way of putting this: how do we optimize the value function of an unknown MDP?

There are two main paradigms for control:

*on-policy*: learning about policy$\pi$ while sampling experiences from following$\pi$ *off-policy*: learning about policy$\pi$ while sampling experiences from following some other policy$\mu$

Generally, the MDP might be unknown, but we can sample experiences, or the MDP might be known, but too big to use except by sampling it. So in either case, model-free control is appropriate.

We can take approach as we saw with dynamic programming: generalized policy iteration. That is, we evaluate our policy (estimate

So generalized policy iteration has two components:

- policy evaluation
- policy improvement (e.g. greedy policy improvement)

Unfortunately we can't just drop in a model-free policy evaluation method (e.g. Monte-Carlo evaluation) as our policy evaluation method to use with greedy policy improvement.

One reason is because of exploration - with model-free policy evaluation methods, we only sample experiences so there may be some we miss out on. If we use greedy policy improvement, we'll never explore those other experiences.

Another is because greedy policy improvement over

That is, we need

However, greedy policy improvement over

So instead our Monte-Carlo policy evaluation estimates

However, we still have the problem of exploration.

One way around this is * $\epsilon$-greedy exploration*, where with probability

For any

So now our policy iteration strategy has become:

- policy evaluation: Monte-Carlo policy evaluation
- policy improvement:
$\epsilon$ -greedy policy improvement - every episode

We can do this update after each episode.

Given infinite time, this will eventually explore all states, but it could happen very slowly (there may be some hard-to-access states that take awhile to randomly reach). In practice, we don't want to explore ad infinitum, we want to explore until we find an optimal policy, then stop exploring.

*Greedy in the limit with infinite exploration* (GLIE) has two properties:

- all state-action pairs are explored infinitely many times:
$\lim_{k \to \infty} N_l(s,a) = \infty$ - the policy converges on a greedy policy:
$\lim_{k \to \infty} \pi_k (a|s) = \mathbb{1} (a = \argmax_{a' \in A} Q_k(s, a'))$

For example,

We can modify our Monte-Carlo control to be GLIE:

- sample
$k$ th episode using$\pi$ :$\{S_1, A_1, R_2, \dots, S_T\} \sim \pi$ - for each state
$S_t$ and action$A_t$ in the episode:

That is, we keep track of how many times we've encountered this state-action pair and update its values according to how many times we've encountered it.

Then we improve the policy:

This converges to the optimal action-value function

TD learning has several advantages over MC:

- lower variance
- online
- can use incomplete sequences

So we just use TD instead of MC for policy evaluation, where we apply TD to *Sarsa*:

We start with some state-action pair

So our (on-policy) policy iteration strategy has become:

- policy evaluation: Sarsa
- policy improvement:
$\epsilon$ -greedy policy improvement - every time-step

In greater detail:

- arbitrarily initialize
$Q(s,a)$ for each$s \in \mathcal S, a \in \mathcal A(s)$ and$Q(\text{terminal state}, \cdot) = 0$ - repeat, for each episode:
- initialize
$S$ - choose
$A$ from$S$ using policy derived from$Q$ (e.g.$\epsilon$ -greedy) - repeat, for each step of episode:
- take action
$A$ , observe$R, S'$ - choose
$A'$ from$S'$ using policy derived from Q (e.g.$\epsilon$ -greedy) $Q(S,A) \leftarrow Q(S,A) + \alpha (R + \gamma Q(S', A') - Q(S, A))$ $S \leftarrow S'$ ,$A \leftarrow A'$

- take action
- until
$S$ is terminal

- initialize

Here

This is on-policy because we are evaluating and improving the policy we are currently using.

Sarsa converges to the optimal action-value function

- GLIE sequence of policies
$pi_t(a|s)$ (e.g. can use the same modifications we did with MC above) - Robbins-Monro sequence of step-sizes
$\alpha_t$ :

These basically say:

- your step size is sufficiently large that your
$Q$ value can change as much as it needs to (e.g. if your initial estimate is way off) - eventually the changes to your
$Q$ values need to trend towards 0

In practice we often don't worry about the Robbins-Monro sequence of step-sizes, and sometimes even GLIE, and Sarsa still works out.

What is the middle ground between TD and MC control?

We can again consider

The

Then

From this we get Sarsa

With forward view Sarsa

And use this with the general

And in this way we are averaging over these

With backward view Sarsa

Basically this says, every time we visit a state-action pair

Sarsa

- arbitrarily initialize
$Q(s,a)$ for all$s \in \mathcal S, a \in \mathcal A(s)$ - repeat (for each episode):
$E(s,a) = 0$ for all$s \in \mathcal S, a \in \mathcal A(s)$ - initialize
$S, A$ - repeat (for each step of episode):
- take action
$A$ , observe$R, S'$ - choose
$A'$ from$S'$ using policy derived from$Q$ (e.g.$\epsilon$ -greedy) $\delta \leftarrow R + \gamma Q(S', A') - Q(S,A)$ $E(S,A) \leftarrow E(S,A) + 1$ - for all
:$s \in mathcal S, a \in \mathcal A(s)$ $Q(s,a) \leftarrow Q(s,a) + \alpha \delta E(s,a)$ $E(s,a) \leftarrow \gamma \lambda E(s,a)$

$S \leftarrow S'$ ,$A \leftarrow A'$

- take action
- until
$S$ is terminal

Essentially what the

Consider a situation where a long path is taken, with zero reward at each step, until the agent reaches a terminal state with a positive reward (one episode).

With

With

Evaluate target policy

This can be thought of as "learning by observation", i.e. the agent watches another agent perform some task, and from that observation it can actually learn a better policy than the one it observed.

This can also be thought of as "reusing" experience learned from old policies.

Off-policy learning has a couple advantages in what it enables us to do:

- learn about an optimal policy while following an
*exploratory*policy - learn about
*multiple*policies while following one policy

So how can we do this?

Importance sampling gives us a means to estimate the expectation of a different distribution using another distribution:

For off-policy Monte Carlo, we can apply important sampling and use returns generated from

Then update value towards the corrected return:

(we can't use this if

However, importance sampling can increase variance to be too high to be practical (since Monte Carlo learning is already high variance). Because of the high variance, the target and behavior policies never get close enough to be useful.

So off-policy TD is preferred; it has much lower variance. We use the TD targets generated from

The best offline-policy is Q-learning, which does not require importance sampling.

We choose the next action using the behavior policy,

We also consider an alternative successor action from our target policy,

Then we update

There's a special case of Q-learning which is usually what is meant when the Q-learning algorithm is referred to, where the target policy

The behavior policy

The Q-learning target then simplifies:

What happens here is both the behavior and target policies improve.

So the updates then are:

and it converges to the optimal action-value function.

If state spaces get very large (e.g. with Go, with

Instead, we can consider *value function approximation* in which we don't need to distinctly map each state to some value, rather we use some function to approximate the value of states, such that nearby states take similar values (i.e. generalize to states we have not encountered before).

These allow us to scale up the model-free methods we've seen so far for both prediction and control.

So there is some true value function for a policy

where

The function we learn could:

- take an input state
$s$ and return$\hat v(s,w)$ - take an input state
$s$ and an input action$a$ and return$\hat q(s,a,w)$ - take an input state
$s$ and return values for all actions$\hat q(s,a,w) \forall a \in \mathcal A$

We want our training method to work with non-stationary (our policy will be changing, causing its value function to change), non-iid (e.g. where I am next correlates highly with where I just was) data. This is a main difference between supervised and reinforcement learning.

Broadly, we can break value function approximation into two categories:

*incremental methods*, in which the approximator (e.g. a neural network) makes an update after each time step (somewhat similar to online methods, but not exactly)*batch methods*, where the approximator looks at whole sets of data

As an aside, note that table lookup is actually a special case of linear value function approximation, where each state is a one-hot vector and its associated weight is its lookup value.

Basically, we use stochastic gradient descent to fit the approximator.

We can represent states with a feature vector.

Instead of computing mean squared error with respect to the true value function

- for MC, the target is the return
$G_t$ , so the parameter update is:

- for
$TD(0)$ , the target is the TD target$R_{t+1} + \gamma \hat v(S_{t+1}, w)$ , so the parameter update is:

- for
$TD(\lambda)$ , the target is the$\lambda$ -return$G_t^{\lambda}$

Monte-Carlo with value function approximation involves running episodes, and these episodes essentially are "training data", e.g.:

The result is basically a supervised learning problem. This will converge to a local optimum in both linear and non-linear cases.

TD learning (

But note that unlike Monte-Carlo, this sample is biased because we are using our neural network to estimate further reward at each step.

Linear

- Policy evaluation: approximate policy evaluation
$\hat q(\cdot, cdot, w) \approx q_{\pi}$ - Policy improvement:
$\epsilon$ -greedy policy improvement

So we want to approximate the action-value function (so we can remain model-free):

Again we want to minimize the mean-squared error between the approximate action-value function

We can represent both the states and actions with feature vectors.

Like with approximating the value function, we don't know the true action-value function, so we can't optimize against it. Rather, we again substitute a target for

- for MC, the target is the return
$G_t$ , so the parameter update is:

- for
$TD(0)$ , the target is the TD target$R_{t+1} + \gamma \hat q(S_{t+1}, A_{t+1}, w)$ , so the parameter update is:

- for
$TD(\lambda)$ , the target is the$\lambda$ -return$q_t^{\lambda}$ . For forward-view$TD(\lambda)$ :

For backward-view

Note that TD (weights) can sometimes blow up. A rough summary of when TD is guaranteed to converge ("ok") and when it *might* diverge ("maybe"):

On/Off-Policy | Algorithm | Table Lookup | Linear | Non-Linear |
---|---|---|---|---|

On | MC | ok | ok | ok |

TD(0) | ok | ok | maybe | |

TD( |
ok | ok | maybe | |

Off | MC | ok | ok | ok |

TD(0) | ok | maybe | maybe | |

TD( |
ok | maybe | maybe |

Although MC works in all cases, it can be very, very slow (variance is too high), so we generally want to bootstrap (i.e have

The reason TD may diverge in off-policy or non-linear approximations is because it does not follow the gradient of *any* objective function.

*Gradient TD* on the other hand follows the true gradient of projected Bellman error, so it will guaranteed converge for all cases.

For control algorithms:

Algorithm | Table Lookup | Linear | Non-linear |
---|---|---|---|

Monte-Carlo Control | ok | ~ok | maybe |

Sarsa | ok | ~ok | maybe |

Q-learning | ok | maybe | maybe |

Gradient Q-learning | ok | ok | maybe |

Where "~ok" means it wiggles around the near-optimal value function (it might get a little worse, then get better, etc; we aren't guaranteed that every update will be an improvement).

The incremental methods waste experience - we have an experience, use it to update, then we toss that experience away. They aren't "sample efficient".

Batch methods try to find the value function that best fits all the data we've seen (all the experience).

With *experience replay* we just store all the experiences. Then, at every time step, we sample from these stored experiences and use it for a stochastic gradient descent update.

*Deep Q-networks* use experience replay and fixed Q-targets (essentially Q-learning):

- take action
$a_t$ according to$\epsilon$ -greedy policy - store transition
$(s_t, a_t, r_{t+1}, s_{t+1})$ in replace memory$\mathcal D$ - sample random mini-batch of transitions
$(s,a,r,s')$ from$\mathcal D$ - compute Q-learning targets wrt to old, fixed parameters
$w^-$ - optimize MSE between Q-network and Q-learning targets using variant of stochastic gradient descent:

What prevents this from blowing up is experience replay and the fixed Q-targets.

Experience replay stabilizes updates by decorrelating the data (you're not seeing them in a specific sequence; they're randomized by sampling past experiences)

Here there are actually two neural networks (one with parameters

Policy gradient methods work directly with the policy, i.e. without using a value or action-value function. The general class of methods that directly learn a policy is called *policy-based* reinforcement learning (in contrast to *value-based* reinforcement learning which just learns a value function).

To review: we have thus far tried to learn a value (or action-value) function (e.g. approximating it with parameters

and generated an *implicit* policy *from* this value function by using some policy improvement method, e.g.

Policy gradient methods directly parametrize the policy itself:

So the policy itself can be approximated by some function approximator, e.g. a neural network.

Here we again focus on model-free methods.

Advantages of policy-based RL:

- better convergence properties: remember that value-based methods could wiggle around points or even diverge under certain conditions; policy-based methods tend to be more stable since you're just following the policy gradient
- effective in high-dimensional or continuous action spaces: we don't have to deal with any maximums, i.e. we don't have to figure out the maximum-valued action; this is good because that maximization itself can be very expensive, e.g. in continuous action spaces where there are many, many actions to maximize over. Policy-based methods give us the action to take in a more direct way.
- can learn stochastic policies: deterministic policies may be easily exploited (e.g. in rock-paper-scissors, where the optimal policy is actually a uniform random policy)

Disadvantages:

- typically converge to a local rather than global optimum
- evaluating a policy is typically inefficient and high variance: value-based methods, by using the max, aggressively push you towards what you think is the best policy, but policy gradient methods take small steps in the right direction.

Learning stochastic policies are useful in partially-observed environments (i.e. we are dealing with an POMDP rather than an MDP), i.e. when we have limited information about states.

A simple example is the aliased gridworld, where we don't know our exact position; rather, we only have features like "is there a wall to my north", "is there a wall to my south", etc. *State aliasing* refers to environments where states which may actually be different can't be distinguished given the features we know about. For example, for the following grid world:

```
_____________________
| 0 | 1 | 2 | 3 | 4 |
|___|___|___|___|___|
| 5 | | 6 | | 7 |
|___| |___| |___|
```

Cells 1 and 3 are aliased, as are 5, 6 and 7. From the perspective of the agent, 1 and 3 are identical (both have walls just to the north and south), and 5, 6, and 7 are identical (both have walls to the west, east, and south).

A deterministic policy must always take the same action from aliased states. For example, if it learns to go left in cell 3, it will also go left in cell 1 (the two states are identical as far as the agent is concerned).

A stochastic policy, on the other hand, may learn to go left or right with 50/50 probability in cell 3 and will do the same for cell 1.

Remember that there is a deterministic optimal policy for any MDP, but this is not the case for POMDPs, where optimal policies may be stochastic.

So how do we optimize a policy? Remember, we want to find the best parameters

We have a few options:

- in episodic environments (i.e. where there is a clear notion of a "starting state") we can use the
*start value*:

- in continuing environments we can use the
*average value*:

- or the
*average reward per time-step*:

where

Fortunately, the policy gradient is essentially the same for all of these objective functions.

So we want to optimize the policy, i.e. find

Let

Policy gradient algorithms search for a local maximum in

where

and

Computing gradients by finite differences just involves perturbing the parameters in each dimension and seeing if the objective function result is better:

- for each dimension
$k \in [1,n]$ - estimate
$k$ th partial derivative of objective function wrt$\theta$ by perturbing$\theta$ by a small amount$\epsilon$ in the$k$ th dimension:

- estimate

where

This requires

There are analytic ways of computing the gradient so we don't have to numerically compute it in each dimension.

We assume the policy

We can use *likelihood ratios*. These exploit the following identity:

*score function*. You see this in maximum likelihood learning; here this essentially tells us how to adjust the policy to get more of a particular action.

For example, consider a (linear) softmax policy. We weight actions using a linear combination of features

The score function is:

i.e. the feature for the action we took (

For another example, we could use a (linear) Gaussian policy, which is good for continuous action spaces. The mean is a linear combination of state features

The policy is Gaussian, so

The score function is:

So how do we apply this?

Consider the simple case of a one-step MDP, i.e. our episodes consist of only one step. So we start in some state

We use likelihood ratios to compute the policy gradient:

You can see that this is still model-free since we replace

We can expand this to multi-step MDPs by just replacing the instantaneous reward

The *policy gradient theorem* tells us that this ends up being the true gradient of the policy (this holds for any of the policy objectives we mentioned, i.e. start state objective, average reward and average value objectives):

The policy gradient theorem: For any differentiable policy

We can use this for *Monte-Carlo Policy Gradient*, which is essentially the *REINFORCE* algorithm:

- update parameters by stochastic gradient ascent (get rid of the expectation, just sample it instead, see below)
- using policy gradient theorem
- using return
$v_t$ as an unbiased sample of$Q^{\pi_{\theta}}(s_t, a_t)$ :

REINFORCE:

- arbitrarily initialize
$\theta$ - for each episode
do (at the end of the episode):${s_1, a_1, r_2, \dots, s_{T-1}, a_{T-1}, r_T} \sim \pi_{\theta}$ - for
to$t = 1$ do:$T-1$ $\theta \leftarrow \theta + \alpha \nabla_{\theta} \log \pi_{\theta}(s_t,a_t) v_t$

- for
- return
$\theta$

where, like before,

Monte-Carlo gradient methods, however, tend to be slow, again, because of high variance in the sampled rewards (for example, if playing an Atari game, in one episode I may get 10,000 points and in the next I may get 0; over the course of any episode there are so many things that could occur that can lead to very different outcomes; it is very noisy).

*Actor-critic policy gradient methods* try to get the best of both worlds - they learn a policy explicitly *and* learn a value function.

We use a *critic* to estimate the action-value function using a function approximator:

So instead of using the return to estimate the action-value function (which, as just stated, is high in variance), we estimate it directly with the critic. This reduces the variance so training is quicker.

So we have two sets of parameters:

- the critic: updates action-value function parameters
$w$ - the actor: updates policy parameters
$\theta$ , in direction suggested by critic

So the actor is the one that actually has the policy and is deciding and taking actions, the critic is just evaluating the actor.

Actor-critic algorithms follow an *approximate* policy gradient:

So we've replaced the true action value function

So the critic is basically just incorporating policy evaluation, i.e. how good is policy

For example, we could use a linear value function approximator

- initialize
$s, \theta$ - sample
$a \sim \pi_{\theta}$ - for each step do:
- sample reward
$r = \mathcal R_s^a$ ; sample transition$s' \sim P_s^a$ - sample action
$a' \sim \pi_{\theta}(s', a')$ $\delta = r + \gamma Q_w(s',a') - Q_w(s,a)$ $\theta = \theta + \alpha \nabla_{\theta} \log \pi_{\theta}(s,a) Q_w(s,a)$ $w \leftarrow w + \beta \delta \phi(s,a)$ $a \leftarrow a', s \leftarrow s'$

- sample reward

(note that because we are using TD we don't need to wait until the end of the episode, we can update at each step)

This is similar to policy iteration: we evaluate the policy (like value-based RL), then improve it, except now we're improving it explicitly using its gradient (like policy-based RL).

How can we improve this?

We can reduce variance using a *baseline*. We subtract a baseline function

So the expectation of this term is 0, so it does not affect the expectation wrt to our action-value function.

We choose a baseline function explicitly to reduce variance (e.g. mean 0, roughly the right scale, etc).

A good baseline is the state value function

So we subtract the state value function from the action-value function. This gives us what is called the *advantage function*

So we can rewrite the policy gradient using the advantage function:

So basically if we take an action that has a better-than-usual result, we update the policy to encourage (towards) the action; if worse-than-usual, we update the policy to discourage (away from) that action.

This can significantly reduce the variance of the policy gradient.

So what we really want is for the critic to estimate the advantage function, e.g. by estimating both

And then update both value functions by e.g. TD learning.

However, we can make this a bit simpler.

For the true value function

and it is actually an unbiased estimate of the advantage function:

Remember that

So we can just use the TD error the compute the policy gradient:

In practice we can use an approximate TD error:

So the critic only needs to learn one set of parameters,

We already know how to have the critic work on different time-scales (e.g. Monte Carlo requires the full episode,

We can do the same for the actor; i.e. have it consider the critic's response from multiple time-steps, i.e. the policy gradient can also be estimated at many time-scales.

The Monte-Carlo policy gradient uses error from the complete return,

The actor-critic policy gradient uses the one-step TD error,

(remember that the terms in parentheses, e.g.

We can combine the policy gradient with eligibility traces to get what we want - to consider the critic's response from multiple time-steps.

Like forward-view

where

Like backward-view

- by equivalence with
$TD(\lambda)$ , subsisting$\phi(s) = \nabla_{\theta} \log \pi_{\theta}(s,a)$ :

You can see it is similar to eligibility traces with value-based RL.

This update can be applied online, to incomplete sequences.

Previously we learned value functions directly from experience, then we looked at how to learn policies directly from experience. Those methods were all model-free. Here we'll see how we can learn a model directly from experience; i.e. the agent learns the dynamics of the environment (i.e. the state transition function and the reward function).

Then we can use that model to *plan*; i.e. construct a value function or policy by looking ahead/imagining/simulating what will happen in the future.

The general idea:

- the agent gains experience by interacting with the real world to learn a model for the world (i.e. an MDP)
- the agent uses this model to plan (e.g. solve the learned MDP), generating a value function or a policy
- the agent uses this value function or policy to decide how to act
- repeat

Advantages of model-based RL:

- can be more compact than learning a value function or policy directly
- can efficiently learn a model by supervised learning methods
- can reason about model uncertainty

Disadvantages:

- first learn a model, the construct a value function: thus, two sources of approximation error

Formally, a model

We assume the state space

So a model

We typically assume conditional independence between state transitions and rewards:

So our goal is to estimate a model

This can be framed as a supervised learning problem:

Learning

So we pick a loss function, e.g. mean-squared error, KL divergence, and the find the parameters

Some example models:

- table lookup model
- linear expectation model
- linear gaussian model
- gaussian process model
- deep belief network model
- etc

As a really simple example, consider a table lookup model.

We can just model

Similarly, to model rewards we can just take the average reward received for state-action pair

Alternatively, we can just record each experience tuple

Given a model

One such algorithm is *sample-based planning*, which is a simple but powerful approach. We use the model *only* to generate samples, i.e. we sample experience from the model:

Then we apply model-free RL to the samples, e.g. Monte-Carlo control, Sarsa, Q-learning, etc.

Sample-based planning methods are often more efficient because, by sampling, we consider experiences that are more likely to happen.

This is basically the agent learning from the "imagined" or "simulated" world its learned from experience.

It then acts based on its model-free learning and generates more experiences which further refine the model, which it then samples to learn a better value function or policy, which it then acts from, and so on.

This can be extended to incorporate uncertainties with Bayesian model-based reinforcement learning (not covered here).

The performance of model-based RL is limited to the optimal policy for the approximate MDP; i.e. it's only as good as the estimated model. So we have to consider the case where the model is imperfect/inaccurate, i.e.

There are two solutions:

- when the model is wrong, use model-free RL
- or reason explicitly about the model's uncertainty (e.g. Bayesian methods)

So we can generate experience in two ways:

- real experience, i.e. sampled from the environment (the true MDP)
- simulated experience, i.e. sampled from the model (the approximate MDP)

Model-free RL has no model and *learns* a value function (and/or policy) from *real* experience.

Model-based RL (here, using sample-based planning), learns a model from real experience and *plans* a value function (and/or) policy from *simulated* experience.

We can combine model-free RL and model-based RL with the *Dyna* architecture to make the most of both real and simulated experience:

- learn a model from real experience
*learn and plan*value function (and/or policy) from*real and simulated*experience

The canonical Dyna algorithm is the *Dyna-Q* algorithm:

- initialize
$Q(s,a)$ and$\text{Model}(s,a)$ for all$s \in \mathcal S$ and$a \in \mathcal A(s)$ - do forever:
$S \leftarrow$ current (nonterminal) state$A \leftarrow \epsilon\text{-greedy}(S,Q)$ - execute action
$A$ , observe resulting reward$R$ and state$S'$ $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_a Q(S', a) - Q(S,A)]$ $\text{Model}(S,A) \leftarrow R, S'$ (assuming deterministic environment)- repeat
times:$n$ $S \leftarrow$ random previously observed state$A \leftarrow$ random action previously taken in$S$ $R,S' \leftarrow \text{Model}(S,A)$ $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_a Q(S', a) - Q(S,A)]$

If you train a Dyna-Q agent in one environment and then change the environment on it, its model suddenly becomes inaccurate, but, if exploration is encouraged (there will be more on this later; but one simple variation is Dyna-Q+ in which a little bonus is assigned to unexplored states) then it can reorient itself fairly quickly.

This part focuses on the planning part; i.e. expanding on sample-based planning.

The key ideas are sampling and forward search.

Forward search algorithms select the best action by lookahead. They build a search tree with the current state

*Simulation-based search* is a forward search paradigm that uses sample-based planning. We simulate episodes of experience from the current state with the model, i.e.:

then apply model-free RL to the simulated episodes, e.g.:

- if we use Monte-Carlo control, we have
*Monte-Carlo search*. - if we use Sarsa, we have
*TD search*.

- given a model
$\mathcal M_v$ and a*simulation policy*$\pi$ - for each action
$a \in \mathcal A$ - simulate
$K$ episodes from current (real) state$s_t$ :

- simulate

- evaluate actions by mean return (
*Monte-Carlo evaluation*):

- select current (real) action with maximum value:

(this approach is "simple" because we aren't improving the simulation policy

At time of writing, this is a state-of-the-art search method.

- given a model
$\mathcal M_v$ - simulate
$K$ episodes from current state$s_t$ using current simulation policy$\pi$

- build a search tree containing visited states and actions
- evaluate states
$Q(s,a)$ by mean return of episodes from$s,a$ (Monte Carlo evaluation):

(where

- after search is finished, select current (real) action with maximum value in the search tree:

So now instead of just evaluating the root actions, as we did with simple Monte-Carlo search, we're evaluating *every* state-action pair we visit.

With MCTS, the simulation policy

Each simulation consists of two phases: *in-tree* and *out-of-tree*:

*tree policy*(improves): pick actions to maximize$Q(S,A)$ . We can do this while we're "within" the tree*default policy*(fixed): pick actions randomly. We use this once we've gone "beyond" the tree (where we don't have any information)

Repeat for each simulation:

- evaluate states
$Q(S,A)$ by Monte Carlo evaluation - improve tree policy, e.g. by
$\epsilon\text{-greedy}(Q)$

This is basically Monte-Carlo control applied to simulated episodes of experience, starting from the current (i.e. root) state.

This converges on the optimal search tree,

For example: in a game of Go, white vs black, the agent is playing as black.

Starting at the current state, it rolls out (sample) one episode to a terminal state and see that it wins (images from here):

So according to our tree policy, we select the next state in that branch (since it's the highest valued at the moment), and roll out another episode starting from there. We get to a terminal state and see that it loses:

So from our tree policy we pick another state and roll an episode out from there; at the terminal state it wins:

So then our tree policy directs us to the next state in that branch and we roll out an episode from there; at the terminal state it loses:

So then from our tree policy we pick another state and roll out an episode from there; at the terminal state it wins:

and so on.

(note that when I say our "from our tree policy we pick another state and roll an episode out from there", technically we are always restarting from the root, i.e. rolling the episode out from the root, but our tree policy directs us to choose actions in such a way that we end up on what is the most promising node)

Advantages of MCTS:

- highly selective best-first search: we search through the space is by what is most promising, so we are quite efficient.
- evaluates states dynamically (unlike e.g. DP): we don't have to worry about the entire state space, only the states that are relevant to us now
- uses sampling to break curse of dimensionality: we don't have to consider everything that could happen in the environment
- works for "black-box" models: we only need samples
- computationally efficient, anytime, parallelizable

MCTS is just one example of a family of very effective search algorithms. The key ideas are to use forward search and sampling.

For example, we can use TD instead of MC, i.e. we can use bootstrapping. So whereas MC tree search applies MC control to the sub-MDP from the current state, TD search applies Sarsa to the sub-MDP from the current state.

For simulation-based search, as we saw in model-free reinforcement learning, bootstrapping is helpful:

- TD search reduces variance but increases bias
- TD search is usually more efficient than MC search
$TD(\lambda)$ search can be much more efficient than MC search

TD search:

- simulate episodes from current (real) state
$s_t$ - estimate action-value function
$Q(s,a)$ - for each step of simulation, update action-values by Sarsa:

- select actions based on action-values
$Q(s,a)$ , e.g.$\epsilon$ -greedy - you can also use function approximation for
$Q$

This is an especially nice approach for domains where there are many ways to arrive at a given state. With bootstrapping, you'll already have some knowledge about a state if you re-visit it through another trajectory.

*Dyna-2* incorporates Dyna (using both real and simulated experience to learn and plan our value function and/or policy) with forward search. The agent stores two set of feature weights (i.e. two value functions):

- long-term memory
- short-term (working) memory

Long-term memory is updated from real experience using TD learning (this can be thought of as general domain knowledge that applies to any episode).

Short-term memory is updated from simulated experience using TD search (this can be thought of as specific local knowledge about the current situation).

The value function is the sum of long and short-term memories.

How does a RL agent balance exploitation and exploration? That is, how do we balance making the best decision given what we know (exploitation) and gathering more information (exploration)?

The best long-term strategy may involve sacrificing some short-term gains by exploring.

Thus far we have just used the

Broadly, there are three approaches:

- random exploration
- explore random actions (e.g.
$\epsilon$ -greedy, softmax)

- explore random actions (e.g.
- optimism in the face of uncertainty
- estimate uncertainty on value
- prefer to explore states/actions with highest uncertainty

- information state space
- consider agent's information as part of its state
- lookahead to see how information helps reward
- this is the most "correct" way but also the most computationally difficult

We can explore two different spaces:

- state-action exploration
- systematically explore state-space/action-space
- e.g. pick different action
$A$ each time$S$ is visited

- parameter exploration
- parametrize policy
$\pi(A|S,u)$ - e.g. pick different parameters and try for awhile
- advantage: consistent exploration
- disadvantage: doesn't know about state/action space

- parametrize policy

We'll focus on state-action exploration.

A very simplified version of the reinforcement learning problem is the *multi-armed bandit* problem, which is useful for introducing concepts around exploration vs exploitation.

A multi-armed bandit is a tuple

At each step

The *action-value* is the mean reward for an action

The *optimal value*

The *regret* is the opportunity loss for one step (i.e. the difference between the best we could have gotten and what we actually did get):

The *total regret* is the total opportunity loss:

So maximizing cumulative reward is equivalent to minimizing total regret.

The *count*

The *gap*

Regret is a function of gaps and the counts:

So basically, we want to have small counts of large gaps (i.e. we want to take few actions that are much worse than the optimal action).

However, we don't know what these gaps are.

With strategies we've seen so far, e.g. greedy and *never* explore, we may just get stuck with a suboptimal action, so total regret keeps accumulating; similarly with *forever*, we will always sometimes randomly choose an action which is most likely to be suboptimal (not much of a chance we randomly pick the optimal action). Ideally, we want to have a sublinear total regret (one that eventually stops increasing).

First let's go more into the greedy algorithm:

- we consider algorithms that estimate
$Q_t(a) \approx q(a)$ . - we estimate the value of each action by Monte-Carlo evaluation, i.e.:

- the
*greedy*algorithm selects the action with the highest value, i.e.:

As mentioned before, this can lock onto a suboptimal action forever, so it has linear total regret.

An enhancement on the greedy algorithm is the greedy algorithm with *optimistic initialization*, in which we initialize all values to the maximum possible, i.e.

Note that you can also initialize the counts

This optimistic initialization encourages exploration of unknown values, which is great. But a few unlucky samples can also cause the algorithm to lock onto a suboptimal action (though in practice, this method can work very well). So this has linear total regret as well.

Note that by assigning more confidence to the optimistic prior, you can withstand more unlucky samples and the algorithm will perform better.

Now let's consider the

This has logarithmic asymptotic total regret.

In practice, that schedule is impossible because it requires knowledge of the gaps (i.e. of

So ideally we want to find an algorithm with sublinear regret for any multi-armed bandit which does not require knowledge of

There is a lower bound to total regret (i.e. no algorithm can possibly do better than this lower bound), so we want to get as close to that lower bound as possible.

The difficulty of a multi-armed bandit problem is in large part determined by the similarity between the optimal arm/action and the other arms/actions. That is, if suboptimal arms have a reward distribution very similar to the optimal one, it will take a long time to discern the optimal one (contrast to an easier situation where one arm is "obviously" optimal in that it consistently gives higher reward than all the other arms).

We can describe this formally with the gap

Theorem (Lai and Robbins):

Asymptotic total regret is at least logarithmic in number of steps:

Intuitively, the larger the gap

The principle of *optimism in the face of uncertainty* states that we should not the action that is certain to be best, but the action that has the potential to be best.

Consider the following example with three arms/actions:

We are most certain about

More formally (for "potential to yield higher rewards"), we estimate an upper confidence

- small
$N_t(a) \to$ large$U_t(a)$ (estimated value is uncertain) - large
$N_t(a) \to$ small$U_t(a)$ (estimated value is accurate)

Then we select the action maximizing the *upper confidence bound* (UCB):

In statistics, there is a theorem known as *Hoeffding's Inequality*:

Let

i.e. the probability that we're wrong in our estimate of the empirical mean by the amount *any* distribution.

We can apply Hoeffding's Inequality to the rewards of the bandit conditioned on selecting action

So we can pick a probability

For example, we can set

We can reduce

This ensures we select the optimal action as

This leads to the *UCB1 algorithm*:

This achieves a logarithmic asymptotic total regret.

This was the frequentist approach to multi-armed bandits; we made no assumptions about the distributions of the bandits.

Alternatively, we can use a Bayesian approach (*Bayesian bandits*) where we exploit prior knowledge about rewards,

Consider a distribution

Bayesian methods compute a posterior distribution over

So basically we describe

In more detail, the upper confidence bound method:

- we use Bayes law to compute posterior $p[w | R_1, \dots, R_{t-1}]$ | ||
---|---|---|---|

- then we estimate an upper confidence from the posterior, e.g. |
w) |
||

- then we pick the action that maximizes |

So it's quite similar to UCB1, we just incorporate priors.

Alternative to the UCB method, we can use *probability matching*, where we select an action

Probability matching is optimistic in the face of uncertainty (uncertain action shave a higher probability of being max). But it can be difficult to compute

To do this, we can use *Thompson sampling*, which is sample-based probability matching:

We use Bayes law to compute the posterior distribution

For Bernoulli bandits, this achieves the Lai and Robbins lower bound on regret (i.e. it is optimal).

Note that "optimism in the face of uncertainty" won't work where the action-space is infinite (it will forever explore and never end up exploiting anything) or when it is expensive to explore (this isn't very "safe" exploration, e.g. if you have an expensive robot, you don't want it trying *everything*).

With exploration we gain information. If we could quantify the value of that information, we could compute the exploration-exploitation trade off perfectly (i.e. the long-term reward after getting information minus the immediate reward of exploitation).

Information gain is higher in uncertain situations - for instance, if you already know everything about a situation, there's no information to be gained. So we want to explore uncertain situations more in the optimal way.

We can introduce an *information state*

Each action

So now we have an MDP

For example, in a Bernoulli bandit (i.e. where there are only two outcomes, 0 or 1), we may have an information state

Now we have an infinite MDP, since there are infinitely many information states. We can solve this MDP with reinforcement learning, e.g. model-free reinforcement learning or Bayesian model-based reinforcement learning. The latter approach is known as *Bayes-adaptive* RL, where we characterize our information state as a posterior distribution.

For example, with Bernoulli bandits:

- start with a
$\text{Beta}(\alpha_a, \beta_a)$ prior over reward function$\mathcal R^a$ - each time action
is selected, update posterior for$a$ :$\mathcal R^a$ $\text{Beta}(\alpha_a + 1, \beta_a)$ if reward is 0$\text{Beta}(\alpha_a, \beta_a + 1)$ if reward is 1

- this defines transition function
$\tilde {\mathcal P}$ for the Bayes-adaptive MDP - each information state
$(\alpha, \beta)$ corresponds to a model$\text{Beta}(\alpha, \beta)$ - each state transition corresponds to a Bayesian model update
- solving the Bayes-adaptive MDP takes account of the value of information because the effect of new information is factored into the model update

The Bernoulli case of the Bayes-adaptive MDP can be solved by dynamic programming; the solution is known as the *Gittins index*.

Exactly solving a Bayes-adaptive MDP is typically intractable, but methods like large-scale (large state space) planning methods (e.g. Monte-Carlo tree search) can be applied.

So far we've introduced state as only information state but we can bring in our previous notions of state as well.

A *contextual bandit* is a tuple

This is just like multi-arm bandits except that we've reintroduced states *contexts*").

At each step

- environment generates state
$S_t \sim \mathcal S$ - agent selects action
$A_t \in \mathcal A$ - environment generates reward
$R_t \sim \mathcal R_{S_t}^{A_t}$

The goal is to maximize cumulative reward

(not covered in lecture)

The UCB approach can be generalized to full MDPs to give a model-free RL algorithm, i.e. we can maximize the upper confidence bound on the action-value function:

One thing worth noting is that this ignores that, in MDPs, the policy will likely improve (if we're doing control with the MDP) and the

You can also extend the general "optimism in the face of uncertainty" approach by replacing the reward for unknown or poorly estimated states with *Rmax algorithm*.

The information state space approach can also be extended to MDPs. Here, the augmented state is now