A quick refresher - a Markov Decision Process (MDP) involves a set of states and actions. Actions connect states with some uncertainty described by the dynamics

Markov Decision Processes as described so far have been fully-observed in the sense that we knew all of their parameters (transition probabilities and so on). Because everything was known in advance, we could conduct **offline planning**, that is, formulate a plan without needing to interact with the world.

MDP parameters aren't always known from the onset - we may not know the reward function **online planning**, in which we must interact with the world to learn more about it to better formulate a plan.

Online planning involves **reinforcement learning**, where agents can learn in what states rewards or goals are located without needing to know from the start.

Reinforcement learning in summary:

- the agent interacts with its environment and receives feedback in the form of rewards
- the agent's utility is defined by the reward function
- the agent must learn to act so as to maximize expected rewards
- learning is based on observed samples of outcomes

The ultimate goal of reinforcement learning is to learn a *policy* which returns an action to take given a state. To form a good policy we need to know the value of a given state; we do so by learning a *value function* which is the sum of rewards from the current state to some terminal state, following a fixed policy. This value function can be learned and approximated by any learning and approximation approach, e.g. neural networks.

One challenge with reinforcement learning is the **credit assignment problem** - a much earlier action could be responsible for the current outcome, but how is that responsibility assigned? And how it is quantified?

With reinforcement learning, we still assume a MDP, it's just not fully specified - that is, we don't know

If we do know **utility-based agent** can learn

If **Q-learning agent** can learn

A **reflex agent** can also directly learn the policy

Reinforcement learning agents can be *passive*, which means the agent has a fixed policy and learns

Alternatively, an *active* reinforcement learning agent changes its policy as it goes and learns.

Passive learning has the drawbacks that it can take awhile to converge on good estimates for the unknown quantities, and it may limit how much of the space is actually explored, and as such, there may be little or no information about some states and better paths may remain unknown.

We can broadly categorize various RL methods into three groups:

- critic-only methods, which first learn a value function and then use that to define a policy, e.g. TD learning.
- actor-only methods, which directly search the policy space. An example is an evolutionary approach where different policies are evolved.
- actor-critic methods, where a critic and an actor are both included and learned separately. The critic observes the actor and evaluates its policy, determining when it needs to change.

Model-based learning is a simple approach to reinforcement learning.

The basic idea:

- learn an approximate model (i.e.
$P(s'|s,a)$ , that is,$T(s,a,s')$ , and$R(s,a,s')$ ) based on experiences - solve for values (i.e. using value iteration or policy iteration) as if the learned model were correct

In more detail:

- learn an empirical MDP model
- count outcomes
$s'$ for each$s, a$ - normalize to get an estimate of
$\hat T(s,a,s')$ - discover each
$\hat R(s,a,s')$ when we experience$(s,a,s')$ - solve the learned MPD (e.g. value iteration or policy iteration)

In **temporal difference learning**, the agent moves from one state

We run this many times, reaching a terminal state, then restarting, and so on, to get better estimates of the rewards (utilities) for each state.

We keep track of rewards for visited states as

The main part of the algorithm is:

- if
$s'$ is new then$U[s'] = r'$ - if
$s$ is not null then - increment
$N_s[s]$ $U[s] = U[s] + \alpha(N_s[s])(r + \gamma U[s] - U[s])$

Where

Another way of thinking of TD learning:

Consider a sequence of values

We can rearrange these terms to give us:

The term *temporal difference error*. Basically our estimate

This approach treats all values with equal weight, though we may want to decay older values.

One method of active reinforcement learning is the greedy approach.

Given new estimates for the rewards, we recompute a new optimal policy and then use that policy to guide exploration of the space. This gives us new estimates for the rewards, and so on, until convergence.

Thus it is greedy in the sense that it always tries to go for policies that seem immediately better, although in the end that doesn't necessarily guarantee the overall optimal policy (this is the exploration vs exploitation problem).

One alternate approach is to randomly try a non-optimal action, thus exploring more of the space. This works, but can be slow to converge.

A approach better than TDL is to use an **exploration agent**, which favors exploring more when it is uncertain. More specifically, we can use the same TDL algorithm, but while

With model-free learning, instead of trying to estimate

With **passive reinforcement learning**, the agent is given an existing policy and just learns from the results of that policy's execution (that is, learns the state values; i.e. this is essentially just policy evaluation, except this is *not* offline, this involves interacting with the environment).

To compute the values for each state under **direct evaluation**:

- act according to
$\pi$ - every time we visit a state, record what the sum of discounted rewards turned out to be
- average those samples

Direct evaluation is simple, doesn't require any knowledge of *eventually* gets the correct average values. However, it throws out information about state connections, since each state is learned separately - for instance, if we have a state

Policy evaluation, on the other hand, *does* take in account the relationship between states, since the value of each state is a function of its child states, i.e.

However, we don't know

Where each

This is called **sample-based policy evaluation**.

One challenge here: when you try an action, you end up in a new state - how do you get back to the original state to try another action? We don't know anything about the MDP so we don't necessarily know what action will do this.

So really, we only get one sample, and then we're off to another state.

With *temporal difference learning*, we learn from each experience ("episode"); that is, we update

Basically, we have an estimate

So we specify a learning rate

This is an exponential moving average.

This update can be re-written as:

The term

So we still never learn

However, it doesn't help with coming up with a new policy, since we need Q-values to do so.

With *active reinforcement learning*, the agent is actively trying new things rather than following a fixed policy.

The fundamental trade-off in active reinforcement learning is **exploitation vs exploration**. When you land on a decent strategy, do you just stick with it? What if there's a better strategy out there? How do you balance using your current best strategy and searching for an even better one?

Remember that value iteration requires us to look at

However, we can't compute maximums from samples since the maximum is always unknown (there's always the possibility of a new sample being larger; we can only compute averages from samples).

We can instead iteratively compute Q-values (Q-value iteration):

Remember that while a value

Here the

This is the basis of the **Q-Learning algorithm**, which is just sample-based Q-value iteration.

We learn

- take an action
$a$ from a state$s$ and see the outcome as a sample$(s,a,s',r)$ . - consider the old estimate
$Q(s,a)$ - consider the new sample estimate:
$\text{sample} = R(s,a,s') + \gamma \max_{a'} Q(s', a')$ , where$R(s,a,s') = r$ , i.e. the reward we just received - incorporate this new estimate into a running average:

This can also be written:

These updates emulate Bellman updates as we do in known MDPs.

Q-learning converges to an optimal policy, even if you're acting suboptimal. When an optimal policy is still learned from suboptimal actions, it is called **off-policy learning**. Another way of saying this is that with off-policy learning, Q-values are updated not according to the current policy (i.e. the current actions), but according to a greedy policy (i.e. the greedy/best actions).

We still, however, need to explore and decrease the learning rate (but not too quickly or you'll stop learning things).

In Q-Learning, we don't need

With this we can just choose our optimal policy as:

The

Where

Again, we can back up (as with value iteration) to propagate these values through the network.

Note that a simpler version of Q-learning is *SARSA*, ("State Action Reward State Action", because the quintuple

SARSA, in contrast to Q-learning, is *on-policy* learning; that is, it updates states based on the current policy's actions, so Q-values are learned according to the current policy and not a greedy policy.

An action may be responsible for a reward later on, so we want to be able to learn that causality, i.e. propagate rewards. The default one-step Q-learning and SARSA algorithms only associate reward with the direct state-action pair

Up until now we have not considered how we *select actions*. So how do we? That is, how do we explore?

One simple method is to sometimes take random actions (

After the space is thoroughly explored, you don't want to keep moving randomly - so you can decrease

A simple modification for *soft-max* action selection, where actions are chosen based on their estimated

Alternatively, we can use *exploration functions*. Generally, we want to explore areas we have high uncertainty for. More specifically, an exploration function takes a value estimate

We can modify our Q-update to incorporate an exploration function:

This encourages the agent not only to try unknown states, but to also try states that lead to unknown states.

In addition to exploration and exploitation, we also introduce a concept of **regret**. Naturally, mistakes are made as the space is explored - regret is a measure of the total mistake cost. That is, it is the difference between your expected rewards and optimal expected rewards.

We can try to minimize regret - to do so, we must not only learn to be optimal, but we must *optimally* learn how to be optimal.

For example: both random exploration and exploration functions are optimal, but random exploration has higher regret.

Sometimes state spaces are far too large to satisfactorily explore. This can be a limit of memory (since Q-learning keeps a table of Q-values) or simply that there are too many states to visit in a reasonable time. In fact, this is the rule rather than the exception. So in practice we cannot learn about every state.

The general idea of **approximate Q-learning** is to transfer learnings from one state to other similar states. For example, if we learn from exploring one state that a fire pit is bad, then we can generalize that all fire pit states are probably bad.

This is an approach like machine learning - we want to learn general knowledge from a few training states; the states are represented by features (for example, we could have a binary feature **Q-function**; this method is called *linear function approximation*):

Note that we can do the same for value functions as well, i.e.

So we observe a transition

With exact Q-learning, we would update

With approximate Q-learning, we instead update the weights, and we do so in proportion to their feature values:

This is the same as least-squares regression.

That is, given a point

The derivative of the error with respect to a weight

Then we update the weight:

In terms of approximate Q-learning, the target

Q-learning tries to model the states by learning q-values. However, a feature-based Q-learning model that models the states well does not necessarily translate to a good feature-based policy (and vice-versa).

Instead of trying to model the unknown states, we can directly try to learn the policies that maximize rewards.

So we can use Q-learning and learn a decent solution, then fine-tune by hill climbing on feature weights. That is, we learn an initial linear Q-function, the nudge each feature weight and up and down to see if the resulting policy is better.

We test whether or not a policy is better by running many sample episodes.

If we have many features, we have to test many new policies, and this hill climbing approach becomes impractical. There are better methods (not discussed here).

A helpful table:

For a known MDP, we can compute an offline solution:

Goal | Technique |
---|---|

Compute |
Value or policy iteration |

Evaluate a fixed policy |
Policy evaluation |

For an unknown MDP, we can use model-based approaches:

Goal | Technique |
---|---|

Compute |
Value or policy iteration on the approximated MDP |

Evaluate a fixed policy |
Policy evaluation on the approximated MDP |

Or we can use model-free approaches:

Goal | Technique |
---|---|

Compute |
Q-learning |

Evaluate a fixed policy |
Value learning |

The previous Q-learning approach was *tabular* in that we essentially kept a table of mappings from *exact states* to values, but map *similar states* to similar values.

The general idea behind deep Q-learning is using a deep neural network to learn

This is essentially a regression problem, since Q-values are continuous. So we can use a squared error loss in the form of a Bellman equation:

Where the

Approximating Q-values using nonlinear functions is not very stable, so tricks are needed to get good performance.

One problem is **catastrophic forgetting**, in which similar states may lead to drastically different outcomes. For instance, there may be a state for which is a single move away from winning, and then another similar state where that same move leads to failure. When the agent wins from that first state, it will assign a high value to it. Then, when it loses from the similar state, it revises its value negatively, and in doing so it "overwrites" its assessment of the other state.

So catastrophic forgetting occurs when similar states lead to very different outcomes, and when this happens, the agent is unable to properly learn.

One trick for this is **experience replay** in which each experience tuple

Then the network is trained using random minibatches sampled from the replay memory. This essentially turns the task into a supervised learning task.

A deep Q-learning algorithm that includes experience replay and

- initialize replay memory
$D$ - initialize action-value function
$Q$ (with random weights) - observe initial state
$s$ - repeat
- select an action
$a$ - with probability
$\epsilon$ select a random action - otherwise select
$a = \argmax_{a'} Q(s,a')$

- with probability
- carry out action
$a$ - observe reward
$r$ and new state$s'$ - store experience
$<s,a,r,s'>$ in replay memory$D$ - sample random transitions
$<ss, aa, rr, ss'>$ from replay memory$D$ - calculate target for each minibatch transition
- if
$ss'$ is terminal state then$tt = rr$ - otherwise
$tt = rr + \gamma \max_{a'} Q(ss', aa')$

- if
- train the Q network using
$(tt - Q(ss, aa))^2$ as loss $s = s'$

- select an action

- Reinforcement Learning - Part 1. Brandon B.
- CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX).
- Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
- 11.3 Reinforcement Learning. Artificial Intelligence: Foundations of Computational Agents. David Poole and Alan Mackworth. Cambridge University Press, 2010.
- Reinforcement Learning in a Nutshell. V. Heidrich-Meisner, M. Lauer, C. Igel, M. Riedmiller.
- Asynchronous Methods for Deep Reinforcement Learning. Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, Koray Kavukcuoglu.
- Demystifying Deep Reinforcement Learning. Tambet Matiisen.
- Reinforcement Learning: A Tutorial. Mance E. Harmon & Stephanie S. Harmon.
- Q-learning with Neural Networks. Brandon B.