# Reinforcement learning

A quick refresher - a Markov Decision Process (MDP) involves a set of states and actions. Actions connect states with some uncertainty described by the dynamics $P(s'|a,s)$ (i.e. transition probabilities which underlie the transition function $T(s,a,s')$) Additionally is a reward function $R(s,a,s')$ that associates a reward or penalty with each state. When these are known or learned the result is a policy $\pi$ which prescribes actions to take given a state. We can then value a given state $s$ and a policy $\pi$ in terms of expected future rewards with a value function $V^{\pi}(s)$. So the ultimate goal here is to identify an optimal policy.

Markov Decision Processes as described so far have been fully-observed in the sense that we knew all of their parameters (transition probabilities and so on). Because everything was known in advance, we could conduct offline planning, that is, formulate a plan without needing to interact with the world.

MDP parameters aren't always known from the onset - we may not know the reward function $R$ or even the transition model $T$, and then we must engage in online planning, in which we must interact with the world to learn more about it to better formulate a plan.

Online planning involves reinforcement learning, where agents can learn in what states rewards or goals are located without needing to know from the start.

Reinforcement learning in summary:

• the agent interacts with its environment and receives feedback in the form of rewards
• the agent's utility is defined by the reward function
• the agent must learn to act so as to maximize expected rewards
• learning is based on observed samples of outcomes

The ultimate goal of reinforcement learning is to learn a policy which returns an action to take given a state. To form a good policy we need to know the value of a given state; we do so by learning a value function which is the sum of rewards from the current state to some terminal state, following a fixed policy. This value function can be learned and approximated by any learning and approximation approach, e.g. neural networks.

One challenge with reinforcement learning is the credit assignment problem - a much earlier action could be responsible for the current outcome, but how is that responsibility assigned? And how it is quantified?

With reinforcement learning, we still assume a MDP, it's just not fully specified - that is, we don't know $R$ and we might not know $T$.

If we do know $T$, a utility-based agent can learn $R$ and thus $V$, which we can then use for MDP.

If $T$ and $R$ are both unknown, a Q-learning agent can learn $Q(s,a)$ without needing either. Where $V(s)$ is the value over states, $Q(s,a)$ is the value over state-action pairs and can also be used with MDP.

A reflex agent can also directly learn the policy $\pi(s)$ without needing to know $T$ or $R$.

Reinforcement learning agents can be passive, which means the agent has a fixed policy and learns $R$ and $T$ (if necessary) while executing that policy.

Alternatively, an active reinforcement learning agent changes its policy as it goes and learns.

Passive learning has the drawbacks that it can take awhile to converge on good estimates for the unknown quantities, and it may limit how much of the space is actually explored, and as such, there may be little or no information about some states and better paths may remain unknown.

#### Critic, Actor, and Actor-Critic methods

We can broadly categorize various RL methods into three groups:

• critic-only methods, which first learn a value function and then use that to define a policy, e.g. TD learning.
• actor-only methods, which directly search the policy space. An example is an evolutionary approach where different policies are evolved.
• actor-critic methods, where a critic and an actor are both included and learned separately. The critic observes the actor and evaluates its policy, determining when it needs to change.

## Model-based learning

Model-based learning is a simple approach to reinforcement learning.

The basic idea:

• learn an approximate model (i.e. $P(s'|s,a)$, that is, $T(s,a,s')$, and $R(s,a,s')$) based on experiences
• solve for values (i.e. using value iteration or policy iteration) as if the learned model were correct

In more detail:

1. learn an empirical MDP model
2. count outcomes $s'$ for each $s, a$
3. normalize to get an estimate of $\hat T(s,a,s')$
4. discover each $\hat R(s,a,s')$ when we experience $(s,a,s')$
5. solve the learned MPD (e.g. value iteration or policy iteration)

### Temporal Difference Learning (TDL or TD Learning)

In temporal difference learning, the agent moves from one state $s$ to the next $s'$, looks at the reward difference between the states, then backs up (propagates) the values (as in value iteration) from one state to the next.

We run this many times, reaching a terminal state, then restarting, and so on, to get better estimates of the rewards (utilities) for each state.

We keep track of rewards for visited states as $U(s)$ and also the number of times we have visited each state as $N(s)$.

The main part of the algorithm is:

• if $s'$ is new then $U[s'] = r'$
• if $s$ is not null then
• increment $N_s[s]$
• $U[s] = U[s] + \alpha(N_s[s])(r + \gamma U[s] - U[s])$

Where $\alpha$ is the learning rate and $\gamma$ is the discount factor.

Another way of thinking of TD learning:

Consider a sequence of values $v_1, v_2, v_3, \dots, v_t$. We want to estimate the value $v_{t+1}$. We might do so by averaging the observed values, e.g. $\hat v_{t+1} = \frac{v_1 + \dots + v_t}{t}$.

We can rearrange these terms to give us:

\begin{aligned} t \hat v_{t+1} &= v_1 + \dots + v_{t-1} + v_t \\ t \hat v_{t+1} &= (t-1) \hat v_t + v_t \\ \hat v_{t+1} &= (1-\frac{1}{t}) \hat v_t + \frac{v_t}{t} \\ \hat v_{t+1} &= \hat v_t + \frac{v_t - \hat v_t}{t} \end{aligned}

The term $v_t - \hat v_t$ is the temporal difference error. Basically our estimate $\hat v_{t+1}$ is derived by updating the previous estimate $\hat v_t$ proportionally to this error. For instance, if $v_t > \hat v_t$ then the next estimate is increased.

This approach treats all values with equal weight, though we may want to decay older values.

#### Greedy TDL

One method of active reinforcement learning is the greedy approach.

Given new estimates for the rewards, we recompute a new optimal policy and then use that policy to guide exploration of the space. This gives us new estimates for the rewards, and so on, until convergence.

Thus it is greedy in the sense that it always tries to go for policies that seem immediately better, although in the end that doesn't necessarily guarantee the overall optimal policy (this is the exploration vs exploitation problem).

One alternate approach is to randomly try a non-optimal action, thus exploring more of the space. This works, but can be slow to converge.

### Exploration agent

A approach better than TDL is to use an exploration agent, which favors exploring more when it is uncertain. More specifically, we can use the same TDL algorithm, but while $N_s < \epsilon$, where $\epsilon$ is some exploration threshold, we set $U[s] = R$, where $R$ is the largest reward we expect to get. When $N_s > \epsilon$, we start using the learned reward as with regular TDL.

## Model-free learning

With model-free learning, instead of trying to estimate $T$ or $R$, we take actions and the actual outcome to what we expected the outcome would be.

With passive reinforcement learning, the agent is given an existing policy and just learns from the results of that policy's execution (that is, learns the state values; i.e. this is essentially just policy evaluation, except this is not offline, this involves interacting with the environment).

To compute the values for each state under $\pi$, we can use direct evaluation:

• act according to $\pi$
• every time we visit a state, record what the sum of discounted rewards turned out to be
• average those samples

Direct evaluation is simple, doesn't require any knowledge of $T$ or $R$, and eventually gets the correct average values. However, it throws out information about state connections, since each state is learned separately - for instance, if we have a state $s_i$ with a positive reward, and another state $s_j$ that leads into it, it's possible that direct evaluation assigns state $s_j$ a negative reward, which doesn't make sense - since it leads to a state with a positive reward, it should also have some positive reward. Given enough time/samples, this will eventually resolve, but that can require a long time.

Policy evaluation, on the other hand, does take in account the relationship between states, since the value of each state is a function of its child states, i.e.

$$V_{k+1}^{\pi_i}(s) = \sum_{s'} T(s, \pi_k(s), s') [R(s, \pi_k(s), s') + \gamma V_k^{\pi_i}(s')]$$

However, we don't know $T$ and $R$. Well, we could just try actions and take samples of outcomes $s'$ and average:

$$V_{k+1}^{\pi}(s) = \frac{1}{n}\sum_i \text{sample}_i$$

Where each $\text{sample}_i = R(s, \pi(s), s_i') + \gamma V_k^{\pi}(s_i')$. $R(s, \pi(s), s_i')$ is just the observed reward from taking the action.

This is called sample-based policy evaluation.

One challenge here: when you try an action, you end up in a new state - how do you get back to the original state to try another action? We don't know anything about the MDP so we don't necessarily know what action will do this.

So really, we only get one sample, and then we're off to another state.

With temporal difference learning, we learn from each experience ("episode"); that is, we update $V(s)$ each time we experience a transition $(s,a,s',r)$. The likely outcomes $s'$ will contribute updates more often. The policy is still fixed (given), and we're still doing policy evaluation.

Basically, we have an estimate $V(s)$, and then we take an action and get a new sample. We update $V(s)$ like so:

$$V^{\pi}(s) = (1 - \alpha) V^{\pi}(s) + (\alpha) \text{sample}$$

So we specify a learning rate $\alpha$ (usually small, e.g. $\alpha=0.1$) which controls how much of the old estimate we keep. This learning rate can be decreased over time.

This is an exponential moving average.

This update can be re-written as:

$$V^{\pi}(s) = V^{\pi}(s) + \alpha (\text{sample} - V^{\pi}(s))$$

The term $(\text{sample} - V^{\pi}(s))$ can be interpreted as an error, i.e. how off our current estimate $V^{\pi}(s)$ was from the observed sample.

So we still never learn $T$ or $R$, we just keep running sample averages instead; hence temporal difference learning is a model-free method for doing policy evaluation.

However, it doesn't help with coming up with a new policy, since we need Q-values to do so.

### Q-Learning

With active reinforcement learning, the agent is actively trying new things rather than following a fixed policy.

The fundamental trade-off in active reinforcement learning is exploitation vs exploration. When you land on a decent strategy, do you just stick with it? What if there's a better strategy out there? How do you balance using your current best strategy and searching for an even better one?

Remember that value iteration requires us to look at $\max_a$ over the set of possible actions from a state:

$$V_{k+1}(s) = \max_a \sum_{s'} T(s,a,s') [R(s,a,s') + \gamma V_k(s')]$$

However, we can't compute maximums from samples since the maximum is always unknown (there's always the possibility of a new sample being larger; we can only compute averages from samples).

We can instead iteratively compute Q-values (Q-value iteration):

$$Q_{k+1}(s,a) = \sum_{s'} T(s,a,s') [R(s,a,s') + \gamma \max_{a'} Q(s', a')]$$

Remember that while a value $V(s)$ is the value of a state, a Q-value $Q(s,a)$ is the value of an action (from a particular state).

Here the $\max$ term pushed inside, and we are ultimately just computing an average, so we can compute this from samples.

This is the basis of the Q-Learning algorithm, which is just sample-based Q-value iteration.

We learn $Q(s,a)$ values as we go:

• take an action $a$ from a state $s$ and see the outcome as a sample $(s,a,s',r)$.
• consider the old estimate $Q(s,a)$
• consider the new sample estimate: $\text{sample} = R(s,a,s') + \gamma \max_{a'} Q(s', a')$, where $R(s,a,s') = r$, i.e. the reward we just received
• incorporate this new estimate into a running average:

$$Q(s,a) = (1 - \alpha)Q(s,a) + (\alpha)\text{sample}$$

This can also be written:

$$Q(s,a) =_{\alpha} R(s,a,s') + \gamma \max_{a'} Q(s', a')$$

Q-learning converges to an optimal policy, even if you're acting suboptimal. When an optimal policy is still learned from suboptimal actions, it is called off-policy learning. Another way of saying this is that with off-policy learning, Q-values are updated not according to the current policy (i.e. the current actions), but according to a greedy policy (i.e. the greedy/best actions).

We still, however, need to explore and decrease the learning rate (but not too quickly or you'll stop learning things).

In Q-Learning, we don't need $P$ or the reward/utility function. We directly learn the rewards/utilities of state-action pairs, $Q(s,a)$.

With this we can just choose our optimal policy as:

$$\pi(s) = \argmax_a \sum_{s'} Q(s,a)$$

The $Q$ update formula is simply:

$$Q(s,a) = Q(s,a) + \alpha(R(s) + \gamma \max Q(s', a') - Q(s,a))$$

Where $\alpha$ is the learning rate and $\gamma$ is the discount factor.

Again, we can back up (as with value iteration) to propagate these values through the network.

Note that a simpler version of Q-learning is SARSA, ("State Action Reward State Action", because the quintuple $(s_t, a_t, r_t, s_{t+1}, a_{t+1})$ is central to this method), which uses the update:

$$Q(s,a) = Q(s,a) + \alpha(R(s) + \gamma Q(s', a') - Q(s,a))$$

SARSA, in contrast to Q-learning, is on-policy learning; that is, it updates states based on the current policy's actions, so Q-values are learned according to the current policy and not a greedy policy.

#### n-step Q-learning

An action may be responsible for a reward later on, so we want to be able to learn that causality, i.e. propagate rewards. The default one-step Q-learning and SARSA algorithms only associate reward with the direct state-action pair $s,a$ that immediately led to it. We can instead propagate these rewards further with $n$-step variations, e.g. $n$-step Q-learning updates $Q(s,a)$ with:

$$r_t + \gamma r_{t+1} + \dots + \gamma^{n-1} r_{t+n} + \max_a \gamma^n Q(s+{t+n+1}, a)$$

### Exploration vs exploitation

Up until now we have not considered how we select actions. So how do we? That is, how do we explore?

One simple method is to sometimes take random actions ($\epsilon$-greedy). With a small probability $\epsilon$, act randomly, with probability $1-\epsilon$, act on the current best policy.

After the space is thoroughly explored, you don't want to keep moving randomly - so you can decrease $\epsilon$ over time.

A simple modification for $\epsilon$-greedy action selection is soft-max action selection, where actions are chosen based on their estimated $Q(s,a)$ value. One specific method is to use a Gibbs or Boltzmann distribution where selecting action $a$ in state $s$ is proportional to $e^{Q(s,a)/T}$ where $T>0$ is a temperature which influences how randomly actions should be chosen. The higher the temperature, the more random; when $T=0$, the best-valued action is always chosen. More specifically, in state $s$, action $a$ is chosen with the probability:

$$\frac{e^{Q(s,a)/T}}{\sum_a e^{Q(s,a)/T}}$$

Alternatively, we can use exploration functions. Generally, we want to explore areas we have high uncertainty for. More specifically, an exploration function takes a value estimate $u$ and a visit count $n$ and returns an optimistic utility. For example: $f(u,n) = u + \frac{k}{n}$.

We can modify our Q-update to incorporate an exploration function:

$$Q(s,a) =_{\alpha} R(s,a,s') + \gamma \max_{a'} f(Q(s', a'), N(s',a'))$$

This encourages the agent not only to try unknown states, but to also try states that lead to unknown states.

In addition to exploration and exploitation, we also introduce a concept of regret. Naturally, mistakes are made as the space is explored - regret is a measure of the total mistake cost. That is, it is the difference between your expected rewards and optimal expected rewards.

We can try to minimize regret - to do so, we must not only learn to be optimal, but we must optimally learn how to be optimal.

For example: both random exploration and exploration functions are optimal, but random exploration has higher regret.

### Approximate Q-Learning

Sometimes state spaces are far too large to satisfactorily explore. This can be a limit of memory (since Q-learning keeps a table of Q-values) or simply that there are too many states to visit in a reasonable time. In fact, this is the rule rather than the exception. So in practice we cannot learn about every state.

The general idea of approximate Q-learning is to transfer learnings from one state to other similar states. For example, if we learn from exploring one state that a fire pit is bad, then we can generalize that all fire pit states are probably bad.

This is an approach like machine learning - we want to learn general knowledge from a few training states; the states are represented by features (for example, we could have a binary feature $\text{has fire pit}$). Then we describe q-states in terms of features, e.g. as linear functions (called a Q-function; this method is called linear function approximation):

$$Q(s,a) = w_1 f_1(s,a) + w_2 f_2(s,a) + \dots + w_n f_n(s,a)$$

Note that we can do the same for value functions as well, i.e.

$$V(s) = w_1 f_1(s) + w_2 f_2(s) + \dots + w_n f_n(s)$$

So we observe a transition $(s,a,r,s')$ and then we compute the difference of this observed transition from what we expected, i.e:

$$\text{difference} = [r + \gamma \max_{a'} Q(s', a')] - Q(s,a)$$

With exact Q-learning, we would update $Q(s,a)$ like so:

$$Q(s,a) = Q(s,a) + \alpha [\text{difference}]$$

With approximate Q-learning, we instead update the weights, and we do so in proportion to their feature values:

$$w_i = w_i + \alpha [\text{difference}] f_i (s,a)$$

This is the same as least-squares regression.

That is, given a point $x$, with features $f(x)$ and target value $y$, the error is:

$$\text{error}(w) = \frac{1}{2} (y - \sum_k w_k f_k(x))^2$$

The derivative of the error with respect to a weight $w_m$ is:

$$\frac{\partial \text{error}(w)}{\partial w_m} = - ( y- \sum_k w_k f_k(x)) f_m(x)$$

Then we update the weight:

$$w_m = w_m + \alpha (y - \sum_k w_k f_k(x)) f_m(x)$$

In terms of approximate Q-learning, the target $y$ is $r + \gamma \max_{a'} Q(s', a')$ and our prediction is $Q(s,a)$:

$$w_m = w_m + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s,a)] f_m (s,a)$$

### Policy Search

Q-learning tries to model the states by learning q-values. However, a feature-based Q-learning model that models the states well does not necessarily translate to a good feature-based policy (and vice-versa).

Instead of trying to model the unknown states, we can directly try to learn the policies that maximize rewards.

So we can use Q-learning and learn a decent solution, then fine-tune by hill climbing on feature weights. That is, we learn an initial linear Q-function, the nudge each feature weight and up and down to see if the resulting policy is better.

We test whether or not a policy is better by running many sample episodes.

If we have many features, we have to test many new policies, and this hill climbing approach becomes impractical. There are better methods (not discussed here).

### Summary

For a known MDP, we can compute an offline solution:

Goal Technique
Compute $V^*, Q^*, \pi^*$ Value or policy iteration
Evaluate a fixed policy $\pi$ Policy evaluation

For an unknown MDP, we can use model-based approaches:

Goal Technique
Compute $V^*, Q^*, \pi^*$ Value or policy iteration on the approximated MDP
Evaluate a fixed policy $\pi$ Policy evaluation on the approximated MDP

Or we can use model-free approaches:

Goal Technique
Compute $V^*, Q^*, \pi^*$ Q-learning
Evaluate a fixed policy $\pi$ Value learning

## Deep Q-Learning

The previous Q-learning approach was tabular in that we essentially kept a table of mappings from $(s,a)$ to some value. However, we'd like to be a bit more flexible and not have to map exact states to values, but map similar states to similar values.

The general idea behind deep Q-learning is using a deep neural network to learn $Q(s,a)$, which gives us this kind of mapping.

This is essentially a regression problem, since Q-values are continuous. So we can use a squared error loss in the form of a Bellman equation:

$$L = \frac{1}{2}[ r + \max_{a'} Q(s', a') - Q(s, a)]^2$$

Where the $r + \max_{a'} Q(s', a')$ term is the target value and $Q(s,a)$ is the predicted value.

Approximating Q-values using nonlinear functions is not very stable, so tricks are needed to get good performance.

One problem is catastrophic forgetting, in which similar states may lead to drastically different outcomes. For instance, there may be a state for which is a single move away from winning, and then another similar state where that same move leads to failure. When the agent wins from that first state, it will assign a high value to it. Then, when it loses from the similar state, it revises its value negatively, and in doing so it "overwrites" its assessment of the other state.

So catastrophic forgetting occurs when similar states lead to very different outcomes, and when this happens, the agent is unable to properly learn.

One trick for this is experience replay in which each experience tuple $(s,a,r,s')$ are saved (this collection of saved experiences is called "replay memory"). Memory size is often limited to keep only the last $n$ experiences.

Then the network is trained using random minibatches sampled from the replay memory. This essentially turns the task into a supervised learning task.

A deep Q-learning algorithm that includes experience replay and $\epsilon$-greedy exploration follows (source):

• initialize replay memory $D$
• initialize action-value function $Q$ (with random weights)
• observe initial state $s$
• repeat
• select an action $a$
• with probability $\epsilon$ select a random action
• otherwise select $a = \argmax_{a'} Q(s,a')$
• carry out action $a$
• observe reward $r$ and new state $s'$
• store experience $$in replay memory D • sample random transitions $$ from replay memory $D$
• calculate target for each minibatch transition
• if $ss'$ is terminal state then $tt = rr$
• otherwise $tt = rr + \gamma \max_{a'} Q(ss', aa')$
• train the Q network using $(tt - Q(ss, aa))^2$ as loss
• $s = s'$