A quick refresher - a Markov Decision Process (MDP) involves a set of states and actions. Actions connect states with some uncertainty described by the dynamics
Markov Decision Processes as described so far have been fully-observed in the sense that we knew all of their parameters (transition probabilities and so on). Because everything was known in advance, we could conduct offline planning, that is, formulate a plan without needing to interact with the world.
MDP parameters aren't always known from the onset - we may not know the reward function
Online planning involves reinforcement learning, where agents can learn in what states rewards or goals are located without needing to know from the start.
Reinforcement learning in summary:
The ultimate goal of reinforcement learning is to learn a policy which returns an action to take given a state. To form a good policy we need to know the value of a given state; we do so by learning a value function which is the sum of rewards from the current state to some terminal state, following a fixed policy. This value function can be learned and approximated by any learning and approximation approach, e.g. neural networks.
One challenge with reinforcement learning is the credit assignment problem - a much earlier action could be responsible for the current outcome, but how is that responsibility assigned? And how it is quantified?
With reinforcement learning, we still assume a MDP, it's just not fully specified - that is, we don't know
If we do know
A reflex agent can also directly learn the policy
Reinforcement learning agents can be passive, which means the agent has a fixed policy and learns
Alternatively, an active reinforcement learning agent changes its policy as it goes and learns.
Passive learning has the drawbacks that it can take awhile to converge on good estimates for the unknown quantities, and it may limit how much of the space is actually explored, and as such, there may be little or no information about some states and better paths may remain unknown.
We can broadly categorize various RL methods into three groups:
Model-based learning is a simple approach to reinforcement learning.
The basic idea:
In more detail:
In temporal difference learning, the agent moves from one state
We run this many times, reaching a terminal state, then restarting, and so on, to get better estimates of the rewards (utilities) for each state.
We keep track of rewards for visited states as
The main part of the algorithm is:
Another way of thinking of TD learning:
Consider a sequence of values
We can rearrange these terms to give us:
This approach treats all values with equal weight, though we may want to decay older values.
One method of active reinforcement learning is the greedy approach.
Given new estimates for the rewards, we recompute a new optimal policy and then use that policy to guide exploration of the space. This gives us new estimates for the rewards, and so on, until convergence.
Thus it is greedy in the sense that it always tries to go for policies that seem immediately better, although in the end that doesn't necessarily guarantee the overall optimal policy (this is the exploration vs exploitation problem).
One alternate approach is to randomly try a non-optimal action, thus exploring more of the space. This works, but can be slow to converge.
A approach better than TDL is to use an exploration agent, which favors exploring more when it is uncertain. More specifically, we can use the same TDL algorithm, but while
With model-free learning, instead of trying to estimate
With passive reinforcement learning, the agent is given an existing policy and just learns from the results of that policy's execution (that is, learns the state values; i.e. this is essentially just policy evaluation, except this is not offline, this involves interacting with the environment).
To compute the values for each state under
Direct evaluation is simple, doesn't require any knowledge of
Policy evaluation, on the other hand, does take in account the relationship between states, since the value of each state is a function of its child states, i.e.
However, we don't know
This is called sample-based policy evaluation.
One challenge here: when you try an action, you end up in a new state - how do you get back to the original state to try another action? We don't know anything about the MDP so we don't necessarily know what action will do this.
So really, we only get one sample, and then we're off to another state.
With temporal difference learning, we learn from each experience ("episode"); that is, we update
Basically, we have an estimate
So we specify a learning rate
This is an exponential moving average.
This update can be re-written as:
So we still never learn
However, it doesn't help with coming up with a new policy, since we need Q-values to do so.
With active reinforcement learning, the agent is actively trying new things rather than following a fixed policy.
The fundamental trade-off in active reinforcement learning is exploitation vs exploration. When you land on a decent strategy, do you just stick with it? What if there's a better strategy out there? How do you balance using your current best strategy and searching for an even better one?
Remember that value iteration requires us to look at
However, we can't compute maximums from samples since the maximum is always unknown (there's always the possibility of a new sample being larger; we can only compute averages from samples).
We can instead iteratively compute Q-values (Q-value iteration):
Remember that while a value
This is the basis of the Q-Learning algorithm, which is just sample-based Q-value iteration.
This can also be written:
These updates emulate Bellman updates as we do in known MDPs.
Q-learning converges to an optimal policy, even if you're acting suboptimal. When an optimal policy is still learned from suboptimal actions, it is called off-policy learning. Another way of saying this is that with off-policy learning, Q-values are updated not according to the current policy (i.e. the current actions), but according to a greedy policy (i.e. the greedy/best actions).
We still, however, need to explore and decrease the learning rate (but not too quickly or you'll stop learning things).
In Q-Learning, we don't need
With this we can just choose our optimal policy as:
Again, we can back up (as with value iteration) to propagate these values through the network.
Note that a simpler version of Q-learning is SARSA, ("State Action Reward State Action", because the quintuple
SARSA, in contrast to Q-learning, is on-policy learning; that is, it updates states based on the current policy's actions, so Q-values are learned according to the current policy and not a greedy policy.
An action may be responsible for a reward later on, so we want to be able to learn that causality, i.e. propagate rewards. The default one-step Q-learning and SARSA algorithms only associate reward with the direct state-action pair
Up until now we have not considered how we select actions. So how do we? That is, how do we explore?
One simple method is to sometimes take random actions (
After the space is thoroughly explored, you don't want to keep moving randomly - so you can decrease
A simple modification for
Alternatively, we can use exploration functions. Generally, we want to explore areas we have high uncertainty for. More specifically, an exploration function takes a value estimate
We can modify our Q-update to incorporate an exploration function:
This encourages the agent not only to try unknown states, but to also try states that lead to unknown states.
In addition to exploration and exploitation, we also introduce a concept of regret. Naturally, mistakes are made as the space is explored - regret is a measure of the total mistake cost. That is, it is the difference between your expected rewards and optimal expected rewards.
We can try to minimize regret - to do so, we must not only learn to be optimal, but we must optimally learn how to be optimal.
For example: both random exploration and exploration functions are optimal, but random exploration has higher regret.
Sometimes state spaces are far too large to satisfactorily explore. This can be a limit of memory (since Q-learning keeps a table of Q-values) or simply that there are too many states to visit in a reasonable time. In fact, this is the rule rather than the exception. So in practice we cannot learn about every state.
The general idea of approximate Q-learning is to transfer learnings from one state to other similar states. For example, if we learn from exploring one state that a fire pit is bad, then we can generalize that all fire pit states are probably bad.
This is an approach like machine learning - we want to learn general knowledge from a few training states; the states are represented by features (for example, we could have a binary feature
Note that we can do the same for value functions as well, i.e.
So we observe a transition
With exact Q-learning, we would update
With approximate Q-learning, we instead update the weights, and we do so in proportion to their feature values:
This is the same as least-squares regression.
That is, given a point
The derivative of the error with respect to a weight
Then we update the weight:
In terms of approximate Q-learning, the target
Q-learning tries to model the states by learning q-values. However, a feature-based Q-learning model that models the states well does not necessarily translate to a good feature-based policy (and vice-versa).
Instead of trying to model the unknown states, we can directly try to learn the policies that maximize rewards.
So we can use Q-learning and learn a decent solution, then fine-tune by hill climbing on feature weights. That is, we learn an initial linear Q-function, the nudge each feature weight and up and down to see if the resulting policy is better.
We test whether or not a policy is better by running many sample episodes.
If we have many features, we have to test many new policies, and this hill climbing approach becomes impractical. There are better methods (not discussed here).
A helpful table:
For a known MDP, we can compute an offline solution:
||Value or policy iteration|
|Evaluate a fixed policy
For an unknown MDP, we can use model-based approaches:
||Value or policy iteration on the approximated MDP|
|Evaluate a fixed policy
||Policy evaluation on the approximated MDP|
Or we can use model-free approaches:
|Evaluate a fixed policy
The previous Q-learning approach was tabular in that we essentially kept a table of mappings from
The general idea behind deep Q-learning is using a deep neural network to learn
This is essentially a regression problem, since Q-values are continuous. So we can use a squared error loss in the form of a Bellman equation:
Approximating Q-values using nonlinear functions is not very stable, so tricks are needed to get good performance.
One problem is catastrophic forgetting, in which similar states may lead to drastically different outcomes. For instance, there may be a state for which is a single move away from winning, and then another similar state where that same move leads to failure. When the agent wins from that first state, it will assign a high value to it. Then, when it loses from the similar state, it revises its value negatively, and in doing so it "overwrites" its assessment of the other state.
So catastrophic forgetting occurs when similar states lead to very different outcomes, and when this happens, the agent is unable to properly learn.
One trick for this is experience replay in which each experience tuple
Then the network is trained using random minibatches sampled from the replay memory. This essentially turns the task into a supervised learning task.
A deep Q-learning algorithm that includes experience replay and