notes from John Schulman's Deep Reinforcement Learning lectures for MLSS 2016 in Cadiz.

Lecture 1

Lecture 2

Lecture 3

Lecture 4

Broadly, two approaches to RL:

- policy optimization: the policy is parameterized and you try to optimize expected reward
- includes policy gradients, derivative-free optimization (DFO)/evolutionary algorithms (though DFO doesn't work well for large numbers of parameter)

- dynamic programming: you can exactly solve some simple control problems (i.e. MDPs) with dynamic programming
- includes policy iteration, value iteration
- for more useful/realistic problems we have to use approximate versions of these algorithms (e.g. Q-learning)

There are also actor-critic methods which are policy gradient methods that use value functions.

*Deep reinforcement learning* is just reinforcement learning with nonlinear function approximators, usually updating parameters with stochastic gradient descent.

Policies:

- deterministic policies:
$a = \pi(s)$ - stochastic policies:
$a \sim \pi(a|s)$

Policies may be parameterized, i.e.

Cross-entropy method (a DFO/evolutionary algorithm, for parameterized policies/policy optimization):

- initialize
$\mu \in \mathbb R^d, \sigma \in \mathbb R^d$ - for each iteration
- collection
$n$ samples of$\theta_i \sim N(\mu, \diag(\theta))$ (i.e. sample a population of parameter vectors) - perform a noisy evaluation
$R_i \sim \theta_i$ (i.e. for each parameter vector, evaluate its reward) - select the top
$p$ percent of samples (e.g.$p=20$ ); this is the*elite set*(the high-fitness individuals) - fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new
$\mu, \sigma$

- collection
- return the final
$\mu$