notes from John Schulman's Deep Reinforcement Learning lectures for MLSS 2016 in Cadiz.

• policy optimization: the policy is parameterized and you try to optimize expected reward
• includes policy gradients, derivative-free optimization (DFO)/evolutionary algorithms (though DFO doesn't work well for large numbers of parameter)
• dynamic programming: you can exactly solve some simple control problems (i.e. MDPs) with dynamic programming
• includes policy iteration, value iteration
• for more useful/realistic problems we have to use approximate versions of these algorithms (e.g. Q-learning)

There are also actor-critic methods which are policy gradient methods that use value functions.

Deep reinforcement learning is just reinforcement learning with nonlinear function approximators, usually updating parameters with stochastic gradient descent.

Policies:

• deterministic policies: $a = \pi(s)$
• stochastic policies: $a \sim \pi(a|s)$

Policies may be parameterized, i.e. $\pi_{\theta}$

Cross-entropy method (a DFO/evolutionary algorithm, for parameterized policies/policy optimization):

• initialize $\mu \in \mathbb R^d, \sigma \in \mathbb R^d$
• for each iteration
• collection $n$ samples of $\theta_i \sim N(\mu, \diag(\theta))$ (i.e. sample a population of parameter vectors)
• perform a noisy evaluation $R_i \sim \theta_i$ (i.e. for each parameter vector, evaluate its reward)
• select the top $p$ percent of samples (e.g. $p=20$); this is the elite set (the high-fitness individuals)
• fit a Gaussian distribution, with diagonal covariance, to the elite set, obtaining a new $\mu, \sigma$
• return the final $\mu$