Optimization is the task of finding the arguments to a function which yield its minimum or maximum value. An optimal argument is denoted with an asterisk, e.g.

In the context of machine learning, we are typically dealing with minimization. If necessary, a maximization problem can be reframed as minimization: to maximize a function

The function we want to optimize is called the **objective function**. When the particular optimization is minimization, the function may also be referred to as a **cost**, **loss**, or **error** function.

Optimization problems can be thought of a topology where you are looking for the global peak (if you are maximizing) or the globally lowest point (if you are minimizing). For simplicity, minimizing will be the assumed goal here, as you are often trying to minimize some error function.

Consider a very naive approach: a greedy random algorithm which starts at some position in this topology, then randomly tries moving to a new position and checks if it is better. If it is, it sets that as the current solution. It continues until it has some reason to stop, usually because it has found a minimum. This is a **local minimum**; that is, it is a minimum relative to its immediately surrounding points, but it is not necessarily the **global minimum**, which is the minimum of the entire function.

This algorithm is *greedy* in that it will always prefer a better scoring position, even if it is only marginally better. Thus it can be easy to get stuck in local optima - since any step away from it seems worse, even if the global optimum is right around the corner, so to speak.

Optimization may be accomplished **numerically** or **analytically**. The analytic approach involves computing derivatives and then identifying critical points (e.g. the second derivative test). These provide the exact optima.

However, this analytic approach is infeasible for many functions. In such cases we resort to numerical optimization, which involves, in a sense, guessing your way to the optima. The methods covered here are all numerical optimization methods.

Within optimization there are certain special cases:

As far as optimization goes, convex optimization is easier to deal with - convex functions have only global minima (any local minimum is a global minimum) without any saddle points.

In optimization we are typically looking to find the optimum across *all* points. However, we may only be interested in finding the optimum across a subset **constrained optimization** problem. The points that are in **feasible** points.

One method for constrained optimization is the **Karush-Kuhn-Tucker** (KKT) method.

It uses the **generalized Lagrangian** (sometimes called the generalized Lagrange function).

We must describe *equality constraints*, and *inequality constraints*, such that

For each constraint we also have the variables *KKT multipliers*.

Then we can define the generalized Lagrangian as:

This reframes the constrained minimization problem as an unconstrained optimization of the generalized Lagrangian.

That is, so long as at least one feasible point exists and

has the same objective function value and set of optimal points as

So long as the constraints are satisfied,

and if a constraint is violated, then

Broadly we can categorize optimization methods into those that use gradients and those that do not:

Non-gradient methods include:

- hill climbing
- simplex/amoeba/Nelder Mead
- genetic algorithms

Gradient methods include:

- gradient descent
- conjugate gradient
- quasi-newton

Gradient methods tend to be more efficient, but are not always possible to use (you don't always have the gradient).

**Gradient descent** (GD) is perhaps the common minimizing optimization (for maximizing, its equivalent is *gradient ascent*) in machine learning.

Say we have a function

In this example, the global minimum is visually obvious, but most of the time it is not (especially when dealing with far more dimensions). But we can apply the model of a ball rolling down a hill and expand it to any arbitrary

The position the ball is at is a potential solution; here it is some values for

More formally,

We define the *gradient* of

So we can rewrite

We can choose

Where **learning rate**), which controls the step size.

Finally we have:

We can use this to compute a value for

And repeat until we hit a global (or local) minimum.

This process is known in particular as **batch gradient descent** because each step is computed over the entire *batch* of data.

With batch gradient descent, the cost function is evaluated on all the training inputs for each step. This can be quite slow.

With **stochastic gradient descent** (SGD), you randomly shuffle your examples and look at only one example for each iteration of gradient descent (sometimes this is called *online* gradient descent to contrast with minibatch gradient descent, described below). Ultimately it is less direct than batch gradient descent but gets you close to the global minimum - the main advantage is that you're not iterating over your entire training set for each step, so though its path is more wandering, it ends up taking less time on large datasets.

The reason we randomly shuffle examples is to avoid "forgetting". For instance, say you have time series data where there are somewhat different patterns later in the data than earlier on. If that training data is presented in sequence, the algorithm will "forget" the patterns earlier on in favor of those it encounters later on (since the parameter updates learned from the later-on data will effectively erase the updates from the earlier-on data).

In fact, stochastic gradient descent can help with finding the global minimum because instead of computing over a single error surface, you are working with many different error surfaces varying with the example you are current looking at. So it is possible that in one of these surfaces a local minima does not exist or is less pronounced than in others, which make it easier to surpass.

There's another form of stochastic gradient descent called **minibatch gradient descent**. Here *too* large, resulting in greater time for convergence. But generally it is faster than SGD and has the benefit of aiding in local minima avoidance.

Minibatches also need to be properly representative of the overall dataset (e.g. be balanced for classes).

When the stochastic variant is used, a

An important point of clarification: an "epoch" and a training "iteration" are not necessarily the same thing.

One training iteration is one step of your optimization algorithm (i.e. one update pass). In the case of something like minibatch gradient descent, one training iteration will only look at one batch of examples.

An epoch, on the other hand, consists of enough training iterations to look at all your training examples.

So if you have a total of 1000 training examples and a batch size of 100, one epoch will consist of 10 training iterations.

The learning rate

**Conditioning** describes how much the output of a function varies with small changes in input.

In particular, if we have a function **condition number** as follows:

Which is the ratio of the magnitude of the largest and smallest eigenvalue; when this is large, we say we have a *poorly conditioned* matrix since it is overly sensitive to small changes in input. This has the practical implications of slow convergence.

In the context of gradient descent, if the Hessian is poorly conditioned, then gradient descent does not perform as well. This can be alleviated with **Newton's method**, where a Taylor series expansion is used to approximate

Solving for the critical point gives:

(As a reminder,

Such methods which also use second-order derivatives (i.e. the Hessian) are known as *second-order optimization algorithms*; those that use only the gradient are called *first-order* optimization algorithms.

**Simulated annealing** is similar to the greedy random approach but it has some randomness which can "shake" it out of local optima.

Annealing is a process in metal working where the metal starts at a very high temperature and gradually cools down. Simulated annealing uses a similar process to manage its randomness.

A simulated annealing algorithm starts with a high "*temperature*" (or "energy") which "cools" down (becomes less extreme) as progress is made.

Like the greedy random approach, the algorithm tries a random move. If the move is better, it is accepted as the new position. If the move is worse, then there is a chance it still may be accepted; the probability of this is based on the current temperature, the current error, and the previous error:

Each random move, whether accepted or not, is considered an iteration. After each iteration, the temperature is decreased according to a *cooling schedule*. An example cooling schedule is:

where

$T_{\text{init}}$ = the starting temperature$T_{\text{final}}$ = the minimum/ending temperature$k$ = the current iteration$k_{\max}$ = the maximum number of iterations

For this particular schedule, you probably don't want to set

The algorithm terminates when the temperature is at its minimum.

For a problem of **simplex**.

One vertex of the simplex is initialized with your best educated guess of the solution vector. That guess could be the output of some other optimization approach, even a previous Nelder-Mead run. If you have nothing to start with, a random vector can be used.

The other

Then at each step of the algorithm, you want to (illustrations are for

- Find the worst, second worst, and best scoring vertices
- Reflect the worst vertex to some point
$p'$ through the best side - If
$p'$ is better,*expand*by setting the worst vertex to a new point$p''$ , a bit further than$p'$ but in the same direction - If
$p'$ is worse, then*contract*by setting the worst vertex to a new point$p''$ , in the same direction as$p'$ but before crossing the best side

The algorithm terminates when one of the following occurs:

- The maximum number of iterations is reached
- The score is "good enough"
- The vertices have become close enough together

Then the best vertex is considered the solution.

This optimization method is very sensitive to how it is initialized; whether or not a good solution is found depends a great deal on its starting points.

**Particle swarm optimization** is similar to Nelder-Mead, but instead of three points, many more points are used. These points are called "particles".

Each particle has a position (a potential solution) and a velocity which indicates where the particle moves to in the next iteration. Particles also keep track of their current error to the training examples and its best position so far.

Globally, we also track the best position overall and the lowest error overall.

The velocity for each particle is computed according to:

- it's inertia (i.e. the current direction it is moving in)
- it's historic best position (i.e. the best position it's found so far)
- the global best position

The influence of these components are:

- inertia weight
- cognitive weight (for historic best position)
- social weight (for global best position)

These weights are parameters that must be tuned, but this method is quite robust to them (that is, they are not sensitive to these changes so you don't have to worry too much about getting them just right).

More particles are better, of course, but more intensive.

You can specify the number of epochs to run.

You can also incorporate a death-birth cycle in which low-performing particles (those that seem to be stuck, for instance) get destroyed and a new randomly-placed particle is initialized in its place.

Evolutionary algorithms are a type of algorithm which uses concepts from evolution - e.g. individuals, populations, fitness, reproduction, mutation - to search a solution space.

Genetic algorithms are the most common class of evolutionary algorithms.

- You have a
*population*of "**chromosomes**" (e.g. possible solutions or parameters, which are also called "*object variables*"). These chromosomes are interchangeably referred to as "*individuals*" - There may be some
**mutation**in the chromosomes (e.g. with binary chromosomes, sometimes 0s become 1s and vice versa or with continuous values, changes happen according to some step size) - Parents have children, in which their chromosomes
**crossover**- the front part of one chromosome combines with the back part of another. This is also called*recombination*. - The
**genotype**(the chromosome composition) is expressed as some**phenotype**(i.e. some genetically-determined properties) in some individuals - Then each of these individuals has some
**fitness**value resulting from their phenotypes - These fitnesses are turned into some probability of
**survival**(this*selection pressure*is what pushes the system towards an optimal individual) - Then the individuals are
*selected*randomly based on their survival probabilities - These individuals form the new chromosome population for the next
**generation**

Each of these steps requires some decisions by the implementer.

For instance, how do you translate a fitness score into a survival probability?

Well, the simplest way is:

Where

However, depending on how you calculate fitness, this may not be appropriate.

You could alternatively use a ranking method, in which you just look at the relative fitness rankings and not their actual values. So the most fit individual is most likely to survive, the second fit is a bit less likely, and so on.

You pick a probability constant

If you get stuck on local maxima you can try increasing the step size. When your populations start to get close to the desired value, you can decrease the step size so the changes are less sporadic (i.e. use simulated annealing).

When selecting a new population, you can incorporate a diversity rank in addition to a fitness rank. This diversity ranking tries to maximize the diversity of the new population. You select one individual for the new population, and then as you select your next individual, you try and find one which is distinct from the already selected individuals.

The general algorithm is as follows:

- randomly initialize a population of
$\mu$ individuals - compute fitness scores for each individual
- randomly choose
$\frac{\mu}{2}$ pairs of parents, weighted by fitness (see above for an example), to reproduce - with probability
$P_c$ (a hyperparameter, e.g. 0.8), perform crossover on the parents to form two children, which replaces the old population (you may also choose the keep some of the old population, rather than having two children per pair of parents, as per above - there is no universal genetic algorithm; you typically need to adjust it for a particular task) - randomly apply mutation to some of the population with probability
$P_m$ (a hyperparameter, e.g. 0.01) - repeat

The specifics of how crossover and mutation work depend on your particular problem.

With *evolution strategies*, each individual includes not only object variables but also "strategy" parameters which, which are variances and covariances (optional) of the object variables. These strategy parameters control mutation.

From each population of size

All of the object variables, as with genetic algorithms, are derived from the same parents, though each strategy parameter may be derived from a different pair of parents, selected at random (without any selection pressure). However, the best approach is to copy an object variable from one of the parents and set each strategy parameter to be the mean of its parents' corresponding strategy parameter.

Then mutation mutates both the strategy parameters and the object variables, starting with the strategy parameters. The mutation of the strategy parameters is called *self-adaptation*. The object variables are mutated according to the probability distribution specified by the (mutated) strategy parameters.

There are two approaches to selection for evolutionary strategy:

just involves taking the best$(\mu, \lambda)$ selection$\mu$ individuals from the$\lambda$ offspring.involves selecting the best$(\mu + \lambda)$ selection$\mu$ individuals from the union of the$\lambda$ offspring and the$\mu$ parents.

*Evolutionary programming* does not include recombination; changes to individuals rely solely on mutation. Mutations are based on a Gaussian distribution, where the standard deviation is the square root of a linear transform (parametrized according to the user) of the parent's fitness score. Each of the

Note that in *meta-evolutionary programming*, the variances are also part of the individual, i.e. subject to mutation (this is self-adaptation).

The next generation is selected from the union of the parents and the offspring via a process called * $q$-tournament selection*. Each candidate is paired with

Increasing

Note that Nelder-Mead, Particle Swarm, and genetic algorithm optimization methods are sometimes known as "derivative-free" because they do not involve computing derivatives in order to optimize.

Also known as the "Hessian technique".

Given a function

Where

We can approximate

Assuming the Hessian matrix is positive definite, we can show using calculus that the right-hand expression can be minimized to:

If *learning rate*):

- Initialize
$\theta$ . - Update
$\theta$ to$\theta' = \theta -\eta H^{-1} \nabla C$ , computing$H$ and$\nabla C$ at$\theta$ . - Update
$\theta'$ to$\theta'' = \theta - \eta H'^{-1} \nabla' C$ , computing$H'$ and$\nabla' C$ at$\theta'$ . - And so on.

The second derivatives in the Hessian tells us how the gradient is changing, which provides some advantages (such as convergence speed) over traditional gradient descent.

The Hessian matrix has

There are other advanced optimization algorithms, such as:

- Conjugate gradient
- BFGS
- L-BFGS

These shouldn't be implemented on your own since they require an advanced understanding of numerical computing, even just to understand what they're doing.

They are more complex, but (in the context of machine learning) there's no need to manually pick a learning rate

- Swarm Intelligence Optimization using Python. James McCaffrey. PyData 2015.
- Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
- Neural Networks and Deep Learning, Michael A Nielsen. Determination Press, 2015.
- Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
- Genetic and Evolutionary Algorithms. Gareth Jones.
- Advanced Topics: Reinforcement Learning, Lecture 7. David Silver.
- An Interactive Tutorial on Numerical Optimization. Ben Frederickson.