Optimization is the task of finding the arguments to a function which yield its minimum or maximum value. An optimal argument is denoted with an asterisk, e.g.
In the context of machine learning, we are typically dealing with minimization. If necessary, a maximization problem can be reframed as minimization: to maximize a function
The function we want to optimize is called the objective function. When the particular optimization is minimization, the function may also be referred to as a cost, loss, or error function.
Optimization problems can be thought of a topology where you are looking for the global peak (if you are maximizing) or the globally lowest point (if you are minimizing). For simplicity, minimizing will be the assumed goal here, as you are often trying to minimize some error function.
Consider a very naive approach: a greedy random algorithm which starts at some position in this topology, then randomly tries moving to a new position and checks if it is better. If it is, it sets that as the current solution. It continues until it has some reason to stop, usually because it has found a minimum. This is a local minimum; that is, it is a minimum relative to its immediately surrounding points, but it is not necessarily the global minimum, which is the minimum of the entire function.
This algorithm is greedy in that it will always prefer a better scoring position, even if it is only marginally better. Thus it can be easy to get stuck in local optima - since any step away from it seems worse, even if the global optimum is right around the corner, so to speak.
Optimization may be accomplished numerically or analytically. The analytic approach involves computing derivatives and then identifying critical points (e.g. the second derivative test). These provide the exact optima.
However, this analytic approach is infeasible for many functions. In such cases we resort to numerical optimization, which involves, in a sense, guessing your way to the optima. The methods covered here are all numerical optimization methods.
Within optimization there are certain special cases:
As far as optimization goes, convex optimization is easier to deal with - convex functions have only global minima (any local minimum is a global minimum) without any saddle points.
In optimization we are typically looking to find the optimum across all points. However, we may only be interested in finding the optimum across a subset
One method for constrained optimization is the Karush-Kuhn-Tucker (KKT) method.
It uses the generalized Lagrangian (sometimes called the generalized Lagrange function).
We must describe
For each constraint we also have the variables
Then we can define the generalized Lagrangian as:
This reframes the constrained minimization problem as an unconstrained optimization of the generalized Lagrangian.
That is, so long as at least one feasible point exists and
has the same objective function value and set of optimal points as
So long as the constraints are satisfied,
and if a constraint is violated, then
Broadly we can categorize optimization methods into those that use gradients and those that do not:
Non-gradient methods include:
Gradient methods include:
Gradient methods tend to be more efficient, but are not always possible to use (you don't always have the gradient).
Gradient descent (GD) is perhaps the common minimizing optimization (for maximizing, its equivalent is gradient ascent) in machine learning.
Say we have a function
In this example, the global minimum is visually obvious, but most of the time it is not (especially when dealing with far more dimensions). But we can apply the model of a ball rolling down a hill and expand it to any arbitrary
The position the ball is at is a potential solution; here it is some values for
More formally,
We define the gradient of
So we can rewrite
We can choose
Where
Finally we have:
We can use this to compute a value for
And repeat until we hit a global (or local) minimum.
This process is known in particular as batch gradient descent because each step is computed over the entire batch of data.
With batch gradient descent, the cost function is evaluated on all the training inputs for each step. This can be quite slow.
With stochastic gradient descent (SGD), you randomly shuffle your examples and look at only one example for each iteration of gradient descent (sometimes this is called online gradient descent to contrast with minibatch gradient descent, described below). Ultimately it is less direct than batch gradient descent but gets you close to the global minimum - the main advantage is that you're not iterating over your entire training set for each step, so though its path is more wandering, it ends up taking less time on large datasets.
The reason we randomly shuffle examples is to avoid "forgetting". For instance, say you have time series data where there are somewhat different patterns later in the data than earlier on. If that training data is presented in sequence, the algorithm will "forget" the patterns earlier on in favor of those it encounters later on (since the parameter updates learned from the later-on data will effectively erase the updates from the earlier-on data).
In fact, stochastic gradient descent can help with finding the global minimum because instead of computing over a single error surface, you are working with many different error surfaces varying with the example you are current looking at. So it is possible that in one of these surfaces a local minima does not exist or is less pronounced than in others, which make it easier to surpass.
There's another form of stochastic gradient descent called minibatch gradient descent. Here
Minibatches also need to be properly representative of the overall dataset (e.g. be balanced for classes).
When the stochastic variant is used, a
An important point of clarification: an "epoch" and a training "iteration" are not necessarily the same thing.
One training iteration is one step of your optimization algorithm (i.e. one update pass). In the case of something like minibatch gradient descent, one training iteration will only look at one batch of examples.
An epoch, on the other hand, consists of enough training iterations to look at all your training examples.
So if you have a total of 1000 training examples and a batch size of 100, one epoch will consist of 10 training iterations.
The learning rate
Conditioning describes how much the output of a function varies with small changes in input.
In particular, if we have a function
Which is the ratio of the magnitude of the largest and smallest eigenvalue; when this is large, we say we have a poorly conditioned matrix since it is overly sensitive to small changes in input. This has the practical implications of slow convergence.
In the context of gradient descent, if the Hessian is poorly conditioned, then gradient descent does not perform as well. This can be alleviated with Newton's method, where a Taylor series expansion is used to approximate
Solving for the critical point gives:
(As a reminder,
Such methods which also use second-order derivatives (i.e. the Hessian) are known as second-order optimization algorithms; those that use only the gradient are called first-order optimization algorithms.
Simulated annealing is similar to the greedy random approach but it has some randomness which can "shake" it out of local optima.
Annealing is a process in metal working where the metal starts at a very high temperature and gradually cools down. Simulated annealing uses a similar process to manage its randomness.
A simulated annealing algorithm starts with a high "temperature" (or "energy") which "cools" down (becomes less extreme) as progress is made.
Like the greedy random approach, the algorithm tries a random move. If the move is better, it is accepted as the new position. If the move is worse, then there is a chance it still may be accepted; the probability of this is based on the current temperature, the current error, and the previous error:
Each random move, whether accepted or not, is considered an iteration. After each iteration, the temperature is decreased according to a cooling schedule. An example cooling schedule is:
where
For this particular schedule, you probably don't want to set
The algorithm terminates when the temperature is at its minimum.
For a problem of
One vertex of the simplex is initialized with your best educated guess of the solution vector. That guess could be the output of some other optimization approach, even a previous Nelder-Mead run. If you have nothing to start with, a random vector can be used.
The other
Then at each step of the algorithm, you want to (illustrations are for
The algorithm terminates when one of the following occurs:
Then the best vertex is considered the solution.
This optimization method is very sensitive to how it is initialized; whether or not a good solution is found depends a great deal on its starting points.
Particle swarm optimization is similar to Nelder-Mead, but instead of three points, many more points are used. These points are called "particles".
Each particle has a position (a potential solution) and a velocity which indicates where the particle moves to in the next iteration. Particles also keep track of their current error to the training examples and its best position so far.
Globally, we also track the best position overall and the lowest error overall.
The velocity for each particle is computed according to:
The influence of these components are:
These weights are parameters that must be tuned, but this method is quite robust to them (that is, they are not sensitive to these changes so you don't have to worry too much about getting them just right).
More particles are better, of course, but more intensive.
You can specify the number of epochs to run.
You can also incorporate a death-birth cycle in which low-performing particles (those that seem to be stuck, for instance) get destroyed and a new randomly-placed particle is initialized in its place.
Evolutionary algorithms are a type of algorithm which uses concepts from evolution - e.g. individuals, populations, fitness, reproduction, mutation - to search a solution space.
Genetic algorithms are the most common class of evolutionary algorithms.
Each of these steps requires some decisions by the implementer.
For instance, how do you translate a fitness score into a survival probability?
Well, the simplest way is:
Where
However, depending on how you calculate fitness, this may not be appropriate.
You could alternatively use a ranking method, in which you just look at the relative fitness rankings and not their actual values. So the most fit individual is most likely to survive, the second fit is a bit less likely, and so on.
You pick a probability constant
If you get stuck on local maxima you can try increasing the step size. When your populations start to get close to the desired value, you can decrease the step size so the changes are less sporadic (i.e. use simulated annealing).
When selecting a new population, you can incorporate a diversity rank in addition to a fitness rank. This diversity ranking tries to maximize the diversity of the new population. You select one individual for the new population, and then as you select your next individual, you try and find one which is distinct from the already selected individuals.
The general algorithm is as follows:
The specifics of how crossover and mutation work depend on your particular problem.
With evolution strategies, each individual includes not only object variables but also "strategy" parameters which, which are variances and covariances (optional) of the object variables. These strategy parameters control mutation.
From each population of size
All of the object variables, as with genetic algorithms, are derived from the same parents, though each strategy parameter may be derived from a different pair of parents, selected at random (without any selection pressure). However, the best approach is to copy an object variable from one of the parents and set each strategy parameter to be the mean of its parents' corresponding strategy parameter.
Then mutation mutates both the strategy parameters and the object variables, starting with the strategy parameters. The mutation of the strategy parameters is called self-adaptation. The object variables are mutated according to the probability distribution specified by the (mutated) strategy parameters.
There are two approaches to selection for evolutionary strategy:
Evolutionary programming does not include recombination; changes to individuals rely solely on mutation. Mutations are based on a Gaussian distribution, where the standard deviation is the square root of a linear transform (parametrized according to the user) of the parent's fitness score. Each of the
Note that in meta-evolutionary programming, the variances are also part of the individual, i.e. subject to mutation (this is self-adaptation).
The next generation is selected from the union of the parents and the offspring via a process called
Increasing
Note that Nelder-Mead, Particle Swarm, and genetic algorithm optimization methods are sometimes known as "derivative-free" because they do not involve computing derivatives in order to optimize.
Also known as the "Hessian technique".
Given a function
Where
We can approximate
Assuming the Hessian matrix is positive definite, we can show using calculus that the right-hand expression can be minimized to:
If
The second derivatives in the Hessian tells us how the gradient is changing, which provides some advantages (such as convergence speed) over traditional gradient descent.
The Hessian matrix has
There are other advanced optimization algorithms, such as:
These shouldn't be implemented on your own since they require an advanced understanding of numerical computing, even just to understand what they're doing.
They are more complex, but (in the context of machine learning) there's no need to manually pick a learning rate