When it comes down to it, a neural net is just a very sophisticated way of fitting a curve.
Neural networks with at least one hidden layer are universal approximators, which means that they can approximate any (continuous) function. This approximation can be improved by increasing the number of hidden neurons in the network (but increases the risk of overfitting).
A key advantage to neural networks is that they are capable of learning features independently - without much human involvement.
Neural networks can also learn distributed representations. Say we have animals of some type and color. We could learn representations for each of them, e.g. with a neuron for a red cat, one for a blue dog, one for a blue cat, etc. But this would mean learning many, many representations (for instance, with three types and three colors, we learn nine representations). With distributed representations, we instead have neurons that learn the different colors and other neurons which learn the different types (for instance, with three types and three colors, we have six neurons). Thus the representation of a given case, such as a blue dog, is distributed across the neurons, and ultimately much more compact.
One challenge with neural networks is that it is often hard to understand what a neural network has learned - they may be black boxes which do what we want, but we can't peer in and clearly see how it's doing it.
Artificial neural networks (ANNs) are (conceptually) based off of biological neural networks such as the human brain.
Neural networks are composed of neurons which send signals to each other in response to certain inputs.
A single neuron takes in one or more inputs (via dendrites), processes it, and fires one output (via its axon).
Note that the term "unit" is often used instead of "neuron" when discussing artificial neural networks to dissociate these from the biological version - while there is some basis in biological neural networks, there are vast differences, so it is a deceit to present them as analogous.
A perceptron, first described by Frank Rosenblatt in 1957, is an artificial neuron (a computational model of a biological neuron, first introduced in 1943 by Warren McCulloch and Walter Pitts).
Like a biological neuron, it has multiple inputs, processes them, and returns one output.
Each input has a weight associated with it.
In the simplest artificial neuron, a "binary" or "classic spin" neuron, the neuron "fires" an output of "1" if the weighted sum of these inputs is above some threshold, or "-1" if otherwise.
A single-layer perceptron can't learn XOR:
A line can't be drawn to separate the
A sigmoid neuron is another artificial neuron, similar to a perceptron. However, while the perceptron has a binary output, the sigmoid neuron has a continuous output,
which can also be written:
Note that if
Here is the sigmoid function visualized:
Which is a smoothed out step function (which is how a perceptron operates):
Sigmoid neurons are useful because small changes in weights and biases will only produce small changes in output from a given neuron (rather than switching between binary output values as is the case with the step function, which is typically too drastic).
The function that determines the output of a neuron is known as the activation function. In the binary/classic spin case, it might look like:
weights = [...]
inputs = [...]
sum_w = sum([weights[i] * inputs[i] for i in range(len(inputs))])
def activate(sum_w, threshold):
return 1 if sum_w > threshold else -1
Or:
Note that
In some interpretations, the "binary" neuron returns "0" or "1" instead of "-1" or "1".
An activation function can generally be described as some function:
where
A common activation function is the sigmoid function, which takes input and squashes it to be in
However, the sigmoid activation function has some problems. If the activation yields values at the tails of 0 or 1, the gradient ends up being almost 0. In backpropagation, this local gradient is multiplied with the gradient of the node's output against the total error - if this local gradient is near 0, it "kills" the gradient preventing any signal from going further backwards in the network. For this reason, when using the sigmoid activation function you must be careful of how you initialize the weights - if they are too large, you will "saturate" the network and kill the gradient in this way.
Furthermore, sigmoid outputs are not zero-centered:
This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. x>0 elementwise in f=wTx+b)), then the gradient on the weights w will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression f). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above. (CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Modeling one neuron, Andrej Karpathy)
The tanh activation function is another option; it squishes values to be in
Note that
The Rectified Linear Unit (ReLU) is
Leaky ReLUs are an attempt to fix this problem. Rather than outputting 0 when
Note that
Another alternative is a randomized leaky ReLU, where
There are also some units which defy the conventional activation form of
Karpathy suggests:
Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout. source
Activation Function | Propagation | Backpropagation |
---|---|---|
Sigmoid | ||
Tanh | ||
ReLu | ||
Ramp |
There is also the hard-tanh activation function, which is an approximation of
The softmax function (called such because it is like a "softened" maximum function) may be used as the output layer's activation function. It takes the form:
To clarify, we are summing over all the output neurons in the denominator.
This function has the properties that it sums to 1 and that all of its outputs are positive, which are useful for modeling probability distributions.
The cost function to use with softmax is the (categorical) cross-entropy loss function. It has the nice property of having a very big gradient when the target value is 1 and the output is almost 0.
The categorical cross-entropy loss:
You can base your activation function off of radial basis functions (RBFs):
where
A radial basis function (RBF) is a function which is:
can be in
A neural network that uses RBFs as its activation functions is known as radial basis function neural network.
The 1D Gaussian RBF is an example:
Defined as:
The Ricker Wavelet is another example:
Defined as:
A feed-forward neural network is a simple neural network with an input layer, and output layer, and one or more intermediate layers of neurons.
These layers are fully-connected in that every neuron of the previous layer is connected to every neuron in the following layer. Such layers are also called affine or dense layers.
When we describe the network in terms of layers as a "
This model is often called "feed-forward" because values go into the input layer and are fed into subsequent layers.
Different learning algorithms can train such a network so that its weights are adjusted appropriately for a given task. It's worth emphasizing that the structure of the network is distinct from the learning algorithm which tunes its weights and biases. The most popular learning algorithm for neural networks is backpropagation.
The most common algorithm for adjusting a neural network's weights and biases is backpropagation.
Backpropagation is just the calculation of partial derivatives (the gradient) by moving backwards through the network (from output to input), accumulating them by applying the chain rule. In particular, it computes the gradient of the loss function with respect to the weights in the network (i.e. the derivatives of the loss function with respect to each weight in the network) in order to update the weights.
We compute the total error for the network on the training data and then want to know how a change in an individual weight within the network affects this total error (i.e. the result of our cost function), e.g.
"Backpropagation" is almost just a special term for the chain rule in the context of training neural networks. This is because a neural network can be thought of as a composition of functions, in which case to compute the derivative of the overall function, you just apply the chain rule for computing derivatives.
To elaborate on thinking of neural network as a "composition of functions": each layer represents a function taking in the inputs of the previous layer's output, e.g. if the previous layer is a function that outputs a vector,
The general procedure for training a neural network with backpropagation is:
Consider the following simple neural net:
Here's a single neuron expanded:
Remember that a neuron processes its inputs by computing the dot product of its weights and inputs (i.e. the sum of its weight-input products) and then passes this resulting net input into its activation function (in this case, it is the sigmoid function).
Say we have passed some training data through the network and computed the total error as
Then we take this value and subtract it, multiplied by a learning rate
If we wanted to calculate the update value for
Any activation function can be used with backprop, it just must be differentiable anywhere.
(adapted from the CS231n notes cited below)
Refresher on derivatives: say you have a function
As a simple example, consider the function
Now consider the function
In code (adapted from the CS231 notes cited below)
# set some inputs
x = -2; y = 5; z = -4
# perform the forward pass
q = x + y # q becomes 3
f = q * z # f becomes -12
# perform the backward pass (backpropagation) in reverse order:
# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
dqdx = 1.
dqdy = 1.
# now backprop through q = x + y
dfdx = dqdx * dfdq # dq/dx = 1. And the multiplication here is the chain rule!
dfdy = dqdy * dfdq # dq/dy = 1
So essentially you can decompose any function into smaller, simpler functions, compute the gradients for those, then use the chain rule to aggregate them into the original function's gradient.
Consider the neural network:
Where:
Given training data:
Where
Thus
For a node
For a layer
Note that
The feed-forward step is a straightforward algorithm:
OUT^0 = X
for i in 1...N
OUT^i = f^i(W^i \cdot OUT^{i-1}+b^i)
With backpropagation, we are interested in the gradient
The main advantage of backpropagation is that it allows us to compute this gradient efficiently. There are other ways we could do it. For instance, we could manually calculate the partial derivatives of the cost function with respect to each individual weight, which, if we had
Some notation:
The error for a layer
For the output layer, this is straightforward (by applying the chain rule):
Since
Thus we have:
Note that for
For the hidden layer prior to the output,
We have already calculated the term
Since
Similarly, since
Thus:
This is how we compute
We are most interested in updating weights and biases, rather than knowing the errors themselves. That is, we are most interested in the quantities:
For any layer
These are relatively easy to derive.
We want to update the weights such that the error is lowered with the new weights. Thus we compute the gradient of the error with respect to the weights and biases to learn in which way the error is increasing and by how much. Then we move in the opposite direction by that amount (typically weighted by a learning rate).
We previously showed that
Then, knowing that
Thus:
Then we can use these for gradient descent.
A quick bit of notation:
So, to clarify, we are computing
Backpropagation is not a guarantee of training at all nor of quick training.
Possible issues include:
Note that in high dimensions, local minima are usually not a problem.
More typically, there are saddle points, which slow down training but can be escaped in time. This is because with many dimensions, it is unlikely that a point is a minimum is all dimensions (if we consider that a point is a minimum in one dimensions with probability
As training nears the global minimum,
Statistical (or "stochastic") training methods, contrasted with deterministic training methods (such as backpropagation as described abov), involve some randomness to avoid local minima. They generally work by randomly leaving local minima to possibly find the global minimum. The severity of this randomness decreases over time so that a solution is "settled" upon (this gradual "cooling" of the randomness is the key part of simulated annealing).
Simulated annealing applied as a training method to a neural network is called Boltzmann training (neural networks trained in this way are called Boltzmann machines):
Where:
Steps 3 and 4 are repeated for each of the weights in the network as
The random weight change can be selected in a few ways, but one is just choosing it from a Gaussian distribution,
Boltzmann training uses the following cooling rate, which is necessary for convergence to a global minimum:
Where
The problem with Boltzmann training is that it can be very slow (the cooling rate as computed above is very low).
This can be resolved by using the Cauchy distribution instead of the Boltzmann distribution; the former has fatter tails so has a higher probability of selecting large step sizes. Thus the cooling rate can be much quicker:
The Cauchy distribution is:
where
This can be integrated, which makes selecting random weights much easier:
Where
Here we can just select a random number from a uniform distribution in
Cauchy training still may be slow so we can also use a method based on artificial specific heat (in annealing, there are discrete energy levels where phase changes occur, at which abrupt changes in the "specific heat" occur). In the context of artificial neural networks, we define the (pseudo)specific heat to be the average rate of change of temperature with the objective function. The idea is that there are parts where the objective function is sensitive to small changes in temperature, where the average value of the objective function makes an abrupt change, so the temperature must be changed slowly here so as not to get stuck in a local minima. Where the average value of the objective function changes little with temperature, large changes in temperature can be used to quicken things.
Still, Cauchy training may be much slower than backprop, and can have issues of network paralysis (because it is possible to have very large random weight changes), esp. if a nonlinearity is used as the activation function (see the bit on network paralysis and the sigmoid function above).
Cauchy training may be combined with backprop to get the best of both worlds - it simply involves computing both the backprop and Cauchy weight updates and applying their weighted sum as the update. Then, the objective function's change is computed, and like with Cauchy training, if there is an improvement, the weight change is kept, otherwise, it is kept with a probability determined by the Boltzmann distribution.
The weighted sum of the individual weight updates is controlled by a coefficient
There is still the issue of the possibility of retaining a massive weight change due to the Cauchy distribution's infinite variance, which creates the possibility of network paralysis. The recommended approach here is to detect saturated neurons by looking at their
The amount weights and biases are adjusted is determined by a learning rate
# LEARNING_CONSTANT is defined elsewhere
def adjust_weight(weight, error, input):
return weight + error * input * LEARNING_CONSTANT
In some cases, a simulated annealing approach is used, where the learning constant may be tempered (made smaller, less drastic) as the network evolves, to avoid jittering the network out of the global optima.
Over the course of training, it is often better to gradually decrease the learning rate as you approach an optima so that you don't "overshoot" it.
The appropriate learning rate can vary across parameters, so it can help to have different adaptive learning rates for each parameter.
For example, the magnitudes of gradients are often very different across layers (starting small early on, growing larger further on).
The fan-in of a neuron (number of inputs) also has an effect, determining the size of "overshoot" effects (the more inputs there are, the more weights are changed simultaneously, all to adjust the same error, which is what can cause the overshooting).
So what you can do is manually set a global learning rate, then for each weight multiply this global learning rate by a local gain, determined empirically per weight.
One way to determine these learning rates is as follows:
This ensures that big gains decay rapidly when oscillations start.
Another tip: limit the gains to line in some reasonable range, e.g.
Note that these adaptive learning rates are meant for full batch learning or for very big mini-batches. Otherwise, you may encounter gradient sign changes that are just due to sampling error of a mini-batch.
These adaptive learning rates can also be combined with momentum by using agreement in sign between the current gradient for a weight and the velocity for that weight.
Note that adaptive learning rates deal only with axis-aligned effects.
There are a variety of gradient descent algorithms used for training neural networks. Some of the more popular ones include the following.
Momentum is a technique that can be combined with gradient descent to improve its performance.
Conceptually, it applies the idea of velocity and friction to the error surface (imagine a ball rolling around the error surface to find a minimum).
We incorporate a matrix of velocity values
To do so, we break our gradient descent update rule (
Another hyperparameter,
You can see that if
A variation on regular momentum is Nesterov momentum:
Here we add the velocity to the parameters before computing the gradient.
Using gradient descent with this momentum is also called the Nesterov accelerated gradient.
Adagrad is an enhancement of Nesterov momentum that keeps track of squared gradients over time. This allows it to identify frequently updated parameters and infrequently updated parameters - as a result, learning rates can be adapted per parameter over time (e.g. higher learning rates are assigned to infrequently updated parameters). This makes it quite useful for sparse data, and also means that the learning rate does not need to be manually tuned.
More formally, for each parameter
The learning rate
An additional smoothing term 1e-8
).
As vector operations, this is written:
Note that as training progresses, the denominator term
Adadelta is an improvement on Adagrad designed to deal with its halting learning rate. Whereas Adagrad takes the sum of all past squared gradients for a parameter, Adadelta takes only the past
This is implemented as follows. We keep a running average
And then update the Adagrad equation to replace the matrix
Note that this running average is the same as the root mean squared (RMS) error criterion of the gradient, so this can be re-written:
There is one more enhancement that is part of Adadelta. The learning rate
The RMS error of this term is also taken:
Thus the Adadelta update is:
Note that Adadelta without the numerator RMS term, that is:
is known as RMSprop; for RMSprop typical values are
Adam, short for "Adaptive Moment Estimation", is, like Adagrad, Adadelta, and RMSprop, an adaptive learning rate algorithm. Like Adadelta and RMSprop, Adam keeps track of an exponentially decaying average of past squared gradients (here, it is
Note that the
Then the Adam update rule is simply:
Batch normalization is a normalization method for mini-batch training, which can improve training time (it allows for higher learning rates), act as a regularizer, and reduce the importance of proper parameter initialization. It is applied to intermediate representations in the network.
For a mini-batch
Each feature
where
Standardizing intermediate representations in this way can weaken the representational power of the layer, so two additional learnable parameters
When
Given a layer with some activation function
The bias is dropped because its effect is cancelled by the standardization.
During test time, we must use
Refer to Batch Normalized Recurrent Neural Networks (César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, Yoshua Bengio) for more details.
We have some cost function, which is a function of our parameters, typically notated
For regression, this is often the mean squared error (MSE), also known as the quadratic cost:
Note that
So for a single example, the cost function is:
For deriving convenience, we'll include a
Deriving with respect to
Note that these are dependent on the derivative of the output layer's activation function,
For binary classification, a common cost function is the cross-entropy cost, also known as "log loss" or "logistic loss":
where
The partial derivatives of the cross-entropy cost with respect to
This has the advantage that for some activation functions
However, as mentioned before, this saturation occurs with only some activation functions (like the sigmoid function). This isn't a problem, for instance, with linear activation functions, in which case quadratic cost is appropriate (though neural nets with linear activation functions are limited in what they can learn).
Thus we have:
The log-likelihood cost function is defined as, for a single training example:
That is, given an example that belongs to class
This is assuming that the output node's activation function outputs probability-like values (such as is the case with the softmax function).
This cost function's partial derivatives with respect to
For brevity, we've notated
Note that for softmax activation functions, we avoid the saturation problem with this cost function. Thus softmax output activations and the log-likelihood cost functions are a good pairing for problems requiring probability-like outputs (such as with classification problems).
Loss Function | Propagation | Backpropagation |
---|---|---|
Square | ||
Log, |
||
Hinge, |
||
LogSoftMax, |
||
MaxMargin, |
What are the best values to initialize weights and biases to?
Given normalized data, we could reasonably estimate that roughly half the weights will be negative and roughly half will be positive.
As a result, it may seem intuitive to initialize all weights to zero. But you should not - this causes every neuron to have the same output, which causes them to have the same gradients during backpropagation, which causes them to all have the same parameter updates. Thus none of the neurons will differentiate.
Alternatively, we could set each neuron's initial weights to be a random vector from a standard multidimensional normal distribution (mean of 0, standard deviation of 1), scaled by some value, e.g. 0.001
so that they are kept very small, but still non-zero. This process is known as symmetry breaking. The random initializations allow the neurons to differentiate themselves during training.
However, this can become problematic.
Consider that the net input to a neuron is:
The following extends to the general case, but for simplicity, consider an input
Then
The sum of
That is, it is also a normal distribution.
Thus
If for example,
This is most problematic for deep networks, since they may reduce the gradient signal that flows backwards by too much (in a weaker version of the gradient "killing" effect).
As the number of inputs to a neuron grows, so too will its output's variance. This can be controlled for (calibrated) by scaling its weight vector by the square root of its "fan-in" (its number of inputs), so you should divide the standard multidimensional distribution sampled random vector by
An alternative to this fan-in scaling for the uncalibrated variances problem is sparse initialization, which is to set all weights to 0, and then break symmetry by randomly connecting every neuron to some fixed number (e.g. 10) of neurons below it by setting those weights to ones randomly sampled from the standard normal distribution like mentioned previously.
Biases are commonly initialized to be zero, though if using ReLUs, then you can set them to a small value like 0.01 so all the ReLUs fire at the start and are included in the gradient backpropagation update.
Elsewhere it is recommended that ReLU weights should be sampled from zero-mean Gaussian distribution with standard deviation of
Elsewhere it is recommended that you sample your weights uniformly from
where
Generally you should shuffle your data every training epoch so the network does not become biased towards a particular ordering.
However, there are cases in which your network may benefit from a meaningful ordering of input data; this approach is called curriculum learning.
Adding noise from a Gaussian distribution to each update, i.e.
with variance annealed with the following schedule:
has been shown to make "networks more robust to poor initialization and helps training particularly deep and complex networks. They suspect that the added noise gives the model more chances to escape and find new local minima, which are more frequent for deeper models." (An overview of gradient descent optimization algorithms, Sebastian Ruder)
Adding noise to input, such as in the accompanying figure, can throw off a classifier. Few strategies are robust against these tricks, but one approach is to generate these adversarial examples and include them as part of the training set.
When you write code to compute the gradient, it can be very difficult to debug. Thus it is often useful to check the gradient by numerically approximating the gradient and comparing it to the computed gradient.
Say our implemented gradient function is
We choose some
Then we can numerically approximate the gradient at some scalar value
When
Where:
Start training with small, unequal weights to avoid saturating the network w/ large weights. If all the weights start equal, the network won't learn anything.
The practice of transfer learning involves taking a neural net trained for another task and applying it to a different task. For instance, if using an image classification net trained for one classification task, you can use that same network for another, truncating the output layer, that is, take the vectors from the second-to-last layer and use those as feature vectors for other tasks.
The architecture of a neural network describes how its layers are structured - e.g. how many layers there are, how many neurons in each, and how they are connected.
Neural networks are distinguished by their architecture.
The general structure of a neural network is input layer -> 0 or more hidden layers -> output layer
.
Neural networks always have one input layer, and the size of that input layer is equal to the input dimensions (i.e. one node per feature), though sometimes you may have an additional bias node.
Neural networks always have one output layer, and the size of that output layer depends on what you're doing. For instance, if your neural network will be a regressor (i.e. for a regression problem), then you'd have a single output node (unless you're doing multivariate regression). Same for binary classification. However with softmax (more than just two classes) you have one output node per class label, with each node outputting the probability the input is of the class associated with the node.
If your data is linearly separable, then you don't need any hidden layers (and you probably don't need a neural network either and a linear or generalized linear model may be plenty).
Neural networks with additional hidden layers become difficult to train; networks with multiple hidden layers are the subject of deep learning (detailed below). For many problems, one hidden layer suffices, and you may not see any performance improvement from adding additional hidden layers.
A rule of thumb for deciding the size of the hidden layer is that the size should be between the size between the input size and output size (for example, the mean of their sizes).
Because neural networks can have so many parameters, it can be quite easy for them to overfit. Thus it is something to always keep an eye out for. This is especially a problem for large neural networks, which have huge amounts of parameters.
As the network grows in number of layers and size, the network capacity increases, which is to say it is capable of representing more complex functions.
Simpler networks have fewer local minima, but they are easier to converge to and tend to perform worse (they have higher loss). There is a great deal of variance across these local minima, so the outcome is quite sensitive to the random initialization - some times you land in a good local minima, sometimes not. More complex networks have more local minima, but they tend to perform better, and there is less variance across how these local minima perform.
Higher-capacity networks run a greater risk of overfitting, but this overfitting can be (preferably) mitigated by other methods such as L2 regularization, dropout, and input noise. So don't let overfitting be the sole reason for going with a simpler network if a larger one seems appropriate.
Here are regularization examples for the same data from the previous image, with the neural net for 20 hidden neurons:
As you can see, regularization is effective at counteracting overfitting.
Another simple, but possibly expensive way of reducing overfitting is by increasing the amount of training data - it's unlikely to overfit many, many examples. However, this is seldom a practical option.
Generally, the methods for preventing overfitting include:
Regularization techniques are used to prevent neural networks from overfitting.
L2 regularization is the most common form of regularization. We penalize the squared magnitude of all parameters (weights) as part of the objective function, i.e. we add
L2 regularization is sometimes called weight decay since the added regularization term penalizes large weights, favoring smaller weights.
So a regularized cost function
This affects the partial derivative of the cost function with respect to weights in a simple way (again, biases are not included, so it does not change that partial derivative):
So your update rule would be:
Note that biases are typically not included by convention; regularizing them usually does not have an impact on the network's generalizability.
Similar to L2 regularization, except that the regularization term added to the objective function is
The main difference between L1 and L2 regularization is that L1 regularization shrinks weights by a constant amount, whereas L2 regularization shrinks weights by an amount proportional to the weights themselves. This is made clearer by considering the derived update rules from gradient descent.
L1 regularization has the effect of causing weight vectors to become sparse, such that neurons only use a few of their inputs and ignore the rest as "noise". Generally L2 regularization is preferred to L1.
For L1, this partial derivative of the cost function wrt the weights is:
This ends up leading to the following update rule:
Note that we say that
Compare this with the update rule for L2 regularization:
In L2 regularization, we subtract a term weighted by
This is just the combination of L1 and L2 regularization, such that the term introduced to the objective function is
This involves setting an absolute upper bound on the magnitude of the weight vectors; that is, after updating the parameters/weights, clamp every weight vector so that it satisfies
Dropout is a regularization method which works well with the others mentioned so far (L1, L2, maxnorm). It does not involve modifying cost functions. Rather, the network itself is modified.
During training, we specify a probability
This dropout is applied only at training time and applied per-layer (that is, it is applied after each layer, see the code example below). This prevents the network from relying too much on certain neurons.
One way to think about this is that, for each training step, a sub-network is sampled from the full network, and only those parameters are updated. Then on the next step, a different sub-sample is taken and updated, and so on.
To put it another way, dropping out neurons in this way has the effect of training multiple neural networks simultaneously. If we have multiple networks overfit to different training data, they are unlikely to all overfit in the same way. So their average should provide better results.
This has the additional advantage that neurons must learn to operate in the absence of other neurons, which can have the effect of the network learning more robust features. That is, the neurons of the network should be more resilient to the absence of some information.
At test time, all neurons are active (i.e. we don't use dropout at test time). There will be twice as many hidden neurons active as there were in training, so all weights are halved to compensate.
We must scale the activation functions by
This scaling can be applied at training time, which is more efficient - this technique is called inverted dropout.
For comparison, here is an implementation of regular dropout and an implementation of inverted dropout (source from: https://cs231n.github.io/neural-networks-2/)
# Dropout
p = 0.5 # probability of keeping a unit active. higher = less dropout
def train_step(X):
""" X contains the data """
# forward pass for example 3-layer neural network
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = np.random.rand(*H1.shape) < p # first dropout mask
H1 *= U1 # drop!
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = np.random.rand(*H2.shape) < p # second dropout mask
H2 *= U2 # drop!
out = np.dot(W3, H2) + b3
# backward pass: compute gradients... (not shown)
# perform parameter update... (not shown)
def predict(X):
# ensembled forward pass
H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations
out = np.dot(W3, H2) + b3
# Inverted dropout
p = 0.5 # probability of keeping a unit active. higher = less dropout
def train_step(X):
# forward pass for example 3-layer neural network
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p!
H1 *= U1 # drop!
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p!
H2 *= U2 # drop!
out = np.dot(W3, H2) + b3
# backward pass: compute gradients... (not shown)
# perform parameter update... (not shown)
def predict(X):
# ensembled forward pass
H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary
H2 = np.maximum(0, np.dot(W2, H1) + b2)
out = np.dot(W3, H2) + b3
It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of
$p=0.5$ is a reasonable default, but this can be tuned on validation data. https://cs231n.github.io/neural-networks-2/
In addition to regularization, training on more data can help prevent overfitting. This, unfortunately, is typically not a practical option. However, the training set can be artificially expanded by taking existing training data and modifying it in a way we'd expect to see in the real world.
For instance, if we were training a network to recognize handwritten digits, we may take our examples and rotate them slightly, since this could plausibly happen naturally.
A related technique is training on adversarial examples (detailed elsewhere), in which training examples are modified to be deliberately hard for the network to classify, so that it can be trained on more ambiguous/difficult examples.
The most common approach to dealing with overfitting is to apply some kind of regularization.
There are many hyperparameters to set with neural networks, such as:
TODO See: https://cs231n.github.io/neural-networks-3/#anneal
Not only are there many hyperparameters for neural networks; it can also be very difficult to choose good ones.
You could do a naive grid search and just try all possible combinations of hyperparameters, which is infeasible because it blows up in size.
You could randomly sample combinations as well, but this still has the problem of repeatedly trying hyperparameter values which may have no effect.
Instead, we can apply machine learning to this problem and try and learn what hyperparameters may perform well based on the attempts thus far. In particular, we can try and predict regions in the hyperparameter space that might do well. We'd want to also be able to be explicit about the uncertainty in our prediction.
We can use Gaussian process models to do so. The basic assumption of these models is that similar inputs give similar outputs.
However, what does "similar" mean? Is 200 hidden units "similar" to 300 hidden units or not? Fortunately, such models can also learn this scale of similarity for each hyperparameter.
These models predict a Gaussian distribution of values for each hyperparameter (hence the name).
A method for applying this:
So we might try a new combination, and it might not do that well, but we won't have replaced our current best.
This method for selecting hyperparameters is called Bayesian (hyperparameter) optimization, and is a better approach than by picking hyperparameters by hand (less prone to human error).
A big challenge in designing a neural network is calibrating its hyperparameters. From the start, it may be difficult to intuit what hyperparameters need tuning. There are so many to choose from: network architecture, number of epochs, cost function, weight initialization, learning rate, etc.
There are a few heuristics which may help.
When the learning rate
Learning rates which are too low tend to have a slow decrease in error over training. You can try higher learning rates if this seems to be the case.
The learning rate does not need to be fixed. When starting out training, you may want a high learning rate to quickly get close to a minimum. But once you get closer, you may want to decrease the learning rate to carefully identify the best minimum.
The specification of how the learning rate decreases is called the learning rate schedule.
Some places recommend using a learning rate in the form:
Where
For the number of epochs, we can use a strategy called "early stopping", where we top once some performance metric (e.g. classification accuracy) appears to stop improving. More precisely, "stop improving" can mean when the performance metric doesn't improve for some
However, neural networks sometimes plateau for a little bit and then keep on improving. In which case, adopting an early stopping strategy can be harmful. You can be somewhat conservative and set
A deep neural network is simply a neural network with more than one hidden layer. Deep learning is the field related to deep neural networks. These deep networks can perform much better than shallow networks (networks with just one hidden layer) because they can embody a complex hierarchy of concepts.
Many problems can be broken down into subproblems, each of which can be addressed by a separate neural network.
Say for example we want to know whether or not a face is in an image. We could break that down (decompose it) into subproblems like:
We could train a neural network on each of these subproblems. We could even break these subproblems further (e.g. "Is there an eyelash?", "Is there an iris?", etc) and train neural networks for those, and so on.
Then if we want to identify a face, we can aggregate these networks into a larger network.
This kind of multi-layered neural net is a deep neural network.
Multilayer nns must have nonlinear activation functions, otherwise they are equivalent to a single layer network aggregating its weights.
That is, a 2 layer network has weight vectors
Training deep neural networks (that is, neural networks with more than one hidden layer) is not as straightforward as it is with a single hidden layer - a simple stochastic gradient descent + backpropagation approach is not as effective or quick.
This is because of unstable gradients. This has two ways of showing up:
These unstable gradients occur because gradients in earlier layers are the products of the later layers (refer to backpropagation for details, but remember that the
Certain neural networks, such as RNNs, can have unstable gradients, in which gradients may grow exponentially (an exploding gradient) or shrink exponentially until it reaches zero (a vanishing gradient).
With exploding gradients, the minimum is not found because, with such a large gradient, the steps don't effectively search the space.
With vanishing gradients, the minimum is not found because a gradient of zero means the space isn't searched at all.
Unstable gradients can occur as a result of drastic changes in the cost surface, as illustrated in the accompanying figure (from Pascanu et al via http://peterroelants.github.io/posts/rnn_implementation_part01/).
In the figure, the large jump in cost leads to a large gradient which causes the optimizer to make an exaggerated step.
There are methods for dealing with unstable gradients, including:
Normally, weights are updated by the size of the gradient (typically scaled by some learning rate). However, as demonstrated above, this can lead to an unstable gradient.
Resilient backpropagation ignores the size of the gradient and only considers its sign and then uses two hyperparameters,
If the sign of the gradient changes in an iteration, the weight update
If the gradient's sign changes, this usually indicates that we have passed through a local minima.
Then the weight is updated by this computed value in the opposite direction of its gradient:
Typically,
This is essentially separate adaptive learning rates but ignoring the size of the gradient and only look at the sign. That is, we increase weights multiplicatively by
Rprop is meant for full batch learning or for very large mini-batches. To use this technique with mini-batches, see Rmsprop.
Rmsprop is the mini-batch version of Rprop. It computes a moving average,
Then normalizes the gradient by dividing by the square root of this moving average:
Rmsprop can be used with momentum as well (i.e. update the velocity with this modified gradient).
The basic idea behind rmsprop is to adjust the learning rate per-parameter according to the (smoothed) sum of the previous gradients. Intuitively this means that frequently occurring features get a smaller learning rate (because the sum of their gradients is larger), and rare features get a larger learning rate. http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
In a regular neural network, the relationship between a pixel and one that is next to it is the same as its relationship with a pixel far away - the structural information of the image is totally lost. Convolutional nets are capable of encoding this structural information about the image; as a result, they are especially effective with image-based tasks.
Convolutional nets are based on three ideas:
A regular neural network is fully-connected in that every node from a layer
This is not the case with convolutional nets.
Typically we think of a layer as a line of neurons. With convolutional nets, it is more useful to think of the neurons arranged in a grid.
(Note: the following images are from http://neuralnetworksanddeeplearning.com/chap6.html TODO replace the graphics)
We do not fully connect this input layer to the hidden layer (which we'll call a convolutional layer). Rather, we connect regions of neurons to neurons in the hidden layer. These regions are local receptive fields, local to the neuron at their center (they may more simply be called windows).
We can move across local receptive fields one neuron at a time, or in greater movements. These movements are called the stride length.
These windows end up learning to detect salient features, but are less sensitive to where exactly they occur. For instance, for recognizing a human face, it may be important that we see an eye in one region, but it doesn't have to be in a particular exact position. A filter (also called a kernel) function is applied to each window to transform it into another vector (which is then passed to a pooling layer, see below).
One architectural decision with CNNs is the use of wide convolution or narrow convolution. When you reach the edges of your input (say, the edges of an image), do you stop there or do you pad the input with zeros (or some other value) so we can fit another window? Padding the input is wide convolution, not padding is narrow convolution. Note that, as depicted above, narrow convolution will yield a smaller feature map of size: input shape - filter shape + 1.
Note that this hyperparameter is sometimes called "border mode". A border mode of "valid" is equivalent to a narrow convolution.
There are a few different ways of handling padding for a wide convolution. Border modes of "half" (also called "same") and "full" correspond to different padding strategies.
Say we have a filter of size
There is also a "full" border mode which pads the input with a symmetric border of
For example, say we have the following image:
xxx
xxx
xxx
Say we have a 3x3 filter.
For a border mode of "half"/"same", we the padded image would look like (padding is indicated with o
):
ooooo
oxxxo
oxxxo
oxxxo
ooooo
For a border mode of "full", the padded image would instead be:
ooooooo
ooooooo
ooxxxoo
ooxxxoo
ooxxxoo
ooooooo
ooooooo
Another change here is that the hidden layer has one set of weights and a bias that is shared across the entire layer (these weights and biases are accordingly referred to as shared weights and the shared bias, and together, they define a filter or a kernel).
As a result, if we have receptive fields of
Where
Another way of writing the above is:
Where
The consequence of this sharing of weights and biases is that this layer detects the same feature across different receptive fields. For example, this layer could detect vertical edges anywhere in the image. If an edge shows up in the upper-right part of the image, the corresponding input neuron for that receptive field will fire. If an edge shows up in the lower-left part of the image, the corresponding input neuron for that receptive field will also fire, due to the fact that they all share weights and a bias.
As a result of this property, this mapping between layers is often called a feature map. Technically, the kernel/filter output a feature map, which is to say they are not the same thing, but in practice the terms "kernel" and "filter" are often used interchangeably with "feature map".
For example, say we have a 3x3 filter:
Each position in the filter is a weight to be learned; here we have initialized them to 0 (not necessarily the best choice, but this is just an example).
Each position in the filter lines up with a pixel as the filter slides across the image (as depicted above)
Let's say that the weights learned by the filter end up being the following:
Then say we place the filter over the following patch of pixels:
We want to combine (i.e. mix) the filter and the pixel values to produce a single pixel value (which will be a single pixel in the resulting feature map). We do so by convolving them as the sum of the element-wise product:
Pixel-by-pixel the feature map is produced in this way.
We may include multiple feature maps/filters, i.e. have the input connect to many hidden layers of this kind (this is typically how it's done in practice). Each layer would learn to detect a different feature.
Another benefit to sharing weights and biases across the layer is that it introduces some resilience to overfitting - the sharing of weights means that the layer cannot favor peculiarities in particular parts of the training data; it must take the whole example into account. As a result, regularization methods are seldom necessary for these layers.
In addition to convolutional layers there are also pooling layers, which often accompany convolutional layers (often one per convolutional layer) and follow after them. Pooling layers produced a condensed version of the feature map they are given (for this reason, this process is also known as subsampling, so pooling layers are sometimes called subsampling layers). For example, a
There are a few different strategies for how this compression works. A common one is max-pooling, in which the pooling neuron just outputs the maximum value of its inputs. In some sense, max-pooling asks its region: was your feature present? And activates if it was. It isn't concerned with where in that region the feature was, since in practice, its precise location doesn't matter so much as its relative positioning to other features (especially with images).
Another pooling technique is L2 pooling. Say a pooling neuron has an
Another pooling technique is average-pooling in which the average value of the input is output.
There is also the
Generally, we have many feature maps (convolutional layers) and pooling layer pairs grouped together; conceptually it is often easier to think of these groups themselves as layers (called "convolutional-pooling layers").
The output layer is fully-connected (i.e. every neuron from the convolutional-pooling layer are connected to every neuron in the output layer).
Often it helps to include another (or more) fully-connected layer just prior to the output layer. This can be thought of as aggregating and considering all the features coming from the convolutional-pooling layer.
It is also possible to insert additional convolutional-pooling layers (this practice is called hierarchical pooling). Conceptually, these take the features output by the previous convolutional-pooling layer and extract higher-level features. The way these convolutional-pooling layers connect to each other is a little different. Each of this new layer's input neurons (that is, the neurons in its first set of convolutional layers) takes as its input all of the outputs (within its local receptive field) from the preceding convolutional-pooling layer.
For example, if the preceding convolutional-pooling layer has 20 layers in it, and we have receptive fields of size
Backpropagation is slightly different for a convolutional net because the typical backpropagation assumes fully-connected layers.
TODO add this
CNNs learn a convolution kernel and (for images) apply it to every pixel across the image:
A recurrent neural network is a feedback neural network, that is, it is a neural net where the outputs of neurons are fed back into their inputs. They have properties which give them advantages over feed-forward NNs for certain problems. In particular, RNNs are well-suited for handling sequences as input.
With machine learning, data is typically represented in vector form. This works for certain kinds of data, such as numerical data, but not necessarily for other kinds of data, like text. We usually end up coercing text into some vector representation (e.g. TF-IDF) and end up losing much of its structure (such as the order of words). This is ok for some tasks (such as topic detection), but for many others we are throwing out important information. We could use bigrams or trigrams or so on to preserve some structure but this becomes unmanageably large (we end up with very high-dimension vectors).
Recurrent neural networks are able to take sequences as input, i.e. iterate over a sequence, instead of fixed-size vectors, and as such can preserve the sequential structure of things like text and have a stronger concept of "context".
Basically, an RNN takes in each item in the sequence and updates a hidden representation (its state) based on that item and the hidden representation from the previous time step. If there is no previous hidden representation (i.e. we are looking at the first item in the sequence), we can initialize it as either all zeros or treat the initial hidden representation as another parameter to be learned.
Another way of putting this is that the core difference of an RNN from a regular feedforward network is that the output of a neuron is a function of its inputs and of its past state, e.g.
Where
In the most basic RNN, the hidden layer have two inputs: the input from the previous layer, and the layer's own output from the previous time step (so it loops back onto itself):
This simple network can be visualized over time as well:
Say we have a hidden layer
This simple feedback mechanism offers a kind of short-term memory - the network "remembers" the output from the previous time step.
It also allows for variable-sized inputs and outputs - the inputs can be fed in one at a time and combined by this feedback mechanism.
The input item can be represented with one-hot encoding, i.e. each term is to a vector of all zeroes and one 1. For example, if we had the vocabulary
Another way to represent these terms is with an embedding matrix, in which each term is mapped to some index of the matrix which points to some
Convolutional neural networks, and feed-forward neural networks in general, treat an input the same no matter when they are given it. For RNNs, the hidden representation is like (short-term) "memory" for the network, so context is taken into account for inputs; that is, an input will be treated differently depending on what the previous input(s) was/were.
(Note that RNNs train very slowly on CPUs; they train significantly faster on GPUs.)
RNNs are trained using a variant of backpropagation called backpropagation through time, which just involves unfolding the RNN a certain number of time steps, which results in what is essentially a regular feedforward network, and then applying backpropagation:
which starts with:
Where
The gradients of the cost function wrt to the weights is computed by summing the weight gradients in each layer:
This summing of the weight gradients at each time step is the main difference from regular feedforward networks, aside from that BPTT is basically just backpropagation on an RNN unrolled up to some time step
However, if working with long sequences, this is effectively like training a deep network with many hidden layers (i.e. this is equivalent to an unrolled RNN), which can be difficult (due to vanishing or exploding gradients). In practice, it's common to truncate the backpropagation by running it for only to a few time steps back.
The vanishing gradient problem in RNNs means long-term dependencies won't be learned - the effect of earlier steps "vanish" over time steps (this is the same problem of vanishing gradients in deep feedforward networks, given that an RNN is basically a deep neural net).
Exploding gradients are more easily dealt with - it's obvious when they occur (you'll see NaN
s, for instance), and you can clip them at some maximum value, which can be quite effective (refer to this paper)
Some strategies for dealing with vanishing gradients:
Generally, however, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures are used instead of vanilla RNNs, which were designed for mitigating vanishing gradients (for the purpose of better learning long-range dependencies).
This short-term memory of (vanilla) RNNs may be too short. RNNs may incorporate long short-term memory (LSTM) units instead, which just computes hidden states in a different way.
With an LSTM unit, we have memory stored and passed through a more involved series of steps. This memory is modified in each step, with something being added and something being removed at each step. The result is a neural network that can handle longer-term context.
These LSTM units have a three gates (in contrast to the single activation function vanilla RNNs have):
TODO include Chris Olah's LSTM diagrams: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
These gates are sigmoid functions combined with a pointwise multiplication operation. They are called gates because they tune how much of their input is passed on (i.e. sigmoids give a value in
The input gate determines how much of the input is let through, the forget gate determines how much of the previous state is let through. We compute a new "memory" (i.e. the LSTM unit's internal state) from the outputs of these gates. The output gate determines how much of this new memory to output as the hidden state.
In more detail:
The forget gate controls what is removed ("forgotten") from the cell state. The input to the forget gate is the concatenation of the cell's output from the previous step,
The output of a forget gate
Then our intermediate value of
Where
The input gate controls what information gets stored in the cell state. This gate also takes as input the concatenation of
A
We pointwise multiple this candidate value vector with the input gate's output vector to get the vector that is passed to the cell state. This resulting vector is pointwise added to the updated cell state.
Thus our final updated value of
We don't output this cell state
So the output of the output gate is just:
To get the final output of the cell, we pass the cell state
An RNN is can be thought of as an LSTM in which all input and output gates are 1 and all forget gates are 0, with an additional activation function (e.g.
There are many variations of LSTMs (see this paper for empirical comparisons between some of them), the most common of which is the Gated Recurrent Unit (GRU).
A gated recurrent unit (GRU) is a simpler LSTM unit; it includes only two gates (also both sigmoid functions) - the reset gate
This LSTM variant just passes on the previous cell state,
In this LSTM variant, the forget and input gates are combined into a single update gate. The value
Essentially, we just update enough information to replace what was forgotten.
Bidirectional RNNs (BI-RNNs) are a variation on RNNs in which the RNN can not only look into the past, but it can also look into the "future". The BI-RNN has two states,
The output at position
In people, "attention" is a mechanism by which we focus on one particular element of our environment, such that our perception of the focused element is in high-fidelity/resolution, whereas surrounding elements are at a lower resolution.
Attention mechanisms in recurrent neural networks emulate this behavior. This amounts to a weighted sum across input states (typically weights are normalized to sum to 1); higher weights indicate more "focus" or attention.
For instance, consider neural machine translation models. Their basic form consists of two RNNs, one which takes an input sentence (the encoder) and one which produces the translated output sentence (the decoder). The encoder takes the input sentence, produces a sentence embedding (i.e. a single vector meant to encapsulate the sentence's meaning), then the decoder takes that embedding and outputs the translated sentence.
Representing a sentence as a single embedding is challenging, especially since earlier parts of the sentence may be forgotten. There are some architectures such as the bidirectional variant that help with this, but attention mechanisms can help so that the decoder has access to the full spread of inputs and can "focus" more on translating individual parts when appropriate.
This means that instead of taking a single sentence embedding, each output word is produced through this weighted combination of all input states.
Note that these attention weights are stored for each step since each step the model distributes its attention differently. This can add up quickly.
Attention mechanisms can be thought of as an addressing system for selecting locations in memory (e.g. an array) in a weighted fashion.
The most basic one is probably the autoencoder, which is a feed-forward neural net which tries to predict its own input. While this isn’t exactly the world’s hardest prediction task, one makes it hard by somehow constraining the network. Often, this is done by introducing a bottleneck, where one or more of the hidden layers has much lower dimensionality than the inputs. Alternatively, one can constrain the hidden layer activations to be sparse (i.e. each unit activates only rarely), or feed the network corrupted versions of its inputs and make it reconstruct the clean ones (this is known as a denoising autoencoder). [https://www.metacademy.org/roadmaps/rgrosse/deep_learning]
Autoencoders are a feedforward neural network used for unsupervised learning. Autoencoders extract meaningful features by trying to output a reproduction of its input. That is, the output layer is the same size as its input layer, and it tries to reconstruct its input at the output layer.
Generally the output of an autoencoder is notated
The first half (i.e. from the input layer up to the hidden layer) of the autoencoder architecture is called the encoder, and the latter half (i.e. from the hidden layer to the output layer) is called the decoder.
Often the weights of the decoder,
Essentially what happens is the hidden layer learns a compressed representation of the input (given that it is a smaller size than the input/output layers, this is called an undercomplete hidden layer, the learned representation is called an undercomplete representation), since it needs to be reconstructed by the decoder back to its original form. That is, the network needs to find some way of representing the input with less information. In some sense, we do this already with language, where we may represent a photo with a word (or a thousand words with a photo).
Undercomplete hidden layers do a good job compressing data similar to its training set, but bad for other inputs.
On the other hand, the hidden layer may be larger than the input/output layers, in which case it is called an overcomplete hidden layer and the learned representation of the input is an overcomplete representation. There's no compression as a result, and there's not guarantee that anything meaningful will be learned (since it can essentially just copy the input).
However, overcomplete representation as a concept is appealing because if we are using this autoencoder to learn features for us, we may want to learn many features. So how can we learn useful overcomplete representations?
Using a hidden layer size smaller than your input is tricky - encoding a lot of information into fewer bits is quite challenging.
Rather counterintuitively, a larger hidden layer helps, where some hidden units are randomly turned off during a training iteration - that way, the output isn't a mere copy of the input, and learning is easier since there is more "room" to represent the input. Such an autoencoder is called a sparse autoencoder.
In effect, what an autoencoder is learning is some higher-level representation of its input. In the case of an image, it may go from pixels to edges.
We can stack these sparse autoencoders on top of each other, so that higher and higher-level representations are learned. The sparse autoencoder that goes from pixels to edges can go into another one that learns how to go from edges to shapes, for example.
A denoising autoencoder is a way of learning useful overcomplete representations. The general idea is that we want the encoder to be robust to noise (that is, to be able to reconstruct the original input even in the presence of noise). So instead of inputting
There are many ways this noise can be added, but two popular approaches:
Say our neural network is
For binary inputs, we can use cross-entropy (more precisely, the sum of Bernoulli cross-entropies):
For real-valued inputs, we can use the sum of squared differences (i.e. the squared euclidean distance):
And we use a linear activation function at the output.
Note that if you are using tied weights, the gradient
A contractive autoencoder is another way of learning useful overcomplete representations. We do so by adding an explicit term in the loss that penalizes uninteresting solutions (i.e. that penalizes just copying the input).
Thus we have a new loss function, extended from an existing loss function:
Where
Where
Intuitively, the term we're adding to the loss (the squared Frobenius norm of the Jacobian) increases the loss if we have non-zero partial derivatives with the encoder
We balance this out with the original loss function which, as usual, encourages the encoder to keep good information (information that is useful for reconstructing the original input).
By combining these two conflicting priorities, the result is that the encoder keeps only the good information (the latter term encourages it to throw all information away, the former term encourages it to keep only the good stuff). The
Both perform well and each has their own advantages.
Denoising autoencoders are simpler to implement in that they are a simple extension of regular autoencoders and do not require computing the Jacobian of the hidden layer.
Contractive autoencoders have a deterministic gradient (since no sampling is involved; i.e. no random noise), which means second-order optimizers can be used (conjugate gradient, LBFGs, etc), and can be more stable than denoising autoencoders.
Autoencoders can have more than one hidden layer but they can be quite difficult to train (e.g. with small initial weights, the gradient dies).
They can be trained with unsupervised layer-by-layer pre-training (stacking RBMs), or care can be taken in weight initialization.
A shallow autoencoder is just an autoencoder with one hidden layer.
In particular, we can create a deep autoencoder by stacking (shallow) denoising autoencoders.
This typically works better than pre-training with RBMs.
Alternatively, (shallow) contractive autoencoders can be stacked, and they also work very well for pre-training.
The sparse coding model is another unsupervised neural network.
The general problem is that for each input
Formally:
Note that
The term
We constraint the columns of
Restricted Boltzmann machines (RBMs) are a type of neural network used for unsupervised learning; it tries to extract meaningful features.
Such methods are useful for when we have a small supervised training set, but perhaps abundant unlabeled data. We can train an RBM (or another unsupervised learning method) on the unlabeled data to learn useful features to use with the supervised training set - this approach is called semi-supervised learning.
Deep belief networks are a generative neural network. Given some feature values, a deep belief net can be run "backwards" and generate plausible inputs. For example, if you train a DBN on handwritten digits, it can be used to generate new images of handwritten digits.
Deep belief nets are also capable of unsupervised and semi-supervised learning. In an unsupervised setting, DBNs can still learn useful features.
So say we have trained a neural net which has learned our function
We can re-use this network in a modular fashion so that we construct a larger neural net which can take a fixed-size set of words as input. For example, the following network takes in five words, from which we get their representations, which are then passed into another network
Using modular neural networks like above is limiting in the fact that we can only accept a fixed number of inputs.
We can get around this by adding an association module
As you can see, it can take either a reputation from a word (via a
We probably don't want to merge words linearly though. Instead we might want to group words in some way:
This kind of model is a "recursive neural network" (sometimes "tree-structured neural network") because it has modules feeding into modules of the same type.
In typical NNs, the architecture of the network is specified before hand and is static - neurons don't change connections. In a nonlinear neural net, however, the connections between neurons becomes dynamic, so that new connections may form and old connections may break. This is more like how the human brain operates. But so far at least, these are very complex and difficult to train.
A Neural Turing Machine is a neural network enhanced with external addressable memory (and a means of interfacing with it). Like a Turing machine, it can simulate any arbitrary procedure - in fact, given an input sequence and a target output sequence, it can learn a procedure to map between the two on its own, trainable via gradient descent (as the entire thing is differentiable).
The basic architecture of NTMs is that there is a controller (which is a neural network, typically an RNN, e.g. LSTM, or a standard feedforward network), read/write heads (the write "head" actually consists of two heads, an erase and an add head, but referred to as a single head), and a memory matrix
Each row (of which there are
Unlike a normal Turing machine, the read and write operations are "blurry" in that they interact in some way with all elements in memory (normal Turing machines address one element at a time). There is an attentional "focus" mechanism that constrains the memory interaction to a smaller portion - each head outputs a weighting vector which determines how much it interacts (i.e. reads or writes) with each location.
At time
From this we get the
At time
Using these vectors, the memory vectors
Where
Thus a memory location is erased (all elements set to zero) if
The write head also produces an
So, how are these weight vectors
For each head, two addressing mechanisms are combined to produce its weighting vectors:
Each head produces a length
The location-based addressing mechanism is used to move across memory locations iteratively (i.e. given a current location, move to this next location; this is called a rotational shift) and for random-access jumps.
Each head outputs a scalar interpolation gate
If the gate is zero, the content weighting is ignored and only the previous weighting is used.
(TODO not totally clear on this part) Next, the head also emits a shift weighting
Then we apply the rotation specified by
Over time, the shift weighting, if it isn't "sharp", can cause weightings to disperse over time. For example, with permitted shifts of -1, 0, 1 and
Refer to the paper for example uses.
Neuroevolution is the process of applying evolutionary algorithms to neural networks to learn their parameters (weights) and/or architecture (topology).
Neuroevolution is flexible in its application; it may be used for supervised, unsupervised, and reinforcement learning tasks. An example application is state or action value evaluation, e.g. for game playing.
With neuroevolution, an important choice is the genetic representation (genotype) of the neural network. For instance, if the architecture is fixed by the user, the weights can just be genetically represented as a vector of real numbers. Then the standard genetic algorithm (i.e. fitness, mutation, crossover, etc) can be applied.
This simple representation of weights as a vector is called conventional neuroevolution (CNE).
However, because the performance of a neural net is so dependent on topology, evolving the topology in addition to the weights can lead to better performance. One such method is NeuroEvolution of Augmenting Topologies (NEAT), of which there are many variations (e.g. RBF-NEAT, Cascade-NEAT).
Direct encoding the parameters are mapped one-to-one onto the vector; that is each weight is mapped to one number in the vector. However, there may be advantage to using indirect encodings, in which information in one part of the vector may be linked to another part. This compacts the genetic representation in that not every value must be represented (some are shared, mapping to multiple connections).
A Compositional Pattern Producing Network (CPPN) is a neural network which functions as a pattern-generator. CPPNs typically include different activation functions (such as sine, for repeating patterns, or Gaussian, to create symmetric patterns). Although they were originally designed to produce two-dimensional patterns (e.g. images), CPPNs may be used to evolved indirectly encoded neural networks - they "exploit geometric domain properties to compactly describe the connectivity pattern of a large-scale ANN" (Riesi & Togelius). The CPPN itself may be evolved using NEAT - this approach is called HyperNEAT.
A form of indirect encodings are developmental approaches in which the network develops new connections as the game is being played.
In non-deterministic games, the fitness function may be noisy (since the same action can lead to different scores). One way around this is to average the performance over many independent plays.
For complex problems, it sometimes is too difficult to evolve the network directly to that problem. Instead, staging (also called incremental evolution) may be preferred, where the network is evolved on simpler problems that gradually increase towards the original complex task. Similarly, transfer learning may be useful here as well.
A challenge in evolving competitive AI is that there may not be a good enough opponent to play against and learn from. A method called competitive coevolution can be used, in which the fitness of one AI player depends on how it performs against another AI player drawn from the same or from another population.
A similar method called cooperative coevolution, where fitness is instead based on its performance in collaboration with other players, may make more sense in other contexts. It may be adapted more generally by applying it at the individual neuron level - that is, each neuron's fitness depends on how well it works with the other neurons in the network. The CoSyNE neuroevolution algorithm is based on this.
In many cases, there is no single performance metric that can be used; rather, performance is evaluated based on many different dimensions. The simplest way around this is to combine these various metrics in some way - e.g. as a linear combination - but another way is cascading elitism, where "each generation contains separate selection events for each fitness function, ensuring equal selection pressure" (Riesi & Togelius).
There is another class of algorithms called multiobjective evoluationary algorithm (MOEA) where multiple fitness functions are specified. These algorithms try to satisfy all their given objectives (fitness functions) and can also manage conflicts between objectives by identifying (mapping) them and deciding on tradeoffs. When a solution is found where no objective can be further improved without worsening another, the solution is said to be on the Pareto Front. One such MOEA is NSGA-II (Non-dominated Sorting Genetic Algorithm).
There exist interactive evolution approaches in which a human can set or modify objectives during evolution, or even act as the fitness function themselves. Other ways humans can intervene include shaping, where the human can shape the environment to influence training, and demonstration, in which the human takes direct control and the network learns from that example.
Generative models are typically trained with maximum-likelihood estimation which can become intractable (due to the normalization/partition term).
Generative adversarial networks (GAN) are a method for training generative models with neural networks, trained with stochastic gradient descent instead of MLE.
Sampling from the model is achieved by inputting noise; the outputs of the networks are the samples.
A conditional generative adversarial network (cGAN) is an extension which allows the model to condition on external information.
Note that denoising autoencoders have been used to achieve something similar. Denoising autoencoders learn to reconstruct empirical data
A GAN has two components:
These two are pitted against each other in an adversarial game. As such, the objective function here is a minimax value function:
Breaking this down:
They are trained in alternation using stochastic gradient descent.
This paper incorporates conditioning into this general GAN framework. Some condition
We have two functions:
The generator
The conditional GAN objective function becomes:
The conditional data
In terms of cost functions: we have a batch of training data
The cost equation for the discriminator
The cost equation for
Note that a "maximally confused" discriminator would output 0.5 for both true and counterfeit examples.
Note that we have to be careful how we draw the conditional data
Instead, we build a kernel density estimate
We have:
Together,
We simultaneously train
However, we don't want to train
Another problem is that early on
The basic algorithm is:
m = minibatch_size
pz = noise_prior
px = data distribution
for i in range(epochs):
for j in range(k):
# sample m noise samples from the noise prior
z = sample_minibatch(m, pz)
# sample m examples from the data
x = sample_minibatch(m, px)
# update the discriminator by ascending its stochastic gradient
update_discriminator(m, z, x)
# sample m noise samples from the noise prior
z = sample_minibatch(m, pz)
# update the generator by descending its stochastic gradient
update_generator(m, z)
Where update_discriminator
has the gradient:
and update_generator
has the gradient:
The paper used momentum for the gradient updates.