backprop

(trying a clearer explanation of backprop)

terms:

$J$ = cost function
$w_i$ = weights for layer $i$
$b_i$ = biases for layer $i$
$f_i$ = activation function for layer $i$
$l_i = w_i f_{i-1}(l_{i-1}) + b_i$
$n$ = number of layers, i.e. $i=n$ is the output layer
$d$ = number of input dimensions

The goal of backprop is to compute weight and bias updates, i.e. to compute $\frac{\partial J}{\partial w_i}$ and $\frac{\partial J}{\partial b_i}$ for all $i \in [1, n]$. We basically do so through applying the chain rule of derivatives.

For a layer $i$ we want to compute $\frac{\partial J}{\partial l_i}$ because we can easily compute $\frac{\partial J}{\partial w_i}$ and $\frac{\partial J}{\partial b_i}$ from it:

$$ \begin{aligned} \frac{\partial J}{\partial w_i} &= \frac{\partial J}{\partial l_i} \frac{\partial l_i}{\partial w_i} \\ \frac{\partial J}{\partial b_i} &= \frac{\partial J}{\partial l_i} \frac{\partial l_i}{\partial b_i} \end{aligned} $$

Also note that (these derivatives are quite easy to work out on your own):

$$ \begin{aligned} \frac{\partial l_i}{\partial w_i} &= f_{i-1}(l_{i-1}) \\ \frac{\partial l_i}{\partial b_i} &= 1 \end{aligned} $$

To clarify, $\frac{\partial l_i}{\partial w_i} = f_{i-1}(l_{i-1})$ just means that it equals the output of the previous layer.

With these in mind, we can carry out backpropagation.

We start with the output layer, i.e. with $i=n$.

$$ \begin{aligned} \frac{\partial J}{\partial l_n} &= \frac{\partial J}{\partial f_n(l_n)} \frac{\partial f_n(l_n)}{\partial l_n} \\ &= \frac{\partial J}{\partial f_n(l_n)} f_n'(l_n) \end{aligned} $$

Note that $\frac{\partial J}{\partial f_n(l_n)}$ is just the derivative of the cost function with respect to the network's predicted values, i.e. $J'(h(x))$.

With $\frac{\partial J}{\partial l_n}$ we can compute the updates for $w_n$ and $b_n$ using the relationship shown above.

Now we move backwards a layer to compute $\frac{\partial J}{\partial l_{n-1}}$. Starting with the full chain rule version:

$$ \frac{\partial J}{\partial l_{n-1}} = \frac{\partial J}{\partial f_n(l_n)} \frac{\partial f_n(l_n)}{\partial l_n} \frac{\partial l_n}{\partial f_{n-1}(l_{n-1})} \frac{\partial f_{n-1}(l_{n-1})}{\partial l_{n-1}} $$

But we can simplify this a bit, especially because we've already computed $\frac{\partial J}{\partial l_n}$:

$$ \begin{aligned} \frac{\partial J}{\partial l_{n-1}} &= \frac{\partial J}{\partial l_n} \frac{\partial l_n}{\partial f_{n-1}(l_{n-1})} \frac{\partial f_{n-1}(l_{n-1})}{\partial l_{n-1}} \\ &= \frac{\partial J}{\partial l_n} \frac{\partial l_n}{\partial f_{n-1}(l_{n-1})} f_{n-1}'(l_{n-1}) \end{aligned} $$

Also note that because:

$$ l_n = w_n f_{n-1}(l_{n-1}) + b_n $$

Then (again, this is easy to show with basic derivative rules):

$$ \frac{\partial l_n}{\partial f_{n-1}(l_{n-1})} = w_n $$

Therefore:

$$ \frac{\partial J}{\partial l_{n-1}} = \frac{\partial J}{\partial l_n} w_n f_{n-1}'(l_{n-1}) $$

Then we can again go from this to $\frac{\partial J}{\partial w_{n-1}}$ and $\frac{\partial J}{\partial b_{n-1}}$ using the relationship described earlier.

We can generalize what we just did for any layer $i$:

$$ \frac{\partial J}{\partial l_i} = \frac{\partial J}{\partial l_{i+1}} w_{i+1} f_i'(l_i) $$

And then use the relationship described earlier to go from this to $\frac{\partial J}{\partial w_i}$ and $\frac{\partial J}{\partial b_i}$.

To summarize:

Compute for the output layer, i.e. $i=n$:

$$ \frac{\partial J}{\partial l_n} = J'(h(x)) f_n'(l_n) $$

Then for all other layers $i \neq n$ (except for the input layer, that has no parameters):

$$ \frac{\partial J}{\partial l_i} = \frac{\partial J}{\partial l_{i+1}} w_{i+1} f_i'(l_i) $$

Then, for all layers $i \in [1, n]$, compute the weight and bias updates:

$$ \begin{aligned} \text{weight update} &= \frac{\partial J}{\partial l_i} \frac{\partial l_i}{\partial w_i} = \frac{\partial J}{\partial l_i} f_{i-1}(l_{i-1}) \\ \text{bias update} &= \frac{\partial J}{\partial l_i} \frac{\partial l_i}{\partial b_i} = \frac{\partial J}{\partial l_i} \end{aligned} $$

Note that $f_0(l_0) = X$ (the input layer's output is just the input $X$).