Introduction
First we must define how a feed-forward network functions.
The inputs to the model can be viewed as the 0th layers output. Ο0=X
Note: the indexing of the weight matrix may seem backwards but doing it this way prevents us having to take the transpose of the weight matrix when converting to matrix form.
ΟjLβΟLΟjLβΟLβ=kββWjkLβΟkLβ1β+BjLβ=WLΟLβ1+BL=Ο(kββWjkLβΟkLβ1β+BjLβ)=Οvβ(ΟL)β
Given all of the required parameters are given and are properly defined this fully defines a feed-forward network. These definition are not where the magic of a neural net happens (I intend on detailing how the formulation allows it to be a good function approximator however at some point).
The magic happens by combining an optimisation algorithm (gradient descent is the standard) with backpropagation.
Derivation
Backprop sounds very mysterious but all it is, is an algorithm to calculate partial derivatives of the cost function with respect to each of the learnable parameters (the weights and biases). ie βWjkLββCβ,βBjLββCβ
The most common cost function is the squared error (below), but many others exist.
C=2n1ββiβ(Οl(xiβ)βyiβ)2
βWjkLββCββ=βΟjLββCβWjkLββΟjLββ=a,b,β¦,fβββΟalββCββΟblβ1ββΟalβββ¦βΟjLββΟfL+1ββWjkLββΟjLβββ
We now have a formulation for the result we wish to calculate, however computing this directly for all of the nodes in the network would be quite expensive.
From this formulation it makes sense that we will try to compute the derivatives from the last layer back.
We will attempt to derive the derivatives of the current layer in terms of the previous
We will define ΟjLβ=βΟjLββCβ, this is the error of the output of the jth node in the Lth layer.
βΟjLββCβΟjLββ=aβββΟaL+1ββCββΟjLββΟaL+1ββ=aβββΟaL+1ββCβWajL+1β=aβββΟaL+1ββCββΟaL+1ββΟaL+1ββWajL+1β=aβββΟaL+1ββCβΟβ²(ΟaL+1β)WajL+1β=aββΟaL+1βΟβ²(ΟaL+1β)WajL+1β=aββ(ΟL+1βΟvβ²β(ΟL+1))aβWajL+1β=aββ(Wβ,jL+1β)aβ(ΟL+1βΟvβ²β(ΟL+1))aβ=(Wβ,jL+1β)T(ΟL+1βΟvβ²β(ΟL+1))β
This shows that we can compute the error of the previous layer from the current one.
βΟLβCββ=(WL+1)T(ΟL+1βΟvβ²β(ΟL+1))β(A1)
βBjLββCββ=βΟjLββCββBjLββΟjLββ=ΟjLβΟβ²(ΟjLβ)(1)=ΟjLβΟβ²(ΟjLβ)β
βBLβCββ=ΟLβΟvβ²β(ΟL)β(A2)
βWjkLββCββ=βΟjLββCββWjkLββΟjLββ=ΟjLβΟβ²(ΟjLβ)(ΟkLβ1β)β
βWLβCββ=(ΟLβΟvβ²β(ΟL))(ΟLβ1)Tβ(A3)
This allows us to compute the derivative with respect to each parameter we can control, if we compute the error terms first.
So in what order should we compute these values? We can see that we first should compute ΟL,ΟL,Οvβ²β(ΟL). This can be done in a single forward pass. Then we can compute all the error terms in a single backward pass.
This is all backpropagation is for a straightforward feed forward network.
Speed up
We can see that in our formulation that when ever we use ΟL, it is in the context of ΟLβΟvβ²β(ΟL). We compute this quantity in all of the steps of the algorithm. We could define Ξ»L=ΟLβΟvβ²β(ΟL)
ΟLΟLβΟvβ²β(ΟL)Ξ»Lβ=(WL+1)TΞ»L+1=((WL+1)TΞ»L+1)βΟvβ²β(ΟL)=((WL+1)TΞ»L+1)βΟvβ²β(ΟL)β(B1)
βBLβCββBLβCββ=ΟLβΟvβ²β(ΟL)=Ξ»Lβ(B2)
βWLβCββWLβCββ=(ΟLβΟvβ²β(ΟL))(ΟLβ1)T=Ξ»L(ΟLβ1)Tβ(B3)
NOTE
This version of the equations are probably the ones you have seen before. If we had defined ΟjLβ=βΟjLββCβ instead, our first set of derived equations would look like our modified version. This was done intentionally to show that even without that observation it can be derived.
While this algorithm is extremely powerful, it's nothing that a student with some linear algebra and multivariate calculus knowledge couldn't derive.