Backpropagation Hand by Hand
A Very Sample NN Example
Codes below is a simple example of a neural network with one hidden layer, using the sigmoid activation function. The network is trained to learn the XOR function, which is a classic problem in neural networks. The Input consists of two features. The Output is a single value (either 0 or 1). The network has two layers: an input layer with 2 neurons and a hidden layer also with 2 neurons. The output layer has 1 neuron. The weights and biases are initialized randomly. The network is trained using the backpropagation algorithm, which adjusts the weights and biases based on the error between the predicted output and the actual output.
|
Steps of Backpropagation
-
Initialization: (Step 1)
- Initialize the weights and biases of the network with small random values.
-
Forward Pass: (See Step 2 in the for loop)
- For each layer $ l $, compute the input $ z^l $ and output $ a^l $:
- $z^l = W^l a^{l-1} + b^l$
- $a^l = \sigma(z^l)$
- Here, $ W^l $ are the weights (
W1
andW2
), $ b^l $ are the biases (b1
andb2
), $ \sigma $ is the activation function (sigmoid
), and $ a^{l-1} $ is the output from the previous layer (the first $a$ is $x$ which is the input).
- For each layer $ l $, compute the input $ z^l $ and output $ a^l $:
-
Compute Loss: (See codes in Step 3)
- Compute the loss $ L $ using a suitable loss function.
-
Backward Pass: (See codes in Step 4)
- Calculate the gradient of the loss with respect to the output of the last layer $ \delta^L $:
- $\delta^L = \nabla_a L \cdot \sigma’(z^L)$
- For each layer $ l $ from $ L-1 $ to 1, compute:
-$\delta^l = (\delta^{l+1} \cdot W^{l+1}) \cdot \sigma’(z^l)$ - Update the weights and biases:
- $W^l = W^l - \eta \cdot \delta^l \cdot (a^{l-1}) ^T$
- $b^l = b^l - \eta \cdot \delta^l$
- Here, $ \eta $ is the learning rate, and $ \sigma’ $ is the derivative of the activation function.
- Calculate the gradient of the loss with respect to the output of the last layer $ \delta^L $:
$$
\frac{d(\text{Loss})}{dW_2} = \frac{d(\text{Loss})}{da_2} \times \frac{da_2}{dz_2} \times \frac{dz_2}{dW_2}
$$
- $ \frac{d(\text{Loss})}{da_2} $: how much the loss changes if $ a_2 $ changes
- $ \frac{da_2}{dz_2} $: how much $ a_2 $ changes if $ z_2 $ changes
- $ \frac{dz_2}{dW_2} $: how much $ z_2 $ changes if $ W_2 $ changes
Click to go through this step by step very carefully
The delta calculation in the code:
|
and how we get it based on chain rule:
$$
\frac{d(\text{Loss})}{dW_2} = \frac{d(\text{Loss})}{da_2} \times \frac{da_2}{dz_2} \times \frac{dz_2}{dW_2}
$$
Step 1: Loss function
Suppose the loss is Mean Squared Error (MSE):
$$
\text{Loss} = \frac{1}{2} (a_2 - y)^2
$$
Step 2: Derivatives
Now calculate each part:
-
$ \frac{d(\text{Loss})}{da_2} $
By derivative of MSE:
$$
\frac{d(\text{Loss})}{da_2} = (a_2 - y)
$$ -
$ \frac{da_2}{dz_2} $
Since $ a_2 = \sigma(z_2) $ (sigmoid activation),
$$
\frac{da_2}{dz_2} = \sigma’(z_2)
$$
and in code, $\sigma’(z_2)$ can be computed as:
|
because:
$$
\sigma’(z_2) = a_2 (1 - a_2)
$$
- $ \frac{dz_2}{dW_2} $
$ z_2 = a_1 W_2 + b_2 $, so:
$$
\frac{dz_2}{dW_2} = a_1
$$
(just the previous layer’s activations)
Step 3: Chain them together
Thus:
$$
\text{delta2} = \frac{d(\text{Loss})}{da_2} \times \frac{da_2}{dz_2}
$$
which is exactly:
|
Later, the update for $ W_2 $ uses $ delta2 $ multiplied by $ a_1 $ (previous activations).
Summary:
(a2 - y)
is $\frac{d(\text{Loss})}{da_2}$sigmoid_derivative(a2)
is $\frac{da_2}{dz_2}$- We haven’t yet multiplied by $a_1$ — that happens during weight update!
Why we made delta2 vector without multiplying by $a_1$?
Except we will use it to update $W_2$ later by multiplication with $a_1$, we also need to update $b_2$ which doesn't require multiplication $a_1$. So, we don't need to multiply $a_1$ here.
Backpropagation Hand by Hand