Backpropagation Hand by Hand
$$
W^\ell!\leftarrow W^\ell - \eta, (a^ {\ell-1}) ^\top \delta^\ell,\quad
b^\ell!\leftarrow b^\ell - \eta,\sum\delta^\ell.
$$
A Very Sample NN Example
Codes below is a simple example of a neural network with one hidden layer, using the sigmoid activation function. The network is trained to learn the XOR function, which is a classic problem in neural networks. The Input consists of two features. The Output is a single value (either 0 or 1). The network has two layers: an input layer with 2 neurons and a hidden layer also with 2 neurons. The output layer has 1 neuron. The weights and biases are initialized randomly. The network is trained using the backpropagation algorithm, which adjusts the weights and biases based on the error between the predicted output and the actual output.
|
Steps of Backpropagation
-
Initialization: (Step 1)
- Initialize the weights and biases of the network with small random values.
-
Forward Pass: (See Step 2 in the for loop)
- For each layer $ l $, compute the input $ z^l $ and output $ a^l $:
- $z^l = W^l a^{l-1} + b^l$
- $a^l = \sigma(z^l)$
- Here, $ W^l $ are the weights (
W1
andW2
), $ b^l $ are the biases (b1
andb2
), $ \sigma $ is the activation function (sigmoid
), and $ a^{l-1} $ is the output from the previous layer (the first $a$ is $x$ which is the input).
- For each layer $ l $, compute the input $ z^l $ and output $ a^l $:
-
Compute Loss: (See codes in Step 3)
- Compute the loss $ L $ using a suitable loss function.
-
Backward Pass: (See codes in Step 4)
- Calculate the gradient of the loss with respect to the output of the last layer $ \delta^L $:
- $\delta^L = \nabla_a L \cdot \sigma’(z^L)$
- For each layer $ l $ from $ L-1 $ to 1, compute:
-$\delta^l = (\delta^{l+1} \cdot W^{l+1}) \cdot \sigma’(z^l)$ - Update the weights and biases:
- $W^l = W^l - \eta \cdot \delta^l \cdot (a^{l-1}) ^T$
- $b^l = b^l - \eta \cdot \delta^l$
- Here, $ \eta $ is the learning rate, and $ \sigma’ $ is the derivative of the activation function.
- Calculate the gradient of the loss with respect to the output of the last layer $ \delta^L $:
$$
\frac{d(\text{Loss})}{dW_2} = \frac{d(\text{Loss})}{da_2} \times \frac{da_2}{dz_2} \times \frac{dz_2}{dW_2}
$$
- $ \frac{d(\text{Loss})}{da_2} $: how much the loss changes if $ a_2 $ changes
- $ \frac{da_2}{dz_2} $: how much $ a_2 $ changes if $ z_2 $ changes
- $ \frac{dz_2}{dW_2} $: how much $ z_2 $ changes if $ W_2 $ changes
Click to go through this step by step very carefully
The delta calculation in the code:
|
and how we get it based on chain rule:
$$
\frac{d(\text{Loss})}{dW_2} = \frac{d(\text{Loss})}{da_2} \times \frac{da_2}{dz_2} \times \frac{dz_2}{dW_2}
$$
Step 1: Loss function
Suppose the loss is Mean Squared Error (MSE):
$$
\text{Loss} = \frac{1}{2} (a_2 - y)^2
$$
Step 2: Derivatives
Main | Derive from | Result |
---|---|---|
$ \frac{d(\text{Loss})}{da_2} $ | $Loss = 0.5 * (y - a_2)^2 $ | $(a_2 - y)$ |
$ \frac{da_2}{dz_2} $ | $a_2 = \sigma(z_2)$ | $\sigma’(z_2)$ |
$ \frac{dz_2}{dW_2} $ | $ z_2 = a_1 W_2 + b_2 $ | $\frac{dz_2}{dW_2} = a_1$ |
$\sigma’(z_2) = a_2 (1 - a_2)$
Step 3: Chain them together
Thus:
$$
\text{delta2} = \frac{d(\text{Loss})}{da_2} \times \frac{da_2}{dz_2}
$$
which is exactly:
|
Later, the update for $ W_2 $ uses $ delta2 $ multiplied by $ a_1 $ (previous activations).
Summary:
(a2 - y)
is $\frac{d(\text{Loss})}{da_2}$sigmoid_derivative(a2)
is $\frac{da_2}{dz_2}$- We haven’t yet multiplied by $a_1$ — that happens during weight update!
Why we made delta2 vector without multiplying by $a_1$?
Except we will use it to update $W_2$ later by multiplication with $a_1$, we also need to update $b_2$ which doesn't require multiplication $a_1$. So, we don't need to multiply $a_1$ here.
Expansions
-
Derive δ¹ Explicitly
Work through the chain rule for the hidden layer:
$$
\delta^1 = \bigl(\delta^2 W^{2\top}\bigr)\odot \sigma’(z^1).
$$
Mapping each term to code deepens your grasp of how errors “flow back” through layers (Backpropagation calculus - 3Blue1Brown). -
Bias Gradients
Note that
$\partial L/\partial b^\ell = \sum_i \delta^\ell_i$,
which the code implements vianp.sum(deltaℓ, axis=0, keepdims=True)
(Backpropagation - Wikipedia). -
Alternative Losses
Show how the backprop equations simplify when using binary cross‐entropy:
$$
L = -y\log(a^2) - (1-y)\log(1-a^2)
\quad\Longrightarrow\quad
\delta^2 = a^2 - y,
$$
eliminating the explicit $\sigma’(z^2)$ term (Backpropagation - Wikipedia). -
Regularization in Loss
Introduce weight decay by adding $\tfrac{\lambda}{2}|W|^2$ to $L$. Its gradient $\lambda W$ simply augments $\partial L/\partial W$ before the update (History of artificial neural networks). -
Vectorizing Over Batches
Generalize from a single example to a batch by keeping matricesX
andY
shaped(batch_size, features)
, ensuring the update formulas remain the same (Backpropagation - CS231n Deep Learning for Computer Vision). -
Modern Optimizers
Briefly demonstrate how to plug in Adam: maintain per-parameter running averagesm
andv
and adjustW
with
$\displaystyle m\leftarrow\beta_1 m + (1-\beta_1)\nabla W,\quad
v\leftarrow\beta_2 v + (1-\beta_2)(\nabla W)^2$,
then
$$
W\leftarrow W - \eta,\frac{m/(1-\beta_1^ t)}{\sqrt{v/(1-\beta_2^ t)}+\epsilon}.
$$
This typically outperforms vanilla SGD on noisy or sparse data (History of artificial neural networks).
Backpropagation Hand by Hand