Backpropagation: The Heart of Neural Network Training

2 min read

Backpropagation: The Heart of Neural Network Training

Backpropagation is the algorithm that makes deep learning possible. Let's understand how it works mathematically and implement it from scratch.

The Forward Pass

Consider a simple 3-layer neural network. The forward pass can be described as:

z[1]=W[1]x+b[1]z^{[1]} = W^{[1]}x + b^{[1]} a[1]=σ(z[1])a^{[1]} = \sigma(z^{[1]}) z[2]=W[2]a[1]+b[2]z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} a[2]=σ(z[2])a^{[2]} = \sigma(z^{[2]})

Where σ\sigma is the activation function (e.g., sigmoid, ReLU).

The Loss Function

For binary classification, we use the cross-entropy loss:

J=1mi=1m[y(i)log(a[2](i))+(1y(i))log(1a[2](i))]J = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(a^{[2](i)}) + (1-y^{(i)}) \log(1-a^{[2](i)})]

Backpropagation Equations

The beauty of backpropagation lies in the chain rule. For the output layer:

Jz[2]=a[2]y\frac{\partial J}{\partial z^{[2]}} = a^{[2]} - y

For the hidden layer:

Jz[1]=(W[2])TJz[2]σ(z[1])\frac{\partial J}{\partial z^{[1]}} = (W^{[2]})^T \frac{\partial J}{\partial z^{[2]}} \odot \sigma'(z^{[1]})

Where \odot denotes element-wise multiplication.

Weight and Bias Updates

The gradients for weights and biases are:

JW[2]=1mJz[2](a[1])T\frac{\partial J}{\partial W^{[2]}} = \frac{1}{m} \frac{\partial J}{\partial z^{[2]}} (a^{[1]})^T

Jb[2]=1mJz[2]\frac{\partial J}{\partial b^{[2]}} = \frac{1}{m} \sum \frac{\partial J}{\partial z^{[2]}}

JW[1]=1mJz[1]xT\frac{\partial J}{\partial W^{[1]}} = \frac{1}{m} \frac{\partial J}{\partial z^{[1]}} x^T

Jb[1]=1mJz[1]\frac{\partial J}{\partial b^{[1]}} = \frac{1}{m} \sum \frac{\partial J}{\partial z^{[1]}}

Implementation

import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def sigmoid_derivative(z): s = sigmoid(z) return s * (1 - s) class NeuralNetwork: def __init__(self, input_size, hidden_size, output_size): # Initialize weights randomly self.W1 = np.random.randn(hidden_size, input_size) * 0.01 self.b1 = np.zeros((hidden_size, 1)) self.W2 = np.random.randn(output_size, hidden_size) * 0.01 self.b2 = np.zeros((output_size, 1)) def forward(self, X): self.z1 = self.W1 @ X + self.b1 self.a1 = sigmoid(self.z1) self.z2 = self.W2 @ self.a1 + self.b2 self.a2 = sigmoid(self.z2) return self.a2 def backward(self, X, y, learning_rate=0.01): m = X.shape[1] # Backward propagation dz2 = self.a2 - y dW2 = (1/m) * dz2 @ self.a1.T db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True) dz1 = (self.W2.T @ dz2) * sigmoid_derivative(self.z1) dW1 = (1/m) * dz1 @ X.T db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True) # Update parameters self.W1 -= learning_rate * dW1 self.b1 -= learning_rate * db1 self.W2 -= learning_rate * dW2 self.b2 -= learning_rate * db2

Key Insights

  1. Chain Rule: Backpropagation is just the chain rule applied systematically
  2. Efficiency: We compute gradients in reverse order, reusing computations
  3. Scalability: The algorithm scales to networks of any depth

Understanding these mathematical foundations is crucial for debugging neural networks and developing intuition about their behavior.