Backpropagation is the Chain Rule

Reverse-mode automatic differentiation on a computational DAG. The chain rule, organized for efficiency.

Backpropagation is the Chain Rule

Every neural network is a long composition of differentiable operations: matrix multiplications, additions, and pointwise nonlinearities like sigmoid or tanh, stacked layer after layer. Training requires the gradient of the final loss with respect to every parameter — millions or billions of partial derivatives. Computing them naively, one parameter at a time, would be unaffordable. Backpropagation is the trick that makes it cheap.

The trick is the chain rule applied to a computational graph. A deep network is a directed acyclic graph of operations: each node is one elementary computation, and the edges carry values forward. The chain rule says ∂(f∘g)/∂x = f'(g(x)) · g'(x). Backpropagation extends this to the whole graph by traversing it backward, multiplying local Jacobians along every edge. The forward pass costs one graph traversal; the backward pass costs another. Two passes give you every gradient.

Interactive: Computation Graph & Reverse Pass

A small graph for L = (sigmoid(w·x + b) − y*)². Adjust the inputs, then animate the backward pass: each highlighted edge labels the local derivative being multiplied, and the orange grad badges accumulate the chain-rule product on every parent.

w0.60x1.50b-0.30y*1.00

forward only

Output

L = (sigmoid(w·x + b) − y*)² = 0.1256

Gradients (after backward)

∂L/∂w = 0.0000∂L/∂x = 0.0000∂L/∂b = 0.0000

Forward pass shown in green: each node holds its current value. Hit "Animate backward pass" and watch gradients flow right to left. The label on each highlighted edge is the local derivative being multiplied — the chain rule applied one operation at a time. After the animation finishes, the orange numbers are exactly what gradient descent uses to update the parameters.

Interactive: The Chain Rule Visualized

Three plots show the inner h(x), the middle g(h(x)), and the outer f(g(h(x))). Slide x and watch the marked points move in lockstep. The product of the three numerical derivatives is the gradient backprop would compute.

inner h(x)

middle g(u)

outer f(v)

h(x) = x²

x → h(x)

h(0.70) = 0.490

g(u) = tanh(u)

u = h(x) → g(u)

g(0.49) = 0.454

f(v) = v²

v = g(h(x)) → f(v)

f(0.45) = 0.206

x0.70

chain rule: df/dx = f'(g(h(x))) · g'(h(x)) · h'(x)

f'(v)

0.9084

g'(u)

0.7937

h'(x)

1.4000

df/dx

1.0094

Slide x. The amber dot moves through every plot. Each factor above is the local derivative at that plot's marked point — multiply them to get the gradient that backpropagation would compute for x.

Interactive: Training a Tiny Neural Network

A 2-input → 4-hidden-tanh → 1-sigmoid network learning XOR or two-moons. Each weight update is plain gradient descent — exactly the rule from lesson 1 — with backpropagation supplying the gradient via the same DAG traversal you just watched.

2 inputs → 4 tanh → 1 sigmoid

decision boundary

training loss

Per epoch we run one pass through the data: forward, backward via autodiff, gradient-descent update on every parameter. That's lesson 1's rule, with backprop computing the gradient.

η0.10seed

Each weight in this network has its gradient computed by traversing the same DAG you saw in the previous demos — just bigger. XOR is impossible for a linear classifier; the hidden tanh layer bends space until two straight lines can separate it. Two moons does the same trick with curved data. Try a learning rate that's too large and watch the loss explode; try one that's too small and watch it crawl.

The math objects

Computational DAG: a directed acyclic graph where each node is an elementary differentiable operation and edges carry intermediate values. Every neural network is one of these.
Chain rule: ∂(f∘g)/∂x = f'(g(x)) · g'(x). For a chain f∘g∘h, the derivative df/dx is just the product f'(g(h(x))) · g'(h(x)) · h'(x). For a graph, we sum products of edge derivatives over all paths from output to input — but we never have to enumerate those paths explicitly.
Forward pass: evaluate every node in topological order. Each node stores its data value.
Backward pass: seed the output node's gradient as 1. Walk the topological order in reverse. At each node, push (upstream grad × local derivative) onto every parent. After one reverse traversal, every node holds ∂L/∂(its value).
Reverse-mode automatic differentiation: the algorithmic name for backpropagation. It costs roughly the same as a single forward pass, regardless of how many input parameters the network has. That asymmetric cost is why deep learning works at all.
Gradient descent — still: backprop just produces the gradient. The actual update θ ← θ − η · ∂L/∂θ is exactly the rule from the Loss Landscapes lesson. Training a neural network is rolling a ball down a (very high-dimensional) loss surface.

Key takeaways

A neural network is a directed acyclic graph of differentiable operations.
The chain rule says derivatives multiply along a chain — backprop applies it to whole graphs by walking edges in reverse.
Two graph traversals (one forward, one backward) give you every parameter's gradient.
Reverse-mode autodiff scales because its cost is independent of the number of inputs.
Once you have the gradient, training is the same gradient descent from lesson 1.