Reverse-mode automatic differentiation on a computational DAG. The chain rule, organized for efficiency.
Every neural network is a long composition of differentiable operations: matrix multiplications, additions, and pointwise nonlinearities like sigmoid or tanh, stacked layer after layer. Training requires the gradient of the final loss with respect to every parameter — millions or billions of partial derivatives. Computing them naively, one parameter at a time, would be unaffordable. Backpropagation is the trick that makes it cheap.
The trick is the chain rule applied to a computational graph. A deep network is a directed acyclic graph of operations: each node is one elementary computation, and the edges carry values forward. The chain rule says ∂(f∘g)/∂x = f'(g(x)) · g'(x). Backpropagation extends this to the whole graph by traversing it backward, multiplying local Jacobians along every edge. The forward pass costs one graph traversal; the backward pass costs another. Two passes give you every gradient.
Forward pass shown in green: each node holds its current value. Hit "Animate backward pass" and watch gradients flow right to left. The label on each highlighted edge is the local derivative being multiplied — the chain rule applied one operation at a time. After the animation finishes, the orange numbers are exactly what gradient descent uses to update the parameters.
Each weight in this network has its gradient computed by traversing the same DAG you saw in the previous demos — just bigger. XOR is impossible for a linear classifier; the hidden tanh layer bends space until two straight lines can separate it. Two moons does the same trick with curved data. Try a learning rate that's too large and watch the loss explode; try one that's too small and watch it crawl.