Paul Werbos invented backpropagation in 1974.
No one cared.
The algorithm lay buried in his doctoral thesis,
forgotten for over a decade.
In 1986, Rumelhart, Hinton, and Williams
“rediscovered” the same algorithm.
It was published in Nature.
The world changed.
Why did 1974 fail
while 1986 succeeded?
The answer wasn’t the algorithm—it was context.
By 1986, computers were fast enough,
data was plentiful enough,
and most importantly, people were ready.
But the essential question remains.
What exactly is backpropagation?
The Definition of Learning: Reducing Wrongness
Neural network learning is simple.
Predict, fail, correct.
Repeat this process tens of thousands of times.
Show the network an image labeled “cat.”
The network predicts “dog.”
Wrong.
This “wrongness” expressed as a number is the loss function.
The distance between prediction and truth.
Reducing this distance is all that learning is.
The problem is method.
A neural network has millions of weights.
Which weights should be adjusted, and by how much,
to reduce the loss?
You can’t try every combination.
If there are 1 million weights
and each can take 1,000 values,
the possible combinations number 1000^1,000,000.
More than the atoms in the universe.
A smarter approach is needed.
Gradient Descent: Navigating a Mountain Blindfolded
Imagine you’re lost on a fog-covered mountain.
You can’t see ahead.
You have no map.
You need to reach the lowest point in the valley.
What would you do?
Feel the slope beneath your feet.
Take one step in the steepest downward direction.
Feel the slope again.
Take another step.
This is gradient descent.
Think of the loss function as a mountain.
The loss changes depending on weight values.
Graph this, and you get a bumpy terrain.
Our goal is to find the lowest point.
That’s where loss is smallest.
Where the optimal weight combination lives.
The gradient tells us “which direction goes uphill.”
Go the opposite way to descend.
Mathematically, the gradient is a partial derivative.
Differentiate the loss function with respect to each weight,
and you know how much the loss changes when that weight changes.
The Chain Rule: Differentiating Deep Networks
Here’s where problems arise.
As we saw in the structural difference between ML and DL, the core of deep learning is depth.
Multiple layers stacked together.
Input passes through the first layer to the second,
through the second to the third.
Loss is calculated at the very end.
But how do we know the effect of first-layer weights
on the final loss?
The answer is calculus’s chain rule.
The derivative of f(g(x)) is f'(g(x)) × g'(x).
To differentiate a composite function,
multiply the derivatives of each function.
A neural network is a giant composite function.
Layer1(Layer2(Layer3(…input…)))
Start from the last layer
and multiply each layer’s derivative in sequence,
and you can reach the first layer.
This is backpropagation.
Propagate the error backward.
From output toward input.
Calculate how much each weight contributed to the final error.
How Backpropagation Works
Let’s look at the specifics.
Forward Pass
Input data passes through the network.
At each layer, it’s multiplied by weights and passed through activation functions.
A final output emerges.
Loss Calculation
Compare the output to the correct answer.
The loss function converts “how wrong” into a number.
Backward Pass
Start from the loss.
Calculate how the last layer’s weights affected the loss.
Pass that result to the previous layer.
Calculate how that layer’s weights affected the loss.
Repeat until you reach the first layer.
Weight Update
For each weight:
New weight = Current weight – (learning rate × gradient)
The learning rate determines how far to move at once.
Too large, and you overshoot the minimum.
Too small, and you never arrive.
Why “Back” Propagation
The name matters.
You could calculate each weight’s influence with forward passes alone.
Change one weight slightly, recalculate the entire network,
and see how much the loss changed.
The problem is efficiency.
With 1 million weights,
you’d need 1 million forward passes.
Each forward pass calculates the entire network.
Backpropagation gets all gradients
with just one backward calculation.
One forward pass + one backward pass.
Regardless of how many weights exist.
This efficiency made deep learning possible.
Networks with hundreds of millions of parameters
could now be trained in practical time.
When Learning Fails
Backpropagation doesn’t always succeed.
Vanishing Gradient
As networks deepen,
the gradient propagated backward shrinks.
In the chain rule, multiplying values less than 1 repeatedly
converges the result toward zero.
Early layers barely get updated.
Learning stalls.
Exploding Gradient
The opposite phenomenon.
Gradients grow exponentially.
Weights diverge.
The network breaks.
Local Minimum
While descending the mountain, you fall into a small pit.
Every direction slopes upward.
But the true valley lies deeper below.
Gradient descent can’t distinguish this pit
from the global minimum.
Various techniques have been developed to solve these problems.
ReLU activation functions, Batch Normalization,
adaptive optimizers like Adam.
But the core principle remains unchanged.
Follow the gradient downward, propagate errors backward.
The Essence of Learning
Neural network learning in one sentence:
Correct proportionally to the error and the responsibility.
The loss function measures “how wrong.”
Backpropagation calculates “who’s responsible and by how much.”
Gradient descent determines “how much to correct.”
These three combine
so neural networks can extract patterns from data.
Discover features humans never taught.
Werbos in 1974 and Hinton in 1986
used the same math.
What changed was scale.
More data, faster computers, deeper networks.
Backpropagation was the only algorithm
that could handle that scale.


Leave a Reply