Hi Pythonistas Happy new year,
We’ve already seen how backpropagation teaches a neural network. But when networks get very deep, something sneaky starts happening
Gradients become too small Or they grow way too large
Either way learning breaks.
Let’s unpack this.
What Are Gradients, Again?
Gradients are just signals.
They tell each weight: “Change a little” Or “change a lot”
In simple terms:
Small gradient → tiny update
Large gradient → big update
Sounds fine until we send these signals back through many layers.
That’s where the trouble starts.
Vanishing Gradients
In very deep networks: Gradients get multiplied again and again Especially bad with sigmoid and tanh
Each step makes them smaller.
By the time they reach the early layers? Almost zero
Result: Early layers stop learning Model trains painfully slowly Looks “stuck”
Imagine whispering instructions through 50 people.
By the time it reaches the first person it’s complete silence.
Exploding Gradients
The opposite problem.
Here: Repeated multiplications make gradients huge
Values grow exponentially
Result: Wild weight updates Training becomes unstable Loss jumps all over the place
Like shouting instructions through 50 people. By the end, it’s not a message it’s pure noise.
How Do We Fix This?
Thankfully, deep learning has learned from its mistakes.
Better Activations
Old: Sigmoid, Tanh
New: ReLU, Leaky ReLU, GELU
These don’t squash values too aggressively, so gradients survive longer.
Smart Weight Initialization
Instead of random chaos:
- Xavier initialization
- He initialization
They set weights carefully at the start, so gradients stay in a healthy range.
Normalization
You’ve seen this already
- BatchNorm
- LayerNorm
They stabilize activations layer by layer, making gradient flow much smoother.
Gradient Clipping
Mostly for exploding gradients.
Set a maximum threshold
If gradients exceed it → clip them
Like a speed limit on a highway.
Skip Connections (ResNets)
Add shortcuts:
Gradients can flow directly across layers
No need to pass through every single transformation
This idea made very deep networks practical.
One of the biggest breakthroughs in modern deep learning.
What I Learned This Week
Vanishing gradients → network stops learning
Exploding gradients → network becomes unstable
Fixes:
- Better activations
- Smart initialization
- Normalization
- Gradient clipping
- Skip connections
What's next
Next week we will learn about Fully Connected Networks (Dense)