Hi Pythonistas!,
Last time, we tackled overfitting models memorizing instead of learning.
This week, let’s talk about another silent troublemaker during training
Sometimes neurons get too excited Outputs explode or completely vanish Gradients go wild
Training becomes unstable or painfully slow
The solution? Normalization
Why Do We Need Normalization?
Inside a neural network, every layer depends on the output of the previous one.
Now imagine this:
- Some inputs are huge
- Some are tiny
- Some neurons dominate
- Others barely get a chance to learn
Result?
- Slow convergence
- Unstable training
- Hard-to-tune learning rates
Normalization keeps everything on a similar scale, so learning stays smooth and predictable.
Think of it as:
"Everyone speaks at the same volume, so the model can actually listen."
Batch Normalization (BatchNorm)
Introduced in 2015 and honestly, it changed deep learning forever.
What it does:
Normalizes activations within each mini-batch
Then applies learnable scale and shift, so the model doesn’t lose flexibility
Why it’s awesome:
- Faster training
- Acts like a mild regularizer (helps a bit with overfitting)
- Allows higher learning rates
BatchNorm = the strict class monitor
Keeping every batch disciplined and under control.
Layer Normalization (LayerNorm)
BatchNorm depends on batch statistics, which can be a problem.
LayerNorm takes a different approach:
Normalizes across features of a single sample
Doesn’t care about batch size
Why it shines:
- Works even with batch size = 1
- Very stable
- Perfect for sequences
That’s why it’s everywhere in:
- RNNs
- NLP models
- Transformers
LayerNorm = a personal coach
Focused on each individual sample, not the whole class.
Other Normalization Variants
There’s more than just BatchNorm and LayerNorm
Instance Normalization: Per sample, per channel Common in style transfer models
Group Normalization: Channels split into groups Normalization happens inside each group Useful when batch sizes are small
When to Use What?
Quick rule of thumb:
CNNs / Vision models → BatchNorm
RNNs / NLP / Transformers → LayerNorm
Small batch sizes → LayerNorm or GroupNorm
What I Learned This Week
Normalization keeps activations stable
BatchNorm = per batch → great for vision
LayerNorm = per sample → great for sequences
Calm neurons learn better And calm training saves a lot of debugging time.
What’s Coming Next
Next week we will learn about Vanishing & Exploding Gradients