Hi Pythonistas!,
Last we have met the driver of learning which is the optimizer. Today we are learning about the speed of training not the kind measured in km/h, but the quiet internal pacing of a neural network.How it chooses to learn, how fast it adapts, how boldly it updates its beliefs.
Every driver faces the same timeless question: How fast should I go?
What Exactly Is the Learning Rate?
Every time a model learns something, it tweaks its weights. Not dramatically, but in tiny, carefully calculated steps.
The learning rate decides: How big should those steps be? A deceptively simple decision that shapes everything.
Set it too small → your model crawls through training like it’s dragging a suitcase through beach sand.
Set it too big → your updates bounce around like a motorcycle with badly tuned suspension.
Set it just right → smooth, steady learning that actually converges.
The Goldilocks Zone of Learning
Deep learning has its own version of “not too hot, not too cold”:
Too Low: The Snail Mode
- Everything is stable
- Everything is safe
- Everything takes forever.
Sometimes the model gets stuck in a tiny dip (local minimum) and assumes, Ah yes, this must be the answer. when in reality, the real valley is far ahead.
Too High: The Chaos Mode
This is the caffeinated mode. Loss shoots up, drops down, spins around but never settles. You’re learning fast, but not learning well.
Just Right: The Balanced Mode
This is the sweet spot. Stable, efficient, and converges like a well-behaved student.
The Goldilocks LR is what every training run hopes for.
Learning Rate Schedulers: Smarter Speed Control
Using a fixed learning rate for the entire training run is like driving at the same speed from empty highway to crowded city traffic.
Not ideal.
Modern deep learning adjusts its speed depending on what phase it’s in.
This is where Learning Rate Schedulers shine.
Let’s walk through the ones I explored this week:
1. Step Decay
Reduce the learning rate after every few epochs. Like slowing down as you approach civilization after a long highway ride.
2. Exponential Decay
The learning rate shrinks a little after every epoch. Smooth, gentle, predictable like dimming a light gradually.
3. Cyclical Learning Rate
This one is great. Instead of always decreasing, LR goes up and down in cycles.Why?Because sometimes increasing speed helps the model escape shallow traps.
It’s like jogging: Sprint → recover → sprint → recover.
4. Cosine Annealing
Imagine your learning rate following a beautiful cosine wave. Starts high, dips slowly, with a graceful curve.
This technique has become a favorite in modern deep learning, especially in vision and transformers.
5. Warm Restarts
Just when the LR becomes tiny Reset. A fresh burst of learning energy. This helps the optimizer explore new directions instead of settling too early.
How Do You Choose the Right One?
After enough failed experiments (and late-night debugging), some patterns become clear:
Small / beginner projects → Fixed LR works fine
Mid-sized or deeper networks → Step Decay or Exponential Decay gives great stability
Transformers / Vision Models / SOTA architectures → Cosine Annealing or Cyclical LR almost always performs better
Think of it as choosing the right rhythm for the right kind of dance.
What i have learned
Learning Rate = How fast your model learns
Too high → noisy and unstable
Too low → slow and stuck
LR schedulers → smarter, adaptive learning Choose based on model size and complexity
Next Week
Next week we will learn about Backpropogation