From Basics to Bots: My Weekly AI Engineering Adventure-20

Vanishing & Exploding Gradients: When Learning Breaks

Posted by Afsal on 02-Jan-2026

Hi Pythonistas Happy new year,

We’ve already seen how backpropagation teaches a neural network. But when networks get very deep, something sneaky starts happening

Gradients become too small Or they grow way too large

Either way learning breaks.

Let’s unpack this.

What Are Gradients, Again?

Gradients are just signals.

They tell each weight: “Change a little” Or “change a lot”

In simple terms:

Small gradient → tiny update

Large gradient → big update

Sounds fine until we send these signals back through many layers.

That’s where the trouble starts.

Vanishing Gradients

In very deep networks: Gradients get multiplied again and again Especially bad with sigmoid and tanh

Each step makes them smaller.

By the time they reach the early layers? Almost zero

Result: Early layers stop learning Model trains painfully slowly Looks “stuck”

Imagine whispering instructions through 50 people.
By the time it reaches the first person it’s complete silence.

Exploding Gradients

The opposite problem.

Here:  Repeated multiplications make gradients huge

Values grow exponentially

Result: Wild weight updates Training becomes unstable Loss jumps all over the place

Like shouting instructions through 50 people. By the end, it’s not a message it’s pure noise.

How Do We Fix This?

Thankfully, deep learning has learned from its mistakes.

Better Activations

Old: Sigmoid, Tanh

New: ReLU, Leaky ReLU, GELU

These don’t squash values too aggressively, so gradients survive longer.

Smart Weight Initialization

Instead of random chaos:

  • Xavier initialization
  • He initialization

They set weights carefully at the start, so gradients stay in a healthy range.

Normalization

You’ve seen this already 

  • BatchNorm
  • LayerNorm

They stabilize activations layer by layer, making gradient flow much smoother.

Gradient Clipping 

Mostly for exploding gradients.

Set a maximum threshold

If gradients exceed it → clip them

Like a speed limit on a highway.

Skip Connections (ResNets)

Add shortcuts:

Gradients can flow directly across layers

No need to pass through every single transformation

This idea made very deep networks practical.

One of the biggest breakthroughs in modern deep learning.

What I Learned This Week

Vanishing gradients → network stops learning

Exploding gradients → network becomes unstable

Fixes:

  • Better activations
  • Smart initialization
  • Normalization
  • Gradient clipping
  • Skip connections

What's next

Next week we will learn about Fully Connected Networks (Dense)