From Basics to Bots: My Weekly AI Engineering Adventure-15

Hi Pythonistas!

So far, we’ve met the layers (the structure), activation functions (the spark), and loss functions (the teacher).
But here’s the big question:

???? Once the loss tells us how wrong we are, who actually fixes the network’s weights?

Enter the Optimizer, the driver that moves the network toward better performance.

What is an Optimizer?

An optimizer decides how to adjust the weights of the neural network so that the loss gets smaller.
It’s like a GPS navigator:

Loss function = the map of where we are vs where we want to be.

Optimizer = the car driver choosing the route to reach the destination.

Common Optimizers

1. Gradient Descent (the simplest idea)

Update weights in the direction that reduces loss.

Very slow if dataset is huge.

2. Stochastic Gradient Descent (SGD)

Instead of using the whole dataset, updates weights using small batches.

Much faster, and works well in practice.

Often combined with momentum to avoid zig-zagging.

3. Momentum

Think of a ball rolling downhill it builds speed and doesn’t get stuck in tiny bumps.

Helps escape local minima and speeds up training.

4. RMSProp

Adjusts learning rate for each parameter based on how frequently it’s updated.

Works well for recurrent networks.

5. Adam (Adaptive Moment Estimation)

The superstar optimizer.

Combines momentum + RMSProp.

Fast, efficient, works in most situations.

Default choice in many frameworks.

6. AdamW

A better version of Adam with improved regularization.

Common in modern architectures like Transformers.

7. Others

Adagrad: adapts learning rates but slows down over time.

Nadam: Adam + Nesterov momentum.

LAMB / Lion: newer optimizers for very large models.

How to Choose?

Small/simple problems → SGD (with momentum).

General use case → Adam (safe default).

Very deep networks / Transformers → AdamW.

Recurrent networks → RMSProp.

Massive models → consider advanced ones like LAMB.

What i have learned

Optimizers are the drivers of training.
They decide how fast, how smooth, and how safely your network moves toward better performance.

Start with SGD or Adam.

Upgrade to AdamW for modern deep learning.

Experiment when you need that extra edge.

What's next

Next week, we’ll dive into the Learning Rate the speed of our driver.

Too fast and we overshoot; too slow and we never reach.