Hi Pythonistas!
So far, we know:
How text is generated
How Transformers work
Now the obvious question:
Where does all this knowledge come from?
Answer: Training. And a lot of mistakes.
Training Is Not Teaching.The model is not taught facts.No one explains grammar.
Instead:
- The model guesses the next token
- We check if it’s wrong
- We correct it slightly
- Repeat this billions of times.
The Training Data
At its core, training data is just text.
- Books
- Articles
- Code
- Conversations
- No labels.
- No explanations
The Loss Function - Measuring "How Wrong"
Every prediction is scored.
If the correct next token had:
- High probability → small loss
- Low probability → big loss
Backpropagation - Blame Goes Backward
Once loss is calculated:
- Gradients flow backward
- Every parameter gets a tiny correction
- Important idea: Recent layers adjust first
- Earlier layers adjust indirectly
This is how the entire network learns.
Gradient Descent - Tiny Steps, Huge Journey
Weights are updated using gradient descent.
Too big a step → training explodes
Too small → training crawls
Learning rate controls this balance.
Training is millions of tiny nudges.
Epochs, Batches, Steps (Quick Intuition)
Batch → small chunk of data
Step → one update
Epoch → full pass over data
Large models may never see a full epoch.Data is that big.
Overfitting Is Always Lurking
If the model memorizes: Training loss drops
Regularization, dropout, and validation help.
But scale itself is a powerful regularizer.
Training Is Expensive
- Massive compute
- Huge memory
- Long training time
That’s why most of us don’t train from scratch.But understanding this is crucial.
What I Learned This Week
- Models learn by predicting and failing
- Loss measures how wrong
- Gradients push weights to improve
- Training is slow, incremental, and costly
- No understanding - just optimization
What's Coming Next
Next week we will learn about fine tuning a model