Hi Pythoninstas!
So far, we’ve learned how to train a language model.
It can:
- Predict text well
- Continue patterns
- Mimic styles
But left alone, it’s unpredictable.
- Sometimes helpful
- Sometimes wrong
- Sometimes unsafe
So the next question is obvious:
How do we shape its behavior
Pretraining vs Fine-Tuning
Pretraining:
- Huge dataset
- Generic objective
- Learn language
Fine-tuning:
- Smaller, curated data
- Specific goals
- Behave like this
Same model.
Different phase.
Supervised Fine-Tuning (SFT)
First alignment step.
Humans create:
- Prompts
- Ideal responses
- The model learns: When I see this kind of input, this is how I should respond.
Why Pretraining Isn’t Enough
A pretrained model:
- Can imitate anything
- Doesn’t know what’s good
- Has no concept of intent
Fine-tuning introduces:
- Helpfulness
- Clarity
- Politeness
- Not intelligence - direction.
Reinforcement Learning from Human Feedback (RLHF)
This is where things get interesting.
Instead of labels:
- Humans rank responses
- This one is better than that one
- A reward model learns these preferences.
The language model is then trained to:
- Maximize human preference.
- Alignment Is Optimization, Not Ethics
- Important reality check.
The model does not understand:
- Values
- Morals
- Safety
- It learns patterns that look aligned.
Trade-offs Everywhere
More alignment:
- Safer
- More predictable
But also:
- Less creative
- More cautious
ChatGPT is:
- Pretrained
- Fine-tuned
- Aligned for conversation
That’s why it:
- Explains
- Refuses
- Asks clarifying questions
What I Learned This Week
- Pretraining learns language
- Fine-tuning shapes behavior
- SFT teaches good examples
- RLHF optimizes for human preference
- Alignment is engineering, not understanding
At this point, we understand how ChatGPT is built.
What's Coming Next
We will start building mini-gpt