Hi Pythonistas!
In the last post, I built a single attention head.
That already felt powerful:
- each token looks at others
- assigns importance
- builds context
But there’s a limitation I didn’t notice immediately:
a single attention head can only learn one type of relationship at a time
That’s not enough for language.
Code
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.proj(out)
out = self.dropout(out)
return out
What This Actually Does
Instead of one attention mechanism:
we run multiple attention heads in parallel
Each head:
- sees the same input
- learns different relationships
Step 1: Create Multiple Heads
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
If num_heads is 4
We now have:
Head 1
Head 2
Head 3
Head 4
All independent.
Think of each head as a specialist:
one focuses on grammar
one on subject-object relation
one on long-distance dependencies
one on local patterns
Not explicitly coded but learned during training.
Step 2: Run All Heads
[h(x) for h in self.heads]
Each head produces:
[B, T, head_size]
Step 3: Concatenate Outputs
out = torch.cat(..., dim=-1)
This combines all heads into one tensor:
[B, T, n_embd]
Because:
n_embd = num_heads × head_size
Step 4: Linear Projection
self.proj = nn.Linear(n_embd, n_embd)
out = self.proj(out)
Why?
After concatenation, we mix all head outputs together.
Without this:
- heads remain independent
- model can’t combine insights
Step 5: Dropout
out = self.dropout(out)
Adds randomness during training:
- prevents overfitting
- improves generalization
What Changed for Me
With one head:
model has a single "view" of the sequence
With multiple heads:
model can look at the same sequence in different ways simultaneously
Example
Input:
"The cat sat on the mat"
Different heads might learn:
Head 1 → "cat ↔ sat"
Head 2 → "sat ↔ on"
Head 3 → "cat ↔ mat" (long-range)
Head 4 → local word patterns
All heads:
- share same input
- have different weights
So they evolve differently during training.
Where This Fits
Embedding
↓
Multi-Head Attention ← (you are here)
↓
FeedForward
↓
Transformer Block
This is one of the key upgrades from older models.
Without multi-head:
- limited representation
- weaker context understanding
With multi-head:
- richer relationships
- better language modeling
This is a core idea behind models like GPT-2.
What's Coming Next
So far:
attention gathers information
Next step:
process that information, There comes FeedForward Layer - where transformation happens