Hi Pythonistas!,
Up to now:
text → numbers ✔
numbers → vectors ✔
But those vectors are still independent.
The model still doesn’t know:
- which words relate to each other
- what context matters
- what to focus on
This is where everything changes.
Self-attention is the first place where the model actually starts using context.
code
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))
def forward(self, x):
B, T, C = x.shape
k = self.key(x)
q = self.query(x)
wei = q @ k.transpose(-2, -1) * (C ** -0.5)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = torch.softmax(wei, dim=-1)
v = self.value(x)
out = wei @ v
return out
What This Actually Does
At a high level:
Each token looks at other tokens and decides how much they matter.
Step 1: Create Q, K, V
k = self.key(x)
q = self.query(x)
v = self.value(x)
Each token is projected into three different spaces:
Query (Q) → what am I looking for?
Key (K) → what do I contain?
Value (V) → what information do I pass?
Intuition
Think:
Token = "sat"
Query: "I want context about subject"
Key (from "cat"): "I am a subject"
Value: actual information from "cat"
Step 2: Compute Similarity
wei = q @ k.transpose(-2, -1)
This creates a matrix:
how much token i cares about token j
Example:
the cat sat
sat 0.1 0.8 0.1
"sat" strongly attends to "cat"
Step 3: Scale
* (C ** -0.5)
This stabilizes values (prevents extreme numbers).
Without this:
- training becomes unstable
- softmax becomes too sharp
Step 4: Mask Future Tokens
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
This is critical.
It ensures:
the model cannot see future tokens
Example
Input: "the cat sat"
When predicting "sat":
can see "the", "cat"
cannot see future words
This is what makes it GPT-style (causal)
Step 5: Softmax
wei = torch.softmax(wei, dim=-1)
Convert scores → probabilities
[0.1, 0.8, 0.1]
Step 6: Apply to Values
out = wei @ v
Now we combine information:
weighted sum of important tokens
Before this:
I thought models "process sequences"
After this:
they dynamically decide what to focus on
Every token:
- looks at others
- assigns importance
- builds its own context
Important Detail
This happens for:
- every token
- every layer
- every training step
Which means: context is continuously recomputed
Where This Fits
Text
↓
Tokenization
↓
Embedding
↓
Self-Attention ← (this is the core)
↓
Transformer
↓
Prediction
This is the idea behind models like GPT-2.
Without attention:
models struggle with long-range dependencies
With attention:
model can relate any token to any previous token
What's Coming Next
Right now we have one attention head.But in practice: one perspective is not enough
Next step: Multi-Head Attention - multiple parallel views of the same sequence