Hi Pythonistas!
Up to now, we built everything in isolation:
attention → gathers context
multi-head → multiple perspectives
feedforward → transforms information
But these are just pieces.
The real power comes from how they are combined.
That combination is the Transformer Block.
Code
class Block(nn.Module):
def __init__(self):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward()
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
What This Actually Does
Each block performs two main operations:
1. Attention (look around)
2. FeedForward (process)
And it does this with:
normalization
residual connections
Step 1: Layer Normalization (Before Attention)
self.ln1 = nn.LayerNorm(n_embd)
Used here:
self.sa(self.ln1(x))
Why normalize?
- stabilizes training
- keeps values in a reasonable range
- prevents exploding/vanishing values
Step 2: Attention with Residual
x = x + self.sa(self.ln1(x))
This is important.
Instead of:
x = attention(x)
We do:
x = x + attention(x)
Why residual connection?
It preserves original information.
Think:
don’t overwrite, just refine
Step 3: Layer Normalization (Again)
self.ln2 = nn.LayerNorm(n_embd)
Used here:
self.ffwd(self.ln2(x))
Step 4: FeedForward with Residual
x = x + self.ffwd(self.ln2(x))
Same pattern:
- normalize
- process
- add back
What Changed for Me
Before this:
I saw attention and feedforward as separate layers
After this:
I realized the real unit is the block
Everything is built around this structure.
Full Flow Inside One Block
Input x
↓
LayerNorm
↓
Attention
↓
Add (residual)
↓
LayerNorm
↓
FeedForward
↓
Add (residual)
↓
Output x
Important Detail
This block is:
repeated multiple times
Example:
self.blocks = nn.Sequential(*[Block() for _ in range(n_layer)])
Why stacking works
Each layer:
- refines representation
- adds deeper understanding
Think:
Layer 1 → basic patterns
Layer 2 → relationships
Layer 3 → structure
Layer 4 → higher meaning
Where This Fits
Embedding
↓
[ Transformer Block × N ] ← (core engine)
↓
Linear Head
↓
Prediction
Why This Matters
This block is the core building unit of:
- GPT-2
- ChatGPT
Everything else is just scaling this.
Mental Model
Attention → gather context
FeedForward → process
Residual → preserve
LayerNorm → stabilize
What's Coming Next
Now we have the core engine.
Next step:
wrap everything into a full model GPT Model (final architecture + logits)