Hi Pythonistas!
Up to now:
- attention tells the model what to focus on
- multi-head gives multiple perspectives
But something is still missing.
Right now, the model is mostly: gathering information
It hasn’t really transformed it yet.
That’s where the FeedForward layer comes in.
code
class FeedForward(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
What This Actually Does
This is just a small neural network applied to each token independently.
Important:
- No interaction between tokens here.
- That already happened in attention.
Step 1: Expand Dimension
nn.Linear(n_embd, 4 * n_embd)
If:
n_embd = 64
This becomes:
64 → 256
Why expand?
This gives the model more "space" to learn complex patterns.
Think:
low dimension → limited expression
higher dimension → richer transformations
Step 2: Activation
nn.ReLU()
Adds non-linearity.
Without this:
- the whole model becomes just linear math
- no complex patterns can be learned
Step 3: Project Back
nn.Linear(4 * n_embd, n_embd)
256 → 64
Bring it back to original size so it fits the rest of the model.
Step 4: Dropout
nn.Dropout(dropout)
Same idea as before:
- adds randomness
- prevents overfitting
What Changed for Me
Before this, I thought attention does everything.
After implementing this:
- attention gathers information
- feedforward transforms it
Both are equally important.
Example Intuition
Input sentence:
"The cat sat"
After attention:
"sat" knows about "cat"
After feedforward:
that relationship gets processed and refined
Important Detail
FeedForward is applied:
independently per token
So:
no cross-token communication here
purely transformation
Where This Fits
Embedding
↓
Multi-Head Attention
↓
FeedForward ← (this)
↓
Transformer Block
Why This Matters
Without FeedForward:
- model only mixes information
- no deep transformation
With FeedForward:
model can build complex representations
Simple Mental Model
Attention → "what is important?"
FeedForward → "what do I do with it?"
What's Coming Next
Now we have:
- attention
- feedforward
Next step:
combine them properly, Transformer Block, the core building unit