From Basics to Bots: My Weekly AI Engineering Adventure-40

Hi Pythonistas!

Up to now:

attention tells the model what to focus on
multi-head gives multiple perspectives

But something is still missing.

Right now, the model is mostly: gathering information

It hasn’t really transformed it yet.

That’s where the FeedForward layer comes in.

code

class FeedForward(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

What This Actually Does

This is just a small neural network applied to each token independently.

Important:

No interaction between tokens here.
That already happened in attention.

Step 1: Expand Dimension

nn.Linear(n_embd, 4 * n_embd)

If:

n_embd = 64

This becomes:

64 → 256

Why expand?

This gives the model more "space" to learn complex patterns.

Think:

low dimension → limited expression
higher dimension → richer transformations

Step 2: Activation

nn.ReLU()

Adds non-linearity.

Without this:

the whole model becomes just linear math
no complex patterns can be learned

Step 3: Project Back

nn.Linear(4 * n_embd, n_embd)

256 → 64

Bring it back to original size so it fits the rest of the model.

Step 4: Dropout

nn.Dropout(dropout)

Same idea as before:

adds randomness
prevents overfitting

What Changed for Me

Before this, I thought attention does everything.

After implementing this:

attention gathers information
feedforward transforms it

Both are equally important.

Example Intuition

Input sentence:

"The cat sat"

After attention:

"sat" knows about "cat"

After feedforward:

that relationship gets processed and refined

Important Detail

FeedForward is applied:

independently per token

So:

no cross-token communication here
purely transformation

Where This Fits

Embedding
↓
Multi-Head Attention
↓
FeedForward ← (this)
↓
Transformer Block

Why This Matters

Without FeedForward:

model only mixes information
no deep transformation

With FeedForward:

model can build complex representations

Simple Mental Model

Attention → "what is important?"
FeedForward → "what do I do with it?"

What's Coming Next

Now we have:

attention
feedforward

Next step:

combine them properly, Transformer Block, the core building unit