From Basics to Bots: My Weekly AI Engineering Adventure-40

FeedForward: Where the Model Actually "Processes" Information

Posted by Afsal on 22-May-2026

Hi Pythonistas!

Up to now:

  • attention tells the model what to focus on
  • multi-head gives multiple perspectives

But something is still missing.

Right now, the model is mostly: gathering information

It hasn’t really transformed it yet.

That’s where the FeedForward layer comes in.

code

class FeedForward(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

What This Actually Does

This is just a small neural network applied to each token independently.

Important:

  • No interaction between tokens here.
  • That already happened in attention.

Step 1: Expand Dimension

nn.Linear(n_embd, 4 * n_embd)

If:

n_embd = 64

This becomes:

64 → 256

Why expand?

This gives the model more "space" to learn complex patterns.

Think:

low dimension → limited expression
higher dimension → richer transformations

Step 2: Activation

nn.ReLU()

Adds non-linearity.

Without this:

  • the whole model becomes just linear math
  • no complex patterns can be learned

Step 3: Project Back

nn.Linear(4 * n_embd, n_embd)

256 → 64

Bring it back to original size so it fits the rest of the model.

Step 4: Dropout

nn.Dropout(dropout)

Same idea as before:

  • adds randomness
  • prevents overfitting

What Changed for Me

Before this, I thought attention does everything.

After implementing this:

  • attention gathers information
  • feedforward transforms it

Both are equally important.

Example Intuition

Input sentence:

"The cat sat"

After attention:

"sat" knows about "cat"

After feedforward:

that relationship gets processed and refined

Important Detail

FeedForward is applied:

independently per token

So:

no cross-token communication here
purely transformation


Where This Fits

Embedding
 ↓
Multi-Head Attention
 ↓
FeedForward   ← (this)
 ↓
Transformer Block

Why This Matters

Without FeedForward:

  • model only mixes information
  • no deep transformation

With FeedForward:

model can build complex representations

Simple Mental Model

Attention → "what is important?"
FeedForward → "what do I do with it?"

What's Coming Next

Now we have:

  • attention
  • feedforward 

Next step:

combine them properly, Transformer Block, the core building unit