From Basics to Bots: My Weekly AI Engineering Adventure-41

Transformer Block: Putting It All Together

Posted by Afsal on 29-May-2026

Hi Pythonistas!

Up to now, we built everything in isolation:

attention → gathers context
multi-head → multiple perspectives
feedforward → transforms information

But these are just pieces.

The real power comes from how they are combined.

That combination is the Transformer Block.

Code

class Block(nn.Module):
    def __init__(self):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward()
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

What This Actually Does

Each block performs two main operations:

1. Attention (look around)
2. FeedForward (process)

And it does this with:

normalization
residual connections

Step 1: Layer Normalization (Before Attention)

self.ln1 = nn.LayerNorm(n_embd)

Used here:

self.sa(self.ln1(x))

Why normalize?

  • stabilizes training
  • keeps values in a reasonable range
  • prevents exploding/vanishing values

Step 2: Attention with Residual

x = x + self.sa(self.ln1(x))

This is important.

Instead of:

x = attention(x)

We do:

x = x + attention(x)

Why residual connection?

It preserves original information.

Think:

don’t overwrite, just refine

Step 3: Layer Normalization (Again)

self.ln2 = nn.LayerNorm(n_embd)

Used here:

self.ffwd(self.ln2(x))

Step 4: FeedForward with Residual

x = x + self.ffwd(self.ln2(x))

Same pattern:

  • normalize
  • process
  • add back

What Changed for Me

Before this:

I saw attention and feedforward as separate layers

After this:

I realized the real unit is the block

Everything is built around this structure.

Full Flow Inside One Block

Input x
 ↓
LayerNorm
 ↓
Attention
 ↓
Add (residual)
 ↓
LayerNorm
 ↓
FeedForward
 ↓
Add (residual)
 ↓
Output x

Important Detail

This block is:

repeated multiple times

Example:

self.blocks = nn.Sequential(*[Block() for _ in range(n_layer)])

Why stacking works

Each layer:

  • refines representation
  • adds deeper understanding

Think:

Layer 1 → basic patterns  
Layer 2 → relationships  
Layer 3 → structure  
Layer 4 → higher meaning  

Where This Fits

Embedding
 ↓
[ Transformer Block × N ]   ← (core engine)
 ↓
Linear Head
 ↓
Prediction

Why This Matters

This block is the core building unit of:

  • GPT-2
  • ChatGPT

Everything else is just scaling this.

Mental Model

Attention → gather context  
FeedForward → process  
Residual → preserve  
LayerNorm → stabilize  

What's Coming Next

Now we have the core engine.

Next step:

wrap everything into a full model GPT Model (final architecture + logits)