From Basics to Bots: My Weekly AI Engineering Adventure-41

Hi Pythonistas!

Up to now, we built everything in isolation:

attention → gathers context
multi-head → multiple perspectives
feedforward → transforms information

But these are just pieces.

The real power comes from how they are combined.

That combination is the Transformer Block.

Code

class Block(nn.Module):
    def __init__(self):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward()
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

What This Actually Does

Each block performs two main operations:

1. Attention (look around)
2. FeedForward (process)

And it does this with:

normalization
residual connections

Step 1: Layer Normalization (Before Attention)

self.ln1 = nn.LayerNorm(n_embd)

Used here:

self.sa(self.ln1(x))

Why normalize?

stabilizes training
keeps values in a reasonable range
prevents exploding/vanishing values

Step 2: Attention with Residual

x = x + self.sa(self.ln1(x))

This is important.

Instead of:

x = attention(x)

We do:

x = x + attention(x)

Why residual connection?

It preserves original information.

Think:

don’t overwrite, just refine

Step 3: Layer Normalization (Again)

self.ln2 = nn.LayerNorm(n_embd)

Used here:

self.ffwd(self.ln2(x))

Step 4: FeedForward with Residual

x = x + self.ffwd(self.ln2(x))

Same pattern:

normalize
process
add back

What Changed for Me

Before this:

I saw attention and feedforward as separate layers

After this:

I realized the real unit is the block

Everything is built around this structure.

Full Flow Inside One Block

Input x
↓
LayerNorm
↓
Attention
↓
Add (residual)
↓
LayerNorm
↓
FeedForward
↓
Add (residual)
↓
Output x

Important Detail

This block is:

repeated multiple times

Example:

self.blocks = nn.Sequential(*[Block() for _ in range(n_layer)])

Why stacking works

Each layer:

refines representation
adds deeper understanding

Think:

Layer 1 → basic patterns
Layer 2 → relationships
Layer 3 → structure
Layer 4 → higher meaning

Where This Fits

Embedding
↓
[ Transformer Block × N ] ← (core engine)
↓
Linear Head
↓
Prediction

Why This Matters

This block is the core building unit of:

GPT-2
ChatGPT

Everything else is just scaling this.

Mental Model

Attention → gather context
FeedForward → process
Residual → preserve
LayerNorm → stabilize

What's Coming Next

Now we have the core engine.

Next step:

wrap everything into a full model GPT Model (final architecture + logits)