From Basics to Bots: My Weekly AI Engineering Adventure-39

Multi-Head Attention: One View Isn't Enough

Posted by Afsal on 15-May-2026

Hi Pythonistas!

In the last post, I built a single attention head.

That already felt powerful:

  • each token looks at others
  • assigns importance
  • builds context

But there’s a limitation I didn’t notice immediately:

a single attention head can only learn one type of relationship at a time

That’s not enough for language.

Code

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        out = self.dropout(out)
        return out

What This Actually Does

Instead of one attention mechanism:

we run multiple attention heads in parallel

Each head:

  • sees the same input
  • learns different relationships

Step 1: Create Multiple Heads

self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

If num_heads is 4

We now have:

Head 1
Head 2
Head 3
Head 4

All independent.

Think of each head as a specialist:

one focuses on grammar
one on subject-object relation
one on long-distance dependencies
one on local patterns

Not explicitly coded but learned during training.

Step 2: Run All Heads

[h(x) for h in self.heads]

Each head produces:

[B, T, head_size]

Step 3: Concatenate Outputs

out = torch.cat(..., dim=-1)

This combines all heads into one tensor:

[B, T, n_embd]

Because:

n_embd = num_heads × head_size

Step 4: Linear Projection

self.proj = nn.Linear(n_embd, n_embd)
out = self.proj(out)

Why?

After concatenation, we mix all head outputs together.

Without this:

  • heads remain independent
  • model can’t combine insights

Step 5: Dropout

out = self.dropout(out)

Adds randomness during training:

  • prevents overfitting
  • improves generalization

What Changed for Me

With one head:

model has a single "view" of the sequence

With multiple heads:

model can look at the same sequence in different ways simultaneously

Example

Input:

"The cat sat on the mat"

Different heads might learn:

Head 1 → "cat ↔ sat"
Head 2 → "sat ↔ on"
Head 3 → "cat ↔ mat" (long-range)
Head 4 → local word patterns


All heads:

  • share same input
  • have different weights

So they evolve differently during training.

Where This Fits

Embedding
 ↓
Multi-Head Attention   ← (you are here)
 ↓
FeedForward
 ↓
Transformer Block

This is one of the key upgrades from older models.

Without multi-head:

  • limited representation
  • weaker context understanding

With multi-head:

  • richer relationships
  • better language modeling

This is a core idea behind models like GPT-2.

What's Coming Next

So far:

attention gathers information

Next step:

process that information, There comes FeedForward Layer - where transformation happens