From Basics to Bots: My Weekly AI Engineering Adventure-38

Self-Attention: Where the Model Actually Starts "Thinking"

Posted by Afsal on 08-May-2026

Hi Pythonistas!,

Up to now:

text → numbers ✔
numbers → vectors ✔

But those vectors are still independent.

The model still doesn’t know:

  • which words relate to each other
  • what context matters
  • what to focus on

This is where everything changes.

Self-attention is the first place where the model actually starts using context.

code

class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape

        k = self.key(x)
        q = self.query(x)

        wei = q @ k.transpose(-2, -1) * (C ** -0.5)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = torch.softmax(wei, dim=-1)

        v = self.value(x)
        out = wei @ v

        return out

What This Actually Does

At a high level:

Each token looks at other tokens and decides how much they matter.

Step 1: Create Q, K, V

k = self.key(x)
q = self.query(x)
v = self.value(x)

Each token is projected into three different spaces:

Query (Q) → what am I looking for?
Key (K) → what do I contain?
Value (V) → what information do I pass?

Intuition

Think:

Token = "sat"
Query: "I want context about subject"
Key (from "cat"): "I am a subject"
Value: actual information from "cat"

Step 2: Compute Similarity

wei = q @ k.transpose(-2, -1)

This creates a matrix:

how much token i cares about token j

Example:

        the   cat   sat
sat    0.1   0.8   0.1

"sat" strongly attends to "cat"

Step 3: Scale

* (C ** -0.5)

This stabilizes values (prevents extreme numbers).

Without this:

  • training becomes unstable
  • softmax becomes too sharp

Step 4: Mask Future Tokens

wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))

This is critical.

It ensures:

the model cannot see future tokens

Example

Input: "the cat sat"

When predicting "sat":

can see "the", "cat"
cannot see future words

This is what makes it GPT-style (causal)

Step 5: Softmax

wei = torch.softmax(wei, dim=-1)

Convert scores → probabilities
[0.1, 0.8, 0.1]

Step 6: Apply to Values

out = wei @ v

Now we combine information:

weighted sum of important tokens

Before this:

I thought models "process sequences"

After this:

they dynamically decide what to focus on

Every token:

  • looks at others
  • assigns importance
  • builds its own context

Important Detail

This happens for:

  • every token
  • every layer
  • every training step

Which means: context is continuously recomputed

Where This Fits

Text
 ↓
Tokenization
 ↓
Embedding
 ↓
Self-Attention   ← (this is the core)
 ↓
Transformer
 ↓
Prediction

This is the idea behind models like GPT-2.

Without attention:

models struggle with long-range dependencies

With attention:

model can relate any token to any previous token

What's Coming Next

Right now we have one attention head.But in practice: one perspective is not enough

Next step: Multi-Head Attention - multiple parallel views of the same sequence