From Basics to Bots: My Weekly AI Engineering Adventure-31

Attention - How the Model Knows What to Focus On

Posted by Afsal on 20-Mar-2026

Hi Pythonistas!

We have tokens. We turned them into embeddings.
Now the big problem: In a long sentence, which parts actually matter right now?
Humans do this naturally.Models don’t, unless we teach them. That’s exactly what attention does.

The Core Idea of Attention
Attention answers one simple question: When predicting the next token, which previous tokens should I care about more?
Not all tokens are equally important.
In the sentence:
"The animal didn’t cross the road because it was tired"
What does "it" refer to? the animal or the road?

Attention helps the model figure this out.

Attention Is Weighting, Not Memory

Important clarification:

  • Attention does not store memory
  • It assigns weights
  • Each token looks at other tokens and asks: How relevant are you to me right now?

Some tokens get:

  • High weight
  • More influence
  • Others fade into the background.

Every token embedding is transformed into three vectors:

Query → what I’m looking for
Key → what I offer
Value → my actual content

Think of it like this:
I compare my query with everyone else’s keys. Whoever matches best, I listen to their values.
That's attention.

Self-Attention: Tokens Talking to Tokens, In self-attention, tokens attend to each other.
Each token:

  • Looks at all other tokens
  • Decides relevance
  • Builds a new, richer representation

This happens in parallel, not sequentially. This is why Transformers are fast.

Context Becomes Dynamic Because of attention: The same word can behave differently, Meaning adapts to context

Earlier we said embeddings can be contextual. Attention is how that context is built.

Why Attention Was a Breakthrough

Before attention:

  • RNNs processed tokens one by one
  • Long-range dependencies were painful

Attention:

  • Sees the entire sequence at once
  • Handles long-distance relationships easily

This single idea reshaped deep learning.

Attention Is Not "Understanding"

Attention does not mean:

  • Comprehension
  • Reasoning
  • Conscious focus

It’s:

  • Learned relevance scoring
  • Optimized for prediction And yet it works shockingly well.

What I Learned This Week 

  • Attention decides what matters
  • Tokens assign weights to other tokens
  • Query, Key, Value enable matching
  • Self-attention lets tokens interact
  • Context becomes flexible and dynamic

What's Coming Next

Next week we will learn about Transformer