From Basics to Bots: My Weekly AI Engineering Adventure-31

Hi Pythonistas!

We have tokens. We turned them into embeddings.
Now the big problem: In a long sentence, which parts actually matter right now?
Humans do this naturally.Models don’t, unless we teach them. That’s exactly what attention does.

The Core Idea of Attention
Attention answers one simple question: When predicting the next token, which previous tokens should I care about more?
Not all tokens are equally important.
In the sentence:
"The animal didn’t cross the road because it was tired"
What does "it" refer to? the animal or the road?

Attention helps the model figure this out.

Attention Is Weighting, Not Memory

Important clarification:

Attention does not store memory
It assigns weights
Each token looks at other tokens and asks: How relevant are you to me right now?

Some tokens get:

High weight
More influence
Others fade into the background.

Every token embedding is transformed into three vectors:

Query → what I’m looking for
Key → what I offer
Value → my actual content

Think of it like this:
I compare my query with everyone else’s keys. Whoever matches best, I listen to their values.
That's attention.

Self-Attention: Tokens Talking to Tokens, In self-attention, tokens attend to each other.
Each token:

Looks at all other tokens
Decides relevance
Builds a new, richer representation

This happens in parallel, not sequentially. This is why Transformers are fast.

Context Becomes Dynamic Because of attention: The same word can behave differently, Meaning adapts to context

Earlier we said embeddings can be contextual. Attention is how that context is built.

Why Attention Was a Breakthrough

Before attention:

RNNs processed tokens one by one
Long-range dependencies were painful

Attention:

Sees the entire sequence at once
Handles long-distance relationships easily

This single idea reshaped deep learning.

Attention Is Not "Understanding"

Attention does not mean:

Comprehension
Reasoning
Conscious focus

It’s:

Learned relevance scoring
Optimized for prediction And yet it works shockingly well.

What I Learned This Week

Attention decides what matters
Tokens assign weights to other tokens
Query, Key, Value enable matching
Self-attention lets tokens interact
Context becomes flexible and dynamic

What's Coming Next

Next week we will learn about Transformer