Hi Pythonistas!
We have tokens. We turned them into embeddings.
Now the big problem: In a long sentence, which parts actually matter right now?
Humans do this naturally.Models don’t, unless we teach them. That’s exactly what attention does.
The Core Idea of Attention
Attention answers one simple question: When predicting the next token, which previous tokens should I care about more?
Not all tokens are equally important.
In the sentence:
"The animal didn’t cross the road because it was tired"
What does "it" refer to? the animal or the road?
Attention helps the model figure this out.
Attention Is Weighting, Not Memory
Important clarification:
- Attention does not store memory
- It assigns weights
- Each token looks at other tokens and asks: How relevant are you to me right now?
Some tokens get:
- High weight
- More influence
- Others fade into the background.
Every token embedding is transformed into three vectors:
Query → what I’m looking for
Key → what I offer
Value → my actual content
Think of it like this:
I compare my query with everyone else’s keys. Whoever matches best, I listen to their values.
That's attention.
Self-Attention: Tokens Talking to Tokens, In self-attention, tokens attend to each other.
Each token:
- Looks at all other tokens
- Decides relevance
- Builds a new, richer representation
This happens in parallel, not sequentially. This is why Transformers are fast.
Context Becomes Dynamic Because of attention: The same word can behave differently, Meaning adapts to context
Earlier we said embeddings can be contextual. Attention is how that context is built.
Why Attention Was a Breakthrough
Before attention:
- RNNs processed tokens one by one
- Long-range dependencies were painful
Attention:
- Sees the entire sequence at once
- Handles long-distance relationships easily
This single idea reshaped deep learning.
Attention Is Not "Understanding"
Attention does not mean:
- Comprehension
- Reasoning
- Conscious focus
It’s:
- Learned relevance scoring
- Optimized for prediction And yet it works shockingly well.
What I Learned This Week
- Attention decides what matters
- Tokens assign weights to other tokens
- Query, Key, Value enable matching
- Self-attention lets tokens interact
- Context becomes flexible and dynamic
What's Coming Next
Next week we will learn about Transformer