Hi Pythonistas!,
In the last post, I converted text into integers.At that point, the data looked like this:
[1, 0, 2, 2, 3]
Now the obvious question is: how does a model learn anything from just integers?
Short answer: it doesn’t.
Numbers like 1, 0, 2 don’t carry meaning by themselves. The model needs a richer representation.
That’s where embeddings come in.
code
import torch.nn as nn
import torch
n_embd = 64
block_size = 64
token_embedding = nn.Embedding(vocab_size, n_embd)
position_embedding = nn.Embedding(block_size, n_embd)
def embed(x):
B, T = x.shape
tok = token_embedding(x)
pos = position_embedding(torch.arange(T, device=x.device))
return tok + pos
We’re converting:
[1, 0, 2, 2, 3]
into:
[
[0.12, -0.45, ..., 0.88],
[0.91, 0.10, ..., -0.22],
...
]
Each token becomes a vector.
Step 1: Token Embedding
token_embedding = nn.Embedding(vocab_size, n_embd)
This is basically a lookup table.
Think of it like:
{
0: [vector],
1: [vector],
2: [vector],
...
}
So when we do:
tok = token_embedding(x)
Each integer gets replaced by its vector.
Example
Input:
[1, 0, 2]
Output:
[
[0.2, -0.1, ...],
[0.5, 0.3, ...],
[0.9, -0.7, ...]
]
These vectors are:
- randomly initialized
- learned during training
So the model slowly figures out:
- which tokens are similar
- which tokens are important
Step 2: Position Embedding
position_embedding = nn.Embedding(block_size, n_embd)
This part is easy to miss, but very important.
Because:
the model has no idea about order
Without this:
"cat sat"
"sat cat"
would look identical.
How it works
pos = position_embedding(torch.arange(T))
This creates vectors like:
position 0 → vector
position 1 → vector
position 2 → vector
Step 3: Combine Them
return tok + pos
We simply add:
token meaning + position info
So now each token knows:
- what it is
- where it is
What Changed for Me
Before this, I thought:
embeddings = just some preprocessing trick
After implementing it:
embeddings are where meaning starts to emerge
Because:
- similar tokens start getting similar vectors
- relationships get encoded numerically
The model never sees:
"hello"
It doesn’t even see:
[1, 0, 2, 2, 3]
It sees:
vectors in 64-dimensional space
That’s the actual input to the transformer.
Where This Fits
Text
↓
Tokenization
↓
Embedding ← (you are here)
↓
Attention
↓
Transformer
↓
Prediction
Why This Matters
Without embeddings:
- the model can’t learn relationships
- everything is just discrete integers
With embeddings:
- the model gets a continuous space to learn patterns
What's Coming Next
Now we have meaningful vectors.how does the model decide which tokens to focus on?
That’s where things get interesting. which is Self-Attention