Hi Pythonistas!
We have already learned the theory in previous post.This time I wanted to do something more practical:
implement the simplest possible tokenizer and see what actually goes into a model.
No abstractions. No libraries. Just raw Python.
with open("data.txt", "r", encoding="utf-8") as f:
text = f.read()
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}
def encode(s):
return [stoi[c] for c in s]
def decode(l):
return ''.join([itos[i] for i in l])
What This Actually Does
I didn't use any tokenizer library here. This is the most basic version you can build.
Step 1: Extract vocabulary
chars = sorted(list(set(text)))
'This line builds the entire vocabulary from the dataset.
If the dataset is:
"hello"
You get:
['e','h','l','o']
That's it. No words, no tokens, just characters.
Step 2: Assign integer IDs
stoi = {ch: i for i, ch in enumerate(chars)}
Now every character gets mapped to a number:
{'e':0, 'h':1, 'l':2, 'o':3}
This is the first real transformation:
text → integers
Step 3: Encoding
encode("hello")
Output:
[1, 0, 2, 2, 3]
At this point, the model will never see text again.
Everything downstream works with numbers.
Step 4: Decoding
decode([1,0,2,2,3])
Output:
"hello"
This is mainly for debugging and output readability.
This approach works, but it’s extremely naive.
Example:
"understanding"
becomes:
['u','n','d','e','r','s','t','a','n','d','i','n','g']
That’s 13 steps for one word.
Which means:
longer sequences
slower training
weaker pattern learning
Why I Still Started Here
Even with all the limitations, this approach helped me:
understand the full pipeline end-to-end
remove any "magic" from tokenization
clearly see how text becomes model input
Where This Fits
Text
↓
Tokenization ← (this)
↓
Embedding
↓
Transformer
↓
Prediction
What's Coming Next
Now that text is converted into integers, the next question is: how does the model actually understand these numbers?
That’s where embeddings come in.