From Basics to Bots: My Weekly AI Engineering Adventure-36

Hi Pythonistas!

We have already learned the theory in previous post.This time I wanted to do something more practical:

implement the simplest possible tokenizer and see what actually goes into a model.

No abstractions. No libraries. Just raw Python.

with open("data.txt", "r", encoding="utf-8") as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)

stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}

def encode(s):
    return [stoi[c] for c in s]

def decode(l):
    return ''.join([itos[i] for i in l])

What This Actually Does

I didn't use any tokenizer library here. This is the most basic version you can build.

Step 1: Extract vocabulary

chars = sorted(list(set(text)))

'This line builds the entire vocabulary from the dataset.

If the dataset is:

"hello"

You get:

['e','h','l','o']

That's it. No words, no tokens, just characters.

Step 2: Assign integer IDs

stoi = {ch: i for i, ch in enumerate(chars)}

Now every character gets mapped to a number:

{'e':0, 'h':1, 'l':2, 'o':3}

This is the first real transformation:

text → integers

Step 3: Encoding

encode("hello")

Output:

[1, 0, 2, 2, 3]

At this point, the model will never see text again.
Everything downstream works with numbers.

Step 4: Decoding

decode([1,0,2,2,3])

Output:

"hello"

This is mainly for debugging and output readability.

This approach works, but it’s extremely naive.

Example:

"understanding"

becomes:

['u','n','d','e','r','s','t','a','n','d','i','n','g']

That’s 13 steps for one word.

Which means:

longer sequences
slower training
weaker pattern learning

Why I Still Started Here

Even with all the limitations, this approach helped me:

understand the full pipeline end-to-end
remove any "magic" from tokenization
clearly see how text becomes model input

Where This Fits

Text
↓
Tokenization ← (this)
↓
Embedding
↓
Transformer
↓
Prediction

What's Coming Next

Now that text is converted into integers, the next question is: how does the model actually understand these numbers?

That’s where embeddings come in.