From Basics to Bots: My Weekly AI Engineering Adventure-29

Hi Pythonistas!,

Last time, we learnd the core idea: A language model predicts the next token.

What Is a Token?

A token is a chunk of text. It can be:

A full word
Part of a word
A single character
Punctuation

For example, a text like: "unbelievable" might be split into:
un
believe
able

Different models split text differently, but the idea stays the same.Text is broken into smaller pieces.

Why Not Just Use Words?

Words sound natural to us, but they’re messy.
Problems with words:

New words appear all the time
Misspellings exist
Languages mix
Rare words explode vocabulary size

Tokens solve this by:

Reusing smaller pieces
Handling unseen words gracefully

Even if the model has never seen a word before, it can still process its parts.

From Text to Token IDs

Once text is split into tokens, something important happens.Each token is mapped to a number.
For example:

hello → 42
world → 317

The model never sees the text again.From this point on, everything is numbers.
Text → tokens → token IDs.

Why Numbers Matter

Neural networks don’t understand text.They understand:

Numbers
Vectors
Matrices

Turning text into numbers is not optional.It’s the only way learning can happen.This is where language officially becomes math.

Tokens Are the Model’s Alphabet.Think of tokens like letters in an alphabet.The model doesn’t know meaning.It learns how tokens follow each other.It learns which sequences are common

Over time, it becomes very good at predicting: what token usually comes next.

A Small but Important Detail

Tokenization happens before learning.The tokenizer is fixed.The model adapts to it.The model cannot invent new tokens.This design choice shapes how the model learns language.

What I Learned This Week

ChatGPT does not see words
It sees tokens
Tokens are chunks of text
Tokens are converted into numbers
From that point on, everything is math

Once you understand tokens,you stop imagining language models as 'reading'. They’re not reading. They’re processing sequences.

What's Coming Next

Next week will learn about embeddings