From Basics to Bots: My Weekly AI Engineering Adventure-29

How ChatGPT Sees Text - Tokens, Not Words

Posted by Afsal on 06-Mar-2026

Hi Pythonistas!,

Last time, we learnd the core idea: A language model predicts the next token

What Is a Token?

A token is a chunk of text. It can be:

  • A full word
  • Part of a word
  • A single character
  • Punctuation

For example, a text like: "unbelievable" might be split into:
un
believe
able

Different models split text differently, but the idea stays the same.Text is broken into smaller pieces.

Why Not Just Use Words?

Words sound natural to us, but they’re messy.
Problems with words:

  • New words appear all the time
  • Misspellings exist
  • Languages mix
  • Rare words explode vocabulary size

Tokens solve this by:

  • Reusing smaller pieces
  • Handling unseen words gracefully

Even if the model has never seen a word before, it can still process its parts.

From Text to Token IDs

Once text is split into tokens, something important happens.Each token is mapped to a number.
For example:

hello → 42
world → 317

The model never sees the text again.From this point on, everything is numbers.
Text → tokens → token IDs.

Why Numbers Matter

Neural networks don’t understand text.They understand:

  • Numbers
  • Vectors
  • Matrices

Turning text into numbers is not optional.It’s the only way learning can happen.This is where language officially becomes math.

Tokens Are the Model’s Alphabet.Think of tokens like letters in an alphabet.The model doesn’t know meaning.It learns how tokens follow each other.It learns which sequences are common

Over time, it becomes very good at predicting: what token usually comes next.

A Small but Important Detail

Tokenization happens before learning.The tokenizer is fixed.The model adapts to it.The model cannot invent new tokens.This design choice shapes how the model learns language.

What I Learned This Week

  • ChatGPT does not see words
  • It sees tokens
  • Tokens are chunks of text
  • Tokens are converted into numbers
  • From that point on, everything is math

Once you understand tokens,you stop imagining language models as 'reading'. They’re not reading. They’re processing sequences.

What's Coming Next

Next week will learn about embeddings