From Basics to Bots: My Weekly AI Engineering Adventure-42

Hi Pythonistas!

In the last post, I built the Transformer Block.
That's the core engine.
But an engine alone isn't a car.
Now it's time to wrap everything into a complete model, train it, and actually generate text.

This is the final post of this mini-series.

Code

class GPT(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)

        self.blocks = nn.Sequential(*[Block() for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding(idx)
        pos_emb = self.position_embedding(torch.arange(T, device=device))
        x = tok_emb + pos_emb

        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        if targets is None:
            return logits, None

        B, T, C = logits.shape
        logits = logits.view(B*T, C)
        targets = targets.view(B*T)

        loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = self(idx_cond)

            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)

            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_token), dim=1)

        return idx

What This Actually Does

This class connects every piece we built so far into one model.

Input goes in. Predictions come out.

Step 1: Embeddings (again)

tok_emb = self.token_embedding(idx)
pos_emb = self.position_embedding(torch.arange(T, device=device))
x = tok_emb + pos_emb

We covered this in post 37.

token embedding → what is this token?
position embedding → where is it in the sequence?
add them together → full context per token

Step 2: Pass Through All Blocks

x = self.blocks(x)

This runs all the transformer blocks in sequence.

Each block:

attends to context
transforms information
refines representation

With n_layer = 2, we run through 2 blocks.

Each pass gives the model a deeper understanding.

Step 3: Final LayerNorm

x = self.ln_f(x)

One last normalization before the final layer.

Keeps values stable for the projection that follows.

Step 4: Linear Head (Logits)

logits = self.head(x)

This is the output layer.

It maps:
64 dimensions → vocab_size
Each position now has a score for every character in the vocabulary.
These scores are called logits.

Step 5: Loss Calculation

B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)

During training, we compare predictions to actual next characters.
cross_entropy measures how wrong we are.
The model's goal:
make this number smaller over time.

Step 6: Generation

def generate(self, idx, max_new_tokens):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -block_size:]
        logits, _ = self(idx_cond)

        logits = logits[:, -1, :]
        probs = F.softmax(logits, dim=-1)

        next_token = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, next_token), dim=1)

    return idx

This is how text actually gets generated.

Loop:

Take the last block_size tokens (context window)
Run forward pass → get logits
Look at the last position only
Convert to probabilities via softmax
Sample one token
Append it and repeat

It's not finding the "best" answer.
It's sampling from a probability distribution.
That's why each run produces different output.

The Training Loop

for iter in range(max_iters):

    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train {losses['train']:.4f}, val {losses['val']:.4f}")

    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

This is where the model actually learns.

Each step:

Get a random batch of text
Run forward pass → compute loss
Compute gradients
Update weights

3000 steps. Each step: the model gets slightly better at predicting the next character.

What Changed for Me

Before building this:

GPT felt like magic
After building this:

it's just matrix multiplications, attention, and gradient descent
Everything we built piece by piece fits together:

Tokenization
↓
Embedding
↓
Transformer Blocks × N
↓
LayerNorm
↓
Linear Head
↓
Logits → Loss (training) / Probabilities → Token (generation)

The Full Picture

Text
↓
Tokenization (post 36)
↓
Embedding (post 37)
↓
Self-Attention (post 38)
↓
Multi-Head (post 39)
↓
FeedForward (post 40)
↓
Transformer Block (post 41)
↓
GPT Model ← (this)

We started from raw text.

We ended with a model that generates text.

What I Learned From This Series

No magic. Just:

character-level tokenization
vector representations
attention scores
residual connections
cross-entropy loss

Put them together:

you have a language model.
A small one. But the same ideas power GPT-2, GPT-4, and everything else.
The difference is scale, not architecture.

Final Thought

When I started this series, I was trying to understand how these models work from the inside.
The only way I found that actually worked:
build it yourself.
No abstraction hides what you've already written with your own hands.

I hope this series helped you in the same way.
See you in the next adventure!