Hi Pythonistas!
In the last post, I built the Transformer Block.
That's the core engine.
But an engine alone isn't a car.
Now it's time to wrap everything into a complete model, train it, and actually generate text.
This is the final post of this mini-series.
Code
class GPT(nn.Module):
def __init__(self):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, n_embd)
self.position_embedding = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block() for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd)
self.head = nn.Linear(n_embd, vocab_size)
def forward(self, idx, targets=None):
B, T = idx.shape
tok_emb = self.token_embedding(idx)
pos_emb = self.position_embedding(torch.arange(T, device=device))
x = tok_emb + pos_emb
x = self.blocks(x)
x = self.ln_f(x)
logits = self.head(x)
if targets is None:
return logits, None
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
for _ in range(max_new_tokens):
idx_cond = idx[:, -block_size:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, next_token), dim=1)
return idx
What This Actually Does
This class connects every piece we built so far into one model.
Input goes in. Predictions come out.
Step 1: Embeddings (again)
tok_emb = self.token_embedding(idx)
pos_emb = self.position_embedding(torch.arange(T, device=device))
x = tok_emb + pos_emb
We covered this in post 37.
token embedding → what is this token?
position embedding → where is it in the sequence?
add them together → full context per token
Step 2: Pass Through All Blocks
x = self.blocks(x)
This runs all the transformer blocks in sequence.
Each block:
- attends to context
- transforms information
- refines representation
With n_layer = 2, we run through 2 blocks.
Each pass gives the model a deeper understanding.
Step 3: Final LayerNorm
x = self.ln_f(x)
One last normalization before the final layer.
Keeps values stable for the projection that follows.
Step 4: Linear Head (Logits)
logits = self.head(x)
This is the output layer.
It maps:
64 dimensions → vocab_size
Each position now has a score for every character in the vocabulary.
These scores are called logits.
Step 5: Loss Calculation
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
During training, we compare predictions to actual next characters.
cross_entropy measures how wrong we are.
The model's goal:
make this number smaller over time.
Step 6: Generation
def generate(self, idx, max_new_tokens):
for _ in range(max_new_tokens):
idx_cond = idx[:, -block_size:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, next_token), dim=1)
return idx
This is how text actually gets generated.
Loop:
Take the last block_size tokens (context window)
Run forward pass → get logits
Look at the last position only
Convert to probabilities via softmax
Sample one token
Append it and repeat
It's not finding the "best" answer.
It's sampling from a probability distribution.
That's why each run produces different output.
The Training Loop
for iter in range(max_iters):
if iter % eval_interval == 0:
losses = estimate_loss()
print(f"step {iter}: train {losses['train']:.4f}, val {losses['val']:.4f}")
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
This is where the model actually learns.
Each step:
- Get a random batch of text
- Run forward pass → compute loss
- Compute gradients
- Update weights
3000 steps. Each step: the model gets slightly better at predicting the next character.
What Changed for Me
Before building this:
GPT felt like magic
After building this:
it's just matrix multiplications, attention, and gradient descent
Everything we built piece by piece fits together:
Tokenization
↓
Embedding
↓
Transformer Blocks × N
↓
LayerNorm
↓
Linear Head
↓
Logits → Loss (training) / Probabilities → Token (generation)
The Full Picture
Text
↓
Tokenization (post 36)
↓
Embedding (post 37)
↓
Self-Attention (post 38)
↓
Multi-Head (post 39)
↓
FeedForward (post 40)
↓
Transformer Block (post 41)
↓
GPT Model ← (this)
We started from raw text.
We ended with a model that generates text.
What I Learned From This Series
No magic. Just:
- character-level tokenization
- vector representations
- attention scores
- residual connections
- cross-entropy loss
Put them together:
you have a language model.
A small one. But the same ideas power GPT-2, GPT-4, and everything else.
The difference is scale, not architecture.
Final Thought
When I started this series, I was trying to understand how these models work from the inside.
The only way I found that actually worked:
build it yourself.
No abstraction hides what you've already written with your own hands.
I hope this series helped you in the same way.
See you in the next adventure!