From Basics to Bots: My Weekly AI Engineering Adventure

Comparing Sentences With Simple Math

Posted by Afsal on 22-Aug-2025

Hi Pythonistas!,

This is the kickoff of my hands on journey toward something big

Goal: To go from Python basics all the way to building my own AI-powered bot from scratch. I'm sharing every step what i am learning

Disclaimer

I’m not an expert just a curious Pythonista learning AI from scratch. Think of this as my learning journal. If you spot mistakes or have pro tips, please let me know!

Can a Computer Tell If Sentences Are Similar?

Humans know immediately that "cat sat on the mat" and "dog sat on the rug" feel similar. But to a computer? That’s not obvious. So, I built a simple tool using Python and a little math to teach my machine what "similarity" even means.

Step-by-Step: Comparing Sentences Like a Pythonista

Step 1 install sklearn and import required methods

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Step 2 List Out the Sentences

sentence1 = "cat sat on the mat"
sentence2 = "dog sat on the rug"

Step 3 Turn Them Into Numbers (Bag-of-Words Model)

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform([sentence1, sentence2])

This tool just counts each word in each sentence so both become lists of numbers.

Step 4 See the Vocabulary and Word Counts

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Vectors:\n", vectors.toarray())

You’ll see each sentence as a row of numbers by word.

Step 5 Cosine Similarity Magic!

similarity = cosine_similarity(vectors)[0, 1]
print(f"Similarity score: {similarity:.3f}")

This gives us a number from 0 (completely different) to 1 (identical), showing how much the sentences overlap in word usage.

The Simple Math Behind Sentence Similarity

We're basically measuring "how much do these sentences point in the same direction in word-space?"

1. First we create a vocabulary here in our example

Vocabulary is ['cat', 'dog', 'mat', 'on', 'rug', 'sat', 'the']

2. Now check each work in the sentence with vocabulary if exist then put 1 else 0 the we get the vector

sentence: "cat sat on the mat" changed to vector: [1 0 1 1 0 1 1] 

sentence:  "dog sat on the rug" changed to  vector: [0 1 0 1 1 1 1]

Cosine similarity formula:

cos(θ)= A.B/||A||.||B||

A · B is the sum of the products of corresponding elements from A and B.

 vector1: [1 0 1 1 0 1 1]

vector2: [0 1 0 1 1 1 1]

A.B = (1*0) + (0*1) + (1*0) + (1*1) + (0*1) + (1*1) + (1*1) = 3

||A|| = sqrt(1^2 + 0^2 + 1^2 + 1^2 + 0^2 + 1^2 + 1^2) = sqrt(5)
||B|| = sqrt(0^2 + 1^2 + 0^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(5)

similarity = 3/5 = 0.6

All it really does: Counts overlap in the same word positions, scaled by the size of each list.If the sentences use lots of the same words, cosine is close to 1. If nothing in common, it’s 0.

What I Learned

  • Word count vectors are a simple way for computers to start comparing language.
  • Cosine similarity can catch overlap (sat, on, the) even across short, simple examples.
  • Swapping out just a few words quickly lowers the similarity if change sat to sit on one sentence can drop similarity to 0.4
  • The computer doesn’t read "meaning," just matches word usage!

What’s Next

All these baby steps add up!
Next stop: We will make a Simple search engine using this concept