Hi Pythonistas!,
This is the kickoff of my hands on journey toward something big
Goal: To go from Python basics all the way to building my own AI-powered bot from scratch. I'm sharing every step what i am learning
Disclaimer
I’m not an expert just a curious Pythonista learning AI from scratch. Think of this as my learning journal. If you spot mistakes or have pro tips, please let me know!
Can a Computer Tell If Sentences Are Similar?
Humans know immediately that "cat sat on the mat" and "dog sat on the rug" feel similar. But to a computer? That’s not obvious. So, I built a simple tool using Python and a little math to teach my machine what "similarity" even means.
Step-by-Step: Comparing Sentences Like a Pythonista
Step 1 install sklearn and import required methods
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
Step 2 List Out the Sentences
sentence1 = "cat sat on the mat"
sentence2 = "dog sat on the rug"
Step 3 Turn Them Into Numbers (Bag-of-Words Model)
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform([sentence1, sentence2])
This tool just counts each word in each sentence so both become lists of numbers.
Step 4 See the Vocabulary and Word Counts
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Vectors:\n", vectors.toarray())
You’ll see each sentence as a row of numbers by word.
Step 5 Cosine Similarity Magic!
similarity = cosine_similarity(vectors)[0, 1]
print(f"Similarity score: {similarity:.3f}")
This gives us a number from 0 (completely different) to 1 (identical), showing how much the sentences overlap in word usage.
The Simple Math Behind Sentence Similarity
We're basically measuring "how much do these sentences point in the same direction in word-space?"
1. First we create a vocabulary here in our example
Vocabulary is ['cat', 'dog', 'mat', 'on', 'rug', 'sat', 'the']
2. Now check each work in the sentence with vocabulary if exist then put 1 else 0 the we get the vector
sentence: "cat sat on the mat" changed to vector: [1 0 1 1 0 1 1]
sentence: "dog sat on the rug" changed to vector: [0 1 0 1 1 1 1]
Cosine similarity formula:
cos(θ)= A.B/||A||.||B||
A · B is the sum of the products of corresponding elements from A and B.
vector1: [1 0 1 1 0 1 1]
vector2: [0 1 0 1 1 1 1]
A.B = (1*0) + (0*1) + (1*0) + (1*1) + (0*1) + (1*1) + (1*1) = 3
||A|| = sqrt(1^2 + 0^2 + 1^2 + 1^2 + 0^2 + 1^2 + 1^2) = sqrt(5)
||B|| = sqrt(0^2 + 1^2 + 0^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(5)
similarity = 3/5 = 0.6
All it really does: Counts overlap in the same word positions, scaled by the size of each list.If the sentences use lots of the same words, cosine is close to 1. If nothing in common, it’s 0.
What I Learned
- Word count vectors are a simple way for computers to start comparing language.
- Cosine similarity can catch overlap (sat, on, the) even across short, simple examples.
- Swapping out just a few words quickly lowers the similarity if change sat to sit on one sentence can drop similarity to 0.4
- The computer doesn’t read "meaning," just matches word usage!
What’s Next
All these baby steps add up!
Next stop: We will make a Simple search engine using this concept