From Basics to Bots: My Weekly AI Engineering Adventure-10

Hi Pythonistas!

Today, let’s build a Sentiment Classifier model using Logistic Regression a fundamental and powerful classification algorithm in machine learning.

Step 1: Prepare Your Data
Download the imdb dataset form this link:
We have to start by cleaning the text

import re
import pandas as pd

def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and digits
    text = text.lower().strip()               # Convert to lowercase and trim
    return text

df = pd.read_csv('IMDB_Dataset.csv')

df['clean_review'] = df['review'].apply(clean_text)

What we have done is remove the punctuations and digit from the data

Step 2: Convert Text to Features Using Bag-of-Words

Transform the cleaned text into numerical features using CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['clean_review'])

Step 3: Encode Your Sentiment Labels

Convert string labels into numeric form:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(df['sentiment'])

Step 4: Split Data and Train Logistic Regression Model

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=100)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Here we split the entire data into 2 set one for training and one for testing the model. Here we are spliting 80 for training and 20 percentage for testing. Then we are feeding the data and labels for logisticregression. Once training is done we check the accuracy using test data set. You change the paramters max_iter and what happens to accuracy

Output

Accuracy: 0.8745
              precision    recall  f1-score   support

           0       0.88      0.87      0.87      4961
           1       0.87      0.88      0.88      5039

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Why Logistic Regression?

It’s simple, interpretable, and fast.
Performs well on text classification tasks.
Offers a solid baseline before moving to deep learning models.

What’s Next?

We will learn about what is neural network