Hi Pythonistas!
Today, let’s build a Sentiment Classifier model using Logistic Regression a fundamental and powerful classification algorithm in machine learning.
Step 1: Prepare Your Data
Download the imdb dataset form this link:
We have to start by cleaning the text
import re
import pandas as pd
def clean_text(text):
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove punctuation and digits
text = text.lower().strip() # Convert to lowercase and trim
return text
df = pd.read_csv('IMDB_Dataset.csv')
df['clean_review'] = df['review'].apply(clean_text)
What we have done is remove the punctuations and digit from the data
Step 2: Convert Text to Features Using Bag-of-Words
Transform the cleaned text into numerical features using CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['clean_review'])
Step 3: Encode Your Sentiment Labels
Convert string labels into numeric form:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(df['sentiment'])
Step 4: Split Data and Train Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Here we split the entire data into 2 set one for training and one for testing the model. Here we are spliting 80 for training and 20 percentage for testing. Then we are feeding the data and labels for logisticregression. Once training is done we check the accuracy using test data set. You change the paramters max_iter and what happens to accuracy
Output
Accuracy: 0.8745
precision recall f1-score support
0 0.88 0.87 0.87 4961
1 0.87 0.88 0.88 5039
accuracy 0.87 10000
macro avg 0.87 0.87 0.87 10000
weighted avg 0.87 0.87 0.87 10000
Why Logistic Regression?
It’s simple, interpretable, and fast.
Performs well on text classification tasks.
Offers a solid baseline before moving to deep learning models.
What’s Next?
We will learn about what is neueral network