Sentiment Analysis on IMDB Reviews Using Machine Learning
By Rohit Sharma
Updated on Aug 08, 2025 | 1.21K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 08, 2025 | 1.21K+ views
Share:
Table of Contents
The way the human brain interprets a film has become immensely important in the digital world today. In this project, we will perform sentiment analysis on IMDB reviews by applying machine learning models to classify text-based movie reviews into positive or negative. By analyzing thousands of real user reviews, we want to develop an automated system that could interpret the sentiment, thereby saving time and giving producers, marketers, and recommendation engines useful insights.
This beginner-level project combines a little bit of natural language processing (NLP) with classification algorithms such as Logistic Regression and Multinomial Naive Bayes.
For more project ideas like this one, check out our blog post on the Top 25+ Essential Data Science Projects GitHub to Explore in 2025.
You should understand the fundamentals of Python syntax, pandas for handling data, and some machine learning and natural language processing concepts like tokenization and vectorization in order to complete this project successfully. It will also be useful to understand how classification models, such as logistic regression or Naive Bayes, operate.
It will be helpful to have some basic knowledge of:
Let’s start building the project from scratch. So, without wasting any more time, let’s begin!
Let's import all the required libraries for data handling, visualization, preprocessing, and modeling before we begin working with the dataset. To do this, use the code listed below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Now we will use Google Colab's file upload feature to upload the dataset (in CSV format). To accomplish the same, use the code listed below:
# Import required libraries
import pandas as pd
import numpy as np
# Upload the dataset from your local system
from google.colab import files
uploaded = files.upload()
You will be asked to upload a file after running the aforementioned code. Upload the IMDB Dataset.csv file. (This file is available for download at https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/input.)
Load the dataset into a DataFrame once the.csv file has been uploaded to Colab. To do this, enter the code listed below:
# Load the uploaded CSV into a pandas DataFrame
data = pd.read_csv('IMDB Dataset.csv')
# Display the first few rows
data.head()
Output:
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
Before we can train the model, we have to ensure that the text is coherent and conducive to analysis. To achieve that, in this step, we will:
Use the below-mentioned code to accomplish the same:
def clean_text(text):
text = text.lower()
text = re.sub(r'[^a-zA-Z\s]', '', text) # remove punctuation and numbers
text = re.sub(r'\s+', ' ', text) # remove extra spaces
return text.strip()
# Apply cleaning to the 'review' column
df['cleaned_review'] = df['review'].apply(clean_text)
print(df[['review', 'cleaned_review']].head())
Output:
review \
0 One of the other reviewers has mentioned that ...
1 A wonderful little production. <br /><br />The...
2 I thought this was a wonderful way to spend ti...
3 Basically there's a family where a little boy ...
4 Petter Mattei's "Love in the Time of Money" is...
cleaned_review
0 one of the other reviewers has mentioned that ...
1 a wonderful little production br br the filmin...
2 i thought this was a wonderful way to spend ti...
3 basically theres a family where a little boy j...
4 petter matteis love in the time of money is a ...
Machine learning models cannot read raw text. Therefore, it becomes paramount to convert each cleaned review into a numerical vector. But how to do this?
To achieve this, we will employ TF-IDF (Term Frequency–Inverse Document Frequency). It aids in assessing a word's significance within a review in relation to all reviews.
Use the below-mentioned code:
# Step 1: Vectorize using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['cleaned_review'])
# Step 2: Encode target values
y = df['sentiment']
# Step 3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In this step, we will train both the below-mentioned models -
Use the below-mentioned code:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
# Train Multinomial Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
# Train Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
In this step, we will evaluate the performance of both models on the test data using key evaluation metrics.
To do so, use the below-mentioned code:
# Predict on test data
nb_preds = nb_model.predict(X_test)
logistic_preds = logistic_model.predict(X_test)
# Evaluate
from sklearn.metrics import accuracy_score, confusion_matrix
# Naive Bayes
print("Multinomial Naive Bayes Accuracy:", accuracy_score(y_test, nb_preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, nb_preds))
# Logistic Regression
print("\nLogistic Regression Accuracy:", accuracy_score(y_test, logistic_preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, logistic_preds))
Output:
Multinomial Naive Bayes Accuracy: 0.8503
Confusion Matrix:
[[4223 738]
[ 759 4280]]
Logistic Regression Accuracy: 0.8928
Confusion Matrix:
[[4369 592]
[ 480 4559]]
In this step, we will build an interactive loop. Which user will be able to submit his/her reviews? Once the review is entered, we will employ trained models to determine whether the sentiment is positive or negative. Results will also be displayed. The loop will run until the user types - exit.
To accomplish this, use the below-mentioned code:
while True:
# Take user input
user_input = input("Enter a product review (or type 'exit' to quit): ").strip()
# Exit condition
if user_input.lower() == "exit":
print("Exiting Sentiment Analysis. Goodbye!")
break
# Vectorize the input text
vectorized_input = vectorizer.transform([user_input])
# Predict using both models
nb_prediction = nb_model.predict(vectorized_input)[0]
logistic_prediction = logistic_model.predict(vectorized_input)[0]
# Show the output
print("\n--- Sentiment Analysis Result ---")
print("Multinomial Naive Bayes Prediction:", nb_prediction.capitalize())
print("Logistic Regression Prediction:", logistic_prediction.capitalize())
print("-" * 50)
Output:
Enter a product review (or type 'exit' to quit): it was good
--- Sentiment Analysis Result ---
Multinomial Naive Bayes Prediction: Negative
Logistic Regression Prediction: Positive
--------------------------------------------------
Enter a product review (or type 'exit' to quit): it was dissapointing
--- Sentiment Analysis Result ---
Multinomial Naive Bayes Prediction: Negative
Logistic Regression Prediction: Positive
--------------------------------------------------
Enter a product review (or type 'exit' to quit): exit
Exiting Sentiment Analysis. Goodbye!
In this project, we have executed a sentiment analysis on IMDB reviews. We applied cleaning, TF-IDF vectorization, and two classification models were trained - Logistic Regression and Multinomial Naive Bayes.
Logistic Regression gave us a better result by a small margin. Now that the pipeline model is set up, you can conduct real-time sentiment analysis on user reviews.
Popular Data Science Programs
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab link:
https://colab.research.google.com/drive/1XHVDbSXkxHw6lIoxL3xVVUVhABcp7ZCU?usp=sharing
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources