Home
Blog
Data Science
Sentiment Analysis on IMDB Reviews Using Machine Learning

Sentiment Analysis on IMDB Reviews Using Machine Learning

Updated on Aug 08, 2025 | 1.48K+ views

Table of Contents

View all

What Do You Need to Know Up Front?
Technologies and Libraries Used in Sentiment Analysis on IMDB Reviews
Time Taken and Difficulty
How to Build a Model for Sentiment Analysis on IMDB Reviews
Conclusion

The way the human brain interprets a film has become immensely important in the digital world today. In this project, we will perform sentiment analysis on IMDB reviews by applying machine learning models to classify text-based movie reviews into positive or negative. By analyzing thousands of real user reviews, we want to develop an automated system that could interpret the sentiment, thereby saving time and giving producers, marketers, and recommendation engines useful insights.

This beginner-level project combines a little bit of natural language processing (NLP) with classification algorithms such as Logistic Regression and Multinomial Naive Bayes.

For more project ideas like this one, check out our blog post on the Top 25+ Essential Data Science Projects GitHub to Explore in 2025.

What Do You Need to Know Up Front?

You should understand the fundamentals of Python syntax, pandas for handling data, and some machine learning and natural language processing concepts like tokenization and vectorization in order to complete this project successfully. It will also be useful to understand how classification models, such as logistic regression or Naive Bayes, operate.

Technologies and Libraries Used in Sentiment Analysis on IMDB Reviews

It will be helpful to have some basic knowledge of:

Python – Programming language
Pandas – Data handling and cleaning
Scikit-learn – ML models and preprocessing
NLTK / TextBlob – Text processing and sentiment tools (optional)
Google Colab – Development environment

Time Taken and Difficulty

Estimated Time: 1.5 to 2 hours
Difficulty: Beginner to Intermediate

How to Build a Model for Sentiment Analysis on IMDB Reviews

Let’s start building the project from scratch. So, without wasting any more time, let’s begin!

Step 1: Import Libraries

Let's import all the required libraries for data handling, visualization, preprocessing, and modeling before we begin working with the dataset. To do this, use the code listed below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Step 2: Upload the dataset

Now we will use Google Colab's file upload feature to upload the dataset (in CSV format). To accomplish the same, use the code listed below:

# Import required libraries
import pandas as pd
import numpy as np

# Upload the dataset from your local system
from google.colab import files
uploaded = files.upload()

You will be asked to upload a file after running the aforementioned code. Upload the IMDB Dataset.csv file. (This file is available for download at https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/input.)

Load the dataset into a DataFrame once the.csv file has been uploaded to Colab. To do this, enter the code listed below:

# Load the uploaded CSV into a pandas DataFrame
data = pd.read_csv('IMDB Dataset.csv')

# Display the first few rows
data.head()

Output:

review sentiment

0 One of the other reviewers has mentioned that ... positive

1 A wonderful little production. <br /><br />The... positive

2 I thought this was a wonderful way to spend ti... positive

3 Basically there's a family where a little boy ... negative

4 Petter Mattei's "Love in the Time of Money" is... positive

Step 3: Clean and Preprocess the IMDB Movie Reviews

Before we can train the model, we have to ensure that the text is coherent and conducive to analysis. To achieve that, in this step, we will:

Converting text to lowercase
Get rid of stopwords and tokenize

Use the below-mentioned code to accomplish the same:

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # remove punctuation and numbers
    text = re.sub(r'\s+', ' ', text)         # remove extra spaces
    return text.strip()

# Apply cleaning to the 'review' column
df['cleaned_review'] = df['review'].apply(clean_text)
print(df[['review', 'cleaned_review']].head())

Output:

review \

0 One of the other reviewers has mentioned that ...

1 A wonderful little production. <br /><br />The...

2 I thought this was a wonderful way to spend ti...

3 Basically there's a family where a little boy ...

4 Petter Mattei's "Love in the Time of Money" is...

cleaned_review

0 one of the other reviewers has mentioned that ...

1 a wonderful little production br br the filmin...

2 i thought this was a wonderful way to spend ti...

3 basically theres a family where a little boy j...

4 petter matteis love in the time of money is a ...

Step 4: Split and Vectorize the Data

Machine learning models cannot read raw text. Therefore, it becomes paramount to convert each cleaned review into a numerical vector. But how to do this?

To achieve this, we will employ TF-IDF (Term Frequency–Inverse Document Frequency). It aids in assessing a word's significance within a review in relation to all reviews.

Use the below-mentioned code:

# Step 1: Vectorize using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['cleaned_review'])

# Step 2: Encode target values
y = df['sentiment']

# Step 3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Train Multinomial Naive Bayes and Logistic Regression Models

In this step, we will train both the below-mentioned models -

Logistic Regression
Multinomial Naive Bayes

Use the below-mentioned code:

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Train Multinomial Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Train Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

Step 6: Compare the Models

In this step, we will evaluate the performance of both models on the test data using key evaluation metrics.

To do so, use the below-mentioned code:

# Predict on test data
nb_preds = nb_model.predict(X_test)
logistic_preds = logistic_model.predict(X_test)

# Evaluate
from sklearn.metrics import accuracy_score, confusion_matrix

# Naive Bayes
print("Multinomial Naive Bayes Accuracy:", accuracy_score(y_test, nb_preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, nb_preds))

# Logistic Regression
print("\nLogistic Regression Accuracy:", accuracy_score(y_test, logistic_preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, logistic_preds))

Output:

Multinomial Naive Bayes Accuracy: 0.8503

Confusion Matrix:

[[4223 738]

[ 759 4280]]

Logistic Regression Accuracy: 0.8928

Confusion Matrix:

[[4369 592]

[ 480 4559]]

Step 7: Predict Sentiment Using User Input

In this step, we will build an interactive loop. Which user will be able to submit his/her reviews? Once the review is entered, we will employ trained models to determine whether the sentiment is positive or negative. Results will also be displayed. The loop will run until the user types - exit.

To accomplish this, use the below-mentioned code:

while True:
    # Take user input
    user_input = input("Enter a product review (or type 'exit' to quit): ").strip()

    # Exit condition
    if user_input.lower() == "exit":
        print("Exiting Sentiment Analysis. Goodbye!")
        break

    # Vectorize the input text
    vectorized_input = vectorizer.transform([user_input])

    # Predict using both models
    nb_prediction = nb_model.predict(vectorized_input)[0]
    logistic_prediction = logistic_model.predict(vectorized_input)[0]

    # Show the output
    print("\n--- Sentiment Analysis Result ---")
    print("Multinomial Naive Bayes Prediction:", nb_prediction.capitalize())
    print("Logistic Regression Prediction:", logistic_prediction.capitalize())
    print("-" * 50)

Output:

Enter a product review (or type 'exit' to quit): it was good

--- Sentiment Analysis Result ---

Multinomial Naive Bayes Prediction: Negative

Logistic Regression Prediction: Positive

--------------------------------------------------

Enter a product review (or type 'exit' to quit): it was dissapointing

--- Sentiment Analysis Result ---

Multinomial Naive Bayes Prediction: Negative

Logistic Regression Prediction: Positive

--------------------------------------------------

Enter a product review (or type 'exit' to quit): exit

Exiting Sentiment Analysis. Goodbye!

Conclusion

In this project, we have executed a sentiment analysis on IMDB reviews. We applied cleaning, TF-IDF vectorization, and two classification models were trained - Logistic Regression and Multinomial Naive Bayes.

Logistic Regression gave us a better result by a small margin. Now that the pipeline model is set up, you can conduct real-time sentiment analysis on user reviews.

Popular Data Science Programs

Data Science Machine Learning Course Cloud Computing Courses Certification Postgraduate Diploma in Data Science MS in Data Science Data Science Advanced Course

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab link:
https://colab.research.google.com/drive/1XHVDbSXkxHw6lIoxL3xVVUVhABcp7ZCU?usp=sharing

Frequently Asked Questions (FAQs)

1. What could be the main purpose of sentiment analysis for this project?

The purpose of sentiment analysis is to help with the automatic classification. Whether the movie review is positive or negative is determined based on machine learning models. Doing so eliminates the need for manual reading and helps comprehend user opinions at scale.

2. Why did we use both Logistic Regression and Multinomial Naive Bayes models?

Both models were applied to compare the accuracy and performance on text data. Logistic Regression performs well with balanced datasets, whereas Naive Bayes is fairly quick and works best with text features.

3. What kind of preprocessing will one need to perform on text before training a sentiment model?

You need to remove HTML tags, punctuation, special characters, and stopwords. The cleaning ensures that the model only learns from words that make sense.

4. What changes are made to the input review prior to prediction?

The input review is cleaned and vectorized using the TF-IDF method. The TF-IDF method transforms the text into a numerical format that the models can comprehend.

5. Is it possible to apply this model in practical settings?

Yes. This model can be incorporated into review monitoring systems, customer feedback analysis, and social media sentiment tracking tools with additional fine-tuning and a larger dataset.

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources