Sentiment Analysis on IMDB Reviews Using Machine Learning

By Rohit Sharma

Updated on Aug 08, 2025 | 1.21K+ views

Share:

The way the human brain interprets a film has become immensely important in the digital world today. In this project, we will perform sentiment analysis on IMDB reviews by applying machine learning models to classify text-based movie reviews into positive or negative. By analyzing thousands of real user reviews, we want to develop an automated system that could interpret the sentiment, thereby saving time and giving producers, marketers, and recommendation engines useful insights.

This beginner-level project combines a little bit of natural language processing (NLP) with classification algorithms such as Logistic Regression and Multinomial Naive Bayes.

For more project ideas like this one, check out our blog post on the Top 25+ Essential Data Science Projects GitHub to Explore in 2025.  

What Do You Need to Know Up Front?

You should understand the fundamentals of Python syntax, pandas for handling data, and some machine learning and natural language processing concepts like tokenization and vectorization in order to complete this project successfully. It will also be useful to understand how classification models, such as logistic regression or Naive Bayes, operate.

Technologies and Libraries Used in Sentiment Analysis on IMDB Reviews

It will be helpful to have some basic knowledge of:

  • Python – Programming language
  • Pandas – Data handling and cleaning
  • Scikit-learn – ML models and preprocessing
  • NLTK / TextBlob – Text processing and sentiment tools (optional)
  • Google Colab – Development environment

Time Taken and Difficulty

  • Estimated Time: 1.5 to 2 hours
  • Difficulty: Beginner to Intermediate 

How to Build a Model for Sentiment Analysis on IMDB Reviews

Let’s start building the project from scratch. So, without wasting any more time, let’s begin!

Step 1: Import Libraries

Let's import all the required libraries for data handling, visualization, preprocessing, and modeling before we begin working with the dataset. To do this, use the code listed below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Step 2: Upload the dataset 

Now we will use Google Colab's file upload feature to upload the dataset (in CSV format). To accomplish the same, use the code listed below:

# Import required libraries
import pandas as pd
import numpy as np

# Upload the dataset from your local system
from google.colab import files
uploaded = files.upload()

You will be asked to upload a file after running the aforementioned code. Upload the IMDB Dataset.csv file. (This file is available for download at https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/input.)

Load the dataset into a DataFrame once the.csv file has been uploaded to Colab. To do this, enter the code listed below:

# Load the uploaded CSV into a pandas DataFrame
data = pd.read_csv('IMDB Dataset.csv')

# Display the first few rows
data.head()

Output:

                                                                       review sentiment

0  One of the other reviewers has mentioned that ...  positive

1  A wonderful little production. <br /><br />The...     positive

2  I thought this was a wonderful way to spend ti...    positive

3  Basically there's a family where a little boy ...        negative

4  Petter Mattei's "Love in the Time of Money" is...     positive

Step 3: Clean and Preprocess the IMDB Movie Reviews

Before we can train the model, we have to ensure that the text is coherent and conducive to analysis. To achieve that, in this step, we will: 

  • Converting text to lowercase
  • Get rid of stopwords and tokenize

Use the below-mentioned code to accomplish the same:

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # remove punctuation and numbers
    text = re.sub(r'\s+', ' ', text)         # remove extra spaces
    return text.strip()

# Apply cleaning to the 'review' column
df['cleaned_review'] = df['review'].apply(clean_text)
print(df[['review', 'cleaned_review']].head())

Output:

                                                                         review  \

0  One of the other reviewers has mentioned that ...   

1  A wonderful little production. <br /><br />The...   

2  I thought this was a wonderful way to spend ti...   

3  Basically there's a family where a little boy ...   

4  Petter Mattei's "Love in the Time of Money" is...   

 

                                                               cleaned_review  

0  one of the other reviewers has mentioned that ...  

1  a wonderful little production br br the filmin...  

2  i thought this was a wonderful way to spend ti...  

3  basically theres a family where a little boy j...  

4  petter matteis love in the time of money is a ...

Step 4: Split and Vectorize the Data

Machine learning models cannot read raw text. Therefore, it becomes paramount to convert each cleaned review into a numerical vector. But how to do this? 

To achieve this, we will employ TF-IDF (Term Frequency–Inverse Document Frequency). It aids in assessing a word's significance within a review in relation to all reviews.

Use the below-mentioned code:

# Step 1: Vectorize using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['cleaned_review'])

# Step 2: Encode target values
y = df['sentiment']

# Step 3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Train Multinomial Naive Bayes and Logistic Regression Models

In this step, we will train both the below-mentioned models -

  • Logistic Regression
  • Multinomial Naive Bayes 

Use the below-mentioned code:

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Train Multinomial Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Train Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

Step 6: Compare the Models

In this step, we will evaluate the performance of both models on the test data using key evaluation metrics

To do so, use the below-mentioned code:

# Predict on test data
nb_preds = nb_model.predict(X_test)
logistic_preds = logistic_model.predict(X_test)

# Evaluate
from sklearn.metrics import accuracy_score, confusion_matrix

# Naive Bayes
print("Multinomial Naive Bayes Accuracy:", accuracy_score(y_test, nb_preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, nb_preds))

# Logistic Regression
print("\nLogistic Regression Accuracy:", accuracy_score(y_test, logistic_preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, logistic_preds))

Output:

Multinomial Naive Bayes Accuracy: 0.8503

Confusion Matrix:

 [[4223  738]

 [ 759 4280]]

 

Logistic Regression Accuracy: 0.8928

Confusion Matrix:

 [[4369  592]

 [ 480 4559]]

Step 7: Predict Sentiment Using User Input

In this step, we will build an interactive loop. Which user will be able to submit his/her reviews? Once the review is entered, we will employ trained models to determine whether the sentiment is positive or negative. Results will also be displayed. The loop will run until the user types - exit.

To accomplish this, use the below-mentioned code:

while True:
    # Take user input
    user_input = input("Enter a product review (or type 'exit' to quit): ").strip()

    # Exit condition
    if user_input.lower() == "exit":
        print("Exiting Sentiment Analysis. Goodbye!")
        break

    # Vectorize the input text
    vectorized_input = vectorizer.transform([user_input])

    # Predict using both models
    nb_prediction = nb_model.predict(vectorized_input)[0]
    logistic_prediction = logistic_model.predict(vectorized_input)[0]

    # Show the output
    print("\n--- Sentiment Analysis Result ---")
    print("Multinomial Naive Bayes Prediction:", nb_prediction.capitalize())
    print("Logistic Regression Prediction:", logistic_prediction.capitalize())
    print("-" * 50)

Output:

Enter a product review (or type 'exit' to quit): it was good

 

--- Sentiment Analysis Result ---

Multinomial Naive Bayes Prediction: Negative

Logistic Regression Prediction: Positive

--------------------------------------------------

Enter a product review (or type 'exit' to quit): it was dissapointing

 

--- Sentiment Analysis Result ---

Multinomial Naive Bayes Prediction: Negative

Logistic Regression Prediction: Positive

--------------------------------------------------

Enter a product review (or type 'exit' to quit): exit

Exiting Sentiment Analysis. Goodbye!

Conclusion

In this project, we have executed a sentiment analysis on IMDB reviews. We applied cleaning, TF-IDF vectorization, and two classification models were trained - Logistic Regression and Multinomial Naive Bayes. 

Logistic Regression gave us a better result by a small margin. Now that the pipeline model is set up, you can conduct real-time sentiment analysis on user reviews.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab link:
https://colab.research.google.com/drive/1XHVDbSXkxHw6lIoxL3xVVUVhABcp7ZCU?usp=sharing

Frequently Asked Questions (FAQs)

1. What could be the main purpose of sentiment analysis for this project?

2. Why did we use both Logistic Regression and Multinomial Naive Bayes models?

3. What kind of preprocessing will one need to perform on text before training a sentiment model?

4. What changes are made to the input review prior to prediction?

5. Is it possible to apply this model in practical settings?

Rohit Sharma

827 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months