Home
Blog
Data Science
Bollywood Movie Analysis and Success Prediction with Machine Learning

Bollywood Movie Analysis and Success Prediction with Machine Learning

Updated on Aug 12, 2025 | 9 min read | 2.06K+ views

Table of Contents

View all

Essential Project Requirements
Behind the Scenes: How To Do Bollywood Movie Analysis
Smart Insights: Techniques That Power Bollywood Movie Analysis
Final Conclusion

Bollywood is one of the largest film industries in the world. But what makes a movie a hit or a flop?

In this project, you’ll analyse real Bollywood movie data to uncover trends behind successful films. You'll explore genres, budgets, lead stars, and release periods to see how they impact performance.

Then you'll build a machine learning model to predict whether a new movie is likely to succeed based on key features.

Enhance your data science expertise with upGrad's Online Data Science Courses. Master Python, ML, AI, SQL, and Tableau under expert guidance, develop practical skills, and prepare for a successful career.

Looking for some hands-on Python projects to get job-ready? Check this out!: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025

Popular Data Science Programs

Advanced Certificate Program in Data Science Postgraduate Diploma in Data Science DevOps Full Course Online MSc in Data Science Program M Sc in Data Science Degree

Essential Project Requirements

Before you start working on the World Happiness Report analysis, make sure you're comfortable with these tools and concepts:

Python programming (You’ll use Python throughout for data processing, visualisation, and modelling.)
Pandas and Numpy (These libraries help you clean, explore, and structure the dataset for analysis.)
Matplotlib or Seaborn (You’ll use these tools to create heatmaps, scatter plots, and interactive maps)
Scikit‑learn basics (You’ll rely on this for building a machine learning pipeline and training a Random Forest classifier)
Data preprocessing (Knowing how to scale features and encode categorical variables is key to building a working model)
Model evaluation (You’ll use metrics like accuracy, classification report, and confusion matrix to evaluate your results)

upGrad's premier courses and industry mentors can help you advance your data science career.

Behind the Scenes: How To Do Bollywood Movie Analysis

To analyse and predict Bollywood movie success, you’ll work with Python libraries designed for data handling, visualisation, and classification:

Tool / Library	Purpose
Python	Core language for scripting and model building
Pandas	Loads, cleans, and structures movie data
NumPy	Supports numerical operations like ROI calculations
Matplotlib / Seaborn	Creates visualisations such as bar charts and heatmaps
Scikit-learn	Handles preprocessing, model training, and evaluation
RandomForestClassifier	Predicts movie success using selected features
Pipeline & ColumnTransformer	Automates scaling, encoding, and classification steps

Are you new to Python? This course can help you enhance your skills for free - Learn Basic Python Programming

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Smart Insights: Techniques That Power Bollywood Movie Analysis

To get the most out of your Bollywood Movie Analysis & Success Prediction project, you'll apply these key data science techniques:

Exploratory Data Analysis (EDA):
Explore patterns in movie success, including hit rates, ROI by genre, and top-performing stars.
Feature Engineering:
Create new features like ROI and define success labels to prepare data for modelling.
Data Visualisation:
Use bar charts and count plots to show trends in genres, lead actors, release periods, and revenue.
Data Preprocessing:
Encode categorical data and scale numerical values to prepare inputs for the machine learning model.
Classification Modelling:
Train a Random Forest classifier to predict whether a movie will be a hit or a flop based on its features.
Model Evaluation:
Use accuracy, classification report, and a confusion matrix to check how well the model performs.

Check out this beginner-friendly Python project! - Sales Data Analysis Project

Time Required to Complete the Project: You can complete the Bollywood Movie Analysis & Success Prediction project in about 3 to 4 hours.

Let’s build this project from scratch with clear, step-by-step guidance:

Load the Dataset
Check for Missing Values
Explore the Data (EDA)
Visualise Key Factors
Define Movie Success
Preprocess the Data
Train a Classification Model
Evaluate the Model
Predict a New Movie’s Outcome

Without any further delay, let’s get started!

Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python

Step 1: Download the Dataset

To start the analysis, first, you need to download the dataset, which is available on the internet for free. You can also download from Kaggle by searching for your project name.

Step 2: Import Required Libraries

To begin your Bollywood Movie Analysis & Success Prediction project, first import the essential Python libraries. These will help you handle data, create visualisations, and build the prediction model.

Here’s the list of tools you’ll use:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Explore this project - Customer Purchase Behaviour Analysis Project Using Python

Step 3: Load the Dataset

Begin by reading the movie dataset using pandas. This block loads the CSV file and checks if it's present in the correct directory.

If the file is missing, the script will stop to prevent further errors.

print("--- Loading Dataset ---")
try:
    # Load the dataset from the provided CSV file
    df = pd.read_csv('Data for repository.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'Data for repository.csv' not found. Please ensure the file is in the correct directory.")
    exit()

Check out this - COVID-19 Project: Data Visualization & Insights

Step 4: Clean the Data and Create New Features

In this step, you’ll clean the dataset by removing incomplete or invalid entries and prepare it for analysis.

print("Original shape of the dataset:", df.shape)

# Drop rows with any missing values for simplicity
df.dropna(inplace=True)
print("Shape after dropping rows with missing values:", df.shape)

# Convert currency columns to numeric, coercing errors to NaN
df['Revenue(INR)'] = pd.to_numeric(df['Revenue(INR)'], errors='coerce')
df['Budget(INR)'] = pd.to_numeric(df['Budget(INR)'], errors='coerce')

# Drop rows where currency conversion failed
df.dropna(subset=['Revenue(INR)', 'Budget(INR)'], inplace=True)
print("Cleaned shape after handling data types:", df.shape)

# Define Movie Success: A movie is a 'Hit' (1) if revenue > budget, else a 'Flop' (0)
df['Success'] = (df['Revenue(INR)'] > df['Budget(INR)']).astype(int)

# Calculate Return on Investment (ROI) for analysis
df['ROI'] = (df['Revenue(INR)'] - df['Budget(INR)']) / df['Budget(INR)']
print("'Success' and 'ROI' columns created.")

Output:

Original shape of the dataset: (1698, 14)

Shape after dropping rows with missing values: (1698, 14)

Cleaned shape after handling data types: (1698, 14)

'Success' and 'ROI' columns created.

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 5: Exploratory Data Analysis (EDA)

Now that your data is clean, use Exploratory Data Analysis (EDA) to understand trends and patterns.

print("\n--- . Exploratory Data Analysis (EDA) ---")
sns.set_style('whitegrid')

# a. Overall Success Rate (Hit vs. Flop)
plt.figure(figsize=(6, 5))
sns.countplot(x='Success', data=df, palette='pastel')
plt.title('Overall Movie Success Rate (Hit vs. Flop)')
plt.xticks([0, 1], ['Flop', 'Hit'])
plt.ylabel('Number of Movies')
plt.savefig('success_rate.png')  # Save the plot as an image
print("Generated 'success_rate.png'")

# b. Success by Release Period
plt.figure(figsize=(8, 6))
sns.countplot(x='Release Period', hue='Success', data=df, palette='viridis')
plt.title('Success Rate by Release Period')
plt.legend(title='Success', labels=['Flop', 'Hit'])
plt.savefig('release_period_success.png')
print("Generated 'release_period_success.png'")

# c. Top 10 Most Profitable Genres by Average ROI
plt.figure(figsize=(12, 7))
genre_roi = df.groupby('Genre')['ROI'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=genre_roi.values, y=genre_roi.index, palette='plasma')
plt.title('Top 10 Most Profitable Genres (by average ROI)')
plt.xlabel('Average Return on Investment (ROI)')
plt.ylabel('Genre')
plt.savefig('top_genres_roi.png')
print("Generated 'top_genres_roi.png'")

# d. Top 10 Lead Stars by Number of Hit Movies
plt.figure(figsize=(12, 7))
star_hits = df[df['Success'] == 1]['Lead Star'].value_counts().head(10)
sns.barplot(x=star_hits.values, y=star_hits.index, palette='magma')
plt.title('Top 10 Lead Stars by Number of Hit Movies')
plt.xlabel('Number of Hit Movies')
plt.ylabel('Lead Star')
plt.savefig('top_lead_stars.png')
print("Generated 'top_lead_stars.png'")

Output:

A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python

Step 6: Prepare the Data for Machine Learning

Before training the model, you need to separate the input features from the target variable and set up a preprocessing pipeline.

print("\n--- . Feature Engineering and Preprocessing for ML ---")

# Define features (X) and target (y)
features = ['Genre', 'Lead Star', 'Director', 'Release Period', 'Budget(INR)']
target = 'Success'
X = df[features]
y = df[target]

# Identify categorical and numerical features
categorical_features = ['Genre', 'Lead Star', 'Director', 'Release Period']
numerical_features = ['Budget(INR)']

# Create a preprocessing pipeline
# OneHotEncoder handles categorical variables
# StandardScaler scales numerical variables
# handle_unknown='ignore' avoids errors from unseen categories in test data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

print("Preprocessing pipeline for numerical and categorical features created.")

Output:

Preprocessing pipeline for numerical and categorical features created.

Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning

Step 7: Model Training

Now it's time to train the machine learning model to predict movie success.

Random Forest is chosen here for its robustness and ability to handle both numerical and categorical data.

Here is the code for this step:

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Create the full model pipeline
# Includes preprocessing and RandomForestClassifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the model
model_pipeline.fit(X_train, y_train)
print("Random Forest model trained successfully.")

Model Training Summary

Data split into training (80%) and testing (20%) using stratified sampling.
A pipeline was built with preprocessing (scaling + encoding) and a RandomForestClassifier.
The model was successfully trained on the training data.

Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques

Step 8: Model Evaluation

After training the model, it's time to check how well it performs. In this step, we evaluate the Random Forest classifier using the test data.

Here is the code for this step:

# Make predictions on the test set
y_pred = model_pipeline.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")

# Print the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Flop', 'Hit']))

# Plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Flop', 'Hit'], yticklabels=['Flop', 'Hit'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig('confusion_matrix.png')
print("Generated 'confusion_matrix.png'")

Output:

Model Accuracy: 87.65%

Classification Report:

precision recall f1-score support

Flop 0.80 0.74 0.77 95

accuracy 0.88 340

macro avg 0.85 0.83 0.84 340

weighted avg 0.87 0.88 0.87 340

Take a look at this exciting project - Customer Churn Prediction Project

Step 9: Prediction on a New, Hypothetical Movie

In this step, we test the trained model on a completely new movie that wasn't part of the dataset.

Here is the code:

# Create a new movie's data as a DataFrame
# The column names MUST match the features used for training
new_movie_data = pd.DataFrame({
    'Genre': ['comedy'],
    'Lead Star': ['Akshay Kumar'],
    'Director': ['Rohit Shetty'],
    'Release Period': ['Holiday'],
    'Budget(INR)': [1000000000]  # 100 Crore Budget
})

print("Predicting success for the following new movie:")
print(new_movie_data)

# Use the trained pipeline to predict the outcome
prediction = model_pipeline.predict(new_movie_data)
prediction_proba = model_pipeline.predict_proba(new_movie_data)

# Print the result
predicted_class = 'Hit' if prediction[0] == 1 else 'Flop'
print(f"\nPredicted Outcome: {predicted_class}")
print(f"Prediction Confidence -> Flop: {prediction_proba[0][0]:.2%}, Hit: {prediction_proba[0][1]:.2%}")

Output:

Predicting success for the following new movie:

Genre Lead Star Director Release Period Budget(INR)

0 comedy Akshay Kumar Rohit Shetty Holiday 1000000000

Predicted Outcome: Flop

Prediction Confidence -> Flop: 96.00%, Hit: 4.00%

Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project

Final Conclusion

This project successfully developed a machine learning model to predict whether a Bollywood movie will be a hit or a flop. We used features like genre, lead star, director, release period, and budget, and trained a Random Forest Classifier with a complete preprocessing pipeline. The model showed good accuracy and gave detailed evaluation metrics, including a classification report and a confusion matrix. We also demonstrated how it can predict the outcome of a new hypothetical movie.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1-FZNs_lPzynoFMbvKOF3HJXKrL3I5P9j?usp=sharing

Frequently Asked Questions (FAQs)

1. What data is needed to predict a Bollywood Movie Analysis and Prediction?

You need key features like movie genre, lead actor, director, release period (festival or non-festival), and the movie's budget. These help the model learn patterns linked to box office outcomes.

2. Which machine learning model is best for Bollywood Movie Analysis and Prediction?

Random Forest Classifier works well because it handles both categorical and numerical data, captures non-linear relationships, and is robust to overfitting on small to medium datasets.

3. How accurate is the prediction model?

The model’s accuracy depends on data quality and the features used. In this project, the Random Forest model gave high accuracy and reliable performance on test data, with detailed evaluation using a classification report and confusion matrix.

4. Can this model predict success for upcoming or hypothetical movies?

Yes. Once trained, the model can predict outcomes for new movies as long as you provide the required input features (genre, star, director, release period, and budget).

5. How is categorical data like 'Genre' or 'Director' handled in the model?

These features are transformed using OneHotEncoding in a preprocessing pipeline, converting them into a format the model can understand while preserving the meaning of different categories.

Rohit Sharma

844 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources