Bollywood Movie Analysis and Success Prediction with Machine Learning

By Rohit Sharma

Updated on Aug 07, 2025 | 9 min read | 1.19K+ views

Share:

Bollywood is one of the largest film industries in the world. But what makes a movie a hit or a flop?

In this project, you’ll analyse real Bollywood movie data to uncover trends behind successful films. You'll explore genres, budgets, lead stars, and release periods to see how they impact performance. 

Then you'll build a machine learning model to predict whether a new movie is likely to succeed based on key features.

 

Enhance your data science expertise with upGrad's Online Data Science Courses. Master Python, ML, AI, SQL, and Tableau under expert guidance, develop practical skills, and prepare for a successful career.

Looking for some hands-on Python projects to get job-ready? Check this out!: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025

Essential Project Requirements

Before you start working on the World Happiness Report analysis, make sure you're comfortable with these tools and concepts:

  • Python programming (You’ll use Python throughout for data processing, visualisation, and modelling.)
  • Pandas and Numpy (These libraries help you clean, explore, and structure the dataset for analysis.)
  • Matplotlib or Seaborn (You’ll use these tools to create heatmaps, scatter plots, and interactive maps)
  • Scikit‑learn basics (You’ll rely on this for building a machine learning pipeline and training a Random Forest classifier)
  • Data preprocessing (Knowing how to scale features and encode categorical variables is key to building a working model)
  • Model evaluation (You’ll use metrics like accuracy, classification report, and confusion matrix to evaluate your results)

upGrad's premier courses and industry mentors can help you advance your data science career.

Behind the Scenes: How To Do Bollywood Movie Analysis

To analyse and predict Bollywood movie success, you’ll work with Python libraries designed for data handling, visualisation, and classification:

Tool / Library

Purpose

Python Core language for scripting and model building
Pandas Loads, cleans, and structures movie data
NumPy Supports numerical operations like ROI calculations
Matplotlib / Seaborn Creates visualisations such as bar charts and heatmaps
Scikit-learn Handles preprocessing, model training, and evaluation
RandomForestClassifier Predicts movie success using selected features
Pipeline & ColumnTransformer Automates scaling, encoding, and classification steps

Are you new to Python? This course can help you enhance your skills for free - Learn Basic Python Programming

Smart Insights: Techniques That Power Bollywood Movie Analysis

To get the most out of your Bollywood Movie Analysis & Success Prediction project, you'll apply these key data science techniques:

  • Exploratory Data Analysis (EDA):
    Explore patterns in movie success, including hit rates, ROI by genre, and top-performing stars.
  • Feature Engineering:
    Create new features like ROI and define success labels to prepare data for modelling.
  • Data Visualisation:
    Use bar charts and count plots to show trends in genres, lead actors, release periods, and revenue.
  • Data Preprocessing:
    Encode categorical data and scale numerical values to prepare inputs for the machine learning model.
  • Classification Modelling:
    Train a Random Forest classifier to predict whether a movie will be a hit or a flop based on its features.
  • Model Evaluation:
    Use accuracy, classification report, and a confusion matrix to check how well the model performs.

Check out this beginner-friendly Python project! - Sales Data Analysis Project

Time Required to Complete the Project: You can complete the Bollywood Movie Analysis & Success Prediction project in about 3 to 4 hours.

Let’s build this project from scratch with clear, step-by-step guidance:

  1. Load the Dataset
  2. Check for Missing Values
  3. Explore the Data (EDA)
  4. Visualise Key Factors
  5. Define Movie Success
  6. Preprocess the Data
  7. Train a Classification Model
  8. Evaluate the Model
  9. Predict a New Movie’s Outcome

Without any further delay, let’s get started!

Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python

Step 1: Download the Dataset

To start the analysis, first, you need to download the dataset, which is available on the internet for free. You can also download from Kaggle by searching for your project name.

Step 2:  Import Required Libraries

To begin your Bollywood Movie Analysis & Success Prediction project, first import the essential Python libraries. These will help you handle data, create visualisations, and build the prediction model.

Here’s the list of tools you’ll use:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Explore this project - Customer Purchase Behaviour Analysis Project Using Python

Step 3:  Load the Dataset

Begin by reading the movie dataset using pandas. This block loads the CSV file and checks if it's present in the correct directory.

If the file is missing, the script will stop to prevent further errors.

print("--- Loading Dataset ---")
try:
    # Load the dataset from the provided CSV file
    df = pd.read_csv('Data for repository.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'Data for repository.csv' not found. Please ensure the file is in the correct directory.")
    exit()

Check out this - COVID-19 Project: Data Visualization & Insights

Step 4:  Clean the Data and Create New Features

In this step, you’ll clean the dataset by removing incomplete or invalid entries and prepare it for analysis.

print("Original shape of the dataset:", df.shape)

# Drop rows with any missing values for simplicity
df.dropna(inplace=True)
print("Shape after dropping rows with missing values:", df.shape)

# Convert currency columns to numeric, coercing errors to NaN
df['Revenue(INR)'] = pd.to_numeric(df['Revenue(INR)'], errors='coerce')
df['Budget(INR)'] = pd.to_numeric(df['Budget(INR)'], errors='coerce')

# Drop rows where currency conversion failed
df.dropna(subset=['Revenue(INR)', 'Budget(INR)'], inplace=True)
print("Cleaned shape after handling data types:", df.shape)

# Define Movie Success: A movie is a 'Hit' (1) if revenue > budget, else a 'Flop' (0)
df['Success'] = (df['Revenue(INR)'] > df['Budget(INR)']).astype(int)

# Calculate Return on Investment (ROI) for analysis
df['ROI'] = (df['Revenue(INR)'] - df['Budget(INR)']) / df['Budget(INR)']
print("'Success' and 'ROI' columns created.")

Output:

Original shape of the dataset: (1698, 14)

Shape after dropping rows with missing values: (1698, 14)

Cleaned shape after handling data types: (1698, 14)

'Success' and 'ROI' columns created.

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 5: Exploratory Data Analysis (EDA)

Now that your data is clean, use Exploratory Data Analysis (EDA) to understand trends and patterns.

print("\n--- . Exploratory Data Analysis (EDA) ---")
sns.set_style('whitegrid')

# a. Overall Success Rate (Hit vs. Flop)
plt.figure(figsize=(6, 5))
sns.countplot(x='Success', data=df, palette='pastel')
plt.title('Overall Movie Success Rate (Hit vs. Flop)')
plt.xticks([0, 1], ['Flop', 'Hit'])
plt.ylabel('Number of Movies')
plt.savefig('success_rate.png')  # Save the plot as an image
print("Generated 'success_rate.png'")

# b. Success by Release Period
plt.figure(figsize=(8, 6))
sns.countplot(x='Release Period', hue='Success', data=df, palette='viridis')
plt.title('Success Rate by Release Period')
plt.legend(title='Success', labels=['Flop', 'Hit'])
plt.savefig('release_period_success.png')
print("Generated 'release_period_success.png'")

# c. Top 10 Most Profitable Genres by Average ROI
plt.figure(figsize=(12, 7))
genre_roi = df.groupby('Genre')['ROI'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=genre_roi.values, y=genre_roi.index, palette='plasma')
plt.title('Top 10 Most Profitable Genres (by average ROI)')
plt.xlabel('Average Return on Investment (ROI)')
plt.ylabel('Genre')
plt.savefig('top_genres_roi.png')
print("Generated 'top_genres_roi.png'")

# d. Top 10 Lead Stars by Number of Hit Movies
plt.figure(figsize=(12, 7))
star_hits = df[df['Success'] == 1]['Lead Star'].value_counts().head(10)
sns.barplot(x=star_hits.values, y=star_hits.index, palette='magma')
plt.title('Top 10 Lead Stars by Number of Hit Movies')
plt.xlabel('Number of Hit Movies')
plt.ylabel('Lead Star')
plt.savefig('top_lead_stars.png')
print("Generated 'top_lead_stars.png'")

Output:

A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python

Step 6: Prepare the Data for Machine Learning

Before training the model, you need to separate the input features from the target variable and set up a preprocessing pipeline.

print("\n--- . Feature Engineering and Preprocessing for ML ---")

# Define features (X) and target (y)
features = ['Genre', 'Lead Star', 'Director', 'Release Period', 'Budget(INR)']
target = 'Success'
X = df[features]
y = df[target]

# Identify categorical and numerical features
categorical_features = ['Genre', 'Lead Star', 'Director', 'Release Period']
numerical_features = ['Budget(INR)']

# Create a preprocessing pipeline
# OneHotEncoder handles categorical variables
# StandardScaler scales numerical variables
# handle_unknown='ignore' avoids errors from unseen categories in test data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

print("Preprocessing pipeline for numerical and categorical features created.")

Output: 

Preprocessing pipeline for numerical and categorical features created.

Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning

Step 7: Model Training

Now it's time to train the machine learning model to predict movie success.

Random Forest is chosen here for its robustness and ability to handle both numerical and categorical data.

Here is the code for this step:

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Create the full model pipeline
# Includes preprocessing and RandomForestClassifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the model
model_pipeline.fit(X_train, y_train)
print("Random Forest model trained successfully.")

Model Training Summary

  • Data split into training (80%) and testing (20%) using stratified sampling.
  • A pipeline was built with preprocessing (scaling + encoding) and a RandomForestClassifier.
  • The model was successfully trained on the training data.

Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques

Step 8:  Model Evaluation

After training the model, it's time to check how well it performs. In this step, we evaluate the Random Forest classifier using the test data.

Here is the code for this step:

# Make predictions on the test set
y_pred = model_pipeline.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")

# Print the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Flop', 'Hit']))

# Plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Flop', 'Hit'], yticklabels=['Flop', 'Hit'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig('confusion_matrix.png')
print("Generated 'confusion_matrix.png'")

Output:

Model Accuracy: 87.65%

Classification Report:

                   precision    recall    f1-score   support

        Flop      0.80          0.74       0.77          95

         Hit       0.90          0.93      0.92       245

 

    accuracy                                     0.88       340

   macro avg       0.85         0.83     0.84       340

weighted avg       0.87       0.88      0.87       340

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Take a look at this exciting project - Customer Churn Prediction Project

Step 9: Prediction on a New, Hypothetical Movie

In this step, we test the trained model on a completely new movie that wasn't part of the dataset.

Here is the code:

# Create a new movie's data as a DataFrame
# The column names MUST match the features used for training
new_movie_data = pd.DataFrame({
    'Genre': ['comedy'],
    'Lead Star': ['Akshay Kumar'],
    'Director': ['Rohit Shetty'],
    'Release Period': ['Holiday'],
    'Budget(INR)': [1000000000]  # 100 Crore Budget
})

print("Predicting success for the following new movie:")
print(new_movie_data)

# Use the trained pipeline to predict the outcome
prediction = model_pipeline.predict(new_movie_data)
prediction_proba = model_pipeline.predict_proba(new_movie_data)

# Print the result
predicted_class = 'Hit' if prediction[0] == 1 else 'Flop'
print(f"\nPredicted Outcome: {predicted_class}")
print(f"Prediction Confidence -> Flop: {prediction_proba[0][0]:.2%}, Hit: {prediction_proba[0][1]:.2%}")

Output:

Predicting success for the following new movie:

    Genre      Lead Star          Director Release     Period      Budget(INR)

0  comedy  Akshay Kumar   Rohit Shetty           Holiday     1000000000

Predicted Outcome: Flop

Prediction Confidence -> Flop: 96.00%, Hit: 4.00%

Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project

Final Conclusion

This project successfully developed a machine learning model to predict whether a Bollywood movie will be a hit or a flop. We used features like genre, lead star, director, release period, and budget, and trained a Random Forest Classifier with a complete preprocessing pipeline. The model showed good accuracy and gave detailed evaluation metrics, including a classification report and a confusion matrix. We also demonstrated how it can predict the outcome of a new hypothetical movie.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1-FZNs_lPzynoFMbvKOF3HJXKrL3I5P9j?usp=sharing

Frequently Asked Questions (FAQs)

1. What data is needed to predict a Bollywood Movie Analysis and Prediction?

2. Which machine learning model is best for Bollywood Movie Analysis and Prediction?

3. How accurate is the prediction model?

4. Can this model predict success for upcoming or hypothetical movies?

5. How is categorical data like 'Genre' or 'Director' handled in the model?

Rohit Sharma

827 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months