Bollywood Movie Analysis and Success Prediction with Machine Learning
By Rohit Sharma
Updated on Aug 07, 2025 | 9 min read | 1.19K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 07, 2025 | 9 min read | 1.19K+ views
Share:
Table of Contents
Bollywood is one of the largest film industries in the world. But what makes a movie a hit or a flop?
In this project, you’ll analyse real Bollywood movie data to uncover trends behind successful films. You'll explore genres, budgets, lead stars, and release periods to see how they impact performance.
Then you'll build a machine learning model to predict whether a new movie is likely to succeed based on key features.
Popular Data Science Programs
Looking for some hands-on Python projects to get job-ready? Check this out!: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025
Before you start working on the World Happiness Report analysis, make sure you're comfortable with these tools and concepts:
upGrad's premier courses and industry mentors can help you advance your data science career.
To analyse and predict Bollywood movie success, you’ll work with Python libraries designed for data handling, visualisation, and classification:
Tool / Library |
Purpose |
Python | Core language for scripting and model building |
Pandas | Loads, cleans, and structures movie data |
NumPy | Supports numerical operations like ROI calculations |
Matplotlib / Seaborn | Creates visualisations such as bar charts and heatmaps |
Scikit-learn | Handles preprocessing, model training, and evaluation |
RandomForestClassifier | Predicts movie success using selected features |
Pipeline & ColumnTransformer | Automates scaling, encoding, and classification steps |
Are you new to Python? This course can help you enhance your skills for free - Learn Basic Python Programming
To get the most out of your Bollywood Movie Analysis & Success Prediction project, you'll apply these key data science techniques:
Check out this beginner-friendly Python project! - Sales Data Analysis Project
Time Required to Complete the Project: You can complete the Bollywood Movie Analysis & Success Prediction project in about 3 to 4 hours.
Let’s build this project from scratch with clear, step-by-step guidance:
Without any further delay, let’s get started!
Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python
To start the analysis, first, you need to download the dataset, which is available on the internet for free. You can also download from Kaggle by searching for your project name.
To begin your Bollywood Movie Analysis & Success Prediction project, first import the essential Python libraries. These will help you handle data, create visualisations, and build the prediction model.
Here’s the list of tools you’ll use:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Explore this project - Customer Purchase Behaviour Analysis Project Using Python
Begin by reading the movie dataset using pandas. This block loads the CSV file and checks if it's present in the correct directory.
If the file is missing, the script will stop to prevent further errors.
print("--- Loading Dataset ---")
try:
# Load the dataset from the provided CSV file
df = pd.read_csv('Data for repository.csv')
print("Dataset loaded successfully.")
except FileNotFoundError:
print("Error: 'Data for repository.csv' not found. Please ensure the file is in the correct directory.")
exit()
Check out this - COVID-19 Project: Data Visualization & Insights
In this step, you’ll clean the dataset by removing incomplete or invalid entries and prepare it for analysis.
print("Original shape of the dataset:", df.shape)
# Drop rows with any missing values for simplicity
df.dropna(inplace=True)
print("Shape after dropping rows with missing values:", df.shape)
# Convert currency columns to numeric, coercing errors to NaN
df['Revenue(INR)'] = pd.to_numeric(df['Revenue(INR)'], errors='coerce')
df['Budget(INR)'] = pd.to_numeric(df['Budget(INR)'], errors='coerce')
# Drop rows where currency conversion failed
df.dropna(subset=['Revenue(INR)', 'Budget(INR)'], inplace=True)
print("Cleaned shape after handling data types:", df.shape)
# Define Movie Success: A movie is a 'Hit' (1) if revenue > budget, else a 'Flop' (0)
df['Success'] = (df['Revenue(INR)'] > df['Budget(INR)']).astype(int)
# Calculate Return on Investment (ROI) for analysis
df['ROI'] = (df['Revenue(INR)'] - df['Budget(INR)']) / df['Budget(INR)']
print("'Success' and 'ROI' columns created.")
Output:
Original shape of the dataset: (1698, 14) Shape after dropping rows with missing values: (1698, 14) Cleaned shape after handling data types: (1698, 14) 'Success' and 'ROI' columns created. |
Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data
Now that your data is clean, use Exploratory Data Analysis (EDA) to understand trends and patterns.
print("\n--- . Exploratory Data Analysis (EDA) ---")
sns.set_style('whitegrid')
# a. Overall Success Rate (Hit vs. Flop)
plt.figure(figsize=(6, 5))
sns.countplot(x='Success', data=df, palette='pastel')
plt.title('Overall Movie Success Rate (Hit vs. Flop)')
plt.xticks([0, 1], ['Flop', 'Hit'])
plt.ylabel('Number of Movies')
plt.savefig('success_rate.png') # Save the plot as an image
print("Generated 'success_rate.png'")
# b. Success by Release Period
plt.figure(figsize=(8, 6))
sns.countplot(x='Release Period', hue='Success', data=df, palette='viridis')
plt.title('Success Rate by Release Period')
plt.legend(title='Success', labels=['Flop', 'Hit'])
plt.savefig('release_period_success.png')
print("Generated 'release_period_success.png'")
# c. Top 10 Most Profitable Genres by Average ROI
plt.figure(figsize=(12, 7))
genre_roi = df.groupby('Genre')['ROI'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=genre_roi.values, y=genre_roi.index, palette='plasma')
plt.title('Top 10 Most Profitable Genres (by average ROI)')
plt.xlabel('Average Return on Investment (ROI)')
plt.ylabel('Genre')
plt.savefig('top_genres_roi.png')
print("Generated 'top_genres_roi.png'")
# d. Top 10 Lead Stars by Number of Hit Movies
plt.figure(figsize=(12, 7))
star_hits = df[df['Success'] == 1]['Lead Star'].value_counts().head(10)
sns.barplot(x=star_hits.values, y=star_hits.index, palette='magma')
plt.title('Top 10 Lead Stars by Number of Hit Movies')
plt.xlabel('Number of Hit Movies')
plt.ylabel('Lead Star')
plt.savefig('top_lead_stars.png')
print("Generated 'top_lead_stars.png'")
Output:
A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python
Before training the model, you need to separate the input features from the target variable and set up a preprocessing pipeline.
print("\n--- . Feature Engineering and Preprocessing for ML ---")
# Define features (X) and target (y)
features = ['Genre', 'Lead Star', 'Director', 'Release Period', 'Budget(INR)']
target = 'Success'
X = df[features]
y = df[target]
# Identify categorical and numerical features
categorical_features = ['Genre', 'Lead Star', 'Director', 'Release Period']
numerical_features = ['Budget(INR)']
# Create a preprocessing pipeline
# OneHotEncoder handles categorical variables
# StandardScaler scales numerical variables
# handle_unknown='ignore' avoids errors from unseen categories in test data
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
print("Preprocessing pipeline for numerical and categorical features created.")
Output:
Preprocessing pipeline for numerical and categorical features created.
Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning
Now it's time to train the machine learning model to predict movie success.
Random Forest is chosen here for its robustness and ability to handle both numerical and categorical data.
Here is the code for this step:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# Create the full model pipeline
# Includes preprocessing and RandomForestClassifier
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train the model
model_pipeline.fit(X_train, y_train)
print("Random Forest model trained successfully.")
Model Training Summary
Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques
After training the model, it's time to check how well it performs. In this step, we evaluate the Random Forest classifier using the test data.
Here is the code for this step:
# Make predictions on the test set
y_pred = model_pipeline.predict(X_test)
# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")
# Print the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Flop', 'Hit']))
# Plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Flop', 'Hit'], yticklabels=['Flop', 'Hit'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig('confusion_matrix.png')
print("Generated 'confusion_matrix.png'")
Output:
Model Accuracy: 87.65% Classification Report: precision recall f1-score support Flop 0.80 0.74 0.77 95 Hit 0.90 0.93 0.92 245
accuracy 0.88 340 macro avg 0.85 0.83 0.84 340 weighted avg 0.87 0.88 0.87 340 |
Take a look at this exciting project - Customer Churn Prediction Project
In this step, we test the trained model on a completely new movie that wasn't part of the dataset.
Here is the code:
# Create a new movie's data as a DataFrame
# The column names MUST match the features used for training
new_movie_data = pd.DataFrame({
'Genre': ['comedy'],
'Lead Star': ['Akshay Kumar'],
'Director': ['Rohit Shetty'],
'Release Period': ['Holiday'],
'Budget(INR)': [1000000000] # 100 Crore Budget
})
print("Predicting success for the following new movie:")
print(new_movie_data)
# Use the trained pipeline to predict the outcome
prediction = model_pipeline.predict(new_movie_data)
prediction_proba = model_pipeline.predict_proba(new_movie_data)
# Print the result
predicted_class = 'Hit' if prediction[0] == 1 else 'Flop'
print(f"\nPredicted Outcome: {predicted_class}")
print(f"Prediction Confidence -> Flop: {prediction_proba[0][0]:.2%}, Hit: {prediction_proba[0][1]:.2%}")
Output:
Predicting success for the following new movie: Genre Lead Star Director Release Period Budget(INR) 0 comedy Akshay Kumar Rohit Shetty Holiday 1000000000 Predicted Outcome: Flop Prediction Confidence -> Flop: 96.00%, Hit: 4.00% |
Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project
This project successfully developed a machine learning model to predict whether a Bollywood movie will be a hit or a flop. We used features like genre, lead star, director, release period, and budget, and trained a Random Forest Classifier with a complete preprocessing pipeline. The model showed good accuracy and gave detailed evaluation metrics, including a classification report and a confusion matrix. We also demonstrated how it can predict the outcome of a new hypothetical movie.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1-FZNs_lPzynoFMbvKOF3HJXKrL3I5P9j?usp=sharing
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources