Home
Blog
Data Science
Startup Funding Analysis and Prediction: A Machine Learning Project

Startup Funding Analysis and Prediction: A Machine Learning Project

Updated on Aug 11, 2025 | 8 min read | 1.67K+ views

In India, the startup ecosystem has seen rapid expansion, with a surge in funding activity across various sectors.

This project focuses on Startup Funding Analysis and Prediction, where we explore patterns in funding data, clean and preprocess the dataset, and lay the groundwork for building a predictive model. The aim is to uncover what influences investment decisions and how funding trends evolve.

upGrad offers Online Data Science Courses to help you enhance your data science expertise. These courses provide expert guidance to help you master Python, ML, AI, SQL, and Tableau, develop practical skills, and prepare for a successful career.

Interested in hands-on Python projects to enhance your job readiness? Explore this resource: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025

Popular Data Science Programs

Advanced Certificate Program in Data Science M Sc in Data Science Degree Postgraduate Diploma in Data Science DevOps Full Course Online MSc AI and Data Science Program

Before We Dive In, What Should You Know?

Before you begin working on the Startup Funding Analysis and Prediction project, make sure you're comfortable with the following tools and concepts:

Python programming (You'll use Python for data loading, cleaning, exploration, and model development)
Pandas and Numpy (These libraries are essential for handling structured data, performing transformations, and generating insights)
Matplotlib or Seaborn (You'll use these for data visualisation: bar charts, correlation heatmaps, and trend analysis plots)
Scikit‑learn basics (Required for building preprocessing pipelines and training regression models like Random Forest)
Data preprocessing (You need to understand how to handle categorical and numerical features using encoders and scalers)
Model evaluation (Familiarity with metrics like R² score, RMSE, and MSE is necessary to interpret model performance effectively)

Advance your data science career with upGrad's premier courses and industry mentors.

Techniques You'll Use for Learning

To get the most out of your Startup Funding Analysis and Prediction project, you’ll apply the following data science techniques:

Exploratory Data Analysis (EDA):

Analyse patterns in startup funding, identify top industries, cities, and investor trends over the years.

Feature Engineering:

Extract and clean information such as funding amounts, startup stages, and industry sectors to create relevant features.

Data Visualisation:

Use bar plots, trend lines, and heatmaps to visualise funding distribution, investor frequency, and sector-wise growth.

Regression Modelling:

Train a Random Forest Regressor to predict funding amounts based on startup attributes and funding rounds.

Check out this beginner-friendly Python project! - Sales Data Analysis Project

Time Required to Complete the Project: You can complete the Startup Funding Analysis and Prediction project in approximately 3 to 4 hours, depending on your familiarity with data preprocessing, visualisation, and model building in Python.

Let’s build this Startup Funding Analysis and Prediction project from scratch with clear, step-by-step guidance:

Step No.	Step Title	Purpose in the Project
1	Load the Dataset	Read the startup funding dataset into a DataFrame for processing and analysis.
2	Check for Missing Values	Identify and handle missing or inconsistent values in the city, industry, or funding columns.
3	Explore the Data (EDA)	Analyse trends in funding by city, sector, year, and investor activity.
4	Visualise Key Factors	Use bar plots, count plots, and pie charts to visualise dominant sectors and investors.
5	Define Startup Success	Set criteria for success (e.g., funding amount above a threshold or presence of investors).
6	Preprocess the Data	Encode categorical variables, clean text, and scale numeric fields for model training.
7	Train a Classification Model	Build a machine learning model (Random Forest) to classify startup success.
8	Evaluate the Model	Use accuracy, confusion matrix, and classification report to measure model performance.
9	Predict New Startup Outcome	Predict whether a new startup (with given features) is likely to get funded or not.

Alright, let's dive in!

Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python

Step 1: Download the Dataset

To begin your Startup Funding Analysis and Prediction project, you'll first need to download the dataset. It's freely available online, and you can also find it on Kaggle by searching for your project name.

Step 2: Import Required Libraries

To begin the Startup Funding Analysis and Prediction project, the first step is to prepare the necessary tools and libraries. In this phase, we import essential Python packages and ensure that the environment is ready for data analysis and modelling.

Here’s the breakdown of what each import does:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
warnings.filterwarnings('ignore')

Explore this project - Customer Purchase Behaviour Analysis Project Using Python

Step 3: Load the Dataset

Now that we’ve imported all the required libraries, the next step is to load the dataset into memory. We’ll be working with a CSV file named startup_funding.csv.

If the file is missing, the script will stop to prevent further errors.

print("--- Loading Dataset ---")
try:
    df = pd.read_csv('startup_funding.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'startup_funding.csv' not found. Please ensure the file is in the correct directory.")
    exit()

Check out this - COVID-19 Project: Data Visualization & Insights

Step 4: Data Cleaning and Preparation

Before we dive into Startup Funding Analysis and Prediction analysis or modelling, we need to clean and prepare the data. Real-world data often comes with inconsistencies, missing values, and formatting issues.

# Standardize column names
df.columns = ['SrNo', 'Date', 'StartupName', 'IndustryVertical', 'SubVertical',
              'CityLocation', 'InvestorsName', 'InvestmentType', 'AmountInUSD', 'Remarks']

# --- Clean 'AmountInUSD' column ---
df['AmountInUSD'] = df['AmountInUSD'].apply(lambda x: str(x).replace(",", ""))
df['AmountInUSD'] = pd.to_numeric(df['AmountInUSD'], errors='coerce')

# --- Clean 'Date' column ---
df['Date'] = df['Date'].str.replace('05/072018', '05/07/2018')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Drop rows where key data is missing
df.dropna(subset=['Date', 'AmountInUSD', 'IndustryVertical', 'CityLocation', 'InvestmentType'], inplace=True)

# Extract Year and Month for trend analysis
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# --- Clean Categorical Columns ---
df['CityLocation'] = df['CityLocation'].str.title().replace({'Delhi': 'New Delhi', 'Bangalore': 'Bengaluru'})
df['IndustryVertical'] = df['IndustryVertical'].str.title().replace({
    'E-Commerce': 'ECommerce',
    'E-Commerce & M-Commerce': 'ECommerce',
    'Ecommerce': 'ECommerce'
})
df['InvestmentType'] = df['InvestmentType'].str.title().replace({
    'Seed/ Angel Funding': 'Seed Funding',
    'Seed / Angel Funding': 'Seed Funding',
    'Seed/Angel Funding': 'Seed Funding',
    'Angel / Seed Funding': 'Seed Funding',
    'Privateequity': 'Private Equity'
})

print("Data cleaning and preparation complete.")
print("Cleaned DataFrame shape:", df.shape)

Output:

Data cleaning and preparation complete.
Cleaned DataFrame shape: (836, 12)

Step 5: Exploratory Data Analysis (EDA)

Now that your data is clean, use Exploratory Data Analysis (EDA) to understand trends and patterns.

sns.set_style("whitegrid")

# a. Year-on-Year Funding Trend
plt.figure(figsize=(12, 6))
yearly_funding = df.groupby('Year')['AmountInUSD'].sum().sort_index()
sns.lineplot(x=yearly_funding.index, y=yearly_funding.values, marker='o')
plt.title('Total Startup Funding in India (Year-on-Year)')
plt.xlabel('Year')
plt.ylabel('Total Funding (in Billion USD)')
plt.xticks(yearly_funding.index.astype(int))
plt.savefig('yearly_funding_trend.png')
print("Generated 'yearly_funding_trend.png'")

# b. Top 10 Startup Hubs by Number of Deals
plt.figure(figsize=(12, 7))
top_cities = df['CityLocation'].value_counts().head(10)
sns.barplot(x=top_cities.values, y=top_cities.index, palette='viridis')
plt.title('Top 10 Startup Hubs in India (by Number of Funding Deals)')
plt.xlabel('Number of Deals')
plt.ylabel('City')
plt.savefig('top_startup_hubs.png')
print("Generated 'top_startup_hubs.png'")

# c. Top 10 Most Funded Sectors
plt.figure(figsize=(12, 7))
top_sectors = df.groupby('IndustryVertical')['AmountInUSD'].sum().sort_values(ascending=False).head(10)
sns.barplot(x=top_sectors.values, y=top_sectors.index, palette='plasma')
plt.title('Top 10 Most Funded Sectors in India')
plt.xlabel('Total Funding (in Billion USD)')
plt.ylabel('Sector')
plt.savefig('top_funded_sectors.png')
print("Generated 'top_funded_sectors.png'")

# d. Most Common Investment Types
plt.figure(figsize=(10, 8))
investment_types = df['InvestmentType'].value_counts().head(5)
plt.pie(investment_types, labels=investment_types.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Distribution of Top 5 Investment Types')
plt.ylabel('')
plt.savefig('investment_types_distribution.png')
print("Generated 'investment_types_distribution.png'")

Output:

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python

Step 6: Build the Funding Amount Prediction Model

In this step, we create a machine learning pipeline to predict how much funding a startup might raise based on its location, sector, and investment type.

# Define features and target
features = ['CityLocation', 'IndustryVertical', 'InvestmentType']
target = 'AmountInUSD'
X = df[features]

# Log-transform the target variable to handle skewness
y = np.log1p(df[target])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a preprocessing pipeline for categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), features)
    ], remainder='passthrough')

# Create the full model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])

# Train the model
print("Training the funding amount prediction model...")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")

Output:

Training the funding amount prediction model...
Model training complete.

Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning

Step 7: Evaluate the Model Performance

After training the model, it's important to check how well it performs on unseen data. In this step, we evaluate the model using R-squared (R²) and Root Mean Squared Error (RMSE) to measure prediction accuracy.

Here is the code for this step:

# Make predictions on the test set
y_pred_log = model_pipeline.predict(X_test)

# Inverse transform the predictions to get the actual currency amount
y_pred = np.expm1(y_pred_log)
y_test_actual = np.expm1(y_test)

# Calculate and print evaluation metrics
r2 = r2_score(y_test_actual, y_pred)
rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred))
print(f"Model R-squared (R²): {r2:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:,.2f}")
print("R² shows the proportion of variance in funding amount that the model can predict.")
print("RMSE indicates the typical error in the model's prediction in USD.")

Output:

Model R-squared (R²): -0.01
Root Mean Squared Error (RMSE): $194,456,881.93

Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques

Step 8: Prediction on a New, Hypothetical Startup

In this step, we use the trained pipeline to predict the funding amount for a fictional startup using specific input values.

Here is the code for this step:

# Create a new startup's data
new_startup_data = pd.DataFrame({
    'CityLocation': ['Bengaluru'],
    'IndustryVertical': ['Fintech'],
    'InvestmentType': ['Series A']
})

print("\nPredicting funding amount for the following new startup:")
print(new_startup_data)

# Predict funding amount
predicted_log_amount = model_pipeline.predict(new_startup_data)
predicted_amount = np.expm1(predicted_log_amount)

print(f"\nPredicted Funding Amount: ${predicted_amount[0]:,.2f}")

Output:

Predicting the funding amount for the following new startup:

CityLocation IndustryVertical InvestmentType

0 Bengaluru Fintech Series A

Predicted Funding Amount: $6,028,352.07

Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project

Final Conclusion

Our Startup Funding Analysis and Prediction project involved cleaning and preprocessing data. Our Random Forest Regressor model for predicting funding amounts performed poorly (R² = -0.01, RMSE = $194M), likely due to high variability and external factors. We successfully predicted funding for a hypothetical Bengaluru Fintech startup, but accuracy limitations warrant caution. This Startup Funding Analysis and Prediction project showcased a full data science workflow, highlighting financial modelling challenges with limited, categorical features.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1c0TRJXsShjGjfFwkeL-WlDoIRgLyp2kA

Frequently Asked Questions (FAQs)

1. What is the goal of this Startup Funding Analysis and Prediction Project?

The project aims to analyse startup funding trends in India and build a machine learning model that predicts funding amounts based on factors like city location, industry vertical, and investment type.

2. Which machine learning model was used for prediction?

A Random Forest Regressor was used within a pipeline that includes preprocessing steps like OneHotEncoding for categorical features and log transformation for the target variable.

3. Why is the model performance low (R² = -0.01)?

The available features don’t fully capture the factors influencing funding amounts. External factors like founder background, market trends, investor reputation, and startup traction aren’t represented in the dataset, limiting model accuracy.

4. How can the model be improved?

Improvement can be made by:

Adding features such as startup age, number of employees, or past funding history
Handling outliers more effectively
Trying advanced models like Gradient Boosting or XGBoost
Using external data sources for more context

5. Are there more beginner-friendly Python projects similar to this one?

For Python beginners seeking single-dataset projects, consider these options:

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources