Startup Funding Analysis and Prediction: A Machine Learning Project

By Rohit Sharma

Updated on Aug 11, 2025 | 8 min read | 1.49K+ views

Share:

In India, the startup ecosystem has seen rapid expansion, with a surge in funding activity across various sectors. 

This project focuses on Startup Funding Analysis and Prediction, where we explore patterns in funding data, clean and preprocess the dataset, and lay the groundwork for building a predictive model. The aim is to uncover what influences investment decisions and how funding trends evolve.

upGrad offers Online Data Science Courses to help you enhance your data science expertise. These courses provide expert guidance to help you master Python, ML, AI, SQL, and Tableau, develop practical skills, and prepare for a successful career.

Interested in hands-on Python projects to enhance your job readiness? Explore this resource: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025

Before We Dive In, What Should You Know?

Before you begin working on the Startup Funding Analysis and Prediction project, make sure you're comfortable with the following tools and concepts:

  • Python programming (You'll use Python for data loading, cleaning, exploration, and model development)
  • Pandas and Numpy (These libraries are essential for handling structured data, performing transformations, and generating insights)
  • Matplotlib or Seaborn (You'll use these for data visualisation: bar charts, correlation heatmaps, and trend analysis plots)
  • Scikit‑learn basics (Required for building preprocessing pipelines and training regression models like Random Forest)
  • Data preprocessing (You need to understand how to handle categorical and numerical features using encoders and scalers)
  • Model evaluation (Familiarity with metrics like R² score, RMSE, and MSE is necessary to interpret model performance effectively)

Advance your data science career with upGrad's premier courses and industry mentors.

Techniques You'll Use for Learning

To get the most out of your Startup Funding Analysis and Prediction project, you’ll apply the following data science techniques:

  • Exploratory Data Analysis (EDA):

Analyse patterns in startup funding, identify top industries, cities, and investor trends over the years.

  • Feature Engineering:

Extract and clean information such as funding amounts, startup stages, and industry sectors to create relevant features.

  • Data Visualisation:

Use bar plots, trend lines, and heatmaps to visualise funding distribution, investor frequency, and sector-wise growth.

  • Regression Modelling:

Train a Random Forest Regressor to predict funding amounts based on startup attributes and funding rounds.

Check out this beginner-friendly Python project! - Sales Data Analysis Project

Time Required to Complete the Project: You can complete the Startup Funding Analysis and Prediction project in approximately 3 to 4 hours, depending on your familiarity with data preprocessing, visualisation, and model building in Python.

Let’s build this  Startup Funding Analysis and Prediction project from scratch with clear, step-by-step guidance:

Step No.

Step Title

Purpose in the Project

1 Load the Dataset Read the startup funding dataset into a DataFrame for processing and analysis.
2 Check for Missing Values Identify and handle missing or inconsistent values in the city, industry, or funding columns.
3 Explore the Data (EDA) Analyse trends in funding by city, sector, year, and investor activity.
4 Visualise Key Factors Use bar plots, count plots, and pie charts to visualise dominant sectors and investors.
5 Define Startup Success Set criteria for success (e.g., funding amount above a threshold or presence of investors).
6 Preprocess the Data Encode categorical variables, clean text, and scale numeric fields for model training.
7 Train a Classification Model Build a machine learning model (Random Forest) to classify startup success.
8 Evaluate the Model Use accuracy, confusion matrix, and classification report to measure model performance.
9 Predict New Startup Outcome Predict whether a new startup (with given features) is likely to get funded or not.

Alright, let's dive in!

Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python

Step 1: Download the Dataset

To begin your  Startup Funding Analysis and Prediction project, you'll first need to download the dataset. It's freely available online, and you can also find it on Kaggle by searching for your project name.

Step 2:  Import Required Libraries

To begin the Startup Funding Analysis and Prediction project, the first step is to prepare the necessary tools and libraries. In this phase, we import essential Python packages and ensure that the environment is ready for data analysis and modelling.

Here’s the breakdown of what each import does:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
warnings.filterwarnings('ignore')

Explore this project - Customer Purchase Behaviour Analysis Project Using Python

Step 3:  Load the Dataset

Now that we’ve imported all the required libraries, the next step is to load the dataset into memory. We’ll be working with a CSV file named startup_funding.csv.

If the file is missing, the script will stop to prevent further errors.

print("--- Loading Dataset ---")
try:
    df = pd.read_csv('startup_funding.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'startup_funding.csv' not found. Please ensure the file is in the correct directory.")
    exit()

Check out this - COVID-19 Project: Data Visualization & Insights

Step 4: Data Cleaning and Preparation

Before we dive into  Startup Funding Analysis and Prediction analysis or modelling, we need to clean and prepare the data. Real-world data often comes with inconsistencies, missing values, and formatting issues. 

# Standardize column names
df.columns = ['SrNo', 'Date', 'StartupName', 'IndustryVertical', 'SubVertical',
              'CityLocation', 'InvestorsName', 'InvestmentType', 'AmountInUSD', 'Remarks']

# --- Clean 'AmountInUSD' column ---
df['AmountInUSD'] = df['AmountInUSD'].apply(lambda x: str(x).replace(",", ""))
df['AmountInUSD'] = pd.to_numeric(df['AmountInUSD'], errors='coerce')

# --- Clean 'Date' column ---
df['Date'] = df['Date'].str.replace('05/072018', '05/07/2018')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Drop rows where key data is missing
df.dropna(subset=['Date', 'AmountInUSD', 'IndustryVertical', 'CityLocation', 'InvestmentType'], inplace=True)

# Extract Year and Month for trend analysis
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# --- Clean Categorical Columns ---
df['CityLocation'] = df['CityLocation'].str.title().replace({'Delhi': 'New Delhi', 'Bangalore': 'Bengaluru'})
df['IndustryVertical'] = df['IndustryVertical'].str.title().replace({
    'E-Commerce': 'ECommerce',
    'E-Commerce & M-Commerce': 'ECommerce',
    'Ecommerce': 'ECommerce'
})
df['InvestmentType'] = df['InvestmentType'].str.title().replace({
    'Seed/ Angel Funding': 'Seed Funding',
    'Seed / Angel Funding': 'Seed Funding',
    'Seed/Angel Funding': 'Seed Funding',
    'Angel / Seed Funding': 'Seed Funding',
    'Privateequity': 'Private Equity'
})

print("Data cleaning and preparation complete.")
print("Cleaned DataFrame shape:", df.shape)

Output:

Data cleaning and preparation complete.
Cleaned DataFrame shape: (836, 12)

Step 5: Exploratory Data Analysis (EDA)

Now that your data is clean, use Exploratory Data Analysis (EDA) to understand trends and patterns.

sns.set_style("whitegrid")

# a. Year-on-Year Funding Trend
plt.figure(figsize=(12, 6))
yearly_funding = df.groupby('Year')['AmountInUSD'].sum().sort_index()
sns.lineplot(x=yearly_funding.index, y=yearly_funding.values, marker='o')
plt.title('Total Startup Funding in India (Year-on-Year)')
plt.xlabel('Year')
plt.ylabel('Total Funding (in Billion USD)')
plt.xticks(yearly_funding.index.astype(int))
plt.savefig('yearly_funding_trend.png')
print("Generated 'yearly_funding_trend.png'")

# b. Top 10 Startup Hubs by Number of Deals
plt.figure(figsize=(12, 7))
top_cities = df['CityLocation'].value_counts().head(10)
sns.barplot(x=top_cities.values, y=top_cities.index, palette='viridis')
plt.title('Top 10 Startup Hubs in India (by Number of Funding Deals)')
plt.xlabel('Number of Deals')
plt.ylabel('City')
plt.savefig('top_startup_hubs.png')
print("Generated 'top_startup_hubs.png'")

# c. Top 10 Most Funded Sectors
plt.figure(figsize=(12, 7))
top_sectors = df.groupby('IndustryVertical')['AmountInUSD'].sum().sort_values(ascending=False).head(10)
sns.barplot(x=top_sectors.values, y=top_sectors.index, palette='plasma')
plt.title('Top 10 Most Funded Sectors in India')
plt.xlabel('Total Funding (in Billion USD)')
plt.ylabel('Sector')
plt.savefig('top_funded_sectors.png')
print("Generated 'top_funded_sectors.png'")

# d. Most Common Investment Types
plt.figure(figsize=(10, 8))
investment_types = df['InvestmentType'].value_counts().head(5)
plt.pie(investment_types, labels=investment_types.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Distribution of Top 5 Investment Types')
plt.ylabel('')
plt.savefig('investment_types_distribution.png')
print("Generated 'investment_types_distribution.png'")

Output:

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python

Step 6: Build the Funding Amount Prediction Model

In this step, we create a machine learning pipeline to predict how much funding a startup might raise based on its location, sector, and investment type. 

# Define features and target
features = ['CityLocation', 'IndustryVertical', 'InvestmentType']
target = 'AmountInUSD'
X = df[features]

# Log-transform the target variable to handle skewness
y = np.log1p(df[target])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a preprocessing pipeline for categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), features)
    ], remainder='passthrough')

# Create the full model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])

# Train the model
print("Training the funding amount prediction model...")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")

Output: 

Training the funding amount prediction model...
Model training complete.

Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning

Step 7: Evaluate the Model Performance

After training the model, it's important to check how well it performs on unseen data. In this step, we evaluate the model using R-squared (R²) and Root Mean Squared Error (RMSE) to measure prediction accuracy.

Here is the code for this step:

# Make predictions on the test set
y_pred_log = model_pipeline.predict(X_test)

# Inverse transform the predictions to get the actual currency amount
y_pred = np.expm1(y_pred_log)
y_test_actual = np.expm1(y_test)

# Calculate and print evaluation metrics
r2 = r2_score(y_test_actual, y_pred)
rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred))
print(f"Model R-squared (R²): {r2:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:,.2f}")
print("R² shows the proportion of variance in funding amount that the model can predict.")
print("RMSE indicates the typical error in the model's prediction in USD.")

Output: 

Model R-squared (R²): -0.01
Root Mean Squared Error (RMSE): $194,456,881.93

Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques

Step 8: Prediction on a New, Hypothetical Startup

In this step, we use the trained pipeline to predict the funding amount for a fictional startup using specific input values.

Here is the code for this step:

# Create a new startup's data
new_startup_data = pd.DataFrame({
    'CityLocation': ['Bengaluru'],
    'IndustryVertical': ['Fintech'],
    'InvestmentType': ['Series A']
})

print("\nPredicting funding amount for the following new startup:")
print(new_startup_data)

# Predict funding amount
predicted_log_amount = model_pipeline.predict(new_startup_data)
predicted_amount = np.expm1(predicted_log_amount)

print(f"\nPredicted Funding Amount: ${predicted_amount[0]:,.2f}")

Output:

Predicting the funding amount for the following new startup:

    CityLocation  IndustryVertical   InvestmentType

0      Bengaluru                Fintech               Series A

Predicted Funding Amount: $6,028,352.07

Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project

Final Conclusion

Our  Startup Funding Analysis and Prediction project involved cleaning and preprocessing data. Our Random Forest Regressor model for predicting funding amounts performed poorly (R² = -0.01, RMSE = $194M), likely due to high variability and external factors. We successfully predicted funding for a hypothetical Bengaluru Fintech startup, but accuracy limitations warrant caution. This Startup Funding Analysis and Prediction project showcased a full data science workflow, highlighting financial modelling challenges with limited, categorical features.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1c0TRJXsShjGjfFwkeL-WlDoIRgLyp2kA

Frequently Asked Questions (FAQs)

1. What is the goal of this Startup Funding Analysis and Prediction Project?

2. Which machine learning model was used for prediction?

3. Why is the model performance low (R² = -0.01)?

4. How can the model be improved?

5. Are there more beginner-friendly Python projects similar to this one?

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months