Startup Funding Analysis and Prediction: A Machine Learning Project
By Rohit Sharma
Updated on Aug 11, 2025 | 8 min read | 1.49K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 11, 2025 | 8 min read | 1.49K+ views
Share:
In India, the startup ecosystem has seen rapid expansion, with a surge in funding activity across various sectors.
This project focuses on Startup Funding Analysis and Prediction, where we explore patterns in funding data, clean and preprocess the dataset, and lay the groundwork for building a predictive model. The aim is to uncover what influences investment decisions and how funding trends evolve.
upGrad offers Online Data Science Courses to help you enhance your data science expertise. These courses provide expert guidance to help you master Python, ML, AI, SQL, and Tableau, develop practical skills, and prepare for a successful career.
Interested in hands-on Python projects to enhance your job readiness? Explore this resource: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025
Popular Data Science Programs
Before you begin working on the Startup Funding Analysis and Prediction project, make sure you're comfortable with the following tools and concepts:
Advance your data science career with upGrad's premier courses and industry mentors.
To get the most out of your Startup Funding Analysis and Prediction project, you’ll apply the following data science techniques:
Analyse patterns in startup funding, identify top industries, cities, and investor trends over the years.
Extract and clean information such as funding amounts, startup stages, and industry sectors to create relevant features.
Use bar plots, trend lines, and heatmaps to visualise funding distribution, investor frequency, and sector-wise growth.
Train a Random Forest Regressor to predict funding amounts based on startup attributes and funding rounds.
Check out this beginner-friendly Python project! - Sales Data Analysis Project
Time Required to Complete the Project: You can complete the Startup Funding Analysis and Prediction project in approximately 3 to 4 hours, depending on your familiarity with data preprocessing, visualisation, and model building in Python.
Let’s build this Startup Funding Analysis and Prediction project from scratch with clear, step-by-step guidance:
Step No. |
Step Title |
Purpose in the Project |
1 | Load the Dataset | Read the startup funding dataset into a DataFrame for processing and analysis. |
2 | Check for Missing Values | Identify and handle missing or inconsistent values in the city, industry, or funding columns. |
3 | Explore the Data (EDA) | Analyse trends in funding by city, sector, year, and investor activity. |
4 | Visualise Key Factors | Use bar plots, count plots, and pie charts to visualise dominant sectors and investors. |
5 | Define Startup Success | Set criteria for success (e.g., funding amount above a threshold or presence of investors). |
6 | Preprocess the Data | Encode categorical variables, clean text, and scale numeric fields for model training. |
7 | Train a Classification Model | Build a machine learning model (Random Forest) to classify startup success. |
8 | Evaluate the Model | Use accuracy, confusion matrix, and classification report to measure model performance. |
9 | Predict New Startup Outcome | Predict whether a new startup (with given features) is likely to get funded or not. |
Alright, let's dive in!
Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python
To begin your Startup Funding Analysis and Prediction project, you'll first need to download the dataset. It's freely available online, and you can also find it on Kaggle by searching for your project name.
To begin the Startup Funding Analysis and Prediction project, the first step is to prepare the necessary tools and libraries. In this phase, we import essential Python packages and ensure that the environment is ready for data analysis and modelling.
Here’s the breakdown of what each import does:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
warnings.filterwarnings('ignore')
Explore this project - Customer Purchase Behaviour Analysis Project Using Python
Now that we’ve imported all the required libraries, the next step is to load the dataset into memory. We’ll be working with a CSV file named startup_funding.csv.
If the file is missing, the script will stop to prevent further errors.
print("--- Loading Dataset ---")
try:
df = pd.read_csv('startup_funding.csv')
print("Dataset loaded successfully.")
except FileNotFoundError:
print("Error: 'startup_funding.csv' not found. Please ensure the file is in the correct directory.")
exit()
Check out this - COVID-19 Project: Data Visualization & Insights
Before we dive into Startup Funding Analysis and Prediction analysis or modelling, we need to clean and prepare the data. Real-world data often comes with inconsistencies, missing values, and formatting issues.
# Standardize column names
df.columns = ['SrNo', 'Date', 'StartupName', 'IndustryVertical', 'SubVertical',
'CityLocation', 'InvestorsName', 'InvestmentType', 'AmountInUSD', 'Remarks']
# --- Clean 'AmountInUSD' column ---
df['AmountInUSD'] = df['AmountInUSD'].apply(lambda x: str(x).replace(",", ""))
df['AmountInUSD'] = pd.to_numeric(df['AmountInUSD'], errors='coerce')
# --- Clean 'Date' column ---
df['Date'] = df['Date'].str.replace('05/072018', '05/07/2018')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
# Drop rows where key data is missing
df.dropna(subset=['Date', 'AmountInUSD', 'IndustryVertical', 'CityLocation', 'InvestmentType'], inplace=True)
# Extract Year and Month for trend analysis
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
# --- Clean Categorical Columns ---
df['CityLocation'] = df['CityLocation'].str.title().replace({'Delhi': 'New Delhi', 'Bangalore': 'Bengaluru'})
df['IndustryVertical'] = df['IndustryVertical'].str.title().replace({
'E-Commerce': 'ECommerce',
'E-Commerce & M-Commerce': 'ECommerce',
'Ecommerce': 'ECommerce'
})
df['InvestmentType'] = df['InvestmentType'].str.title().replace({
'Seed/ Angel Funding': 'Seed Funding',
'Seed / Angel Funding': 'Seed Funding',
'Seed/Angel Funding': 'Seed Funding',
'Angel / Seed Funding': 'Seed Funding',
'Privateequity': 'Private Equity'
})
print("Data cleaning and preparation complete.")
print("Cleaned DataFrame shape:", df.shape)
Output:
Data cleaning and preparation complete.
Cleaned DataFrame shape: (836, 12)
Now that your data is clean, use Exploratory Data Analysis (EDA) to understand trends and patterns.
sns.set_style("whitegrid")
# a. Year-on-Year Funding Trend
plt.figure(figsize=(12, 6))
yearly_funding = df.groupby('Year')['AmountInUSD'].sum().sort_index()
sns.lineplot(x=yearly_funding.index, y=yearly_funding.values, marker='o')
plt.title('Total Startup Funding in India (Year-on-Year)')
plt.xlabel('Year')
plt.ylabel('Total Funding (in Billion USD)')
plt.xticks(yearly_funding.index.astype(int))
plt.savefig('yearly_funding_trend.png')
print("Generated 'yearly_funding_trend.png'")
# b. Top 10 Startup Hubs by Number of Deals
plt.figure(figsize=(12, 7))
top_cities = df['CityLocation'].value_counts().head(10)
sns.barplot(x=top_cities.values, y=top_cities.index, palette='viridis')
plt.title('Top 10 Startup Hubs in India (by Number of Funding Deals)')
plt.xlabel('Number of Deals')
plt.ylabel('City')
plt.savefig('top_startup_hubs.png')
print("Generated 'top_startup_hubs.png'")
# c. Top 10 Most Funded Sectors
plt.figure(figsize=(12, 7))
top_sectors = df.groupby('IndustryVertical')['AmountInUSD'].sum().sort_values(ascending=False).head(10)
sns.barplot(x=top_sectors.values, y=top_sectors.index, palette='plasma')
plt.title('Top 10 Most Funded Sectors in India')
plt.xlabel('Total Funding (in Billion USD)')
plt.ylabel('Sector')
plt.savefig('top_funded_sectors.png')
print("Generated 'top_funded_sectors.png'")
# d. Most Common Investment Types
plt.figure(figsize=(10, 8))
investment_types = df['InvestmentType'].value_counts().head(5)
plt.pie(investment_types, labels=investment_types.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Distribution of Top 5 Investment Types')
plt.ylabel('')
plt.savefig('investment_types_distribution.png')
print("Generated 'investment_types_distribution.png'")
Output:
A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python
In this step, we create a machine learning pipeline to predict how much funding a startup might raise based on its location, sector, and investment type.
# Define features and target
features = ['CityLocation', 'IndustryVertical', 'InvestmentType']
target = 'AmountInUSD'
X = df[features]
# Log-transform the target variable to handle skewness
y = np.log1p(df[target])
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a preprocessing pipeline for categorical features
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), features)
], remainder='passthrough')
# Create the full model pipeline
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])
# Train the model
print("Training the funding amount prediction model...")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")
Output:
Training the funding amount prediction model...
Model training complete.
Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning
After training the model, it's important to check how well it performs on unseen data. In this step, we evaluate the model using R-squared (R²) and Root Mean Squared Error (RMSE) to measure prediction accuracy.
Here is the code for this step:
# Make predictions on the test set
y_pred_log = model_pipeline.predict(X_test)
# Inverse transform the predictions to get the actual currency amount
y_pred = np.expm1(y_pred_log)
y_test_actual = np.expm1(y_test)
# Calculate and print evaluation metrics
r2 = r2_score(y_test_actual, y_pred)
rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred))
print(f"Model R-squared (R²): {r2:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:,.2f}")
print("R² shows the proportion of variance in funding amount that the model can predict.")
print("RMSE indicates the typical error in the model's prediction in USD.")
Output:
Model R-squared (R²): -0.01
Root Mean Squared Error (RMSE): $194,456,881.93
Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques
In this step, we use the trained pipeline to predict the funding amount for a fictional startup using specific input values.
Here is the code for this step:
# Create a new startup's data
new_startup_data = pd.DataFrame({
'CityLocation': ['Bengaluru'],
'IndustryVertical': ['Fintech'],
'InvestmentType': ['Series A']
})
print("\nPredicting funding amount for the following new startup:")
print(new_startup_data)
# Predict funding amount
predicted_log_amount = model_pipeline.predict(new_startup_data)
predicted_amount = np.expm1(predicted_log_amount)
print(f"\nPredicted Funding Amount: ${predicted_amount[0]:,.2f}")
Output:
Predicting the funding amount for the following new startup:
CityLocation IndustryVertical InvestmentType
0 Bengaluru Fintech Series A
Predicted Funding Amount: $6,028,352.07
Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project
Our Startup Funding Analysis and Prediction project involved cleaning and preprocessing data. Our Random Forest Regressor model for predicting funding amounts performed poorly (R² = -0.01, RMSE = $194M), likely due to high variability and external factors. We successfully predicted funding for a hypothetical Bengaluru Fintech startup, but accuracy limitations warrant caution. This Startup Funding Analysis and Prediction project showcased a full data science workflow, highlighting financial modelling challenges with limited, categorical features.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1c0TRJXsShjGjfFwkeL-WlDoIRgLyp2kA
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources