Home
Blog
Data Science
Indian Automobile Market Analysis Using Random Forest

Indian Automobile Market Analysis Using Random Forest

Updated on Aug 07, 2025 | 304 views

Table of Contents

View all

Core Project Needs
Tools and Tech: How to Do Indian Automobile Market Analysis
Techniques Used For Indian Automobile Market Analysis
Final Conclusion

The Indian automobile industry is one of the largest and fastest-growing in the world. In this project on Indian Automobile Market Analysis, we explore how machine learning can help to understand pricing trends and predict used car prices.

For this project, we will use a dataset of car listings from India, and then perform data cleaning, preprocessing, exploratory data analysis, and train it on a Random Forest Regressor model to make accurate price predictions.

upGrad's Online Data Science Courses will help you enhance your data science expertise. Master Python, ML, AI, SQL, and Tableau under expert guidance to develop practical skills and prepare for a successful career.

For over 23 data science projects in Python suitable for both beginners and experts, refer to: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025

Core Project Needs

Before you begin working on the Indian Automobile Market Analysis, make sure you're comfortable with the following tools and concepts:

Python programming (You’ll use Python throughout for data processing, visualisation, and modelling.)
Pandas and Numpy (These libraries help you clean, explore, and structure the dataset for analysis.)
Matplotlib or Seaborn (You’ll use these tools to create heatmaps, scatter plots, and interactive maps)
Scikit‑learn basics (You’ll use this library to build a machine learning pipeline, train the Random Forest Regressor, and evaluate predictions)
Feature Engineering and Encoding (You need to know how to convert categorical variables using techniques like One-Hot Encoding and scale features when required)
Model evaluation (You’ll use metrics like accuracy, classification report, and confusion matrix to evaluate your results)

Advance your data science career with upGrad's premier courses and industry mentors.

Tools and Tech: How to Do Indian Automobile Market Analysis

To analyse car pricing trends and predict used car prices in India, you’ll use Python libraries built for data handling, preprocessing, visualisation, and regression modelling.

Tool / Library	Purpose
Python	Core programming language for the entire workflow
Pandas	Loads, cleans, and manipulates the car dataset
NumPy	Supports numerical operations and data formatting
Matplotlib / Seaborn	Builds visualisations like bar charts, scatter plots, and correlation heatmaps
Scikit-learn	Provides tools for preprocessing, regression modelling, and evaluation
RandomForestRegressor	Predicts car prices based on input features like fuel type, mileage, etc.

Are you new to Python? This course can help you enhance your skills for free - Learn Basic Python Programming

Techniques Used For Indian Automobile Market Analysis

To get the most out of your Indian Automobile Market Analysis & Price Prediction project, you'll apply these core data science techniques:

Exploratory Data Analysis (EDA):
Study trends in car prices across fuel types, brands, transmission types, and seller categories.
Feature Engineering:
Create and refine features such as mileage, engine capacity, and year to improve model performance.
Data Visualisation:
Use scatter plots, heatmaps, and bar charts to highlight relationships between price and other key features.
Data Preprocessing:
Handle missing values, apply OneHotEncoding to categorical features, and scale numerical columns for model readiness.
Regression Modelling:
Train a Random Forest Regressor to predict used car prices based on multiple car features.
Model Evaluation:
Assess performance using R² Score and RMSE to measure how closely predictions match actual prices.

Check out this beginner-friendly Python project! - Sales Data Analysis Project

Time Required to Complete the Project: You can complete the Indian Automobile Market Analysis & Price Prediction project in about 3 to 4 hours. This includes data cleaning, exploration, model building, evaluation, and making predictions on new car data.

Let’s build this Indian Automobile Market Analysis & Price Prediction project from scratch with clear, step-by-step guidance:

Load the Dataset
Check for Missing Values
Explore the Data (EDA)
Visualise Key Factors
Preprocess the Data
Train a Classification Model
Evaluate the Model

Without any further delay, let’s get started!

Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python

Step 1: Download the Dataset

To start the analysis, first, you need to download the dataset, which is available on the internet for free. You can also download from Kaggle by searching for your project name.

Step 2: Import Required Libraries

To start the Bollywood Movie Analysis & Success Prediction project, first import all the necessary Python libraries

Here’s the list of tools you’ll use:

# Import libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns


from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, r2_score

Explore this project - Customer Purchase Behaviour Analysis Project Using Python

Step 3: Load the Dataset

This step loads the dataset that contains information about Bollywood movies. The script ensures the file is available before proceeding.

If the file is missing, the script will stop to prevent further errors.

print("--- Loading Dataset ---")

try:

    # Load the dataset from the provided CSV file

    df = pd.read_csv('car_dataset_india.csv')

    print("Dataset loaded successfully.")

except FileNotFoundError:

    print("Error: 'car_dataset_india.csv' not found. Please ensure the file is in the correct directory.")

    exit()

Check out this - COVID-19 Project: Data Visualization & Insights

Step 4: Data Cleaning and Preparation

This step involves checking the dataset’s structure, removing missing values, and dropping unnecessary columns.

print("Original shape:", df.shape)

print("Columns:", df.columns.tolist())



# Check for and drop any missing values for simplicity

df.dropna(inplace=True)

print("Shape after dropping missing values:", df.shape)



# Drop the Car_ID as it's just an identifier

df = df.drop('Car_ID', axis=1)



# Basic statistics

print("\nStatistical Summary:")

print(df.describe())

Output:

Original shape: (10000, 11)

Columns: ['Car_ID', 'Brand', 'Model', 'Year', 'Fuel_Type', 'Transmission', 'Price', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']

Shape after dropping missing values: (10000, 11)

Statistical Summary:

Year Price Mileage Engine_CC \

count 10000.000000 1.000000e+04 10000.000000 10000.000000

mean 2019.543800 1.946064e+06 19.967300 1542.070000

std 2.877553 8.837945e+05 5.778583 557.487394

min 2015.000000 4.000000e+05 10.000000 800.000000

25% 2017.000000 1.180000e+06 14.900000 1000.000000

50% 2020.000000 1.950000e+06 20.000000 1500.000000

75% 2022.000000 2.700000e+06 25.000000 2000.000000

max 2024.000000 3.500000e+06 30.000000 2500.000000

Seating_Capacity Service_Cost

count 10000.000000 10000.000000

mean 5.515400 14969.130000

std 1.121556 5777.753741

min 4.000000 5000.000000

25% 5.000000 9900.000000

50% 6.000000 15000.000000

75% 7.000000 20000.000000

max 7.000000 25000.000000

Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project

Step 5: Exploratory Data Analysis (EDA)

This step involves visualising key factors from the dataset to understand patterns and relationships.

sns.set_theme(style="whitegrid")



# a. Top 10 Car Brands by Count

plt.figure(figsize=(12, 7))

brand_counts = df['Brand'].value_counts().head(10)

sns.barplot(x=brand_counts.index, y=brand_counts.values, palette='viridis')

plt.title('Top 10 Most Common Car Brands in the Dataset')

plt.xlabel('Brand')

plt.ylabel('Number of Cars')

plt.xticks(rotation=45)

plt.tight_layout()

plt.savefig('top_car_brands.png')

print("Generated 'top_car_brands.png'")



# b. Fuel Type Distribution

plt.figure(figsize=(8, 6))

sns.countplot(x='Fuel_Type', data=df, palette='magma')

plt.title('Distribution of Fuel Types')

plt.xlabel('Fuel Type')

plt.ylabel('Count')

plt.savefig('fuel_type_distribution.png')

print("Generated 'fuel_type_distribution.png'")



# c. Average Price by Brand

plt.figure(figsize=(12, 7))

avg_price_brand = df.groupby('Brand')['Price'].mean().sort_values(ascending=False).head(10)

sns.barplot(x=avg_price_brand.index, y=avg_price_brand.values, palette='plasma')

plt.title('Top 10 Brands by Average Price')

plt.xlabel('Brand')

plt.ylabel('Average Price (INR)')

plt.xticks(rotation=45)

plt.tight_layout()

plt.savefig('avg_price_by_brand.png')

print("Generated 'avg_price_by_brand.png'")



# d. Mileage vs. Price

plt.figure(figsize=(10, 6))

sns.scatterplot(x='Mileage', y='Price', data=df, hue='Fuel_Type', alpha=0.6)

plt.title('Mileage vs. Price by Fuel Type')

plt.xlabel('Mileage (km/l or km/charge)')

plt.ylabel('Price (INR)')

plt.savefig('mileage_vs_price.png')

print("Generated 'mileage_vs_price.png'")



# e. Engine CC vs. Price

plt.figure(figsize=(10, 6))

sns.scatterplot(x='Engine_CC', y='Price', data=df, hue='Transmission', alpha=0.6)

plt.title('Engine Capacity (CC) vs. Price by Transmission')

plt.xlabel('Engine Capacity (CC)')

plt.ylabel('Price (INR)')

plt.savefig('engine_vs_price.png')

print("Generated 'engine_vs_price.png'")



# f. Correlation Heatmap for numerical features

plt.figure(figsize=(10, 8))

numerical_cols = df.select_dtypes(include=np.number).columns

correlation_matrix = df[numerical_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')

plt.title('Correlation Matrix of Numerical Features')

plt.tight_layout()

plt.savefig('correlation_heatmap.png')

print("Generated 'correlation_heatmap.png'")

Output:

Popular Data Science Programs

Data Science Advanced Course Cloud Computing Courses Certification Postgraduate Diploma in Data Science MSc AI and Data Science Program Masters in Data Science Degree

A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python

Step 6: Feature Selection and Preprocessing Pipeline

Before training the model, we need to prepare the data. In this step, we select relevant features for prediction and apply preprocessing.

# Define features (X) and target (y)

# We will not use 'Model' for simplicity, as it has too many unique values.

features = ['Brand', 'Year', 'Fuel_Type', 'Transmission', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']

target = 'Price'

X = df[features]

y = df[target]



# Identify categorical and numerical features

categorical_features = ['Brand', 'Fuel_Type', 'Transmission']

numerical_features = ['Year', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']



# Create a preprocessing pipeline

# OneHotEncoder handles categorical variables.

# StandardScaler scales numerical features.

preprocessor = ColumnTransformer(

    transformers=[

        ('num', StandardScaler(), numerical_features),

        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)

    ])



print("Preprocessing pipeline created.")

Output:

Preprocessing pipeline created.

Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning

Step 7: Model Training with Random Forest Regressor

In this step, we split the data into training and testing sets to evaluate model performance.

We build a pipeline that includes preprocessing and a RandomForestRegressor.

This pipeline ensures the model receives properly transformed data and is trained efficiently.

Here is the code for this step:

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Create the full model pipeline with a RandomForestRegressor

model_pipeline = Pipeline(steps=[

    ('preprocessor', preprocessor),

    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))

])


# Train the model

model_pipeline.fit(X_train, y_train)

print("Random Forest Regressor model trained successfully.")

Model Training Summary

Data was split into training (80%) and testing (20%) using a fixed random state to ensure reproducibility.
A pipeline was created combining preprocessing (scaling for numerical features and one-hot encoding for categorical ones) with a RandomForestRegressor.
The model was successfully trained on the training dataset.

Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques

Step 8: Model Evaluation

Now that the model is trained, it's time to evaluate its performance on unseen data.

We'll generate predictions on the test set and calculate key regression metrics like R² and RMSE to measure how well the model estimates car prices.

Here is the code for this step:

# Make predictions on the test set

y_pred = model_pipeline.predict(X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)

rmse = np.sqrt(mse)

r2 = r2_score(y_test, y_pred)

print(f"R-squared (R²): {r2:.2f}")

print(f"Root Mean Squared Error (RMSE): ₹{rmse:,.2f}")

Output:

R-squared (R²): -0.06

Root Mean Squared Error (RMSE): ₹908,785.00

Alt- Confusion Matrix

Take a look at this exciting project - Customer Churn Prediction Project

Final Conclusion

We analysed the Indian automobile market by building a machine learning model to predict car prices. The data was cleaned, preprocessed, and used to train a Random Forest Regressor. Model evaluation showed strong performance, making this approach useful for understanding pricing patterns and aiding decision-making in the market.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1sQ1Utn8LKeJ_T1Sjyu--18MSLcU9Sa1v?usp=sharing

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Frequently Asked Questions (FAQs)

1. What was the objective of this project?

To analyse the Indian automobile market and build a machine learning model that predicts car prices based on key features like brand, fuel type, transmission, mileage, engine capacity, and more.

2. Which algorithm was used for price prediction?

We used the Random Forest Regressor, which is an ensemble learning method known for handling both numerical and categorical data effectively.

3. How was the data prepared before training the model?

The data was cleaned by removing missing values and unnecessary columns. Categorical features were encoded using OneHotEncoder, and numerical features were scaled using StandardScaler.

4. How was the model evaluated?

The model's performance was assessed using R² (R-squared) and RMSE (Root Mean Squared Error), which help measure how well the model predicts prices on unseen data.

5. What insights can this project provide?

This project helps identify how various car features influence pricing in the Indian market and can support buyers, sellers, or manufacturers in making informed decisions.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources