Indian Automobile Market Analysis Using Random Forest

By Rohit Sharma

Updated on Aug 07, 2025 | 4 views

Share:

The Indian automobile industry is one of the largest and fastest-growing in the world. In this project on Indian Automobile Market Analysis, we explore how machine learning can help to understand pricing trends and predict used car prices.

For this project, we will use a dataset of car listings from India, and then perform data cleaning, preprocessing, exploratory data analysis, and train it on a Random Forest Regressor model to make accurate price predictions. 

upGrad's Online Data Science Courses will help you enhance your data science expertise. Master Python, ML, AI, SQL, and Tableau under expert guidance to develop practical skills and prepare for a successful career.

For over 23 data science projects in Python suitable for both beginners and experts, refer to: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025

Core Project Needs

Before you begin working on the Indian Automobile Market Analysis, make sure you're comfortable with the following tools and concepts:

  • Python programming (You’ll use Python throughout for data processing, visualisation, and modelling.)
  • Pandas and Numpy (These libraries help you clean, explore, and structure the dataset for analysis.)
  • Matplotlib or Seaborn (You’ll use these tools to create heatmaps, scatter plots, and interactive maps)
  • Scikit‑learn basics (You’ll use this library to build a machine learning pipeline, train the Random Forest Regressor, and evaluate predictions)
  • Feature Engineering and Encoding (You need to know how to convert categorical variables using techniques like One-Hot Encoding and scale features when required)
  • Model evaluation (You’ll use metrics like accuracy, classification report, and confusion matrix to evaluate your results)

Advance your data science career with upGrad's premier courses and industry mentors.

Tools and Tech: How to Do Indian Automobile Market Analysis

To analyse car pricing trends and predict used car prices in India, you’ll use Python libraries built for data handling, preprocessing, visualisation, and regression modelling.

Tool / Library Purpose
Python Core programming language for the entire workflow
Pandas Loads, cleans, and manipulates the car dataset
NumPy Supports numerical operations and data formatting
Matplotlib / Seaborn Builds visualisations like bar charts, scatter plots, and correlation heatmaps
Scikit-learn Provides tools for preprocessing, regression modelling, and evaluation
RandomForestRegressor Predicts car prices based on input features like fuel type, mileage, etc.

 

Are you new to Python? This course can help you enhance your skills for free - Learn Basic Python Programming

Techniques Used For Indian Automobile Market Analysis

To get the most out of your Indian Automobile Market Analysis & Price Prediction project, you'll apply these core data science techniques:

  • Exploratory Data Analysis (EDA):
    Study trends in car prices across fuel types, brands, transmission types, and seller categories.
  • Feature Engineering:
    Create and refine features such as mileage, engine capacity, and year to improve model performance.
  • Data Visualisation:
    Use scatter plots, heatmaps, and bar charts to highlight relationships between price and other key features.
  • Data Preprocessing:
    Handle missing values, apply OneHotEncoding to categorical features, and scale numerical columns for model readiness.
  • Regression Modelling:
    Train a Random Forest Regressor to predict used car prices based on multiple car features.
  • Model Evaluation:
    Assess performance using R² Score and RMSE to measure how closely predictions match actual prices.

Check out this beginner-friendly Python project! - Sales Data Analysis Project

Time Required to Complete the Project: You can complete the Indian Automobile Market Analysis & Price Prediction project in about 3 to 4 hours. This includes data cleaning, exploration, model building, evaluation, and making predictions on new car data.

Let’s build this Indian Automobile Market Analysis & Price Prediction project from scratch with clear, step-by-step guidance:

  1. Load the Dataset
  2. Check for Missing Values
  3. Explore the Data (EDA)
  4. Visualise Key Factors
  5. Preprocess the Data
  6. Train a Classification Model
  7. Evaluate the Model

Without any further delay, let’s get started!

Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python

Step 1: Download the Dataset

To start the analysis, first, you need to download the dataset, which is available on the internet for free. You can also download from Kaggle by searching for your project name.

Step 2:  Import Required Libraries

To start the Bollywood Movie Analysis & Success Prediction project, first import all the necessary Python libraries

Here’s the list of tools you’ll use:

# Import libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns


from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, r2_score

Explore this project - Customer Purchase Behaviour Analysis Project Using Python

Step 3:  Load the Dataset

This step loads the dataset that contains information about Bollywood movies. The script ensures the file is available before proceeding.

If the file is missing, the script will stop to prevent further errors.

print("--- Loading Dataset ---")

try:

    # Load the dataset from the provided CSV file

    df = pd.read_csv('car_dataset_india.csv')

    print("Dataset loaded successfully.")

except FileNotFoundError:

    print("Error: 'car_dataset_india.csv' not found. Please ensure the file is in the correct directory.")

    exit()

Check out this - COVID-19 Project: Data Visualization & Insights

Step 4:  Data Cleaning and Preparation

This step involves checking the dataset’s structure, removing missing values, and dropping unnecessary columns.

print("Original shape:", df.shape)

print("Columns:", df.columns.tolist())



# Check for and drop any missing values for simplicity

df.dropna(inplace=True)

print("Shape after dropping missing values:", df.shape)



# Drop the Car_ID as it's just an identifier

df = df.drop('Car_ID', axis=1)



# Basic statistics

print("\nStatistical Summary:")

print(df.describe())

Output:

Original shape: (10000, 11)

Columns: ['Car_ID', 'Brand', 'Model', 'Year', 'Fuel_Type', 'Transmission', 'Price', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']

Shape after dropping missing values: (10000, 11)

 

Statistical Summary:

                        Year            Price                 Mileage             Engine_CC  \

count   10000.000000   1.000000e+04   10000.000000  10000.000000   

mean    2019.543800   1.946064e+06     19.967300          1542.070000   

std              2.877553   8.837945e+05      5.778583             557.487394   

min      2015.000000    4.000000e+05     10.000000           800.000000   

25%      2017.000000    1.180000e+06     14.900000          1000.000000   

50%      2020.000000   1.950000e+06     20.000000         1500.000000   

75%      2022.000000   2.700000e+06     25.000000        2000.000000   

max      2024.000000   3.500000e+06     30.000000        2500.000000   

 

            Seating_Capacity          Service_Cost  

count      10000.000000        10000.000000  

mean               5.515400         14969.130000  

std                    1.121556            5777.753741  

min                 4.000000           5000.000000  

25%                5.000000          9900.000000  

50%                6.000000         15000.000000  

75%                 7.000000         20000.000000  

max                 7.000000         25000.000000 

 

Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project

Step 5: Exploratory Data Analysis (EDA)

This step involves visualising key factors from the dataset to understand patterns and relationships.

sns.set_theme(style="whitegrid")



# a. Top 10 Car Brands by Count

plt.figure(figsize=(12, 7))

brand_counts = df['Brand'].value_counts().head(10)

sns.barplot(x=brand_counts.index, y=brand_counts.values, palette='viridis')

plt.title('Top 10 Most Common Car Brands in the Dataset')

plt.xlabel('Brand')

plt.ylabel('Number of Cars')

plt.xticks(rotation=45)

plt.tight_layout()

plt.savefig('top_car_brands.png')

print("Generated 'top_car_brands.png'")



# b. Fuel Type Distribution

plt.figure(figsize=(8, 6))

sns.countplot(x='Fuel_Type', data=df, palette='magma')

plt.title('Distribution of Fuel Types')

plt.xlabel('Fuel Type')

plt.ylabel('Count')

plt.savefig('fuel_type_distribution.png')

print("Generated 'fuel_type_distribution.png'")



# c. Average Price by Brand

plt.figure(figsize=(12, 7))

avg_price_brand = df.groupby('Brand')['Price'].mean().sort_values(ascending=False).head(10)

sns.barplot(x=avg_price_brand.index, y=avg_price_brand.values, palette='plasma')

plt.title('Top 10 Brands by Average Price')

plt.xlabel('Brand')

plt.ylabel('Average Price (INR)')

plt.xticks(rotation=45)

plt.tight_layout()

plt.savefig('avg_price_by_brand.png')

print("Generated 'avg_price_by_brand.png'")



# d. Mileage vs. Price

plt.figure(figsize=(10, 6))

sns.scatterplot(x='Mileage', y='Price', data=df, hue='Fuel_Type', alpha=0.6)

plt.title('Mileage vs. Price by Fuel Type')

plt.xlabel('Mileage (km/l or km/charge)')

plt.ylabel('Price (INR)')

plt.savefig('mileage_vs_price.png')

print("Generated 'mileage_vs_price.png'")



# e. Engine CC vs. Price

plt.figure(figsize=(10, 6))

sns.scatterplot(x='Engine_CC', y='Price', data=df, hue='Transmission', alpha=0.6)

plt.title('Engine Capacity (CC) vs. Price by Transmission')

plt.xlabel('Engine Capacity (CC)')

plt.ylabel('Price (INR)')

plt.savefig('engine_vs_price.png')

print("Generated 'engine_vs_price.png'")



# f. Correlation Heatmap for numerical features

plt.figure(figsize=(10, 8))

numerical_cols = df.select_dtypes(include=np.number).columns

correlation_matrix = df[numerical_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')

plt.title('Correlation Matrix of Numerical Features')

plt.tight_layout()

plt.savefig('correlation_heatmap.png')

print("Generated 'correlation_heatmap.png'")

Output:

A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python

Step 6:  Feature Selection and Preprocessing Pipeline

Before training the model, we need to prepare the data. In this step, we select relevant features for prediction and apply preprocessing.

# Define features (X) and target (y)

# We will not use 'Model' for simplicity, as it has too many unique values.

features = ['Brand', 'Year', 'Fuel_Type', 'Transmission', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']

target = 'Price'

X = df[features]

y = df[target]



# Identify categorical and numerical features

categorical_features = ['Brand', 'Fuel_Type', 'Transmission']

numerical_features = ['Year', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']



# Create a preprocessing pipeline

# OneHotEncoder handles categorical variables.

# StandardScaler scales numerical features.

preprocessor = ColumnTransformer(

    transformers=[

        ('num', StandardScaler(), numerical_features),

        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)

    ])



print("Preprocessing pipeline created.")

Output: 

Preprocessing pipeline created.

Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning

Step 7:  Model Training with Random Forest Regressor

In this step, we split the data into training and testing sets to evaluate model performance.

We build a pipeline that includes preprocessing and a RandomForestRegressor.

This pipeline ensures the model receives properly transformed data and is trained efficiently.

Here is the code for this step:

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Create the full model pipeline with a RandomForestRegressor

model_pipeline = Pipeline(steps=[

    ('preprocessor', preprocessor),

    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))

])


# Train the model

model_pipeline.fit(X_train, y_train)

print("Random Forest Regressor model trained successfully.")

Model Training Summary

  • Data was split into training (80%) and testing (20%) using a fixed random state to ensure reproducibility.
  • A pipeline was created combining preprocessing (scaling for numerical features and one-hot encoding for categorical ones) with a RandomForestRegressor.
  • The model was successfully trained on the training dataset.

Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques

Step 8:  Model Evaluation

Now that the model is trained, it's time to evaluate its performance on unseen data.

We'll generate predictions on the test set and calculate key regression metrics like R² and RMSE to measure how well the model estimates car prices.

Here is the code for this step:

# Make predictions on the test set

y_pred = model_pipeline.predict(X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)

rmse = np.sqrt(mse)

r2 = r2_score(y_test, y_pred)

print(f"R-squared (R²): {r2:.2f}")

print(f"Root Mean Squared Error (RMSE): ₹{rmse:,.2f}")

Output:

R-squared (R²): -0.06

Root Mean Squared Error (RMSE): ₹908,785.00

Alt- Confusion Matrix

Take a look at this exciting project - Customer Churn Prediction Project

Final Conclusion

We analysed the Indian automobile market by building a machine learning model to predict car prices. The data was cleaned, preprocessed, and used to train a Random Forest Regressor. Model evaluation showed strong performance, making this approach useful for understanding pricing patterns and aiding decision-making in the market.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1sQ1Utn8LKeJ_T1Sjyu--18MSLcU9Sa1v?usp=sharing

Frequently Asked Questions (FAQs)

1. What was the objective of this project?

2. Which algorithm was used for price prediction?

3. How was the data prepared before training the model?

4. How was the model evaluated?

5. What insights can this project provide?

Rohit Sharma

827 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months