Indian Automobile Market Analysis Using Random Forest
By Rohit Sharma
Updated on Aug 07, 2025 | 4 views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 07, 2025 | 4 views
Share:
Table of Contents
The Indian automobile industry is one of the largest and fastest-growing in the world. In this project on Indian Automobile Market Analysis, we explore how machine learning can help to understand pricing trends and predict used car prices.
For this project, we will use a dataset of car listings from India, and then perform data cleaning, preprocessing, exploratory data analysis, and train it on a Random Forest Regressor model to make accurate price predictions.
upGrad's Online Data Science Courses will help you enhance your data science expertise. Master Python, ML, AI, SQL, and Tableau under expert guidance to develop practical skills and prepare for a successful career.
For over 23 data science projects in Python suitable for both beginners and experts, refer to: 23+ Data Science Projects in Python for Freshers and Experts to Succeed in 2025
Before you begin working on the Indian Automobile Market Analysis, make sure you're comfortable with the following tools and concepts:
Advance your data science career with upGrad's premier courses and industry mentors.
To analyse car pricing trends and predict used car prices in India, you’ll use Python libraries built for data handling, preprocessing, visualisation, and regression modelling.
Tool / Library | Purpose |
Python | Core programming language for the entire workflow |
Pandas | Loads, cleans, and manipulates the car dataset |
NumPy | Supports numerical operations and data formatting |
Matplotlib / Seaborn | Builds visualisations like bar charts, scatter plots, and correlation heatmaps |
Scikit-learn | Provides tools for preprocessing, regression modelling, and evaluation |
RandomForestRegressor | Predicts car prices based on input features like fuel type, mileage, etc. |
Are you new to Python? This course can help you enhance your skills for free - Learn Basic Python Programming
To get the most out of your Indian Automobile Market Analysis & Price Prediction project, you'll apply these core data science techniques:
Check out this beginner-friendly Python project! - Sales Data Analysis Project
Time Required to Complete the Project: You can complete the Indian Automobile Market Analysis & Price Prediction project in about 3 to 4 hours. This includes data cleaning, exploration, model building, evaluation, and making predictions on new car data.
Let’s build this Indian Automobile Market Analysis & Price Prediction project from scratch with clear, step-by-step guidance:
Without any further delay, let’s get started!
Wanna dive deeper into Python? Check this out! - Handwritten Digit Recognition with CNN Using Python
To start the analysis, first, you need to download the dataset, which is available on the internet for free. You can also download from Kaggle by searching for your project name.
To start the Bollywood Movie Analysis & Success Prediction project, first import all the necessary Python libraries
Here’s the list of tools you’ll use:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
Explore this project - Customer Purchase Behaviour Analysis Project Using Python
This step loads the dataset that contains information about Bollywood movies. The script ensures the file is available before proceeding.
If the file is missing, the script will stop to prevent further errors.
print("--- Loading Dataset ---")
try:
# Load the dataset from the provided CSV file
df = pd.read_csv('car_dataset_india.csv')
print("Dataset loaded successfully.")
except FileNotFoundError:
print("Error: 'car_dataset_india.csv' not found. Please ensure the file is in the correct directory.")
exit()
Check out this - COVID-19 Project: Data Visualization & Insights
This step involves checking the dataset’s structure, removing missing values, and dropping unnecessary columns.
print("Original shape:", df.shape)
print("Columns:", df.columns.tolist())
# Check for and drop any missing values for simplicity
df.dropna(inplace=True)
print("Shape after dropping missing values:", df.shape)
# Drop the Car_ID as it's just an identifier
df = df.drop('Car_ID', axis=1)
# Basic statistics
print("\nStatistical Summary:")
print(df.describe())
Output:
Original shape: (10000, 11)
Columns: ['Car_ID', 'Brand', 'Model', 'Year', 'Fuel_Type', 'Transmission', 'Price', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']
Shape after dropping missing values: (10000, 11)
Statistical Summary:
Year Price Mileage Engine_CC \
count 10000.000000 1.000000e+04 10000.000000 10000.000000
mean 2019.543800 1.946064e+06 19.967300 1542.070000
std 2.877553 8.837945e+05 5.778583 557.487394
min 2015.000000 4.000000e+05 10.000000 800.000000
25% 2017.000000 1.180000e+06 14.900000 1000.000000
50% 2020.000000 1.950000e+06 20.000000 1500.000000
75% 2022.000000 2.700000e+06 25.000000 2000.000000
max 2024.000000 3.500000e+06 30.000000 2500.000000
Seating_Capacity Service_Cost
count 10000.000000 10000.000000
mean 5.515400 14969.130000
std 1.121556 5777.753741
min 4.000000 5000.000000
25% 5.000000 9900.000000
50% 6.000000 15000.000000
75% 7.000000 20000.000000
max 7.000000 25000.000000
Need to spot fraud in transactions? Check this out - Fraud Detection in Transactions with Python: A Machine Learning Project
This step involves visualising key factors from the dataset to understand patterns and relationships.
sns.set_theme(style="whitegrid")
# a. Top 10 Car Brands by Count
plt.figure(figsize=(12, 7))
brand_counts = df['Brand'].value_counts().head(10)
sns.barplot(x=brand_counts.index, y=brand_counts.values, palette='viridis')
plt.title('Top 10 Most Common Car Brands in the Dataset')
plt.xlabel('Brand')
plt.ylabel('Number of Cars')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('top_car_brands.png')
print("Generated 'top_car_brands.png'")
# b. Fuel Type Distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='Fuel_Type', data=df, palette='magma')
plt.title('Distribution of Fuel Types')
plt.xlabel('Fuel Type')
plt.ylabel('Count')
plt.savefig('fuel_type_distribution.png')
print("Generated 'fuel_type_distribution.png'")
# c. Average Price by Brand
plt.figure(figsize=(12, 7))
avg_price_brand = df.groupby('Brand')['Price'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=avg_price_brand.index, y=avg_price_brand.values, palette='plasma')
plt.title('Top 10 Brands by Average Price')
plt.xlabel('Brand')
plt.ylabel('Average Price (INR)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('avg_price_by_brand.png')
print("Generated 'avg_price_by_brand.png'")
# d. Mileage vs. Price
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Mileage', y='Price', data=df, hue='Fuel_Type', alpha=0.6)
plt.title('Mileage vs. Price by Fuel Type')
plt.xlabel('Mileage (km/l or km/charge)')
plt.ylabel('Price (INR)')
plt.savefig('mileage_vs_price.png')
print("Generated 'mileage_vs_price.png'")
# e. Engine CC vs. Price
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Engine_CC', y='Price', data=df, hue='Transmission', alpha=0.6)
plt.title('Engine Capacity (CC) vs. Price by Transmission')
plt.xlabel('Engine Capacity (CC)')
plt.ylabel('Price (INR)')
plt.savefig('engine_vs_price.png')
print("Generated 'engine_vs_price.png'")
# f. Correlation Heatmap for numerical features
plt.figure(figsize=(10, 8))
numerical_cols = df.select_dtypes(include=np.number).columns
correlation_matrix = df[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.savefig('correlation_heatmap.png')
print("Generated 'correlation_heatmap.png'")
Output:
Popular Data Science Programs
A Beginner-Friendly Project in Python - Complete Airline Passenger Traffic Analysis Project Using Python
Before training the model, we need to prepare the data. In this step, we select relevant features for prediction and apply preprocessing.
# Define features (X) and target (y)
# We will not use 'Model' for simplicity, as it has too many unique values.
features = ['Brand', 'Year', 'Fuel_Type', 'Transmission', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']
target = 'Price'
X = df[features]
y = df[target]
# Identify categorical and numerical features
categorical_features = ['Brand', 'Fuel_Type', 'Transmission']
numerical_features = ['Year', 'Mileage', 'Engine_CC', 'Seating_Capacity', 'Service_Cost']
# Create a preprocessing pipeline
# OneHotEncoder handles categorical variables.
# StandardScaler scales numerical features.
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
print("Preprocessing pipeline created.")
Output:
Preprocessing pipeline created.
Also, Check this - Crime Rate Prediction by City Using Python and Machine Learning
In this step, we split the data into training and testing sets to evaluate model performance.
We build a pipeline that includes preprocessing and a RandomForestRegressor.
This pipeline ensures the model receives properly transformed data and is trained efficiently.
Here is the code for this step:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create the full model pipeline with a RandomForestRegressor
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])
# Train the model
model_pipeline.fit(X_train, y_train)
print("Random Forest Regressor model trained successfully.")
Model Training Summary
Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques
Now that the model is trained, it's time to evaluate its performance on unseen data.
We'll generate predictions on the test set and calculate key regression metrics like R² and RMSE to measure how well the model estimates car prices.
Here is the code for this step:
# Make predictions on the test set
y_pred = model_pipeline.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2:.2f}")
print(f"Root Mean Squared Error (RMSE): ₹{rmse:,.2f}")
Output:
R-squared (R²): -0.06
Root Mean Squared Error (RMSE): ₹908,785.00
Alt- Confusion Matrix
Take a look at this exciting project - Customer Churn Prediction Project
We analysed the Indian automobile market by building a machine learning model to predict car prices. The data was cleaned, preprocessed, and used to train a Random Forest Regressor. Model evaluation showed strong performance, making this approach useful for understanding pricing patterns and aiding decision-making in the market.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1sQ1Utn8LKeJ_T1Sjyu--18MSLcU9Sa1v?usp=sharing
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources