Home
Blog
Data Science
Used Car Price Prediction Using ML| Random Forest & EDA

Used Car Price Prediction Using ML| Random Forest & EDA

Updated on Aug 04, 2025 | 10 min read | 1.73K+ views

Table of Contents

View all

What Should You Know Before Building a Used Car Price Prediction Project?
Our Tech Stack for Used Car Price Prediction
Duration and Learning Outcomes
Smart Forecasting: Techniques That Drive Used Car Price Prediction
How to Build a Used Car Price Prediction Model
Final Conclusion

India has a huge market for used cars, which is rapidly growing over time. Making a model that predicts used car prices accurately can help both buyers and sellers to make better decisions.

In this project, you'll build a machine learning model that estimates a car’s selling price based on factors like brand, fuel type, kilometres driven, and car age. Using data from CarDekho and a Random Forest Regressor, this project covers data cleaning, feature engineering, visualisation, and model evaluation, all in a single pipeline.

Fast-track your data science career with upGrad's Online Data Science Courses. Learn Python, Machine Learning, AI, SQL, Tableau, and more from industry experts. Enrol today!

Get cracking on your data science skills with our hand-picked Python projects!

What Should You Know Before Building a Used Car Price Prediction Project?

Before starting your used car price prediction project, you should be comfortable with the following tools and concepts:

Python programming (Python is used throughout the project: for loading data, preprocessing, visualisation, and training the prediction model)
Pandas and Numpy (help you explore the dataset, handle missing values, convert categorical data, and prepare features for machine learning)
Matplotlib or Seaborn (These are used to visualise patterns such as car age vs. price, fuel type trends, and the impact of kilometres driven)
Regression algorithms (Knowledge of models like Linear Regression and Random Forest Regressor is essential, as they are used to predict car prices based on input features)
Model evaluation metrics (Learn how to use RMSE, MAE, and MAPE to assess how accurate your model’s predictions are).

Also Read: Random Forest Algorithm: When to Use & How to Use? [With Pros & Cons]

Begin your data science career with upGrad's highly-rated courses, offering direct learning opportunities from seasoned industry mentors.

Our Tech Stack for Used Car Price Prediction

To build your used car price prediction model, you'll use powerful Python libraries focused on data preprocessing, regression modelling, and result visualisation:

Tool / Library	Purpose
Python	Core programming language for data manipulation and machine learning
Google Colab	Cloud-based environment to write, run, and share code without local setup
Pandas	Loads and prepares structured car data (make, model, year, km driven, etc.)
NumPy	Performs efficient numerical computations and array operations
Matplotlib / Seaborn	Visualises price trends, feature relationships, and model performance
Scikit-learn	Trains and evaluates machine learning models like Linear Regression and Random Forest

Also Read: How to Create a Python Heatmap with Seaborn? [Comprehensive Explanation]

Duration and Learning Outcomes

You can complete this used car price prediction project in 3 to 4 hours. It’s ideal for beginners who know basic Python and want to apply machine learning to solve real-world pricing problems. You’ll learn how to clean and prepare car data, train models like Linear Regression and Random Forest, and evaluate how well your model predicts the price of a used car based on features like year, mileage, fuel type, and location.

Smart Forecasting: Techniques That Drive Used Car Price Prediction

To build an accurate used car price prediction model, you’ll apply proven techniques that turn raw car listings into meaningful price estimates:

Feature Engineering: Extract valuable insights from variables like car age, mileage, fuel type, and transmission to improve prediction accuracy.
Regression Modeling: Use algorithms like Linear Regression and Random Forest Regressor to learn patterns between car attributes and their market prices.
Data Visualization: Plot actual vs. predicted prices and residuals to evaluate model performance and spot underfitting or overfitting visually.

Also Read: Data Visualisation: The What, The Why, and The How!

How to Build a Used Car Price Prediction Model

Let’s build this project from scratch with clear, step-by-step guidance:

Load the Stock Price Dataset
Clean and Preprocess the Data
Visualise Price Trends
Apply Regression Models
Make Price Predictions
Evaluate the Model

Let’s jump in and get started.

Step 1: Load the Dataset

Before building any machine learning model, the first step is to load and inspect the dataset.

# main.py

# --- 1. Load the Dataset ---

# Import necessary libraries
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error
import joblib  # For saving the model

print("--- Step 1: Loading the Dataset ---")

# Load dataset
try:
    df = pd.read_csv('CAR DETAILS FROM CAR DEKHO.csv')
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'CAR DETAILS FROM CAR DEKHO.csv' not found.")
    print("Please ensure the file is located in the same directory.")
    exit()

print("\n" + "="*50 + "\n")

Step 2: Data Cleaning and Feature Engineering

Before training any model, it's essential to clean the dataset and extract useful features. This step helps improve model performance.

# --- 2. Data Cleaning and Feature Engineering ---


# Check for missing values
if df.isnull().sum().any():
    print(f"\nMissing values before cleaning:\n{df.isnull().sum()}")
    df.dropna(inplace=True)
else:
    print("\nNo missing values found in the dataset.")


# Feature Engineering: Create 'car_age' from 'year'
# A car's age is more useful than its manufacturing year
current_year = datetime.datetime.now().year
df['car_age'] = current_year - df['year']


# Feature Engineering: Extract 'brand' from 'name'
# The brand gives a strong signal; the full name is too specific
df['brand'] = df['name'].apply(lambda x: x.split(' ')[0])


# Save a copy for EDA before dropping columns
df_for_eda = df.copy()


# Drop the original 'name' and 'year' columns
df.drop(columns=['name', 'year'], inplace=True)


print("\n--- Data After Feature Engineering ---")
print(df.head())

Output:

No missing values found in the dataset.

--- Data After Feature Engineering ---

selling_price	km_driven	fuel	seller_type	transmission	owner	car_age	brand
60000	70000	Petrol	Individual	Manual	First Owner	18	Maruti
135000	50000	Petrol	Individual	Manual	First Owner	18	Maruti
600000	100000	Diesel	Individual	Manual	First Owner	13	Hyundai
250000	46000	Petrol	Individual	Manual	First Owner	8	Datsun
450000	141000	Diesel	Individual	Manual	Second Owner	11	Honda

Step 3: Exploratory Data Analysis (EDA) with Plots

Let's explore the dataset visually to understand key patterns and relationships. These insights will guide feature selection and model decisions.

Here is the code for this step:

# --- 3. Exploratory Data Analysis (Plots) ---

print("\n--- Starting Exploratory Data Analysis (Plots) ---")
sns.set_style("whitegrid")

# Plot 1: Distribution of Selling Price
plt.figure(figsize=(10, 6))
sns.histplot(df_for_eda['selling_price'], kde=True, bins=50)
plt.title('Distribution of Selling Price', fontsize=16)
plt.xlabel('Selling Price (in Lakhs)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

# Plot 2: Selling Price vs. Car Age
plt.figure(figsize=(10, 6))
sns.scatterplot(x='car_age', y='selling_price', data=df_for_eda, alpha=0.6)
plt.title('Selling Price vs. Car Age', fontsize=16)
plt.xlabel('Car Age (Years)', fontsize=12)
plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
plt.show()

# Plot 3: Selling Price vs. Kilometers Driven
plt.figure(figsize=(10, 6))
sns.scatterplot(x='km_driven', y='selling_price', data=df_for_eda, alpha=0.6)
plt.title('Selling Price vs. Kilometers Driven', fontsize=16)
plt.xlabel('Kilometers Driven', fontsize=12)
plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
plt.show()

# Plot 4: Categorical Features vs. Selling Price
categorical_features_for_plot = ['fuel', 'seller_type', 'transmission', 'owner']
for feature in categorical_features_for_plot:
    plt.figure(figsize=(10, 6))
    sns.boxplot(x=feature, y='selling_price', data=df_for_eda)
    plt.title(f'Selling Price vs. {feature.replace("_", " ").title()}', fontsize=16)
    plt.xlabel(feature.replace("_", " ").title(), fontsize=12)
    plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
    plt.show()

# Plot 5: Correlation Heatmap for Numerical Features
plt.figure(figsize=(10, 8))
numeric_df = df_for_eda.select_dtypes(include=['number'])
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features', fontsize=16)
plt.show()

Output:

Popular Data Science Programs

DevOps Course Online M Sc in Data Science Degree MSc in Data Science Program PGD in Data Science Post Graduate Certificate in Data Science

Also Read: How Forecasting Works in Tableau? Predicting the Future with Data

Step 4: Define Features and Prepare Training Data

Now that the data is clean and explored, let’s define what the model will learn from (features) and what it needs to predict (target). We'll also identify categorical and numerical columns, and split the data for training and testing.

# --- 4. Define Features (X) and Target (y) ---

# The target variable is what we want to predict
y = df['selling_price']

# The features include all columns except the target
X = df.drop(columns=['selling_price'])


# --- 5. Identify Categorical and Numerical Features ---

# Categorical features: text-based or discrete categories
categorical_features = ['fuel', 'seller_type', 'transmission', 'owner', 'brand']

# Numerical features: continuous values
numerical_features = ['km_driven', 'car_age']


# --- 6. Split Data into Training and Testing Sets ---

# We'll use 80% of the data for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Output:

Training set size: 3472 samples
Testing set size: 868 samples

Also Read: 14 Essential Data Visualization Libraries for Python in 2025

Step 5: Build a Preprocessing and Modeling Pipeline

To simplify the workflow and avoid data leakage, we’ll create a pipeline that handles both preprocessing and model training in one step. Categorical features will be one-hot encoded, and all other columns will pass through unchanged.

We'll use a Random Forest Regressor for prediction.

# --- 7. Create a Preprocessing and Modeling Pipeline ---

# Step 1: Define transformation for categorical features
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Step 2: Combine transformers for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'  # Keep numerical features as they are
)

# Step 3: Build the full pipeline with a Random Forest Regressor
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

Output:

Model Configuration:

p (AR order): 5 - Using last 5 values for prediction
d (Differencing): 1 - Making data stationary
q (MA order): 0 - No moving average component

Training ARIMA(5,1,0) model...

Training Results:

AIC (Akaike Information Criterion): 1086.54
BIC (Bayesian Information Criterion): 1109.92
Log Likelihood: -537.27
Model Parameters: ARIMA(5,1,0)

Model Coefficients:

AR.L1: -0.0252
AR.L2: 0.0508
AR.L3: 0.0416
AR.L4: -0.0474
AR.L5: 0.0216
sigma2: 1.1208

Model Performance:

• Successfully fitted on 365 data points

Step 6: Train and Evaluate the Model

Now we’ll train the model using the training data. After that, we’ll test its performance using R-squared and Mean Absolute Error.

Evaluation plots will help us understand how well the predictions align with actual values.

# --- 8. Train the Model ---

print("\n--- Training the model... ---")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")


# --- 9. Evaluate the Model with Plots ---

print("\n--- Evaluating the model... ---")
y_pred = model_pipeline.predict(X_test)

# Calculate evaluation metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print("Model Performance on Test Data:")
print(f"R-squared (R²): {r2:.4f}")
print(f"Mean Absolute Error (MAE): ₹{mae:,.2f}")

print("\n* R-squared tells how much of the price variation the model explains.")
print("* MAE shows the average difference between actual and predicted prices.")


# --- Adding Evaluation Plots ---

print("\n--- Generating Evaluation Plots ---")

# Plot 1: Actual vs. Predicted Values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
p1 = max(max(y_pred), max(y_test))
p2 = min(min(y_pred), min(y_test))
plt.plot([p1, p2], [p1, p2], 'r--')  # Perfect prediction line
plt.title('Actual vs. Predicted Selling Price', fontsize=16)
plt.xlabel('Actual Price (in Lakhs)', fontsize=12)
plt.ylabel('Predicted Price (in Lakhs)', fontsize=12)
plt.show()

# Plot 2: Residuals Plot
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs. Predicted Values', fontsize=16)
plt.xlabel('Predicted Price (in Lakhs)', fontsize=12)
plt.ylabel('Residuals (Actual - Predicted)', fontsize=12)
plt.show()

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Output:

--- Training the model... ---

Model training complete.

--- Evaluating the model... ---

Model Performance on Test Data:

R-squared (R²): 0.7629
Mean Absolute Error (MAE): ₹114,277.24

Step 7: Predict the Price of a Car

Once the model is trained and saved, you can load it later to make predictions on new unseen data. Here's how to predict the price of a car based on its features.

# --- 11. Example of Predicting a New Car ---

print("\n--- Example Prediction ---")

# New car data as a DataFrame
new_car_data = pd.DataFrame({
    'km_driven': [50000],
    'fuel': ['Petrol'],
    'seller_type': ['Individual'],
    'transmission': ['Manual'],
    'owner': ['First Owner'],
    'car_age': [5],
    'brand': ['Maruti']
})

# Load the saved model pipeline
loaded_model = joblib.load('used_car_price_model.joblib')

# Make prediction
predicted_price = loaded_model.predict(new_car_data)
print(f"Predicted price for the new car: ₹{predicted_price[0]:,.2f}")

Output:

--- Example Prediction ---

Predicted price for the new car: ₹321,050.32

The model successfully predicted the selling price of the used car as ₹3.21 Lakhs based on the provided features.

This confirms that the pipeline works end-to-end, from preprocessing to prediction and can be used in real scenarios for pricing used cars accurately.

Also Read: Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know

Final Conclusion

This project demonstrated how to build a complete machine learning pipeline to predict used car prices. Starting from data cleaning and feature engineering, we explored the dataset, prepared it for modeling, and trained a Random Forest Regressor using a pipeline. The model showed good performance on test data and was able to predict the price of a new car accurately.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/17Fek7Y9RsRoBVLvMJslVYZFRp8nmQJRm?usp=sharing

Frequently Asked Questions (FAQs)

1. What machine learning algorithm was used in this project?

We used a Random Forest Regressor, which works well with both numerical and categorical data and handles non-linear relationships effectively.

2. Why was the 'car_age' feature created from the 'year' column?

Car age is more directly related to price depreciation than the manufacturing year. It simplifies the model's learning process.

3. How were categorical variables handled before modeling?

Categorical features like fuel type and transmission were one-hot encoded using a pipeline to convert them into numerical format for the model.

4. How was model performance evaluated?

We used R-squared (R²) and Mean Absolute Error (MAE) on the test set to measure how well the model predicted selling prices.

5. Can this model be used for real-world applications?

Yes. The model pipeline is reusable and can be loaded to predict prices of unseen cars, making it suitable for deployment in car listing or pricing platforms.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources