Used Car Price Prediction Using ML| Random Forest & EDA

By Rohit Sharma

Updated on Aug 04, 2025 | 10 min read | 1.22K+ views

Share:

India has a huge market for used cars, which is rapidly growing over time. Making a model that predicts used car prices accurately can help both buyers and sellers to make better decisions.

In this project, you'll build a machine learning model that estimates a car’s selling price based on factors like brand, fuel type, kilometres driven, and car age. Using data from CarDekho and a Random Forest Regressor, this project covers data cleaning, feature engineering, visualisation, and model evaluation, all in a single pipeline.

Fast-track your data science career with upGrad's Online Data Science Courses. Learn Python, Machine Learning, AI, SQL, Tableau, and more from industry experts. Enrol today!

Get cracking on your data science skills with our hand-picked Python projects!

What Should You Know Before Building a Used Car Price Prediction Project?

Before starting your used car price prediction project, you should be comfortable with the following tools and concepts:

  • Python programming (Python is used throughout the project: for loading data, preprocessing, visualisation, and training the prediction model)
  • Pandas and Numpy (help you explore the dataset, handle missing values, convert categorical data, and prepare features for machine learning)
  • Matplotlib or Seaborn (These are used to visualise patterns such as car age vs. price, fuel type trends, and the impact of kilometres driven)
  • Regression algorithms (Knowledge of models like Linear Regression and Random Forest Regressor is essential, as they are used to predict car prices based on input features)
  • Model evaluation metrics (Learn how to use RMSE, MAE, and MAPE to assess how accurate your model’s predictions are).

Also Read: Random Forest Algorithm: When to Use & How to Use? [With Pros & Cons]

Begin your data science career with upGrad's highly-rated courses, offering direct learning opportunities from seasoned industry mentors.

Our Tech Stack for Used Car Price Prediction

To build your used car price prediction model, you'll use powerful Python libraries focused on data preprocessing, regression modelling, and result visualisation:

Tool / Library

Purpose

Python Core programming language for data manipulation and machine learning
Google Colab Cloud-based environment to write, run, and share code without local setup
Pandas Loads and prepares structured car data (make, model, year, km driven, etc.)
NumPy Performs efficient numerical computations and array operations
Matplotlib / Seaborn Visualises price trends, feature relationships, and model performance
Scikit-learn Trains and evaluates machine learning models like Linear Regression and Random Forest

Also Read: How to Create a Python Heatmap with Seaborn? [Comprehensive Explanation]

Duration and Learning Outcomes

You can complete this used car price prediction project in 3 to 4 hours. It’s ideal for beginners who know basic Python and want to apply machine learning to solve real-world pricing problems. You’ll learn how to clean and prepare car data, train models like Linear Regression and Random Forest, and evaluate how well your model predicts the price of a used car based on features like year, mileage, fuel type, and location.

Smart Forecasting: Techniques That Drive Used Car Price Prediction

To build an accurate used car price prediction model, you’ll apply proven techniques that turn raw car listings into meaningful price estimates:

  • Feature Engineering: Extract valuable insights from variables like car age, mileage, fuel type, and transmission to improve prediction accuracy.
  • Regression Modeling: Use algorithms like Linear Regression and Random Forest Regressor to learn patterns between car attributes and their market prices.
  • Data Visualization: Plot actual vs. predicted prices and residuals to evaluate model performance and spot underfitting or overfitting visually.

Also Read: Data Visualisation: The What, The Why, and The How!

How to Build a Used Car Price Prediction Model

Let’s build this project from scratch with clear, step-by-step guidance:

  1. Load the Stock Price Dataset
  2. Clean and Preprocess the Data
  3. Visualise Price Trends
  4. Apply Regression Models
  5. Make Price Predictions
  6. Evaluate the Model

Let’s jump in and get started.

Step 1: Load the Dataset

Before building any machine learning model, the first step is to load and inspect the dataset.

# main.py

# --- 1. Load the Dataset ---

# Import necessary libraries
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error
import joblib  # For saving the model

print("--- Step 1: Loading the Dataset ---")

# Load dataset
try:
    df = pd.read_csv('CAR DETAILS FROM CAR DEKHO.csv')
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print("Error: 'CAR DETAILS FROM CAR DEKHO.csv' not found.")
    print("Please ensure the file is located in the same directory.")
    exit()

print("\n" + "="*50 + "\n")

Step 2: Data Cleaning and Feature Engineering

Before training any model, it's essential to clean the dataset and extract useful features. This step helps improve model performance.

# --- 2. Data Cleaning and Feature Engineering ---


# Check for missing values
if df.isnull().sum().any():
    print(f"\nMissing values before cleaning:\n{df.isnull().sum()}")
    df.dropna(inplace=True)
else:
    print("\nNo missing values found in the dataset.")


# Feature Engineering: Create 'car_age' from 'year'
# A car's age is more useful than its manufacturing year
current_year = datetime.datetime.now().year
df['car_age'] = current_year - df['year']


# Feature Engineering: Extract 'brand' from 'name'
# The brand gives a strong signal; the full name is too specific
df['brand'] = df['name'].apply(lambda x: x.split(' ')[0])


# Save a copy for EDA before dropping columns
df_for_eda = df.copy()


# Drop the original 'name' and 'year' columns
df.drop(columns=['name', 'year'], inplace=True)


print("\n--- Data After Feature Engineering ---")
print(df.head())

Output: 

No missing values found in the dataset.

--- Data After Feature Engineering ---

selling_price km_driven fuel seller_type transmission owner car_age brand
60000 70000 Petrol Individual Manual First Owner 18 Maruti
135000 50000 Petrol Individual Manual First Owner 18 Maruti
600000 100000 Diesel Individual Manual First Owner 13 Hyundai
250000 46000 Petrol Individual Manual First Owner 8 Datsun
450000 141000 Diesel Individual Manual Second Owner 11 Honda

 

Step 3:  Exploratory Data Analysis (EDA) with Plots

Let's explore the dataset visually to understand key patterns and relationships. These insights will guide feature selection and model decisions.

Here is the code for this step:

# --- 3. Exploratory Data Analysis (Plots) ---

print("\n--- Starting Exploratory Data Analysis (Plots) ---")
sns.set_style("whitegrid")

# Plot 1: Distribution of Selling Price
plt.figure(figsize=(10, 6))
sns.histplot(df_for_eda['selling_price'], kde=True, bins=50)
plt.title('Distribution of Selling Price', fontsize=16)
plt.xlabel('Selling Price (in Lakhs)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

# Plot 2: Selling Price vs. Car Age
plt.figure(figsize=(10, 6))
sns.scatterplot(x='car_age', y='selling_price', data=df_for_eda, alpha=0.6)
plt.title('Selling Price vs. Car Age', fontsize=16)
plt.xlabel('Car Age (Years)', fontsize=12)
plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
plt.show()

# Plot 3: Selling Price vs. Kilometers Driven
plt.figure(figsize=(10, 6))
sns.scatterplot(x='km_driven', y='selling_price', data=df_for_eda, alpha=0.6)
plt.title('Selling Price vs. Kilometers Driven', fontsize=16)
plt.xlabel('Kilometers Driven', fontsize=12)
plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
plt.show()

# Plot 4: Categorical Features vs. Selling Price
categorical_features_for_plot = ['fuel', 'seller_type', 'transmission', 'owner']
for feature in categorical_features_for_plot:
    plt.figure(figsize=(10, 6))
    sns.boxplot(x=feature, y='selling_price', data=df_for_eda)
    plt.title(f'Selling Price vs. {feature.replace("_", " ").title()}', fontsize=16)
    plt.xlabel(feature.replace("_", " ").title(), fontsize=12)
    plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
    plt.show()

# Plot 5: Correlation Heatmap for Numerical Features
plt.figure(figsize=(10, 8))
numeric_df = df_for_eda.select_dtypes(include=['number'])
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features', fontsize=16)
plt.show()

Output:

 

Also Read: How Forecasting Works in Tableau? Predicting the Future with Data

Step 4:  Define Features and Prepare Training Data

Now that the data is clean and explored, let’s define what the model will learn from (features) and what it needs to predict (target). We'll also identify categorical and numerical columns, and split the data for training and testing.

# --- 4. Define Features (X) and Target (y) ---

# The target variable is what we want to predict
y = df['selling_price']

# The features include all columns except the target
X = df.drop(columns=['selling_price'])


# --- 5. Identify Categorical and Numerical Features ---

# Categorical features: text-based or discrete categories
categorical_features = ['fuel', 'seller_type', 'transmission', 'owner', 'brand']

# Numerical features: continuous values
numerical_features = ['km_driven', 'car_age']


# --- 6. Split Data into Training and Testing Sets ---

# We'll use 80% of the data for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Output:

Training set size: 3472 samples
Testing set size: 868 samples

Also Read: 14 Essential Data Visualization Libraries for Python in 2025

Step 5:  Build a Preprocessing and Modeling Pipeline

To simplify the workflow and avoid data leakage, we’ll create a pipeline that handles both preprocessing and model training in one step. Categorical features will be one-hot encoded, and all other columns will pass through unchanged. 

We'll use a Random Forest Regressor for prediction.

# --- 7. Create a Preprocessing and Modeling Pipeline ---

# Step 1: Define transformation for categorical features
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Step 2: Combine transformers for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'  # Keep numerical features as they are
)

# Step 3: Build the full pipeline with a Random Forest Regressor
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

Output: 

Model Configuration:

  • p (AR order): 5 - Using last 5 values for prediction
  • d (Differencing): 1 - Making data stationary
  • q (MA order): 0 - No moving average component

 Training ARIMA(5,1,0) model...

 Training Results:

  • AIC (Akaike Information Criterion): 1086.54
  • BIC (Bayesian Information Criterion): 1109.92
  • Log Likelihood: -537.27
  • Model Parameters: ARIMA(5,1,0)

 Model Coefficients:

  • AR.L1: -0.0252
  • AR.L2: 0.0508
  • AR.L3: 0.0416
  • AR.L4: -0.0474
  • AR.L5: 0.0216
  • sigma2: 1.1208

 Model Performance:

  • Successfully fitted on 365 data points

Step 6: Train and Evaluate the Model

Now we’ll train the model using the training data. After that, we’ll test its performance using R-squared and Mean Absolute Error. 

Evaluation plots will help us understand how well the predictions align with actual values.

# --- 8. Train the Model ---

print("\n--- Training the model... ---")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")


# --- 9. Evaluate the Model with Plots ---

print("\n--- Evaluating the model... ---")
y_pred = model_pipeline.predict(X_test)

# Calculate evaluation metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print("Model Performance on Test Data:")
print(f"R-squared (R²): {r2:.4f}")
print(f"Mean Absolute Error (MAE): ₹{mae:,.2f}")

print("\n* R-squared tells how much of the price variation the model explains.")
print("* MAE shows the average difference between actual and predicted prices.")


# --- Adding Evaluation Plots ---

print("\n--- Generating Evaluation Plots ---")

# Plot 1: Actual vs. Predicted Values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
p1 = max(max(y_pred), max(y_test))
p2 = min(min(y_pred), min(y_test))
plt.plot([p1, p2], [p1, p2], 'r--')  # Perfect prediction line
plt.title('Actual vs. Predicted Selling Price', fontsize=16)
plt.xlabel('Actual Price (in Lakhs)', fontsize=12)
plt.ylabel('Predicted Price (in Lakhs)', fontsize=12)
plt.show()

# Plot 2: Residuals Plot
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs. Predicted Values', fontsize=16)
plt.xlabel('Predicted Price (in Lakhs)', fontsize=12)
plt.ylabel('Residuals (Actual - Predicted)', fontsize=12)
plt.show()
background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Output: 

--- Training the model... ---

Model training complete.

--- Evaluating the model... ---

Model Performance on Test Data:

R-squared (R²): 0.7629
Mean Absolute Error (MAE): ₹114,277.24

Step 7:  Predict the Price of a Car

Once the model is trained and saved, you can load it later to make predictions on new unseen data. Here's how to predict the price of a car based on its features.

# --- 11. Example of Predicting a New Car ---

print("\n--- Example Prediction ---")

# New car data as a DataFrame
new_car_data = pd.DataFrame({
    'km_driven': [50000],
    'fuel': ['Petrol'],
    'seller_type': ['Individual'],
    'transmission': ['Manual'],
    'owner': ['First Owner'],
    'car_age': [5],
    'brand': ['Maruti']
})

# Load the saved model pipeline
loaded_model = joblib.load('used_car_price_model.joblib')

# Make prediction
predicted_price = loaded_model.predict(new_car_data)
print(f"Predicted price for the new car: ₹{predicted_price[0]:,.2f}")

Output:

--- Example Prediction ---

Predicted price for the new car: ₹321,050.32

The model successfully predicted the selling price of the used car as ₹3.21 Lakhs based on the provided features.

This confirms that the pipeline works end-to-end, from preprocessing to prediction and can be used in real scenarios for pricing used cars accurately.

Also Read: Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know

Final Conclusion

This project demonstrated how to build a complete machine learning pipeline to predict used car prices. Starting from data cleaning and feature engineering, we explored the dataset, prepared it for modeling, and trained a Random Forest Regressor using a pipeline. The model showed good performance on test data and was able to predict the price of a new car accurately.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/17Fek7Y9RsRoBVLvMJslVYZFRp8nmQJRm?usp=sharing

Frequently Asked Questions (FAQs)

1. What machine learning algorithm was used in this project?

2. Why was the 'car_age' feature created from the 'year' column?

3. How were categorical variables handled before modeling?

4. How was model performance evaluated?

5. Can this model be used for real-world applications?

Rohit Sharma

827 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months