Used Car Price Prediction Using ML| Random Forest & EDA
By Rohit Sharma
Updated on Aug 04, 2025 | 10 min read | 1.22K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 04, 2025 | 10 min read | 1.22K+ views
Share:
Table of Contents
India has a huge market for used cars, which is rapidly growing over time. Making a model that predicts used car prices accurately can help both buyers and sellers to make better decisions.
In this project, you'll build a machine learning model that estimates a car’s selling price based on factors like brand, fuel type, kilometres driven, and car age. Using data from CarDekho and a Random Forest Regressor, this project covers data cleaning, feature engineering, visualisation, and model evaluation, all in a single pipeline.
Fast-track your data science career with upGrad's Online Data Science Courses. Learn Python, Machine Learning, AI, SQL, Tableau, and more from industry experts. Enrol today!
Get cracking on your data science skills with our hand-picked Python projects!
Before starting your used car price prediction project, you should be comfortable with the following tools and concepts:
Also Read: Random Forest Algorithm: When to Use & How to Use? [With Pros & Cons]
Begin your data science career with upGrad's highly-rated courses, offering direct learning opportunities from seasoned industry mentors.
To build your used car price prediction model, you'll use powerful Python libraries focused on data preprocessing, regression modelling, and result visualisation:
Tool / Library | Purpose |
Python | Core programming language for data manipulation and machine learning |
Google Colab | Cloud-based environment to write, run, and share code without local setup |
Pandas | Loads and prepares structured car data (make, model, year, km driven, etc.) |
NumPy | Performs efficient numerical computations and array operations |
Matplotlib / Seaborn | Visualises price trends, feature relationships, and model performance |
Scikit-learn | Trains and evaluates machine learning models like Linear Regression and Random Forest |
Also Read: How to Create a Python Heatmap with Seaborn? [Comprehensive Explanation]
You can complete this used car price prediction project in 3 to 4 hours. It’s ideal for beginners who know basic Python and want to apply machine learning to solve real-world pricing problems. You’ll learn how to clean and prepare car data, train models like Linear Regression and Random Forest, and evaluate how well your model predicts the price of a used car based on features like year, mileage, fuel type, and location.
To build an accurate used car price prediction model, you’ll apply proven techniques that turn raw car listings into meaningful price estimates:
Also Read: Data Visualisation: The What, The Why, and The How!
Let’s build this project from scratch with clear, step-by-step guidance:
Let’s jump in and get started.
Before building any machine learning model, the first step is to load and inspect the dataset.
# main.py
# --- 1. Load the Dataset ---
# Import necessary libraries
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error
import joblib # For saving the model
print("--- Step 1: Loading the Dataset ---")
# Load dataset
try:
df = pd.read_csv('CAR DETAILS FROM CAR DEKHO.csv')
print("Dataset loaded successfully.")
print(df.head())
except FileNotFoundError:
print("Error: 'CAR DETAILS FROM CAR DEKHO.csv' not found.")
print("Please ensure the file is located in the same directory.")
exit()
print("\n" + "="*50 + "\n")
Before training any model, it's essential to clean the dataset and extract useful features. This step helps improve model performance.
# --- 2. Data Cleaning and Feature Engineering ---
# Check for missing values
if df.isnull().sum().any():
print(f"\nMissing values before cleaning:\n{df.isnull().sum()}")
df.dropna(inplace=True)
else:
print("\nNo missing values found in the dataset.")
# Feature Engineering: Create 'car_age' from 'year'
# A car's age is more useful than its manufacturing year
current_year = datetime.datetime.now().year
df['car_age'] = current_year - df['year']
# Feature Engineering: Extract 'brand' from 'name'
# The brand gives a strong signal; the full name is too specific
df['brand'] = df['name'].apply(lambda x: x.split(' ')[0])
# Save a copy for EDA before dropping columns
df_for_eda = df.copy()
# Drop the original 'name' and 'year' columns
df.drop(columns=['name', 'year'], inplace=True)
print("\n--- Data After Feature Engineering ---")
print(df.head())
Output:
No missing values found in the dataset.
--- Data After Feature Engineering ---
selling_price | km_driven | fuel | seller_type | transmission | owner | car_age | brand |
60000 | 70000 | Petrol | Individual | Manual | First Owner | 18 | Maruti |
135000 | 50000 | Petrol | Individual | Manual | First Owner | 18 | Maruti |
600000 | 100000 | Diesel | Individual | Manual | First Owner | 13 | Hyundai |
250000 | 46000 | Petrol | Individual | Manual | First Owner | 8 | Datsun |
450000 | 141000 | Diesel | Individual | Manual | Second Owner | 11 | Honda |
Let's explore the dataset visually to understand key patterns and relationships. These insights will guide feature selection and model decisions.
Here is the code for this step:
# --- 3. Exploratory Data Analysis (Plots) ---
print("\n--- Starting Exploratory Data Analysis (Plots) ---")
sns.set_style("whitegrid")
# Plot 1: Distribution of Selling Price
plt.figure(figsize=(10, 6))
sns.histplot(df_for_eda['selling_price'], kde=True, bins=50)
plt.title('Distribution of Selling Price', fontsize=16)
plt.xlabel('Selling Price (in Lakhs)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()
# Plot 2: Selling Price vs. Car Age
plt.figure(figsize=(10, 6))
sns.scatterplot(x='car_age', y='selling_price', data=df_for_eda, alpha=0.6)
plt.title('Selling Price vs. Car Age', fontsize=16)
plt.xlabel('Car Age (Years)', fontsize=12)
plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
plt.show()
# Plot 3: Selling Price vs. Kilometers Driven
plt.figure(figsize=(10, 6))
sns.scatterplot(x='km_driven', y='selling_price', data=df_for_eda, alpha=0.6)
plt.title('Selling Price vs. Kilometers Driven', fontsize=16)
plt.xlabel('Kilometers Driven', fontsize=12)
plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
plt.show()
# Plot 4: Categorical Features vs. Selling Price
categorical_features_for_plot = ['fuel', 'seller_type', 'transmission', 'owner']
for feature in categorical_features_for_plot:
plt.figure(figsize=(10, 6))
sns.boxplot(x=feature, y='selling_price', data=df_for_eda)
plt.title(f'Selling Price vs. {feature.replace("_", " ").title()}', fontsize=16)
plt.xlabel(feature.replace("_", " ").title(), fontsize=12)
plt.ylabel('Selling Price (in Lakhs)', fontsize=12)
plt.show()
# Plot 5: Correlation Heatmap for Numerical Features
plt.figure(figsize=(10, 8))
numeric_df = df_for_eda.select_dtypes(include=['number'])
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features', fontsize=16)
plt.show()
Output:
Popular Data Science Programs
Also Read: How Forecasting Works in Tableau? Predicting the Future with Data
Now that the data is clean and explored, let’s define what the model will learn from (features) and what it needs to predict (target). We'll also identify categorical and numerical columns, and split the data for training and testing.
# --- 4. Define Features (X) and Target (y) ---
# The target variable is what we want to predict
y = df['selling_price']
# The features include all columns except the target
X = df.drop(columns=['selling_price'])
# --- 5. Identify Categorical and Numerical Features ---
# Categorical features: text-based or discrete categories
categorical_features = ['fuel', 'seller_type', 'transmission', 'owner', 'brand']
# Numerical features: continuous values
numerical_features = ['km_driven', 'car_age']
# --- 6. Split Data into Training and Testing Sets ---
# We'll use 80% of the data for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
Output:
Training set size: 3472 samples
Testing set size: 868 samples
Also Read: 14 Essential Data Visualization Libraries for Python in 2025
To simplify the workflow and avoid data leakage, we’ll create a pipeline that handles both preprocessing and model training in one step. Categorical features will be one-hot encoded, and all other columns will pass through unchanged.
We'll use a Random Forest Regressor for prediction.
# --- 7. Create a Preprocessing and Modeling Pipeline ---
# Step 1: Define transformation for categorical features
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Step 2: Combine transformers for preprocessing
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_features)
],
remainder='passthrough' # Keep numerical features as they are
)
# Step 3: Build the full pipeline with a Random Forest Regressor
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])
Output:
Model Configuration:
Training ARIMA(5,1,0) model...
Training Results:
Model Coefficients:
Model Performance:
• Successfully fitted on 365 data points
Now we’ll train the model using the training data. After that, we’ll test its performance using R-squared and Mean Absolute Error.
Evaluation plots will help us understand how well the predictions align with actual values.
# --- 8. Train the Model ---
print("\n--- Training the model... ---")
model_pipeline.fit(X_train, y_train)
print("Model training complete.")
# --- 9. Evaluate the Model with Plots ---
print("\n--- Evaluating the model... ---")
y_pred = model_pipeline.predict(X_test)
# Calculate evaluation metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print("Model Performance on Test Data:")
print(f"R-squared (R²): {r2:.4f}")
print(f"Mean Absolute Error (MAE): ₹{mae:,.2f}")
print("\n* R-squared tells how much of the price variation the model explains.")
print("* MAE shows the average difference between actual and predicted prices.")
# --- Adding Evaluation Plots ---
print("\n--- Generating Evaluation Plots ---")
# Plot 1: Actual vs. Predicted Values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
p1 = max(max(y_pred), max(y_test))
p2 = min(min(y_pred), min(y_test))
plt.plot([p1, p2], [p1, p2], 'r--') # Perfect prediction line
plt.title('Actual vs. Predicted Selling Price', fontsize=16)
plt.xlabel('Actual Price (in Lakhs)', fontsize=12)
plt.ylabel('Predicted Price (in Lakhs)', fontsize=12)
plt.show()
# Plot 2: Residuals Plot
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs. Predicted Values', fontsize=16)
plt.xlabel('Predicted Price (in Lakhs)', fontsize=12)
plt.ylabel('Residuals (Actual - Predicted)', fontsize=12)
plt.show()
Output:
--- Training the model... ---
Model training complete.
--- Evaluating the model... ---
Model Performance on Test Data:
R-squared (R²): 0.7629
Mean Absolute Error (MAE): ₹114,277.24
Once the model is trained and saved, you can load it later to make predictions on new unseen data. Here's how to predict the price of a car based on its features.
# --- 11. Example of Predicting a New Car ---
print("\n--- Example Prediction ---")
# New car data as a DataFrame
new_car_data = pd.DataFrame({
'km_driven': [50000],
'fuel': ['Petrol'],
'seller_type': ['Individual'],
'transmission': ['Manual'],
'owner': ['First Owner'],
'car_age': [5],
'brand': ['Maruti']
})
# Load the saved model pipeline
loaded_model = joblib.load('used_car_price_model.joblib')
# Make prediction
predicted_price = loaded_model.predict(new_car_data)
print(f"Predicted price for the new car: ₹{predicted_price[0]:,.2f}")
Output:
--- Example Prediction ---
Predicted price for the new car: ₹321,050.32
The model successfully predicted the selling price of the used car as ₹3.21 Lakhs based on the provided features.
This confirms that the pipeline works end-to-end, from preprocessing to prediction and can be used in real scenarios for pricing used cars accurately.
Also Read: Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know
This project demonstrated how to build a complete machine learning pipeline to predict used car prices. Starting from data cleaning and feature engineering, we explored the dataset, prepared it for modeling, and trained a Random Forest Regressor using a pipeline. The model showed good performance on test data and was able to predict the price of a new car accurately.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/17Fek7Y9RsRoBVLvMJslVYZFRp8nmQJRm?usp=sharing
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources