Home
Blog
Data Science
Medical Cost Prediction Using Linear Regression and Random Forest

Medical Cost Prediction Using Linear Regression and Random Forest

Updated on Aug 04, 2025 | 11 min read | 1.92K+ views

Table of Contents

View all

Preparation for Effective Project Development
Predicting Medical Costs: A Look at the Tech That Makes It Happen
How We're Predicting Medical Costs
Skill Level Requirements and Timeline
Developing a Medical Cost Prediction Model: A Step-by-Step Guide
Conclusion

Medical cost prediction is the most useful attribute of machine learning, where you can estimate insurance charges based on the patient's various factors like age, BMI, smoking status, and more. This project is primarily focused on building a Medical cost prediction model using Python, which will help us identify cost-driving factors and predict expenses easily for any future need.

You’ll work with a real-world dataset and apply regression algorithms like Linear Regression and Random Forest Regressor. The process includes data cleaning, exploratory data analysis, feature engineering, and model evaluation.

Launch your data science career with upGrad's cutting-edge Online Data Science Courses. Master Python, Machine Learning, AI, SQL, and Tableau. Taught by industry leaders, these courses equip you with highly sought-after, job-ready skills. Don't wait – enroll today!

Enhance your theoretical understanding and practical skills by exploring our premier Python Data Science Projects, designed to launch your project development journey.

Preparation for Effective Project Development

Basic Python programming knowledge (For writing scripts, defining functions, and handling control flows)
Experience with data manipulation using Pandas and NumPy (Required for reading data, handling missing values, and structuring datasets)
Understanding of data visualization with Matplotlib and Seaborn (Helps in generating charts like countplots, histograms, and heatmaps for EDA)
Knowledge of data preprocessing techniques (Such as dropping irrelevant columns, encoding categorical variables using one-hot encoding, and splitting datasets)
Familiarity with Regression Algorithms (Since this is a regression task, you should understand models like Linear Regression and Random Forest Regressor, which can predict continuous target values.)

Also Read - Top 35 Linear Regression Projects in Machine Learning With Source Code

Learn data science with upGrad's industry-led, top-ranked courses for direct mentorship and career guidance.

Predicting Medical Costs: A Look at the Tech That Makes It Happen

To develop and evaluate the medical cost prediction model, you'll use core Python tools for data processing, visualization, and regression modeling:

Tool / Library	Purpose
Python	The main programming language for scripting and implementing the pipeline
Google Colab	Cloud-based platform for running notebooks with free GPU/TPU support
Pandas	Loads the insurance dataset and handles data manipulation and preprocessing
NumPy	Supports numerical operations for arrays and model input preparation
Matplotlib / Seaborn	Visualizes data trends, feature relationships, and model insights
scikit-learn	Performs data splitting, encoding, model training, and performance evaluation
LinearRegression	A baseline model for predicting continuous medical costs
RandomForestRegressor	A powerful ensemble model that captures non-linear relationships
MAE, MSE, R² Score	Evaluation metrics to measure prediction accuracy and model reliability

Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics

How We're Predicting Medical Costs

To build an effective Medical Cost Prediction model, you’ll apply core regression techniques that help estimate healthcare expenses based on patient attributes:

Data preprocessing and cleaning
Exploratory Data Analysis (EDA)
Regression Algorithms (Linear Regression & Random Forest Regressor)
Feature Importance Analysis
Model evaluation

Also Read - Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know

Skill Level Requirements and Timeline

You can complete this Medical Cost Prediction project in about 4 to 5 hours. It is made for beginners and intermediate learners who have basic Python skills and want hands-on experience in solving regression problems using healthcare data.

Developing a Medical Cost Prediction Model: A Step-by-Step Guide

Developing a Medical Cost Prediction Model: A Step-by-Step Guide

Let’s walk through the steps to build this project from scratch:

Load the Medical Insurance Dataset
Import the dataset containing features like age, sex, BMI, smoking status, and region, along with medical charges.
Clean and Preprocess the Data
Handle missing values, encode categorical variables, and normalize features.
Explore and Visualize Relationships
Use visualizations like histograms, box plots, and heatmaps to identify key patterns, such as how smoking or age impacts medical costs.
Train Regression Models
Apply models like Linear Regression and Random Forest Regressor to learn from the data and capture both linear and complex relationships.
Evaluate the Model’s Performance
Measure accuracy using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² score to see how well the model predicts costs.

Let's get started!

Step 1: Import Required Libraries

To begin the Medical Cost Prediction project, import all the essential Python libraries for data processing, visualization, modeling, and evaluation:

Here's the code for importing:

# Import necessary libraries
import pandas as pd                 # For data manipulation and loading
import numpy as np                  # For numerical operations
import matplotlib.pyplot as plt     # For creating plots
import seaborn as sns               # For advanced visualizations

from sklearn.model_selection import train_test_split          # For splitting data
from sklearn.linear_model import LinearRegression             # For linear regression modeling
from sklearn.ensemble import RandomForestRegressor            # For training a Random Forest model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score  # For evaluation

Also Read - Libraries in Python Explained: List of Important Libraries

Step 2: Load the Dataset

Now, load the dataset that contains medical insurance cost records. The code for this step is below:

# Make sure the 'insurance.csv' file is in the same directory as this script.
try:
    df = pd.read_csv('insurance.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Error: 'insurance.csv' not found. Please make sure the file is in the correct directory.")
    exit()

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 3: Exploratory Data Analysis (EDA)

In this step, you’ll examine the dataset to understand its structure, detect missing values, and review basic statistics. This helps identify any data issues before preprocessing.

# Get a first look at the data and understand its structure.
print("--- Data Overview ---")

# Display the first 5 rows of the dataframe
print("First 5 rows of the dataset:")
print(df.head())
print("\n" + "="*50 + "\n")

# Get information about the columns, data types, and non-null values
print("Dataset Information:")
df.info()
print("\n" + "="*50 + "\n")

# Get statistical summary for numerical columns
print("Statistical Summary of Numerical Features:")
print(df.describe())
print("\n" + "="*50 + "\n")

# Check for any missing values in the dataset
print("Missing Values Check:")
print(df.isnull().sum())
print("\n" + "="*50 + "\n")

Output:

--- Data Overview ---

First 5 rows of the dataset:

age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

==================================================

Dataset Information:

RangeIndex: 1338 entries, 0 to 1337

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 age 1338 non-null int64

1 sex 1338 non-null object

2 bmi 1338 non-null float64

3 children 1338 non-null int64

4 smoker 1338 non-null object

5 region 1338 non-null object

6 charges 1338 non-null float64

dtypes: float64(2), int64(2), object(3)

memory usage: 73.3+ KB

==================================================

Statistical Summary of Numerical Features:

age bmi children charges

count 1338.000000 1338.000000 1338.000000 1338.000000

mean 39.207025 30.663397 1.094918 13270.422265

std 14.049960 6.098187 1.205493 12110.011237

min 18.000000 15.960000 0.000000 1121.873900

25% 27.000000 26.296250 0.000000 4740.287150

50% 39.000000 30.400000 1.000000 9382.033000

75% 51.000000 34.693750 2.000000 16639.912515

max 64.000000 53.130000 5.000000 63770.428010

==================================================

Missing Values Check:

age 0

sex 0

bmi 0

children 0

smoker 0

region 0

charges 0

dtype: int64

Also Read - Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools, Types, and Best Practices

Step 4: Data Visualization

This step helps you explore relationships and trends between features and the target variable (charges) through clear and insightful plots.

The code for this step is as follows:

# Visualizing the data helps in understanding the relationships between features.
print("--- Generating Visualizations ---")

# Set the style for the plots
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))

# a) Distribution of the target variable 'charges'
sns.histplot(df['charges'], kde=True, bins=30)
plt.title('Distribution of Medical Charges')
plt.xlabel('Charges')
plt.ylabel('Frequency')
plt.show()
# Observation: Charges are right-skewed with a long tail on the higher end.

# b) Medical Charges by Smoking Status
plt.figure(figsize=(8, 6))
sns.boxplot(x='smoker', y='charges', data=df)
plt.title('Medical Charges by Smoking Status')
plt.show()
# Insight: Smokers tend to incur significantly higher charges.

# c) Medical Charges by Sex
plt.figure(figsize=(8, 6))
sns.boxplot(x='sex', y='charges', data=df)
plt.title('Medical Charges by Sex')
plt.show()

# d) Medical Charges by Region
plt.figure(figsize=(10, 7))
sns.boxplot(x='region', y='charges', data=df)
plt.title('Medical Charges by Region')
plt.show()

# e) Correlation Heatmap (after encoding categorical variables)
df_encoded = df.copy()
df_encoded['sex'] = df_encoded['sex'].astype('category').cat.codes
df_encoded['smoker'] = df_encoded['smoker'].astype('category').cat.codes
df_encoded['region'] = df_encoded['region'].astype('category').cat.codes
plt.figure(figsize=(10, 8))
sns.heatmap(df_encoded.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of All Features')
plt.show()
# Insight: 'age' and 'smoker' have the strongest correlation with charges.

# f) Scatter Plot: Age vs. Charges by Smoking Status
plt.figure(figsize=(12, 8))
sns.scatterplot(x='age', y='charges', hue='smoker', data=df, palette=['blue', 'red'], alpha=0.7)
plt.title('Age vs. Charges by Smoking Status')
plt.show()
# Insight: Smokers tend to have higher charges across all age groups.

Popular Data Science Programs

PGD in Data Science Data Science Machine Learning Course DevOps Full Course Online Data Science Advanced Course MS in Data Science

Step 5: Data Preprocessing for Modeling

Before feeding data into a machine learning model, we must convert categorical variables into a numerical format so the model can interpret them.

print("--- Preprocessing Data for Modeling ---")

# Apply one-hot encoding to convert categorical variables into binary format
# This creates new columns like 'sex_male', 'smoker_yes', 'region_northwest', etc.
# Setting drop_first=True to prevent multicollinearity
df_processed = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)

# Display the first few rows of the processed dataset
print("Dataset after one-hot encoding:")
print(df_processed.head())
print("\n" + "="*50 + "\n")

Output:

--- Preprocessing Data for Modeling ---

Dataset after one-hot encoding:

age bmi children charges sex_male smoker_yes region_northwest \

0 19 27.900 0 16884.92400 False True False

1 18 33.770 1 1725.55230 True False False

2 28 33.000 3 4449.46200 True False False

3 33 22.705 0 21984.47061 True False True

4 32 28.880 0 3866.85520 True False True

region_southeast region_southwest

0 False True

1 True False

2 True False

3 False False

4 False False

This preprocessing step prepares your data for building regression models effectively.

Step 6: Model Training and Evaluation

In this step, you'll train two regression models, Linear Regression and Random Forest Regressor, then evaluate their performance on test data.

Here is the code for training and evaluating model performance:

# a) Define features (X) and target (y)
X = df_processed.drop('charges', axis=1)
y = df_processed['charges']

# b) Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print("\n" + "="*50 + "\n")

# --- Model 1: Linear Regression ---
print("--- Training Linear Regression Model ---")
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_lr = lr_model.predict(X_test)

# Evaluate the model
print("Linear Regression - Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred_lr):.2f}")
print(f"Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred_lr):.2f}")
print(f"R-squared (R2 Score): {r2_score(y_test, y_pred_lr):.4f}")
print("\n" + "="*50 + "\n")

# --- Model 2: Random Forest Regressor ---
print("--- Training Random Forest Regressor Model ---")
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
print("Random Forest - Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred_rf):.2f}")
print(f"Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred_rf):.2f}")
print(f"R-squared (R2 Score): {r2_score(y_test, y_pred_rf):.4f}")
print("\n" + "="*50 + "\n")

Output:

Training set size: 1070 samples

Test set size: 268 samples

==================================================

--- Training Linear Regression Model ---

Linear Regression - Model Evaluation:

Mean Absolute Error (MAE): 4181.19

Mean Squared Error (MSE): 33596915.85

R-squared (R2 Score): 0.7836

==================================================

--- Training Random Forest Regressor Model ---

Random Forest - Model Evaluation:

Mean Absolute Error (MAE): 2550.08

Mean Squared Error (MSE): 20942520.92

R-squared (R2 Score): 0.8651

Why two models?

Linear Regression helps understand basic linear relationships.
Random Forest Regressor captures complex, non-linear patterns and often performs better.

Explore this project, Airline Passenger Traffic Analysis Project Using Python

Step 7: Visualize Model Evaluation

Use these plots to visually assess how well each model performs by comparing actual vs. predicted charges and analyzing the distribution of residuals.

# a) Linear Regression: Actual vs. Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.title('Linear Regression: Actual vs. Predicted Charges')
plt.show()

# b) Linear Regression: Residuals Plot
residuals_lr = y_test - y_pred_lr
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_lr, residuals_lr, alpha=0.7, edgecolors='k')
plt.hlines(0, xmin=y_pred_lr.min(), xmax=y_pred_lr.max(), colors='red', linestyles='--')
plt.xlabel('Predicted Charges')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Linear Regression: Residuals Plot')
plt.show()

# c) Random Forest: Actual vs. Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.title('Random Forest: Actual vs. Predicted Charges')
plt.show()

# d) Random Forest: Residuals Plot
residuals_rf = y_test - y_pred_rf
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_rf, residuals_rf, alpha=0.7, edgecolors='k')
plt.hlines(0, xmin=y_pred_rf.min(), xmax=y_pred_rf.max(), colors='red', linestyles='--')
plt.xlabel('Predicted Charges')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Random Forest: Residuals Plot')
plt.show()

Output:

Random Forest Regressor significantly outperforms Linear Regression in both:

Prediction accuracy (closer to the ideal line),
and residual behavior (less bias and better variance control).

This suggests Random Forest is more suitable for predicting medical insurance charges, likely due to:

Its ability to capture nonlinearities,
and to handle interactions between features like age, BMI, and smoking status.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Conclusion

In this Medical Cost Prediction project, we utilized Linear Regression and Random Forest Regressor models. The Linear Regression model showed average performance but produced prediction errors and exhibited biased residuals. The Random Forest model achieved superior accuracy alongside reduced errors and produced an elevated R-squared value. The model generated predictions that aligned closely with real values while showing random residual distribution, which demonstrated good generalization capabilities.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Reference:
https://colab.research.google.com/drive/1TfDKBj1PBRBCIJzNV1qIktMVtZyxtRpr?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the goal of this project?

The aim is to predict individual medical insurance charges using patient attributes like age, BMI, smoking status, and region.

2. Which algorithms were used?

We used Linear Regression and Random Forest Regressor to build and compare predictive models.

3. Why was data preprocessing necessary?

Categorical variables like sex and region were converted into a numerical format using one-hot encoding to make the data suitable for machine learning.

4. Which model performed better?

Random Forest outperformed Linear Regression in accuracy and error metrics, making it a more reliable model for cost prediction.

5. How was model performance evaluated?

We used metrics like MAE, MSE, and R² score, along with actual vs. predicted plots and residual analysis for visual evaluation.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources