Medical Cost Prediction Using Linear Regression and Random Forest

By Rohit Sharma

Updated on Aug 04, 2025 | 11 min read | 1.55K+ views

Share:

Medical cost prediction is the most useful attribute of machine learning, where you can estimate insurance charges based on the patient's various factors like age, BMI, smoking status, and more. This project is primarily focused on building a Medical cost prediction model using Python, which will help us identify cost-driving factors and predict expenses easily for any future need.

You’ll work with a real-world dataset and apply regression algorithms like Linear Regression and Random Forest Regressor. The process includes data cleaning, exploratory data analysis, feature engineering, and model evaluation.

Launch your data science career with upGrad's cutting-edge Online Data Science Courses. Master Python, Machine Learning, AI, SQL, and Tableau. Taught by industry leaders, these courses equip you with highly sought-after, job-ready skills. Don't wait – enroll today!

Enhance your theoretical understanding and practical skills by exploring our premier Python Data Science Projects, designed to launch your project development journey.

Preparation for Effective Project Development

  • Basic Python programming knowledge (For writing scripts, defining functions, and handling control flows)
  • Experience with data manipulation using Pandas and NumPy (Required for reading data, handling missing values, and structuring datasets)
  • Understanding of data visualization with Matplotlib and Seaborn (Helps in generating charts like countplots, histograms, and heatmaps for EDA)
  • Knowledge of data preprocessing techniques (Such as dropping irrelevant columns, encoding categorical variables using one-hot encoding, and splitting datasets)
  • Familiarity with Regression Algorithms (Since this is a regression task, you should understand models like Linear Regression and Random Forest Regressor, which can predict continuous target values.)

Also Read - Top 35 Linear Regression Projects in Machine Learning With Source Code

Learn data science with upGrad's industry-led, top-ranked courses for direct mentorship and career guidance.

Predicting Medical Costs: A Look at the Tech That Makes It Happen

To develop and evaluate the medical cost prediction model, you'll use core Python tools for data processing, visualization, and regression modeling:

Tool / Library

Purpose

Python The main programming language for scripting and implementing the pipeline
Google Colab Cloud-based platform for running notebooks with free GPU/TPU support
Pandas Loads the insurance dataset and handles data manipulation and preprocessing
NumPy Supports numerical operations for arrays and model input preparation
Matplotlib / Seaborn Visualizes data trends, feature relationships, and model insights
scikit-learn Performs data splitting, encoding, model training, and performance evaluation
LinearRegression A baseline model for predicting continuous medical costs
RandomForestRegressor A powerful ensemble model that captures non-linear relationships
MAE, MSE, R² Score Evaluation metrics to measure prediction accuracy and model reliability

Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics

How We're Predicting Medical Costs

To build an effective Medical Cost Prediction model, you’ll apply core regression techniques that help estimate healthcare expenses based on patient attributes:

  • Data preprocessing and cleaning
  • Exploratory Data Analysis (EDA)
  • Regression Algorithms (Linear Regression & Random Forest Regressor)
  • Feature Importance Analysis
  • Model evaluation 

Also Read - Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know

Skill Level Requirements and Timeline

You can complete this Medical Cost Prediction project in about 4 to 5 hours. It is made for beginners and intermediate learners who have basic Python skills and want hands-on experience in solving regression problems using healthcare data.

Developing a Medical Cost Prediction Model: A Step-by-Step Guide

Developing a Medical Cost Prediction Model: A Step-by-Step Guide

Let’s walk through the steps to build this project from scratch:

  • Load the Medical Insurance Dataset
    Import the dataset containing features like age, sex, BMI, smoking status, and region, along with medical charges.
  • Clean and Preprocess the Data
    Handle missing values, encode categorical variables, and normalize features.
  • Explore and Visualize Relationships
    Use visualizations like histograms, box plots, and heatmaps to identify key patterns, such as how smoking or age impacts medical costs.
  • Train Regression Models
    Apply models like Linear Regression and Random Forest Regressor to learn from the data and capture both linear and complex relationships.
  • Evaluate the Model’s Performance
    Measure accuracy using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² score to see how well the model predicts costs.

Let's get started!

Step 1: Import Required Libraries

To begin the Medical Cost Prediction project, import all the essential Python libraries for data processing, visualization, modeling, and evaluation:

Here's the code for importing:

# Import necessary libraries
import pandas as pd                 # For data manipulation and loading
import numpy as np                  # For numerical operations
import matplotlib.pyplot as plt     # For creating plots
import seaborn as sns               # For advanced visualizations

from sklearn.model_selection import train_test_split          # For splitting data
from sklearn.linear_model import LinearRegression             # For linear regression modeling
from sklearn.ensemble import RandomForestRegressor            # For training a Random Forest model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score  # For evaluation

Also Read - Libraries in Python Explained: List of Important Libraries

Step 2: Load the Dataset

Now, load the dataset that contains medical insurance cost records. The code for this step is below:

# Make sure the 'insurance.csv' file is in the same directory as this script.
try:
    df = pd.read_csv('insurance.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Error: 'insurance.csv' not found. Please make sure the file is in the correct directory.")
    exit()

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 3: Exploratory Data Analysis (EDA)

In this step, you’ll examine the dataset to understand its structure, detect missing values, and review basic statistics. This helps identify any data issues before preprocessing.

# Get a first look at the data and understand its structure.
print("--- Data Overview ---")

# Display the first 5 rows of the dataframe
print("First 5 rows of the dataset:")
print(df.head())
print("\n" + "="*50 + "\n")

# Get information about the columns, data types, and non-null values
print("Dataset Information:")
df.info()
print("\n" + "="*50 + "\n")

# Get statistical summary for numerical columns
print("Statistical Summary of Numerical Features:")
print(df.describe())
print("\n" + "="*50 + "\n")

# Check for any missing values in the dataset
print("Missing Values Check:")
print(df.isnull().sum())
print("\n" + "="*50 + "\n")

Output:

--- Data Overview ---

First 5 rows of the dataset:

   age      sex       bmi        children    smoker      region                charges

0   19    female   27.900        0           yes             southwest          16884.92400

1   18     male    33.770         1            no               southeast            1725.55230

2   28    male    33.000         3          no                southeast           4449.46200

3   33    male    22.705         0         no                northwest             21984.47061

4   32    male    28.880         0          no               northwest             3866.85520

==================================================

Dataset Information:

RangeIndex: 1338 entries, 0 to 1337

Data columns (total 7 columns):

 #   Column    Non-Null   Count  Dtype  

---  ------    --------------  -----

 0   age       1338        non-null   int64

 1   sex       1338        non-null   object 

 2  bmi        1338       non-null   float64

 3  children   1338     non-null   int64  

 4  smoker   1338     non-null   object 

 5  region    1338       non-null   object 

 6  charges   1338     non-null   float64

dtypes: float64(2), int64(2), object(3)

memory usage: 73.3+ KB

==================================================

Statistical Summary of Numerical Features:

               age                 bmi                  children            charges

count  1338.000000    1338.000000  1338.000000   1338.000000

mean   39.207025     30.663397         1.094918           13270.422265

std      14.049960      6.098187            1.205493          12110.011237

min      18.000000     15.960000          0.000000         1121.873900

25%      27.000000     26.296250         0.000000         4740.287150

50%      39.000000     30.400000        1.000000           9382.033000

75%      51.000000      34.693750         2.000000              16639.912515

max      64.000000     53.130000         5.000000              63770.428010

==================================================

Missing   Values Check:

age         0

sex         0

bmi         0

children    0

smoker      0

region      0

charges     0

dtype: int64

Also Read - Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools, Types, and Best Practices

Step 4:  Data Visualization

This step helps you explore relationships and trends between features and the target variable (charges) through clear and insightful plots.

The code for this step is as follows:

# Visualizing the data helps in understanding the relationships between features.
print("--- Generating Visualizations ---")

# Set the style for the plots
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))

# a) Distribution of the target variable 'charges'
sns.histplot(df['charges'], kde=True, bins=30)
plt.title('Distribution of Medical Charges')
plt.xlabel('Charges')
plt.ylabel('Frequency')
plt.show()
# Observation: Charges are right-skewed with a long tail on the higher end.

# b) Medical Charges by Smoking Status
plt.figure(figsize=(8, 6))
sns.boxplot(x='smoker', y='charges', data=df)
plt.title('Medical Charges by Smoking Status')
plt.show()
# Insight: Smokers tend to incur significantly higher charges.

# c) Medical Charges by Sex
plt.figure(figsize=(8, 6))
sns.boxplot(x='sex', y='charges', data=df)
plt.title('Medical Charges by Sex')
plt.show()

# d) Medical Charges by Region
plt.figure(figsize=(10, 7))
sns.boxplot(x='region', y='charges', data=df)
plt.title('Medical Charges by Region')
plt.show()

# e) Correlation Heatmap (after encoding categorical variables)
df_encoded = df.copy()
df_encoded['sex'] = df_encoded['sex'].astype('category').cat.codes
df_encoded['smoker'] = df_encoded['smoker'].astype('category').cat.codes
df_encoded['region'] = df_encoded['region'].astype('category').cat.codes
plt.figure(figsize=(10, 8))
sns.heatmap(df_encoded.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of All Features')
plt.show()
# Insight: 'age' and 'smoker' have the strongest correlation with charges.

# f) Scatter Plot: Age vs. Charges by Smoking Status
plt.figure(figsize=(12, 8))
sns.scatterplot(x='age', y='charges', hue='smoker', data=df, palette=['blue', 'red'], alpha=0.7)
plt.title('Age vs. Charges by Smoking Status')
plt.show()
# Insight: Smokers tend to have higher charges across all age groups.

Step 5: Data Preprocessing for Modeling

Before feeding data into a machine learning model, we must convert categorical variables into a numerical format so the model can interpret them.

print("--- Preprocessing Data for Modeling ---")

# Apply one-hot encoding to convert categorical variables into binary format
# This creates new columns like 'sex_male', 'smoker_yes', 'region_northwest', etc.
# Setting drop_first=True to prevent multicollinearity
df_processed = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)

# Display the first few rows of the processed dataset
print("Dataset after one-hot encoding:")
print(df_processed.head())
print("\n" + "="*50 + "\n")

Output:

--- Preprocessing Data for Modeling ---

Dataset after one-hot encoding:

age  bmi    children      charges  sex_male          smoker_yes      region_northwest    \

0     19       27.900         0            16884.92400      False              True                         False   

1      18      33.770         1            1725.55230         True                False                        False   

2      28     33.000         3           4449.46200       True                False                         False   

3      33     22.705         0            21984.47061      True               False                         True   

4      32     28.880         0            3866.85520       True               False                        True

 

   region_southeast     region_southwest  

0             False               True  

1              True                False  

2              True               False  

3             False              False  

4             False               False 

This preprocessing step prepares your data for building regression models effectively.

Step 6: Model Training and Evaluation

In this step, you'll train two regression models, Linear Regression and Random Forest Regressor, then evaluate their performance on test data.

Here is the code for training and evaluating model performance: 

# a) Define features (X) and target (y)
X = df_processed.drop('charges', axis=1)
y = df_processed['charges']

# b) Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print("\n" + "="*50 + "\n")

# --- Model 1: Linear Regression ---
print("--- Training Linear Regression Model ---")
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_lr = lr_model.predict(X_test)

# Evaluate the model
print("Linear Regression - Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred_lr):.2f}")
print(f"Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred_lr):.2f}")
print(f"R-squared (R2 Score): {r2_score(y_test, y_pred_lr):.4f}")
print("\n" + "="*50 + "\n")

# --- Model 2: Random Forest Regressor ---
print("--- Training Random Forest Regressor Model ---")
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
print("Random Forest - Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred_rf):.2f}")
print(f"Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred_rf):.2f}")
print(f"R-squared (R2 Score): {r2_score(y_test, y_pred_rf):.4f}")
print("\n" + "="*50 + "\n")

Output:

Training set size: 1070 samples

Test set size: 268 samples

==================================================

--- Training Linear Regression Model ---

Linear Regression - Model Evaluation:

Mean Absolute Error (MAE): 4181.19

Mean Squared Error (MSE): 33596915.85

R-squared (R2 Score): 0.7836

==================================================

--- Training Random Forest Regressor Model ---

Random Forest - Model Evaluation:

Mean Absolute Error (MAE): 2550.08

Mean Squared Error (MSE): 20942520.92

R-squared (R2 Score): 0.8651

Why two models?

  • Linear Regression helps understand basic linear relationships.
  • Random Forest Regressor captures complex, non-linear patterns and often performs better.

Explore this project, Airline Passenger Traffic Analysis Project Using Python

Step 7: Visualize Model Evaluation

Use these plots to visually assess how well each model performs by comparing actual vs. predicted charges and analyzing the distribution of residuals.

# a) Linear Regression: Actual vs. Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.title('Linear Regression: Actual vs. Predicted Charges')
plt.show()

# b) Linear Regression: Residuals Plot
residuals_lr = y_test - y_pred_lr
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_lr, residuals_lr, alpha=0.7, edgecolors='k')
plt.hlines(0, xmin=y_pred_lr.min(), xmax=y_pred_lr.max(), colors='red', linestyles='--')
plt.xlabel('Predicted Charges')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Linear Regression: Residuals Plot')
plt.show()

# c) Random Forest: Actual vs. Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.title('Random Forest: Actual vs. Predicted Charges')
plt.show()

# d) Random Forest: Residuals Plot
residuals_rf = y_test - y_pred_rf
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_rf, residuals_rf, alpha=0.7, edgecolors='k')
plt.hlines(0, xmin=y_pred_rf.min(), xmax=y_pred_rf.max(), colors='red', linestyles='--')
plt.xlabel('Predicted Charges')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Random Forest: Residuals Plot')
plt.show()

Output:

Random Forest Regressor significantly outperforms Linear Regression in both:

  • Prediction accuracy (closer to the ideal line),
  • and residual behavior (less bias and better variance control).

This suggests Random Forest is more suitable for predicting medical insurance charges, likely due to:

  • Its ability to capture nonlinearities,
  • and to handle interactions between features like age, BMI, and smoking status.
background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Conclusion

In this Medical Cost Prediction project, we utilized Linear Regression and Random Forest Regressor models. The Linear Regression model showed average performance but produced prediction errors and exhibited biased residuals. The Random Forest model achieved superior accuracy alongside reduced errors and produced an elevated R-squared value. The model generated predictions that aligned closely with real values while showing random residual distribution, which demonstrated good generalization capabilities.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Reference:
https://colab.research.google.com/drive/1TfDKBj1PBRBCIJzNV1qIktMVtZyxtRpr?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the goal of this project?

2. Which algorithms were used?

3. Why was data preprocessing necessary?

4. Which model performed better?

5. How was model performance evaluated?

Rohit Sharma

827 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months