Medical Cost Prediction Using Linear Regression and Random Forest
By Rohit Sharma
Updated on Aug 04, 2025 | 11 min read | 1.55K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 04, 2025 | 11 min read | 1.55K+ views
Share:
Table of Contents
Medical cost prediction is the most useful attribute of machine learning, where you can estimate insurance charges based on the patient's various factors like age, BMI, smoking status, and more. This project is primarily focused on building a Medical cost prediction model using Python, which will help us identify cost-driving factors and predict expenses easily for any future need.
You’ll work with a real-world dataset and apply regression algorithms like Linear Regression and Random Forest Regressor. The process includes data cleaning, exploratory data analysis, feature engineering, and model evaluation.
Launch your data science career with upGrad's cutting-edge Online Data Science Courses. Master Python, Machine Learning, AI, SQL, and Tableau. Taught by industry leaders, these courses equip you with highly sought-after, job-ready skills. Don't wait – enroll today!
Enhance your theoretical understanding and practical skills by exploring our premier Python Data Science Projects, designed to launch your project development journey.
Also Read - Top 35 Linear Regression Projects in Machine Learning With Source Code
Learn data science with upGrad's industry-led, top-ranked courses for direct mentorship and career guidance.
To develop and evaluate the medical cost prediction model, you'll use core Python tools for data processing, visualization, and regression modeling:
Tool / Library |
Purpose |
Python | The main programming language for scripting and implementing the pipeline |
Google Colab | Cloud-based platform for running notebooks with free GPU/TPU support |
Pandas | Loads the insurance dataset and handles data manipulation and preprocessing |
NumPy | Supports numerical operations for arrays and model input preparation |
Matplotlib / Seaborn | Visualizes data trends, feature relationships, and model insights |
scikit-learn | Performs data splitting, encoding, model training, and performance evaluation |
LinearRegression | A baseline model for predicting continuous medical costs |
RandomForestRegressor | A powerful ensemble model that captures non-linear relationships |
MAE, MSE, R² Score | Evaluation metrics to measure prediction accuracy and model reliability |
Also Read - Decision Tree vs Random Forest: Use Cases & Performance Metrics
To build an effective Medical Cost Prediction model, you’ll apply core regression techniques that help estimate healthcare expenses based on patient attributes:
Also Read - Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know
You can complete this Medical Cost Prediction project in about 4 to 5 hours. It is made for beginners and intermediate learners who have basic Python skills and want hands-on experience in solving regression problems using healthcare data.
Developing a Medical Cost Prediction Model: A Step-by-Step Guide
Let’s walk through the steps to build this project from scratch:
Let's get started!
To begin the Medical Cost Prediction project, import all the essential Python libraries for data processing, visualization, modeling, and evaluation:
Here's the code for importing:
# Import necessary libraries
import pandas as pd # For data manipulation and loading
import numpy as np # For numerical operations
import matplotlib.pyplot as plt # For creating plots
import seaborn as sns # For advanced visualizations
from sklearn.model_selection import train_test_split # For splitting data
from sklearn.linear_model import LinearRegression # For linear regression modeling
from sklearn.ensemble import RandomForestRegressor # For training a Random Forest model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # For evaluation
Also Read - Libraries in Python Explained: List of Important Libraries
Now, load the dataset that contains medical insurance cost records. The code for this step is below:
# Make sure the 'insurance.csv' file is in the same directory as this script.
try:
df = pd.read_csv('insurance.csv')
print("Dataset loaded successfully!")
except FileNotFoundError:
print("Error: 'insurance.csv' not found. Please make sure the file is in the correct directory.")
exit()
Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data
In this step, you’ll examine the dataset to understand its structure, detect missing values, and review basic statistics. This helps identify any data issues before preprocessing.
# Get a first look at the data and understand its structure.
print("--- Data Overview ---")
# Display the first 5 rows of the dataframe
print("First 5 rows of the dataset:")
print(df.head())
print("\n" + "="*50 + "\n")
# Get information about the columns, data types, and non-null values
print("Dataset Information:")
df.info()
print("\n" + "="*50 + "\n")
# Get statistical summary for numerical columns
print("Statistical Summary of Numerical Features:")
print(df.describe())
print("\n" + "="*50 + "\n")
# Check for any missing values in the dataset
print("Missing Values Check:")
print(df.isnull().sum())
print("\n" + "="*50 + "\n")
Output:
--- Data Overview ---
First 5 rows of the dataset:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
==================================================
Dataset Information:
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
==================================================
Statistical Summary of Numerical Features:
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
==================================================
Missing Values Check:
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
Also Read - Comprehensive Guide to Exploratory Data Analysis (EDA) in 2025: Tools, Types, and Best Practices
This step helps you explore relationships and trends between features and the target variable (charges) through clear and insightful plots.
The code for this step is as follows:
# Visualizing the data helps in understanding the relationships between features.
print("--- Generating Visualizations ---")
# Set the style for the plots
sns.set_style('whitegrid')
plt.figure(figsize=(12, 6))
# a) Distribution of the target variable 'charges'
sns.histplot(df['charges'], kde=True, bins=30)
plt.title('Distribution of Medical Charges')
plt.xlabel('Charges')
plt.ylabel('Frequency')
plt.show()
# Observation: Charges are right-skewed with a long tail on the higher end.
# b) Medical Charges by Smoking Status
plt.figure(figsize=(8, 6))
sns.boxplot(x='smoker', y='charges', data=df)
plt.title('Medical Charges by Smoking Status')
plt.show()
# Insight: Smokers tend to incur significantly higher charges.
# c) Medical Charges by Sex
plt.figure(figsize=(8, 6))
sns.boxplot(x='sex', y='charges', data=df)
plt.title('Medical Charges by Sex')
plt.show()
# d) Medical Charges by Region
plt.figure(figsize=(10, 7))
sns.boxplot(x='region', y='charges', data=df)
plt.title('Medical Charges by Region')
plt.show()
# e) Correlation Heatmap (after encoding categorical variables)
df_encoded = df.copy()
df_encoded['sex'] = df_encoded['sex'].astype('category').cat.codes
df_encoded['smoker'] = df_encoded['smoker'].astype('category').cat.codes
df_encoded['region'] = df_encoded['region'].astype('category').cat.codes
plt.figure(figsize=(10, 8))
sns.heatmap(df_encoded.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of All Features')
plt.show()
# Insight: 'age' and 'smoker' have the strongest correlation with charges.
# f) Scatter Plot: Age vs. Charges by Smoking Status
plt.figure(figsize=(12, 8))
sns.scatterplot(x='age', y='charges', hue='smoker', data=df, palette=['blue', 'red'], alpha=0.7)
plt.title('Age vs. Charges by Smoking Status')
plt.show()
# Insight: Smokers tend to have higher charges across all age groups.
Popular Data Science Programs
Before feeding data into a machine learning model, we must convert categorical variables into a numerical format so the model can interpret them.
print("--- Preprocessing Data for Modeling ---")
# Apply one-hot encoding to convert categorical variables into binary format
# This creates new columns like 'sex_male', 'smoker_yes', 'region_northwest', etc.
# Setting drop_first=True to prevent multicollinearity
df_processed = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)
# Display the first few rows of the processed dataset
print("Dataset after one-hot encoding:")
print(df_processed.head())
print("\n" + "="*50 + "\n")
Output:
--- Preprocessing Data for Modeling ---
Dataset after one-hot encoding:
age bmi children charges sex_male smoker_yes region_northwest \
0 19 27.900 0 16884.92400 False True False
1 18 33.770 1 1725.55230 True False False
2 28 33.000 3 4449.46200 True False False
3 33 22.705 0 21984.47061 True False True
4 32 28.880 0 3866.85520 True False True
region_southeast region_southwest
0 False True
1 True False
2 True False
3 False False
4 False False
This preprocessing step prepares your data for building regression models effectively.
In this step, you'll train two regression models, Linear Regression and Random Forest Regressor, then evaluate their performance on test data.
Here is the code for training and evaluating model performance:
# a) Define features (X) and target (y)
X = df_processed.drop('charges', axis=1)
y = df_processed['charges']
# b) Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print("\n" + "="*50 + "\n")
# --- Model 1: Linear Regression ---
print("--- Training Linear Regression Model ---")
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred_lr = lr_model.predict(X_test)
# Evaluate the model
print("Linear Regression - Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred_lr):.2f}")
print(f"Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred_lr):.2f}")
print(f"R-squared (R2 Score): {r2_score(y_test, y_pred_lr):.4f}")
print("\n" + "="*50 + "\n")
# --- Model 2: Random Forest Regressor ---
print("--- Training Random Forest Regressor Model ---")
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)
# Evaluate the model
print("Random Forest - Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred_rf):.2f}")
print(f"Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred_rf):.2f}")
print(f"R-squared (R2 Score): {r2_score(y_test, y_pred_rf):.4f}")
print("\n" + "="*50 + "\n")
Output:
Training set size: 1070 samples
Test set size: 268 samples
==================================================
--- Training Linear Regression Model ---
Linear Regression - Model Evaluation:
Mean Absolute Error (MAE): 4181.19
Mean Squared Error (MSE): 33596915.85
R-squared (R2 Score): 0.7836
==================================================
--- Training Random Forest Regressor Model ---
Random Forest - Model Evaluation:
Mean Absolute Error (MAE): 2550.08
Mean Squared Error (MSE): 20942520.92
R-squared (R2 Score): 0.8651
Why two models?
Explore this project, Airline Passenger Traffic Analysis Project Using Python
Use these plots to visually assess how well each model performs by comparing actual vs. predicted charges and analyzing the distribution of residuals.
# a) Linear Regression: Actual vs. Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.title('Linear Regression: Actual vs. Predicted Charges')
plt.show()
# b) Linear Regression: Residuals Plot
residuals_lr = y_test - y_pred_lr
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_lr, residuals_lr, alpha=0.7, edgecolors='k')
plt.hlines(0, xmin=y_pred_lr.min(), xmax=y_pred_lr.max(), colors='red', linestyles='--')
plt.xlabel('Predicted Charges')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Linear Regression: Residuals Plot')
plt.show()
# c) Random Forest: Actual vs. Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Charges')
plt.ylabel('Predicted Charges')
plt.title('Random Forest: Actual vs. Predicted Charges')
plt.show()
# d) Random Forest: Residuals Plot
residuals_rf = y_test - y_pred_rf
plt.figure(figsize=(10, 6))
plt.scatter(y_pred_rf, residuals_rf, alpha=0.7, edgecolors='k')
plt.hlines(0, xmin=y_pred_rf.min(), xmax=y_pred_rf.max(), colors='red', linestyles='--')
plt.xlabel('Predicted Charges')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Random Forest: Residuals Plot')
plt.show()
Output:
Random Forest Regressor significantly outperforms Linear Regression in both:
This suggests Random Forest is more suitable for predicting medical insurance charges, likely due to:
In this Medical Cost Prediction project, we utilized Linear Regression and Random Forest Regressor models. The Linear Regression model showed average performance but produced prediction errors and exhibited biased residuals. The Random Forest model achieved superior accuracy alongside reduced errors and produced an elevated R-squared value. The model generated predictions that aligned closely with real values while showing random residual distribution, which demonstrated good generalization capabilities.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://colab.research.google.com/drive/1TfDKBj1PBRBCIJzNV1qIktMVtZyxtRpr?usp=sharing
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources