House Price Prediction Using Regression Algorithms
By Rohit Sharma
Updated on Aug 01, 2025 | 11 min read | 1.28K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 01, 2025 | 11 min read | 1.28K+ views
Share:
Table of Contents
Buying or selling a house largely hinges on being able to correctly value it in the market. But localities see huge price variations given so many factors - location, size, condition.
The prime objective of this project is to predict house prices using machine learning. Regression algorithms like Linear Regression, Random Forest, and Gradient Boosting are used to learn patterns from past housing data and estimate the price of a house based on its characteristics. This minimizes guesswork in making real estate decisions, thereby increasing confidence.
Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.
Popular Data Science Programs
It is better to have at least some background in:
For this project, the following tools and libraries will be used:
Tool/Library |
Purpose |
Python |
Programming language used to build the model |
Pandas |
Handling, exploring, and cleaning tabular data |
NumPy |
Performing numerical operations and working with arrays |
Matplotlib |
Creating basic plots and visualizations |
Seaborn |
Creating advanced and aesthetic visualizations |
Scikit-learn |
Implementing regression models and evaluating performance |
Step into the future with expert-led courses that cover it all: analytics, machine learning, and generative AI. Start your data science career journey now!
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
To solve the house price prediction problem, we will use three popular regression models.
You can complete this house price regression project in about 1.5 to 2 hours. It is perfect for beginners to intermediate level.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To build the house price prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:
Now that you have downloaded the file, upload and load the dataset in Google Colab using the code below:
# Step 1: Import necessary libraries
import pandas as pd
# Step 2: Upload and read the train dataset
from google.colab import files
uploaded = files.upload() # Choose train.csv when prompted
# Load dataset into a DataFrame
train_data = pd.read_csv("train.csv")
# Display first 5 rows
print(" First 5 rows of the dataset:")
print(train_data.head())
Output:
Saving train.csv to train (1).csv
Saving test.csv to test (1).csv
First 5 rows of the dataset:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0 2
1 Lvl AllPub ... 0 NaN NaN NaN 0 5
2 Lvl AllPub ... 0 NaN NaN NaN 0 9
3 Lvl AllPub ... 0 NaN NaN NaN 0 2
4 Lvl AllPub ... 0 NaN NaN NaN 0 12
YrSold SaleType SaleCondition SalePrice
0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000
[5 rows x 81 columns]
Now that the dataset is successfully loaded, let's explore its shape, columns, and data types to plan preprocessing effectively.
Let's take a quick look at its structure before we clean or preprocess the data. Doing so will help us identify data types and spot any missing values.
Use the code below to do so:
### Step 3: Understand the Data Structure
# Check the shape of the dataset
print("Shape of training data:", train_data.shape)
# Get information about columns, non-null values, and data types
print("\nData Info:")
train_data.info()
# View summary statistics for numeric columns
print("\nStatistical Summary:")
print(train_data.describe())
Output:
Shape of training data: (1460, 81)
Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 588 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
Statistical Summary:
Id MSSubClass LotFrontage LotArea OverallQual \
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315
std 421.610009 42.300571 24.284752 9981.264932 1.382997
min 1.000000 20.000000 21.000000 1300.000000 1.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000
OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... \
count 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ...
mean 5.575342 1971.267808 1984.865753 103.685262 443.639726 ...
std 1.112799 30.202904 20.645407 181.066207 456.098091 ...
min 1.000000 1872.000000 1950.000000 0.000000 0.000000 ...
25% 5.000000 1954.000000 1967.000000 0.000000 0.000000 ...
50% 5.000000 1973.000000 1994.000000 0.000000 383.500000 ...
75% 6.000000 2000.000000 2004.000000 166.000000 712.250000 ...
max 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ...
WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch \
count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 94.244521 46.660274 21.954110 3.409589 15.060959
std 125.338794 66.256028 61.119149 29.317331 55.757415
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 25.000000 0.000000 0.000000 0.000000
75% 168.000000 68.000000 0.000000 0.000000 0.000000
max 857.000000 547.000000 552.000000 508.000000 480.000000
PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 40.177307 496.123024 2.703626 1.328095 79442.502883
min 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 738.000000 15500.000000 12.000000 2010.000000 755000.000000
[8 rows x 38 columns]
What does the output mean?
After exploring the dataset, we found that some columns contain missing values. These gaps will negatively affect model performance. Therefore, let’s handle such issues before moving on.
Use the code given below to accomplish the same:
# Step 4: Handle Missing Values and Prepare the Data
# 1. Load the dataset (make sure train.csv is uploaded in Colab)
import pandas as pd
train_df = pd.read_csv('/content/train.csv')
# 2. Drop columns with too many missing values
train_df = train_df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'])
# 3. Fill missing numerical values with median
train_df['LotFrontage'] = train_df['LotFrontage'].fillna(train_df['LotFrontage'].median())
train_df['GarageYrBlt'] = train_df['GarageYrBlt'].fillna(train_df['GarageYrBlt'].median())
train_df['MasVnrArea'] = train_df['MasVnrArea'].fillna(train_df['MasVnrArea'].median())
# 4. Fill missing categorical values with mode
train_df['MasVnrType'] = train_df['MasVnrType'].fillna(train_df['MasVnrType'].mode()[0])
train_df['Electrical'] = train_df['Electrical'].fillna(train_df['Electrical'].mode()[0])
train_df['GarageType'] = train_df['GarageType'].fillna(train_df['GarageType'].mode()[0])
train_df['GarageFinish'] = train_df['GarageFinish'].fillna(train_df['GarageFinish'].mode()[0])
train_df['GarageQual'] = train_df['GarageQual'].fillna(train_df['GarageQual'].mode()[0])
train_df['GarageCond'] = train_df['GarageCond'].fillna(train_df['GarageCond'].mode()[0])
train_df['BsmtQual'] = train_df['BsmtQual'].fillna(train_df['BsmtQual'].mode()[0])
train_df['BsmtCond'] = train_df['BsmtCond'].fillna(train_df['BsmtCond'].mode()[0])
train_df['BsmtExposure'] = train_df['BsmtExposure'].fillna(train_df['BsmtExposure'].mode()[0])
train_df['BsmtFinType1'] = train_df['BsmtFinType1'].fillna(train_df['BsmtFinType1'].mode()[0])
train_df['BsmtFinType2'] = train_df['BsmtFinType2'].fillna(train_df['BsmtFinType2'].mode()[0])
Machine learning algorithms require numerical input. But various columns in our dataset are categorical (like Neighborhood, GarageType). To make them usable for models, we first need to convert these columns into numerical input (0 and 1). We will use label encoding and one-hot encoding to achieve this.
Use the below given code below to accomplish the same:
from sklearn.preprocessing import LabelEncoder
# Identify all object (categorical) columns
categorical_cols = train_df.select_dtypes(include=['object']).columns
# Apply Label Encoding for columns with only two categories
label_enc = LabelEncoder()
for col in categorical_cols:
if train_df[col].nunique() == 2:
train_df[col] = label_enc.fit_transform(train_df[col])
# Apply One-Hot Encoding for remaining categorical columns
train_df = pd.get_dummies(train_df, columns=[col for col in categorical_cols if train_df[col].nunique() > 2])
Now we have a fully numeric and cleaned dataset.
Before we start training, we have to split the dataset into input features (X) and the target variable (y) so the model can learn the relationship between input variables and the house price during training. Once it is done, we will normalize the features so that models like linear regression or SVM perform optimally.
Here is the code to accomplish the same:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 1. Separate features and target
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']
# 2. Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Normalize the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Now that the dataset is completely ready, we will train the following regression models to predict house prices:
We will also use R² Score and Root Mean Squared Error (RMSE) as evaluation metrics.
Here is the code to do so:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Dictionary to store results
results = {}
# 1. Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)
results['Linear Regression'] = {
'R2 Score': r2_score(y_test, lr_preds),
'RMSE': np.sqrt(mean_squared_error(y_test, lr_preds))
}
# 2. Ridge Regression
ridge = Ridge()
ridge.fit(X_train_scaled, y_train)
ridge_preds = ridge.predict(X_test_scaled)
results['Ridge Regression'] = {
'R2 Score': r2_score(y_test, ridge_preds),
'RMSE': np.sqrt(mean_squared_error(y_test, ridge_preds))
}
# 3. Lasso Regression
lasso = Lasso()
lasso.fit(X_train_scaled, y_train)
lasso_preds = lasso.predict(X_test_scaled)
results['Lasso Regression'] = {
'R2 Score': r2_score(y_test, lasso_preds),
'RMSE': np.sqrt(mean_squared_error(y_test, lasso_preds))
}
# 4. Random Forest
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_scaled, y_train)
rf_preds = rf.predict(X_test_scaled)
results['Random Forest'] = {
'R2 Score': r2_score(y_test, rf_preds),
'RMSE': np.sqrt(mean_squared_error(y_test, rf_preds))
}
# 5. XGBoost
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train_scaled, y_train)
xgb_preds = xgb.predict(X_test_scaled)
results['XGBoost'] = {
'R2 Score': r2_score(y_test, xgb_preds),
'RMSE': np.sqrt(mean_squared_error(y_test, xgb_preds))
}
# Print the evaluation results
import pandas as pd
results_df = pd.DataFrame(results).T
print(results_df)
Output:
R2 Score RMSE
Linear Regression 0.893465 28585.921223
Ridge Regression 0.893739 28549.160746
Lasso Regression 0.894667 28424.249288
Random Forest 0.890649 28961.361871
XGBoost 0.902643 27326.884052
What does the output mean?
Now that we have seen the performance of all the models, let’s quickly compare their results and understand how they performed on the test data:
Model |
R² Score |
RMSE |
XGBoost |
0.9026 |
27,326.88 |
Lasso Regression |
0.8947 |
28,424.25 |
Ridge Regression |
0.8937 |
28,549.16 |
Linear Regression |
0.8935 |
28,585.92 |
Random Forest |
0.8906 |
28,961.36 |
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
In this project, we built a regression model to predict house prices using a Kaggle dataset. After data cleaning, encoding, and model training, we compared five algorithms. Among them, XGBoost delivered the best results with an R² score of 0.90 and the lowest RMSE of 27,326.88.
This shows that ensemble models like XGBoost are more effective for capturing complex relationships in housing data compared to simple linear models.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1LdicyK51qMh6S_WPCWAJvsixGvaMchoS?usp=sharing
The objective is to predict property prices based on various features like location, number of bedrooms, size, and amenities using machine learning algorithms.
The Ames Housing Dataset and Kaggle’s House Prices – Advanced Regression Techniques dataset are commonly used for building predictive models.
Popular models include Linear Regression, Random Forest, XGBoost, Gradient Boosting, and Lasso Regression. Ensemble methods often yield better accuracy.
Key features include square footage, location (zipcode or neighborhood), number of rooms, year built, garage area, and quality of construction.
You can use Python with libraries such as Pandas, NumPy, Seaborn, Scikit-learn, XGBoost, and Matplotlib for data handling, modeling, and visualization.
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources