Home
Blog
Data Science
House Price Prediction Using Regression Algorithms

House Price Prediction Using Regression Algorithms

Updated on Aug 01, 2025 | 11 min read | 1.51K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty
How to Build the House Price Prediction Model
Conclusion

Buying or selling a house largely hinges on being able to correctly value it in the market. But localities see huge price variations given so many factors - location, size, condition.

The prime objective of this project is to predict house prices using machine learning. Regression algorithms like Linear Regression, Random Forest, and Gradient Boosting are used to learn patterns from past housing data and estimate the price of a house based on its characteristics. This minimizes guesswork in making real estate decisions, thereby increasing confidence.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

Popular Data Science Programs

Data Science Machine Learning Course Postgraduate Diploma in Data Science Advanced Certificate Program in Data Science DevOps Course Online M Sc in Data Science Degree

What Should You Know Beforehand?

It is better to have at least some background in:

Python basics – variables, loops, functions.
Pandas and NumPy – for handling and analyzing data.
Matplotlib and Seaborn – for visualizing patterns.
Machine learning concepts – especially regression and model evaluation.
Scikit-learn library – for building and testing regression models.

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library	Purpose
Python	Programming language used to build the model
Pandas	Handling, exploring, and cleaning tabular data
NumPy	Performing numerical operations and working with arrays
Matplotlib	Creating basic plots and visualizations
Seaborn	Creating advanced and aesthetic visualizations
Scikit-learn	Implementing regression models and evaluating performance

Step into the future with expert-led courses that cover it all: analytics, machine learning, and generative AI. Start your data science career journey now!

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Models That Will Be Utilized for Learning

To solve the house price prediction problem, we will use three popular regression models.

Linear regression is the simplest. It assumes a linear relationship between input features and house prices. It is easy to interpret, but may fail to capture complex relationships.
Ridge Regression is a regularized form of linear regression. It throttles large coefficients, so the model generalizes better and avoids overfitting problems.
Lasso Regression behaves like Ridge with one difference: It reduces less important features to zero. Hence, it can perform automatic feature selection and model simplification.
More powerful isthe Random Forest Regressor. Many decision trees are built, and their outputs are combined. In this way, overfitting is diminished, therefore gaining more accuracy.
XGBoost Regressor works differently. Trees are built sequentially, each tree serving to lessen errors produced by the tree before it, along the way; it would be rather slow while promising superior outcomes on structured data.

Time Taken and Difficulty

You can complete this house price regression project in about 1.5 to 2 hours. It is perfect for beginners to intermediate level.

How to Build the House Price Prediction Model

Let’s start building the project from scratch. We will start by:

Load and explore the dataset
Handle missing values and encode categorical features
Visualize important relationships and correlations
Train and evaluate regression models
Compare the results to find the best model

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the house price prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

Open a new tab in any web browser.
Type https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data.
On the House Prices - Advanced Regression Techniques page, in the right pane, under the Data Explorer section, click test.csv.
Click the download icon.
Click train.csv.
Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded the file, upload and load the dataset in Google Colab using the code below:

# Step 1: Import necessary libraries
import pandas as pd

# Step 2: Upload and read the train dataset
from google.colab import files
uploaded = files.upload()  # Choose train.csv when prompted

# Load dataset into a DataFrame
train_data = pd.read_csv("train.csv")

# Display first 5 rows
print(" First 5 rows of the dataset:")
print(train_data.head())

Output:

Saving train.csv to train (1).csv
Saving test.csv to test (1).csv
First 5 rows of the dataset:

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0 2
1 Lvl AllPub ... 0 NaN NaN NaN 0 5
2 Lvl AllPub ... 0 NaN NaN NaN 0 9
3 Lvl AllPub ... 0 NaN NaN NaN 0 2
4 Lvl AllPub ... 0 NaN NaN NaN 0 12

YrSold SaleType SaleCondition SalePrice
0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

Now that the dataset is successfully loaded, let's explore its shape, columns, and data types to plan preprocessing effectively.

Step 3: Understand the Data Structure

Let's take a quick look at its structure before we clean or preprocess the data. Doing so will help us identify data types and spot any missing values.

Use the code below to do so:

### Step 3: Understand the Data Structure

# Check the shape of the dataset
print("Shape of training data:", train_data.shape)

# Get information about columns, non-null values, and data types
print("\nData Info:")
train_data.info()

# View summary statistics for numeric columns
print("\nStatistical Summary:")
print(train_data.describe())

Output:

Shape of training data: (1460, 81)

Data Info:

RangeIndex: 1460 entries, 0 to 1459

Data columns (total 81 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Id 1460 non-null int64

1 MSSubClass 1460 non-null int64

2 MSZoning 1460 non-null object

3 LotFrontage 1201 non-null float64

4 LotArea 1460 non-null int64

5 Street 1460 non-null object

6 Alley 91 non-null object

7 LotShape 1460 non-null object

8 LandContour 1460 non-null object

9 Utilities 1460 non-null object

10 LotConfig 1460 non-null object

11 LandSlope 1460 non-null object

12 Neighborhood 1460 non-null object

13 Condition1 1460 non-null object

14 Condition2 1460 non-null object

15 BldgType 1460 non-null object

16 HouseStyle 1460 non-null object

17 OverallQual 1460 non-null int64

18 OverallCond 1460 non-null int64

19 YearBuilt 1460 non-null int64

20 YearRemodAdd 1460 non-null int64

21 RoofStyle 1460 non-null object

22 RoofMatl 1460 non-null object

23 Exterior1st 1460 non-null object

24 Exterior2nd 1460 non-null object

25 MasVnrType 588 non-null object

26 MasVnrArea 1452 non-null float64

27 ExterQual 1460 non-null object

28 ExterCond 1460 non-null object

29 Foundation 1460 non-null object

30 BsmtQual 1423 non-null object

31 BsmtCond 1423 non-null object

32 BsmtExposure 1422 non-null object

33 BsmtFinType1 1423 non-null object

34 BsmtFinSF1 1460 non-null int64

35 BsmtFinType2 1422 non-null object

36 BsmtFinSF2 1460 non-null int64

37 BsmtUnfSF 1460 non-null int64

38 TotalBsmtSF 1460 non-null int64

39 Heating 1460 non-null object

40 HeatingQC 1460 non-null object

41 CentralAir 1460 non-null object

42 Electrical 1459 non-null object

43 1stFlrSF 1460 non-null int64

44 2ndFlrSF 1460 non-null int64

45 LowQualFinSF 1460 non-null int64

46 GrLivArea 1460 non-null int64

47 BsmtFullBath 1460 non-null int64

48 BsmtHalfBath 1460 non-null int64

49 FullBath 1460 non-null int64

50 HalfBath 1460 non-null int64

51 BedroomAbvGr 1460 non-null int64

52 KitchenAbvGr 1460 non-null int64

53 KitchenQual 1460 non-null object

54 TotRmsAbvGrd 1460 non-null int64

55 Functional 1460 non-null object

56 Fireplaces 1460 non-null int64

57 FireplaceQu 770 non-null object

58 GarageType 1379 non-null object

59 GarageYrBlt 1379 non-null float64

60 GarageFinish 1379 non-null object

61 GarageCars 1460 non-null int64

62 GarageArea 1460 non-null int64

63 GarageQual 1379 non-null object

64 GarageCond 1379 non-null object

65 PavedDrive 1460 non-null object

66 WoodDeckSF 1460 non-null int64

67 OpenPorchSF 1460 non-null int64

68 EnclosedPorch 1460 non-null int64

69 3SsnPorch 1460 non-null int64

70 ScreenPorch 1460 non-null int64

71 PoolArea 1460 non-null int64

72 PoolQC 7 non-null object

73 Fence 281 non-null object

74 MiscFeature 54 non-null object

75 MiscVal 1460 non-null int64

76 MoSold 1460 non-null int64

77 YrSold 1460 non-null int64

78 SaleType 1460 non-null object

79 SaleCondition 1460 non-null object

80 SalePrice 1460 non-null int64

dtypes: float64(3), int64(35), object(43)

memory usage: 924.0+ KB

Statistical Summary:
Id MSSubClass LotFrontage LotArea OverallQual \
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315
std 421.610009 42.300571 24.284752 9981.264932 1.382997
min 1.000000 20.000000 21.000000 1300.000000 1.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000

OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... \
count 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ...
mean 5.575342 1971.267808 1984.865753 103.685262 443.639726 ...
std 1.112799 30.202904 20.645407 181.066207 456.098091 ...
min 1.000000 1872.000000 1950.000000 0.000000 0.000000 ...
25% 5.000000 1954.000000 1967.000000 0.000000 0.000000 ...
50% 5.000000 1973.000000 1994.000000 0.000000 383.500000 ...
75% 6.000000 2000.000000 2004.000000 166.000000 712.250000 ...
max 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ...

WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch \
count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 94.244521 46.660274 21.954110 3.409589 15.060959
std 125.338794 66.256028 61.119149 29.317331 55.757415
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 25.000000 0.000000 0.000000 0.000000
75% 168.000000 68.000000 0.000000 0.000000 0.000000
max 857.000000 547.000000 552.000000 508.000000 480.000000

PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 40.177307 496.123024 2.703626 1.328095 79442.502883
min 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 738.000000 15500.000000 12.000000 2010.000000 755000.000000

[8 rows x 38 columns]

What does the output mean?

Total Rows: 1460 houses (each row is a house).
Total Columns: 81 features (like area, year built, location, etc.).
Column Types:
- Some are numbers (e.g., LotArea, YearBuilt).
- Some are text/categorical (e.g., Street, Neighborhood).
Some columns have blank/missing values. Example:
- Alley: only 91 houses have data, the rest are blank.
- MasVnrType, MasVnrArea, BsmtQual, etc. also have missing values.

Step 4: Handle Missing Values

After exploring the dataset, we found that some columns contain missing values. These gaps will negatively affect model performance. Therefore, let’s handle such issues before moving on.

Use the code given below to accomplish the same:

# Step 4: Handle Missing Values and Prepare the Data

# 1. Load the dataset (make sure train.csv is uploaded in Colab)
import pandas as pd
train_df = pd.read_csv('/content/train.csv')

# 2. Drop columns with too many missing values
train_df = train_df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'])

# 3. Fill missing numerical values with median
train_df['LotFrontage'] = train_df['LotFrontage'].fillna(train_df['LotFrontage'].median())
train_df['GarageYrBlt'] = train_df['GarageYrBlt'].fillna(train_df['GarageYrBlt'].median())
train_df['MasVnrArea'] = train_df['MasVnrArea'].fillna(train_df['MasVnrArea'].median())

# 4. Fill missing categorical values with mode
train_df['MasVnrType'] = train_df['MasVnrType'].fillna(train_df['MasVnrType'].mode()[0])
train_df['Electrical'] = train_df['Electrical'].fillna(train_df['Electrical'].mode()[0])
train_df['GarageType'] = train_df['GarageType'].fillna(train_df['GarageType'].mode()[0])
train_df['GarageFinish'] = train_df['GarageFinish'].fillna(train_df['GarageFinish'].mode()[0])
train_df['GarageQual'] = train_df['GarageQual'].fillna(train_df['GarageQual'].mode()[0])
train_df['GarageCond'] = train_df['GarageCond'].fillna(train_df['GarageCond'].mode()[0])
train_df['BsmtQual'] = train_df['BsmtQual'].fillna(train_df['BsmtQual'].mode()[0])
train_df['BsmtCond'] = train_df['BsmtCond'].fillna(train_df['BsmtCond'].mode()[0])
train_df['BsmtExposure'] = train_df['BsmtExposure'].fillna(train_df['BsmtExposure'].mode()[0])
train_df['BsmtFinType1'] = train_df['BsmtFinType1'].fillna(train_df['BsmtFinType1'].mode()[0])
train_df['BsmtFinType2'] = train_df['BsmtFinType2'].fillna(train_df['BsmtFinType2'].mode()[0])

Step 5: Encode Categorical Features for Modeling

Machine learning algorithms require numerical input. But various columns in our dataset are categorical (like Neighborhood, GarageType). To make them usable for models, we first need to convert these columns into numerical input (0 and 1). We will use label encoding and one-hot encoding to achieve this.

Use the below given code below to accomplish the same:

from sklearn.preprocessing import LabelEncoder

# Identify all object (categorical) columns
categorical_cols = train_df.select_dtypes(include=['object']).columns

# Apply Label Encoding for columns with only two categories
label_enc = LabelEncoder()
for col in categorical_cols:
    if train_df[col].nunique() == 2:
        train_df[col] = label_enc.fit_transform(train_df[col])
        
# Apply One-Hot Encoding for remaining categorical columns
train_df = pd.get_dummies(train_df, columns=[col for col in categorical_cols if train_df[col].nunique() > 2])

Now we have a fully numeric and cleaned dataset.

Step 6: Split the Dataset and Normalize Features

Before we start training, we have to split the dataset into input features (X) and the target variable (y) so the model can learn the relationship between input variables and the house price during training. Once it is done, we will normalize the features so that models like linear regression or SVM perform optimally.

Here is the code to accomplish the same:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Separate features and target
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

# 2. Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Normalize the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 7: Train and Evaluate Four Regression Models

Now that the dataset is completely ready, we will train the following regression models to predict house prices:

Linear Regression
Ridge Regression
Lasso Regression
Random Forest Regressor
XGBoost Regressor

We will also use R² Score and Root Mean Squared Error (RMSE) as evaluation metrics.

Here is the code to do so:

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Dictionary to store results
results = {}

# 1. Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)
results['Linear Regression'] = {
    'R2 Score': r2_score(y_test, lr_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, lr_preds))
}

# 2. Ridge Regression
ridge = Ridge()
ridge.fit(X_train_scaled, y_train)
ridge_preds = ridge.predict(X_test_scaled)
results['Ridge Regression'] = {
    'R2 Score': r2_score(y_test, ridge_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, ridge_preds))
}

# 3. Lasso Regression
lasso = Lasso()
lasso.fit(X_train_scaled, y_train)
lasso_preds = lasso.predict(X_test_scaled)
results['Lasso Regression'] = {
    'R2 Score': r2_score(y_test, lasso_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, lasso_preds))
}

# 4. Random Forest
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_scaled, y_train)
rf_preds = rf.predict(X_test_scaled)
results['Random Forest'] = {
    'R2 Score': r2_score(y_test, rf_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, rf_preds))
}

# 5. XGBoost
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train_scaled, y_train)
xgb_preds = xgb.predict(X_test_scaled)
results['XGBoost'] = {
    'R2 Score': r2_score(y_test, xgb_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, xgb_preds))
}

# Print the evaluation results
import pandas as pd
results_df = pd.DataFrame(results).T
print(results_df)

Output:

R2 Score RMSE

Linear Regression 0.893465 28585.921223

Ridge Regression 0.893739 28549.160746

Lasso Regression 0.894667 28424.249288

Random Forest 0.890649 28961.361871

XGBoost 0.902643 27326.884052

What does the output mean?

XGBoost performs the best, with the highest R² score (~0.90) and lowest RMSE. It fits the data better than the others
Random Forest performs well but is slightly worse than linear models here. It could have possibly been due to overfitting or hyperparameter defaults.

Step 8: Model Comparison Results

Now that we have seen the performance of all the models, let’s quickly compare their results and understand how they performed on the test data:

Model	R² Score	RMSE
XGBoost	0.9026	27,326.88
Lasso Regression	0.8947	28,424.25
Ridge Regression	0.8937	28,549.16
Linear Regression	0.8935	28,585.92
Random Forest	0.8906	28,961.36

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Conclusion

In this project, we built a regression model to predict house prices using a Kaggle dataset. After data cleaning, encoding, and model training, we compared five algorithms. Among them, XGBoost delivered the best results with an R² score of 0.90 and the lowest RMSE of 27,326.88.

This shows that ensemble models like XGBoost are more effective for capturing complex relationships in housing data compared to simple linear models.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1LdicyK51qMh6S_WPCWAJvsixGvaMchoS?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the goal of a House Price Prediction project?

The objective is to predict property prices based on various features like location, number of bedrooms, size, and amenities using machine learning algorithms.

2. What dataset is commonly used for this project?

The Ames Housing Dataset and Kaggle’s House Prices – Advanced Regression Techniques dataset are commonly used for building predictive models.

3. Which machine learning models are effective for house price prediction?

Popular models include Linear Regression, Random Forest, XGBoost, Gradient Boosting, and Lasso Regression. Ensemble methods often yield better accuracy.

4. What are the important features in house price prediction?

Key features include square footage, location (zipcode or neighborhood), number of rooms, year built, garage area, and quality of construction.

5. What tools are required to build the model?

You can use Python with libraries such as Pandas, NumPy, Seaborn, Scikit-learn, XGBoost, and Matplotlib for data handling, modeling, and visualization.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources