House Price Prediction Using Regression Algorithms

By Rohit Sharma

Updated on Aug 01, 2025 | 11 min read | 1.28K+ views

Share:

Buying or selling a house largely hinges on being able to correctly value it in the market. But localities see huge price variations given so many factors - location, size, condition.

The prime objective of this project is to predict house prices using machine learning. Regression algorithms like Linear Regression, Random Forest, and Gradient Boosting are used to learn patterns from past housing data and estimate the price of a house based on its characteristics. This minimizes guesswork in making real estate decisions, thereby increasing confidence.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog. 

What Should You Know Beforehand?

It is better to have at least some background in:

  • Python basics – variables, loops, functions.
  • Pandas and NumPy – for handling and analyzing data.
  • Matplotlib and Seaborn – for visualizing patterns.
  • Machine learning concepts – especially regression and model evaluation.
  • Scikit-learn library – for building and testing regression models.

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library

Purpose

Python

Programming language used to build the model

Pandas

Handling, exploring, and cleaning tabular data

NumPy

Performing numerical operations and working with arrays

Matplotlib

Creating basic plots and visualizations

Seaborn

Creating advanced and aesthetic visualizations

Scikit-learn

Implementing regression models and evaluating performance

Step into the future with expert-led courses that cover it all: analytics, machine learning, and generative AI. Start your data science career journey now!

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Models That Will Be Utilized for Learning

To solve the house price prediction problem, we will use three popular regression models.

  • Linear regression is the simplest. It assumes a linear relationship between input features and house prices. It is easy to interpret, but may fail to capture complex relationships. 
  • Ridge Regression is a regularized form of linear regression. It throttles large coefficients, so the model generalizes better and avoids overfitting problems. 
  • Lasso Regression behaves like Ridge with one difference: It reduces less important features to zero. Hence, it can perform automatic feature selection and model simplification.
  • More powerful isthe  Random Forest Regressor. Many decision trees are built, and their outputs are combined. In this way, overfitting is diminished, therefore gaining more accuracy.
  • XGBoost Regressor works differently. Trees are built sequentially, each tree serving to lessen errors produced by the tree before it, along the way; it would be rather slow while promising superior outcomes on structured data.

Time Taken and Difficulty

You can complete this house price regression project in about 1.5 to 2 hours. It is perfect for beginners to intermediate level.

How to Build the House Price Prediction Model

Let’s start building the project from scratch. We will start by:

  1. Load and explore the dataset
  2. Handle missing values and encode categorical features
  3. Visualize important relationships and correlations
  4. Train and evaluate regression models
  5. Compare the results to find the best model

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the house price prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data.
  3. On the House Prices - Advanced Regression Techniques page, in the right pane, under the Data Explorer section, click test.csv
  4. Click the download icon
  5. Click train.csv.
  6. Click the download icon

Step 2: Upload and Load the Dataset in Google Colab

Now that you have downloaded the file, upload and load the dataset in Google Colab using the code below:

# Step 1: Import necessary libraries
import pandas as pd

# Step 2: Upload and read the train dataset
from google.colab import files
uploaded = files.upload()  # Choose train.csv when prompted

# Load dataset into a DataFrame
train_data = pd.read_csv("train.csv")

# Display first 5 rows
print(" First 5 rows of the dataset:")
print(train_data.head())

Output:

Saving train.csv to train (1).csv
Saving test.csv to test (1).csv
First 5 rows of the dataset:

   Id  MSSubClass    MSZoning   LotFrontage   LotArea  Street  Alley  LotShape  \
0   1          60                  RL                65.0               8450      Pave   NaN      Reg   
1   2          20                  RL                80.0               9600      Pave   NaN      Reg   
2   3          60                  RL               68.0              11250     Pave    NaN      IR1   
3   4          70                  RL               60.0              9550      Pave    NaN      IR1   
4   5          60                  RL               84.0             14260     Pave    NaN      IR1  

      LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0           Lvl           AllPub  ...         0           NaN      NaN         NaN            0         2   
1           Lvl            AllPub  ...        0           NaN      NaN         NaN            0         5   
2         Lvl            AllPub  ...         0           NaN      NaN         NaN             0         9   
3         Lvl             AllPub  ...        0           NaN      NaN         NaN            0         2   
4         Lvl            AllPub  ...         0          NaN       NaN         NaN            0        12  

     YrSold  SaleType  SaleCondition  SalePrice  
0    2008       WD             Normal        208500  
1    2007        WD            Normal         181500  
2   2008        WD            Normal         223500  
3   2006        WD          Abnorml         140000  
4   2008        WD            Normal        250000  

[5 rows x 81 columns]

Now that the dataset is successfully loaded, let's explore its shape, columns, and data types to plan preprocessing effectively.

Step 3: Understand the Data Structure

Let's take a quick look at its structure before we clean or preprocess the data. Doing so will help us identify data types and spot any missing values.

Use the code below to do so:

### Step 3: Understand the Data Structure

# Check the shape of the dataset
print("Shape of training data:", train_data.shape)

# Get information about columns, non-null values, and data types
print("\nData Info:")
train_data.info()

# View summary statistics for numeric columns
print("\nStatistical Summary:")
print(train_data.describe())

 Output:

Shape of training data: (1460, 81)

Data Info:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 1460 entries, 0 to 1459

Data columns (total 81 columns):

 #   Column         Non-Null Count  Dtype  

---  ------         --------------  -----  

 0   Id             1460 non-null   int64  

 1   MSSubClass     1460 non-null   int64  

 2   MSZoning       1460 non-null   object 

 3   LotFrontage    1201 non-null   float64

 4   LotArea        1460 non-null   int64  

 5   Street         1460 non-null   object 

 6   Alley          91 non-null     object 

 7   LotShape       1460 non-null   object 

 8   LandContour    1460 non-null   object 

 9   Utilities      1460 non-null   object 

 10  LotConfig      1460 non-null   object 

 11  LandSlope      1460 non-null   object 

 12  Neighborhood   1460 non-null   object 

 13  Condition1     1460 non-null   object 

 14  Condition2     1460 non-null   object 

 15  BldgType       1460 non-null   object 

 16  HouseStyle     1460 non-null   object 

 17  OverallQual    1460 non-null   int64  

 18  OverallCond    1460 non-null   int64  

 19  YearBuilt      1460 non-null   int64  

 20  YearRemodAdd   1460 non-null   int64  

 21  RoofStyle      1460 non-null   object 

 22  RoofMatl       1460 non-null   object 

 23  Exterior1st    1460 non-null   object 

 24  Exterior2nd    1460 non-null   object 

 25  MasVnrType     588 non-null    object 

 26  MasVnrArea     1452 non-null   float64

 27  ExterQual      1460 non-null   object 

 28  ExterCond      1460 non-null   object 

 29  Foundation     1460 non-null   object 

 30  BsmtQual       1423 non-null   object 

 31  BsmtCond       1423 non-null   object 

 32  BsmtExposure   1422 non-null   object 

 33  BsmtFinType1   1423 non-null   object 

 34  BsmtFinSF1     1460 non-null   int64  

 35  BsmtFinType2   1422 non-null   object 

 36  BsmtFinSF2     1460 non-null   int64  

 37  BsmtUnfSF      1460 non-null   int64  

 38  TotalBsmtSF    1460 non-null   int64  

 39  Heating        1460 non-null   object 

 40  HeatingQC      1460 non-null   object 

 41  CentralAir     1460 non-null   object 

 42  Electrical     1459 non-null   object 

 43  1stFlrSF       1460 non-null   int64  

 44  2ndFlrSF       1460 non-null   int64  

 45  LowQualFinSF   1460 non-null   int64  

 46  GrLivArea      1460 non-null   int64  

 47  BsmtFullBath   1460 non-null   int64  

 48  BsmtHalfBath   1460 non-null   int64  

 49  FullBath       1460 non-null   int64  

 50  HalfBath       1460 non-null   int64  

 51  BedroomAbvGr   1460 non-null   int64  

 52  KitchenAbvGr   1460 non-null   int64  

 53  KitchenQual    1460 non-null   object 

 54  TotRmsAbvGrd   1460 non-null   int64  

 55  Functional     1460 non-null   object 

 56  Fireplaces     1460 non-null   int64  

 57  FireplaceQu    770 non-null    object 

 58  GarageType     1379 non-null   object 

 59  GarageYrBlt    1379 non-null   float64

 60  GarageFinish   1379 non-null   object 

 61  GarageCars     1460 non-null   int64  

 62  GarageArea     1460 non-null   int64  

 63  GarageQual     1379 non-null   object 

 64  GarageCond     1379 non-null   object 

 65  PavedDrive     1460 non-null   object 

 66  WoodDeckSF     1460 non-null   int64  

 67  OpenPorchSF    1460 non-null   int64  

 68  EnclosedPorch  1460 non-null   int64  

 69  3SsnPorch      1460 non-null   int64  

 70  ScreenPorch    1460 non-null   int64  

 71  PoolArea       1460 non-null   int64  

 72  PoolQC         7 non-null      object 

 73  Fence          281 non-null    object 

 74  MiscFeature    54 non-null     object 

 75  MiscVal        1460 non-null   int64  

 76  MoSold         1460 non-null   int64  

 77  YrSold         1460 non-null   int64  

 78  SaleType       1460 non-null   object 

 79  SaleCondition  1460 non-null   object 

 80  SalePrice      1460 non-null   int64  

dtypes: float64(3), int64(35), object(43)

memory usage: 924.0+ KB

Statistical Summary:
                       Id            MSSubClass  LotFrontage        LotArea          OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000  

               OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  ...  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000  ...   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726  ...   
std       1.112799    30.202904     20.645407   181.066207   456.098091  ...   
min       1.000000  1872.000000   1950.000000     0.000000     0.000000  ...   
25%       5.000000  1954.000000   1967.000000     0.000000     0.000000  ...   
50%       5.000000  1973.000000   1994.000000     0.000000   383.500000  ...   
75%       6.000000  2000.000000   2004.000000   166.000000   712.250000  ...   
max       9.000000  2010.000000   2010.000000  1600.000000  5644.000000  ...  

       WoodDeckSF  OpenPorchSF  EnclosedPorch    3SsnPorch  ScreenPorch  \
count  1460.000000  1460.000000    1460.000000  1460.000000  1460.000000   
mean     94.244521    46.660274      21.954110     3.409589    15.060959   
std     125.338794    66.256028      61.119149    29.317331    55.757415   
min       0.000000     0.000000       0.000000     0.000000     0.000000   
25%       0.000000     0.000000       0.000000     0.000000     0.000000   
50%       0.000000    25.000000       0.000000     0.000000     0.000000   
75%     168.000000    68.000000       0.000000     0.000000     0.000000   
max     857.000000   547.000000     552.000000   508.000000   480.000000  

                   PoolArea       MiscVal             MoSold           YrSold           SalePrice  
count  1460.000000   1460.000000  1460.000000  1460.000000    1460.000000  
mean      2.758904     43.489041     6.321918  2007.815753  180921.195890  
std      40.177307    496.123024     2.703626     1.328095   79442.502883  
min       0.000000      0.000000     1.000000  2006.000000   34900.000000  
25%       0.000000      0.000000     5.000000  2007.000000  129975.000000  
50%       0.000000      0.000000     6.000000  2008.000000  163000.000000  
75%       0.000000      0.000000     8.000000  2009.000000  214000.000000  
max     738.000000  15500.000000    12.000000  2010.000000  755000.000000  

[8 rows x 38 columns]

 What does the output mean?

  • Total Rows: 1460 houses (each row is a house).
  • Total Columns: 81 features (like area, year built, location, etc.).
  • Column Types:
    • Some are numbers (e.g., LotArea, YearBuilt).
    • Some are text/categorical (e.g., Street, Neighborhood).
  • Some columns have blank/missing values. Example:
    • Alley: only 91 houses have data, the rest are blank.
    • MasVnrType, MasVnrArea, BsmtQual, etc. also have missing values.

Step 4: Handle Missing Values

After exploring the dataset, we found that some columns contain missing values. These gaps will negatively affect model performance. Therefore, let’s handle such issues before moving on.

Use the code given below to accomplish the same:

# Step 4: Handle Missing Values and Prepare the Data

# 1. Load the dataset (make sure train.csv is uploaded in Colab)
import pandas as pd
train_df = pd.read_csv('/content/train.csv')

# 2. Drop columns with too many missing values
train_df = train_df.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature'])

# 3. Fill missing numerical values with median
train_df['LotFrontage'] = train_df['LotFrontage'].fillna(train_df['LotFrontage'].median())
train_df['GarageYrBlt'] = train_df['GarageYrBlt'].fillna(train_df['GarageYrBlt'].median())
train_df['MasVnrArea'] = train_df['MasVnrArea'].fillna(train_df['MasVnrArea'].median())

# 4. Fill missing categorical values with mode
train_df['MasVnrType'] = train_df['MasVnrType'].fillna(train_df['MasVnrType'].mode()[0])
train_df['Electrical'] = train_df['Electrical'].fillna(train_df['Electrical'].mode()[0])
train_df['GarageType'] = train_df['GarageType'].fillna(train_df['GarageType'].mode()[0])
train_df['GarageFinish'] = train_df['GarageFinish'].fillna(train_df['GarageFinish'].mode()[0])
train_df['GarageQual'] = train_df['GarageQual'].fillna(train_df['GarageQual'].mode()[0])
train_df['GarageCond'] = train_df['GarageCond'].fillna(train_df['GarageCond'].mode()[0])
train_df['BsmtQual'] = train_df['BsmtQual'].fillna(train_df['BsmtQual'].mode()[0])
train_df['BsmtCond'] = train_df['BsmtCond'].fillna(train_df['BsmtCond'].mode()[0])
train_df['BsmtExposure'] = train_df['BsmtExposure'].fillna(train_df['BsmtExposure'].mode()[0])
train_df['BsmtFinType1'] = train_df['BsmtFinType1'].fillna(train_df['BsmtFinType1'].mode()[0])
train_df['BsmtFinType2'] = train_df['BsmtFinType2'].fillna(train_df['BsmtFinType2'].mode()[0])

Step 5: Encode Categorical Features for Modeling

Machine learning algorithms require numerical input. But various columns in our dataset are categorical (like Neighborhood, GarageType). To make them usable for models, we first need to convert these columns into numerical input (0 and 1). We will use label encoding and one-hot encoding to achieve this. 

Use the below given code below to accomplish the same:

from sklearn.preprocessing import LabelEncoder

# Identify all object (categorical) columns
categorical_cols = train_df.select_dtypes(include=['object']).columns

# Apply Label Encoding for columns with only two categories
label_enc = LabelEncoder()
for col in categorical_cols:
    if train_df[col].nunique() == 2:
        train_df[col] = label_enc.fit_transform(train_df[col])
        
# Apply One-Hot Encoding for remaining categorical columns
train_df = pd.get_dummies(train_df, columns=[col for col in categorical_cols if train_df[col].nunique() > 2])

Now we have a fully numeric and cleaned dataset.

Step 6: Split the Dataset and Normalize Features

Before we start training, we have to split the dataset into input features (X) and the target variable (y) so the model can learn the relationship between input variables and the house price during training. Once it is done, we will normalize the features so that models like linear regression or SVM perform optimally. 

Here is the code to accomplish the same:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Separate features and target
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

# 2. Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Normalize the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 7: Train and Evaluate Four Regression Models

Now that the dataset is completely ready, we will train the following regression models to predict house prices:

  • Linear Regression
  • Ridge Regression
  • Lasso Regression
  • Random Forest Regressor
  • XGBoost Regressor

We will also use R² Score and Root Mean Squared Error (RMSE) as evaluation metrics.

Here is the code to do so:

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Dictionary to store results
results = {}

# 1. Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)
results['Linear Regression'] = {
    'R2 Score': r2_score(y_test, lr_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, lr_preds))
}

# 2. Ridge Regression
ridge = Ridge()
ridge.fit(X_train_scaled, y_train)
ridge_preds = ridge.predict(X_test_scaled)
results['Ridge Regression'] = {
    'R2 Score': r2_score(y_test, ridge_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, ridge_preds))
}

# 3. Lasso Regression
lasso = Lasso()
lasso.fit(X_train_scaled, y_train)
lasso_preds = lasso.predict(X_test_scaled)
results['Lasso Regression'] = {
    'R2 Score': r2_score(y_test, lasso_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, lasso_preds))
}

# 4. Random Forest
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_scaled, y_train)
rf_preds = rf.predict(X_test_scaled)
results['Random Forest'] = {
    'R2 Score': r2_score(y_test, rf_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, rf_preds))
}

# 5. XGBoost
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train_scaled, y_train)
xgb_preds = xgb.predict(X_test_scaled)
results['XGBoost'] = {
    'R2 Score': r2_score(y_test, xgb_preds),
    'RMSE': np.sqrt(mean_squared_error(y_test, xgb_preds))
}

# Print the evaluation results
import pandas as pd
results_df = pd.DataFrame(results).T
print(results_df)

Output:

                                  R2 Score          RMSE

Linear Regression  0.893465  28585.921223

Ridge Regression   0.893739  28549.160746

Lasso Regression   0.894667  28424.249288

Random Forest      0.890649  28961.361871

XGBoost                 0.902643  27326.884052

What does the output mean?

  • XGBoost performs the best, with the highest R² score (~0.90) and lowest RMSE. It fits the data better than the others
  • Random Forest performs well but is slightly worse than linear models here. It could have possibly been due to overfitting or hyperparameter defaults.

Step 8: Model Comparison Results

Now that we have seen the performance of all the models, let’s quickly compare their results and understand how they performed on the test data:

Model

R² Score

RMSE 

XGBoost

0.9026

27,326.88

Lasso Regression

0.8947

28,424.25

Ridge Regression

0.8937

28,549.16

Linear Regression

0.8935

28,585.92

Random Forest

0.8906

28,961.36

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Conclusion

In this project, we built a regression model to predict house prices using a Kaggle dataset. After data cleaning, encoding, and model training, we compared five algorithms. Among them, XGBoost delivered the best results with an R² score of 0.90 and the lowest RMSE of 27,326.88.

This shows that ensemble models like XGBoost are more effective for capturing complex relationships in housing data compared to simple linear models.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1LdicyK51qMh6S_WPCWAJvsixGvaMchoS?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the goal of a House Price Prediction project?

The objective is to predict property prices based on various features like location, number of bedrooms, size, and amenities using machine learning algorithms.

2. What dataset is commonly used for this project?

The Ames Housing Dataset and Kaggle’s House Prices – Advanced Regression Techniques dataset are commonly used for building predictive models.

3. Which machine learning models are effective for house price prediction?

Popular models include Linear Regression, Random Forest, XGBoost, Gradient Boosting, and Lasso Regression. Ensemble methods often yield better accuracy.

4. What are the important features in house price prediction?

Key features include square footage, location (zipcode or neighborhood), number of rooms, year built, garage area, and quality of construction.

5. What tools are required to build the model?

You can use Python with libraries such as Pandas, NumPy, Seaborn, Scikit-learn, XGBoost, and Matplotlib for data handling, modeling, and visualization.

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

upGrad
new course

Certification

30 Weeks

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months