Analyzing the Heights and Weights Dataset Using Linear Regression

By Rohit Sharma

Updated on Jul 30, 2025 | 7 min read | 1.15K+ views

Share:

Height-weight relationships are useful in health sciences, fitness planning, and even fashion design. In this machine learning project, we are about to do the exploration of the heights and weights dataset available on Kaggle, which consists of real-world data on individuals' heights (in inches) and weights (in pounds).

The goal of this project is to predict an individual’s weight based on their height using machine learning techniques, specifically, the - Simple Linear Regression. We will train the model using historical data from the heights and weights dataset. Once done, we will evaluate its prediction accuracy using common regression metrics.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

What Should You Know Beforehand?

It is better to have at least some background in:

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library

Purpose

Python

Core programming language for the project

Pandas

To load, clean, and manipulate the dataset

NumPy

For numerical computations and array handling

Matplotlib

For creating basic plots and visualizations

Seaborn

For more advanced and aesthetically pleasing plots

Scikit-learn

To build, train, and evaluate the linear regression model

Google Colab

Online coding platform with free GPU access and easy sharing

Models That Will Be Utilized for Learning

To predict weight from height, we’ll use regression models that work well with continuous numerical data, such as:

Time Taken and Difficulty Level

This project will take 1 to 2 hours, depending on how familiar you are with Python and the machine learning libraries. Categorized under beginner Level, this is perfect for anyone getting started on data science or regression problems.

How to Build the Height and Weight Prediction Model

Let’s start building the project from scratch. We will start by:

  • Loading and exploring the dataset
  • Visualizing the relationship between height and weight
  • Preprocessing the data for modeling
  • Training and evaluating a regression model

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the height and weight prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset.
  3. On the Height and Weight Dataset page, in the right pane, under the Data Explorer section, click SOCR-HeightWeight.csv
  4. Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the dataset.

# Upload the dataset
from google.colab import files
uploaded = files.upload()

Output:

SOCR-HeightWeight.csv(text/csv) - 608346 bytes, last modified: 7/24/2025 - 100% done
Saving SOCR-HeightWeight.csv to SOCR-HeightWeight.csv

Once uploaded, read the file using Pandas. Here’s the code to do so:

# Import pandas
import pandas as pd

# Load the dataset
df = pd.read_csv("SOCR-HeightWeight.csv")

# Display the first 5 rows
df.head()

Output:

 

Index

Height(Inches)

Weight(Pounds)

0

1

65.78331

112.9925

1

2

71.51521

136.4873

2

3

69.39874

153.0269

3

4

68.21660

142.3354

4

5

67.78781

144.2971

 

 

What does the output tell us?

The output tells us that:

  • The dataset includes an Index column along with Height (Inches) and Weight (Pounds).
  • All values are numerical, i.e., the dataset is suitable for regression analysis.
  • No missing values can be seen, at least in the initial preview

Step 3: Perform Exploratory Data Analysis (EDA)

In this step, we will explore the height and weight dataset further to comprehend its patterns, distributions, and relationships. Doing so will aid in identifying trends as well as finding anomalies before modeling.

Use the code below to do the same:

# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Shape of the dataset
print("Shape of the dataset:", df.shape)

# Basic info
print("\nDataset Info:")
print(df.info())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Histogram of Height
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['Height(Inches)'], kde=True, color='skyblue')
plt.title('Distribution of Height')
plt.xlabel('Height (Inches)')
plt.ylabel('Frequency')

# Histogram of Weight
plt.subplot(1, 2, 2)
sns.histplot(df['Weight(Pounds)'], kde=True, color='salmon')
plt.title('Distribution of Weight')
plt.xlabel('Weight (Pounds)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Scatter plot to show relationship
plt.figure(figsize=(6, 5))
sns.scatterplot(x='Height(Inches)', y='Weight(Pounds)', data=df, color='green')
plt.title('Height vs Weight')
plt.xlabel('Height (Inches)')
plt.ylabel('Weight (Pounds)')
plt.show()

Output:

Shape of the dataset: (25000, 3)

Dataset Info:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 25000 entries, 0 to 24999

Data columns (total 3 columns):

 #   Column          Non-Null Count  Dtype  

---  ------          --------------  -----  

 0   Index           25000 non-null  int64  

 1   Height(Inches)  25000 non-null  float64

 2   Weight(Pounds)  25000 non-null  float64

dtypes: float64(2), int64(1)

memory usage: 586.1 KB

None

Missing Values:

Index             0
Height(Inches)    0
Weight(Pounds)    0
dtype: int64

What does the output tell us?

After running the combined EDA block, we found that:

  • There are 25,000 entries. No missing values are there. Three clean numeric columns are present: 
    • Index
    • Height(Inches)
    • Weight(Pounds).
  • The distribution plots for both height and weight are:
  • Symmetric and normally distributed
  • Free from visible outliers
  • The scatter plot (Height vs Weight) reveals a strong positive linear relationship. This supports our decision to utilize the - Simple Linear Regression for prediction.

Step 4: Data Preprocessing

Before training the model, we need to prepare the dataset. We will do this by selecting relevant features and removing unnecessary columns. In our case, the Index column has no predictive purpose. Hence, we will drop it. 

Here is the code to do so:

# Drop the 'Index' column
df = df.drop('Index', axis=1)

# Separate features and target
X = df[['Height(Inches)']]   # Independent variable (feature)
y = df['Weight(Pounds)']     # Dependent variable (target)

Step 5: Train the Model (Linear Regression & Random Forest)

In this step, we will train and compare the following models to predict weight based on height:

  • Simple Linear Regression (baseline model)
  • Random Forest Regressor (to capture non-linear patterns, if there are any present)

Here’s the code to do so:

# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
lr_model = LinearRegression()
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train both models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

# Predict
lr_preds = lr_model.predict(X_test)
rf_preds = rf_model.predict(X_test)

# Evaluate Linear Regression
lr_mse = mean_squared_error(y_test, lr_preds)
lr_r2 = r2_score(y_test, lr_preds)

# Evaluate Random Forest
rf_mse = mean_squared_error(y_test, rf_preds)
rf_r2 = r2_score(y_test, rf_preds)

# Print results
print("Linear Regression:")
print("  MSE:", lr_mse)
print("  R² Score:", lr_r2)

print("\nRandom Forest Regressor:")
print("  MSE:", rf_mse)
print("  R² Score:", rf_r2)

Output:

Linear Regression:

  MSE: 102.48790963792534
  R² Score: 0.26059113512888576

Random Forest Regressor:

  MSE: 146.73119618890016
  R² Score: -0.05860630388999

Linear Regression Output:

  • MSE (102.49): On average, the squared prediction error is 102.49 units².
  • R² Score (0.26): Only 26% of the variation in weight is explained by height.

Random Forest Regressor Output:

  • MSE (146.73): The model makes larger prediction errors than Linear Regression.
  • R² Score (-0.05): The model performs worse than predicting the mean weight every time.

Conclusion

In the project, a regression model was developed using the heights and weights dataset to predict the weight of a person given the height. We analyzed and visualized the data and trained two distinguishing models: Simple Linear Regression and Random Forest Regressor. 

The results showed that Linear Regression performed much better in terms of MSE and R². This should come as no surprise as the suspicions of a linear nature between the two variables (height and weight) have been expressed. In turn, the Random Forest Regressor might have badly underperformed or overfitted with this kind of dataset, a very simple one with a single feature.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link-
https://colab.research.google.com/drive/1v_PwqDHThTREOkIiHF1Z0jKtpVNOIxFb?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the Heights and Weights Dataset?

2. What type of machine learning problem is this dataset used for?

3. Which algorithms work best on this dataset?

4. What tools are commonly used for analysis?

5. What are some project extensions beyond simple regression?

Rohit Sharma

802 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months