Analyzing the Heights and Weights Dataset Using Linear Regression
By Rohit Sharma
Updated on Jul 30, 2025 | 7 min read | 1.15K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 30, 2025 | 7 min read | 1.15K+ views
Share:
Table of Contents
Height-weight relationships are useful in health sciences, fitness planning, and even fashion design. In this machine learning project, we are about to do the exploration of the heights and weights dataset available on Kaggle, which consists of real-world data on individuals' heights (in inches) and weights (in pounds).
The goal of this project is to predict an individual’s weight based on their height using machine learning techniques, specifically, the - Simple Linear Regression. We will train the model using historical data from the heights and weights dataset. Once done, we will evaluate its prediction accuracy using common regression metrics.
Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.
Popular Data Science Programs
It is better to have at least some background in:
For this project, the following tools and libraries will be used:
Tool/Library |
Purpose |
Python |
Core programming language for the project |
Pandas |
To load, clean, and manipulate the dataset |
NumPy |
For numerical computations and array handling |
Matplotlib |
For creating basic plots and visualizations |
Seaborn |
For more advanced and aesthetically pleasing plots |
Scikit-learn |
To build, train, and evaluate the linear regression model |
Google Colab |
Online coding platform with free GPU access and easy sharing |
To predict weight from height, we’ll use regression models that work well with continuous numerical data, such as:
This project will take 1 to 2 hours, depending on how familiar you are with Python and the machine learning libraries. Categorized under beginner Level, this is perfect for anyone getting started on data science or regression problems.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To build the height and weight prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:
Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the dataset.
# Upload the dataset
from google.colab import files
uploaded = files.upload()
Output:
SOCR-HeightWeight.csv(text/csv) - 608346 bytes, last modified: 7/24/2025 - 100% done
Saving SOCR-HeightWeight.csv to SOCR-HeightWeight.csv
Once uploaded, read the file using Pandas. Here’s the code to do so:
# Import pandas
import pandas as pd
# Load the dataset
df = pd.read_csv("SOCR-HeightWeight.csv")
# Display the first 5 rows
df.head()
Output:
Index |
Height(Inches) |
Weight(Pounds) |
|
0 |
1 |
65.78331 |
112.9925 |
1 |
2 |
71.51521 |
136.4873 |
2 |
3 |
69.39874 |
153.0269 |
3 |
4 |
68.21660 |
142.3354 |
4 |
5 |
67.78781 |
144.2971
|
What does the output tell us?
The output tells us that:
In this step, we will explore the height and weight dataset further to comprehend its patterns, distributions, and relationships. Doing so will aid in identifying trends as well as finding anomalies before modeling.
Use the code below to do the same:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Shape of the dataset
print("Shape of the dataset:", df.shape)
# Basic info
print("\nDataset Info:")
print(df.info())
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Histogram of Height
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['Height(Inches)'], kde=True, color='skyblue')
plt.title('Distribution of Height')
plt.xlabel('Height (Inches)')
plt.ylabel('Frequency')
# Histogram of Weight
plt.subplot(1, 2, 2)
sns.histplot(df['Weight(Pounds)'], kde=True, color='salmon')
plt.title('Distribution of Weight')
plt.xlabel('Weight (Pounds)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# Scatter plot to show relationship
plt.figure(figsize=(6, 5))
sns.scatterplot(x='Height(Inches)', y='Weight(Pounds)', data=df, color='green')
plt.title('Height vs Weight')
plt.xlabel('Height (Inches)')
plt.ylabel('Weight (Pounds)')
plt.show()
Output:
Shape of the dataset: (25000, 3)
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Index 25000 non-null int64
1 Height(Inches) 25000 non-null float64
2 Weight(Pounds) 25000 non-null float64
dtypes: float64(2), int64(1)
memory usage: 586.1 KB
None
Missing Values:
Index 0
Height(Inches) 0
Weight(Pounds) 0
dtype: int64
What does the output tell us?
After running the combined EDA block, we found that:
Before training the model, we need to prepare the dataset. We will do this by selecting relevant features and removing unnecessary columns. In our case, the Index column has no predictive purpose. Hence, we will drop it.
Here is the code to do so:
# Drop the 'Index' column
df = df.drop('Index', axis=1)
# Separate features and target
X = df[['Height(Inches)']] # Independent variable (feature)
y = df['Weight(Pounds)'] # Dependent variable (target)
In this step, we will train and compare the following models to predict weight based on height:
Here’s the code to do so:
# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize models
lr_model = LinearRegression()
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
# Train both models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
# Predict
lr_preds = lr_model.predict(X_test)
rf_preds = rf_model.predict(X_test)
# Evaluate Linear Regression
lr_mse = mean_squared_error(y_test, lr_preds)
lr_r2 = r2_score(y_test, lr_preds)
# Evaluate Random Forest
rf_mse = mean_squared_error(y_test, rf_preds)
rf_r2 = r2_score(y_test, rf_preds)
# Print results
print("Linear Regression:")
print(" MSE:", lr_mse)
print(" R² Score:", lr_r2)
print("\nRandom Forest Regressor:")
print(" MSE:", rf_mse)
print(" R² Score:", rf_r2)
Output:
Linear Regression:
MSE: 102.48790963792534
R² Score: 0.26059113512888576
Random Forest Regressor:
MSE: 146.73119618890016
R² Score: -0.05860630388999
Linear Regression Output:
Random Forest Regressor Output:
In the project, a regression model was developed using the heights and weights dataset to predict the weight of a person given the height. We analyzed and visualized the data and trained two distinguishing models: Simple Linear Regression and Random Forest Regressor.
The results showed that Linear Regression performed much better in terms of MSE and R². This should come as no surprise as the suspicions of a linear nature between the two variables (height and weight) have been expressed. In turn, the Random Forest Regressor might have badly underperformed or overfitted with this kind of dataset, a very simple one with a single feature.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link-
https://colab.research.google.com/drive/1v_PwqDHThTREOkIiHF1Z0jKtpVNOIxFb?usp=sharing
802 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources