Home
Blog
Data Science
Analyzing the Heights and Weights Dataset Using Linear Regression

Analyzing the Heights and Weights Dataset Using Linear Regression

Q: 2. What type of machine learning problem is this dataset used for?

It is mainly used for regression problems, where the goal is to predict a continuous value (e.g., weight) based on another variable (e.g., height).

By Rohit Sharma

Updated on Jul 30, 2025 | 7 min read | 1.82K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty Level
How to Build the Height and Weight Prediction Model
Conclusion

Height-weight relationships are useful in health sciences, fitness planning, and even fashion design. In this machine learning project, we are about to do the exploration of the heights and weights dataset available on Kaggle, which consists of real-world data on individuals' heights (in inches) and weights (in pounds).

The goal of this project is to predict an individual’s weight based on their height using machine learning techniques, specifically, the - Simple Linear Regression. We will train the model using historical data from the heights and weights dataset. Once done, we will evaluate its prediction accuracy using common regression metrics.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

Popular Data Science Programs

Masters in Data Science Degree DevOps Full Course Online Data Science Machine Learning Course PG Diploma in Data Science Advanced Certificate Program in Data Science

What Should You Know Beforehand?

It is better to have at least some background in:

Python programming – functions, loops, and handling basic data types like lists and dictionaries
Pandas and NumPy – reading .csv files, exploring datasets, and doing data manipulation
Matplotlib and Seaborn – visualizing relationships between height and weight through plots
Scikit-learn – implementing linear regression & evaluating model performance
Regression algorithms – comprehending how models like Simple Linear Regression work for predicting continuous values

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library	Purpose
Python	Core programming language for the project
Pandas	To load, clean, and manipulate the dataset
NumPy	For numerical computations and array handling
Matplotlib	For creating basic plots and visualizations
Seaborn	For more advanced and aesthetically pleasing plots
Scikit-learn	To build, train, and evaluate the linear regression model
Google Colab	Online coding platform with free GPU access and easy sharing

Models That Will Be Utilized for Learning

To predict weight from height, we’ll use regression models that work well with continuous numerical data, such as:

Simple Linear Regression: Ideal for predicting one continuous variable (Weight) from another (Height) by using a straight-line relationship
Random Forest Regressor: A powerful ensemble model that can capture more complex patterns if linear regression underperforms.

Time Taken and Difficulty Level

This project will take 1 to 2 hours, depending on how familiar you are with Python and the machine learning libraries. Categorized under beginner Level, this is perfect for anyone getting started on data science or regression problems.

How to Build the Height and Weight Prediction Model

Let’s start building the project from scratch. We will start by:

Loading and exploring the dataset
Visualizing the relationship between height and weight
Preprocessing the data for modeling
Training and evaluating a regression model

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the height and weight prediction model, we will use the dataset available on Kaggle. Follow the steps mentioned below to download the dataset:

Open a new tab in any web browser.
Type https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset.
On the Height and Weight Dataset page, in the right pane, under the Data Explorer section, click SOCR-HeightWeight.csv.
Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the dataset.

# Upload the dataset
from google.colab import files
uploaded = files.upload()

Output:

SOCR-HeightWeight.csv(text/csv) - 608346 bytes, last modified: 7/24/2025 - 100% done
Saving SOCR-HeightWeight.csv to SOCR-HeightWeight.csv

Once uploaded, read the file using Pandas. Here’s the code to do so:

# Import pandas
import pandas as pd

# Load the dataset
df = pd.read_csv("SOCR-HeightWeight.csv")

# Display the first 5 rows
df.head()

Output:

	Index	Height(Inches)	Weight(Pounds)
0	1	65.78331	112.9925
1	2	71.51521	136.4873
2	3	69.39874	153.0269
3	4	68.21660	142.3354
4	5	67.78781	144.2971

What does the output tell us?

The output tells us that:

The dataset includes an Index column along with Height (Inches) and Weight (Pounds).
All values are numerical, i.e., the dataset is suitable for regression analysis.
No missing values can be seen, at least in the initial preview

Step 3: Perform Exploratory Data Analysis (EDA)

In this step, we will explore the height and weight dataset further to comprehend its patterns, distributions, and relationships. Doing so will aid in identifying trends as well as finding anomalies before modeling.

Use the code below to do the same:

# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Shape of the dataset
print("Shape of the dataset:", df.shape)

# Basic info
print("\nDataset Info:")
print(df.info())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Histogram of Height
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['Height(Inches)'], kde=True, color='skyblue')
plt.title('Distribution of Height')
plt.xlabel('Height (Inches)')
plt.ylabel('Frequency')

# Histogram of Weight
plt.subplot(1, 2, 2)
sns.histplot(df['Weight(Pounds)'], kde=True, color='salmon')
plt.title('Distribution of Weight')
plt.xlabel('Weight (Pounds)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

# Scatter plot to show relationship
plt.figure(figsize=(6, 5))
sns.scatterplot(x='Height(Inches)', y='Weight(Pounds)', data=df, color='green')
plt.title('Height vs Weight')
plt.xlabel('Height (Inches)')
plt.ylabel('Weight (Pounds)')
plt.show()

Output:

Shape of the dataset: (25000, 3)

Dataset Info:

RangeIndex: 25000 entries, 0 to 24999

Data columns (total 3 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Index 25000 non-null int64

1 Height(Inches) 25000 non-null float64

2 Weight(Pounds) 25000 non-null float64

dtypes: float64(2), int64(1)

memory usage: 586.1 KB

None

Missing Values:

Index 0
Height(Inches) 0
Weight(Pounds) 0
dtype: int64

What does the output tell us?

After running the combined EDA block, we found that:

There are 25,000 entries. No missing values are there. Three clean numeric columns are present:
- Index
- Height(Inches)
- Weight(Pounds).
The distribution plots for both height and weight are:
Symmetric and normally distributed
Free from visible outliers
The scatter plot (Height vs Weight) reveals a strong positive linear relationship. This supports our decision to utilize the - Simple Linear Regression for prediction.

Step 4: Data Preprocessing

Before training the model, we need to prepare the dataset. We will do this by selecting relevant features and removing unnecessary columns. In our case, the Index column has no predictive purpose. Hence, we will drop it.

Here is the code to do so:

# Drop the 'Index' column
df = df.drop('Index', axis=1)

# Separate features and target
X = df[['Height(Inches)']]   # Independent variable (feature)
y = df['Weight(Pounds)']     # Dependent variable (target)

Step 5: Train the Model (Linear Regression & Random Forest)

In this step, we will train and compare the following models to predict weight based on height:

Simple Linear Regression (baseline model)
Random Forest Regressor (to capture non-linear patterns, if there are any present)

Here’s the code to do so:

# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
lr_model = LinearRegression()
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train both models
lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

# Predict
lr_preds = lr_model.predict(X_test)
rf_preds = rf_model.predict(X_test)

# Evaluate Linear Regression
lr_mse = mean_squared_error(y_test, lr_preds)
lr_r2 = r2_score(y_test, lr_preds)

# Evaluate Random Forest
rf_mse = mean_squared_error(y_test, rf_preds)
rf_r2 = r2_score(y_test, rf_preds)

# Print results
print("Linear Regression:")
print("  MSE:", lr_mse)
print("  R² Score:", lr_r2)

print("\nRandom Forest Regressor:")
print("  MSE:", rf_mse)
print("  R² Score:", rf_r2)

Output:

Linear Regression:

MSE: 102.48790963792534
R² Score: 0.26059113512888576

Random Forest Regressor:

MSE: 146.73119618890016
R² Score: -0.05860630388999

Linear Regression Output:

MSE (102.49): On average, the squared prediction error is 102.49 units².
R² Score (0.26): Only 26% of the variation in weight is explained by height.

Random Forest Regressor Output:

MSE (146.73): The model makes larger prediction errors than Linear Regression.
R² Score (-0.05): The model performs worse than predicting the mean weight every time.

Conclusion

In the project, a regression model was developed using the heights and weights dataset to predict the weight of a person given the height. We analyzed and visualized the data and trained two distinguishing models: Simple Linear Regression and Random Forest Regressor.

The results showed that Linear Regression performed much better in terms of MSE and R². This should come as no surprise as the suspicions of a linear nature between the two variables (height and weight) have been expressed. In turn, the Random Forest Regressor might have badly underperformed or overfitted with this kind of dataset, a very simple one with a single feature.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link-
https://colab.research.google.com/drive/1v_PwqDHThTREOkIiHF1Z0jKtpVNOIxFb?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the Heights and Weights Dataset?

The Heights and Weights Dataset is a collection of numerical records typically containing height (in inches/cm) and weight (in pounds/kg) of individuals. It’s widely used to demonstrate regression models in data science.

2. What type of machine learning problem is this dataset used for?

It is mainly used for regression problems, where the goal is to predict a continuous value (e.g., weight) based on another variable (e.g., height).

3. Which algorithms work best on this dataset?

Linear Regression is commonly used, but models like Polynomial Regression, Decision Trees, and Support Vector Regression can also provide deeper insights.

4. What tools are commonly used for analysis?

Python tools such as Pandas (data handling), Matplotlib/Seaborn (visualization), and Scikit-learn (modeling) are often used to explore and build predictive models.

5. What are some project extensions beyond simple regression?

You can explore multivariate analysis by adding features like age or gender, evaluate model performance with cross-validation, or create interactive visualizations using Streamlit.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources