Home
Blog
Data Science
Airbnb Listing Analysis Project Using R

Airbnb Listing Analysis Project Using R

Q: 4. What preprocessing steps are necessary before clustering Airbnb data?

Before clustering, it's important to: Handle missing values Normalize or scale numeric features Remove outliers Select relevant variables (e.g., price, rating, reviews)

Q: 5. What other clustering algorithms can be tried apart from K-Means in R?

While K-Means is popular, you can experiment with: DBSCAN for density-based clustering Hierarchical Clustering for tree-based grouping K-Medoids (PAM) for robust clustering

Q: 6. What Are Some Other R Projects That I Can Work On?

Gender Recognition Using Voice Titanic Survival Prediction World Happiness Report Analysis Wine Quality Prediction

By Rohit Sharma

Updated on Aug 11, 2025 | 14 min read | 1.88K+ views

This Airbnb Listing Analysis Project Using R explores key insights from Airbnb data, focusing on price, ratings, and property features.

The project involves data cleaning, exploratory analysis, correlation analysis, and clustering to segment listings based on pricing and customer ratings.

Using powerful R packages like dplyr, ggplot2, and factoextra, it identifies patterns that help understand market dynamics. A correlation heatmap is also used to visualize relationships between numerical features.

Master Data Science, AI & Machine Learning with upGrad’s top-rated data science programs. Learn from industry leaders and transform your future. Enrol today before opportunities pass you by!

Elevate Your R Skills: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Popular Data Science Programs

MS in Data Science Advanced Certificate Program in Data Science PGD in Data Science DevOps Full Course Online MSc in Data Science Program

How Long Will It Take and What Skills Do You Need for This Project

Before starting the Airbnb Listing Analysis Project Using R, it’s important to know the estimated time commitment, technical complexity, and what tools or skills are involved. The table below gives all the important information you need.

Aspect	Details
Estimated Duration	2–3 hours
Difficulty Level	Beginner to Intermediate
Tools Required	R, Google Colab (with R kernel), RStudio (optional)
R Packages Used	dplyr, ggplot2, factoextra, ggcorrplot, cluster, readr, stats
Key Skills Needed	Data Cleaning, EDA, Data Visualization, Clustering, Correlation Analysis

Join globally recognized programs from IIITB, Liverpool John Moores University, and top institutions. Gain in-demand skills in AI, ML, and Data Science. Apply now and step into the future of tech.

Complete Breakdown of the Airbnb Listing Analysis Project in R

Below are the individual steps that will be used to build this Airbnb Listing Analysis Project Using R. Each step includes the code with brief comments added for clarity.

Step 1: Initial Setup: Enable R Support in Google Colab

To work with R in Google Colab, you first need to switch the default programming language from Python to R. Follow these steps to configure your notebook:

Open Google Colab and create a new notebook.
In the top menu bar, go to Runtime.
Select Change runtime type.
Under the Language dropdown, choose R.
Click Save to apply the changes.

Step 2: Install and Load Required R Packages

In this step, we prepare the environment by installing and loading the necessary R packages. These libraries will help with data manipulation, working with dates, and visualizing correlations later in this Airbnb listing analysis project. Here’s the code:

# Step 1: Install required packages (only once)
install.packages("tidyverse")   # For data manipulation and plots
install.packages("lubridate")   # For date parsing
install.packages("corrplot")    # For correlation heatmap

# Step 1.1: Load the libraries
library(tidyverse)   # Load tidyverse for dplyr, ggplot2, etc.
library(lubridate)   # Load lubridate to handle date formats
library(corrplot)    # Load corrplot for correlation visualization

The output confirms that the libraries and packages are installed and loaded correctly:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr 1.1.4 ✔ readr 2.1.5

✔ forcats 1.0.0 ✔ stringr 1.5.1

✔ ggplot2 3.5.2 ✔ tibble 3.3.0

✔ lubridate 1.9.4 ✔ tidyr 1.3.1

✔ purrr 1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

corrplot 0.95 loaded

Here’s Something You Must Try: Trend Analysis Project on COVID-19 using R

Step 3: Load and Preview the Airbnb Listing Dataset

In this section, we import the Airbnb dataset into our R environment and examine its basic structure. This gives us an idea of what variables are available and how the data is formatted. Here’s the code:

# Step 2: Read the Airbnb data from your upload path
data <- read.csv("airnb.csv", stringsAsFactors = FALSE)   # Load CSV file without converting strings to factors

# Step 2.1: Preview the structure
str(data)   # Check the structure and types of variables
head(data)  # Display the first few rows of the dataset

The output of the above step gives us a preview of the dataset we’re working with.

'data.frame': 953 obs. of 7 variables:

$ Title : chr "Chalet in Skykomish, Washington, US" "Cabin in Hancock, New York, US" "Cabin in West Farmington, Ohio, US" "Home in Blue Ridge, Georgia, US" ...

$ Detail : chr "Sky Haus - A-Frame Cabin" "The Catskill A-Frame - Mid-Century Modern Cabin" "The Triangle: A-Frame Cabin for your city retreat" "*Summer Sizzle* 5 Min to Blue Ridge* Pets* Hot tub" ...

$ Date : chr "Jun 11 - 16" "Jun 6 - 11" "Jul 9 - 14" "Jun 11 - 16" ...

$ Price.in.dollar. : chr "306.00" "485.00" "119.00" "192.00" ...

$ Offer.price.in.dollar.: chr "229.00" "170.00" "522.00" "348.00" ...

$ Review.and.rating : chr "4.85 (531)" "4.77 (146)" "4.91 (515)" "4.94 (88)" ...

$ Number.of.bed : chr "4 beds" "4 beds" "4 beds" "5 beds" ...

	Title	Detail	Date	Price.in.dollar.	Offer.price.in.dollar.	Review.and.rating	Number.of.bed
	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
1	Chalet in Skykomish, Washington, US	Sky Haus - A-Frame Cabin	Jun 11 - 16	306.00	229.00	4.85 (531)	4 beds
2	Cabin in Hancock, New York, US	The Catskill A-Frame - Mid-Century Modern Cabin	Jun 6 - 11	485.00	170.00	4.77 (146)	4 beds
3	Cabin in West Farmington, Ohio, US	The Triangle: A-Frame Cabin for your city retreat	Jul 9 - 14	119.00	522.00	4.91 (515)	4 beds
4	Home in Blue Ridge, Georgia, US	Summer Sizzle 5 Min to Blue Ridge* Pets* Hot tub	Jun 11 - 16	192.00	348.00	4.94 (88)	5 beds
5	Treehouse in Grandview, Texas, US	Luxury Treehouse Couples Getaway w/ Peaceful Views	Jun 4 - 9	232.00	196.00	4.99 (222)	1 queen bed
6	Tiny home in Puerto Escondido, Mexico	Casa Tiny near Casa Wabi	Jun 21 - 26	261.00	148.00	4.84 (555)	1 double bed

Step 4: Rename Dataset Columns for Simplicity

This step simplifies the dataset by renaming the column headers to shorter, more accessible names. Clean column names make it easier to work with data in future steps. Here’s the code:

# Step 3: Rename columns for easier access 
colnames(data) <- c("title", "detail", "date", "price", "offer_price", "review_rating", "num_beds")

You can try these R Projects: Forest Fire Project Using R - A Step-by-Step Guide | Customer Segmentation Project Using R: A Step-by-Step Guide

Step 5: Clean and Prepare Airbnb Data for Analysis

This step transforms messy textual columns into structured numeric formats. It handles currency symbols, extracts ratings and reviews safely from strings, and converts bed counts for easier analysis later. Here’s the code to clean the data:

# Step 4: Clean and prepare the dataset safely

# 1. Remove $ symbols and convert price columns to numeric
data$price <- as.numeric(gsub("\\$", "", data$price))
data$offer_price <- as.numeric(gsub("\\$", "", data$offer_price))

# 2. Safely extract rating and reviews using string matching

# Create new 'rating' column: extract only values like "4.85" at the start
data$rating <- suppressWarnings(as.numeric(sub(" .*", "", data$review_rating)))

# Set rating to NA if it couldn't be converted (like "New")
data$rating[!grepl("^[0-9\\.]+", data$review_rating)] <- NA

# Create 'reviews' column: extract digits inside brackets like "(123)"
data$reviews <- as.numeric(gsub("[^0-9]", "", regmatches(data$review_rating, gregexpr("\\([0-9]+\\)", data$review_rating))[[1]]))
data$reviews[is.na(data$reviews)] <- NA  # Explicitly set missing if not found

# 3. Convert number of beds to numeric (remove text like "queen bed")
data$num_beds <- as.numeric(gsub("[^0-9]", "", data$num_beds))

# 4. Check the cleaned data
head(data)

The output of the above code is:

	title	detail	date	price	offer_price	review_rating	num_beds	rating	reviews
	<chr>	<chr>	<chr>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>
1	Chalet in Skykomish, Washington, US	Sky Haus - A-Frame Cabin	Jun 11 - 16	306	229	4.85 (531)	4	4.85	531
2	Cabin in Hancock, New York, US	The Catskill A-Frame - Mid-Century Modern Cabin	Jun 6 - 11	485	170	4.77 (146)	4	4.77	531
3	Cabin in West Farmington, Ohio, US	The Triangle: A-Frame Cabin for your city retreat	Jul 9 - 14	119	522	4.91 (515)	4	4.91	531
4	Home in Blue Ridge, Georgia, US	Summer Sizzle 5 Min to Blue Ridge* Pets* Hot tub	Jun 11 - 16	192	348	4.94 (88)	5	4.94	531
5	Treehouse in Grandview, Texas, US	Luxury Treehouse Couples Getaway w/ Peaceful Views	Jun 4 - 9	232	196	4.99 (222)	1	4.99	531
6	Tiny home in Puerto Escondido, Mexico	Casa Tiny near Casa Wabi	Jun 21 - 26	261	148	4.84 (555)	1	4.84	531

Step 6: Summarize and Handle Missing Data in Airbnb Listings

This step helps you inspect the dataset by viewing summary statistics and checking for missing values. If desired, it also creates a cleaner version of the data by dropping rows with any NA values. Here’s the code:

# View summary statistics of the cleaned dataset
summary(data)

# Count how many missing values are in each column
colSums(is.na(data))

# Optional: Create a clean version of the data without NAs
clean_data <- data %>% drop_na()

The output of the above code is:

title detail date price

Length:953 Length:953 Length:953 Min. : 16.00

Class :character Class :character Class :character 1st Qu.: 82.75

Mode :character Mode :character Mode :character Median :134.50

Mean :170.00

3rd Qu.:220.25

Max. :986.00

NA's :1

offer_price review_rating num_beds rating

Min. : 16.0 Length:953 Min. : 1.000 Min. :3.670

1st Qu.: 63.0 Class :character 1st Qu.: 1.000 1st Qu.:4.820

Median :118.0 Mode :character Median : 2.000 Median :4.890

Mean :150.9 Mean : 2.183 Mean :4.863

3rd Qu.:182.0 3rd Qu.: 3.000 3rd Qu.:4.960

Max. :819.0 Max. :22.000 Max. :5.000

NA's :788 NA's :22

reviews

Min. :531

1st Qu.:531

Median :531

Mean :531

3rd Qu.:531

Max. :531

Title 0 detail 0 date 0 price 1 offer_price 788 review_rating 0 num_beds 0 rating 22 reviews 0

Try These R Projects: Spam Filter Project Using R with Naive Bayes – With Code | Spotify Music Data Analysis Project in R

Step 7: Finalize Numeric Data and View Correlation Matrix

This step removes the reviews column (which likely contains all NAs or a constant value), and then recalculates and displays the correlation matrix for the remaining numeric variables. Here’s the code:

# Drop 'reviews' since it has all NAs or constant values
numeric_data <- numeric_data %>% select(-reviews)

# Recalculate correlation matrix
cor_matrix <- cor(numeric_data)

# Print correlation matrix
print(cor_matrix)

The output for the above step is:

price offer_price rating num_beds

price 1.0000000 0.31912435 0.20888333 0.37843215

offer_price 0.3191244 1.00000000 -0.02489574 0.05964721

rating 0.2088833 -0.02489574 1.00000000 0.04495290

num_beds 0.3784321 0.05964721 0.04495290 1.00000000

Step 8: Visualize Correlation Between Features

In this step, we use the corrplot package to create a heatmap that shows how strongly different numeric Airbnb features relate to each other. It helps you spot relationships like price vs rating or number of beds. The code for this step is:

# Plot the correlation heatmap
corrplot(cor_matrix, 
         method = "color",         # Use colored squares to show correlation strength
         addCoef.col = "black",    # Display numeric correlation values inside boxes
         tl.col = "black",         # Set text labels (feature names) to black
         number.cex = 0.8,         # Adjust size of the numbers shown
         title = "Correlation Heatmap of Airbnb Features",  # Title of the plot
         mar = c(0,0,1,0))         # Set margin to make space for the title

The above code gives us a graph of a correlation heatmap of Airbnb features.

The above plot shows that:

Price and Number of Beds are Linked: Listings with more beds tend to have higher prices (moderate correlation of 0.38).
Ratings Don’t Depend on Price or Beds: Customer ratings show very weak correlation with price, offer price, or number of beds—suggesting other factors affect guest satisfaction.
Price and Offer Price Move Together: There’s a moderate positive relationship (0.32) between the listed price and offer price, meaning discounts are generally proportional.

Step 9: Prepare Data for Clustering Analysis

Before applying clustering algorithms like K-Means, it's important to prepare the data by selecting only relevant features, handling missing values, and scaling the values to ensure each variable contributes equally. Here’s the code:

# Subset the data to include only relevant features
cluster_data <- data %>% select(price, rating)

# Remove rows with missing values to avoid errors in clustering
cluster_data <- na.omit(cluster_data)

# Scale the data so all features are on the same range (mean = 0, sd = 1)
scaled_data <- scale(cluster_data)

Step 10: Install Factoextra for Visualizing Clusters

To visualize the results of clustering in a more intuitive way, we use the factoextra package. It provides powerful tools to display clusters and evaluate their quality.

# Install the factoextra package (only once)
install.packages("factoextra")

The above code gives the output prompting successful installation of the package:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘rbibutils’, ‘Deriv’, ‘microbenchmark’, ‘Rdpack’, ‘doBy’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘reformulas’, ‘RcppEigen’, ‘lazyeval’, ‘carData’, ‘Formula’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘crosstalk’, ‘estimability’, ‘mvtnorm’, ‘numDeriv’, ‘viridis’, ‘car’, ‘DT’, ‘ellipse’, ‘emmeans’, ‘flashClust’, ‘leaps’, ‘multcompView’, ‘scatterplot3d’, ‘ggsci’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘polynom’, ‘rstatix’, ‘plyr’, ‘abind’, ‘dendextend’, ‘FactoMineR’, ‘ggpubr’, ‘reshape2’, ‘ggrepel’

Must Try: Movie Rating Analysis Project in R | Loan Approval Classification Using Logistic Regression in R

Step 11: Determine the Optimal Number of Clusters Using the Elbow Method

To choose the best number of clusters (k) for K-Means, we use the Elbow Method. It plots the within-cluster sum of squares (WSS) for different values of k. The "elbow point" is where the WSS begins to decrease more slowly, indicating a good choice for k. Here’s the code:

# Load the library for visualization
library(factoextra)

# Use the Elbow Method to find the optimal number of clusters
fviz_nbclust(scaled_data, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal k")

The above code will give us a graph using the Elbow method:

The above graph shows that:

WCSS decreases as k increases, because adding more clusters reduces the distance within each cluster.
The "elbow point" is where the curve starts to flatten; here, adding more clusters doesn’t significantly improve the model.
In this plot, the elbow is around k = 3, which is likely the optimal number of clusters.

Step 12: Apply K-Means Clustering to Airbnb Listings

Now that we’ve determined the optimal number of clusters (k = 3), we run the K-Means algorithm on the scaled data. Each listing is grouped into a cluster based on price and rating. This helps identify patterns like premium, budget, or mid-range listings. Here’s the code:

# Apply K-Means clustering with 3 centers
set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = 3, nstart = 25)

# Attach cluster results to the original dataset
cluster_data$cluster <- as.factor(kmeans_result$cluster)

# View how many listings fall into each cluster
table(cluster_data$cluster)

The above code gives the output:

1 2 3

97 705 128

The above output is a frequency count of Airbnb listings grouped into 3 clusters by the K-Means algorithm based on price and rating:

Cluster 1 (97 listings): Likely represents a small group of listings that are either high-priced, low-rated, or some unique combination different from the rest.
Cluster 2 (705 listings): The largest group. These listings likely share similar pricing and rating patterns, probably the most affordable and common listings.
Cluster 3 (128 listings): A mid-sized group, potentially representing mid-range or highly-rated listings at moderate prices.

Must Try: Food Delivery Analysis Project Using R

Step 13: Visualize the Airbnb Clusters Using Price and Rating

Now that we’ve grouped the Airbnb listings into 3 clusters using K-means based on their price and rating, it’s time to visualize how these clusters look. We'll use a scatter plot where each point is a listing, colored by its assigned cluster. This will help us understand the patterns and separation between groups.

# Correct: use cluster_data, which contains the 'cluster' column
ggplot(cluster_data, aes(x = price, y = rating, color = cluster)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "K-means Clustering of Listings", x = "Price", y = "Rating") +
  theme_minimal()

The above code gives us a graph that shows the K-Means clustering of listings.

The above plot shows that:

Cluster 1 (red) includes high-price, high-rating listings.
Cluster 2 (green) contains low to mid-price listings with top ratings.
Cluster 3 (blue) represents lower-price listings with relatively lower ratings.

Step 14 – Visualize Feature Correlation with ggcorrplot

To better understand how numerical variables in our Airbnb dataset relate to each other, we can use a correlation heatmap. This helps reveal which features move together (positive correlation) or move in opposite directions (negative correlation). We'll use the ggcorrplot package to create a cleaner and more customizable correlation plot.

# Step 14: Install and load ggcorrplot
install.packages("ggcorrplot")  # Run only once
library(ggcorrplot)

Step 15: Plotting the Correlation Heatmap Using ggcorrplot

Now that we’ve prepared our correlation matrix (cor_matrix), we can generate a clean and intuitive heatmap using the ggcorrplot() function. This visualization allows us to quickly see which numerical features are positively or negatively correlated. Here’s the code:

# Create the correlation heatmap using ggcorrplot
ggcorrplot(cor_matrix, 
           lab = TRUE,                            # Show correlation coefficients
           title = "Correlation Heatmap of Numerical Features", 
           lab_size = 3,                          # Size of text labels
           colors = c("blue", "white", "red"))    # Color gradient from negative to positive

The above code gives us the correlation heatmap of numerical features:

The above plot shows that:

The heatmap shows how strongly numerical features (like price, rating, beds) are linked; red means a strong positive link, blue means a strong negative link, and white means little to no connection.
Each cell shows a number: closer to 1 means a strong positive relationship; closer to 0 means a weak or no relationship.
For example, “num_beds” and “rating” have a strong positive correlation (red), while “offer_price” and “rating” have almost no correlation (near white)

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Conclusion

In this Airbnb Listing Analysis project, we used R in Google Colab to explore and cluster property listings based on key features such as price and rating. After cleaning and preprocessing the data, we visualized distributions and correlations using corrplot and ggcorrplot.

We then applied K-Means clustering to segment the listings into three distinct groups. The resulting clusters revealed patterns in pricing and review ratings that can help hosts better position their listings.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1JUDKOTc5ZYf0kqAtff6yEWIBkqfPXC0i#scrollTo=7iKlZaBkFJuP

Frequently Asked Questions (FAQs)

1. How do you determine the ideal number of clusters in an Airbnb listing dataset?

To determine the optimal number of clusters, techniques like the Elbow Method and Silhouette Analysis are commonly used. These methods help evaluate how well the data fits different cluster counts, ensuring more accurate segmentation in K-Means clustering.

2. Why is clustering useful for Airbnb listing analysis?

Clustering groups similar listings based on features like price and rating, which helps in market segmentation, dynamic pricing strategies, and identifying high-performing property categories. It’s especially useful for hosts, analysts, and investors aiming to make data-driven decisions.

3. Can I perform Airbnb listing clustering without labeled data?

Yes, clustering is an unsupervised machine learning technique, which means it doesn’t require labeled outcomes. The model automatically finds patterns and groups similar listings without needing predefined categories.

4. What preprocessing steps are necessary before clustering Airbnb data?

Before clustering, it's important to:

Handle missing values
Normalize or scale numeric features
Remove outliers
Select relevant variables (e.g., price, rating, reviews)

5. What other clustering algorithms can be tried apart from K-Means in R?

While K-Means is popular, you can experiment with:

DBSCAN for density-based clustering
Hierarchical Clustering for tree-based grouping
K-Medoids (PAM) for robust clustering

6. What Are Some Other R Projects That I Can Work On?

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources