Airbnb Listing Analysis Project Using R
By Rohit Sharma
Updated on Aug 11, 2025 | 14 min read | 1.49K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 11, 2025 | 14 min read | 1.49K+ views
Share:
This Airbnb Listing Analysis Project Using R explores key insights from Airbnb data, focusing on price, ratings, and property features.
The project involves data cleaning, exploratory analysis, correlation analysis, and clustering to segment listings based on pricing and customer ratings.
Using powerful R packages like dplyr, ggplot2, and factoextra, it identifies patterns that help understand market dynamics. A correlation heatmap is also used to visualize relationships between numerical features.
Master Data Science, AI & Machine Learning with upGrad’s top-rated data science programs. Learn from industry leaders and transform your future. Enrol today before opportunities pass you by!
Elevate Your R Skills: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Before starting the Airbnb Listing Analysis Project Using R, it’s important to know the estimated time commitment, technical complexity, and what tools or skills are involved. The table below gives all the important information you need.
Aspect |
Details |
Estimated Duration | 2–3 hours |
Difficulty Level | Beginner to Intermediate |
Tools Required | R, Google Colab (with R kernel), RStudio (optional) |
R Packages Used | dplyr, ggplot2, factoextra, ggcorrplot, cluster, readr, stats |
Key Skills Needed | Data Cleaning, EDA, Data Visualization, Clustering, Correlation Analysis |
Join globally recognized programs from IIITB, Liverpool John Moores University, and top institutions. Gain in-demand skills in AI, ML, and Data Science. Apply now and step into the future of tech.
Below are the individual steps that will be used to build this Airbnb Listing Analysis Project Using R. Each step includes the code with brief comments added for clarity.
To work with R in Google Colab, you first need to switch the default programming language from Python to R. Follow these steps to configure your notebook:
In this step, we prepare the environment by installing and loading the necessary R packages. These libraries will help with data manipulation, working with dates, and visualizing correlations later in this Airbnb listing analysis project. Here’s the code:
# Step 1: Install required packages (only once)
install.packages("tidyverse") # For data manipulation and plots
install.packages("lubridate") # For date parsing
install.packages("corrplot") # For correlation heatmap
# Step 1.1: Load the libraries
library(tidyverse) # Load tidyverse for dplyr, ggplot2, etc.
library(lubridate) # Load lubridate to handle date formats
library(corrplot) # Load corrplot for correlation visualization
The output confirms that the libraries and packages are installed and loaded correctly:
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors corrplot 0.95 loaded |
Here’s Something You Must Try: Trend Analysis Project on COVID-19 using R
In this section, we import the Airbnb dataset into our R environment and examine its basic structure. This gives us an idea of what variables are available and how the data is formatted. Here’s the code:
# Step 2: Read the Airbnb data from your upload path
data <- read.csv("airnb.csv", stringsAsFactors = FALSE) # Load CSV file without converting strings to factors
# Step 2.1: Preview the structure
str(data) # Check the structure and types of variables
head(data) # Display the first few rows of the dataset
The output of the above step gives us a preview of the dataset we’re working with.
'data.frame': 953 obs. of 7 variables: $ Title : chr "Chalet in Skykomish, Washington, US" "Cabin in Hancock, New York, US" "Cabin in West Farmington, Ohio, US" "Home in Blue Ridge, Georgia, US" ... $ Detail : chr "Sky Haus - A-Frame Cabin" "The Catskill A-Frame - Mid-Century Modern Cabin" "The Triangle: A-Frame Cabin for your city retreat" "*Summer Sizzle* 5 Min to Blue Ridge* Pets* Hot tub" ... $ Date : chr "Jun 11 - 16" "Jun 6 - 11" "Jul 9 - 14" "Jun 11 - 16" ... $ Price.in.dollar. : chr "306.00" "485.00" "119.00" "192.00" ... $ Offer.price.in.dollar.: chr "229.00" "170.00" "522.00" "348.00" ... $ Review.and.rating : chr "4.85 (531)" "4.77 (146)" "4.91 (515)" "4.94 (88)" ... $ Number.of.bed : chr "4 beds" "4 beds" "4 beds" "5 beds" ... |
Title |
Detail |
Date |
Price.in.dollar. |
Offer.price.in.dollar. |
Review.and.rating |
Number.of.bed |
|
<chr> |
<chr> |
<chr> |
<chr> |
<chr> |
<chr> |
<chr> |
|
1 |
Chalet in Skykomish, Washington, US |
Sky Haus - A-Frame Cabin |
Jun 11 - 16 |
306.00 |
229.00 |
4.85 (531) |
4 beds |
2 |
Cabin in Hancock, New York, US |
The Catskill A-Frame - Mid-Century Modern Cabin |
Jun 6 - 11 |
485.00 |
170.00 |
4.77 (146) |
4 beds |
3 |
Cabin in West Farmington, Ohio, US |
The Triangle: A-Frame Cabin for your city retreat |
Jul 9 - 14 |
119.00 |
522.00 |
4.91 (515) |
4 beds |
4 |
Home in Blue Ridge, Georgia, US |
*Summer Sizzle* 5 Min to Blue Ridge* Pets* Hot tub |
Jun 11 - 16 |
192.00 |
348.00 |
4.94 (88) |
5 beds |
5 |
Treehouse in Grandview, Texas, US |
Luxury Treehouse Couples Getaway w/ Peaceful Views |
Jun 4 - 9 |
232.00 |
196.00 |
4.99 (222) |
1 queen bed |
6 |
Tiny home in Puerto Escondido, Mexico |
Casa Tiny near Casa Wabi |
Jun 21 - 26 |
261.00 |
148.00 |
4.84 (555) |
1 double bed |
This step simplifies the dataset by renaming the column headers to shorter, more accessible names. Clean column names make it easier to work with data in future steps. Here’s the code:
# Step 3: Rename columns for easier access
colnames(data) <- c("title", "detail", "date", "price", "offer_price", "review_rating", "num_beds")
You can try these R Projects: Forest Fire Project Using R - A Step-by-Step Guide | Customer Segmentation Project Using R: A Step-by-Step Guide
This step transforms messy textual columns into structured numeric formats. It handles currency symbols, extracts ratings and reviews safely from strings, and converts bed counts for easier analysis later. Here’s the code to clean the data:
# Step 4: Clean and prepare the dataset safely
# 1. Remove $ symbols and convert price columns to numeric
data$price <- as.numeric(gsub("\\$", "", data$price))
data$offer_price <- as.numeric(gsub("\\$", "", data$offer_price))
# 2. Safely extract rating and reviews using string matching
# Create new 'rating' column: extract only values like "4.85" at the start
data$rating <- suppressWarnings(as.numeric(sub(" .*", "", data$review_rating)))
# Set rating to NA if it couldn't be converted (like "New")
data$rating[!grepl("^[0-9\\.]+", data$review_rating)] <- NA
# Create 'reviews' column: extract digits inside brackets like "(123)"
data$reviews <- as.numeric(gsub("[^0-9]", "", regmatches(data$review_rating, gregexpr("\\([0-9]+\\)", data$review_rating))[[1]]))
data$reviews[is.na(data$reviews)] <- NA # Explicitly set missing if not found
# 3. Convert number of beds to numeric (remove text like "queen bed")
data$num_beds <- as.numeric(gsub("[^0-9]", "", data$num_beds))
# 4. Check the cleaned data
head(data)
The output of the above code is:
title |
detail |
date |
price |
offer_price |
review_rating |
num_beds |
rating |
reviews |
|
<chr> |
<chr> |
<chr> |
<dbl> |
<dbl> |
<chr> |
<dbl> |
<dbl> |
<dbl> |
|
1 |
Chalet in Skykomish, Washington, US |
Sky Haus - A-Frame Cabin |
Jun 11 - 16 |
306 |
229 |
4.85 (531) |
4 |
4.85 |
531 |
2 |
Cabin in Hancock, New York, US |
The Catskill A-Frame - Mid-Century Modern Cabin |
Jun 6 - 11 |
485 |
170 |
4.77 (146) |
4 |
4.77 |
531 |
3 |
Cabin in West Farmington, Ohio, US |
The Triangle: A-Frame Cabin for your city retreat |
Jul 9 - 14 |
119 |
522 |
4.91 (515) |
4 |
4.91 |
531 |
4 |
Home in Blue Ridge, Georgia, US |
*Summer Sizzle* 5 Min to Blue Ridge* Pets* Hot tub |
Jun 11 - 16 |
192 |
348 |
4.94 (88) |
5 |
4.94 |
531 |
5 |
Treehouse in Grandview, Texas, US |
Luxury Treehouse Couples Getaway w/ Peaceful Views |
Jun 4 - 9 |
232 |
196 |
4.99 (222) |
1 |
4.99 |
531 |
6 |
Tiny home in Puerto Escondido, Mexico |
Casa Tiny near Casa Wabi |
Jun 21 - 26 |
261 |
148 |
4.84 (555) |
1 |
4.84 |
531 |
This step helps you inspect the dataset by viewing summary statistics and checking for missing values. If desired, it also creates a cleaner version of the data by dropping rows with any NA values. Here’s the code:
# View summary statistics of the cleaned dataset
summary(data)
# Count how many missing values are in each column
colSums(is.na(data))
# Optional: Create a clean version of the data without NAs
clean_data <- data %>% drop_na()
The output of the above code is:
title detail date price Length:953 Length:953 Length:953 Min. : 16.00 Class :character Class :character Class :character 1st Qu.: 82.75 Mode :character Mode :character Mode :character Median :134.50 Mean :170.00 3rd Qu.:220.25 Max. :986.00 NA's :1 offer_price review_rating num_beds rating Min. : 16.0 Length:953 Min. : 1.000 Min. :3.670 1st Qu.: 63.0 Class :character 1st Qu.: 1.000 1st Qu.:4.820 Median :118.0 Mode :character Median : 2.000 Median :4.890 Mean :150.9 Mean : 2.183 Mean :4.863 3rd Qu.:182.0 3rd Qu.: 3.000 3rd Qu.:4.960 Max. :819.0 Max. :22.000 Max. :5.000 NA's :788 NA's :22 reviews Min. :531 1st Qu.:531 Median :531 Mean :531 3rd Qu.:531 Max. :531
Title 0 detail 0 date 0 price 1 offer_price 788 review_rating 0 num_beds 0 rating 22 reviews 0 |
Try These R Projects: Spam Filter Project Using R with Naive Bayes – With Code | Spotify Music Data Analysis Project in R
This step removes the reviews column (which likely contains all NAs or a constant value), and then recalculates and displays the correlation matrix for the remaining numeric variables. Here’s the code:
# Drop 'reviews' since it has all NAs or constant values
numeric_data <- numeric_data %>% select(-reviews)
# Recalculate correlation matrix
cor_matrix <- cor(numeric_data)
# Print correlation matrix
print(cor_matrix)
The output for the above step is:
price offer_price rating num_beds price 1.0000000 0.31912435 0.20888333 0.37843215 offer_price 0.3191244 1.00000000 -0.02489574 0.05964721 rating 0.2088833 -0.02489574 1.00000000 0.04495290 num_beds 0.3784321 0.05964721 0.04495290 1.00000000 |
In this step, we use the corrplot package to create a heatmap that shows how strongly different numeric Airbnb features relate to each other. It helps you spot relationships like price vs rating or number of beds. The code for this step is:
# Plot the correlation heatmap
corrplot(cor_matrix,
method = "color", # Use colored squares to show correlation strength
addCoef.col = "black", # Display numeric correlation values inside boxes
tl.col = "black", # Set text labels (feature names) to black
number.cex = 0.8, # Adjust size of the numbers shown
title = "Correlation Heatmap of Airbnb Features", # Title of the plot
mar = c(0,0,1,0)) # Set margin to make space for the title
The above code gives us a graph of a correlation heatmap of Airbnb features.
Popular Data Science Programs
The above plot shows that:
Before applying clustering algorithms like K-Means, it's important to prepare the data by selecting only relevant features, handling missing values, and scaling the values to ensure each variable contributes equally. Here’s the code:
# Subset the data to include only relevant features
cluster_data <- data %>% select(price, rating)
# Remove rows with missing values to avoid errors in clustering
cluster_data <- na.omit(cluster_data)
# Scale the data so all features are on the same range (mean = 0, sd = 1)
scaled_data <- scale(cluster_data)
To visualize the results of clustering in a more intuitive way, we use the factoextra package. It provides powerful tools to display clusters and evaluate their quality.
# Install the factoextra package (only once)
install.packages("factoextra")
The above code gives the output prompting successful installation of the package:
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
also installing the dependencies ‘rbibutils’, ‘Deriv’, ‘microbenchmark’, ‘Rdpack’, ‘doBy’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘reformulas’, ‘RcppEigen’, ‘lazyeval’, ‘carData’, ‘Formula’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘crosstalk’, ‘estimability’, ‘mvtnorm’, ‘numDeriv’, ‘viridis’, ‘car’, ‘DT’, ‘ellipse’, ‘emmeans’, ‘flashClust’, ‘leaps’, ‘multcompView’, ‘scatterplot3d’, ‘ggsci’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘polynom’, ‘rstatix’, ‘plyr’, ‘abind’, ‘dendextend’, ‘FactoMineR’, ‘ggpubr’, ‘reshape2’, ‘ggrepel’ |
Must Try: Movie Rating Analysis Project in R | Loan Approval Classification Using Logistic Regression in R
To choose the best number of clusters (k) for K-Means, we use the Elbow Method. It plots the within-cluster sum of squares (WSS) for different values of k. The "elbow point" is where the WSS begins to decrease more slowly, indicating a good choice for k. Here’s the code:
# Load the library for visualization
library(factoextra)
# Use the Elbow Method to find the optimal number of clusters
fviz_nbclust(scaled_data, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal k")
The above code will give us a graph using the Elbow method:
The above graph shows that:
Now that we’ve determined the optimal number of clusters (k = 3), we run the K-Means algorithm on the scaled data. Each listing is grouped into a cluster based on price and rating. This helps identify patterns like premium, budget, or mid-range listings. Here’s the code:
# Apply K-Means clustering with 3 centers
set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = 3, nstart = 25)
# Attach cluster results to the original dataset
cluster_data$cluster <- as.factor(kmeans_result$cluster)
# View how many listings fall into each cluster
table(cluster_data$cluster)
The above code gives the output:
1 2 3 97 705 128 |
The above output is a frequency count of Airbnb listings grouped into 3 clusters by the K-Means algorithm based on price and rating:
Must Try: Food Delivery Analysis Project Using R
Now that we’ve grouped the Airbnb listings into 3 clusters using K-means based on their price and rating, it’s time to visualize how these clusters look. We'll use a scatter plot where each point is a listing, colored by its assigned cluster. This will help us understand the patterns and separation between groups.
# Correct: use cluster_data, which contains the 'cluster' column
ggplot(cluster_data, aes(x = price, y = rating, color = cluster)) +
geom_point(size = 2, alpha = 0.7) +
labs(title = "K-means Clustering of Listings", x = "Price", y = "Rating") +
theme_minimal()
The above code gives us a graph that shows the K-Means clustering of listings.
The above plot shows that:
To better understand how numerical variables in our Airbnb dataset relate to each other, we can use a correlation heatmap. This helps reveal which features move together (positive correlation) or move in opposite directions (negative correlation). We'll use the ggcorrplot package to create a cleaner and more customizable correlation plot.
# Step 14: Install and load ggcorrplot
install.packages("ggcorrplot") # Run only once
library(ggcorrplot)
Now that we’ve prepared our correlation matrix (cor_matrix), we can generate a clean and intuitive heatmap using the ggcorrplot() function. This visualization allows us to quickly see which numerical features are positively or negatively correlated. Here’s the code:
# Create the correlation heatmap using ggcorrplot
ggcorrplot(cor_matrix,
lab = TRUE, # Show correlation coefficients
title = "Correlation Heatmap of Numerical Features",
lab_size = 3, # Size of text labels
colors = c("blue", "white", "red")) # Color gradient from negative to positive
The above code gives us the correlation heatmap of numerical features:
The above plot shows that:
In this Airbnb Listing Analysis project, we used R in Google Colab to explore and cluster property listings based on key features such as price and rating. After cleaning and preprocessing the data, we visualized distributions and correlations using corrplot and ggcorrplot.
We then applied K-Means clustering to segment the listings into three distinct groups. The resulting clusters revealed patterns in pricing and review ratings that can help hosts better position their listings.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1JUDKOTc5ZYf0kqAtff6yEWIBkqfPXC0i#scrollTo=7iKlZaBkFJuP
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources