Airbnb Listing Analysis Project Using R

By Rohit Sharma

Updated on Aug 11, 2025 | 14 min read | 1.49K+ views

Share:

This Airbnb Listing Analysis Project Using R explores key insights from Airbnb data, focusing on price, ratings, and property features. 

The project involves data cleaning, exploratory analysis, correlation analysis, and clustering to segment listings based on pricing and customer ratings.

Using powerful R packages like dplyr, ggplot2, and factoextra, it identifies patterns that help understand market dynamics. A correlation heatmap is also used to visualize relationships between numerical features.

Master Data Science, AI & Machine Learning with upGrad’s top-rated data science programs. Learn from industry leaders and transform your future. Enrol today before opportunities pass you by!

Elevate Your R Skills: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

How Long Will It Take and What Skills Do You Need for This Project

Before starting the Airbnb Listing Analysis Project Using R, it’s important to know the estimated time commitment, technical complexity, and what tools or skills are involved. The table below gives all the important information you need.

Aspect

Details

Estimated Duration 2–3 hours
Difficulty Level Beginner to Intermediate
Tools Required R, Google Colab (with R kernel), RStudio (optional)
R Packages Used dplyr, ggplot2, factoextra, ggcorrplot, cluster, readr, stats
Key Skills Needed Data Cleaning, EDAData Visualization, Clustering, Correlation Analysis

Join globally recognized programs from IIITB, Liverpool John Moores University, and top institutions. Gain in-demand skills in AI, ML, and Data Science. Apply now and step into the future of tech.

Complete Breakdown of the Airbnb Listing Analysis Project in R

Below are the individual steps that will be used to build this Airbnb Listing Analysis Project Using R. Each step includes the code with brief comments added for clarity.

Step 1: Initial Setup: Enable R Support in Google Colab

To work with R in Google Colab, you first need to switch the default programming language from Python to R. Follow these steps to configure your notebook:

  1. Open Google Colab and create a new notebook.
  2. In the top menu bar, go to Runtime.
  3. Select Change runtime type.
  4. Under the Language dropdown, choose R.
  5. Click Save to apply the changes.

Step 2: Install and Load Required R Packages

In this step, we prepare the environment by installing and loading the necessary R packages. These libraries will help with data manipulation, working with dates, and visualizing correlations later in this Airbnb listing analysis project. Here’s the code:

# Step 1: Install required packages (only once)
install.packages("tidyverse")   # For data manipulation and plots
install.packages("lubridate")   # For date parsing
install.packages("corrplot")    # For correlation heatmap

# Step 1.1: Load the libraries
library(tidyverse)   # Load tidyverse for dplyr, ggplot2, etc.
library(lubridate)   # Load lubridate to handle date formats
library(corrplot)    # Load corrplot for correlation visualization

The output confirms that the libraries and packages are installed and loaded correctly:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

 dplyr    1.1.4      readr    2.1.5

 forcats  1.0.0      stringr  1.5.1

 ggplot2  3.5.2      tibble   3.3.0

 lubridate 1.9.4      tidyr    1.3.1

 purrr    1.1.0     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

 dplyr::filter() masks stats::filter()

 dplyr::lag()    masks stats::lag()

Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

corrplot 0.95 loaded

Here’s Something You Must Try: Trend Analysis Project on COVID-19 using R

Step 3: Load and Preview the Airbnb Listing Dataset

In this section, we import the Airbnb dataset into our R environment and examine its basic structure. This gives us an idea of what variables are available and how the data is formatted. Here’s the code:

# Step 2: Read the Airbnb data from your upload path
data <- read.csv("airnb.csv", stringsAsFactors = FALSE)   # Load CSV file without converting strings to factors

# Step 2.1: Preview the structure
str(data)   # Check the structure and types of variables
head(data)  # Display the first few rows of the dataset

The output of the above step gives us a preview of the dataset we’re working with.

'data.frame': 953 obs. of  7 variables:

 $ Title                 : chr  "Chalet in Skykomish, Washington, US" "Cabin in Hancock, New York, US" "Cabin in West Farmington, Ohio, US" "Home in Blue Ridge, Georgia, US" ...

 $ Detail                : chr  "Sky Haus - A-Frame Cabin" "The Catskill A-Frame - Mid-Century Modern Cabin" "The Triangle: A-Frame Cabin for your city retreat" "*Summer Sizzle* 5 Min to Blue Ridge* Pets* Hot tub" ...

 $ Date                  : chr  "Jun 11 - 16" "Jun 6 - 11" "Jul 9 - 14" "Jun 11 - 16" ...

 $ Price.in.dollar.      : chr  "306.00" "485.00" "119.00" "192.00" ...

 $ Offer.price.in.dollar.: chr  "229.00" "170.00" "522.00" "348.00" ...

 $ Review.and.rating     : chr  "4.85 (531)" "4.77 (146)" "4.91 (515)" "4.94 (88)" ...

 $ Number.of.bed         : chr  "4 beds" "4 beds" "4 beds" "5 beds" ...

 

Title

Detail

Date

Price.in.dollar.

Offer.price.in.dollar.

Review.and.rating

Number.of.bed

 

<chr>

<chr>

<chr>

<chr>

<chr>

<chr>

<chr>

1

Chalet in Skykomish, Washington, US

Sky Haus - A-Frame Cabin

Jun 11 - 16

306.00

229.00

4.85 (531)

4 beds

2

Cabin in Hancock, New York, US

The Catskill A-Frame - Mid-Century Modern Cabin

Jun 6 - 11

485.00

170.00

4.77 (146)

4 beds

3

Cabin in West Farmington, Ohio, US

The Triangle: A-Frame Cabin for your city retreat

Jul 9 - 14

119.00

522.00

4.91 (515)

4 beds

4

Home in Blue Ridge, Georgia, US

*Summer Sizzle* 5 Min to Blue Ridge* Pets* Hot tub

Jun 11 - 16

192.00

348.00

4.94 (88)

5 beds

5

Treehouse in Grandview, Texas, US

Luxury Treehouse Couples Getaway w/ Peaceful Views

Jun 4 - 9

232.00

196.00

4.99 (222)

1 queen bed

6

Tiny home in Puerto Escondido, Mexico

Casa Tiny near Casa Wabi

Jun 21 - 26

261.00

148.00

4.84 (555)

1 double bed

 

Step 4: Rename Dataset Columns for Simplicity

This step simplifies the dataset by renaming the column headers to shorter, more accessible names. Clean column names make it easier to work with data in future steps. Here’s the code:

# Step 3: Rename columns for easier access 
colnames(data) <- c("title", "detail", "date", "price", "offer_price", "review_rating", "num_beds")

You can try these R Projects: Forest Fire Project Using R - A Step-by-Step GuideCustomer Segmentation Project Using R: A Step-by-Step Guide

Step 5: Clean and Prepare Airbnb Data for Analysis

This step transforms messy textual columns into structured numeric formats. It handles currency symbols, extracts ratings and reviews safely from strings, and converts bed counts for easier analysis later. Here’s the code to clean the data:

# Step 4: Clean and prepare the dataset safely

# 1. Remove $ symbols and convert price columns to numeric
data$price <- as.numeric(gsub("\\$", "", data$price))
data$offer_price <- as.numeric(gsub("\\$", "", data$offer_price))

# 2. Safely extract rating and reviews using string matching

# Create new 'rating' column: extract only values like "4.85" at the start
data$rating <- suppressWarnings(as.numeric(sub(" .*", "", data$review_rating)))

# Set rating to NA if it couldn't be converted (like "New")
data$rating[!grepl("^[0-9\\.]+", data$review_rating)] <- NA

# Create 'reviews' column: extract digits inside brackets like "(123)"
data$reviews <- as.numeric(gsub("[^0-9]", "", regmatches(data$review_rating, gregexpr("\\([0-9]+\\)", data$review_rating))[[1]]))
data$reviews[is.na(data$reviews)] <- NA  # Explicitly set missing if not found

# 3. Convert number of beds to numeric (remove text like "queen bed")
data$num_beds <- as.numeric(gsub("[^0-9]", "", data$num_beds))

# 4. Check the cleaned data
head(data)

The output of the above code is:

 

 

title

detail

date

price

offer_price

review_rating

num_beds

rating

reviews

 

<chr>

<chr>

<chr>

<dbl>

<dbl>

<chr>

<dbl>

<dbl>

<dbl>

1

Chalet in Skykomish, Washington, US

Sky Haus - A-Frame Cabin

Jun 11 - 16

306

229

4.85 (531)

4

4.85

531

2

Cabin in Hancock, New York, US

The Catskill A-Frame - Mid-Century Modern Cabin

Jun 6 - 11

485

170

4.77 (146)

4

4.77

531

3

Cabin in West Farmington, Ohio, US

The Triangle: A-Frame Cabin for your city retreat

Jul 9 - 14

119

522

4.91 (515)

4

4.91

531

4

Home in Blue Ridge, Georgia, US

*Summer Sizzle* 5 Min to Blue Ridge* Pets* Hot tub

Jun 11 - 16

192

348

4.94 (88)

5

4.94

531

5

Treehouse in Grandview, Texas, US

Luxury Treehouse Couples Getaway w/ Peaceful Views

Jun 4 - 9

232

196

4.99 (222)

1

4.99

531

6

Tiny home in Puerto Escondido, Mexico

Casa Tiny near Casa Wabi

Jun 21 - 26

261

148

4.84 (555)

1

4.84

531

 

Step 6: Summarize and Handle Missing Data in Airbnb Listings

This step helps you inspect the dataset by viewing summary statistics and checking for missing values. If desired, it also creates a cleaner version of the data by dropping rows with any NA values. Here’s the code:

# View summary statistics of the cleaned dataset
summary(data)

# Count how many missing values are in each column
colSums(is.na(data))

# Optional: Create a clean version of the data without NAs
clean_data <- data %>% drop_na()

The output of the above code is:

title              detail              date               price       

 Length:953         Length:953         Length:953         Min.   : 16.00  

 Class :character   Class :character   Class :character   1st Qu.: 82.75  

 Mode  :character   Mode  :character   Mode  :character   Median :134.50  

                                                          Mean   :170.00  

                                                          3rd Qu.:220.25  

                                                          Max.   :986.00  

                                                          NA's   :1       

  offer_price    review_rating         num_beds          rating     

 Min.   : 16.0   Length:953         Min.   : 1.000   Min.   :3.670  

 1st Qu.: 63.0   Class :character   1st Qu.: 1.000   1st Qu.:4.820  

 Median :118.0   Mode  :character   Median : 2.000   Median :4.890  

 Mean   :150.9                      Mean   : 2.183   Mean   :4.863  

 3rd Qu.:182.0                      3rd Qu.: 3.000   3rd Qu.:4.960  

 Max.   :819.0                      Max.   :22.000   Max.   :5.000  

 NA's   :788                                         NA's   :22     

    reviews   

 Min.   :531  

 1st Qu.:531  

 Median :531  

 Mean   :531  

 3rd Qu.:531  

 Max.   :531  

             

Title 0 detail 0 date 0 price 1 offer_price 788 review_rating 0 num_beds 0 rating 22 reviews 0

Try These R Projects: Spam Filter Project Using R with Naive Bayes – With CodeSpotify Music Data Analysis Project in R

Step 7: Finalize Numeric Data and View Correlation Matrix

This step removes the reviews column (which likely contains all NAs or a constant value), and then recalculates and displays the correlation matrix for the remaining numeric variables. Here’s the code:

# Drop 'reviews' since it has all NAs or constant values
numeric_data <- numeric_data %>% select(-reviews)

# Recalculate correlation matrix
cor_matrix <- cor(numeric_data)

# Print correlation matrix
print(cor_matrix)

The output for the above step is:

              price    offer_price     rating   num_beds

price       1.0000000  0.31912435  0.20888333 0.37843215

offer_price 0.3191244  1.00000000 -0.02489574 0.05964721

rating      0.2088833 -0.02489574  1.00000000 0.04495290

num_beds    0.3784321  0.05964721  0.04495290 1.00000000

Step 8: Visualize Correlation Between Features

In this step, we use the corrplot package to create a heatmap that shows how strongly different numeric Airbnb features relate to each other. It helps you spot relationships like price vs rating or number of beds. The code for this step is:

# Plot the correlation heatmap
corrplot(cor_matrix, 
         method = "color",         # Use colored squares to show correlation strength
         addCoef.col = "black",    # Display numeric correlation values inside boxes
         tl.col = "black",         # Set text labels (feature names) to black
         number.cex = 0.8,         # Adjust size of the numbers shown
         title = "Correlation Heatmap of Airbnb Features",  # Title of the plot
         mar = c(0,0,1,0))         # Set margin to make space for the title

The above code gives us a graph of a correlation heatmap of Airbnb features.

The above plot shows that:

  • Price and Number of Beds are Linked: Listings with more beds tend to have higher prices (moderate correlation of 0.38).
  • Ratings Don’t Depend on Price or Beds: Customer ratings show very weak correlation with price, offer price, or number of beds—suggesting other factors affect guest satisfaction.
  • Price and Offer Price Move Together: There’s a moderate positive relationship (0.32) between the listed price and offer price, meaning discounts are generally proportional.

Step 9: Prepare Data for Clustering Analysis

Before applying clustering algorithms like K-Means, it's important to prepare the data by selecting only relevant features, handling missing values, and scaling the values to ensure each variable contributes equally. Here’s the code:

# Subset the data to include only relevant features
cluster_data <- data %>% select(price, rating)

# Remove rows with missing values to avoid errors in clustering
cluster_data <- na.omit(cluster_data)

# Scale the data so all features are on the same range (mean = 0, sd = 1)
scaled_data <- scale(cluster_data)

Step 10: Install Factoextra for Visualizing Clusters

To visualize the results of clustering in a more intuitive way, we use the factoextra package. It provides powerful tools to display clusters and evaluate their quality.

# Install the factoextra package (only once)
install.packages("factoextra")

The above code gives the output prompting successful installation of the package:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

also installing the dependencies ‘rbibutils’, ‘Deriv’, ‘microbenchmark’, ‘Rdpack’, ‘doBy’, ‘SparseM’, ‘MatrixModels’, ‘minqa’, ‘nloptr’, ‘reformulas’, ‘RcppEigen’, ‘lazyeval’, ‘carData’, ‘Formula’, ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘crosstalk’, ‘estimability’, ‘mvtnorm’, ‘numDeriv’, ‘viridis’, ‘car’, ‘DT’, ‘ellipse’, ‘emmeans’, ‘flashClust’, ‘leaps’, ‘multcompView’, ‘scatterplot3d’, ‘ggsci’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘polynom’, ‘rstatix’, ‘plyr’, ‘abind’, ‘dendextend’, ‘FactoMineR’, ‘ggpubr’, ‘reshape2’, ‘ggrepel’

Must Try: Movie Rating Analysis Project in RLoan Approval Classification Using Logistic Regression in R

Step 11: Determine the Optimal Number of Clusters Using the Elbow Method

To choose the best number of clusters (k) for K-Means, we use the Elbow Method. It plots the within-cluster sum of squares (WSS) for different values of k. The "elbow point" is where the WSS begins to decrease more slowly, indicating a good choice for k. Here’s the code:

# Load the library for visualization
library(factoextra)

# Use the Elbow Method to find the optimal number of clusters
fviz_nbclust(scaled_data, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal k")

The above code will give us a graph using the Elbow method:

The above graph shows that:

  • WCSS decreases as k increases, because adding more clusters reduces the distance within each cluster.
  • The "elbow point" is where the curve starts to flatten; here, adding more clusters doesn’t significantly improve the model.
  • In this plot, the elbow is around k = 3, which is likely the optimal number of clusters.

Step 12: Apply K-Means Clustering to Airbnb Listings

Now that we’ve determined the optimal number of clusters (k = 3), we run the K-Means algorithm on the scaled data. Each listing is grouped into a cluster based on price and rating. This helps identify patterns like premium, budget, or mid-range listings. Here’s the code:

# Apply K-Means clustering with 3 centers
set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = 3, nstart = 25)

# Attach cluster results to the original dataset
cluster_data$cluster <- as.factor(kmeans_result$cluster)

# View how many listings fall into each cluster
table(cluster_data$cluster)

The above code gives the output:

1      2     3 

97 705 128 

The above output is a frequency count of Airbnb listings grouped into 3 clusters by the K-Means algorithm based on price and rating:

  • Cluster 1 (97 listings): Likely represents a small group of listings that are either high-priced, low-rated, or some unique combination different from the rest.
  • Cluster 2 (705 listings): The largest group. These listings likely share similar pricing and rating patterns, probably the most affordable and common listings.
  • Cluster 3 (128 listings): A mid-sized group, potentially representing mid-range or highly-rated listings at moderate prices.

Must Try: Food Delivery Analysis Project Using R

Step 13: Visualize the Airbnb Clusters Using Price and Rating

Now that we’ve grouped the Airbnb listings into 3 clusters using K-means based on their price and rating, it’s time to visualize how these clusters look. We'll use a scatter plot where each point is a listing, colored by its assigned cluster. This will help us understand the patterns and separation between groups.

# Correct: use cluster_data, which contains the 'cluster' column
ggplot(cluster_data, aes(x = price, y = rating, color = cluster)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "K-means Clustering of Listings", x = "Price", y = "Rating") +
  theme_minimal()

The above code gives us a graph that shows the K-Means clustering of listings.

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

The above plot shows that:

  • Cluster 1 (red) includes high-price, high-rating listings.
  • Cluster 2 (green) contains low to mid-price listings with top ratings.
  • Cluster 3 (blue) represents lower-price listings with relatively lower ratings.

Step 14 – Visualize Feature Correlation with ggcorrplot

To better understand how numerical variables in our Airbnb dataset relate to each other, we can use a correlation heatmap. This helps reveal which features move together (positive correlation) or move in opposite directions (negative correlation). We'll use the ggcorrplot package to create a cleaner and more customizable correlation plot.

# Step 14: Install and load ggcorrplot
install.packages("ggcorrplot")  # Run only once
library(ggcorrplot)

Step 15: Plotting the Correlation Heatmap Using ggcorrplot

Now that we’ve prepared our correlation matrix (cor_matrix), we can generate a clean and intuitive heatmap using the ggcorrplot() function. This visualization allows us to quickly see which numerical features are positively or negatively correlated. Here’s the code:

# Create the correlation heatmap using ggcorrplot
ggcorrplot(cor_matrix, 
           lab = TRUE,                            # Show correlation coefficients
           title = "Correlation Heatmap of Numerical Features", 
           lab_size = 3,                          # Size of text labels
           colors = c("blue", "white", "red"))    # Color gradient from negative to positive

The above code gives us the correlation heatmap of numerical features:

The above plot shows that:

  • The heatmap shows how strongly numerical features (like price, rating, beds) are linked; red means a strong positive link, blue means a strong negative link, and white means little to no connection.
  • Each cell shows a number: closer to 1 means a strong positive relationship; closer to 0 means a weak or no relationship.
  • For example, “num_beds” and “rating” have a strong positive correlation (red), while “offer_price” and “rating” have almost no correlation (near white)

Conclusion

In this Airbnb Listing Analysis project, we used R in Google Colab to explore and cluster property listings based on key features such as price and rating. After cleaning and preprocessing the data, we visualized distributions and correlations using corrplot and ggcorrplot. 

We then applied K-Means clustering to segment the listings into three distinct groups. The resulting clusters revealed patterns in pricing and review ratings that can help hosts better position their listings.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1JUDKOTc5ZYf0kqAtff6yEWIBkqfPXC0i#scrollTo=7iKlZaBkFJuP

Frequently Asked Questions (FAQs)

1. How do you determine the ideal number of clusters in an Airbnb listing dataset?

2. Why is clustering useful for Airbnb listing analysis?

3. Can I perform Airbnb listing clustering without labeled data?

4. What preprocessing steps are necessary before clustering Airbnb data?

5. What other clustering algorithms can be tried apart from K-Means in R?

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months