Home
Blog
Data Science
Spotify Music Data Analysis Project in R

Spotify Music Data Analysis Project in R

Updated on Jul 29, 2025 | 19 min read | 1.72K+ views

Table of Contents

View all

What Should You Know Before Starting the Spotify Music Data Analysis Project?
Time, Effort & Skills: What You Need Before Getting Started
What Powers the Analysis: Tools, Models, and R Packages
Steps and Code Explanation For Voice Data Analysis in R
Conclusion

What makes a song popular? Well, in this Spotify Music Data Analysis project using R, we will analyze some real Spotify music data to decode the musical DNA behind various hit songs.

We will peek into over 2,000 tracks, examining features like energy, danceability, valence, and acousticness through statistical analysis and compelling visualizations.

This blog will explain every step from data cleaning to building a Random Forest model that achieves 76% accuracy in predicting song popularity. You'll learn essential data science skills, including correlation analysis, feature engineering, and machine learning.

Popular Data Science Programs

DevOps Course Online PG Diploma in Data Science Data Science Machine Learning Course MS in Data Science Data Science Advanced Course

Want to Lead the AI Revolution?
Master Python, Machine Learning & AI with upGrad’s top online Data Science courses. Learn online, earn globally recognized credentials, and future-proof your career today!

Upskill Yourself With The Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

What Should You Know Before Starting the Spotify Music Data Analysis Project?

Before starting the Spotify Music Data Analysis project in R, it’s important to know some key concepts and skills. These will help you use the dataset and identify relevant patterns:

Skill/Concept	What You Should Know
Basic R Programming	Use variables, functions, and data frames confidently in R.
Tidyverse Usage	Work with dplyr, ggplot2, and tidyr for data manipulation and visualization.
Reading and Exploring Data	Use functions like read_csv(), head(), str(), and summary() to load and inspect data.
Spotify Audio Features	Understand terms like danceability, energy, valence, and tempo.
Data Cleaning Techniques	Handle missing values, rename columns, filter rows, and change data types as needed.
Data Visualization	Create plots (e.g., histograms, boxplots, scatter plots) using ggplot2.
Working with Dates	Use the lubridate package to parse and manipulate date columns.
Grouping and Aggregation	Apply group_by() and summarise() to analyze trends by artist, genre, or year.
Google Colab for R	Run R code in Colab by switching the runtime and uploading your data correctly.

Crack the Code to an opportunity rich Data Science Career!

Master AI, ML, and Generative AI with upGrad’s premier programs, 100% online, industry-ready, and designed for tomorrow’s tech leaders.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Time, Effort & Skills: What You Need Before Getting Started

This music data analysis project requires the following:

Aspect	Details
Estimated Time	2–3 hours for a basic walkthrough; 4–5 hours for in-depth analysis and polish.
Skill Level Required	Beginner to Intermediate. This project is suitable for learners with basic R and data handling knowledge.
Learning Curve	Low to Moderate. The concepts are straightforward with some exploration required.
Time Commitment	Short-term. This entire project can be completed in a single session or two sittings.

Read This: Benefits of Learning R: Why It’s Essential for Data Science

What Powers the Analysis: Tools, Models, and R Packages

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

To run this music data analysis project, we need certain tools, models, and R libraries, which are listed in the table below.

Category	Details
Platform	Google Colab (with R runtime)
Programming Language	R
Machine Learning Models	- Logistic Regression (for binary classification) - Random Forest (for improved accuracy and feature importance)
Key R Libraries	- dplyr – Data manipulation - ggplot2 – Data visualization - readr – Reading CSV files - caret – Model training & evaluation - randomForest – Random forest implementation
Supporting Libraries	- corrplot, GGally – Correlation and pair plots - plotly – Interactive visualizations - viridis – Color palettes for better visuals - reshape2 – Data reshaping - knitr – Report formatting

Steps and Code Explanation For Voice Data Analysis in R

To start working with this music data analysis project, we need to run the code in R, which will be discussed below. The code will be run in R, and this section will break down each step with the code for your understanding.

Step 1: Configure Google Colab for R Programming

To begin working with R in Google Colab, you'll first need to switch the notebook's runtime from Python to R. This setup allows you to run R code seamlessly within the Colab environment.

Steps to switch to R:

Open a new notebook in Google Colab
Click on Runtime in the top menu
Select Change runtime type
In the "Language" dropdown, choose R
Click Save

Step 2: Install and Load Essential R Libraries

Before starting the analysis of the data, we need to install and load all the required R packages. This step ensures that the environment is ready for data cleaning, visualization, and modeling. The code for this step is:

# Installing required packages (run once)
# These are like tools we need for our analysis
install.packages(c("dplyr", "ggplot2", "readr", "corrplot",
                   "plotly", "GGally", "caret", "viridis",
                   "reshape2", "knitr"))

# Loading libraries (run every time you start)
# Think of this as opening your toolbox
library(dplyr)      # For data manipulation - like Excel functions
library(ggplot2)    # For creating beautiful charts
library(readr)      # For reading CSV files
library(corrplot)   # For correlation plots
library(plotly)     # For interactive charts
library(GGally)     # For advanced plotting
library(caret)      # For machine learning
library(viridis)    # For beautiful color schemes
library(reshape2)   # For data reshaping
library(knitr)      # For nice table formatting

# Let's confirm everything loaded successfully
print("SpotiTunes Analytics Setup Complete!")

The code will give the output and confirm the installation and loading of the packages and libraries.

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘patchwork’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘lazyeval’, ‘crosstalk’, ‘ggstats’, ‘S7’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘gridExtra’

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

filter, lag

The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union

corrplot 0.95 loaded

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

last_plot

The following object is masked from ‘package:stats’:

filter

The following object is masked from ‘package:graphics’:

layout

Loading required package: lattice

Loading required package: viridisLite

[1] "SpotiTunes Analytics Setup Complete!"

Here’s Something For You: The Ultimate R Cheat Sheet for Data Science Enthusiasts

Step 3: Upload and Load the Spotify Dataset

As the setup is now ready, it's time to upload the dataset and load it into R. This step allows us to begin exploring and analyzing the contents of the Spotify music data. The code used in this process is:

Upload your CSV file to Colab first, then read it
# In Colab, click the file icon on the left, then upload your Spotify data.csv

# Reading the dataset
# This loads your Spotify data into R's memory
spotify_data <- read_csv("Spotify data.csv")

# Let's take our first look at the data
print("Dataset loaded successfully!")
print(paste("Number of songs:", nrow(spotify_data)))
print(paste("Number of features:", ncol(spotify_data)))

# Display first few rows to understand our data
head(spotify_data)

After the successful execution, the code will give us the output as:

New names:

• `` -> `...1`

Rows: 2017 Columns: 17

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (2): song_title, artist

dbl (15): ...1, acousticness, danceability, duration_ms, energy, instrumenta...

ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

[1] "Dataset loaded successfully!"

[1] "Number of songs: 2017"

[1] "Number of features: 17"

...1	acousticness	danceability	duration_ms	energy	instrumentalness	key	liveness	loudness	mode	speechiness	tempo	time_signature	valence	target	song_title	artist
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>
0	0.01020	0.833	204600	0.434	0.021900	2	0.1650	-8.795	1	0.4310	150.062	4	0.286	1	Mask Off	Future
1	0.19900	0.743	326933	0.359	0.006110	1	0.1370	-10.401	1	0.0794	160.083	4	0.588	1	Redbone	Childish Gambino
2	0.03440	0.838	185707	0.412	0.000234	2	0.1590	-7.148	1	0.2890	75.044	4	0.173	1	Xanny Family	Future
3	0.60400	0.494	199413	0.338	0.510000	5	0.0922	-15.236	1	0.0261	86.468	4	0.230	1	Master Of None	Beach House
4	0.18000	0.678	392893	0.561	0.512000	5	0.4390	-11.648	0	0.0694	174.004	4	0.904	1	Parallel Lines	Junior Boys
5	0.00479	0.804	251333	0.560	0.000000	8	0.1640	-6.682	1	0.1850	85.023	4	0.264	1	Sneakin’	Drake

The above table gives us an idea of what the data looks like.

What’s Hidden in Your Data? Find out with our free Clustering Unsupervised Learning course. Learn K-Means, uncover clusters, and explore patterns like a data pro. Join now.

Step 4: Explore and Understand the Dataset

Before starting our analysis or modeling, it's important to check the dataset. This step will help us understand the structure, identify missing values, and get familiar with the available features. The code for this step is:

Getting to know our dataset better
# This is like getting familiar with a new playlist

# Basic information about the dataset
str(spotify_data)  # Shows data types and structure
summary(spotify_data)  # Shows statistical summaries

# Check for missing values
# Missing data can affect our analysis
missing_values <- colSums(is.na(spotify_data))
print("Missing values in each column:")
print(missing_values)

# Let's look at the column names to understand what we have
print("Our musical features:")
print(colnames(spotify_data))

The output for this code is:

spc_tbl_ [2,017 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

$ ...1 : num [1:2017] 0 1 2 3 4 5 6 7 8 9 ...

$ acousticness : num [1:2017] 0.0102 0.199 0.0344 0.604 0.18 0.00479 0.0145 0.0202 0.0481 0.00208 ...

$ danceability : num [1:2017] 0.833 0.743 0.838 0.494 0.678 0.804 0.739 0.266 0.603 0.836 ...

$ duration_ms : num [1:2017] 204600 326933 185707 199413 392893 ...

$ energy : num [1:2017] 0.434 0.359 0.412 0.338 0.561 0.56 0.472 0.348 0.944 0.603 ...

$ instrumentalness: num [1:2017] 2.19e-02 6.11e-03 2.34e-04 5.10e-01 5.12e-01 0.00 7.27e-06 6.64e-01 0.00 0.00 ...

$ key : num [1:2017] 2 1 2 5 5 8 1 10 11 7 ...

$ liveness : num [1:2017] 0.165 0.137 0.159 0.0922 0.439 0.164 0.207 0.16 0.342 0.571 ...

$ loudness : num [1:2017] -8.79 -10.4 -7.15 -15.24 -11.65 ...

$ mode : num [1:2017] 1 1 1 1 0 1 1 0 0 1 ...

$ speechiness : num [1:2017] 0.431 0.0794 0.289 0.0261 0.0694 0.185 0.156 0.0371 0.347 0.237 ...

$ tempo : num [1:2017] 150.1 160.1 75 86.5 174 ...

$ time_signature : num [1:2017] 4 4 4 4 4 4 4 4 4 4 ...

$ valence : num [1:2017] 0.286 0.588 0.173 0.23 0.904 0.264 0.308 0.393 0.398 0.386 ...

$ target : num [1:2017] 1 1 1 1 1 1 1 1 1 1 ...

$ song_title : chr [1:2017] "Mask Off" "Redbone" "Xanny Family" "Master Of None" ...

$ artist : chr [1:2017] "Future" "Childish Gambino" "Future" "Beach House" ...

- attr(*, "spec")=

.. cols(

.. ...1 = col_double(),

.. acousticness = col_double(),

.. danceability = col_double(),

.. duration_ms = col_double(),

.. energy = col_double(),

.. instrumentalness = col_double(),

.. key = col_double(),

.. liveness = col_double(),

.. loudness = col_double(),

.. mode = col_double(),

.. speechiness = col_double(),

.. tempo = col_double(),

.. time_signature = col_double(),

.. valence = col_double(),

.. target = col_double(),

.. song_title = col_character(),

.. artist = col_character()

.. )

- attr(*, "problems")=<externalptr>

...1 acousticness danceability duration_ms

Min. : 0 Min. :2.840e-06 Min. :0.1220 Min. : 16042

1st Qu.: 504 1st Qu.:9.630e-03 1st Qu.:0.5140 1st Qu.: 200015

Median :1008 Median :6.330e-02 Median :0.6310 Median : 229261

Mean :1008 Mean :1.876e-01 Mean :0.6184 Mean : 246306

3rd Qu.:1512 3rd Qu.:2.650e-01 3rd Qu.:0.7380 3rd Qu.: 270333

Max. :2016 Max. :9.950e-01 Max. :0.9840 Max. :1004627

energy instrumentalness key liveness

Min. :0.0148 Min. :0.0000000 Min. : 0.000 Min. :0.0188

1st Qu.:0.5630 1st Qu.:0.0000000 1st Qu.: 2.000 1st Qu.:0.0923

Median :0.7150 Median :0.0000762 Median : 6.000 Median :0.1270

Mean :0.6816 Mean :0.1332855 Mean : 5.343 Mean :0.1908

3rd Qu.:0.8460 3rd Qu.:0.0540000 3rd Qu.: 9.000 3rd Qu.:0.2470

Max. :0.9980 Max. :0.9760000 Max. :11.000 Max. :0.9690

loudness mode speechiness tempo

Min. :-33.097 Min. :0.0000 Min. :0.02310 Min. : 47.86

1st Qu.: -8.394 1st Qu.:0.0000 1st Qu.:0.03750 1st Qu.:100.19

Median : -6.248 Median :1.0000 Median :0.05490 Median :121.43

Mean : -7.086 Mean :0.6123 Mean :0.09266 Mean :121.60

3rd Qu.: -4.746 3rd Qu.:1.0000 3rd Qu.:0.10800 3rd Qu.:137.85

Max. : -0.307 Max. :1.0000 Max. :0.81600 Max. :219.33

time_signature valence target song_title

Min. :1.000 Min. :0.0348 Min. :0.0000 Length:2017

1st Qu.:4.000 1st Qu.:0.2950 1st Qu.:0.0000 Class :character

Median :4.000 Median :0.4920 Median :1.0000 Mode :character

Mean :3.968 Mean :0.4968 Mean :0.5057

3rd Qu.:4.000 3rd Qu.:0.6910 3rd Qu.:1.0000

Max. :5.000 Max. :0.9920 Max. :1.0000

artist

Length:2017

Class :character

Mode :character

[1] "Missing values in each column:"

...1 acousticness danceability duration_ms

0 0 0 0

energy instrumentalness key liveness

0 0 0 0

loudness mode speechiness tempo

0 0 0 0

time_signature valence target song_title

0 0 0 0

artist

[1] "🎵 Our musical features:"

[1] "...1" "acousticness" "danceability" "duration_ms"

[5] "energy" "instrumentalness" "key" "liveness"

[9] "loudness" "mode" "speechiness" "tempo"

[13] "time_signature" "valence" "target" "song_title"

[17] "artist"

The above output shows that:

The dataset has 2,017 songs and 17 features, including both audio metrics and song metadata.
No missing values were detected, which means that the data is clean.
Features include acousticness, energy, tempo, valence, song_title, and artist.

Also Read: R vs Python Data Science: The Difference

Step 5: Clean and Enrich the Data

Before analysis, we need to remove duplicates and create new categorical variables for better insight. This step makes the dataset easier to interpret and visualize. Here is the code:

# Cleaning our data - like tuning instruments before a concert

# Remove any duplicate songs
spotify_clean <- spotify_data %>%
  distinct(song_title, artist, .keep_all = TRUE)

print(paste("Removed", nrow(spotify_data) - nrow(spotify_clean), "duplicate songs"))

# Create a popularity category for easier analysis
# Target = 1 means popular, Target = 0 means less popular
spotify_clean <- spotify_clean %>%
  mutate(
    popularity_label = ifelse(target == 1, "Popular", "Less Popular"),
    # Create energy level categories
    energy_level = case_when(
      energy < 0.3 ~ "Low Energy",
      energy < 0.7 ~ "Medium Energy",
      TRUE ~ "High Energy"
    ),
    # Create danceability categories
    dance_level = case_when(
      danceability < 0.3 ~ "Not Danceable",
      danceability < 0.7 ~ "Moderately Danceable",
      TRUE ~ "Very Danceable"
    )
  )

print("Data cleaning complete!")

The output for the above code is:

[1] "Removed 35 duplicate songs."

[1] "Data cleaning complete!"

This means:

35 duplicate songs were removed from the dataset.
New columns like popularity_label, energy_level, and dance_level were successfully added.
The data is now cleaned and categorized for easier analysis.

Step 6: Analyze Song Popularity Trends

In this step, we will check how popular and less popular songs differ in terms of key musical features. We also visualize the overall distribution of popularity in the dataset. The code for this section is:

# Let's explore our musical universe!

# 1. Basic statistics about popular vs less popular songs
popularity_summary <- spotify_clean %>%
  group_by(popularity_label) %>%
  summarise(
    count = n(),
    avg_energy = round(mean(energy), 3),
    avg_danceability = round(mean(danceability), 3),
    avg_valence = round(mean(valence), 3),
    avg_loudness = round(mean(loudness), 3)
  )

print("Popularity Analysis:")
knitr::kable(popularity_summary)

# 2. Distribution of song popularity
ggplot(spotify_clean, aes(x = popularity_label)) +
  geom_bar(fill = c("#FF6B35", "#004E64"), alpha = 0.8) +
  labs(title = "Distribution of Song Popularity",
       subtitle = "How many popular vs less popular songs do we have?",
       x = "Popularity Category",
       y = "Number of Songs") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16, hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

The output for this section will be as follows:

[1] " Popularity Analysis:"

popularity_label	count	avg_energy	avg_danceability	avg_valence	avg_loudness
Less Popular	989	0.673	0.589	0.469	-6.813
Popular	993	0.693	0.648	0.528	-7.312

The above code also gives us a graph:

This output shows:

Balanced classes: Popular (993) vs Less Popular (989), great for modeling without heavy rebalancing.
“Feel‑good” bias: Popular songs have higher danceability, energy, and valence (all slightly higher), suggesting they’re more upbeat and danceable.
Loudness isn’t the driver: Popular tracks are actually a bit quieter on average (−7.31 dB vs −6.81 dB), so being louder doesn’t map directly to popularity here.

Here’s Something You Should Know: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!

Step 7: Visualize the Sound of Popularity

This step gives us more insights into how energy, danceability, valence, and tempo vary across popular and less popular songs. Visualizations help us uncover subtle patterns and trends in musical features. The code for this step is:

# Let's visualize the musical DNA of popular songs!

# 1. Energy vs Danceability scatter plot
p1 <- ggplot(spotify_clean, aes(x = energy, y = danceability, color = popularity_label)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c("#FF6B35", "#004E64")) +
  labs(title = "🕺 Energy vs Danceability",
       subtitle = "Do popular songs have more energy and danceability?",
       x = "Energy Level",
       y = "Danceability",
       color = "Popularity") +
  theme_minimal()

print(p1)

# 2. Valence (Musical Happiness) distribution
p2 <- ggplot(spotify_clean, aes(x = valence, fill = popularity_label)) +
  geom_histogram(alpha = 0.7, bins = 30, position = "identity") +
  scale_fill_manual(values = c("#FF6B35", "#004E64")) +
  labs(title = "Musical Happiness Distribution",
       subtitle = "Are popular songs happier?",
       x = "Valence (0 = Sad, 1 = Happy)",
       y = "Number of Songs",
       fill = "Popularity") +
  theme_minimal()

print(p2)

# 3. Tempo analysis
p3 <- ggplot(spotify_clean, aes(x = popularity_label, y = tempo, fill = popularity_label)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("#FF6B35", "#004E64")) +
  labs(title = "Tempo Comparison",
       subtitle = "Is there an optimal tempo for popularity?",
       x = "Popularity Category",
       y = "Tempo (BPM)",
       fill = "Popularity") +
  theme_minimal()

print(p3)

The above code will give us three graphs, each showing different characteristics:

This graph shows:

Popular songs generally cluster in high energy and high danceability zones.
Less popular tracks are more spread out across low-to-mid ranges.
There's a visible positive trend; more energy often aligns with more danceability.

This graph shows:

Popular songs lean more toward the higher valence (happier) end of the scale.
Less popular songs show more spread across low-to-mid valence, suggesting a mix of moods.
Happier songs are slightly more likely to be popular.

This graph shows:

Both groups show similar median tempos, around 120–125 BPM.
Popular songs have more tempo outliers, showing variety.
There’s no clear “optimal” tempo, but both fall in the danceable BPM range.

You Can’t Miss This: What’s Special About Machine Learning?

Step 8: Measure Relationships with a Correlation Matrix

In this step, we quantify how musical features move together and which ones line up most with popularity (target). We’ll build and plot a correlation matrix, then list the features most (positively/negatively) associated with popularity.

# Let's find relationships between musical features

# Select numerical features for correlation
numerical_features <- spotify_clean %>%
  select(acousticness, danceability, energy, instrumentalness,
         liveness, loudness, speechiness, tempo, valence, target)

# Create correlation matrix
correlation_matrix <- cor(numerical_features)

# Visualize correlations
corrplot(correlation_matrix,
         method = "color",
         type = "upper",
         tl.cex = 0.8,
         tl.col = "black",
         title = "Musical Features Correlation Matrix",
         mar = c(0,0,2,0))

# Find strongest correlations with popularity (target)
target_correlations <- correlation_matrix[,"target"] %>%
  sort(decreasing = TRUE)

print("Features most correlated with popularity:")
print(round(target_correlations, 3))

The output of the above code is:

[1] "Features most correlated with popularity:"

Feature	Value
target	1.000
danceability	0.183
speechiness	0.162
instrumentalness	0.150
valence	0.119
energy	0.048
tempo	0.034
liveness	0.024
loudness	-0.066
acousticness	-0.134

The graph of this outcome is:

The above graph explains that:

Popularity (target) has only weak correlations with all features; nothing alone strongly predicts it.
Energy ↔ Loudness are strongly positively correlated, while acousticness is strongly negative with both (louder, energetic songs are less acoustic).
Danceability and valence show a moderate positive link; happier songs tend to be more danceable.

Master the art of data analysis with our free Introduction to Data Analysis using Excel course. Learn pivot tables, formulas, and visualization techniques that top analysts swear by. Enroll now and make Excel your superpower!

Step 9: Build Deeper Insights with Rich Visuals

This step focuses on advanced, storytelling-style visualizations to highlight differences in song features and reveal which artists have the most consistent hit rate. Here’s the code:

# Create beautiful, insightful visualizations

# 1. Multi-feature comparison using violin plots
features_long <- spotify_clean %>%
  select(popularity_label, energy, danceability, valence, acousticness) %>%
  reshape2::melt(id.vars = "popularity_label")

ggplot(features_long, aes(x = popularity_label, y = value, fill = popularity_label)) +
  geom_violin(alpha = 0.7) +
  facet_wrap(~variable, scales = "free_y") +
  scale_fill_manual(values = c("#FF6B35", "#004E64")) +
  labs(title = "Musical Feature Comparison",
       subtitle = "How do different features vary between popular and less popular songs?",
       x = "Popularity Category",
       y = "Feature Value",
       fill = "Popularity") +
  theme_minimal() +
  theme(strip.text = element_text(size = 12, face = "bold"))

# 2. Top artists analysis
top_artists <- spotify_clean %>%
  group_by(artist) %>%
  summarise(
    total_songs = n(),
    popular_songs = sum(target),
    popularity_rate = round(popular_songs / total_songs * 100, 1)
  ) %>%
  filter(total_songs >= 3) %>%  # Artists with at least 3 songs
  arrange(desc(popularity_rate)) %>%
  head(10)

ggplot(top_artists, aes(x = reorder(artist, popularity_rate), y = popularity_rate)) +
  geom_col(fill = "#004E64", alpha = 0.8) +
  coord_flip() +
  labs(title = "Top 10 Artists by Success Rate",
       subtitle = "Artists with highest percentage of popular songs (min 3 songs)",
       x = "Artist",
       y = "Success Rate (%)") +
  theme_minimal()

The above code will give us two graphs that explain the following:

The above graph shows that:

Popular songs tend to have higher energy, danceability, and valence compared to less popular ones.
Acousticness is generally lower in popular songs, suggesting a preference for more produced or electronic sounds.
These patterns highlight the musical qualities that may contribute to a track's popularity.

The above graph shows that:

All listed artists have a 100% success rate, meaning every one of their songs in the dataset is marked as popular.
Artists like Blood Orange, A$AP Rocky, and Beach House consistently released tracks that performed well.
Only artists with at least 3 songs were considered to ensure meaningful success rates.

Join the Fundamentals of Deep Learning and Neural Networks free course to master neural networks, model training, and real-world AI applications. Learn from experts and earn a free certification. Start your deep learning journey today!

Step 10: Train a Logistic Regression to Predict Popularity

Here we split the data into train/test sets, fit a logistic regression using key audio features, evaluate its accuracy, and inspect which features are most statistically significant. The code for this section is:

# Let's build a simple model to predict song popularity!

# Prepare data for modeling
set.seed(123)  # For reproducible results

# Split data into training and testing sets
train_index <- createDataPartition(spotify_clean$target, p = 0.8, list = FALSE)
train_data <- spotify_clean[train_index, ]
test_data <- spotify_clean[-train_index, ]

print(paste("Training songs:", nrow(train_data)))
print(paste("Testing songs:", nrow(test_data)))

# Select features for our model
model_features <- c("acousticness", "danceability", "energy", "instrumentalness",
                   "liveness", "loudness", "speechiness", "tempo", "valence")

# Build a simple logistic regression model
model <- glm(target ~ .,
             data = train_data[, c("target", model_features)],
             family = "binomial")

# Make predictions
predictions <- predict(model, test_data, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

# Calculate accuracy
accuracy <- mean(predicted_classes == test_data$target)
print(paste("Model Accuracy:", round(accuracy * 100, 2), "%"))

# Feature importance
feature_importance <- summary(model)$coefficients[,4]  # p-values
important_features <- sort(feature_importance[2:length(feature_importance)])
print("Most important features (lowest p-values):")
print(round(important_features, 4))

The output for this section is as follows:

[1] "Training songs: 1586"

[1] "Testing songs: 396"

[1] "Model Accuracy: 64.14 %"

[1] "Most important features (lowest p-values):"

Feature	Value
speechiness	0.0000
instrumentalness	0.0000
loudness	0.0000
acousticness	0.0000
danceability	0.0000
valence	0.0006
tempo	0.0827
energy	0.0854
liveness	0.6461

Model Summary:

Accuracy: The model predicts song popularity with 64.14% accuracy, a decent baseline for logistic regression.
Top predictive features (based on lowest p-values):
- Speechiness, instrumentalness, loudness, and acousticness are strongly significant (p-value ≈ 0).
- Valence is also significant; it reflects musical positivity.
- Liveness seems not useful (high p-value ≈ 0.64).

Level Up. Here’s Something For You: Also read: Data Science Project Ideas for Beginners | Python IDEs for Data Science and Machine Learning

Step 11: Upgrade Your Model with Random Forest

Now that we’ve tested a basic logistic regression, in this step, we will build a more powerful model. The Random Forest algorithm handles complex relationships and feature interactions better, which often leads to higher accuracy. Here’s the code:

# Install if not already installed
if (!require(randomForest)) install.packages("randomForest")
library(randomForest)

# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(as.factor(target) ~ .,
                        data = train_data[, c("target", model_features)],
                        ntree = 100,
                        importance = TRUE)

# Make predictions
rf_predictions <- predict(rf_model, test_data, type = "prob")[,2]
rf_classes <- ifelse(rf_predictions > 0.5, 1, 0)

# Accuracy of the Random Forest model
rf_accuracy <- mean(rf_classes == test_data$target)
print(paste("Random Forest Accuracy:", round(rf_accuracy * 100, 2), "%"))

# Feature importance scores
importance(rf_model)

The output for this section is:

Loading required package: randomForest

Warning message in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :

“there is no package called ‘randomForest’”

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

margin

The following object is masked from ‘package:dplyr’:

combine

[1] " Random Forest Accuracy: 76.01 %"

	0	1	MeanDecreaseAccuracy	MeanDecreaseGini
acousticness	11.913743	-0.2209466	11.076556	74.99875
danceability	14.352573	8.3464287	16.966182	93.23475
energy	14.201425	0.5790907	14.407994	82.00614
instrumentalness	26.440829	21.0849145	31.723116	121.86330
liveness	1.832211	3.1778978	3.360536	60.18430
loudness	15.432559	6.9422589	16.070757	106.90894
speechiness	11.330946	15.0804005	17.905247	101.84949
tempo	5.465601	2.3359464	5.790140	69.96652
valence	13.130002	3.1291881	12.043707	81.33633

The Random Forest propels the accuracy to 76.01%, which is a solid gain over the logistic regression’s 64.14%, and it tells us which musical traits matter most.

Key takeaways:

Accuracy ↑: 76.01% vs 64.14% (logistic). Random Forest captures non‑linear relationships better.
Most influential features: Instrumentalness, loudness, speechiness, danceability (highest MeanDecreaseGini / MeanDecreaseAccuracy).
Weaker signals: Liveness and tempo contribute the least to predictions here.

Conclusion

In this Spotify Music Data Analysis project, we used R in Google Colab to look into what makes a song popular by analyzing audio features such as energy, danceability, loudness, and speechiness. After cleaning and visualizing the dataset, we trained two models: a Logistic Regression and a more powerful Random Forest classifier.

The Random Forest model outperformed Logistic Regression with an accuracy of 76.01%, as compared to 64.14%, identifying instrumentalness, loudness, and speechiness as key predictors of popularity.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/17HoemeiOPtDJgoWqNVtxOia5jRxX5fEl#scrollTo=06bFYquyv76C

Frequently Asked Questions (FAQs)

1. What is the goal of the Spotify Music Data Analysis project in R?

The goal is to analyze various audio features (like energy, danceability, valence, etc.) from Spotify tracks and build models to predict whether a song is likely to be popular. It helps uncover the traits of hit songs using data science.

2. Which tools and libraries are used in this music analysis project?

This project is built in Google Colab with R language. Key R libraries include dplyr, ggplot2, caret, randomForest, plotly, corrplot, and more for data wrangling, visualization, and machine learning.

3. Can I use other algorithms to improve song popularity prediction?

Yes, apart from Logistic Regression and Random Forest, you can experiment with:

Support Vector Machines (SVM)
XGBoost
Gradient Boosting Machines (GBM)
K-Nearest Neighbors (KNN)

Neural Networks (using nnet or keras in R)

4. Is this project suitable for R beginners?

Absolutely. The project is designed with step-by-step code and explanations, making it beginner-friendly for those familiar with basic R syntax and Google Colab setup.

5. What are some other interesting R projects like this one?

If you liked this Spotify data project, here are 5 other data science projects you can explore in R:

Customer Churn Prediction – Predict which users are likely to stop using a service.
COVID-19 Trends Analysis – Analyze case trends and forecast outbreaks using time series.
NYC School Perceptions Analysis – Explore public opinion and satisfaction across NYC schools.
Forest Fire Data Analysis – Predict forest fire occurrence and risk based on environmental data.
Fake News Detection – Use NLP to classify whether news headlines are real or fake.

Rohit Sharma

839 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources