Spotify Music Data Analysis Project in R

By Rohit Sharma

Updated on Jul 29, 2025 | 19 min read | 1.36K+ views

Share:

What makes a song popular? Well, in this Spotify Music Data Analysis project using R, we will analyze some real Spotify music data to decode the musical DNA behind various hit songs. 

We will peek into over 2,000 tracks, examining features like energy, danceability, valence, and acousticness through statistical analysis and compelling visualizations.

This blog will explain every step from data cleaning to building a Random Forest model that achieves 76% accuracy in predicting song popularity. You'll learn essential data science skills, including correlation analysis, feature engineering, and machine learning.

Want to Lead the AI Revolution?
Master Python, Machine Learning & AI with upGrad’s top online Data Science courses. Learn online, earn globally recognized credentials, and future-proof your career today!

Upskill Yourself With The Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

What Should You Know Before Starting the Spotify Music Data Analysis Project?

Before starting the Spotify Music Data Analysis project in R, it’s important to know some key concepts and skills. These will help you use the dataset and identify relevant patterns:

Skill/Concept

What You Should Know

Basic R Programming Use variables, functions, and data frames confidently in R.
Tidyverse Usage Work with dplyr, ggplot2, and tidyr for data manipulation and visualization.
Reading and Exploring Data Use functions like read_csv(), head(), str(), and summary() to load and inspect data.
Spotify Audio Features Understand terms like danceability, energy, valence, and tempo.
Data Cleaning Techniques Handle missing values, rename columns, filter rows, and change data types as needed.
Data Visualization Create plots (e.g., histograms, boxplots, scatter plots) using ggplot2.
Working with Dates Use the lubridate package to parse and manipulate date columns.
Grouping and Aggregation Apply group_by() and summarise() to analyze trends by artist, genre, or year.
Google Colab for R Run R code in Colab by switching the runtime and uploading your data correctly.

Crack the Code to an opportunity rich Data Science Career!

Master AI, ML, and Generative AI with upGrad’s premier programs, 100% online, industry-ready, and designed for tomorrow’s tech leaders.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Time, Effort & Skills: What You Need Before Getting Started

This music data analysis project requires the following:

Aspect

Details

Estimated Time 2–3 hours for a basic walkthrough; 4–5 hours for in-depth analysis and polish.
Skill Level Required Beginner to Intermediate. This project is suitable for learners with basic R and data handling knowledge.
Learning Curve Low to Moderate. The concepts are straightforward with some exploration required.
Time Commitment Short-term. This entire project can be completed in a single session or two sittings.

Read This: Benefits of Learning R: Why It’s Essential for Data Science

What Powers the Analysis: Tools, Models, and R Packages

To run this music data analysis project, we need certain tools, models, and R libraries, which are listed in the table below.

Category

Details

Platform Google Colab (with R runtime)
Programming Language R
Machine Learning Models

- Logistic Regression (for binary classification)

- Random Forest (for improved accuracy and feature importance)

Key R Libraries

- dplyr – Data manipulation

- ggplot2 – Data visualization

- readr – Reading CSV files

- caret – Model training & evaluation

- randomForest – Random forest implementation

Supporting Libraries

- corrplot, GGally – Correlation and pair plots

- plotly – Interactive visualizations

- viridis – Color palettes for better visuals

- reshape2 – Data reshaping

- knitr – Report formatting

Steps and Code Explanation For Voice Data Analysis in R

To start working with this music data analysis project, we need to run the code in R, which will be discussed below. The code will be run in R, and this section will break down each step with the code for your understanding.

Step 1: Configure Google Colab for R Programming

To begin working with R in Google Colab, you'll first need to switch the notebook's runtime from Python to R. This setup allows you to run R code seamlessly within the Colab environment.

Steps to switch to R:

  1. Open a new notebook in Google Colab
  2. Click on Runtime in the top menu
  3. Select Change runtime type
  4. In the "Language" dropdown, choose R
  5. Click Save

Step 2: Install and Load Essential R Libraries

Before starting the analysis of the data, we need to install and load all the required R packages. This step ensures that the environment is ready for data cleaning, visualization, and modeling. The code for this step is:

# Installing required packages (run once)
# These are like tools we need for our analysis
install.packages(c("dplyr", "ggplot2", "readr", "corrplot",
                   "plotly", "GGally", "caret", "viridis",
                   "reshape2", "knitr"))

# Loading libraries (run every time you start)
# Think of this as opening your toolbox
library(dplyr)      # For data manipulation - like Excel functions
library(ggplot2)    # For creating beautiful charts
library(readr)      # For reading CSV files
library(corrplot)   # For correlation plots
library(plotly)     # For interactive charts
library(GGally)     # For advanced plotting
library(caret)      # For machine learning
library(viridis)    # For beautiful color schemes
library(reshape2)   # For data reshaping
library(knitr)      # For nice table formatting

# Let's confirm everything loaded successfully
print("SpotiTunes Analytics Setup Complete!")

The code will give the output and confirm the installation and loading of the packages and libraries.

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘patchwork’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘lazyeval’, ‘crosstalk’, ‘ggstats’, ‘S7’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘gridExtra’

 

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

corrplot 0.95 loaded

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

Loading required package: lattice

Loading required package: viridisLite

[1] "SpotiTunes Analytics Setup Complete!"

Here’s Something For You: The Ultimate R Cheat Sheet for Data Science Enthusiasts

Step 3: Upload and Load the Spotify Dataset

As the setup is now ready, it's time to upload the dataset and load it into R. This step allows us to begin exploring and analyzing the contents of the Spotify music data. The code used in this process is:

Upload your CSV file to Colab first, then read it
# In Colab, click the file icon on the left, then upload your Spotify data.csv

# Reading the dataset
# This loads your Spotify data into R's memory
spotify_data <- read_csv("Spotify data.csv")

# Let's take our first look at the data
print("Dataset loaded successfully!")
print(paste("Number of songs:", nrow(spotify_data)))
print(paste("Number of features:", ncol(spotify_data)))

# Display first few rows to understand our data
head(spotify_data)

After the successful execution, the code will give us the output as:

New names:

`` -> `...1`

Rows: 2017 Columns: 17

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr  (2): song_title, artist

dbl (15): ...1, acousticness, danceability, duration_ms, energy, instrumenta...

 

Use `spec()` to retrieve the full column specification for this data.

Specify the column types or set `show_col_types = FALSE` to quiet this message.

[1] "Dataset loaded successfully!"

[1] "Number of songs: 2017"

[1] "Number of features: 17"

...1

acousticness

danceability

duration_ms

energy

instrumentalness

key

liveness

loudness

mode

speechiness

tempo

time_signature

valence

target

song_title

artist

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<chr>

<chr>

0

0.01020

0.833

204600

0.434

0.021900

2

0.1650

-8.795

1

0.4310

150.062

4

0.286

1

Mask Off

Future

1

0.19900

0.743

326933

0.359

0.006110

1

0.1370

-10.401

1

0.0794

160.083

4

0.588

1

Redbone

Childish Gambino

2

0.03440

0.838

185707

0.412

0.000234

2

0.1590

-7.148

1

0.2890

75.044

4

0.173

1

Xanny Family

Future

3

0.60400

0.494

199413

0.338

0.510000

5

0.0922

-15.236

1

0.0261

86.468

4

0.230

1

Master Of None

Beach House

4

0.18000

0.678

392893

0.561

0.512000

5

0.4390

-11.648

0

0.0694

174.004

4

0.904

1

Parallel Lines

Junior Boys

5

0.00479

0.804

251333

0.560

0.000000

8

0.1640

-6.682

1

0.1850

85.023

4

0.264

1

Sneakin’

Drake

 

The above table gives us an idea of what the data looks like.

What’s Hidden in Your Data? Find out with our free Clustering Unsupervised Learning course. Learn K-Means, uncover clusters, and explore patterns like a data pro. Join now.

Step 4: Explore and Understand the Dataset

Before starting our analysis or modeling, it's important to check the dataset. This step will help us understand the structure, identify missing values, and get familiar with the available features. The code for this step is:

Getting to know our dataset better
# This is like getting familiar with a new playlist

# Basic information about the dataset
str(spotify_data)  # Shows data types and structure
summary(spotify_data)  # Shows statistical summaries

# Check for missing values
# Missing data can affect our analysis
missing_values <- colSums(is.na(spotify_data))
print("Missing values in each column:")
print(missing_values)

# Let's look at the column names to understand what we have
print("Our musical features:")
print(colnames(spotify_data))

The output for this code is: 

spc_tbl_ [2,017 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

 $ ...1            : num [1:2017] 0 1 2 3 4 5 6 7 8 9 ...

 $ acousticness    : num [1:2017] 0.0102 0.199 0.0344 0.604 0.18 0.00479 0.0145 0.0202 0.0481 0.00208 ...

 $ danceability    : num [1:2017] 0.833 0.743 0.838 0.494 0.678 0.804 0.739 0.266 0.603 0.836 ...

 $ duration_ms     : num [1:2017] 204600 326933 185707 199413 392893 ...

 $ energy          : num [1:2017] 0.434 0.359 0.412 0.338 0.561 0.56 0.472 0.348 0.944 0.603 ...

 $ instrumentalness: num [1:2017] 2.19e-02 6.11e-03 2.34e-04 5.10e-01 5.12e-01 0.00 7.27e-06 6.64e-01 0.00 0.00 ...

 $ key             : num [1:2017] 2 1 2 5 5 8 1 10 11 7 ...

 $ liveness        : num [1:2017] 0.165 0.137 0.159 0.0922 0.439 0.164 0.207 0.16 0.342 0.571 ...

 $ loudness        : num [1:2017] -8.79 -10.4 -7.15 -15.24 -11.65 ...

 $ mode            : num [1:2017] 1 1 1 1 0 1 1 0 0 1 ...

 $ speechiness     : num [1:2017] 0.431 0.0794 0.289 0.0261 0.0694 0.185 0.156 0.0371 0.347 0.237 ...

 $ tempo           : num [1:2017] 150.1 160.1 75 86.5 174 ...

 $ time_signature  : num [1:2017] 4 4 4 4 4 4 4 4 4 4 ...

 $ valence         : num [1:2017] 0.286 0.588 0.173 0.23 0.904 0.264 0.308 0.393 0.398 0.386 ...

 $ target          : num [1:2017] 1 1 1 1 1 1 1 1 1 1 ...

 $ song_title      : chr [1:2017] "Mask Off" "Redbone" "Xanny Family" "Master Of None" ...

 $ artist          : chr [1:2017] "Future" "Childish Gambino" "Future" "Beach House" ...

 - attr(*, "spec")=

  .. cols(

  ..   ...1 = col_double(),

  ..   acousticness = col_double(),

  ..   danceability = col_double(),

  ..   duration_ms = col_double(),

  ..   energy = col_double(),

  ..   instrumentalness = col_double(),

  ..   key = col_double(),

  ..   liveness = col_double(),

  ..   loudness = col_double(),

  ..   mode = col_double(),

  ..   speechiness = col_double(),

  ..   tempo = col_double(),

  ..   time_signature = col_double(),

  ..   valence = col_double(),

  ..   target = col_double(),

  ..   song_title = col_character(),

  ..   artist = col_character()

  .. )

 - attr(*, "problems")=<externalptr> 

 

     ...1       acousticness        danceability     duration_ms     

 Min.   :   0   Min.   :2.840e-06   Min.   :0.1220   Min.   :  16042  

 1st Qu.: 504   1st Qu.:9.630e-03   1st Qu.:0.5140   1st Qu.: 200015  

 Median :1008   Median :6.330e-02   Median :0.6310   Median : 229261  

 Mean   :1008   Mean   :1.876e-01   Mean   :0.6184   Mean   : 246306  

 3rd Qu.:1512   3rd Qu.:2.650e-01   3rd Qu.:0.7380   3rd Qu.: 270333  

 Max.   :2016   Max.   :9.950e-01   Max.   :0.9840   Max.   :1004627  

     energy       instrumentalness         key            liveness     

 Min.   :0.0148   Min.   :0.0000000   Min.   : 0.000   Min.   :0.0188  

 1st Qu.:0.5630   1st Qu.:0.0000000   1st Qu.: 2.000   1st Qu.:0.0923  

 Median :0.7150   Median :0.0000762   Median : 6.000   Median :0.1270  

 Mean   :0.6816   Mean   :0.1332855   Mean   : 5.343   Mean   :0.1908  

 3rd Qu.:0.8460   3rd Qu.:0.0540000   3rd Qu.: 9.000   3rd Qu.:0.2470  

 Max.   :0.9980   Max.   :0.9760000   Max.   :11.000   Max.   :0.9690  

    loudness            mode         speechiness          tempo       

 Min.   :-33.097   Min.   :0.0000   Min.   :0.02310   Min.   : 47.86  

 1st Qu.: -8.394   1st Qu.:0.0000   1st Qu.:0.03750   1st Qu.:100.19  

 Median : -6.248   Median :1.0000   Median :0.05490   Median :121.43  

 Mean   : -7.086   Mean   :0.6123   Mean   :0.09266   Mean   :121.60  

 3rd Qu.: -4.746   3rd Qu.:1.0000   3rd Qu.:0.10800   3rd Qu.:137.85  

 Max.   : -0.307   Max.   :1.0000   Max.   :0.81600   Max.   :219.33  

 time_signature     valence           target        song_title       

 Min.   :1.000   Min.   :0.0348   Min.   :0.0000   Length:2017       

 1st Qu.:4.000   1st Qu.:0.2950   1st Qu.:0.0000   Class :character  

 Median :4.000   Median :0.4920   Median :1.0000   Mode  :character  

 Mean   :3.968   Mean   :0.4968   Mean   :0.5057                     

 3rd Qu.:4.000   3rd Qu.:0.6910   3rd Qu.:1.0000                     

 Max.   :5.000   Max.   :0.9920   Max.   :1.0000                     

    artist         

 Length:2017       

 Class :character  

 Mode  :character  

                                         

[1] "Missing values in each column:"

            ...1     acousticness     danceability      duration_ms 

               0                0                0                0 

          energy instrumentalness              key         liveness 

               0                0                0                0 

        loudness             mode      speechiness            tempo 

               0                0                0                0 

  time_signature          valence           target       song_title 

               0                0                0                0 

          artist 

               0 

[1] "🎵 Our musical features:"

 [1] "...1"             "acousticness"     "danceability"     "duration_ms"     

 [5] "energy"           "instrumentalness" "key"              "liveness"        

 [9] "loudness"         "mode"             "speechiness"      "tempo"           

[13] "time_signature"   "valence"          "target"           "song_title"      

[17] "artist"          

The above output shows that:

  • The dataset has 2,017 songs and 17 features, including both audio metrics and song metadata.
  • No missing values were detected, which means that the data is clean.
  • Features include acousticness, energy, tempo, valence, song_title, and artist.

Also Read: R vs Python Data Science: The Difference

Step 5: Clean and Enrich the Data

Before analysis, we need to remove duplicates and create new categorical variables for better insight. This step makes the dataset easier to interpret and visualize. Here is the code:

# Cleaning our data - like tuning instruments before a concert

# Remove any duplicate songs
spotify_clean <- spotify_data %>%
  distinct(song_title, artist, .keep_all = TRUE)

print(paste("Removed", nrow(spotify_data) - nrow(spotify_clean), "duplicate songs"))

# Create a popularity category for easier analysis
# Target = 1 means popular, Target = 0 means less popular
spotify_clean <- spotify_clean %>%
  mutate(
    popularity_label = ifelse(target == 1, "Popular", "Less Popular"),
    # Create energy level categories
    energy_level = case_when(
      energy < 0.3 ~ "Low Energy",
      energy < 0.7 ~ "Medium Energy",
      TRUE ~ "High Energy"
    ),
    # Create danceability categories
    dance_level = case_when(
      danceability < 0.3 ~ "Not Danceable",
      danceability < 0.7 ~ "Moderately Danceable",
      TRUE ~ "Very Danceable"
    )
  )

print("Data cleaning complete!")

The output for the above code is:

[1] "Removed 35 duplicate songs."

[1] "Data cleaning complete!"

This means:

  • 35 duplicate songs were removed from the dataset.
  • New columns like popularity_label, energy_level, and dance_level were successfully added.
  • The data is now cleaned and categorized for easier analysis.

Step 6: Analyze Song Popularity Trends

In this step, we will check how popular and less popular songs differ in terms of key musical features. We also visualize the overall distribution of popularity in the dataset. The code for this section is:

# Let's explore our musical universe!

# 1. Basic statistics about popular vs less popular songs
popularity_summary <- spotify_clean %>%
  group_by(popularity_label) %>%
  summarise(
    count = n(),
    avg_energy = round(mean(energy), 3),
    avg_danceability = round(mean(danceability), 3),
    avg_valence = round(mean(valence), 3),
    avg_loudness = round(mean(loudness), 3)
  )

print("Popularity Analysis:")
knitr::kable(popularity_summary)

# 2. Distribution of song popularity
ggplot(spotify_clean, aes(x = popularity_label)) +
  geom_bar(fill = c("#FF6B35", "#004E64"), alpha = 0.8) +
  labs(title = "Distribution of Song Popularity",
       subtitle = "How many popular vs less popular songs do we have?",
       x = "Popularity Category",
       y = "Number of Songs") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16, hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

The output for this section will be as follows:

[1] " Popularity Analysis:"

popularity_label

count

avg_energy

avg_danceability

avg_valence

avg_loudness

Less Popular 989 0.673 0.589 0.469 -6.813
Popular 993 0.693 0.648 0.528 -7.312

 

The above code also gives us a graph:

This output shows:

  • Balanced classes: Popular (993) vs Less Popular (989), great for modeling without heavy rebalancing.
  • “Feel‑good” bias: Popular songs have higher danceability, energy, and valence (all slightly higher), suggesting they’re more upbeat and danceable.
  • Loudness isn’t the driver: Popular tracks are actually a bit quieter on average (−7.31 dB vs −6.81 dB), so being louder doesn’t map directly to popularity here.

Here’s Something You Should Know: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!

Step 7: Visualize the Sound of Popularity

This step gives us more insights into how energy, danceability, valence, and tempo vary across popular and less popular songs. Visualizations help us uncover subtle patterns and trends in musical features. The code for this step is:

# Let's visualize the musical DNA of popular songs!

# 1. Energy vs Danceability scatter plot
p1 <- ggplot(spotify_clean, aes(x = energy, y = danceability, color = popularity_label)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c("#FF6B35", "#004E64")) +
  labs(title = "🕺 Energy vs Danceability",
       subtitle = "Do popular songs have more energy and danceability?",
       x = "Energy Level",
       y = "Danceability",
       color = "Popularity") +
  theme_minimal()

print(p1)

# 2. Valence (Musical Happiness) distribution
p2 <- ggplot(spotify_clean, aes(x = valence, fill = popularity_label)) +
  geom_histogram(alpha = 0.7, bins = 30, position = "identity") +
  scale_fill_manual(values = c("#FF6B35", "#004E64")) +
  labs(title = "Musical Happiness Distribution",
       subtitle = "Are popular songs happier?",
       x = "Valence (0 = Sad, 1 = Happy)",
       y = "Number of Songs",
       fill = "Popularity") +
  theme_minimal()

print(p2)

# 3. Tempo analysis
p3 <- ggplot(spotify_clean, aes(x = popularity_label, y = tempo, fill = popularity_label)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("#FF6B35", "#004E64")) +
  labs(title = "Tempo Comparison",
       subtitle = "Is there an optimal tempo for popularity?",
       x = "Popularity Category",
       y = "Tempo (BPM)",
       fill = "Popularity") +
  theme_minimal()

print(p3)

The above code will give us three graphs, each showing different characteristics:

This graph shows:

  • Popular songs generally cluster in high energy and high danceability zones.
  • Less popular tracks are more spread out across low-to-mid ranges.
  • There's a visible positive trend; more energy often aligns with more danceability.

This graph shows:

  • Popular songs lean more toward the higher valence (happier) end of the scale.
  • Less popular songs show more spread across low-to-mid valence, suggesting a mix of moods.
  • Happier songs are slightly more likely to be popular.

This graph shows:

  • Both groups show similar median tempos, around 120–125 BPM.
  • Popular songs have more tempo outliers, showing variety.
  • There’s no clear “optimal” tempo, but both fall in the danceable BPM range.

You Can’t Miss This: What’s Special About Machine Learning?

Step 8: Measure Relationships with a Correlation Matrix

In this step, we quantify how musical features move together and which ones line up most with popularity (target). We’ll build and plot a correlation matrix, then list the features most (positively/negatively) associated with popularity.

# Let's find relationships between musical features

# Select numerical features for correlation
numerical_features <- spotify_clean %>%
  select(acousticness, danceability, energy, instrumentalness,
         liveness, loudness, speechiness, tempo, valence, target)

# Create correlation matrix
correlation_matrix <- cor(numerical_features)

# Visualize correlations
corrplot(correlation_matrix,
         method = "color",
         type = "upper",
         tl.cex = 0.8,
         tl.col = "black",
         title = "Musical Features Correlation Matrix",
         mar = c(0,0,2,0))

# Find strongest correlations with popularity (target)
target_correlations <- correlation_matrix[,"target"] %>%
  sort(decreasing = TRUE)

print("Features most correlated with popularity:")
print(round(target_correlations, 3))

The output of the above code is:

[1] "Features most correlated with popularity:"

Feature

Value

target 1.000
danceability 0.183
speechiness 0.162
instrumentalness 0.150
valence 0.119
energy 0.048
tempo 0.034
liveness 0.024
loudness -0.066
acousticness -0.134

The graph of this outcome is:

The above graph explains that:

  • Popularity (target) has only weak correlations with all features; nothing alone strongly predicts it.
  • Energy ↔ Loudness are strongly positively correlated, while acousticness is strongly negative with both (louder, energetic songs are less acoustic).
  • Danceability and valence show a moderate positive link; happier songs tend to be more danceable.

Master the art of data analysis with our free Introduction to Data Analysis using Excel course. Learn pivot tables, formulas, and visualization techniques that top analysts swear by. Enroll now and make Excel your superpower!

Step 9: Build Deeper Insights with Rich Visuals

This step focuses on advanced, storytelling-style visualizations to highlight differences in song features and reveal which artists have the most consistent hit rate. Here’s the code:

# Create beautiful, insightful visualizations

# 1. Multi-feature comparison using violin plots
features_long <- spotify_clean %>%
  select(popularity_label, energy, danceability, valence, acousticness) %>%
  reshape2::melt(id.vars = "popularity_label")

ggplot(features_long, aes(x = popularity_label, y = value, fill = popularity_label)) +
  geom_violin(alpha = 0.7) +
  facet_wrap(~variable, scales = "free_y") +
  scale_fill_manual(values = c("#FF6B35", "#004E64")) +
  labs(title = "Musical Feature Comparison",
       subtitle = "How do different features vary between popular and less popular songs?",
       x = "Popularity Category",
       y = "Feature Value",
       fill = "Popularity") +
  theme_minimal() +
  theme(strip.text = element_text(size = 12, face = "bold"))

# 2. Top artists analysis
top_artists <- spotify_clean %>%
  group_by(artist) %>%
  summarise(
    total_songs = n(),
    popular_songs = sum(target),
    popularity_rate = round(popular_songs / total_songs * 100, 1)
  ) %>%
  filter(total_songs >= 3) %>%  # Artists with at least 3 songs
  arrange(desc(popularity_rate)) %>%
  head(10)

ggplot(top_artists, aes(x = reorder(artist, popularity_rate), y = popularity_rate)) +
  geom_col(fill = "#004E64", alpha = 0.8) +
  coord_flip() +
  labs(title = "Top 10 Artists by Success Rate",
       subtitle = "Artists with highest percentage of popular songs (min 3 songs)",
       x = "Artist",
       y = "Success Rate (%)") +
  theme_minimal()

The above code will give us two graphs that explain the following:

The above graph shows that:

  • Popular songs tend to have higher energy, danceability, and valence compared to less popular ones.
  • Acousticness is generally lower in popular songs, suggesting a preference for more produced or electronic sounds.
  • These patterns highlight the musical qualities that may contribute to a track's popularity.

The above graph shows that:

  • All listed artists have a 100% success rate, meaning every one of their songs in the dataset is marked as popular.
  • Artists like Blood Orange, A$AP Rocky, and Beach House consistently released tracks that performed well.
  • Only artists with at least 3 songs were considered to ensure meaningful success rates.

Join the Fundamentals of Deep Learning and Neural Networks free course to master neural networks, model training, and real-world AI applications. Learn from experts and earn a free certification. Start your deep learning journey today!

Step 10: Train a Logistic Regression to Predict Popularity

Here we split the data into train/test sets, fit a logistic regression using key audio features, evaluate its accuracy, and inspect which features are most statistically significant. The code for this section is:

# Let's build a simple model to predict song popularity!

# Prepare data for modeling
set.seed(123)  # For reproducible results

# Split data into training and testing sets
train_index <- createDataPartition(spotify_clean$target, p = 0.8, list = FALSE)
train_data <- spotify_clean[train_index, ]
test_data <- spotify_clean[-train_index, ]

print(paste("Training songs:", nrow(train_data)))
print(paste("Testing songs:", nrow(test_data)))

# Select features for our model
model_features <- c("acousticness", "danceability", "energy", "instrumentalness",
                   "liveness", "loudness", "speechiness", "tempo", "valence")

# Build a simple logistic regression model
model <- glm(target ~ .,
             data = train_data[, c("target", model_features)],
             family = "binomial")

# Make predictions
predictions <- predict(model, test_data, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

# Calculate accuracy
accuracy <- mean(predicted_classes == test_data$target)
print(paste("Model Accuracy:", round(accuracy * 100, 2), "%"))

# Feature importance
feature_importance <- summary(model)$coefficients[,4]  # p-values
important_features <- sort(feature_importance[2:length(feature_importance)])
print("Most important features (lowest p-values):")
print(round(important_features, 4))

The output for this section is as follows:

[1] "Training songs: 1586"

[1] "Testing songs: 396"

[1] "Model Accuracy: 64.14 %"

[1] "Most important features (lowest p-values):"

Feature

Value

speechiness 0.0000
instrumentalness 0.0000
loudness 0.0000
acousticness 0.0000
danceability 0.0000
valence 0.0006
tempo 0.0827
energy 0.0854
liveness 0.6461

Model Summary:

  • Accuracy: The model predicts song popularity with 64.14% accuracy, a decent baseline for logistic regression.
  • Top predictive features (based on lowest p-values):
    • Speechiness, instrumentalness, loudness, and acousticness are strongly significant (p-value ≈ 0).
    • Valence is also significant; it reflects musical positivity.
    • Liveness seems not useful (high p-value ≈ 0.64).

Level Up. Here’s Something For You: Also read: Data Science Project Ideas for BeginnersPython IDEs for Data Science and Machine Learning

Step 11: Upgrade Your Model with Random Forest

Now that we’ve tested a basic logistic regression, in this step, we will build a more powerful model. The Random Forest algorithm handles complex relationships and feature interactions better, which often leads to higher accuracy. Here’s the code:

# Install if not already installed
if (!require(randomForest)) install.packages("randomForest")
library(randomForest)

# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(as.factor(target) ~ .,
                        data = train_data[, c("target", model_features)],
                        ntree = 100,
                        importance = TRUE)

# Make predictions
rf_predictions <- predict(rf_model, test_data, type = "prob")[,2]
rf_classes <- ifelse(rf_predictions > 0.5, 1, 0)

# Accuracy of the Random Forest model
rf_accuracy <- mean(rf_classes == test_data$target)
print(paste("Random Forest Accuracy:", round(rf_accuracy * 100, 2), "%"))

# Feature importance scores
importance(rf_model)

The output for this section is:

Loading required package: randomForest

 

Warning message in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :

“there is no package called ‘randomForest’”

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

 

randomForest 4.7-1.2

 

Type rfNews() to see new features/changes/bug fixes.

 

 

Attaching package: ‘randomForest’

 

 

The following object is masked from ‘package:ggplot2’:

 

    margin

 

The following object is masked from ‘package:dplyr’:

 

    combine

 

[1] " Random Forest Accuracy: 76.01 %"

 

0

1

MeanDecreaseAccuracy              

MeanDecreaseGini

acousticness

11.913743

-0.2209466

11.076556

74.99875

danceability

14.352573

8.3464287

16.966182

93.23475

energy

14.201425

0.5790907

14.407994

82.00614

instrumentalness

26.440829

21.0849145

31.723116

121.86330

liveness

1.832211

3.1778978

3.360536

60.18430

loudness

15.432559

6.9422589

16.070757

106.90894

speechiness

11.330946

15.0804005

17.905247

101.84949

tempo

5.465601

2.3359464

5.790140

69.96652

valence

13.130002

3.1291881

12.043707

81.33633

 

The Random Forest propels the accuracy to 76.01%, which is a solid gain over the logistic regression’s 64.14%, and it tells us which musical traits matter most.

Key takeaways:

  • Accuracy ↑: 76.01% vs 64.14% (logistic). Random Forest captures non‑linear relationships better.
  • Most influential features: Instrumentalness, loudness, speechiness, danceability (highest MeanDecreaseGini / MeanDecreaseAccuracy).
  • Weaker signals: Liveness and tempo contribute the least to predictions here.

Conclusion

In this Spotify Music Data Analysis project, we used R in Google Colab to look into what makes a song popular by analyzing audio features such as energy, danceability, loudness, and speechiness. After cleaning and visualizing the dataset, we trained two models: a Logistic Regression and a more powerful Random Forest classifier.

The Random Forest model outperformed Logistic Regression with an accuracy of 76.01%, as compared to 64.14%, identifying instrumentalness, loudness, and speechiness as key predictors of popularity.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/17HoemeiOPtDJgoWqNVtxOia5jRxX5fEl#scrollTo=06bFYquyv76C

Frequently Asked Questions (FAQs)

1. What is the goal of the Spotify Music Data Analysis project in R?

2. Which tools and libraries are used in this music analysis project?

3. Can I use other algorithms to improve song popularity prediction?

4. Is this project suitable for R beginners?

5. What are some other interesting R projects like this one?

Rohit Sharma

803 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months