Spotify Music Data Analysis Project in R
By Rohit Sharma
Updated on Jul 29, 2025 | 19 min read | 1.36K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 29, 2025 | 19 min read | 1.36K+ views
Share:
Table of Contents
What makes a song popular? Well, in this Spotify Music Data Analysis project using R, we will analyze some real Spotify music data to decode the musical DNA behind various hit songs.
We will peek into over 2,000 tracks, examining features like energy, danceability, valence, and acousticness through statistical analysis and compelling visualizations.
This blog will explain every step from data cleaning to building a Random Forest model that achieves 76% accuracy in predicting song popularity. You'll learn essential data science skills, including correlation analysis, feature engineering, and machine learning.
Popular Data Science Programs
Upskill Yourself With The Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Before starting the Spotify Music Data Analysis project in R, it’s important to know some key concepts and skills. These will help you use the dataset and identify relevant patterns:
Skill/Concept |
What You Should Know |
Basic R Programming | Use variables, functions, and data frames confidently in R. |
Tidyverse Usage | Work with dplyr, ggplot2, and tidyr for data manipulation and visualization. |
Reading and Exploring Data | Use functions like read_csv(), head(), str(), and summary() to load and inspect data. |
Spotify Audio Features | Understand terms like danceability, energy, valence, and tempo. |
Data Cleaning Techniques | Handle missing values, rename columns, filter rows, and change data types as needed. |
Data Visualization | Create plots (e.g., histograms, boxplots, scatter plots) using ggplot2. |
Working with Dates | Use the lubridate package to parse and manipulate date columns. |
Grouping and Aggregation | Apply group_by() and summarise() to analyze trends by artist, genre, or year. |
Google Colab for R | Run R code in Colab by switching the runtime and uploading your data correctly. |
Crack the Code to an opportunity rich Data Science Career!
Master AI, ML, and Generative AI with upGrad’s premier programs, 100% online, industry-ready, and designed for tomorrow’s tech leaders.
This music data analysis project requires the following:
Aspect |
Details |
Estimated Time | 2–3 hours for a basic walkthrough; 4–5 hours for in-depth analysis and polish. |
Skill Level Required | Beginner to Intermediate. This project is suitable for learners with basic R and data handling knowledge. |
Learning Curve | Low to Moderate. The concepts are straightforward with some exploration required. |
Time Commitment | Short-term. This entire project can be completed in a single session or two sittings. |
Read This: Benefits of Learning R: Why It’s Essential for Data Science
To run this music data analysis project, we need certain tools, models, and R libraries, which are listed in the table below.
Category |
Details |
Platform | Google Colab (with R runtime) |
Programming Language | R |
Machine Learning Models | - Logistic Regression (for binary classification) - Random Forest (for improved accuracy and feature importance) |
Key R Libraries | - dplyr – Data manipulation - ggplot2 – Data visualization - readr – Reading CSV files - caret – Model training & evaluation - randomForest – Random forest implementation |
Supporting Libraries | - corrplot, GGally – Correlation and pair plots - plotly – Interactive visualizations - viridis – Color palettes for better visuals - reshape2 – Data reshaping - knitr – Report formatting |
To start working with this music data analysis project, we need to run the code in R, which will be discussed below. The code will be run in R, and this section will break down each step with the code for your understanding.
To begin working with R in Google Colab, you'll first need to switch the notebook's runtime from Python to R. This setup allows you to run R code seamlessly within the Colab environment.
Steps to switch to R:
Before starting the analysis of the data, we need to install and load all the required R packages. This step ensures that the environment is ready for data cleaning, visualization, and modeling. The code for this step is:
# Installing required packages (run once)
# These are like tools we need for our analysis
install.packages(c("dplyr", "ggplot2", "readr", "corrplot",
"plotly", "GGally", "caret", "viridis",
"reshape2", "knitr"))
# Loading libraries (run every time you start)
# Think of this as opening your toolbox
library(dplyr) # For data manipulation - like Excel functions
library(ggplot2) # For creating beautiful charts
library(readr) # For reading CSV files
library(corrplot) # For correlation plots
library(plotly) # For interactive charts
library(GGally) # For advanced plotting
library(caret) # For machine learning
library(viridis) # For beautiful color schemes
library(reshape2) # For data reshaping
library(knitr) # For nice table formatting
# Let's confirm everything loaded successfully
print("SpotiTunes Analytics Setup Complete!")
The code will give the output and confirm the installation and loading of the packages and libraries.
Installing packages into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘patchwork’, ‘proxy’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘lazyeval’, ‘crosstalk’, ‘ggstats’, ‘S7’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘gridExtra’
Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union corrplot 0.95 loaded Attaching package: ‘plotly’ The following object is masked from ‘package:ggplot2’: last_plot The following object is masked from ‘package:stats’: filter The following object is masked from ‘package:graphics’: layout Loading required package: lattice Loading required package: viridisLite [1] "SpotiTunes Analytics Setup Complete!" |
Here’s Something For You: The Ultimate R Cheat Sheet for Data Science Enthusiasts
As the setup is now ready, it's time to upload the dataset and load it into R. This step allows us to begin exploring and analyzing the contents of the Spotify music data. The code used in this process is:
Upload your CSV file to Colab first, then read it
# In Colab, click the file icon on the left, then upload your Spotify data.csv
# Reading the dataset
# This loads your Spotify data into R's memory
spotify_data <- read_csv("Spotify data.csv")
# Let's take our first look at the data
print("Dataset loaded successfully!")
print(paste("Number of songs:", nrow(spotify_data)))
print(paste("Number of features:", ncol(spotify_data)))
# Display first few rows to understand our data
head(spotify_data)
After the successful execution, the code will give us the output as:
New names: • `` -> `...1` Rows: 2017 Columns: 17 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (2): song_title, artist dbl (15): ...1, acousticness, danceability, duration_ms, energy, instrumenta...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. [1] "Dataset loaded successfully!" [1] "Number of songs: 2017" [1] "Number of features: 17"
|
The above table gives us an idea of what the data looks like.
What’s Hidden in Your Data? Find out with our free Clustering Unsupervised Learning course. Learn K-Means, uncover clusters, and explore patterns like a data pro. Join now.
Before starting our analysis or modeling, it's important to check the dataset. This step will help us understand the structure, identify missing values, and get familiar with the available features. The code for this step is:
Getting to know our dataset better
# This is like getting familiar with a new playlist
# Basic information about the dataset
str(spotify_data) # Shows data types and structure
summary(spotify_data) # Shows statistical summaries
# Check for missing values
# Missing data can affect our analysis
missing_values <- colSums(is.na(spotify_data))
print("Missing values in each column:")
print(missing_values)
# Let's look at the column names to understand what we have
print("Our musical features:")
print(colnames(spotify_data))
The output for this code is:
spc_tbl_ [2,017 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ ...1 : num [1:2017] 0 1 2 3 4 5 6 7 8 9 ... $ acousticness : num [1:2017] 0.0102 0.199 0.0344 0.604 0.18 0.00479 0.0145 0.0202 0.0481 0.00208 ... $ danceability : num [1:2017] 0.833 0.743 0.838 0.494 0.678 0.804 0.739 0.266 0.603 0.836 ... $ duration_ms : num [1:2017] 204600 326933 185707 199413 392893 ... $ energy : num [1:2017] 0.434 0.359 0.412 0.338 0.561 0.56 0.472 0.348 0.944 0.603 ... $ instrumentalness: num [1:2017] 2.19e-02 6.11e-03 2.34e-04 5.10e-01 5.12e-01 0.00 7.27e-06 6.64e-01 0.00 0.00 ... $ key : num [1:2017] 2 1 2 5 5 8 1 10 11 7 ... $ liveness : num [1:2017] 0.165 0.137 0.159 0.0922 0.439 0.164 0.207 0.16 0.342 0.571 ... $ loudness : num [1:2017] -8.79 -10.4 -7.15 -15.24 -11.65 ... $ mode : num [1:2017] 1 1 1 1 0 1 1 0 0 1 ... $ speechiness : num [1:2017] 0.431 0.0794 0.289 0.0261 0.0694 0.185 0.156 0.0371 0.347 0.237 ... $ tempo : num [1:2017] 150.1 160.1 75 86.5 174 ... $ time_signature : num [1:2017] 4 4 4 4 4 4 4 4 4 4 ... $ valence : num [1:2017] 0.286 0.588 0.173 0.23 0.904 0.264 0.308 0.393 0.398 0.386 ... $ target : num [1:2017] 1 1 1 1 1 1 1 1 1 1 ... $ song_title : chr [1:2017] "Mask Off" "Redbone" "Xanny Family" "Master Of None" ... $ artist : chr [1:2017] "Future" "Childish Gambino" "Future" "Beach House" ... - attr(*, "spec")= .. cols( .. ...1 = col_double(), .. acousticness = col_double(), .. danceability = col_double(), .. duration_ms = col_double(), .. energy = col_double(), .. instrumentalness = col_double(), .. key = col_double(), .. liveness = col_double(), .. loudness = col_double(), .. mode = col_double(), .. speechiness = col_double(), .. tempo = col_double(), .. time_signature = col_double(), .. valence = col_double(), .. target = col_double(), .. song_title = col_character(), .. artist = col_character() .. ) - attr(*, "problems")=<externalptr>
...1 acousticness danceability duration_ms Min. : 0 Min. :2.840e-06 Min. :0.1220 Min. : 16042 1st Qu.: 504 1st Qu.:9.630e-03 1st Qu.:0.5140 1st Qu.: 200015 Median :1008 Median :6.330e-02 Median :0.6310 Median : 229261 Mean :1008 Mean :1.876e-01 Mean :0.6184 Mean : 246306 3rd Qu.:1512 3rd Qu.:2.650e-01 3rd Qu.:0.7380 3rd Qu.: 270333 Max. :2016 Max. :9.950e-01 Max. :0.9840 Max. :1004627 energy instrumentalness key liveness Min. :0.0148 Min. :0.0000000 Min. : 0.000 Min. :0.0188 1st Qu.:0.5630 1st Qu.:0.0000000 1st Qu.: 2.000 1st Qu.:0.0923 Median :0.7150 Median :0.0000762 Median : 6.000 Median :0.1270 Mean :0.6816 Mean :0.1332855 Mean : 5.343 Mean :0.1908 3rd Qu.:0.8460 3rd Qu.:0.0540000 3rd Qu.: 9.000 3rd Qu.:0.2470 Max. :0.9980 Max. :0.9760000 Max. :11.000 Max. :0.9690 loudness mode speechiness tempo Min. :-33.097 Min. :0.0000 Min. :0.02310 Min. : 47.86 1st Qu.: -8.394 1st Qu.:0.0000 1st Qu.:0.03750 1st Qu.:100.19 Median : -6.248 Median :1.0000 Median :0.05490 Median :121.43 Mean : -7.086 Mean :0.6123 Mean :0.09266 Mean :121.60 3rd Qu.: -4.746 3rd Qu.:1.0000 3rd Qu.:0.10800 3rd Qu.:137.85 Max. : -0.307 Max. :1.0000 Max. :0.81600 Max. :219.33 time_signature valence target song_title Min. :1.000 Min. :0.0348 Min. :0.0000 Length:2017 1st Qu.:4.000 1st Qu.:0.2950 1st Qu.:0.0000 Class :character Median :4.000 Median :0.4920 Median :1.0000 Mode :character Mean :3.968 Mean :0.4968 Mean :0.5057 3rd Qu.:4.000 3rd Qu.:0.6910 3rd Qu.:1.0000 Max. :5.000 Max. :0.9920 Max. :1.0000 artist Length:2017 Class :character Mode :character
[1] "Missing values in each column:" ...1 acousticness danceability duration_ms 0 0 0 0 energy instrumentalness key liveness 0 0 0 0 loudness mode speechiness tempo 0 0 0 0 time_signature valence target song_title 0 0 0 0 artist 0 [1] "🎵 Our musical features:" [1] "...1" "acousticness" "danceability" "duration_ms" [5] "energy" "instrumentalness" "key" "liveness" [9] "loudness" "mode" "speechiness" "tempo" [13] "time_signature" "valence" "target" "song_title" [17] "artist" |
The above output shows that:
Also Read: R vs Python Data Science: The Difference
Before analysis, we need to remove duplicates and create new categorical variables for better insight. This step makes the dataset easier to interpret and visualize. Here is the code:
# Cleaning our data - like tuning instruments before a concert
# Remove any duplicate songs
spotify_clean <- spotify_data %>%
distinct(song_title, artist, .keep_all = TRUE)
print(paste("Removed", nrow(spotify_data) - nrow(spotify_clean), "duplicate songs"))
# Create a popularity category for easier analysis
# Target = 1 means popular, Target = 0 means less popular
spotify_clean <- spotify_clean %>%
mutate(
popularity_label = ifelse(target == 1, "Popular", "Less Popular"),
# Create energy level categories
energy_level = case_when(
energy < 0.3 ~ "Low Energy",
energy < 0.7 ~ "Medium Energy",
TRUE ~ "High Energy"
),
# Create danceability categories
dance_level = case_when(
danceability < 0.3 ~ "Not Danceable",
danceability < 0.7 ~ "Moderately Danceable",
TRUE ~ "Very Danceable"
)
)
print("Data cleaning complete!")
The output for the above code is:
[1] "Removed 35 duplicate songs." [1] "Data cleaning complete!" |
This means:
In this step, we will check how popular and less popular songs differ in terms of key musical features. We also visualize the overall distribution of popularity in the dataset. The code for this section is:
# Let's explore our musical universe!
# 1. Basic statistics about popular vs less popular songs
popularity_summary <- spotify_clean %>%
group_by(popularity_label) %>%
summarise(
count = n(),
avg_energy = round(mean(energy), 3),
avg_danceability = round(mean(danceability), 3),
avg_valence = round(mean(valence), 3),
avg_loudness = round(mean(loudness), 3)
)
print("Popularity Analysis:")
knitr::kable(popularity_summary)
# 2. Distribution of song popularity
ggplot(spotify_clean, aes(x = popularity_label)) +
geom_bar(fill = c("#FF6B35", "#004E64"), alpha = 0.8) +
labs(title = "Distribution of Song Popularity",
subtitle = "How many popular vs less popular songs do we have?",
x = "Popularity Category",
y = "Number of Songs") +
theme_minimal() +
theme(plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
The output for this section will be as follows:
[1] " Popularity Analysis:"
popularity_label |
count |
avg_energy |
avg_danceability |
avg_valence |
avg_loudness |
Less Popular | 989 | 0.673 | 0.589 | 0.469 | -6.813 |
Popular | 993 | 0.693 | 0.648 | 0.528 | -7.312 |
The above code also gives us a graph:
This output shows:
Here’s Something You Should Know: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!
This step gives us more insights into how energy, danceability, valence, and tempo vary across popular and less popular songs. Visualizations help us uncover subtle patterns and trends in musical features. The code for this step is:
# Let's visualize the musical DNA of popular songs!
# 1. Energy vs Danceability scatter plot
p1 <- ggplot(spotify_clean, aes(x = energy, y = danceability, color = popularity_label)) +
geom_point(alpha = 0.6, size = 2) +
scale_color_manual(values = c("#FF6B35", "#004E64")) +
labs(title = "🕺 Energy vs Danceability",
subtitle = "Do popular songs have more energy and danceability?",
x = "Energy Level",
y = "Danceability",
color = "Popularity") +
theme_minimal()
print(p1)
# 2. Valence (Musical Happiness) distribution
p2 <- ggplot(spotify_clean, aes(x = valence, fill = popularity_label)) +
geom_histogram(alpha = 0.7, bins = 30, position = "identity") +
scale_fill_manual(values = c("#FF6B35", "#004E64")) +
labs(title = "Musical Happiness Distribution",
subtitle = "Are popular songs happier?",
x = "Valence (0 = Sad, 1 = Happy)",
y = "Number of Songs",
fill = "Popularity") +
theme_minimal()
print(p2)
# 3. Tempo analysis
p3 <- ggplot(spotify_clean, aes(x = popularity_label, y = tempo, fill = popularity_label)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("#FF6B35", "#004E64")) +
labs(title = "Tempo Comparison",
subtitle = "Is there an optimal tempo for popularity?",
x = "Popularity Category",
y = "Tempo (BPM)",
fill = "Popularity") +
theme_minimal()
print(p3)
The above code will give us three graphs, each showing different characteristics:
This graph shows:
This graph shows:
This graph shows:
You Can’t Miss This: What’s Special About Machine Learning?
In this step, we quantify how musical features move together and which ones line up most with popularity (target). We’ll build and plot a correlation matrix, then list the features most (positively/negatively) associated with popularity.
# Let's find relationships between musical features
# Select numerical features for correlation
numerical_features <- spotify_clean %>%
select(acousticness, danceability, energy, instrumentalness,
liveness, loudness, speechiness, tempo, valence, target)
# Create correlation matrix
correlation_matrix <- cor(numerical_features)
# Visualize correlations
corrplot(correlation_matrix,
method = "color",
type = "upper",
tl.cex = 0.8,
tl.col = "black",
title = "Musical Features Correlation Matrix",
mar = c(0,0,2,0))
# Find strongest correlations with popularity (target)
target_correlations <- correlation_matrix[,"target"] %>%
sort(decreasing = TRUE)
print("Features most correlated with popularity:")
print(round(target_correlations, 3))
The output of the above code is:
[1] "Features most correlated with popularity:"
Feature |
Value |
target | 1.000 |
danceability | 0.183 |
speechiness | 0.162 |
instrumentalness | 0.150 |
valence | 0.119 |
energy | 0.048 |
tempo | 0.034 |
liveness | 0.024 |
loudness | -0.066 |
acousticness | -0.134 |
The graph of this outcome is:
The above graph explains that:
Master the art of data analysis with our free Introduction to Data Analysis using Excel course. Learn pivot tables, formulas, and visualization techniques that top analysts swear by. Enroll now and make Excel your superpower!
This step focuses on advanced, storytelling-style visualizations to highlight differences in song features and reveal which artists have the most consistent hit rate. Here’s the code:
# Create beautiful, insightful visualizations
# 1. Multi-feature comparison using violin plots
features_long <- spotify_clean %>%
select(popularity_label, energy, danceability, valence, acousticness) %>%
reshape2::melt(id.vars = "popularity_label")
ggplot(features_long, aes(x = popularity_label, y = value, fill = popularity_label)) +
geom_violin(alpha = 0.7) +
facet_wrap(~variable, scales = "free_y") +
scale_fill_manual(values = c("#FF6B35", "#004E64")) +
labs(title = "Musical Feature Comparison",
subtitle = "How do different features vary between popular and less popular songs?",
x = "Popularity Category",
y = "Feature Value",
fill = "Popularity") +
theme_minimal() +
theme(strip.text = element_text(size = 12, face = "bold"))
# 2. Top artists analysis
top_artists <- spotify_clean %>%
group_by(artist) %>%
summarise(
total_songs = n(),
popular_songs = sum(target),
popularity_rate = round(popular_songs / total_songs * 100, 1)
) %>%
filter(total_songs >= 3) %>% # Artists with at least 3 songs
arrange(desc(popularity_rate)) %>%
head(10)
ggplot(top_artists, aes(x = reorder(artist, popularity_rate), y = popularity_rate)) +
geom_col(fill = "#004E64", alpha = 0.8) +
coord_flip() +
labs(title = "Top 10 Artists by Success Rate",
subtitle = "Artists with highest percentage of popular songs (min 3 songs)",
x = "Artist",
y = "Success Rate (%)") +
theme_minimal()
The above code will give us two graphs that explain the following:
The above graph shows that:
The above graph shows that:
Join the Fundamentals of Deep Learning and Neural Networks free course to master neural networks, model training, and real-world AI applications. Learn from experts and earn a free certification. Start your deep learning journey today!
Here we split the data into train/test sets, fit a logistic regression using key audio features, evaluate its accuracy, and inspect which features are most statistically significant. The code for this section is:
# Let's build a simple model to predict song popularity!
# Prepare data for modeling
set.seed(123) # For reproducible results
# Split data into training and testing sets
train_index <- createDataPartition(spotify_clean$target, p = 0.8, list = FALSE)
train_data <- spotify_clean[train_index, ]
test_data <- spotify_clean[-train_index, ]
print(paste("Training songs:", nrow(train_data)))
print(paste("Testing songs:", nrow(test_data)))
# Select features for our model
model_features <- c("acousticness", "danceability", "energy", "instrumentalness",
"liveness", "loudness", "speechiness", "tempo", "valence")
# Build a simple logistic regression model
model <- glm(target ~ .,
data = train_data[, c("target", model_features)],
family = "binomial")
# Make predictions
predictions <- predict(model, test_data, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Calculate accuracy
accuracy <- mean(predicted_classes == test_data$target)
print(paste("Model Accuracy:", round(accuracy * 100, 2), "%"))
# Feature importance
feature_importance <- summary(model)$coefficients[,4] # p-values
important_features <- sort(feature_importance[2:length(feature_importance)])
print("Most important features (lowest p-values):")
print(round(important_features, 4))
The output for this section is as follows:
[1] "Training songs: 1586"
[1] "Testing songs: 396"
[1] "Model Accuracy: 64.14 %"
[1] "Most important features (lowest p-values):"
Feature |
Value |
speechiness | 0.0000 |
instrumentalness | 0.0000 |
loudness | 0.0000 |
acousticness | 0.0000 |
danceability | 0.0000 |
valence | 0.0006 |
tempo | 0.0827 |
energy | 0.0854 |
liveness | 0.6461 |
Model Summary:
Level Up. Here’s Something For You: Also read: Data Science Project Ideas for Beginners | Python IDEs for Data Science and Machine Learning
Now that we’ve tested a basic logistic regression, in this step, we will build a more powerful model. The Random Forest algorithm handles complex relationships and feature interactions better, which often leads to higher accuracy. Here’s the code:
# Install if not already installed
if (!require(randomForest)) install.packages("randomForest")
library(randomForest)
# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(as.factor(target) ~ .,
data = train_data[, c("target", model_features)],
ntree = 100,
importance = TRUE)
# Make predictions
rf_predictions <- predict(rf_model, test_data, type = "prob")[,2]
rf_classes <- ifelse(rf_predictions > 0.5, 1, 0)
# Accuracy of the Random Forest model
rf_accuracy <- mean(rf_classes == test_data$target)
print(paste("Random Forest Accuracy:", round(rf_accuracy * 100, 2), "%"))
# Feature importance scores
importance(rf_model)
The output for this section is:
Loading required package: randomForest
Warning message in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : “there is no package called ‘randomForest’” Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
The following object is masked from ‘package:dplyr’:
combine
[1] " Random Forest Accuracy: 76.01 %"
|
The Random Forest propels the accuracy to 76.01%, which is a solid gain over the logistic regression’s 64.14%, and it tells us which musical traits matter most.
Key takeaways:
In this Spotify Music Data Analysis project, we used R in Google Colab to look into what makes a song popular by analyzing audio features such as energy, danceability, loudness, and speechiness. After cleaning and visualizing the dataset, we trained two models: a Logistic Regression and a more powerful Random Forest classifier.
The Random Forest model outperformed Logistic Regression with an accuracy of 76.01%, as compared to 64.14%, identifying instrumentalness, loudness, and speechiness as key predictors of popularity.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/17HoemeiOPtDJgoWqNVtxOia5jRxX5fEl#scrollTo=06bFYquyv76C
803 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources