Song Recommendation System Using Machine Learning
By Rohit Sharma
Updated on Aug 05, 2025 | 7 min read | 1.16K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 05, 2025 | 7 min read | 1.16K+ views
Share:
Table of Contents
Ever noticed how, after you like or listen to a song on Spotify, Jio Saavan, YouTube, or any other platform, your feed starts showing more similar content? That’s an application of Machine Learning, using which recommender systems are built to personalize user experience and keep you engaged. In this project, we will develop a song recommendation system that follows the same principle.
Using metadata like genre, artist, and track name, the system will suggest songs that are similar to the one you select. The dataset used is the TCC CEDs Music Dataset, which contains rich information on songs released over the last century.
Enhance your data science career with upGrad's Online Data Science Courses Taught by experts. These courses offer job-ready skills in Python, Machine Learning, AI, SQL, and Tableau. Enroll today!
For more project ideas like this one, check out our blog post on the Top 25+ Essential Data Science Projects GitHub to Explore in 2025.
Popular Data Science Programs
To follow along smoothly, make sure you have a basic understanding of:
We will work with the following Python libraries:
Let’s start building the project from scratch. So, without wasting any more time, let’s begin!
To begin, we’ll upload our dataset and import the essential Python libraries. The dataset file (tcc_ceds_music.csv) needs to be manually uploaded into Colab for access. Here is the code to do so:
from google.colab import files
# This will prompt you to upload the file manually
uploaded = files.upload()
Output:
tcc_ceds_music.csv(text/csv) - 27655251 bytes, last modified: 7/30/2025 - 100% done
Saving tcc_ceds_music.csv to tcc_ceds_music.csv
Now that the song recommendation system-related .csv file has been uploaded, let’s load it using Pandas. Use the code mentioned below to do so:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Load the dataset
data = pd.read_csv('tcc_ceds_music.csv')
data.head()
Output:
Unnamed: 0 |
artist_name |
track_name |
release_date |
genre |
lyrics |
len |
dating |
violence |
world/life |
... |
sadness |
feelings |
danceability |
loudness |
acousticness |
instrumentalness |
valence |
energy |
topic |
age |
|
0 |
0 |
mukesh |
mohabbat bhi jhoothi |
1950 |
pop |
hold time feel break feel untrue convince spea... |
95 |
0.000598 |
0.063746 |
0.000598 |
... |
0.380299 |
0.117175 |
0.357739 |
0.454119 |
0.997992 |
0.901822 |
0.339448 |
0.137110 |
sadness |
1.0 |
1 |
4 |
frankie laine |
i believe |
1950 |
pop |
believe drop rain fall grow believe darkest ni... |
51 |
0.035537 |
0.096777 |
0.443435 |
... |
0.001284 |
0.001284 |
0.331745 |
0.647540 |
0.954819 |
0.000002 |
0.325021 |
0.263240 |
world/life |
1.0 |
2 |
6 |
johnnie ray |
cry |
1950 |
pop |
sweetheart send letter goodbye secret feel bet... |
24 |
0.002770 |
0.002770 |
0.002770 |
... |
0.002770 |
0.225422 |
0.456298 |
0.585288 |
0.840361 |
0.000000 |
0.351814 |
0.139112 |
music |
1.0 |
3 |
10 |
pérez prado |
patricia |
1950 |
pop |
kiss lips want stroll charm mambo chacha merin... |
54 |
0.048249 |
0.001548 |
0.001548 |
... |
0.225889 |
0.001548 |
0.686992 |
0.744404 |
0.083935 |
0.199393 |
0.775350 |
0.743736 |
romantic |
1.0 |
4 |
12 |
giorgos papadopoulos |
apopse eida oneiro |
1950 |
pop |
till darling till matter know till dream live ... |
48 |
0.001350 |
0.001350 |
0.417772 |
... |
0.068800 |
0.001350 |
0.291671 |
0.646489 |
0.975904 |
0.000246 |
0.597073 |
0.394375 |
romantic |
1.0 |
5 rows × 31 columns
The output shows us that the dataset has successfully loaded. The dataset contains detailed columns, such as artist_name, track_name, lyrics, genre, valence, acousticness, etc.
In this step, we will explore the dataset visually. Doing so will help us comprehend this database structure and uncover patterns. Additionally, it will aid us in finding out or pinpointing which features may contribute the most to our song recommendation system.
First, we will look at the top 10 genres to see the musical diversity in the dataset. Use the code mentioned below to do so:
plt.figure(figsize=(10,6))
sns.countplot(
y='genre',
data=data,
order=data['genre'].value_counts().index[:10],
palette='pastel'
)
plt.title('Top 10 Genres in the Dataset')
plt.xlabel('Number of Songs')
plt.ylabel('Genre')
plt.show()
Output:
The plot shows us which genres dominate the dataset. These genres might influence recommendations depending on song distribution.
Now, we will identify the most frequently appearing artist. This may help understand which artists could bring about some bias or frequency in our recommendations.
Use the following code to do so:
top_artists = data['artist_name'].value_counts().head(10)
plt.figure(figsize=(10,6))
sns.barplot(
x=top_artists.values,
y=top_artists.index,
palette='viridis'
)
plt.title('Top 10 Artists by Song Count')
plt.xlabel('Number of Songs')
plt.ylabel('Artist Name')
plt.show()
Output:
The output tells us which artists have the most representation. This information can be used to analyze how artist popularity will affect the similarity scores.
For a song recommendation system, the data can be prepared by combining relevant text features and transforming them into numerical vectors. This step will allow us to assess the degree of similarity between one song and another, using their metadata presently.
The below-mentioned sub-steps for this will then be executed in one block:
Use the below-mentioned code to accomplish the same:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# 1. Combine important text fields into a single string
data['combined_features'] = (
data['genre'].fillna('') + ' ' +
data['artist_name'].fillna('') + ' ' +
data['track_name'].fillna('')
)
# 2. Vectorize the combined text using TF-IDF
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(data['combined_features'])
# 3. Compute similarity scores between songs
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
By turning text features into TF-IDF vectors, we will highlight the importance of each word with respect to the entire dataset. By then using cosine similarity, we can compare any pair of songs with respect to their metadata profile and find those that are most alike.
Now that we have a matrix showing how similar each song is to every other song, let’s write a function that uses this information to suggest similar tracks. This is the core of our recommendation system.
We’ll define a function recommend_songs() that:
Use the below-mentioned code to do so:
# Function to recommend similar songs using cosine similarity
def recommend_songs(song_name, data, similarity_matrix):
# Convert input song name to lowercase
song_name = song_name.lower()
# Create a lowercase version of all track names in the dataset
data['track_name_lower'] = data['track_name'].str.lower()
# Check if the song exists in the dataset
if song_name not in data['track_name_lower'].values:
return "Sorry, this song is not in our database."
# Get the index of the song
idx = data[data['track_name_lower'] == song_name].index[0]
# Fetch similarity scores for the song
sim_scores = list(enumerate(similarity_matrix[idx]))
# Sort songs by similarity score (highest first)
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get top 10 most similar songs (excluding the input song itself)
sim_scores = sim_scores[1:11]
# Extract the indices of the recommended songs
song_indices = [i[0] for i in sim_scores]
# Return the names of the recommended songs
return data['track_name'].iloc[song_indices].values
# Replace "i believe" with any song title from the dataset to get recommendations
recommend_songs("i believe", data, cosine_sim)
Output:
array(["that's my desire", "after you've gone", 'laura',
"that ain't right", 'jezebel', "your cheatin' heart", 'wanted man',
'granada', "you've changed", 'high noon'], dtype=object)
In this project, we built a content-based song recommender system using the tcc_ceds_music.csv dataset. By analyzing audio features like danceability, energy, and tempo, we calculated song similarities using cosine distance. Given a track like “I Believe”, our system effectively suggests similar songs.
This project highlights a practical application of machine learning in building music recommendation engines for platforms like Spotify, Jio Saavan, YouTube, etc. It’s fast, efficient, and based entirely on content.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1WmHNjM6Bs2Zn9p3wWPzN3un5WNMuC0qi
827 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources