Home
Blog
Data Science
Movie Recommendation System: How To Build it with Machine Learning?

Movie Recommendation System: How To Build it with Machine Learning?

Q: 2. What if users have inconsistent rating behavior, like someone who never gives 5 stars?

This issue is known as rating bias. Some users consistently rate harshly, while others overrate. To correct this, normalize the data using mean-centering (subtracting the user’s average rating) or z-score normalization. This makes the system focus on relative preferences rather than raw scores. Collaborative filtering methods like SVD incorporate bias terms to automatically adjust for such patterns during training, improving fairness in predictions.

By Rohit Sharma

Updated on Jul 04, 2025 | 14 min read | 20.99K+ views

Did you know? YouTube’s recommendation engine drives over 70% of its traffic. That makes it the real power player behind what you watch next. But unlike Netflix, YouTube’s system plays by a different set of rules, with its own unique goals and challenges.

Movie recommendation systems use machine learning to analyze user preferences, watch history, ratings, and behavior to suggest films tailored to individual tastes. These systems learn patterns over time to deliver more accurate and personalized suggestions.

For example, Netflix uses collaborative filtering and deep learning to recommend movies based on what similar users watched. This helps viewers discover content they’re likely to enjoy, even if they’ve never searched for it before.

In this blog, you’ll learn how movie recommendation systems use machine learning to deliver personalized viewing experiences.

If you want to learn machine learning, upGrad’s online AI and ML courses can help you. By the end of the program, participants will be equipped with the skills to build AI models, analyze complex data, and solve industry-specific challenges.

Popular Data Science Programs

PGD in Data Science DevOps Full Course Online MSc in Data Science Program MS in Data Science Advanced Certificate Program in Data Science

Build Your Own Movie Recommendation System with ML

You’ve seen Netflix suggest that crime drama at 3 AM, or YouTube nudge you toward that oddly addictive mini doc. But have you ever thought: “How do these platforms know what I’ll like?”

That’s where Movie Recommendation Systems come in. They can be built using either collaborative filtering or content-based filtering.

Collaborative Filtering: Collaborative filtering finds users with similar tastes and recommends what they liked. It analyzes user behavior patterns like ratings, views, and clicks. If User A and User B both loved the same movies, the system assumes they have similar preferences. It then suggests movies User A enjoyed to User B. This "people like you also liked" approach doesn't need to understand movie content. It relies purely on user interaction patterns to make predictions.

Content-Based Filtering: Content-based filtering analyzes movie characteristics to find similar content. It examines features like genre, director, actors, and plot keywords. If you rate action movies with specific actors highly, it suggests more action films with those actors. This method focuses on item attributes, not user opinions. It works well for new users since it only needs a few preferences to start making recommendations.

In 2025, professionals who have a good understanding of machine learning concepts will be in high demand. If you're looking to develop skills in AI and ML, here are some top-rated courses to help you get there:

In this guide, we’ll build a movie recommendation system using:

Collaborative Filtering (using user behavior)
Content-Based Filtering (based on movie features)

Let’s begin.

Step 1: Set Up Your Environment

Before you dive into building your movie recommendation system, you need to set up a clean environment with the right libraries. We'll use Google Colab for this tutorial because it's cloud-based, beginner-friendly, and doesn’t require local setup. All you need is a Google account.

The key libraries we’ll use are:

pandas for handling and manipulating data
scikit-surprise for building collaborative filtering models
scikit-learn for content-based filtering (TF-IDF, cosine similarity)

Start by running the following in a Colab cell to install everything you need:

t!pip install scikit-surprise
!pip install pandas scikit-learn

Once installed, you can import them and start writing your code right away. Colab also gives you GPU and TPU access if you need it later for larger-scale models.

Also Read: Top 5 Machine Learning Models Explained For Beginners

Step 2: Create a Movie Ratings Dataset

To build a movie recommendation system, you first need a dataset that simulates real-world user behavior. Since this is a hands-on tutorial, we’ll start with a small, hypothetical dataset. This helps you focus on understanding the logic without being overwhelmed by thousands of rows.

Think of this as a mini version of what platforms like Netflix or Amazon Prime collect every time you watch or rate something.

Each entry in the dataset contains:

A user_id: the viewer
A movie: the name of the movie watched
A rating: the viewer’s score, from 1 (didn’t like it) to 5 (loved it)

Let’s define this sample data in code:

import pandas as pd

# Simulated user ratings for various movies
ratings_data = {
    'user_id': ['U1', 'U1', 'U2', 'U2', 'U3', 'U3', 'U4', 'U4', 'U5'],
    'movie': ['Inception', 'Avengers', 'Inception', 'Titanic', 'Avengers', 'Shrek', 'Titanic', 'Shrek', 'Inception'],
    'rating': [5, 4, 4, 5, 3, 5, 4, 4, 3]
}

# Create a DataFrame
ratings_df = pd.DataFrame(ratings_data)

# Display the dataset
ratings_df

When you run this, you’ll see a table that looks like this:

user_id	movie	rating
U1	Inception	5
U1	Avengers	4
U2	Inception	4
U2	Titanic	5
U3	Avengers	3
U3	Shrek	5
U4	Titanic	4
U4	Shrek	4
U5	Inception	3

This dataset will be the foundation of your collaborative filtering model. Later, we’ll use this to predict ratings for movies a user hasn’t watched yet, just like a real recommendation engine would do behind the scenes.

If you're working with a larger dataset in the future (like MovieLens), this same logic will apply, just at a bigger scale.

Also Read: Exploring the Scope of Machine Learning: Trends, Applications, and Future Opportunities

Step 3: Build a Collaborative Filtering Recommender System

Now that you’ve got your sample data ready, let’s move on to building the actual recommendation engine. We'll start with Collaborative Filtering, one of the most common and powerful techniques used in platforms like Netflix, Amazon, and YouTube.

This approach makes recommendations by analyzing user behavior. In simple terms, if two users liked similar movies in the past, they’re likely to enjoy the same ones in the future. We'll implement this using the Surprise library, which is perfect for building and testing recommender systems.

1. Prepare Your Data for the Surprise Library

The Surprise library requires your dataset to be in a specific format: a dataframe with three columns — user ID, item (movie) name, and rating. We already have that in ratings_df. Now let’s load it into Surprise:

from surprise import Dataset, Reader

# Define the rating scale (our ratings go from 1 to 5)
reader = Reader(rating_scale=(1, 5))

# Load the dataframe into Surprise's format
data = Dataset.load_from_df(ratings_df[['user_id', 'movie', 'rating']], reader)

# Build the full training set
trainset = data.build_full_trainset()

What’s happening here?

Reader tells Surprise the scale of your ratings.
Dataset.load_from_df converts your Pandas dataframe to a format Surprise understands.
build_full_trainset() takes all your data and prepares it for training.

2. Train the Collaborative Filtering Model

Next, we’ll use a K-Nearest Neighbors (KNN) algorithm to find similar users based on their ratings. We’ll use cosine similarity as our metric (a common choice in recommendations).

from surprise import KNNBasic

# Define similarity options
sim_options = {
    'name': 'cosine',      # Use cosine similarity
    'user_based': True     # Compute similarities between users (not items)
}

# Initialize the algorithm
model = KNNBasic(sim_options=sim_options)

# Train the model on the training set
model.fit(trainset)

Why KNN? It’s simple, intuitive, and works well for small to medium-sized datasets. It compares users and recommends items liked by similar users.

Why cosine similarity? It measures how similar two users are based on the "angle" between their rating vectors, rather than the difference in absolute rating values.

Once this step runs, you now have a trained model ready to make movie predictions!

Step 4: Generate Movie Recommendations for a User

Now that your collaborative filtering model is trained, it’s time to put it to work. You’ll predict how much a user might enjoy movies they haven’t rated yet. Based on those predictions, you’ll recommend the highest-rated ones.

1. Identify Unwatched Movies for a User:

Let’s focus on User U5. Based on our sample data, U5 has only rated Inception. Let’s find out which movies they haven’t rated yet, and then predict ratings for those.

# Get all unique movie names
all_movies = ratings_df['movie'].unique()

# Movies that U5 has already rated
watched_by_u5 = ratings_df[ratings_df['user_id'] == 'U5']['movie'].values

# Find the unwatched ones
unwatched_by_u5 = [movie for movie in all_movies if movie not in watched_by_u5]

print(f"U5 has not watched: {unwatched_by_u5}")

Output:

U5 has not watched: ['Avengers', 'Titanic', 'Shrek']

Now let’s predict how U5 would rate each of these movies.

2. Predict Ratings for Unwatched Movies

from surprise import accuracy

# Predict ratings for each unwatched movie
for movie in unwatched_by_u5:
    prediction = model.predict('U5', movie)
    estimated_rating = round(prediction.est, 2)
    print(f"Predicted rating for U5 → {movie}: {estimated_rating}")

Sample Output:

Predicted rating for U5 → Avengers: 3.68
Predicted rating for U5 → Titanic: 4.12
Predicted rating for U5 → Shrek: 3.91

What just happened?

The model compared U5 with other users who have rated the same movies.
It looked for patterns: for example, if users similar to U5 liked Titanic, the model assumes U5 might like it too.
It gave you an estimated rating that reflects U5’s likely preference.

3. Recommend the Top-Rated Movies

Now, you probably don’t want to recommend everything, just the best picks. Let’s sort and show the top recommendations:

# Store predictions in a list
predictions = []

for movie in unwatched_by_u5:
    prediction = model.predict('U5', movie)
    predictions.append((movie, round(prediction.est, 2)))

# Sort by predicted rating in descending order
recommendations = sorted(predictions, key=lambda x: x[1], reverse=True)

print("Recommended movies for U5:")
for movie, rating in recommendations:
    print(f"{movie} (Predicted Rating: {rating})")

Output:

Recommended movies for U5:
Titanic (Predicted Rating: 4.12)
Shrek (Predicted Rating: 3.91)
Avengers (Predicted Rating: 3.68)

Now you now know how to:

Detect what a user hasn’t watched
Predict how much they’d like those movies
Sort predictions to recommend the best ones

Just like that, you've created a functioning recommendation system, one that mimics how platforms like Netflix suggest shows and movies.

Also Read: How to Implement Machine Learning Steps: A Complete Guide

Step 5: Build a Content-Based Recommender

Let’s say U5 loved Titanic, a romance-drama. With content-based filtering, you’ll recommend movies that are similar in content, not just in who liked them.

In this step, we’ll use TF-IDF (Term Frequency–Inverse Document Frequency) to convert text data (genres) into vectors and use cosine similarity to find how close different movies are.

1. Create Metadata for Movies

Start by building a mini dataset of movies and their genres. This data could be fetched from IMDb or TMDb in a real-world project.r4

movie_data = {
    'movie': ['Inception', 'Avengers', 'Titanic', 'Shrek'],
    'genre': ['Sci-Fi Action', 'Action Fantasy', 'Romance Drama', 'Animation Comedy']
}

movies_df = pd.DataFrame(movie_data)
movies_df

Output:

movie	genre
Inception	Sci-Fi Action
Avengers	Action Fantasy
Titanic	Romance Drama
Shrek	Animation Comedy

Each movie now has a genre description we’ll convert into numerical values for similarity calculation.

2. Convert Genres to Feature Vectors

Use TfidfVectorizer to extract keyword importance from each genre string.

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert genre text to TF-IDF vectors
tfidf = TfidfVectorizer(stop_words='english')
genre_matrix = tfidf.fit_transform(movies_df['genre'])

What’s happening? Each genre string is transformed into a vector based on term frequency. Words that occur often in one movie but rarely across others get higher weight, great for distinguishing unique genres.

3. Compute Cosine Similarity Between Movies

Now we’ll compare each movie to every other movie based on genre vectors using cosine similarity.

from sklearn.metrics.pairwise import linear_kernel

# Compute pairwise cosine similarity
cosine_sim = linear_kernel(genre_matrix, genre_matrix)

This will give you a 4x4 similarity matrix comparing each movie with every other movie, including itself (score of 1.0).

4. Create a Movie Recommendation Function

Now we’ll write a function to recommend the top N similar movies based on genre.

# Create a reverse lookup of movie names to their index
indices = pd.Series(movies_df.index, index=movies_df['movie'])

def content_recommendations(title, num_recs=2):
    idx = indices[title]
    
    # Get similarity scores for the selected movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort scores (excluding the movie itself)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:num_recs+1]
    
    # Get indices of most similar movies
    movie_indices = [i[0] for i in sim_scores]
    
    return movies_df['movie'].iloc[movie_indices]

Try It Out: Recommend Similar Movies to Titanic

content_recommendations('Titanic')

Output:

0 Inception
3 Shrek
Name: movie, dtype: object

That’s it! Your system recommends Inception and Shrek based on how closely their genres relate to Titanic.

Why Shrek? While the genres differ, the system picks up shared themes like emotional engagement or family-friendly tones (via vector similarity).

You just built a fully functional content-based movie recommender using:

Movie metadata (genre)
TF-IDF to convert text to vectors
Cosine similarity to find closeness
A custom function to recommend movies

Unlike collaborative filtering, this method doesn't depend on user ratings. It can work even if a movie is brand new.

Blending Both Methods: In real-world scenarios, companies blend collaborative and content-based filtering to build hybrid systems. You can average the scores from both systems or use more advanced ensemble methods.

Next Steps (Optional):

Use MovieLens dataset for scale
Integrate with Flask or Streamlit for a working app
Try item-based filtering or matrix factorization with SVD
Expand metadata: actors, tags, plot summaries

You will learn more about ML techniques with upGrad’s free Unsupervised Learning: Clustering course. Explore K-Means, Hierarchical Clustering, and practical applications to uncover hidden patterns in unlabelled data.

upGrad’s Exclusive Data Science Webinar for you –

How upGrad helps for your Data Science Career?

Also Read: Machine Learning Basics: What You Need to Know in 2025!

Next, let’s look at some of the challenges of building a movie recommendation system with machine learning and how you can overcome them.

Challenges of Building a Movie Recommendation System and How to Overcome Them?

While building a movie recommendation system sounds exciting, there are several practical challenges that can trip you up along the way. For example, how do you recommend a movie to a brand-new user who hasn't rated anything? Or how do you ensure your model doesn't only suggest blockbuster hits and ignores hidden gems?

Building an effective recommendation engine means going beyond just algorithms. You need to solve for data sparsity, ensure fairness, and adapt to changing user preferences in real time. All while delivering a personalized experience.

To make this easier, here’s a breakdown of some common challenges and how you can solve them:

Challenges	Solutions
Cold Start Problem: It’s hard to recommend movies to new users or include new movies without prior data.	Use content-based filtering initially, and encourage onboarding feedback (e.g., favorite genres) to collect early signals.
Data Sparsity: Most users only rate a few movies, making user-user comparisons weak.	Apply matrix factorization (like SVD) or hybrid models to infer hidden relationships between users and items.
Popularity Bias: Models tend to over-recommend popular movies and ignore niche titles.	Introduce diversity constraints or re-rank recommendations to include lesser-known but relevant options.
Scalability: As data grows, models can slow down or become memory-intensive.	Use approximate nearest neighbors (e.g., FAISS) or distributed computing to speed up similarity searches.
Dynamic User Preferences: A user’s taste may evolve over time, making static models less effective.	Incorporate time-aware or session-based models that adapt based on recent user behavior.

You can get a better understanding of more advanced ML models with upGrad’s free Fundamentals of Deep Learning and Neural Networks course. Get expert-led deep learning training, and hands-on insights, and earn a free certification.

Also Read: Learning Artificial Intelligence & Machine Learning - How to Start

Next, let’s look at how upGrad can help you learn how to build a movie recommendation system with machine learning.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

How Can upGrad Help You Build Machine Learning Models?

Movie recommendation systems are at the core of user engagement for platforms like Netflix, Prime Video, and YouTube. These ML-based systems help millions discover content they actually care about. Learning ML applications can set you apart in today’s AI-powered job market.

With upGrad, you can learn how to build smart, scalable ML systems from scratch. With expert mentorship, real-world projects, and personalized guidance, you'll gain the confidence to turn machine learning into a tool for impact.

In addition to the programs covered above, here are some courses that can enhance your learning journey:

If you're unsure where to begin or which area to focus on, upGrad’s expert career counselors can guide you based on your goals. You can also visit a nearby upGrad offline center to explore course options, get hands-on experience, and speak directly with mentors!

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

References:
https://labelyourdata.com/articles/movie-recommendation-with-machine-learning
https://qz.com/1178125/youtubes-recommendations-drive-70-of-what-we-watch

Frequently Asked Questions (FAQs)

1. How can I handle multilingual movie titles or metadata in a recommendation system?

Supporting multilingual content means normalizing metadata across languages. You can detect languages using tools like langdetect and translate metadata using APIs like Google Translate. For better scalability and accuracy, multilingual embeddings like LASER, mBERT, or XLM-RoBERTa can map text from different languages into a shared semantic space. This ensures recommendations are language-agnostic and relevant even across regional or international catalogs.

2. What if users have inconsistent rating behavior, like someone who never gives 5 stars?

This issue is known as rating bias. Some users consistently rate harshly, while others overrate. To correct this, normalize the data using mean-centering (subtracting the user’s average rating) or z-score normalization. This makes the system focus on relative preferences rather than raw scores. Collaborative filtering methods like SVD incorporate bias terms to automatically adjust for such patterns during training, improving fairness in predictions.

3. Can I recommend movies based on mood or emotional tone instead of genre?

Yes, mood-based recommenders are gaining traction. You can extract emotional tone from user reviews using sentiment analysis or from audio features using emotion recognition APIs. Tag movies with labels like "heartwarming," "dark," or "uplifting" and map them to user preferences. These models help recommend content aligned with the user’s emotional state, making recommendations more personal and dynamic than genre-only systems.

4. How do I make recommendations when only clickstream data (no ratings) is available?

This is a form of implicit feedback. Instead of ratings, you work with user behaviors like clicks, watch time, pauses, or replays. Models like Implicit ALS (Alternating Least Squares) or Bayesian Personalized Ranking (BPR) are built specifically to handle such data. You can also apply weights to different actions—for example, a full watch might carry more weight than a trailer view. This helps in building a nuanced behavior-based profile.

5. How can I prevent my recommendation system from reinforcing stereotypes or biases in content?

To reduce bias, start with auditing your training data for overrepresentation of specific genres, actors, or creators. Use fairness-aware algorithms that enforce diversity during training or in post-processing (re-ranking). For example, you might cap the number of action movies shown in a single recommendation list or ensure gender/race representation in cast metadata. Regular monitoring and feedback loops are critical to maintain balance over time.

6. What are the risks of overfitting in collaborative filtering models, and how can I avoid it?

Overfitting happens when your model learns user patterns too specifically and fails on new data. In collaborative filtering, this often results from too many latent factors or low-regularization values. Use cross-validation, regularization, and limit the number of dimensions in matrix factorization to avoid this. Also, track test performance using real-world metrics like RMSE or Precision@k to ensure generalizability before deploying the model.

7. How do I update recommendations in real time without retraining the entire model?

For real-time updates, use incremental learning models that allow partial retraining, like online variants of matrix factorization. Alternatively, you can precompute user or item embeddings and cache results for quick lookups. Systems like FAISS, Annoy, or Milvus are useful for performing fast approximate nearest neighbor searches on large-scale vector data. You can also decouple model training and serving pipelines using message brokers like Kafka.

8. Can I use knowledge graphs to improve recommendations beyond ratings and genres?

Yes, knowledge graphs enhance recommendations with contextual relationships. For instance, if a user likes a movie by a particular director, the system can recommend others from the same director—even if there’s no rating history. Tools like Neo4j or RDF-based triple stores let you model entities (actors, studios, themes) and their links. You can also combine this graph with collaborative filtering for hybrid recommendation logic.

9. How do you test if a recommendation system is actually improving user satisfaction?

Offline metrics like RMSE or AUC are a good start, but they don’t always reflect real user engagement. To truly evaluate impact, use A/B testing in production. Randomly assign users to different recommendation models and track metrics like click-through rate (CTR), watch time, conversion, and churn reduction. Pair that with user surveys or feedback mechanisms to validate qualitative satisfaction and long-term retention.

10. Is it possible to combine both collaborative and content-based filtering in a production system?

Yes, that’s called a hybrid recommendation system, and it’s used by major platforms like Netflix. You can combine scores from both methods using weighted averaging, build meta-models that learn when to trust which signal, or even switch strategies based on the user type (cold-start vs. power user). Hybrid systems tend to outperform standalone models as they blend user behavior with item metadata and semantics.

11. How do I scale a recommendation engine to handle millions of users and movies?

Scalability is a key challenge. You can use distributed computing frameworks like Apache Spark MLlib to handle large-scale training, and vector search engines like FAISS for real-time recommendation retrieval. Break your system into microservices, with separate APIs for training, updating, and serving. Use feature stores like Feast to standardize data pipelines. Caching, batching predictions, and asynchronous processing will also help reduce latency and cost at scale.

Rohit Sharma

844 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources