Home
Blog
Data Science
Movie Rating Analysis Project in R

Movie Rating Analysis Project in R

Q: 1. What does this project aim to achieve using movie data?

This project focuses on predicting movie ratings and identifying highly-rated films using Random Forest models. It leverages audience vote data and popularity trends to build both regression and classification models.

Q: 2. Which tools, libraries, and techniques are used in this project?

We used R programming language along with libraries like randomForest, tidyverse, and pROC. Core techniques included data cleaning, feature engineering (like log transformation), exploratory data analysis (EDA), and machine learning model evaluation.

Q: 3. Can other machine learning models be used instead of Random Forest?

Yes, alternatives like Gradient Boosting (e.g., XGBoost), Support Vector Machines (SVM), and even Neural Networks can be tested and compared to improve prediction accuracy and interpretability.

Q: 4. How can I make this project more advanced or real-world ready?

You can enhance the project by including genre, language, runtime, or cast-based features. You can also experiment with hyperparameter tuning, cross-validation, or use ensemble learning techniques for better generalization.

Q: 5. What are some similar machine learning projects I can try in R?

Here are a few related project ideas: Customer Churn Prediction – Forecast whether users will leave a service. NYC School Perception Analysis – Study parent/student satisfaction data. Identifying Product Bundles – Discover which products are commonly bought together. Player Score Prediction in Games – Estimate points a player may score based on past stats. Ensemble Learning Models in R – Combine multiple models to improve prediction robustness.

By Rohit Sharma

Updated on Jul 29, 2025 | 26 min read | 1.92K+ views

Table of Contents

View all

What You Should Know Before Starting This Movie Rating Analysis Project
Tools and R Libraries You'll Be Using in This Project
Project Duration, Difficulty, and Skill Level Required
Step-by-Step Movie Ratings Analysis Project Breakdown
Conclusion

Most of us have checked movie reviews online before watching it. There are various factors involved; let's build a project using R to see how it works. This movie rating analysis project will analyze a movie dataset to understand what brings in higher audience ratings for movies.

We will use R in Google Colab, perform tasks like data cleaning, visualizations, feature engineering, and build machine learning models to predict average movie ratings and classify highly rated movies.

In this movie rating analysis project, we will use tools like ggplot2, dplyr, randomForest, and pROC to understand patterns and evaluate model performance.

Jump into the world of data science with upGrad’s top online data science courses. No classrooms. No limits. Just real skills, real projects, and real career growth. Your future in data starts right here.

Looking for More Practice in R? Don’t Miss This: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Popular Data Science Programs

Postgraduate Diploma in Data Science Data Science Advanced Course DevOps Full Course Online Data Science Machine Learning Course MS in Data Science

What You Should Know Before Starting This Movie Rating Analysis Project

Before starting this project, there are a few pointers you should know certain things.

You should understand basic R syntax and how to work with data frames (data.frame/tibble).
You need to be familiar with dplyr and ggplot2 for data cleaning, transformation, and visualization.
You must know how to handle missing values and create new features through basic feature engineering.
You should understand the difference between regression and classification, and when to apply each.
You need to know how to split data, train models like Random Forest, and evaluate performance using metrics like RMSE, Accuracy, and AUC.

Step into the future with expert-led courses that cover it all: analytics, machine learning, and generative AI. Start your data science career journey now!

Tools and R Libraries You'll Be Using in This Project

The following tools will be used in this project for various purposes.

Category	Tool / Library	Purpose
Environment	Google Colab (R)	Cloud-based coding platform to run R interactively
Data Manipulation	dplyr	Filtering, transforming, and summarizing data
Visualization	ggplot2	Creating informative and aesthetic visualizations
Date Handling	lubridate	Parsing and extracting components from date columns
Data Cleaning	janitor	Cleaning column names and simplifying messy datasets
Modeling	randomForest	Building regression and classification models
Evaluation	pROC	Calculating AUC and plotting ROC curves
Core Tidyverse	tidyverse	Includes readr, tibble, stringr, and more for general workflow

Project Duration, Difficulty, and Skill Level Required

This project requires the following things:

Aspect	Details
Estimated Duration	4–6 hours (spread across 1–2 sessions, depending on pace)
Difficulty Level	Beginner to Lower-Intermediate
Skill Level Required	Basic understanding of R, data manipulation, and a general interest in data analysis

You Can’t Miss This: Machine Learning with R: Everything You Need to Know

Step-by-Step Movie Ratings Analysis Project Breakdown

In this section, we’ll break down the process of creating this movie rating analysis project, step by step, along with the code and output involved.

Step 1: Set up R in Google Colab & download the dataset

We need a data source, and that can be found on websites like Kaggle. Download the required dataset and then open Colab and do the following:

Open Google Colab, and create a New Notebook.
Switch the runtime to R: Click on Runtime ▸ Change runtime type ▸ Language = Choose R ▸ Click on Save.

Step 2: Installing and Loading Required R Packages

This step checks and installs any missing R packages you'll need for the project, and then loads them into your session. These libraries are important as they are involved in data cleaning, visualization, model building, and evaluation. Here’s the code:

# -------------------------------
# 1) INSTALL & LOAD PACKAGES
# -------------------------------
# Helper: install only what's missing
need <- c("tidyverse", "lubridate", "skimr", "janitor", "broom", "GGally", "pROC")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)  # Installs only missing packages
# Load them
library(tidyverse)  # Core data science tools: dplyr, ggplot2, readr, etc.
library(lubridate)  # Makes it easy to work with date columns
library(skimr)      # Provides quick summary stats of your dataset
library(janitor)    # Cleans messy column names, helps explore categorical vars
library(broom)      # Tidies up model output for easier analysis
library(GGally)     # Optional: for quick EDA (pair plots)
library(pROC)       # Used to evaluate classification models (AUC, ROC)

This gives us the output prompting us that the libraries are installed:

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘patchwork’, ‘snakecase’, ‘ggstats’, ‘S7’, ‘plyr’

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr 1.1.4 ✔ readr 2.1.5

✔ forcats 1.0.0 ✔ stringr 1.5.1

✔ ggplot2 3.5.2 ✔ tibble 3.3.0

✔ lubridate 1.9.4 ✔ tidyr 1.3.1

✔ purrr 1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

chisq.test, fisher.test

Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following objects are masked from ‘package:stats’:

cov, smooth, var

Step 3: Reading the Dataset into R

This step will load the uploaded movie dataset into your R session and will give us a quick glimpse of the structure and sample rows to confirm it has loaded correctly. The code for this section is:

# -------------------------------
# 2) READ THE DATA
# -------------------------------
# Set your file name (change the path if needed)
file_path <- "Movie-Dataset-Latest.csv"  # Name of your uploaded CSV file
movies_raw <- readr::read_csv(file_path)  # Load the CSV into R as a tibble
# Take a quick look
glimpse(movies_raw)      # Check column names, types, and structure
head(movies_raw, 3)      # View the first 3 rows to verify content

The above code will give us the output as:

New names:

• `` -> `...1`

Rows: 9463 Columns: 9

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (2): title, overview

dbl (5): ...1, id, popularity, vote_average, vote_count

lgl (1): video

date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Rows: 9,463

Columns: 9

$ ...1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…

$ id <dbl> 19404, 278, 238, 724089, 424, 696374, 761053, 240, 283566…

$ title <chr> "Dilwale Dulhania Le Jayenge", "The Shawshank Redemption"…

$ release_date <date> 1995-10-20, 1994-09-23, 1972-03-14, 2020-07-31, 1993-11-…

$ overview <chr> "Raj is a rich, carefree, happy-go-lucky second generatio…

$ popularity <dbl> 25.884, 60.110, 62.784, 28.316, 38.661, 18.395, 29.495, 4…

$ vote_average <dbl> 8.7, 8.7, 8.7, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.…

$ vote_count <dbl> 3304, 20369, 15219, 1360, 12158, 2172, 922, 9164, 405, 23…

$ video <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…

...1	id	title	release_date	overview	popularity	vote_average	vote_count	video
<dbl>	<dbl>	<chr>	<date>	<chr>	<dbl>	<dbl>	<dbl>	<lgl>
0	19404	Dilwale Dulhania Le Jayenge	1995-10-20	Raj is a rich, carefree, happy-go-lucky second generation NRI. Simran is the daughter of Chaudhary Baldev Singh, who in spite of being an NRI is very strict about adherence to Indian values. Simran has left for India to be married to her childhood fiancé. Raj leaves for India with a mission at his hands, to claim his lady love under the noses of her whole family. Thus begins a saga.	25.884	8.7	3304	FALSE
1	278	The Shawshank Redemption	1994-09-23	Framed in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he puts his accounting skills to work for an amoral warden. During his long stretch in prison, Dufresne comes to be admired by the other inmates -- including an older prisoner named Red -- for his integrity and unquenchable sense of hope.	60.110	8.7	20369	FALSE
2	238	The Godfather	1972-03-14	Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family. When organized crime family patriarch, Vito Corleone barely survives an attempt on his life, his youngest son, Michael steps in to take care of the would-be killers, launching a campaign of bloody revenge.	62.784	8.7	15219	FALSE

Step 4: Cleaning and Basic Preprocessing

In this step, we clean up the column names, remove unnecessary columns, safely parse dates, and create new features like year and decade. We also check for missing values to prepare the dataset for analysis. The code is given below:

# -------------------------------
# CLEANING & BASIC PREP (robust)
# -------------------------------
# Peek at the raw names so you can see what's actually there
names(movies_raw)
movies <- movies_raw %>%
  janitor::clean_names() %>%                           # Convert column names to snake_case
  select(-tidyselect::any_of(c("unnamed_0", "x1"))) %>%# Drop unwanted columns if they exist
  mutate(
    # Parse release_date safely (ymd ignores if already Date)
    release_date = lubridate::ymd(release_date),
    year         = lubridate::year(release_date),      # Extract release year
    decade       = floor(year / 10) * 10,              # Derive decade from year
    # Convert 'video' to factor if it exists
    video        = if ("video" %in% names(.)) as.factor(video) else NULL
  )
glimpse(movies)  # Check cleaned structure and new columns
# Check missing counts across all columns
movies %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  t()  # Transpose for easier reading

The output for the above section is:

'...1'
'id'
'title'
'release_date'
'overview'
'popularity'
'vote_average'
'vote_count'
'video'

Rows: 9,463

Columns: 10

$ id <dbl> 19404, 278, 238, 724089, 424, 696374, 761053, 240, 283566…

$ title <chr> "Dilwale Dulhania Le Jayenge", "The Shawshank Redemption"…

$ release_date <date> 1995-10-20, 1994-09-23, 1972-03-14, 2020-07-31, 1993-11-…

$ overview <chr> "Raj is a rich, carefree, happy-go-lucky second generatio…

$ popularity <dbl> 25.884, 60.110, 62.784, 28.316, 38.661, 18.395, 29.495, 4…

$ vote_average <dbl> 8.7, 8.7, 8.7, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.…

$ vote_count <dbl> 3304, 20369, 15219, 1360, 12158, 2172, 922, 9164, 405, 23…

$ video <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…

$ year <dbl> 1995, 1994, 1972, 2020, 1993, 2020, 2020, 1974, 2021, 201…

$ decade <dbl> 1990, 1990, 1970, 2020, 1990, 2020, 2020, 1970, 2020, 201…

id	0
title	0
release_date	0
overview	14
popularity	0
vote_average	0
vote_count	0
video	0
year	0
decade	0

The above output means:

After cleaning and preprocessing:

The dataset has 9,463 rows and 10 columns.
Columns include key features like title, release_date, popularity, vote_average, vote_count, and video.
Two new columns were successfully added:
- year – extracted from the release_date
- decade – derived from year for broader trend analysis
The column names were cleaned to snake_case (e.g., vote_average, not voteAverage)

Step 5: Exploring the Distribution of Movie Ratings

In this step, we visualize how movie ratings (vote_average) are distributed across the dataset. This helps us understand if most movies are rated highly, poorly, or somewhere in between. The code for this step is:

# -------------------------------
# 4.1 DISTRIBUTION OF RATINGS
# -------------------------------
ggplot(movies, aes(x = vote_average)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +  # Histogram with 30 bins
  labs(
    title = "Distribution of Movie Ratings",  # Chart title
    x = "Average Rating",                     # X-axis label
    y = "Number of Movies"                    # Y-axis label
  ) +
  theme_minimal()  # Clean, minimal visual style

This step gives us a graphical representation of the distribution of movie ratings:

Step 6: Visualizing the Most Voted Movies

This step helps identify the top 15 movies that received the highest number of votes. It gives us an idea of which movies had the most audience engagement or visibility. The code for this step is:

# -------------------------------
# 4.2 MOST VOTED MOVIES
# -------------------------------
movies %>%
  arrange(desc(vote_count)) %>%                   # Sort movies by highest vote count
  slice_head(n = 15) %>%                          # Take the top 15
  ggplot(aes(x = reorder(title, vote_count), y = vote_count)) +  # Reorder titles by vote count
  geom_col(fill = "darkorange") +                 # Bar chart with orange bars
  coord_flip() +                                  # Flip coordinates for horizontal bars
  labs(
    title = "Top 15 Most Voted Movies",           # Chart title
    x = "Movie Title",                            # X-axis label
    y = "Vote Count"                              # Y-axis label
  ) +
  theme_minimal()                                 # Clean visual style

The output for this step gives us a graphical representation of the top 15 most voted movies.

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Something For You: R For Data Science: Why Should You Choose R for Data Science?

Step 7: Top Rated Movies with Minimum 1000 Votes

This step highlights the top 15 movies with the highest average ratings, but only includes those that received at least 1000 votes. This helps avoid bias from little-known movies with very few ratings. The code for this step is:

# -------------------------------
# 4.3 TOP RATED MOVIES (MIN 1000 VOTES)
# -------------------------------
min_votes <- 1000  # to avoid including unknown films with 1 vote
top_rated <- movies %>%
  filter(vote_count >= min_votes) %>%             # Keep only movies with 1000+ votes
  arrange(desc(vote_average)) %>%                 # Sort by highest rating
  slice_head(n = 15)                              # Pick top 15
ggplot(top_rated, aes(x = reorder(title, vote_average), y = vote_average)) +
  geom_col(fill = "seagreen") +                   # Green horizontal bars
  coord_flip() +                                  # Flip for horizontal layout
  labs(
    title = paste("Top 15 Highest Rated Movies (vote_count >=", min_votes, ")"),  # Dynamic title
    x = "Movie Title",                            # X-axis label
    y = "Average Rating"                          # Y-axis label
  ) +
  theme_minimal()                                 # Clean style

The output gives us a graph depicting the top 15 highest rated movies, which have received a minimum of 1000 votes.

Step 8: Tracking Average Ratings Over Time

This step shows how the average movie rating has changed over the years. It helps identify any trends, such as whether movies are being rated more favorably or harshly over time. The code is given below:

# -------------------------------
# 4.4 AVERAGE RATING OVER TIME
# -------------------------------
rating_year <- movies %>%
  filter(!is.na(year)) %>%                        # Remove rows with missing years
  group_by(year) %>%                              # Group by release year
  summarise(
    avg_rating = mean(vote_average, na.rm = TRUE),# Calculate average rating
    n = n()                                       # Count movies per year
  ) %>%
  filter(n >= 5)  # keep years with at least 5 movies to avoid unreliable averages
ggplot(rating_year, aes(x = year, y = avg_rating)) +
  geom_line(color = "steelblue") +                # Line for rating trend
  geom_point() +                                  # Add points for each year
  labs(
    title = "Average Movie Rating by Year",       # Chart title
    x = "Year",                                   # X-axis label
    y = "Average Rating"                          # Y-axis label
  ) +
  theme_minimal()                                 # Clean visual theme

This gives us a graph that shows the average rating over time for various movies:

Step 9: Popularity vs. Vote Count (Log Scale)

This step looks at the relationship between a movie’s popularity score and its vote count. Both axes use a logarithmic scale to better visualize patterns across a wide range of values. The code for this step is:

# -------------------------------
# 4.5 POPULARITY vs VOTE COUNT
# -------------------------------
ggplot(movies, aes(x = popularity, y = vote_count)) +
  geom_point(alpha = 0.5, color = "purple") +      # Scatter plot with semi-transparent purple points
  scale_x_log10() +                                # Log scale for popularity (x-axis)
  scale_y_log10() +                                # Log scale for vote count (y-axis)
  labs(
    title = "Popularity vs Vote Count (Log-Log Scale)",  # Chart title
    x = "Popularity (log)",                       # X-axis label
    y = "Vote Count (log)"                        # Y-axis label
  ) +
  theme_minimal()                                  # Clean theme

This gives us a graph on the comparison between popularity and vote count.

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

The above output shows:

We can see that more popular movies usually get more votes, but it’s not a perfect rule.
Most movies have low popularity and fewer votes, while only a few are very popular and highly voted.
The graph uses log scales to make it easier to see patterns among both small and large values.

Sharpen Your R Skills With This R Language Tutorial

Step 10: Feature Engineering

In this step, we will create new columns from existing data to help with future modeling tasks. These include a log-transformed vote count, a binary flag for high ratings, and a rating category (Low, Average, High). The code is:

# -------------------------------
# 5.1 FEATURE ENGINEERING
# -------------------------------
movies_fe <- movies %>%
  mutate(
    log_vote_count = log1p(vote_count),         # Use log(1 + x) to avoid log(0) issues
    high_rating    = vote_average >= 8,         # TRUE if rating is 8 or higher (for classification)
    rating_bucket  = case_when(                 # Group ratings into categories
      vote_average < 5  ~ "Low",
      vote_average < 7  ~ "Average",
      TRUE              ~ "High"
    )
  ) %>%
  mutate(rating_bucket = factor(rating_bucket, levels = c("Low", "Average", "High")))  # Ordered factor
# Check the new structure
glimpse(movies_fe)  # See new columns
# Count how many movies fall into each rating bucket
movies_fe %>%
  count(rating_bucket)

The output is:

Rows: 9,463

Columns: 13

$ id <dbl> 19404, 278, 238, 724089, 424, 696374, 761053, 240, 2835…

$ title <chr> "Dilwale Dulhania Le Jayenge", "The Shawshank Redemptio…

$ release_date <date> 1995-10-20, 1994-09-23, 1972-03-14, 2020-07-31, 1993-1…

$ overview <chr> "Raj is a rich, carefree, happy-go-lucky second generat…

$ popularity <dbl> 25.884, 60.110, 62.784, 28.316, 38.661, 18.395, 29.495,…

$ vote_average <dbl> 8.7, 8.7, 8.7, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, …

$ vote_count <dbl> 3304, 20369, 15219, 1360, 12158, 2172, 922, 9164, 405, …

$ video <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…

$ year <dbl> 1995, 1994, 1972, 2020, 1993, 2020, 2020, 1974, 2021, 2…

$ decade <dbl> 1990, 1990, 1970, 2020, 1990, 2020, 2020, 1970, 2020, 2…

$ log_vote_count <dbl> 8.103192, 9.921819, 9.630366, 7.215975, 9.405825, 7.683…

$ high_rating <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…

$ rating_bucket <fct> High, High, High, High, High, High, High, High, High, H…

rating_bucket <fct>	n <int>
Low	264
Average	5894
High	3305

The above output means that:

Three new columns were added:

log_vote_count: a log-scaled version of vote_count, useful for modeling.
high_rating: a TRUE/FALSE column indicating whether a movie has a rating of 8 or more.
rating_bucket: categorizes movies as Low (<5), Average (5–7), or High (7+).

Distribution of rating_bucket:

Low: 264 movies
Average: 5,894 movies
High: 3,305 movies

Step 11: Distribution of Vote Counts (Raw vs. Log Transformed)

This step helps us understand the distribution of vote_count before and after applying log transformation. It shows how log scaling helps reduce skewness and makes the data more suitable for modeling. The code for this step is:

# -------------------------------
# 5.2.1 DISTRIBUTION OF VOTE COUNTS
# -------------------------------

par(mfrow = c(1, 2))  # Show two plots side by side

# Histogram of raw vote counts
hist(movies$vote_count, breaks = 30, 
     main = "Vote Count (Raw)", col = "tomato", xlab = "vote_count")

# Histogram of log-transformed vote counts
hist(movies_fe$log_vote_count, breaks = 30, 
     main = "Vote Count (Log Transformed)", col = "steelblue", xlab = "log_vote_count")

par(mfrow = c(1, 1))  # Reset plot layout to default

The output gives us a graph:

The above graphs show:

On the left, the original vote_count is very skewed. Which means most movies have very few votes, and only a few have a lot.
On the right, the log_vote_count is more balanced. This means the log transformation spreads out the data and makes it easier to work with in models.
This helps machine learning models understand patterns better without being overwhelmed by extreme values.

Must Read For Beginners: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!

Step 12: Rating Bucket Counts

In this section, we will create plots that show how movies are distributed across the three rating categories we created earlier, Low, Average, and High.

# -------------------------------
# 5.2.2 RATING BUCKET COUNTS
# -------------------------------

ggplot(movies_fe, aes(x = rating_bucket)) +
  geom_bar(fill = "goldenrod") +
  labs(
    title = "Number of Movies by Rating Bucket",
    x = "Rating Category",
    y = "Number of Movies"
  ) +
  theme_minimal()

We get a graph like this in the output:

This graph tells us that:

The Average rating category dominates, meaning most movies are rated between 5 and 7.
A smaller number of movies fall into the High (≥7) or Low (<5) buckets.
This helps us understand the class balance before applying classification models.

Step 13: Popularity by Rating Bucket

In this step, we will create a plot that compares the average popularity of movies in each rating category (Low, Average, High). We’ll use the code:

# -------------------------------
# 5.2.3 POPULARITY BY RATING BUCKET
# -------------------------------

movies_fe %>%
  group_by(rating_bucket) %>%
  summarise(avg_popularity = mean(popularity, na.rm = TRUE)) %>%
  ggplot(aes(x = rating_bucket, y = avg_popularity, fill = rating_bucket)) +
  geom_col() +
  labs(
    title = "Average Popularity by Rating Bucket",
    x = "Rating Category",
    y = "Average Popularity"
  ) +
  theme_minimal()

This code will give us the output as:

What This Shows:

We can see which rating group tends to have more buzz or attention (as measured by popularity).
If highly rated movies are also more popular, that could suggest a positive correlation between quality and attention.
If Average movies are most popular, it might indicate mass appeal over critical acclaim.

Step 14: Train/Test Split

This step prepares your data for machine learning by splitting it into two parts: training data (to build the model) and testing data (to evaluate how well it works). We’ll use the code:

# -------------------------------
# 6.1 TRAIN/TEST SPLIT
# -------------------------------
set.seed(42)  # for reproducibility
# Drop rows with missing model features
movies_model <- movies_fe %>%
  drop_na(vote_average, popularity, log_vote_count, year)
# Split into 80% train, 20% test
n <- nrow(movies_model)
train_index <- sample(seq_len(n), size = 0.8 * n)
train <- movies_model[train_index, ]
test  <- movies_model[-train_index, ]
cat("Train size:", nrow(train), " | Test size:", nrow(test), "\n")

This will give the output:

Train size: 7570 | Test size: 1893

Which means that:

7,570 rows will be used for training the model
1,893 rows will be used for testing how well it performs

Supercharge Your Learning in R. Explore This Blog: Best R Libraries Data Science: Tools for Analysis, Visualization & ML

Step 15: Linear Regression Model (Baseline)

In this step, we're building a simple linear regression model to predict a movie's average rating using three features: popularity, log of vote count, and release year. We’ll use the code:

# -------------------------------
# 6.2 LINEAR REGRESSION
# -------------------------------
lm_model <- lm(vote_average ~ popularity + log_vote_count + year, data = train)
# Summary of model coefficients
summary(lm_model)

This gives the output:

Call:

lm(formula = vote_average ~ popularity + log_vote_count + year,

data = train)

Residuals:

Min 1Q Median 3Q Max

-4.0309 -0.4821 0.0175 0.5186 2.3704

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.437e+01 1.040e+00 33.06 < 2e-16 ***

popularity 1.975e-04 3.387e-05 5.83 5.77e-09 ***

log_vote_count 2.229e-01 8.284e-03 26.91 < 2e-16 ***

year -1.461e-02 5.212e-04 -28.03 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7547 on 7566 degrees of freedom

Multiple R-squared: 0.1576, Adjusted R-squared: 0.1573

F-statistic: 471.8 on 3 and 7566 DF, p-value: < 2.2e-16

This means that:

Movies with more votes and higher popularity usually have better ratings.
Newer movies tend to have slightly lower ratings.

Step 16: Evaluate the Linear Regression Model

In this step, we predict ratings using the linear model and calculate how close the predictions are to the actual ratings. The code for this step is:

# Predict ratings
test$pred_rating <- predict(lm_model, newdata = test)
# Define metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2))  # Root Mean Square Error
mae  <- function(actual, pred) mean(abs(actual - pred))        # Mean Absolute Error
r2   <- function(actual, pred) cor(actual, pred)^2             # R-squared
# Calculate metrics
lm_rmse <- rmse(test$vote_average, test$pred_rating)
lm_mae  <- mae(test$vote_average, test$pred_rating)
lm_r2   <- r2(test$vote_average, test$pred_rating)

This gives the output as:

Linear Regression Results

RMSE : 0.763

MAE : 0.607

R^2 : 0.114

The above output means that:

On average, the rating predictions are off by about 0.6 to 0.7.
The model only explains about 11% of the rating patterns.

Step 17: Logistic Regression (for Classification)

We're now predicting whether a movie is highly rated (8 or above) using logistic regression. The code for this is:

glm_model <- glm(high_rating ~ popularity + log_vote_count + year,
                 data = train, family = binomial)
summary(glm_model)

The output for the above step is:

Call:

glm(formula = high_rating ~ popularity + log_vote_count + year,

family = binomial, data = train)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 68.2042129 5.5443763 12.302 < 2e-16 ***

popularity 0.0003712 0.0001180 3.147 0.00165 **

log_vote_count 0.7791792 0.0526512 14.799 < 2e-16 ***

year -0.0384784 0.0028399 -13.549 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2505.7 on 7569 degrees of freedom

Residual deviance: 2171.9 on 7566 degrees of freedom

AIC: 2179.9

Number of Fisher Scoring iterations: 7

The above output means that:

More votes and higher popularity increase the chance of being highly rated.
Newer movies are slightly less likely to be highly rated.
All three features (popularity, vote count, year) are important — their p-values are very low (less than 0.05).

Upscale Your Knowledge. Read Now: 18 Types of Regression in Machine Learning You Should Know

Step 18: Evaluate Logistic Regression Model

This step checks how well our model predicts if a movie is highly rated (yes/no). The code for this step is:

# Predict probability of being highly rated
test$prob_high <- predict(glm_model, newdata = test, type = "response")
# Predict TRUE/FALSE based on 0.5 threshold
test$pred_high <- test$prob_high >= 0.5
# Accuracy: how many predictions were correct
accuracy <- mean(test$pred_high == test$high_rating)
# AUC: measures how well the model separates TRUE from FALSE
roc_obj <- roc(response = test$high_rating, predictor = test$prob_high)
auc_val <- auc(roc_obj)

The output is:

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

📊 Logistic Regression Results

Accuracy: 0.957

AUC : 0.638

This means that:

Accuracy = 0.957 → The model got 95.7% of predictions correct.
AUC = 0.638 → The model is somewhat good at separating high-rated from not-high-rated movies (but not perfect).

Step 19: Install and Load Random Forest Packages

In this step, we make sure all the required libraries for building and evaluating Random Forest models are installed and loaded. These packages will help us with model training, predictions, and evaluation metrics. The code used is:

# -------------------------------------------------
# 0) INSTALL & LOAD
# -------------------------------------------------
need <- c("randomForest", "pROC", "tidyverse")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)
library(randomForest)
library(pROC)
library(tidyverse)

Upon successfully installing the package, we get the output as:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

combine

The following object is masked from ‘package:ggplot2’:

margin

Step 20: Prepare Data and Create Train/Test Split

Before training our Random Forest models, we need to prepare the data properly. In this step, we select relevant features, handle missing values, and split the dataset into training and testing sets. We’ll use the code:

# -------------------------------------------------
# 1) PREP DATA & SPLIT
# -------------------------------------------------
set.seed(123)
# Keep only columns we need and drop rows with NAs in them
movies_model <- movies_fe %>%
  drop_na(vote_average, popularity, log_vote_count, year, high_rating)
# Make the classification target a factor (randomForest needs a factor for classification)
movies_model <- movies_model %>%
  mutate(high_rating = factor(high_rating, levels = c(FALSE, TRUE), labels = c("No", "Yes")))
# 80/20 split
n <- nrow(movies_model)
train_idx <- sample(seq_len(n), size = 0.8 * n)
train <- movies_model[train_idx, ]
test  <- movies_model[-train_idx, ]
cat("Train size:", nrow(train), " | Test size:", nrow(test), "\n")

This gives the output:

Train size: 7570 | Test size: 1893

The above output means that:

Train size: 7570
We randomly selected 80% of the data (7,570 movies) to train our model. This is where the model learns patterns from known data.
Test size: 1893
The remaining 20% (1,893 movies) is reserved as unseen data to test how well the model performs on new examples.

Want to Upskill? Read This: What Is Data Acquisition: Key Components & Role in Machine Learning

Step 21: Random Forest – Regression

In this step, we build a Random Forest regression model to predict the average movie rating using key features like popularity, number of votes (log-transformed), and release year. Random Forest improves accuracy by combining multiple decision trees and reduces overfitting.

# -------------------------------------------------
# 21) RANDOM FOREST - REGRESSION
# -------------------------------------------------
# Formula: pick a few sensible predictors (you can add more later)
rf_reg <- randomForest(
  vote_average ~ popularity + log_vote_count + year,
  data = train,
  ntree = 500,       # number of trees
  mtry  = 2,         # number of variables tried at each split (tune this)
  importance = TRUE  # so we can plot variable importance
)
print(rf_reg)

The above code will give the output:

Call:

randomForest(formula = vote_average ~ popularity + log_vote_count + year, data = train, ntree = 500, mtry = 2, importance = TRUE)

Type of random forest: regression

Number of trees: 500

No. of variables tried at each split: 2

Mean of squared residuals: 0.5491158

% Var explained: 18.19

The above output means that:

Model Type: A regression random forest was trained using 500 trees to predict movie ratings.
Error: The model's average squared error on training data is 0.549, showing a decent fit.
Performance: It explains 18.19% of the variation in ratings — slightly better than linear regression (which explained ~11%).

Step 22: Evaluate Random Forest Regression

In this step, we test how well the Random Forest regression model predicts movie ratings on unseen data using common evaluation metrics. We’ll use the following code:

# -------------------------------------------------
# 2A) EVALUATE - REGRESSION
# -------------------------------------------------
# Predict on test
test$rf_pred_rating <- predict(rf_reg, newdata = test)
# Metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2))
mae  <- function(actual, pred) mean(abs(actual - pred))
r2   <- function(actual, pred) cor(actual, pred)^2
rf_rmse <- rmse(test$vote_average, test$rf_pred_rating)
rf_mae  <- mae(test$vote_average, test$rf_pred_rating)
rf_r2   <- r2(test$vote_average, test$rf_pred_rating)
cat("📊 Random Forest Regression (Test)\n")
cat("RMSE :", round(rf_rmse, 3), "\n")
cat("MAE  :", round(rf_mae, 3), "\n")
cat("R^2  :", round(rf_r2, 3), "\n")

The output is given below:

Random Forest Regression (Test)

RMSE : 0.723

MAE : 0.562

R^2 : 0.233

This means that:

RMSE = 0.723: On average, the predicted movie ratings are about 0.72 points off from the actual ratings.
MAE = 0.562: The average absolute error in predictions is 0.56, which means the predictions are fairly close.
R² = 0.233: The model explains about 23.3% of the variation in movie ratings, better than linear regression, but still leaves room for improvement.

Step 23: Variable Importance – Regression

This step helps us understand which features influenced the rating predictions the most in the Random Forest regression model. Random Forest provides a built-in way to measure this using metrics like Mean Decrease in Accuracy and Mean Decrease in Gini. We use the code:

# -------------------------------------------------
# 2B) VARIABLE IMPORTANCE - REGRESSION
# -------------------------------------------------
importance(rf_reg)
varImpPlot(rf_reg, main = "Variable Importance (Random Forest - Regression)")

The output of this code, along with the graph, is given below:

	%IncMSE	IncNodePurity
popularity	64.65503	1594.713
log_vote_count	139.71055	1774.393
year	138.46185	1276.424

The above graph means that:

log_vote_count (number of votes) is the most important. It shows that more votes usually mean a more reliable rating.
year (release year) also matters; older or newer movies may be rated differently.
popularity helps too, but not as much.

Step 24: Random Forest – Classification

In this step, we train a Random Forest classifier to predict whether a movie is highly rated (“Yes”) or not (“No”) using popularity, log_vote_count, and year. We also turn on variable importance to see which features matter most. We’ll use the code:

# -------------------------------------------------
# 3) RANDOM FOREST - CLASSIFICATION
# -------------------------------------------------
rf_clf <- randomForest(
  high_rating ~ popularity + log_vote_count + year,
  data = train,
  ntree = 500,       # grow 500 trees
  mtry  = 2,         # try 2 predictors at each split
  importance = TRUE  # keep importance stats for later plotting
)
print(rf_clf)        # shows OOB error, confusion matrix, etc.

We get the output as:

Call:

randomForest(formula = high_rating ~ popularity + log_vote_count + year, data = train, ntree = 500, mtry = 2, importance = TRUE)

Type of random forest: classification

Number of trees: 500

No. of variables tried at each split: 2

OOB estimate of error rate: 3.79%

Confusion matrix:

No Yes class.error

No 7223 46 0.006328243

Yes 241 60 0.800664452

The above output means that:

Low overall error: The model made only 3.79% mistakes on training data (Out-of-Bag estimate), which means it's performing well overall.
Very good at predicting "No" (not highly rated): It correctly classified most of the movies that are not highly rated.
Struggles with "Yes" (highly rated): It has trouble correctly predicting high-rated movies; 80% of them were misclassified.

Step 25: Evaluating the Random Forest Classification Model

In this step, we evaluate how well the Random Forest model predicts whether a movie is highly rated or not on unseen test data. The code is:

# Class predictions
test$rf_pred_class <- predict(rf_clf, newdata = test, type = "class")
# Probabilities for ROC/AUC
test$rf_prob_yes <- predict(rf_clf, newdata = test, type = "prob")[, "Yes"]
# Accuracy
accuracy <- mean(test$rf_pred_class == test$high_rating)
# Confusion matrix (simple)
conf_mat <- table(Predicted = test$rf_pred_class, Actual = test$high_rating)
# ROC / AUC (positive class = "Yes")
roc_obj <- roc(response = test$high_rating, predictor = test$rf_prob_yes, levels = c("No", "Yes"))
auc_val <- auc(roc_obj)
# Print results
cat("Random Forest Classification (Test)\n")
cat("Accuracy:", round(accuracy, 3), "\n")
cat("AUC     :", round(as.numeric(auc_val), 3), "\n\n")
print(conf_mat)
# Plot ROC
plot(roc_obj, main = "ROC Curve - Random Forest Classification")

We get the following output and graph:

Setting direction: controls < cases

Random Forest Classification (Test)

Accuracy: 0.958

AUC : 0.789

Actual

Predicted No Yes

No 1803 64

Yes 15 11

This shows that:

Accuracy = 95.8%
The model predicted most movies correctly.
AUC = 0.789
The model is good at telling apart high-rated and not-high-rated movies.
Confusion Matrix:
- 1803 movies correctly predicted as Not High Rated
- 11 movies correctly predicted as High Rated
- 64 high-rated movies were missed (model said “No”)
ROC Curve:
The curve is well above the diagonal, meaning the model performs better than random guessing.

Practice Makes Perfect. Start Here: The Ultimate R Cheat Sheet for Data Science Enthusiasts

Step 26: Variable Importance – Classification

This section shows which features were most important in helping the Random Forest classify movies as high-rated or not. The code for this section is:

# -------------------------------------------------
# 3B) VARIABLE IMPORTANCE - CLASSIFICATION
# -------------------------------------------------
importance(rf_clf)
varImpPlot(rf_clf, main = "Variable Importance (Random Forest - Classification)")

The output for this section is:

	No	Yes	MeanDecreaseAccuracy	MeanDecreaseGini
popularity	36.90262	6.076136	39.86347	224.4033	popularity
log_vote_count	72.79150	51.165184	82.72379	233.9531	log_vote_count
year	36.97069	57.166905	55.27109	119.1909	year

The above output means that:

log_vote_count is the most important feature for predicting if a movie is highly rated.
popularity and year also help, but they’re less useful than log_vote_count.
The bars in the plot show how much each feature helps improve:
- Accuracy (left plot)
- Decision quality inside the trees (right plot)

Conclusion

In this Movie Ratings Prediction project, we used R in Google Colab to build Random Forest models for both regression and classification tasks. The goal was to predict a movie's average rating and whether it qualifies as a highly rated film based on features like popularity, vote count (log-transformed), and release year.

After preprocessing and visualizing the data, we trained the models on 80% of the dataset. The regression model achieved an RMSE of 0.723 and an R² of 0.233, while the classification model reached 95.8% accuracy with an AUC of 0.789.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1LptCihglOyzCkkiJ7e-OVjC6NsipxTtW#scrollTo=Ms1l6gDvZVBe

Frequently Asked Questions (FAQs)

1. What does this project aim to achieve using movie data?

This project focuses on predicting movie ratings and identifying highly-rated films using Random Forest models. It leverages audience vote data and popularity trends to build both regression and classification models.

2. Which tools, libraries, and techniques are used in this project?

We used R programming language along with libraries like randomForest, tidyverse, and pROC. Core techniques included data cleaning, feature engineering (like log transformation), exploratory data analysis (EDA), and machine learning model evaluation.

3. Can other machine learning models be used instead of Random Forest?

Yes, alternatives like Gradient Boosting (e.g., XGBoost), Support Vector Machines (SVM), and even Neural Networks can be tested and compared to improve prediction accuracy and interpretability.

4. How can I make this project more advanced or real-world ready?

You can enhance the project by including genre, language, runtime, or cast-based features. You can also experiment with hyperparameter tuning, cross-validation, or use ensemble learning techniques for better generalization.

5. What are some similar machine learning projects I can try in R?

Here are a few related project ideas:

Customer Churn Prediction – Forecast whether users will leave a service.
NYC School Perception Analysis – Study parent/student satisfaction data.
Identifying Product Bundles – Discover which products are commonly bought together.
Player Score Prediction in Games – Estimate points a player may score based on past stats.
Ensemble Learning Models in R – Combine multiple models to improve prediction robustness.

Rohit Sharma

839 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources