Movie Rating Analysis Project in R
By Rohit Sharma
Updated on Jul 29, 2025 | 26 min read | 1.54K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 29, 2025 | 26 min read | 1.54K+ views
Share:
Table of Contents
Most of us have checked movie reviews online before watching it. There are various factors involved; let's build a project using R to see how it works. This movie rating analysis project will analyze a movie dataset to understand what brings in higher audience ratings for movies.
We will use R in Google Colab, perform tasks like data cleaning, visualizations, feature engineering, and build machine learning models to predict average movie ratings and classify highly rated movies.
In this movie rating analysis project, we will use tools like ggplot2, dplyr, randomForest, and pROC to understand patterns and evaluate model performance.
Jump into the world of data science with upGrad’s top online data science courses. No classrooms. No limits. Just real skills, real projects, and real career growth. Your future in data starts right here.
Looking for More Practice in R? Don’t Miss This: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Popular Data Science Programs
Before starting this project, there are a few pointers you should know certain things.
Step into the future with expert-led courses that cover it all: analytics, machine learning, and generative AI. Start your data science career journey now!
The following tools will be used in this project for various purposes.
Category |
Tool / Library |
Purpose |
Environment | Google Colab (R) | Cloud-based coding platform to run R interactively |
Data Manipulation | dplyr | Filtering, transforming, and summarizing data |
Visualization | ggplot2 | Creating informative and aesthetic visualizations |
Date Handling | lubridate | Parsing and extracting components from date columns |
Data Cleaning | janitor | Cleaning column names and simplifying messy datasets |
Modeling | randomForest | Building regression and classification models |
Evaluation | pROC | Calculating AUC and plotting ROC curves |
Core Tidyverse | tidyverse | Includes readr, tibble, stringr, and more for general workflow |
This project requires the following things:
Aspect |
Details |
Estimated Duration | 4–6 hours (spread across 1–2 sessions, depending on pace) |
Difficulty Level | Beginner to Lower-Intermediate |
Skill Level Required | Basic understanding of R, data manipulation, and a general interest in data analysis |
You Can’t Miss This: Machine Learning with R: Everything You Need to Know
In this section, we’ll break down the process of creating this movie rating analysis project, step by step, along with the code and output involved.
We need a data source, and that can be found on websites like Kaggle. Download the required dataset and then open Colab and do the following:
This step checks and installs any missing R packages you'll need for the project, and then loads them into your session. These libraries are important as they are involved in data cleaning, visualization, model building, and evaluation. Here’s the code:
# -------------------------------
# 1) INSTALL & LOAD PACKAGES
# -------------------------------
# Helper: install only what's missing
need <- c("tidyverse", "lubridate", "skimr", "janitor", "broom", "GGally", "pROC")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install) # Installs only missing packages
# Load them
library(tidyverse) # Core data science tools: dplyr, ggplot2, readr, etc.
library(lubridate) # Makes it easy to work with date columns
library(skimr) # Provides quick summary stats of your dataset
library(janitor) # Cleans messy column names, helps explore categorical vars
library(broom) # Tidies up model output for easier analysis
library(GGally) # Optional: for quick EDA (pair plots)
library(pROC) # Used to evaluate classification models (AUC, ROC)
This gives us the output prompting us that the libraries are installed:
Installing packages into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘patchwork’, ‘snakecase’, ‘ggstats’, ‘S7’, ‘plyr’ ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Attaching package: ‘janitor’ The following objects are masked from ‘package:stats’: chisq.test, fisher.test Type 'citation("pROC")' for a citation. Attaching package: ‘pROC’ The following objects are masked from ‘package:stats’: cov, smooth, var |
This step will load the uploaded movie dataset into your R session and will give us a quick glimpse of the structure and sample rows to confirm it has loaded correctly. The code for this section is:
# -------------------------------
# 2) READ THE DATA
# -------------------------------
# Set your file name (change the path if needed)
file_path <- "Movie-Dataset-Latest.csv" # Name of your uploaded CSV file
movies_raw <- readr::read_csv(file_path) # Load the CSV into R as a tibble
# Take a quick look
glimpse(movies_raw) # Check column names, types, and structure
head(movies_raw, 3) # View the first 3 rows to verify content
The above code will give us the output as:
New names: • `` -> `...1` Rows: 9463 Columns: 9 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (2): title, overview dbl (5): ...1, id, popularity, vote_average, vote_count lgl (1): video date (1): release_date ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. Rows: 9,463 Columns: 9 $ ...1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,… $ id <dbl> 19404, 278, 238, 724089, 424, 696374, 761053, 240, 283566… $ title <chr> "Dilwale Dulhania Le Jayenge", "The Shawshank Redemption"… $ release_date <date> 1995-10-20, 1994-09-23, 1972-03-14, 2020-07-31, 1993-11-… $ overview <chr> "Raj is a rich, carefree, happy-go-lucky second generatio… $ popularity <dbl> 25.884, 60.110, 62.784, 28.316, 38.661, 18.395, 29.495, 4… $ vote_average <dbl> 8.7, 8.7, 8.7, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.… $ vote_count <dbl> 3304, 20369, 15219, 1360, 12158, 2172, 922, 9164, 405, 23… $ video <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
|
In this step, we clean up the column names, remove unnecessary columns, safely parse dates, and create new features like year and decade. We also check for missing values to prepare the dataset for analysis. The code is given below:
# -------------------------------
# CLEANING & BASIC PREP (robust)
# -------------------------------
# Peek at the raw names so you can see what's actually there
names(movies_raw)
movies <- movies_raw %>%
janitor::clean_names() %>% # Convert column names to snake_case
select(-tidyselect::any_of(c("unnamed_0", "x1"))) %>%# Drop unwanted columns if they exist
mutate(
# Parse release_date safely (ymd ignores if already Date)
release_date = lubridate::ymd(release_date),
year = lubridate::year(release_date), # Extract release year
decade = floor(year / 10) * 10, # Derive decade from year
# Convert 'video' to factor if it exists
video = if ("video" %in% names(.)) as.factor(video) else NULL
)
glimpse(movies) # Check cleaned structure and new columns
# Check missing counts across all columns
movies %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
t() # Transpose for easier reading
The output for the above section is:
Rows: 9,463 Columns: 10 $ id <dbl> 19404, 278, 238, 724089, 424, 696374, 761053, 240, 283566… $ title <chr> "Dilwale Dulhania Le Jayenge", "The Shawshank Redemption"… $ release_date <date> 1995-10-20, 1994-09-23, 1972-03-14, 2020-07-31, 1993-11-… $ overview <chr> "Raj is a rich, carefree, happy-go-lucky second generatio… $ popularity <dbl> 25.884, 60.110, 62.784, 28.316, 38.661, 18.395, 29.495, 4… $ vote_average <dbl> 8.7, 8.7, 8.7, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.… $ vote_count <dbl> 3304, 20369, 15219, 1360, 12158, 2172, 922, 9164, 405, 23… $ video <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F… $ year <dbl> 1995, 1994, 1972, 2020, 1993, 2020, 2020, 1974, 2021, 201… $ decade <dbl> 1990, 1990, 1970, 2020, 1990, 2020, 2020, 1970, 2020, 201…
|
The above output means:
After cleaning and preprocessing:
In this step, we visualize how movie ratings (vote_average) are distributed across the dataset. This helps us understand if most movies are rated highly, poorly, or somewhere in between. The code for this step is:
# -------------------------------
# 4.1 DISTRIBUTION OF RATINGS
# -------------------------------
ggplot(movies, aes(x = vote_average)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") + # Histogram with 30 bins
labs(
title = "Distribution of Movie Ratings", # Chart title
x = "Average Rating", # X-axis label
y = "Number of Movies" # Y-axis label
) +
theme_minimal() # Clean, minimal visual style
This step gives us a graphical representation of the distribution of movie ratings:
This step helps identify the top 15 movies that received the highest number of votes. It gives us an idea of which movies had the most audience engagement or visibility. The code for this step is:
# -------------------------------
# 4.2 MOST VOTED MOVIES
# -------------------------------
movies %>%
arrange(desc(vote_count)) %>% # Sort movies by highest vote count
slice_head(n = 15) %>% # Take the top 15
ggplot(aes(x = reorder(title, vote_count), y = vote_count)) + # Reorder titles by vote count
geom_col(fill = "darkorange") + # Bar chart with orange bars
coord_flip() + # Flip coordinates for horizontal bars
labs(
title = "Top 15 Most Voted Movies", # Chart title
x = "Movie Title", # X-axis label
y = "Vote Count" # Y-axis label
) +
theme_minimal() # Clean visual style
The output for this step gives us a graphical representation of the top 15 most voted movies.
Something For You: R For Data Science: Why Should You Choose R for Data Science?
This step highlights the top 15 movies with the highest average ratings, but only includes those that received at least 1000 votes. This helps avoid bias from little-known movies with very few ratings. The code for this step is:
# -------------------------------
# 4.3 TOP RATED MOVIES (MIN 1000 VOTES)
# -------------------------------
min_votes <- 1000 # to avoid including unknown films with 1 vote
top_rated <- movies %>%
filter(vote_count >= min_votes) %>% # Keep only movies with 1000+ votes
arrange(desc(vote_average)) %>% # Sort by highest rating
slice_head(n = 15) # Pick top 15
ggplot(top_rated, aes(x = reorder(title, vote_average), y = vote_average)) +
geom_col(fill = "seagreen") + # Green horizontal bars
coord_flip() + # Flip for horizontal layout
labs(
title = paste("Top 15 Highest Rated Movies (vote_count >=", min_votes, ")"), # Dynamic title
x = "Movie Title", # X-axis label
y = "Average Rating" # Y-axis label
) +
theme_minimal() # Clean style
The output gives us a graph depicting the top 15 highest rated movies, which have received a minimum of 1000 votes.
This step shows how the average movie rating has changed over the years. It helps identify any trends, such as whether movies are being rated more favorably or harshly over time. The code is given below:
# -------------------------------
# 4.4 AVERAGE RATING OVER TIME
# -------------------------------
rating_year <- movies %>%
filter(!is.na(year)) %>% # Remove rows with missing years
group_by(year) %>% # Group by release year
summarise(
avg_rating = mean(vote_average, na.rm = TRUE),# Calculate average rating
n = n() # Count movies per year
) %>%
filter(n >= 5) # keep years with at least 5 movies to avoid unreliable averages
ggplot(rating_year, aes(x = year, y = avg_rating)) +
geom_line(color = "steelblue") + # Line for rating trend
geom_point() + # Add points for each year
labs(
title = "Average Movie Rating by Year", # Chart title
x = "Year", # X-axis label
y = "Average Rating" # Y-axis label
) +
theme_minimal() # Clean visual theme
This gives us a graph that shows the average rating over time for various movies:
This step looks at the relationship between a movie’s popularity score and its vote count. Both axes use a logarithmic scale to better visualize patterns across a wide range of values. The code for this step is:
# -------------------------------
# 4.5 POPULARITY vs VOTE COUNT
# -------------------------------
ggplot(movies, aes(x = popularity, y = vote_count)) +
geom_point(alpha = 0.5, color = "purple") + # Scatter plot with semi-transparent purple points
scale_x_log10() + # Log scale for popularity (x-axis)
scale_y_log10() + # Log scale for vote count (y-axis)
labs(
title = "Popularity vs Vote Count (Log-Log Scale)", # Chart title
x = "Popularity (log)", # X-axis label
y = "Vote Count (log)" # Y-axis label
) +
theme_minimal() # Clean theme
This gives us a graph on the comparison between popularity and vote count.
The above output shows:
Sharpen Your R Skills With This R Language Tutorial
In this step, we will create new columns from existing data to help with future modeling tasks. These include a log-transformed vote count, a binary flag for high ratings, and a rating category (Low, Average, High). The code is:
# -------------------------------
# 5.1 FEATURE ENGINEERING
# -------------------------------
movies_fe <- movies %>%
mutate(
log_vote_count = log1p(vote_count), # Use log(1 + x) to avoid log(0) issues
high_rating = vote_average >= 8, # TRUE if rating is 8 or higher (for classification)
rating_bucket = case_when( # Group ratings into categories
vote_average < 5 ~ "Low",
vote_average < 7 ~ "Average",
TRUE ~ "High"
)
) %>%
mutate(rating_bucket = factor(rating_bucket, levels = c("Low", "Average", "High"))) # Ordered factor
# Check the new structure
glimpse(movies_fe) # See new columns
# Count how many movies fall into each rating bucket
movies_fe %>%
count(rating_bucket)
The output is:
Rows: 9,463 Columns: 13 $ id <dbl> 19404, 278, 238, 724089, 424, 696374, 761053, 240, 2835… $ title <chr> "Dilwale Dulhania Le Jayenge", "The Shawshank Redemptio… $ release_date <date> 1995-10-20, 1994-09-23, 1972-03-14, 2020-07-31, 1993-1… $ overview <chr> "Raj is a rich, carefree, happy-go-lucky second generat… $ popularity <dbl> 25.884, 60.110, 62.784, 28.316, 38.661, 18.395, 29.495,… $ vote_average <dbl> 8.7, 8.7, 8.7, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, 8.6, … $ vote_count <dbl> 3304, 20369, 15219, 1360, 12158, 2172, 922, 9164, 405, … $ video <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… $ year <dbl> 1995, 1994, 1972, 2020, 1993, 2020, 2020, 1974, 2021, 2… $ decade <dbl> 1990, 1990, 1970, 2020, 1990, 2020, 2020, 1970, 2020, 2… $ log_vote_count <dbl> 8.103192, 9.921819, 9.630366, 7.215975, 9.405825, 7.683… $ high_rating <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T… $ rating_bucket <fct> High, High, High, High, High, High, High, High, High, H…
|
The above output means that:
Three new columns were added:
Distribution of rating_bucket:
This step helps us understand the distribution of vote_count before and after applying log transformation. It shows how log scaling helps reduce skewness and makes the data more suitable for modeling. The code for this step is:
# -------------------------------
# 5.2.1 DISTRIBUTION OF VOTE COUNTS
# -------------------------------
par(mfrow = c(1, 2)) # Show two plots side by side
# Histogram of raw vote counts
hist(movies$vote_count, breaks = 30,
main = "Vote Count (Raw)", col = "tomato", xlab = "vote_count")
# Histogram of log-transformed vote counts
hist(movies_fe$log_vote_count, breaks = 30,
main = "Vote Count (Log Transformed)", col = "steelblue", xlab = "log_vote_count")
par(mfrow = c(1, 1)) # Reset plot layout to default
The output gives us a graph:
The above graphs show:
Must Read For Beginners: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!
In this section, we will create plots that show how movies are distributed across the three rating categories we created earlier, Low, Average, and High.
# -------------------------------
# 5.2.2 RATING BUCKET COUNTS
# -------------------------------
ggplot(movies_fe, aes(x = rating_bucket)) +
geom_bar(fill = "goldenrod") +
labs(
title = "Number of Movies by Rating Bucket",
x = "Rating Category",
y = "Number of Movies"
) +
theme_minimal()
We get a graph like this in the output:
This graph tells us that:
In this step, we will create a plot that compares the average popularity of movies in each rating category (Low, Average, High). We’ll use the code:
# -------------------------------
# 5.2.3 POPULARITY BY RATING BUCKET
# -------------------------------
movies_fe %>%
group_by(rating_bucket) %>%
summarise(avg_popularity = mean(popularity, na.rm = TRUE)) %>%
ggplot(aes(x = rating_bucket, y = avg_popularity, fill = rating_bucket)) +
geom_col() +
labs(
title = "Average Popularity by Rating Bucket",
x = "Rating Category",
y = "Average Popularity"
) +
theme_minimal()
This code will give us the output as:
What This Shows:
This step prepares your data for machine learning by splitting it into two parts: training data (to build the model) and testing data (to evaluate how well it works). We’ll use the code:
# -------------------------------
# 6.1 TRAIN/TEST SPLIT
# -------------------------------
set.seed(42) # for reproducibility
# Drop rows with missing model features
movies_model <- movies_fe %>%
drop_na(vote_average, popularity, log_vote_count, year)
# Split into 80% train, 20% test
n <- nrow(movies_model)
train_index <- sample(seq_len(n), size = 0.8 * n)
train <- movies_model[train_index, ]
test <- movies_model[-train_index, ]
cat("Train size:", nrow(train), " | Test size:", nrow(test), "\n")
This will give the output:
Train size: 7570 | Test size: 1893 |
Which means that:
Supercharge Your Learning in R. Explore This Blog: Best R Libraries Data Science: Tools for Analysis, Visualization & ML
In this step, we're building a simple linear regression model to predict a movie's average rating using three features: popularity, log of vote count, and release year. We’ll use the code:
# -------------------------------
# 6.2 LINEAR REGRESSION
# -------------------------------
lm_model <- lm(vote_average ~ popularity + log_vote_count + year, data = train)
# Summary of model coefficients
summary(lm_model)
This gives the output:
Call: lm(formula = vote_average ~ popularity + log_vote_count + year, data = train) Residuals: Min 1Q Median 3Q Max -4.0309 -0.4821 0.0175 0.5186 2.3704 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.437e+01 1.040e+00 33.06 < 2e-16 *** popularity 1.975e-04 3.387e-05 5.83 5.77e-09 *** log_vote_count 2.229e-01 8.284e-03 26.91 < 2e-16 *** year -1.461e-02 5.212e-04 -28.03 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7547 on 7566 degrees of freedom Multiple R-squared: 0.1576, Adjusted R-squared: 0.1573 F-statistic: 471.8 on 3 and 7566 DF, p-value: < 2.2e-16 |
This means that:
In this step, we predict ratings using the linear model and calculate how close the predictions are to the actual ratings. The code for this step is:
# Predict ratings
test$pred_rating <- predict(lm_model, newdata = test)
# Define metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2)) # Root Mean Square Error
mae <- function(actual, pred) mean(abs(actual - pred)) # Mean Absolute Error
r2 <- function(actual, pred) cor(actual, pred)^2 # R-squared
# Calculate metrics
lm_rmse <- rmse(test$vote_average, test$pred_rating)
lm_mae <- mae(test$vote_average, test$pred_rating)
lm_r2 <- r2(test$vote_average, test$pred_rating)
This gives the output as:
Linear Regression Results RMSE : 0.763 MAE : 0.607 R^2 : 0.114 |
The above output means that:
We're now predicting whether a movie is highly rated (8 or above) using logistic regression. The code for this is:
glm_model <- glm(high_rating ~ popularity + log_vote_count + year,
data = train, family = binomial)
summary(glm_model)
The output for the above step is:
Call: glm(formula = high_rating ~ popularity + log_vote_count + year, family = binomial, data = train) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 68.2042129 5.5443763 12.302 < 2e-16 *** popularity 0.0003712 0.0001180 3.147 0.00165 ** log_vote_count 0.7791792 0.0526512 14.799 < 2e-16 *** year -0.0384784 0.0028399 -13.549 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2505.7 on 7569 degrees of freedom Residual deviance: 2171.9 on 7566 degrees of freedom AIC: 2179.9 Number of Fisher Scoring iterations: 7 |
The above output means that:
Upscale Your Knowledge. Read Now: 18 Types of Regression in Machine Learning You Should Know
This step checks how well our model predicts if a movie is highly rated (yes/no). The code for this step is:
# Predict probability of being highly rated
test$prob_high <- predict(glm_model, newdata = test, type = "response")
# Predict TRUE/FALSE based on 0.5 threshold
test$pred_high <- test$prob_high >= 0.5
# Accuracy: how many predictions were correct
accuracy <- mean(test$pred_high == test$high_rating)
# AUC: measures how well the model separates TRUE from FALSE
roc_obj <- roc(response = test$high_rating, predictor = test$prob_high)
auc_val <- auc(roc_obj)
The output is:
Setting levels: control = FALSE, case = TRUE Setting direction: controls < cases 📊 Logistic Regression Results Accuracy: 0.957 AUC : 0.638 |
This means that:
In this step, we make sure all the required libraries for building and evaluating Random Forest models are installed and loaded. These packages will help us with model training, predictions, and evaluation metrics. The code used is:
# -------------------------------------------------
# 0) INSTALL & LOAD
# -------------------------------------------------
need <- c("randomForest", "pROC", "tidyverse")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)
library(randomForest)
library(pROC)
library(tidyverse)
Upon successfully installing the package, we get the output as:
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) randomForest 4.7-1.2 Type rfNews() to see new features/changes/bug fixes. Attaching package: ‘randomForest’ The following object is masked from ‘package:dplyr’: combine The following object is masked from ‘package:ggplot2’: margin |
Before training our Random Forest models, we need to prepare the data properly. In this step, we select relevant features, handle missing values, and split the dataset into training and testing sets. We’ll use the code:
# -------------------------------------------------
# 1) PREP DATA & SPLIT
# -------------------------------------------------
set.seed(123)
# Keep only columns we need and drop rows with NAs in them
movies_model <- movies_fe %>%
drop_na(vote_average, popularity, log_vote_count, year, high_rating)
# Make the classification target a factor (randomForest needs a factor for classification)
movies_model <- movies_model %>%
mutate(high_rating = factor(high_rating, levels = c(FALSE, TRUE), labels = c("No", "Yes")))
# 80/20 split
n <- nrow(movies_model)
train_idx <- sample(seq_len(n), size = 0.8 * n)
train <- movies_model[train_idx, ]
test <- movies_model[-train_idx, ]
cat("Train size:", nrow(train), " | Test size:", nrow(test), "\n")
This gives the output:
Train size: 7570 | Test size: 1893 |
The above output means that:
Want to Upskill? Read This: What Is Data Acquisition: Key Components & Role in Machine Learning
In this step, we build a Random Forest regression model to predict the average movie rating using key features like popularity, number of votes (log-transformed), and release year. Random Forest improves accuracy by combining multiple decision trees and reduces overfitting.
# -------------------------------------------------
# 21) RANDOM FOREST - REGRESSION
# -------------------------------------------------
# Formula: pick a few sensible predictors (you can add more later)
rf_reg <- randomForest(
vote_average ~ popularity + log_vote_count + year,
data = train,
ntree = 500, # number of trees
mtry = 2, # number of variables tried at each split (tune this)
importance = TRUE # so we can plot variable importance
)
print(rf_reg)
The above code will give the output:
Call: randomForest(formula = vote_average ~ popularity + log_vote_count + year, data = train, ntree = 500, mtry = 2, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 2 Mean of squared residuals: 0.5491158 % Var explained: 18.19 |
The above output means that:
In this step, we test how well the Random Forest regression model predicts movie ratings on unseen data using common evaluation metrics. We’ll use the following code:
# -------------------------------------------------
# 2A) EVALUATE - REGRESSION
# -------------------------------------------------
# Predict on test
test$rf_pred_rating <- predict(rf_reg, newdata = test)
# Metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2))
mae <- function(actual, pred) mean(abs(actual - pred))
r2 <- function(actual, pred) cor(actual, pred)^2
rf_rmse <- rmse(test$vote_average, test$rf_pred_rating)
rf_mae <- mae(test$vote_average, test$rf_pred_rating)
rf_r2 <- r2(test$vote_average, test$rf_pred_rating)
cat("📊 Random Forest Regression (Test)\n")
cat("RMSE :", round(rf_rmse, 3), "\n")
cat("MAE :", round(rf_mae, 3), "\n")
cat("R^2 :", round(rf_r2, 3), "\n")
The output is given below:
Random Forest Regression (Test) RMSE : 0.723 MAE : 0.562 R^2 : 0.233 |
This means that:
This step helps us understand which features influenced the rating predictions the most in the Random Forest regression model. Random Forest provides a built-in way to measure this using metrics like Mean Decrease in Accuracy and Mean Decrease in Gini. We use the code:
# -------------------------------------------------
# 2B) VARIABLE IMPORTANCE - REGRESSION
# -------------------------------------------------
importance(rf_reg)
varImpPlot(rf_reg, main = "Variable Importance (Random Forest - Regression)")
The output of this code, along with the graph, is given below:
%IncMSE |
IncNodePurity |
|
popularity |
64.65503 |
1594.713 |
log_vote_count |
139.71055 |
1774.393 |
year |
138.46185 |
1276.424 |
The above graph means that:
In this step, we train a Random Forest classifier to predict whether a movie is highly rated (“Yes”) or not (“No”) using popularity, log_vote_count, and year. We also turn on variable importance to see which features matter most. We’ll use the code:
# -------------------------------------------------
# 3) RANDOM FOREST - CLASSIFICATION
# -------------------------------------------------
rf_clf <- randomForest(
high_rating ~ popularity + log_vote_count + year,
data = train,
ntree = 500, # grow 500 trees
mtry = 2, # try 2 predictors at each split
importance = TRUE # keep importance stats for later plotting
)
print(rf_clf) # shows OOB error, confusion matrix, etc.
We get the output as:
Call: randomForest(formula = high_rating ~ popularity + log_vote_count + year, data = train, ntree = 500, mtry = 2, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 3.79% Confusion matrix: No Yes class.error No 7223 46 0.006328243 Yes 241 60 0.800664452 |
The above output means that:
In this step, we evaluate how well the Random Forest model predicts whether a movie is highly rated or not on unseen test data. The code is:
# Class predictions
test$rf_pred_class <- predict(rf_clf, newdata = test, type = "class")
# Probabilities for ROC/AUC
test$rf_prob_yes <- predict(rf_clf, newdata = test, type = "prob")[, "Yes"]
# Accuracy
accuracy <- mean(test$rf_pred_class == test$high_rating)
# Confusion matrix (simple)
conf_mat <- table(Predicted = test$rf_pred_class, Actual = test$high_rating)
# ROC / AUC (positive class = "Yes")
roc_obj <- roc(response = test$high_rating, predictor = test$rf_prob_yes, levels = c("No", "Yes"))
auc_val <- auc(roc_obj)
# Print results
cat("Random Forest Classification (Test)\n")
cat("Accuracy:", round(accuracy, 3), "\n")
cat("AUC :", round(as.numeric(auc_val), 3), "\n\n")
print(conf_mat)
# Plot ROC
plot(roc_obj, main = "ROC Curve - Random Forest Classification")
We get the following output and graph:
Setting direction: controls < cases Random Forest Classification (Test) Accuracy: 0.958 AUC : 0.789 Actual Predicted No Yes No 1803 64 Yes 15 11 |
This shows that:
Practice Makes Perfect. Start Here: The Ultimate R Cheat Sheet for Data Science Enthusiasts
This section shows which features were most important in helping the Random Forest classify movies as high-rated or not. The code for this section is:
# -------------------------------------------------
# 3B) VARIABLE IMPORTANCE - CLASSIFICATION
# -------------------------------------------------
importance(rf_clf)
varImpPlot(rf_clf, main = "Variable Importance (Random Forest - Classification)")
The output for this section is:
No |
Yes |
MeanDecreaseAccuracy |
MeanDecreaseGini |
||
popularity |
36.90262 |
6.076136 |
39.86347 |
224.4033 |
popularity |
log_vote_count |
72.79150 |
51.165184 |
82.72379 |
233.9531 |
log_vote_count |
year |
36.97069 |
57.166905 |
55.27109 |
119.1909 |
year |
The above output means that:
In this Movie Ratings Prediction project, we used R in Google Colab to build Random Forest models for both regression and classification tasks. The goal was to predict a movie's average rating and whether it qualifies as a highly rated film based on features like popularity, vote count (log-transformed), and release year.
After preprocessing and visualizing the data, we trained the models on 80% of the dataset. The regression model achieved an RMSE of 0.723 and an R² of 0.233, while the classification model reached 95.8% accuracy with an AUC of 0.789.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1LptCihglOyzCkkiJ7e-OVjC6NsipxTtW#scrollTo=Ms1l6gDvZVBe
796 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources