Player Performance Analysis & Prediction Using R
By Rohit Sharma
Updated on Jul 31, 2025 | 14 min read | 1.26K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 31, 2025 | 14 min read | 1.26K+ views
Share:
Table of Contents
This player performance analysis project focuses on predicting the total points scored by NBA players using machine learning techniques in R. We will use the 2023 NBA player stats dataset to identify key performance metrics of players like minutes played, field goals, assists, and more.
The data will be cleaned, preprocessed, and modeled using algorithms such as Linear Regression, Random Forest, and XGBoost. The goal of this project is to build an accurate prediction model and identify the most influential features that affect a player’s scoring.
Break Into the Future of Data Science, Master GenAI, get mentored by industry leaders, and turn insights into income. These top online data science courses are your launchpad.
Find the Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025 for your next R Project.
Popular Data Science Programs
Before starting the player performance analysis project, there are a few things we need to know in order to ensure a successful outcome.
Advance your career with globally recognized programs in Data Science and Generative AI. Learn from the best, build real-world skills, and stay ahead in a rapidly evolving digital world.
The following tools and libraries will be used in this project:
Category |
Name |
Purpose |
Programming Language | R | Data analysis and machine learning |
Platform | Google Colab (R runtime) | Cloud-based coding environment |
Data Manipulation | dplyr, tidyverse, janitor | Cleaning, transforming, and wrangling data |
Visualization | ggplot2, vip | Data and feature importance visualization |
ML Framework | caret | Building and tuning machine learning models |
Algorithms | lm, ranger, xgbTree | Linear Regression, Random Forest, XGBoost |
Model Evaluation | caret (e.g., postResample) | Measuring accuracy (RMSE, R²) |
Data Summary | skimr | Exploring dataset structure and distributions |
Encoding Categorical | caret::dummyVars | One-hot encoding of categorical variables |
Different projects need varying timelines and skills. They are mentioned below:
Aspect |
Details |
Estimated Duration | 3–5 hours |
Difficulty Level | Beginner to Intermediate |
Skill Level Needed | Beginner to Intermediate |
Must Read: ML Types Explained: A Complete Guide to Data Types in Machine Learning
The step-by-step breakdown for this project with codes is given below.
Before starting the project, it's important to set up Google Colab to run R instead of the default Python environment. This ensures that all R code executes correctly within the notebook.
Follow These Steps:
1. Open a new notebook in Google Colab
2. Click on Runtime in the top menu
3. Select Change runtime type
4. In the Language dropdown, choose R
5. Click Save to apply the changes
In this step, we install and load all the necessary R packages used for data manipulation, visualization, machine learning, and model evaluation. The code for this step is:
# Install required packages (only needed once)
install.packages(c("tidyverse", "caret", "janitor", "skimr", "ranger", "xgboost", "vip", "e1071"))
# Load libraries into the session
library(tidyverse) # For data manipulation and visualization
library(caret) # For machine learning workflow
library(janitor) # For cleaning column names
library(skimr) # For quick data summary
library(ranger) # Random Forest algorithm
library(xgboost) # XGBoost algorithm
library(vip) # Visualize feature importance
library(e1071) # Required for some caret functionalities
The above code will install all the required libraries to run this project:
Installing packages into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’, ‘snakecase’, ‘RcppEigen’, ‘yardstick’, ‘proxy’ ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Loading required package: lattice
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
Attaching package: ‘xgboost’
The following object is masked from ‘package:dplyr’:
slice
Attaching package: ‘vip’
The following object is masked from ‘package:utils’:
vi |
This step reads the NBA player statistics CSV file into R and cleans the column names for easier use in later steps.
# Load the CSV file into R
nba_raw <- read_csv("2023_nba_player_stats.csv")
# Clean column names (e.g., converts to snake_case)
nba <- nba_raw %>% clean_names()
The output of the above code is:
Rows: 539 Columns: 30 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (3): PName, POS, Team dbl (27): Age, GP, W, L, Min, PTS, FGM, FGA, FG%, 3PM, 3PA, 3P%, FTM, FTA, F... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. |
The above output means:
Upskill Now: R Tutorial for Beginners: Become an Expert in R Programming
In this step, we will remove unnecessary columns, convert categorical features into numeric format (one-hot encoding), and prepare the dataset for model training. We use the following code for this step:
# Remove columns not useful for modeling (e.g., name, fantasy points, unknown column)
nba_clean <- nba %>%
select(-p_name, -x, -fp)
# Convert 'team' and 'pos' to categorical variables
nba_clean <- nba_clean %>%
mutate(team = as.factor(team), pos = as.factor(pos))
# Create dummy variables (one-hot encoding) for categorical columns
dummies <- dummyVars(pts ~ ., data = nba_clean)
nba_encoded <- predict(dummies, newdata = nba_clean) %>% as.data.frame()
# Combine the target variable 'pts' with the encoded features
nba_ready <- bind_cols(pts = nba_clean$pts, nba_encoded)
# Remove any rows that have missing values
nba_ready <- nba_ready %>% drop_na()
In the above step:
We divide the data into training (80%) and testing (20%) sets. This helps us train the model on one portion of the data and evaluate it on another to ensure reliable results.
# Set seed so results stay the same every time you run it
set.seed(42)
# Split the data: 80% training, 20% testing
split <- caret::createDataPartition(nba_ready$pts, p = 0.8, list = FALSE)
# Training data (80%)
train_df <- nba_ready[split, ]
# Testing data (20%)
test_df <- nba_ready[-split, ]
# Print the number of rows in each set to confirm the split
cat("Training set rows:", nrow(train_df), "\n")
cat("Testing set rows:", nrow(test_df))
The output for this step is:
Train rows: 432 Test rows: 107 |
This means that the data has been successfully split into two sets:
Here’s Something Fun: Car Data Analysis Project Using R
In this step, we will set up how the model will be trained and validated using cross-validation. We will also define preprocessing steps to normalize the data for better model performance. The code for this step is:
# Define how the model will train and validate
ctrl <- trainControl(
method = "repeatedcv", # Use repeated cross-validation
number = 5, # Split data into 5 folds
repeats = 3, # Repeat the 5-fold CV 3 times
verboseIter = TRUE # Display training progress
)
# Preprocessing steps: normalize the data by centering and scaling
preprocess_steps <- c("center", "scale")
In the above step, we define a training strategy using repeated cross-validation, which means:
We also set preprocessing steps to:
In this step, we will train a linear regression model to predict total points scored (pts) using all other features in the training set. We will use the training strategy and preprocessing steps defined earlier. The code for this section is:
# Set seed to ensure consistent results
set.seed(42)
# Train a linear regression model using caret
lm_fit <- caret::train(
pts ~ ., # Formula: predict 'pts' using all other variables
data = train_df, # Use the training data
method = "lm", # Use linear regression
trControl = ctrl, # Use the cross-validation setup from Step 6
preProcess = preprocess_steps # Normalize the data (center and scale)
)
The output for the above code looks like this:
+ Fold1.Rep1: intercept=TRUE - Fold1.Rep1: intercept=TRUE + Fold2.Rep1: intercept=TRUE - Fold2.Rep1: intercept=TRUE + Fold3.Rep1: intercept=TRUE - Fold3.Rep1: intercept=TRUE + Fold4.Rep1: intercept=TRUE - Fold4.Rep1: intercept=TRUE + Fold5.Rep1: intercept=TRUE - Fold5.Rep1: intercept=TRUE + Fold1.Rep2: intercept=TRUE - Fold1.Rep2: intercept=TRUE + Fold2.Rep2: intercept=TRUE - Fold2.Rep2: intercept=TRUE + Fold3.Rep2: intercept=TRUE - Fold3.Rep2: intercept=TRUE + Fold4.Rep2: intercept=TRUE - Fold4.Rep2: intercept=TRUE + Fold5.Rep2: intercept=TRUE - Fold5.Rep2: intercept=TRUE + Fold1.Rep3: intercept=TRUE - Fold1.Rep3: intercept=TRUE + Fold2.Rep3: intercept=TRUE - Fold2.Rep3: intercept=TRUE + Fold3.Rep3: intercept=TRUE - Fold3.Rep3: intercept=TRUE + Fold4.Rep3: intercept=TRUE - Fold4.Rep3: intercept=TRUE + Fold5.Rep3: intercept=TRUE - Fold5.Rep3: intercept=TRUE Aggregating results Fitting final model on full training set |
The above output means:
In this step, we are going to train a Random Forest model to predict player points. Random Forest is a powerful algorithm that combines many decision trees to improve prediction accuracy and reduce overfitting. The code for this step is:
# Set seed to make results reproducible
set.seed(42)
# Train a Random Forest model using the 'ranger' method
rf_fit <- caret::train(
pts ~ ., # Predict 'pts' using all other variables
data = train_df, # Use the training data
method = "ranger", # Use the Random Forest algorithm
trControl = ctrl, # Apply cross-validation settings
preProcess = preprocess_steps, # Normalize data
tuneLength = 5 # Try 5 different tuning settings
)
The output for this step is:
In the above step:
In this step, we will then train an XGBoost model, which is a high-performance algorithm known for accuracy and speed. It builds trees one at a time and improves with each step, making it great for predictions. The code for this step is:
# Set seed for reproducible results
set.seed(42)
# Train an XGBoost model using caret
xgb_fit <- caret::train(
pts ~ ., # Predict 'pts' using all features
data = train_df, # Use the training set
method = "xgbTree", # Use XGBoost with decision trees
trControl = ctrl, # Apply cross-validation setup
preProcess = preprocess_steps, # Normalize the data
tuneLength = 10, # Try 10 combinations of tuning settings
verbose = FALSE # Turn off training messages
)
The output of the above step is:
In the above step:
Here’s a Fun R Project: Forest Fire Project Using R - A Step-by-Step Guide
In this step, we then compare how well each model (Linear Regression, Random Forest, XGBoost) performed using cross-validation. We’ll print a summary and visualize the results using a boxplot. The code for this step is given below:
# Combine the results of all trained models for comparison
results <- resamples(list(
Linear_Regression = lm_fit,
Random_Forest = rf_fit,
XGBoost = xgb_fit
))
# Print summary of model performance (e.g., RMSE, R-squared)
summary(results)
# Create a boxplot to visually compare model performance
bwplot(results)
The output for this code is:
Call: summary.resamples(object = results)
Models: Linear_Regression, Random_Forest, XGBoost Number of resamples: 15
MAE Min. 1st Qu. Median Mean Linear_Regression 1.410072e-13 2.057361e-13 2.310297e-13 2.352077e-13 Random_Forest 1.467092e+01 1.761664e+01 2.025753e+01 2.087942e+01 XGBoost 1.992400e+01 2.131183e+01 2.334781e+01 2.395862e+01 3rd Qu. Max. NA's Linear_Regression 2.584762e-13 3.490010e-13 0 Random_Forest 2.408861e+01 2.967785e+01 0 XGBoost 2.618161e+01 3.044348e+01 0
RMSE Min. 1st Qu. Median Mean Linear_Regression 1.818315e-13 2.714956e-13 3.004007e-13 3.035248e-13 Random_Forest 2.231048e+01 2.803128e+01 3.310515e+01 3.543857e+01 XGBoost 2.992772e+01 3.333336e+01 3.590806e+01 3.721283e+01 3rd Qu. Max. NA's Linear_Regression 3.449525e-13 4.519012e-13 0 Random_Forest 4.210753e+01 5.497088e+01 0 XGBoost 4.088298e+01 4.789698e+01 0
Rsquared Min. 1st Qu. Median Mean 3rd Qu. Linear_Regression 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 Random_Forest 0.9921872 0.9945075 0.9959446 0.9955249 0.9966669 XGBoost 0.9906932 0.9940607 0.9951215 0.9947949 0.9957946 Max. NA's Linear_Regression 1.0000000 0 Random_Forest 0.9982100 0 XGBoost 0.9972477 0 |
The graph for this output is:
The above output means:
Must read: R For Data Science: Why Should You Choose R for Data Science?
Now that we’ve trained our models, we’ll use the XGBoost model to predict player points on the test data. We’ll then compare the predicted values to the actual points to see how well the model is performing. We’ll use this code:
# Predict player points using the trained XGBoost model
predictions <- predict(xgb_fit, newdata = test_df)
# Compare predicted vs actual values in a new table
comparison <- data.frame(
Actual = test_df$pts, # Real player points
Predicted = round(predictions, 1) # Predicted points, rounded to 1 decimal
)
# Show the first 10 rows of comparison
head(comparison, 10)
The output for the above code is:
Actual |
Predicted |
|
<dbl> |
<dbl> |
|
1 |
1959 |
1871.2 |
2 |
1826 |
1915.2 |
3 |
1490 |
1513.3 |
4 |
1447 |
1477.0 |
5 |
1431 |
1466.9 |
6 |
1357 |
1325.3 |
7 |
1347 |
1456.8 |
8 |
1302 |
1243.1 |
9 |
1298 |
1226.0 |
10 |
1290 |
1250.2 |
The above output shows:
In this step, we calculate accuracy metrics to measure how well the model performed on the test set. These metrics help us understand how close our predictions are to the actual values. The code for this step is:
# Calculate model performance metrics on test data
postResample(pred = predictions, obs = test_df$pts)
The output for the above step is:
RMSE 32.7604351856363 Rsquared 0.994865387369555 MAE 21.8286527433546 |
The above output means:
This step helps you understand which features mattered the most when predicting player points using the XGBoost model. We’ll use the vip package to generate a simple plot of the top 10 features. The code for this step is:
# Show the 10 most important features used by the XGBoost model
vip(xgb_fit, num_features = 10
The above graph means:
In this Player Performance Analysis project, we used R in Google Colab to build a regression model that predicts NBA player points based on game statistics such as field goals made, free throws, and minutes played.
After preprocessing and encoding the data, we trained models including Linear Regression, Random Forest, and XGBoost.
The XGBoost model delivered the best results, achieving an RMSE of 32.76 and an R-squared value of 0.995, meaning it can explain 99.5% of the variation in player points, a highly accurate prediction performance.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1skF1-VvZ2Q5yebC7Ag6CBAcSbP07-lR6#scrollTo=HfH19rxEDMrN
802 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources