Player Performance Analysis & Prediction Using R

By Rohit Sharma

Updated on Jul 31, 2025 | 14 min read | 1.26K+ views

Share:

This player performance analysis project focuses on predicting the total points scored by NBA players using machine learning techniques in R. We will use the 2023 NBA player stats dataset to identify key performance metrics of players like minutes played, field goals, assists, and more.

The data will be cleaned, preprocessed, and modeled using algorithms such as Linear Regression, Random Forest, and XGBoost. The goal of this project is to build an accurate prediction model and identify the most influential features that affect a player’s scoring.

Break Into the Future of Data Science, Master GenAI, get mentored by industry leaders, and turn insights into income. These top online data science courses are your launchpad.

Find the Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025 for your next R Project.

What Should We Know Before Starting the Project?

Before starting the player performance analysis project, there are a few things we need to know in order to ensure a successful outcome.

  • We must be familiar with basic R syntax and data frames.
  • We should have knowledge of data cleaning and preprocessing techniques.
  • We must understand how to split data into training and testing sets.
  • We should be familiar with regression and basic machine learning concepts.
  • We must know how to evaluate models using metrics like RMSE and R².

Advance your career with globally recognized programs in Data Science and Generative AI. Learn from the best, build real-world skills, and stay ahead in a rapidly evolving digital world.

Tools, Technologies, and R Libraries Used in This Player Performance Analysis Project

The following tools and libraries will be used in this project:

Category

Name

Purpose

Programming Language R Data analysis and machine learning
Platform Google Colab (R runtime) Cloud-based coding environment
Data Manipulation dplyr, tidyverse, janitor Cleaning, transforming, and wrangling data
Visualization ggplot2, vip Data and feature importance visualization
ML Framework caret Building and tuning machine learning models
Algorithms lm, ranger, xgbTree Linear Regression, Random Forest, XGBoost
Model Evaluation caret (e.g., postResample) Measuring accuracy (RMSE, R²)
Data Summary skimr Exploring dataset structure and distributions
Encoding Categorical caret::dummyVars One-hot encoding of categorical variables
background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

What to Expect: Duration, Difficulty & Skill Level

Different projects need varying timelines and skills. They are mentioned below:

Aspect

Details

Estimated Duration 3–5 hours
Difficulty Level Beginner to Intermediate
Skill Level Needed Beginner to Intermediate

Must Read: ML Types Explained: A Complete Guide to Data Types in Machine Learning

Step-by-Step Player Performance Analysis Project Breakdown

The step-by-step breakdown for this project with codes is given below.

Step 1: Configure Google Colab to Use R

Before starting the project, it's important to set up Google Colab to run R instead of the default Python environment. This ensures that all R code executes correctly within the notebook.

Follow These Steps:

1. Open a new notebook in Google Colab

2. Click on Runtime in the top menu

3. Select Change runtime type

4. In the Language dropdown, choose R

5. Click Save to apply the changes

Step 2: Install and Load Required R Libraries

In this step, we install and load all the necessary R packages used for data manipulation, visualization, machine learning, and model evaluation. The code for this step is:

# Install required packages (only needed once)
install.packages(c("tidyverse", "caret", "janitor", "skimr", "ranger", "xgboost", "vip", "e1071"))

# Load libraries into the session
library(tidyverse)   # For data manipulation and visualization
library(caret)       # For machine learning workflow
library(janitor)     # For cleaning column names
library(skimr)       # For quick data summary
library(ranger)      # Random Forest algorithm
library(xgboost)     # XGBoost algorithm
library(vip)         # Visualize feature importance
library(e1071)       # Required for some caret functionalities

The above code will install all the required libraries to run this project:

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’, ‘snakecase’, ‘RcppEigen’, ‘yardstick’, ‘proxy’

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

 dplyr    1.1.4      readr    2.1.5

 forcats  1.0.0      stringr  1.5.1

 ggplot2  3.5.2      tibble   3.3.0

 lubridate 1.9.4      tidyr    1.3.1

 purrr    1.1.0     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

 dplyr::filter() masks stats::filter()

 dplyr::lag()    masks stats::lag()

Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading required package: lattice

 

Attaching package: ‘caret’

 

The following object is masked from ‘package:purrr’:

 

    lift

 

Attaching package: ‘janitor’

 

The following objects are masked from ‘package:stats’:

 

    chisq.test, fisher.test

 

Attaching package: ‘xgboost’

 

The following object is masked from ‘package:dplyr’:

 

    slice

 

Attaching package: ‘vip’

 

The following object is masked from ‘package:utils’:

 

    vi

Step 3: Load and Clean the Dataset

This step reads the NBA player statistics CSV file into R and cleans the column names for easier use in later steps.

# Load the CSV file into R
nba_raw <- read_csv("2023_nba_player_stats.csv")

# Clean column names (e.g., converts to snake_case)
nba <- nba_raw %>% clean_names()

The output of the above code is:

Rows: 539 Columns: 30

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr  (3): PName, POS, Team

dbl (27): Age, GP, W, L, Min, PTS, FGM, FGA, FG%, 3PM, 3PA, 3P%, FTM, FTA, F...

Use `spec()` to retrieve the full column specification for this data.

Specify the column types or set `show_col_types = FALSE` to quiet this message.

The above output means:

  • The dataset has 539 rows (players) and 30 columns (stats).
  • There are 3 text columns (player name, position, team) and 27 numeric columns (like age, points, assists, etc.).
  • The message confirms the file was loaded correctly and the data types were auto-detected.

Upskill Now: R Tutorial for Beginners: Become an Expert in R Programming

Step 4: Clean, Encode, and Prepare the Data

In this step, we will remove unnecessary columns, convert categorical features into numeric format (one-hot encoding), and prepare the dataset for model training. We use the following code for this step:

# Remove columns not useful for modeling (e.g., name, fantasy points, unknown column)
nba_clean <- nba %>%
  select(-p_name, -x, -fp)

# Convert 'team' and 'pos' to categorical variables
nba_clean <- nba_clean %>%
  mutate(team = as.factor(team), pos = as.factor(pos))

# Create dummy variables (one-hot encoding) for categorical columns
dummies <- dummyVars(pts ~ ., data = nba_clean)
nba_encoded <- predict(dummies, newdata = nba_clean) %>% as.data.frame()

# Combine the target variable 'pts' with the encoded features
nba_ready <- bind_cols(pts = nba_clean$pts, nba_encoded)

# Remove any rows that have missing values
nba_ready <- nba_ready %>% drop_na()

In the above step:

  • We remove unnecessary columns like player name and irrelevant stats.
  • We convert team and position into numeric format using one-hot encoding.
  • We prepare a clean dataset with only numeric values, ready for model training.

Step 5: Split the Data into Training and Testing Sets

We divide the data into training (80%) and testing (20%) sets. This helps us train the model on one portion of the data and evaluate it on another to ensure reliable results.

# Set seed so results stay the same every time you run it
set.seed(42)

# Split the data: 80% training, 20% testing
split <- caret::createDataPartition(nba_ready$pts, p = 0.8, list = FALSE)

# Training data (80%)
train_df <- nba_ready[split, ]

# Testing data (20%)
test_df <- nba_ready[-split, ]

# Print the number of rows in each set to confirm the split
cat("Training set rows:", nrow(train_df), "\n")
cat("Testing set rows:", nrow(test_df))

The output for this step is:

Train rows: 432 

Test rows: 107

This means that the data has been successfully split into two sets:

  • 432 rows for training (used to build the model)
  • 107 rows for testing (used to evaluate model performance)

Here’s Something Fun: Car Data Analysis Project Using R

Step 6: Define Training Method and Preprocessing Steps

In this step, we will set up how the model will be trained and validated using cross-validation. We will also define preprocessing steps to normalize the data for better model performance. The code for this step is:

# Define how the model will train and validate
ctrl <- trainControl(
  method = "repeatedcv",  # Use repeated cross-validation
  number = 5,             # Split data into 5 folds
  repeats = 3,            # Repeat the 5-fold CV 3 times
  verboseIter = TRUE      # Display training progress
)

# Preprocessing steps: normalize the data by centering and scaling
preprocess_steps <- c("center", "scale")

In the above step, we define a training strategy using repeated cross-validation, which means:

  • The data is split into 5 parts (folds).
  • The model is trained and validated multiple times (3 repeats), each time using different folds.
  • This helps us get a more reliable and stable model.

We also set preprocessing steps to:

  • Center the data (subtract the mean)
  • Scale the data (divide by standard deviation)

Step 7: Train a Linear Regression Model

In this step, we will train a linear regression model to predict total points scored (pts) using all other features in the training set. We will use the training strategy and preprocessing steps defined earlier. The code for this section is:

# Set seed to ensure consistent results
set.seed(42)

# Train a linear regression model using caret
lm_fit <- caret::train(
  pts ~ .,                 # Formula: predict 'pts' using all other variables
  data = train_df,         # Use the training data
  method = "lm",           # Use linear regression
  trControl = ctrl,        # Use the cross-validation setup from Step 6
  preProcess = preprocess_steps  # Normalize the data (center and scale)
)

The output for the above code looks like this:

+ Fold1.Rep1: intercept=TRUE 

- Fold1.Rep1: intercept=TRUE 

+ Fold2.Rep1: intercept=TRUE 

- Fold2.Rep1: intercept=TRUE 

+ Fold3.Rep1: intercept=TRUE 

- Fold3.Rep1: intercept=TRUE 

+ Fold4.Rep1: intercept=TRUE 

- Fold4.Rep1: intercept=TRUE 

+ Fold5.Rep1: intercept=TRUE 

- Fold5.Rep1: intercept=TRUE 

+ Fold1.Rep2: intercept=TRUE 

- Fold1.Rep2: intercept=TRUE 

+ Fold2.Rep2: intercept=TRUE 

- Fold2.Rep2: intercept=TRUE 

+ Fold3.Rep2: intercept=TRUE 

- Fold3.Rep2: intercept=TRUE 

+ Fold4.Rep2: intercept=TRUE 

- Fold4.Rep2: intercept=TRUE 

+ Fold5.Rep2: intercept=TRUE 

- Fold5.Rep2: intercept=TRUE 

+ Fold1.Rep3: intercept=TRUE 

- Fold1.Rep3: intercept=TRUE 

+ Fold2.Rep3: intercept=TRUE 

- Fold2.Rep3: intercept=TRUE 

+ Fold3.Rep3: intercept=TRUE 

- Fold3.Rep3: intercept=TRUE 

+ Fold4.Rep3: intercept=TRUE 

- Fold4.Rep3: intercept=TRUE 

+ Fold5.Rep3: intercept=TRUE 

- Fold5.Rep3: intercept=TRUE 

Aggregating results

Fitting final model on full training set

The above output means:

  • The model is trained in parts (folds) to check how well it performs.
  • It does this 3 times to make sure the results are reliable.
  • At the end, it builds one final model using all the training data.

Step 8: Train a Random Forest Model

In this step, we are going to train a Random Forest model to predict player points. Random Forest is a powerful algorithm that combines many decision trees to improve prediction accuracy and reduce overfitting. The code for this step is:

# Set seed to make results reproducible
set.seed(42)

# Train a Random Forest model using the 'ranger' method
rf_fit <- caret::train(
  pts ~ .,                 # Predict 'pts' using all other variables
  data = train_df,         # Use the training data
  method = "ranger",       # Use the Random Forest algorithm
  trControl = ctrl,        # Apply cross-validation settings
  preProcess = preprocess_steps,  # Normalize data
  tuneLength = 5           # Try 5 different tuning settings
)

The output for this step is:

In the above step:

  • The model will try 5 different combinations of tuning settings (like number of trees and depth).
  • You’ll see cross-validation results for each setting, showing how well the model performed.
  • At the end, it selects the best settings and trains the final Random Forest model using the training data.

Step 9: Train an XGBoost Model

In this step, we will then train an XGBoost model, which is a high-performance algorithm known for accuracy and speed. It builds trees one at a time and improves with each step, making it great for predictions. The code for this step is:

# Set seed for reproducible results
set.seed(42)

# Train an XGBoost model using caret
xgb_fit <- caret::train(
  pts ~ .,                 # Predict 'pts' using all features
  data = train_df,         # Use the training set
  method = "xgbTree",      # Use XGBoost with decision trees
  trControl = ctrl,        # Apply cross-validation setup
  preProcess = preprocess_steps,  # Normalize the data
  tuneLength = 10,         # Try 10 combinations of tuning settings
  verbose = FALSE          # Turn off training messages
)

The output of the above step is:

In the above step:

  • The model will test 10 different tuning combinations (like learning rate, depth, etc.).
  • You'll see a performance score (like RMSE) for each combination.
  • At the end, it picks the best settings and builds the final XGBoost model using the full training data.

Here’s a Fun R Project: Forest Fire Project Using R - A Step-by-Step Guide

Step 10: Compare All Model Performances

In this step, we then compare how well each model (Linear Regression, Random Forest, XGBoost) performed using cross-validation. We’ll print a summary and visualize the results using a boxplot. The code for this step is given below:

# Combine the results of all trained models for comparison
results <- resamples(list(
  Linear_Regression = lm_fit,
  Random_Forest = rf_fit,
  XGBoost = xgb_fit
))

# Print summary of model performance (e.g., RMSE, R-squared)
summary(results)

# Create a boxplot to visually compare model performance
bwplot(results)

The output for this code is:

Call:

summary.resamples(object = results)

 

Models: Linear_Regression, Random_Forest, XGBoost 

Number of resamples: 15 

 

MAE 

                                    Min.                    1st Qu.                Median               Mean

Linear_Regression      1.410072e-13    2.057361e-13     2.310297e-13    2.352077e-13

Random_Forest          1.467092e+01   1.761664e+01     2.025753e+01   2.087942e+01

XGBoost                     1.992400e+01   2.131183e+01      2.334781e+01    2.395862e+01

                                   3rd Qu.                Max.                   NA's

Linear_Regression     2.584762e-13    3.490010e-13    0

Random_Forest          2.408861e+01   2.967785e+01    0

XGBoost                      2.618161e+01    3.044348e+01    0

 

RMSE 

                                    Min.                       1st Qu.              Median                Mean

Linear_Regression      1.818315e-13        2.714956e-13      3.004007e-13    3.035248e-13

Random_Forest          2.231048e+01      2.803128e+01    3.310515e+01      3.543857e+01

XGBoost                      2.992772e+01     3.333336e+01    3.590806e+01    3.721283e+01

                                   3rd Qu.                  Max.                    NA's

Linear_Regression     3.449525e-13      4.519012e-13      0

Random_Forest         4.210753e+01       5.497088e+01    0

XGBoost                      4.088298e+01     4.789698e+01     0

 

Rsquared 

                                     Min.               1st Qu.              Median           Mean              3rd Qu.           

Linear_Regression      1.0000000     1.0000000        1.0000000     1.0000000      1.0000000

Random_Forest           0.9921872      0.9945075        0.9959446    0.9955249      0.9966669

XGBoost                      0.9906932     0.9940607        0.9951215       0.9947949      0.9957946

                                      Max.                   NA's

Linear_Regression        1.0000000          0

Random_Forest             0.9982100          0

XGBoost                         0.9972477          0

The graph for this output is:

The above output means:

  • Linear Regression shows near-perfect scores (almost 0 error and R² = 1), which likely means it overfit the data or there's a data leakage.
  • Random Forest performs well with lower error (RMSE ≈ 35, MAE ≈ 21) and very high accuracy (R² ≈ 0.995).
  • XGBoost also performs well but slightly worse than Random Forest, with higher errors and slightly lower R².
  • Overall, Random Forest is the best-performing model among the three, balancing error and accuracy well.

Must read: R For Data Science: Why Should You Choose R for Data Science?

Step 11: Make Predictions on Unseen Data

Now that we’ve trained our models, we’ll use the XGBoost model to predict player points on the test data. We’ll then compare the predicted values to the actual points to see how well the model is performing. We’ll use this code:

# Predict player points using the trained XGBoost model
predictions <- predict(xgb_fit, newdata = test_df)

# Compare predicted vs actual values in a new table
comparison <- data.frame(
  Actual = test_df$pts,             # Real player points
  Predicted = round(predictions, 1) # Predicted points, rounded to 1 decimal
)

# Show the first 10 rows of comparison
head(comparison, 10)

The output for the above code is:

 

Actual

Predicted

 

<dbl>

<dbl>

1

1959

1871.2

2

1826

1915.2

3

1490

1513.3

4

1447

1477.0

5

1431

1466.9

6

1357

1325.3

7

1347

1456.8

8

1302

1243.1

9

1298

1226.0

10

1290

1250.2

The above output shows:

  • The table shows how close the predicted player points are to the actual points in the test data.
  • For example, the first player actually scored 1959 points, and the model predicted 1871.2, which is quite close.
  • Most predictions are within a reasonable range, showing that the model is performing well on unseen data.

Step 12: Evaluate Model Accuracy

In this step, we calculate accuracy metrics to measure how well the model performed on the test set. These metrics help us understand how close our predictions are to the actual values. The code for this step is:

# Calculate model performance metrics on test data
postResample(pred = predictions, obs = test_df$pts)

The output for the above step is:

RMSE 32.7604351856363 Rsquared 0.994865387369555 MAE 21.8286527433546

The above output means:

  • RMSE = 32.76. On average, the predictions are about 33 points off from the actual values.
  • MAE = 21.83. The average absolute difference between actual and predicted points is 22 points, which is very good.
  • R² = 0.995. This means the model explains 99.5% of the variation in player points, a very strong performance.

Step 13: Visualize Top 10 Important Features

This step helps you understand which features mattered the most when predicting player points using the XGBoost model. We’ll use the vip package to generate a simple plot of the top 10 features. The code for this step is:

# Show the 10 most important features used by the XGBoost model
vip(xgb_fit, num_features = 10

The above graph means:

  • fgm (Field Goals Made) is the most important feature, it has the biggest impact on predicting player points.
  • fga (Field Goals Attempted) also plays a major role, showing that shooting activity strongly relates to points scored.
  • Other helpful features include ftm (Free Throws Made) and min (Minutes played), though their influence is smaller.
  • Features like 3-pointers and position (e.g., pos.SG) have very little influence in this model.

Conclusion

In this Player Performance Analysis project, we used R in Google Colab to build a regression model that predicts NBA player points based on game statistics such as field goals made, free throws, and minutes played. 

After preprocessing and encoding the data, we trained models including Linear Regression, Random Forest, and XGBoost.

The XGBoost model delivered the best results, achieving an RMSE of 32.76 and an R-squared value of 0.995, meaning it can explain 99.5% of the variation in player points, a highly accurate prediction performance.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1skF1-VvZ2Q5yebC7Ag6CBAcSbP07-lR6#scrollTo=HfH19rxEDMrN

Frequently Asked Questions (FAQs)

1. What is the Player Performance Analysis project in R all about?

2. Which R packages are used in this project?

3. What are some advanced models that can improve prediction accuracy?

4. What skills do I need before starting this project?

5. What are some other beginner-friendly R projects I can try?

Rohit Sharma

802 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months