Home
Blog
Data Science
Player Performance Analysis & Prediction Using R

Player Performance Analysis & Prediction Using R

Q: 3. What are some advanced models that can improve prediction accuracy?

Beyond the models used in this project, you can try Support Vector Regression (SVR), Lasso or Ridge Regression, or Neural Networks using nnet or keras in R to further enhance accuracy or handle more complex data.

By Rohit Sharma

Updated on Jul 31, 2025 | 14 min read | 1.59K+ views

Table of Contents

View all

What Should We Know Before Starting the Project?
Tools, Technologies, and R Libraries Used in This Player Performance Analysis Project
What to Expect: Duration, Difficulty & Skill Level
Step-by-Step Player Performance Analysis Project Breakdown
Conclusion

This player performance analysis project focuses on predicting the total points scored by NBA players using machine learning techniques in R. We will use the 2023 NBA player stats dataset to identify key performance metrics of players like minutes played, field goals, assists, and more.

The data will be cleaned, preprocessed, and modeled using algorithms such as Linear Regression, Random Forest, and XGBoost. The goal of this project is to build an accurate prediction model and identify the most influential features that affect a player’s scoring.

Break Into the Future of Data Science, Master GenAI, get mentored by industry leaders, and turn insights into income. These top online data science courses are your launchpad.

Find the Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025 for your next R Project.

Popular Data Science Programs

Data Science Advanced Course Masters in Data Science Degree PG Diploma in Data Science DevOps Full Course Online MSc in Data Science Program

What Should We Know Before Starting the Project?

Before starting the player performance analysis project, there are a few things we need to know in order to ensure a successful outcome.

We must be familiar with basic R syntax and data frames.
We should have knowledge of data cleaning and preprocessing techniques.
We must understand how to split data into training and testing sets.
We should be familiar with regression and basic machine learning concepts.
We must know how to evaluate models using metrics like RMSE and R².

Advance your career with globally recognized programs in Data Science and Generative AI. Learn from the best, build real-world skills, and stay ahead in a rapidly evolving digital world.

Tools, Technologies, and R Libraries Used in This Player Performance Analysis Project

The following tools and libraries will be used in this project:

Category	Name	Purpose
Programming Language	R	Data analysis and machine learning
Platform	Google Colab (R runtime)	Cloud-based coding environment
Data Manipulation	dplyr, tidyverse, janitor	Cleaning, transforming, and wrangling data
Visualization	ggplot2, vip	Data and feature importance visualization
ML Framework	caret	Building and tuning machine learning models
Algorithms	lm, ranger, xgbTree	Linear Regression, Random Forest, XGBoost
Model Evaluation	caret (e.g., postResample)	Measuring accuracy (RMSE, R²)
Data Summary	skimr	Exploring dataset structure and distributions
Encoding Categorical	caret::dummyVars	One-hot encoding of categorical variables

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

What to Expect: Duration, Difficulty & Skill Level

Different projects need varying timelines and skills. They are mentioned below:

Aspect	Details
Estimated Duration	3–5 hours
Difficulty Level	Beginner to Intermediate
Skill Level Needed	Beginner to Intermediate

Must Read: ML Types Explained: A Complete Guide to Data Types in Machine Learning

Step-by-Step Player Performance Analysis Project Breakdown

The step-by-step breakdown for this project with codes is given below.

Step 1: Configure Google Colab to Use R

Before starting the project, it's important to set up Google Colab to run R instead of the default Python environment. This ensures that all R code executes correctly within the notebook.

Follow These Steps:

1. Open a new notebook in Google Colab

2. Click on Runtime in the top menu

3. Select Change runtime type

4. In the Language dropdown, choose R

5. Click Save to apply the changes

Step 2: Install and Load Required R Libraries

In this step, we install and load all the necessary R packages used for data manipulation, visualization, machine learning, and model evaluation. The code for this step is:

# Install required packages (only needed once)
install.packages(c("tidyverse", "caret", "janitor", "skimr", "ranger", "xgboost", "vip", "e1071"))

# Load libraries into the session
library(tidyverse)   # For data manipulation and visualization
library(caret)       # For machine learning workflow
library(janitor)     # For cleaning column names
library(skimr)       # For quick data summary
library(ranger)      # Random Forest algorithm
library(xgboost)     # XGBoost algorithm
library(vip)         # Visualize feature importance
library(e1071)       # Required for some caret functionalities

The above code will install all the required libraries to run this project:

Installing packages into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘sparsevctrs’, ‘timeDate’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’, ‘snakecase’, ‘RcppEigen’, ‘yardstick’, ‘proxy’

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr 1.1.4 ✔ readr 2.1.5

✔ forcats 1.0.0 ✔ stringr 1.5.1

✔ ggplot2 3.5.2 ✔ tibble 3.3.0

✔ lubridate 1.9.4 ✔ tidyr 1.3.1

✔ purrr 1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

lift

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

chisq.test, fisher.test

Attaching package: ‘xgboost’

The following object is masked from ‘package:dplyr’:

slice

Attaching package: ‘vip’

The following object is masked from ‘package:utils’:

Step 3: Load and Clean the Dataset

This step reads the NBA player statistics CSV file into R and cleans the column names for easier use in later steps.

# Load the CSV file into R
nba_raw <- read_csv("2023_nba_player_stats.csv")

# Clean column names (e.g., converts to snake_case)
nba <- nba_raw %>% clean_names()

The output of the above code is:

Rows: 539 Columns: 30

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (3): PName, POS, Team

dbl (27): Age, GP, W, L, Min, PTS, FGM, FGA, FG%, 3PM, 3PA, 3P%, FTM, FTA, F...

ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The above output means:

The dataset has 539 rows (players) and 30 columns (stats).
There are 3 text columns (player name, position, team) and 27 numeric columns (like age, points, assists, etc.).
The message confirms the file was loaded correctly and the data types were auto-detected.

Upskill Now: R Tutorial for Beginners: Become an Expert in R Programming

Step 4: Clean, Encode, and Prepare the Data

In this step, we will remove unnecessary columns, convert categorical features into numeric format (one-hot encoding), and prepare the dataset for model training. We use the following code for this step:

# Remove columns not useful for modeling (e.g., name, fantasy points, unknown column)
nba_clean <- nba %>%
  select(-p_name, -x, -fp)

# Convert 'team' and 'pos' to categorical variables
nba_clean <- nba_clean %>%
  mutate(team = as.factor(team), pos = as.factor(pos))

# Create dummy variables (one-hot encoding) for categorical columns
dummies <- dummyVars(pts ~ ., data = nba_clean)
nba_encoded <- predict(dummies, newdata = nba_clean) %>% as.data.frame()

# Combine the target variable 'pts' with the encoded features
nba_ready <- bind_cols(pts = nba_clean$pts, nba_encoded)

# Remove any rows that have missing values
nba_ready <- nba_ready %>% drop_na()

In the above step:

We remove unnecessary columns like player name and irrelevant stats.
We convert team and position into numeric format using one-hot encoding.
We prepare a clean dataset with only numeric values, ready for model training.

Step 5: Split the Data into Training and Testing Sets

We divide the data into training (80%) and testing (20%) sets. This helps us train the model on one portion of the data and evaluate it on another to ensure reliable results.

# Set seed so results stay the same every time you run it
set.seed(42)

# Split the data: 80% training, 20% testing
split <- caret::createDataPartition(nba_ready$pts, p = 0.8, list = FALSE)

# Training data (80%)
train_df <- nba_ready[split, ]

# Testing data (20%)
test_df <- nba_ready[-split, ]

# Print the number of rows in each set to confirm the split
cat("Training set rows:", nrow(train_df), "\n")
cat("Testing set rows:", nrow(test_df))

The output for this step is:

Train rows: 432

Test rows: 107

This means that the data has been successfully split into two sets:

432 rows for training (used to build the model)
107 rows for testing (used to evaluate model performance)

Here’s Something Fun: Car Data Analysis Project Using R

Step 6: Define Training Method and Preprocessing Steps

In this step, we will set up how the model will be trained and validated using cross-validation. We will also define preprocessing steps to normalize the data for better model performance. The code for this step is:

# Define how the model will train and validate
ctrl <- trainControl(
  method = "repeatedcv",  # Use repeated cross-validation
  number = 5,             # Split data into 5 folds
  repeats = 3,            # Repeat the 5-fold CV 3 times
  verboseIter = TRUE      # Display training progress
)

# Preprocessing steps: normalize the data by centering and scaling
preprocess_steps <- c("center", "scale")

In the above step, we define a training strategy using repeated cross-validation, which means:

The data is split into 5 parts (folds).
The model is trained and validated multiple times (3 repeats), each time using different folds.
This helps us get a more reliable and stable model.

We also set preprocessing steps to:

Center the data (subtract the mean)
Scale the data (divide by standard deviation)

Step 7: Train a Linear Regression Model

In this step, we will train a linear regression model to predict total points scored (pts) using all other features in the training set. We will use the training strategy and preprocessing steps defined earlier. The code for this section is:

# Set seed to ensure consistent results
set.seed(42)

# Train a linear regression model using caret
lm_fit <- caret::train(
  pts ~ .,                 # Formula: predict 'pts' using all other variables
  data = train_df,         # Use the training data
  method = "lm",           # Use linear regression
  trControl = ctrl,        # Use the cross-validation setup from Step 6
  preProcess = preprocess_steps  # Normalize the data (center and scale)
)

The output for the above code looks like this:

+ Fold1.Rep1: intercept=TRUE

- Fold1.Rep1: intercept=TRUE

+ Fold2.Rep1: intercept=TRUE

- Fold2.Rep1: intercept=TRUE

+ Fold3.Rep1: intercept=TRUE

- Fold3.Rep1: intercept=TRUE

+ Fold4.Rep1: intercept=TRUE

- Fold4.Rep1: intercept=TRUE

+ Fold5.Rep1: intercept=TRUE

- Fold5.Rep1: intercept=TRUE

+ Fold1.Rep2: intercept=TRUE

- Fold1.Rep2: intercept=TRUE

+ Fold2.Rep2: intercept=TRUE

- Fold2.Rep2: intercept=TRUE

+ Fold3.Rep2: intercept=TRUE

- Fold3.Rep2: intercept=TRUE

+ Fold4.Rep2: intercept=TRUE

- Fold4.Rep2: intercept=TRUE

+ Fold5.Rep2: intercept=TRUE

- Fold5.Rep2: intercept=TRUE

+ Fold1.Rep3: intercept=TRUE

- Fold1.Rep3: intercept=TRUE

+ Fold2.Rep3: intercept=TRUE

- Fold2.Rep3: intercept=TRUE

+ Fold3.Rep3: intercept=TRUE

- Fold3.Rep3: intercept=TRUE

+ Fold4.Rep3: intercept=TRUE

- Fold4.Rep3: intercept=TRUE

+ Fold5.Rep3: intercept=TRUE

- Fold5.Rep3: intercept=TRUE

Aggregating results

Fitting final model on full training set

The above output means:

The model is trained in parts (folds) to check how well it performs.
It does this 3 times to make sure the results are reliable.
At the end, it builds one final model using all the training data.

Step 8: Train a Random Forest Model

In this step, we are going to train a Random Forest model to predict player points. Random Forest is a powerful algorithm that combines many decision trees to improve prediction accuracy and reduce overfitting. The code for this step is:

# Set seed to make results reproducible
set.seed(42)

# Train a Random Forest model using the 'ranger' method
rf_fit <- caret::train(
  pts ~ .,                 # Predict 'pts' using all other variables
  data = train_df,         # Use the training data
  method = "ranger",       # Use the Random Forest algorithm
  trControl = ctrl,        # Apply cross-validation settings
  preProcess = preprocess_steps,  # Normalize data
  tuneLength = 5           # Try 5 different tuning settings
)

The output for this step is:

In the above step:

The model will try 5 different combinations of tuning settings (like number of trees and depth).
You’ll see cross-validation results for each setting, showing how well the model performed.
At the end, it selects the best settings and trains the final Random Forest model using the training data.

Step 9: Train an XGBoost Model

In this step, we will then train an XGBoost model, which is a high-performance algorithm known for accuracy and speed. It builds trees one at a time and improves with each step, making it great for predictions. The code for this step is:

# Set seed for reproducible results
set.seed(42)

# Train an XGBoost model using caret
xgb_fit <- caret::train(
  pts ~ .,                 # Predict 'pts' using all features
  data = train_df,         # Use the training set
  method = "xgbTree",      # Use XGBoost with decision trees
  trControl = ctrl,        # Apply cross-validation setup
  preProcess = preprocess_steps,  # Normalize the data
  tuneLength = 10,         # Try 10 combinations of tuning settings
  verbose = FALSE          # Turn off training messages
)

The output of the above step is:

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

In the above step:

The model will test 10 different tuning combinations (like learning rate, depth, etc.).
You'll see a performance score (like RMSE) for each combination.
At the end, it picks the best settings and builds the final XGBoost model using the full training data.

Here’s a Fun R Project: Forest Fire Project Using R - A Step-by-Step Guide

Step 10: Compare All Model Performances

In this step, we then compare how well each model (Linear Regression, Random Forest, XGBoost) performed using cross-validation. We’ll print a summary and visualize the results using a boxplot. The code for this step is given below:

# Combine the results of all trained models for comparison
results <- resamples(list(
  Linear_Regression = lm_fit,
  Random_Forest = rf_fit,
  XGBoost = xgb_fit
))

# Print summary of model performance (e.g., RMSE, R-squared)
summary(results)

# Create a boxplot to visually compare model performance
bwplot(results)

The output for this code is:

Call:

summary.resamples(object = results)

Models: Linear_Regression, Random_Forest, XGBoost

Number of resamples: 15

MAE

Min. 1st Qu. Median Mean

Linear_Regression 1.410072e-13 2.057361e-13 2.310297e-13 2.352077e-13

Random_Forest 1.467092e+01 1.761664e+01 2.025753e+01 2.087942e+01

XGBoost 1.992400e+01 2.131183e+01 2.334781e+01 2.395862e+01

3rd Qu. Max. NA's

Linear_Regression 2.584762e-13 3.490010e-13 0

Random_Forest 2.408861e+01 2.967785e+01 0

XGBoost 2.618161e+01 3.044348e+01 0

RMSE

Min. 1st Qu. Median Mean

Linear_Regression 1.818315e-13 2.714956e-13 3.004007e-13 3.035248e-13

Random_Forest 2.231048e+01 2.803128e+01 3.310515e+01 3.543857e+01

XGBoost 2.992772e+01 3.333336e+01 3.590806e+01 3.721283e+01

3rd Qu. Max. NA's

Linear_Regression 3.449525e-13 4.519012e-13 0

Random_Forest 4.210753e+01 5.497088e+01 0

XGBoost 4.088298e+01 4.789698e+01 0

Rsquared

Min. 1st Qu. Median Mean 3rd Qu.

Linear_Regression 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

Random_Forest 0.9921872 0.9945075 0.9959446 0.9955249 0.9966669

XGBoost 0.9906932 0.9940607 0.9951215 0.9947949 0.9957946

Max. NA's

Linear_Regression 1.0000000 0

Random_Forest 0.9982100 0

XGBoost 0.9972477 0

The graph for this output is:

The above output means:

Linear Regression shows near-perfect scores (almost 0 error and R² = 1), which likely means it overfit the data or there's a data leakage.
Random Forest performs well with lower error (RMSE ≈ 35, MAE ≈ 21) and very high accuracy (R² ≈ 0.995).
XGBoost also performs well but slightly worse than Random Forest, with higher errors and slightly lower R².
Overall, Random Forest is the best-performing model among the three, balancing error and accuracy well.

Must read: R For Data Science: Why Should You Choose R for Data Science?

Step 11: Make Predictions on Unseen Data

Now that we’ve trained our models, we’ll use the XGBoost model to predict player points on the test data. We’ll then compare the predicted values to the actual points to see how well the model is performing. We’ll use this code:

# Predict player points using the trained XGBoost model
predictions <- predict(xgb_fit, newdata = test_df)

# Compare predicted vs actual values in a new table
comparison <- data.frame(
  Actual = test_df$pts,             # Real player points
  Predicted = round(predictions, 1) # Predicted points, rounded to 1 decimal
)

# Show the first 10 rows of comparison
head(comparison, 10)

The output for the above code is:

	Actual	Predicted
	<dbl>	<dbl>
1	1959	1871.2
2	1826	1915.2
3	1490	1513.3
4	1447	1477.0
5	1431	1466.9
6	1357	1325.3
7	1347	1456.8
8	1302	1243.1
9	1298	1226.0
10	1290	1250.2

The above output shows:

The table shows how close the predicted player points are to the actual points in the test data.
For example, the first player actually scored 1959 points, and the model predicted 1871.2, which is quite close.
Most predictions are within a reasonable range, showing that the model is performing well on unseen data.

Step 12: Evaluate Model Accuracy

In this step, we calculate accuracy metrics to measure how well the model performed on the test set. These metrics help us understand how close our predictions are to the actual values. The code for this step is:

# Calculate model performance metrics on test data
postResample(pred = predictions, obs = test_df$pts)

The output for the above step is:

RMSE 32.7604351856363 Rsquared 0.994865387369555 MAE 21.8286527433546

The above output means:

RMSE = 32.76. On average, the predictions are about 33 points off from the actual values.
MAE = 21.83. The average absolute difference between actual and predicted points is 22 points, which is very good.
R² = 0.995. This means the model explains 99.5% of the variation in player points, a very strong performance.

Step 13: Visualize Top 10 Important Features

This step helps you understand which features mattered the most when predicting player points using the XGBoost model. We’ll use the vip package to generate a simple plot of the top 10 features. The code for this step is:

# Show the 10 most important features used by the XGBoost model
vip(xgb_fit, num_features = 10

The above graph means:

fgm (Field Goals Made) is the most important feature, it has the biggest impact on predicting player points.
fga (Field Goals Attempted) also plays a major role, showing that shooting activity strongly relates to points scored.
Other helpful features include ftm (Free Throws Made) and min (Minutes played), though their influence is smaller.
Features like 3-pointers and position (e.g., pos.SG) have very little influence in this model.

Conclusion

In this Player Performance Analysis project, we used R in Google Colab to build a regression model that predicts NBA player points based on game statistics such as field goals made, free throws, and minutes played.

After preprocessing and encoding the data, we trained models including Linear Regression, Random Forest, and XGBoost.

The XGBoost model delivered the best results, achieving an RMSE of 32.76 and an R-squared value of 0.995, meaning it can explain 99.5% of the variation in player points, a highly accurate prediction performance.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1skF1-VvZ2Q5yebC7Ag6CBAcSbP07-lR6#scrollTo=HfH19rxEDMrN

Frequently Asked Questions (FAQs)

1. What is the Player Performance Analysis project in R all about?

This project focuses on predicting the total points scored by NBA players using their match and season statistics. It uses regression models like Linear Regression, Random Forest, and XGBoost to analyze and forecast player performance effectively.

2. Which R packages are used in this project?

We use essential R libraries such as tidyverse for data handling, caret for model training, ranger for Random Forest, xgboost for gradient boosting, vip for feature importance visualization, and janitor and skimr for data cleaning and exploration.

3. What are some advanced models that can improve prediction accuracy?

Beyond the models used in this project, you can try Support Vector Regression (SVR), Lasso or Ridge Regression, or Neural Networks using nnet or keras in R to further enhance accuracy or handle more complex data.

4. What skills do I need before starting this project?

You should be comfortable with basic R programming, data wrangling using dplyr, and understand regression concepts, model evaluation metrics (like RMSE and R²), and basic machine learning workflows.

5. What are some other beginner-friendly R projects I can try?

If you're exploring more hands-on projects like this one, here are a few great options:

Forest Fire Analysis in R
Uber Data Analysis Project in R
Wine Quality Prediction Project in R
House Price Prediction using R
Mobile App for Lottery Addiction Analysis

Spam Filter using Naive Bayes in R

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources