Student Performance Analysis In R With Code and Explanation

By Rohit Sharma

Updated on Aug 05, 2025 | 20 min read | 1.2K+ views

Share:

This Student Performance Analysis in R project will focus on key factors that influence students’ final grades using a dataset of Portuguese secondary school students. We'll use Google Colab to run the project. 

The project includes data cleaning, visual exploration, correlation analysis, and a simple linear regression model to predict student performance based on features like prior grades, study time, and absences. 

Shape tomorrow with upGrad’s Data Science programs. Build practical skills in AI, Machine Learning, and Data Analytics for the next generation of tech leaders. Enrol now and fast-track your career.

Build Your Data Science Ambitions: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

How Long This Project Takes and What Skills You Need to Do It

This Student Performance Analysis in R project is beginner-friendly and easy to complete in one sitting. The skills and timeline of this project are given in the table below:

Aspect

Details

Estimated Duration 1.5 to 2 hours
Difficulty Level Easy to Moderate
Skill Level Needed Beginner in R and basic data analysis
Tools Required Google Colab, R, ggplot2, corrplot, dplyr
Project Type Exploratory Data AnalysisRegression

Take charge of your future with upGrad’s Data Science and AI programs. Learn from industry experts, master cutting-edge tools, and build a career that stands out in the AI-driven world. Enrol today and get ahead.

What Should You Know Before Starting the Student Performance Analysis Project?

To get the most out of this Student Performance Analysis in R project, it's helpful to have a basic understanding of a few core concepts. While the steps are beginner-friendly, being familiar with the following will ensure a smoother learning experience:

  • You need to have a basic understanding of R programming syntax and how to execute code in Google Colab
  • You must have an understanding of data frames and how to view or manipulate tabular data
  • Must have an understanding of the fundamentals of data visualization using ggplot2
  • You need basic statistical concepts like mean, correlation, and linear regression
  • You also need to know the logic behind predictive modeling, especially how independent variables affect a target variable

The R Tools and Libraries Powering This Project

This project is entirely built in R using Google Colab, which allows you to run R code without installing anything on your local machine. We'll use a few essential R libraries to clean data, explore patterns, visualize relationships, and build a regression model.

Category

Name / Package

Purpose

Platform Google Colab (R kernel) Run and share R code in the cloud
Programming Language R Perform data manipulation, analysis, and modeling
Data Wrangling dplyr, tidyverse Filter, select, transform, and manage data
Visualization ggplot2, corrplot Create plots, graphs, and correlation matrices
Data Summary skimr (optional) Get quick overviews of datasets
Modeling Base R (lm) Build and evaluate linear regression models

A Detailed Walkthrough of the Student Performance Analysis Project

This section will break down the entire project into individual steps to help you understand the concepts of data analysis and modeling used in this project. 

Step 1: Configure Google Colab to Run R Code

Google Colab runs Python by default, so we first need to switch the environment to R. This allows you to write and execute R code directly in the notebook.

To set it up:

  • Open Google Colab and start a new notebook
  • Go to the Runtime menu at the top
  • Click on Change runtime type
  • In the pop-up window, select R from the Language dropdown
  • Click Save to apply the changes

Step 2: Install and Load Required R Libraries

Before starting data analysis, we need to install and load the libraries that will help with data cleaning, visualization, and correlation analysis. We only need to install the packages once; after that, simply load them each time you run the notebook. The code to install and load the libraries is given below:

# Install required packages (only needed once, skip if already installed)
install.packages("tidyverse")  # Collection of packages for data manipulation and visualization
install.packages("skimr")      # Provides an overview of dataset structure and summaries
install.packages("corrplot")   # Helps in visualizing correlation matrices

# Load the libraries into the current session
library(tidyverse)   # Loads ggplot2, dplyr, readr, and other useful packages
library(skimr)       # Useful for summarizing datasets quickly
library(corrplot)    # Used to draw correlation plots

The above code installs and loads the required libraries. The output is:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr    1.1.4     ✔ readr    2.1.5

✔ forcats  1.0.0     ✔ stringr  1.5.1

✔ ggplot2  3.5.2     ✔ tibble   3.3.0

✔ lubridate 1.9.4     ✔ tidyr    1.3.1

✔ purrr    1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag()    masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

corrplot 0.95 loaded

Also Read: R For Data Science: Why Should You Choose R for Data Science?

Step 3: Load the Dataset into Your R Environment

Now that the libraries are ready, the next step is to bring your data into the notebook. We'll load the uploaded CSV file and preview the first few records to understand its structure. Here’s the code

# Load the uploaded dataset using its filename
student_data <- read.csv("student-por.csv")

# Display the first six rows to get a quick look at the data
head(student_data)

This gives us an overview of the dataset we’re working with. The output for the above code is:

 

school

sex

age

address

famsize

Pstatus

Medu

Fedu

Mjob

Fjob

famrel

freetime

goout

Dalc

Walc

health

absences

G1

G2

G3

 

<chr>

<chr>

<int>

<chr>

<chr>

<chr>

<int>

<int>

<chr>

<chr>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

<int>

1

GP

F

18

U

GT3

A

4

4

at_home

teacher

4

3

4

1

1

3

4

0

11

11

2

GP

F

17

U

GT3

T

1

1

at_home

other

5

3

3

1

1

3

2

9

11

11

3

GP

F

15

U

LE3

T

1

1

at_home

other

4

3

2

2

3

3

6

12

13

12

4

GP

F

15

U

GT3

T

4

2

health

services

3

2

2

1

1

5

0

14

14

14

5

GP

F

16

U

GT3

T

3

3

other

other

4

3

2

1

2

5

0

11

13

13

6

GP

M

16

U

LE3

T

4

3

services

other

5

4

2

1

2

5

6

12

12

13

 

Step 4: Clean the Column Names for Better Usability

Some column names may contain dots (.), which can make referencing them in code a bit tricky. Replacing them with underscores (_) makes the column names easier to work with. Here’s the code:

# Replace all dots in column names with underscores for cleaner access
colnames(student_data) <- gsub("\\.", "_", colnames(student_data))

# Display the cleaned column names
colnames(student_data)

The above code cleans the dataset. The output for the above code is:

'school'

'sex'

'age'

'address'

'famsize'

'Pstatus'

'Medu'

'Fedu'

'Mjob'

'Fjob'

'reason'

'guardian'

'traveltime'

'studytime'

'failures'

'schoolsup'

'famsup'

'paid'

'activities'

'nursery'

'higher'

'internet'

'romantic'

'famrel'

'freetime'

'goout'

'Dalc'

'Walc'

'health'

'absences'

'G1'

'G2'

'G3'

Step 5: Explore the Structure and Quality of the Dataset

In this step, we'll examine the overall structure of the dataset, summarize its contents, and check for any missing values. This helps us understand what we're working with and identify any cleanup needed before deeper analysis. The code for this step is:

# View the structure of the dataset: shows data types and sample values
str(student_data)
# Get summary statistics for each column (min, max, mean, median, etc.)
summary(student_data)

# Load dplyr for data manipulation
library(dplyr)

# Check how many missing values exist in each column
student_data %>%
  summarise_all(~sum(is.na(.))) %>%                     # Count NAs in each column
  pivot_longer(cols = everything(),                    # Convert to long format
               names_to = "Column", 
               values_to = "Missing_Values") %>%
  filter(Missing_Values > 0)                            # Show only columns with missing data

The output for the above code is:

'data.frame': 649 obs. of  33 variables:

 $ school   : chr  "GP" "GP" "GP" "GP" ...

 $ sex        :  chr  "F" "F" "F" "F" ...

 $ age       : int  18 17 15 15 16 16 16 17 15 15 ...

 $ address   : chr  "U" "U" "U" "U" ...

 $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...

 $ Pstatus   : chr  "A" "T" "T" "T" ...

 $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...

 $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...

 $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...

 $ Fjob      : chr  "teacher" "other" "other" "services" ...

 $ reason    : chr  "course" "course" "other" "home" ...

 $ guardian  : chr  "mother" "father" "mother" "mother" ...

 $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...

 $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...

 $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...

 $ schoolsup : chr  "yes" "no" "yes" "no" ...

 $ famsup    : chr  "no" "yes" "no" "yes" ...

 $ paid      : chr  "no" "no" "no" "no" ...

 $ activities: chr  "no" "no" "no" "yes" ...

 $ nursery   : chr  "yes" "no" "yes" "yes" ...

 $ higher    : chr  "yes" "yes" "yes" "yes" ...

 $ internet  : chr  "no" "yes" "yes" "yes" ...

 $ romantic  : chr  "no" "no" "no" "yes" ...

 $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...

 $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...

 $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...

 $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...

 $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...

 $ health    : int  3 3 3 5 5 5 3 1 1 5 ...

 $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...

 $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...

 $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...

 $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

school                     sex                         age                       address         

Length:649              Length:649           Min.   :15.00          Length:649        

Class :character     Class :character     1st Qu.:16.00        Class :character  

Mode  :character     Mode  :character    Median :17.00      Mode  :character  

                                       Mean   :16.74                     

                                       3rd Qu.:18.00                     

                                       Max.   :22.00 

famsize                  Pstatus                Medu                   Fedu      

Length:649            Length:649          Min.   :0.000       Min.   :0.000  

Class :character    Class :character   1st Qu.:2.000      1st Qu.:1.000  

Mode  :character    Mode  :character   Median :2.000   Median :2.000  

                                       Mean   :2.515   Mean   :2.307  

                                       3rd Qu.:4.000   3rd Qu.:3.000  

                                       Max.   :4.000   Max.   :4.000

Mjob                        Fjob                    reason                   guardian        

Length:649             Length:649         Length:649           Length:649        

Class :character     Class :character   Class :character   Class :character  

Mode  :character    Mode  :character   Mode  :character   Mode  :character

 

traveltime         studytime          failures             schoolsup        

Min.   :1.000       Min.   :1.000     Min.   :0.0000    Length:649        

1st Qu.:1.000     1st Qu.:1.000   1st Qu.:0.0000    Class :character  

Median :1.000   Median :2.000   Median :0.0000   Mode  :character  

Mean   :1.569     Mean   :1.931     Mean   :0.2219                     

3rd Qu.:2.000    3rd Qu.:2.000    3rd Qu.:0.0000                     

Max.   :4.000      Max.   :4.000      Max.   :3.0000                     

famsup                  paid                    activities                 nursery         

Length:649           Length:649         Length:649            Length:649        

Class :character   Class :character   Class :character     Class :character  

Mode  :character   Mode  :character   Mode  :character   Mode  :character  

 

higher                   internet                    romantic                         famrel     

Length:649           Length:649              Length:649              Min.   :1.000  

Class :character    Class :character     Class :character      1st Qu.:4.000  

Mode  :character   Mode  :character   Mode  :character    Median :4.000  

                                                          Mean   :3.931  

                                                          3rd Qu.:5.000  

                                                          Max.   :5.000  

freetime            goout               Dalc                     Walc                health     

Min.   :1.00       Min.   :1.000      Min.   :1.000         Min.   :1.00     Min.   :1.000  

1st Qu.:3.00     1st Qu.:2.000   1st Qu.:1.000        1st Qu.:1.00    1st Qu.:2.000  

Median :3.00   Median :3.000   Median :1.000     Median :2.00    Median :4.000  

Mean   :3.18     Mean   :3.185    Mean   :1.502      Mean   :2.28     Mean   :3.536  

3rd Qu.:4.00     3rd Qu.:4.000   3rd Qu.:2.000    3rd Qu.:3.00      3rd Qu.:5.000  

Max.   :5.00      Max.   :5.000      Max.   :5.000     Max.   :5.00          Max.   :5.000  

 

absences              G1                 G2                     G3       

 Min.   : 0.000       Min.   : 0.0     Min.   : 0.00       Min.   : 0.00  

 1st Qu.: 0.000      1st Qu.:10.0   1st Qu.:10.00     1st Qu.:10.00  

 Median : 2.000     Median :11.0   Median :11.00   Median :12.00  

 Mean   : 3.659      Mean   :11.4    Mean   :11.57     Mean   :11.91  

 3rd Qu.: 6.000     3rd Qu.:13.0    3rd Qu.:13.00     3rd Qu.:14.00  

 Max.   :32.000       Max.   :19.0    Max.   :19.00      Max.   :19.00 

Column

Missing_Values

<chr>

<int>

Here’s a Must Build R Project: Trend Analysis Project on COVID-19 using R

Step 6: Count Missing Values in Each Column

While the previous step filtered only columns with missing values, here we’re doing a quick scan of all columns to count how many missing values each one contains, whether it's zero or more. Here’s the code:

# Count and display the number of missing values in each column
colSums(is.na(student_data))

The output for the above code shows the number of missing values in each column.

School 0 sex 0 age 0 address 0 famsize 0 Pstatus 0 Medu 0 Fedu 0 Mjob 0 Fjob 0 reason 0 guardian 0 traveltime 0 studytime 0 failures 0 schoolsup 0 famsup 0 paid 0 activities 0 nursery 0 higher 0 internet 0 romantic 0 famrel 0 freetime 0 goout 0 Dalc 0 Walc 0 health 0 absences 0 G1 0 G2 0 G3 0

Step 7: Load ggplot2 for Data Visualization

To create clear and informative plots, we'll use the ggplot2 package, which is part of the tidyverse collection. This package will help us visualize trends and patterns in the student data effectively. The code to load ggplot2 is:

# Load the ggplot2 package for creating data visualizations
library(ggplot2)

Build This R Project: Natural Disaster Prediction Analysis Project in R

Step 8: Visualize the Distribution of Final Grades

In this step, we'll create a histogram to see how students' final grades (G3) are distributed. This gives us a quick view of whether most students performed well, poorly, or fell in the middle. Here’s the code:

# Create a histogram to visualize how the final grades (G3) are distributed
ggplot(student_data, aes(x = G3)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +  # Set bar color and outline
  labs(title = "Distribution of Final Grades (G3)",                  # Add a chart title
       x = "Final Grade",                                            # Label for x-axis
       y = "Number of Students")                                     # Label for y-axis

The above code gives us an output of the graph of the distribution of grades in G3.

The above graph shows:

  • Most students scored between 10 and 14: The tallest bars are between these values, showing this is the most common grade range.
  • Very few students scored below 5 or above 17: The bars on the extreme ends are very short, indicating only a small number of students got very low or very high grades.
  • A small spike at 0: There's a noticeable number of students who received a grade of 0, which might indicate dropouts or missing final assessments.
  • Overall distribution is slightly skewed to the left: The curve isn't perfectly symmetrical; it shows more students scored above the average than below it.

Step 9: Compare Final Grades Based on Internet Access

This section visualizes how access to the internet at home might impact students’ final grades. Using a boxplot, we can observe whether students with or without internet access tend to perform better academically. Here’s the code:

ggplot(student_data, aes(x = internet, y = G3, fill = internet)) +
  geom_boxplot() +
  labs(title = "Final Grades by Internet Access",
       x = "Internet Access at Home",
       y = "Final Grade")

The above code gives us a graph showing the grades by gender.

The above graph shows that:

  • Female students (F) have a slightly higher median final grade than male students (M).
  • Male students show more variability in their grades, with a longer lower whisker and more outliers, especially on the lower end.
  • Female students' grades are more concentrated, indicating less variation and a slightly better performance on average.
  • Both genders have some students who scored very low (0 or near 0), which may indicate dropouts or failures.

Step 10: Study Time vs Final Grades

This step explores whether students who dedicate more weekly time to studying tend to perform better in their final grades. Here’s the code:

ggplot(student_data, aes(x = factor(studytime), y = G3, fill = factor(studytime))) +
  geom_boxplot() +
  labs(title = "Study Time vs Final Grade",
       x = "Study Time (1 = <2 hrs, 4 = >10 hrs)",
       y = "Final Grade")

The above code gives the output:

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

The above plot shows that:

  • Median Grades Increase Slightly:
    As study time increases (from group 1 to 4), the median final grade also tends to rise. 
  • Lowest Performance in Group 1:
    Students who study <2 hours weekly generally score lower.
  • Highest Spread in Groups 3 and 4:
    More study time is associated with higher variability in performance; some students still perform poorly despite studying more.

New to R? Here’s A Fun R Project: Car Data Analysis Project Using R

Step 11: Analyze the Impact of Internet Access on Final Grades

This step explores whether having internet access at home affects students' academic performance. We'll use a boxplot to compare the final grades (G3) of students who have internet access versus those who do not. Here’s the code:

ggplot(student_data, aes(x = internet, y = G3, fill = internet)) +
  geom_boxplot() +
  labs(title = "Internet Access vs Final Grade",
       x = "Internet Access (Yes/No)",
       y = "Final Grade")

The above code gives us the graph:

The above plot shows that:

  • Students with internet access (yes) generally have slightly higher median grades than those without internet (no).
  • The interquartile range (middle 50%) of grades is wider for students with internet, suggesting more variability in performance.
  • Students without internet have a lower upper range, indicating fewer top performers.
  • A few outliers (extremely low or high grades) appear in both groups, but the internet group includes more students with grades above 15.

Step 12: Install and Load Correlation Plot Library

To visualize relationships between numeric variables, we'll use the corrplot package. This package creates an easy-to-read graphical representation of correlations in a dataset. We'll install and load it before plotting. Here’s the code:

# Install corrplot if not already installed
install.packages("corrplot")

# Load the library
library(corrplot)

Step 13: Filter Numeric Columns for Correlation Analysis

Before creating a correlation plot, we need to isolate only the numeric columns from the dataset. This ensures the correlation matrix is accurate and relevant. We'll use select_if(is.numeric) from dplyr for this. Here’s the code:

# Filter numeric columns
numeric_data <- student_data %>% select_if(is.numeric)

# View column names
colnames(numeric_data)

The output for the above code is:

'age'

'Medu'

'Fedu'

'traveltime'

'studytime'

'failures'

'famrel'

'freetime'

'goout'

'Dalc'

'Walc'

'health'

'absences'

'G1'

'G2'

'G3'

Here’s a Fun R Project For You: Player Performance Analysis & Prediction Using R

Step 14: Calculate Correlation Between Numeric Features

Now that we've isolated the numeric columns, we calculate the correlation matrix to understand how these variables relate to each other. This matrix shows the strength and direction of linear relationships between pairs of numeric features. Here’s the code:

# Calculate correlation between numeric features
cor_matrix <- cor(numeric_data)

# View the correlation matrix rounded to 2 decimal places
round(cor_matrix, 2)

The output for the above step is:

 

age

Medu

Fedu

traveltime

studytime

failures

famrel

freetime

goout

Dalc

Walc

health

absences

G1

G2

G3

age

1.00

-0.11

-0.12

0.03

-0.01

0.32

-0.02

0.00

0.11

0.13

0.09

-0.01

0.15

-0.17

-0.11

-0.11

Medu

-0.11

1.00

0.65

-0.27

0.10

-0.17

0.02

-0.02

0.01

-0.01

-0.02

0.00

-0.01

0.26

0.26

0.24

Fedu

-0.12

0.65

1.00

-0.21

0.05

-0.17

0.02

0.01

0.03

0.00

0.04

0.04

0.03

0.22

0.23

0.21

traveltime

0.03

-0.27

-0.21

1.00

-0.06

0.10

-0.01

0.00

0.06

0.09

0.06

-0.05

-0.01

-0.15

-0.15

-0.13

studytime

-0.01

0.10

0.05

-0.06

1.00

-0.15

0.00

-0.07

-0.08

-0.14

-0.21

-0.06

-0.12

0.26

0.24

0.25

failures

0.32

-0.17

-0.17

0.10

-0.15

1.00

-0.06

0.11

0.05

0.11

0.08

0.04

0.12

-0.38

-0.39

-0.39

famrel

-0.02

0.02

0.02

-0.01

0.00

-0.06

1.00

0.13

0.09

-0.08

-0.09

0.11

-0.09

0.05

0.09

0.06

freetime

0.00

-0.02

0.01

0.00

-0.07

0.11

0.13

1.00

0.35

0.11

0.12

0.08

-0.02

-0.09

-0.11

-0.12

goout

0.11

0.01

0.03

0.06

-0.08

0.05

0.09

0.35

1.00

0.25

0.39

-0.02

0.09

-0.07

-0.08

-0.09

Dalc

0.13

-0.01

0.00

0.09

-0.14

0.11

-0.08

0.11

0.25

1.00

0.62

0.06

0.17

-0.20

-0.19

-0.20

Walc

0.09

-0.02

0.04

0.06

-0.21

0.08

-0.09

0.12

0.39

0.62

1.00

0.11

0.16

-0.16

-0.16

-0.18

health

-0.01

0.00

0.04

-0.05

-0.06

0.04

0.11

0.08

-0.02

0.06

0.11

1.00

-0.03

-0.05

-0.08

-0.10

absences

0.15

-0.01

0.03

-0.01

-0.12

0.12

-0.09

-0.02

0.09

0.17

0.16

-0.03

1.00

-0.15

-0.12

-0.09

G1

-0.17

0.26

0.22

-0.15

0.26

-0.38

0.05

-0.09

-0.07

-0.20

-0.16

-0.05

-0.15

1.00

0.86

0.83

G2

-0.11

0.26

0.23

-0.15

0.24

-0.39

0.09

-0.11

-0.08

-0.19

-0.16

-0.08

-0.12

0.86

1.00

0.92

G3

-0.11

0.24

0.21

-0.13

0.25

-0.39

0.06

-0.12

-0.09

-0.20

-0.18

-0.10

-0.09

0.83

0.92

1.00

 

Step 15: Visualize Correlation Between Numeric Features

To better understand the strength and direction of relationships between numeric variables, we use a correlation heatmap. This colorful plot helps us quickly identify strong positive or negative correlations among the features. Here’s the code:

# Create a colorful correlation plot
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.8)

The output for the above code is shown below:

Step 16: Build a Linear Regression Model to Predict Final Grades

In this step, we create a simple linear regression model to predict the final grade (G3) based on variables like first and second period grades (G1, G2), study time, number of failures, and absences. The model summary gives us coefficients, significance levels, and overall model performance. Here’s the code:

# Build a linear regression model
model <- lm(G3 ~ G1 + G2 + studytime + failures + absences, data = student_data)

# View model summary
summary(model)

The output for the above code is:

Call:

lm(formula = G3 ~ G1 + G2 + studytime + failures + absences, data = student_data)

Residuals:

    Min       1Q         Median        3Q      Max 

-9.0716  -0.4624  -0.0796   0.6346   5.8068 

Coefficients:

                   Estimate  Std. Error       t value   Pr(>|t|)    

(Intercept) -0.15519        0.25863   -0.600   0.54868    

G1               0.13946        0.03623    3.849    0.00013 ***

G2               0.88571       0.03393    26.107    < 2e-16 ***

studytime    0.09670      0.06181    1.564      0.11820    

failures       -0.21829      0.09086   -2.402    0.01657 *  

absences    0.02337      0.01079    2.165      0.03077 *  

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 1.254 on 643 degrees of freedom

Multiple R-squared:  0.8506, Adjusted R-squared:  0.8494 

F-statistic:   732 on 5 and 643 DF,  p-value: < 2.2e-16

The above output means that:

  • G1 and G2 are strong predictors – Especially G2, which has the biggest positive impact on final grades (G3).
  • Failures reduce grades – More past failures lead to lower final grades.
  • Study time isn’t very impactful – In this model, study time doesn’t show a strong effect on G3.
  • Model fits well – With an R-squared of 85%, the model explains most of the variation in final grades.
  • Absences show a small positive effect – This might be due to other factors in the data, and could be explored further.

Also Read: 18 Types of Regression in Machine Learning You Should Know

Step 17: Predict Final Grades Using the Model

Now that we’ve trained the regression model, let’s use it to predict students’ final grades (G3). We’ll compare the predicted results with the actual values to see how closely the model performs. Here’s the code:

# Add predicted G3 values to the dataset
student_data$predicted_G3 <- predict(model, student_data)

# View first few actual vs predicted
head(student_data[, c("G3", "predicted_G3")])

The output gives us a table:

 

G3

predicted_G3

 

<int>

<dbl>

1

11

9.87447

2

11

11.08285

3

12

13.36611

4

14

14.48723

5

13

13.08645

6

13

12.48040

The above output shows:

  • G3 is the actual final grade the student received.
  • predicted_G3 is the grade predicted by the linear regression model based on features like G1, G2, study time, failures, and absences.
  • For example, in row 1, the student scored 11, but the model predicted 9.87, which is slightly lower.
  • In row 3, the actual score is 12, but the model predicted 13.37, slightly overestimating.
  • Overall, predictions are fairly close to actual values, indicating the model performs reasonably well.

Step 18 – Visualize Actual vs Predicted Final Grades

This step creates a scatter plot that compares the actual final grades (G3) with the predicted grades from the regression model. A red dashed line indicates perfect prediction, where actual equals predicted. Points closer to this line represent better predictions. Here’s the code:

ggplot(student_data, aes(x = G3, y = predicted_G3)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Final Grades",
       x = "Actual G3",
       y = "Predicted G3")

The above code gives us a graph of the predicted vs actual grades.

The above output shows that:

  • The blue dots represent actual vs predicted final grades for each student.
  • The red dashed line represents the ideal scenario where the predicted grade equals the actual grade.
  • Most of the dots are close to the red line, showing that the model's predictions are relatively accurate.
  • Some outliers appear far from the line, indicating prediction errors that may need further investigation or model improvement.

Conclusion

In this Student Performance Analysis project, we built a linear regression model in R using Google Colab to predict students’ final grades (G3) based on features like first and second period grades (G1 and G2), study time, failures, and absences.

After preprocessing the data and exploring key relationships, we trained the model and evaluated its performance using R-squared and residual plots. 

The model achieved an R-squared of 0.85, indicating that it explains 85% of the variance in final grades. Overall, the model shows strong predictive accuracy, especially when prior performance is considered.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Collab Link:
https://colab.research.google.com/drive/1XcU-XxV2j76DnWrfDdn7TpuTEh3xq9wi#scrollTo=jqe2jLRsgpUk

Frequently Asked Questions (FAQs)

1. How does analyzing student data help in educational decision-making?

2. What kind of dataset is used in this student performance project?

3. How does a linear regression model work in the context of this analysis?

4. What are the steps involved in building this model in R?

5. What are similar beginner-friendly machine learning projects in R?

Rohit Sharma

823 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months