Home
Blog
Data Science
Student Performance Analysis In R With Code and Explanation

Student Performance Analysis In R With Code and Explanation

Updated on Aug 05, 2025 | 20 min read | 2.2K+ views

Table of Contents

View all

How Long This Project Takes and What Skills You Need to Do It
What Should You Know Before Starting the Student Performance Analysis Project?
The R Tools and Libraries Powering This Project
A Detailed Walkthrough of the Student Performance Analysis Project
Conclusion

This Student Performance Analysis in R project will focus on key factors that influence students’ final grades using a dataset of Portuguese secondary school students. We'll use Google Colab to run the project.

The project includes data cleaning, visual exploration, correlation analysis, and a simple linear regression model to predict student performance based on features like prior grades, study time, and absences.

Shape tomorrow with upGrad’s Data Science programs. Build practical skills in AI, Machine Learning, and Data Analytics for the next generation of tech leaders. Enrol now and fast-track your career.

Build Your Data Science Ambitions: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025

Popular Data Science Programs

Masters in Data Science Degree PG Diploma in Data Science MSc AI and Data Science Program Cloud Computing Courses Certification Data Science Advanced Course

How Long This Project Takes and What Skills You Need to Do It

This Student Performance Analysis in R project is beginner-friendly and easy to complete in one sitting. The skills and timeline of this project are given in the table below:

Aspect	Details
Estimated Duration	1.5 to 2 hours
Difficulty Level	Easy to Moderate
Skill Level Needed	Beginner in R and basic data analysis
Tools Required	Google Colab, R, ggplot2, corrplot, dplyr
Project Type	Exploratory Data Analysis + Regression

Take charge of your future with upGrad’s Data Science and AI programs. Learn from industry experts, master cutting-edge tools, and build a career that stands out in the AI-driven world. Enrol today and get ahead.

What Should You Know Before Starting the Student Performance Analysis Project?

To get the most out of this Student Performance Analysis in R project, it's helpful to have a basic understanding of a few core concepts. While the steps are beginner-friendly, being familiar with the following will ensure a smoother learning experience:

You need to have a basic understanding of R programming syntax and how to execute code in Google Colab
You must have an understanding of data frames and how to view or manipulate tabular data
Must have an understanding of the fundamentals of data visualization using ggplot2
You need basic statistical concepts like mean, correlation, and linear regression
You also need to know the logic behind predictive modeling, especially how independent variables affect a target variable

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

The R Tools and Libraries Powering This Project

This project is entirely built in R using Google Colab, which allows you to run R code without installing anything on your local machine. We'll use a few essential R libraries to clean data, explore patterns, visualize relationships, and build a regression model.

Category	Name / Package	Purpose
Platform	Google Colab (R kernel)	Run and share R code in the cloud
Programming Language	R	Perform data manipulation, analysis, and modeling
Data Wrangling	dplyr, tidyverse	Filter, select, transform, and manage data
Visualization	ggplot2, corrplot	Create plots, graphs, and correlation matrices
Data Summary	skimr (optional)	Get quick overviews of datasets
Modeling	Base R (lm)	Build and evaluate linear regression models

A Detailed Walkthrough of the Student Performance Analysis Project

This section will break down the entire project into individual steps to help you understand the concepts of data analysis and modeling used in this project.

Step 1: Configure Google Colab to Run R Code

Google Colab runs Python by default, so we first need to switch the environment to R. This allows you to write and execute R code directly in the notebook.

To set it up:

Open Google Colab and start a new notebook
Go to the Runtime menu at the top
Click on Change runtime type
In the pop-up window, select R from the Language dropdown
Click Save to apply the changes

Step 2: Install and Load Required R Libraries

Before starting data analysis, we need to install and load the libraries that will help with data cleaning, visualization, and correlation analysis. We only need to install the packages once; after that, simply load them each time you run the notebook. The code to install and load the libraries is given below:

# Install required packages (only needed once, skip if already installed)
install.packages("tidyverse")  # Collection of packages for data manipulation and visualization
install.packages("skimr")      # Provides an overview of dataset structure and summaries
install.packages("corrplot")   # Helps in visualizing correlation matrices

# Load the libraries into the current session
library(tidyverse)   # Loads ggplot2, dplyr, readr, and other useful packages
library(skimr)       # Useful for summarizing datasets quickly
library(corrplot)    # Used to draw correlation plots

The above code installs and loads the required libraries. The output is:

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’

(as ‘lib’ is unspecified)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──

✔ dplyr 1.1.4 ✔ readr 2.1.5

✔ forcats 1.0.0 ✔ stringr 1.5.1

✔ ggplot2 3.5.2 ✔ tibble 3.3.0

✔ lubridate 1.9.4 ✔ tidyr 1.3.1

✔ purrr 1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

corrplot 0.95 loaded

Also Read: R For Data Science: Why Should You Choose R for Data Science?

Step 3: Load the Dataset into Your R Environment

Now that the libraries are ready, the next step is to bring your data into the notebook. We'll load the uploaded CSV file and preview the first few records to understand its structure. Here’s the code

# Load the uploaded dataset using its filename
student_data <- read.csv("student-por.csv")

# Display the first six rows to get a quick look at the data
head(student_data)

This gives us an overview of the dataset we’re working with. The output for the above code is:

	school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	⋯	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	<int>	<int>	<chr>	<chr>	⋯	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
1	GP	F	18	U	GT3	A	4	4	at_home	teacher	⋯	4	3	4	1	1	3	4	0	11	11
2	GP	F	17	U	GT3	T	1	1	at_home	other	⋯	5	3	3	1	1	3	2	9	11	11
3	GP	F	15	U	LE3	T	1	1	at_home	other	⋯	4	3	2	2	3	3	6	12	13	12
4	GP	F	15	U	GT3	T	4	2	health	services	⋯	3	2	2	1	1	5	0	14	14	14
5	GP	F	16	U	GT3	T	3	3	other	other	⋯	4	3	2	1	2	5	0	11	13	13
6	GP	M	16	U	LE3	T	4	3	services	other	⋯	5	4	2	1	2	5	6	12	12	13

Step 4: Clean the Column Names for Better Usability

Some column names may contain dots (.), which can make referencing them in code a bit tricky. Replacing them with underscores (_) makes the column names easier to work with. Here’s the code:

# Replace all dots in column names with underscores for cleaner access
colnames(student_data) <- gsub("\\.", "_", colnames(student_data))

# Display the cleaned column names
colnames(student_data)

The above code cleans the dataset. The output for the above code is:

'school'

'sex'

'age'

'address'

'famsize'

'Pstatus'

'Medu'

'Fedu'

'Mjob'

'Fjob'

'reason'

'guardian'

'traveltime'

'studytime'

'failures'

'schoolsup'

'famsup'

'paid'

'activities'

'nursery'

'higher'

'internet'

'romantic'

'famrel'

'freetime'

'goout'

'Dalc'

'Walc'

'health'

'absences'

'G1'

'G2'

'G3'

Step 5: Explore the Structure and Quality of the Dataset

In this step, we'll examine the overall structure of the dataset, summarize its contents, and check for any missing values. This helps us understand what we're working with and identify any cleanup needed before deeper analysis. The code for this step is:

# View the structure of the dataset: shows data types and sample values
str(student_data)
# Get summary statistics for each column (min, max, mean, median, etc.)
summary(student_data)

# Load dplyr for data manipulation
library(dplyr)

# Check how many missing values exist in each column
student_data %>%
  summarise_all(~sum(is.na(.))) %>%                     # Count NAs in each column
  pivot_longer(cols = everything(),                    # Convert to long format
               names_to = "Column", 
               values_to = "Missing_Values") %>%
  filter(Missing_Values > 0)                            # Show only columns with missing data

The output for the above code is:

'data.frame': 649 obs. of 33 variables:

$ school : chr "GP" "GP" "GP" "GP" ...

$ sex : chr "F" "F" "F" "F" ...

$ age : int 18 17 15 15 16 16 16 17 15 15 ...

$ address : chr "U" "U" "U" "U" ...

$ famsize : chr "GT3" "GT3" "LE3" "GT3" ...

$ Pstatus : chr "A" "T" "T" "T" ...

$ Medu : int 4 1 1 4 3 4 2 4 3 3 ...

$ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...

$ Mjob : chr "at_home" "at_home" "at_home" "health" ...

$ Fjob : chr "teacher" "other" "other" "services" ...

$ reason : chr "course" "course" "other" "home" ...

$ guardian : chr "mother" "father" "mother" "mother" ...

$ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...

$ studytime : int 2 2 2 3 2 2 2 2 2 2 ...

$ failures : int 0 0 0 0 0 0 0 0 0 0 ...

$ schoolsup : chr "yes" "no" "yes" "no" ...

$ famsup : chr "no" "yes" "no" "yes" ...

$ paid : chr "no" "no" "no" "no" ...

$ activities: chr "no" "no" "no" "yes" ...

$ nursery : chr "yes" "no" "yes" "yes" ...

$ higher : chr "yes" "yes" "yes" "yes" ...

$ internet : chr "no" "yes" "yes" "yes" ...

$ romantic : chr "no" "no" "no" "yes" ...

$ famrel : int 4 5 4 3 4 5 4 4 4 5 ...

$ freetime : int 3 3 3 2 3 4 4 1 2 5 ...

$ goout : int 4 3 2 2 2 2 4 4 2 1 ...

$ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...

$ Walc : int 1 1 3 1 2 2 1 1 1 1 ...

$ health : int 3 3 3 5 5 5 3 1 1 5 ...

$ absences : int 4 2 6 0 0 6 0 2 0 0 ...

$ G1 : int 0 9 12 14 11 12 13 10 15 12 ...

$ G2 : int 11 11 13 14 13 12 12 13 16 12 ...

$ G3 : int 11 11 12 14 13 13 13 13 17 13 ...

school sex age address

Length:649 Length:649 Min. :15.00 Length:649

Class :character Class :character 1st Qu.:16.00 Class :character

Mode :character Mode :character Median :17.00 Mode :character

Mean :16.74

3rd Qu.:18.00

Max. :22.00

famsize Pstatus Medu Fedu

Length:649 Length:649 Min. :0.000 Min. :0.000

Class :character Class :character 1st Qu.:2.000 1st Qu.:1.000

Mode :character Mode :character Median :2.000 Median :2.000

Mean :2.515 Mean :2.307

3rd Qu.:4.000 3rd Qu.:3.000

Max. :4.000 Max. :4.000

Mjob Fjob reason guardian

Length:649 Length:649 Length:649 Length:649

Class :character Class :character Class :character Class :character

Mode :character Mode :character Mode :character Mode :character

traveltime studytime failures schoolsup

Min. :1.000 Min. :1.000 Min. :0.0000 Length:649

1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character

Median :1.000 Median :2.000 Median :0.0000 Mode :character

Mean :1.569 Mean :1.931 Mean :0.2219

3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000

Max. :4.000 Max. :4.000 Max. :3.0000

famsup paid activities nursery

Length:649 Length:649 Length:649 Length:649

Class :character Class :character Class :character Class :character

Mode :character Mode :character Mode :character Mode :character

higher internet romantic famrel

Length:649 Length:649 Length:649 Min. :1.000

Class :character Class :character Class :character 1st Qu.:4.000

Mode :character Mode :character Mode :character Median :4.000

Mean :3.931

3rd Qu.:5.000

Max. :5.000

freetime goout Dalc Walc health

Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000

1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000

Median :3.00 Median :3.000 Median :1.000 Median :2.00 Median :4.000

Mean :3.18 Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536

3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000

Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000

absences G1 G2 G3

Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00

1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00

Median : 2.000 Median :11.0 Median :11.00 Median :12.00

Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91

3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00

Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00

Column

Missing_Values

<chr>

<int>

Here’s a Must Build R Project: Trend Analysis Project on COVID-19 using R

Step 6: Count Missing Values in Each Column

While the previous step filtered only columns with missing values, here we’re doing a quick scan of all columns to count how many missing values each one contains, whether it's zero or more. Here’s the code:

# Count and display the number of missing values in each column
colSums(is.na(student_data))

The output for the above code shows the number of missing values in each column.

School 0 sex 0 age 0 address 0 famsize 0 Pstatus 0 Medu 0 Fedu 0 Mjob 0 Fjob 0 reason 0 guardian 0 traveltime 0 studytime 0 failures 0 schoolsup 0 famsup 0 paid 0 activities 0 nursery 0 higher 0 internet 0 romantic 0 famrel 0 freetime 0 goout 0 Dalc 0 Walc 0 health 0 absences 0 G1 0 G2 0 G3 0

Step 7: Load ggplot2 for Data Visualization

To create clear and informative plots, we'll use the ggplot2 package, which is part of the tidyverse collection. This package will help us visualize trends and patterns in the student data effectively. The code to load ggplot2 is:

# Load the ggplot2 package for creating data visualizations
library(ggplot2)

Build This R Project: Natural Disaster Prediction Analysis Project in R

Step 8: Visualize the Distribution of Final Grades

In this step, we'll create a histogram to see how students' final grades (G3) are distributed. This gives us a quick view of whether most students performed well, poorly, or fell in the middle. Here’s the code:

# Create a histogram to visualize how the final grades (G3) are distributed
ggplot(student_data, aes(x = G3)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +  # Set bar color and outline
  labs(title = "Distribution of Final Grades (G3)",                  # Add a chart title
       x = "Final Grade",                                            # Label for x-axis
       y = "Number of Students")                                     # Label for y-axis

The above code gives us an output of the graph of the distribution of grades in G3.

The above graph shows:

Most students scored between 10 and 14: The tallest bars are between these values, showing this is the most common grade range.
Very few students scored below 5 or above 17: The bars on the extreme ends are very short, indicating only a small number of students got very low or very high grades.
A small spike at 0: There's a noticeable number of students who received a grade of 0, which might indicate dropouts or missing final assessments.
Overall distribution is slightly skewed to the left: The curve isn't perfectly symmetrical; it shows more students scored above the average than below it.

Step 9: Compare Final Grades Based on Internet Access

This section visualizes how access to the internet at home might impact students’ final grades. Using a boxplot, we can observe whether students with or without internet access tend to perform better academically. Here’s the code:

ggplot(student_data, aes(x = internet, y = G3, fill = internet)) +
  geom_boxplot() +
  labs(title = "Final Grades by Internet Access",
       x = "Internet Access at Home",
       y = "Final Grade")

The above code gives us a graph showing the grades by gender.

The above graph shows that:

Female students (F) have a slightly higher median final grade than male students (M).
Male students show more variability in their grades, with a longer lower whisker and more outliers, especially on the lower end.
Female students' grades are more concentrated, indicating less variation and a slightly better performance on average.
Both genders have some students who scored very low (0 or near 0), which may indicate dropouts or failures.

Step 10: Study Time vs Final Grades

This step explores whether students who dedicate more weekly time to studying tend to perform better in their final grades. Here’s the code:

ggplot(student_data, aes(x = factor(studytime), y = G3, fill = factor(studytime))) +
  geom_boxplot() +
  labs(title = "Study Time vs Final Grade",
       x = "Study Time (1 = <2 hrs, 4 = >10 hrs)",
       y = "Final Grade")

The above code gives the output:

The above plot shows that:

Median Grades Increase Slightly:
As study time increases (from group 1 to 4), the median final grade also tends to rise.
Lowest Performance in Group 1:
Students who study <2 hours weekly generally score lower.
Highest Spread in Groups 3 and 4:
More study time is associated with higher variability in performance; some students still perform poorly despite studying more.

New to R? Here’s A Fun R Project: Car Data Analysis Project Using R

Step 11: Analyze the Impact of Internet Access on Final Grades

This step explores whether having internet access at home affects students' academic performance. We'll use a boxplot to compare the final grades (G3) of students who have internet access versus those who do not. Here’s the code:

ggplot(student_data, aes(x = internet, y = G3, fill = internet)) +
  geom_boxplot() +
  labs(title = "Internet Access vs Final Grade",
       x = "Internet Access (Yes/No)",
       y = "Final Grade")

The above code gives us the graph:

The above plot shows that:

Students with internet access (yes) generally have slightly higher median grades than those without internet (no).
The interquartile range (middle 50%) of grades is wider for students with internet, suggesting more variability in performance.
Students without internet have a lower upper range, indicating fewer top performers.
A few outliers (extremely low or high grades) appear in both groups, but the internet group includes more students with grades above 15.

Step 12: Install and Load Correlation Plot Library

To visualize relationships between numeric variables, we'll use the corrplot package. This package creates an easy-to-read graphical representation of correlations in a dataset. We'll install and load it before plotting. Here’s the code:

# Install corrplot if not already installed
install.packages("corrplot")

# Load the library
library(corrplot)

Step 13: Filter Numeric Columns for Correlation Analysis

Before creating a correlation plot, we need to isolate only the numeric columns from the dataset. This ensures the correlation matrix is accurate and relevant. We'll use select_if(is.numeric) from dplyr for this. Here’s the code:

# Filter numeric columns
numeric_data <- student_data %>% select_if(is.numeric)

# View column names
colnames(numeric_data)

The output for the above code is:

'age'

'Medu'

'Fedu'

'traveltime'

'studytime'

'failures'

'famrel'

'freetime'

'goout'

'Dalc'

'Walc'

'health'

'absences'

'G1'

'G2'

'G3'

Here’s a Fun R Project For You: Player Performance Analysis & Prediction Using R

Step 14: Calculate Correlation Between Numeric Features

Now that we've isolated the numeric columns, we calculate the correlation matrix to understand how these variables relate to each other. This matrix shows the strength and direction of linear relationships between pairs of numeric features. Here’s the code:

# Calculate correlation between numeric features
cor_matrix <- cor(numeric_data)

# View the correlation matrix rounded to 2 decimal places
round(cor_matrix, 2)

The output for the above step is:

	age	Medu	Fedu	traveltime	studytime	failures	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
age	1.00	-0.11	-0.12	0.03	-0.01	0.32	-0.02	0.00	0.11	0.13	0.09	-0.01	0.15	-0.17	-0.11	-0.11
Medu	-0.11	1.00	0.65	-0.27	0.10	-0.17	0.02	-0.02	0.01	-0.01	-0.02	0.00	-0.01	0.26	0.26	0.24
Fedu	-0.12	0.65	1.00	-0.21	0.05	-0.17	0.02	0.01	0.03	0.00	0.04	0.04	0.03	0.22	0.23	0.21
traveltime	0.03	-0.27	-0.21	1.00	-0.06	0.10	-0.01	0.00	0.06	0.09	0.06	-0.05	-0.01	-0.15	-0.15	-0.13
studytime	-0.01	0.10	0.05	-0.06	1.00	-0.15	0.00	-0.07	-0.08	-0.14	-0.21	-0.06	-0.12	0.26	0.24	0.25
failures	0.32	-0.17	-0.17	0.10	-0.15	1.00	-0.06	0.11	0.05	0.11	0.08	0.04	0.12	-0.38	-0.39	-0.39
famrel	-0.02	0.02	0.02	-0.01	0.00	-0.06	1.00	0.13	0.09	-0.08	-0.09	0.11	-0.09	0.05	0.09	0.06
freetime	0.00	-0.02	0.01	0.00	-0.07	0.11	0.13	1.00	0.35	0.11	0.12	0.08	-0.02	-0.09	-0.11	-0.12
goout	0.11	0.01	0.03	0.06	-0.08	0.05	0.09	0.35	1.00	0.25	0.39	-0.02	0.09	-0.07	-0.08	-0.09
Dalc	0.13	-0.01	0.00	0.09	-0.14	0.11	-0.08	0.11	0.25	1.00	0.62	0.06	0.17	-0.20	-0.19	-0.20
Walc	0.09	-0.02	0.04	0.06	-0.21	0.08	-0.09	0.12	0.39	0.62	1.00	0.11	0.16	-0.16	-0.16	-0.18
health	-0.01	0.00	0.04	-0.05	-0.06	0.04	0.11	0.08	-0.02	0.06	0.11	1.00	-0.03	-0.05	-0.08	-0.10
absences	0.15	-0.01	0.03	-0.01	-0.12	0.12	-0.09	-0.02	0.09	0.17	0.16	-0.03	1.00	-0.15	-0.12	-0.09
G1	-0.17	0.26	0.22	-0.15	0.26	-0.38	0.05	-0.09	-0.07	-0.20	-0.16	-0.05	-0.15	1.00	0.86	0.83
G2	-0.11	0.26	0.23	-0.15	0.24	-0.39	0.09	-0.11	-0.08	-0.19	-0.16	-0.08	-0.12	0.86	1.00	0.92
G3	-0.11	0.24	0.21	-0.13	0.25	-0.39	0.06	-0.12	-0.09	-0.20	-0.18	-0.10	-0.09	0.83	0.92	1.00

Step 15: Visualize Correlation Between Numeric Features

To better understand the strength and direction of relationships between numeric variables, we use a correlation heatmap. This colorful plot helps us quickly identify strong positive or negative correlations among the features. Here’s the code:

# Create a colorful correlation plot
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.8)

The output for the above code is shown below:

Step 16: Build a Linear Regression Model to Predict Final Grades

In this step, we create a simple linear regression model to predict the final grade (G3) based on variables like first and second period grades (G1, G2), study time, number of failures, and absences. The model summary gives us coefficients, significance levels, and overall model performance. Here’s the code:

# Build a linear regression model
model <- lm(G3 ~ G1 + G2 + studytime + failures + absences, data = student_data)

# View model summary
summary(model)

The output for the above code is:

Call:

lm(formula = G3 ~ G1 + G2 + studytime + failures + absences, data = student_data)

Residuals:

Min 1Q Median 3Q Max

-9.0716 -0.4624 -0.0796 0.6346 5.8068

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.15519 0.25863 -0.600 0.54868

G1 0.13946 0.03623 3.849 0.00013 ***

G2 0.88571 0.03393 26.107 < 2e-16 ***

studytime 0.09670 0.06181 1.564 0.11820

failures -0.21829 0.09086 -2.402 0.01657 *

absences 0.02337 0.01079 2.165 0.03077 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.254 on 643 degrees of freedom

Multiple R-squared: 0.8506, Adjusted R-squared: 0.8494

F-statistic: 732 on 5 and 643 DF, p-value: < 2.2e-16

The above output means that:

G1 and G2 are strong predictors – Especially G2, which has the biggest positive impact on final grades (G3).
Failures reduce grades – More past failures lead to lower final grades.
Study time isn’t very impactful – In this model, study time doesn’t show a strong effect on G3.
Model fits well – With an R-squared of 85%, the model explains most of the variation in final grades.
Absences show a small positive effect – This might be due to other factors in the data, and could be explored further.

Also Read: 18 Types of Regression in Machine Learning You Should Know

Step 17: Predict Final Grades Using the Model

Now that we’ve trained the regression model, let’s use it to predict students’ final grades (G3). We’ll compare the predicted results with the actual values to see how closely the model performs. Here’s the code:

# Add predicted G3 values to the dataset
student_data$predicted_G3 <- predict(model, student_data)

# View first few actual vs predicted
head(student_data[, c("G3", "predicted_G3")])

The output gives us a table:

	G3	predicted_G3
	<int>	<dbl>
1	11	9.87447
2	11	11.08285
3	12	13.36611
4	14	14.48723
5	13	13.08645
6	13	12.48040

The above output shows:

G3 is the actual final grade the student received.
predicted_G3 is the grade predicted by the linear regression model based on features like G1, G2, study time, failures, and absences.
For example, in row 1, the student scored 11, but the model predicted 9.87, which is slightly lower.
In row 3, the actual score is 12, but the model predicted 13.37, slightly overestimating.
Overall, predictions are fairly close to actual values, indicating the model performs reasonably well.

Step 18 – Visualize Actual vs Predicted Final Grades

This step creates a scatter plot that compares the actual final grades (G3) with the predicted grades from the regression model. A red dashed line indicates perfect prediction, where actual equals predicted. Points closer to this line represent better predictions. Here’s the code:

ggplot(student_data, aes(x = G3, y = predicted_G3)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Final Grades",
       x = "Actual G3",
       y = "Predicted G3")

The above code gives us a graph of the predicted vs actual grades.

The above output shows that:

The blue dots represent actual vs predicted final grades for each student.
The red dashed line represents the ideal scenario where the predicted grade equals the actual grade.
Most of the dots are close to the red line, showing that the model's predictions are relatively accurate.
Some outliers appear far from the line, indicating prediction errors that may need further investigation or model improvement.

Conclusion

In this Student Performance Analysis project, we built a linear regression model in R using Google Colab to predict students’ final grades (G3) based on features like first and second period grades (G1 and G2), study time, failures, and absences.

After preprocessing the data and exploring key relationships, we trained the model and evaluated its performance using R-squared and residual plots.

The model achieved an R-squared of 0.85, indicating that it explains 85% of the variance in final grades. Overall, the model shows strong predictive accuracy, especially when prior performance is considered.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Collab Link:
https://colab.research.google.com/drive/1XcU-XxV2j76DnWrfDdn7TpuTEh3xq9wi#scrollTo=jqe2jLRsgpUk

Frequently Asked Questions (FAQs)

1. How does analyzing student data help in educational decision-making?

Analyzing student data reveals patterns in academic performance based on behavioral and academic factors. Educators and institutions can use this insight to provide targeted interventions, optimize study programs, and reduce dropout rates.

2. What kind of dataset is used in this student performance project?

The dataset includes student records with attributes like previous grades (G1, G2), study time, failures, absences, and more. It’s structured, clean, and suitable for supervised learning tasks like regression or classification.

3. How does a linear regression model work in the context of this analysis?

Linear regression finds the best-fit line that maps input features (like G1, G2, etc.) to the final grade (G3). It assumes a linear relationship between predictors and the outcome, making it a great starting point for numeric prediction.

4. What are the steps involved in building this model in R?

Key steps include: importing the dataset, exploring and cleaning data, visualizing correlations, splitting data into training and testing sets, building the model using lm(), predicting outcomes, and evaluating performance using RMSE and R².

5. What are similar beginner-friendly machine learning projects in R?

If you enjoyed this project, here are more practical and beginner-friendly projects to explore:

Forest Fire Prediction Project – Predict wildfire risks based on weather and terrain features
Music Analysis Project – Analyze songs using audio features and build genre classifiers
Food Delivery Analysis – Explore delivery patterns, customer behavior, and optimize routes
Spam Filter Project – Build a spam email classifier using text processing and Naive Bayes

Rohit Sharma

849 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources