Wine Quality Prediction Project in R
By Rohit Sharma
Updated on Jul 24, 2025 | 11 min read | 1.36K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Jul 24, 2025 | 11 min read | 1.36K+ views
Share:
Predicting wine quality based on physicochemical properties is a great way to apply regression analysis in R. This wine quality prediction project covers key processes like data preprocessing, feature engineering, model development, and evaluation using libraries such as caret, ggplot2, and randomForest.
The goal of this project is to build a model for quality control in the wine industry.
In this wine quality prediction project, you’ll learn how to preprocess data, visualize trends, build a model, and evaluate performance all using R in Google Colab.
Advance your career with the Online Data Science Courses by IIIT Bangalore and LJMU, featuring a GenAI-integrated curriculum. Learn Python, Machine Learning, AI, SQL, and more—enroll today for industry-ready skills and higher earning potential.
Find More Projects on R. Read This: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025
Here are some things you should know before starting this wine quality prediction model:
Kickstart your data science career with upGrad’s top-ranked programs. Learn from industry experts through globally recognized courses like the MS in Data Science from LJMU, the Generative AI Mastery Certificate, and the Executive Diploma in Data Science and AI from IIIT Bangalore.
To work on this project, you’ll need these tools and libraries to make it run successfully.
Category |
Tools & Libraries |
Programming Language | R |
IDE | RStudio |
Data Manipulation | dplyr, tidyverse |
Data Visualization | ggplot2 |
Machine Learning | caret, randomForest |
Model Evaluation | Built-in R functions, caret |
Feature Engineering | dplyr, custom R functions |
Must Read: Machine Learning with R: Everything You Need to Know
To accurately predict wine quality based on its chemical attributes, the following regression models will be implemented and compared for performance:
Click to Learn More About: R For Data Science: Why Should You Choose R for Data Science?
Build Your Skills With This R Language Tutorial
Let’s take a look at the various steps involved in building this model with R using Google Colab along with insights of various data used in this wine quality prediction project.
Use the csv dataset, which contains the data of wine along with multiple physicochemical properties (such as acidity, sugar content, pH, etc.) and corresponding wine quality ratings. This dataset is commonly available on platforms like Kaggle. Here, we’ve used the dataset available on Kaggle.
Follow these steps to upload and read the WineQT.csv dataset in Google Colab:
Before moving ahead with data preprocessing and model development, make sure all necessary packages are installed. Run the following commands in your R environment (installation is required only once):
Use This Code:
# Install packages (only once)
install.packages("caret")
install.packages("ggplot2")
install.packages("randomForest")
install.packages("corrplot")
These packages support data visualization, regression modeling, and correlation analysis, which are essential for the wine quality prediction workflow.
Also Read: What is Data Wrangling? Exploring Its Role in Data Analysis
Once the required packages are installed, load the dataset and perform a quick inspection to understand its structure.
Use This Code:
# Load the uploaded dataset
wine <- read.csv("WineQT.csv")
# View the first few rows of the dataset
head(wine)
# Check the dimensions (number of rows and columns)
dim(wine)
This step helps confirm that the dataset has been loaded correctly and provides an overview of the available features and records.
This step would then give us the data available of the first 6 rows with all required information.
Output
fixed.acidity |
volatile.acidity |
citric.acid |
residual.sugar |
chlorides |
free.sulfur.dioxide |
total.sulfur.dioxide |
density |
pH |
sulphates |
alcohol |
quality |
Id |
||
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<int> |
<int> |
||
1 |
7.4 |
0.70 |
0.00 |
1.9 |
0.076 |
11 |
34 |
0.9978 |
3.51 |
0.56 |
9.4 |
5 |
0 |
|
2 |
7.8 |
0.88 |
0.00 |
2.6 |
0.098 |
25 |
67 |
0.9968 |
3.20 |
0.68 |
9.8 |
5 |
1 |
|
3 |
7.8 |
0.76 |
0.04 |
2.3 |
0.092 |
15 |
54 |
0.9970 |
3.26 |
0.65 |
9.8 |
5 |
2 |
|
4 |
11.2 |
0.28 |
0.56 |
1.9 |
0.075 |
17 |
60 |
0.9980 |
3.16 |
0.58 |
9.8 |
6 |
3 |
|
5 |
7.4 |
0.70 |
0.00 |
1.9 |
0.076 |
11 |
34 |
0.9978 |
3.51 |
0.56 |
9.4 |
5 |
4 |
|
6 |
7.4 |
0.66 |
0.00 |
1.8 |
0.075 |
13 |
40 |
0.9978 |
3.51 |
0.56 |
9.4 |
5 |
5 |
Popular Data Science Programs
Proper data cleaning and preparation are critical for building an accurate predictive model. Follow these steps.
Use this code:
# See the structure of the dataset
str(wine)
# Get summary statistics
summary(wine)
# Check column names
colnames(wine)
You’ll get a data output like this:
Output
'data.frame': 1143 obs. of 13 variables:
$ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ...
$ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ...
$ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ...
$ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ...
$ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ...
$ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 15 ...
$ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 65 ...
$ density : num 0.998 0.997 0.997 0.998 0.998 ...
$ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ...
$ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ...
$ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ...
$ quality : int 5 5 5 6 5 5 5 7 7 5 ...
$ Id : int 0 1 2 3 4 5 6 7 8 10 ...
fixed.acidity volatile.acidity citric.acid residual.sugar
Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. : 0.900
1st Qu.: 7.100 1st Qu.:0.3925 1st Qu.:0.0900 1st Qu.: 1.900
Median : 7.900 Median :0.5200 Median :0.2500 Median : 2.200
Mean : 8.311 Mean :0.5313 Mean :0.2684 Mean : 2.532
3rd Qu.: 9.100 3rd Qu.:0.6400 3rd Qu.:0.4200 3rd Qu.: 2.600
Max. :15.900 Max. :1.5800 Max. :1.0000 Max. :15.500
chlorides free.sulfur.dioxide total.sulfur.dioxide density
Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 21.00 1st Qu.:0.9956
Median :0.07900 Median :13.00 Median : 37.00 Median :0.9967
Mean :0.08693 Mean :15.62 Mean : 45.91 Mean :0.9967
3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 61.00 3rd Qu.:0.9978
Max. :0.61100 Max. :68.00 Max. :289.00 Max. :1.0037
pH sulphates alcohol quality
Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
1st Qu.:3.205 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
Median :3.310 Median :0.6200 Median :10.20 Median :6.000
Mean :3.311 Mean :0.6577 Mean :10.44 Mean :5.657
3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Id
Min. : 0
1st Qu.: 411
Median : 794
Mean : 805
3rd Qu.:1210
Max. :1597
If missing values are present, handle them appropriately (e.g., remove or impute).
Use this code:
# Check for missing values in each column
colSums(is.na(wine))
You’ll get a data output like this:
Output
fixed.acidity 0
volatile.acidity 0
citric.acid 0
residual.sugar 0
chlorides 0
free.sulfur.dioxide 0
total.sulfur.dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
Id 0
Also Read: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!
Use ggplot2 to visualize how wine quality scores are distributed across the dataset. This helps assess class distribution and potential imbalance.
Use this code:
# Add a title and axis labels to the plot for better readability and interpretation
library(ggplot2)
ggplot(wine, aes(x = quality)) +
geom_bar(fill = "tomato") +
labs(title = "Distribution of Wine Quality", x = "Quality Score", y = "Count")
Upon executing this code, we’ll get a graph of the wine distribution quality.
Analyzing feature correlations helps identify which variables most influence wine quality. Use the corrplot package to visualize the relationships:
Use this code:
library(corrplot)
# Compute correlation matrix
cor_matrix <- cor(wine)
# Plot correlation heatmap
corrplot(cor_matrix, method = "color", type = "lower", tl.cex = 0.7)
This step provides insights into multicollinearity and highlights strong predictors of wine quality.
Start by splitting the dataset into training and test sets using createDataPartition() from the caret package. Then, train a Random Forest regression model on the training data:
Use this code:
library(caret)
# Exclude the target variable (quality) during scaling
preproc <- preProcess(wine[, -12], method = c("center", "scale"))
# Apply preprocessing
wine_scaled <- predict(preproc, wine[, -12])
# Add the quality column back
wine_scaled$quality <- wine$quality
# Check results
head(wine_scaled)
This process ensures robust model training and unbiased evaluation on unseen data.
Output
fixed.acidity |
volatile.acidity |
citric.acid |
residual.sugar |
chlorides |
free.sulfur.dioxide |
total.sulfur.dioxide |
density |
pH |
sulphates |
alcohol |
Id |
quality |
|
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<dbl> |
<int> |
|
1 |
-0.5213514 |
0.9389212 |
-1.364429 |
-0.46621734 |
-0.2312936 |
-0.45026992 |
-0.3634510 |
0.55561117 |
1.2701390 |
-0.57340683 |
-0.9629603 |
-1.734859 |
5 |
2 |
-0.2924654 |
1.9409632 |
-1.364429 |
0.05003827 |
0.2341441 |
0.91551896 |
0.6431950 |
0.03614877 |
-0.7086174 |
0.13082384 |
-0.5933413 |
-1.732703 |
5 |
3 |
-0.2924654 |
1.2729352 |
-1.161059 |
-0.17121413 |
0.1072065 |
-0.06004452 |
0.2466375 |
0.14004125 |
-0.3256323 |
-0.04523383 |
-0.5933413 |
-1.730548 |
5 |
4 |
1.6530654 |
-1.3991767 |
1.482750 |
-0.46621734 |
-0.2524499 |
0.13506817 |
0.4296640 |
0.65950365 |
-0.9639408 |
-0.45603505 |
-0.5933413 |
-1.728393 |
6 |
5 |
-0.5213514 |
0.9389212 |
-1.364429 |
-0.46621734 |
-0.2312936 |
-0.45026992 |
-0.3634510 |
0.55561117 |
1.2701390 |
-0.57340683 |
-0.9629603 |
-1.726238 |
5 |
6 |
-0.5213514 |
0.7162452 |
-1.364429 |
-0.53996814 |
-0.2524499 |
-0.25515722 |
-0.1804245 |
0.55561117 |
1.2701390 |
-0.57340683 |
-0.9629603 |
-1.724083 |
5 |
After scaling the features, use the following steps to create training and test sets, then train the Random Forest regression model:
Use this code:
set.seed(123) # For reproducibility
# Create a partition: 80% training, 20% testing
splitIndex <- createDataPartition(wine_scaled$quality, p = 0.8, list = FALSE)
# Split the scaled data
train_data <- wine_scaled[splitIndex, ]
test_data <- wine_scaled[-splitIndex, ]
library(randomForest)
# Train the Random Forest model with 100 trees and feature importance enabled
rf_model <- randomForest(quality ~ ., data = train_data, ntree = 100, importance = TRUE)
# View model summary
print(rf_model)
Output
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
Call:
randomForest(formula = quality ~ ., data = train_data, ntree = 100, importance = TRUE)
Type of random forest: regression
Number of trees: 100
No. of variables tried at each split: 4
Mean of squared residuals: 0.3594949
% Var explained: 43.3
This setup optimizes model performance and allows for assessing feature importance in predicting wine quality.
Once the Random Forest model is trained, evaluate its performance on the test set using RMSE and R² metrics:
Use this code:
# Make predictions on the test set
predictions <- predict(rf_model, newdata = test_data)
# Calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mean((predictions - test_data$quality)^2))
# Calculate R-squared (Coefficient of Determination)
r2 <- cor(predictions, test_data$quality)^2
# Display evaluation results
cat("RMSE:", rmse, "\n")
cat("R-squared:", r2, "\n")
These metrics help assess the model's predictive accuracy. A lower RMSE and a higher R² indicate better performance in estimating wine quality.
This would give the output:
RMSE: 0.6176461
R-squared: 0.4683847
Click Here to Read More: What Is Data Acquisition: Key Components & Role in Machine Learning
Understanding which features most significantly influence wine quality can provide valuable insights. Use the randomForest functions importance() and varImpPlot() to evaluate feature contributions.
Use this code:
# Display feature importance scores
importance(rf_model)
# Visualize feature importance
varImpPlot(rf_model)
This analysis highlights the most impactful chemical properties, helping to interpret the model and guide further refinement or domain-specific decisions.
This output is the feature importance plot generated by the varImpPlot() function from the randomForest package in R. It visually represents how important each feature (chemical property) is in predicting wine quality based on your trained random forest model.
In this Wine Quality Prediction project, we developed a Random Forest regression model using R in Google Colab to predict wine quality based on physicochemical properties such as alcohol, acidity, and sulphates.
After cleaning and scaling the data, we visualized feature distributions and correlations, then trained the model on 80% of the data and tested it on the remaining 20%. The model's performance was evaluated using two key metrics: Root Mean Squared Error (RMSE) and R-squared (R²).
The model achieved an RMSE of 0.6176, indicating that, on average, its predictions deviate from the actual quality scores by about 0.62 points. The R-squared value of 0.4684 means the model explains roughly 46.8% of the variance in wine quality.
While not perfect, this level of accuracy is acceptable for a first model and shows that chemical properties like alcohol and volatile acidity significantly influence wine quality. With further feature tuning or model selection, performance can be improved for even more accurate predictions.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Reference:
https://colab.research.google.com/drive/1X3sSAFbp05irIjrNX8cnh0NzU_1wsdT7#scrollTo=YyHvbMpYv4yB
779 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources