View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Wine Quality Prediction Project in R

By Rohit Sharma

Updated on Jul 24, 2025 | 11 min read | 1.36K+ views

Share:

Predicting wine quality based on physicochemical properties is a great way to apply regression analysis in R. This wine quality prediction project covers key processes like data preprocessing, feature engineering, model development, and evaluation using libraries such as caret, ggplot2, and randomForest. 

The goal of this project is to build a model for quality control in the wine industry.

In this wine quality prediction project, you’ll learn how to preprocess data, visualize trends, build a model, and evaluate performance all using R in Google Colab.

Advance your career with the Online Data Science Courses by IIIT Bangalore and LJMU, featuring a GenAI-integrated curriculum. Learn Python, Machine Learning, AI, SQL, and more—enroll today for industry-ready skills and higher earning potential.

Find More Projects on R. Read This: Top 25+ R Projects for Beginners to Boost Your Data Science Skills in 2025 

What Should You Know Before Starting This Project?

Here are some things you should know before starting this wine quality prediction model:

  • You need a basic understanding of R programming and the RStudio interface
  • You must be familiar with data manipulation using techniques like dplyr and tidyverse
  • You need to know about regression techniques and model evaluation methods
  • Understand the feature engineering and data preprocessing techniques
  • You should also have experience of working with machine learning libraries like caret or randomForest

Kickstart your data science career with upGrad’s top-ranked programs. Learn from industry experts through globally recognized courses like the MS in Data Science from LJMU, the Generative AI Mastery Certificate, and the Executive Diploma in Data Science and AI from IIIT Bangalore.

Technologies and Libraries Used For This Project

To work on this project, you’ll need these tools and libraries to make it run successfully.

Category

Tools & Libraries

Programming Language R
IDE RStudio
Data Manipulation dplyr, tidyverse
Data Visualization ggplot2
Machine Learning caret, randomForest
Model Evaluation Built-in R functions, caret
Feature Engineering dplyr, custom R functions

Must Read: Machine Learning with R: Everything You Need to Know

What Models Will Be Utilized for Learning

To accurately predict wine quality based on its chemical attributes, the following regression models will be implemented and compared for performance:

  • Linear Regression – It will be used for establishing a baseline model to predict wine quality
  • Random Forest Regression – It’s purpose if to handle non-linear relationships and improve accuracy
  • Ridge and Lasso Regression – It’ll be used for regularization and feature selection
  • Decision Tree Regression – This model will be used for interpreting feature impact on wine quality predictions

Click to Learn More About: R For Data Science: Why Should You Choose R for Data Science?

Time Taken and Difficulty

  • Estimated Time: 6–8 hours, depending on familiarity with R and regression modeling 
  • Difficulty Level: Intermediate
    This project requires a good understanding of R programming, data preprocessing, and machine learning workflows. If you have past experience with regression models and feature engineering, it will be helpful for smooth implementation.

Build Your Skills With This R Language Tutorial

How to Build a Wine Quality Prediction Project in R

Let’s take a look at the various steps involved in building this model with R using Google Colab along with insights of various data used in this wine quality prediction project.

Step 1: Download the Dataset

Use the csv dataset, which contains the data of wine along with multiple physicochemical properties (such as acidity, sugar content, pH, etc.) and corresponding wine quality ratings. This dataset is commonly available on platforms like Kaggle. Here, we’ve used the dataset available on Kaggle.

Step 2: Upload and Read the Dataset in Google Colab

Follow these steps to upload and read the WineQT.csv dataset in Google Colab:

  1. Upload the Dataset: 
    • Open your Colab notebook.
    • Use the file upload option

Step 3: Install Required Packages

Before moving ahead with data preprocessing and model development, make sure all necessary packages are installed. Run the following commands in your R environment (installation is required only once):

Use This Code:

# Install packages (only once)
install.packages("caret")
install.packages("ggplot2")
install.packages("randomForest")
install.packages("corrplot")

These packages support data visualization, regression modeling, and correlation analysis, which are essential for the wine quality prediction workflow.

Also Read: What is Data Wrangling? Exploring Its Role in Data Analysis

Step 4: Load and Inspect the Dataset

Once the required packages are installed, load the dataset and perform a quick inspection to understand its structure.

Use This Code:

# Load the uploaded dataset
wine <- read.csv("WineQT.csv")
# View the first few rows of the dataset
head(wine)
# Check the dimensions (number of rows and columns)
dim(wine)

 

This step helps confirm that the dataset has been loaded correctly and provides an overview of the available features and records.

This step would then give us the data available of the first 6 rows with all required information.

Output

 

fixed.acidity

volatile.acidity

citric.acid

residual.sugar

chlorides

free.sulfur.dioxide

total.sulfur.dioxide

density

pH

sulphates

alcohol

quality

Id

 
 

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<int>

<int>

 

1

7.4

0.70

0.00

1.9

0.076

11

34

0.9978

3.51

0.56

9.4

5

0

 

2

7.8

0.88

0.00

2.6

0.098

25

67

0.9968

3.20

0.68

9.8

5

1

 

3

7.8

0.76

0.04

2.3

0.092

15

54

0.9970

3.26

0.65

9.8

5

2

 

4

11.2

0.28

0.56

1.9

0.075

17

60

0.9980

3.16

0.58

9.8

6

3

 

5

7.4

0.70

0.00

1.9

0.076

11

34

0.9978

3.51

0.56

9.4

5

4

 

6

7.4

0.66

0.00

1.8

0.075

13

40

0.9978

3.51

0.56

9.4

5

5

 

 

Step 5: Clean and Prepare the Data

Proper data cleaning and preparation are critical for building an accurate predictive model. Follow these steps.

Use this code:

# See the structure of the dataset
str(wine)
# Get summary statistics
summary(wine)
# Check column names
colnames(wine)

You’ll get a data output like this:

Output

'data.frame': 1143 obs. of  13 variables:

 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ...

 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ...

 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ...

 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ...

 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ...

 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 15 ...

 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 65 ...

 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...

 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ...

 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ...

 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ...

 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

 $ Id                  : int  0 1 2 3 4 5 6 7 8 10 ...

 

fixed.acidity    volatile.acidity  citric.acid     residual.sugar  

 Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   : 0.900  

 1st Qu.: 7.100   1st Qu.:0.3925   1st Qu.:0.0900   1st Qu.: 1.900  

 Median : 7.900   Median :0.5200   Median :0.2500   Median : 2.200  

 Mean   : 8.311   Mean   :0.5313   Mean   :0.2684   Mean   : 2.532  

 3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4200   3rd Qu.: 2.600  

 Max.   :15.900   Max.   :1.5800   Max.   :1.0000   Max.   :15.500  

   chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      

 Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  

 1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 21.00       1st Qu.:0.9956  

 Median :0.07900   Median :13.00       Median : 37.00       Median :0.9967  

 Mean   :0.08693   Mean   :15.62       Mean   : 45.91       Mean   :0.9967  

 3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 61.00       3rd Qu.:0.9978  

 Max.   :0.61100   Max.   :68.00       Max.   :289.00       Max.   :1.0037  

       pH          sulphates         alcohol         quality     

 Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  

 1st Qu.:3.205   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  

 Median :3.310   Median :0.6200   Median :10.20   Median :6.000  

 Mean   :3.311   Mean   :0.6577   Mean   :10.44   Mean   :5.657  

 3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  

 Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000  

       Id      

 Min.   :   0  

 1st Qu.: 411  

 Median : 794  

 Mean   : 805  

 3rd Qu.:1210  

 Max.   :1597 

  • 'fixed.acidity'
  • 'volatile.acidity'
  • 'citric.acid'
  • 'residual.sugar'
  • 'chlorides'
  • 'free.sulfur.dioxide'
  • 'total.sulfur.dioxide'
  • 'density'
  • 'pH'
  • 'sulphates'
  • 'alcohol'
  • 'quality'
  • 'Id'

If missing values are present, handle them appropriately (e.g., remove or impute).

Use this code:

# Check for missing values in each column
colSums(is.na(wine))

You’ll get a data output like this:

Output

fixed.acidity 0

volatile.acidity 0

citric.acid 0

residual.sugar 0

chlorides 0

free.sulfur.dioxide 0

total.sulfur.dioxide 0

density 0

pH 0

sulphates 0

alcohol 0

quality 0

Id 0

Also Read: Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!

Step 6: Visualize Wine Quality Distribution

Use ggplot2 to visualize how wine quality scores are distributed across the dataset. This helps assess class distribution and potential imbalance.

Use this code:

# Add a title and axis labels to the plot for better readability and interpretation
library(ggplot2)
ggplot(wine, aes(x = quality)) +
  geom_bar(fill = "tomato") +
  labs(title = "Distribution of Wine Quality", x = "Quality Score", y = "Count")

Upon executing this code, we’ll get a graph of the wine distribution quality.

Step 7: Explore Correlation Between Features

Analyzing feature correlations helps identify which variables most influence wine quality. Use the corrplot package to visualize the relationships:

Use this code:

library(corrplot)
# Compute correlation matrix
cor_matrix <- cor(wine)
# Plot correlation heatmap
corrplot(cor_matrix, method = "color", type = "lower", tl.cex = 0.7)

This step provides insights into multicollinearity and highlights strong predictors of wine quality.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Step 8: Split Data and Train the Model

Start by splitting the dataset into training and test sets using createDataPartition() from the caret package. Then, train a Random Forest regression model on the training data:

Use this code:

library(caret)
# Exclude the target variable (quality) during scaling
preproc <- preProcess(wine[, -12], method = c("center", "scale"))
# Apply preprocessing
wine_scaled <- predict(preproc, wine[, -12])
# Add the quality column back
wine_scaled$quality <- wine$quality
# Check results
head(wine_scaled)

This process ensures robust model training and unbiased evaluation on unseen data.

Output

 

fixed.acidity

volatile.acidity

citric.acid

residual.sugar

chlorides

free.sulfur.dioxide

total.sulfur.dioxide

density

pH

sulphates

alcohol

Id

quality

 

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<dbl>

<int>

1

-0.5213514

0.9389212

-1.364429

-0.46621734

-0.2312936

-0.45026992

-0.3634510

0.55561117

1.2701390

-0.57340683

-0.9629603

-1.734859

5

2

-0.2924654

1.9409632

-1.364429

0.05003827

0.2341441

0.91551896

0.6431950

0.03614877

-0.7086174

0.13082384

-0.5933413

-1.732703

5

3

-0.2924654

1.2729352

-1.161059

-0.17121413

0.1072065

-0.06004452

0.2466375

0.14004125

-0.3256323

-0.04523383

-0.5933413

-1.730548

5

4

1.6530654

-1.3991767

1.482750

-0.46621734

-0.2524499

0.13506817

0.4296640

0.65950365

-0.9639408

-0.45603505

-0.5933413

-1.728393

6

5

-0.5213514

0.9389212

-1.364429

-0.46621734

-0.2312936

-0.45026992

-0.3634510

0.55561117

1.2701390

-0.57340683

-0.9629603

-1.726238

5

6

-0.5213514

0.7162452

-1.364429

-0.53996814

-0.2524499

-0.25515722

-0.1804245

0.55561117

1.2701390

-0.57340683

-0.9629603

-1.724083

5

 

Step 9: Split Scaled Data and Train Random Forest Model

After scaling the features, use the following steps to create training and test sets, then train the Random Forest regression model:

Use this code:

set.seed(123)  # For reproducibility
# Create a partition: 80% training, 20% testing
splitIndex <- createDataPartition(wine_scaled$quality, p = 0.8, list = FALSE)
# Split the scaled data
train_data <- wine_scaled[splitIndex, ]
test_data  <- wine_scaled[-splitIndex, ]
library(randomForest)
# Train the Random Forest model with 100 trees and feature importance enabled
rf_model <- randomForest(quality ~ ., data = train_data, ntree = 100, importance = TRUE)
# View model summary
print(rf_model)

Output

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin

Call:

 randomForest(formula = quality ~ ., data = train_data, ntree = 100,      importance = TRUE) 

               Type of random forest: regression

                     Number of trees: 100

No. of variables tried at each split: 4

          Mean of squared residuals: 0.3594949

                    % Var explained: 43.3

This setup optimizes model performance and allows for assessing feature importance in predicting wine quality.

Step 10: Make Predictions and Evaluate Model

Once the Random Forest model is trained, evaluate its performance on the test set using RMSE and R² metrics:

Use this code:

# Make predictions on the test set

predictions <- predict(rf_model, newdata = test_data)
# Calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mean((predictions - test_data$quality)^2))
# Calculate R-squared (Coefficient of Determination)
r2 <- cor(predictions, test_data$quality)^2
# Display evaluation results
cat("RMSE:", rmse, "\n")
cat("R-squared:", r2, "\n")

These metrics help assess the model's predictive accuracy. A lower RMSE and a higher R² indicate better performance in estimating wine quality.

This would give the output:

RMSE: 0.6176461 

R-squared: 0.4683847

Click Here to Read More: What Is Data Acquisition: Key Components & Role in Machine Learning

Step 11: Analyze Feature Importance

Understanding which features most significantly influence wine quality can provide valuable insights. Use the randomForest functions importance() and varImpPlot() to evaluate feature contributions.

Use this code:

# Display feature importance scores
importance(rf_model)
# Visualize feature importance
varImpPlot(rf_model)

This analysis highlights the most impactful chemical properties, helping to interpret the model and guide further refinement or domain-specific decisions.

This output is the feature importance plot generated by the varImpPlot() function from the randomForest package in R. It visually represents how important each feature (chemical property) is in predicting wine quality based on your trained random forest model.

Conclusion

In this Wine Quality Prediction project, we developed a Random Forest regression model using R in Google Colab to predict wine quality based on physicochemical properties such as alcohol, acidity, and sulphates. 

After cleaning and scaling the data, we visualized feature distributions and correlations, then trained the model on 80% of the data and tested it on the remaining 20%. The model's performance was evaluated using two key metrics: Root Mean Squared Error (RMSE) and R-squared (R²).

The model achieved an RMSE of 0.6176, indicating that, on average, its predictions deviate from the actual quality scores by about 0.62 points. The R-squared value of 0.4684 means the model explains roughly 46.8% of the variance in wine quality. 

While not perfect, this level of accuracy is acceptable for a first model and shows that chemical properties like alcohol and volatile acidity significantly influence wine quality. With further feature tuning or model selection, performance can be improved for even more accurate predictions.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Reference:
https://colab.research.google.com/drive/1X3sSAFbp05irIjrNX8cnh0NzU_1wsdT7#scrollTo=YyHvbMpYv4yB

Frequently Asked Questions (FAQs)

1. What is the Wine Quality Prediction project in R about?

2. Which tools and libraries are used in this project?

3. What other machine learning algorithms can be used to predict wine quality?

4. How can I optimize the performance of the wine quality model?

5. What are other machine learning projects you can build using R?

Rohit Sharma

779 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months