Wine Quality Prediction Model

By Rohit Sharma

Updated on Aug 01, 2025 | 10 min read | 1.27K+ views

Share:

Quality prediction is perhaps the most common example of a dataset for which machine learning is used. The goal here is to predict the quality of wines from physicochemical properties. Since the measurement is a numeric score, the task becomes a regression problem.

The dataset used here is WineQT.csv, downloadable from Kaggle. The dataset contains attributes such as acidity, sugar content, pH, alcohol content, and sulphates

In this project, we will build and evaluate regression models that will predict red wine quality as closely as possible in terms of the said physicochemical features. Some of these physical features are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, and free sulfur dioxide.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

What Should You Know Beforehand?

It is better to have at least some background in:

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library

Purpose

Google Colab

Online environment for writing and running Python code seamlessly

Python

Core programming language for building the model

Pandas & NumPy

For reading, processing, and analyzing structured data

Matplotlib & Seaborn

To visualize distributions, correlations, and feature patterns

Scikit-learn

For training regression models and evaluating their performance

Models That Will Be Utilized for Learning

The following are the regression models that will be applied and compared for the wine quality prediction:

  • Logistic Regression: A statistical model that estimates the probability of a wine's classification as high-quality based on its chemical properties. It is very simple to use and very fast if the data is linearly separable. 
  • Decision Tree Classifier: Splits the wine features (such as acidity or alcohol) into branches to classify quality, just like the human mind would decide. Great to interpret, but can be a nuisance to handle when it comes to overfitting noisy data.
  • Random Forest ClassifierGrows many decision trees and then combines their predictions to get more accurate results. Great to capture complex patterns in wine features without overfitting
  • K-Nearest Neighbors (KNN): Classifies wine based on the majority quality of its closest feature neighbors. This works well when the data has been well scaled, but is slower with large datasets.

Time Taken and Difficulty

You can complete this wine quality prediction project in about 1.5 to 2 hours. It’s a beginner-level machine learning project that helps you apply basic machine learning concepts, like exploratory data analysis, data preprocessing, etc.

How to Build the Wine Quality Prediction Model

Let’s start building the project from scratch. We will start by:

  1. Loading and exploring the dataset
  2. Preprocessing the features
  3. Visualizing relationships between physicochemical properties
  4. Training and evaluating regression models

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the wine quality prediction model, we will use the dataset available on Kaggle. It contains  1 file and 13 columns. 

Follow the steps mentioned below to download the dataset:

  1. Open a new tab in any web browser. 
  2. Type https://www.kaggle.com/datasets/yasserh/wine-quality-dataset.
  3. On the Wine Quality Dataset page, in the right pane, under the Data Explorer section, click WineQT.csv
  4. Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the dataset.

# Upload the CSV file to Google Colab
from google.colab import files
uploaded = files.upload()

Once uploaded, load the file using Pandas DataFrame. Here’s the code to do so:

# Import pandas
import pandas as pd

# Read the uploaded file into a DataFrame
df = pd.read_csv('WineQT.csv')

# Display the first few rows
df.head()

Output:

 

fixed acidity

volatile acidity

citric acid

residual sugar

chlorides

free sulfur dioxide

total sulfur dioxide

density

pH

sulphates

alcohol

quality

Id

0

7.4

0.70

0.00

1.9

0.076

11.0

34.0

0.9978

3.51

0.56

9.4

5

0

1

7.8

0.88

0.00

2.6

0.098

25.0

67.0

0.9968

3.20

0.68

9.8

5

1

2

7.8

0.76

0.04

2.3

0.092

15.0

54.0

0.9970

3.26

0.65

9.8

5

2

3

11.2

0.28

0.56

1.9

0.075

17.0

60.0

0.9980

3.16

0.58

9.8

6

3

4

7.4

0.70

0.00

1.9

0.076

11.0

34.0

0.9978

3.51

0.56

9.4

5

4

 

 

Step 3: Explore the Dataset

In this step, we will explore the database to comprehend its structure and contents. Use the code given below to accomplish the same:

# Step 2: Basic Data Exploration

# 1. Dataset dimensions
print("Dataset Shape:", df.shape)

# 2. First 5 rows
print("\nSample Data:")
print(df.head())

# 3. Info: data types and non-null counts
print("\nData Info:")
print(df.info())

# 4. Check for missing values
print("\nMissing Values in Each Column:")
print(df.isnull().sum())

# 5. Summary statistics
print("\nStatistical Summary:")
print(df.describe())

Output:

Dataset Shape: (1143, 13)

Sample Data:

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \

0            7.4              0.70              0.00             1.9                  0.076   

1            7.8              0.88               0.00             2.6                 0.098   

2            7.8              0.76               0.04             2.3                 0.092   

3           11.2              0.28              0.56             1.9                 0.075   

4            7.4              0.70               0.00             1.9                 0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \

0                 11.0                  34.0   0.9978        3.51            0.56   

1                 25.0                  67.0   0.9968        3.20           0.68   

2                 15.0                  54.0   0.9970        3.26           0.65   

3                 17.0                  60.0   0.9980        3.16           0.58   

4                 11.0                  34.0   0.9978         3.51          0.56   

   alcohol  quality  Id  

0      9.4        5      0  

1      9.8        5      1  

2      9.8        5     2  

3      9.8        6     3  

4      9.4        5     4  

Data Info:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 1143 entries, 0 to 1142

Data columns (total 13 columns):

 #          Column                Non-Null Count            Dtype  
---         ------                   --------------                -----  

 0          fixed acidity             1143 non-null              float64

 1           volatile acidity          1143 non-null             float64

 2           citric acid                 1143 non-null              float64

 3            residual sugar          1143 non-null             float64

 4            chlorides                  1143 non-null              float64

 5             free sulfur dioxide   1143 non-null             float64

 6             total sulfur dioxide  1143 non-null             float64

 7             density                      1143 non-null            float64

 8             pH                              1143 non-null            float64

 9            sulphates                     1143 non-null           float64

 10          alcohol                          1143 non-null           float64

 11          quality                            1143 non-null           int64  

 12          Id                                    1143 non-null           int64  

dtypes: float64(11), int64(2)

memory usage: 116.2 KB

None

Missing Values in Each Column:

fixed acidity             0

volatile acidity          0

citric acid                  0

residual sugar           0

chlorides                    0

free sulfur dioxide      0

total sulfur dioxide      0

density                         0

pH                                 0

sulphates                     0

alcohol                         0

quality                          0

Id                                  0

dtype: int64

Statistical Summary:

                 fixed acidity        volatile acidity                   citric acid       residual sugar  \

count         1143.000000       1143.000000                 1143.000000     1143.000000   

mean         8.311111                 0.531339                        0.268364        2.532152   

std             1.747595                0.179633                        0.196686        1.355917   

min            4.600000               0.120000                        0.000000        0.900000   

25%           7.100000                0.392500                        0.090000        1.900000   

50%           7.900000               0.520000                        0.250000        2.200000   

75%           9.100000                0.640000                        0.420000        2.600000   

max           15.900000             1.580000                          1.000000       15.500000   

                  chlorides           free sulfur dioxide          total sulfur dioxide      density  \

count       1143.000000       1143.000000                   1143.000000         1143.000000   

mean        0.086933             15.615486                        45.914698            0.996730   

std            0.047267              10.250486                       32.782130            0.001925   

min           0.012000              1.000000                           6.000000             0.990070   

25%          0.070000              7.000000                          21.000000            0.995570   

50%          0.079000             13.000000                         37.000000           0.996680   

75%          0.090000             21.000000                         61.000000           0.997845   

max          0.611000             68.000000                          289.000000        1.003690   

                       pH    sulphates      alcohol      quality           Id  

count  1143.000000  1143.000000  1143.000000  1143.000000  1143.000000  

mean      3.311015     0.657708    10.442111     5.657043   804.969379  

std       0.156664     0.170399     1.082196     0.805824   463.997116  

min       2.740000     0.330000     8.400000     3.000000     0.000000  

25%       3.205000     0.550000     9.500000     5.000000   411.000000  

50%       3.310000     0.620000    10.200000     6.000000   794.000000  

75%       3.400000     0.730000    11.100000     6.000000  1209.500000  

max       4.010000     2.000000    14.900000     8.000000  1597.000000 

What does the output mean?

The output shows us that - 

  • The dataset contains 13 columns. Mostly of type - float64, with quality and Id as int64.
  • 0 (zero) missing values. Therefore, no need for imputation (a technique used in Feature Engineering).

Step 4: Preprocessing the Features

Before we can train any model, we need to prep our dataset. To do so, in this step, we will:

  • Drop the Id column. It doesn't help in predicting wine quality.
  • Convert quality into a binary label. We will simplify the prediction into two classes:
    • for - low quality (score ≤ 5)
    • 1 for - high quality (score ≥ 6)
  • Normalize features. This will improve performance for most ML algorithms.

Here is the code to accomplish all this:

from sklearn.preprocessing import StandardScaler

# 1. Drop the 'Id' column (not useful for prediction)
df.drop('Id', axis=1, inplace=True)

# 2. Binarize the 'quality' column: 0 for low (<=5), 1 for high (>=6)
df['quality'] = df['quality'].apply(lambda x: 1 if x >= 6 else 0)

# 3. Split features and target
X = df.drop('quality', axis=1)
y = df['quality']

# 4. Normalize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 5: Visualizing Feature Relationships

Before moving ahead, let’s visualize data. Doing this, we will aid in comprehending trends, locating correlations, and finding patterns between physicochemical properties and wine quality. 

Heatmap – Feature Correlation Matrix

Use the code below:

import seaborn as sns
import matplotlib.pyplot as plt

# Create a correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Physicochemical Properties')
plt.show()

Output:

The output shows how features are correlated with each other and the target label (quality).

Boxplots – Feature Distribution by Wine Quality

Use the code below:

# Boxplot for alcohol content
plt.figure(figsize=(8, 5))
sns.boxplot(x='quality', y='alcohol', data=df)
plt.title('Alcohol Content vs Wine Quality')
plt.xlabel('Wine Quality (0 = Low, 1 = High)')
plt.ylabel('Alcohol Content')
plt.show()

Output:

The output shows - how key features differ between high-quality and low-quality wine.

Histogram – Distribution of Wine Quality

Here is the code:

# Histogram of the binary quality column
plt.figure(figsize=(6, 4))
sns.countplot(x='quality', data=df)
plt.title('Distribution of Wine Quality (Binary)')
plt.xlabel('Wine Quality (0 = Low, 1 = High)')
plt.ylabel('Count')
plt.show()

Output:

The output shows how many wines fall into each quality category after binarization.

After seeing all three above visualizations (outputs), we got to know that:

  • Alcohol and sulphates tend to increase with wine quality.
  • Volatile acidity is often lower in high-quality wine.
  • Most wines in this dataset are of low to medium quality

Step 6: Training and Evaluating Classification Models

Now that our dataset is ready and we have comprehended it, let’s train the models. We will train the following classification models:

  • Logistic Regression
  • Decision Tree Classifier
  • Random Forest Classifier
  • K-Nearest Neighbors (KNN)

Once trained, we will also compare their accuracy and classification reports.

Here is the code to do so:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# 1. Splitting the dataset
X = df.drop('quality', axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Initialize Models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'KNN': KNeighborsClassifier()
}

# 4. Train and Evaluate
for name, model in models.items():
    print(f"\n--- {name} ---")
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    acc = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {acc:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

Output:

--- Logistic Regression ---

Accuracy: 0.7686

Classification Report:

              precision    recall  f1-score   support

           0       0.74      0.75      0.74       102

           1       0.79      0.79      0.79       127

    accuracy                           0.77       229

   macro avg       0.77      0.77      0.77       229

weighted avg       0.77      0.77      0.77       229

--- Decision Tree ---

Accuracy: 0.6900

Classification Report:

              precision    recall  f1-score   support

           0       0.65      0.65      0.65       102

           1       0.72      0.72      0.72       127

    accuracy                           0.69       229

   macro avg       0.69      0.69      0.69       229

weighted avg       0.69      0.69      0.69       229

--- Random Forest ---

Accuracy: 0.7773

Classification Report:

              precision    recall  f1-score   support

           0       0.75      0.75      0.75       102

           1       0.80      0.80      0.80       127

    accuracy                           0.78       229

   macro avg       0.77      0.78      0.77       229

weighted avg       0.78      0.78      0.78       229

--- KNN ---

Accuracy: 0.7249

Classification Report:

              precision    recall  f1-score   support

           0       0.70      0.67      0.68       102

           1       0.74      0.77      0.76       127

    accuracy                           0.72       229

   macro avg       0.72      0.72      0.72       229

weighted avg       0.72      0.72      0.72       229

What does the output mean?

The output shows that Random Forest is the best model here. It offers the highest accuracy and F1-score across both classes.

Model Comparison Summary:

Model

Accuracy

F1-Score (Class 1)

Description

Random Forest

0.7773

0.80

Best overall performance. Good balance between precision and recall

Logistic Regression

0.7686

0.79

Very close to Random Forest. Simple model but still reliable

KNN

0.7249

0.76

High precision. But slightly lower recall

Decision Tree

0.6900

0.72

Lowest accuracy. Likely due to overfitting or less generalization

 

Conclusion

Random Forest achieved the best performance with an accuracy of 77.7% and an F1 score of 0.80 for Class 1, indicating a strong balance between precision and recall. Logistic Regression also performed well, presenting an accuracy of 76.86% and a decent F1 score of 0.79 for Class 1.

Decision Tree was the poorest performer in terms of accuracy, with 69%. Apart from that, it underestimated precision and recall for Class 0 and Class 1. We could attribute this to overfitting or failure to generalize well enough. Next, KNN managed an accuracy of 72.49%. But, the recall for Class 1 (0.76) was less than those of Logistic Regression and Random Forest.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab link:
https://colab.research.google.com/drive/1HlSNLTr-8RfynDOE_GrmXkza1wiHKuFF?usp=sharing

Frequently Asked Questions (FAQs)

1. What is Wine Quality Prediction?

2. Which dataset is commonly used for this project?

3. What are the best algorithms for wine quality prediction?

4. What are the key features that influence wine quality?

5. What tools are used in building the model?

Rohit Sharma

805 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months