Home
Blog
Data Science
Wine Quality Prediction Model

Wine Quality Prediction Model

Updated on Aug 01, 2025 | 10 min read | 1.67K+ views

Table of Contents

View all

What Should You Know Beforehand?
Technologies and Libraries Used
Models That Will Be Utilized for Learning
Time Taken and Difficulty
How to Build the Wine Quality Prediction Model
Conclusion

Quality prediction is perhaps the most common example of a dataset for which machine learning is used. The goal here is to predict the quality of wines from physicochemical properties. Since the measurement is a numeric score, the task becomes a regression problem.

The dataset used here is WineQT.csv, downloadable from Kaggle. The dataset contains attributes such as acidity, sugar content, pH, alcohol content, and sulphates

In this project, we will build and evaluate regression models that will predict red wine quality as closely as possible in terms of the said physicochemical features. Some of these physical features are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, and free sulfur dioxide.

Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.

Popular Data Science Programs

M Sc in Data Science Degree Data Science Advanced Course DevOps Full Course Online PGD in Data Science MSc AI and Data Science Program

What Should You Know Beforehand?

It is better to have at least some background in:

Python programming - functions, loops, and handling basic data types
Pandas and NumPy – reading, analyzing, and manipulating structured data
Matplotlib and Seaborn – for data visualization and feature analysis
Scikit-learn – applying regression algorithms and evaluating them
Regression algorithms – basic understanding of how models like Linear Regression and Random Forest work

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Technologies and Libraries Used

For this project, the following tools and libraries will be used:

Tool/Library	Purpose
Google Colab	Online environment for writing and running Python code seamlessly
Python	Core programming language for building the model
Pandas & NumPy	For reading, processing, and analyzing structured data
Matplotlib & Seaborn	To visualize distributions, correlations, and feature patterns
Scikit-learn	For training regression models and evaluating their performance

Models That Will Be Utilized for Learning

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

The following are the regression models that will be applied and compared for the wine quality prediction:

Logistic Regression: A statistical model that estimates the probability of a wine's classification as high-quality based on its chemical properties. It is very simple to use and very fast if the data is linearly separable.
Decision Tree Classifier: Splits the wine features (such as acidity or alcohol) into branches to classify quality, just like the human mind would decide. Great to interpret, but can be a nuisance to handle when it comes to overfitting noisy data.
Random Forest Classifier: Grows many decision trees and then combines their predictions to get more accurate results. Great to capture complex patterns in wine features without overfitting
K-Nearest Neighbors (KNN): Classifies wine based on the majority quality of its closest feature neighbors. This works well when the data has been well scaled, but is slower with large datasets.

Time Taken and Difficulty

You can complete this wine quality prediction project in about 1.5 to 2 hours. It’s a beginner-level machine learning project that helps you apply basic machine learning concepts, like exploratory data analysis, data preprocessing, etc.

How to Build the Wine Quality Prediction Model

Let’s start building the project from scratch. We will start by:

Loading and exploring the dataset
Preprocessing the features
Visualizing relationships between physicochemical properties
Training and evaluating regression models

Without any further delay, let’s start!

Step 1: Download the Dataset

To build the wine quality prediction model, we will use the dataset available on Kaggle. It contains 1 file and 13 columns.

Follow the steps mentioned below to download the dataset:

Open a new tab in any web browser.
Type https://www.kaggle.com/datasets/yasserh/wine-quality-dataset.
On the Wine Quality Dataset page, in the right pane, under the Data Explorer section, click WineQT.csv.
Click the download icon.

Step 2: Upload and Load the Dataset in Google Colab

Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the dataset.

# Upload the CSV file to Google Colab
from google.colab import files
uploaded = files.upload()

Once uploaded, load the file using Pandas DataFrame. Here’s the code to do so:

# Import pandas
import pandas as pd

# Read the uploaded file into a DataFrame
df = pd.read_csv('WineQT.csv')

# Display the first few rows
df.head()

Output:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	Id
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	0
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5	1
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5	2
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6	3
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	4

Step 3: Explore the Dataset

In this step, we will explore the database to comprehend its structure and contents. Use the code given below to accomplish the same:

# Step 2: Basic Data Exploration

# 1. Dataset dimensions
print("Dataset Shape:", df.shape)

# 2. First 5 rows
print("\nSample Data:")
print(df.head())

# 3. Info: data types and non-null counts
print("\nData Info:")
print(df.info())

# 4. Check for missing values
print("\nMissing Values in Each Column:")
print(df.isnull().sum())

# 5. Summary statistics
print("\nStatistical Summary:")
print(df.describe())

Output:

Dataset Shape: (1143, 13)

Sample Data:

fixed acidity volatile acidity citric acid residual sugar chlorides \

0 7.4 0.70 0.00 1.9 0.076

1 7.8 0.88 0.00 2.6 0.098

2 7.8 0.76 0.04 2.3 0.092

3 11.2 0.28 0.56 1.9 0.075

4 7.4 0.70 0.00 1.9 0.076

free sulfur dioxide total sulfur dioxide density pH sulphates \

0 11.0 34.0 0.9978 3.51 0.56

1 25.0 67.0 0.9968 3.20 0.68

2 15.0 54.0 0.9970 3.26 0.65

3 17.0 60.0 0.9980 3.16 0.58

4 11.0 34.0 0.9978 3.51 0.56

alcohol quality Id

0 9.4 5 0

1 9.8 5 1

2 9.8 5 2

3 9.8 6 3

4 9.4 5 4

Data Info:

RangeIndex: 1143 entries, 0 to 1142

Data columns (total 13 columns):

# Column Non-Null Count Dtype
--- ------ -------------- -----

0 fixed acidity 1143 non-null float64

1 volatile acidity 1143 non-null float64

2 citric acid 1143 non-null float64

3 residual sugar 1143 non-null float64

4 chlorides 1143 non-null float64

5 free sulfur dioxide 1143 non-null float64

6 total sulfur dioxide 1143 non-null float64

7 density 1143 non-null float64

8 pH 1143 non-null float64

9 sulphates 1143 non-null float64

10 alcohol 1143 non-null float64

11 quality 1143 non-null int64

12 Id 1143 non-null int64

dtypes: float64(11), int64(2)

memory usage: 116.2 KB

None

Missing Values in Each Column:

fixed acidity 0

volatile acidity 0

citric acid 0

residual sugar 0

chlorides 0

free sulfur dioxide 0

total sulfur dioxide 0

density 0

pH 0

sulphates 0

alcohol 0

quality 0

Id 0

dtype: int64

Statistical Summary:

fixed acidity volatile acidity citric acid residual sugar \

count 1143.000000 1143.000000 1143.000000 1143.000000

mean 8.311111 0.531339 0.268364 2.532152

std 1.747595 0.179633 0.196686 1.355917

min 4.600000 0.120000 0.000000 0.900000

25% 7.100000 0.392500 0.090000 1.900000

50% 7.900000 0.520000 0.250000 2.200000

75% 9.100000 0.640000 0.420000 2.600000

max 15.900000 1.580000 1.000000 15.500000

chlorides free sulfur dioxide total sulfur dioxide density \

count 1143.000000 1143.000000 1143.000000 1143.000000

mean 0.086933 15.615486 45.914698 0.996730

std 0.047267 10.250486 32.782130 0.001925

min 0.012000 1.000000 6.000000 0.990070

25% 0.070000 7.000000 21.000000 0.995570

50% 0.079000 13.000000 37.000000 0.996680

75% 0.090000 21.000000 61.000000 0.997845

max 0.611000 68.000000 289.000000 1.003690

pH sulphates alcohol quality Id

count 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000

mean 3.311015 0.657708 10.442111 5.657043 804.969379

std 0.156664 0.170399 1.082196 0.805824 463.997116

min 2.740000 0.330000 8.400000 3.000000 0.000000

25% 3.205000 0.550000 9.500000 5.000000 411.000000

50% 3.310000 0.620000 10.200000 6.000000 794.000000

75% 3.400000 0.730000 11.100000 6.000000 1209.500000

max 4.010000 2.000000 14.900000 8.000000 1597.000000

What does the output mean?

The output shows us that -

The dataset contains 13 columns. Mostly of type - float64, with quality and Id as int64.
0 (zero) missing values. Therefore, no need for imputation (a technique used in Feature Engineering).

Step 4: Preprocessing the Features

Before we can train any model, we need to prep our dataset. To do so, in this step, we will:

Drop the Id column. It doesn't help in predicting wine quality.
Convert quality into a binary label. We will simplify the prediction into two classes:
- 0 for - low quality (score ≤ 5)
- 1 for - high quality (score ≥ 6)
Normalize features. This will improve performance for most ML algorithms.

Here is the code to accomplish all this:

from sklearn.preprocessing import StandardScaler

# 1. Drop the 'Id' column (not useful for prediction)
df.drop('Id', axis=1, inplace=True)

# 2. Binarize the 'quality' column: 0 for low (<=5), 1 for high (>=6)
df['quality'] = df['quality'].apply(lambda x: 1 if x >= 6 else 0)

# 3. Split features and target
X = df.drop('quality', axis=1)
y = df['quality']

# 4. Normalize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 5: Visualizing Feature Relationships

Before moving ahead, let’s visualize data. Doing this, we will aid in comprehending trends, locating correlations, and finding patterns between physicochemical properties and wine quality.

Heatmap – Feature Correlation Matrix

Use the code below:

import seaborn as sns
import matplotlib.pyplot as plt

# Create a correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Physicochemical Properties')
plt.show()

Output:

The output shows how features are correlated with each other and the target label (quality).

Boxplots – Feature Distribution by Wine Quality

Use the code below:

# Boxplot for alcohol content
plt.figure(figsize=(8, 5))
sns.boxplot(x='quality', y='alcohol', data=df)
plt.title('Alcohol Content vs Wine Quality')
plt.xlabel('Wine Quality (0 = Low, 1 = High)')
plt.ylabel('Alcohol Content')
plt.show()

Output:

The output shows - how key features differ between high-quality and low-quality wine.

Histogram – Distribution of Wine Quality

Here is the code:

# Histogram of the binary quality column
plt.figure(figsize=(6, 4))
sns.countplot(x='quality', data=df)
plt.title('Distribution of Wine Quality (Binary)')
plt.xlabel('Wine Quality (0 = Low, 1 = High)')
plt.ylabel('Count')
plt.show()

Output:

The output shows how many wines fall into each quality category after binarization.

After seeing all three above visualizations (outputs), we got to know that:

Alcohol and sulphates tend to increase with wine quality.
Volatile acidity is often lower in high-quality wine.
Most wines in this dataset are of low to medium quality

Step 6: Training and Evaluating Classification Models

Now that our dataset is ready and we have comprehended it, let’s train the models. We will train the following classification models:

Logistic Regression
Decision Tree Classifier
Random Forest Classifier
K-Nearest Neighbors (KNN)

Once trained, we will also compare their accuracy and classification reports.

Here is the code to do so:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# 1. Splitting the dataset
X = df.drop('quality', axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Initialize Models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'KNN': KNeighborsClassifier()
}

# 4. Train and Evaluate
for name, model in models.items():
    print(f"\n--- {name} ---")
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    acc = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {acc:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

Output:

--- Logistic Regression ---

Accuracy: 0.7686

Classification Report:

precision recall f1-score support

0 0.74 0.75 0.74 102

1 0.79 0.79 0.79 127

accuracy 0.77 229

macro avg 0.77 0.77 0.77 229

weighted avg 0.77 0.77 0.77 229

--- Decision Tree ---

Accuracy: 0.6900

Classification Report:

precision recall f1-score support

0 0.65 0.65 0.65 102

1 0.72 0.72 0.72 127

accuracy 0.69 229

macro avg 0.69 0.69 0.69 229

weighted avg 0.69 0.69 0.69 229

--- Random Forest ---

Accuracy: 0.7773

Classification Report:

precision recall f1-score support

0 0.75 0.75 0.75 102

1 0.80 0.80 0.80 127

accuracy 0.78 229

macro avg 0.77 0.78 0.77 229

weighted avg 0.78 0.78 0.78 229

--- KNN ---

Accuracy: 0.7249

Classification Report:

precision recall f1-score support

0 0.70 0.67 0.68 102

1 0.74 0.77 0.76 127

accuracy 0.72 229

macro avg 0.72 0.72 0.72 229

weighted avg 0.72 0.72 0.72 229

What does the output mean?

The output shows that Random Forest is the best model here. It offers the highest accuracy and F1-score across both classes.

Model Comparison Summary:

Model	Accuracy	F1-Score (Class 1)	Description
Random Forest	0.7773	0.80	Best overall performance. Good balance between precision and recall
Logistic Regression	0.7686	0.79	Very close to Random Forest. Simple model but still reliable
KNN	0.7249	0.76	High precision. But slightly lower recall
Decision Tree	0.6900	0.72	Lowest accuracy. Likely due to overfitting or less generalization

Conclusion

Random Forest achieved the best performance with an accuracy of 77.7% and an F1 score of 0.80 for Class 1, indicating a strong balance between precision and recall. Logistic Regression also performed well, presenting an accuracy of 76.86% and a decent F1 score of 0.79 for Class 1.

Decision Tree was the poorest performer in terms of accuracy, with 69%. Apart from that, it underestimated precision and recall for Class 0 and Class 1. We could attribute this to overfitting or failure to generalize well enough. Next, KNN managed an accuracy of 72.49%. But, the recall for Class 1 (0.76) was less than those of Logistic Regression and Random Forest.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab link:
https://colab.research.google.com/drive/1HlSNLTr-8RfynDOE_GrmXkza1wiHKuFF?usp=sharing

Frequently Asked Questions (FAQs)

1. What is Wine Quality Prediction?

Wine Quality Prediction is a regression or classification task where machine learning models are used to predict the quality score of wine based on its physicochemical properties like acidity, alcohol, and pH levels.

2. Which dataset is commonly used for this project?

The UCI Wine Quality Dataset is widely used. It includes data on red and white wines, with 11 chemical properties and a quality score rated from 0 to 10.

3. What are the best algorithms for wine quality prediction?

Logistic Regression, Random Forest, Decision Trees, Support Vector Machines (SVM), and Gradient Boosting are commonly used. For regression tasks, Linear Regression and XGBoost perform well.

4. What are the key features that influence wine quality?

Features like alcohol content, volatile acidity, sulphates, citric acid, and density significantly impact the predicted quality of the wine.

5. What tools are used in building the model?

Python libraries like Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, and XGBoost are used for data preprocessing, visualization, and model training.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources