Wine Quality Prediction Model
By Rohit Sharma
Updated on Aug 01, 2025 | 10 min read | 1.27K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 01, 2025 | 10 min read | 1.27K+ views
Share:
Table of Contents
Quality prediction is perhaps the most common example of a dataset for which machine learning is used. The goal here is to predict the quality of wines from physicochemical properties. Since the measurement is a numeric score, the task becomes a regression problem.
The dataset used here is WineQT.csv, downloadable from Kaggle. The dataset contains attributes such as acidity, sugar content, pH, alcohol content, and sulphates
In this project, we will build and evaluate regression models that will predict red wine quality as closely as possible in terms of the said physicochemical features. Some of these physical features are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, and free sulfur dioxide.
Explore more project ideas like this in our Top 25+ Essential Data Science Projects GitHub to Explore in 2025 blog.
Popular Data Science Programs
It is better to have at least some background in:
For this project, the following tools and libraries will be used:
Tool/Library |
Purpose |
Google Colab |
Online environment for writing and running Python code seamlessly |
Python |
Core programming language for building the model |
Pandas & NumPy |
For reading, processing, and analyzing structured data |
Matplotlib & Seaborn |
To visualize distributions, correlations, and feature patterns |
Scikit-learn |
For training regression models and evaluating their performance |
The following are the regression models that will be applied and compared for the wine quality prediction:
You can complete this wine quality prediction project in about 1.5 to 2 hours. It’s a beginner-level machine learning project that helps you apply basic machine learning concepts, like exploratory data analysis, data preprocessing, etc.
Let’s start building the project from scratch. We will start by:
Without any further delay, let’s start!
To build the wine quality prediction model, we will use the dataset available on Kaggle. It contains 1 file and 13 columns.
Follow the steps mentioned below to download the dataset:
Now that the .csv file has been downloaded, let’s upload it to the Colab environment. Use the following code to open a file picker and load the dataset.
# Upload the CSV file to Google Colab
from google.colab import files
uploaded = files.upload()
Once uploaded, load the file using Pandas DataFrame. Here’s the code to do so:
# Import pandas
import pandas as pd
# Read the uploaded file into a DataFrame
df = pd.read_csv('WineQT.csv')
# Display the first few rows
df.head()
Output:
fixed acidity |
volatile acidity |
citric acid |
residual sugar |
chlorides |
free sulfur dioxide |
total sulfur dioxide |
density |
pH |
sulphates |
alcohol |
quality |
Id |
|
0 |
7.4 |
0.70 |
0.00 |
1.9 |
0.076 |
11.0 |
34.0 |
0.9978 |
3.51 |
0.56 |
9.4 |
5 |
0 |
1 |
7.8 |
0.88 |
0.00 |
2.6 |
0.098 |
25.0 |
67.0 |
0.9968 |
3.20 |
0.68 |
9.8 |
5 |
1 |
2 |
7.8 |
0.76 |
0.04 |
2.3 |
0.092 |
15.0 |
54.0 |
0.9970 |
3.26 |
0.65 |
9.8 |
5 |
2 |
3 |
11.2 |
0.28 |
0.56 |
1.9 |
0.075 |
17.0 |
60.0 |
0.9980 |
3.16 |
0.58 |
9.8 |
6 |
3 |
4 |
7.4 |
0.70 |
0.00 |
1.9 |
0.076 |
11.0 |
34.0 |
0.9978 |
3.51 |
0.56 |
9.4 |
5 |
4
|
In this step, we will explore the database to comprehend its structure and contents. Use the code given below to accomplish the same:
# Step 2: Basic Data Exploration
# 1. Dataset dimensions
print("Dataset Shape:", df.shape)
# 2. First 5 rows
print("\nSample Data:")
print(df.head())
# 3. Info: data types and non-null counts
print("\nData Info:")
print(df.info())
# 4. Check for missing values
print("\nMissing Values in Each Column:")
print(df.isnull().sum())
# 5. Summary statistics
print("\nStatistical Summary:")
print(df.describe())
Output:
Dataset Shape: (1143, 13)
Sample Data:
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
3 11.2 0.28 0.56 1.9 0.075
4 7.4 0.70 0.00 1.9 0.076
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 11.0 34.0 0.9978 3.51 0.56
1 25.0 67.0 0.9968 3.20 0.68
2 15.0 54.0 0.9970 3.26 0.65
3 17.0 60.0 0.9980 3.16 0.58
4 11.0 34.0 0.9978 3.51 0.56
alcohol quality Id
0 9.4 5 0
1 9.8 5 1
2 9.8 5 2
3 9.8 6 3
4 9.4 5 4
Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1143 non-null float64
1 volatile acidity 1143 non-null float64
2 citric acid 1143 non-null float64
3 residual sugar 1143 non-null float64
4 chlorides 1143 non-null float64
5 free sulfur dioxide 1143 non-null float64
6 total sulfur dioxide 1143 non-null float64
7 density 1143 non-null float64
8 pH 1143 non-null float64
9 sulphates 1143 non-null float64
10 alcohol 1143 non-null float64
11 quality 1143 non-null int64
12 Id 1143 non-null int64
dtypes: float64(11), int64(2)
memory usage: 116.2 KB
None
Missing Values in Each Column:
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
Id 0
dtype: int64
Statistical Summary:
fixed acidity volatile acidity citric acid residual sugar \
count 1143.000000 1143.000000 1143.000000 1143.000000
mean 8.311111 0.531339 0.268364 2.532152
std 1.747595 0.179633 0.196686 1.355917
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.392500 0.090000 1.900000
50% 7.900000 0.520000 0.250000 2.200000
75% 9.100000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
chlorides free sulfur dioxide total sulfur dioxide density \
count 1143.000000 1143.000000 1143.000000 1143.000000
mean 0.086933 15.615486 45.914698 0.996730
std 0.047267 10.250486 32.782130 0.001925
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 21.000000 0.995570
50% 0.079000 13.000000 37.000000 0.996680
75% 0.090000 21.000000 61.000000 0.997845
max 0.611000 68.000000 289.000000 1.003690
pH sulphates alcohol quality Id
count 1143.000000 1143.000000 1143.000000 1143.000000 1143.000000
mean 3.311015 0.657708 10.442111 5.657043 804.969379
std 0.156664 0.170399 1.082196 0.805824 463.997116
min 2.740000 0.330000 8.400000 3.000000 0.000000
25% 3.205000 0.550000 9.500000 5.000000 411.000000
50% 3.310000 0.620000 10.200000 6.000000 794.000000
75% 3.400000 0.730000 11.100000 6.000000 1209.500000
max 4.010000 2.000000 14.900000 8.000000 1597.000000
What does the output mean?
The output shows us that -
Before we can train any model, we need to prep our dataset. To do so, in this step, we will:
Here is the code to accomplish all this:
from sklearn.preprocessing import StandardScaler
# 1. Drop the 'Id' column (not useful for prediction)
df.drop('Id', axis=1, inplace=True)
# 2. Binarize the 'quality' column: 0 for low (<=5), 1 for high (>=6)
df['quality'] = df['quality'].apply(lambda x: 1 if x >= 6 else 0)
# 3. Split features and target
X = df.drop('quality', axis=1)
y = df['quality']
# 4. Normalize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Before moving ahead, let’s visualize data. Doing this, we will aid in comprehending trends, locating correlations, and finding patterns between physicochemical properties and wine quality.
Heatmap – Feature Correlation Matrix
Use the code below:
import seaborn as sns
import matplotlib.pyplot as plt
# Create a correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Physicochemical Properties')
plt.show()
Output:
The output shows how features are correlated with each other and the target label (quality).
Boxplots – Feature Distribution by Wine Quality
Use the code below:
# Boxplot for alcohol content
plt.figure(figsize=(8, 5))
sns.boxplot(x='quality', y='alcohol', data=df)
plt.title('Alcohol Content vs Wine Quality')
plt.xlabel('Wine Quality (0 = Low, 1 = High)')
plt.ylabel('Alcohol Content')
plt.show()
Output:
The output shows - how key features differ between high-quality and low-quality wine.
Histogram – Distribution of Wine Quality
Here is the code:
# Histogram of the binary quality column
plt.figure(figsize=(6, 4))
sns.countplot(x='quality', data=df)
plt.title('Distribution of Wine Quality (Binary)')
plt.xlabel('Wine Quality (0 = Low, 1 = High)')
plt.ylabel('Count')
plt.show()
Output:
The output shows how many wines fall into each quality category after binarization.
After seeing all three above visualizations (outputs), we got to know that:
Now that our dataset is ready and we have comprehended it, let’s train the models. We will train the following classification models:
Once trained, we will also compare their accuracy and classification reports.
Here is the code to do so:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
# 1. Splitting the dataset
X = df.drop('quality', axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 2. Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 3. Initialize Models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier(),
'KNN': KNeighborsClassifier()
}
# 4. Train and Evaluate
for name, model in models.items():
print(f"\n--- {name} ---")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
Output:
--- Logistic Regression ---
Accuracy: 0.7686
Classification Report:
precision recall f1-score support
0 0.74 0.75 0.74 102
1 0.79 0.79 0.79 127
accuracy 0.77 229
macro avg 0.77 0.77 0.77 229
weighted avg 0.77 0.77 0.77 229
--- Decision Tree ---
Accuracy: 0.6900
Classification Report:
precision recall f1-score support
0 0.65 0.65 0.65 102
1 0.72 0.72 0.72 127
accuracy 0.69 229
macro avg 0.69 0.69 0.69 229
weighted avg 0.69 0.69 0.69 229
--- Random Forest ---
Accuracy: 0.7773
Classification Report:
precision recall f1-score support
0 0.75 0.75 0.75 102
1 0.80 0.80 0.80 127
accuracy 0.78 229
macro avg 0.77 0.78 0.77 229
weighted avg 0.78 0.78 0.78 229
--- KNN ---
Accuracy: 0.7249
Classification Report:
precision recall f1-score support
0 0.70 0.67 0.68 102
1 0.74 0.77 0.76 127
accuracy 0.72 229
macro avg 0.72 0.72 0.72 229
weighted avg 0.72 0.72 0.72 229
What does the output mean?
The output shows that Random Forest is the best model here. It offers the highest accuracy and F1-score across both classes.
Model Comparison Summary:
Model |
Accuracy |
F1-Score (Class 1) |
Description |
Random Forest |
0.7773 |
0.80 |
Best overall performance. Good balance between precision and recall |
Logistic Regression |
0.7686 |
0.79 |
Very close to Random Forest. Simple model but still reliable |
KNN |
0.7249 |
0.76 |
High precision. But slightly lower recall |
Decision Tree |
0.6900 |
0.72 |
Lowest accuracy. Likely due to overfitting or less generalization |
Random Forest achieved the best performance with an accuracy of 77.7% and an F1 score of 0.80 for Class 1, indicating a strong balance between precision and recall. Logistic Regression also performed well, presenting an accuracy of 76.86% and a decent F1 score of 0.79 for Class 1.
Decision Tree was the poorest performer in terms of accuracy, with 69%. Apart from that, it underestimated precision and recall for Class 0 and Class 1. We could attribute this to overfitting or failure to generalize well enough. Next, KNN managed an accuracy of 72.49%. But, the recall for Class 1 (0.76) was less than those of Logistic Regression and Random Forest.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab link:
https://colab.research.google.com/drive/1HlSNLTr-8RfynDOE_GrmXkza1wiHKuFF?usp=sharing
805 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources