Indian Food Analysis and Region Prediction Using Machine Learning
By Rohit Sharma
Updated on Aug 04, 2025 | 10 min read | 151 views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 04, 2025 | 10 min read | 151 views
Share:
Table of Contents
India's food is so diverse, with dishes varying in region and culture. In this project, you’ll explore the Indian Food dataset to find out the patterns in ingredients, diet types, and flavor profiles.
Using machine learning, you’ll build a classifier to predict the region a dish belongs to based on its features. From data cleaning to model evaluation, this project shows how Indian Food Analysis using Machine Learning can help uncover hidden insights from culinary data.
Discover our comprehensive array of data science projects in Python to inspire your next significant idea.
Boost your data science skills with upGrad's Online Data Science Courses. Learn Python, Machine Learning, Artificial Intelligence, Tableau, SQL, and more from expert faculty. Enroll now.
Popular Data Science Programs
Before starting the Indian Food Analysis project, you should be familiar with:
Also Read - Step-by-Step Guide to Learning Python for Data Science
Enhance your data science career through upGrad's esteemed courses, benefiting from the mentorship and guidance of industry leaders.
For the Indian Food Analysis project, the tools and libraries used are below:
Tool / Library |
Purpose |
Python | Core programming language |
Google Colab | Environment for running and sharing Python notebooks |
Pandas | Reading, cleaning, and analyzing the dataset |
NumPy | Handling numerical data efficiently |
Matplotlib / Seaborn | Visualizing cuisine trends and ingredient patterns |
Scikit-learn | Feature encoding and any advanced analysis (if needed) |
Also Read - 60 Most Asked Pandas Interview Questions and Answers [ANSWERED + CODE]
For our Indian Food Analysis project, we’ll use two simple yet insightful approaches:
Frequency Analysis: First, we will analyze the ingredients that are commonly used, then the preparation time and cooking style to understand the regional and cultural food of India.
One-Hot Encoding and Aggregation: We will convert the categorical data into numerical data, like cuisine type and ingredients, for analyzing the pattern.
Also Read- Label Encoder vs One Hot Encoder in Machine Learning
You can complete this Indian Food Analysis project in 2 to 3 hours. It’s suitable for beginners to intermediate.
Let’s create this project from scratch with clear, step-by-step guidance:
By following these steps, you'll extract meaningful data from Indian food data and strengthen your data analysis workflow.
Without any further delay, let's begin!
Also Read - Detailed Guide on Datasets in Machine Learning: Steps to Build Machine Learning Datasets
We begin by importing the necessary libraries and loading the Indian Food dataset using pandas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.sparse import hstack
# --- Step 1: Load and Initial Data Exploration ---
import pandas as pd
print("Step 1: Loading and Exploring Data")
# Load the dataset
try:
df = pd.read_csv('indian_food.csv')
print(" Dataset loaded successfully!")
print("Preview of the data:")
print(df.head())
except FileNotFoundError:
print(" Error: 'indian_food.csv' not found. Please place it in the working directory.")
exit()
Output:
First 5 rows of the dataset:
name ingredients \
0 Balu shahi Maida flour, yogurt, oil, sugar
1 Boondi Gram flour, ghee, sugar
2 Gajar ka halwa Carrots, milk, sugar, ghee, cashews, raisins
3 Ghevar Flour, ghee, kewra, milk, clarified butter, su...
4 Gulab jamun Milk powder, plain flour, baking powder, ghee,...
diet prep_time cook_time flavor_profile course state \
0 vegetarian 45 25 sweet dessert West Bengal
1 vegetarian 80 30 sweet dessert Rajasthan
2 vegetarian 15 60 sweet dessert Punjab
3 vegetarian 15 30 sweet dessert Rajasthan
4 vegetarian 15 40 sweet dessert West Bengal
region
0 East
1 West
2 North
3 West
4 East
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
In this step, we clean the dataset to prepare it for analysis and modeling.
Here is the code:
# --- Step 2: Data Cleaning and Preprocessing ---
print("\nStep 2: Cleaning and Preprocessing Data")
# Replace placeholder missing values (-1 and '-1') with NaN
df.replace(-1, np.nan, inplace=True)
df.replace('-1', np.nan, inplace=True)
# Drop rows where the target variable 'region' is missing
df.dropna(subset=['region'], inplace=True)
# Fill missing values in 'flavor_profile' with the most frequent category (mode)
mode_flavor = df['flavor_profile'].mode()[0]
df['flavor_profile'].fillna(mode_flavor, inplace=True)
print(f"Filled missing 'flavor_profile' with '{mode_flavor}'.")
# Display cleaned dataset information
print("\nData Info after cleaning:")
df.info()
print("\nMissing values count after cleaning:")
print(df.isnull().sum())
Output:
--- Cleaning and Preprocessing Data ---
Filled missing 'flavor_profile' with 'spicy'.
Data Info after cleaning:
Index: 241 entries, 0 to 254
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 241 non-null object
1 ingredients 241 non-null object
2 diet 241 non-null object
3 prep_time 212 non-null float64
4 cook_time 214 non-null float64
5 flavor_profile 241 non-null object
6 course 241 non-null object
7 state 230 non-null object
8 region 241 non-null object
dtypes: float64(2), object(7)
Missing values count after cleaning:
name 0
ingredients 0
diet 0
prep_time 29
cook_time 27
flavor_profile 0
course 0
state 11
region 0
dtype: int64
This ensures our data is structured and ready for feature engineering and model building.
Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data
In this step, we explore the dataset to understand patterns and distributions across different categories.
Here is the code:
# --- Step 3: Exploratory Data Analysis (EDA) & Visualization ---
print("\nStep 3: Performing Exploratory Data Analysis (EDA)")
sns.set_style("whitegrid")
plt.figure(figsize=(12, 6))
# Plot 1: Distribution of Dishes by Region
plt.subplot(2, 2, 1)
sns.countplot(y=df['region'], order=df['region'].value_counts().index, palette='viridis')
plt.title('Number of Dishes by Region')
plt.xlabel('Number of Dishes')
plt.ylabel('Region')
# Plot 2: Distribution of Dishes by Course
plt.subplot(2, 2, 2)
sns.countplot(x=df['course'], order=df['course'].value_counts().index, palette='plasma')
plt.title('Distribution of Dishes by Course')
plt.xlabel('Course')
plt.ylabel('Count')
# Plot 3: Diet Distribution (Vegetarian vs. Non-Vegetarian)
plt.subplot(2, 2, 3)
df['diet'].value_counts().plot.pie(autopct='%1.1f%%', colors=['#66b3ff', '#99ff99'],
wedgeprops={'edgecolor': 'white'})
plt.title('Diet Distribution')
plt.ylabel('') # Remove y-label
# Plot 4: Distribution of Flavor Profile
plt.subplot(2, 2, 4)
sns.countplot(x=df['flavor_profile'], order=df['flavor_profile'].value_counts().index, palette='magma')
plt.title('Flavor Profiles')
plt.xlabel('Flavor Profile')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
Output:
This visual summary helps us understand the diversity and distribution of Indian cuisine across regions and categories.
Also Read - What They Don't Tell You About Exploratory Data Analysis in Python!
In this step, we prepare the dataset for machine learning by changing both textual and categorical features into a numerical format.
Here is the code:
# --- Step 4: Feature Engineering and Preparation for Model ---
print("\nStep 4: Preparing Data for Machine Learning Model")
# Target Variable (y): Predicting the 'region'
target = df['region']
# Encode target labels into numeric values
le = LabelEncoder()
y = le.fit_transform(target)
# Display region-to-number mapping
print("Region to Number Mapping:")
for i, class_name in enumerate(le.classes_):
print(f"{i} -> {class_name}")
# Features (X): Using 'ingredients', 'flavor_profile', and 'diet'
# 1. Vectorize the text in 'ingredients' using TF-IDF
print("\nVectorizing 'ingredients' using TF-IDF...")
tfidf = TfidfVectorizer(stop_words='english')
X_ingredients = tfidf.fit_transform(df['ingredients'])
print("Shape of TF-IDF features:", X_ingredients.shape)
# 2. One-hot encode 'flavor_profile' and 'diet'
print("One-hot encoding 'flavor_profile' and 'diet'...")
X_categorical = pd.get_dummies(df[['flavor_profile', 'diet']], drop_first=True)
print("Shape of categorical features:", X_categorical.shape)
# Combine TF-IDF and one-hot encoded features
X = hstack([X_ingredients, X_categorical.values])
print("Shape of final combined feature matrix (X):", X.shape)
Output:
--- Preparing Data for Machine Learning Model ---
Region to Number Mapping:
0 -> Central
1 -> East
2 -> North
3 -> North East
4 -> South
5 -> West
Vectorizing 'ingredients' using TF-IDF...
Shape of TF-IDF features: (241, 322)
One-hot encoding 'flavor_profile' and 'diet'...
Shape of categorical features: (241, 4)
Shape of final combined feature matrix (X): (241, 326)
Also Read - Top 6 Techniques Used in Feature Engineering [Machine Learning]
In this step, we build a classification model to predict the region of a dish using its textual and numerical features.
Here is the code for this step:
# --- Step 5: Building and Training the Classification Model ---
print("\nStep 5: Building and Training the Classification Model")
# Split the dataset into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
# Initialize the Random Forest Classifier
# It’s robust and handles feature importance well.
model = RandomForestClassifier(
n_estimators=100,
random_state=42,
class_weight='balanced'
)
# Train the model on the training data
print("\nTraining the RandomForest model...")
model.fit(X_train, y_train)
print("Model training complete!")
Output:
--- Building and Training the Classification Model ---
Training set size: 192 samples
Test set size: 49 samples
Model training complete!
Now that the model is trained, let's evaluate how well it performs on the unseen test data.
Here is the code:
# --- Step 6: Evaluating the Model ---
print("\nStep 6: Evaluating the Model's Performance")
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Display a classification report (precision, recall, f1-score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_, zero_division=0))
# Display the confusion matrix using heatmap
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Region')
plt.ylabel('Actual Region')
plt.show()
Output:
-- Evaluating the Model's Performance ---
Model Accuracy: 59.18%
Classification Report:
precision recall f1-score support
Central 0.00 0.00 0.00 1
East 0.60 0.50 0.55 6
North 0.64 0.70 0.67 10
North East 0.00 0.00 0.00 5
South 0.64 0.58 0.61 12
West 0.55 0.80 0.65 15
accuracy 0.59 49
macro avg 0.40 0.43 0.41 49
weighted avg 0.53 0.59 0.55 49
Confusion Matrix:
Also Read - Top 5 Machine Learning Models Explained For Beginners
In this step, we'll use the trained model to predict the regional origin of a new dish based on its ingredients, flavor profile, and diet type.
Here is the code to do so:
# --- Step 7: Making a Prediction on a New Dish --
print("\nStep 7: Predicting Region for a New Dish")
# Define a new dish
new_dish_ingredients = "Chicken, yogurt, onion, tomato, garam masala, ginger, garlic"
new_dish_flavor = "spicy"
new_dish_diet = "non vegetarian"
# Step 1: TF-IDF vectorization for the ingredients
new_ingredients_tfidf = tfidf.transform([new_dish_ingredients])
# Step 2: One-hot encode the categorical features
new_cat_data = {'flavor_profile': [new_dish_flavor], 'diet': [new_dish_diet]}
new_cat_df = pd.DataFrame(new_cat_data)
# Match columns used during training
encoded_cols = pd.get_dummies(new_cat_df).reindex(columns=X_categorical.columns, fill_value=0)
# Step 3: Combine the features
new_dish_features = hstack([new_ingredients_tfidf, encoded_cols.astype(int).values])
# Step 4: Make prediction
prediction_encoded = model.predict(new_dish_features)
predicted_region = le.inverse_transform(prediction_encoded)
# Show result
print(f"New Dish Ingredients: '{new_dish_ingredients}'")
print(f"Predicted Region: {predicted_region[0]}")
Output:
--- Example: Predicting Region for a New Dish ---
New Dish Ingredients: 'Chicken, yogurt, onion, tomato, garam masala, ginger, garlic'
Predicted Region: North
This makes sense, as these ingredients and spices are common in North Indian cuisine, especially in dishes like Butter Chicken or Chicken Curry.
Extending the Model:
You’re not limited to Random Forest. You can try other models for comparison and for better performance:
Try different algorithms and tune hyperparameters using tools like GridSearchCV or RandomizedSearchCV to improve accuracy and generalizability.
Also Read- Random Forest Hyperparameter Tuning in Python: Complete Guide
Subscribe to upGrad's Newsletter
Join thousands of learners who receive useful tips
This project built a model to predict the region of Indian dishes using ingredients, flavor profile, and diet. After cleaning and analyzing the data, we used TF-IDF and one-hot encoding for feature preparation.
A Random Forest classifier was trained and performed well, correctly predicting the region of a new dish as “North.” Other models like SVM or XGBoost can also be explored. This project shows how machine learning can uncover patterns in Indian cuisine.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Colab Link:
https://colab.research.google.com/drive/1ZCsBI2OON1MQjVe9NGnVAN6tseAjNn5s
The goal was to build a machine learning model that can accurately predict the regional origin of Indian dishes. This prediction was based on features such as the dish’s ingredients, its flavor profile, and the diet type (vegetarian or non-vegetarian).
We used the Random Forest Classifier, a popular ensemble method. Random Forests are well-suited for this task because they handle both numerical and categorical data efficiently. They also reduce the risk of overfitting and provide good accuracy without extensive parameter tuning.
The ingredients column was treated as text data and vectorized using TF-IDF (Term Frequency–Inverse Document Frequency). This technique helps convert textual information into numerical features by weighing the importance of each term in the dataset. For flavor_profile and diet, we applied one-hot encoding to convert categorical values into binary vectors.
Yes, the model can predict the region for a new dish if provided with its ingredients, flavor profile, and diet type. During prediction, the new inputs must be preprocessed in the same way as the training data i.e., ingredients should be vectorized using the same TF-IDF model, and categorical values should be one-hot encoded with the same column structure.
Absolutely. While Random Forests are a strong baseline, other models like Support Vector Machines (SVM), XGBoost, or even deep learning approaches such as LSTM or BERT (for ingredient text) could yield better performance. Enhancing the dataset, fine-tuning hyperparameters, incorporating ingredient quantities, or using word embeddings instead of TF-IDF could also boost accuracy.
For those interested in further machine learning projects involving classification or prediction, several excellent choices are available:
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources