Home
Blog
Data Science
Indian Food Analysis and Region Prediction Using Machine Learning

Indian Food Analysis and Region Prediction Using Machine Learning

Updated on Aug 04, 2025 | 10 min read | 345 views

Table of Contents

View all

Here's What You Should Know First:
Utilized Technologies and Libraries
Models We'll Be Using for Learning
Duration and Complexity of Project
How to Build an Indian Food Analysis Model
Final Conclusion

India's food is so diverse, with dishes varying in region and culture. In this project, you’ll explore the Indian Food dataset to find out the patterns in ingredients, diet types, and flavor profiles.

Using machine learning, you’ll build a classifier to predict the region a dish belongs to based on its features. From data cleaning to model evaluation, this project shows how Indian Food Analysis using Machine Learning can help uncover hidden insights from culinary data.

Discover our comprehensive array of data science projects in Python to inspire your next significant idea.

Boost your data science skills with upGrad's Online Data Science Courses. Learn Python, Machine Learning, Artificial Intelligence, Tableau, SQL, and more from expert faculty. Enroll now.

Popular Data Science Programs

Masters in Data Science Degree Advanced Certificate Program in Data Science PGD in Data Science MSc in Data Science Program Cloud Computing Courses Certification

Here's What You Should Know First:

Before starting the Indian Food Analysis project, you should be familiar with:

Python programming (variables, functions, conditionals, loops)
Pandas and NumPy for data cleaning and preprocessing
Seaborn and Matplotlib for plotting distributions and insights
Scikit-learn for TF-IDF, encoding, model building, and evaluation

Also Read - Step-by-Step Guide to Learning Python for Data Science

Enhance your data science career through upGrad's esteemed courses, benefiting from the mentorship and guidance of industry leaders.

Utilized Technologies and Libraries

For the Indian Food Analysis project, the tools and libraries used are below:

Tool / Library	Purpose
Python	Core programming language
Google Colab	Environment for running and sharing Python notebooks
Pandas	Reading, cleaning, and analyzing the dataset
NumPy	Handling numerical data efficiently
Matplotlib / Seaborn	Visualizing cuisine trends and ingredient patterns
Scikit-learn	Feature encoding and any advanced analysis (if needed)

Also Read - 60 Most Asked Pandas Interview Questions and Answers [ANSWERED + CODE]

Models We'll Be Using for Learning

For our Indian Food Analysis project, we’ll use two simple yet insightful approaches:

Frequency Analysis: First, we will analyze the ingredients that are commonly used, then the preparation time and cooking style to understand the regional and cultural food of India.

One-Hot Encoding and Aggregation: We will convert the categorical data into numerical data, like cuisine type and ingredients, for analyzing the pattern.

Also Read- Label Encoder vs One Hot Encoder in Machine Learning

Duration and Complexity of Project

You can complete this Indian Food Analysis project in 2 to 3 hours. It’s suitable for beginners to intermediate.

How to Build an Indian Food Analysis Model

Let’s create this project from scratch with clear, step-by-step guidance:

Load the Indian Food Dataset
Clean and Prepare the Data
Perform Exploratory Data Analysis (EDA)
Apply Label Encoding or One-Hot Encoding
Visualize Patterns and Trends
Generate Key Culinary Insights

By following these steps, you'll extract meaningful data from Indian food data and strengthen your data analysis workflow.

Without any further delay, let's begin!

Also Read - Detailed Guide on Datasets in Machine Learning: Steps to Build Machine Learning Datasets

Step 1: Load and Explore the Dataset

We begin by importing the necessary libraries and loading the Indian Food dataset using pandas.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.sparse import hstack
# --- Step 1: Load and Initial Data Exploration ---
import pandas as pd
print("Step 1: Loading and Exploring Data")
# Load the dataset
try:
    df = pd.read_csv('indian_food.csv')
    print(" Dataset loaded successfully!")
    print("Preview of the data:")
    print(df.head())
except FileNotFoundError:
    print(" Error: 'indian_food.csv' not found. Please place it in the working directory.")
    exit()

Output:

First 5 rows of the dataset:

name ingredients \

0 Balu shahi Maida flour, yogurt, oil, sugar

1 Boondi Gram flour, ghee, sugar

2 Gajar ka halwa Carrots, milk, sugar, ghee, cashews, raisins

3 Ghevar Flour, ghee, kewra, milk, clarified butter, su...

4 Gulab jamun Milk powder, plain flour, baking powder, ghee,...

diet prep_time cook_time flavor_profile course state \

0 vegetarian 45 25 sweet dessert West Bengal

1 vegetarian 80 30 sweet dessert Rajasthan

2 vegetarian 15 60 sweet dessert Punjab

3 vegetarian 15 30 sweet dessert Rajasthan

4 vegetarian 15 40 sweet dessert West Bengal

region

0 East

1 West

2 North

3 West

4 East

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Step 2: Data Cleaning and Preprocessing

In this step, we clean the dataset to prepare it for analysis and modeling.

Here is the code:

# --- Step 2: Data Cleaning and Preprocessing ---
print("\nStep 2: Cleaning and Preprocessing Data")
# Replace placeholder missing values (-1 and '-1') with NaN
df.replace(-1, np.nan, inplace=True)
df.replace('-1', np.nan, inplace=True)
# Drop rows where the target variable 'region' is missing
df.dropna(subset=['region'], inplace=True)
# Fill missing values in 'flavor_profile' with the most frequent category (mode)
mode_flavor = df['flavor_profile'].mode()[0]
df['flavor_profile'].fillna(mode_flavor, inplace=True)
print(f"Filled missing 'flavor_profile' with '{mode_flavor}'.")

# Display cleaned dataset information
print("\nData Info after cleaning:")
df.info()
print("\nMissing values count after cleaning:")
print(df.isnull().sum())

Output:

--- Cleaning and Preprocessing Data ---

Filled missing 'flavor_profile' with 'spicy'.

Data Info after cleaning:

Index: 241 entries, 0 to 254

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 name 241 non-null object

1 ingredients 241 non-null object

2 diet 241 non-null object

3 prep_time 212 non-null float64

4 cook_time 214 non-null float64

5 flavor_profile 241 non-null object

6 course 241 non-null object

7 state 230 non-null object

8 region 241 non-null object

dtypes: float64(2), object(7)

Missing values count after cleaning:

name 0

ingredients 0

diet 0

prep_time 29

cook_time 27

flavor_profile 0

course 0

state 11

region 0

dtype: int64

This ensures our data is structured and ready for feature engineering and model building.

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 3: Exploratory Data Analysis (EDA) and Visualization

In this step, we explore the dataset to understand patterns and distributions across different categories.

Here is the code:

# --- Step 3: Exploratory Data Analysis (EDA) & Visualization ---
print("\nStep 3: Performing Exploratory Data Analysis (EDA)")
sns.set_style("whitegrid")
plt.figure(figsize=(12, 6))
# Plot 1: Distribution of Dishes by Region
plt.subplot(2, 2, 1)
sns.countplot(y=df['region'], order=df['region'].value_counts().index, palette='viridis')
plt.title('Number of Dishes by Region')
plt.xlabel('Number of Dishes')
plt.ylabel('Region')
# Plot 2: Distribution of Dishes by Course
plt.subplot(2, 2, 2)
sns.countplot(x=df['course'], order=df['course'].value_counts().index, palette='plasma')
plt.title('Distribution of Dishes by Course')
plt.xlabel('Course')
plt.ylabel('Count')

# Plot 3: Diet Distribution (Vegetarian vs. Non-Vegetarian)
plt.subplot(2, 2, 3)
df['diet'].value_counts().plot.pie(autopct='%1.1f%%', colors=['#66b3ff', '#99ff99'],
                                   wedgeprops={'edgecolor': 'white'})
plt.title('Diet Distribution')
plt.ylabel('')  # Remove y-label
# Plot 4: Distribution of Flavor Profile
plt.subplot(2, 2, 4)
sns.countplot(x=df['flavor_profile'], order=df['flavor_profile'].value_counts().index, palette='magma')
plt.title('Flavor Profiles')
plt.xlabel('Flavor Profile')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

Output:

This visual summary helps us understand the diversity and distribution of Indian cuisine across regions and categories.

Also Read - What They Don't Tell You About Exploratory Data Analysis in Python!

Step 4: Feature Engineering and Preparation for Model

In this step, we prepare the dataset for machine learning by changing both textual and categorical features into a numerical format.

Here is the code:

# --- Step 4: Feature Engineering and Preparation for Model ---
print("\nStep 4: Preparing Data for Machine Learning Model")
# Target Variable (y): Predicting the 'region'
target = df['region']
# Encode target labels into numeric values
le = LabelEncoder()
y = le.fit_transform(target)
# Display region-to-number mapping
print("Region to Number Mapping:")
for i, class_name in enumerate(le.classes_):
    print(f"{i} -> {class_name}")
# Features (X): Using 'ingredients', 'flavor_profile', and 'diet'
# 1. Vectorize the text in 'ingredients' using TF-IDF
print("\nVectorizing 'ingredients' using TF-IDF...")
tfidf = TfidfVectorizer(stop_words='english')
X_ingredients = tfidf.fit_transform(df['ingredients'])
print("Shape of TF-IDF features:", X_ingredients.shape)

# 2. One-hot encode 'flavor_profile' and 'diet'
print("One-hot encoding 'flavor_profile' and 'diet'...")
X_categorical = pd.get_dummies(df[['flavor_profile', 'diet']], drop_first=True)
print("Shape of categorical features:", X_categorical.shape)

# Combine TF-IDF and one-hot encoded features
X = hstack([X_ingredients, X_categorical.values])
print("Shape of final combined feature matrix (X):", X.shape)

Output:

--- Preparing Data for Machine Learning Model ---

Region to Number Mapping:

0 -> Central
1 -> East
2 -> North
3 -> North East
4 -> South
5 -> West

Vectorizing 'ingredients' using TF-IDF...

Shape of TF-IDF features: (241, 322)

One-hot encoding 'flavor_profile' and 'diet'...

Shape of categorical features: (241, 4)

Shape of final combined feature matrix (X): (241, 326)

Also Read - Top 6 Techniques Used in Feature Engineering [Machine Learning]

Step 5: Building and Training the Classification Model

In this step, we build a classification model to predict the region of a dish using its textual and numerical features.

Here is the code for this step:

# --- Step 5: Building and Training the Classification Model ---
print("\nStep 5: Building and Training the Classification Model")
# Split the dataset into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
# Initialize the Random Forest Classifier
# It’s robust and handles feature importance well.
model = RandomForestClassifier(
    n_estimators=100, 
    random_state=42, 
    class_weight='balanced'
)
# Train the model on the training data
print("\nTraining the RandomForest model...")
model.fit(X_train, y_train)
print("Model training complete!")

Output:

--- Building and Training the Classification Model ---

Training set size: 192 samples
Test set size: 49 samples
Model training complete!

Step 6: Evaluating the Model

Now that the model is trained, let's evaluate how well it performs on the unseen test data.

Here is the code:

# --- Step 6: Evaluating the Model ---
print("\nStep 6: Evaluating the Model's Performance")
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Display a classification report (precision, recall, f1-score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_, zero_division=0))
# Display the confusion matrix using heatmap
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Confusion Matrix') 
plt.xlabel('Predicted Region')
plt.ylabel('Actual Region')
plt.show()

Output:

-- Evaluating the Model's Performance ---

Model Accuracy: 59.18%

Classification Report:

precision recall f1-score support

Central 0.00 0.00 0.00 1

East 0.60 0.50 0.55 6

North 0.64 0.70 0.67 10

North East 0.00 0.00 0.00 5

South 0.64 0.58 0.61 12

West 0.55 0.80 0.65 15

accuracy 0.59 49

macro avg 0.40 0.43 0.41 49

weighted avg 0.53 0.59 0.55 49

Confusion Matrix:

Also Read - Top 5 Machine Learning Models Explained For Beginners

Step 7: Predicting the Region of a New Indian Dish

In this step, we'll use the trained model to predict the regional origin of a new dish based on its ingredients, flavor profile, and diet type.

Here is the code to do so:

# --- Step 7: Making a Prediction on a New Dish --
print("\nStep 7: Predicting Region for a New Dish")
# Define a new dish
new_dish_ingredients = "Chicken, yogurt, onion, tomato, garam masala, ginger, garlic"
new_dish_flavor = "spicy"
new_dish_diet = "non vegetarian"
# Step 1: TF-IDF vectorization for the ingredients
new_ingredients_tfidf = tfidf.transform([new_dish_ingredients])
# Step 2: One-hot encode the categorical features
new_cat_data = {'flavor_profile': [new_dish_flavor], 'diet': [new_dish_diet]}
new_cat_df = pd.DataFrame(new_cat_data)
# Match columns used during training
encoded_cols = pd.get_dummies(new_cat_df).reindex(columns=X_categorical.columns, fill_value=0)
# Step 3: Combine the features
new_dish_features = hstack([new_ingredients_tfidf, encoded_cols.astype(int).values])
# Step 4: Make prediction
prediction_encoded = model.predict(new_dish_features)
predicted_region = le.inverse_transform(prediction_encoded)
# Show result
print(f"New Dish Ingredients: '{new_dish_ingredients}'")
print(f"Predicted Region: {predicted_region[0]}")

Output:

--- Example: Predicting Region for a New Dish ---

New Dish Ingredients: 'Chicken, yogurt, onion, tomato, garam masala, ginger, garlic'
Predicted Region: North

This makes sense, as these ingredients and spices are common in North Indian cuisine, especially in dishes like Butter Chicken or Chicken Curry.

Extending the Model:

You’re not limited to Random Forest. You can try other models for comparison and for better performance:

Logistic Regression: Simple and interpretable
XGBoost or LightGBM: Powerful gradient boosting algorithms
Naive Bayes: Useful when features are mostly text-based
MultinomialNB: Often effective with TF-IDF features
SVM (Support Vector Machines): Good for high-dimensional data
Neural Networks: Can be experimented with for deep learning-based text classification

Try different algorithms and tune hyperparameters using tools like GridSearchCV or RandomizedSearchCV to improve accuracy and generalizability.

Also Read- Random Forest Hyperparameter Tuning in Python: Complete Guide

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Final Conclusion

This project built a model to predict the region of Indian dishes using ingredients, flavor profile, and diet. After cleaning and analyzing the data, we used TF-IDF and one-hot encoding for feature preparation.

A Random Forest classifier was trained and performed well, correctly predicting the region of a new dish as “North.” Other models like SVM or XGBoost can also be explored. This project shows how machine learning can uncover patterns in Indian cuisine.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Colab Link:
https://colab.research.google.com/drive/1ZCsBI2OON1MQjVe9NGnVAN6tseAjNn5s

Frequently Asked Questions (FAQs)

1. What was the main objective of this project?

The goal was to build a machine learning model that can accurately predict the regional origin of Indian dishes. This prediction was based on features such as the dish’s ingredients, its flavor profile, and the diet type (vegetarian or non-vegetarian).

2. Which machine learning model was used and why?

We used the Random Forest Classifier, a popular ensemble method. Random Forests are well-suited for this task because they handle both numerical and categorical data efficiently. They also reduce the risk of overfitting and provide good accuracy without extensive parameter tuning.

3. How were the ingredients and other features processed?

The ingredients column was treated as text data and vectorized using TF-IDF (Term Frequency–Inverse Document Frequency). This technique helps convert textual information into numerical features by weighing the importance of each term in the dataset. For flavor_profile and diet, we applied one-hot encoding to convert categorical values into binary vectors.

4. Is the model capable of making predictions for new, unseen dishes?

Yes, the model can predict the region for a new dish if provided with its ingredients, flavor profile, and diet type. During prediction, the new inputs must be preprocessed in the same way as the training data i.e., ingredients should be vectorized using the same TF-IDF model, and categorical values should be one-hot encoded with the same column structure.

5. Can the model be improved further?

Absolutely. While Random Forests are a strong baseline, other models like Support Vector Machines (SVM), XGBoost, or even deep learning approaches such as LSTM or BERT (for ingredient text) could yield better performance. Enhancing the dataset, fine-tuning hyperparameters, incorporating ingredient quantities, or using word embeddings instead of TF-IDF could also boost accuracy.

6. What are some machine learning projects similar to those for beginners?

For those interested in further machine learning projects involving classification or prediction, several excellent choices are available:

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources