Indian Food Analysis and Region Prediction Using Machine Learning

By Rohit Sharma

Updated on Aug 04, 2025 | 10 min read | 151 views

Share:

India's food is so diverse, with dishes varying in region and culture. In this project, you’ll explore the Indian Food dataset to find out the patterns in ingredients, diet types, and flavor profiles.

Using machine learning, you’ll build a classifier to predict the region a dish belongs to based on its features. From data cleaning to model evaluation, this project shows how Indian Food Analysis using Machine Learning can help uncover hidden insights from culinary data.

Discover our comprehensive array of data science projects in Python to inspire your next significant idea.

Boost your data science skills with upGrad's Online Data Science Courses. Learn Python, Machine Learning, Artificial Intelligence, Tableau, SQL, and more from expert faculty. Enroll now.

Here's What You Should Know First:

Before starting the Indian Food Analysis project, you should be familiar with:

Also Read - Step-by-Step Guide to Learning Python for Data Science

Enhance your data science career through upGrad's esteemed courses, benefiting from the mentorship and guidance of industry leaders.

Utilized Technologies and Libraries

For the Indian Food Analysis project, the tools and libraries used are below:

Tool / Library

Purpose

Python Core programming language
Google Colab Environment for running and sharing Python notebooks
Pandas Reading, cleaning, and analyzing the dataset
NumPy Handling numerical data efficiently
Matplotlib / Seaborn Visualizing cuisine trends and ingredient patterns
Scikit-learn Feature encoding and any advanced analysis (if needed)

Also Read - 60 Most Asked Pandas Interview Questions and Answers [ANSWERED + CODE]

Models We'll Be Using for Learning

For our Indian Food Analysis project, we’ll use two simple yet insightful approaches:

Frequency Analysis: First, we will analyze the ingredients that are commonly used, then the preparation time and cooking style to understand the regional and cultural food of India.

One-Hot Encoding and Aggregation: We will convert the categorical data into numerical data, like cuisine type and ingredients, for analyzing the pattern.

Also Read- Label Encoder vs One Hot Encoder in Machine Learning

Duration and Complexity of Project

You can complete this Indian Food Analysis project in  2 to 3 hours. It’s suitable for beginners to intermediate.

How to Build an Indian Food Analysis Model

Let’s create this project from scratch with clear, step-by-step guidance:

  • Load the Indian Food Dataset
  • Clean and Prepare the Data
  • Perform Exploratory Data Analysis (EDA)
  • Apply Label Encoding or One-Hot Encoding
  • Visualize Patterns and Trends
  • Generate Key Culinary Insights

By following these steps, you'll extract meaningful data from Indian food data and strengthen your data analysis workflow.

Without any further delay, let's begin!

Also Read - Detailed Guide on Datasets in Machine Learning: Steps to Build Machine Learning Datasets

Step 1: Load and Explore the Dataset

We begin by importing the necessary libraries and loading the Indian Food dataset using pandas.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.sparse import hstack
# --- Step 1: Load and Initial Data Exploration ---
import pandas as pd
print("Step 1: Loading and Exploring Data")
# Load the dataset
try:
    df = pd.read_csv('indian_food.csv')
    print(" Dataset loaded successfully!")
    print("Preview of the data:")
    print(df.head())
except FileNotFoundError:
    print(" Error: 'indian_food.csv' not found. Please place it in the working directory.")
    exit()

Output:

First 5 rows of the dataset:

             name                                        ingredients  \

0      Balu shahi                    Maida flour, yogurt, oil, sugar   

1          Boondi                            Gram flour, ghee, sugar   

2  Gajar ka halwa       Carrots, milk, sugar, ghee, cashews, raisins   

3          Ghevar  Flour, ghee, kewra, milk, clarified butter, su...   

4     Gulab jamun  Milk powder, plain flour, baking powder, ghee,...   

         diet  prep_time  cook_time flavor_profile   course        state  \

0  vegetarian         45         25          sweet  dessert  West Bengal   

1  vegetarian         80         30          sweet  dessert    Rajasthan   

2  vegetarian         15         60          sweet  dessert       Punjab   

3  vegetarian         15         30          sweet  dessert    Rajasthan   

4  vegetarian         15         40          sweet  dessert  West Bengal   

  region  

0   East  

1   West  

2  North  

3   West  

4   East  

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Step 2: Data Cleaning and Preprocessing

In this step, we clean the dataset to prepare it for analysis and modeling.

Here is the code:

# --- Step 2: Data Cleaning and Preprocessing ---
print("\nStep 2: Cleaning and Preprocessing Data")
# Replace placeholder missing values (-1 and '-1') with NaN
df.replace(-1, np.nan, inplace=True)
df.replace('-1', np.nan, inplace=True)
# Drop rows where the target variable 'region' is missing
df.dropna(subset=['region'], inplace=True)
# Fill missing values in 'flavor_profile' with the most frequent category (mode)
mode_flavor = df['flavor_profile'].mode()[0]
df['flavor_profile'].fillna(mode_flavor, inplace=True)
print(f"Filled missing 'flavor_profile' with '{mode_flavor}'.")

# Display cleaned dataset information
print("\nData Info after cleaning:")
df.info()
print("\nMissing values count after cleaning:")
print(df.isnull().sum())

Output:

--- Cleaning and Preprocessing Data ---

Filled missing 'flavor_profile' with 'spicy'.

Data Info after cleaning:

Index: 241 entries, 0 to 254

Data columns (total 9 columns):

 #   Column          Non-Null Count     Dtype  

---  ------             --------------       -----  

 0   name                241 non-null        object 

 1   ingredients       241 non-null        object 

 2   diet                   241 non-null        object 

 3   prep_time         212 non-null        float64

 4   cook_time        214 non-null        float64

 5   flavor_profile   241 non-null         object 

 6   course             241 non-null         object 

 7   state                230 non-null        object 

 8   region              241 non-null         object 

dtypes: float64(2), object(7)

Missing values count after cleaning:

name                 0

ingredients        0

diet                    0

prep_time          29

cook_time         27

flavor_profile     0

course               0

state                  11

region                0

dtype: int64

This ensures our data is structured and ready for feature engineering and model building.

Also Read - Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Step 3: Exploratory Data Analysis (EDA) and Visualization

In this step, we explore the dataset to understand patterns and distributions across different categories.

Here is the code:

# --- Step 3: Exploratory Data Analysis (EDA) & Visualization ---
print("\nStep 3: Performing Exploratory Data Analysis (EDA)")
sns.set_style("whitegrid")
plt.figure(figsize=(12, 6))
# Plot 1: Distribution of Dishes by Region
plt.subplot(2, 2, 1)
sns.countplot(y=df['region'], order=df['region'].value_counts().index, palette='viridis')
plt.title('Number of Dishes by Region')
plt.xlabel('Number of Dishes')
plt.ylabel('Region')
# Plot 2: Distribution of Dishes by Course
plt.subplot(2, 2, 2)
sns.countplot(x=df['course'], order=df['course'].value_counts().index, palette='plasma')
plt.title('Distribution of Dishes by Course')
plt.xlabel('Course')
plt.ylabel('Count')

# Plot 3: Diet Distribution (Vegetarian vs. Non-Vegetarian)
plt.subplot(2, 2, 3)
df['diet'].value_counts().plot.pie(autopct='%1.1f%%', colors=['#66b3ff', '#99ff99'],
                                   wedgeprops={'edgecolor': 'white'})
plt.title('Diet Distribution')
plt.ylabel('')  # Remove y-label
# Plot 4: Distribution of Flavor Profile
plt.subplot(2, 2, 4)
sns.countplot(x=df['flavor_profile'], order=df['flavor_profile'].value_counts().index, palette='magma')
plt.title('Flavor Profiles')
plt.xlabel('Flavor Profile')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

Output:

This visual summary helps us understand the diversity and distribution of Indian cuisine across regions and categories.

Also Read - What They Don't Tell You About Exploratory Data Analysis in Python!

Step 4: Feature Engineering and Preparation for Model

In this step, we prepare the dataset for machine learning by changing both textual and categorical features into a numerical format.

Here is the code:

# --- Step 4: Feature Engineering and Preparation for Model ---
print("\nStep 4: Preparing Data for Machine Learning Model")
# Target Variable (y): Predicting the 'region'
target = df['region']
# Encode target labels into numeric values
le = LabelEncoder()
y = le.fit_transform(target)
# Display region-to-number mapping
print("Region to Number Mapping:")
for i, class_name in enumerate(le.classes_):
    print(f"{i} -> {class_name}")
# Features (X): Using 'ingredients', 'flavor_profile', and 'diet'
# 1. Vectorize the text in 'ingredients' using TF-IDF
print("\nVectorizing 'ingredients' using TF-IDF...")
tfidf = TfidfVectorizer(stop_words='english')
X_ingredients = tfidf.fit_transform(df['ingredients'])
print("Shape of TF-IDF features:", X_ingredients.shape)

# 2. One-hot encode 'flavor_profile' and 'diet'
print("One-hot encoding 'flavor_profile' and 'diet'...")
X_categorical = pd.get_dummies(df[['flavor_profile', 'diet']], drop_first=True)
print("Shape of categorical features:", X_categorical.shape)

# Combine TF-IDF and one-hot encoded features
X = hstack([X_ingredients, X_categorical.values])
print("Shape of final combined feature matrix (X):", X.shape)

Output:

--- Preparing Data for Machine Learning Model ---

Region to Number Mapping:

0 -> Central
1 -> East
2 -> North
3 -> North East
4 -> South
5 -> West

Vectorizing 'ingredients' using TF-IDF...

Shape of TF-IDF features: (241, 322)

One-hot encoding 'flavor_profile' and 'diet'...

Shape of categorical features: (241, 4)

Shape of final combined feature matrix (X): (241, 326)

Also Read - Top 6 Techniques Used in Feature Engineering [Machine Learning]

Step 5: Building and Training the Classification Model

In this step, we build a classification model to predict the region of a dish using its textual and numerical features.

Here is the code for this step:

# --- Step 5: Building and Training the Classification Model ---
print("\nStep 5: Building and Training the Classification Model")
# Split the dataset into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
# Initialize the Random Forest Classifier
# It’s robust and handles feature importance well.
model = RandomForestClassifier(
    n_estimators=100, 
    random_state=42, 
    class_weight='balanced'
)
# Train the model on the training data
print("\nTraining the RandomForest model...")
model.fit(X_train, y_train)
print("Model training complete!")

Output:

--- Building and Training the Classification Model ---

Training set size: 192 samples
Test set size: 49 samples
Model training complete!

Step 6: Evaluating the Model

Now that the model is trained, let's evaluate how well it performs on the unseen test data.

Here is the code:

# --- Step 6: Evaluating the Model ---
print("\nStep 6: Evaluating the Model's Performance")
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
# Display a classification report (precision, recall, f1-score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_, zero_division=0))
# Display the confusion matrix using heatmap
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Confusion Matrix') 
plt.xlabel('Predicted Region')
plt.ylabel('Actual Region')
plt.show()

Output:

-- Evaluating the Model's Performance ---

Model Accuracy: 59.18%

Classification Report:

                     precision    recall  f1-score   support

     Central         0.00      0.00      0.00         1

        East           0.60      0.50      0.55         6

       North          0.64      0.70      0.67         10

  North East       0.00      0.00      0.00         5

       South          0.64      0.58      0.61         12

        West          0.55       0.80      0.65        15

    accuracy                                   0.59        49

   macro avg       0.40      0.43      0.41        49

weighted avg     0.53      0.59      0.55        49

Confusion Matrix:

Also Read - Top 5 Machine Learning Models Explained For Beginners

Step 7: Predicting the Region of a New Indian Dish

In this step, we'll use the trained model to predict the regional origin of a new dish based on its ingredients, flavor profile, and diet type.

Here is the code to do so:

# --- Step 7: Making a Prediction on a New Dish --
print("\nStep 7: Predicting Region for a New Dish")
# Define a new dish
new_dish_ingredients = "Chicken, yogurt, onion, tomato, garam masala, ginger, garlic"
new_dish_flavor = "spicy"
new_dish_diet = "non vegetarian"
# Step 1: TF-IDF vectorization for the ingredients
new_ingredients_tfidf = tfidf.transform([new_dish_ingredients])
# Step 2: One-hot encode the categorical features
new_cat_data = {'flavor_profile': [new_dish_flavor], 'diet': [new_dish_diet]}
new_cat_df = pd.DataFrame(new_cat_data)
# Match columns used during training
encoded_cols = pd.get_dummies(new_cat_df).reindex(columns=X_categorical.columns, fill_value=0)
# Step 3: Combine the features
new_dish_features = hstack([new_ingredients_tfidf, encoded_cols.astype(int).values])
# Step 4: Make prediction
prediction_encoded = model.predict(new_dish_features)
predicted_region = le.inverse_transform(prediction_encoded)
# Show result
print(f"New Dish Ingredients: '{new_dish_ingredients}'")
print(f"Predicted Region: {predicted_region[0]}")

Output:

--- Example: Predicting Region for a New Dish ---

New Dish Ingredients: 'Chicken, yogurt, onion, tomato, garam masala, ginger, garlic'
Predicted Region: North

This makes sense, as these ingredients and spices are common in North Indian cuisine, especially in dishes like Butter Chicken or Chicken Curry.

Extending the Model:

You’re not limited to Random Forest. You can try other models for comparison and for better performance:

Try different algorithms and tune hyperparameters using tools like GridSearchCV or RandomizedSearchCV to improve accuracy and generalizability.

Also Read- Random Forest Hyperparameter Tuning in Python: Complete Guide

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Final Conclusion

This project built a model to predict the region of Indian dishes using ingredients, flavor profile, and diet. After cleaning and analyzing the data, we used TF-IDF and one-hot encoding for feature preparation.

A Random Forest classifier was trained and performed well, correctly predicting the region of a new dish as “North.” Other models like SVM or XGBoost can also be explored. This project shows how machine learning can uncover patterns in Indian cuisine.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Colab Link:
https://colab.research.google.com/drive/1ZCsBI2OON1MQjVe9NGnVAN6tseAjNn5s

Frequently Asked Questions (FAQs)

1. What was the main objective of this project?

The goal was to build a machine learning model that can accurately predict the regional origin of Indian dishes. This prediction was based on features such as the dish’s ingredients, its flavor profile, and the diet type (vegetarian or non-vegetarian).

2. Which machine learning model was used and why?

We used the Random Forest Classifier, a popular ensemble method. Random Forests are well-suited for this task because they handle both numerical and categorical data efficiently. They also reduce the risk of overfitting and provide good accuracy without extensive parameter tuning.

3. How were the ingredients and other features processed?

The ingredients column was treated as text data and vectorized using TF-IDF (Term Frequency–Inverse Document Frequency). This technique helps convert textual information into numerical features by weighing the importance of each term in the dataset. For flavor_profile and diet, we applied one-hot encoding to convert categorical values into binary vectors.

4. Is the model capable of making predictions for new, unseen dishes?

Yes, the model can predict the region for a new dish if provided with its ingredients, flavor profile, and diet type. During prediction, the new inputs must be preprocessed in the same way as the training data i.e., ingredients should be vectorized using the same TF-IDF model, and categorical values should be one-hot encoded with the same column structure.

5. Can the model be improved further?

Absolutely. While Random Forests are a strong baseline, other models like Support Vector Machines (SVM), XGBoost, or even deep learning approaches such as LSTM or BERT (for ingredient text) could yield better performance. Enhancing the dataset, fine-tuning hyperparameters, incorporating ingredient quantities, or using word embeddings instead of TF-IDF could also boost accuracy.

6. What are some machine learning projects similar to those for beginners?

For those interested in further machine learning projects involving classification or prediction, several excellent choices are available:

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science

360° Career Support

Executive PG Program

12 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

upGrad Logo

Certification

3 Months