Home
Blog
Data Science
Crop Production Prediction using Random Forest Regressor

Crop Production Prediction using Random Forest Regressor

Q: 2. Which machine learning algorithm was used for prediction?

We used the Random Forest Regressor, an ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting.

Q: 3. How was the data prepared before modelling?

The data was preprocessed using: OneHotEncoding for categorical features like crop, season, and state. Train-test split to separate data for model training and evaluation.

Q: 4. How was the model’s performance evaluated?

We used the following regression metrics: Mean Absolute Error (MAE) Mean Squared Error (MSE) Root Mean Squared Error (RMSE) R² ScoreA scatter plot of actual vs. predicted values was also used for visual validation.

Q: 5. What are the practical applications of this model?

The model can be used by: Farmers estimate the expected yield based on their inputs. Policymakers need to make region-specific decisions on food production. Researchers are to study agricultural trends and improve food security planning.

By Rohit Sharma

Updated on Aug 11, 2025 | 8 min read | 1.62K+ views

Table of Contents

View all

Heads Up! What You Should Know First
Tech Stack Used
Models We'll Use for Learning
Time Taken and Difficulty
How to Build a Crop Production Prediction Model
Conclusion

Accurately predicting crop production is essential for planning, food security, and agricultural growth.

In this Crop Production Prediction project, we use machine learning to estimate crop yields based on factors like crop type, season, state, area, and production history.

The project involves cleaning and encoding agricultural data, training a Random Forest Regressor, and evaluating how well it can predict future yields. This approach helps identify key patterns and supports data-driven decision-making in agriculture.

If you're looking to accelerate your data science journey, check out the Online Data Science Courses at upGrad. The programs help you learn Python, Machine Learning, AI, Tableau, SQL, and more from top-tier faculty. Enrol today!

Spark your next big idea. Browse our full collection of data science projects in Python.

Heads Up! What You Should Know First

Before starting the Crop Production Prediction project, it’s useful to have a basic understanding of the following:

Python programming (variables, functions, loops, basic syntax)
Pandas and Numpy (for handling and analysing data
Matplotlib or Seaborn (for creating charts and visualising trends)
Categorical data handling – Understanding encoding techniques like OneHotEncoding.
Scikit-learn (Basic knowledge of machine learning workflows like model training and evaluation)
Regression basics (Knowing how regression models work helps interpret the output better)

Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:

Tech Stack Used

For this Crop Production Prediction project, the following tools and libraries are used:

Tool/Library	Purpose
Python	A programming language for building the entire project
Google Colab	Cloud-based environment to write and run Python code
Pandas	Data manipulation and analysis of crop data
NumPy	Efficient numerical computations
Matplotlib / Seaborn	Data visualisation and trend plotting
Scikit-learn	Machine learning toolkit used for training and evaluation
OneHotEncoder	Encoding categorical agricultural features
RandomForestRegressor	Predicting crop production using regression

Models We'll Use for Learning

To predict crop production while analysing historical agricultural data, we will use a robust regression-based machine learning model:

Random Forest Regressor: A powerful ensemble learning method that builds multiple decision trees and merges their results to improve prediction accuracy. It's highly effective for handling complex datasets with both categorical and numerical features like crop type, soil, and climate data.
OneHotEncoder (Preprocessing): Used to convert categorical features such as crop name, state, or season into a numerical format so that the model can process them correctly.

Learn Python with This Beginner-Friendly Project!- Sales Data Analysis Project

Time Taken and Difficulty

You can complete the Crop Production Prediction project in about 2 to 3 hours. It's a beginner-friendly, practical project ideal for learning how to handle real-world agricultural data, preprocess mixed data types, and build a regression model for yield prediction using Python.

How to Build a Crop Production Prediction Model

Let’s start building the project from scratch. We'll go step-by-step through the process of:

Loading the agricultural production dataset
Handling missing values and preprocessing categorical features
Encoding the data using OneHotEncoding
Training a Random Forest Regressor
Evaluating model performance

Without any further delay, let’s get started!

Looking to explore Python Projects further? Find out more here! Handwritten Digit Recognition with CNN Using Python

Step 1: Import Necessary Libraries

Before we dive into this project, you'll need to grab the dataset for model training and import the necessary libraries. First, head over to Kaggle to download the dataset, and then you can bring in the libraries.

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Upload and Read the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()
Once uploaded, use the following Python code to read and check the data:
# Load the data from the uploaded CSV file.
try:
    df = pd.read_csv('datafile (1).csv')
    print("Dataset loaded successfully!")
    print("First 5 rows of the dataset:")
    print(df.head())
except FileNotFoundError:
    print("Error: 'datafile (1).csv' not found. Please ensure the file is in the correct directory.")
    exit()

Output:

First 5 rows of the dataset:

Crop State Cost of Cultivation (`/Hectare) A2+FL \

0 ARHAR Uttar Pradesh 9794.05

1 ARHAR Karnataka 10593.15

2 ARHAR Gujarat 13468.82

3 ARHAR Andhra Pradesh 17051.66

4 ARHAR Maharashtra 17130.55

Cost of Cultivation (`/Hectare) C2 Cost of Production (`/Quintal) C2 \

0 23076.74 1941.55

1 16528.68 2172.46

2 19551.90 1898.30

3 24171.65 3670.54

4 25270.26 2775.80

Yield (Quintal/ Hectare)

0 9.83

1 7.47

2 9.59

3 6.42

4 8.72

Check out this project- Customer Purchase Behaviour Analysis Project Using Python

Step 3: Data Preprocessing

In this step, we simplify long or complex column names to make them easier to reference throughout the code.

Here is the code:

# The column names are a bit complex. Let's rename them for easier access.
df.rename(columns={
    'Cost of Cultivation (`/Hectare) A2+FL': 'cost_cultivation_a2_fl',
    'Cost of Cultivation (`/Hectare) C2': 'cost_cultivation_c2',
    'Cost of Production (`/Quintal) C2': 'cost_production_c2',
    'Yield (Quintal/ Hectare) ': 'yield_quintal_per_hectare'  # Note the trailing space in the original name
}, inplace=True)
print("\nColumns renamed for easier use.")
print("New column names:", df.columns.tolist())

Output:

Columns renamed for easier use.

New column names: ['Crop', 'State', 'cost_cultivation_a2_fl', 'cost_cultivation_c2', 'cost_production_c2', 'yield_quintal_per_hectare']

Check this out:- COVID-19 Project: Data Visualization & Insights

Step 4: Feature Selection and Encoding

In this step, we prepare the features for model training. This includes selecting relevant columns and converting categorical data into a numerical format using encoding.

Here is the code:

# Select features and target
# We'll use 'Crop', 'State', and cultivation costs as input features to predict 'yield_quintal_per_hectare'.
# We exclude 'cost_production_c2' because it's derived from yield, which would cause data leakage.
features = ['Crop', 'State', 'cost_cultivation_a2_fl', 'cost_cultivation_c2']
target = 'yield_quintal_per_hectare'
X = df[features]
y = df[target]
# Identify categorical and numerical columns
categorical_features = ['Crop', 'State']
numerical_features = ['cost_cultivation_a2_fl', 'cost_cultivation_c2']
# Set up the preprocessing pipeline
# Categorical features will be one-hot encoded
# Numerical features will be passed through unchanged
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'  # Leave numerical features as is
)
# Apply preprocessing transformations to feature data
X_processed = preprocessor.fit_transform(X)
print(f"\nData preprocessed. Shape of processed features: {X_processed.shape}")

Output:

Data preprocessed. Shape of processed features: (49, 25)

Identify fraudulent transactions: Learn how - Fraud Detection in Transactions with Python: A Machine Learning Project

Step 5: Split Data into Training and Testing Sets

In this step, we divide the dataset into training and testing subsets to evaluate model performance fairly.

Here is the code:

# We use 80% of the data for training and 20% for testing.
# Setting 'random_state' ensures consistent results on each run.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=0.2, random_state=42
)
print("Data split into training and testing sets.")
print(f"  Training set size: {X_train.shape[0]} samples")
print(f"  Testing set size: {X_test.shape[0]} samples")

Output:

Data split into training and testing sets.

Training set size: 39 samples

Testing set size: 10 samples

A Python Project for Beginners - Complete Airline Passenger Traffic Analysis Project Using Python

Step 6: Train the Random Forest Regressor Model

In this step, we train a machine learning model using the Random Forest algorithm to learn from the training data.

Here is the code:

# Create a RandomForestRegressor with 100 decision trees
# 'random_state=42' ensures reproducibility
model = RandomForestRegressor(n_estimators=100, random_state=42)
# Fit the model on the training data
model.fit(X_train, y_train)
print("\nRandomForestRegressor model trained successfully!")

Output:

RandomForestRegressor model trained successfully!

Additionally, review this. - Crime Rate Prediction by City Using Python and Machine Learning

Step 7: Make Predictions and Evaluate the Model

In this step, we use the trained model to predict crop yield on the test set and evaluate its performance using common regression metrics.

Here is the code:

# Predict crop yield for the test features
y_pred = model.predict(X_test)
print("Predictions made on the test set.")
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)  # Average absolute difference
mse = mean_squared_error(y_test, y_pred)   # Average squared difference
rmse = np.sqrt(mse)                        # Square root of MSE
r2 = r2_score(y_test, y_pred)              # Proportion of variance explained by the model
# Print the results
print("\n--- Model Evaluation ---")
print(f"R-squared (R²): {r2:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Output:

Predictions made on the test set.

--- Model Evaluation ---

R-squared (R²): 0.95

Mean Absolute Error (MAE): 24.41

Mean Squared Error (MSE): 4201.85

Root Mean Squared Error (RMSE): 64.82

Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques

Step 8: Visualise the Results

The final step in this Crop Production Prediction project is to visualize the model performance and make a plot for Predicted Crop Yield.

Here is the code:

# Set visual style
plt.style.use('seaborn-v0_8-whitegrid')
# Create a scatter plot to compare actual vs predicted values
fig, ax = plt.subplots(figsize=(10, 6))
# Scatter plot: Actual vs Predicted
ax.scatter(y_test, y_pred, alpha=0.7, edgecolors='k', s=80)
# Reference line for perfect predictions
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
# Set labels and title
ax.set_xlabel('Actual Yield (Quintal/Hectare)', fontsize=12)
ax.set_ylabel('Predicted Yield (Quintal/Hectare)', fontsize=12)
ax.set_title('Actual vs. Predicted Crop Yield', fontsize=16, fontweight='bold')
ax.grid(True)
# Display the plot
plt.show()

Output:

Popular Data Science Programs

MSc in Data Science Program PG Diploma in Data Science Masters in Data Science Degree DevOps Full Course Online Advanced Certificate Program in Data Science

Conclusion

This project successfully built a machine learning model to predict crop production in India using a Random Forest Regressor. The model performed well, showing a high R² score and low error values, indicating reliable predictions.

The scatter plot comparing actual vs. predicted crop yields further validates the model’s performance. Most data points fall close to the red diagonal line, showing strong agreement between the model’s predictions and real values. While there are a few outliers, the overall trend suggests that the model generalises well across different crop types and production levels.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Explore our Popular Data Science Courses

Executive Post Graduate Programme in Data Science from IIITB	Data Science Bootcamp with AI	Master of Science in Data Science from LJMU
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Courses

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Top Data Science Skills to Learn

Data Analysis Course	Inferential Statistics Courses
Hypothesis Testing Programs	Logistic Regression Courses
Linear Regression Courses	Linear Algebra for Analysis

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Read our popular Data Science Articles

Is Data Science Hard to Learn	Data Science Career Growth	What Is Data Science? Courses, Basics, Frameworks & Careers
Future of Data Science in India	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist

Collab Link:
https://colab.research.google.com/drive/1qD7GtAU9eDAsEDn2W7lOhMWUJloWEtr_?usp=sharing

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Frequently Asked Questions (FAQs)

1. What is the goal of this project?

The main goal is to predict crop yield (in Quintal/Hectare) using machine learning based on features like crop type, season, state, and area cultivated. This helps in improving planning and productivity in the agricultural sector.

2. Which machine learning algorithm was used for prediction?

We used the Random Forest Regressor, an ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting.

3. How was the data prepared before modelling?

The data was preprocessed using:

OneHotEncoding for categorical features like crop, season, and state.
Train-test split to separate data for model training and evaluation.

4. How was the model’s performance evaluated?

We used the following regression metrics:

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R² Score
A scatter plot of actual vs. predicted values was also used for visual validation.

5. What are the practical applications of this model?

The model can be used by:

Farmers estimate the expected yield based on their inputs.
Policymakers need to make region-specific decisions on food production.
Researchers are to study agricultural trends and improve food security planning.

Rohit Sharma

840 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources