Crop Production Prediction using Random Forest Regressor

By Rohit Sharma

Updated on Aug 11, 2025 | 8 min read | 1.27K+ views

Share:

Accurately predicting crop production is essential for planning, food security, and agricultural growth.

In this Crop Production Prediction project, we use machine learning to estimate crop yields based on factors like crop type, season, state, area, and production history.

The project involves cleaning and encoding agricultural data, training a Random Forest Regressor, and evaluating how well it can predict future yields. This approach helps identify key patterns and supports data-driven decision-making in agriculture.

If you're looking to accelerate your data science journey, check out the Online Data Science Courses at upGrad. The programs help you learn Python, Machine Learning, AI, Tableau, SQL, and more from top-tier faculty. Enrol today!

Spark your next big idea. Browse our full collection of data science projects in Python.

Heads Up! What You Should Know First

Before starting the Crop Production Prediction project, it’s useful to have a basic understanding of the following:

  • Python programming (variables, functions, loops, basic syntax)
  • Pandas and Numpy (for handling and analysing data
  • Matplotlib or Seaborn (for creating charts and visualising trends)
  • Categorical data handling – Understanding encoding techniques like OneHotEncoding.
  • Scikit-learn (Basic knowledge of machine learning workflows like model training and evaluation)
  • Regression basics (Knowing how regression models work helps interpret the output better)

Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:

Tech Stack Used

For this Crop Production Prediction project, the following tools and libraries are used:

Tool/Library

Purpose

Python A programming language for building the entire project
Google Colab Cloud-based environment to write and run Python code
Pandas Data manipulation and analysis of crop data
NumPy Efficient numerical computations
Matplotlib / Seaborn Data visualisation and trend plotting
Scikit-learn Machine learning toolkit used for training and evaluation
OneHotEncoder Encoding categorical agricultural features
RandomForestRegressor Predicting crop production using regression

Models We'll Use for Learning

To predict crop production while analysing historical agricultural data, we will use a robust regression-based machine learning model:

  • Random Forest Regressor: A powerful ensemble learning method that builds multiple decision trees and merges their results to improve prediction accuracy. It's highly effective for handling complex datasets with both categorical and numerical features like crop type, soil, and climate data.
  • OneHotEncoder (Preprocessing): Used to convert categorical features such as crop name, state, or season into a numerical format so that the model can process them correctly.

Learn Python with This Beginner-Friendly Project!- Sales Data Analysis Project

Time Taken and Difficulty

You can complete the Crop Production Prediction project in about 2 to 3 hours. It's a beginner-friendly, practical project ideal for learning how to handle real-world agricultural data, preprocess mixed data types, and build a regression model for yield prediction using Python.

How to Build a Crop Production Prediction Model

Let’s start building the project from scratch. We'll go step-by-step through the process of:

  1. Loading the agricultural production dataset
  2. Handling missing values and preprocessing categorical features
  3. Encoding the data using OneHotEncoding
  4. Training a Random Forest Regressor
  5. Evaluating model performance

Without any further delay, let’s get started!

Looking to explore Python Projects further? Find out more here! Handwritten Digit Recognition with CNN Using Python

Step 1: Import Necessary Libraries

Before we dive into this project, you'll need to grab the dataset for model training and import the necessary libraries. First, head over to Kaggle to download the dataset, and then you can bring in the libraries.

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Upload and Read the Dataset in Google Colab

Now that you have downloaded both files, upload them to Google Colab using the code below:

from google.colab import files
uploaded = files.upload()
Once uploaded, use the following Python code to read and check the data:
# Load the data from the uploaded CSV file.
try:
    df = pd.read_csv('datafile (1).csv')
    print("Dataset loaded successfully!")
    print("First 5 rows of the dataset:")
    print(df.head())
except FileNotFoundError:
    print("Error: 'datafile (1).csv' not found. Please ensure the file is in the correct directory.")
    exit()

Output:

First 5 rows of the dataset:

    Crop              State  Cost of Cultivation (`/Hectare) A2+FL  \

0  ARHAR           Uttar Pradesh                                      9794.05   

1  ARHAR            Karnataka                                           10593.15   

2  ARHAR           Gujarat                                                13468.82   

3  ARHAR          Andhra Pradesh                                   17051.66   

4  ARHAR          Maharashtra                                         17130.55

Cost of Cultivation (`/Hectare) C2  Cost of Production (`/Quintal) C2  \

0                            23076.74                            1941.55   

1                            16528.68                            2172.46   

2                            19551.90                            1898.30   

3                            24171.65                            3670.54   

4                            25270.26                            2775.80   

Yield (Quintal/ Hectare)   

0                       9.83  

1                       7.47  

2                       9.59  

3                       6.42  

4                       8.72  

Check out this project- Customer Purchase Behaviour Analysis Project Using Python

Step 3: Data Preprocessing

In this step, we simplify long or complex column names to make them easier to reference throughout the code.

Here is the code:

# The column names are a bit complex. Let's rename them for easier access.
df.rename(columns={
    'Cost of Cultivation (`/Hectare) A2+FL': 'cost_cultivation_a2_fl',
    'Cost of Cultivation (`/Hectare) C2': 'cost_cultivation_c2',
    'Cost of Production (`/Quintal) C2': 'cost_production_c2',
    'Yield (Quintal/ Hectare) ': 'yield_quintal_per_hectare'  # Note the trailing space in the original name
}, inplace=True)
print("\nColumns renamed for easier use.")
print("New column names:", df.columns.tolist())

Output: 

Columns renamed for easier use.

New column names: ['Crop', 'State', 'cost_cultivation_a2_fl', 'cost_cultivation_c2', 'cost_production_c2', 'yield_quintal_per_hectare']

Check this out:- COVID-19 Project: Data Visualization & Insights

Step 4: Feature Selection and Encoding

In this step, we prepare the features for model training. This includes selecting relevant columns and converting categorical data into a numerical format using encoding.

Here is the code:

# Select features and target
# We'll use 'Crop', 'State', and cultivation costs as input features to predict 'yield_quintal_per_hectare'.
# We exclude 'cost_production_c2' because it's derived from yield, which would cause data leakage.
features = ['Crop', 'State', 'cost_cultivation_a2_fl', 'cost_cultivation_c2']
target = 'yield_quintal_per_hectare'
X = df[features]
y = df[target]
# Identify categorical and numerical columns
categorical_features = ['Crop', 'State']
numerical_features = ['cost_cultivation_a2_fl', 'cost_cultivation_c2']
# Set up the preprocessing pipeline
# Categorical features will be one-hot encoded
# Numerical features will be passed through unchanged
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'  # Leave numerical features as is
)
# Apply preprocessing transformations to feature data
X_processed = preprocessor.fit_transform(X)
print(f"\nData preprocessed. Shape of processed features: {X_processed.shape}")

Output: 

 Data preprocessed. Shape of processed features: (49, 25)

Identify fraudulent transactions: Learn how - Fraud Detection in Transactions with Python: A Machine Learning Project

Step 5: Split Data into Training and Testing Sets

In this step, we divide the dataset into training and testing subsets to evaluate model performance fairly.

Here is the code:

# We use 80% of the data for training and 20% for testing.
# Setting 'random_state' ensures consistent results on each run.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=0.2, random_state=42
)
print("Data split into training and testing sets.")
print(f"  Training set size: {X_train.shape[0]} samples")
print(f"  Testing set size: {X_test.shape[0]} samples")

Output:

Data split into training and testing sets.

   Training set size: 39 samples

   Testing set size: 10 samples

A Python Project for BeginnersComplete Airline Passenger Traffic Analysis Project Using Python

Step 6: Train the Random Forest Regressor Model

In this step, we train a machine learning model using the Random Forest algorithm to learn from the training data.

Here is the code:

# Create a RandomForestRegressor with 100 decision trees
# 'random_state=42' ensures reproducibility
model = RandomForestRegressor(n_estimators=100, random_state=42)
# Fit the model on the training data
model.fit(X_train, y_train)
print("\nRandomForestRegressor model trained successfully!")

Output:

RandomForestRegressor model trained successfully!

Additionally, review this. - Crime Rate Prediction by City Using Python and Machine Learning

Step 7: Make Predictions and Evaluate the Model

In this step, we use the trained model to predict crop yield on the test set and evaluate its performance using common regression metrics.

Here is the code:

# Predict crop yield for the test features
y_pred = model.predict(X_test)
print("Predictions made on the test set.")
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)  # Average absolute difference
mse = mean_squared_error(y_test, y_pred)   # Average squared difference
rmse = np.sqrt(mse)                        # Square root of MSE
r2 = r2_score(y_test, y_pred)              # Proportion of variance explained by the model
# Print the results
print("\n--- Model Evaluation ---")
print(f"R-squared (R²): {r2:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Output:

Predictions made on the test set.

--- Model Evaluation ---

R-squared (R²): 0.95

Mean Absolute Error (MAE): 24.41

Mean Squared Error (MSE): 4201.85

Root Mean Squared Error (RMSE): 64.82

Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques

Step 8: Visualise the Results

The final step in this Crop Production Prediction project is to visualize the model performance and make a plot for Predicted Crop Yield. 

Here is the code:

# Set visual style
plt.style.use('seaborn-v0_8-whitegrid')
# Create a scatter plot to compare actual vs predicted values
fig, ax = plt.subplots(figsize=(10, 6))
# Scatter plot: Actual vs Predicted
ax.scatter(y_test, y_pred, alpha=0.7, edgecolors='k', s=80)
# Reference line for perfect predictions
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
# Set labels and title
ax.set_xlabel('Actual Yield (Quintal/Hectare)', fontsize=12)
ax.set_ylabel('Predicted Yield (Quintal/Hectare)', fontsize=12)
ax.set_title('Actual vs. Predicted Crop Yield', fontsize=16, fontweight='bold')
ax.grid(True)
# Display the plot
plt.show()

Output:

Conclusion

This project successfully built a machine learning model to predict crop production in India using a Random Forest Regressor. The model performed well, showing a high R² score and low error values, indicating reliable predictions.

The scatter plot comparing actual vs. predicted crop yields further validates the model’s performance. Most data points fall close to the red diagonal line, showing strong agreement between the model’s predictions and real values. While there are a few outliers, the overall trend suggests that the model generalises well across different crop types and production levels.

Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!

background

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree17 Months

Placement Assistance

Certification6 Months

Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!

Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!

Collab Link:
https://colab.research.google.com/drive/1qD7GtAU9eDAsEDn2W7lOhMWUJloWEtr_?usp=sharing

Frequently Asked Questions (FAQs)

1. What is the goal of this project?

2. Which machine learning algorithm was used for prediction?

3. How was the data prepared before modelling?

4. How was the model’s performance evaluated?

5. What are the practical applications of this model?

Rohit Sharma

834 articles published

Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...

Speak with Data Science Expert

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree

17 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

360° Career Support

Executive PG Program

12 Months