Crop Production Prediction using Random Forest Regressor
By Rohit Sharma
Updated on Aug 11, 2025 | 8 min read | 1.27K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Aug 11, 2025 | 8 min read | 1.27K+ views
Share:
Table of Contents
Accurately predicting crop production is essential for planning, food security, and agricultural growth.
In this Crop Production Prediction project, we use machine learning to estimate crop yields based on factors like crop type, season, state, area, and production history.
The project involves cleaning and encoding agricultural data, training a Random Forest Regressor, and evaluating how well it can predict future yields. This approach helps identify key patterns and supports data-driven decision-making in agriculture.
If you're looking to accelerate your data science journey, check out the Online Data Science Courses at upGrad. The programs help you learn Python, Machine Learning, AI, Tableau, SQL, and more from top-tier faculty. Enrol today!
Spark your next big idea. Browse our full collection of data science projects in Python.
Before starting the Crop Production Prediction project, it’s useful to have a basic understanding of the following:
Start your journey of career advancement in data science with upGrad’s top-ranked courses and get a chance to learn from industry-established mentors:
For this Crop Production Prediction project, the following tools and libraries are used:
Tool/Library |
Purpose |
Python | A programming language for building the entire project |
Google Colab | Cloud-based environment to write and run Python code |
Pandas | Data manipulation and analysis of crop data |
NumPy | Efficient numerical computations |
Matplotlib / Seaborn | Data visualisation and trend plotting |
Scikit-learn | Machine learning toolkit used for training and evaluation |
OneHotEncoder | Encoding categorical agricultural features |
RandomForestRegressor | Predicting crop production using regression |
To predict crop production while analysing historical agricultural data, we will use a robust regression-based machine learning model:
Learn Python with This Beginner-Friendly Project!- Sales Data Analysis Project
You can complete the Crop Production Prediction project in about 2 to 3 hours. It's a beginner-friendly, practical project ideal for learning how to handle real-world agricultural data, preprocess mixed data types, and build a regression model for yield prediction using Python.
Let’s start building the project from scratch. We'll go step-by-step through the process of:
Without any further delay, let’s get started!
Looking to explore Python Projects further? Find out more here! Handwritten Digit Recognition with CNN Using Python
Before we dive into this project, you'll need to grab the dataset for model training and import the necessary libraries. First, head over to Kaggle to download the dataset, and then you can bring in the libraries.
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Now that you have downloaded both files, upload them to Google Colab using the code below:
from google.colab import files
uploaded = files.upload()
Once uploaded, use the following Python code to read and check the data:
# Load the data from the uploaded CSV file.
try:
df = pd.read_csv('datafile (1).csv')
print("Dataset loaded successfully!")
print("First 5 rows of the dataset:")
print(df.head())
except FileNotFoundError:
print("Error: 'datafile (1).csv' not found. Please ensure the file is in the correct directory.")
exit()
Output:
First 5 rows of the dataset:
Crop State Cost of Cultivation (`/Hectare) A2+FL \
0 ARHAR Uttar Pradesh 9794.05
1 ARHAR Karnataka 10593.15
2 ARHAR Gujarat 13468.82
3 ARHAR Andhra Pradesh 17051.66
4 ARHAR Maharashtra 17130.55
Cost of Cultivation (`/Hectare) C2 Cost of Production (`/Quintal) C2 \
0 23076.74 1941.55
1 16528.68 2172.46
2 19551.90 1898.30
3 24171.65 3670.54
4 25270.26 2775.80
Yield (Quintal/ Hectare)
0 9.83
1 7.47
2 9.59
3 6.42
4 8.72
Check out this project- Customer Purchase Behaviour Analysis Project Using Python
In this step, we simplify long or complex column names to make them easier to reference throughout the code.
Here is the code:
# The column names are a bit complex. Let's rename them for easier access.
df.rename(columns={
'Cost of Cultivation (`/Hectare) A2+FL': 'cost_cultivation_a2_fl',
'Cost of Cultivation (`/Hectare) C2': 'cost_cultivation_c2',
'Cost of Production (`/Quintal) C2': 'cost_production_c2',
'Yield (Quintal/ Hectare) ': 'yield_quintal_per_hectare' # Note the trailing space in the original name
}, inplace=True)
print("\nColumns renamed for easier use.")
print("New column names:", df.columns.tolist())
Output:
Columns renamed for easier use.
New column names: ['Crop', 'State', 'cost_cultivation_a2_fl', 'cost_cultivation_c2', 'cost_production_c2', 'yield_quintal_per_hectare']
Check this out:- COVID-19 Project: Data Visualization & Insights
In this step, we prepare the features for model training. This includes selecting relevant columns and converting categorical data into a numerical format using encoding.
Here is the code:
# Select features and target
# We'll use 'Crop', 'State', and cultivation costs as input features to predict 'yield_quintal_per_hectare'.
# We exclude 'cost_production_c2' because it's derived from yield, which would cause data leakage.
features = ['Crop', 'State', 'cost_cultivation_a2_fl', 'cost_cultivation_c2']
target = 'yield_quintal_per_hectare'
X = df[features]
y = df[target]
# Identify categorical and numerical columns
categorical_features = ['Crop', 'State']
numerical_features = ['cost_cultivation_a2_fl', 'cost_cultivation_c2']
# Set up the preprocessing pipeline
# Categorical features will be one-hot encoded
# Numerical features will be passed through unchanged
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
],
remainder='passthrough' # Leave numerical features as is
)
# Apply preprocessing transformations to feature data
X_processed = preprocessor.fit_transform(X)
print(f"\nData preprocessed. Shape of processed features: {X_processed.shape}")
Output:
Data preprocessed. Shape of processed features: (49, 25)
Identify fraudulent transactions: Learn how - Fraud Detection in Transactions with Python: A Machine Learning Project
In this step, we divide the dataset into training and testing subsets to evaluate model performance fairly.
Here is the code:
# We use 80% of the data for training and 20% for testing.
# Setting 'random_state' ensures consistent results on each run.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X_processed, y, test_size=0.2, random_state=42
)
print("Data split into training and testing sets.")
print(f" Training set size: {X_train.shape[0]} samples")
print(f" Testing set size: {X_test.shape[0]} samples")
Output:
Data split into training and testing sets.
Training set size: 39 samples
Testing set size: 10 samples
A Python Project for Beginners - Complete Airline Passenger Traffic Analysis Project Using Python
In this step, we train a machine learning model using the Random Forest algorithm to learn from the training data.
Here is the code:
# Create a RandomForestRegressor with 100 decision trees
# 'random_state=42' ensures reproducibility
model = RandomForestRegressor(n_estimators=100, random_state=42)
# Fit the model on the training data
model.fit(X_train, y_train)
print("\nRandomForestRegressor model trained successfully!")
Output:
RandomForestRegressor model trained successfully!
Additionally, review this. - Crime Rate Prediction by City Using Python and Machine Learning
In this step, we use the trained model to predict crop yield on the test set and evaluate its performance using common regression metrics.
Here is the code:
# Predict crop yield for the test features
y_pred = model.predict(X_test)
print("Predictions made on the test set.")
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred) # Average absolute difference
mse = mean_squared_error(y_test, y_pred) # Average squared difference
rmse = np.sqrt(mse) # Square root of MSE
r2 = r2_score(y_test, y_pred) # Proportion of variance explained by the model
# Print the results
print("\n--- Model Evaluation ---")
print(f"R-squared (R²): {r2:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
Output:
Predictions made on the test set.
--- Model Evaluation ---
R-squared (R²): 0.95
Mean Absolute Error (MAE): 24.41
Mean Squared Error (MSE): 4201.85
Root Mean Squared Error (RMSE): 64.82
Explore this project - Loan Default Risk Analysis Using Machine Learning Techniques
The final step in this Crop Production Prediction project is to visualize the model performance and make a plot for Predicted Crop Yield.
Here is the code:
# Set visual style
plt.style.use('seaborn-v0_8-whitegrid')
# Create a scatter plot to compare actual vs predicted values
fig, ax = plt.subplots(figsize=(10, 6))
# Scatter plot: Actual vs Predicted
ax.scatter(y_test, y_pred, alpha=0.7, edgecolors='k', s=80)
# Reference line for perfect predictions
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
# Set labels and title
ax.set_xlabel('Actual Yield (Quintal/Hectare)', fontsize=12)
ax.set_ylabel('Predicted Yield (Quintal/Hectare)', fontsize=12)
ax.set_title('Actual vs. Predicted Crop Yield', fontsize=16, fontweight='bold')
ax.grid(True)
# Display the plot
plt.show()
Output:
Popular Data Science Programs
This project successfully built a machine learning model to predict crop production in India using a Random Forest Regressor. The model performed well, showing a high R² score and low error values, indicating reliable predictions.
The scatter plot comparing actual vs. predicted crop yields further validates the model’s performance. Most data points fall close to the red diagonal line, showing strong agreement between the model’s predictions and real values. While there are a few outliers, the overall trend suggests that the model generalises well across different crop types and production levels.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Collab Link:
https://colab.research.google.com/drive/1qD7GtAU9eDAsEDn2W7lOhMWUJloWEtr_?usp=sharing
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources