Precision, Recall, and F1 Score Explained: From Basics to Advanced
By Rahul Singh
Updated on Jun 08, 2026 | 10 min read | 3.5K+ views
Share:
Looks like you're browsing from the
United StatesSome programs may not be available in your location
Some programs may not be available in your location
Switch to upGrad USAll courses
Certifications
More
By Rahul Singh
Updated on Jun 08, 2026 | 10 min read | 3.5K+ views
Share:
Table of Contents
Precision, Recall, and F1 Score are key evaluation metrics used to measure the performance of machine learning classification models. Unlike accuracy, which only shows overall correctness, these metrics provide deeper insights into how well a model identifies positive cases and handles prediction errors.
They are particularly valuable when working with imbalanced datasets, where one class significantly outnumbers another. In such situations, a model can achieve high accuracy while still performing poorly, making precision, recall, and F1 score more reliable indicators of real-world performance.
In this blog, you will learn what precision, recall, and F1 score mean, how to calculate each one, when to use which metric, and how they all work together.
Transform your career with upGrad’s Data Science Course. Learn from industry experts, work on hands-on projects, and gain the skills top employer’s demand.
Before jumping into formulas, let us look at the problem these metrics solve.
Imagine a spam email classifier. You test it on 1,000 emails. 950 are normal and 50 are spam. If your model predicts every single email as "not spam," it is 95% accurate. But it catches zero spam emails. That is a useless model.
Accuracy precision recall f1 score exist to expose exactly this kind of failure.
To understand these metrics, you first need to know four terms:
Term |
What It Means |
Example |
| True Positive (TP) | Correctly predicted positive | Spam email correctly flagged as spam |
| True Negative (TN) | Correctly predicted negative | Normal email correctly marked as not spam |
| False Positive (FP) | Predicted positive but actually negative | Normal email wrongly marked as spam |
| False Negative (FN) | Predicted negative but actually positive | Spam email wrongly marked as safe |
These four values form the foundation of precision recall and f1 score calculations.
Also Read: What is Classification in Machine Learning?
Precision answers one question: Of all the cases your model predicted as positive, how many were actually positive?
Precision Formula:
Precision = TP / (TP + FP)
If your spam classifier flags 100 emails as spam and 80 of them are actually spam, your precision is 80%.
High precision means fewer false alarms. When false positives are costly, like in legal document review or fraud detection, you want high precision.
Recall answers a different question: Of all the actual positive cases, how many did your model catch?
Recall Formula:
Recall = TP / (TP + FN)
If there were 50 real spam emails and your model caught 40 of them, your recall is 80%.
High recall means fewer misses. When missing a positive case is dangerous, like in cancer diagnosis or fraud detection, recall becomes more important.
Also Read: How to Perform Cross-Validation in Machine Learning?
Here is the tricky part. Precision and recall usually pull against each other.
If you lower your classification threshold to catch more positives (more spam), recall goes up but precision drops. If you raise the threshold to be more selective, precision improves but recall falls.
This is the classic precision-recall trade-off, and it is one of the most important things to understand in model evaluation.
The F1 score is the solution to the precision-recall trade-off problem. It combines both into a single number.
The precision recall f1 score formula uses the harmonic mean of precision and recall. Why harmonic mean? Because it penalizes extreme values. A model with 100% precision and 0% recall would have an arithmetic mean of 50%, which sounds decent but is clearly broken. The harmonic mean gives a much more honest result.
Also Read: What Is Data Science? Courses, Basics, Frameworks & Careers
F1 Score Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Let us work through a real example.
You have a model that detects fraudulent transactions:
Metric |
Value |
| True Positives (TP) | 70 |
| False Positives (FP) | 30 |
| False Negatives (FN) | 20 |
Step 1: Calculate Precision
Precision = 70 / (70 + 30) = 70 / 100 = 0.70
Step 2: Calculate Recall
Recall = 70 / (70 + 20) = 70 / 90 = 0.78
Step 3: Calculate F1 Score
F1 = 2 × (0.70 × 0.78) / (0.70 + 0.78) = 2 × 0.546 / 1.48 = 0.74
The F1 score is 0.74, which means the model is reasonably well-balanced but has room to improve.
F1 scores range from 0 to 1:
Score |
Interpretation |
| 0.90 to 1.00 | Excellent |
| 0.80 to 0.89 | Good |
| 0.70 to 0.79 | Acceptable |
| Below 0.70 | Needs improvement |
Also Read: Career in Data Science: Jobs, Salary, and Skills Required
This is where most beginners get confused. Each metric serves a different purpose. Choosing the wrong one can lead to bad decisions about your model.
False positives are expensive. If your model wrongly flags a good customer as fraudulent, they get locked out of their account. That damages trust.
Use precision when:
False negatives are expensive. Missing a real case is worse than having a few false alarms.
Use recall when:
Also Read: Real Data Science Case Studies That Drive Results!
You want a balanced view and the cost of false positives and false negatives is roughly equal. The F1 score is especially useful when:
Metric |
Best Used When |
| Accuracy | Dataset is balanced, all errors cost the same |
| Precision | False positives are very costly |
| Recall | False negatives are very costly |
| F1 Score | Balance is needed, dataset is imbalanced |
When your data is skewed, accuracy precision recall f1 score analysis together gives a much more complete picture than accuracy alone.
Once you move beyond binary classification (positive vs negative), things get more complex. Multi-class problems need a way to average F1 scores across all classes. This is where macro, micro, and weighted F1 come in.
Calculates F1 for each class independently and then takes the simple average.
Macro F1 = (F1 class 1 + F1 class 2 + ... + F1 class N) / N
This treats every class equally, regardless of how many examples it has. Good when all classes matter equally.
Also Read: Introduction to Classification Algorithm: Concepts & Various Types
Aggregates the total true positives, false positives, and false negatives across all classes, then calculates a single F1 score from those combined counts.
Good when you care about overall performance across all examples, especially with class imbalance.
Calculates F1 for each class but weights it by the number of actual examples in that class. Good when some classes naturally appear more often and you want to reflect that in your evaluation.
Scenario |
Recommended |
| All classes equally important | Macro F1 |
| Class imbalance, care about total correct | Micro F1 |
| Some classes more common, weight matters | Weighted F1 |
The F1 score assumes precision and recall are equally important. But that is not always true. The F-beta score lets you control this balance.
F-beta = (1 + beta²) × (Precision × Recall) / ((beta² × Precision) + Recall)
When beta is greater than 1, recall matters more. When beta is less than 1, precision matters more. F1 is simply F-beta where beta equals 1.
Also Read: Binary Logistic Regression: Concepts, Implementation, and Applications
Let us see how to calculate precision recall and f1 score using scikit-learn, which is the most common library for this in Python.
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
# Actual labels
y_true = [0, 1, 1, 0, 1, 1, 0, 0, 1, 0]
# Predicted labels
y_pred = [0, 1, 0, 0, 1, 1, 1, 0, 1, 0]
# Calculate individual metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
# Full report for multi-class
print(classification_report(y_true, y_pred))
The classification_report function is especially useful. It gives you precision, recall, and F1 score for every class in your dataset, along with support (number of examples per class).
For multi-class averaging:
# Macro averaging
f1_macro = f1_score(y_true, y_pred, average='macro')
# Weighted averaging
f1_weighted = f1_score(y_true, y_pred, average='weighted')
Precision, recall, and F1 score are not just numbers to fill in a report. They tell you what kind of mistakes your model is making and whether those mistakes are acceptable for your use case.
To quickly recap:
Once you understand the precision recall f1 score formula, you will never look at a model's accuracy number the same way again. These metrics give you the real story.
Want personalized guidance in Data Science and upskilling? Speak with an expert for a free 1:1 counselling session today.
Precision tells you how many of your positive predictions were actually correct. Recall tells you how many of all the actual positives your model managed to catch. Precision is about quality of predictions; recall is about coverage.
When one class has far more examples than the other, a model can score high accuracy by just predicting the majority class every time. The F1 score accounts for both false positives and false negatives, which forces the model to actually perform well on the minority class too.
It depends on the problem. In most general applications, an F1 score above 0.80 is considered good. For high-stakes domains like medical diagnosis or fraud detection, you would want it above 0.90. Always compare against a baseline model before judging the number.
Yes. For multi-class problems, you calculate F1 per class and then average the results. The three main averaging methods are macro (equal weight per class), micro (weight by total counts), and weighted (weight by class frequency). Scikit-learn supports all three.
Prioritize recall when the cost of missing a true positive is high. Medical diagnosis, fraud detection, and safety-critical systems are good examples. Missing a cancer case or a security threat is far more dangerous than triggering a false alarm.
A precision of 1.0 means every single prediction your model made as positive was actually positive. There were zero false positives. However, the model might still have low recall, meaning it missed many actual positives. Precision and recall must always be looked at together.
For multi-label classification, you calculate precision, recall, and F1 per label and then average them using macro, micro, or weighted methods. Libraries like scikit-learn handle this directly through the average parameter in the metric functions.
You can adjust the classification threshold. Lowering it increases recall but may drop precision. Raising it does the opposite. Tools like the precision-recall curve help you find the threshold that gives the best F1 score for your specific use case.
Macro F1 treats all classes equally by averaging F1 scores across classes without considering class sizes. Micro F1 pools all true positives, false positives, and false negatives together before computing a single score. Micro F1 is more influenced by larger classes.
Primarily yes. These metrics are designed for classification. For regression tasks, you use different metrics like RMSE or MAE. However, in information retrieval (like search engines), precision and recall have slightly adapted definitions but serve the same conceptual purpose.
Class imbalance usually inflates precision for the majority class while hurting recall for the minority class. The F1 score helps expose this by penalizing models that do not perform well on both classes. Weighted F1 is especially useful here because it accounts for class frequency while still showing per-class performance issues.
52 articles published
Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...
Start Your Career in Data Science Today