Home
Blog
Data Science
Precision, Recall, and F1 Score Explained: From Basics to Advanced

Precision, Recall, and F1 Score Explained: From Basics to Advanced

Updated on Jun 08, 2026 | 10 min read | 3.79K+ views

Table of Contents

View all

What Is Precision, Recall, and F1 Score?
The F1 Score Formula and How to Calculate It
When to Use Precision vs Recall vs F1 Score
Advanced Concepts: Macro, Micro, and Weighted F1 Scores
Precision, Recall, and F1 Score in Python
Conclusion

Precision, Recall, and F1 Score are key evaluation metrics used to measure the performance of machine learning classification models. Unlike accuracy, which only shows overall correctness, these metrics provide deeper insights into how well a model identifies positive cases and handles prediction errors.

They are particularly valuable when working with imbalanced datasets, where one class significantly outnumbers another. In such situations, a model can achieve high accuracy while still performing poorly, making precision, recall, and F1 score more reliable indicators of real-world performance.

In this blog, you will learn what precision, recall, and F1 score mean, how to calculate each one, when to use which metric, and how they all work together.

Transform your career with upGrad’s Data Science Course. Learn from industry experts, work on hands-on projects, and gain the skills top employer’s demand.

Popular Data Science Programs

Postgraduate Diploma in Data Science Advanced Certificate Program in Data Science M Sc in Data Science Degree MSc AI and Data Science Program

What Is Precision, Recall, and F1 Score?

Before jumping into formulas, let us look at the problem these metrics solve.

Imagine a spam email classifier. You test it on 1,000 emails. 950 are normal and 50 are spam. If your model predicts every single email as "not spam," it is 95% accurate. But it catches zero spam emails. That is a useless model.

Accuracy precision recall f1 score exist to expose exactly this kind of failure.

To understand these metrics, you first need to know four terms:

Term	What It Means	Example
True Positive (TP)	Correctly predicted positive	Spam email correctly flagged as spam
True Negative (TN)	Correctly predicted negative	Normal email correctly marked as not spam
False Positive (FP)	Predicted positive but actually negative	Normal email wrongly marked as spam
False Negative (FN)	Predicted negative but actually positive	Spam email wrongly marked as safe

These four values form the foundation of precision recall and f1 score calculations.

Also Read: What is Classification in Machine Learning?

Precision

Precision answers one question: Of all the cases your model predicted as positive, how many were actually positive?

Precision Formula:

Precision = TP / (TP + FP)

If your spam classifier flags 100 emails as spam and 80 of them are actually spam, your precision is 80%.

High precision means fewer false alarms. When false positives are costly, like in legal document review or fraud detection, you want high precision.

Recall

Recall answers a different question: Of all the actual positive cases, how many did your model catch?

Recall Formula:

Recall = TP / (TP + FN)

If there were 50 real spam emails and your model caught 40 of them, your recall is 80%.

High recall means fewer misses. When missing a positive case is dangerous, like in cancer diagnosis or fraud detection, recall becomes more important.

Also Read: How to Perform Cross-Validation in Machine Learning?

The Precision-Recall Trade-off

Here is the tricky part. Precision and recall usually pull against each other.

If you lower your classification threshold to catch more positives (more spam), recall goes up but precision drops. If you raise the threshold to be more selective, precision improves but recall falls.

This is the classic precision-recall trade-off, and it is one of the most important things to understand in model evaluation.

The F1 Score Formula and How to Calculate It

The F1 score is the solution to the precision-recall trade-off problem. It combines both into a single number.

The precision recall f1 score formula uses the harmonic mean of precision and recall. Why harmonic mean? Because it penalizes extreme values. A model with 100% precision and 0% recall would have an arithmetic mean of 50%, which sounds decent but is clearly broken. The harmonic mean gives a much more honest result.

Also Read: What Is Data Science? Courses, Basics, Frameworks & Careers

F1 Score Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 Score Calculation Example

Let us work through a real example.

You have a model that detects fraudulent transactions:

Metric	Value
True Positives (TP)	70
False Positives (FP)	30
False Negatives (FN)	20

Step 1: Calculate Precision

Precision = 70 / (70 + 30) = 70 / 100 = 0.70

Step 2: Calculate Recall

Recall = 70 / (70 + 20) = 70 / 90 = 0.78

Step 3: Calculate F1 Score

F1 = 2 × (0.70 × 0.78) / (0.70 + 0.78) = 2 × 0.546 / 1.48 = 0.74

The F1 score is 0.74, which means the model is reasonably well-balanced but has room to improve.

F1 Score Range

F1 scores range from 0 to 1:

Score	Interpretation
0.90 to 1.00	Excellent
0.80 to 0.89	Good
0.70 to 0.79	Acceptable
Below 0.70	Needs improvement

Also Read: Career in Data Science: Jobs, Salary, and Skills Required

Data Science Courses to upskill

Explore Data Science Courses for Career Progression

Liverpool John Moores University

MS in Data Science

Double Credentials

Master's Degree18 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

When to Use Precision vs Recall vs F1 Score

This is where most beginners get confused. Each metric serves a different purpose. Choosing the wrong one can lead to bad decisions about your model.

Use Precision When

False positives are expensive. If your model wrongly flags a good customer as fraudulent, they get locked out of their account. That damages trust.

Use precision when:

Email spam filtering (a false positive loses important emails)
Legal document classification
Medical test result systems where unnecessary treatment is harmful

Use Recall When

False negatives are expensive. Missing a real case is worse than having a few false alarms.

Use recall when:

Cancer or disease detection (missing a positive case can cost a life)
Security threat detection
Quality control in manufacturing (missing a defect means a faulty product ships)

Also Read: Real Data Science Case Studies That Drive Results!

Use F1 Score When

You want a balanced view and the cost of false positives and false negatives is roughly equal. The F1 score is especially useful when:

Your dataset is imbalanced
You cannot afford to prioritize one error type over another
You are comparing multiple models and want one number to rank them

Accuracy vs Precision Recall F1 Score

Metric	Best Used When
Accuracy	Dataset is balanced, all errors cost the same
Precision	False positives are very costly
Recall	False negatives are very costly
F1 Score	Balance is needed, dataset is imbalanced

When your data is skewed, accuracy precision recall f1 score analysis together gives a much more complete picture than accuracy alone.

Advanced Concepts: Macro, Micro, and Weighted F1 Scores

Once you move beyond binary classification (positive vs negative), things get more complex. Multi-class problems need a way to average F1 scores across all classes. This is where macro, micro, and weighted F1 come in.

1. Macro F1 Score

Calculates F1 for each class independently and then takes the simple average.

Macro F1 = (F1 class 1 + F1 class 2 + ... + F1 class N) / N

This treats every class equally, regardless of how many examples it has. Good when all classes matter equally.

Also Read: Introduction to Classification Algorithm: Concepts & Various Types

2. Micro F1 Score

Aggregates the total true positives, false positives, and false negatives across all classes, then calculates a single F1 score from those combined counts.

Good when you care about overall performance across all examples, especially with class imbalance.

3. Weighted F1 Score

Calculates F1 for each class but weights it by the number of actual examples in that class. Good when some classes naturally appear more often and you want to reflect that in your evaluation.

Which One Should You Use?

Scenario	Recommended
All classes equally important	Macro F1
Class imbalance, care about total correct	Micro F1
Some classes more common, weight matters	Weighted F1

The F-Beta Score

The F1 score assumes precision and recall are equally important. But that is not always true. The F-beta score lets you control this balance.

F-beta = (1 + beta²) × (Precision × Recall) / ((beta² × Precision) + Recall)

When beta is greater than 1, recall matters more. When beta is less than 1, precision matters more. F1 is simply F-beta where beta equals 1.

Also Read: Binary Logistic Regression: Concepts, Implementation, and Applications

Precision, Recall, and F1 Score in Python

Let us see how to calculate precision recall and f1 score using scikit-learn, which is the most common library for this in Python.

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

# Actual labels
y_true = [0, 1, 1, 0, 1, 1, 0, 0, 1, 0]

# Predicted labels
y_pred = [0, 1, 0, 0, 1, 1, 1, 0, 1, 0]

# Calculate individual metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1 Score:  {f1:.2f}")

# Full report for multi-class
print(classification_report(y_true, y_pred))

The classification_report function is especially useful. It gives you precision, recall, and F1 score for every class in your dataset, along with support (number of examples per class).

For multi-class averaging:

# Macro averaging
f1_macro = f1_score(y_true, y_pred, average='macro')

# Weighted averaging
f1_weighted = f1_score(y_true, y_pred, average='weighted')

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

Conclusion

Precision, recall, and F1 score are not just numbers to fill in a report. They tell you what kind of mistakes your model is making and whether those mistakes are acceptable for your use case.

To quickly recap:

Precision measures how clean your positive predictions are
Recall measures how complete your positive coverage is
F1 score balances both into a single, honest metric
Accuracy alone is often misleading, especially with imbalanced data

Once you understand the precision recall f1 score formula, you will never look at a model's accuracy number the same way again. These metrics give you the real story.

Want personalized guidance in Data Science and upskilling? Speak with an expert for a free 1:1 counselling session today.

Frequently Asked Question (FAQs)

1. What is the difference between precision and recall?

Precision tells you how many of your positive predictions were actually correct. Recall tells you how many of all the actual positives your model managed to catch. Precision is about quality of predictions; recall is about coverage.

2. Why is the F1 score better than accuracy for imbalanced datasets?

When one class has far more examples than the other, a model can score high accuracy by just predicting the majority class every time. The F1 score accounts for both false positives and false negatives, which forces the model to actually perform well on the minority class too.

3. What is a good F1 score for a machine learning model?

It depends on the problem. In most general applications, an F1 score above 0.80 is considered good. For high-stakes domains like medical diagnosis or fraud detection, you would want it above 0.90. Always compare against a baseline model before judging the number.

4. Can F1 score be used for multi-class classification?

Yes. For multi-class problems, you calculate F1 per class and then average the results. The three main averaging methods are macro (equal weight per class), micro (weight by total counts), and weighted (weight by class frequency). Scikit-learn supports all three.

5. When should I prioritize recall over precision?

Prioritize recall when the cost of missing a true positive is high. Medical diagnosis, fraud detection, and safety-critical systems are good examples. Missing a cancer case or a security threat is far more dangerous than triggering a false alarm.

6. What does a precision of 1.0 mean?

A precision of 1.0 means every single prediction your model made as positive was actually positive. There were zero false positives. However, the model might still have low recall, meaning it missed many actual positives. Precision and recall must always be looked at together.

7. What is the precision recall f1 score formula for multi-label classification?

For multi-label classification, you calculate precision, recall, and F1 per label and then average them using macro, micro, or weighted methods. Libraries like scikit-learn handle this directly through the average parameter in the metric functions.

8. How do I improve F1 score without retraining my model?

You can adjust the classification threshold. Lowering it increases recall but may drop precision. Raising it does the opposite. Tools like the precision-recall curve help you find the threshold that gives the best F1 score for your specific use case.

9. What is the difference between micro and macro F1 score?

Macro F1 treats all classes equally by averaging F1 scores across classes without considering class sizes. Micro F1 pools all true positives, false positives, and false negatives together before computing a single score. Micro F1 is more influenced by larger classes.

10. Is precision recall and f1 score used only in classification tasks?

Primarily yes. These metrics are designed for classification. For regression tasks, you use different metrics like RMSE or MAE. However, in information retrieval (like search engines), precision and recall have slightly adapted definitions but serve the same conceptual purpose.

11. How does class imbalance affect precision, recall, and F1 score?

Class imbalance usually inflates precision for the majority class while hurting recall for the minority class. The F1 score helps expose this by penalizing models that do not perform well on both classes. Weighted F1 is especially useful here because it accounts for class frequency while still showing per-class performance issues.

Rahul Singh

97 articles published

Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...

Speak with Data Science Expert

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources