F1 Score in Machine Learning: What It Is and Why It Matters

By Sriram

Updated on Jun 25, 2026 | 5 min read | 1.36K+ views

Share:

The F1 score is a key evaluation metric for classification models that combines precision and recall into a single measure by calculating their harmonic mean. Its value ranges from 0 to 1, where 1 indicates perfect performance, and 0 represents the poorest performance.

The F1 score is particularly useful when working with imbalanced datasets, as it provides a more balanced assessment of model performance than accuracy, which can sometimes produce misleading results.

This blog breaks down what the F1 score is, how to calculate it, why precision and recall matter, and when you should actually trust this metric over others. You'll also get a clear look at its limitations, because no metric is perfect.

Explore upGrad's Machine Learning programs to build practical skills in machine learning, model evaluation, classification metrics, data preprocessing, predictive modeling, and AI. Learn how to develop, evaluate, and optimize machine learning models for real-world applications.

What Is F1 Score in Machine Learning?

The F1 score is a single number that balances two things: how many relevant results your model catches, and how many of its predictions are actually correct.

It's the harmonic mean of precision and recall. That distinction matters more than most people realise.

If precision is 1.0 and recall is 0.0, the arithmetic average gives you 0.5. The F1 score gives you 0. That's the right answer, because a model that catches nothing is worthless regardless of how precise its few predictions are.

The formula:

F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

The score ranges from 0 to 1. Closer to 1 means better. A score of 0 means complete failure on at least one of the two fronts.

Score Range 

What It Signals 

0.9 to 1.0  Strong model performance 
0.7 to 0.9  Decent, room to improve 
0.5 to 0.7  Weak, investigate class balance 
Below 0.5  Model is struggling 

The F1 score doesn't care about true negatives. That's intentional. For tasks like fraud detection or disease diagnosis, getting the positives right is what counts.

Do read: Precision, Recall, and F1 Score Explained: From Basics to Advanced

Precision and Recall: The Two Things F1 Balances

You can't understand F1 without understanding what it's balancing. These two metrics, precision vs recall, pull in opposite directions, and that tension is exactly why F1 exists.

Precision answers: of everything the model labelled positive, how many actually were?

Precision = True Positives / (True Positives + False Positives)

High precision means fewer false alarms. A spam filter with high precision won't send legitimate emails to your junk folder.

Recall answers: of all the actual positives, how many did the model catch?

Recall = True Positives / (True Positives + False Negatives)

High recall means fewer misses. A cancer screening tool needs high recall because missing a real case is far worse than a false positive.

Here's the tension. Push precision up and recall often drops. You catch only the cases you're very sure about, but you miss more real ones. Push recall up and precision drops. You catch more real cases, but you're also flagging a lot of noise.

The F1 score forces you to be honest about both.

When Precision Matters More

Think about a legal document classifier. Flagging an innocent document as suspicious has real consequences. You'd rather miss a few bad ones than wrongly flag good ones.

When Recall Matters More

Think about medical tests or fraud detection. Missing a real case is the expensive mistake. You'd rather investigate ten false alarms than let one real fraud slip through.

F1 doesn't pick a side. It tells you how well you're doing at both simultaneously.

Also read: How to Test an NLP Model?

How to Calculate F1 Score in Machine Learning

Walk through a concrete example. Say you're building a model to detect fraudulent transactions.

Out of 200 transactions, your model makes these calls:

  • True Positives (correctly flagged fraud): 40
  • False Positives (wrongly flagged as fraud): 10
  • False Negatives (missed fraud): 20
  • True Negatives (correctly cleared): 130

Step 1: Calculate Precision

Precision = 40 / (40 + 10) = 40 / 50 = 0.80

Step 2: Calculate Recall

Recall = 40 / (40 + 20) = 40 / 60 = 0.667

Step 3: Calculate F1

F1 = 2 × (0.80 × 0.667) / (0.80 + 0.667) F1 = 2 × 0.5336 / 1.467 F1 = 0.727

So the model's F1 score is roughly 0.73. Not bad, but there's clearly room to reduce false negatives.

Doing This in Python

from sklearn.metrics import f1_score 
 
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1] 
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0] 
 
score = f1_score(y_true, y_pred) 
print(f"F1 Score: {score:.3f}") 

Scikit-learn handles the calculation cleanly. But knowing the manual steps helps you debug when something looks off.

Build job-ready AI and machine learning skills with upGrad's Executive Diploma in Machine Learning and AI from IIIT Bangalore. Learn machine learning, deep learning, NLP, model evaluation, MLOps, and Generative AI through real-world case studies, capstone projects, and hands-on training with industry tools.

Macro, Micro, and Weighted F1: What's the Difference?

Binary classification is the easy case. What about multi-class problems, where you're classifying into three, five, or ten categories?

That's where F1 variants come in.

Macro F1: Calculates F1 for each class separately, then averages them. Every class gets equal weight. Good when you care equally about all classes, even rare ones.

Micro F1: Aggregates all true positives, false positives, and false negatives across all classes before calculating. It's dominated by whichever class has the most examples.

Weighted F1: Like macro, but each class's F1 is weighted by how many samples it has. More balanced than macro when your classes aren't evenly distributed.

Variant 

Best For 

Macro F1  Equal importance across all classes 
Micro F1  Dominant class matters most 
Weighted F1  Class imbalance with proportional importance 

Don't just pick weighted by default. Think about what a mistake on a rare class actually costs your use case.

Must read: Accuracy Formula in Machine Learning

When Should You Actually Use the F1 Score?

Not every problem needs F1. That's something beginners often overlook.

F1 score in machine learning shines when your dataset is imbalanced. If 98% of your data is class A and 2% is class B, accuracy will look great while your model completely ignores class B. F1 catches that.

Use F1 Score When 

Don't Use F1 Score When 

Positive cases are rare.  True negatives matter equally. 
Both false positives and false negatives are important.  Classes are balanced. 
Comparing models on imbalanced datasets.  Accuracy is easier for your audience to understand. 

There's also the F-beta score, which lets you weight precision and recall differently. F2 weights recall higher. F0.5 weights precision higher. If your use case clearly favours one, consider moving away from the standard F1.

Also read: What is Classification in Machine Learning? A Complete Guide to Concepts, Algorithms

Limitations of the F1 Score

It doesn't account for true negatives. For some problems, that's fine. For others, it's a serious blind spot.

Think about a content moderation model that correctly removes harmful content but also removes a lot of safe content. High recall, lower precision, decent F1. But your users are furious because safe posts keep getting removed. True negatives matter here.

F1 also doesn't tell you anything about probability calibration. A model can have a good F1 and still be poorly calibrated, meaning its confidence scores don't reflect real-world probabilities. If you're using model outputs for ranking or risk scoring, that's a problem.

And F1 is a threshold-dependent metric. Change the classification threshold, and your F1 changes. If you're comparing models, make sure you're using the same threshold. Metrics like AUC-ROC don't have this problem.

No single metric tells the whole story. F1 is a strong starting point, not a final answer.

Conclusion

The F1 score gives you a grounded view of model performance when accuracy just isn't enough. It's not perfect, and it wasn't designed to be. What it does well is force you to confront the tradeoff between catching real positives and avoiding false ones.

Know what your problem actually needs before you reach for any metric. If false negatives are your nightmare scenario, lean toward recall. If false positives are costly, lean toward precision. If you're not sure which matters more, the F1 score in machine learning is where you start.

Ready to start your journey? Book a free consultation with upGrad today to find the best path for your career.

FAQs

1. Is a higher F1 score always better in machine learning?

A higher F1 score usually indicates better balance between precision and recall, but it isn't automatically the best outcome. The ideal score depends on your application, dataset, and business goals. Always review precision and recall alongside the F1 score to understand where your model performs well or struggles.

2. What is considered a good F1 score in machine learning?

There's no fixed benchmark for a good F1 score in machine learning. Scores above 0.80 are generally considered strong for many classification tasks, while applications like fraud detection or medical diagnosis may require even higher values because prediction errors carry significant consequences.

3. Why can two models have the same F1 score but different performance?

Two models can achieve identical F1 scores while having very different precision and recall values. One model might prioritize fewer false positives, while another focuses on catching more positive cases. Looking at all three metrics together provides a more complete evaluation than relying on a single score.

4. How does changing the classification threshold affect the F1 score?

The F1 score changes when you adjust the classification threshold because precision and recall also change. Lowering the threshold usually increases recall but reduces precision, while raising it often has the opposite effect. Finding the right threshold helps improve overall model performance.

5. Can the F1 score be used for multiclass classification problems?

Yes. The F1 score works for multiclass classification by calculating separate scores for each class and combining them using macro, micro, or weighted averaging. The best averaging method depends on whether all classes are equally important or some classes occur more frequently.

6. How do you calculate the F1 score in Python?

If you're wondering how to calculate F1 score in machine learning using Python, libraries like Scikit-learn make the process straightforward. After training your classifier, you simply compare the actual labels with predicted labels using the f1_score() function from sklearn.metrics.

 

7. What is the difference between accuracy score and F1 score in machine learning?

The accuracy score measures the percentage of correct predictions across all classes. The F1 score focuses only on balancing precision and recall for the positive class. When classes are highly imbalanced, the F1 score often provides a more meaningful evaluation than accuracy alone.

8. Is the F1 score suitable for regression models?

No. The F1 score is designed only for classification problems where predictions belong to discrete classes. Regression models predict continuous numerical values, so evaluation metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), or R-squared are more appropriate.

9. What is the difference between the F1 score and the F-beta score?

The standard F1 score gives equal importance to precision and recall. The F-beta score lets you assign more weight to one metric depending on your objective. For example, F2 emphasizes recall, while F0.5 places greater importance on precision during model evaluation.

10. Why is the harmonic mean used in the F1 score formula?

The F1 score formula uses the harmonic mean because it penalizes large differences between precision and recall. If either metric is very low, the final score also drops significantly. This prevents a model from appearing strong simply because it performs well on one metric.

11. Should you rely only on the F1 score to evaluate a machine learning model?

No. While the F1 score in machine learning is valuable, it shouldn't be your only evaluation metric. Combining it with accuracy, precision, recall, ROC-AUC, confusion matrix analysis, and business-specific measures provides a clearer understanding of how the model performs in real-world scenarios.

Sriram

549 articles published

Sriram K is a Senior SEO Executive with a B.Tech in Information Technology from Dr. M.G.R. Educational and Research Institute, Chennai. With over a decade of experience in digital marketing, he specia...

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program