Introduction
How do you know if a model is "good"? The answer depends entirely on your use case. A model that's excellent for one application might be dangerous for another. Understanding evaluation metrics is essential for making informed decisions about model deployment.
This part covers the key metrics you'll encounter when evaluating AI systems, with a focus on understanding what they mean for real-world applications.
The Confusion Matrix
For classification problems, the confusion matrix is the foundation for understanding model performance. It shows how predictions compare to actual outcomes.
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) Correctly identified positive |
False Negative (FN) Missed positive |
| Actual Negative | False Positive (FP) False alarm |
True Negative (TN) Correctly identified negative |
Real-World Example: Medical Screening
True Positive: Test correctly identifies a disease
False Positive: Healthy person flagged as having disease (causes unnecessary stress, follow-up tests)
False Negative: Diseased person flagged as healthy (dangerous - disease goes untreated)
True Negative: Test correctly identifies healthy person
Core Classification Metrics
Accuracy
(TP + TN) / TotalThe percentage of all predictions that are correct. Simple to understand but can be misleading with imbalanced data.
When Accuracy Fails
In fraud detection where 99.9% of transactions are legitimate, a model that predicts "not fraud" for everything would have 99.9% accuracy but catch zero fraud. Accuracy alone is not enough.
Precision
TP / (TP + FP)Of all positive predictions, what percentage were actually positive? High precision means few false alarms.
When Precision Matters
Email spam filtering: High precision means legitimate emails rarely go to spam. Users trust the filter when it marks something as spam.
Recall (Sensitivity)
TP / (TP + FN)Of all actual positives, what percentage were correctly identified? High recall means few missed cases.
When Recall Matters
Cancer screening: High recall is critical - missing a cancer case (false negative) has severe consequences. Better to have some false alarms than miss true cases.
F1 Score
2 * (Precision * Recall) / (Precision + Recall)The harmonic mean of precision and recall. Useful when you need a single metric that balances both concerns.
Why Harmonic Mean?
The harmonic mean penalizes extreme imbalances. A model with 100% precision and 1% recall would have ~2% F1 score, reflecting that it's not useful despite high precision.
The Precision-Recall Trade-off
There's typically a trade-off between precision and recall. You can increase one at the expense of the other by adjusting the decision threshold.
Understanding the Trade-off
Higher threshold (more conservative): Fewer positive predictions, but those made are more confident. Higher precision, lower recall.
Lower threshold (more aggressive): More positive predictions, catching more true positives but also more false positives. Higher recall, lower precision.
Choosing the Right Balance
| Scenario | Priority | Reasoning |
|---|---|---|
| Medical screening | High Recall | Missing a disease is worse than a false alarm |
| Spam filtering | High Precision | Sending good email to spam is very frustrating |
| Fraud detection | Context-dependent | Balance blocking fraud vs. customer friction |
| Content moderation | High Recall | Missing harmful content has regulatory/safety risks |
ROC-AUC
The Receiver Operating Characteristic (ROC) curve and Area Under Curve (AUC) provide a threshold-independent view of classifier performance.
ROC-AUC
Area under ROC curve (0.5 to 1.0)The ROC curve plots True Positive Rate vs. False Positive Rate at all thresholds. AUC summarizes this into a single number representing the model's ability to distinguish between classes.
Interpreting AUC
AUC = 0.5: Random guessing (no discrimination)
AUC = 0.7-0.8: Acceptable discrimination
AUC = 0.8-0.9: Excellent discrimination
AUC > 0.9: Outstanding discrimination
When to Use ROC-AUC
ROC-AUC is useful when you want to evaluate a model's overall discriminative ability without committing to a specific threshold. However, it can be misleading with highly imbalanced datasets - in such cases, Precision-Recall AUC may be more informative.
Regression Metrics
When predicting continuous values (not categories), different metrics apply.
Mean Absolute Error (MAE)
Average of |Actual - Predicted|The average magnitude of errors. Easy to interpret - in the same units as the target variable.
Mean Squared Error (MSE) / RMSE
Average of (Actual - Predicted)^2Squares errors before averaging, penalizing large errors more heavily. RMSE (root of MSE) returns to original units.
R-squared (R2)
1 - (Sum of squared residuals / Total sum of squares)Proportion of variance explained by the model. Ranges from 0 to 1 (can be negative for very poor models).
Beyond Technical Metrics
Technical metrics are necessary but not sufficient. Business value and risk assessment require additional considerations.
Business Metrics
- Revenue Impact: How does model performance translate to business outcomes?
- Cost of Errors: What's the cost of false positives vs. false negatives?
- User Experience: How does the model affect customer satisfaction?
- Operational Efficiency: Time and resources saved or consumed
Fairness Metrics
- Demographic Parity: Equal positive prediction rates across groups
- Equal Opportunity: Equal true positive rates across groups
- Predictive Parity: Equal precision across groups
The Fairness-Accuracy Trade-off
Optimizing for fairness metrics may reduce overall accuracy, and different fairness definitions can conflict with each other. There's no universally "correct" fairness metric - the choice depends on context and values.
Practical Evaluation Guidelines
Questions to Ask
- What metric best aligns with business objectives?
- What are the relative costs of different error types?
- How will the model perform across different population segments?
- What is the minimum acceptable performance for deployment?
- How will performance be monitored over time?
Common Pitfalls
- Using accuracy alone with imbalanced data
- Evaluating on data too similar to training data
- Ignoring performance differences across subgroups
- Not considering the cost of different error types
- Setting thresholds without business context
Key Takeaways
- The confusion matrix is the foundation for classification metrics
- Accuracy can be misleading with imbalanced data - always consider precision and recall
- Precision prioritizes avoiding false alarms; recall prioritizes avoiding missed cases
- F1 score balances precision and recall into a single metric
- ROC-AUC provides threshold-independent evaluation
- Business context should drive metric selection and threshold setting
- Fairness metrics are essential but involve trade-offs