Part 4: Model Evaluation & Metrics | Module 2

Introduction

How do you know if a model is "good"? The answer depends entirely on your use case. A model that's excellent for one application might be dangerous for another. Understanding evaluation metrics is essential for making informed decisions about model deployment.

This part covers the key metrics you'll encounter when evaluating AI systems, with a focus on understanding what they mean for real-world applications.

The Confusion Matrix

For classification problems, the confusion matrix is the foundation for understanding model performance. It shows how predictions compare to actual outcomes.

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP) Correctly identified positive	False Negative (FN) Missed positive
Actual Negative	False Positive (FP) False alarm	True Negative (TN) Correctly identified negative

Real-World Example: Medical Screening

True Positive: Test correctly identifies a disease
False Positive: Healthy person flagged as having disease (causes unnecessary stress, follow-up tests)
False Negative: Diseased person flagged as healthy (dangerous - disease goes untreated)
True Negative: Test correctly identifies healthy person

Core Classification Metrics

%

Accuracy

(TP + TN) / Total

The percentage of all predictions that are correct. Simple to understand but can be misleading with imbalanced data.

When Accuracy Fails

In fraud detection where 99.9% of transactions are legitimate, a model that predicts "not fraud" for everything would have 99.9% accuracy but catch zero fraud. Accuracy alone is not enough.

◎

Precision

TP / (TP + FP)

Of all positive predictions, what percentage were actually positive? High precision means few false alarms.

When Precision Matters

Email spam filtering: High precision means legitimate emails rarely go to spam. Users trust the filter when it marks something as spam.

◉

Recall (Sensitivity)

TP / (TP + FN)

Of all actual positives, what percentage were correctly identified? High recall means few missed cases.

When Recall Matters

Cancer screening: High recall is critical - missing a cancer case (false negative) has severe consequences. Better to have some false alarms than miss true cases.

F1

F1 Score

2 * (Precision * Recall) / (Precision + Recall)

The harmonic mean of precision and recall. Useful when you need a single metric that balances both concerns.

Why Harmonic Mean?

The harmonic mean penalizes extreme imbalances. A model with 100% precision and 1% recall would have ~2% F1 score, reflecting that it's not useful despite high precision.

The Precision-Recall Trade-off

There's typically a trade-off between precision and recall. You can increase one at the expense of the other by adjusting the decision threshold.

Understanding the Trade-off

Higher threshold (more conservative): Fewer positive predictions, but those made are more confident. Higher precision, lower recall.

Lower threshold (more aggressive): More positive predictions, catching more true positives but also more false positives. Higher recall, lower precision.

Choosing the Right Balance

Scenario	Priority	Reasoning
Medical screening	High Recall	Missing a disease is worse than a false alarm
Spam filtering	High Precision	Sending good email to spam is very frustrating
Fraud detection	Context-dependent	Balance blocking fraud vs. customer friction
Content moderation	High Recall	Missing harmful content has regulatory/safety risks

ROC-AUC

The Receiver Operating Characteristic (ROC) curve and Area Under Curve (AUC) provide a threshold-independent view of classifier performance.

📈

ROC-AUC

Area under ROC curve (0.5 to 1.0)

The ROC curve plots True Positive Rate vs. False Positive Rate at all thresholds. AUC summarizes this into a single number representing the model's ability to distinguish between classes.

Interpreting AUC

AUC = 0.5: Random guessing (no discrimination)
AUC = 0.7-0.8: Acceptable discrimination
AUC = 0.8-0.9: Excellent discrimination
AUC > 0.9: Outstanding discrimination

When to Use ROC-AUC

ROC-AUC is useful when you want to evaluate a model's overall discriminative ability without committing to a specific threshold. However, it can be misleading with highly imbalanced datasets - in such cases, Precision-Recall AUC may be more informative.

Regression Metrics

When predicting continuous values (not categories), different metrics apply.

Δ

Mean Absolute Error (MAE)

Average of |Actual - Predicted|

The average magnitude of errors. Easy to interpret - in the same units as the target variable.

²

Mean Squared Error (MSE) / RMSE

Average of (Actual - Predicted)^2

Squares errors before averaging, penalizing large errors more heavily. RMSE (root of MSE) returns to original units.

R

R-squared (R2)

1 - (Sum of squared residuals / Total sum of squares)

Proportion of variance explained by the model. Ranges from 0 to 1 (can be negative for very poor models).

Beyond Technical Metrics

Technical metrics are necessary but not sufficient. Business value and risk assessment require additional considerations.

Business Metrics

Revenue Impact: How does model performance translate to business outcomes?
Cost of Errors: What's the cost of false positives vs. false negatives?
User Experience: How does the model affect customer satisfaction?
Operational Efficiency: Time and resources saved or consumed

Fairness Metrics

Demographic Parity: Equal positive prediction rates across groups
Equal Opportunity: Equal true positive rates across groups
Predictive Parity: Equal precision across groups

The Fairness-Accuracy Trade-off

Optimizing for fairness metrics may reduce overall accuracy, and different fairness definitions can conflict with each other. There's no universally "correct" fairness metric - the choice depends on context and values.

Practical Evaluation Guidelines

Questions to Ask

What metric best aligns with business objectives?
What are the relative costs of different error types?
How will the model perform across different population segments?
What is the minimum acceptable performance for deployment?
How will performance be monitored over time?

Common Pitfalls

Using accuracy alone with imbalanced data
Evaluating on data too similar to training data
Ignoring performance differences across subgroups
Not considering the cost of different error types
Setting thresholds without business context

Key Takeaways

The confusion matrix is the foundation for classification metrics
Accuracy can be misleading with imbalanced data - always consider precision and recall
Precision prioritizes avoiding false alarms; recall prioritizes avoiding missed cases
F1 score balances precision and recall into a single metric
ROC-AUC provides threshold-independent evaluation
Business context should drive metric selection and threshold setting
Fairness metrics are essential but involve trade-offs