Part 3: Fairness Metrics & Measurement | Module 6

Introduction to Fairness Metrics

Fairness metrics provide quantitative measures for evaluating whether AI systems produce equitable outcomes across different demographic groups. These metrics translate ethical principles into measurable criteria that can be tested, monitored, and enforced.

Understanding fairness metrics is essential for AI professionals because different metrics capture different notions of fairness, and no single metric can satisfy all fairness criteria simultaneously. Choosing appropriate metrics requires understanding the context, stakeholders, and potential harms of each application.

Key Concept

Fairness metrics fall into three main categories: group fairness (comparing outcomes across groups), individual fairness (similar individuals receive similar treatment), and counterfactual fairness (outcomes remain consistent when protected attributes change).

Understanding the Confusion Matrix

Many fairness metrics are derived from the confusion matrix, which categorizes model predictions into four outcomes. Understanding these outcomes is essential for calculating fairness metrics.

Binary Classification Confusion Matrix

Predicted Positive

Predicted Negative

Actual Positive

True Positive (TP)

False Negative (FN)

Actual Negative

False Positive (FP)

True Negative (TN)

True Positive (TP): Model correctly predicts positive outcome for positive cases
True Negative (TN): Model correctly predicts negative outcome for negative cases
False Positive (FP): Model incorrectly predicts positive for negative cases (Type I error)
False Negative (FN): Model incorrectly predicts negative for positive cases (Type II error)

Group Fairness Metrics

Group fairness metrics compare aggregate outcomes across protected groups. These are the most commonly used metrics in fairness assessments and regulatory compliance.

⚖

Demographic Parity

Also: Statistical Parity, Group Fairness

Definition: The probability of receiving a positive outcome should be equal across protected groups, regardless of qualification.

P(Ŷ = 1 | A = 0) = P(Ŷ = 1 | A = 1)

Demographic parity requires that the selection rate (percentage receiving positive outcomes) be equal across groups. The 80% rule in US employment law is based on this concept, requiring minority selection rates to be at least 80% of the majority rate.

Strengths

Easy to understand and compute
Does not require ground truth labels
Aligns with legal disparate impact standards

Limitations

Ignores differences in qualification rates
May require selecting less qualified candidates
Does not consider prediction accuracy

🎯

Equalized Odds

Also: Separation, Conditional Procedure Accuracy Equality

Definition: The true positive rate (TPR) and false positive rate (FPR) should be equal across protected groups.

P(Ŷ = 1 | A = a, Y = y) is equal for all a and y

Equalized odds requires that the model have equal accuracy in both positive and negative predictions across groups. This means qualified individuals have equal chances of being selected, and unqualified individuals have equal chances of being rejected.

Strengths

Accounts for actual qualifications
Balances errors across groups
More nuanced than demographic parity

Limitations

Requires accurate ground truth labels
Labels may themselves be biased
Cannot be satisfied with calibration

📈

Equal Opportunity

Also: True Positive Rate Parity

Definition: The true positive rate should be equal across protected groups (relaxed version of equalized odds).

P(Ŷ = 1 | A = 0, Y = 1) = P(Ŷ = 1 | A = 1, Y = 1)

Equal opportunity focuses only on ensuring qualified individuals across groups have equal chances of receiving positive outcomes. It does not constrain false positive rates, making it easier to achieve than full equalized odds.

Strengths

Easier to achieve than equalized odds
Focuses on benefit distribution
Appropriate when FP costs are low

Limitations

Allows unequal false positive rates
May harm groups with higher FPR
Still requires accurate labels

🔢

Calibration

Also: Predictive Parity, Sufficiency

Definition: Among individuals receiving a given score, the probability of positive outcome should be equal across groups.

P(Y = 1 | S = s, A = a) is equal for all a at each score s

Calibration requires that risk scores mean the same thing across groups. A 70% risk score should correspond to 70% actual positive outcomes for all groups. This ensures scores are equally reliable for all individuals.

Strengths

Scores have consistent meaning
Important for threshold decisions
Enables fair individual comparisons

Limitations

Does not equalize outcomes
Can coexist with disparate impact
Cannot be combined with equalized odds

The Impossibility Theorem

A fundamental result in fairness research demonstrates that certain fairness metrics cannot be simultaneously satisfied except in trivial cases. This has profound implications for AI fairness practice.

Impossibility Theorem (Chouldechova 2017, Kleinberg et al. 2016)

For any classifier with imperfect accuracy applied to groups with different base rates, the following three conditions cannot all be satisfied simultaneously:

Calibration (predictive parity)
Equal false positive rates across groups
Equal false negative rates across groups

Implications for Practice

No Universal Solution: There is no single fairness metric that works for all situations
Tradeoffs Required: Achieving one form of fairness may require sacrificing another
Context Matters: The appropriate metric depends on the specific application and its potential harms
Stakeholder Input: Decisions about fairness tradeoffs should involve affected communities

The Fairness-Accuracy Tradeoff

⚖

Maximum Fairness

Equal outcomes across all groups

🎯

Maximum Accuracy

Best overall prediction performance

Optimizing for one often comes at some cost to the other

Individual Fairness

While group fairness metrics compare aggregate outcomes, individual fairness focuses on treating similar individuals similarly. This addresses concerns that group metrics may still allow unfair treatment of specific individuals.

Core Principle

Individual fairness requires that individuals who are similar with respect to a task receive similar predictions. This is formalized through the concept of a similarity metric that defines what "similar" means in a given context.

d(f(x), f(x')) ≤ L * d(x, x')

Where d is a distance metric, f is the model, and L is a Lipschitz constant. Similar inputs (small d(x, x')) should produce similar outputs (small d(f(x), f(x'))).

Challenges of Individual Fairness

Defining Similarity: Who decides which features determine similarity? This requires domain expertise and value judgments
Feature Selection: Should protected attributes be included in similarity calculations?
Scalability: Comparing all pairs of individuals is computationally expensive for large datasets
Conflict with Group Fairness: Individual and group fairness can sometimes be incompatible

Counterfactual Fairness

Counterfactual fairness asks whether an individual's outcome would have been different had they belonged to a different group, while all other relevant factors remained the same.

Formal Definition

P(Ŷ(A=a) | X=x, A=a) = P(Ŷ(A=a') | X=x, A=a)

A prediction is counterfactually fair if the prediction would be the same in a counterfactual world where only the protected attribute is different.

Example Application

Consider a hiring algorithm. Counterfactual fairness asks: "Would this candidate have received the same hiring decision if they were a different gender, holding all other qualifications constant?" This requires reasoning about causal relationships between features.

Implementing Counterfactual Fairness

Causal Graph: Construct a causal model showing relationships between features, protected attributes, and outcomes
Path-Specific Effects: Identify which causal paths from protected attributes to outcomes should be blocked
Counterfactual Reasoning: Use causal inference techniques to estimate counterfactual outcomes
Fair Prediction: Ensure predictions do not change based on protected attribute intervention

Choosing Appropriate Metrics

Selecting fairness metrics requires understanding the application context, potential harms, and stakeholder values. Different scenarios call for different metrics.

Hiring & Employment

Equal opportunity ensures qualified candidates from all groups have equal chances of selection.

Recommended: Equal Opportunity

Lending & Credit

Calibration ensures risk scores mean the same thing across groups for fair pricing.

Recommended: Calibration

Criminal Justice

Equalized odds balances both false positives and false negatives across groups.

Recommended: Equalized Odds

Resource Allocation

Demographic parity may be appropriate when historical data is unreliable or biased.

Recommended: Demographic Parity

Metric	Best When	Regulatory Alignment
Demographic Parity	Labels are unreliable or reflect historical bias	EEOC 80% Rule, EU AI Act
Equalized Odds	Both FP and FN costs matter equally	Criminal justice assessments
Equal Opportunity	Focus on ensuring deserving receive benefits	Employment opportunity contexts
Calibration	Scores inform threshold decisions	Credit scoring requirements
Individual Fairness	Similar cases must be treated similarly	Anti-discrimination principles

Measuring Fairness in Practice

Implementing fairness measurement requires careful attention to data collection, metric calculation, and interpretation.

Implementation Steps

Define Protected Groups: Identify which demographic attributes to evaluate (race, gender, age, etc.)
Collect Group Membership: Determine how to identify group membership while complying with data protection regulations
Calculate Baseline Rates: Measure outcome rates and error rates for each group
Apply Metrics: Calculate chosen fairness metrics across groups
Set Thresholds: Define acceptable disparity levels based on regulatory and ethical standards
Monitor Continuously: Track fairness metrics over time as data distributions change

Practical Considerations

In many jurisdictions, collecting demographic data requires explicit consent and has legal restrictions. Organizations must balance the need for fairness measurement with privacy regulations like GDPR, which limits processing of special category data. Consider using proxy methods or statistically sound estimation techniques where direct collection is not possible.

Key Takeaways

Fairness metrics quantify different notions of equity: demographic parity, equalized odds, equal opportunity, and calibration
The impossibility theorem proves that multiple fairness criteria cannot all be satisfied simultaneously
Individual fairness requires similar treatment for similar individuals based on task-relevant features
Counterfactual fairness uses causal reasoning to ensure outcomes are independent of protected attributes
Metric selection depends on application context, potential harms, and stakeholder values
Continuous monitoring is essential as fairness metrics can drift over time with changing data