Part 6: Model Risks & Limitations | Module 2

Introduction

AI models are not infallible - they have fundamental limitations and face numerous risks that can cause failures in production. Understanding these risks is essential for AI governance, as mitigation must be built into systems from the start, not added as an afterthought.

This part covers the major risk categories that affect AI systems: drift, adversarial attacks, the black box problem, and inherent limitations.

Model Drift

📈

Model Drift

Performance degradation over time as the world changes

All models degrade over time. The patterns learned during training may not hold as circumstances change. This is one of the most common causes of AI system failures in production.

Types of Drift

Data Drift (Covariate Shift): The distribution of input data changes. Example: Customer demographics shift, product mix changes, seasonal patterns evolve.
Concept Drift: The relationship between inputs and outputs changes. Example: What constitutes "spam" evolves as tactics change.
Label Drift: The meaning or distribution of labels changes. Example: Categories are redefined, class balance shifts.

Mitigation Strategies

Continuous monitoring of prediction distributions
Regular comparison against ground truth when available
Statistical tests for distribution changes
Scheduled retraining pipelines
Trigger-based retraining when drift exceeds thresholds

Real-World Example: COVID-19

The pandemic caused dramatic drift across virtually all AI systems. Demand forecasting models failed as purchasing patterns changed overnight. Fraud detection systems saw false positive rates spike as normal behavior shifted. Credit scoring models based on employment stability became unreliable.

Adversarial Attacks

⚠

Adversarial Attacks

Intentional manipulation to cause model failures

Adversarial attacks exploit the way models learn, crafting inputs that cause incorrect predictions while appearing normal to humans. These attacks reveal fundamental fragilities in how AI systems perceive the world.

Attack Types

Evasion Attacks: Modify inputs at inference time to cause misclassification. Example: Adding imperceptible noise to an image to change its predicted class.
Poisoning Attacks: Corrupt training data to influence model behavior (covered in Part 5).
Model Extraction: Query the model to reconstruct its parameters or training data.
Prompt Injection: For language models, craft inputs that override intended behavior.

Mitigation Strategies

Adversarial training - include adversarial examples in training
Input validation and sanitization
Ensemble methods - harder to fool multiple models
Rate limiting and anomaly detection for queries
Defense in depth - multiple security layers

Physical-World Attacks

Adversarial attacks aren't limited to digital inputs. Researchers have demonstrated attacks using physical objects: stickers that cause stop signs to be misclassified, glasses that defeat facial recognition, and clothing patterns that make people invisible to detection systems. These have serious implications for safety-critical AI systems.

The Black Box Problem

■

The Black Box Problem

Inability to explain why models make specific predictions

Deep learning models are often called "black boxes" because their internal decision-making processes are opaque. This creates challenges for accountability, debugging, regulatory compliance, and user trust.

Why It Matters

Regulatory Requirements: Laws like GDPR require "meaningful information about the logic involved" in automated decisions.
Debugging: Hard to fix errors if you don't understand why they occur.
Trust: Users and stakeholders may not accept decisions they can't understand.
Liability: Who is responsible when an unexplainable AI makes a harmful decision?

Approaches to Explainability

Use interpretable models where possible (trade-off with performance)
Post-hoc explanation methods (LIME, SHAP)
Attention visualization for transformer models
Concept-based explanations
Counterfactual explanations - "what would need to change"

The Explainability Trade-off

There's often a trade-off between model performance and interpretability. Simpler, more interpretable models (decision trees, linear models) may perform worse than complex deep learning models. Organizations must decide where on this spectrum their applications should fall based on risk and requirements.

Inherent Model Limitations

Beyond specific risks, AI models have fundamental limitations that should inform expectations and governance.

No True Understanding

Models learn statistical patterns, not true understanding. They can fail spectacularly on examples outside their training distribution, even when those examples seem trivial to humans.

Correlation vs. Causation

Models find correlations, which may be spurious. A model might learn that hospital admission predicts death - not because admission causes death, but because sick people are admitted.

Distribution Dependence

Models only work well on data similar to their training data. Performance degrades on out-of-distribution inputs, often without warning.

Brittleness

Small input changes can cause large output changes. Models lack the robustness of human perception and reasoning.

No Common Sense

Models lack the background knowledge humans use for reasoning. They can make errors that seem absurd to humans but are consistent with their training.

Uncertainty Blindness

Many models don't know what they don't know. They may produce confident predictions on inputs they've never seen before.

Operational Risks

Beyond model-specific risks, operational factors can cause AI system failures.

Common Operational Risks

Data Pipeline Failures: Broken ETL processes, schema changes, data source outages
Feature Store Issues: Stale features, incorrect calculations, missing values
Infrastructure Problems: Scaling failures, latency spikes, resource exhaustion
Deployment Errors: Wrong model version deployed, configuration mistakes
Monitoring Gaps: Failures go undetected because the right metrics aren't tracked

MLOps Best Practices

Robust operational practices include: version control for data, code, and models; automated testing pipelines; canary deployments; comprehensive monitoring; runbooks for incident response; and regular chaos engineering to test resilience.

Risk Assessment Framework

A systematic approach to AI risk assessment helps prioritize mitigation efforts.

Risk Assessment Questions

Impact: What's the worst-case outcome of a model failure?
Frequency: How often might failures occur?
Detectability: How quickly would failures be noticed?
Reversibility: Can the harm from failures be undone?
Human Oversight: Is there human review before consequential actions?
Attack Surface: Who might want to attack this system, and how?

Risk-Based Governance

Not all AI systems need the same level of governance. High-risk applications (healthcare, criminal justice, autonomous vehicles) require rigorous testing, monitoring, and human oversight. Lower-risk applications (content recommendations, spell check) can tolerate more automation. Match governance intensity to risk level.

Building Resilient AI Systems

Key Principles

Defense in Depth: Multiple layers of protection, not single points of failure
Graceful Degradation: Systems should fail safely, not catastrophically
Human-in-the-Loop: Keep humans involved in high-stakes decisions
Continuous Monitoring: Detect problems before they cause significant harm
Regular Testing: Probe systems for weaknesses before attackers do
Incident Response: Have plans ready for when (not if) things go wrong

Key Takeaways

Model drift causes all models to degrade over time - continuous monitoring is essential
Adversarial attacks can manipulate models in ways that may be invisible to humans
The black box problem creates challenges for accountability and regulatory compliance
Models have fundamental limitations - no true understanding, correlation-based, distribution-dependent
Operational risks are as important as model risks
Risk assessment should consider impact, frequency, detectability, and reversibility
Resilient systems require defense in depth, graceful degradation, and human oversight