Module 7 - Part 3 of 6

Privacy-Preserving Machine Learning

📚 Estimated: 2.5-3 hours 🎓 Advanced Level 🔒 Technical Focus

🔒 Introduction

Privacy-Preserving Machine Learning (PPML) encompasses a suite of technical approaches that enable AI systems to learn from data while protecting individual privacy. These technologies can significantly reduce privacy risks and support compliance with data protection principles.

For AI governance professionals, understanding PPML is essential for: (1) assessing whether privacy-enhancing technologies should be required for specific AI deployments, (2) evaluating vendor claims about privacy protection, and (3) advising on technical safeguards in DPIAs and compliance assessments.

💡 Key Insight

PPML technologies are increasingly referenced in regulatory guidance. The EU AI Act mentions privacy-preserving techniques as relevant safeguards, and supervisory authorities consider whether privacy-enhancing technologies could have been employed when assessing DPIA adequacy. Understanding these technologies is now a professional competency for AI governance roles.

🛠 Core PPML Technologies

Four main categories of privacy-preserving technologies are relevant to AI systems. Each addresses different aspects of privacy protection and has distinct strengths and limitations.

📊

Differential Privacy

Mathematical framework adding calibrated noise to data or queries, providing provable privacy guarantees while preserving aggregate statistical properties.

Statistical Noise Privacy Budget Provable Guarantees
🌐

Federated Learning

Distributed training approach where models learn from data on local devices without raw data ever leaving user control.

Decentralized Local Training Model Aggregation
👥

Secure Multi-Party Computation

Cryptographic protocols allowing multiple parties to jointly compute functions over their data without revealing individual inputs.

Cryptographic Joint Computation No Data Sharing
🔒

Homomorphic Encryption

Encryption scheme allowing computations on encrypted data, producing encrypted results that match operations on plaintext.

Compute on Encrypted Never Decrypted End-to-End

📊 Differential Privacy in Depth

Differential privacy (DP) provides a mathematical definition of privacy with provable guarantees. It ensures that the output of any analysis does not reveal whether any specific individual's data was included in the input dataset.

📜 The Epsilon Parameter

Differential privacy is parameterized by epsilon (ε), the "privacy budget." Lower epsilon means stronger privacy but potentially less accurate results. Typical values range from 0.1 (very strong) to 10 (weaker). Organizations must balance privacy guarantees against utility needs.

  • ε = 0.1: Very strong privacy, significant noise, some utility loss
  • ε = 1: Standard privacy level, reasonable utility preservation
  • ε = 5-10: Weaker privacy, higher utility, may be insufficient for sensitive data

Differential Privacy Mechanism Flow

📊
Original Data
🎲
Add Calibrated Noise
📈
Noisy Result
🔒
Privacy Protected

Applications in AI

  • Training Data Protection: DP-SGD (Differentially Private Stochastic Gradient Descent) trains ML models with privacy guarantees
  • Query Answering: Private responses to database queries about aggregate statistics
  • Synthetic Data: Generate privacy-preserving synthetic datasets for sharing or testing
  • Model Publishing: Release trained models with provable privacy guarantees about training data
📖 Real-World Implementation: Apple

Apple uses local differential privacy for collecting usage statistics from iOS devices. Each device adds noise locally before sending data, meaning Apple never receives exact individual values. This allows aggregate analysis (e.g., popular emoji usage) while protecting any single user's specific inputs.

🌐 Federated Learning

Federated learning (FL) enables model training across decentralized data sources without centralizing the raw data. Instead of collecting data, the model travels to the data, computes updates locally, and only shares model parameters.

Federated Learning Architecture

💻
Local Device 1
Train locally
💻
Local Device 2
Train locally
💻
Local Device N
Train locally
Central Server
Aggregate updates only

✅ Privacy Benefits

  • Raw data never leaves user devices or organizational boundaries
  • Supports data minimization - only model updates are transmitted
  • Enables collaboration without data sharing agreements
  • Complies with data localization requirements

⚠ Privacy Limitations

Federated learning alone does not guarantee complete privacy. Model updates can potentially leak information about training data through:

  • Gradient Inversion: Attackers may partially reconstruct training data from gradients
  • Membership Inference: Determining if specific data was used in training
  • Model Memorization: Models may memorize and reveal specific training examples

Best practice: Combine federated learning with differential privacy (DP-FL) for stronger guarantees.

📖 Healthcare Application

Multiple hospitals want to train a diagnostic AI model but cannot share patient data due to GDPR/HIPAA constraints. Federated learning enables each hospital to train locally on their patient data, sharing only model improvements. The resulting model benefits from diverse data across institutions without any hospital accessing another's patient records.

👥 Secure Multi-Party Computation

Secure Multi-Party Computation (SMPC) uses cryptographic protocols to enable multiple parties to jointly compute a function over their inputs while keeping those inputs private. Each party learns only the output, not other parties' inputs.

🔒 How SMPC Works

In SMPC, data is split into "secret shares" distributed among parties. Computations are performed on these shares using cryptographic protocols. No single party ever sees complete data - only when shares are combined can results be reconstructed.

AI Applications of SMPC

  • Private Model Inference: Run predictions without the model owner seeing inputs or the data owner seeing the model
  • Secure Aggregation: Combine data from multiple sources without any party seeing others' data
  • Privacy-Preserving Benchmarking: Compare performance metrics across organizations without revealing sensitive data
  • Collaborative Training: Multiple parties train models together without sharing raw data
📖 Financial Industry Use Case

Banks want to collaborate on an anti-money laundering AI but cannot share customer transaction data. Using SMPC, each bank secret-shares their transaction patterns. Joint computation identifies suspicious patterns across institutions without any bank seeing another's customer data. The output reveals only the aggregate analysis results.

🔒 Homomorphic Encryption

Homomorphic Encryption (HE) allows computations to be performed directly on encrypted data. The result, when decrypted, matches what would have been obtained from computing on the plaintext. Data remains encrypted throughout processing.

💡 Types of Homomorphic Encryption

  • Partially Homomorphic (PHE): Supports one operation (addition OR multiplication) unlimited times
  • Somewhat Homomorphic (SHE): Supports both operations but limited number of times
  • Fully Homomorphic (FHE): Supports unlimited additions and multiplications - can compute any function

AI Applications

  • Private Inference: Send encrypted data to cloud AI, receive encrypted predictions, decrypt locally
  • Secure Outsourcing: Process sensitive data on untrusted cloud infrastructure
  • Privacy-Preserving ML as a Service: Users get predictions without revealing their data to the service

⚠ Practical Limitations

Homomorphic encryption introduces significant computational overhead (10x-1000x slower than plaintext operations). This makes it currently impractical for training large models but increasingly viable for inference on smaller models and specific operations. Performance is improving rapidly with hardware acceleration and algorithmic advances.

📈 Technology Comparison

Selecting the appropriate PPML technology depends on threat model, performance requirements, and regulatory context. Often, combining techniques provides stronger protection.

Criteria Differential Privacy Federated Learning SMPC HE
Privacy Guarantee Provable (ε) Moderate Cryptographic Cryptographic
Performance Impact Low Medium High Very High
Model Accuracy Reduced Preserved Preserved Preserved
Implementation Complexity Low-Medium Medium High High
Best For Analytics, Publishing Distributed Training Multi-party Collaboration Cloud Inference

✅ Governance Recommendations

  • Require PPML assessment in DPIA process for high-risk AI
  • Document why chosen technique is appropriate for threat model
  • Consider combining techniques (e.g., DP + FL) for stronger protection
  • Evaluate vendor privacy claims technically - request epsilon values, protocols used
  • Include PPML requirements in AI procurement specifications

📚 Key Takeaways

  • PPML Enables Compliance: These technologies support data minimization and privacy-by-design requirements
  • Differential Privacy: Provides mathematical privacy guarantees through calibrated noise
  • Federated Learning: Keeps data decentralized but may need additional protections
  • SMPC: Enables multi-party computation without data sharing
  • Homomorphic Encryption: Allows computation on encrypted data but with performance costs
  • Combination is Key: Often multiple techniques together provide robust protection
  • Governance Integration: PPML should be assessed in DPIAs and procurement processes