Part 3: Privacy-Preserving Machine Learning | Module 7

🔒 Introduction

Privacy-Preserving Machine Learning (PPML) encompasses a suite of technical approaches that enable AI systems to learn from data while protecting individual privacy. These technologies can significantly reduce privacy risks and support compliance with data protection principles.

For AI governance professionals, understanding PPML is essential for: (1) assessing whether privacy-enhancing technologies should be required for specific AI deployments, (2) evaluating vendor claims about privacy protection, and (3) advising on technical safeguards in DPIAs and compliance assessments.

💡 Key Insight

PPML technologies are increasingly referenced in regulatory guidance. The EU AI Act mentions privacy-preserving techniques as relevant safeguards, and supervisory authorities consider whether privacy-enhancing technologies could have been employed when assessing DPIA adequacy. Understanding these technologies is now a professional competency for AI governance roles.

🛠 Core PPML Technologies

Four main categories of privacy-preserving technologies are relevant to AI systems. Each addresses different aspects of privacy protection and has distinct strengths and limitations.

📊

Differential Privacy

Mathematical framework adding calibrated noise to data or queries, providing provable privacy guarantees while preserving aggregate statistical properties.

Statistical Noise Privacy Budget Provable Guarantees

🌐

Federated Learning

Distributed training approach where models learn from data on local devices without raw data ever leaving user control.

Decentralized Local Training Model Aggregation

👥

Secure Multi-Party Computation

Cryptographic protocols allowing multiple parties to jointly compute functions over their data without revealing individual inputs.

Cryptographic Joint Computation No Data Sharing

🔒

Homomorphic Encryption

Encryption scheme allowing computations on encrypted data, producing encrypted results that match operations on plaintext.

Compute on Encrypted Never Decrypted End-to-End

📊 Differential Privacy in Depth

Differential privacy (DP) provides a mathematical definition of privacy with provable guarantees. It ensures that the output of any analysis does not reveal whether any specific individual's data was included in the input dataset.

📜 The Epsilon Parameter

Differential privacy is parameterized by epsilon (ε), the "privacy budget." Lower epsilon means stronger privacy but potentially less accurate results. Typical values range from 0.1 (very strong) to 10 (weaker). Organizations must balance privacy guarantees against utility needs.

ε = 0.1: Very strong privacy, significant noise, some utility loss
ε = 1: Standard privacy level, reasonable utility preservation
ε = 5-10: Weaker privacy, higher utility, may be insufficient for sensitive data

Differential Privacy Mechanism Flow

📊

Original Data

→

🎲

Add Calibrated Noise

→

📈

Noisy Result

→

🔒

Privacy Protected

Applications in AI

Training Data Protection: DP-SGD (Differentially Private Stochastic Gradient Descent) trains ML models with privacy guarantees
Query Answering: Private responses to database queries about aggregate statistics
Synthetic Data: Generate privacy-preserving synthetic datasets for sharing or testing
Model Publishing: Release trained models with provable privacy guarantees about training data

📖 Real-World Implementation: Apple

Apple uses local differential privacy for collecting usage statistics from iOS devices. Each device adds noise locally before sending data, meaning Apple never receives exact individual values. This allows aggregate analysis (e.g., popular emoji usage) while protecting any single user's specific inputs.

🌐 Federated Learning

Federated learning (FL) enables model training across decentralized data sources without centralizing the raw data. Instead of collecting data, the model travels to the data, computes updates locally, and only shares model parameters.

Federated Learning Architecture

💻

Local Device 1
Train locally

💻

Local Device 2
Train locally

💻

Local Device N
Train locally

↓

☁

Central Server
Aggregate updates only

✅ Privacy Benefits

Raw data never leaves user devices or organizational boundaries
Supports data minimization - only model updates are transmitted
Enables collaboration without data sharing agreements
Complies with data localization requirements

⚠ Privacy Limitations

Federated learning alone does not guarantee complete privacy. Model updates can potentially leak information about training data through:

Gradient Inversion: Attackers may partially reconstruct training data from gradients
Membership Inference: Determining if specific data was used in training
Model Memorization: Models may memorize and reveal specific training examples

Best practice: Combine federated learning with differential privacy (DP-FL) for stronger guarantees.

📖 Healthcare Application

Multiple hospitals want to train a diagnostic AI model but cannot share patient data due to GDPR/HIPAA constraints. Federated learning enables each hospital to train locally on their patient data, sharing only model improvements. The resulting model benefits from diverse data across institutions without any hospital accessing another's patient records.

👥 Secure Multi-Party Computation

Secure Multi-Party Computation (SMPC) uses cryptographic protocols to enable multiple parties to jointly compute a function over their inputs while keeping those inputs private. Each party learns only the output, not other parties' inputs.

🔒 How SMPC Works

In SMPC, data is split into "secret shares" distributed among parties. Computations are performed on these shares using cryptographic protocols. No single party ever sees complete data - only when shares are combined can results be reconstructed.

AI Applications of SMPC

Private Model Inference: Run predictions without the model owner seeing inputs or the data owner seeing the model
Secure Aggregation: Combine data from multiple sources without any party seeing others' data
Privacy-Preserving Benchmarking: Compare performance metrics across organizations without revealing sensitive data
Collaborative Training: Multiple parties train models together without sharing raw data

📖 Financial Industry Use Case

Banks want to collaborate on an anti-money laundering AI but cannot share customer transaction data. Using SMPC, each bank secret-shares their transaction patterns. Joint computation identifies suspicious patterns across institutions without any bank seeing another's customer data. The output reveals only the aggregate analysis results.

🔒 Homomorphic Encryption

Homomorphic Encryption (HE) allows computations to be performed directly on encrypted data. The result, when decrypted, matches what would have been obtained from computing on the plaintext. Data remains encrypted throughout processing.

                    💡 Types of Homomorphic Encryption
                    Partially Homomorphic (PHE): Supports one operation (addition OR multiplication) unlimited times
Somewhat Homomorphic (SHE): Supports both operations but limited number of times
Fully Homomorphic (FHE): Supports unlimited additions and multiplications - can compute any function

                

AI Applications

Private Inference: Send encrypted data to cloud AI, receive encrypted predictions, decrypt locally
Secure Outsourcing: Process sensitive data on untrusted cloud infrastructure
Privacy-Preserving ML as a Service: Users get predictions without revealing their data to the service

⚠ Practical Limitations

Homomorphic encryption introduces significant computational overhead (10x-1000x slower than plaintext operations). This makes it currently impractical for training large models but increasingly viable for inference on smaller models and specific operations. Performance is improving rapidly with hardware acceleration and algorithmic advances.

📈 Technology Comparison

Selecting the appropriate PPML technology depends on threat model, performance requirements, and regulatory context. Often, combining techniques provides stronger protection.

Criteria	Differential Privacy	Federated Learning	SMPC	HE
Privacy Guarantee	Provable (ε)	Moderate	Cryptographic	Cryptographic
Performance Impact	Low	Medium	High	Very High
Model Accuracy	Reduced	Preserved	Preserved	Preserved
Implementation Complexity	Low-Medium	Medium	High	High
Best For	Analytics, Publishing	Distributed Training	Multi-party Collaboration	Cloud Inference

✅ Governance Recommendations

Require PPML assessment in DPIA process for high-risk AI
Document why chosen technique is appropriate for threat model
Consider combining techniques (e.g., DP + FL) for stronger protection
Evaluate vendor privacy claims technically - request epsilon values, protocols used
Include PPML requirements in AI procurement specifications

📚 Key Takeaways

                    PPML Enables Compliance: These technologies support data minimization and privacy-by-design requirements
Differential Privacy: Provides mathematical privacy guarantees through calibrated noise
Federated Learning: Keeps data decentralized but may need additional protections
SMPC: Enables multi-party computation without data sharing
Homomorphic Encryption: Allows computation on encrypted data but with performance costs
Combination is Key: Often multiple techniques together provide robust protection
Governance Integration: PPML should be assessed in DPIAs and procurement processes