Part 2: The ML Pipeline | Module 2

Introduction

Building a machine learning system is not just about algorithms - it's about a complete pipeline that transforms raw data into deployed, monitored predictions. Understanding this pipeline helps professionals ask the right questions, estimate realistic timelines, and identify potential failure points.

Research shows that data scientists spend 80% of their time on data preparation - not on the "sexy" parts of building models. Knowing the full pipeline explains why AI projects often take longer and cost more than expected.

The ML Pipeline Stages

1

Problem Definition

Define the business problem, success metrics, and how ML will address it. This is where many projects go wrong - solving the wrong problem or setting unrealistic expectations.

Business objectives Success metrics Feasibility assessment

2

Data Collection

Gather the data needed to train and validate the model. This may involve accessing internal databases, acquiring external data, or creating new data collection mechanisms.

Source identification Data acquisition Legal/privacy review

3

Data Preprocessing

Clean, transform, and prepare raw data for modeling. This includes handling missing values, encoding categories, normalizing scales, and creating features.

Cleaning Transformation Feature engineering

4

Model Training

Select algorithms, train models on the prepared data, and tune hyperparameters. This iterative process finds the best model configuration for the problem.

Algorithm selection Training Hyperparameter tuning

5

Model Validation

Evaluate model performance on held-out data. Ensure the model generalizes well and doesn't just memorize training examples (overfitting).

Performance metrics Cross-validation Bias testing

6

Deployment

Move the validated model into production where it can make predictions on real data. This includes infrastructure setup, integration, and release planning.

Infrastructure Integration A/B testing

7

Monitoring & Maintenance

Continuously monitor model performance in production. Models degrade over time as data patterns change, requiring ongoing attention and periodic retraining.

Performance tracking Drift detection Retraining

Stage Deep Dive: Data Preprocessing

Data preprocessing is where most time is spent. Understanding common preprocessing tasks helps explain why data quality is so critical.

Data Cleaning

Handling missing values (impute or remove)
Removing duplicates
Fixing inconsistencies
Correcting errors and outliers
Standardizing formats

Data Transformation

Encoding categorical variables
Normalizing numerical scales
Handling date/time features
Text tokenization
Image resizing/augmentation

Feature Engineering

Feature engineering is the process of creating new input variables that help the model learn better. Good features can dramatically improve model performance - often more than changing algorithms.

Example: Predicting Customer Churn

Raw data might include transaction dates. Feature engineering could create: "days since last purchase," "purchase frequency," "trend in purchase amounts" - features that better capture churn risk signals.

Stage Deep Dive: Training and Validation

Data Splitting

Data is typically split into three sets to ensure models generalize well:

Training Set (60-80%)

Used to train the model
Model learns patterns from this data
Largest portion of data

Validation Set (10-20%)

Used to tune hyperparameters
Helps prevent overfitting
Guides model selection

Test Set (10-20%)

Final evaluation only
Never seen during training
Estimates real-world performance

Cross-Validation

Multiple train/validation splits
More robust performance estimates
Better use of limited data

Overfitting vs. Underfitting

Overfitting: The model memorizes training data but fails on new data. Like a student who memorizes answers without understanding concepts.

Underfitting: The model is too simple to capture patterns. Like using a straight line to fit curved data.

Proper validation helps detect both problems before deployment.

Stage Deep Dive: Deployment

Getting a model into production is often harder than building it. Deployment involves technical, operational, and organizational challenges.

Deployment Patterns

Batch Inference: Predictions generated periodically (e.g., nightly scoring of all customers)
Real-time API: Predictions on-demand via API calls (e.g., fraud detection at transaction time)
Edge Deployment: Model runs on end-user devices (e.g., mobile apps, IoT devices)
Streaming: Continuous predictions on data streams (e.g., real-time anomaly detection)

Shadow Mode and A/B Testing

Before full deployment, models are often tested in "shadow mode" - running alongside existing systems without affecting decisions. A/B testing then gradually rolls out the new model to measure real-world impact.

Stage Deep Dive: Monitoring & Maintenance

ML models are not "set and forget." They require ongoing monitoring and maintenance - a fact often underestimated in project planning.

What to Monitor

Prediction Quality: Accuracy, precision, recall on live data
Input Data Quality: Missing values, out-of-range values, unexpected categories
Data Drift: Changes in input data distributions over time
Concept Drift: Changes in the relationship between inputs and outputs
System Performance: Latency, throughput, error rates

Model Decay

All models degrade over time as the world changes. A customer churn model trained on pre-pandemic data may not work well post-pandemic. Plan for regular retraining and model refresh cycles.

The Iterative Nature of ML

The ML pipeline is not linear - it's iterative. Insights from later stages often lead back to earlier stages:

Poor model performance may reveal data quality issues
Validation results may suggest new features to engineer
Production monitoring may identify needed retraining
Business feedback may redefine success metrics

Project Planning Implication

ML projects are inherently uncertain. Unlike traditional software where requirements can be specified upfront, ML success depends on data quality and pattern learnability that can only be discovered through experimentation. Build iteration time into project plans.

Key Takeaways

The ML pipeline has seven stages from problem definition through monitoring
Data preprocessing consumes approximately 80% of project time
Proper data splitting prevents overfitting and ensures reliable performance estimates
Deployment is often harder than model building - plan accordingly
Models require ongoing monitoring and periodic retraining
The pipeline is iterative - insights at later stages drive changes to earlier stages
Build uncertainty and iteration time into ML project plans