Part 3: LLM Training & Fine-Tuning | Module 3

Introduction

Creating an LLM like GPT-4 or Claude involves multiple training phases, each serving a distinct purpose. Understanding these phases helps professionals evaluate model capabilities, anticipate limitations, and make informed decisions about when and how to customize models for specific needs.

The Three-Phase Training Process

Modern LLMs typically go through three main training phases:

1

Pre-training

Learning language from massive text corpora

The model learns to predict the next token by training on enormous text datasets - often trillions of tokens from books, websites, code, and other sources. This phase is extremely expensive (millions of dollars in compute) and creates the model's base capabilities.

Self-supervised: No human labeling required
Learns grammar, facts, reasoning patterns, coding
Creates a general-purpose "foundation"
Training data quality critically affects outcomes

2

Supervised Fine-Tuning (SFT)

Teaching the model to follow instructions

The pre-trained model is fine-tuned on curated examples of instruction-following. Human contractors write high-quality responses to diverse prompts, teaching the model the expected format and style of helpful responses.

Uses human-written examples of good responses
Transforms raw language model into assistant
Teaches appropriate response formats
Much smaller dataset than pre-training

3

RLHF (Reinforcement Learning from Human Feedback)

Aligning with human preferences

Human evaluators rank different model outputs. A reward model learns these preferences, and reinforcement learning optimizes the LLM to produce higher-ranked responses. This is how models learn to be helpful, harmless, and honest.

Humans rank/compare model outputs
Reward model learns human preferences
RL optimizes for higher rewards
Critical for safety and alignment

Why Three Phases?

Pre-training creates capability (what the model CAN do). SFT and RLHF create alignment (what the model SHOULD do). A pre-trained model might generate harmful content or refuse to follow instructions. The subsequent phases shape it into a useful, safe assistant.

RLHF in Depth

RLHF (Reinforcement Learning from Human Feedback) is perhaps the most important innovation in making LLMs useful and safe. Here's how it works:

The RLHF Process

Generate Comparisons: For a given prompt, generate multiple responses
Human Ranking: Human evaluators rank responses from best to worst
Train Reward Model: A separate model learns to predict human rankings
Optimize with RL: Use PPO (Proximal Policy Optimization) to train the LLM to maximize reward model scores

Constitutional AI (CAI)

Anthropic (Claude's creator) developed Constitutional AI as an evolution of RLHF. Instead of relying solely on human feedback, the model is trained to critique and revise its own outputs according to a set of principles (a "constitution"). This reduces reliance on human labeling while maintaining alignment.

Fine-Tuning for Specific Use Cases

Organizations often want to customize models for their specific needs. Several approaches exist, with different trade-offs:

Full Fine-Tuning

Update all model parameters on your data. Maximum customization but requires significant compute and data. Risk of "catastrophic forgetting" of general capabilities.

LoRA (Low-Rank Adaptation)

Freeze base model, train small "adapter" layers. Much more efficient - often 100x less compute. Preserves base capabilities while adding specialized skills.

QLoRA

Combines LoRA with quantization (reducing precision). Enables fine-tuning large models on consumer hardware. Popular for open-source model customization.

Prompt Tuning

Learn optimal "soft prompts" prepended to inputs. Extremely efficient but less flexible than LoRA. Good for narrow task adaptation.

Method	Compute Cost	Customization	Forgetting Risk
Full Fine-Tuning	Very High	Maximum	High
LoRA	Low	High	Low
QLoRA	Very Low	High	Low
Prompt Tuning	Minimal	Limited	None

When to Fine-Tune vs. Prompt

Many use cases can be addressed through clever prompting without fine-tuning. Understanding when fine-tuning is truly necessary saves resources.

Use Prompting When:

Task can be explained with examples and instructions
Rapid iteration is needed
Limited training data is available
Base model already has relevant capabilities

Use Fine-Tuning When:

Specific output format or style is required consistently
Domain-specific knowledge needs to be deeply embedded
Performance on a specific task is critical
Reducing inference costs through shorter prompts is important
You have high-quality training data (thousands of examples)

The Cost-Benefit Calculation

Fine-tuning has upfront costs (data preparation, training compute) but can reduce ongoing costs (shorter prompts, fewer API calls). For high-volume applications, fine-tuning may be economical even if prompting works.

Data Requirements for Fine-Tuning

The quality and quantity of fine-tuning data significantly impacts results:

Minimum: ~100-500 high-quality examples for narrow tasks
Recommended: 1,000-10,000 examples for robust performance
Quality > Quantity: Clean, consistent examples outperform noisy large datasets
Diversity: Cover the range of expected inputs and edge cases
Format: Match production prompt/response format exactly

Governance Consideration

Fine-tuning data must be carefully curated. Biased or low-quality training examples will be learned by the model. Ensure data is reviewed for accuracy, bias, and appropriateness before training.

Key Takeaways

LLM training has three phases: pre-training, supervised fine-tuning, and RLHF
Pre-training creates capability; SFT and RLHF create alignment
RLHF uses human preferences to optimize for helpful, harmless responses
LoRA and QLoRA enable efficient fine-tuning without full parameter updates
Many use cases can be addressed with prompting before resorting to fine-tuning
Fine-tuning requires high-quality data - quality matters more than quantity
Constitutional AI (CAI) reduces reliance on human feedback for alignment