Introduction
Creating an LLM like GPT-4 or Claude involves multiple training phases, each serving a distinct purpose. Understanding these phases helps professionals evaluate model capabilities, anticipate limitations, and make informed decisions about when and how to customize models for specific needs.
The Three-Phase Training Process
Modern LLMs typically go through three main training phases:
Pre-training
Learning language from massive text corpora
The model learns to predict the next token by training on enormous text datasets - often trillions of tokens from books, websites, code, and other sources. This phase is extremely expensive (millions of dollars in compute) and creates the model's base capabilities.
- Self-supervised: No human labeling required
- Learns grammar, facts, reasoning patterns, coding
- Creates a general-purpose "foundation"
- Training data quality critically affects outcomes
Supervised Fine-Tuning (SFT)
Teaching the model to follow instructions
The pre-trained model is fine-tuned on curated examples of instruction-following. Human contractors write high-quality responses to diverse prompts, teaching the model the expected format and style of helpful responses.
- Uses human-written examples of good responses
- Transforms raw language model into assistant
- Teaches appropriate response formats
- Much smaller dataset than pre-training
RLHF (Reinforcement Learning from Human Feedback)
Aligning with human preferences
Human evaluators rank different model outputs. A reward model learns these preferences, and reinforcement learning optimizes the LLM to produce higher-ranked responses. This is how models learn to be helpful, harmless, and honest.
- Humans rank/compare model outputs
- Reward model learns human preferences
- RL optimizes for higher rewards
- Critical for safety and alignment
Why Three Phases?
Pre-training creates capability (what the model CAN do). SFT and RLHF create alignment (what the model SHOULD do). A pre-trained model might generate harmful content or refuse to follow instructions. The subsequent phases shape it into a useful, safe assistant.
RLHF in Depth
RLHF (Reinforcement Learning from Human Feedback) is perhaps the most important innovation in making LLMs useful and safe. Here's how it works:
The RLHF Process
- Generate Comparisons: For a given prompt, generate multiple responses
- Human Ranking: Human evaluators rank responses from best to worst
- Train Reward Model: A separate model learns to predict human rankings
- Optimize with RL: Use PPO (Proximal Policy Optimization) to train the LLM to maximize reward model scores
Constitutional AI (CAI)
Anthropic (Claude's creator) developed Constitutional AI as an evolution of RLHF. Instead of relying solely on human feedback, the model is trained to critique and revise its own outputs according to a set of principles (a "constitution"). This reduces reliance on human labeling while maintaining alignment.
Fine-Tuning for Specific Use Cases
Organizations often want to customize models for their specific needs. Several approaches exist, with different trade-offs:
Full Fine-Tuning
Update all model parameters on your data. Maximum customization but requires significant compute and data. Risk of "catastrophic forgetting" of general capabilities.
LoRA (Low-Rank Adaptation)
Freeze base model, train small "adapter" layers. Much more efficient - often 100x less compute. Preserves base capabilities while adding specialized skills.
QLoRA
Combines LoRA with quantization (reducing precision). Enables fine-tuning large models on consumer hardware. Popular for open-source model customization.
Prompt Tuning
Learn optimal "soft prompts" prepended to inputs. Extremely efficient but less flexible than LoRA. Good for narrow task adaptation.
| Method | Compute Cost | Customization | Forgetting Risk |
|---|---|---|---|
| Full Fine-Tuning | Very High | Maximum | High |
| LoRA | Low | High | Low |
| QLoRA | Very Low | High | Low |
| Prompt Tuning | Minimal | Limited | None |
When to Fine-Tune vs. Prompt
Many use cases can be addressed through clever prompting without fine-tuning. Understanding when fine-tuning is truly necessary saves resources.
Use Prompting When:
- Task can be explained with examples and instructions
- Rapid iteration is needed
- Limited training data is available
- Base model already has relevant capabilities
Use Fine-Tuning When:
- Specific output format or style is required consistently
- Domain-specific knowledge needs to be deeply embedded
- Performance on a specific task is critical
- Reducing inference costs through shorter prompts is important
- You have high-quality training data (thousands of examples)
The Cost-Benefit Calculation
Fine-tuning has upfront costs (data preparation, training compute) but can reduce ongoing costs (shorter prompts, fewer API calls). For high-volume applications, fine-tuning may be economical even if prompting works.
Data Requirements for Fine-Tuning
The quality and quantity of fine-tuning data significantly impacts results:
- Minimum: ~100-500 high-quality examples for narrow tasks
- Recommended: 1,000-10,000 examples for robust performance
- Quality > Quantity: Clean, consistent examples outperform noisy large datasets
- Diversity: Cover the range of expected inputs and edge cases
- Format: Match production prompt/response format exactly
Governance Consideration
Fine-tuning data must be carefully curated. Biased or low-quality training examples will be learned by the model. Ensure data is reviewed for accuracy, bias, and appropriateness before training.
Key Takeaways
- LLM training has three phases: pre-training, supervised fine-tuning, and RLHF
- Pre-training creates capability; SFT and RLHF create alignment
- RLHF uses human preferences to optimize for helpful, harmless responses
- LoRA and QLoRA enable efficient fine-tuning without full parameter updates
- Many use cases can be addressed with prompting before resorting to fine-tuning
- Fine-tuning requires high-quality data - quality matters more than quantity
- Constitutional AI (CAI) reduces reliance on human feedback for alignment