Introduction
Large Language Models (LLMs) like GPT-4 and Claude have captured global attention, but how do they actually work? This part explains the core concepts - transformers, attention, and tokenization - at a level that enables meaningful engagement with these technologies without requiring a PhD in machine learning.
The Core Idea: Next Word Prediction
At their heart, LLMs do something surprisingly simple: they predict the next word (or more precisely, token) in a sequence. Given the text "The cat sat on the...", the model predicts that "mat" is a likely next word.
From Simple to Sophisticated
While the task is simple, when scaled to billions of parameters and trained on trillions of words, next-word prediction gives rise to sophisticated behaviors. To predict well, the model must learn grammar, facts, reasoning patterns, and more - all encoded in its parameters.
This is why LLMs seem to "understand" and "reason" - they've learned patterns so deeply that they can generate coherent, contextually appropriate text. However, they're doing statistical pattern matching, not true understanding.
Tokenization: Breaking Down Text
Before processing text, LLMs break it into tokens - the basic units the model works with. Tokens are not always complete words.
Tokenization Example
The sentence: "Tokenization is fascinating!"
Notice how words may be split into subwords
Why Tokenization Matters
- Context Limits: Models have token limits (e.g., 128K tokens). Longer texts require more tokens.
- Pricing: API costs are typically based on token count.
- Performance: Some words use multiple tokens, which can affect how the model processes them.
- Languages: Non-English languages often require more tokens per word.
Practical Implication
A rough rule: 1 token is approximately 0.75 English words, or about 4 characters. When working with LLM APIs, understanding tokenization helps estimate costs and context usage.
The Transformer Architecture
The transformer, introduced in 2017, is the architecture behind all modern LLMs. Its key innovation is the "attention mechanism" which allows the model to weigh the relevance of different parts of the input when making predictions.
What Problem Does Attention Solve?
Consider: "The animal didn't cross the street because it was too tired."
What does "it" refer to? The animal. A model needs to connect "it" back to "animal" across the sentence. Attention enables this by allowing every word to "attend to" every other word, determining which are most relevant.
How Attention Works (Conceptually)
For each word in the input, attention computes:
- How relevant is every other word to understanding this word?
- Create a weighted combination of all words, based on relevance
- This becomes the representation of that word in context
Query
"What am I looking for?" - Each word asks what information it needs
Key
"What do I contain?" - Each word advertises what information it offers
Value
"Here's my content" - The actual information to be retrieved
Score
Match queries to keys to determine relevance weights
Self-Attention
LLMs use "self-attention" - the input sequence attends to itself. This allows the model to build representations that capture relationships between all parts of the input, regardless of distance.
Transformer Components
A transformer consists of stacked layers, each containing attention and processing components:
Key Components
- Embedding Layer: Converts tokens to numerical vectors the model can process
- Positional Encoding: Adds position information (since attention is position-agnostic)
- Multi-Head Attention: Multiple parallel attention mechanisms, each learning different relationships
- Feed-Forward Networks: Process the attended representations
- Layer Normalization: Stabilizes training
- Output Layer: Predicts probability distribution over next tokens
Scale Creates Capability
GPT-4 is rumored to have over 1 trillion parameters arranged in these layers. This massive scale, combined with training on internet-scale text, enables the sophisticated behaviors we observe - but the fundamental architecture is the same transformer stack.
How LLMs Generate Text
Text generation happens one token at a time through "autoregressive" generation:
- Process the input prompt through all transformer layers
- Output a probability distribution over possible next tokens
- Sample a token from this distribution (various strategies exist)
- Append the sampled token to the sequence
- Repeat until a stopping condition is met
Sampling Strategies
Temperature
Controls randomness. Low = more deterministic, high = more creative/random
Top-K Sampling
Only consider the K most likely tokens
Top-P (Nucleus)
Consider tokens until cumulative probability reaches P
Context Windows and Limitations
Every LLM has a maximum context length - the total number of tokens it can process at once (input + generated output).
Context Window Sizes
- Early GPT-3: 4,096 tokens
- GPT-4: 8K to 128K tokens depending on version
- Claude: Up to 200K tokens
- Gemini: Up to 1M tokens (claimed)
Why Context Matters
The context window determines how much information the model can consider when generating responses. Beyond the window, information is effectively "forgotten." This affects applications requiring long documents, extended conversations, or complex reasoning chains.
What LLMs Are (and Aren't)
LLMs Are:
- Extremely sophisticated pattern matching systems
- Trained to predict statistically likely continuations
- Capable of impressive generalization from their training
- Useful tools for many text-based tasks
LLMs Are Not:
- Databases of facts (they don't "look things up")
- Reasoning engines with reliable logic
- Conscious or truly understanding entities
- Guaranteed to be accurate or truthful
Key Takeaways
- LLMs fundamentally work by predicting the next token in a sequence
- Tokenization breaks text into subword units - affects pricing and context limits
- Transformers use attention mechanisms to weigh relationships between all parts of input
- Self-attention allows capturing long-range dependencies in text
- Generation is autoregressive - one token at a time, with sampling strategies controlling randomness
- Context windows limit how much text the model can consider at once
- LLMs are sophisticated pattern matchers, not knowledge bases or reasoning engines