Part 2: Large Language Models Explained | Module 3

Introduction

Large Language Models (LLMs) like GPT-4 and Claude have captured global attention, but how do they actually work? This part explains the core concepts - transformers, attention, and tokenization - at a level that enables meaningful engagement with these technologies without requiring a PhD in machine learning.

The Core Idea: Next Word Prediction

At their heart, LLMs do something surprisingly simple: they predict the next word (or more precisely, token) in a sequence. Given the text "The cat sat on the...", the model predicts that "mat" is a likely next word.

From Simple to Sophisticated

While the task is simple, when scaled to billions of parameters and trained on trillions of words, next-word prediction gives rise to sophisticated behaviors. To predict well, the model must learn grammar, facts, reasoning patterns, and more - all encoded in its parameters.

This is why LLMs seem to "understand" and "reason" - they've learned patterns so deeply that they can generate coherent, contextually appropriate text. However, they're doing statistical pattern matching, not true understanding.

Tokenization: Breaking Down Text

Before processing text, LLMs break it into tokens - the basic units the model works with. Tokens are not always complete words.

Tokenization Example

The sentence: "Tokenization is fascinating!"

Token ization is fasc inating !

Notice how words may be split into subwords

Why Tokenization Matters

Context Limits: Models have token limits (e.g., 128K tokens). Longer texts require more tokens.
Pricing: API costs are typically based on token count.
Performance: Some words use multiple tokens, which can affect how the model processes them.
Languages: Non-English languages often require more tokens per word.

Practical Implication

A rough rule: 1 token is approximately 0.75 English words, or about 4 characters. When working with LLM APIs, understanding tokenization helps estimate costs and context usage.

The Transformer Architecture

The transformer, introduced in 2017, is the architecture behind all modern LLMs. Its key innovation is the "attention mechanism" which allows the model to weigh the relevance of different parts of the input when making predictions.

What Problem Does Attention Solve?

Consider: "The animal didn't cross the street because it was too tired."

What does "it" refer to? The animal. A model needs to connect "it" back to "animal" across the sentence. Attention enables this by allowing every word to "attend to" every other word, determining which are most relevant.

How Attention Works (Conceptually)

For each word in the input, attention computes:

How relevant is every other word to understanding this word?
Create a weighted combination of all words, based on relevance
This becomes the representation of that word in context

Query

"What am I looking for?" - Each word asks what information it needs

Key

"What do I contain?" - Each word advertises what information it offers

Value

"Here's my content" - The actual information to be retrieved

Score

Match queries to keys to determine relevance weights

Self-Attention

LLMs use "self-attention" - the input sequence attends to itself. This allows the model to build representations that capture relationships between all parts of the input, regardless of distance.

Transformer Components

A transformer consists of stacked layers, each containing attention and processing components:

Key Components

Embedding Layer: Converts tokens to numerical vectors the model can process
Positional Encoding: Adds position information (since attention is position-agnostic)
Multi-Head Attention: Multiple parallel attention mechanisms, each learning different relationships
Feed-Forward Networks: Process the attended representations
Layer Normalization: Stabilizes training
Output Layer: Predicts probability distribution over next tokens

Scale Creates Capability

GPT-4 is rumored to have over 1 trillion parameters arranged in these layers. This massive scale, combined with training on internet-scale text, enables the sophisticated behaviors we observe - but the fundamental architecture is the same transformer stack.

How LLMs Generate Text

Text generation happens one token at a time through "autoregressive" generation:

Process the input prompt through all transformer layers
Output a probability distribution over possible next tokens
Sample a token from this distribution (various strategies exist)
Append the sampled token to the sequence
Repeat until a stopping condition is met

Sampling Strategies

Temperature

Controls randomness. Low = more deterministic, high = more creative/random

Top-K Sampling

Only consider the K most likely tokens

Top-P (Nucleus)

Consider tokens until cumulative probability reaches P

Context Windows and Limitations

Every LLM has a maximum context length - the total number of tokens it can process at once (input + generated output).

Context Window Sizes

Early GPT-3: 4,096 tokens
GPT-4: 8K to 128K tokens depending on version
Claude: Up to 200K tokens
Gemini: Up to 1M tokens (claimed)

Why Context Matters

The context window determines how much information the model can consider when generating responses. Beyond the window, information is effectively "forgotten." This affects applications requiring long documents, extended conversations, or complex reasoning chains.

What LLMs Are (and Aren't)

LLMs Are:

Extremely sophisticated pattern matching systems
Trained to predict statistically likely continuations
Capable of impressive generalization from their training
Useful tools for many text-based tasks

LLMs Are Not:

Databases of facts (they don't "look things up")
Reasoning engines with reliable logic
Conscious or truly understanding entities
Guaranteed to be accurate or truthful

Key Takeaways

LLMs fundamentally work by predicting the next token in a sequence
Tokenization breaks text into subword units - affects pricing and context limits
Transformers use attention mechanisms to weigh relationships between all parts of input
Self-attention allows capturing long-range dependencies in text
Generation is autoregressive - one token at a time, with sampling strategies controlling randomness
Context windows limit how much text the model can consider at once
LLMs are sophisticated pattern matchers, not knowledge bases or reasoning engines