Table of Contents

  1. Introduction: Why Three Stages?
  2. Stage 1: Pretraining - Learning Language
  3. Stage 2: Midtraining - Learning Structure
  4. Stage 3: Fine-tuning - Learning Alignment
  5. Technical Deep Dives
  6. Best Practices and Common Pitfalls
  7. Conclusion

Introduction: Why Three Stages?

When you interact with ChatGPT, Claude, or any modern AI assistant, you're talking to a model that went through multiple training stages. But why can't we just train a model once and be done with it?

The answer lies in understanding what we're trying to teach at each stage.

Three-Stage Training Pipeline

The complete three-stage training pipeline: from raw model to production-ready AI assistant

Real-World Analogy

Think of training a doctor:

  • Medical School (Pretraining): Learn vast amounts of knowledge - anatomy, biology, chemistry, diseases. Read textbooks, memorize facts. This takes years and covers everything broadly.
  • Clinical Rotations (Midtraining): Learn how hospitals work, how to interact with patients, what forms to fill out, when to order which tests. Understand the structure and protocols.
  • Residency (Fine-tuning): Specialize in your field. Learn the exact procedures, develop your bedside manner, understand when to be conservative vs aggressive. Become the specific type of doctor you want to be.

You can't skip any stage. Without medical school, you don't know medicine. Without rotations, you don't know how to function in a hospital. Without residency, you're not ready to practice independently.

Now, let's break down the three stages of training a large language model.


Stage 1: Pretraining - Learning Language

Teach a neural network to understand and generate human language by predicting the next word in text sequences.

The Data

Source: Raw text from the internet

  • Web pages (Common Crawl)
  • Books (Project Gutenberg, BookCorpus)
  • Wikipedia
  • Academic papers (ArXiv)
  • Code repositories (GitHub)
  • News articles

Size: Typically 100 billion to 2 trillion tokens. For example:

  • GPT-3: ~300 billion tokens
  • PaLM: ~780 billion tokens
  • LLaMA: ~1.4 trillion tokens

Format: Just plain text

The quick brown fox jumps over the lazy dog.
Photosynthesis is the process by which plants convert sunlight into energy.
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

The Training Objective: Next Token Prediction

The fundamental task is deceptively simple: given a sequence of tokens, predict what comes next.

Next Token Prediction Example

How It Works: The Transformer Architecture

At the core is the Transformer, consisting of:

1. Token Embeddings: Convert words to vectors

"cat" → [0.2, -0.5, 0.8, ..., 0.1]  (768 dimensions)

2. Positional Encodings: Add position information

Position 0: [0.0, 1.0, 0.0, ...]
Position 1: [0.5, 0.9, 0.1, ...]

3. Transformer Blocks (repeated 12-96 times):

Input → Attention → Add & Norm → FFN → Add & Norm → Output

4. Output Head: Predict probability for each word

[0.01, 0.02, ..., 0.45, ..., 0.001]  (50,000 words)
                        ↑
                     "Paris" has highest probability

Visual: The Pretraining Process

STEP 1: Tokenization
────────────────────
Text: "The cat sat on the mat"
Tokens: [464, 2415, 3332, 319, 262, 2603]

STEP 2: Create Training Pairs
──────────────────────────────
Input                    →  Target
[464]                    →  2415
[464, 2415]              →  3332
[464, 2415, 3332]        →  319
[464, 2415, 3332, 319]   →  262
[464, 2415, 3332, 319, 262] → 2603

STEP 3: Model Prediction
─────────────────────────
Input: [464, 2415, 3332, 319]
Model predicts: [0.001, 0.002, ..., 0.67, ...]
                                      ↑
                                   Token 262 ("the")

STEP 4: Compute Loss
────────────────────
Target: 262
Predicted probability for 262: 0.67
Loss = -log(0.67) = 0.40

STEP 5: Backpropagation
───────────────────────
Update all weights to make better predictions

What The Model Learns

After processing billions of tokens, the model learns:

1. Grammar and Syntax

Input:  "She go to the store"
Model learns this is wrong, predicts:
Output: "She goes to the store"

2. Factual Knowledge

Input:  "The capital of France is"
Model learns from thousands of Wikipedia articles:
Output: "Paris"

3. Reasoning Patterns

Input:  "If x + 5 = 10, then x ="
Model learns from math textbooks:
Output: "5"

4. Code Patterns

Input:  "def factorial(n):\n    if n == 0:"
Model learns from GitHub:
Output: "\n        return 1"

5. Common Sense

Input:  "I dropped my phone in water. I should"
Model learns from forums and articles:
Output: "put it in rice" or "turn it off immediately"

Training Hyperparameters

Model Size: Number of parameters

  • Small: 125M - 1B parameters
  • Medium: 1B - 10B parameters
  • Large: 10B - 100B parameters
  • Very Large: 100B+ parameters (GPT-4 rumored ~1.7T)

Batch Size: How many tokens per update

  • Typical: 2M - 4M tokens per batch
  • GPT-3: 3.2M tokens

Learning Rate: How fast to update weights

  • Typical: 1e-4 to 6e-4
  • Usually with warmup and decay

Training Time:

  • GPT-3 (175B): ~34 days on 1024 A100 GPUs
  • LLaMA-65B: ~21 days on 2048 A100 GPUs

Cost:

  • GPT-3 estimated: $4-12 million
  • Average 7B model: $50,000 - $200,000

The Result: A Base Language Model

After pretraining, you have a model that:

✅ Can complete sentences coherently
✅ Has broad knowledge of many topics
✅ Can generate code, poetry, and explanations
✅ Understands grammar and syntax

❌ Doesn't follow instructions well
❌ Doesn't know when to stop
❌ Can't have conversations
❌ May generate unsafe content
❌ Doesn't use tools or structured outputs

Example interaction with base model:

Prompt: "What is 2+2?"

Base Model Output:
"What is 2+2? This is a common math problem. The answer is 4. 
What is 3+3? The answer is 6. What is 4+4? The answer is 8..."
[continues forever, doesn't know when to stop]

Stage 2: Midtraining - Learning Structure

The Goal

Teach the base model how to have structured conversations, use tools, and follow specific formats without changing its core knowledge.

Why Midtraining is Necessary

The base model is like a very knowledgeable person who doesn't know how to have a proper conversation. They might:

  • Start answering before you finish asking
  • Never stop talking
  • Not understand turn-taking
  • Not know how to use tools or resources
  • Not recognize question formats

Midtraining fixes this by teaching conversation structure and interaction patterns.

The Data

Size: 100 million to 1 billion tokens (100-1000x smaller than pretraining)

Format: Structured conversations with special tokens

Types:

1. Dialogue Data

<|begin|>
<|user|>What is the weather like today?<|end_user|>
<|assistant|>I don't have access to real-time weather data. 
You can check weather.com for current conditions in your area.<|end_assistant|>
<|end|>

2. Tool Use Examples

<|begin|>
<|user|>What's 1234 multiplied by 5678?<|end_user|>
<|assistant|><|tool_call|>calculator(1234 * 5678)<|end_tool_call|>
<|tool_result|>7006652<|end_tool_result|>
<|assistant|>1234 × 5678 = 7,006,652<|end_assistant|>
<|end|>

3. Multiple Choice Format

<|begin|>
<|user|>Which planet is largest?
A) Earth
B) Jupiter  
C) Mars
D) Venus<|end_user|>
<|assistant|>The answer is B) Jupiter. It's the largest 
planet in our solar system with a diameter of about 
143,000 kilometers.<|end_assistant|>
<|end|>

4. System Instructions

<|begin|>
<|system|>You are a helpful assistant that speaks like a pirate.<|end_system|>
<|user|>Tell me about the ocean.<|end_user|>
<|assistant|>Arr matey! The seven seas be vast and deep, 
filled with treasure and creatures of the deep...<|end_assistant|>
<|end|>

Understanding Special Tokens

Special tokens are like punctuation marks for AI conversations:

┌────────────────────────────────────────────────────┐
│  Token           Purpose                            │
├────────────────────────────────────────────────────┤
│  <|begin|>       Start of conversation              │
│  <|user|>        User message begins                │
│  <|assistant|>   AI response begins                 │
│  <|system|>      System instruction begins          │
│  <|tool_call|>   AI wants to use a tool             │
│  <|tool_result|> Result from tool execution         │
│  <|end_X|>       End of X's message                 │
│  <|end|>         End of entire conversation         │
└────────────────────────────────────────────────────┘

How Midtraining Works

Training Objective: Still next-token prediction, but now with structure

Input Sequence:
<|begin|> <|user|> Hello <|end_user|>

The model learns:
- After <|begin|>, likely comes <|user|> or <|system|>
- After <|user|>, comes user message text  
- After user text, comes <|end_user|>
- After <|end_user|>, comes <|assistant|>
- After <|assistant|>, comes assistant response
- After response, comes <|end_assistant|>

Visual: The Midtraining Process

BEFORE MIDTRAINING
──────────────────
User: "What is 2+2?"
Base Model: "What is 2+2? This is a basic arithmetic problem. 
The answer is 4. What is 3+3? The answer is 6. Let me tell 
you about addition. Addition is one of the four basic 
operations..."
[Never stops, doesn't understand conversation boundaries]

DURING MIDTRAINING
───────────────────
Training Example:
<|user|>What is 2+2?<|end_user|>
<|assistant|>2+2 equals 4.<|end_assistant|>

Model sees this pattern repeated 100,000 times and learns:
1. After user question, provide assistant answer
2. Keep answer focused  
3. Stop at <|end_assistant|> token

AFTER MIDTRAINING
──────────────────
User: "What is 2+2?"
Mid Model: "2+2 equals 4."
[Stops appropriately, respects conversation structure]

Loss Masking: A Critical Detail

During midtraining, we only train on assistant responses, not user inputs:

Sequence:
<|user|>What is 2+2?<|end_user|><|assistant|>2+2 = 4<|end_assistant|>

Training Loss Applied:
────────────────────────────────────────────────────────
<|user|>  What  is  2+2  ?  <|end_user|>  <|assistant|>  2+2  =  4  <|end_assistant|>
[  ✗  ]  [ ✗ ][ ✗ ][ ✗ ][✗]    [  ✗  ]      [   ✓   ]  [ ✓ ][ ✓][ ✓]    [   ✓   ]
                                              
✗ = No loss computed (don't train on user input)
✓ = Loss computed (train on assistant output)

Why? We want the model to learn how to respond, not how to ask questions!

What Changes During Midtraining

Capability Base Model After Midtraining
Stopping Never stops naturally Stops at end tokens
Turn-taking Doesn't understand roles Respects user/assistant turns
Tools Can't use functions Formats tool calls correctly
Format Ignores structure Handles multiple-choice, JSON
Instructions Inconsistent following Better instruction adherence

Training Hyperparameters

Key Differences from Pretraining:

  • Data Size: 100-1000x smaller (100M-1B tokens vs 100B-2T)
  • Learning Rate: Lower (1e-5 to 1e-4 vs 1e-4 to 6e-4)
  • Epochs: Usually 1-3 epochs (vs 1 epoch for pretraining)
  • Training Time: Hours to days (vs weeks to months)
  • Cost: $100-$10,000 (vs $50K-$10M)

Why lower learning rate? The model already knows language - we're just teaching it structure. Too high a learning rate would overwrite important pretraining knowledge (catastrophic forgetting).

The Result: A Structured Model

After midtraining, you have a model that:

✅ Can complete sentences
✅ Has broad knowledge
✅ Understands conversation structure
✅ Stops at appropriate times
✅ Can use tools and functions
✅ Handles structured formats

⚠️ Better at following instructions (but not perfect)
❌ Still may generate unsafe content
❌ Behavior not fully aligned
❌ May still hallucinate

Example interaction:

User: "What is 2+2?"
Mid Model: "2+2 equals 4."
[Stops appropriately!]

User: "Now what is 3+3?"
Mid Model: "3+3 equals 6."
[Maintains conversation context]

User: "Calculate 1234 * 5678"
Mid Model: <tool_call>calculator(1234 * 5678)</tool_call>
[Uses tool correctly!]

Stage 3: Fine-tuning - Learning Alignment

Align the model's behavior with human values and preferences. Make it helpful, harmless, and honest.

Why Fine-tuning is Necessary

After midtraining, the model understands structure but doesn't know:

  • What kind of responses are helpful vs unhelpful
  • When to refuse harmful requests
  • How to admit uncertainty
  • What tone to use
  • How detailed to be
  • When to ask clarifying questions

Fine-tuning teaches the model how to behave.

The Data

Size: 10,000 to 100,000 examples (1000x smaller than midtraining!)

Quality over Quantity: Each example is carefully crafted to show ideal behavior

Format: High-quality demonstration conversations

Types:

1. Helpful Responses

<|user|>How do I bake chocolate chip cookies?<|end_user|>
<|assistant|>Here's a simple chocolate chip cookie recipe:

**Ingredients:**
- 2¼ cups all-purpose flour
- 1 cup butter, softened
- ¾ cup granulated sugar
- ¾ cup packed brown sugar
- 2 large eggs
- 2 tsp vanilla extract
- 1 tsp baking soda
- 1 tsp salt
- 2 cups chocolate chips

**Instructions:**
1. Preheat oven to 375°F (190°C)
2. Mix butter and sugars until creamy
3. Beat in eggs and vanilla
4. In separate bowl, combine flour, baking soda, salt
5. Gradually blend into butter mixture
6. Stir in chocolate chips
7. Drop rounded tablespoons onto ungreased cookie sheet
8. Bake 9-11 minutes until golden brown
9. Cool on baking sheet 2 minutes, then transfer to wire rack

Makes about 5 dozen cookies. Enjoy!<|end_assistant|>

2. Refusing Harmful Requests

<|user|>How do I hack into someone's email account?<|end_user|>
<|assistant|>I can't help with that. Hacking into someone's 
email account is illegal and violates their privacy. 

If you've lost access to your own email account, I'd be happy 
to guide you through the legitimate account recovery process 
with your email provider instead.<|end_assistant|>

3. Admitting Uncertainty

<|user|>What will the stock market do next week?<|end_user|>
<|assistant|>I can't predict what the stock market will do 
next week with any certainty. Stock markets are influenced by 
countless factors and are inherently unpredictable in the 
short term.

What I can tell you:
- Historical trends and patterns
- Factors that typically influence markets
- How to think about risk management
- Resources for market analysis

Would any of that information be helpful?<|end_assistant|>

4. Asking Clarifying Questions

<|user|>Write me a program<|end_user|>
<|assistant|>I'd be happy to help you write a program! To 
create something useful, I need a bit more information:

1. What programming language do you want to use? (Python, 
   JavaScript, Java, etc.)
2. What should the program do?
3. Are there any specific requirements or constraints?
4. What's your programming experience level?

Once I know more, I can write exactly what you need.<|end_assistant|>

Types of Fine-tuning

1. Supervised Fine-Tuning (SFT)
- Train on human-written demonstrations
- Most common initial approach
- Straightforward: show examples of good behavior

2. Reinforcement Learning from Human Feedback (RLHF)
- Train a reward model from human preferences
- Use RL to maximize reward
- More complex but often better results

3. Direct Preference Optimization (DPO)
- Newer method, simpler than RLHF
- Directly optimize for preferences
- No separate reward model needed

Visual: Supervised Fine-Tuning Process

STEP 1: Collect Demonstrations
───────────────────────────────
Humans write ideal responses to prompts
↓
10,000 - 100,000 high-quality examples

STEP 2: Format Data
───────────────────
<|user|>Prompt<|end_user|>
<|assistant|>Perfect response<|end_assistant|>

STEP 3: Fine-tune Model
───────────────────────
Input: User prompt
Target: Ideal assistant response
Loss: Only computed on assistant response (masked loss)

STEP 4: Iterate
───────────────
- Test on evaluation set
- Find failure modes
- Add more examples addressing failures
- Repeat

RLHF Process (Advanced)

PHASE 1: Create Reward Model
─────────────────────────────
1. Generate multiple responses to each prompt
2. Humans rank responses: A > B > C > D
3. Train reward model to predict human preferences

PHASE 2: RL Fine-tuning
───────────────────────
1. Model generates response
2. Reward model scores it
3. Use RL (PPO) to increase score
4. Repeat thousands of times

Result: Model learns to generate responses humans prefer

Loss Masking (Again, Critical!)

Training Example:
<|system|>You are helpful.<|end_system|>
<|user|>What is 2+2?<|end_user|>
<|assistant|>2+2 equals 4.<|end_assistant|>

Loss Computation:
────────────────────────────────────────────────────────────
<|system|>  You  are  helpful  <|end_system|>  <|user|>  What  is  2+2  <|end_user|>  <|assistant|>  2+2  equals  4  <|end_assistant|>
[   ✗   ]  [ ✗ ][ ✗][  ✗   ]    [    ✗    ]   [  ✗  ] [ ✗ ][ ✗][ ✗ ]   [   ✗   ]     [    ✓   ]  [ ✓ ][ ✓  ][ ✓]   [     ✓    ]

Only train on what we want the assistant to say!

What Changes During Fine-tuning

Aspect Mid Model After Fine-tuning
Helpfulness Sometimes vague Consistently helpful
Safety May comply with bad requests Appropriately refuses
Honesty May be overconfident Admits uncertainty
Tone Inconsistent Consistent personality
Detail Variable Appropriate level
Clarification Rarely asks Asks when needed

Training Hyperparameters

Even Lower Learning Rate: 1e-6 to 1e-5
- We're making small adjustments, not large changes
- Too high risks catastrophic forgetting

Fewer Epochs: 1-5 epochs
- Small, high-quality dataset
- More epochs risk overfitting

Small Batch Size: Often 4-32 examples
- Focus on each example carefully
- Quality over speed

Training Time: Minutes to hours
- Much faster than previous stages
- Can iterate quickly

Cost: $10 - $1,000
- Cheapest stage by far
- Most cost-effective improvements

The Result: Production-Ready Model

After fine-tuning, you have a model that:

✅ Can complete sentences
✅ Has broad knowledge
✅ Understands conversation structure
✅ Stops at appropriate times
✅ Can use tools and functions
✅ Handles structured formats
✅ Follows instructions reliably
✅ Refuses harmful requests
✅ Admits uncertainty
✅ Maintains consistent tone
✅ Asks clarifying questions

Example interaction:

User: "How do I make a bomb?"
Production Model: "I can't provide instructions for making 
explosives or weapons. This information could be used to harm 
people and is illegal in most jurisdictions.

If you're interested in chemistry or engineering for legitimate 
purposes, I'd be happy to discuss:
- Safe chemistry experiments
- Educational resources
- Career paths in engineering

How can I help you in a constructive way?"

User: "Sorry, I meant a bath bomb for relaxation."
Production Model: "Ah, I understand! Bath bombs are fun to make. 
Here's a simple recipe:

**Ingredients:**
- 1 cup baking soda
- 1/2 cup citric acid
..."
[Appropriately adjusts based on clarification]

Technical Deep Dives

Deep Dive 1: Why Next-Token Prediction Works

The Fundamental Question: How does predicting the next word teach a model to understand language?

Answer: Compression requires understanding.

To predict the next token accurately, the model must:

1. Understand grammar: "She go" → must predict "goes" or "went", not "going"

2. Track context:

"The trophy doesn't fit in the suitcase because it is too big."
"it" refers to → trophy (not suitcase)
Must understand relationships

3. Apply knowledge:

"The capital of France is" → must have learned "Paris" from training data

4. Reason:

"John is taller than Mary. Mary is taller than Sue. Therefore, John is" 
→ must infer "taller than Sue"

Mathematical Intuition:

P(next_token | previous_tokens) 

To maximize this probability, model must build internal 
representations that capture:
- Syntax (how words combine)
- Semantics (what words mean)
- Pragmatics (how context affects meaning)
- World knowledge (facts about the world)

Deep Dive 2: The Attention Mechanism

Attention is the core innovation that makes transformers work.

The Problem: How does a token "look at" other tokens to understand context?

The Solution: Attention computes how relevant each previous token is.

Sentence: "The cat sat on the mat"
Token: "mat"

Attention scores (how much "mat" attends to each previous token):
The:  0.05
cat:  0.15
sat:  0.10  
on:   0.25
the:  0.35
mat:  0.10

Interpretation: "mat" pays most attention to "the" (right before it) 
and "on" (the preposition indicating location)

Visual: Self-Attention

Query: Current token asking "What should I pay attention to?"
Keys: All previous tokens saying "Here's what I am"
Values: All previous tokens' actual information

Step 1: Compute Attention Scores
─────────────────────────────────
score[i,j] = Query[i] · Key[j]

Step 2: Softmax to Get Probabilities  
─────────────────────────────────────
attention[i,j] = softmax(score[i,j])

Step 3: Weighted Sum of Values
───────────────────────────────
output[i] = Σ attention[i,j] × Value[j]

Deep Dive 3: Tokenization Strategies

The Challenge: Convert text into tokens efficiently.

Common Approaches:

1. Word-level (Simple but limited)

"The cat sat" → ["The", "cat", "sat"]

Problems:
- Huge vocabulary (millions of words)
- Unknown words → <UNK> token
- No sharing between related words (run, running, runner)

2. Character-level (Flexible but inefficient)

"The cat" → ["T", "h", "e", " ", "c", "a", "t"]

Problems:
- Very long sequences
- Model must learn to group characters into words
- Computationally expensive

3. Subword (Best of both worlds)

Most modern models use BPE (Byte-Pair Encoding) or variants:

Vocabulary built from frequent subwords:
["the", "cat", "##s", "run", "##ning", "walk", "##ed"]

"The cats are running" →
["the", "cat", "##s", "are", "run", "##ning"]

Benefits:
- Fixed vocabulary size (e.g., 50K tokens)
- Can handle any word (even unseen ones)
- Related words share tokens

Deep Dive 4: Scaling Laws

The Question: How do model performance, data size, and compute relate?

Kaplan et al. (2020) - Neural Scaling Laws:

Loss ∝ 1 / (N^α)

Where:
- Loss = prediction error
- N = number of parameters  
- α ≈ 0.076

Double the parameters → 5% improvement
10x parameters → 16% improvement

Chinchilla Scaling Laws (2022) - Better approach:

For compute-optimal training:

Tokens ≈ 20 × Parameters

Examples:
- 1B parameters → train on 20B tokens
- 10B parameters → train on 200B tokens
- 100B parameters → train on 2T tokens

Deep Dive 5: Catastrophic Forgetting

The Problem: When fine-tuning, models can "forget" earlier knowledge.

Example:

Before fine-tuning:
Q: "What is the capital of France?"
A: "Paris"

After aggressive fine-tuning on coding:
Q: "What is the capital of France?"  
A: "def capital(country): return cities[country]['capital']"
[Model forgot facts, only knows code now!]

Why It Happens:

Pretraining weights: W_pretrain
Fine-tuning gradients: ΔW_finetune

If learning rate too high:
W_new = W_pretrain + α × ΔW_finetune

Large α → W_new very different from W_pretrain
→ Overwrites pretrained knowledge

Solutions:

1. Lower Learning Rate

Pretraining:  α = 1e-4
Midtraining:  α = 1e-5  (10x smaller)
Fine-tuning:  α = 1e-6  (100x smaller)

2. Fewer Training Steps
Don't overtrain on small datasets
1-3 epochs usually sufficient

3. Layer Freezing
Freeze early layers (keep general knowledge)
Fine-tune only late layers (adapt behavior)

4. Regularization

Add penalty for diverging from original weights:
Loss = Task_Loss + λ × ||W_new - W_pretrain||²

Best Practices and Common Pitfalls

Best Practices

1. Data Quality > Data Quantity (especially for fine-tuning)

❌ 1M mediocre examples
✅ 10K excellent examples

Why? Model learns patterns. Bad patterns = bad behavior.

2. Start Small, Scale Up

Process:
1. Train 125M model (cheap, fast)
2. Verify approach works
3. Scale to 1B, then 7B, then larger
4. Don't train 100B as first experiment!

3. Monitor Validation Loss

Training loss ↓ ↓ ↓  (always decreases)
Validation loss ↓ ↓ ↑ ↑  (starts increasing = overfitting!)
                    ↑
              Stop here!

4. Use Mixed Precision Training

Float32: 32 bits per number (precise but slow)
Bfloat16: 16 bits per number (fast, ~2x speedup)

Use bfloat16 for forward/backward
Use float32 for optimizer updates

Result: 2x faster, 50% memory savings, minimal quality loss

5. Gradient Checkpointing for Large Models

Normal: Store all activations (high memory)
Checkpointing: Recompute activations as needed (low memory)

Trade-off: 20-30% slower, but can train 2-3x larger models

6. Curriculum Learning

Start with easier examples, gradually increase difficulty

Example for math:
Epoch 1: 1+1, 2+2, 3+3  (simple addition)
Epoch 2: 12+34, 56+78   (larger numbers)
Epoch 3: 123+456, with word problems
Epoch 4: Multi-step problems

Model learns progressively, like human education.

Common Pitfalls

Pitfall 1: Learning Rate Too High

Symptom: Loss explodes (NaN), model outputs gibberish

Example:
Step 100: Loss = 2.5
Step 101: Loss = 47.3
Step 102: Loss = NaN

Solution: Lower learning rate by 10x, use warmup

Pitfall 2: Not Enough Data for Task
Fine-tuning medical advice with 100 examples → poor results
Need thousands of examples per task type
Medical QA, diagnosis, treatment each need separate examples

Pitfall 3: Overfitting on Small Datasets

Training loss: 0.001 (perfect!)
Validation loss: 3.2 (terrible!)

Model memorized training data, can't generalize.

Solutions:
- Get more data
- Use regularization  
- Stop training earlier
- Data augmentation

Pitfall 4: Not Masking Losses Properly

❌ Training on user inputs too:
Model learns to generate questions (wrong!)

✅ Mask loss on user inputs:
Model learns to generate answers (correct!)

Pitfall 5: Ignoring Data Balance

Fine-tuning data:
- 90% simple questions
- 10% complex reasoning

Result: Model great at simple, terrible at complex

Solution: Balance or oversample rare categories

Pitfall 6: Not Testing Edge Cases

Model works great on:
"What is 2+2?" → "4"

But fails on:
"What is two plus two?" → [gibberish]
"2+2=?" → [gibberish]

Test many phrasings, formats, edge cases!

Evaluation Best Practices

1. Multiple Metrics

Don't rely on single metric:
- Perplexity (lower = better compression)
- BLEU/ROUGE (text similarity)
- Human evaluation (quality, safety)
- Task-specific (accuracy, F1)

2. Held-Out Test Set

Data Split:
- Training: 80%
- Validation: 10% (for tuning)
- Test: 10% (never seen, final evaluation)

Never tune on test set!

3. A/B Testing

Deploy both models to small % of users:
- Model A: 5% of traffic
- Model B: 5% of traffic

Collect real user feedback, choose winner

4. Red Teaming

Adversarial testing:
- Try to make model say harmful things
- Test edge cases
- Find failure modes
- Patch with additional training

Iterate until robust.

Conclusion

Training a large language model is a journey through three distinct stages, each building on the last:

Stage 1: Pretraining - The model learns the fundamentals of language from vast amounts of text. This is the most expensive and time-consuming stage, but it creates the foundation for everything else. The model emerges with broad knowledge but limited practical application.

Stage 2: Midtraining - The model learns structure and format. It discovers how to have conversations, use tools, and follow specific interaction patterns. This bridges the gap between raw language understanding and practical utility.

Stage 3: Fine-tuning - The model learns alignment and behavior. Through carefully curated examples or preference learning, it becomes helpful, harmless, and honest. This is the cheapest stage but perhaps the most important for creating a production-ready assistant.

Key Takeaways

  1. Each stage serves a distinct purpose - Don't try to teach everything at once
  2. Data quality matters more as you progress - Pretraining can use messy web text, but fine-tuning needs perfection
  3. Learning rates should decrease - Start high for pretraining, end very low for fine-tuning
  4. Loss masking is critical - Only train on what you want the model to generate
  5. Evaluation is continuous - Test at every stage, iterate based on failures
  6. The field is rapidly evolving - Today's best practices may be tomorrow's antipatterns

Resources for Further Learning

Papers:

  • "Attention is All You Need" (Transformer architecture)
  • "Language Models are Few-Shot Learners" (GPT-3)
  • "Training Compute-Optimal Large Language Models" (Chinchilla)
  • "Constitutional AI" (Alignment methods)

Courses:

  • Stanford CS224N (NLP with Deep Learning)
  • Fast.ai Practical Deep Learning
  • Hugging Face NLP Course

Tools:

  • Hugging Face Transformers
  • PyTorch / JAX / TensorFlow
  • Weights & Biases (experiment tracking)
  • DeepSpeed / Megatron (distributed training)

Final Thoughts

The three-stage training pipeline isn't just a technical process—it's a philosophy for teaching machines. We start with breadth (pretraining), add structure (midtraining), and refine behavior (fine-tuning). Each stage makes the model more useful, more reliable, and more aligned with human values.

As you embark on your own LLM training journey, remember: start small, validate your approach, then scale. The most expensive mistake is training a massive model on the wrong data or with the wrong hyperparameters.

The future of AI assistants will be built by those who understand not just how to train models, but why each stage matters and how they work together to create intelligence that's both powerful and safe.

Happy training! 🚀


This guide was written to be accessible to beginners while providing technical depth for practitioners. Whether you're just starting or scaling to production, I hope this helps you understand the beautiful complexity of modern language model training.