LLM Training Walkthrough

🎯 What You'll Learn

🔗 Building on Lesson 3

In Lesson 3, you learned about embeddings (how words become numbers), attention (how models focus on relevant words), and transformers (the architecture powering modern LLMs).

Now you'll see these concepts in action! Watch as a neural network learns from scratch, transforming from random guesses to accurate predictions.

The Big Picture

Have you ever wondered how AI assistants can write text that sounds human? The secret is surprisingly simple: they predict one word (or token) at a time!

In this lesson, you'll watch a tiny neural network learn to predict the next word in the sentence: "the cat sat on the mat"

1️⃣

Words → Numbers

See how words are converted into mathematical vectors (embeddings) that capture their meaning.

2️⃣

Learning Process

Watch the network make mistakes, calculate errors, and gradually improve its predictions.

3️⃣

From 8% to 99%

See the probability of the correct answer jump from barely better than random to near-perfect!

📚 Step 1: Understanding the Task

What are we teaching the AI?

We're training a simple neural network to predict the next word in a sentence. This is the fundamental task that powers large language models!

🔑 Key Concept: Just like how you can predict that "the cat" is likely followed by "sat" or "jumped", neural networks learn these patterns by seeing many examples. This is called next token prediction.

💡 Real-World Connection

When you type on your phone and it suggests the next word, that's the same technology! Modern LLMs use this same principle but with billions of parameters and massive amounts of text.

🏗️ Step 2: Network Architecture

How does the network work?

The network has three main layers that transform words into predictions:

🔢 Embedding Layer: Converts words into numbers (vectors). Think of this like giving each word a unique set of coordinates in a multi-dimensional space. Words with similar meanings end up closer together!
🧮 Hidden Layer: Learns patterns and relationships between words. This is where the "thinking" happens - the network discovers which words tend to follow others.
🎯 Output Layer: Produces probabilities for each possible next word. The network gives each word a score - higher scores mean "more likely to come next".

🔗 Connecting to Lesson 3

Remember the embedding visualizations from Lesson 3? You saw how words cluster by meaning and how you can do arithmetic with word vectors (King - Man + Woman ≈ Queen).

Now you'll see something even more amazing: how these embeddings are learned from scratch! Watch as the network starts with random numbers and gradually learns meaningful representations through training.

The embedding vector you'll see changing in the visualizations below is the network learning to represent the word "cat" in a way that helps predict what comes next.

🎯 Step 3: Training Progress & Visualizations

0%

What's happening now?

Click play to start the training animation!

Loss Over Time

Prediction Probabilities

Embedding Vector

Hidden Layer Activations

💡 Key Takeaways

What did we learn?

🔢 Words become numbers (Embeddings): Neural networks convert text into mathematical vectors that capture meaning, grammar, and context. Similar words have similar vectors!
📉 Learning = reducing error (Loss): The network adjusts weights to minimize loss. When loss goes down, predictions get better!
🎯 Predictions improve (Probability): Over time, the correct answer gets higher probability. Watch how "sat" goes from 8% to 99%!
🔄 Backpropagation: Errors flow backward to update all layers. The network learns from its mistakes!
🚀 Scale matters: Real LLMs use billions of parameters and massive datasets. This tiny example has just ~100 parameters!

🌍 From Simple to Sophisticated

What you just saw: A tiny network learning one sentence with 5 words.

Modern Large LLMs:

Trained on billions of words from books, websites, and articles
Use billions of parameters (175B+, vs our ~100)
Have embeddings with thousands of dimensions (12,000+, vs our 8)
Can understand context, write code, translate languages, and more!

Note: Exact specifications for recent models are often proprietary and not publicly disclosed.

But the core principle is exactly the same: predict the next token, learn from mistakes, repeat!

⚙️ Training Details You Saw

Learning Rate: Notice how the loss decreased smoothly? That's because we used a small learning rate (0.01). Too large, and training becomes unstable; too small, and it takes forever!

Convergence: Around epoch 30, the loss stopped improving much. This is called convergence - the model has learned as much as it can from this data.

Overfitting Risk: We trained on just one sentence. In real scenarios, we'd test on new sentences to ensure the model generalizes and doesn't just memorize!

🔬 From Simple to Sophisticated

This lesson showed you a simple feedforward network. Modern LLMs use Transformer architecture (which you learned about in Lesson 3) including:

Self-Attention: Helps the model focus on relevant words (like understanding "lies" differently in "speak no lies" vs "he lies down")
Positional Encoding: Tells the model where each word is in the sentence
Multiple Layers: Stack many transformer blocks to learn complex patterns
Batch Training: Process thousands of examples simultaneously for efficiency

But they all start with the same foundation you just learned: embeddings, learning from data, and next token prediction!

🧠 How LLMs Learn