From Text to Intelligence: How AI Understands Language
You've learned about neural networks and how they work. But language presents unique challenges that make it different from other types of data like images or numbers.
"Dog bites man" vs "Man bites dog"
Same words, completely different meaning! The sequence is crucial.
"I went to the bank" - River or money?
The same word can mean different things based on surrounding words.
King → Queen, Man → Woman
Words have semantic relationships that AI needs to understand.
Sentences can be 3 words or 300 words
Unlike fixed-size images, text length varies dramatically.
Before a neural network can process text, we need to break it into smaller pieces called tokens (pieces of text like words, subwords, or characters).
Why? Neural networks work with numbers, not text. Tokenization is the first step in converting text to numbers.
Throughout these lessons, we'll use "token" and "word" somewhat interchangeably for simplicity, though technically tokens can be smaller than words.
Modern LLMs use sophisticated tokenization:
Once we have tokens, we convert each one into an embedding vector (a list of numbers that represents the token's meaning). These vectors capture the meaning, grammar, and relationships of words.
Key Insight: Similar words get similar embedding vectors! Words like "king" and "queen" will have vectors that are close together in the embedding space.
"cat" → [0, 0, 1, 0, 0]
"dog" → [0, 0, 0, 1, 0]
Problem: No relationship captured!
Cat and dog are equally different from each other as they are from "car".
"cat" → [0.2, 0.8, -0.3, 0.5, ...]
"dog" → [0.3, 0.7, -0.2, 0.6, ...]
Success: Similarity captured!
Cat and dog vectors are close because they're both animals!
This 2D visualization shows how words cluster by meaning. Real embeddings use hundreds to tens of thousands of dimensions!
Embeddings capture relationships so well that you can do math with words!
This works because the "gender" direction in the embedding space is consistent across word pairs!
At their core, language models have one job: predict the next token given the previous tokens.
This simple task is incredibly powerful! By learning to predict the next word, models learn grammar, facts, reasoning, and even creativity.
"The cat sat on the ___"
Show the model millions of examples:
"The cat" → predict "sat"
"cat sat" → predict "on"
"sat on" → predict "the"
Model learns patterns from data!
Takes weeks/months on powerful GPUs
Use the trained model to create text:
Start: "Once upon"
Predict: "a" → "Once upon a"
Predict: "time" → "Once upon a time"
Keep predicting to generate stories!
Happens in seconds when you use modern AI assistants
Training is when the model learns from data (expensive, slow, done once by large AI companies):
Inference is when you use the trained model (fast, cheap, happens every time you chat with AI):
When generating text, models don't always pick the highest probability word - that would be boring and repetitive!
Temperature controls randomness:
This is why AI assistants can give different answers to the same question - they're sampling from the probability distribution, not always picking the top choice!
When predicting the next word, not all previous words are equally important. The model needs to focus on relevant words and ignore irrelevant ones.
Example: "The animal didn't cross the street because it was too tired"
What does "it" refer to? The animal or the street? Attention helps the model figure this out!
"The animal didn't cross the street because it was too tired"
Modern transformers use multiple attention heads in parallel. Each head can focus on different aspects:
Language models can only "remember" a limited number of tokens at once - this is called the context window or context length.
Examples:
If your conversation or document exceeds this limit, the model "forgets" the earliest parts. Longer context windows allow models to work with entire books or long conversations!
The Transformer is the architecture that powers modern LLMs. It combines embeddings, attention, and feed-forward networks in a clever way.
Key Innovation: Transformers replaced older RNN/LSTM architectures because they can process all words in parallel, making them much faster to train!
Remember how we said order matters in language? But embeddings alone don't capture position!
The Problem: The embedding for "cat" is the same whether it's the first word or the last word in a sentence.
The Solution: Positional encoding adds information about where each word appears in the sequence. It's like giving each word a "position tag" so the model knows "this is word #3" vs "this is word #7".
This is added to the embedding vector before it enters the transformer, allowing the model to understand both what the word is and where it appears.
Each transformer block contains two main components:
👁️👁️👁️
🧮
Two additional tricks make transformers work well:
These allow stacking many transformer blocks (large models can have 96+ layers!)
The principles you've learned apply to models of all sizes. But when we scale up dramatically, something magical happens: emergent abilities appear!
When models get large enough, they suddenly gain abilities they weren't explicitly trained for:
Parameters: Hundreds of billions to trillions
Use cases: Conversational AI, coding assistants, content creation
Can understand images, write code, solve math problems, and engage in nuanced conversations.
Parameters: Billions
Use cases: Analysis, writing, coding, research
Designed for safety and helpfulness, with strong reasoning and extended context windows.
Parameters: 7B to 70B variants
Use cases: Research, open-source applications
Community-driven models that enable researchers and developers to build custom applications.
Parameters: 110M to 340M
Use cases: Search, classification, Q&A
Models that understand context from both directions, powering search engines and classification tasks.
LLMs are evolving rapidly. Here are some exciting directions based on current research and industry trends:
Training on text, images, audio, and video together
Models that can understand and generate across all media types!
Smaller models with better performance
Techniques like distillation and quantization make models faster and cheaper to run.
LLMs that can use tools and take actions
Models that can browse the web, run code, and interact with APIs to accomplish tasks.
Domain-specific LLMs for medicine, law, science
Fine-tuned models with deep expertise in specific fields.
LLMs are already transforming industries:
You've completed the fourth lesson in your machine learning journey!
You've learned the concepts - now it's time to see them in action! Watch a real neural network learn to predict the next word, going from random guesses to 99% accuracy.