Published on 2026-04-30
How LLMs Actually Work, Part 1: Tokens and Transformers
The first part of a practical deep dive into LLMs - from next-token prediction and tokenization through embeddings, attention, transformer blocks, and training.
Introduction
Over the last few years, AI has gone from being a niche research topic to something millions of people use every day.
We now have tools like ChatGPT, Claude, Gemini, Copilot, Cursor, Perplexity, and countless AI powered products.
As software engineers, we often use these tools through APIs and SDKs. We build chatbots, AI assistants, RAG systems, and agents.
But many developers still do not fully understand what is happening underneath.
Terms like tokens, embeddings, transformers, and attention are used everywhere. This series breaks the full picture into three focused articles:
- Part 1 (this post): How text becomes predictions - tokens, embeddings, transformers, attention, and training
- Part 2: Inference, Memory, and RAG - parameters, inference, context windows, hallucinations, RAG, and fine-tuning
- Part 3: Agents and the Future of AI - reasoning, modern architectures, and AI agents
By the end of the series, you should have a solid mental model of how modern LLMs work and why they have become the foundation of modern AI applications.
The Core Idea Behind Every LLM
What Is A Large Language Model?
LLM stands for Large Language Model.
At its core, an LLM is a machine learning model trained on enormous amounts of text data.
The goal is surprisingly simple:
Predict the next token.
Everything else emerges from that ability.
The Next Token Prediction Concept
Imagine the model sees:
The model might assign probabilities like:
It chooses the most likely next token and then repeats the process.
This happens over and over until an entire response is generated.
Why Next Token Prediction Is More Powerful Than It Sounds
At first, predicting the next token sounds trivial.
But language contains:
- Knowledge
- Logic
- Patterns
- Human reasoning
- Programming concepts
- Mathematics
When a model learns enough language patterns, surprisingly powerful capabilities begin to emerge.
How Text Becomes Something A Neural Network Can Understand
What Are Tokens?
Computers do not understand words.
They understand numbers.
Before processing text, the model converts text into smaller units called tokens.
For example:
may become:
Different models use different tokenization strategies.
Why Models Do Not See Words
The model never sees actual text.
Instead, it sees token IDs:
Internally, everything becomes numbers.
Why Tokens Matter
Tokens affect:
- API pricing
- Context windows
- Memory limits
- Processing speed
- Inference costs
This is why AI providers often charge per token instead of per request.
How Models Understand Meaning
What Are Embeddings?
Token IDs themselves have no meaning.
The model converts tokens into vectors called embeddings.
A vector is simply a list of numbers.
These vectors capture semantic meaning.
Semantic Meaning In Vector Space
Words with similar meanings tend to be close together.
This is how the model begins understanding relationships between concepts.
A Famous Example
One of the most famous embedding relationships is:
This demonstrated that neural networks can learn meaningful relationships from data.
The Transformer Breakthrough
Why AI Changed In 2017
The biggest breakthrough in modern AI came from a research paper called:
Attention Is All You Need
This paper introduced the Transformer architecture.
Nearly every modern LLM is based on transformers.
High Level Transformer Architecture
The transformer is what enables models to understand context efficiently.
Understanding Attention
Why Context Matters
Consider this sentence:
What does "it" refer to?
Humans instantly understand that "it" means "animal".
Transformers achieve something similar using attention.
Self Attention Explained
Attention allows every token to look at other tokens in the sentence.
This helps the model determine which words are important.
Query, Key, And Value
Each token generates:
- Query
- Key
- Value
Attention scores are calculated using queries and keys.
The higher the score, the more attention one token pays to another.
Multi Head Attention
Instead of using one attention mechanism, transformers use multiple attention heads.
Different heads learn different relationships such as:
- Grammar
- Long range dependencies
- Coding patterns
- Semantic meaning
Inside A Transformer Block
Feed Forward Networks
After attention, data passes through additional neural network layers.
Complete Transformer Layer
A transformer layer contains:
- Multi Head Attention
- Feed Forward Networks
- Residual Connections
- Layer Normalization
Modern LLMs stack dozens or even hundreds of these layers.
How Models Learn
The Training Objective
Training is simple in theory.
The model repeatedly predicts the next token and compares its prediction with the correct answer.
The Training Pipeline
Backpropagation
When the model makes mistakes:
- Calculate error
- Send error backward through the network
- Update weights
- Repeat
Over billions of examples, the model gradually improves.
Why GPUs Matter
Training requires:
- Massive matrix operations
- Billions of parameters
- Trillions of tokens
GPUs excel at performing these calculations in parallel.
Without GPUs, modern LLMs would not exist at their current scale.
What's Next
You now have the foundation: text becomes tokens, tokens become embeddings, transformers use attention to build context, and training teaches the model to predict the next token at scale.
In Part 2, we cover what happens when you actually use a model - inference, context windows, memory limits, hallucinations, RAG, and fine-tuning.
Then in Part 3, we look at reasoning, modern architectures like mixture of experts, and how LLMs become AI agents.