Published on 2026-04-30

How LLMs Actually Work, Part 1: Tokens and Transformers

The first part of a practical deep dive into LLMs - from next-token prediction and tokenization through embeddings, attention, transformer blocks, and training.

Artificial IntelligenceLarge Language ModelsGenerative AIMachine LearningSoftware Engineering

Introduction

Over the last few years, AI has gone from being a niche research topic to something millions of people use every day.

We now have tools like ChatGPT, Claude, Gemini, Copilot, Cursor, Perplexity, and countless AI powered products.

As software engineers, we often use these tools through APIs and SDKs. We build chatbots, AI assistants, RAG systems, and agents.

But many developers still do not fully understand what is happening underneath.

Terms like tokens, embeddings, transformers, and attention are used everywhere. This series breaks the full picture into three focused articles:

  1. Part 1 (this post): How text becomes predictions - tokens, embeddings, transformers, attention, and training
  2. Part 2: Inference, Memory, and RAG - parameters, inference, context windows, hallucinations, RAG, and fine-tuning
  3. Part 3: Agents and the Future of AI - reasoning, modern architectures, and AI agents

By the end of the series, you should have a solid mental model of how modern LLMs work and why they have become the foundation of modern AI applications.


The Core Idea Behind Every LLM

What Is A Large Language Model?

LLM stands for Large Language Model.

At its core, an LLM is a machine learning model trained on enormous amounts of text data.

The goal is surprisingly simple:

Predict the next token.

Everything else emerges from that ability.

The Next Token Prediction Concept

Imagine the model sees:

text
The capital of France is

The model might assign probabilities like:

text
Paris -> 95% London -> 2% Berlin -> 1% Madrid -> 1% Other -> 1%

It chooses the most likely next token and then repeats the process.

Flow from input text through tokenization and the network to predicting the next token

This happens over and over until an entire response is generated.

Why Next Token Prediction Is More Powerful Than It Sounds

At first, predicting the next token sounds trivial.

But language contains:

  • Knowledge
  • Logic
  • Patterns
  • Human reasoning
  • Programming concepts
  • Mathematics

When a model learns enough language patterns, surprisingly powerful capabilities begin to emerge.


How Text Becomes Something A Neural Network Can Understand

What Are Tokens?

Computers do not understand words.

They understand numbers.

Before processing text, the model converts text into smaller units called tokens.

For example:

text
I love programming

may become:

text
["I", " love", " program", "ming"]

Different models use different tokenization strategies.

Why Models Do Not See Words

The model never sees actual text.

Instead, it sees token IDs:

text
"I" -> 123 " love" -> 847 " program"-> 2938 "ming" -> 621

Internally, everything becomes numbers.

Why Tokens Matter

Tokens affect:

  • API pricing
  • Context windows
  • Memory limits
  • Processing speed
  • Inference costs

This is why AI providers often charge per token instead of per request.


How Models Understand Meaning

What Are Embeddings?

Token IDs themselves have no meaning.

The model converts tokens into vectors called embeddings.

A vector is simply a list of numbers.

text
King = [0.21, -0.84, 0.55, ...]

These vectors capture semantic meaning.

Semantic Meaning In Vector Space

Words with similar meanings tend to be close together.

Example of related words clustering in embedding space by meaning

This is how the model begins understanding relationships between concepts.

A Famous Example

One of the most famous embedding relationships is:

text
King - Man + Woman = Queen

This demonstrated that neural networks can learn meaningful relationships from data.


The Transformer Breakthrough

Why AI Changed In 2017

The biggest breakthrough in modern AI came from a research paper called:

Attention Is All You Need

This paper introduced the Transformer architecture.

Nearly every modern LLM is based on transformers.

High Level Transformer Architecture

High-level LLM stack from tokenizer and embeddings through transformer layers to next-token probabilities

The transformer is what enables models to understand context efficiently.


Understanding Attention

Why Context Matters

Consider this sentence:

text
The animal did not cross the road because it was tired.

What does "it" refer to?

Humans instantly understand that "it" means "animal".

Transformers achieve something similar using attention.

Self Attention Explained

Attention allows every token to look at other tokens in the sentence.

Self-attention connects every token to every other token to build context-aware understanding

This helps the model determine which words are important.

Query, Key, And Value

Each token generates:

  • Query
  • Key
  • Value

Attention scores are calculated using queries and keys.

Query and key produce attention scores, then values are weighted and combined

The higher the score, the more attention one token pays to another.

Multi Head Attention

Instead of using one attention mechanism, transformers use multiple attention heads.

Multi-head attention runs several parallel attention heads then merges their outputs

Different heads learn different relationships such as:

  • Grammar
  • Long range dependencies
  • Coding patterns
  • Semantic meaning

Inside A Transformer Block

Feed Forward Networks

After attention, data passes through additional neural network layers.

Feed-forward sublayer refines representations after the attention output

Complete Transformer Layer

A transformer layer contains:

  • Multi Head Attention
  • Feed Forward Networks
  • Residual Connections
  • Layer Normalization

One transformer layer with multi-head attention, feed-forward, and residual plus layer norm

Modern LLMs stack dozens or even hundreds of these layers.


How Models Learn

The Training Objective

Training is simple in theory.

The model repeatedly predicts the next token and compares its prediction with the correct answer.

The Training Pipeline

Training loop from dataset through forward pass, loss, backpropagation, and weight updates repeated at scale

Backpropagation

When the model makes mistakes:

  1. Calculate error
  2. Send error backward through the network
  3. Update weights
  4. Repeat

Over billions of examples, the model gradually improves.

Why GPUs Matter

Training requires:

  • Massive matrix operations
  • Billions of parameters
  • Trillions of tokens

GPUs excel at performing these calculations in parallel.

Without GPUs, modern LLMs would not exist at their current scale.


What's Next

You now have the foundation: text becomes tokens, tokens become embeddings, transformers use attention to build context, and training teaches the model to predict the next token at scale.

In Part 2, we cover what happens when you actually use a model - inference, context windows, memory limits, hallucinations, RAG, and fine-tuning.

Then in Part 3, we look at reasoning, modern architectures like mixture of experts, and how LLMs become AI agents.