Published on 2026-05-29
How LLMs Actually Work, Part 2: Inference, Memory, and RAG
The second part of a practical deep dive into LLMs - parameters, inference, context windows, hallucinations, retrieval augmented generation, and fine-tuning.
Introduction
In Part 1, we covered the core mechanics of LLMs: next-token prediction, tokenization, embeddings, transformers, attention, and training.
Training teaches a model patterns from massive datasets. But when you send a prompt to ChatGPT or call an API, you are doing something different - inference.
This post covers what happens at runtime and how engineers work around the model's built-in limits:
- Parameters and what they store
- The inference pipeline and temperature
- Context windows and memory
- Why models hallucinate
- RAG for grounding answers in real data
- Fine-tuning and RLHF for specialization
Series overview:
- Part 1: Tokens and Transformers
- Part 2 (this post): Inference, Memory, and RAG
- Part 3: Agents and the Future of AI
Parameters And Knowledge
What Are Parameters?
Parameters are the learned weights inside the model.
Examples:
- GPT 2: 1.5 Billion
- GPT 3: 175 Billion
Parameters store the patterns learned during training.
Bigger Does Not Always Mean Better
Larger models generally perform better.
However:
- Better training data
- Better architecture
- Better alignment
can often matter more than simply increasing parameter count.
What Happens When You Ask ChatGPT A Question?
Training Vs Inference
Training teaches the model.
Inference is when users interact with it.
Inference Pipeline
Temperature
Temperature controls randomness.
Low temperature:
- Predictable
- Stable
- More factual
High temperature:
- More creative
- More diverse
- Less predictable
Context Windows And Memory
What Is A Context Window?
A context window is the amount of information a model can consider at once.
Examples:
- 8K tokens
- 32K tokens
- 128K tokens
- 1M tokens
Short Term Memory
The model only remembers what exists inside the current context.
Long Term Memory
Persistent memory is usually implemented externally using:
- Databases
- RAG systems
- Conversation history
- Memory layers
Most LLMs do not naturally remember past conversations.
This section covers the basics. If you want to go deeper into how teams actually build memory in production (layered architectures, vector retrieval, compression tradeoffs, agent state, and what to ask in an AI engineering interview), see Memory Management in LLMs: How AI Actually Remembers Things.
Why LLMs Hallucinate
Prediction Is Not Truth
One of the biggest misconceptions about LLMs is that they know facts.
They do not.
They predict statistically likely text.
This distinction is extremely important.
Why Hallucinations Happen
Hallucinations occur when:
- Information is missing
- Context is insufficient
- Training data contains ambiguity
The model generates a plausible answer even if it is incorrect.
Retrieval Augmented Generation (RAG)
What Is RAG?
RAG stands for Retrieval Augmented Generation.
Instead of relying only on training data:
- Retrieve information
- Add it to the prompt
- Generate an answer
Document Indexing
Query Time Retrieval
This approach powers most production AI applications today.
Improving Models After Training
Fine Tuning
Organizations often specialize models for specific domains.
Examples:
- Healthcare
- Legal
- Finance
- Customer Support
RLHF
RLHF stands for Reinforcement Learning From Human Feedback.
Humans rank model outputs.
The model learns:
- Helpfulness
- Safety
- Tone
- Conversational quality
This is one reason ChatGPT feels natural compared to earlier models.
What's Next
You now understand the gap between training and production use: models predict tokens at inference time, forget everything outside the context window, and often need RAG or fine-tuning to be reliable in real applications.
In Part 3, we explore whether LLMs truly reason, where architectures are heading (mixture of experts, multimodal models), and how LLMs become AI agents that plan, use tools, and execute tasks.