Published on 2026-05-29

How LLMs Actually Work, Part 2: Inference, Memory, and RAG

The second part of a practical deep dive into LLMs - parameters, inference, context windows, hallucinations, retrieval augmented generation, and fine-tuning.

Artificial IntelligenceLarge Language ModelsGenerative AIMachine LearningSoftware EngineeringRAG

Introduction

In Part 1, we covered the core mechanics of LLMs: next-token prediction, tokenization, embeddings, transformers, attention, and training.

Training teaches a model patterns from massive datasets. But when you send a prompt to ChatGPT or call an API, you are doing something different - inference.

This post covers what happens at runtime and how engineers work around the model's built-in limits:

Parameters and what they store
The inference pipeline and temperature
Context windows and memory
Why models hallucinate
RAG for grounding answers in real data
Fine-tuning and RLHF for specialization

Series overview:

Part 1: Tokens and Transformers
Part 2 (this post): Inference, Memory, and RAG
Part 3: Agents and the Future of AI

Parameters And Knowledge

What Are Parameters?

Parameters are the learned weights inside the model.

Examples:

GPT 2: 1.5 Billion
GPT 3: 175 Billion

Parameters store the patterns learned during training.

Bigger Does Not Always Mean Better

Larger models generally perform better.

However:

Better training data
Better architecture
Better alignment

can often matter more than simply increasing parameter count.

What Happens When You Ask ChatGPT A Question?

Training Vs Inference

Training teaches the model.

Inference is when users interact with it.

Inference Pipeline

Inference repeatedly tokenizes, runs the transformer, samples the next token, appends it, until the response is complete

Temperature

Temperature controls randomness.

Low temperature:

Predictable
Stable
More factual

High temperature:

More creative
More diverse
Less predictable

Context Windows And Memory

What Is A Context Window?

A context window is the amount of information a model can consider at once.

Examples:

8K tokens
32K tokens
128K tokens
1M tokens

Short Term Memory

Current conversation text held in the model's fixed context window as temporary memory

The model only remembers what exists inside the current context.

Long Term Memory

Persistent memory is usually implemented externally using:

Databases
RAG systems
Conversation history
Memory layers

Most LLMs do not naturally remember past conversations.

This section covers the basics. If you want to go deeper into how teams actually build memory in production (layered architectures, vector retrieval, compression tradeoffs, agent state, and what to ask in an AI engineering interview), see Memory Management in LLMs: How AI Actually Remembers Things.

Why LLMs Hallucinate

Prediction Is Not Truth

One of the biggest misconceptions about LLMs is that they know facts.

They do not.

They predict statistically likely text.

This distinction is extremely important.

Why Hallucinations Happen

Hallucinations occur when:

Information is missing
Context is insufficient
Training data contains ambiguity

The model generates a plausible answer even if it is incorrect.

Retrieval Augmented Generation (RAG)

What Is RAG?

RAG stands for Retrieval Augmented Generation.

Instead of relying only on training data:

Retrieve information
Add it to the prompt
Generate an answer

RAG retrieves relevant documents, adds them to the prompt, then the LLM generates an answer

Document Indexing

Offline indexing path from raw documents through an embedding model into vectors stored in a vector database

Query Time Retrieval

This approach powers most production AI applications today.

Improving Models After Training

Fine Tuning

Organizations often specialize models for specific domains.

Examples:

Healthcare
Legal
Finance
Customer Support

Fine-tuning updates a pretrained base model on domain-specific data to produce a specialized model

RLHF

RLHF stands for Reinforcement Learning From Human Feedback.

Humans rank model outputs.

The model learns:

Helpfulness
Safety
Tone
Conversational quality

This is one reason ChatGPT feels natural compared to earlier models.

What's Next

You now understand the gap between training and production use: models predict tokens at inference time, forget everything outside the context window, and often need RAG or fine-tuning to be reliable in real applications.

In Part 3, we explore whether LLMs truly reason, where architectures are heading (mixture of experts, multimodal models), and how LLMs become AI agents that plan, use tools, and execute tasks.

Back to blog