Published on 2026-05-29

How LLMs Actually Work, Part 2: Inference, Memory, and RAG

The second part of a practical deep dive into LLMs - parameters, inference, context windows, hallucinations, retrieval augmented generation, and fine-tuning.

Artificial IntelligenceLarge Language ModelsGenerative AIMachine LearningSoftware EngineeringRAG

Introduction

In Part 1, we covered the core mechanics of LLMs: next-token prediction, tokenization, embeddings, transformers, attention, and training.

Training teaches a model patterns from massive datasets. But when you send a prompt to ChatGPT or call an API, you are doing something different - inference.

This post covers what happens at runtime and how engineers work around the model's built-in limits:

  • Parameters and what they store
  • The inference pipeline and temperature
  • Context windows and memory
  • Why models hallucinate
  • RAG for grounding answers in real data
  • Fine-tuning and RLHF for specialization

Series overview:

  1. Part 1: Tokens and Transformers
  2. Part 2 (this post): Inference, Memory, and RAG
  3. Part 3: Agents and the Future of AI

Parameters And Knowledge

What Are Parameters?

Parameters are the learned weights inside the model.

Examples:

  • GPT 2: 1.5 Billion
  • GPT 3: 175 Billion

Parameters store the patterns learned during training.

Bigger Does Not Always Mean Better

Larger models generally perform better.

However:

  • Better training data
  • Better architecture
  • Better alignment

can often matter more than simply increasing parameter count.


What Happens When You Ask ChatGPT A Question?

Training Vs Inference

Training teaches the model.

Inference is when users interact with it.

Inference Pipeline

Inference repeatedly tokenizes, runs the transformer, samples the next token, appends it, until the response is complete

Temperature

Temperature controls randomness.

Low temperature:

  • Predictable
  • Stable
  • More factual

High temperature:

  • More creative
  • More diverse
  • Less predictable

Context Windows And Memory

What Is A Context Window?

A context window is the amount of information a model can consider at once.

Examples:

  • 8K tokens
  • 32K tokens
  • 128K tokens
  • 1M tokens

Short Term Memory

Current conversation text held in the model's fixed context window as temporary memory

The model only remembers what exists inside the current context.

Long Term Memory

Persistent memory is usually implemented externally using:

  • Databases
  • RAG systems
  • Conversation history
  • Memory layers

Most LLMs do not naturally remember past conversations.

This section covers the basics. If you want to go deeper into how teams actually build memory in production (layered architectures, vector retrieval, compression tradeoffs, agent state, and what to ask in an AI engineering interview), see Memory Management in LLMs: How AI Actually Remembers Things.


Why LLMs Hallucinate

Prediction Is Not Truth

One of the biggest misconceptions about LLMs is that they know facts.

They do not.

They predict statistically likely text.

This distinction is extremely important.

Why Hallucinations Happen

Hallucinations occur when:

  • Information is missing
  • Context is insufficient
  • Training data contains ambiguity

The model generates a plausible answer even if it is incorrect.


Retrieval Augmented Generation (RAG)

What Is RAG?

RAG stands for Retrieval Augmented Generation.

Instead of relying only on training data:

  1. Retrieve information
  2. Add it to the prompt
  3. Generate an answer

RAG retrieves relevant documents, adds them to the prompt, then the LLM generates an answer

Document Indexing

Offline indexing path from raw documents through an embedding model into vectors stored in a vector database

Query Time Retrieval

At query time the question is embedded and similarity search returns the most relevant text chunks

This approach powers most production AI applications today.


Improving Models After Training

Fine Tuning

Organizations often specialize models for specific domains.

Examples:

  • Healthcare
  • Legal
  • Finance
  • Customer Support

Fine-tuning updates a pretrained base model on domain-specific data to produce a specialized model

RLHF

RLHF stands for Reinforcement Learning From Human Feedback.

Humans rank model outputs.

The model learns:

  • Helpfulness
  • Safety
  • Tone
  • Conversational quality

This is one reason ChatGPT feels natural compared to earlier models.


What's Next

You now understand the gap between training and production use: models predict tokens at inference time, forget everything outside the context window, and often need RAG or fine-tuning to be reliable in real applications.

In Part 3, we explore whether LLMs truly reason, where architectures are heading (mixture of experts, multimodal models), and how LLMs become AI agents that plan, use tools, and execute tasks.