Published on 2026-06-18

Memory Management in LLMs: How AI Actually Remembers Things

Context windows, short-term state, long-term retrieval, RAG, and the memory architectures that separate demo chatbots from production AI systems.

LLMsAI EngineeringGenerative AIMemory SystemsRAGVector DatabasesLangChainLangGraphAI Agents

Introduction

The most common misconception I hear about Large Language Models is also the most dangerous one for anyone building AI products:

"ChatGPT remembers everything."

It does not. Not in any human sense of the word.

What feels like memory is usually a stack of engineering tricks layered on top of a stateless prediction engine. The model reads a chunk of text called the context window, predicts the next token, returns a response, and then the request is over. Unless your application stores something externally, that conversation effectively never happened.

I learned this the hard way while building AI systems that were supposed to feel personal: interview practice platforms, coding assistants, support agents. Users expect continuity. They expect the system to remember their name, their project, their preferences, and the thread of a conversation from three days ago. Raw LLMs give you none of that by default.

After a year of wrestling with this in production, I am convinced that memory is one of the hardest unsolved problems in AI engineering. Not model size. Not prompt wording. Memory. Because intelligence without continuity feels broken, and continuity without good retrieval becomes expensive chaos.

This post is my attempt to explain how memory actually works in LLM systems: what lives inside the context window, what gets persisted outside it, and the tradeoffs you have to make when you ship something real.

If you have read Part 2 of my LLM series, some of this will feel familiar. That post covers context windows and RAG at a high level. Here I want to go deeper into the engineering: how teams actually design memory layers, why bigger context windows do not solve the problem, and what separates a demo chatbot from a system that holds up under long-running agent workflows.


LLMs Do Not Remember. They Read.

Before we talk about solutions, we need to be precise about the constraint.

An LLM at inference time is a next-token prediction function. You pass in tokens. It returns a probability distribution over the vocabulary. You sample one token, append it, and repeat until you hit a stop condition. The model weights are frozen. Nothing from your conversation gets written back into those weights during a chat session.

That means every "memory" behavior you have ever seen in a consumer AI product was built outside the model:

  • PostgreSQL rows holding user profiles
  • Vector databases storing embedded conversation chunks
  • Redis keys tracking active agent state
  • Summaries generated by a separate LLM call and cached for reuse
  • LangGraph checkpoints persisting workflow state between tool calls

The model is the reasoning engine. Your application is the memory system. Confusing those two layers is how you end up with a product that works in a 10-message demo and falls apart at message 200.


The Context Window Is Working Memory, Not Storage

The context window is the maximum number of tokens the model can attend to in a single forward pass. Current production models sit anywhere from 8K to 128K tokens, with some experimental models pushing toward 1M+. But regardless of the number, the mental model is the same: working memory, not a hard drive.

Everything the model can "know" during a request must fit inside that window:

  • System prompt and instructions
  • Retrieved documents from RAG
  • Tool outputs and agent scratchpads
  • Recent chat history
  • The current user message

Current conversation text held in the model's fixed context window as temporary memory

Once you exceed the limit, something has to go. Most APIs silently truncate from the oldest messages. Some let you configure what gets dropped. Either way, information that falls out of the window is gone from the model's perspective.

Why Bigger Windows Do Not Fix Memory

It is tempting to think that 128K or 1M token windows make memory engineering obsolete. They do not. They shift the problem.

Cost scales with context length. Transformer attention is roughly O(n²) in sequence length during prefill. A request with 100K tokens of context costs dramatically more than one with 2K tokens, even if the generated output is identical.

Latency grows with context. Prefilling a long context before the first output token can add seconds of wait time. Users notice.

Retrieval quality degrades in long contexts. Research on the "lost in the middle" effect shows that models often miss critical information buried in the center of very long prompts, even when that information is technically present. More context can mean worse focus, not better.

Attention gets noisy. When you dump 500 pages into a prompt, the model has no guaranteed mechanism for prioritizing the one paragraph that matters. You are relying on softmax weights distributed across thousands of tokens. That is a fragile bet.

Bigger windows are useful. They give you more room for recent history and retrieved chunks. But they are not a memory architecture. They are a larger scratch pad.


Short-Term Memory: What Lives in the Active Context

In LLM systems, short-term memory usually means anything currently loaded into the context window or held in ephemeral runtime state.

That includes:

  • The last N turns of conversation
  • An agent's active plan and todo list
  • Tool call results from the current workflow step
  • A scratchpad the model uses for chain-of-thought reasoning

Here is a concrete example. You tell the model:

"My name is Parth. I am building an AI interview platform with FastAPI and LangGraph."

Five messages later you ask:

"What stack am I using?"

The model answers correctly because those earlier tokens are still in context. It did not "remember" you. It read your name and project details off the page, the same way you would if someone handed you a transcript.

The failure mode shows up when the conversation grows. A support agent handling a billing dispute over 40 messages will eventually truncate the message where the user explained their account ID. The model then asks for information the user already provided. From the user's perspective, the AI got dumber. From the engineer's perspective, the relevant tokens fell off the left edge of the window.

KV Cache: The Hidden Cost of Short-Term Memory

During autoregressive generation, models cache key-value pairs from previous tokens so they do not recompute attention over the entire prefix on every new token. This KV cache is what makes streaming responses feasible.

But the cache grows with context length. For a 70B parameter model serving concurrent users, KV cache memory can become the binding constraint on throughput, not GPU FLOPs. This is one reason production systems aggressively limit how much history they inject per request, even when the model supports much larger windows.

Short-term memory is not free. It has a direct line to your inference bill.


Why Sending Full Chat History Breaks in Production

The simplest memory strategy, and the one most prototypes use, is to append every message to an array and send the full array on each request.

This works until it does not.

Consider a coding assistant where a single session might include:

  • 200 user messages
  • 200 assistant replies
  • 50 tool outputs (file contents, linter errors, test results)
  • A system prompt and retrieved docs

You can hit 100K tokens in a long debugging session without trying. At that point you are paying premium inference prices, waiting several seconds for prefill, and still risking truncation of the message where the user said "do not refactor the auth module."

The counterintuitive lesson: too much memory can make the model less useful. LLMs do not have a built-in relevance filter. If you flood context with everything, the model may fixate on recent noise and miss the constraint you stated an hour ago.

This is why production systems treat memory as a retrieval problem, not a storage problem.


Long-Term Memory: Persistence Outside the Model

Long-term memory is anything that survives across requests, sessions, or days. The model never stores it internally. Your application does, and selectively injects it back into context when relevant.

Common storage backends:

| Memory type | Storage | Retrieval method | | -------------------- | ---------------------------------------- | -------------------------------- | | User preferences | Structured DB (Postgres, Redis) | Key lookup by user ID | | Conversation history | Object store or DB | Time-range query + summarization | | Semantic knowledge | Vector DB (Pinecone, Weaviate, pgvector) | Embedding similarity search | | Entity relationships | Knowledge graph | Graph traversal | | Agent workflow state | Checkpoint store (LangGraph, Temporal) | State ID lookup |

When a returning user opens your app, the flow looks like this:

  1. Load structured profile data (name, plan tier, preferences)
  2. Query vector store for semantically relevant past conversations
  3. Fetch any active agent state from a checkpoint
  4. Assemble a context bundle within your token budget
  5. Send the bundle plus the new user message to the LLM

The user experiences continuity. Under the hood, it is just careful prompt assembly.


Vector Search Made Long-Term Memory Practical

Before embeddings went mainstream, long-term memory in AI apps mostly meant keyword search or rigid SQL schemas. Both break down quickly when users describe things in natural language.

Vector search changed the game. Text gets converted into dense embedding vectors. Similar meaning maps to nearby points in high-dimensional space. At query time, you embed the user's message, run a similarity search, and retrieve the closest chunks.

Example:

Earlier in a conversation, the user said:

"I hate open offices. I need quiet to focus."

Weeks later they ask:

"Can you suggest a good place to work from this afternoon?"

No keyword overlap exists between those two utterances. But a good embedding model will place them close together, and retrieval will surface the preference about quiet environments.

RAG retrieves relevant documents, adds them to the prompt, then the LLM generates an answer

That feels like memory to the user. Under the hood it is:

  1. Chunk and embed text at write time
  2. Store vectors with metadata (user ID, timestamp, source)
  3. Embed the query at read time
  4. Retrieve top-k chunks above a similarity threshold
  5. Inject retrieved text into the prompt

The engineering details matter enormously: chunk size, overlap, metadata filtering, reranking, and hybrid search (vector + BM25) often matter more than the choice of embedding model.


What to Store: A Memory Taxonomy

Not everything deserves to be remembered forever. Unfiltered memory accumulation turns your vector store into a junk drawer.

Most production systems borrow a taxonomy from cognitive science:

Episodic memory captures specific events. "On March 12, the user debugged a failing WebSocket connection in their interview platform." Useful for recalling context about past sessions.

Semantic memory captures stable facts. "User prefers Python over JavaScript." "User's company is on the Pro plan." These change slowly and get stored in structured profiles.

Procedural memory captures how to do things. "When deploying this project, always run the migration script first." "This user wants unified diffs, not full file rewrites." Critical for agents that repeat workflows.

Preference memory captures tone and style. "User prefers concise answers." "User gets frustrated when you suggest libraries they have already rejected."

The write path is as important as the read path. Good systems do not store every message verbatim. They run an extraction step (often another LLM call) that decides: is this a durable fact, a ephemeral aside, or noise?

python
# Simplified memory write decision async def maybe_store_message(user_id: str, message: str, reply: str): extraction = await llm.extract( prompt=f""" Given this exchange, extract durable memories worth storing. Return JSON: {{"store": bool, "type": "semantic|episodic|preference", "content": str}} User: {message} Assistant: {reply} """ ) if extraction["store"]: await vector_store.upsert( user_id=user_id, content=extraction["content"], memory_type=extraction["type"], created_at=datetime.utcnow(), )

Getting this extraction step wrong is one of the most common sources of memory pollution I see in production systems.


Memory Compression and Its Failure Modes

Raw conversation logs do not scale. A 500-message thread might be 80K tokens. Storing and retrieving that verbatim on every request is wasteful.

The standard approach is summarization: periodically compress older turns into a dense summary and replace the raw messages in context.

Instead of 100 messages about a project, context might contain:

"User is building an AI interview platform (FastAPI, LangGraph, Next.js). Current blocker: WebSocket latency during voice transcription. Prefers async patterns. Deadline: end of month."

That summary might be 50 tokens instead of 15,000. The savings are obvious.

The risk is equally obvious: compression is lossy. Summaries drop nuance. They flatten contradictions. They erase the user's exact wording, which sometimes matters. A summary might say "user prefers PostgreSQL" when the user actually said "PostgreSQL for now, but we might switch to Supabase if the team grows."

Mature systems use tiered memory:

  • Hot context: last 10 to 20 messages, verbatim
  • Warm summary: compressed history from the current session
  • Cold storage: full raw logs in a database, retrieved only when a query specifically needs deep history

The art is deciding when to promote something from cold storage back into hot context.


RAG Is the Memory Layer for Most Production Apps

Retrieval Augmented Generation gets described as "document search for LLMs." That undersells it. In most production architectures I have seen, RAG is the memory subsystem.

The retrieval pipeline handles:

  • Company documentation and knowledge bases
  • Previous chat transcripts
  • CRM records and ticket history
  • Codebase indexes for coding agents
  • Slack threads and email archives

Offline indexing path from raw documents through an embedding model into vectors stored in a vector database

At query time the question is embedded and similarity search returns the most relevant text chunks

The hard part is not indexing documents. It is retrieval quality at query time: picking the right chunks, ranking them, deduplicating overlapping results, and fitting them into a token budget without crowding out recent conversation history.

I have seen teams spend weeks tuning embedding models when their real problem was chunk boundaries splitting a critical sentence across two vectors, neither of which retrieved well on their own.


Agents Need a Different Memory Model Than Chatbots

A simple Q&A chatbot can survive with conversation history and a vector store. An agent cannot.

Agents operate over time. They plan, call tools, observe results, revise plans, and recover from failures. A coding agent that forgets which file it edited, which API key failed, or what constraint the user stated three steps ago will loop endlessly or make destructive changes.

Frameworks like LangGraph treat memory as first-class state. A typical agent memory stack looks like:

Checkpoint state (short-term): current plan, pending tool calls, intermediate results. Persisted so the agent can resume after a crash or a long-running tool call.

Thread history (medium-term): recent messages and tool outputs for the current task.

Long-term store (cross-session): user preferences, past task outcomes, learned procedures.

python
# LangGraph-style state shape (simplified) class AgentState(TypedDict): messages: Annotated[list, add_messages] plan: str current_step: int tool_results: dict user_preferences: dict # loaded from long-term store at session start

Without this layered state, agents behave like goldfish. With it, they can handle multi-step workflows that span minutes or hours. The memory architecture is not optional. It is the foundation of agent reliability.


A Layered Memory Architecture That Works in Practice

After building several agent systems, the pattern I keep returning to has six layers. No single layer is sufficient on its own.

Layer 1: Immediate context. The current user message and the last few turns. Always verbatim. This is what makes the conversation feel responsive and coherent.

Layer 2: Session history. Recent messages beyond the immediate window, possibly truncated by a sliding window or token budget.

Layer 3: Session summary. A running compression of older turns within the current session. Updated every N messages or when context usage crosses a threshold.

Layer 4: Vector retrieval. Semantic search over embedded past conversations, documents, and extracted facts. Top-k results, reranked, deduplicated.

Layer 5: Structured profile. User preferences, account data, and stable facts from a relational database. Loaded by user ID, not by similarity.

Layer 6: External tools. Live lookups via APIs, search engines, or database queries when cached memory is insufficient or stale.

The assembly logic, which decides how many tokens to allocate to each layer, is where most of the engineering effort lives. A common starting budget for a 32K context model:

  • 2K tokens: system prompt and instructions
  • 4K tokens: immediate context (recent messages)
  • 2K tokens: session summary
  • 6K tokens: retrieved memories and documents
  • 1K tokens: structured profile
  • Remaining: reserved for model output and tool results

These numbers shift based on the use case. A document Q&A bot allocates more to retrieval. A personal assistant allocates more to profile and preference data.


The Retrieval Problem Is Harder Than Storage

Storage is a solved infrastructure problem. S3, Postgres, and Pinecone all work fine. The hard question is:

Given a user message, which memories should enter context right now?

This requires scoring across multiple signals:

  • Relevance: embedding similarity to the current query
  • Recency: exponential decay so recent memories rank higher
  • Importance: user-stated preferences and explicit "remember this" commands get boosted
  • Frequency: memories referenced often in past sessions get promoted

A simplified scoring function:

python
def memory_score(memory, query_embedding, now): relevance = cosine_similarity(memory.embedding, query_embedding) age_hours = (now - memory.created_at).total_seconds() / 3600 recency = math.exp(-age_hours / 168) # half-life of ~1 week importance = memory.importance # 0.0 to 1.0, set at write time return 0.5 * relevance + 0.3 * recency + 0.2 * importance

Getting this wrong produces two failure modes I see constantly:

  1. Memory amnesia: the system forgets things the user expects it to know because retrieval thresholds are too aggressive or embeddings miss semantic overlap.

  2. Memory hallucination: retrieved chunks from unrelated past conversations pollute context and the model confidently cites wrong information.

Both are worse than having no long-term memory at all.


Forgetting Is a Feature, Not a Bug

Human memory works because we forget. Irrelevant details fade. Temporary frustrations get deprioritized. Only patterns that repeat get consolidated into long-term storage.

AI systems need the same discipline. If you permanently store every typo, every half-formed thought, and every "actually, never mind" message, your vector store fills with noise. Retrieval quality degrades. Costs rise. Users get weird answers grounded in conversations they forgot they had.

Production memory systems need explicit deletion policies:

  • TTL expiration for ephemeral session data
  • Relevance decay that down-weights old memories over time
  • User-controlled deletion ("forget what I said about my salary")
  • Privacy boundaries so one user's memories never leak into another's retrieval results
  • Deduplication when the same fact gets extracted multiple times with slightly different wording

Building a memory system means building cognition infrastructure: write policies, read policies, decay curves, and privacy controls. It is closer to designing a database schema for a social network than to writing a good system prompt.


Where This Is Heading

Most memory systems in production today are still primitive compared to what users expect. Even advanced assistants struggle with long-horizon consistency, evolving preferences, and goal tracking across weeks of interaction.

The research and product direction is clear though:

  • Memory graphs that link entities, events, and relationships instead of flat chunk retrieval
  • Adaptive compression that preserves high-salience details while aggressively summarizing noise
  • Importance scoring at write time, not just read time
  • Reflection loops where the system consolidates and reorganizes memories during idle periods, similar to sleep-dependent memory consolidation in humans
  • Hybrid architectures combining retrieval, structured state, and larger but selectively pruned context windows

The models will keep getting better at reasoning within whatever context you provide. But the gap between "smart model" and "useful product" will keep being filled by memory engineering.


What I Would Tell a Recruiter (And What I Tell Myself)

If you are evaluating someone for an AI engineering role, ask them about memory. Not "what is RAG" (everyone knows the acronym now). Ask:

  • How do you decide what to store versus what to discard?
  • How do you handle context window limits in a 200-message agent session?
  • What happens when retrieval returns the wrong memories?
  • How do you test memory quality over time?

The answers reveal whether someone has shipped AI products or only run notebooks.

For me, the lesson after building these systems is simple: memory is the real battlefield in AI engineering. Model benchmarks measure reasoning in isolation. Users judge products by continuity, personalization, and whether the system remembers what matters.

Intelligence without memory feels broken. Memory without good retrieval is chaos. The teams that figure out the full stack, write path, read path, compression, decay, and privacy, will build the AI products that actually stick.

That is what makes this problem worth working on.