February 9, 2026 · 8 min read

Why Your Mental Model of AI is Probably Wrong

Most teams still misunderstand how LLMs work. An LLM isn't a brain. It's a very fast clerk with a 2 million page desk and a 5-second attention span. Here's what actually breaks AI products in 2026 and how to build around it.

The Mental Model Costing You Money

Most teams still misunderstand how LLMs work. An LLM isn't a brain. It's a very fast clerk with a 2 million page desk and a 5-second attention span.

Most product builders think: It's intelligent. It thinks through problems. It learns from conversations. It remembers what you told it. The more context you give it, the smarter it gets.

Here's the reality: AI is a text prediction engine. It predicts text based on what's currently visible. Everything resets between sessions. And even within a session, having something "in view" doesn't mean the model can use it effectively.

Think of it like texting someone who can see 1,000 pages at once but has to find one specific sentence in 5 seconds. They can technically "see" everything. Good luck getting the right answer under pressure. Recent pages get attention. Pages matching the current question get attention. Everything else? Background noise.

Would you design a financial advisory product assuming that person could reliably recall every client detail? No. You'd design around their constraints.

That's exactly what breaks AI products in 2026.

What Actually Happens With 1 Million+ Tokens

By now, everyone in your Slack knows what a token is. In 2026, tokens are how you measure both cost and speed. But "unlimited context" is mostly an expensive illusion.

Here's the reality for long-context models:

Gemini 3 Pro (and variants): advertised up to 1M–10M tokens depending on tier/preview, but effective retrieval often caps lower in practice due to attention patterns.
GPT-5 family: around 400K tokens common, with tiered pricing that can spike for extended reasoning or long inputs.
Claude Opus 4.6 / Sonnet 4.5: up to 1M tokens in beta for top tiers (standard often 200K), with strong consistency but costs rise sharply beyond base limits.

Larger windows help, but don't eliminate retrieval issues. Studies (including the seminal "Lost in the Middle" paper and 2025–2026 follow-ups) show a U-shaped performance curve: accuracy peaks at the start/end of the context, often dropping sharply in the middle, even for models claiming 1M+ context window.

Between sessions, complete reset. When you start a new chat, everything is gone unless someone built a system to retrieve and inject previous conversations.

How "Memory" Actually Works in 2026

ChatGPT has a "Memory" feature. Claude has Projects that "remember" your work. Doesn't that contradict everything?

No. It proves the point.

When ChatGPT or Claude say they have "memory," they don't actually remember things the way people do. What you're seeing (pun intended) is a ghost writer at work. It's like a background process that silently clips your past data and pastes it into the current prompt before you even hit enter.

Here's the mechanism:

You tell it something important: "I'm a vegetarian"
ChatGPT stores that in a database (not in the chat)
When you start a new chat days later, it looks up those saved facts in the database
It finds "User is vegetarian" and inserts that into the new conversation's context
The model responds as if it "remembers"

It's not memory. It's a clever way of feeding the model the right information at the right time so it appears consistent and personal.

The distinction matters because you need to build that system if you want your product to "remember" anything.

The base model? Still stateless. Still resets between sessions. Still has attention decay within long contexts.

The Healthcare "Needle in a Haystack" Disaster

I'm seeing this pattern most dangerously in healthcare AI deployments. Specifically "Ambient Clinical Intelligence" systems that listen to patient conversations and auto-generate clinical notes. Teams hit high scores (often 90%+) on controlled Needle-in-a-Haystack benchmarks in labs (a test where the model finds specific information in long documents), then ship to production and discover how fragile those results are in messy, real-world workflows.

If you're using a 1-million-token window as a substitute for a clinical database, you're taking serious clinical and regulatory risks.

The Recurring Failure Mode

Consider a typical setup. A health-tech startup builds an intake assistant using Gemini 3 Pro's 1-million-token. The pitch: patients upload their lifetime medical history (PDFs, scans, previous transcripts), and the AI "understands" the full longitudinal context.

The Multi-Million Risk of a Miss

A patient uploads a 300-page historical file (approximately 450,000 tokens). Deep on page 12 buried under years of routine physicals is a documented allergic reaction to Penicillin.

The patient then spends 20 minutes in a "live" ambient session discussing their current flu symptoms. Without structured safeguards, the model biasing toward prompt start/recent content might overlook it and suggest Amoxicillin.

This kind of miss is exactly what regulators worry about. Under HIPAA, such failures can trigger investigations or "wilful neglect" findings. Depending on severity and intent, penalties can reach tens of thousands of dollars per incident, with annual caps up to $2 million.

Why This Keeps Happening

The allergy information falls into the Dead Zone (the middle 80%) where attention is lowest. The model prioritises the system prompt and recent messages. Everything else gets ignored.

When the patient returns the next day, the model is stateless. Unless you've built a system to retrieve that allergy and pin it to the top of the new session's context, the AI has no memory of the previous encounter.

Naively passing full long histories (~450K tokens) on every turn with 2026 pricing can run $1–$3+ per response on high-tier models. For high-volume scenarios (thousands of interactions), costs scale to tens of thousands monthly while retrieval risks remain.

The Rebuild: Architecture Over Context Size

The solution wasn't better prompts or a bigger context window. It was treating this as an infrastructure problem.

Rip critical medical data out of the conversation: Allergies, chronic conditions, current medications get extracted via structured output and written to the EHR (electronic health record) database in real-time. Not "captured in conversation." This data is retrieved and injected prominently at the start of every response, regardless of when it was mentioned.
Design for the Dead Zone: Critical safety information (allergies, drug interactions, red flags) gets repeated in the system prompt for every single response. Not mentioned once and assumed to be "in memory."
Chunk and summarise strategically: Every 10 minutes of conversation triggers automatic summarisation of key medical facts. That summary stays at the top of the context window where attention is highest. Full transcript stored separately.
Validate against external truth: Before suggesting any medication or treatment, the system queries the patient's EHR database directly. It doesn't rely on what's "in the conversation."

The rebuilt system works. Not because Gemini got smarter, but because the architecture respects what transformers actually do: they attend to edges and relevant patterns, not comprehensive middle context.

The Architecture of Survival

Once you understand that even massive context windows don't create memory, certain design principles become critical.

Rule 1: Stop Using the Context Window as a Database

Information that matters later doesn't belong in the conversation.

Rip "memory" out of the LLM. If it's a user preference, put it in a database. If it's a compliance rule, hard-code it or build a deterministic source where it can be called from. Don't ask the model to "remember" rather tell the model what it's looking at.

Example: That fintech pattern? Risk tolerance, investment goals, regulatory constraints should live in a database. Each conversation retrieves and injects them into the system prompt. Don't rely on the AI "remembering" them from three days ago or even three hours ago if they were mentioned in the middle.

The economics matter here.

Injecting 500,000 tokens of conversation history into every message costs $2.00 per response on Gemini 3 Pro (at the high-volume $4/M tier). Storing preferences in DB, using RAG (retrieval-augmented generation) to fetch relevant context, and injecting 500 tokens? $0.002 per response.

Your unit economics die if you use context windows like databases.

Rule 2: The Infinite Context Trap

This is the new failure mode of 2026.

In 2024, products failed because the context window was too small. In 2026, products fail because teams dump everything into a massive context window and expect the model to find what matters.

This is the law of physics teams keep trying to break: transformers prioritise the head and tail of context. If you bury critical constraints in the middle 80% of a 1 million token window then you've basically deleted them.

Your options aren't symmetric. Pick what fits your constraint:

Summarise and promote: Every 10–15 exchanges, generate a structured summary of critical information. Keep that summary at the top where attention is highest.
Strategic repetition: Important constraints, preferences, or safety information should appear in the system prompt for every response.
Retrieve, don't dump: Store full history separately. Retrieve only what's relevant to the current query. Build a search problem, not a memory problem.

Perplexity doesn't keep your entire search history in context. It retrieves relevant past searches when they'd help with the current query. That's architectural intelligence.

Rule 3: Your Marketing Cannot Outrun Your Architecture

If you tell users the AI "remembers" or "learns" from conversations, you're making promises about architecture, not model capabilities.

ChatGPT can say "I remember you're vegetarian" because OpenAI built a database system to enable that. The base model? Still stateless.

Be honest about what you've actually built:

"I can access your previous sessions when you reference them" (requires retrieval system you built)
"I remember your preferences" (requires preference database you maintain)
"I learn from our conversations" (requires fine-tuning pipeline or continuous injection you designed)

All of these are possible. None of them happen automatically just because you have a large context window.

The demo that "learns" over 5 minutes won't scale to 5,000 users over 5 months without deliberate memory architecture.

In the upcoming posts, we'll cover the three things LLMs fundamentally cannot do, even with all the advances in 2026.

Your Turn

What AI product idea have you been working on? Are you treating your context window like a database or like a search index?

The difference determines whether your product scales or your costs explode.

Drop a comment or send me a message. I read everything.

TL;DR Prompt

Want a compact summary? Copy this into your LLM:

Summarise this article: why million-token context windows don't create memory, what the "Dead Zone" means, how ChatGPT and Claude actually implement memory features, and the three architectural rules for building AI products that work in 2026.