I spent months building a RAG system. FAISS for vector search. BM25 for keyword fallback. Cohere for reranking. LanceDB for persistence. Redis for caching. A FastAPI layer to orchestrate it all. Docker Compose to hold the pieces together.
It worked. Patients could ask whether a medical procedure was covered under Medi-Cal, and the system would retrieve the relevant policy sections, synthesize them, and return a clear answer. I was proud of the architecture.
Then I watched someone dump the entire policy document — all 200 pages — into a model with a million-token context window and get the same answer in one shot. No embeddings pipeline. No vector store. No reranker. No infrastructure at all.
That's the moment I started questioning whether classical RAG has a future.
The Case Against RAG
The argument is straightforward. RAG exists because models couldn't hold enough context. You had to chop documents into chunks, embed them, store them in a vector database, and retrieve the relevant pieces at query time. Every step introduced failure modes:
Chunking → semantic boundaries get split Embedding → meaning gets compressed and distorted Retrieval → wrong chunks get returned (or right ones get missed) Reranking → adds latency, still imperfect Synthesis → model has to reconstruct context from fragments
Each component needs its own tuning, monitoring, and maintenance. The chunking strategy that works for legal documents fails on medical policy manuals. The embedding model that handles English well collapses on multilingual content. The retrieval threshold that produces high precision kills recall.
Now compare that to: paste the entire corpus into the context window. Ask your question. Get your answer.
Context windows are growing fast. Gemini 3.1 Pro ships with a million-token window. Claude handles 200K tokens natively. These aren't experimental features — they're production capabilities. And the cost per token keeps dropping. What cost $30 per million tokens two years ago now costs under $1.
Why RAG Isn't Dead Yet
But here's the thing — I've been building RAG systems, and I know what they can do that long context can't.
Scale. A million tokens is roughly 750K words. That sounds like a lot until you realize a single enterprise knowledge base might have 50 million words across thousands of documents. You can't fit that in any context window. RAG can search across it in milliseconds.
Freshness. Long context requires you to re-ingest the entire corpus every time something changes. RAG lets you update a single document's embeddings without touching anything else. For systems where data changes daily — support tickets, medical guidelines, compliance rules — this matters enormously.
Cost at scale. Processing a million tokens per query is cheap for one-off questions. It's ruinous at 10,000 queries per hour. RAG retrieves only the relevant chunks, keeping per-query token costs predictable and low.
Precision. When you need to find a specific clause in a 500-page contract, retrieval with good reranking actually outperforms long-context models. The "needle in a haystack" problem is real — models with huge context windows can miss information buried in the middle of their input.
The Hybrid Future
The smart play isn't picking sides. It's building systems that use both strategies where each excels:
Query arrives
│
├─ Corpus < 500K tokens?
│ └─ Stuff it all in context. Done.
│
├─ Corpus > 500K tokens?
│ └─ RAG retrieval → top-k chunks → feed to model
│
└─ Need real-time data?
└─ RAG with streaming index updates
This is where I think production ML is heading. The retrieval pipeline doesn't disappear — it becomes one strategy in a routing layer that picks the right approach per query. Small corpus? Long context. Large corpus? Retrieval. Mixed requirements? Both.
What Changes in Practice
For ML engineers, the shift means different things depending on what you're building:
If you're starting a new project — don't default to RAG. Ask first: can my entire corpus fit in a context window? If yes, skip the infrastructure. You can always add retrieval later when scale demands it.
If you're maintaining an existing RAG system — benchmark it against a long-context baseline. You might find that your carefully tuned pipeline is solving a problem that the model itself can now handle. If so, you just eliminated six services from your architecture.
If you're building at enterprise scale — RAG is still your foundation. But invest in better routing. The next generation of production systems will dynamically choose between retrieval and full-context based on corpus size, query type, latency requirements, and cost constraints.
What I Learned
Building that MedCoverage system taught me something important: the value of RAG was never the retrieval itself. It was the ability to make unstructured information queryable, reliable, and fast. If long context can deliver the same result with less complexity, the answer isn't to defend RAG — it's to follow the capability.
The best ML engineers aren't loyal to architectures. They're loyal to outcomes. Right now, the outcome is shifting.