Classical vs agentic RAG: a practical decision framework
Ahmet Özel argues that classical and agentic RAG are not competing choices but a spectrum, and offers a practical decision guide with runnable code for choosing between them.
Score breakdown
Treat RAG architecture as a tunable dial rather than a binary choice — defaulting to classical RAG and measuring retrieval quality before adding agent complexity can cut costs and latency without sacrificing answer quality.
- 01Classical RAG uses a fixed pipeline: embed query, retrieve top-k chunks, generate — one retrieval, one generation.
- 02Agentic RAG lets the LLM loop over query reformulation, multi-source retrieval, and tool calls before answering.
- 03Özel recommends defaulting to classical RAG and escalating to agentic only when multi-step reasoning or multiple sources are required.
Ahmet Özel presents a practical framework for deciding between classical and agentic RAG, framing the two not as competing architectures but as opposite ends of a dial. Classical RAG follows a fixed pipeline — embed the query, retrieve the top-k chunks from a vector store, stuff them into the prompt, and generate an answer — making it cheap, fast, and predictable. Agentic RAG, by contrast, lets the LLM decide what to do: reformulate the query, retrieve, self-check the result, retrieve again from a different source, call a tool, and only then answer. That flexibility comes at the cost of more tokens, higher latency, and non-deterministic behavior.
A hybrid pattern he favors is to run classical RAG first and escalate only the queries that fail a confidence or self-check step to the agentic path, keeping most queries cheap.
Özel's decision heuristics center on a few key questions: whether the answer lives in a single chunk or requires combining multiple documents; whether multiple sources (vector DB, SQL table, external API) must be queried; and whether latency and cost constraints are tight. A hybrid pattern he favors is to run classical RAG first and escalate only the queries that fail a confidence or self-check step to the agentic path, keeping most queries cheap. He also stresses that architecture debates are meaningless without an eval set of real questions with known-good answers, tracking retrieval quality (recall@k, hit rate), answer faithfulness and relevance, and cost and latency per query.
The accompanying repository walks through both architectures end to end using ChromaDB for vector search and supports OpenAI, Gemini, Claude, Ollama, and vLLM, enabling fully local or hosted-model runs. It includes chunking and retrieval steps, the agentic tool-selection loop, and evaluation metrics for comparing the two approaches on custom data.
Key facts
- 01Classical RAG uses a fixed pipeline: embed query, retrieve top-k chunks, generate — one retrieval, one generation.
- 02Agentic RAG lets the LLM loop over query reformulation, multi-source retrieval, and tool calls before answering.
- 03Özel recommends defaulting to classical RAG and escalating to agentic only when multi-step reasoning or multiple sources are required.
- 04A hybrid pattern runs classical RAG first and escalates only low-confidence queries to the agentic path.
- 05Most 'RAG is bad' complaints are attributed to retrieval problems: bad chunking, wrong embedding model, or no reranking.
- 06The repo uses ChromaDB for vector search and supports OpenAI, Gemini, Claude, Ollama, and vLLM.
- 07Recommended eval metrics include recall@k, hit rate, faithfulness, relevance, and cost and latency per query.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 8, 2026 · 15:36 UTC. How this works →