RAG Systems
Retrieval-Augmented Generation grounds your LLM in real data at inference time. From a single vector lookup to an autonomous retrieval agent, learn how RAG systems scale in power and complexity.
What is RAG?
RAG (Retrieval-Augmented Generation) is a technique where relevant information is retrieved from a knowledge base and handed to the LLM to generate a grounded answer. Rather than relying solely on what the model learned during training, RAG grounds responses in external data at inference time.
An LLM's training data has a cutoff date and is finite. RAG gives it access to current, private, or domain-specific knowledge it was never trained on, by retrieving that knowledge at the moment of answering.
Core Concepts
Before diving into the RAG variants, it helps to understand the building blocks every RAG system is built on.
Chunking
Documents are too large to fit into a prompt wholesale, so they are split into smaller pieces called chunks. Chunk size is a tunable parameter: too small and you lose context, too large and you introduce noise. Common strategies include fixed-size chunking (e.g., 512 tokens with overlap), sentence-level splitting, and recursive character splitting.
Embeddings
Each chunk is passed through an embedding model (e.g., text-embedding-3-small
from OpenAI, or bge-large from HuggingFace) which converts it into a
high-dimensional vector: a list of floating-point numbers that encodes the semantic
meaning of the text. Chunks that are semantically similar end up with vectors that
are close together in this high-dimensional space.
Vector Stores
These embedding vectors are stored in a vector database (e.g., Pinecone, Weaviate, pgvector, Chroma). At query time, the user's question is also embedded, and the vector store retrieves the chunks whose vectors are closest to the query vector, typically using cosine similarity or dot product as the distance metric.
The Retrieval-Generation Pipeline
User query
|
v
Embed query --> Vector search --> Top-k chunks
|
v
Inject into prompt --> LLM --> Final answer
The retrieved chunks are injected into the LLM's prompt as context, and the model generates an answer grounded in that retrieved information.
Naive RAG Baseline
Think of it like an open-book exam with sticky notes.
flowchart LR
subgraph Ingestion
D[Documents] --> C[Chunker]
C --> E[Embedding Model]
E --> VS[(Vector Store)]
end
subgraph Query
Q([User Query]) --> EQ[Embed Query]
EQ -->|cosine similarity| VS
VS -->|top-k chunks| P[Prompt Builder]
P --> LLM[LLM]
LLM --> R([Response])
end
How it works
- Pre-read all documents and write "sticky notes" (embeddings) for each chunk.
- A question comes in.
- Find the sticky notes most semantically similar to the question.
- Hand those notes to the LLM: "Answer using these."
Technical details
The retrieval step is a single approximate nearest neighbor (ANN) search over the vector store. Libraries like FAISS use ANN algorithms (e.g., HNSW: Hierarchical Navigable Small World graphs) to make this search fast even over millions of vectors, trading a small amount of accuracy for a large gain in speed.
The top-k chunks (typically k=3 to 10) are concatenated into the prompt. The model sees something like:
Context: [Chunk 1] ... [Chunk 2] ... [Chunk 3] ... Question: What is the refund policy? Answer:
Weaknesses
- Retrieval is purely similarity-based. A query like "What did the CEO say last quarter?" might retrieve chunks that sound relevant but miss the actual quote if the wording does not align well with the embedding space.
- No mechanism to verify whether the retrieved chunks are actually sufficient to answer the question.
- Chunk boundaries can cut off important context.
Real-world examples
PDF Chatbots
Early document Q&A tools that embed a document and answer questions with a single vector lookup. Simple and effective for narrow, well-structured documents.
Internal Knowledge Bases
Basic assistants that retrieve the closest FAQ entry and generate a response from it. Common first step for teams adding AI to their internal docs.
Customer Support Bots
Entry-level bots that match a user's question to the nearest help article using embeddings. Fast to build, but limited to exact-match style queries.
Hybrid RAG Smarter search
Same open-book exam, but smarter search.
flowchart LR
subgraph Ingestion
D[Documents] --> C[Chunker]
C --> E[Embedding Model]
C --> IDX["Inverted Index\nBM25"]
E --> VS[(Vector Store)]
end
subgraph Query
Q([User Query]) --> EQ[Embed Query]
Q --> KQ[Keyword Query]
EQ -->|ANN search| VS
KQ -->|BM25 scoring| IDX
VS -->|ranked list A| RRF[RRF Fusion]
IDX -->|ranked list B| RRF
RRF -->|merged candidates| RE["Cross-Encoder\nReranker"]
RE -->|top-k reranked| P[Prompt Builder]
P --> LLM[LLM]
LLM --> R([Response])
end
How it works
- Same chunking and storage setup as Naive RAG.
- A question comes in.
- Run two searches in parallel: semantic search (embeddings) finds conceptually related chunks, keyword search (BM25) finds exact word matches.
- Merge and rerank the results from both searches.
- Hand the best combined results to the LLM.
Technical details
BM25 (Best Match 25) is a classical information retrieval algorithm that scores documents based on term frequency and inverse document frequency (TF-IDF), with length normalization. It excels at exact keyword matching, which dense embeddings can miss: especially for rare terms, proper nouns, or technical jargon.
The two result sets are merged using Reciprocal Rank Fusion (RRF), a simple but effective algorithm that combines ranked lists without needing to normalize scores across different scales:
RRF_score(chunk) = sum over each ranker of: 1 / (k + rank) Where k = 60 (constant that dampens the impact of very high ranks). Chunks near the top in both rankings get boosted.
After fusion, a cross-encoder reranker (e.g., Cohere Rerank, or a
local ms-marco model) is often applied. Unlike bi-encoders used in
vector search, a cross-encoder takes the query and a candidate chunk together as
input and outputs a relevance score. This is slower but significantly more accurate,
making it practical as a second-pass filter over the top candidates.
Weaknesses
- More infrastructure to maintain (vector store + inverted index).
- Reranking adds latency.
- Still a single retrieval pass: no ability to recognize when retrieved results are insufficient.
Real-world examples
Elasticsearch Enterprise Search
Enterprise search systems that combine BM25 keyword matching with dense vector search for more accurate and robust document retrieval across large corpora.
Legal & Medical Q&A
Tools where exact terminology (drug names, legal citations) must be matched precisely, but conceptual context also matters. Hybrid retrieval handles both.
Notion AI / Confluence AI
Tools that blend keyword and semantic search over large internal wikis, giving users results that match both the literal words and the intent behind their queries.
Agentic RAG Most powerful
Instead of a student doing one lookup, you have a detective working a case.
flowchart TD
Q([User Query]) --> AG["Agent\nReAct Loop"]
AG -->|plan + search query| T1[Vector Search Tool]
AG -->|if needed| T2[Web Search Tool]
AG -->|if needed| T3[SQL Tool]
AG -->|if needed| T4[Calculator Tool]
T1 & T2 & T3 & T4 -->|observation| AG
AG -->|sufficient context?| CHK{Enough to answer?}
CHK -->|no: refine query| AG
CHK -->|yes| P[Prompt Builder]
P --> LLM[LLM]
LLM --> R([Response])
How it works
- A question comes in.
- The agent plans: "What do I actually need to find to answer this?"
- It retrieves something.
- It reads the result and asks itself: "Do I have enough? What's still missing?"
- If not satisfied, it retrieves again with a refined query, or switches tools entirely (web search, SQL, calculator, etc.).
- Repeats until it can answer confidently.
- Generates the final response.
In Naive and Hybrid RAG, the LLM is the endpoint of the pipeline. In Agentic RAG, the LLM runs the pipeline.
Technical details
Agentic RAG typically implements a ReAct loop (Reasoning + Acting), where the LLM interleaves thought steps with tool calls:
Thought: The user is asking about Q3 revenue. I should search the earnings docs first.
Action: vector_search("Q3 revenue 2024")
Observation: [retrieved chunks about total revenue...]
Thought: These chunks mention total revenue but not the regional breakdown.
Action: vector_search("Q3 revenue breakdown by region 2024")
Observation: [retrieved chunks with regional data...]
Thought: I now have enough to answer.
Final Answer: ...
Query rewriting is another key technique: rather than passing the raw user query into retrieval, the agent rewrites it into a more precise search query based on what it already knows. This significantly improves retrieval quality for multi-hop questions (questions that require chaining multiple pieces of information).
Weaknesses
- Multiple retrieval calls means higher latency and cost.
- The agent can get stuck in retrieval loops or make poor decisions about when it has enough information.
- Harder to debug and trace than a fixed pipeline.
Real-world examples
Perplexity AI
Iteratively searches the web, evaluates results, and refines its queries before synthesizing a final answer. The retrieval loop is the core of the product.
Financial Research Assistants
Pull from SEC filings, run SQL queries over earnings data, and cross-reference news sources before generating an investment summary. Multi-source, multi-hop retrieval.
Devin / OpenHands
Iteratively read code, run tests, interpret errors, and search documentation until a bug is resolved. The agent decides when it has gathered enough context to act.
Mental Models
Each RAG variant trades complexity for capability. Here is a one-line model for each to help you choose the right one for a given problem.
| Type | Mental Model | Best For |
|---|---|---|
| Naive RAG | Find → Answer | Simple Q&A over a well-structured document corpus |
| Hybrid RAG | Find smarter → Answer | Domains with precise terminology or large mixed corpora |
| Agentic RAG | Plan → Find → Think → Find again → Answer | Multi-hop questions, cross-source reasoning, open-ended research |