Part 2: RAG Systems

00

What is RAG?

RAG (Retrieval-Augmented Generation) is a technique where relevant information is retrieved from a knowledge base and handed to the LLM to generate a grounded answer. Rather than relying solely on what the model learned during training, RAG grounds responses in external data at inference time.

Key Concept

An LLM's training data has a cutoff date and is finite. RAG gives it access to current, private, or domain-specific knowledge it was never trained on, by retrieving that knowledge at the moment of answering.

01

Core Concepts

Before diving into the RAG variants, it helps to understand the building blocks every RAG system is built on.

Chunking

Documents are too large to fit into a prompt wholesale, so they are split into smaller pieces called chunks. Chunk size is a tunable parameter: too small and you lose context, too large and you introduce noise. Common strategies include fixed-size chunking (e.g., 512 tokens with overlap), sentence-level splitting, and recursive character splitting.

Embeddings

Each chunk is passed through an embedding model (e.g., text-embedding-3-small from OpenAI, or bge-large from HuggingFace) which converts it into a high-dimensional vector: a list of floating-point numbers that encodes the semantic meaning of the text. Chunks that are semantically similar end up with vectors that are close together in this high-dimensional space.

Vector Stores

These embedding vectors are stored in a vector database (e.g., Pinecone, Weaviate, pgvector, Chroma). At query time, the user's question is also embedded, and the vector store retrieves the chunks whose vectors are closest to the query vector, typically using cosine similarity or dot product as the distance metric.

The Retrieval-Generation Pipeline

Pipeline

User query
   |
   v
Embed query --> Vector search --> Top-k chunks
                                      |
                                      v
                              Inject into prompt --> LLM --> Final answer

The retrieved chunks are injected into the LLM's prompt as context, and the model generates an answer grounded in that retrieved information.

02

Naive RAG Baseline

📚

Think of it like an open-book exam with sticky notes.

flowchart LR
    subgraph Ingestion
        D[Documents] --> C[Chunker]
        C --> E[Embedding Model]
        E --> VS[(Vector Store)]
    end

    subgraph Query
        Q([User Query]) --> EQ[Embed Query]
        EQ -->|cosine similarity| VS
        VS -->|top-k chunks| P[Prompt Builder]
        P --> LLM[LLM]
        LLM --> R([Response])
    end

How it works

Pre-read all documents and write "sticky notes" (embeddings) for each chunk.
A question comes in.
Find the sticky notes most semantically similar to the question.
Hand those notes to the LLM: "Answer using these."

Technical details

The retrieval step is a single approximate nearest neighbor (ANN) search over the vector store. Libraries like FAISS use ANN algorithms (e.g., HNSW: Hierarchical Navigable Small World graphs) to make this search fast even over millions of vectors, trading a small amount of accuracy for a large gain in speed.

The top-k chunks (typically k=3 to 10) are concatenated into the prompt. The model sees something like:

Prompt template

Context:
[Chunk 1] ...
[Chunk 2] ...
[Chunk 3] ...

Question: What is the refund policy?
Answer:

Weaknesses

Retrieval is purely similarity-based. A query like "What did the CEO say last quarter?" might retrieve chunks that sound relevant but miss the actual quote if the wording does not align well with the embedding space.
No mechanism to verify whether the retrieved chunks are actually sufficient to answer the question.
Chunk boundaries can cut off important context.

Real-world examples

📄 PDF Chatbots

Early document Q&A tools that embed a document and answer questions with a single vector lookup. Simple and effective for narrow, well-structured documents.

🗂️ Internal Knowledge Bases

Basic assistants that retrieve the closest FAQ entry and generate a response from it. Common first step for teams adding AI to their internal docs.

💬 Customer Support Bots

Entry-level bots that match a user's question to the nearest help article using embeddings. Fast to build, but limited to exact-match style queries.

03

Hybrid RAG Smarter search

🔍

Same open-book exam, but smarter search.

flowchart LR
    subgraph Ingestion
        D[Documents] --> C[Chunker]
        C --> E[Embedding Model]
        C --> IDX["Inverted Index\nBM25"]
        E --> VS[(Vector Store)]
    end

    subgraph Query
        Q([User Query]) --> EQ[Embed Query]
        Q --> KQ[Keyword Query]

        EQ -->|ANN search| VS
        KQ -->|BM25 scoring| IDX

        VS -->|ranked list A| RRF[RRF Fusion]
        IDX -->|ranked list B| RRF

        RRF -->|merged candidates| RE["Cross-Encoder\nReranker"]
        RE -->|top-k reranked| P[Prompt Builder]
        P --> LLM[LLM]
        LLM --> R([Response])
    end

How it works

Same chunking and storage setup as Naive RAG.
A question comes in.
Run two searches in parallel: semantic search (embeddings) finds conceptually related chunks, keyword search (BM25) finds exact word matches.
Merge and rerank the results from both searches.
Hand the best combined results to the LLM.

Technical details

BM25 (Best Match 25) is a classical information retrieval algorithm that scores documents based on term frequency and inverse document frequency (TF-IDF), with length normalization. It excels at exact keyword matching, which dense embeddings can miss: especially for rare terms, proper nouns, or technical jargon.

The two result sets are merged using Reciprocal Rank Fusion (RRF), a simple but effective algorithm that combines ranked lists without needing to normalize scores across different scales:

RRF formula

RRF_score(chunk) = sum over each ranker of: 1 / (k + rank)

Where k = 60 (constant that dampens the impact of very high ranks).
Chunks near the top in both rankings get boosted.

After fusion, a cross-encoder reranker (e.g., Cohere Rerank, or a local ms-marco model) is often applied. Unlike bi-encoders used in vector search, a cross-encoder takes the query and a candidate chunk together as input and outputs a relevance score. This is slower but significantly more accurate, making it practical as a second-pass filter over the top candidates.

Weaknesses

More infrastructure to maintain (vector store + inverted index).
Reranking adds latency.
Still a single retrieval pass: no ability to recognize when retrieved results are insufficient.

Real-world examples

🔍 Elasticsearch Enterprise Search

Enterprise search systems that combine BM25 keyword matching with dense vector search for more accurate and robust document retrieval across large corpora.

⚖️ Legal & Medical Q&A

Tools where exact terminology (drug names, legal citations) must be matched precisely, but conceptual context also matters. Hybrid retrieval handles both.

📝 Notion AI / Confluence AI

Tools that blend keyword and semantic search over large internal wikis, giving users results that match both the literal words and the intent behind their queries.

04

Agentic RAG Most powerful

🕵

Instead of a student doing one lookup, you have a detective working a case.

flowchart TD
    Q([User Query]) --> AG["Agent\nReAct Loop"]

    AG -->|plan + search query| T1[Vector Search Tool]
    AG -->|if needed| T2[Web Search Tool]
    AG -->|if needed| T3[SQL Tool]
    AG -->|if needed| T4[Calculator Tool]

    T1 & T2 & T3 & T4 -->|observation| AG

    AG -->|sufficient context?| CHK{Enough to answer?}
    CHK -->|no: refine query| AG
    CHK -->|yes| P[Prompt Builder]
    P --> LLM[LLM]
    LLM --> R([Response])

How it works

A question comes in.
The agent plans: "What do I actually need to find to answer this?"
It retrieves something.
It reads the result and asks itself: "Do I have enough? What's still missing?"
If not satisfied, it retrieves again with a refined query, or switches tools entirely (web search, SQL, calculator, etc.).
Repeats until it can answer confidently.
Generates the final response.

Key Distinction

In Naive and Hybrid RAG, the LLM is the endpoint of the pipeline. In Agentic RAG, the LLM runs the pipeline.

Technical details

Agentic RAG typically implements a ReAct loop (Reasoning + Acting), where the LLM interleaves thought steps with tool calls:

ReAct trace example

Thought: The user is asking about Q3 revenue. I should search the earnings docs first.
Action: vector_search("Q3 revenue 2024")
Observation: [retrieved chunks about total revenue...]

Thought: These chunks mention total revenue but not the regional breakdown.
Action: vector_search("Q3 revenue breakdown by region 2024")
Observation: [retrieved chunks with regional data...]

Thought: I now have enough to answer.
Final Answer: ...

Query rewriting is another key technique: rather than passing the raw user query into retrieval, the agent rewrites it into a more precise search query based on what it already knows. This significantly improves retrieval quality for multi-hop questions (questions that require chaining multiple pieces of information).

Weaknesses

Multiple retrieval calls means higher latency and cost.
The agent can get stuck in retrieval loops or make poor decisions about when it has enough information.
Harder to debug and trace than a fixed pipeline.

Real-world examples

🔍 Perplexity AI

Iteratively searches the web, evaluates results, and refines its queries before synthesizing a final answer. The retrieval loop is the core of the product.

📈 Financial Research Assistants

Pull from SEC filings, run SQL queries over earnings data, and cross-reference news sources before generating an investment summary. Multi-source, multi-hop retrieval.

💻 Devin / OpenHands

Iteratively read code, run tests, interpret errors, and search documentation until a bug is resolved. The agent decides when it has gathered enough context to act.

05

Mental Models

Each RAG variant trades complexity for capability. Here is a one-line model for each to help you choose the right one for a given problem.

Type	Mental Model	Best For
Naive RAG	Find → Answer	Simple Q&A over a well-structured document corpus
Hybrid RAG	Find smarter → Answer	Domains with precise terminology or large mixed corpora
Agentic RAG	Plan → Find → Think → Find again → Answer	Multi-hop questions, cross-source reasoning, open-ended research