← Lessons

quiz vs the machine

Silver1050

Machine Learning

The RAG Architecture Deep

How retrieval augmented generation wires a retriever to a generator at query time.

5 min read · intro · beat Silver to climb

Why retrieve at all

A language model only knows what was baked into its weights at training time. Retrieval augmented generation, or RAG, fixes this by fetching relevant text from an external store and pasting it into the prompt before the model answers. The model stays frozen while the knowledge stays fresh and editable.

The two stage flow

A RAG system splits work into two stages:

  • Retrieval turns the user question into a query, searches a knowledge store, and returns the most relevant passages.
  • Generation places those passages into the prompt as context and asks the model to answer using them.

The offline index

Before any query, documents are split into chunks, each chunk is embedded into a vector, and the vectors are stored in a vector index. This indexing step happens once, ahead of time, so queries stay fast.

Why it matters

RAG lets you update knowledge by changing the store, not retraining the model. It grounds answers in real sources, which reduces made up facts and lets the system cite where each claim came from.

Key idea

RAG separates a frozen generator from an editable knowledge store, retrieving relevant passages at query time so answers stay current and grounded in real sources.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the retrieval stage add to the prompt?

2. Why does RAG let you update knowledge without retraining?