The RAG Architecture Deep

How retrieval augmented generation wires a retriever to a generator at query time.

Why retrieve at all

A language model only knows what was baked into its weights at training time. Retrieval augmented generation, or RAG, fixes this by fetching relevant text from an external store and pasting it into the prompt before the model answers. The model stays frozen while the knowledge stays fresh and editable.

The two stage flow

A RAG system splits work into two stages:

Retrieval turns the user question into a query, searches a knowledge store, and returns the most relevant passages.
Generation places those passages into the prompt as context and asks the model to answer using them.

The offline index

Before any query, documents are split into chunks, each chunk is embedded into a vector, and the vectors are stored in a vector index. This indexing step happens once, ahead of time, so queries stay fast.

Why it matters

RAG lets you update knowledge by changing the store, not retraining the model. It grounds answers in real sources, which reduces made up facts and lets the system cite where each claim came from.

Key idea