Why retrieve at all
A language model only knows what was baked into its weights at training time. Retrieval augmented generation, or RAG, fixes this by fetching relevant text from an external store and pasting it into the prompt before the model answers. The model stays frozen while the knowledge stays fresh and editable.
The two stage flow
A RAG system splits work into two stages:
- Retrieval turns the user question into a query, searches a knowledge store, and returns the most relevant passages.
- Generation places those passages into the prompt as context and asks the model to answer using them.
The offline index
Before any query, documents are split into chunks, each chunk is embedded into a vector, and the vectors are stored in a vector index. This indexing step happens once, ahead of time, so queries stay fast.
Why it matters
RAG lets you update knowledge by changing the store, not retraining the model. It grounds answers in real sources, which reduces made up facts and lets the system cite where each claim came from.
Key idea
RAG separates a frozen generator from an editable knowledge store, retrieving relevant passages at query time so answers stay current and grounded in real sources.