Retrieval Augmented Generation Pipeline
A language model knows only what was in its training data, and it can confidently invent facts, a failure called hallucination. Retrieval augmented generation, or RAG, fixes both by fetching relevant documents at question time and feeding them to the model as context.
The pipeline runs in clear stages:
- Chunk and index, splitting source documents into passages, embedding each, and storing them in a vector database
- Retrieve, embedding the user question and pulling the most similar passages with semantic search
- Augment, inserting those passages into the prompt alongside the question
- Generate, where the model answers using the supplied context
Because the answer is grounded in retrieved text, RAG sharply reduces hallucination and can cite its sources. It also lets you update knowledge by changing the document store, with no retraining of the model.
Design choices shape quality. Chunk size trades context against precision, since chunks too large dilute relevance and chunks too small lose meaning. The number of retrieved passages affects both coverage and prompt length.
The dominant failure mode is retrieval. If the right passage is not fetched, the model cannot use it, so the answer suffers. This makes a strong retriever and good chunking the heart of a reliable RAG system.
Key idea
RAG retrieves relevant passages and feeds them as context so a language model answers from grounded text, reducing hallucination without retraining.