Grounding generation in retrieval
Retrieval augmented generation, or RAG, answers a question by first fetching relevant passages and then asking a language model to answer using them. It grounds the model in real, current, source backed text instead of relying only on its trained weights.
The full flow
- Index time: documents are chunked, each chunk is embedded, and vectors plus metadata land in a vector store.
- Query time: the question is embedded, candidates are retrieved, optionally reranked, then passed as context to the generator.
Why each stage matters
- Chunking sets what can be retrieved at all.
- Retrieval decides which passages enter the context.
- Reranking sharpens the order so the best passages lead.
- Generation writes the answer and should cite its sources.
Where it goes wrong
- Missing context: retrieval fails, so the model guesses or hallucinates.
- Distracting context: irrelevant passages crowd out the right one.
- Ignored context: the model has the answer but does not use it.
Good RAG tunes every stage and evaluates retrieval and generation separately, since a weak link anywhere caps the whole system.
Key idea
RAG chunks and indexes documents, then at query time retrieves, reranks, and feeds passages to a generator that answers from them, so the system is only as strong as its weakest stage.