RAG
(giving it a textbook)
RAG stands for Retrieval-Augmented Generation. It means the LLM reads a document before answering your question. You ask a question. A search engine finds the right pages in your library. The pages get pasted into the model’s context. The model answers with those pages in front of it. This is how a chatbot can cite your company’s internal docs without being retrained.
Knowledge turned on. The model now reads your documents before answering.
The process
Your documents get chopped into small chunks, a paragraph or two each. A computer converts each chunk into a numerical fingerprint, called an embedding. These fingerprints go into a special database.
You ask a question. The question also gets a fingerprint. The database finds the chunks whose fingerprints are closest to your question’s fingerprint. Those chunks are the ones most likely to answer the question.
The LLM receives your question with the top few chunks pasted above it. The model reads both and writes an answer based on what the chunks said. If the chunks mention your company’s refund policy, the answer mentions that policy. If nothing relevant exists in your library, the model says so.
You’ve encountered this when…
You chatted with a customer support bot on a bank or software website and it answered specifically about your account or their product. That is almost always RAG. Two years ago, those bots were scripted and useless. They got smart when RAG arrived.
A familiar example
Think about asking a colleague a policy question. If they have read the handbook recently, they answer from memory. If they haven’t, they say “hold on” and open the handbook to the right page before answering. RAG is the second behaviour. The LLM pauses, flips to the right page, reads it, then responds.
A typical RAG system retrieves between 3 and 10 chunks per question. Each chunk is around 500 words. The LLM sees all of them plus your question, then writes the answer.
Variants include
Vector databases
The storage system where embeddings live. Pinecone, Weaviate, Qdrant, and pgvector are the common names. Most companies use one of these without thinking about it, hidden inside a product.
Reranking
A second pass after retrieval. The first pass finds 50 chunks. A smaller model scores each chunk’s relevance. The top 5 go to the LLM. Reranking improves answers noticeably on hard questions.
Graph RAG
Instead of chunks in a flat pile, the documents are arranged as a graph of connected ideas. The retrieval follows links between concepts. Better for complex questions across many documents, more work to set up.
The breaking point
RAG does not make the LLM smarter. It hands the model the right page of the textbook before asking the question. If the textbook is wrong, the answer is wrong. If the chunks miss the key paragraph, the model invents a plausible answer from the chunks it did get. Garbage in, garbage out, at the speed of a chatbot.
Your takeaway
When a company chatbot answers specifically about your account or their product, a RAG system fetched the right document a second before you saw the answer. This is the unglamorous workhorse behind most “smart” AI features in production today.