effort-free
Learn

RAG

(giving it a textbook)

RAG stands for Retrieval-Augmented Generation. It means the LLM reads a document before answering your question. You ask a question. A search engine finds the right pages in your library. The pages get pasted into the model’s context. The model answers with those pages in front of it. This is how a chatbot can cite your company’s internal docs without being retrained.

USERinputLLMlanguage modelRESPONSEKNOWLEDGEyour documents

Knowledge turned on. The model now reads your documents before answering.

The process

Your documents get chopped into small chunks, a paragraph or two each. A computer converts each chunk into a numerical fingerprint, called an embedding. These fingerprints go into a special database.

You ask a question. The question also gets a fingerprint. The database finds the chunks whose fingerprints are closest to your question’s fingerprint. Those chunks are the ones most likely to answer the question.

The LLM receives your question with the top few chunks pasted above it. The model reads both and writes an answer based on what the chunks said. If the chunks mention your company’s refund policy, the answer mentions that policy. If nothing relevant exists in your library, the model says so.

You’ve encountered this when…

You chatted with a customer support bot on a bank or software website and it answered specifically about your account or their product. That is almost always RAG. Two years ago, those bots were scripted and useless. They got smart when RAG arrived.

A familiar example

Think about asking a colleague a policy question. If they have read the handbook recently, they answer from memory. If they haven’t, they say “hold on” and open the handbook to the right page before answering. RAG is the second behaviour. The LLM pauses, flips to the right page, reads it, then responds.

A typical RAG system retrieves between 3 and 10 chunks per question. Each chunk is around 500 words. The LLM sees all of them plus your question, then writes the answer.

Variants include

Vector databases

The storage system where embeddings live. Pinecone, Weaviate, Qdrant, and pgvector are the common names. Most companies use one of these without thinking about it, hidden inside a product.

Reranking

A second pass after retrieval. The first pass finds 50 chunks. A smaller model scores each chunk’s relevance. The top 5 go to the LLM. Reranking improves answers noticeably on hard questions.

Graph RAG

Instead of chunks in a flat pile, the documents are arranged as a graph of connected ideas. The retrieval follows links between concepts. Better for complex questions across many documents, more work to set up.

The breaking point

RAG does not make the LLM smarter. It hands the model the right page of the textbook before asking the question. If the textbook is wrong, the answer is wrong. If the chunks miss the key paragraph, the model invents a plausible answer from the chunks it did get. Garbage in, garbage out, at the speed of a chatbot.

Your takeaway

When a company chatbot answers specifically about your account or their product, a RAG system fetched the right document a second before you saw the answer. This is the unglamorous workhorse behind most “smart” AI features in production today.

The Zero-Data Promise
Your data never leaves your screen.
01 · No upload
Files stay put.
02 · No training
Your words, your own.
03 · No storage
No logs. No profile.
04 · No catch
Always free.