AI reading your files
This is how AI reads your own files before answering you.
RAG stands for Retrieval-Augmented Generation. LLM means large language model. It means the LLM reads a document before answering your question. You ask a question. A search engine finds the right pages in your library. The pages get pasted into the model’s context. The model answers with those pages in front of it. This is how a chatbot can cite your company’s internal docs without being retrained.
Knowledge turned on. The model now reads your documents before answering.
The process
Your documents get chopped into small chunks, a paragraph or two each. A computer converts each chunk into a numerical fingerprint, called an embedding. These fingerprints go into a special database.
You ask a question. The question also gets a fingerprint. The database finds the chunks whose fingerprints are closest to your question’s fingerprint. Those chunks often answer the question.
The LLM receives your question with the top few chunks pasted above it. The model reads both and writes an answer based on what the chunks said. If the chunks mention your company’s refund policy, the answer mentions that policy. If nothing relevant exists in your library, the model says so.
You’ve encountered this when…
You chatted with a customer support bot on a bank or software website and it answered about your account or their product. That is often RAG. Two years ago, those bots were scripted and useless. They got smart when RAG arrived.
A familiar example
Think about asking a colleague a policy question. If they read the handbook last week, they answer from memory. If they haven’t, they say “hold on” and open the handbook to the right page before answering. RAG is the second behaviour. The LLM pauses, flips to the right page, reads it, then responds.
A typical RAG system retrieves between 3 and 10 chunks per question. Each chunk is around 500 words. The LLM sees all of them plus your question, then writes the answer.
Variants include
Vector databases
The storage system where embeddings live. Pinecone, Weaviate, Qdrant, and pgvector are the common names. Most companies use one of these without thinking about it, hidden inside a product.
Reranking
A second pass after retrieval. The first pass finds 50 chunks. A smaller model scores each chunk’s relevance. The top 5 go to the LLM. Reranking improves answers on hard questions.
Graph RAG
Instead of chunks in a flat pile, the documents are arranged as a graph of connected ideas. The retrieval follows links between concepts. Better for complex questions across many documents, more work to set up.
The breaking point
RAG does not make the LLM smarter. It hands the model the right page of the textbook before asking the question. If the textbook is wrong, the answer is wrong. If the chunks miss the key paragraph, the model invents a plausible answer from the chunks it did get. Garbage in, garbage out, at the speed of a chatbot.
RAG vs massive context
The first version of RAG existed because models had small context windows. A model could only read 4,000 or 8,000 tokens at a time. Anything bigger needed RAG. That constraint is gone.
Modern frontier models hold up to 1 million tokens in their working memory. That is roughly 1,500 pages of text. You can drop a whole book into the prompt. You can drop a hundred customer interview transcripts. The model reads the lot.
RAG still matters when your data is bigger than 1 million tokens. A company with millions of customer records, a developer searching their entire git history, a law firm with thousands of contracts. RAG is now for databases too big to fit in the model's context. For smaller jobs, paste the data and skip the search step.
Your takeaway
A company chatbot can answer about your account or their product. A RAG system fetched the right document a second before you saw the answer. This is the unglamorous workhorse behind most “smart” AI features in production today.
Use this page with your AI
Copy the prompt below. Paste it into Claude, Copilot, Gemini, or any AI you use. The AI will ask you simple questions, then teach the page back to you using your own work as the example.
You are an expert teacher and AI strategist. Read the page below as your reference material. PAGE: RAG CONTENT: ## Essence RAG stands for Retrieval-Augmented Generation. LLM means large language model. It means the LLM reads a document before answering your question. You ask a question. A search engine finds the right pages in your library. The pages get pasted into the model's context. The model answers with those pages in front of it. This is how a chatbot can cite your company's internal docs without being retrained. ## The process Your documents get chopped into small chunks, a paragraph or two each. A computer converts each chunk into a numerical fingerprint, called an embedding. These fingerprints go into a special database. You ask a question. The question also gets a fingerprint. The database finds the chunks whose fingerprints are closest to your question's fingerprint. Those chunks often answer the question. The LLM receives your question with the top few chunks pasted above it. The model reads both and writes an answer based on what the chunks said. If the chunks mention your company's refund policy, the answer mentions that policy. If nothing relevant exists in your library, the model says so. ## You've encountered this when... You chatted with a customer support bot on a bank or software website and it answered about your account or their product. That is often RAG. Two years ago, those bots were scripted and useless. They got smart when RAG arrived. ## A familiar example Think about asking a colleague a policy question. If they read the handbook last week, they answer from memory. If they haven't, they say "hold on" and open the handbook to the right page before answering. RAG is the second behaviour. The LLM pauses, flips to the right page, reads it, then responds. A typical RAG system retrieves between 3 and 10 chunks per question. Each chunk is around 500 words. The LLM sees all of them plus your question, then writes the answer. ## Variants include - Vector databases. The storage system where embeddings live. Pinecone, Weaviate, Qdrant, and pgvector are the common names. Most companies use one of these without thinking about it, hidden inside a product. - Reranking. A second pass after retrieval. The first pass finds 50 chunks. A smaller model scores each chunk's relevance. The top 5 go to the LLM. Reranking improves answers on hard questions. - Graph RAG. Instead of chunks in a flat pile, the documents are arranged as a graph of connected ideas. The retrieval follows links between concepts. Better for complex questions across many documents, more work to set up. ## The breaking point RAG does not make the LLM smarter. It hands the model the right page of the textbook before asking the question. If the textbook is wrong, the answer is wrong. If the chunks miss the key paragraph, the model invents a plausible answer from the chunks it did get. Garbage in, garbage out, at the speed of a chatbot. ## RAG vs massive context The first version of RAG existed because models had small context windows. A model could only read 4,000 or 8,000 tokens at a time. Anything bigger needed RAG. That constraint is gone. Modern frontier models hold up to 1 million tokens in their working memory. That is roughly 1,500 pages of text. You can drop a whole book into the prompt. You can drop a hundred customer interview transcripts. The model reads the lot. RAG still matters when your data is bigger than 1 million tokens. A company with millions of customer records, a developer searching their entire git history, a law firm with thousands of contracts. RAG is now for databases too big to fit in the model's context. For smaller jobs, paste the data and skip the search step. ## Your takeaway A company chatbot can answer about your account or their product. A RAG system fetched the right document a second before you saw the answer. This is the unglamorous workhorse behind most "smart" AI features in production today. Your job: 1. Ask me 3 to 5 simple questions about my work, my situation, and what I would actually use AI for. One question at a time. Wait for my answer between each. 2. Once you have my answers, explain the key ideas from the page back to me, using my answers as the example. 3. Suggest one concrete next step I could take this week. Tie it back to the page. 4. Push back if my answer is vague. Ask me to be specific. Rules for you: - Do not flatter me. - Do not agree with me when I am wrong. - If you do not know something, say so. - Be brief. Two paragraphs at most per turn. - Ask one question at a time. Do not stack questions.