What I have built
Open architecture. Take what helps.
Build 1 of N: Master Agent
What it is and why I built it
I built a Master Agent that runs my work the way a senior assistant would. It plans before it acts. It uses tools to read files, write code, search memory, and ship documents. It learns from its own mistakes and from my corrections. I built it because chat tools forget what I told them last week, work in 60-second bursts, and tell me what I want to hear. My agent remembers months of context, runs for hours on its own, and pushes back when I am wrong.
The components
Planner
The Planner reads my request and writes a plan before any work starts. The plan lists each step, flags steps that could break things, and surfaces gaps in what I asked. If gaps exist, the agent stops and asks me to fill them. This matters because most agent failures come from rushing into half-understood work.
Executor
The Executor runs the plan one step at a time. It calls tools to read files, edit code, run tests, send messages, or build documents. It reads each tool result before the next step starts. This matters because acting blind on tool output is how agents loop on failure.
RAG memory layer (Pinecone)
RAG stands for retrieval-augmented generation. Pinecone is a database that stores chunks of past work as math vectors. When I ask a new question, the agent finds the most related past chunks and feeds them in. This matters because it lets the agent draw on months of past projects without holding all of them in mind at once.
Process manager (PM2)
PM2 keeps the agent running on my computer. It starts the dashboard when I boot up. It restarts services if they crash. It also runs background jobs like nightly memory cleanup. This matters because an agent that needs me to babysit its uptime is not autonomous.
Sub-agents per business
I run two businesses. Each one has its own sub-agent profile. The profile holds context about that business, its voice, its customers, and its rules. The Master Agent loads the right profile when the work calls for it. This matters because mixing business context creates wrong answers and off-brand work.
Memory layer
Three layers store what the agent knows. Episodic memory holds each chat as it happens. Causal memory links chats that caused or led to each other. Consolidated memory takes a day of related chats and rolls them into one summary. Old memory never gets deleted, only flagged when newer memory replaces it. This matters because flat chat history degrades the agent over weeks of use.
Monitoring and autoheal
A small watcher process reads logs from my agent and from cloud services like Vercel and Railway. When it spots an error, it tries to fix it on its own. For high-risk fixes, it asks a stronger model and waits for me to sign off. This matters because my time should go to new work, not patching old bugs.
Model routing
Different jobs go to different models. Claude Sonnet plans and reasons. Claude Haiku runs cheap fast tasks like classification or one-line edits. Gemini Flash handles bulk research fan-out. Gemini Pro handles big synthesis steps. Claude Opus runs only on the riskiest fixes. This matters because using the most expensive model for every job burns money and slows everything down.
Synthesis daemon
A daemon is a background program that runs all the time. The Synthesis Daemon takes scattered notes from many runs and turns them into one clean output. It uses Gemini Pro because that model handles long context well. This matters because agents produce raw chunks that need stitching.
Honesty protocol
The agent ends plans, design choices, and claims about new tools with a short footer. The footer names the weakest part of the answer, what the agent does not know, and how confident it is. This matters because confident wrong answers waste more time than honest gaps.
Failure distillation
Every night a job reads yesterday's failures from my logs. A small model writes one short rule per failure, like a note to self. The rules get fed back as guardrails. This matters because the agent stops repeating the same kind of mistake across weeks.
Decisions and corrections log
Every plan and every choice the agent makes gets logged. Every time I edit a message or override the agent, that gets logged too. Both feed back into the agent's context as patterns of how I work. This matters because the agent learns my standards and starts thinking the way I think.
Skills as sub-processes
A skill is a markdown file with focused instructions for one kind of work, like cold outreach copy or design specs. When the agent needs that skill, it spawns a small focused model call with only that skill's instructions plus the input. The result comes back to the Master Agent. This matters because the main agent stays clean while specialists handle craft work.
Railway worker for long jobs
Some jobs take hours. I send those to a worker on Railway, a cloud platform with no per-task time limit. The worker runs the same agent stack as the chat. It writes progress to my database and sends a Telegram message when done. This matters because true autonomy means I can walk away.
How it all connects
I send a request to the dashboard. The Master Agent loads the right business profile, pulls related memory from Pinecone, reads any guardrails from past failures, and writes a plan. If the plan has gaps, the agent stops and asks. If it is clean, the Executor runs each step. Each step calls tools. Each tool result feeds the next step. After the agent edits a file, it runs the type checker and tests on its own. If something breaks, it fixes it and tries again. Long jobs go to the Railway worker. Short jobs run in the dashboard. Decisions and corrections get logged to make the next run smarter. Memory consolidates each night so the agent stays sharp without choking on old context. The watcher sees all of it and fixes what it can.
How I verify it works
A working agent and a known-working agent are different things. I run a small evaluation harness over the agent's output. Each run produces a score against a fixed rubric: did it answer the question, did it cite sources when needed, did it ask for missing information instead of guessing, did it stay within budget. The harness runs on every change. If the score drops, the change does not ship.
What gets graded
Accuracy. The agent's answers get checked against ground-truth examples. New examples get added every week from real failures.
Honesty. The honesty protocol asks the agent to flag what it does not know. The harness checks whether the flagged uncertainty matches actual uncertainty.
Cost. Every run logs token spend. A run that produces a good answer at 5x normal cost is a regression, not a success.
What I do not check
I do not run my agent against academic benchmarks. The benchmarks measure what benchmarks measure. My agent does work that matters to me. Real failures are the better signal.
What is coming next
More builds will land here. The architecture above generalises to other domains. The next page entries will appear when they ship, not before. Stub pages and roadmaps are noise.
Use this page with your AI
Copy the prompt below. Paste it into Claude, Copilot, Gemini, or any AI you use. The AI will ask you simple questions, then teach the page back to you using your own work as the example.
You are an expert teacher and AI strategist. Read the page below as your reference material. PAGE: What I have built CONTENT: ## What I have built Open architecture. Take what helps. ## Build 1: Master Agent I built a Master Agent that runs my work the way a senior assistant would. ### What it is and why I built it I built a Master Agent that runs my work the way a senior assistant would. It plans before it acts. It uses tools to read files, write code, search memory, and ship documents. It learns from its own mistakes and from my corrections. I built it because chat tools forget what I told them last week, work in 60-second bursts, and tell me what I want to hear. My agent remembers months of context, runs for hours on its own, and pushes back when I am wrong. ### Components Planner: reads the request and writes a plan before any work starts. Lists each step, flags steps that could break things, surfaces gaps. Executor: runs the plan one step at a time. Calls tools, reads each result before the next step starts. RAG memory layer (Pinecone): stores past work as math vectors. Finds related past chunks and feeds them into the prompt. Process manager (PM2): keeps the agent running, restarts on crash, runs nightly background jobs. Sub-agents: each business has its own profile with context, voice, customers, and rules. Memory layer: episodic, causal, and consolidated. Old memory gets flagged, never deleted. Monitoring and autoheal: a watcher reads logs and tries fixes on its own. High-risk fixes go to a stronger model for sign-off. Model routing: Sonnet plans, Haiku handles cheap tasks, Flash handles research fan-out, Pro handles synthesis, Opus for the riskiest fixes. Synthesis daemon: takes scattered notes from many runs and stitches them into one clean output. Honesty protocol: every plan and claim ends with a footer naming the weakest part, what the agent does not know, and how confident it is. Failure distillation: nightly job reads yesterday's failures, writes one rule per failure, feeds rules back as guardrails. Decisions and corrections log: every override I make gets logged. The agent learns my standards. Skills as sub-processes: a skill is a markdown file for one kind of work. The agent spawns a focused model call when it needs that skill. Railway worker: long jobs run on Railway with no time limit. Progress writes to the database. Done triggers a Telegram message. ### How it all connects Request → planner loads profile and memory → executor runs steps → tools called per step → type checker and tests run automatically → long jobs go to Railway → decisions logged → memory consolidates nightly → watcher monitors everything. ## How I verify it works A working agent and a known-working agent are different things. I run a small evaluation harness over the agent's output. Each run produces a score against a fixed rubric: did it answer the question, did it cite sources when needed, did it ask for missing information instead of guessing, did it stay within budget. The harness runs on every change. If the score drops, the change does not ship. ### What gets graded Accuracy: answers checked against ground-truth examples. New examples added every week from real failures. Honesty: the protocol asks the agent to flag what it does not know. The harness checks whether flagged uncertainty matches actual uncertainty. Cost: every run logs token spend. A good answer at 5x normal cost is a regression. ### What I do not check I do not run my agent against academic benchmarks. The benchmarks measure what benchmarks measure. My agent does work that matters to me. Real failures are the better signal. ## What is coming next More builds will land here. The architecture above generalises to other domains. The next page entries will appear when they ship, not before. Stub pages and roadmaps are noise. Your job: 1. Ask me 3 to 5 simple questions about my work, my situation, and what I would actually use AI for. One question at a time. Wait for my answer between each. 2. Once you have my answers, explain the key ideas from the page back to me, using my answers as the example. 3. Suggest one concrete next step I could take this week. Tie it back to the page. 4. Push back if my answer is vague. Ask me to be specific. Rules for you: - Do not flatter me. - Do not agree with me when I am wrong. - If you do not know something, say so. - Be brief. Two paragraphs at most per turn. - Ask one question at a time. Do not stack questions.