Where systems are going
What is coming next in AI architecture. The honest version, not the hype.
Every AI explainer ends with predictions. Most are wrong. Some are right but boring. This page covers the architectural shifts with credible technical backing, the labs building them, and the honest reasons each one might not work.
Persistent memory layers
Old AI starts every chat from scratch. Developers fix this by pasting all your past conversations into the prompt every time. That wastes tokens and slows everything down. New systems use persistent memory: the model writes structured notes about you to a database and reads them back when needed.
What is being built
The A-MEM paper (Xu et al., 2025) shows agents creating smart notes from chats. They extract tags, write short summaries, and link related memories using cosine similarity. The Mem0 project tracks user states and session states with entity linking for fast recall. Both replace raw vector dumps with structured graphs.
Standard prompts paste all chat history at around 16,000 tokens with low recall accuracy on long tasks. A-MEM stores linked summary notes at around 2,500 tokens with higher recall accuracy. Memory structure beats memory size.
The Model Context Protocol (MCP) standardises how memory connects to the model. Custom code gives way to open standards.
The honest counter
Memory poisoning is real. A bad fact stored on Monday ruins answers on Friday. The MEXTRA attack (Wang et al., 2025) shows hackers can steal stored agent memories using prompt injection. Privacy across long sessions needs strict isolation between users.
Long-running autonomous agents
Early models ran for fifty milliseconds and typed a text answer. Modern agents run for hours or days. They fix software bugs, research papers, and write code without human input between steps. This needs new infrastructure.
What is being built
The Agent Sandbox project on Kubernetes gives agents safe, persistent execution environments. Agents pause and resume work across sessions. H2O Super Agent orchestrates dynamic routing and long memory.
The Instruction Tool Retrieval method (2026) cuts context tokens by 95% and total episode costs by 70% by retrieving only the tools needed for each step instead of holding the whole instruction set in context.
The honest counter
An agent stuck in a loop can spend thousands of dollars overnight. Long-running tasks need active credentials, which become hacker bait. An autonomous coding agent recently deleted a production database through a single rogue API call. Unsupervised agents need monitoring and least-privilege sandboxes, not blanket execution rights.
Multi-agent coordination
One model handling everything breaks on complex tasks. The model forgets earlier steps. The model loses focus halfway through. Multi-agent systems split the work: a manager agent plans, worker agents execute, a tester agent checks. This sounds great. The reality is more mixed.
What is being built
Microsoft's AutoGen framework helps agent teams build software. The MAC-Flow paper (Lee and Zhang, 2025) uses flow matching for coordination, speeding up inference 14.5 times by distilling behaviours into one-step policies. The R3DM framework improves how agents discover roles using contrastive learning, increasing win rates by 20% on benchmarks.
The honest counter
The DeepMind paper “Towards a Science of Scaling Agent Systems” (Kim and Liu, 2026) tested 180 different agent configurations. More agents do not guarantee success. Single agents beat multi-agent teams on sequential tasks. Coordination drops performance by 39 to 70 percent on logic-heavy work because agents copy each other's errors. Multi-agent systems consume 15 times more tokens than single agents. Multi-agent only wins on parallel tasks: 9.2% gain on parallel web navigation. Tool interface design matters more than agent count.
Model routing as a first-class pattern
Sending every prompt to the most expensive model wastes money. Sending every prompt to the cheapest model breaks on hard tasks. A routing gateway evaluates each prompt, checks task complexity, and picks the right model. This is now a first-class architectural component, not an optimisation.
What is being built
Microsoft's Foundry Model Router selects between massive and small models in real time. Developers test routing decisions with the RouteLens tool. Static routing uses the model name in code. Semantic routing uses the prompt's meaning to decide. Disaggregated routing splits work across hardware: prefill needs memory bandwidth, decode needs fast computation, and they can run on different chips.
The llm-d project on Kubernetes implements disaggregated serving by separating the prefill phase (reading the prompt) from the decode phase (writing the answer). Semantic routing saves around 50% on cost. Disaggregated routing maximises hardware speed.
The honest counter
Splitting tasks across chips creates network complexity. Moving data between machines adds latency. The network becomes the new bottleneck. Small companies do not have the infrastructure to support disaggregated serving. It remains a giant-scale tool.
AI-first data transformers
Old data pipelines broke on messy input. Engineers wrote regex per source, parsers per format, and one engineer per integration. The new pattern: hand the messy input to a small model with a target schema. The model normalises the input into clean structured data. Validate with Zod or JSON Schema. Fall back to a stronger model on failure.
What is being built
Chained transformers: one model extracts raw text, the next translates it, the final model formats it. Languages like PPL control execution flow across the chain. The COCOS framework (Jeonghun et al., 2025) teaches small models self-correction via reinforcement learning. The CorrectBench dataset (Sun et al., 2025) measures self-correction accuracy gains.
Self-correction adds around 5% accuracy on tested datasets. The trade is 40% slower execution and higher token costs. Future pipelines route dynamically: simple chain-of-thought for easy rows, self-correction loops only for the hardest data.
The honest counter
LLM transformers cost more per row than regex. Latency adds 1 to 3 seconds. They hallucinate at edge cases. The wins come from coverage and maintenance time, not raw speed. Use this for varied input sources. Use regex for clean repeat data.
Research injection sequences
AI training data goes stale fast. The current pattern: query analysis, web search, distillation, injection into prompt context, generation with citations. The pattern is now reaching its limits. The next generation runs the search loop differently.
What is being built
Agentic search loops read databases multiple times instead of once. Anthropic found multi-agent search beats single-agent search by 90% on complex research tasks. The agents formulate hypotheses, search in parallel, discard dead ends, and return only the precise file spans needed. The main context stays clean.
SearchGPT trains models to decide autonomously when to search the web. The model uses customised versions of GPT-5.5 trained with model distillation. It synthesises results into fluent answers, not blue links.
Citation accuracy is a growing problem. A 2025 study found Llama models fabricated up to 85.6% of their citations. The CiteVerifier framework (Ansari et al., 2026) checks every generated link against real academic databases and rejects unsupported claims.
The honest counter
More search cycles add cost and latency. Often the quality gain does not justify the spend. Garbage sources produce garbage answers regardless of how many cycles you run. Source quality is the binding constraint, not search depth.
Still speculative
Two more architectural shifts get a lot of research attention. Both might land. Both might not. Worth knowing about, not worth betting your stack on this year.
On-device inference
Apple ships foundation models for iPhones at 3.5 bits per weight. Liquid AI builds fast audio models for edge devices. Stanford's OpenJarvis framework runs personal agents offline. Local models handle around 88.7% of daily questions at low latency without sharing data.
The counter: hardware limits software. Mobile chips lack the memory for large reasoning models. Small models hallucinate more. They cannot execute complex multi-step logic reliably without guidance. Heavy local processing drains batteries fast.
World models for embodied agents
Google DeepMind's Genie 3 generates interactive 3D worlds from text at 24 frames per second. Robots train inside these synthetic worlds before touching real hardware. SenseTime's KaiWu and Toshiba's ReCoRe extend the same idea with multimodal perception and contrastive features.
The counter: predicting video pixels is not understanding physics. Rodney Brooks argues that learning physics from video is silly when physics engines already exist in code. Video models hallucinate physical interactions. A robot trained in a fake video world fails in reality.
Every shift on this page has labs working on it and reasons it might fail. The pattern across all of them: the next architecture is more honest about what AI does well, more skeptical about what it does badly, and more willing to fall back to the boring layer underneath when the new layer breaks. Build for the boring case. Stretch for the new one when it earns its place.
Use this page with your AI
Copy the prompt below. Paste it into Claude, Copilot, Gemini, or any AI you use. The AI will ask you simple questions, then teach the page back to you using your own work as the example.
You are an expert teacher and AI strategist. Read the page below as your reference material. PAGE: Where AI systems are going CONTENT: ## Lede Every AI explainer ends with predictions. Most are wrong. Some are right but boring. This page covers the architectural shifts with credible technical backing, the labs building them, and the honest reasons each one might not work. ## Persistent memory layers Old AI starts every chat from scratch. Developers fix this by pasting all your past conversations into the prompt every time. That wastes tokens and slows everything down. New systems use persistent memory: the model writes structured notes about you to a database and reads them back when needed. The A-MEM paper (Xu et al., 2025) shows agents creating smart notes from chats. They extract tags, write short summaries, and link related memories using cosine similarity. The Mem0 project tracks user states and session states with entity linking for fast recall. Both replace raw vector dumps with structured graphs. Standard prompts paste all chat history at around 16,000 tokens with low recall accuracy on long tasks. A-MEM stores linked summary notes at around 2,500 tokens with higher recall accuracy. Memory structure beats memory size. The Model Context Protocol (MCP) standardises how memory connects to the model. Custom code gives way to open standards. Counter: Memory poisoning is real. A bad fact stored on Monday ruins answers on Friday. The MEXTRA attack (Wang et al., 2025) shows hackers can steal stored agent memories using prompt injection. Privacy across long sessions needs strict isolation between users. ## Long-running autonomous agents Early models ran for fifty milliseconds and typed a text answer. Modern agents run for hours or days. They fix software bugs, research papers, and write code without human input between steps. This needs new infrastructure. The Agent Sandbox project on Kubernetes gives agents safe, persistent execution environments. Agents pause and resume work across sessions. H2O Super Agent orchestrates dynamic routing and long memory. The Instruction Tool Retrieval method (2026) cuts context tokens by 95% and total episode costs by 70% by retrieving only the tools needed for each step instead of holding the whole instruction set in context. Counter: An agent stuck in a loop can spend thousands of dollars overnight. Long-running tasks need active credentials, which become hacker bait. An autonomous coding agent recently deleted a production database through a single rogue API call. Unsupervised agents need monitoring and least-privilege sandboxes, not blanket execution rights. ## Multi-agent coordination One model handling everything breaks on complex tasks. The model forgets earlier steps. The model loses focus halfway through. Multi-agent systems split the work: a manager agent plans, worker agents execute, a tester agent checks. This sounds great. The reality is more mixed. Microsoft's AutoGen framework helps agent teams build software. The MAC-Flow paper (Lee and Zhang, 2025) uses flow matching for coordination, speeding up inference 14.5 times by distilling behaviours into one-step policies. The R3DM framework improves how agents discover roles using contrastive learning, increasing win rates by 20% on benchmarks. Counter: The DeepMind paper "Towards a Science of Scaling Agent Systems" (Kim and Liu, 2026) tested 180 different agent configurations. More agents do not guarantee success. Single agents beat multi-agent teams on sequential tasks. Coordination drops performance by 39 to 70 percent on logic-heavy work because agents copy each other's errors. Multi-agent systems consume 15 times more tokens than single agents. Multi-agent only wins on parallel tasks: 9.2% gain on parallel web navigation. Tool interface design matters more than agent count. ## Model routing as a first-class pattern Sending every prompt to the most expensive model wastes money. Sending every prompt to the cheapest model breaks on hard tasks. A routing gateway evaluates each prompt, checks task complexity, and picks the right model. This is now a first-class architectural component, not an optimisation. Microsoft's Foundry Model Router selects between massive and small models in real time. Developers test routing decisions with the RouteLens tool. Static routing uses the model name in code. Semantic routing uses the prompt's meaning to decide. Disaggregated routing splits work across hardware: prefill needs memory bandwidth, decode needs fast computation, and they can run on different chips. The llm-d project on Kubernetes implements disaggregated serving by separating the prefill phase from the decode phase. Semantic routing saves around 50% on cost. Disaggregated maximises hardware speed. Counter: Splitting tasks across chips creates network complexity. Moving data between machines adds latency. The network becomes the new bottleneck. Small companies do not have the infrastructure to support disaggregated serving. It remains a giant-scale tool. ## AI-first data transformers Old data pipelines broke on messy input. Engineers wrote regex per source, parsers per format, and one engineer per integration. The new pattern: hand the messy input to a small model with a target schema. The model normalises the input into clean structured data. Validate with Zod or JSON Schema. Fall back to a stronger model on failure. Chained transformers: one model extracts raw text, the next translates it, the final model formats it. Languages like PPL control execution flow across the chain. The COCOS framework (Jeonghun et al., 2025) teaches small models self-correction via reinforcement learning. The CorrectBench dataset (Sun et al., 2025) measures self-correction accuracy gains. Self-correction adds around 5% accuracy on tested datasets. The trade is 40% slower execution and higher token costs. Future pipelines route dynamically: simple chain-of-thought for easy rows, self-correction loops only for the hardest data. Counter: LLM transformers cost more per row than regex. Latency adds 1 to 3 seconds. They hallucinate at edge cases. The wins come from coverage and maintenance time, not raw speed. Use this for varied input sources. Use regex for clean repeat data. ## Research injection sequences AI training data goes stale fast. The current pattern: query analysis, web search, distillation, injection into prompt context, generation with citations. The pattern is now reaching its limits. The next generation runs the search loop differently. Agentic search loops read databases multiple times instead of once. Anthropic found multi-agent search beats single-agent search by 90% on complex research tasks. The agents formulate hypotheses, search in parallel, discard dead ends, and return only the precise file spans needed. The main context stays clean. SearchGPT trains models to decide autonomously when to search the web, using customised versions of GPT-5.5 trained with model distillation. It synthesises results into fluent answers, not blue links. A 2025 study found Llama models fabricated up to 85.6% of their citations. The CiteVerifier framework (Ansari et al., 2026) checks every generated link against real academic databases and rejects unsupported claims. Counter: More search cycles add cost and latency. Often the quality gain does not justify the spend. Garbage sources produce garbage answers regardless of how many cycles you run. Source quality is the binding constraint, not search depth. ## Still speculative On-device: Apple ships foundation models for iPhones at 3.5 bits per weight. Liquid AI builds fast audio models for edge devices. Stanford's OpenJarvis framework runs personal agents offline. Local models handle around 88.7% of daily questions at low latency without sharing data. Counter: hardware limits software. Mobile chips lack the memory for large reasoning models. Small models hallucinate more. They cannot execute complex multi-step logic reliably without guidance. Heavy local processing drains batteries fast. World models: Google DeepMind's Genie 3 generates interactive 3D worlds from text at 24 frames per second. Robots train inside these synthetic worlds before touching real hardware. SenseTime's KaiWu and Toshiba's ReCoRe extend the same idea with multimodal perception and contrastive features. Counter: predicting video pixels is not understanding physics. Rodney Brooks argues that learning physics from video is silly when physics engines already exist in code. Video models hallucinate physical interactions. A robot trained in a fake video world fails in reality. ## Closing Every shift on this page has labs working on it and reasons it might fail. The pattern across all of them: the next architecture is more honest about what AI does well, more skeptical about what it does badly, and more willing to fall back to the boring layer underneath when the new layer breaks. Build for the boring case. Stretch for the new one when it earns its place. Your job: 1. Ask me 3 to 5 simple questions about my work, my situation, and what I would actually use AI for. One question at a time. Wait for my answer between each. 2. Once you have my answers, explain the key ideas from the page back to me, using my answers as the example. 3. Suggest one concrete next step I could take this week. Tie it back to the page. 4. Push back if my answer is vague. Ask me to be specific. Rules for you: - Do not flatter me. - Do not agree with me when I am wrong. - If you do not know something, say so. - Be brief. Two paragraphs at most per turn. - Ask one question at a time. Do not stack questions.