Tressoir: Unifying Online, Offline, and HIL Design and Evolution of Multi-Agent Systems via Interpretable Blueprints
Amadou Latyr Ngom (MIT, G5 Labs), Ziniu Wu (MIT, G5 Labs), Jason Mohoney (MIT), James Moore (MIT, G5 Labs), Alex Zhang (MIT), Samuel Madden (MIT, G5 Labs), Tim Kraska (MIT, G5 Labs)
Architectural Patterns & Composition
Tressoir jointly designs and evolves multi-agent architectures, prompts, tools, and knowledge through human-readable Interpretable Blueprints that encode both online design intent and offline-generated high-quality components. It supports automated, human-guided, and hybrid optimization modes, making multi-agent system development more systematic and reproducible.
Presentation
Talk
Paper Session 1: Agent Design
Wednesday, May 27 · 11:15 AM – 11:25 AM
Bayshore Ballroom
Poster
Wednesday, May 27 · 5:15 PM – 6:45 PM
Carmel / Monterey
Abstract
We explore a principled approach that jointly designs and evolves the architectures, prompts, tools, and knowledge of multi-agent systems, whether online, offline, or with human guidance. We first propose Interpretable Blueprints (IBs), which pair an online-interpretable system ontology (describing architectures, invariants, domain knowledge, etc.) with offline-materialized components proven to be high-quality or cost-effective. Second, we propose a supervising interpreter that co-interprets the IB and the task to construct a specialized agentic system on the fly, without assuming any pre-existing implementation, thereby enabling maximal adaptation to the task. IBs are also the primary online communication mechanism between agents. Offline learning is a subset of this approach; learning IBs encode learning strategies that let the interpreter orchestrate metrics collection and IB improvement. Human guidance is enabled at every layer, whether by co-editing IBs or by steering online or offline interpretation in ways that the system learns from over time. To instantiate this vision, we develop Tressoir, an IB-centric framework that unifies online, offline, and human-guided evolution under a single mechanism. Tressoir is tailored for long-running, complex projects with tasks that build on each other and require continual learning during or in between executions. Its generality further allows it to bootstrap itself, where its own features are now self-generated with high-level human guidance. We also evaluate Tressoir on shorter-term benchmarks. On SWE-Bench-Pro's Qutebrowser subset, Tressoir with Claude 4.6 Opus reaches 75.9% vs. 57.0% for SWE-Agent; on ScreenSpot-Pro, it lifts Gemini 3 Flash from a 69.1% baseline to 83.1%; and on Bird-Critic Flash, Tressoir with Gemini 3 Flash tools scores 56.0%, exceeding SQL-ACT with Claude 4.6 Opus at 52.0%.