Skip to main content
Registration has reached capacity. Join the waitlist

All Accepted Papers

Context, Reasoning, and Hierarchy: A Cost–Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Igor Bogdanov (Carleton University), Chung-Horng Lung (Carleton University), Thomas Kunz (Carleton University), Jie Gao (Carleton University), Adrian Taylor (Defence R&D Canada), Marzia Zaman (Cistel Technology)

Architectural Patterns & Composition

A controlled study of compound LLM agent design in CybORG, an adversarial cyber defense environment, that separates the effects of context design, reasoning depth, and task decomposition on agent performance. It gives practitioners empirical guidance on which design choices genuinely improve outcomes versus which merely increase inference cost via token consumption.

Presentation

Talk

Paper Session 1: Agent Design

Wednesday, May 27 · 10:45 AM – 10:55 AM

Bayshore Ballroom

Poster

Wednesday, May 27 · 5:15 PM – 6:45 PM

Carmel / Monterey

Abstract

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several interacting design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode and errors compound over time. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We systematically vary context representation (raw observations vs. a deterministic, programmatic environment state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4× worse mean return while using 1.8-2.7× more tokens. We call this destructive interaction pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation tools achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

ACM CAIS 2026 Sponsors