Registration is now open! Early-bird pricing available through May 5, 2026. Register now

Context, Reasoning, and Hierarchy: A Cost–Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Igor Bogdanov (Carleton University), Chung-Horng Lung (Carleton University), Thomas Kunz (Carleton University), Jie Gao (Carleton University), Adrian Taylor (Defence R&D Canada), Marzia Zaman (Cistel Technology)

Architectural Patterns & Composition

Abstract

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several interacting design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet this design space remains empirically underexplored, and practitioners lack guidance on which design choices improve performance versus merely increase inference costs via token consumption. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation regime where errors compound over time. Our evaluation spans five model families, six models, and twelve configurations (3,475 evaluated episodes) with token-level cost accounting. We systematically vary context representation (raw observations vs. a deterministic, programmatic environment state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional explicit chain-of-thought prompting), and hierarchical decomposition (monolithic planner vs. delegation to specialized Analyst and ActionChooser sub-agents). We find that: (1) Programmatic state abstraction consistently delivers the largest returns per token spent (RPTS), improving mean episode return by up to 76\% relative to raw observations, without increasing LLM API calls. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 2-3$\times$ more tokens. We call this destructive interaction pattern a \emph{deliberation cascade}. (3) Hierarchical decomposition without deliberation tools achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.