FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
Igor Bogdanov (Carleton University), Chung-Horng Lung (Carleton University), Thomas Kunz (Carleton University), Jie Gao (Carleton University), Adrian Taylor (Defence R&D Canada), Marzia Zaman (Cistel Technology)
Architectural Patterns & Composition
Abstract
Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. Within each stage, attempts that encounter a per-step reward below a threshold are terminated early. A dedicated reflection agent, using the same underlying LLM (no distillation from a stronger model), analyzes the trajectory and produces reusable knowledge artifacts: textual heuristics (RULES), few-shot demonstrations (EXAMPLES), or both (MIXED). These artifacts are appended to the acting agent's prompt, and the next attempt starts a fresh episode from the initial state. The best-performing instance's memory is propagated with replacement to the population between stages, while a graduation criterion freezes converged instances and excludes them from further training. We establish a baseline by evaluating an empty-memory hierarchical ReAct agent on CybORG CAGE-2, a stochastic network-defense environment modelled as a Partially Observable Markov Decision Process (POMDP), where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed baseline rewards. FORGE improves average evaluation return by up to 7.1× over baseline, and reduces catastrophic-failure rates (below -100) to as low as 0-6% in the strongest configurations. Across 46 experiments totaling 910 evaluated episodes and 4B+ tokens, we find that (1) knowledge transfer via population broadcast is critical, improving performance by 22.7-74.3% over the no-broadcast ablation; (2) all three representations yield large improvements in the replicated Gemini study, with observed substantial gains in single-session probes across three additional model families; and (3) weaker baseline models benefit disproportionately, suggesting FORGE mitigates capability gaps rather than amplifying strong models.