Registration has reached capacity. Join the waitlist

SAPO: Secure Automated Prompt Optimization via Multi-Agent Collaboration

Emmanuel Aboah Boateng (DoorDash), Zachary Johnson (Microsoft), Tian Xia (Microsoft), Sarah Zhang (Amazon), Aidan Jay (Microsoft), Junyao Feng (Microsoft), Aditya Mate (Microsoft), Ehi Nosakhare (Microsoft)

Security & Privacy Architectural Patterns & Composition

SAPO is a multi-agent prompt optimization framework that treats safety as a first-class constraint, jointly maximizing task performance and robustness to adversarial inputs. It closes the gap left by accuracy-only prompt optimization, which routinely produces prompts vulnerable to jailbreaks and harmful outputs.

Presentation

Talk

Paper Session 5: Security & Governance

Thursday, May 28 · 11:30 AM – 11:40 AM

Bayshore Ballroom

Poster

Thursday, May 28 · 4:30 PM – 6:00 PM

Carmel

View day schedule

Abstract

Prompt optimization is essential for deploying language models in specialized tasks, yet existing automated prompt optimization methods focus almost exclusively on task performance while treating safety as an afterthought. This gap is consequential: prompts optimized purely for accuracy can become susceptible to adversarial inputs that elicit harmful, biased, or confidential outputs. We introduce SAPO (Secure Automated Prompt Optimization), a multi-agent framework that formulates prompt optimization as a constrained multi-objective problem, maximizing task performance subject to explicit security constraints. SAPO coordinates four specialized agents through a central orchestrator: a Prompt Generation Agent for candidate creation, a Security Check Agent for adversarial robustness evaluation, a Performance Evaluation Agent for task accuracy measurement, and a Critic Agent that synthesizes cross-agent feedback to adaptively rebalance optimization weights across iterations. A security constraint ensures that candidate prompts that exceed a minimum security and overall score threshold are favored for selection. The framework extends naturally to model migration scenarios where prompts must transfer across model families without sacrificing safety or performance. Experiments across six tasks from the Instruction Induction and BIG-Bench Hard benchmarks, evaluated against HarmBench for adversarial robustness, demonstrate that SAPO achieves perfect security scores while simultaneously achieving the highest aggregated task-accuracy score by at least 2.6% over single-objective baseline methods.

Artifacts & Links

                        Authors
                        Emmanuel Aboah Boateng
DoorDash
Zachary Johnson
Microsoft
Tian Xia
Microsoft
Sarah Zhang
Amazon
Aidan Jay
Microsoft
Junyao Feng
Microsoft
Aditya Mate
Microsoft
Ehi Nosakhare
Microsoft