VISTA: Verifier-in-the-Loop Agentic Reinforcement Learning for Quantum Program Synthesis
Cong Yu (Aalto University), Tuo Shi (Aalto University), Valter Uotila (Aalto University), Shilong Deng (University of Liverpool), Lei You (Technical University of Denmark), Bo Zhao (Aalto University)
Architectural Patterns & Composition
Vista is a verifier-in-the-loop RL system for quantum program synthesis that efficiently schedules staged verification calls—compilers, simulators, optimizers—across training, training agents on correctness signals rather than text plausibility. It demonstrates that formally verified program synthesis is tractable as a learning problem when verification cost is treated as a first-class system constraint.
Presentation
Talk
Paper Session 4: Agent Memory & Planning
Thursday, May 28 · 9:10 AM – 9:20 AM
Bayshore Ballroom
Poster
Thursday, May 28 · 4:30 PM – 6:00 PM
Carmel
Abstract
Quantum program synthesis increasingly depends on external evaluators such as parsers, simulators, and optimizers. In OpenQASM 3.0 circuit generation, artifact quality is determined not by text plausibility but by staged execution against tool-defined quantum semantics. This makes verifier-in-the-loop training a systems problem: verifier stages differ sharply in cost, latency, and informativeness, so executing the full verifier on every candidate is inefficient, while collapsing all verifier outcomes into a single reward can destabilize learning. We present Vista, a verifier-in-the-loop agentic reinforcement learning (RL) system for quantum program synthesis, instantiated for OpenQASM 3.0 quantum circuit generation. Vista introduces two mechanisms: (i) hierarchical verified reward optimization, which converts staged verifier outcomes into stable learning signals spanning feasibility, behavior, objective quality, and utility; and (ii) budget-aware gated evaluation, which schedules expensive verifier stages using partial evidence from earlier stages. Vista outperforms four classes of baselines—frontier LLM agents, quantum-specific agents, RL post-training agents, and RL agentic tool-use agents. Across quantum optimization tasks, it achieves 1.13× higher executability at Pass@10, improves semantic solution quality by 1.10×, and cuts verifier cost by 1.77× under matched-budget evaluation.