Registration is now open! Early-bird pricing available through May 5, 2026. Register now

Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems (University of Notre Dame), Dilara Soylu (Stanford University), Lakshya A Agrawal (UC Berkeley), Isaac Miller (Normal Computing), Liheng Lai (UC Berkeley), Chen Qian (Databricks), Kaiqiang Song (Zoom, Inc.), Meng Jiang (University of Notre Dame), Dan Klein (UC Berkeley), Matei Zaharia (UC Berkeley, Databricks), Karel D’Oosterlinck (Contextual AI), Christopher Potts (Stanford University), Omar Khattab (MIT)

Architectural Patterns & Composition

Abstract

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by generalizing GRPO to multi-prompt systems by grouping LM calls by module across rollouts and handling variable-length and interrupted trajectories. We find for the first time that GRPO (and its multi-module counterpart) composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM---with 5% gains against prompt optimization on its own. Our approach is released as an open-source learning algorithm for compound AI systems.

                        Authors
                        Noah Ziems
University of Notre Dame
Dilara Soylu
Stanford University
Lakshya A Agrawal
UC Berkeley
Isaac Miller
Normal Computing
Liheng Lai
UC Berkeley
Chen Qian
Databricks
Kaiqiang Song
Zoom, Inc.
Meng Jiang
University of Notre Dame
Dan Klein
UC Berkeley
Matei Zaharia
UC Berkeley, Databricks
Karel D’Oosterlinck
Contextual AI
Christopher Potts
Stanford University
Omar Khattab
MIT