Valkyrie: A Microservice-Based Framework for Scalable Evaluation of AI Agents
Jarett Forzano (Vals AI), Omar Almatov (Vals AI), Langston Nashold (Vals AI), Nikil Ravi (Vals AI), Orestes Kassian (Vals AI)
Evaluation & Benchmarking Engineering & Operations
Summary
A microservice-based benchmarking framework that decouples benchmark code, agent logic, and execution infrastructure for scalable, reproducible evaluation of AI agents.
Description
Existing frameworks couple benchmark code, agent logic, and execution infrastructure into monolithic repositories, relying on local execution, ephemeral storage, and trust rather than verification. We present Valkyrie, a microservice-based benchmarking framework that decouples these concerns into independently deployable services, scales task execution horizontally across isolated Daytona sandboxes, and persists all results in the organization's own cloud account. We validate the system by running four agents across SWE-Bench Verified and Terminal-Bench 2.0, demonstrating end-to-end orchestration across multiple benchmarks without per-run manual configuration.