ViBench: A Benchmark on Vibe Coding
Peter Zhong (Replit & Carnegie Mellon University), Pashootan Vaezipoor (Georgian AI Lab), Fuyang Cui (Georgian AI Lab), Vaibhav Kumar (Replit), James Austin (Replit), Azin Asgarian (Georgian AI Lab), Toby Ho (Replit), Paul Inder (Georgian AI Lab), Imen Kedir (Replit), Zhen Li (Replit), Nicholas Ondo (Replit), Asna Shafiq (Georgian AI Lab), Ibrahim Sheikh (Replit), Edouard Sioufi (Replit), Setareh Soltanieh (Georgian AI Lab), Ben Wilde (Georgian AI Lab), Jacky Zhao (Replit), Ryan Carelli (Replit), Heather Miller (Carnegie Mellon University), Michele Catasta (Replit)
Evaluation & Benchmarking
Abstract
Vibe coding -- where users describe applications in natural language and AI agents autonomously handle all implementation -- has rapidly emerged as a mainstream development paradigm. Yet existing benchmarks evaluate mainly an agent's ability at code completion or fixing bugs, not end-to-end application creation from a user's perspective. We introduce ViBench, the first open-source benchmark for evaluating AI agents on realistic, end-to-end web application development. ViBench comprises tasks derived from production traces across 15 applications spanning different domains, covering both zero-to-one creation and feature extension -- the two core workflows of vibe coding. Unlike prior benchmarks, tasks are specified entirely through user-facing requirements with no implementation constraints, and evaluation occurs at the abstraction level of an end user interacting with the application. We contribute an adaptive automatic evaluator that uses REPL-based browser automation to verify arbitrary implementations without assuming anything about code structure, achieving 99% step-level agreement with human experts across 1,082 test steps. Evaluating nine models -- five closed-source and four open-weight -- across 105 artifacts, we find that even the best models are far from reliable: Claude Opus 4.6 and GPT-5.2 lead with only 46% and 42% Pass@1 -- fully correct artifacts on the first try -- while no open-weight model exceeds 12% Pass@1. Feature extension on a stable reference codebase proves easier than zero-to-one creation, but errors compound when models extend their own code: seven of nine models degrade in the Vibe-on-Vibe setting, with complete failure rates rising sharply. Notably, open-weight models that perform competitively on coding benchmarks such as SWE-bench struggle to generalize to end-to-end vibe coding, exhibiting complete failure rates up to 40% -- underscoring that end-to-end application development demands capabilities beyond what current coding benchmarks measure. As vibe coding matures into mainstream development practice, ViBench provides the open-source infrastructure needed to measure progress, compare models, and guide practitioners toward reliable AI-assisted development, beyond just a "Vibe Check".