Registration is now open! Early-bird pricing available through May 5, 2026. Register now

ViBench: A Benchmark on Vibe Coding

Peter Zhong (Replit & Carnegie Mellon University), Pashootan Vaezipoor (Georgian AI Lab), Fuyang Cui (Georgian AI Lab), Vaibhav Kumar (Replit), James Austin (Replit), Azin Asgarian (Georgian AI Lab), Toby Ho (Replit), Paul Inder (Georgian AI Lab), Imen Kedir (Replit), Zhen Li (Replit), Nicholas Ondo (Replit), Asna Shafiq (Georgian AI Lab), Ibrahim Sheikh (Replit), Edouard Sioufi (Replit), Setareh Soltanieh (Georgian AI Lab), Ben Wilde (Georgian AI Lab), Jacky Zhao (Replit), Ryan Carelli (Replit), Heather Miller (Carnegie Mellon University), Michele Catasta (Replit)

Evaluation & Benchmarking

Abstract

Vibe coding -- where users describe applications in natural language and AI agents autonomously handle all implementation -- has rapidly emerged as a mainstream development paradigm. Yet existing benchmarks evaluate mainly an agent's ability at code completion or fixing bugs, not end-to-end application creation from a user's perspective. We introduce ViBench, the first open-source benchmark for evaluating AI agents on realistic, end-to-end web application development. ViBench comprises tasks derived from production traces across 15 applications spanning different domains, covering both zero-to-one creation and feature extension -- the two core workflows of vibe coding. Unlike prior benchmarks, tasks are specified entirely through user-facing requirements with no implementation constraints, and evaluation occurs at the abstraction level of an end user interacting with the application. We contribute an adaptive automatic evaluator that uses REPL-based browser automation to verify arbitrary implementations without assuming anything about code structure, achieving 99% step-level agreement with human experts across 1,082 test steps. Evaluating nine models -- five closed-source and four open-weight -- across 105 artifacts, we find that even the best models are far from reliable: Claude Opus 4.6 and GPT-5.2 lead with only 46% and 42% Pass@1 -- fully correct artifacts on the first try -- while no open-weight model exceeds 12% Pass@1. Feature extension on a stable reference codebase proves easier than zero-to-one creation, but errors compound when models extend their own code: seven of nine models degrade in the Vibe-on-Vibe setting, with complete failure rates rising sharply. Notably, open-weight models that perform competitively on coding benchmarks such as SWE-bench struggle to generalize to end-to-end vibe coding, exhibiting complete failure rates up to 40% -- underscoring that end-to-end application development demands capabilities beyond what current coding benchmarks measure. As vibe coding matures into mainstream development practice, ViBench provides the open-source infrastructure needed to measure progress, compare models, and guide practitioners toward reliable AI-assisted development, beyond just a "Vibe Check".

                        Authors
                        Peter Zhong
Replit & Carnegie Mellon University
Pashootan Vaezipoor
Georgian AI Lab
Fuyang Cui
Georgian AI Lab
Vaibhav Kumar
Replit
James Austin
Replit
Azin Asgarian
Georgian AI Lab
Toby Ho
Replit
Paul Inder
Georgian AI Lab
Imen Kedir
Replit
Zhen Li
Replit
Nicholas Ondo
Replit
Asna Shafiq
Georgian AI Lab
Ibrahim Sheikh
Replit
Edouard Sioufi
Replit
Setareh Soltanieh
Georgian AI Lab
Ben Wilde
Georgian AI Lab
Jacky Zhao
Replit
Ryan Carelli
Replit
Heather Miller
Carnegie Mellon University
Michele Catasta
Replit