Registration has reached capacity. Join the waitlist

Forge: Closing the Agentic Reliability Gap Between Self-Hosted and Frontier Language Models

Antoine Zambelli (Texas Instruments, Inc)

Security & Privacy Evaluation & Benchmarking

An open-source guardrail framework that enables an 8B self-hosted model to achieve 99% agentic workflow accuracy, matching frontier APIs.

Presentation

Demo session

Thursday, May 28 · 4:30 PM – 6:00 PM

San Jose

View day schedule

Description

Agentic workflows — multi-step processes where language models call tools, interpret results, and make routing decisions — are a critical capability for deploying AI systems beyond simple question-answering. While frontier API models achieve high accuracy on individual function calls, multi-step workflows expose a compounding reliability problem: even 95% per-step accuracy yields only 77% completion over five steps. We present Forge, an open-source Python framework that adds tool-agnostic guardrails to self-hosted language models running on consumer hardware. Forge's guardrail stack — retry nudges, step enforcement, error recovery, context compaction, and hardware-aware VRAM budgeting — operates independently of the specific tools or workflows being executed. We evaluate Forge across 50+ model/backend configurations using a custom eval harness with 9 agentic scenarios run 50 times each. Our key findings: (1) an 8-billion-parameter model with Forge achieves 99% accuracy on agentic workflows, within 1 percentage point of frontier APIs with the same framework; (2) the same 8B model with Forge outperforms frontier APIs without guardrails — the best configuration a consumer can achieve through API alone; (3) even frontier models drop to 49–87% completion without guardrails, with error recovery scoring 0% universally — an architectural gap, not a capability gap; (4) the serving backend is a hidden variable, with identical model weights producing 0% accuracy on one backend and 78% on another; (5) larger models do not always outperform smaller ones, with 8B models beating 14B variants of the same family across multiple configurations. These findings suggest that the reliability gap between self-hosted and frontier models is primarily a framework problem — guardrails close the mechanical gap, while the intelligence gap between model tiers manifests only in reasoning-heavy scenarios where frontier models excel without assistance.