Registration has reached capacity. Join the waitlist

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Skyler Hallinan (University of Southern California), Thejas Venkatesh (Samaya AI), Xiang Ren (University of Southern California), Sai Praneeth Karimireddy (University of Southern California), Ashwin Paranjape (Samaya AI), Yuhao Zhang (Samaya AI), Jack Hessel (Samaya AI)

Evaluation & Benchmarking Architectural Patterns & Composition

OpaqueToolsBench evaluates whether LLM agents can learn to use poorly-documented tools through interaction and self-generated documentation improvement, across three environments: general function calling, chess, and long-horizon agentic tasks. Most current agents show limited ability to adapt to opaque tools, exposing a practical gap between benchmark performance and real-world tool-use reliability.

Presentation

Talk

Paper Session 2: Agent Evaluation

Wednesday, May 27 · 1:50 PM – 2:00 PM

Bayshore Ballroom

Poster

Wednesday, May 27 · 5:15 PM – 6:45 PM

Carmel / Monterey

View day schedule

Abstract

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general “search” APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5× fewer total tokens than the best baseline.

Artifacts & Links

                        Authors
                        Skyler Hallinan
University of Southern California
Thejas Venkatesh
Samaya AI
Xiang Ren
University of Southern California
Sai Praneeth Karimireddy
University of Southern California
Ashwin Paranjape
Samaya AI
Yuhao Zhang
Samaya AI
Jack Hessel
Samaya AI