Registration has reached capacity. Join the waitlist

Trace-Level Analysis of Information Contamination in Multi-Agent Systems

Anna Mazhar (Cornell University), Huzaifa Suri (University of Illinois Urbana-Champaign), Sainyam Galhotra (Cornell University)

Evaluation & Benchmarking

A study showing how uncertainty in heterogeneous input artifacts—PDFs, spreadsheets, slide decks—propagates and amplifies through multi-agent workflows, producing qualitatively different execution trajectories under controlled perturbations. The results show that outcome-only evaluation of agentic systems systematically misses contamination-induced failures that are only visible at the trace level.

Presentation

Talk

Paper Session 2: Agent Evaluation

Wednesday, May 27 · 1:30 PM – 1:40 PM

Bayshore Ballroom

Poster

Wednesday, May 27 · 5:15 PM – 6:45 PM

Carmel / Monterey

View day schedule

Abstract

Reasoning over heterogeneous artifacts (PDFs, spreadsheets, slide decks, etc.) increasingly occurs within structured agent workflows that iteratively extract, transform, and reference external information. In these workflows, uncertainty is not merely an input-quality issue: it can redirect decomposition and routing decisions, reshape intermediate state, and produce qualitatively different execution trajectories. We study this phenomenon by treating uncertainty as a controlled variable: we inject structured perturbations into artifact-derived representations, execute fixed workflows under comprehensive logging, and quantify contamination via trace divergence in plans, tool invocations, and intermediate state. Across 614 paired runs on 32 GAIA tasks with three different language models, we find a decoupling: workflows may diverge substantially yet recover correct answers, or remain structurally similar while producing incorrect outputs. We characterize three manifestation types: silent semantic corruption, behavioral detours with recovery, and combined structural disruption and their control-flow signatures (rerouting, extended execution, early termination). We measure operational costs and characterize why commonly used verification guardrails fail to intercept contamination. We contribute (i) a formal taxonomy of contamination manifestations in structured workflows, (ii) a trace-based measurement framework for detecting and localizing contamination across agent interactions, and (iii) empirical evidence with implications for targeted verification, defensive design, and cost control.

Artifacts & Links

                        Authors
                        Anna Mazhar
Cornell University
Huzaifa Suri
University of Illinois Urbana-Champaign
Sainyam Galhotra
Cornell University