Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Anna Mazhar (Cornell University), Huzaifa Suri (University of Illinois Urbana-Champaign), Sainyam Galhotra (Cornell University)
Evaluation & Benchmarking
A study showing how uncertainty in heterogeneous input artifacts—PDFs, spreadsheets, slide decks—propagates and amplifies through multi-agent workflows, producing qualitatively different execution trajectories under controlled perturbations. The results show that outcome-only evaluation of agentic systems systematically misses contamination-induced failures that are only visible at the trace level.
Presentation
Talk
Paper Session 2: Agent Evaluation
Wednesday, May 27 · 1:30 PM – 1:40 PM
Bayshore Ballroom
Poster
Wednesday, May 27 · 5:15 PM – 6:45 PM
Carmel / Monterey
Abstract
Reasoning over heterogeneous artifacts (PDFs, spreadsheets, slide decks, etc.) increasingly occurs within structured agent workflows that iteratively extract, transform, and reference external information. In these workflows, uncertainty is not merely an input-quality issue: it can redirect decomposition and routing decisions, reshape intermediate state, and produce qualitatively different execution trajectories. We study this phenomenon by treating uncertainty as a controlled variable: we inject structured perturbations into artifact-derived representations, execute fixed workflows under comprehensive logging, and quantify contamination via trace divergence in plans, tool invocations, and intermediate state. Across 614 paired runs on 32 GAIA tasks with three different language models, we find a decoupling: workflows may diverge substantially yet recover correct answers, or remain structurally similar while producing incorrect outputs. We characterize three manifestation types: silent semantic corruption, behavioral detours with recovery, and combined structural disruption and their control-flow signatures (rerouting, extended execution, early termination). We measure operational costs and characterize why commonly used verification guardrails fail to intercept contamination. We contribute (i) a formal taxonomy of contamination manifestations in structured workflows, (ii) a trace-based measurement framework for detecting and localizing contamination across agent interactions, and (iii) empirical evidence with implications for targeted verification, defensive design, and cost control.