When Harmful Intent Dissolves into Technical Detail: How Safe Are Coding Agents Against Cyber Misuse?
Xiangzhe Xu (Purdue University), Shiwei Feng (Purdue University), Guangyu Shen (Purdue University), Xiangyu Zhang (Purdue University)
Security & Privacy Evaluation & Benchmarking
Abstract
Coding agents are increasingly integrated into realistic software development workflows, where they can write, modify, and execute code on behalf of users. This capability creates a distinct safety requirement: agents must refuse requests that would enable malicious cyber activity. Yet in cybersecurity, harmful intent often dissolves into technical detail. A prompt may describe a sequence of legitimate operations without explicitly revealing the downstream consequence they collectively produce. Safe behavior therefore hinges on an agent's ability to reason from prompt to consequence under partial information. In this paper, we empirically evaluate how safe are coding agents against cyber misuse. We construct a cybersecurity evaluation dataset designed to preserve verifiable maliciousness while removing explicit intent. Our data synthesis pipeline hierarchically partitions the cybersecurity space and generates diverse, intent-obscured requests, validated using an ensemble of LLM judges to ensure implicit but genuine harmful capability. The resulting dataset contains 2.2k samples and exhibits substantially greater domain coverage and implicitness than existing cybersecurity safety benchmarks. Using the resulting dataset, we evaluate nine LLM agents in the OpenHands framework and make three key observations. First, safety performance varies widely across cybersecurity subdomains, highlighting the need for broad domain coverage. Second, per-step guardrail significantly improves detection over prompt-only refusal, but a non-trivial fraction of harmful cases remain undetected. Third, we show that lightweight dry-run simulation, namely allowing the actor model to internally roll out action sequences and plausible consequences, recovers a meaningful portion of the guardrail's detection gains without requiring real execution. Overall, our results motivate realistic, domain-diverse evaluation for coding-agent misuse prevention and point to dry-run simulation as a promising direction for more effective and efficient guardrail.