Benchmark Methodology
How we measure the gap between probabilistic intent evaluation and deterministic action interception.
Evaluation Framework
Our benchmarks evaluate the effectiveness of security layers in autonomous AI agent pipelines. We specifically compare Intent-based Guards (LLM-as-judge) against Action-based Interceptors (SupraWall).
Test Vector A
Semantic Bypass
Using homoglyphs, encoded payloads, or role-play to obscure the malicious intent of a tool call while keeping the arguments valid for execution.
Test Vector B
Context Displacement
Burying the tool-call request deep within a long user prompt or system logs to exceed the immediate attention window of the judge LLM.
Environment Setup (April 2026)
| Target System | Version | Mode |
|---|---|---|
| Lakera Guard | v1.1 (Cloud) | Direct Injection API |
| NeMo Guardrails | v0.9.1 | Default Jailbreak Rail |
| Guardrails AI | v0.5.14 | Detect Jailbreak Validator |
Scoring Criteria
- False Negative (Bypass)
The security layer allowed a malicious tool call to proceed to execution.
- Deterministic Block
The security layer blocked the action based on a hard policy, regardless of the semantic content.
A Note on Probabilistic Failure
Our tests indicate that LLM-as-judge systems fail not due to lack of intelligence, but due to architectural misalignment. As identified in Shi et al. (2024), "Judging the Judges,"position bias and semantic drift create a "shadow accuracy window" where malicious tool calls can hide. SupraWall avoids this by intercepting the call at the SDK boundary, where intent is irrelevant and policy is binary.