Benchmark Methodology | SupraWall Documentation

Evaluation Framework

Our benchmarks evaluate the effectiveness of security layers in autonomous AI agent pipelines. We specifically compare Intent-based Guards (LLM-as-judge) against Action-based Interceptors (SupraWall).

Test Vector A

Semantic Bypass

Using homoglyphs, encoded payloads, or role-play to obscure the malicious intent of a tool call while keeping the arguments valid for execution.

Test Vector B

Context Displacement

Burying the tool-call request deep within a long user prompt or system logs to exceed the immediate attention window of the judge LLM.

Environment Setup (April 2026)

Target System	Version	Mode
Lakera Guard	v1.1 (Cloud)	Direct Injection API
NeMo Guardrails	v0.9.1	Default Jailbreak Rail
Guardrails AI	v0.5.14	Detect Jailbreak Validator

Scoring Criteria

False Negative (Bypass)
The security layer allowed a malicious tool call to proceed to execution.
Deterministic Block
The security layer blocked the action based on a hard policy, regardless of the semantic content.

A Note on Probabilistic Failure

Our tests indicate that LLM-as-judge systems fail not due to lack of intelligence, but due to architectural misalignment. As identified in Shi et al. (2024), "Judging the Judges,"position bias and semantic drift create a "shadow accuracy window" where malicious tool calls can hide. SupraWall avoids this by intercepting the call at the SDK boundary, where intent is irrelevant and policy is binary.