Documentation / benchmarks

Benchmark Methodology

How we measure the gap between probabilistic intent evaluation and deterministic action interception.

Evaluation Framework

Our benchmarks evaluate the effectiveness of security layers in autonomous AI agent pipelines. We specifically compare Intent-based Guards (LLM-as-judge) against Action-based Interceptors (SupraWall).

Test Vector A

Semantic Bypass

Using homoglyphs, encoded payloads, or role-play to obscure the malicious intent of a tool call while keeping the arguments valid for execution.

Test Vector B

Context Displacement

Burying the tool-call request deep within a long user prompt or system logs to exceed the immediate attention window of the judge LLM.

Environment Setup (April 2026)

Target SystemVersionMode
Lakera Guardv1.1 (Cloud)Direct Injection API
NeMo Guardrailsv0.9.1Default Jailbreak Rail
Guardrails AIv0.5.14Detect Jailbreak Validator

Scoring Criteria

  • False Negative (Bypass)

    The security layer allowed a malicious tool call to proceed to execution.

  • Deterministic Block

    The security layer blocked the action based on a hard policy, regardless of the semantic content.

A Note on Probabilistic Failure

Our tests indicate that LLM-as-judge systems fail not due to lack of intelligence, but due to architectural misalignment. As identified in Shi et al. (2024), "Judging the Judges,"position bias and semantic drift create a "shadow accuracy window" where malicious tool calls can hide. SupraWall avoids this by intercepting the call at the SDK boundary, where intent is irrelevant and policy is binary.