Security Hub • Cost Control

How to Set Token Limits
on AI Agents.

Setting token limits on AI agents requires distinguishing between per-call limits (max tokens for a single LLM invocation), per-session limits (total tokens across an entire agent run), and per-day budget caps (dollar spend converted from token consumption). This guide covers all three layers and their implementation across LangChain, CrewAI, AutoGen, and the OpenAI Assistants API.

TL;DR

  • Soft limits (warnings) fail because the expensive API call has already been made by the time the warning fires.
  • Hard caps terminate execution deterministically before the call is made — no LLM reasoning can override them.
  • Three distinct layers: per-call max_tokens, per-session token tracking, and per-day dollar budget caps.
  • Each framework (LangChain, CrewAI, AutoGen, OpenAI Assistants) requires different configuration — this guide covers all four.

Hard Cap vs. Soft Limit: The Critical Distinction

Most developers, when they first implement token controls, reach for the same pattern: a conditional check that logs a warning when a counter exceeds a threshold. This is a soft limit, and it is nearly useless for preventing runaway costs.

The fundamental problem is sequencing. When your code logs a warning — or even raises a Python exception after the fact — the LLM API call has already completed. The tokens have been consumed. The money has been spent. The warning is a notification that something expensive just happened, not a prevention mechanism.

In a looping agent context, this distinction is catastrophic. If your soft limit fires on call 101, you've already paid for all 100 calls before it. And if your exception handling is imperfect — if the loop catches and swallows the exception — your soft limit may never actually stop anything.

"A soft limit is a speed bump. A hard cap is a wall."

Hard caps terminate execution deterministically. The enforcement happens at the interception layer, before the API call is dispatched. No amount of LLM reasoning, exception handling variation, or framework quirk can override a hard cap — because the call never reaches the API in the first place.

The Three Types of Token Limits

Effective token control requires all three layers working together. Each addresses a distinct failure mode. Implementing only one or two leaves meaningful gaps.

Layer 01

Per-Call Limits

Max tokens for a single LLM invocation. Prevents any single call from being catastrophically expensive. Easiest to implement, weakest protection — doesn't catch accumulation across multiple calls.

llm = ChatOpenAI(
  model="gpt-4o",
  max_tokens=4096
)
Impl: EasyProtection: Low

Layer 02

Per-Session Limits

Total tokens for an entire agent session. Requires tracking across all calls in a session. Catches gradual context accumulation that per-call limits miss entirely.

# With SupraWall:
protect(agent, budget={
  "session_tokens": 100_000
})
Impl: MediumProtection: Medium

Layer 03

Per-Day Budget Caps

Converts token count to dollar spend. The most powerful approach because it maps directly to actual cost regardless of model, call pattern, or context size variation.

protect(agent, budget={
  "daily_limit_usd": 10
})
Impl: HardProtection: High

Implementation Guide by Framework

Every major agent framework handles token limits differently. Native controls are inconsistent — some count iterations, some count wall-clock time, and most don't track cumulative spend at all. The following configurations show the correct approach for each framework, combining native controls with SupraWall's budget layer.

01

LangChain

LangChain's AgentExecutor offers max_iterations and max_execution_time as native guardrails. These stop runaway loops at the step level, but they don't track token spend, don't distinguish between a cheap and an expensive iteration, and can't enforce a daily dollar budget. SupraWall adds the budget and circuit breaker layer on top of the native controls.

from langchain.agents import AgentExecutor
from suprawall.langchain import protect

# Native: max_iterations stops loops but doesn't track spend
agent = AgentExecutor(
    agent=llm_agent,
    tools=tools,
    max_iterations=25,        # stops after 25 steps
    max_execution_time=120,   # stops after 120 seconds
)

# SupraWall: adds dollar budget + token tracking on top
secured = protect(agent, budget={
    "daily_limit_usd": 10,
    "session_tokens": 200_000,    # 200K tokens per session
    "circuit_breaker": {
        "max_identical_calls": 5,
        "window_seconds": 30,
    }
})

Note: the max_iterations native control and the SupraWall budget are complementary, not redundant. Native controls catch step-count loops; SupraWall catches expensive-but-short sessions and context inflation.

02

CrewAI

CrewAI has no native budget enforcement mechanism. Multi-agent crews can spawn multiple agents working in parallel, each billing independently, with no central cost tracking. The correct approach is to wrap the entire crew and define per-agent and crew-level budgets simultaneously.

from suprawall.crewai import protect_crew

secured_crew = protect_crew(
    crew,
    budget={
        "per_agent_daily_usd": 5.00,    # $5/day per agent in the crew
        "crew_daily_usd": 20.00,        # $20/day for the entire crew
        "session_tokens_per_agent": 50_000,
    },
    on_budget_exceeded="notify_and_pause"  # sends webhook before halting
)

The crew_daily_usd cap acts as a ceiling across all agents combined. The per_agent_daily_usd cap prevents any single agent from consuming the entire crew budget. Both must be exceeded before the hard stop fires — whichever is reached first takes precedence.

03

AutoGen

AutoGen's native max_turns parameter counts conversation turns, not tokens. This is particularly deceptive: a single turn can consume 100K tokens if the context is large, making turn-based limits an unreliable proxy for cost control. SupraWall replaces the default GroupChatManager with a token-aware variant.

import autogen
from suprawall.autogen import SupraWallGroupChatManager

# Native: max_turns counts conversation turns, not tokens
# This is insufficient — a single turn can consume 100K tokens

# SupraWall: token-aware enforcement
manager = SupraWallGroupChatManager(
    groupchat=group_chat,
    budget={
        "session_tokens": 500_000,    # total tokens across all agents in session
        "daily_limit_usd": 25.00,
    }
)

The session_tokens limit here applies to the entire group chat session — all agents combined. This correctly models AutoGen's multi-agent cost structure where a single conversation involves multiple participants each consuming tokens simultaneously.

04

OpenAI Assistants API

The OpenAI Assistants API exposes max_prompt_tokens and max_completion_tokens at the run level. These correctly limit individual runs, but they don't aggregate across multiple runs within a single user session. A session that makes 20 runs each consuming 10K tokens has consumed 200K tokens total — with no native mechanism to detect or halt that.

# Native: max_prompt_tokens and max_completion_tokens at run level
# These don't aggregate across multiple runs in a session

# Correct approach with SupraWall session tracking:
from suprawall.openai import SecureAssistantSession

session = SecureAssistantSession(
    assistant_id="asst_...",
    budget={
        "session_tokens": 200_000,  # across all runs in this session
        "daily_limit_usd": 15.00,
    }
)
response = await session.run("Analyze Q4 results")
# Session tracks tokens cumulatively across all runs

SecureAssistantSession maintains a cumulative token counter that persists across all runs within the session object's lifecycle. This correctly models how Assistants API costs actually accrue in production.

Monitoring and Alerting

Token limits without observability are incomplete. A hard cap that silently kills an agent in production — without notifying your on-call team — is nearly as bad as no cap at all. The correct configuration layers alerts at 50% and 80% of the budget before the hard 100% cutoff.

This gives your team two intervention windows: one to investigate whether the spend is expected, and one final warning to take action before the agent halts. SupraWall supports webhook delivery to any target — Slack, PagerDuty, custom endpoints.

secured = protect(
    agent,
    budget={
        "daily_limit_usd": 100,
        "alerts": [
            {"threshold": 0.5, "channel": "slack",     "webhook": "https://hooks.slack.com/..."},
            {"threshold": 0.8, "channel": "pagerduty", "webhook": "https://events.pagerduty.com/..."},
            {"threshold": 1.0, "action": "halt"},   # hard cap at 100%
        ]
    }
)

Each alert fires with a structured payload that includes the agent ID, current spend, daily limit, and session ID. This gives your team everything needed to locate and investigate the agent without manually querying API logs.

{
  "event": "budget_threshold_reached",
  "agent_id": "research-agent-v2",
  "threshold_pct": 80,
  "current_spend_usd": 80.00,
  "daily_limit_usd": 100.00,
  "session_id": "sess_xK7m9...",
  "timestamp": "2026-03-15T14:32:17Z"
}

50% Alert — Investigate

Half the daily budget consumed. Check whether this is expected traffic volume or an early sign of a loop. No action required unless patterns look abnormal.

80% Alert — Prepare to Intervene

Budget is nearly exhausted. If this is unexpected, halt the agent manually before the hard cap fires. The 80% alert is your last human decision point.

100% Hard Cap — Auto-Halt

SupraWall raises BudgetExceeded. Agent terminates gracefully. All subsequent tool calls are blocked until the budget resets at midnight UTC or is manually extended.

Incident Log

Every budget event is logged with full context in the SupraWall audit trail: agent ID, session ID, total tokens consumed, total spend, and halt reason.

Quick Reference

Native vs. SupraWall Controls by Framework

FrameworkNative Token ControlTracks Cumulative Spend?SupraWall Integration
LangChainmax_iterations, max_execution_timeNoprotect() wrapper
CrewAINoneNoprotect_crew() wrapper
AutoGenmax_turns (turns only, not tokens)NoSupraWallGroupChatManager
OpenAI Assistantsmax_prompt/completion_tokens (per run)NoSecureAssistantSession

Related Resources

Frequently Asked Questions

What's the difference between max_tokens and a budget limit?

max_tokens limits a single LLM call's response length. A budget limit (daily_limit_usd) tracks cumulative spend across all calls in a session or day and halts when the threshold is reached. They operate at entirely different granularities and you typically need both.

Does SupraWall work with all LLM providers?

Yes. SupraWall's budget enforcement is provider-agnostic. It intercepts at the agent framework level (LangChain, CrewAI, AutoGen), not at the LLM API level. This means it works with OpenAI, Anthropic, Google, Mistral, or any model your framework supports.

What happens to the agent when it hits the budget?

By default, SupraWall raises a BudgetExceeded exception, which terminates the agent gracefully. You can configure on_budget_exceeded to 'notify' (continue with warning), 'halt' (terminate), or 'require_approval' (pause pending human review). The agent state is preserved so you can resume after approval.

Can I set different limits for different agent roles?

Yes. Define per-agent scopes: research agents get $5/day, billing agents get $20/day, orchestrators get $50/day. Team-level and organization caps are also supported, allowing nested budget hierarchies where the most restrictive applicable limit always wins.

Do token limits affect agent performance?

Per-call max_tokens can affect response quality if set too low — the model may truncate reasoning or output. Budget caps and circuit breakers only activate when their threshold is reached, adding zero latency to normal operation. We recommend setting per-call max_tokens at 2× your expected response length to leave headroom without enabling unlimited context.

Set Hard Budget Caps on Your Agents.