token-spiral.png

The Silent Killer of Your AI Program: The Runaway Agent You Never See

Token maxxing didn’t kill enterprise AI ambitions. The runaway agent you couldn’t see did.


The Budget Was Gone Before Anyone Read the Alert

In April 2026, Uber’s CTO said something every enterprise AI leader should read twice: “I’m back to the drawing board, because the budget I thought I would need is blown away already.”

Uber had given 5,000 engineers access to Claude Code in December 2025. By March, 84% of them were using it daily. By April — four months in, eight months before year-end — the entire 2026 AI budget was gone. Microsoft quietly cancelled internal Claude Code licenses across its Experiences & Devices division. A separate enterprise reportedly accumulated a $500 million AI bill in a single month after deploying access without usage caps. One healthcare company consumed a trillion tokens over six months — $6 million in unplanned costs — before the finance team even understood what was driving it.

J.R. Storment, executive director of the FinOps Foundation, put it plainly: “In April and May, I started hearing from companies: ‘Oh my god, we are 3x over our entire 2026 token budget and it’s only April.’”

This isn’t a spending story. It’s a visibility and control story. And the thing nobody is saying clearly enough is this:

The root cause is almost always a single runaway agent that nobody could see until it was too late.


What “Token Maxxing” Actually Means — and Why It’s the Wrong Frame

Silicon Valley coined “tokenmaxxing” to describe the behavior of burning tokens to hit an internal metric — leaderboards, adoption scores, innovation theater. Amazon ran an internal ranking called KiroRank; engineers started assigning agents pointless busywork to climb it.

But that framing is a distraction. Most enterprises blowing their token budgets aren’t doing it deliberately. They’re doing it because one workflow went off the rails and nobody had the controls to stop it.

Here’s the mechanism:

Every LLM API call is stateless. Agents resend the full conversation history on every turn to maintain context. What starts as a simple task — classify this ticket, draft this response — spirals into millions of tokens as the loop grows. A single sub-workflow stuck in a retry storm can double your bill in minutes. A single agent looping on a failed tool call can consume the equivalent of 10,000 normal tasks before any alert fires.

A single chatbot query uses 2,000–4,000 tokens. A single agentic workflow with tool calls, planning steps, and verification loops? 50,000 to 500,000 tokens. Multiply that by hundreds of agent runs per day, across dozens of workflows, across multiple teams and harnesses — and the math turns catastrophic without warning.

Token consumption in the enterprise has grown 13x since January 2025. Goldman Sachs projects a 24-fold increase by 2030. Gartner’s March 2026 analysis found that agentic models require between 5 and 30 times more tokens per task than a standard chatbot — and already forecasts that 40% of AI agent projects will be cancelled by 2027 due to cost overruns alone. Not technical failure. Not market fit. Just economics.

The per-token price dropped 67% year-over-year between Q1 2025 and Q1 2026. Bills are still going up. Volume is outrunning price compression, and it’s accelerating. This is not a problem that solves itself.


My Hypothesis: The Runaway Agent Is the Root Cause You Never See

Here is what I believe is happening inside most enterprises right now, and I’ve heard versions of this story from hundreds of customers at DAIS and RSA this year:

The observability dashboard shows spend climbing. Somebody sets a monthly alert at $50,000. An agent starts misbehaving — a retry loop, a stuck workflow, a subagent hitting an external tool that keeps returning errors. The agent runs. And runs. And runs. Nobody sees it because it’s not throwing exceptions. It’s just consuming tokens. Doing exactly what it was told to do. Repeatedly.

By the time the alert email lands — or worse, by the time the monthly bill arrives — the damage is done. That one workflow has consumed three months of planned budget. Every other agent in the environment is now at risk of getting throttled or killed because one nobody noticed went silent and kept burning.

The runaway agent is not a hypothetical. It is the budget blowout. Every time.

And the dirty secret of observability tools is exactly this: observability tells you what happened. It does not stop what is happening. You cannot react to an alert email fast enough. By the time you read it, the agent has moved on.


What Enterprises Actually Need: Granular, Deterministic Controls

Enterprises don’t need better dashboards. They need policies. Real-time, enforced, hard-blocking policies at the level of granularity that actually matches how agents operate.

Here’s what that looks like in practice. You need controls at every meaningful unit of execution:

Per workflow — not per application, not per account. One agent may have twenty sub-workflows. The one that breaks is never the one the budget was assigned to. You need a ceiling on each workflow independently, enforced at the interceptor, in real time, producing a hard block — not an email.

Per agent — each autonomous agent should carry its own budget envelope. If Agent A starts burning through its allocation, Agent B should keep running. Budget problems should not cascade.

Per subagent — in multi-agent architectures, orchestrators spawn subagents that spawn subagents. Each hop can multiply cost. Each hop needs its own governor.

Per model family — Claude, OpenAI, Gemini. When an enterprise runs multi-model workflows, they need to know exactly what each model family is consuming, and enforce limits on each. A policy that says “this workflow may not spend more than $X on Opus-class models per day” is a real control. A monthly aggregate is not.

Per harness — Bedrock, LangChain, Claude Code, Databricks Agent Bricks, LlamaIndex. Agents run across many harnesses simultaneously. Cost controls need to be harness-aware, not just model-aware.

Per agent family (tags) — group agents by team, product, business unit, or function, and apply cost policies to the family. Marketing agents get one envelope. Finance agents get another. Developer tools get a hard daily ceiling.

Explicit model whitelists — a policy that says “this workflow may only call Claude Sonnet 4.6 or Haiku 4.5” is a cost control and a governance control simultaneously. Unauthorized model usage — an agent quietly upgrading itself to a frontier model for a task that doesn’t need it — is a cost leak and a compliance risk at the same time.

Per stage: development vs. production — the same policy should behave differently by environment. In development, alert and let the agent continue. In production, hard-block and stop. Stage-aware enforcement is the difference between a guardrail that helps developers learn and a guardrail that actually protects the production budget.

Per workspace — in Databricks and similar environments, workspaces map to teams, projects, and business units. Cost controls must respect workspace boundaries and be enforceable at that granularity.

This is not theoretical. These are the controls enterprises are asking for. Not once in passing. Consistently. Urgently. Because the alternative — monthly budget alerts and dashboards — has already failed them.


Observability ≠ Control

There is a category of vendor that will sell you visibility into your AI spend. You will get a beautiful dashboard. You will see your token consumption by model. You will see your cost curve going up and to the right.

You will not be able to stop it.

The gap between seeing the spend and stopping it is where your budget lives. Observability is a retrospective discipline. It is essential. It is not sufficient.

What the enterprise needs is a runtime enforcement layer that sits outside the LLM’s own reasoning — deterministic, policy-driven, and structurally incapable of being bypassed by the model itself. One that evaluates every tool call, every model invocation, every token consumed against a live policy, and acts in real time: ALLOW, BLOCK, or ESCALATE.

That is not guardrails. Guardrails are in the model. The model hallucinates. Guardrails can be bypassed — this has been demonstrated publicly with Fable 5 and others. Your cost controls cannot live inside the thing you’re trying to control.


The Call to Action — Because Watching the Budget Climb Is Not a Strategy

If you are reading this and you recognize your situation in any of the stories above, here is what to do:

Don’t sit around watching spend climb. The dashboard is not a control plane. By the time the spend is visible, the runaway is already well underway.

Connect LangGuard. Our runtime enforcement layer sits on the path every agent takes — outside the model, across every harness, covering every workflow. We see every token consumed. We enforce every policy in real time. We hard-block the runaway without touching the workflows that are behaving.

Download the Tokenmaxxing Policy Pack. We’ve built a ready-to-deploy policy set for the most common runaway patterns: retry storms, subagent loops, unauthorized model upgrades, development agents hitting production-scale model endpoints. It takes minutes to deploy against your existing agent infrastructure.

Turn on Arbiter. Arbiter is LangGuard’s deterministic enforcement engine. It doesn’t predict what might happen. It evaluates what is happening, right now, against your policies, and acts. ALLOW. BLOCK. ESCALATE. No latency. No drift. No model-in-the-loop.

Get daily dashboards that mean something. Not spend-after-the-fact. Policy enforcement events, per workflow, per agent, per stage. Violations caught before they become budget events. The receipts your finance team actually needs.

Don’t let runaway agents destroy your AI growth objectives. You approved these agents because you believed in what they can do for the business. The ones behaving should keep running. The one that went off the rails should have been stopped at the first violation — not discovered in the next billing cycle.

The enterprises that will win with agentic AI are not the ones that burn the most tokens. They’re the ones that govern which tokens get burned, by which agents, for which purposes — and stop the ones that shouldn’t be running at all.

That’s not a limitation on AI ambition. That’s the infrastructure that makes AI ambition sustainable.


LangGuard is the runtime governance layer for enterprise AI agents. Arbiter enforces per-workflow cost policies in real time — outside the model, across every harness, before the bill arrives.

*Request a free trial → Download the Tokenmaxxing Policy Pack → Read the platform overview →*