Token Economics: Profiling and Reducing LLM Costs in Multi-Agent Systems
A single agent call costs pennies. A multi-step workflow with retries and context assembly can cost dollars. Here is how to see where every token goes and systematically reduce spend.
The Hidden Cost of Multi-Step Agents
When engineers estimate LLM costs, they typically think in terms of a single API call: a prompt goes in, a completion comes back, you multiply tokens by price per million. At GPT-4o rates of $2.50 per million input tokens and $10 per million output tokens, a single call with 2,000 input tokens and 500 output tokens costs about half a cent. Manageable. But a production multi-agent workflow is not a single call — it’s a cascade of calls where each step inflates the context for the next. A five-step workflow that starts with a 2,000-token system prompt accumulates context geometrically: the system prompt is repeated at every step, the conversation history grows with each tool result and reasoning trace, and tool descriptions add a fixed overhead to every call. By step five, you’re sending 15,000–25,000 input tokens per call, not 2,000.
The following table illustrates how tokens accumulate in a typical five-step agent workflow — an insurance claims processor that retrieves policy data, analyzes coverage, queries a knowledge base, drafts a decision, and validates compliance. The numbers are representative of real production workloads:
| Step | System Prompt | Conversation History | Tool Descriptions | Tool Results | Total Input | Output | Cumulative Cost |
|---|---|---|---|---|---|---|---|
| 1 — Retrieve policy | 1,800 | 0 | 1,200 | 0 | 3,000 | 400 | $0.012 |
| 2 — Analyze coverage | 1,800 | 2,600 | 1,200 | 3,200 | 8,800 | 600 | $0.040 |
| 3 — Query knowledge base | 1,800 | 5,800 | 1,200 | 4,500 | 13,300 | 500 | $0.078 |
| 4 — Draft decision | 1,800 | 9,200 | 1,200 | 5,800 | 18,000 | 1,200 | $0.135 |
| 5 — Validate compliance | 1,800 | 13,800 | 1,200 | 7,200 | 24,000 | 800 | $0.203 |
What this table makes viscerally clear is that 67% of total input tokens are context, not generation. The system prompt alone accounts for 9,000 tokens across five calls — 13% of total input — for identical content repeated verbatim every time. Conversation history dominates: by step five, you’re paying to re-read 13,800 tokens of prior exchanges before the model generates a single new token. At 10,000 workflow executions per day, this single workflow costs roughly $2,030 daily. Scale to ten workflows and you’re looking at $600,000 annually — most of it spent re-reading context the model has already processed in previous steps. The compounding nature of context accumulation means that optimizing the last step of a workflow has almost no impact; the leverage is in controlling what goes into the first three steps.
Profiling Token Usage
You cannot optimize what you cannot see, and most teams have surprisingly poor visibility into where their tokens go. The first step toward cost control is instrumenting every LLM call to record input tokens, output tokens, model used, latency, and — critically — the span context that tells you which step in which workflow consumed those tokens. Without span-level attribution, you know your monthly bill is $45,000 but you have no idea whether that’s driven by one expensive workflow running a million times or a hundred workflows each contributing a small share. Token profiling means attaching cost data to every span in your trace, just as you’d attach latency data, so you can slice and aggregate by workflow, step, model, tenant, and time period.
The highest-leverage profiling insight is separating input tokens from output tokens at every span. Input tokens reveal context bloat — are you sending 20,000 tokens of tool descriptions when only two tools are relevant to this step? Output tokens reveal verbosity — is the model generating 2,000-token reasoning traces when a 200-token structured output would suffice? In practice, most teams find that 3–5 spans account for 60–80% of their total token spend. Common culprits include: retrieval steps that dump raw documents into context instead of extracted passages, tool calls that return full API responses (including metadata, pagination, and unused fields) instead of filtered results, and planning steps that accumulate the full reasoning trace of every previous step instead of a summary.
Once you have per-span token data, build three views. First, a cost heatmap by workflow and step: for every workflow, which step is the most expensive, and how does that compare across workflow types? This immediately shows you where to focus optimization. Second, a token growth curve per workflow execution: plot cumulative input tokens across steps and look for super-linear growth, which indicates unconstrained context accumulation. Third, a top-N cost drivers view that ranks individual spans by total cost over a time period. A span that runs 50,000 times a day at $0.003 each contributes $150 daily and $55,000 annually — that single span might be worth a week of optimization work. These views transform token spend from an opaque line item on your cloud bill into an actionable engineering metric you can trend, alert on, and optimize.
Context Compression Strategies
Since context tokens dominate LLM costs in multi-step workflows, the most impactful optimization is reducing what goes into the prompt without degrading task quality. The simplest technique is step-result summarization: instead of passing the raw output of step N as context to step N+1, run a cheap summarization pass that distills the output to its essential information. A retrieval step that returns 3,000 tokens of raw document text might yield a 400-token summary containing the three facts the next step actually needs. At $2.50 per million input tokens, eliminating 2,600 tokens from every subsequent step in a five-step workflow saves $0.026 per execution. That’s $260 per day at 10,000 executions, or $95,000 annually — from a single optimization.
Sliding window context takes this further by limiting how far back the conversation history extends. Instead of accumulating the full history of all previous steps, keep only the last two or three steps in full detail and drop everything older. For workflows where step five genuinely needs information from step one, use hierarchical context: maintain the last two steps in full, steps before that as one-paragraph summaries, and anything older than five steps as a single sentence of key facts. This mirrors how human experts work — you remember recent details precisely and older context as compressed knowledge. The implementation is straightforward: after each step, run a context manager that classifies prior steps into tiers and compresses accordingly. The cost of the summarization calls is a fraction of the savings from reduced context in downstream steps.
In most production workflows, 80% of the information passed as context to a given step is irrelevant to that step’s task. A compliance-checking step doesn’t need the raw API response from the data retrieval step — it needs three extracted fields. Audit your top five most expensive spans and ask: for each piece of context this step receives, would removing it change the output? You’ll typically find that aggressively pruning context to only what’s decision-relevant reduces input tokens by 40–60% with zero quality degradation.
Model Cascading: Right-Size Every Call
Not every LLM call in a workflow requires a frontier model. A five-step agent might need GPT-4o for complex reasoning in one step and could use GPT-4o-mini or Claude 3.5 Haiku for the other four. Model cascading assigns each step to the cheapest model that meets the quality threshold for that step’s task. Data extraction, formatting, simple classification, and template-based generation rarely need frontier-model reasoning — they need reliable instruction following, which smaller models handle well. Reserve expensive models for steps that involve multi-hop reasoning, ambiguous inputs, or nuanced judgment where model capability directly correlates with output quality.
The most effective cascading pattern is confidence-based escalation. Route every request to the cheapest applicable model first. If the model’s output meets a confidence threshold — measured by self-reported confidence, output format validity, or a lightweight verifier — accept the result. If confidence is below the threshold, re-route to the next tier. This try-cheap-first approach means you only pay frontier prices for the subset of requests that genuinely need it. In practice, 60–80% of requests across most workflow steps are handled successfully by the cheapest tier, yielding a blended cost reduction of 3–5x compared to using the frontier model uniformly.
| Model Tier | Example Models | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Accuracy (typical tasks) | Best For |
|---|---|---|---|---|---|
| Tier 1 — Lightweight | GPT-4o-mini, Claude 3.5 Haiku, Gemini 2.0 Flash | $0.15–$0.25 | $0.60–$1.00 | 85–92% | Extraction, classification, formatting, simple Q&A |
| Tier 2 — Mid-range | GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Pro | $2.00–$3.00 | $8.00–$15.00 | 92–97% | Multi-step reasoning, synthesis, code generation |
| Tier 3 — Frontier | Claude 4 Opus, GPT-4.5, o3 | $10.00–$15.00 | $30.00–$60.00 | 96–99% | Complex judgment, ambiguous inputs, high-stakes decisions |
Implementation requires instrumenting each span with the model used and tracking accuracy per model-step combination. Start by profiling your workflow to identify which steps have uniform, predictable inputs (good candidates for Tier 1) and which steps have high variance or require nuanced interpretation (keep on Tier 2 or 3). Run a two-week A/B test: route 10% of traffic for candidate steps to the cheaper model and compare quality metrics. If quality holds within your tolerance (typically <2% degradation on task success rate), promote the cheaper model to 100% for that step. A five-step workflow that moves three steps from Tier 2 to Tier 1 reduces per-execution cost from $0.20 to $0.06 — a 70% reduction that compounds across millions of daily executions.
Caching and Deduplication
Semantic caching is the single highest-ROI optimization for workflows that handle repeated or similar queries. The idea is simple: before sending a request to the LLM, compute an embedding of the prompt and check whether a sufficiently similar prompt has been seen recently. If the cosine similarity exceeds a threshold (typically 0.95–0.98), return the cached response instead of making an API call. For customer-facing agents that handle common queries — “What’s my account balance?”, “How do I reset my password?”, “What are your business hours?” — cache hit rates of 15–30% are common. At 100,000 queries per day with an average cost of $0.01 per query, a 25% cache hit rate saves $250 daily, or $91,000 annually. The cache infrastructure (a vector store plus TTL-based eviction) costs a fraction of that.
Provider-level prompt caching is a complementary optimization that targets the system prompt and static context prefix. Anthropic’s prompt caching, for example, caches the first portion of the prompt that remains constant across requests, charging only 10% of normal input cost for cached tokens. If your system prompt and tool descriptions total 3,000 tokens and they’re identical across calls, prompt caching reduces the effective input cost for those tokens by 90%. For multi-agent systems, tool-call deduplication eliminates redundant work: if Agent A and Agent B both need the same customer record, a shared tool-call cache ensures the external API is hit once and both agents receive the cached result. This reduces both latency and cost, especially for expensive or rate-limited external APIs.
Semantic caching introduces a staleness risk: if the underlying data changes but the query is similar enough to hit the cache, you return outdated results. Mitigate this with TTL-based eviction (expire entries after 5–15 minutes for volatile data, 1–24 hours for stable data), event-driven invalidation (clear relevant cache entries when upstream data changes), and confidence scoring (never cache low-confidence responses). Monitor your cache hit rate alongside output quality — a rising hit rate with declining quality scores means your TTLs are too long.
Measuring ROI: Cost Per Outcome
The metric that actually matters for LLM cost management is not cost per token or cost per API call — it’s cost per successful task completion. This metric captures the full economic reality: retries, fallbacks, human escalations, and failed attempts that consumed tokens without delivering value. A workflow that costs $0.15 per execution but succeeds 98% of the time has an effective cost of $0.15 / 0.98 = $0.153 per successful outcome. A cheaper workflow at $0.06 per execution that fails 20% of the time — requiring retries or human escalation at $2.50 per escalation — has an effective cost of (0.80 × $0.06 + 0.20 × ($0.06 + $2.50)) / 0.80 = $0.70 per successful outcome. The “cheap” option costs 4.6x more per actual result.
To calculate cost per outcome accurately, you need to track three components at the workflow level: direct LLM cost (sum of all token costs across all spans in a single execution), retry cost (LLM cost of any automated retry attempts), and escalation cost (the human labor cost when the agent fails and a person takes over). Consider this example for a document processing workflow running 50,000 executions per month:
| Metric | Frontier Model (GPT-4o) | Budget Model (GPT-4o-mini) |
|---|---|---|
| Cost per execution | $0.18 | $0.03 |
| Success rate | 97% | 81% |
| Retry rate (auto) | 2% | 12% |
| Escalation rate (human) | 1% | 7% |
| Human escalation cost | $3.00 | $3.00 |
| Monthly LLM cost | $9,180 | $1,680 |
| Monthly escalation cost | $1,500 | $10,500 |
| Total monthly cost | $10,680 | $12,180 |
| Cost per successful outcome | $0.22 | $0.30 |
The budget model looks 6x cheaper on a per-call basis but ends up costing 14% more when you account for failures and escalations. This is why token cost alone is a misleading metric. The optimal strategy is often a hybrid: use model cascading to route easy tasks to cheap models (where they succeed reliably) and expensive models only for hard tasks (where accuracy matters most). Track cost per successful outcome as your north-star metric, segment it by workflow and difficulty tier, and optimize the routing thresholds continuously. Teams that adopt this outcome-oriented cost framework typically find they can reduce total spend by 40–60% compared to a uniform “use the cheapest model everywhere” approach, because they eliminate the hidden costs of failure.
Cutting LLM costs by switching to a cheaper model without measuring downstream impact is the most common mistake in production AI systems. A 2% drop in task success rate sounds negligible, but at 50,000 executions per month, that’s 1,000 additional failures requiring human intervention at $3 each — $3,000 per month in hidden costs. Always measure cost per successful outcome, not cost per API call. Factor in retry tokens, escalation labor, customer churn from degraded quality, and engineering time spent debugging failures. The cheapest model is rarely the most economical.