All posts
April 19, 2026·10 min read
Cost OptimizationLLM RoutingPrompt CachingProduction

How to Cut Your AI Workflow Costs 60% Without Sacrificing Output Quality

Published: April 19, 2026

The fastest-growing expense in AI-enabled engineering teams isn't compute or storage. It's LLM tokens. Teams running multi-step agentic workflows routinely report monthly API bills that were unplanned, uncapped, and significantly higher than early estimates.

The good news: most of that spend is unnecessary. The gap between a naively implemented agentic workflow and a cost-optimized one is typically 40–60%, with no degradation in output quality. The difference is architectural, not a matter of prompting tricks.

Here's a systematic breakdown of where token spend goes in production workflows, and the specific design patterns that eliminate waste.


Where the Tokens Actually Go

Before optimizing, you need a clear picture of token distribution. In a typical multi-step agentic workflow:

Step Type% of Total Token SpendOptimization Potential
Large context reads (full file/document input)30–40%High — most context isn't used
Repetitive system prompt re-injection15–25%High — can be cached
Reasoning on simple classification/routing tasks10–20%High — wrong model for the job
Actual implementation/generation steps20–30%Low — this is the value
Verification and validation steps5–10%Medium — can use smaller models
Error recovery and retry cycles5–15%Medium — proper prompting reduces retries

The insight: the high-value generation steps that justify LLM usage often represent only 20–30% of total token spend. The rest is overhead that good architecture eliminates.


Pattern 1: Effort-Calibrated Model Routing

The highest-leverage optimization is using the right model for each step. This is now a first-class feature in Anthropic's API via effort controls, and implicit in the cost differential between model tiers.

Model cost reference (April 2026):

ModelInput (per 1M tokens)Output (per 1M tokens)Best For
Claude Haiku 4.5$0.80$4.00Classification, routing, extraction, formatting
GPT-4o Mini$0.15$0.60Simple transformations, validation, summarization
Claude Sonnet 4.6$3.00$15.00Structured reasoning, code review, analysis
Gemini 2.0 Flash$0.10$0.40Large context processing, document analysis
Claude Opus 4.7$5.00$25.00Complex implementation, autonomous coding
GPT-4o$2.50$10.00Generation, synthesis, complex tasks

A workflow that routes every step to Opus 4.7 pays 62x more per token than the same workflow using GPT-4o Mini for the appropriate steps.

Practical routing heuristic:

  • Extract, classify, format, validate → Use Haiku 4.5 or GPT-4o Mini
  • Analyze, plan, review, structure → Use Sonnet 4.6 or Gemini Flash
  • Implement, generate, reason autonomously → Use Opus 4.7 or GPT-4o
  • Verify assertion, check output → Use Haiku 4.5 or GPT-4o Mini

A five-step workflow — classify → analyze → plan → implement → verify — might cost $0.50 per run using uniform Opus 4.7 routing, and $0.08 per run using the above routing strategy, for equivalent output quality.


Pattern 2: Context Window Minimization

Every token in the context window costs money, including tokens that don't affect the model's output. Naively constructed agent workflows stuff the entire available context with "just in case" information.

What eats context unnecessarily:

  1. Full file contents when only a function is needed: Loading a 500-line file when the relevant function is 20 lines costs 25x more input tokens
  2. Entire conversation history: Most multi-turn agent conversations include earlier turns that are no longer relevant to the current step
  3. Verbose tool output: A tool that returns full JSON when the workflow needs one field
  4. Repeated static instructions: System prompts re-injected at every step even when they don't change

The fixes:

```

// Instead of: read full file

const content = await readFile('src/api/auth.ts');

// Do: extract relevant section

const section = await readFileLines('src/api/auth.ts', 78, 102);

```

```

// Instead of: pass full tool output

const searchResults = await searchCode(query); // returns 50 results with full context

// Do: truncate before injection

const topResults = searchResults.slice(0, 5).map(r => ({ path: r.path, line: r.line, snippet: r.snippet }));

```

For context that genuinely must be large, Gemini 2.0 Flash at $0.10/1M tokens is 50x cheaper than Opus 4.7 for identical context sizes. Use the cheapest model that can handle the context scale.


Pattern 3: Prompt Caching

Anthropic, OpenAI, and Google all support prompt caching — a mechanism where the static prefix of your prompt (system instructions, reference documents, tool definitions) is cached and not re-billed on subsequent calls with the same prefix.

Anthropic's cache pricing: cache writes cost 25% more than standard input, but cache hits cost 90% less. A system prompt repeated across 1,000 API calls with caching enabled costs:

  • Without caching: 1,000 × full prompt cost
  • With caching: 1 cache write + 999 × 10% of prompt cost

For a 4,000-token system prompt at Sonnet 4.6 pricing ($3/1M):

ScenarioCost for 1,000 Calls
Without caching$12.00
With caching$1.21 (1 write + 999 hits)

Implementation requirements for cache hits:

  • The cached prefix must be identical across calls (same tokens, same order)
  • The cache prefix must be at least 1,024 tokens (Anthropic minimum)
  • Use the cache_control header to mark prefix boundaries explicitly

The implication: system prompts and reference documents should be static and placed at the start of every prompt. Variable inputs go at the end, after the cached prefix. This structure maximizes cache hit rate.


Pattern 4: Output Length Control

Output tokens cost 3–5x more than input tokens across every major provider. Workflows that request verbose outputs when compact outputs suffice pay a significant premium.

Common output bloat causes:

  1. No format constraint: The model defaults to explanatory prose when you need structured data
  2. Asking for reasoning when you need the conclusion: "Explain your analysis and then provide..." generates unnecessary reasoning tokens
  3. Full document generation when incremental updates are sufficient: Regenerating an entire file when only a function changed

The fixes:

  • Specify output format explicitly: "Return a JSON object with fields: path, line_number, issue_type, severity"
  • Separate reasoning from output: Use a reasoning step (cheap, short output) → implementation step (full output only for the final artifact)
  • Constrain length: "Respond in under 100 words" or "Return only the modified function, not the entire file"

A code review step that says "Review this code and provide detailed analysis of every potential issue, explain your reasoning, and suggest improvements for each" might generate 2,000 output tokens. Rewritten as "Return a JSON array of issues: [{severity, file, line, description, fix}]" for the same review typically generates 300–500 output tokens with equivalent actionability.


Pattern 5: Workflow-Level Task Budgets

Even with per-step optimization, workflow-level costs can spike when:

  • An error recovery loop runs more iterations than expected
  • A retrieval step returns more context than anticipated
  • A model takes a more verbose reasoning path than usual

Anthropic's task budgets (Opus 4.7 feature) let you set a total token ceiling per workflow run. The model adapts its behavior to work within the budget — terminating early or summarizing rather than exhausting the limit.

For providers without native task budgets, implement them at the orchestration layer:

```typescript

class WorkflowExecutor {

private totalTokensUsed = 0;

private tokenBudget: number;

async executeStep(step: WorkflowStep): Promise<StepResult> {

if (this.totalTokensUsed >= this.tokenBudget) {

return { status: 'budget_exceeded', partial_result: this.collectPartialResults() };

}

const result = await this.callModel(step);

this.totalTokensUsed += result.usage.total_tokens;

return result;

}

}

```

Setting a budget doesn't mean workflows fail when they hit it — it means they terminate gracefully with whatever they've produced, rather than running to exhaustion.


Real Cost Numbers: Before and After

A representative code review workflow — receiving a 200-line PR diff, running security analysis, checking for performance regressions, and generating a review comment:

Before optimization:

StepModelTokensCost
Load full file contextOpus 4.78,000 in / 200 out$0.045
Security analysisOpus 4.72,000 in / 1,500 out$0.048
Performance analysisOpus 4.72,000 in / 1,500 out$0.048
Generate reviewOpus 4.74,000 in / 800 out$0.040
Total$0.181

After optimization:

StepModelTokensCost
Extract relevant sectionsHaiku 4.58,000 in / 200 out$0.007
Security analysis (cached system prompt)Sonnet 4.6500 in / 400 out$0.008
Performance analysis (cached system prompt)Sonnet 4.6500 in / 400 out$0.008
Generate review (structured output)Sonnet 4.61,200 in / 300 out$0.008
Total$0.031

83% cost reduction on equivalent output quality. The security and performance analyses are actually better — Sonnet 4.6 at focused context outperforms Opus 4.7 at bloated context on bounded analysis tasks.


Implementation in AgenticNode

AgenticNode's execution model surfaces the data you need to optimize cost:

Real-time token and cost tracking in the Glass Window shows per-step token consumption and running cost as your workflow executes. You can see exactly which step is the cost driver without waiting for a billing report.

Node-level model selection lets you assign different models to different nodes in the visual canvas. Route your classification nodes to Haiku, your analysis nodes to Sonnet, and your generation nodes to Opus — without writing code to manage the routing logic.

BYOK with full provider support means you can switch to Gemini 2.0 Flash for large-context steps at $0.10/1M input, use Haiku for fast classification, and use Opus only where the benchmark justifies the price.

Structured output nodes constrain model outputs to JSON schemas, preventing verbose responses and reducing output token counts by 40–60% on structured data steps.


Summary

Cutting AI workflow costs 60% is architectural, not magic:

  1. Route by model tier — Classification and validation steps don't need Opus; use Haiku/Mini and save 80%+ on those steps
  2. Minimize context inputs — Extract only the relevant sections; don't load full files when functions will do
  3. Implement prompt caching — Static system prompts cached = 90% cheaper on cache hits
  4. Control output length — Structured JSON output is 60–70% fewer tokens than explanatory prose
  5. Set workflow budgets — Cap total token spend per run; handle graceful termination at the orchestration layer

The optimization floor: every step at the cheapest appropriate model, with minimal context and structured outputs, cache-enabled for static prefixes. That combination typically achieves 55–65% cost reduction versus a naively constructed workflow.

Build your first agentic workflow

The visual workflow editor is live. Design, execute, and observe multi-agent pipelines — no framework code required.

Open Editor