Cost Mechanics

Why Your Claude Code Bill Keeps Growing — And What To Do About It

Anthropic's own enterprise data puts the average Claude Code cost at $13 per developer per active day — roughly $150–250 per developer per month. For a team of eight, that's up to $2,000 a month before you've written a single invoice.

Most teams accept that number. What they don't accept — because they don't see it — is how much of it is waste.

Not waste from using the wrong model or running too many sessions. Waste from a mechanic that almost nobody explains clearly: the context you carry forward costs as much as the work you're doing now.

The Bill Is Not What You Think It Is

When developers think about Claude Code costs, they think about prompts. The longer the prompt, the higher the cost. Straightforward enough.

That mental model is wrong — or at least dangerously incomplete.

Claude Code doesn't just send your latest message to the model. It sends the entire conversation history on every turn: every prior message, every tool call result, every file excerpt Claude read, every failed attempt, every test output, every debugging spiral. All of it, every time.

The token accounting breaks into four buckets:

Bucket	What it is	Cost
`input_tokens`	Fresh tokens in this turn	Base input rate
`output_tokens`	Model response tokens	Output rate (5× input on Sonnet)
`cache_creation_input_tokens`	Tokens written to prompt cache	1.25× input rate
`cache_read_input_tokens`	Tokens read from prompt cache	0.1× input rate (90% discount)

On Sonnet 4.6, input is $3/million tokens, output is $15/million. Cache reads drop that input cost to $0.30/million — a 90% discount on stable context. That's the good news.

The bad news is what happens when the cache doesn't save you.

The Compounding Effect Nobody Warns You About

Here's the math most developers never see.

Assume a session with a 50,000-token fixed prefix (system prompt, CLAUDE.md, tool definitions) and 5,000 new tokens added each turn through conversation and tool outputs.

Turn	Input tokens this turn	Cumulative input total
1	55,000	55,000
5	75,000	325,000
10	100,000	775,000
20	150,000	2,050,000

By turn 20, you've consumed 2 million input tokens — not because your prompts got longer, but because the history got heavier. The cost curve is quadratic. It can feel exponential when agent loops, failed edits, and verbose test outputs accelerate the growth rate of each turn.

Put simply: by turn 20, you're paying roughly 10× what you'd expect if you thought only the current prompt mattered.

This is where most unplanned cost spikes come from. Not a single expensive session, but a long session that nobody ended.

This is the problem TokenWatch is built to solve. You shouldn't need to manually watch a token counter to know when a session has crossed into expensive territory. TokenWatch tracks context growth across your team's sessions and alerts you when a developer's active session is approaching the threshold where clearing saves more than it costs — before the bill reflects it.

What Prompt Caching Actually Does (And Doesn't Do)

Prompt caching is real and it works — but only on the stable prefix of a conversation.

Anthropic caches in this order: tools → system prompt → messages. If your CLAUDE.md, system instructions, and tool definitions stay consistent, those tokens get cached and re-read at the 90% discount. For a developer running long sessions with consistent project context, cache reads can dominate the token bill in a good way. One reported case showed 90%+ of tokens as cache reads across 10 billion tokens of usage — effectively $800 in actual subscription cost against $15,000+ in API-equivalent consumption.

But caching breaks easily:

Changing tool definitions invalidates everything downstream
Toggling web search or citations invalidates system and message cache
Session resume can trigger a full cache-creation pass rather than a cache read
Expired TTL (5 minutes by default on API; 1 hour on subscriptions) forces reprocessing

One GitHub-reported session showed 60.9 million cache-read tokens across 197 turns in a single 12-hour session. The cache was doing its job — but a resume mid-session triggered unexpected cache-creation that consumed roughly 20% of the user's 5-hour quota on its own.

The lesson: caching reduces the cost of long sessions, but it doesn't eliminate the compounding problem. And when caching breaks, you feel it immediately.

Tokenocalypse: When Anthropic Made It Worse

In March and April 2026, Claude Code users started reporting that their usage limits were evaporating — Max 20x plans exhausted within 70 minutes of reset. The community named it "Tokenocalypse."

Anthropic's postmortem identified three overlapping causes:

Effort change — On March 4, Anthropic silently changed default effort from high to medium, degrading quality and causing more retries.
Cache/thinking clearing bug — A March 26 bug accidentally cleared thinking tokens every turn instead of only on idle cache misses, destroying cache efficiency.
Verbosity reduction prompt — Shipped April 16, reduced quality, reverted April 20.

All three were resolved by v2.1.116 according to Anthropic's April 23 postmortem. However, GitHub issue #46917 had already documented a separate finding: v2.1.100 was adding ~20,000 extra cache_creation_input_tokens per request compared to v2.1.98 — confirmed by version-by-version token measurements, not community speculation.

v2.1.98:  49,726 cache_creation_input_tokens
v2.1.100: 69,922 cache_creation_input_tokens  (+20,196 per cold request)
v2.1.101: ~72,000 cache_creation_input_tokens

Anthropic's official position is that the April issues were resolved by v2.1.116. Some community reports on versions as late as v2.1.137 continued to question whether token inflation had fully disappeared. The current release as of this writing is v2.1.141.

The practical takeaway: Claude Code's token behavior is not static. A version update can meaningfully change your usage patterns without you knowing. Teams that aren't monitoring session-level costs are flying blind when this happens.

The `/clear` Command: What It Does and When to Use It

/clear resets your active conversation context. It removes accumulated chat history, tool outputs, failed attempts, and stale file excerpts from the working context. It does not touch your repo, your git state, your CLAUDE.md, or your billing.

What Anthropic's own best practices say:

If you have corrected Claude more than twice on the same issue, the context is probably cluttered with failed approaches — clear and restart with a better prompt.

Good moments to /clear:

After finishing and committing a task
When switching to an unrelated feature
After a debugging spiral that went nowhere
When /context shows the session is bloated
Before starting a clean implementation pass

The tradeoff is real: clearing loses local reasoning history. Claude may re-read files it already processed. You need to restate where you are. That's the cost of a clean context — and for most sessions past turn 15 or 20, it's a cost worth paying.

The rule of thumb that works in practice: clear after commits, not mid-task.

Handover Before You Clear: How to Not Lose Your Work

The reason developers avoid /clear is fear of losing context. That fear is solvable.

Before clearing a long session, ask Claude to produce a HANDOVER.md with:

Goal — what you were trying to accomplish
Current status — what's done, what isn't
Files changed — paths and what changed in each
Decisions made — choices that future sessions need to respect
Commands run — tests, builds, migrations
Known failures — what broke and the leading hypothesis
Next steps — in order
Suggested next prompt — the exact prompt to start the fresh session

A good handover turns volatile session state into durable project state. The new session reads the file, inspects only the listed files, and continues — without re-scanning the entire repo.

This is also the pattern for developer handoffs across a team. When one developer ends a session on a feature, the handover file means the next developer (or the same developer the next morning) picks up in 60 seconds instead of 20 minutes.

The Cost-Saving Playbook

These are the high-leverage moves, in order of impact:

1. Clear frequently between tasks. The single most effective cost control. Don't let sessions run to turn 30 across multiple unrelated features.

2. Keep CLAUDE.md short. Every token in CLAUDE.md is part of the fixed prefix and is read on every turn. Trim it to what Claude genuinely needs.

3. Use /compact instead of /clear when staying on the same task. /compact summarizes the conversation history rather than clearing it — retaining context while reducing token volume.

4. Lower thinking effort for simple tasks. Extended thinking tokens are billed as output tokens at output rates. Use MAX_THINKING_TOKENS=8000 or /effort to cap this on tasks that don't need deep reasoning.

5. Route models by task complexity. Haiku 4.5 at $1/million input handles typo fixes, formatting, and simple lookups. Sonnet 4.6 handles most coding. Opus only when the task genuinely requires it. The cost difference between Haiku and Opus is 5× on input, 5× on output.

6. Scope prompts to specific files. "Only inspect these files first: path/to/file1, path/to/file2" prevents Claude from scanning the whole repo and loading irrelevant context.

7. Use subagents for verbose tasks. Running tests, fetching docs, and processing logs generate large outputs. Delegate these to subagents so the verbose output stays in the subagent's context rather than your main conversation.

8. Monitor with /usage or ccusage. /usage shows your local session estimate. ccusage (community tool) provides a live dashboard. Both give you visibility before the console bill does.

The Problem With All of These Strategies

Every item on that playbook requires the same thing: you need to know when to act.

You need to know that a session has reached turn 20. That context is bloated. That an agent loop has been running for 40 iterations. That a developer's session spiked to 3× the team average this afternoon.

Most teams don't know any of this until the Anthropic invoice arrives.

That's the gap TokenWatch fills. It monitors session-level token growth across your team — Claude Code, Cursor, and Cline — and alerts you when a session crosses the threshold where clearing or handing over makes economic sense. Not after the fact. Before the cost compounds further.

The best Claude Code workflows are not just about better prompts. They're about knowing when the context has outrun its value — and acting before the bill reflects it.

TokenWatch

Know when a session crosses the line — before the invoice does.

Session-level cost monitoring across Claude Code, Cursor, and Cline.

Get Early Access →