Where Your AI Tokens Actually Go
I've been running an AI desktop assistant for a few months - the kind with dozens of skills, background automations, memory, the works. I wanted to know what it costs. Not the subscription price, the actual token economics.
The data wasn't always there
So I built an audit. Charter sizes, proxy estimates, a Book of Why framework, budget tiers. I even filed 10 backlog items for instrumentation work, then marked all 10 "shipped" on the same day.
The problem: none of it was real. Every number was derived from len(text) / 4. No actual API telemetry. The single most important variable — prompt cache hit rate - was listed as "unknown."
I flagged the gap to the platform team: the assistant didn't expose token counts. There was no API-level telemetry available to me - copilot.log was all OAuth, zero usage data.
Ten days later, a new release quietly started logging full API telemetry to process logs: input_tokens, output_tokens, cache_read_tokens, model, duration, billing units. Everything I'd asked for.
I didn't notice for nine more days.
Two scripts, one evening: a log parser and a skill attribution engine. 8,761 live API events from 9 days of usage.
Three things I got wrong
1. The system prompt tax doesn't exist.
My main hypothesis was that the ~25,000-token system prompt (instructions, tool definitions, memory, rules) created a per-turn floor that made lightweight operations disproportionately expensive. A trivial "capture an idea" skill would pay the same 25K overhead as a complex spec-writing session.
The cache hit rate is 99.6%. After the first turn in a session, the entire static prefix is served from cache. The floor tax I was worried about is a rounding error.
2. Background work dominates, not interactive use.
| Category | % of cost |
|---|---|
| Background automations | 38% |
| Inbox processing | 12% |
| Heartbeat checks | 7% |
| Background maintenance | 7% |
| All interactive skills combined | ~10% |
I'd been optimizing skill charter sizes — trimming instructions, lazy-loading reference files, compressing prompts. That work was real (cut some skills by 50-60%), but it was optimizing the 10% while the 64% ran unexamined.
The automations fire full sessions with the complete system prompt for things like "check if auth tokens are still valid" and "restart a Teams connection." Avg output: under 200 tokens. Avg input: 140,000 tokens. That ratio should make you uncomfortable.
3. The proxy was 300x wrong, not 10-50x.
I'd estimated the proxy data (character count divided by 4) understated real costs by "10-50x." The actual factor was approx 300x. The proxy only captured user-visible text in the conversation. It missed the system prompt, tool definitions ( approx 65K tokens), tool call arguments and results, and intermediate reasoning turns.
If you're estimating AI costs from conversation text length, you're not in the right order of magnitude.
What I'd actually measure
If you're running any kind of AI agent or assistant and want to understand costs:
Cache hit rate. This is the number that changes everything. Modern API providers cache the static prefix of your prompt. If your cache rate is >95%, your system prompt size barely matters. If it's <50%, it's your biggest cost driver. You can't set budgets without knowing this.
Background-to-interactive ratio. List every automated process that fires an AI call. Calculate total tokens consumed by automation vs. human-initiated work. In my case it was 64/10 - six times more background than foreground. Yours might be different, but you should know the number.
Output/input ratio per automation. Any automation producing <500 output tokens against >100K input tokens is a candidate for a smaller model or a batched approach. You're warming up a full context window to generate a sentence.
The meta-lesson
I built an audit framework, filed backlog items, marked them shipped, and spent three weeks believing I'd done the work. The instrumentation gap wasn't technical - the data was sitting in log files. It was an accountability gap. Marking items "shipped" on the same timestamp without checking if the outputs existed.
AI-assisted workflows make this worse, not better. It's easy to generate the scaffolding - the schema, the queries, the report template - and mistake that for the measurement itself. The scaffolding shipped. The measurement didn't.