The Token Squeeze and the Three-Tier Answer for Organizations

What started this

I've been playing with AI coding tools and open models for a while now. The open ones haven't worked well out of the box.

In plain English: "open" models, hosted vs local. Open models are AI models that are publicly available, like Meta's Llama or Alibaba's Qwen. There are two ways to use them. Open-hosted means someone else runs them on the cloud and you pay per use (Ollama Cloud, OpenRouter, and similar). Open-local means running them on your own hardware — your laptop or a server you own — and paying nothing per use, though you need the machine. The big AI providers (Anthropic, OpenAI, Google) don't release their models. These are the frontier models — the most capable and most expensive. You can only get to them through their paid APIs. They're closed for business reasons — intellectual property, competitive moat — not technical ones. There's no reason in principle these companies couldn't share their models; they just don't, because the models are the business.

The reason I started looking at this is the same reason a lot of people are: Claude and OpenAI are getting expensive and rate-limited. The same is happening on ChatGPT, Copilot, and most of the coding agents. If you use these tools at work, you've probably felt it already.

The obvious move is to switch to open or local models. But they don't just work as well.

Why "just use open models" doesn't work out of the box

Two things get in the way.

Open models have to be tweaked. They often seem to work at first — they answer questions, generate code — but my testing showed they're much less reliable and accurate than the frontier models at real coding work. Some return their tool calls as plain text instead of the structured format the coding tools expect. Some crash outright if you give them a list of tools. Each one needs its own small fix before you can use it in a real agent.

In plain English: "tool calls." AI models can talk but can't actually do anything — open files, run code, anything physical. They hand off real work to the software around them through structured requests. Without that loop, an AI coding assistant can only describe code, not actually change it.

So the real problem isn't cost. It's getting the model to work with the tools at all.

Two articles got me thinking about this:

Harness Design for Long-Running Apps (Anthropic)
Continually Improving Agent Harness (Cursor)

Both argue that the harness — the code around the model — matters more than the model itself.

In plain English: "harness." The software around an AI model. It handles tool calls, remembers context across multiple messages, and recovers from errors. Think of the model as a jet engine and the harness as the cockpit, instruments, and wiring that turn that engine into something usable.

My results mostly agree with that. The catch is that the harness only matters more than the model if the model can actually talk to the harness. Some can't.

What I built

I built a small proxy that sits between my coding tools (Cursor, Crush, OpenCode) and whatever AI model I want to use — Anthropic's Claude, OpenAI's GPT, an open model on Ollama Cloud, or one running on my own machine. The build details are in the technical writeup: Building a Local Model Proxy.

In plain English: "proxy." Coding tools like Cursor let you point them at a custom URL instead of using their built-in AI. A proxy is a small piece of software at that URL. It receives each request, picks which AI model to forward it to, and returns the result. The coding tool just sees what looks like a normal AI service.

Cursor and the other coding tools all have a "bring your own URL" feature — they let you point at any OpenAI-compatible endpoint. That's the hook I used to slot in my proxy. I tried setting up a few other commercial coding tools as well, but each one had its own quirks (custom auth, hidden endpoint settings), and the time-and-energy cost added up. I focused on Cursor on the commercial side and put the rest of the effort into the open-source tools.

The same hook lets you plug in commercial services if you don't want to build something. OpenRouter, HuggingFace inference endpoints, Together AI, Groq, Fireworks AI, and Replicate all expose OpenAI-compatible APIs over big catalogs of open and hosted models. Point Cursor at one of those and you cover a lot of the routing problem.

I built my own because I wanted to route per-tool, run some models on hardware I already own, and skip the per-token fee on the work that fits.

Then I built an evaluator. It runs the same four tests against every model I want to compare. The tests use the real tool definitions the coding tools actually send (read a file, write a file, edit a file). That way I'm comparing models fairly, not against different setups.

Results

Model	Proxy name	T1	T2	T3	T4	Latency	Tools via
Claude Opus 4.7	`claude-opus-4-7`	✅	✅	✅	✅	15s	Anthropic API
Claude Sonnet 4.6	`claude-sonnet-4-6`	✅	✅	✅	✅	12s	Anthropic API
Claude Haiku 4.5	`claude-haiku-4-5`	✅	✅	✅	✅	9s	Anthropic API
DeepSeek V4 Pro	`deepseek-v4-pro`	✅	✅	✅	✅	16s	Ollama Cloud
Qwen 2.5 Coder 32B	`qwen2.5-coder-32b`	✅	✅	✅	✅	60–70s	Local
Qwen 2.5 Coder 7B	`qwen2.5-coder-7b`	✅	✅	✅	⚠️	16s	Local
Llama 3.2	`llama3.2`	❌	—	—	❌	14s	Local, chat only
Qwen 2.5 7B	`qwen2.5-7b`	❌	—	—	❌	22s	Local, chat only

Three tiers, plus a floor

In plain English: the three tiers (plus a floor). Frontier models are the expensive, capable ones from Anthropic and OpenAI. Hosted-open means open models someone else runs on the cloud — cheaper per use, but you still pay per use. Local means open models you run yourself on your own hardware — cheapest per use, but you need the hardware. Below the floor is where the smallest open models live; they can't handle the tool-call loop that modern coding assistants need.

The results split into three groups, with a fourth group that fails outright:

Frontier         → Claude Opus, Sonnet, Haiku           pass everything, fast, expensive
Hosted-open      → DeepSeek V4 Pro (Ollama Cloud)        pass everything, fast, ~1/10 cost
Local capable    → Qwen 2.5 Coder 32B + 7B               pass, slow, ~1/100 cost
Below the floor  → Llama 3.2, Qwen 2.5 7B (chat only)    can't do tool calls at all

The big local model (Qwen 32B) passes every test, but takes about a minute per call. The smaller local one is faster but it stumbles on the trickier edit task. The plain chat-only models don't return tool calls in any usable form, so the tests can't even start.

Rough cost picture, thumb in the wind:

Frontier         ████████████████████████████  1x       baseline
Hosted-open      ███                            ~1/10x   ~90% cheaper
Local            ▎                              ~1/100x  ~99% cheaper (plus hardware)

These are my own projections, not real vendor numbers. The big point is the scale. Hosted-open is roughly 10x cheaper than frontier. Local is roughly 100x cheaper, plus you need hardware. For any organization paying frontier prices today, that's real money over a year.

What this means for organizations

Organizations that depend on frontier AI APIs are going to feel the squeeze. Token costs are going up. Rate limits are getting tighter. Things that ran fine last quarter cost more or hit caps this quarter. And AI usage in most companies will keep growing.

So you have two basic choices: accept the rising costs, or find alternatives. There isn't really a middle option — the frontier providers aren't going to suddenly get cheaper, and the usage caps aren't going to relax.

If you go the alternatives route, you don't have to pick one tier. You can match the tier to the job:

Coding assistant that a developer is actively waiting on → frontier or hosted-open. Slow models will frustrate people.
Overnight batch work, scheduled summaries, background agents → a local model is fine.
Anything with sensitive data that can't leave the building → local, even if it's slow.
Plain chatbot that doesn't need tools → cheap local models are fine.

A proxy layer is what makes this routing happen automatically across an organization. Without it, each developer picks for themselves and the savings never really show up.

Where the opportunities are

There's a real opportunity here for builders to help organizations with this.

A few things people could work on:

Build the proxy + evaluator setup inside an organization. Route work to the cheapest tier that can actually handle the job.
Sell a managed version for companies that don't want to build it themselves. The proxy is the easy part. The evaluator that figures out which tier each job needs is where the real work is.
Make agents smarter about which models they pick. The agent should know that Llama 3.2 can't handle its tools and route around it, not just fail.
Build coding tools and agents that aren't model-agnostic. Today's tools try to support every model. But model behavior varies a lot, and supporting "any model" is the source of most of the bugs I hit. A coding tool built specifically for Qwen 2.5 Coder, with the harness tuned for exactly how Qwen handles tool calls and code, could be cheaper, faster, and more reliable than a tool trying to cover every model under the sun.

A year ago this whole layer didn't really need to exist. It does now.

Final thoughts

Picking an AI model used to be about chat quality. Now it's about what the model can do, what the data policy allows, how long users will wait, and what you want to pay. It's more of an infrastructure question than a model-picking one.

If you want to see how the proxy actually works — the architecture, the bugs I hit along the way, the evaluator, and the per-model tweaks — that's in the technical writeup.