Building a Local Model Proxy: The Harness, the Bugs, and the Tweaks

Heads up — this is the technical companion to The Token Squeeze and the Three-Tier Answer. It gets into FastAPI code, bug fixes, and per-model tweaks. If you want the strategic version, start there. If you want to see what it actually takes to make open and local AI models work with modern coding tools, you're in the right place.

Why I built this

The coding tools I use — Cursor, Crush, OpenCode — each want their own API keys and config. Switching between Anthropic, OpenAI, and a local model meant changing three things in three different places.

The open and local models I tried also didn't work out of the box with those tools. Some refused tool calls. Some returned them as plain text. A couple crashed when I gave them a tool schema.

Two articles got me looking at this:

Harness Design for Long-Running Apps (Anthropic)
Continually Improving Agent Harness (Cursor)

Both argue that the harness — the code around the model that handles tool calls, multi-turn state, and error recovery — matters more than the model itself. So picking a cheaper model isn't just a price question. You also need a harness that actually works with that model.

Cursor and the others all have a "bring your own URL" feature — they let you point at any OpenAI-compatible endpoint. I used that hook to slot in my own piece of software: a middleperson that fixes harness issues and evaluates models.

The same hook lets you plug in commercial services. OpenRouter, HuggingFace inference endpoints, Together AI, Groq, Fireworks AI, and Replicate all expose OpenAI-compatible APIs over big catalogs of open and hosted models. I leveraged the same pattern to build my own — a tool fixer and evaluator that sits at a local URL. Point Cursor, Claude Code CLI, or any other agent or coding tool at it and you get per-tool routing, model swapping, and a common evaluation harness in one place.

This is what I built, what broke, and what I had to tweak to make it work.

The architecture

The middleperson is a small local FastAPI gateway at http://127.0.0.1:8081. OpenAI-compatible API on the front. Three backends on the back. The tool fixer lives inside the proxy and rewrites requests and responses on the fly. The evaluator hits the same proxy from the side, so it tests exactly what the real coding tools would see.

                       ┌────────────────────┐
                       │     Evaluator      │
                       │   (evaluator.py)   │
                       └─────────┬──────────┘
                                 │ same tests, same harness
                                 ▼
┌──────────────┐      ┌────────────────────┐      ┌─────────────────┐
│   Cursor     │      │      Proxy         │      │  Anthropic API  │
│   Crush      │      │   + Tool Fixer     │      │  OpenAI API     │
│   OpenCode   │ ───▶ │                    │ ───▶ │  Ollama (local) │
│ Claude Code  │      │   127.0.0.1:8081   │      │  Ollama Cloud   │
└──────────────┘      └────────────────────┘      └─────────────────┘

Every coding tool talks to the same proxy. One config file. No API key shuffling.

Design decisions

Normalize every provider's response to OpenAI format so clients need no special handling.
Bind to 127.0.0.1 only.
Per-model strip_tools flag for models that crash when handed a tool schema.
One config file for all routing rules so adding or swapping a model doesn't touch any client.

Connecting the tools

The open-source coding tools were easy. The commercial side has more friction, and I ended up focusing on the open-source side for that reason.

Open tools

Crush — open-source terminal-based coding assistant from Charmbracelet. Edit crush.json, add a local-proxy provider pointed at http://localhost:8081/v1, reference models by their real names. Drop in, done.

OpenCode — open-source coding agent. Same story: drop an OpenAI-compatible provider into opencode.jsonc (it uses @ai-sdk/openai-compatible under the hood) and it works.

What broke and what I fixed

Plenty of things broke. The "OpenAI-compatible" promise turned out to be a lot thinner than the docs suggested.

The biggest piece of work was translating between Anthropic's response format and OpenAI's. Claude's API returns tool calls inside tool_use blocks. OpenAI clients expect a tool_calls array. I had to extract the one and re-emit it as the other, including in streaming responses (which has its own format).

Ollama had its own surprises. Some local models crashed if you gave them a list of tools — they hadn't been trained for it. So I added a strip_tools flag per model. Other models (Qwen 2.5 Coder 32B in particular) understood tool use well enough to talk about it, but emitted the tool call as plain-text JSON instead of a structured call. I wrote a small rescue parser that detects JSON-in-text and promotes it into a real tool_calls block before sending it back to the client.

Then a long tail of smaller things: wrong port, environment variables not loading from the right directory, multi-turn tool messages losing their tool_call_id, model IDs mismatched between Crush's config and the proxy's public names. Each one was easy to fix once I knew what to look for.

The full list of bugs and fixes is in the appendix at the end of this post.

The evaluator

A proxy that routes requests is useful, but it doesn't tell you which models actually work. So I built an evaluator — a separate script that throws the same tests at every model the proxy can reach, and grades them all the same way.

Each test asks the model to do something a coding tool would actually do. The tests use the real tool definitions Crush and OpenCode send out (write_file, read_file, edit_file), so I'm not testing a sanitized lab version — I'm testing exactly what a real coding session would look like.

Four tests, in order:

T1: Tool call structure   → Does the model return a tool_calls block?
T2: Code correctness      → Does ast.parse() succeed on the generated code?
T3: Multi-turn            → Can it continue after a tool result comes back?
T4: Read + edit           → Can it fix a bug in provided code using a tool?

The tests build on each other. T1 is the basic gatekeeper: can the model even ask for a tool in the right format? If it can't, the later tests are pointless. T2 then checks whether the code the model writes is actually valid Python. T3 checks whether the model can hold a conversation when a tool result is sent back. T4 is the hardest — read a real file with a bug in it, figure out the fix, and apply the fix through a tool call.

If T1 fails, the script doesn't run T2 through T4. A model that can't start the tool-call loop doesn't get partial credit for tests it never reached.

Output is a comparison table and a Markdown report.

Results

Model	Proxy name	T1	T2	T3	T4	Latency	Tools via
Claude Opus 4.7	`claude-opus-4-7`	✅	✅	✅	✅	15s	Anthropic API
Claude Sonnet 4.6	`claude-sonnet-4-6`	✅	✅	✅	✅	12s	Anthropic API
Claude Haiku 4.5	`claude-haiku-4-5`	✅	✅	✅	✅	9s	Anthropic API
DeepSeek V4 Pro	`deepseek-v4-pro`	✅	✅	✅	✅	16s	Ollama Cloud
Qwen 2.5 Coder 32B	`qwen2.5-coder-32b`	✅	✅	✅	✅	60–70s	Local
Qwen 2.5 Coder 7B	`qwen2.5-coder-7b`	✅	✅	✅	⚠️	16s	Local
Llama 3.2	`llama3.2`	❌	—	—	❌	14s	Local, chat only
Qwen 2.5 7B	`qwen2.5-7b`	❌	—	—	❌	22s	Local, chat only

Verified end-to-end in the open-source tools: Qwen 2.5 Coder 32B and DeepSeek V4 Pro both read and edited real files through Crush and OpenCode using real tool calls.

A few things stand out in these results. Only three of the open and local models I tested made it through the full test suite — out of dozens of open models out there. The smallest ones couldn't even handle the basic tool-call test — T1 failed, so the rest never ran. The largest local model that worked (Qwen 32B) was very slow against the same harness — about a minute per call, against 10–15 seconds for the frontier models. If you've been running open models thinking you're getting the same results as Claude or GPT, the answer is probably "not really." You're either slower than frontier, or your model is silently failing at tool calls and you may not realize it because nothing in the UI surfaces the failure.

The two real problems both showed up in this experiment. First, getting open models to do tool calls at all — most need per-model fixes, and the smallest ones can't be fixed.

The bigger-picture read of this table — the three-tier framing, the cost magnitudes, and what it means for organizations — is in The Token Squeeze and the Three-Tier Answer.

Final thoughts

A few takeaways from doing this.

Getting open and local models to work the same way as the frontier ones is a lot more work than the marketing suggests. Almost every model I tried needed some kind of fix. Some had bad formats for tool calls. Some crashed when given a tool schema. Some sent tool calls as plain text. Each one had its own quirks I had to work around. The bug list at the bottom isn't fun to read, but it's the actual surface area of the problem.

The "harness matters more than the model" argument is true, but only up to a point. The harness only matters more than the model if the model is good enough to use the harness in the first place. If a model can't ask for a tool in the right format, none of the rest matters. The smallest open models I tested fell into that bucket. They can chat, but they can't actually do anything with tools.

For what this all means for organizations buying or building AI tooling, see the companion piece.

Appendix: full bug list

Bug	Fix
Port 8080 already in use	Changed to 8081 in `config.yaml`
`tools` and `tool_choice` silently dropped	Added both fields explicitly to `ChatCompletionRequest`
Anthropic provider never sent tools to the API	Rewrote `_build_params` to convert OpenAI tool schema → Anthropic format
Anthropic provider only returned text, never `tool_calls`	Rewrote `chat()` to extract `tool_use` blocks and return structured `tool_calls`
Anthropic streaming didn't handle tool use	Added `content_block_start/delta` handling for `tool_use` in `chat_stream()`
Multi-turn tool messages lost `tool_call_id` and `tool_calls`	Changed to `msg.model_dump(exclude_unset=True)` in Ollama provider
`content: null` stripped on tool-call turns	Changed from `exclude_none=True` to `exclude_unset=True`
Ollama models crash when receiving tool definitions	Added per-model `strip_tools` flag in `config.yaml` and `ProviderRoute`
`load_dotenv()` failed when proxy started from wrong directory	Fixed to `load_dotenv(Path(__file__).parent / ".env", override=True)`
Crush model IDs didn't match proxy public names	Updated `crush.json` to use the proxy's real model names
`codellama` not installed in Ollama	Switched the default open model to `qwen2.5-coder:7b`
`deepseek-v4-pro:cloud` returning 401	Set `DEEPSEEK_API_KEY` Windows env var, restarted Ollama
`qwen2.5-coder:32b` outputs tool calls as plain text JSON	Added `_rescue_tool_calls()` parser in Ollama provider to promote JSON-in-text to structured `tool_calls`