The Coding-Tool Bake-Off: Methodology Notes

When I started building graph-live, I had a question I hadn't seen a clean answer to: when two coding tools wrap the same underlying model, does the choice of tool actually matter? Or am I really just paying for the model behind it?

The only way to find out was to put them on the same tasks and watch.

This post is the methodology and the headline findings. A companion post — earl-mcgowen.com/blog/graph-live-build — walks through the real production code the winning tool wrote.

What I tested

Four coding tools (harnesses) and the models behind them:

Qwen Code running on Qwen3.7-Max via OpenRouter
Claude Code running on Claude Opus 4.7 via OpenRouter
CodeWhale running on DeepSeek V4 Pro
Crush (Charmbracelet's TUI agent) — model-agnostic, configured for the bake-off against DeepSeek V4 Pro to isolate the harness from the model

A "harness" is the tool wrapping the model — the agent that takes your prompt, sends it to the model, parses the model's output, executes code, and decides what to do next. The model writes. The harness orchestrates.

The deliberate pairing of CodeWhale and Crush on the same model was the cleanest way to ask the methodology question: if the outputs diverge, the difference is the harness, not the model behind it.

The two rounds

Two tasks, both representative of the real graph-live build:

Round 1 — Backend / SQL. Write a non-trivial SQL query: a multi-table join with averaging, unit conversion, grouping, and ranking. The kind of query a data analyst writes a dozen times a week.

Round 2 — Frontend / UI. Build a clean component for the graph-live page with reasonable styling, layout, and TypeScript types.

Each tool got the same prompt in each round. Each output was reviewed by hand: correctness first, then code quality, then design taste.

Findings

Qwen Code passed cleanly on SQL

Qwen Code on Qwen3.7-Max produced correct multi-line SQL with robust, minimal output handling. The result ran on the first try.

CodeWhale failed SQL — but the model didn't

This was the round where harness-mattered-more-than-model showed up most clearly. CodeWhale's elaborate output-extractor mis-parsed the model's response, shipping a truncated query. The underlying DeepSeek V4 Pro answer was correct. The harness's output handling broke it.

Output-extraction is how the harness pulls the executable code out of the model's full response — which often includes explanation, reasoning, or formatting around the actual code. More elaborate isn't always better. Sometimes it's just more surface area for bugs.

Frontend was tighter — but split on what "best" meant

All four tools produced correct frontend code. The differentiation moved up the stack:

Qwen had the best design taste — the only one of four that added custom styling, chose the most readable layout, and used TypeScript types appropriately.
CodeWhale (DeepSeek V4 Pro) had the cleanest code structure — best file organization and clearest separation of concerns.

That's the second finding: the right tool depends on the job. On SQL, correctness was the differentiator. On frontend, all four cleared the correctness bar, and the differentiator became design taste or code structure depending on what you cared about most.

Same model, two harnesses, different output

This is the cleanest version of the methodology finding. CodeWhale and Crush both ran against DeepSeek V4 Pro on the same prompts. Their outputs differed. Not because of the model — because of how each harness formatted the prompt, parsed the response, and decided what to do with it.

If you're shopping for a coding tool and assuming the model is the product, you're missing half of what you're actually buying.

What I'd do differently next time

A few things this round didn't isolate that I'd want to:

Prompt formatting. Each harness has its own internal prompting style. Some of the differences I attributed to harness behavior are probably partially prompt-template differences. Worth pulling apart.
Tool-use depth. The frontend round didn't stress tool-calling much. A real agentic build with multi-step file edits and refactors would.
Cost per task. I didn't track tokens consumed per round. A proper buying-decision bake-off needs that column.
Larger N. Two rounds with one prompt each is a sample, not a study. Three or four prompts per round, run more than once, would shake out variance from finding.

What this means for buying decisions

If your team is rolling out frontier coding tools across engineering, the cheapest hour you can spend before signing the contract is a controlled bake-off on your stack. Pick three or four representative tasks. Run the candidates side by side. Read the outputs.

The cost discipline isn't about cheaping out on AI. It's about knowing what each dollar is buying — the model, the harness, or both — before committing to a per-seat-per-month bill that compounds across a team.

For graph-live, I picked Qwen Code on Qwen3.7-Max as the workhorse and never regretted it. The companion blog walks through what it actually built.