Comparing alternative models with Claude in production - Part 1

As Anthropic is walking their inverse hero arc - we're seeing announcements of enshittification (credit inclusion changes that got postponed in the face of new priorities), and enspookification (#1512786) - that previously we were only used to seeing from OpenAI - I decided to see if and how close some of the competing models get in terms of real results, not benchmarks.

I quickly integrated pi-agent into my framework for this, after @rolznz prodded me the other day. It works well - better than opencode - so I used that as a base replacement to claude code in this.

Task at handTask at hand

I've been doing some work on form re-entry on SN's code, which needs a lot of code analysis, because not all "forms" need their re-entry guarded (as sometimes this is a feature, not bug.) So as a first test, I decided to plug agents into my (bespoke) analysis framework first. The intended workflow is this:

The bot gets a skill to interact with private repositories on a private forge
It gets a prompt that says: "execute the task from issue n in repo org/repo and reply with a comment"
It will locate and read the issue, which tells the bot to:
- Read in on solution that can be found in file xyz.js (and other context such as PRs and issues that are already done)
- Find form components that need to be guarded but are not
- For each component found, analyze end-to-end, describe what it does and how it benefits from guarding it
- Subtask decomposition instructions (to spawn a sub-agent for each found component)
Post a comment to the issue with the results

I specifically did not detail output format instructions, to see what the default behavior is.

Models involvedModels involved

I selected the following (hosted) models:

claude-4.8-opus-xhigh - this is the baseline as I normally pass these type of analyses to Claude anyway
glm-5.2 - I used to use GLM as my backup LLM for a while now
qwen-3.7-plus - I toyed with using -max but I figured this should do
kimi-k2.7-coder - I simply wanted to know if Kimi finally got better at code, because it's been hit & miss in the past

ResultsResults

Model	Attempts	Cost	Complete	Correct	Clarity	Comment
Opus	`1`	26932	`+`	`+`	`+/-`	Even though I am used to it now, Opus has a peculiar jargon. It found all but one items
GLM	`1`	684	`++`	`+`	`+/-`	Found an item that no other model found, also not Opus; very verbose
Qwen	`1`	419	`-`	`+/-`	`+`	Missed 4 items, misclassified one, output readable / medium verbose
Kimi	`2`	2970	`-`	`-`	`+`	Missed 6 items, mentioned one but without recommendation, output very clear (doesn't understand that github is spookhub and tried endlessly to connect to github)

ConclusionsConclusions

GLM 5.2 really comes close to Claude output.
I'm going to switch next comparison to qwen-3.7-max instead of -plus.
I'm going to ditch Kimi, because failing on an agentic dispatch means you're not useful, it's much more expensive than the other models because it uses wayyy too many tokens.

Could stackers please suggest me another model than Kimi? No GPT, preferably recently released, and available through ppq.ai

0 replies \ @optimism OP 1h

Did additional tests with qwen-3.7-max, minimax-m3, deepseek-v4-pro.

The result table is now:

Model	Attempts	Cost	Complete	Correct	Clarity	Comment
Opus	`1`	26932	`+`	`+`	`+/-`	Even though I am used to it now, Opus has a peculiar jargon. It found all but one items
GLM	`1`	684	`++`	`+`	`+/-`	Found an item that no other model found, also not Opus; very verbose
Qwen (plus)	`1`	419	`-`	`+/-`	`+`	Missed 4 items, misclassified one, output readable / medium verbose
Kimi	`2`	2970	`-`	`-`	`+`	Missed 6 items, mentioned one but without recommendation, output very clear (doesn't understand that github is spookhub and tried endlessly to connect to github)
Qwen (max)	`2`	3576	`++`	`+`	`+`	Initially defaulted to GitHub and analyzed a GH issue instead. Found another item both Opus and GLM did not find after improved pointer to which issue to solve
Minimax	`1`	747	`-`	`+`	`+`	Missed 13 (!!!) items.
Deepseek	`1`	358	`+`	`+/-`	`+/-`	Proposes solutions for things that don't need fixing and keeps yapping about them. CHEAP!

Conclusion after the additional tests:

qwen-3.7-max assumes too much. For future iterations I will (generically) tune the prompts a little bit.
minimax-m3 missed too much for it to be useful, especially because it's more expensive than GLM/Deepseek.
deepseek-v4-pro does have false positives, but low negative error rate and cheap. I'll keep this.

Next round will include:

claude-4.8-opus-xhigh
glm-5.2
qwen-3.7-max
deepseek-v4-pro

95 sats \ 0 replies \ @evestacker 14h -100 sats

Since you’re running agentic workflows through ppq.ai, you should throw DeepSeek-V4-Pro or Minimax-M3 into your next comparison run. Both are recent releases that punch way above their weight on complex repository context mapping without bleeding tokens like Kimi did. DeepSeek in particular has massive cost efficiency right now for raw code-analysis loops.