As Anthropic is walking their inverse hero arc - we're seeing announcements of enshittification (credit inclusion changes that got postponed in the face of new priorities), and enspookification (#1512786) - that previously we were only used to seeing from OpenAI - I decided to see if and how close some of the competing models get in terms of real results, not benchmarks.
I quickly integrated pi-agent into my framework for this, after @rolznz prodded me the other day. It works well - better than opencode - so I used that as a base replacement to claude code in this.
Task at handTask at hand
I've been doing some work on form re-entry on SN's code, which needs a lot of code analysis, because not all "forms" need their re-entry guarded (as sometimes this is a feature, not bug.) So as a first test, I decided to plug agents into my (bespoke) analysis framework first. The intended workflow is this:
- The bot gets a skill to interact with private repositories on a private forge
- It gets a prompt that says: "execute the task from issue
nin repoorg/repoand reply with a comment" - It will locate and read the issue, which tells the bot to:
- Read in on solution that can be found in file
xyz.js(and other context such as PRs and issues that are already done) - Find form components that need to be guarded but are not
- For each component found, analyze end-to-end, describe what it does and how it benefits from guarding it
- Subtask decomposition instructions (to spawn a sub-agent for each found component)
- Read in on solution that can be found in file
- Post a comment to the issue with the results
I specifically did not detail output format instructions, to see what the default behavior is.
Models involvedModels involved
I selected the following (hosted) models:
claude-4.8-opus-xhigh- this is the baseline as I normally pass these type of analyses to Claude anywayglm-5.2- I used to use GLM as my backup LLM for a while nowqwen-3.7-plus- I toyed with using-maxbut I figured this should dokimi-k2.7-coder- I simply wanted to know if Kimi finally got better at code, because it's been hit & miss in the past
ResultsResults
| Model | Attempts | Cost | Complete | Correct | Clarity | Comment |
| Opus | 1 | 26932 | + | + | +/- | Even though I am used to it now, Opus has a peculiar jargon. It found all but one items |
| GLM | 1 | 684 | ++ | + | +/- | Found an item that no other model found, also not Opus; very verbose |
| Qwen | 1 | 419 | - | +/- | + | Missed 4 items, misclassified one, output readable / medium verbose |
| Kimi | 2 | 2970 | - | - | + | Missed 6 items, mentioned one but without recommendation, output very clear (doesn't understand that github is spookhub and tried endlessly to connect to github) |
ConclusionsConclusions
- GLM 5.2 really comes close to Claude output.
- I'm going to switch next comparison to
qwen-3.7-maxinstead of-plus. - I'm going to ditch Kimi, because failing on an agentic dispatch means you're not useful, it's much more expensive than the other models because it uses wayyy too many tokens.
Could stackers please suggest me another model than Kimi? No GPT, preferably recently released, and available through ppq.ai
Did additional tests with
qwen-3.7-max,minimax-m3,deepseek-v4-pro.The result table is now:
1+++/-1++++/-1-+/-+2--+2++++1-++1++/-+/-Conclusion after the additional tests:
qwen-3.7-maxassumes too much. For future iterations I will (generically) tune the prompts a little bit.minimax-m3missed too much for it to be useful, especially because it's more expensive than GLM/Deepseek.deepseek-v4-prodoes have false positives, but low negative error rate and cheap. I'll keep this.Next round will include:
claude-4.8-opus-xhighglm-5.2qwen-3.7-maxdeepseek-v4-proSince you’re running agentic workflows through ppq.ai, you should throw DeepSeek-V4-Pro or Minimax-M3 into your next comparison run. Both are recent releases that punch way above their weight on complex repository context mapping without bleeding tokens like Kimi did. DeepSeek in particular has massive cost efficiency right now for raw code-analysis loops.