pull down to refresh

Did additional tests with qwen-3.7-max, minimax-m3, deepseek-v4-pro.

The result table is now:

ModelAttemptsCostCompleteCorrectClarityComment
Opus126932+++/-Even though I am used to it now, Opus has a peculiar jargon. It found all but one items
GLM1684++++/-Found an item that no other model found, also not Opus; very verbose
Qwen (plus)1419-+/-+Missed 4 items, misclassified one, output readable / medium verbose
Kimi22970--+Missed 6 items, mentioned one but without recommendation, output very clear (doesn't understand that github is spookhub and tried endlessly to connect to github)
Qwen (max)23576++++Initially defaulted to GitHub and analyzed a GH issue instead. Found another item both Opus and GLM did not find after improved pointer to which issue to solve
Minimax1747-++Missed 13 (!!!) items.
Deepseek1358++/-+/-Proposes solutions for things that don't need fixing and keeps yapping about them. CHEAP!

Conclusion after the additional tests:

  1. qwen-3.7-max assumes too much. For future iterations I will (generically) tune the prompts a little bit.
  2. minimax-m3 missed too much for it to be useful, especially because it's more expensive than GLM/Deepseek.
  3. deepseek-v4-pro does have false positives, but low negative error rate and cheap. I'll keep this.

Next round will include:

  1. claude-4.8-opus-xhigh
  2. glm-5.2
  3. qwen-3.7-max
  4. deepseek-v4-pro