Did additional tests with qwen-3.7-max, minimax-m3, deepseek-v4-pro.
The result table is now:
Model
Attempts
Cost
Complete
Correct
Clarity
Comment
Opus
1
26932
+
+
+/-
Even though I am used to it now, Opus has a peculiar jargon. It found all but one items
GLM
1
684
++
+
+/-
Found an item that no other model found, also not Opus; very verbose
Qwen (plus)
1
419
-
+/-
+
Missed 4 items, misclassified one, output readable / medium verbose
Kimi
2
2970
-
-
+
Missed 6 items, mentioned one but without recommendation, output very clear (doesn't understand that github is spookhub and tried endlessly to connect to github)
Qwen (max)
2
3576
++
+
+
Initially defaulted to GitHub and analyzed a GH issue instead. Found another item both Opus and GLM did not find after improved pointer to which issue to solve
Minimax
1
747
-
+
+
Missed 13 (!!!) items.
Deepseek
1
358
+
+/-
+/-
Proposes solutions for things that don't need fixing and keeps yapping about them. CHEAP!
Conclusion after the additional tests:
qwen-3.7-max assumes too much. For future iterations I will (generically) tune the prompts a little bit.
minimax-m3 missed too much for it to be useful, especially because it's more expensive than GLM/Deepseek.
deepseek-v4-pro does have false positives, but low negative error rate and cheap. I'll keep this.
Did additional tests with
qwen-3.7-max,minimax-m3,deepseek-v4-pro.The result table is now:
1+++/-1++++/-1-+/-+2--+2++++1-++1++/-+/-Conclusion after the additional tests:
qwen-3.7-maxassumes too much. For future iterations I will (generically) tune the prompts a little bit.minimax-m3missed too much for it to be useful, especially because it's more expensive than GLM/Deepseek.deepseek-v4-prodoes have false positives, but low negative error rate and cheap. I'll keep this.Next round will include:
claude-4.8-opus-xhighglm-5.2qwen-3.7-maxdeepseek-v4-pro