Comparing alternative models with Claude in production - Part 1 - Analysis

optimism

Did additional tests with `qwen-3.7-max`, `minimax-m3`, `deepseek-v4-pro`.

The result table is now:

| Model       | Attempts |  Cost | Complete | Correct | Clarity | Comment                                                                                                                                                           |
| ----------- | -------: | ----: | :------: | :-----: | :-----: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Opus        |      `1` | 26932 |    `+`   |   `+`   |  `+/-`  | Even though I am used to it now, Opus has a peculiar jargon. It found all but one items                                                                           |
| GLM         |      `1` |   684 |   `++`   |   `+`   |  `+/-`  | Found an item that no other model found, also not Opus; very verbose                                                                                              |
| Qwen (plus) |      `1` |   419 |    `-`   |  `+/-`  |   `+`   | Missed 4 items, misclassified one, output readable / medium verbose                                                                                               |
| Kimi        |      `2` |  2970 |    `-`   |   `-`   |   `+`   | Missed 6 items, mentioned one but without recommendation, output very clear (doesn't understand that github is spookhub and tried endlessly to connect to github) |
| Qwen (max)  |      `2` |  3576 |   `++`   |   `+`   |   `+`   | Initially defaulted to GitHub and analyzed a GH issue instead. Found another item both Opus and GLM did not find after improved pointer to which issue to solve   |
| Minimax     |      `1` |   747 |    `-`   |   `+`   |   `+`   | Missed 13 (!!!) items.                                                                                                                                            |
| Deepseek    |      `1` |   358 |    `+`   |   `+/-`   |  `+/-`  | Proposes solutions for things that don't need fixing and keeps yapping about them. CHEAP!                                                                                 |

Conclusion after the additional tests:

1. `qwen-3.7-max` assumes too much. For future iterations I will (generically) tune the prompts a little bit.
2. `minimax-m3` missed too much for it to be useful, especially because it's more expensive than GLM/Deepseek.
3. `deepseek-v4-pro` does have false positives, but low negative error rate and cheap. I'll keep this.

Next round will include:

1. `claude-4.8-opus-xhigh`
2. `glm-5.2`
3. `qwen-3.7-max`
4. `deepseek-v4-pro`

Model	Attempts	Cost	Complete	Correct	Clarity	Comment
Opus	`1`	26932	`+`	`+`	`+/-`	Even though I am used to it now, Opus has a peculiar jargon. It found all but one items
GLM	`1`	684	`++`	`+`	`+/-`	Found an item that no other model found, also not Opus; very verbose
Qwen (plus)	`1`	419	`-`	`+/-`	`+`	Missed 4 items, misclassified one, output readable / medium verbose
Kimi	`2`	2970	`-`	`-`	`+`	Missed 6 items, mentioned one but without recommendation, output very clear (doesn't understand that github is spookhub and tried endlessly to connect to github)
Qwen (max)	`2`	3576	`++`	`+`	`+`	Initially defaulted to GitHub and analyzed a GH issue instead. Found another item both Opus and GLM did not find after improved pointer to which issue to solve
Minimax	`1`	747	`-`	`+`	`+`	Missed 13 (!!!) items.
Deepseek	`1`	358	`+`	`+/-`	`+/-`	Proposes solutions for things that don't need fixing and keeps yapping about them. CHEAP!