pull down to refresh
the edge can be taken off with prompt engineering
It sounds like you're having more luck than me recently. It depends on the task and scope though.
At root I think the problem is they're trained to get from point A to point B. They get to point B more and more reliably, especially if you specify B well, but they tend to choose retarded routes when A and B are far apart.
it's a moving target.
Yes, once I get used to a model's quirks, they release new ones with their own quirks.
It sounds like you're having more luck than me recently.
I'm still happy with my Dec-Feb investment in bespoke orchestration. I'm semi-happy with Claude, relatively unhappy with GPT - I use it less and less - and neutral with GLM. I'm 99% skeptical about codegen still. Analysis is fine, false positive rate is under 20% for me now, maybe even under 10%.
I do get tired of reading all the slop, but in some of my usecases ("analyze this 600k line diff for x,y,z") the choice I have is to either be going over well structured opti-instructed slop, or generally poor slop code from a third party. I prefer "my slop" over "their slop"; it's simply better slop, lol.
Analysis is fine, false positive rate is under 20% for me now, maybe even under 10%.
I think the difference between analysis and codegen is that analysis is like going from point A to point B via as many routes as possible. Codegen requires picking one of few great routes.
I think that the edge can be taken off with prompt engineering (directly - indirection with "skills" / "rules" /
AGENTS.mdis generally inefficient).But, it's a moving target. For example, I noticed a regression in Opus 4.8 vs 4.5-4.7 where it stopped tagging the issue a PR closes consistently, which I suspect to be a side effect of their alignment tuning. For now I just close them manually, not going to adjust anything on a minor version regression causing a minor inconvenience. However this does illustrate that it's all very fluid, and that a generic best practice, let alone reliable, ossified skill definitions, are still far off.
I'm still a big fan of "bespoke everything", but right now, that means you need bespoke tooling too. Which is expensive, especially if you need to continuously develop it.