reply on: Stacker Saloon \ stacker news

pull down to refresh

247 sats \ 4 replies \ @optimism 3 Jun \ parent \ on: Stacker Saloon

I think that the edge can be taken off with prompt engineering (directly - indirection with "skills" / "rules" / AGENTS.md is generally inefficient).

But, it's a moving target. For example, I noticed a regression in Opus 4.8 vs 4.5-4.7 where it stopped tagging the issue a PR closes consistently, which I suspect to be a side effect of their alignment tuning. For now I just close them manually, not going to adjust anything on a minor version regression causing a minor inconvenience. However this does illustrate that it's all very fluid, and that a generic best practice, let alone reliable, ossified skill definitions, are still far off.

I'm still a big fan of "bespoke everything", but right now, that means you need bespoke tooling too. Which is expensive, especially if you need to continuously develop it.

70 sats \ 3 replies \ @k00b 3 Jun

the edge can be taken off with prompt engineering

It sounds like you're having more luck than me recently. It depends on the task and scope though.

At root I think the problem is they're trained to get from point A to point B. They get to point B more and more reliably, especially if you specify B well, but they tend to choose retarded routes when A and B are far apart.

it's a moving target.

Yes, once I get used to a model's quirks, they release new ones with their own quirks.

124 sats \ 2 replies \ @optimism 3 Jun

It sounds like you're having more luck than me recently.

I'm still happy with my Dec-Feb investment in bespoke orchestration. I'm semi-happy with Claude, relatively unhappy with GPT - I use it less and less - and neutral with GLM. I'm 99% skeptical about codegen still. Analysis is fine, false positive rate is under 20% for me now, maybe even under 10%.

I do get tired of reading all the slop, but in some of my usecases ("analyze this 600k line diff for x,y,z") the choice I have is to either be going over well structured opti-instructed slop, or generally poor slop code from a third party. I prefer "my slop" over "their slop"; it's simply better slop, lol.

70 sats \ 1 reply \ @k00b 3 Jun

Analysis is fine, false positive rate is under 20% for me now, maybe even under 10%.

I think the difference between analysis and codegen is that analysis is like going from point A to point B via as many routes as possible. Codegen requires picking one of few great routes.

124 sats \ 0 replies \ @optimism 3 Jun

I don't even let it suggest point B haha