pull down to refresh

LLMs make full first drafts fast. Making those drafts great requires lots of iteration. Without iterating, they make beautiful black boxes filled with vomit.

It's a lot like asking an LLM to write you an essay. The shape is almost correct and it's all plausibly coherent, but the substance is tiny relative to the word count and it doesn't make much sense when you drill in.

I think that the edge can be taken off with prompt engineering (directly - indirection with "skills" / "rules" / AGENTS.md is generally inefficient).

But, it's a moving target. For example, I noticed a regression in Opus 4.8 vs 4.5-4.7 where it stopped tagging the issue a PR closes consistently, which I suspect to be a side effect of their alignment tuning. For now I just close them manually, not going to adjust anything on a minor version regression causing a minor inconvenience. However this does illustrate that it's all very fluid, and that a generic best practice, let alone reliable, ossified skill definitions, are still far off.


I'm still a big fan of "bespoke everything", but right now, that means you need bespoke tooling too. Which is expensive, especially if you need to continuously develop it.

reply
70 sats \ 3 replies \ @k00b 3 Jun
the edge can be taken off with prompt engineering

It sounds like you're having more luck than me recently. It depends on the task and scope though.

At root I think the problem is they're trained to get from point A to point B. They get to point B more and more reliably, especially if you specify B well, but they tend to choose retarded routes when A and B are far apart.

it's a moving target.

Yes, once I get used to a model's quirks, they release new ones with their own quirks.

reply
It sounds like you're having more luck than me recently.

I'm still happy with my Dec-Feb investment in bespoke orchestration. I'm semi-happy with Claude, relatively unhappy with GPT - I use it less and less - and neutral with GLM. I'm 99% skeptical about codegen still. Analysis is fine, false positive rate is under 20% for me now, maybe even under 10%.

I do get tired of reading all the slop, but in some of my usecases ("analyze this 600k line diff for x,y,z") the choice I have is to either be going over well structured opti-instructed slop, or generally poor slop code from a third party. I prefer "my slop" over "their slop"; it's simply better slop, lol.

reply
70 sats \ 1 reply \ @k00b 3 Jun
Analysis is fine, false positive rate is under 20% for me now, maybe even under 10%.

I think the difference between analysis and codegen is that analysis is like going from point A to point B via as many routes as possible. Codegen requires picking one of few great routes.

reply

I don't even let it suggest point B haha

reply