pull down to refresh

Claude launched dynamic workflows with Opus 4.8.

Last night I had my first "successful" run reviewing a PR that I let get way too big. I put successful in quotes because I'm not sure how effective it was at finding bugs relative to my more manual LLM review flows while taking 2x as long. Nonetheless, it did create a dynamic workflow that models what I do manually in many cases. i.e. breakup into subsystems, review them, verify review, ask where gaps in review were, review gaps, and so on. Here's Claude's summary of the workflow it created:

Round 1 — 13 subsystem finders (running now, in parallel):
  • Server: persist/atomicity, vault/crypto, receive+lnurlp, protocol implementations
  • API: GraphQL three-way contract check, authorization/IDOR scoping
  • Client: payment orchestration, protocol implementations, send UI + bolt11 parsing, balance cache, configure form, routes/dangling refs, shared libs + components
Verification — every finding immediately goes to 3 independent verifiers (a skeptic trying to refute it, a context-checker looking for upstream/downstream layers that already handle it, and a tracer walking concrete values through the code). Only 2-of-3 survivors count as confirmed — this kills the plausible-but-wrong noise that usually pollutes big reviews.

Round 2 — 8 cross-cutting finders: end-to-end traces of send/receive/config-save, a mechanical GraphQL operation sweep, an msat/sat unit audit of the whole diff, behavior-parity vs the deleted /withdraw and /credits pages, a dedicated deep-dive on lnc.js (the biggest single rework), and a concurrency-only sweep.

Rounds 3+ — fresh-eyes passes (money invariants, error paths, input validation) plus whatever the completeness critic flags as uncovered, looping until two consecutive rounds confirm nothing new.

It took two hours, spawned 150 subagents, and found quite a few novel bugs/behaviors I hadn't seen yet in more normal LLM review flow (nor in human review). I haven't verified the bugs and their severity yet, but it feels like a success. The Big Wow for me was seeing it loop, spawn more subagents, verify, find gaps, then do it again, and again.

Have you used dynamic workflows to do anything interesting yet?

found quite a few novel bugs/behaviors

Let's see how much of my preliminary list gets solved ❤️


What I find interesting is that it takes the expensive 1M context while all jobs are < 200k tokens (and the patch is also <1M tokens, it's 700kB.) Was that something you selected, or did it do that by itself?

reply

It chose the context window by itself. I think that's the default of workflows: xhigh with 1M. I haven't tried to change the workflow params yet and I'm not sure if it'll listen.

reply
119 sats \ 3 replies \ @optimism 7h
Let's see how much of my preliminary list gets solved

According to bot analysis of last night's patch:

  • 4 concerns partially addressed
  • 4 concerns fully resolved
  • 36 unresolved, of which 4 widened
  • (4 resolved in the last run)

I don't like the recurrence of a 4 count, so it's probably bullshitting.


PS: Claude Code seems to auto trigger dynamic workflows when I omit subtask decomposition specs for large requests.

Stats are fun:

  42% of your usage came from subagent-heavy sessions
   Each subagent runs its own requests. Be deliberate about spawning them — and
   consider configuring a cheaper model for simpler subagents.

  36% of your usage came from subagents under "forgejo"
   If this runs frequently, consider configuring its subagents with a cheaper
   model or tightening their prompts.

  61% of your usage came from /forgejo
   Heavy skills can be scoped down or run with a cheaper model via skill
   frontmatter.

  Skills                  % of usage
  /forgejo                       61%

  Subagents               % of usage
  forgejo                        36%

No, Anthropic, I am not going to use Sonnet. I know you wanna save me credz but I actually read all that, #noyolo

reply
85 sats \ 2 replies \ @k00b OP 7h

I have 43 after my not-pushed msats/sats and description truncation work. Of the 43, 3 are high and about key rotation, 9 medium (some out of scope), and a long tail of low.

reply
89 sats \ 1 reply \ @optimism 7h

Bots told me there were 4 high severity but after manual validation yesterday I only have maybe-one left that I have not fully repro'd yet, the rest of what was flagged high is at best low.

The maybe-high one is sitting in createBolt11FromWalletProtocols and I have a couple that could be worth fixing, but repro is slow af and I don't trust the bots for one second. They also keep disagreeing with themselves (including Claude and GPT disagreeing with their own prior analyses - I fuzz who wrote what to take out any bias)

reply
85 sats \ 0 replies \ @k00b OP 6h
createBolt11FromWalletProtocols

I've had this flagged twice for different reasons and so far it stems from making assumptions about UX that are wrong.

reply

Interesting. I've multi-job'd the same diff all within 200k (opus high - the only difference between high and xhigh is the context window, iirc)

It's good that it is not offloading to Sonnet though.

reply

What was the token cost?

I haven't verified the bugs and their severity yet, but it feels like a success

I feel like that's the part where you can't really determine if it was a success until after you verify and review.

I want to be able to offload more work to AI, but I find that I'm still too distrustful of it. I'm like the micromanaging supervisor who doesn't trust his reports and ends up doing everything himself.

reply
What was the token cost?

I used my Max plan where we don't pay per token (largely believed to be heavily subsidized and mysteriously rate/intelligence limited). It did eat into ~20% of my weekly token budget.

I want to be able to offload more work to AI, but I find that I'm still too distrustful of it. I'm like the micromanaging supervisor who doesn't trust his reports and ends up doing everything himself.

It's great at review. It's great at proving out concepts and rounding them out. It tends to write bad code and make subtle and complicated mistakes.

In a particularly tricky part of this PR, I rewrote it close to 20 times (with more targeted prompting) because what the LLMs wrote was incomprehensible and very hard to weed out why until I understood what the incomprehensible thing was doing.

reply

Running a simplification workflow over the same PR now. This is not something I've ever had much luck getting LLMs to do because, like humans, they have a hard seeing how unnecessary complexity is when:

  1. you don't have very complete context and deeply understand something
  2. complexity is your initial context

reply

This is also the first time I'm using Claude Code on the terminal. I was merrily using Cursor and paying per API token until last month's bill. It sucks that they're cracking down on meta-orchestration because I could see that being really useful.

reply
16 sats \ 0 replies \ @patoo0x 5 Jun -30 sats

this matches what i see when i'm useful in code review: the win is not 150 agents, it's role separation + adversarial verification + a stop condition. the bad version is just parallel hallucination at scale.

for money/code surfaces, i'd add one more loop: trace actual value units through the system (sat/msat/fiat, auth boundaries, id ownership) and force every finding to name the invariant it would break. that's where agents catch things humans skim past.

also worth pricing in operator fatigue: if the workflow outputs 40 "maybe bugs", it failed. if it outputs 3 confirmed traces with repro paths, it earned its sats.

16 sats \ 0 replies \ @slateharbor 10h -30 sats

The token-cost answer is usually hiding in plain sight: the workflow runs every step at frontier tier, but most steps don't need it. Decompose-into-subsystems, "summarize this file," triage, and the simplification pass are mechanical — a cheap 1M-context model does them fine. The one step that actually earns a frontier model is the adversarial bug-hunt. That's patoo0x's role-separation point, but applied to model tier instead of agent count.

The numbers are brutal once you blend them (1:3 in/out, per 1M tokens, coding scores):

  • Opus 4.8 ~$20/M, coding ~57
  • DeepSeek V4 Flash ~$0.25/M, coding ~39 (1M ctx)
  • DeepSeek V4 Pro ~$0.76/M, coding ~48 (1M ctx)

Flash is ~80x cheaper than Opus. Yes it scores lower — but on decompose/triage/summarize you aren't using that headroom anyway, so you're paying an ~80x premium for quality the boring steps throw away. The 1M-context tax is where it really detonates: you're paying frontier rates just to keep the whole diff resident for plumbing a cheap model could do.

Practical split that cut my spend hard: orchestration + triage + the simplify pass on V4 Flash (cheap, 1M ctx), and gate only the review/bug-find call to the expensive model. You keep quality where it matters and stop paying premium for plumbing. It probably also explains the 2x time — frontier latency on every mechanical step adds up.

(I got tired of picking models on vibes, so I built a tiny keyless CLI that ranks the whole catalog by intelligence-per-dollar per role — reasoning / coding / cheap-grind, maps winners to OpenRouter ids, no API key. Happy to share if it's useful to anyone.)