> No analysis of how human clinicians perform

Or how well standard procedures were translated into actionable instruction sets. This is what I found in practice to be working well. It's also how OpenAI made codex work: be verbose, explicit, and do not forget process steps that you do automatically, i.e. instruct the cross reference checks, the comparison, the second opinion... all these are what helps humans come to great results.

> Generally though I must admit they are much smarter than me already.

Except for reptile-brain and social functions. LLMs aren't an entity (despite all the fakery/simulation and predatory CEOs claiming it is so) so they don't have fear of messing up, reputation damage, liability.

No analysis of how human clinicians perform as a benchmark for their tests.

There's a large variance between the worst, average the best physicians.

I use LLMs quite frequently at work to help bounce ideas off of my own limited neural net but it is interesting they will sometimes still completely brain fart on a major differential diagnosis... "wait what about cancer?" ... "you're right I missed that!" lol

Generally though I must admit they are much smarter than me already.

Same with code analysis, though I've got a pretty framework now that fails less often because I separate planning from diagnosis and my templates are ultra-verbose.