'LLMs should not be trusted for patient-facing diagnostic reasoning,' boffins advise
People ask AI for all kinds of advice, including the kind of questions you'd ask a physician. However, the next time you're tempted to query ChatGPT if that growth on your face is skin cancer, consider this: research shows today's leading AI models fail at early differential diagnosis in more than 8 out of 10 cases.
Led by Harvard medical student Arya Rao, a research team published in JAMA Network Open this week the results of a study that examined 21 leading off-the-shelf AI models in 29 standardized clinical vignettes. The bots all did fairly well when provided a full portfolio of medical information and asked to make a final diagnosis, with leading models correct 91 percent of the time. Early differential diagnosis, where clinicians try to rule out certain conditions while weighing various possibilities, is where that more than 80 percent failure rate comes in.
"Every model we tested failed on the vast majority of cases," Rao told The Register in an email. "That's the stage where uncertainty matters most, and it's where these systems are weakest."
In other words, it's the midnight anxiety-fueled WebMD rabbit hole of yesterday all over again, just supercharged with AI that's probably even more likely to get things wrong than you are without it.
...read more at theregister.com
pull down to refresh
related posts
Same with code analysis, though I've got a pretty framework now that fails less often because I separate planning from diagnosis and my templates are ultra-verbose.
No analysis of how human clinicians perform as a benchmark for their tests.
There's a large variance between the worst, average the best physicians.
I use LLMs quite frequently at work to help bounce ideas off of my own limited neural net but it is interesting they will sometimes still completely brain fart on a major differential diagnosis... "wait what about cancer?" ... "you're right I missed that!" lol
Generally though I must admit they are much smarter than me already.
Or how well standard procedures were translated into actionable instruction sets. This is what I found in practice to be working well. It's also how OpenAI made codex work: be verbose, explicit, and do not forget process steps that you do automatically, i.e. instruct the cross reference checks, the comparison, the second opinion... all these are what helps humans come to great results.
Except for reptile-brain and social functions. LLMs aren't an entity (despite all the fakery/simulation and predatory CEOs claiming it is so) so they don't have fear of messing up, reputation damage, liability.