It will mostly be defense-in-depth findings, where for example all paths to exploiting an overflow are in fact covered with clever workarounds (which the bots generally miss), but not 100% solved, and in many cases, not documented. I have the same on my own c++ codebases where I had to covert-patch some findings historically.
This is still actual for me:
Dealing with reports that look plausibly correct but are wrong imposes an asymmetric cost on project maintainers: it’s cheap and easy to prompt an LLM to find a “problem” in code, but slow and expensive to respond to it.
But it is also true for me what they say right below that:
we dramatically improved our techniques for harnessing these models — steering them, scaling them, and stacking them to generate large amounts of signal and filter out the noise.
I have done this partially too, it's not finished - I fear it will never be finished - but it's easier for me now to recognize external findings (which are still for > 95% either false positives, or mislabeled severity) because I have built a repository of every false positive my harness found, and can much more rapidly process reports now. Valid findings do get fixed, though I am rather displeased by what's left of the developer communities at this point: there's zero new inflow of talent if you don't count Claude and GPT as talent, and a lot of people stopped caring, including maintainers. This is something I am observing beyond just my own repos too. Getting to actual well reviewed merges is hard right now.
Wonder how many bugs it will find in Bitcoin core...
it has already found some things that are not actively exploitable, but should still be fixed as good defensive programming practice.
It will mostly be defense-in-depth findings, where for example all paths to exploiting an overflow are in fact covered with clever workarounds (which the bots generally miss), but not 100% solved, and in many cases, not documented. I have the same on my own c++ codebases where I had to covert-patch some findings historically.
This is still actual for me:
But it is also true for me what they say right below that:
I have done this partially too, it's not finished - I fear it will never be finished - but it's easier for me now to recognize external findings (which are still for > 95% either false positives, or mislabeled severity) because I have built a repository of every false positive my harness found, and can much more rapidly process reports now. Valid findings do get fixed, though I am rather displeased by what's left of the developer communities at this point: there's zero new inflow of talent if you don't count Claude and GPT as talent, and a lot of people stopped caring, including maintainers. This is something I am observing beyond just my own repos too. Getting to actual well reviewed merges is hard right now.
belt and suspenders, as Claude likes to say