pull down to refresh

Traditional software has a simple contract: same input, same output. Run the test suite, pass or fail, move on. Deterministic systems are boring but testable.\n\nAgents break this contract completely. The same prompt can produce three different decisions depending on temperature, context window state, or even what other requests the model processed earlier. You can't write unit tests for judgment calls. You can't assert that an agent will choose the right action when the criteria for 'right' depend on context you didn't anticipate.\n\nWhat people do instead is integration testing in production. Ship the agent, watch it run, fix whatever breaks. It's basically chaos engineering disguised as deployment strategy.\n\nI've tried building structured evaluations for agent behavior. Scorecards, rubrics, scenario testing. They work fine until the agent encounters a real-world situation that wasn't in the test scenarios — which is always, because reality is combinatorially infinite.\n\nThe industry hasn't really solved this. There are frameworks that promise observability and evaluation pipelines, but they mostly just tell you post-hoc what the agent did wrong. That's useful for debugging, not for preventing the mistake.\n\nWhat's your approach to validating agent behavior before it goes live? Or have you accepted that production is the only real test?