AI,testing,agentic AI
Crossing the Line - Testing Agentic AI

There is a line that exists not only in software, but also in other domains such as writing, music, and movies, just to name a few.

 

That line is the release point of whatever is being built.

 

Before the release point, there is an opportunity to find and fix defects before the intended audience experiences them. In pre-release, testing is possible.

 

After the release point, defects and problems can be identified, but they are now live and affecting users and customers. “Testing in production” is a contradictory term, since it is no longer a test but rather an observation of what is happening in live operation. Problems discovered in production are more expensive to fix and carry much higher risk than those discovered before crossing the release point.

 

The trend toward releasing without any formal decision became popular with the rise of DevOps and continuous deployment. The idea is that problems can be “fixed on failure,” with little consequence. In some cases, that works. In other cases, major problems and losses have occurred.

 

Now, with AI, especially Agentic AI, all kinds of software are being released not only continuously but autonomously. In fact, it’s not just software that is crossing the line.

 

AI agents are being given power to make decisions and perform actions entirely on their own – in the real world. Even when so-called guardrails are in place, there have been recent cases in which an AI agent has violated or ignored them, causing severe damage.

 

It didn’t take long for another virtual line to get crossed. That is, the use of AI as an assistant moving to that of an agent, and even having sub-agents. This is all escalating very quickly, and with very real risks.

 

Consider the example of Amazon’s agents in Q1 of 2026, which experienced a series of high-impact, critical failures that cost the company and its users millions of dollars. This has prompted Amazon to add human safety checks and guardrails.

 

In recent days, I have noticed a sharp increase in content encouraging people to use agentic AI to automate mundane daily tasks. However, the people behind the content often fail to mention the risks.

 

One such video I saw recently described telling an AI agent to buy the cheapest possible package of 50 paper clips. The agent did just that. And, in the process of searching the world for the best price, incurred a cost of over $100 in token usage!

 

Let’s scale this now to businesses considering creating an army of AI agents to perform real-world tasks based on prompts and instructions (and even guardrails). Can this be tested in any reliable way?

 

Here is my unpopular opinion after giving this issue a lot of thought. Agentic AI cannot be tested in real-world production use for the following reasons:

 

1.     Production use is not a testing environment, but rather, a monitoring situation. Even if you could create a test environment, it would not be realistic enough to simulate the real-world.

2.     AI agents can operate at a much higher speed and scale than any form of testing, including automated testing, can match. Plus, by the time a problem might be detected, the damage has already been done. And, it takes time to create the test automation.

3.     AI is non-deterministic. It makes choices and takes actions based on such a wide variety of input that no test could ever anticipate. We only know acceptable and unacceptable when we see it.

 

There are probably other reasons that Agentic AI is untestable, but those three are my starting point.

 

Can AI in general be tested? Yes, I think it can be tested with the appropriate test strategy. That strategy will look much different than a traditional test strategy. It will rely heavily on test automation, but it will also involve significant human evaluation.

 

We have all seen AI fail. In fact, there are some things that can be thrown at AI to cause it to fail. The challenge is how to test the non-deterministic and evolving nature of AI.

 

Insurance companies are wary of AI and its potential for error, so many of them are excluding anything that incorporates AI in any product or service. This started as an exclusion just for generative AI, but it is quickly escalating to anything using AI. This shows the high level of risk and justifies the best possible testing.

 

So, carefully consider the use of Agentic AI, and don’t forget you are releasing untestable things into the real world. That should cause anyone a lot of concern.

 

Agentic AI usage will only increase. The question is, will people and organizations use it responsibly?

 

I still keep thinking about all the cases where AI broke the “rules” with no regard. And that’s what really concerns me.


If you need help with your AI testing strategy, contact me at the link above or leave a comment and let's talk about it.

0 Comments

Jonathan Addelston

Date 5/24/2026

Randall Rice

Date 5/26/2026 10:57:54 AM

Add Comment

TOP Logo