AI Testing Evaluator Rikard Edgren
We will be flooded with AI agents that can do software testing. I have seen examples of test reports that look impressive, but it is hard to tell if the testing was really good.
So I created a very simple web application to make “quick” (not bullet proof) testing of their capabilities.
The tool is built together with Claude Opus 4.6, and just gives you a random string to use in testing.
Beware that on a new session your additions and removals will be lost. This is because we want the vanilla version at every new start, and this is too basic a test data tool to use for real.
The real purpose is to use it as a playground for testers, LLMs and test automation.
At Random Test Data (buggy) we have introduced 23 bugs that can be used to try out your testing techniques, experiment with prompting and models to see the LLM capabilities, and perhaps most important, to use as a benchmark tool for new AI agents.
You can also try your automation tooling and create a regression test you believe in.
At Random Test Data (good) 21 of these 23 bugs have been fixed, but there was unfortunately introduced five new ones.
Can the automated regression tests verify the bug fixes and find the new ones?
Probably not, the bugs have wide variety in difficulty, and testers (with or without tools) rarely finds everything.
I will not tell you the bugs, since I think the discovery and how they were discovered will offer the best learning experience. I will probably not approve comments that describe the bugs, but I welcome descriptions of test methods to find them.
For both versions there are probably additional bugs, but the only ones I know about are security concerns inherited from thetesteye.com. First version had some bugs that we fixed, but two of them were left (the ones not fixed in good version).
The tools are also available in RandomTestData.zip if you prefer to run locally, where you also find the original prompt if you want to modify to your own needs.
If you don’t want to do exploratory testing, the requirements can be found in zip file. They are AI-generated from the code and prompts (one manual removal only), so similar to what you will get from now on.
Note that I use a wide definition of “bug”, so all induced ones do not correspond to an explicit requirement.
I strongly recommend a realistic scenario for your experiments; first you try the buggy version, and then comes the new and improved build. The easiest way to find the bugs is probably by comparing source code between good and buggy version. I don’t recommend doing this as it is completely unrealistic, but I can’t stop you. And it could of course be interesting as an experiment.
This is not a realistisk scenario (software is usually much more complex), but I hope it can be somewhat useful for testing of LLM capabilities.
Feel free to add a comment on your setup, a prompt that was extra useful, or a model that differed significantly from others.
It is also interesting what LLM initiator was used, since I believe there are hidden things that change the result (just based on my experiences of Cursor performing better than Copilot with the same model and prompt.)
Time taken is also of interest, I have seen examples where the machine is significantly slower than a human (especially with the right tools).
What about temperature, should it be high on test execution and very low on observation, evaluating and reporting?
For an AI testing agent to really impress me, it should find a good portion of the bugs, generate a regression test suite, verify the bug fixes in the new version, and find a couple of the newly introduced bugs.
Leave a Reply