How can we Quality Assure an AI solution?

When I came to test my first commercially written computer programme (or Apps as you may know them now) back in the late 1970’s, one of the “tablets of stone” I was handed on testing theory was that: for each test, in advance of any test execution, must have a documented expected result. The actual test results would then be compared to what was expected resulting in a straight set of pass or fail outcomes.

To my knowledge, despite all of the many and varied technical advances since that time: this tablet of stone remains in place and is equally valid today regardless of the methodology or technologies being used and deployed for the solution.

Now is maybe the time to re-evaluate.

Let’s take a real-world example: We wish to develop an application that predicts what the weather will be in one week’s time for locations in the UK. We are seeking a prediction consisting of:

  • Maximum and minimum temperatures
  • Amount of rainfall
  • Maximum and minimum wind speed
  • Predominate wind direction


I have a “black-box” application. Inside this application is a mass of historic data points relating to past weather patterns, the prediction engine is AI in nature, using self-learning and neural techniques. All we need to do to keep the application running is to record each days set of measurable weather data and each day it will produce a prediction for what will happen 24 hours later.


So how do we approach a test strategy that will verify that such an application “works?”

How should we define “works”? If we assume the obvious checks, can it accept and validate input, can it produce understandable output, does in work in a timely manner (e.g. I need results in 10 minutes and not after 10 hours of “thinking” time); then we quickly arrive at the key question: “Is the prediction output correct?” Referring to my tablet of stone this equates to: does the actual result match the expected result?

Now comes the crux: how do I know what the expected result is? If we wait 24 hours each time then we can say whether the prediction was right or wrong, or more precisely: what was the differential between what was predicted and what happened. For example: missing a maximum temperature prediction by a single degree is presumably less bad than missing it by 10 degrees. We need tolerance of rightness.

{As a sidebar there are complicating factors here, the difference of a minimum temperature at +1 degree centigrade as opposed to -1 degree centigrade, a difference of two degrees maybe more significant than a similar difference at 17 and 19 degrees centigrade. For example my key question could have been: do I need to send out gritting machines tomorrow because of icy roads.}

My pass criteria for the application is then focussed around its accuracy, and the levels of accuracy may have different weight for different outcomes. There may be a higher weighting for extreme events such as correctly predicting hurricane-force winds or heat-waves.

We also need to consider the business need here (as ever). We can assume, and verify, that we are expecting this application to be better than what we do currently, whether this be using weather experts or weather experts using previous-generation computations (pre AI).

One common approach therefore, would be to run a parallel run with the old and new methods of forecasting running alongside each other with reference to the actual outcomes. We can give weighting “bonus points” for predicting the more difficult extreme events. We could also compare both models to a random model, and a modified model, for example: assume that tomorrow’s weather is identical to yesterday.

Do we have to wait for a parallel run before we know if the new application is better than the current method?

One technique is to take the original source data and to partition it as randomly as possible. Only one such partition is used as source primer-data for the application. The remaining data is used in a test-run as test input and the historic actual outcome as the measuring stick for the measure of correctness of the application. The test run results then being compared to the actual results generated using the test set that has been “hidden” from the application.

Whichever method we choose our tablet of stone still holds true – just. But we need a more subtle way of expressing the expected result, it becomes more of a bounded expected result, examples could be:

  • if the result is right nine times out of ten
  • If the result is better than the previous method x% of the time
  • If the results show that extreme events are predicted correctly more often than the more mundane events

Whatever we choose, the construction of our expected result column requires more thought, more business or end-user involvement, more real-world thought, and that can be no bad thing. So while that tablet of stone stays intact we may need to add a few chalked-on caveats as we move into the brave new world where machines, at last, start to do things better than we can.


  • Sogeti UK
    Sogeti UK
    Make an enquiry
    0330 588 8000
  • Phil Lupton
    Phil Lupton
    Account Director, Sogeti UK
Print Email