BLOG

How can we Quality Assure an AI solution?

Sogeti UK / Content Hub / Our Blog / How can we Quality Assure an AI solution?

When I came to test my first commercially written computer programme (or Apps as you may know them now) back in the late 1970’s, one of the “tablets of stone” I was handed on testing theory was that: for each test, in advance of any test execution, must have a documented expected result. The actual test results would then be compared to what was expected resulting in a straight set of pass or fail outcomes.

To my knowledge, despite all of the many and varied technical advances since that time: this tablet of stone remains in place and is equally valid today regardless of the methodology or technologies being used and deployed for the solution.

Now is maybe the time to re-evaluate.

Let’s take a real-world example: We wish to develop an application that predicts what the weather will be in one week’s time for locations in the UK. We are seeking a prediction consisting of:

Maximum and minimum temperatures
Amount of rainfall
Maximum and minimum wind speed
Predominate wind direction

I have a “black-box” application. Inside this application is a mass of historic data points relating to past weather patterns, the prediction engine is AI in nature, using self-learning and neural techniques. All we need to do to keep the application running is to record each days set of measurable weather data and each day it will produce a prediction for what will happen 24 hours later.

So how do we approach a test strategy that will verify that such an application “works?”

How should we define “works”? If we assume the obvious checks, can it accept and validate input, can it produce understandable output, does in work in a timely manner (e.g. I need results in 10 minutes and not after 10 hours of “thinking” time); then we quickly arrive at the key question: “Is the prediction output correct?” Referring to my tablet of stone this equates to: does the actual result match the expected result?

Now comes the crux: how do I know what the expected result is? If we wait 24 hours each time then we can say whether the prediction was right or wrong, or more precisely: what was the differential between what was predicted and what happened. For example: missing a maximum temperature prediction by a single degree is presumably less bad than missing it by 10 degrees. We need tolerance of rightness.

{As a sidebar there are complicating factors here, the difference of a minimum temperature at +1 degree centigrade as opposed to -1 degree centigrade, a difference of two degrees maybe more significant than a similar difference at 17 and 19 degrees centigrade. For example my key question could have been: do I need to send out gritting machines tomorrow because of icy roads.}

My pass criteria for the application is then focussed around its accuracy, and the levels of accuracy may have different weight for different outcomes. There may be a higher weighting for extreme events such as correctly predicting hurricane-force winds or heat-waves.

We also need to consider the business need here (as ever). We can assume, and verify, that we are expecting this application to be better than what we do currently, whether this be using weather experts or weather experts using previous-generation computations (pre AI).

One common approach therefore, would be to run a parallel run with the old and new methods of forecasting running alongside each other with reference to the actual outcomes. We can give weighting “bonus points” for predicting the more difficult extreme events. We could also compare both models to a random model, and a modified model, for example: assume that tomorrow’s weather is identical to yesterday.

Do we have to wait for a parallel run before we know if the new application is better than the current method?

One technique is to take the original source data and to partition it as randomly as possible. Only one such partition is used as source primer-data for the application. The remaining data is used in a test-run as test input and the historic actual outcome as the measuring stick for the measure of correctness of the application. The test run results then being compared to the actual results generated using the test set that has been “hidden” from the application.

Whichever method we choose our tablet of stone still holds true – just. But we need a more subtle way of expressing the expected result, it becomes more of a bounded expected result, examples could be:

if the result is right nine times out of ten
If the result is better than the previous method x% of the time
If the results show that extreme events are predicted correctly more often than the more mundane events

Whatever we choose, the construction of our expected result column requires more thought, more business or end-user involvement, more real-world thought, and that can be no bad thing. So while that tablet of stone stays intact we may need to add a few chalked-on caveats as we move into the brave new world where machines, at last, start to do things better than we can.

Contact

Sogeti UK
Make an enquiry
0330 588 8000

Sogeti UK
Make an enquiry
0330 588 8000
Phil Lupton
Account Director, Sogeti UK
+3305888221

Phil Lupton
Account Director, Sogeti UK
+3305888221

Cookies	Description
Registered visitor cookie	Cookie given to each registered user.
Registered visitor functionality cookie	Cookies used to remember the unique identifier given to each registered user.
Social plug-in content sharing cookie	Cookies set by services such as Facebook Connect or Twitter Button, which allow social networks users to share the content of our websites on social networks.
Unregistered visitor cookie	Cookies used to give to unregistered users a unique identifier in order to recognize them and to analyze how they use the website.
Analytic cookie	Cookies used to store URLs of the previous page visited, enabling to track users navigating from inside or from outside the website. If you click on a Sogeti advertisement on a non-Sogeti website, a cookie may be used to log which website you are on, in order to ensure our advertisements are served effectively and to measure whether our advertisements are viewed. Google Analytics: cookies set by Google analytics are used for web analytical purpose, but are not used to track individual users. For further information on how Google Analytics collects and uses information on our behalf and the right to use such cookies, please refer to the Google Analytics products and services privacy statement. If you object to your Personal Data being collected by Google Analytics, you may download and install the Google Analytics Opt-out Browser Add-on. Pardot: cookies set by Pardot are used to track users on our website. Visits are tracked for known users only. Unknown users are recorded as anonymous users. Please refer to Pardot privacy policy for any further information on their use and your rights related to the use of such cookies.