Tuesday 21 September 2010

Software testing and a talk by Peter Dybjer

Another thing I wanted to write about and not forget was a lovely LFCS seminar I heard on Tuesday by Peter Dybjer. It was entitled "Program testing and constructive validity" and was about applying insights from software testing to the problem of validating judgements in formal systems, specifically Martin-Löf type theory.

Of course, I was one of a handful of members of the audience more comfortable with testing than with type theory, and I was wondering whether there is anything that can be taken back the other way: can we gain any insight into software testing from this? In the end I think probably the answer is "probably not really" (and of course, it would not have been reasonable to expect anything else), but I am thinking about testing anyway because I'm very much looking forward to a Dagstuhl workshop on Model-based testing that's taking place in a few weeks' time, so it was interesting to see. The rest of this post has a little background on model-based testing, and then a few words on the way Peter used testing and possible connections.


The basic idea of model-based testing is that you make use of a model of the system's behaviour in testing the actual system (the "system under test", SUT). For example, you might have a finite state machine that is supposed to capture the interaction between the system and the outside world, in some suitable sense, and then you analyse the FSM in order both to generate a sufficient set of concrete tests and to check the system's output. A testing model could be a grammar, or a UML model, or what you will, provided that it can somehow express requirements. Given a suitable testing model, test generation, running and results checking can be done automatically, and the hope is that the effort of building or deriving the models for MBT is small compared with the effort of writing or generating tests in other ways. Of course if the testing model is wrong - that is, its behaviour is not what the customer actually requires - we are in trouble; generally, any discrepancy shown up by running a test is supposed to expose a bug in the SUT. If, however, we knew that the SUT was perfect, we could equally well regard a discrepancy as exposing a bug in the testing model. (And hands up everyone who's ever found that a test failed because there was a bug in the test...) The point is that running one test and seeing the expected result is a simple experiment that fractionally, Bayesianly, increases our confidence that the testing model and the SUT are consistent in some appropriate sense; and more importantly, running one test and not seeing the expected result not only reduces our confidence of that, but gives us a strong hint about where in the models to look for the source of the discrepancy. (You would not pay much to a testing company that would run your tests and tell you afterwards how many failed, but not which ones.)

In Peter's talk, the thing under scrutiny is a logical system: a collection of rules for how to construct judgements such as "under these assumptions, a and a' are equal as members of type A". The issue is: how do you convince somebody that your system is correct? Of course there are different kinds of conviction that might be required. Somebody might want to be convinced that the system as a whole has some desirable property, such as not contradicting itself (this is what Peter called a meta-mathematical issue). Someone else might want to be convinced that the rules of your system are individually true, and you might want, as far as possible, to address each rule separately, without depending on their understanding your system as a whole, and without relying on complex mathematical gadgets in your argument. This is what Peter called a pre-mathematical approach: if we do it successfully, then the conviction produced helps to provide a foundation for the system, a reason for considering it reliable as a way of answering questions. The trouble with complex mathematical gadgets, and what he is trying to avoid in this work, is that results produced using them are only as reliable as they are themselves. We may recall the famous Knuth quotation "Beware of bugs in the above code; I have only proved it correct, not tried it."

One traditional way to convince someone that a logical system is correct, which does require a complex mathematical gadget, is to explain it using a realizability model. This is basically where you provide a relationship between things in the system and things the person you need to convince already has a good understanding of and knows how to compute with. You demonstrate that everything in your system has at least one related well-understood thing, and that all the computations that can happen in your system correspond to computations that can happen in the well-understood place. This is fine provided that you have a realizability model that you're happy to accept you understand, but it rests complete on that, and there isn't a way to accept the realizability model bit by bit: you have to swallow it whole.

A testing model, although it can be used to generate dynamic tests together with their expected results, is a static object: it captures a whole set of requirements on the SUT. Running, and checking the result of, an individual test, on the other hand, is a dynamic procedure. This seems a little bit like: a logical system is to a realizability model as a testing model is to the SUT. You produce a test from the testing model, together with its expected results. To find out whether it passes or fails, you have to run it on the SUT. You could, of course, use the same testing model against two different versions of the SUT, run the same test, and find that on one SUT it passed and on the other it failed. In Peter's world, you produce a test from the logical system, but there doesn't seem to be an explicit notion of a model against which the test is run, such that different models might give different results. I get a bit confused here, but I think this is because here it is the logical system itself which is under scrutiny: a discrepancy (between it and the single, implicit model) would be taken as a bug in the logical system itself. Peter makes a strong distinction between the meta-mathematical semantics of a system, in which realizability models live, and the pre-mathematical world in which his testing approach lives. In the pre-mathematical world, the real world does duty for the SUT.

An interesting aspect of Peter's talk which might have applicability in the MBT world is that a running test can get stuck for lack of suitable data needed during the course of its execution, in which case it solves its problem by going back to the logical system to generate that data on demand, rather like the questioning of the questioner that happens in game semantics. This obviates the need to have a complex structure against which to test - all the complication is in the logical system. In MBT, the sort-of-equivalent would be that instead of generating a single-level test and running it to completion against the SUT - which, in the case of a test that invokes complex computation, may lead to difficulty in debugging the SUT if the test fails - one might generate a test which is somehow able and instructed to invoke the testing model's version of certain functionality - which obviously has to be functionality actually present in the testing model, not something deep in the design of the SUT - when required. This is just using the testing model to stub out parts of the system's functionality, but I wonder whether anyone has explored using this approach to increase the efficiency of MBT. There would be some engineering problems in making it practical, I suppose, and I don't know whether it would be worthwhile.

Anyway, thanks to Peter for a very thought-provoking talk, and to John Longley for explaining to me (again) what a realizability model is. All errors are mine, and I apologise for them.

No comments: