Software Engineering (Research)

Tuesday, 21 September 2010

Software testing and a talk by Peter Dybjer

Another thing I wanted to write about and not forget was a lovely LFCS seminar I heard on Tuesday by Peter Dybjer. It was entitled "Program testing and constructive validity" and was about applying insights from software testing to the problem of validating judgements in formal systems, specifically Martin-Löf type theory.

Of course, I was one of a handful of members of the audience more comfortable with testing than with type theory, and I was wondering whether there is anything that can be taken back the other way: can we gain any insight into software testing from this? In the end I think probably the answer is "probably not really" (and of course, it would not have been reasonable to expect anything else), but I am thinking about testing anyway because I'm very much looking forward to a Dagstuhl workshop on Model-based testing that's taking place in a few weeks' time, so it was interesting to see. The rest of this post has a little background on model-based testing, and then a few words on the way Peter used testing and possible connections.

The basic idea of model-based testing is that you make use of a model of the system's behaviour in testing the actual system (the "system under test", SUT). For example, you might have a finite state machine that is supposed to capture the interaction between the system and the outside world, in some suitable sense, and then you analyse the FSM in order both to generate a sufficient set of concrete tests and to check the system's output. A testing model could be a grammar, or a UML model, or what you will, provided that it can somehow express requirements. Given a suitable testing model, test generation, running and results checking can be done automatically, and the hope is that the effort of building or deriving the models for MBT is small compared with the effort of writing or generating tests in other ways. Of course if the testing model is wrong - that is, its behaviour is not what the customer actually requires - we are in trouble; generally, any discrepancy shown up by running a test is supposed to expose a bug in the SUT. If, however, we knew that the SUT was perfect, we could equally well regard a discrepancy as exposing a bug in the testing model. (And hands up everyone who's ever found that a test failed because there was a bug in the test...) The point is that running one test and seeing the expected result is a simple experiment that fractionally, Bayesianly, increases our confidence that the testing model and the SUT are consistent in some appropriate sense; and more importantly, running one test and not seeing the expected result not only reduces our confidence of that, but gives us a strong hint about where in the models to look for the source of the discrepancy. (You would not pay much to a testing company that would run your tests and tell you afterwards how many failed, but not which ones.)

In Peter's talk, the thing under scrutiny is a logical system: a collection of rules for how to construct judgements such as "under these assumptions, a and a' are equal as members of type A". The issue is: how do you convince somebody that your system is correct? Of course there are different kinds of conviction that might be required. Somebody might want to be convinced that the system as a whole has some desirable property, such as not contradicting itself (this is what Peter called a meta-mathematical issue). Someone else might want to be convinced that the rules of your system are individually true, and you might want, as far as possible, to address each rule separately, without depending on their understanding your system as a whole, and without relying on complex mathematical gadgets in your argument. This is what Peter called a pre-mathematical approach: if we do it successfully, then the conviction produced helps to provide a foundation for the system, a reason for considering it reliable as a way of answering questions. The trouble with complex mathematical gadgets, and what he is trying to avoid in this work, is that results produced using them are only as reliable as they are themselves. We may recall the famous Knuth quotation "Beware of bugs in the above code; I have only proved it correct, not tried it."

One traditional way to convince someone that a logical system is correct, which does require a complex mathematical gadget, is to explain it using a realizability model. This is basically where you provide a relationship between things in the system and things the person you need to convince already has a good understanding of and knows how to compute with. You demonstrate that everything in your system has at least one related well-understood thing, and that all the computations that can happen in your system correspond to computations that can happen in the well-understood place. This is fine provided that you have a realizability model that you're happy to accept you understand, but it rests complete on that, and there isn't a way to accept the realizability model bit by bit: you have to swallow it whole.

A testing model, although it can be used to generate dynamic tests together with their expected results, is a static object: it captures a whole set of requirements on the SUT. Running, and checking the result of, an individual test, on the other hand, is a dynamic procedure. This seems a little bit like: a logical system is to a realizability model as a testing model is to the SUT. You produce a test from the testing model, together with its expected results. To find out whether it passes or fails, you have to run it on the SUT. You could, of course, use the same testing model against two different versions of the SUT, run the same test, and find that on one SUT it passed and on the other it failed. In Peter's world, you produce a test from the logical system, but there doesn't seem to be an explicit notion of a model against which the test is run, such that different models might give different results. I get a bit confused here, but I think this is because here it is the logical system itself which is under scrutiny: a discrepancy (between it and the single, implicit model) would be taken as a bug in the logical system itself. Peter makes a strong distinction between the meta-mathematical semantics of a system, in which realizability models live, and the pre-mathematical world in which his testing approach lives. In the pre-mathematical world, the real world does duty for the SUT.

An interesting aspect of Peter's talk which might have applicability in the MBT world is that a running test can get stuck for lack of suitable data needed during the course of its execution, in which case it solves its problem by going back to the logical system to generate that data on demand, rather like the questioning of the questioner that happens in game semantics. This obviates the need to have a complex structure against which to test - all the complication is in the logical system. In MBT, the sort-of-equivalent would be that instead of generating a single-level test and running it to completion against the SUT - which, in the case of a test that invokes complex computation, may lead to difficulty in debugging the SUT if the test fails - one might generate a test which is somehow able and instructed to invoke the testing model's version of certain functionality - which obviously has to be functionality actually present in the testing model, not something deep in the design of the SUT - when required. This is just using the testing model to stub out parts of the system's functionality, but I wonder whether anyone has explored using this approach to increase the efficiency of MBT. There would be some engineering problems in making it practical, I suppose, and I don't know whether it would be worthwhile.

Anyway, thanks to Peter for a very thought-provoking talk, and to John Longley for explaining to me (again) what a realizability model is. All errors are mine, and I apologise for them.

Wednesday, 15 September 2010

On sequential composition of bidirectional model transformations

This is a partial brain dump prompted by being part of the way through reading this paper by Martin Hofmann, Benjamin Pierce and Daniel Wagner. It's also influenced, naturally, by interesting conversations with Benjamin and various other people over the years. ObDisclaimer: this is a blog post, not a paper, and it does not contain a related work section. Comments and pointers to work I might not know about most welcome! This post may still change.

First a quick introduction for anyone who doesn't already know, following the presentation here because it's my blog: a bidirectional model transformation is a program which allows you to keep two (or more, but two is enough for now) models consistent, when either of them may change. So such a transformation embodies a notion of consistency:

T \subseteq M x N

(just one notion of consistency, not two directional notions) along with a pair of functions

Tforwards : M x N -> N

which tells you how to modify a model from N if you take an old version of it together with an authoritative m in M, and dually

Tbackwards : M x N -> M

In some particularly simple cases, we may be able to do without one of the arguments in one or both of these functions; for example, if N is an abstraction - a view - of M, containing no information not also present in M, then the transformation does not need to consult an old version of n in N and the N argument is redundant. In general, however, either model may contain information that the other does not.

One goes on to ask what constraints must hold on T, Tforwards and Tbackwards to make them form a "reasonable" bidirectional transformation. For a start, the functions have to be "correct" - that is, they must succeed in enforcing consistency - and "hippocratic" - that is, if their arguments are already consistent, then the function must not propose any change. This doesn't feel like enough, however: the problem is that a transformation can be correct and hippocratic and still have the property that (almost) "all bets are off" if its arguments are not consistent to start with. In that case, a transformation which only has to be correct and hippocratic is at liberty to throw away all information from the old model and just pick its favourite from all the models that are consistent with the given authoritative model. Intuitively this is not sensible: although there are real decisions for the transformation writer to make - there can certainly be multiple sensible choices of how to restore consistency - we do expect a transformation, whose job is to restore consistency, to do its job without avoidable vandalism.

There are various ways to handle this; one can put extra constraints ("undoability", "PUTPUT" in the asymmetric case) on the transformation, but these tend to be considered too strong; or if transformations are built up from transformation elements as in bidirectional programming, one at least intuitively gets something like this for free (continuity..). Meertens discusses the use of metrics; I think there may be more to be said, though, at least in the context of models specifically, and perhaps more generally.

What has any of this to do with sequential composition? Well, as good little computer scientists, we naturally want to be able to compose things once we've defined them. Consistency is a relation, and we know how to compose relations; it's natural to say that the notion of consistency captured by T1;T2 should be

(T1;T2)(m,p) iff exists n s.t. T1(m,n) and T2(n,p)

This immediately exposes the problem for defining the corresponding transformation functions: the apparent need to "guess" an element of the elided missing model type. Hmm. Let's try to be good little software engineers instead, take a step back, and think about why we might want to be able to compose bidirectional model transformations, in order to get some requirements for a notion of composition.

In model-driven development, we may have a chain, or even a network, of models describing different aspects of the world and/or the artefacts we are building, each being worked on by different people. At certain points in time, e.g. software release dates, we will want at least a certain collection of these to be guaranteed consistent, in the sense that there exists a model of them all in the model-theoretic sense, viz., the actual software in the world which should be consistent with all (or a specified subset of) the models in the MDD sense. An individual bidirectional transformation captures the correctness of some pair of models. Imagine a graph with model types as nodes and bidirectional transformations as edges.

We would certainly like it if overall consistency meant nothing more nor less than the conjunction of the notions of consistency captured by each of the transformations. We'd also like confluence: if our models are inconsistent, there may be different sequences in which we can apply our bidirectional transformations so as to restore consistency, and we'd like it if they all gave the same result. A little more precisely: suppose we label some of our nodes as authoritative i.e. not to be touched, and others as result. Now imposing consistency means applying the transformations in some order, without ever applying a transformation in the direction of an authoritative model, until nothing changes - that is, the whole set of models is consistent - and then looking at the result model(s). It's only too easy to imagine doing this blindly at 3am when a release is overdue: wouldn't it be nice to know that the result model(s) will depend only on the set of models we started with, and on which of them we deemed authoritative, but not on the order in which we applied the transformations? Unfortunately, a moment's thought shows that this is unachievable without imposing restrictions on the transformations (undoability, or PUTPUT in the lens world) which are generally thought too strong.

An important special case is applying a transformation when the non-authoritative model does not exist yet: model sets can be equipped with a special no-information model to act as the argument in this case. In such a case, the transformation functions should compose just as the ordinary functions they are.

An interesting aspect of this kind of "composition" is that all the models involved may have equal status: in particular, they all actually exist, perhaps even in the same CMS if you're really lucky.

There is at least one other scenario in which we may wish to compose bidirectional transformations, though, and it's pragmatically a bit different. Writing bidirectional transformations correctly is hard work, and there is every chance that you will want to reuse a transformation written by someone else. If you just want straight reuse, well and good, but more likely, you will want to customise the transformation that you buy in or otherwise reuse. IIRR (can't be bothered to check right now) this ability to customize transformations was one of the requirements in the OMG document that led to the QVT standard - however, the standard does not address it, at least for QVT-R.

There are various kinds of customisation you might want to do, of course - perhaps you want parameterised or higher order transformations - but let's suppose that what you want is related to sequential composition in the sense that you want to do one or both of:

- guddling systematically with your models before they are given as input to the reused transformation

- guddling systematically with the output of the transformation, before continuing to use the output model.

First let's consider the easiest case, in which the guddling of a model doesn't depend on any other model and doesn't alter the type of the model. Given T on M x N and f : M -> M we can define

(f;T) (m,n) iff T(f(m),n)

(f;T)forward (m,n) = Tforward (f(m),n)

(f;T)backward (m,n) = Tbackward (f(m),n)

In fact, looking at this we can see that it doesn't matter if f : M' -> M; provided the target of f matches the appropriate input type of T we'll be fine.

Dually, we can define T;g where g : N -> N' : no problem.

My guess is that this is actually enough to meet most customisation needs, but suppose not. Maybe we buy a nontrivially bidirectional transformation

S: FavouriteObscureModellingLanguage <-> UML

from the vendor of our FavouriteObscureModellingLanguage tool, and another nontrivially bidirectional transformation

T: UML <-> RDBMS

from our database vendor, and suppose we don't care about UML and there's noone in our company who speaks UML, but we want to be able to work on our model in FavouriteObscureModellingLanguage and on our RDBMS schema and keep those in sync. If we are happy to generate an initial UML model from one end or the other and keep it around, then we're back in the original case - it just so happens that the UML model will only change when a transformation tells it to. But what if our CMS expert is a vehement hater of UML and will not permit a UML model to sully his pristine repository? Are there interesting conditions under which we can still somehow use S and T to give us a composed

S;T: FavouriteObscureModellingLanguage <-> RDBMS ?

Obviously, since none of us ever touch UML, the absence of the UML model isn't really depriving our company of any information. In particular, provided we have all previous versions of the models we do touch in our CMS, together with a timeline on when transformations were run, we could in principle just rebuild the latest UML model by reapplying all the previously run transformations, use it to apply this transformation, then throw it away. Storing and using previous versions is pretty much like storing and using edits, and this is closely related to what the graph transformation people do: parse the input graph to see where it came from - hoping fervently that the parse is unique - then do the same on the other side.

What makes this impractical is that to rerun stage n of the transformation - say, the one that ended up modifying r : RDBMS to r' to be consistent with authoritative f : FavouriteObscureModellingLanguage - we really do need to construct the very UML model we had at the end of that transformation stage. It won't do to somehow guess any old u : UML s.t. S(f,u) and T(u,r') and hope. For if the one we really had was u', it could well be that Tforward(Sforward(f',u'),r') is not the same as Tforward(Sforward(f',u),r') which would mean we're stuffed next time we try to amend r' in accord with authoritative f'. Indeed, our attempt at composition could fail to be hippocratic, because it could happen that S(f',u') and T(u',r') but one of those fails with our u in place of the genuine u'. Maybe there are some friendly conditions under which this can't happen (beyond the lens case) but I don't know any offhand. It might be useful to compare "keep as much information as we possibly can" vs "keep only the information we're forced to keep" policies for choosing u.

Still to come: relating this properly to the paper that prompted this post, or, come to that, to my own previous work on an algebraic approach to bidirectional transformations.

Wednesday, 8 September 2010

Combinatorial species

I'm reading an article (to appear in proceedings of Haskell'10) on combinatorial species by Brent Yorgey. Very interesting, although as I don't speak Haskell, the explanations were sometimes harder to understand than the things they were explaining. I did offer to tutor Haskell this year in the hope that I'd learn it, but I doubt it's going to happen. For purposes of reading this paper, the main thing you have to know, e.g. from here is that a "class" is an interface, and an "instance" of a "class" is a thing that implements the interface! I do hope they don't cover this aspect of Haskell in the course that precedes our first Java course...

Thursday, 14 May 2009

Milner's bigraphs

[This post is still changing!]

I'm currently attending a course by Robin Milner on Bigraphs. This is interesting in many ways, not least, to me, for how many of the things I've been interested in over the years it ties together. There must obviously be connections with MDD, but what's less clear to me right now is which connections are worth exploring. In some sense the bigraph formalism is a metametamodelling language, in which the metamodels are the bigraphical reactive systems and the models are maybe the states (configurations) of those systems? A bigraph consists of a place structure and a link structure: the place structure gives the containment hierarchy of parts, while the link structure specifies what can talk to what. A reaction rule can allow a ground bigraph to change either or both of these structures. Anyway, the course is not over yet, but I ordered the book.

This morning Jean Krivine gave a nice talk about the application of bigraphs to biological systems.

Off to hear about the category theory...

Monday, 11 August 2008

Expressing bidirectional transformations

As part of the preparation for my invited talk at the International Conference on Graph Transformations I have been trying to understand the relationships between triple graph grammars, lenses and QVT. In particular, how does their expressivity compare, and how does it relate to the general notions of computability and coherent transformations? Some useful email from PC chair Reiko Heckel made me realise that I was failing to think carefully enough about basic notions.

A function f : M -> N is called computable if it can be implemented by a Turing machine; it turns out that this is a very robust notion, so that, for example, you can say "it's computable if you can write a program that computes it" and it basically doesn't matter what programming language you're thinking about. In particular, expressivity for actual programming languages is something of a non-question: they are all equi-expressive.

Is the situation equally boring for bidirectional transformations between M and N? We have to decide what M and N are, but that's probably not very important, any more than it's important to know what the type of our ordinary function was. Any of the bidirectional languages we've been thinking about can be looked at as means of specifying three things:

the consistency relation, a subset of M x N
the forward transformation, a function M x N -> N
the backward transformation, a function M x N -> M

Each of these three elements can be considered as an ordinary function, and then we can ask whether the corresponding function is computable or not. If you wanted to write your bidirectional transformation in your favourite ordinary programming language, you could write three programs: the first would take elements of M and N and return true if they were consistent, else false, the second would take elements of M and N and return an element of N, etc. Since your favourite programming language can express all computable functions, and since we can't hope for any more (because if we had a bidirectional transformation language which could express something with an uncomputable component, we could use it as an ordinary programming language by disregarding everything else), this gives us a notion of computable bidirectional transformation, and any standard programming language can express all computable bidirectional transformations. Boring.

Next attempt: the point about bidirectional languages is that you don't write separate programs for those three elements: you write one text, which somehow encodes all three. There's something deeply unsatisfactory about saying "yes, my language can express bidirectional transformations, you just [do something equivalent to writing the three things separately]". Formally, what exactly is unsatisfying? On the face of it, the notion of "one text to express all three things" is not really a formal notion, but a pragmatic one. The point of it is to ease the software engineering process. In order for a bidirectional transformation to be sane (whatever that means) the three elements have to stay consistent. For example, one kind of sanity, discussed last time, is correctness -- after applying a transformation (defined in 2. or 3.) we expect consistency (defined in 1.) to be restored. So, if we change the definition of consistency (1.) we may well have to change the transformations (2. and 3.) to make sure that the whole bidirectional transformation stays correct. Busy software engineers are not very good at remembering that when you do X you also have to do Y. (Let X be "update the code" and Y be "update the documentation" for example...) So the name of the game is to remind the engineer about Y when they do X; or, to be a bit more ambitious, to make it impossible for them to do X without also doing Y. For example, it's better to have a variable called widgetsInStock (and no comment) than to have a variable called x and a comment "x is the number of widgets in stock".

No mention of "same text" yet -- specifying several things in one text is just one possible way to achieve that aim. At the extreme, if we complain to our Java programmer that we don't like the use of Java as a bidirectional programming language, because the three things are specified in different methods, we will not be much happier if the programmer modifies their bidirectional transformation class to have only one method which begins with a switch statement...

What the language of lenses, and to a lesser extent the other languages like TGGs and QVT, do, is in effect to limit the expressivity of the language to only those triples that are well-behaved in some language-dependent sense. In Focal, for example, it's a key part of the language design to ensure that the basic lenses satisfy some lens laws, and that the ways of composing basic lenses into compound lenses preserve those lens laws. The idea is that it is not possible for the lens programmer to write a bidirectional transformation which is insane, for Focal's idea of sanity. Even, maybe, that it isn't possible for the programmer to think about writing an insane program!

That's interesting. Given a language which can express all computable bidirectional transformations -- including the many insane ones -- an obvious question is "is it decidable whether a transformation is sane?" Excuse me while I demonstrate that it's too long since I tutored Computability and Intractability, but it's at least tempting to think that the answer to that is always going to be No, by Rice's Theorem. Here's Rice's Theorem as stated in Papadimitriou's Computational Complexity, p62.

Suppose that C is a proper, non-empty subset of the set of all recursively enumerable languages. Then the following problem is undecidable: Given a Turing machine M, is L(M) in C?

To apply this, let's suppose we wrap up a given computable bidirectional transformation in a Turing machine which takes as input an indicator of what the machine is supposed to do (check consistency, apply a forward transformation, or apply a backward transformation) together with appropriate arguments for the task. If the machine is to check consistency, the appropriate arguments are a pair of models and a boolean: the machine should accept the input iff the boolean correctly indicates whether or not the models are consistent. If it is to do a forward transformation, the arguments are a triple of models, and the machine accepts iff the third model is the transformation's output on being given the first two; and dually.

Regardless of whether the transformation is sane or not, this is a reasonable definition of a Turing machine, so the language accepted by it is recursively enumerable. Modulo some boring coding, it'll have the form of a set of strings which might be

(consistent, M, N, true)
(consistent, M, N, false)
(forward, M, N, N')
(backward, M, N, M')

and there will be enough of those strings to specify the whole of the transformation.

Now let C be the subset of that set of recursively enumerable languages which represent sane transformations. The amazing thing about Rice's theorem is that we don't have to decide on a notion of sane: provided we ensure that there is some sane transformation, that's enough: the theorem tells us that it is undecidable whether a given Turing machine's language is in C or not.

Unfortunately, this isn't what we wanted to know. Given a TM, there's no reason to think that it'll be obvious whether or not it even is a wrapped-up bidirectional transformation at all; so it's not very interesting that it's undecidable whether or not it's a wrapped-up sane bidirectional transformation. That said, usually the kind of thing we did there is sorted out by some more boring coding, so I guess that "morally" Rice's thm is going to tell us that no non-trivial notion of sanity will be decidable on any language that can express all computable bidirectional transformations. I guess this with enough confidence that I can't muster the confidence to actually check right now...

Conclusion: that wasn't the right question. Onwards and upwards...

Monday, 4 August 2008

What's still wrong with code generation?

Generating code from models is fine, and can certainly save time compared with actually typing that boilerplate stuff that's obvious from looking at the model. Life gets interesting - and that's sometimes a euphemism - when you start editing the code, while still editing the model too, and you need to keep the two in sync. This is often called "round-trip engineering", but if I remember correctly, that's actually a trademark of Rational. Let's call it model-code synchronisation.

Typically, you'll develop a model in UML or some such, then you'll generate code from the model. The generated code will often have special comments of some kind scattered through it to show the tools which bits of the code are autogenerated, and there will often be a notion of some parts of the code "belonging" to the tool, which is interesting. If you add code, code written by you is distinguished from autogenerated code, and the tool is supposed to leave it alone. If you modify code that was autogenerated, you'll probably have to do something to indicate to the tool that you don't want your changes thrown away; maybe you delete one of its special comments. Now, what happens when you modify the model? If you add new structural elements, it's pretty simple: when you press the button, new code gets generated for them and it doesn't interfere (much) with what was there already. What if you delete a structural element from the model? If you do so before any hand-written code has been added that's "owned" by this element, no problem: the autogenerated code should be deleted too. If you have hand-written code for it, then deleting the element from the model is a slightly surprising thing to do. What should the tool do about this? It could restore consistency to the world by deleting the code corresponding to the deleted model element, and in the end that may be the only sensible thing to do. Maybe, though, the presence of that code indicates that more work needs to be done? Maybe it even indicates that deleting the model element is a mistake? It might be better for the tool to indicate in some way that this was a surprising thing to do.

In this particular case, we can probably cover it by something like:

enforce a hard separation between code "owned" by the tool and code that "should not be touched" by the tool;
any time the tool needs to change or delete code that "should not be touched", do something special, e.g., require confirmation.

However, it's not hard to go on inventing more and more puzzling cases, so we need a systematic way to think about what's reasonable behaviour and what's not.

Since code can be seen as a special kind of model - after all, it records some information about a software system - this is a special case of bidirectional model transformations, which I've been interested in for a while. In general we can think about a definition of model-code synchronisation as involving (a) a definition of what it is for the code and model to be in sync (b) a pair of recipes for fixing it when they aren't. Let's assume that the person pressing the button has just finished editing either the code or the model, and doesn't want their work changed, so that synchronisation should change either the model, or the code, but not both. (How much it complicates things if we allow both to be changed deserves some thought, but the answer is probably "not much".) What should be true? Well, two things seem fairly uncontroversial:

if it happens to be the case that the code and the model are already in sync, then nothing should happen. I called this the transformation being hippocratic - "first, do no harm".
after the synchronisation, the code and the model should be in sync. I called this the transformation being correct, since the job of the transformation is to restore consistency.

The first of these deserves a bit of comment: isn't it automatic from the second? Well, not if there can be more than one model that's in sync with the same code, or vice versa. (That is, not if the consistency relation can be non-bijective.) Is that likely? Absolutely. For example, a UML model can include sequence diagrams for any parts of its behaviour, so there's an endless choice of model for the same code there. For an example the other way round, anything that's specified in code but not visible in the model can be changed without altering consistency, too.

Notice that in the process of thinking about that, it becomes obvious that there's a real choice about what "being in sync" means, too. How much model is there expected to be for a given code body? If there's a class diagram but no state diagrams, is that in sync, or should there have to be a state diagram for each (important?) class? Which classes are important, and what kind of state diagram? The right answers should, ideally, depend on the needs of the user: they won't be defined once and for all by the tool vendor.

Unfortunately, correctness and hippocraticness - nor their near ancestors, the basic lens laws of Foster, Pierce et al. - are not enough to ensure sensible behaviour. For example, the transformation that silently deletes the user's handwritten code to restore consistency when an element is deleted from the model can perfectly well be correct and hippocratic. More another time on what else we might require...

Thursday, 24 July 2008

Logic and algorithms workshop

This week Edinburgh is hosting an ICMS event called Logic and Algorithms . I didn't register for it, but I've been going into the odd talk (and not eating the lunches, as required). This morning saw a lovely tutorial by Phokion Kolaitis on Foundations and applications of schema mappings. Very useful, since this is something I've realised for a while that I know almost nothing about, despite its obvious connections with model transformation. More anon, maybe.
Well, OK... I didn't promise that this blog was readable, let alone correct. So here goes with some notes; caveat lector.

There are two related concepts.

Data integration, aka federation, lets you answer queries over multiple data sources as though they were integrated. The data stays where it is, but a new schema is created, against which the user can write queries; something must explain how to translate the query and its results so that the query is done over the actual data sources with their actual schemas, and the results presented to the user in a consistent way.
Data exchange, aka translation, really does move the data. You take some data with its original schema; you take a new schema; you explain how to fit the old data into the new schema.

At the time I thought "ah, 1. is model merging, 2. is model transformation". Well, sort of.

The way this field talks about these problems is using schema mappings. A schema mapping is a triple (S,T,\Sigma) [how do I do Greek in this thing, or maths for that matter?! never mind for now!]. S and T are the source and target schemas - let's be thinking about 2. for now - and Sigma is a set of dependencies. There are two kinds of dependencies: source-to-target dependencies Sigma_ST, which model transformation people might think of as the specification of a consistency relation that forms part of a model transformation, and constraints Sigma_T over the target schema, which we might think of as part of the target metamodel. (We see already an example of the common phenomenon that related problems in different areas often put their concept boundaries in different places: another kind of translation problem...)

Then the data exchange problem is to take this triple, and a model I for S, and create a model J for T which satisfies Sigma. We might say: given a pair of metamodels S (plus some constraints which are not specified because we're about to be given the data as well and it is guaranteed to satisfy whatever it has to: let's write S+) and T plus Sigma_T (let's write T+), and a consistency relation Sigma_ST, and a model I for S+, construct J which is a model for T+ such that (I,J) satisfy the consistency relation.

At this point, the database people observe that solutions may not exist, or if they exist, may not be unique, and they go on to define and study the existence-of-solutions problem which is: for a given M and I, does a solution J always exist? This is initially puzzling from a model transformation point of view, because in our world, we would not expect it to be useful to try to apply a model transformation which is defined only by its consistency relation. It's sort of the whole point that we have to have some way of choosing among solutions: we expect the consistency relation to be only part of the specification of a model transformation; we expect to add a constructive procedure of some kind to explain how to construct a target model, at which point non-existence is "just" a bug in the program. Still, I'd like to be able to say confidently whether or not we can write transformations which are not actually executable (at all, never mind non-determinacy) in (fragments of) our favourite bidirectional transformation languages, so maybe it would be useful to think about this more.

Back in the database world, they observe that there's an important question concerning the language in which the constraints are written. FOL is too strong - the existence of solutions problem is undecidable - so they look at fragments. It turns out that what happens is very sensitive to which fragment they choose.

Time to go to the next talk; maybe even more anon. Do let me know if you're reading this: perhaps I should offer a small prize for the first comment.