Tech ramblings by Marcin

Schematron to the rescue!

2010-10-21 08:40

In an ideal world all the standards fit well into their places. It is sufficient to use just one serious standard, because all the problems can be solved with it - the standardization processes is there for some reason. But that happens only in ideal world, which we're not living in.

In ideal world, when dealing with XML instances you'd be more than fulfilled using XML Schema, or RelaxNG, or any other simple xml formal definition language to declare your data structure. With that you get rigid rules as to how XML documents should look like. There doesn't seem to be much space to deviate from specs. Well, in fact there is.

The main problem of XML, aside its verbosity, is the inability to create concise rules for the input or output document as a whole. Perhaps it's a nice feature, because XML Schema should only be used to describe a data structure, not to infer business rules on it. Perhaps not. Nevertheless it's not what I needed in one of the projects I've worked on.

My need was to actually check the business validity of such documents. This was used in a Web Service environment, a pretty stupid WS, which sole role was to fetch data from database and pack it into appropriate XML structures. Errors might occur in database's views or in WS - as usual. They might be data multiplication or appearance of some elements while they shouldn't. Resulting documents were correctly validated with the xml schema, but the result was simply wrong from the business point of view.

What I needed an XML formalization language, an ability to write rules that would assert some rules, report on not meeting stated rules. I was in need of a tool to write business rules to tame such XML entities.

The simplest way I found to solve this was to use Schematron! - "a language for making assertions about patterns found in XML documents". This neat tool is a set of XSL templates, that you use in conjunction with a rule set on documents to check. As a result of the check you get another XML document with test assertions - whether failed or succeeded.

With Schematron you write a set of rules you expect the document to assert, than you use Schematron XSL template to produce XSL rules specific for your case. Now you only need to use newly generated XSL rules template on your XML document to check rules compliance. Easy, if not, check the diagram below.

How does it look?

The rules' file may look like this:


  TouK Schematron test harness

    



  checking GetMigrationOffers
  
    Report date.
    Unique offers allowed.
    Each offer has to have an @abc attribute 
  
  
    Each offer has to have a tariff
    Each offer has to have a promotion
  



  checking GetAllPhones
  
    
        TACs should be unique. TAC: , 
        handsetId: 
        offerId: 
    
  



Here we see two rules, one named getMigrationOffers and the other getAllPhones. The rules - mainly their asserts seem pretty self explanatory, but for the sake of completeness I'll describe the rules for getAllPhones.

There is one rule, which checks the uniqueness of tac elements. This rule tries to ensure that each handset should have a list of unique tac elements as its children. However there may appear tac elements of the same value in different handset elements.

Given an input XML in the form of:


   
      
         
            12028006
            20070705
            35535302
            01216100
            01216100
         
         
            12028006
            20070705
            35535302
            01216100
         
         
            12028006
            20070705
            35535302
            01216100
         
       
   

And passing those two files through the processing pipeline you get a report:


    
   
   
   
   
   
   
   
   
      
        TACs should be unique. TAC: 01216100, 
        handsetId: 95
        offerId: 103021
   
   
   
      
        TACs should be unique. TAC: 01216100, 
        handsetId: 95
        offerId: 103021
   
   
   
   
   
[...]

After running the validation, the report presents us with the result. It shows that there are actually non-unique tacs. Unfortunately the rule itself is not so optimal, as it is executed for each tac node. The better case would be to create a rule operating only on groups of tacs - having a rule for each handset's tacs would be much better.

Performance consideration

As you may have seen, Schematron gives quite a potential, if it comes to rules building - maybe not the easiest to comprehend, since written with XPath, but good enough.

However, with all the XML processing involved in the process, it may take some considerable amount of time to execute such validations. For example, processing rules for file getMigrationOffers.xml takes about 2.296s - the file has 82 offer elements, which the rules operate on. But validating the other file, getAllPhones.xml takes 5.324s, with 3113 tac elements, and the rule iterating all of them.

This overhead is too much in most of the situations. That's why this solution is rather not for use in normal execution pipeline - it would be unwise to put Schematron to check each request, thus entangle it into my Web Services normal flow.

What may be more desirable is to deploy a continuous integration server, with a project querying such Web Service and checking the rules in this manner.

Conclusion

So, what's so great about having one XML generate another XML? Perhaps nothing, I think it would took just about a day to write some shell, python, <other text processing tool> that would perform equally (or even better). However, we loose technology homogeneity, and employ some other environments, not specific to our primary target platform, and that seems bad. Of course using some powerful text processing tool to impose the same rules might be much more efficient, thou less coherent.

What is your approach to such situations? Have you used Schematron or any other similar tool?

Code for this example is available on GitHub - http://github.com/zygm0nt/schematron-example.