Extending XML Document Validation with Schematron

4

The use of XML documents is common practice nowadays and so is XML schema (XSD) to validate XML documents. XML validation is often needed to ensure structure, content and relations are correct and valid. However validation only using a schema (XSD) only covers a small part: it can describe the basic XML structure (valid elements and order) and some basic content validation of a XML node. Schematron can be used to cover the remaining part of XML validation, like:

  • Advanced structure validation
    e.g. element A should have either attribute X or attribute Y, but not both and always one of them
  • Structure depending on content
    e.g. when attribute A of element B has value ‘x’ , then it should have child element C
  • Content validation on multiple nodes
    e.g. sum of all percentage elements should be 100
  • Relations between elements
    e.g. For each employee element with a manager attribute there should be another employee element with an id attribute having the same value (meaning manager of employee should exist)


Schematron, just like an XSD, is an XML document itself. Each validation rule is defined by a rule element. The rule element has a context attribute to define on which node (or nodes) of your target XML the rule applies to. A X-path expression is used to define this context. For example a XML document containing one or more Department elements and we want to define a rule for each Department element in the XML document:

<rule context="Department">
</rule>

A rule element has one or more report and/or assert elements. Both contain a test attribute with the actual validation rule, the test. The only difference between them is that a report element results in an output error when the test results in (boolean) true whereas the error element results in an output error when the test results in (boolean) false.
For the test attribute also an X-path expression is used to define the validation rule.
Let’s say our Department element has two attributes, “name” and “abbr” (abbreviation) and two business rules apply:

  1. abbr should contain at least two characters
  2. abbr should contain less characters then name

Defining these rules with Schematron results in: (in xml < character is written as &lt;)

<rule context="Department">
  <report test="string-length(@abbr) &lt; 2">Abbreviation too short</report>
  <assert test="string-length(@abbr) &lt; string-length(@name)">Abbreviation too long</assert>
</rule>

To complete the Schematron XML document, a rule element is a child element of the pattern element. The pattern element is used to group rules and a provide a name for the group. It’s only for readability and it has no further technical meaning. With root element schema we can finish the Schematron document.
The complete Schematron document of our little example:

< ?xml version="1.0" encoding="UTF-8" ?>
<schema xmlns="http://www.ascc.net/xml/schematron" >
  <pattern name="Number of characters in abbr attribute">
    <rule context="Department">
      <report test="string-length(@abbr) &lt; 2">Abbreviation too short</report>
      <assert test="string-length(@abbr) &lt; string-length(@name)">Abbreviation too long</assert>
    </rule>
  </pattern>
</schema>

Below another source example with the rule that the total sum of Percent elements somewhere (childs or even grandchilds) within a Total element should be 100.

< ?xml version="1.0" encoding="UTF-8" ?>
<schema xmlns="http://www.ascc.net/xml/schematron" >
     <pattern name="Sum equals 100%.">
          <rule context="Total">
               <assert test="sum(//Percent) = 100">Sum is not 100%.</assert>
          </rule>
     </pattern>
</schema>

Before continuing with a complex example with relations between elements, how do we get this to work?
In fact, that’s quite easy. You only need to be able to do xsl(t) translations!
The beauty about Schematron is that it’s not a new technology, but just clever usage of xslt translations. No new language is needed and you even don’t need to learn xslt, just basic knowledge of X-path and XML are sufficient.

The trick is that you have to transform your Schematron XML containing your validation rules into a xslt that contains your validation rules. Then you use this generated xslt for validation of the XML documents by doing a xslt translation. And how do you generate your rules xslt from your Schematron document…also by an xslt translation! Your rules xslt is the result of the translation of your Schematron XML (with your validation rules) with a provided Schematron xslt (iso_schematron_skeleton_for_xslt1.xsl or iso_schematron_skeleton_for_xslt2.xsl, downloadable from schematron.com).
So it’s a two step approach. First you translate your Schematron rules XML with the Schematron xslt resulting into a new xslt. This xslt contains your rules. Now you can use your generated xslt to validate xml documents by doing a xslt translation. This final translation results into your errors or no output when the validation succeeds.

Schematron two step validation proces

In a production environment most of the time the rules are predefined or do not change (often), so the generated xslt can be stored (or cached).

To show the possibilities of Schematron validation I finalize this blog with the promised complex example with element relation rules.
Let’s start with the target XML, so the XML data which has to be validated. With example data it is easier to understand the Schematron rules.

< ?xml version="1.0" encoding="UTF-8" ?>
<company>
  <department naam="The Floor" afk="fl">
    <employees>
      <employee id="10" manager="15">
        <name>J. Jansen</name>
        <salary>1000</salary>
      </employee>
      <employee id="11" manager="20">
        <name>P. Klaasen</name>
        <salary>1100</salary>
      </employee>
    </employees>
  </department>
  <department naam="Managers" afk="man">
    <employees>
      <employee id="15" manager="25">
        <name>M. A. Neger</name>
        <salary>1700</salary>
      </employee>
      <employee id="20" manager="25">
        <name>L.E. Ader</name>
        <salary>1500</salary>
      </employee>
      <employee id="25">
        <name>P.R. Esident</name>
        <salary>2500</salary>
      </employee>
    </employees>
  </department>
</company>

We want to implement the following business rules:

  • All employees of department “The Floor” should have less salary than any manager (=employee in department “Managers”).
  • An employee may not be the manager of himself.
  • There is only one manager without a manager (only one president).
  • The relation manager and employee is a valid one, so the manager of an employee must exist. This means that for each employee with a manager attribute there must be a manager with attribute id with the same value.

In Schematron xml these rules result into:

< ?xml version="1.0" encoding="UTF-8" ?>
<schema xmlns="http://www.ascc.net/xml/schematron" >
  <pattern name="All floor emp earn less than managers">
    <rule context="Department[@name='The Floor']/Employees/Employee">
      <report test="Salary &gt; //Department[@name='Managers']/Employees/Employee/Salary">Too much</report>
    </rule>
  </pattern>
  <pattern name="Emp not own manager">
    <rule context="Employee[@manager]">
      <assert test="@manager != @id">Own manager</assert>
    </rule>
  </pattern>
  <pattern name="Only one manager without manager">
    <rule context="Department[@name='Managers']/Employees">
      <assert test="count(Employee[not(@manager)]) = 1">More than one president</assert>
    </rule>
  </pattern>
  <pattern name="Manager relation exists">
    <rule context="Employee[@manager]">
      <assert test="/Company/Department[@name='Managers']/Employees/Employee[@id=current()/@manager]">Not a valid manager</assert>
    </rule>
  </pattern>
</schema>

More information can be found at schematron.com.
An easy step by step Schematron tutorial can be found here.

 

Share.

About Author

Emiel is a senior Java & SOA consultant at AMIS, Nieuwegein (The Netherlands).

4 Comments

  1. Chris, you are right, performance can be an issue. Aside from the hardware, the performance is much dependent on the size en complexity of the xml file and the amount and complexity of the Schematron rules (thus size and complexity of the final xslt). For a big insurance company we ‘ve implemented business validation of an insurance policy. The performance trick was to cache (in java) the binary result of the final xslt compilation. So for a validation only one translation has to be done, where this translation had already been compiled.

  2. One of the main reasons most places don’t do plain schema validation in production is the performance impact.  How does this impact performance?  This looks like a really nice tool if it performs well.
    Thanks
    Chris

  3. There are lots of use cases or reasons to use Schematron.  A real-world example. For an insurance company I’ve written a validation webservice. Incoming parameters where an xml document reflecting an insurance application, name of the insurance application type and version. The application had to be validated against lots of specific insurance business rules. The webservice first validates the document against an xsd and after succesful parsing it was validated against the insurance bussines rules using Schematron. Schematron was choosen because the rules changes quite often. So these rules (in Schematron xml) were stored in a db and the webservice generated (and cached) the schematron xslt after changes in the rules (and also in case of a new version) .

  4. Sumit Tambekar on

    Hi Emiel,
    Nice article.
    Schematron can be a good option if requirement says fixed business rules to be enforced on data flow.
    Thanks,
    Sumit Dnyanesh Tambekar