XML CDM Development

Development and Runtime Experiences with a Canonical Data Model Part I: Standards & Guidelines

Introduction

In my previous blog I’ve explained what a Canonical Data Model (CDM) is and why you should use it. This blog is about how to do this. I will share my experiences on how to create and use a CDM. I gained these experiences at several projects, small ones, and large ones. All of these experiences were related to an XML based CDM. This blog consists of three parts. This blogpost contains part I: Standards & Guidelines. The next blogpost, part two, is about XML Namespace Standards and the last blogpost contains part three about Dependency Management & Interface Tailoring.
This first part, about standards and naming conventions, primarily apply to XML, but the same principles and ideas will mostly apply to other formats, like JSON, as well. The second part about XML namespace standards only is, as it already indicates, applicable to an XML format CDM. The last part, in the third blogpost, about dependency management & interface tailoring entirely, applies to all kind of data formats.

Developing a CDM

About the way of creating a CDM. It’s not doable to create a complete CDM upfront and only then start designing services and developing them. This is because you only can determine usage of data, completeness and quality while developing the services and gaining experience in using them. A CDM is a ‘living’ model and will change in time.
When the software modules (systems or data stores) which are to be connected by the integration layer are being developed together, the CDM will change very often. While developing software you always encounter shortcomings in the design, unseen functional flaws, unexpected requirements or restrictions and changes in design because of new insights or changed functionality. So sometimes the CDM will even change on a daily base. This perfectly fits into the modern Agile Software Development methodologies, like Scrum, where changes are welcome.
When the development stage is finished and the integration layer (SOA environment) is in a maintenance stage, the CDM still will change, but at a much slower pace. It will keep on changing because of maintenance changes and modifications of connected systems or trading partners. Changes and modifications due to new functionality also causes new data entities and structures which have to be added to the CDM. These changes and modifications occur because business processes change in time, caused by a changing world, ranging from technical innovations to social behavioral changes.
In either way, the CDM will never be ready and reach a final changeless state, so a CDM should be flexible and created in such a way that it welcomes changes.

When you start creating a CDM, it’s wise to define standards and guidelines about defining the CDM and using it beforehand. Make a person (or group of persons in a large project), responsible for developing and defining the CDM. This means he defines the data definitions and structures of the CDM. When using XML this person is responsible for creating and maintaining the XML schema definition (XSD) files which represent the CDM. He develops the CDM based on requests from developers and designers. He must be able to understand the need of the developers, but he should also keep the model consistent, flexible and future proof. This means he must have experience in data modeling and the data format (e.g. XML or JSON) and specification language (e.g. XSD) being used. Of course, he also guards the standards and guidelines which has been set. He also is able, when needed, to deny requests for a CDM change from (senior) developers and designers in order to preserve a well-defined CDM and provide an alternative which meets their needs as well.

Standards & Guidelines

There are more or less three types of standards and guidelines when defining an XML data model:

  • Naming Conventions
  • Structure Standards
  • Namespace Standards

Naming Conventions

The most important advice is that you define naming conventions upfront and stick to them. Like all the naming convention in programming languages, there are a lot of options and often it’s a matter of personal preference. Changing conventions because of different personal preferences it not a good idea. Mixed conventions results in ugly code. Nevertheless I do have some recommendations.

Nodes versus types
The first one is to make a distinction between the name of a node (element or attribute) and an XML type. I’ve been in a project where the standard was to give them exactly the same name. In XML this is possible! But the drawback was that there were connecting systems and programming languages which couldn’t handle this! For example the standard Java library for XML parsing, JAX-P, had an issue with this. The Java code which was generated under the hood used the name of an XML type for a Java class name and the name of an element as a Java variable name. In Java it is not possible to use an identical name for both. In that specific project, this had to be fixed manually in the generated Java source code. That is not what you want! It can easily be avoided by using different names for types and elements.

Specific name for types
A second recommendation, which complements the advice above, is to use a specific naming convention for XML types, so their names always differ from node names. The advantage for developers is that they can recognize from the name if something is an XML node or an XML type. This eases XML development and makes the software code easier to read and understand and thus to maintain.
Often I’ve seen the naming convention, which tries to implements this, by prescribing that the name of an XML type should be suffixed with the token “Type”. I personally do not like this specific naming convention. Consider you have a “Person” entity and so you end up with an XML type named “PersonType”. This perfectly makes sense, doesn’t it? But how about a “Document” entity? You end up with an XML type named “DocumentType” and guess what: there is also going to be a “DocumentType” entity resulting in an XML type named “DocumentTypeType”…!? Very confusing in the first place. Secondly, you end up with an element and an XML type with the same name! The name “DocumentType” is used as a name for an element (of type “DocumentTypeType”) and “DocumentType” is used as an XML type (of an element named “Document”).
From experience I can tell you there are more entities with a name that ends with “Type” than you would expect!
My advice is to prefix an XML type with the character “t”. This not only prevents this problem, but it’s also shorter. Additionally you can distinguish an XML node from an XML type by the start of its name. This naming convention results into element names like “Person”, “Document” and “DocumentType” versus type names “tPerson”, “tDocument” and “tDocumentType”.

Use CamelCase – not_underscores
The third recommendation is to use Camel Case for names instead of using underscores as separator between the words which make up a name of a node or type. This shortens a name and still the name can be read easily. I’ve got a slight preference to start a name with an uppercase character, because then I can use camel Case beginning with a lowercase character for local variables in logic or translations (BPEL, xslt, etc) in the integration layer or tooling. This results in a node named “DocumentType” of type “tDocumentType” and when used in a local variable in code, this variable is named “documentType”.

Structure Standards

I also have some recommendations about standards which apply to the XML structure of the CDM.

Use elements only
The first one is to never use attributes, so only elements. You can never expand an attribute and create child elements in it. This may not be necessary at the moment, but may be necessary sometime in the future. Also an attribute cannot have the ‘null’ value in contrast with an element. You can argue that an empty value can represent the null value. But this is only possible with String type attributes (otherwise it’s considered as invalid XML when validating against its schema) and often there is a difference between an empty string and a null value. Another disadvantage is that you can not have multiple attributes with the same name inside an element.
Furthermore, using elements makes XML better readable by humans, so this helps developers in their coding and debugging. A good read about this subject is “Principles of XML design: When to use elements versus attributes”. This article contains a nice statement: “Elements are the extensible engine for expressing structure in XML.” And that’s exactly what you want when developing a CDM that will change in time.
The last advantage is that when the CDM only consists of elements, processing layers can add their own ‘processing’ attributes only for the purpose of helping the processing itself. This means that the result, the XML which is used in communicating with the world outside of the processing system, should be free of attributes again. Also processing attributes can be added in the interface, to provide extra information about the functionality of the interface. For example, when retrieving orders with operation getOrders, you might want to indicate for each order whether it has to be returned with or without customer product numbers:

<getOrdersRequest>
  <Orders>
    <Order includeCustProdIds='false'>
      <Id>123</Id>
    </Order>
    <Order includeCustProdIds='true'>
      <Id>125</Id>
    </Order>
    <Order includeCustProdIds='false'>
      <Id>128</Id>
    </Order>
  </Orders>
</getOrdersRequest>

Beware these attributes are processing or functionality related, so they should not be a data part the entity. And ask yourself if they are really necessary. You might consider to provide this extra functionality in a new operation, e.g. operation getCustProdIds to retrieve customer product ids or operation getOrderWithCustIds to retrieve order with customer product number.

All elements optional
The next advice is to make all the elements optional! There unexpectedly always is a system or business process which doesn’t need a certain (child) element of which you initially had thought it would always be necessary. On one project this was the case with id elements. Each data entity must have an id element, because the id element contains the functional unique identifying value for the data entity. But then there came a business case with a front end system that had screens in which the data entity was being created. Some of the input data had to be validated before the unique identifying value was known. So the request to the validation system contained the entity data without the identifying id element, so the mandatory id element had to be changed to an optional element. Of-course, you can solve this by creating a request which only contains the data that is used in separate elements, so without the use of the CDM element representing the entity. But one of the powers of a CDM is that there is one definition of an entity.
At that specific project, in time, more and more mandatory elements turned out to be optional somewhere. Likely this will happen at your project as well!

Use a ‘plural container’ element
There is, of course, an exception of an element which should be mandatory. That is the ‘plural container’ element, which only is a wrapper element around a single element which may occur multiple times. This is my next recommendation: when a data entity (XML structure) contains another data entity as a child element and this child element occurs two or more times, or there is a slight chance that this will happen in the future, then create a mandatory ‘plural container’ element which acts as a wrapper element that contains these child elements. A nice example of this is an address. More often than you might think, a data entity contains more than one address. When you have an order as data entity, it may contain a delivery address and a billing address, while you initially started with only the delivery address. So when initially there is only one address and the XML is created like this:

<Order>
  <Id>123</Id>
  <CustomerId>456/<CustomerId>
  <Address>
    <Street>My Street</Street>
    <ZipCode>23456</ZipCode>
    <City>A-town</City>
    <CountryCode>US</CountryCode>
    <UsageType>Delivery</UsageType>
  </Address>
  <Product>...</Product>
  <Product>...</Product>
  <Product>...</Product>
</Order>

Then you have a problem with backwards compatibility when you have to add the billing address. This is why it’s wise to create a plural container element for addresses, and for products as well. The name of this element will be the plural of the element it contains. Above XML will then become like this:

<Order>
  <Id>123</Id>
  <CustomerId>456/<CustomerId>
  <Addresses>
    <Address>
      <Street>My Street</Street>
      <ZipCode>23456</ZipCode>
      <City>A-town</City>
      <CountryCode>US</CountryCode>
      <UsageType>Delivery</UsageType>
    </Address>
  </Addresses>
  <Products>
    <Product>...</Product>
    <Product>...</Product>
    <Product>...</Product>
  </Products>
</Order>

In the structure definition, the XML Schema Definition (XSD), define the plural container element to be single and mandatory. Make its child elements optional and without a maximum of occurrences. First this results in maximum flexibility and second, in this way there is only one way of constructing XML data that doesn’t have any child elements. In contrast, when you make the plural container element optional, you can create XML data that doesn’t have any child element in two ways, by omitting the plural container element completely and by adding it without any child elements. You may want to solve this by dictating that child elements always have at least one element, but then the next advantage, discussed below, is lost.
So the XML data example of above will be modeled as follows:

<complexType name="tOrder">
  <sequence>
    <element name="Id" type="string" minOccurs="0" maxOccurs="1"/>
    <element name="CustomerId" type="string" minOccurs="0" maxOccurs="1"/>
    <element name="Addresses" minOccurs="1" maxOccurs="1">
      <complexType>
        <sequence>
          <element name="Address" type="tns:tAddress" minOccurs="0" maxOccurs="unbounded"/>
        </sequence>
      </complexType>
    </element>
    <element name="Products" minOccurs="1" maxOccurs="1">
      <complexType>
        <sequence>
          <element name="Product" type="tns:tProduct" minOccurs="0" maxOccurs="unbounded"/>
        </sequence>
      </complexType>
    </element>
  </sequence>
</complexType>
<complexType name="tAddress">
  <sequence>
    ...
  </sequence>
</complexType>
<complexType name="tProduct">
  <sequence>
    ...
  </sequence>
</complexType>

There is another advantage of this construction for developers. When there is a mandatory plural container element, this elements acts as a kind of anchor or ‘join point’ when XML data has be modified in the software and for example, child elements have to be added. As this element is mandatory, it’s always present in the XML data that has to be changed, even if there are no child elements yet. So the code of a software developer can safely ‘navigate’ to this element and make changes, e.g. adding child elements. This eases the work of a developer.

Be careful with restrictions
You never know beforehand with which systems or trading partners the integration layer will connect in future. When you define restrictions in your CDM, beware of this. For example restricting a string type to a list of possible values (enumeration) is very risky. What to do when in future another possible value is added?
Even a more flexible restriction, like a regular expression can soon become too strict as well. Take for example the top level domain names on internet. It once was restricted to two character abbreviations for countries, some other three character abbreviations (“net”, “com”, “org”, “gov”, “edu”) and one four character word “info”, but that’s history now!
This risk applies for all restrictions, restriction on character length, numeric restrictions, restriction on value ranges, etc.
Likewise I bet that the length of product id’s in the new version of your ERP system will exceed the current one.
My advice is to minimize restriction as much as possible in your CDM, preferable no restrictions at all!
Instead define restrictions on the interfaces, the API to the connection systems. When for example the product id of your current ERP system is restricted to 8 characters, it perfectly makes sense that you define a restriction on the interface with that system. More on this in part III in my last blogpost in the section about Interface Tailoring.

String type for id elements
Actually this one is the same as the one above about restrictions. I want to discuss this one separately, because of its importance and because it often goes wrong. Defining an id element as a numeric type is a way of applying a nummeric restriction to a string type id.
The advice is to make all identifying elements (id, code, etc) of type string and never a numeric type! Even when they always get a numeric value… for now! The integration layer may in future connect to another system that uses non-numeric values for an id element or an existing system may be replaced by a system that uses non-numeric id’s. Only make those elements numeric which truly contain numbers, so the value has a nummeric meaning. You can check this by asking yourself whether it functionally makes sense to calculate with the value or not. So for example phone numbers should be strings. Also when there is a check (algorithm) based on the sequence of the digits whether a number is valid or not (e.g. bank account check digit), this means the number serves as an identification and thus should be a string type element! Another way to detect numbers which are used as identification, is to determine if it matters when you add a preceding zero to the value. If that does matter, it means it’s not used nummeric. After all, preceding zero’s doesn’t change a nummeric value.

Determine null usage
The usage of the null value in XML () always leads to lots of discussions. The most import advice is to explicitly define standards & rules and communicate them! Decide whether the null usage is allowed or not. If so, determine in what situation it is allowed and what it functionally means. Ask yourself how it is used and how it differs from an element being absent (optional elements).
For example I’ve been in a project where a lot of data was updated in the database. An element being absent meant that a value didn’t change, while a null value meant that for a container element it’s record had be deleted and for a ‘value’ element that the database value had to be set to null.
The most important advice in this is: Make up your mind, decide, document and communicate it!

To summarize this first part of naming conventions and guidelines:

  • Keep in mind that a CDM keeps on changing, so it’s never finished
  • Define naming and structure standards upfront
  • and communicate your standards and guidelines!

When creating a CDM in the XML format, you also have to think about namespaces and how to design the XML. This is where the second part in my next blogpost is all about. When you are not defining a CDM in the XML format, you can skip this one and immediately go to the third and last blogpost about dependency management & interface tailoring.

2 Comments

  1. Emiel Paasschens October 11, 2017
  2. Cristian October 9, 2017