This blogpost is part III, the last part of a trilogy on how to create and use a Canonical Data Model (CDM). The first blogpost contains part I in which I share my experiences in developing a CDM and provide you with lots of standards and guidelines for creating a CDM. The second part is all about XML Namespace Standards. This part is about usage of a CDM in the integration layer, thus how to use it in a run time environment and what are the consequences for the development of the services which are part of the integration layer.
Dependency Management & Interface Tailoring
When you’ve decided to use a CDM, it’s quite tempting to use the XSD files, that make up the CDM, in a central place in the run time environment where all the services can reference to. In this way there is only one model, one ‘truth’ for all the services. However there are a few problems you run into quite fast when using such a central run time CDM.
The first challenge is to maintain backwards compatibility. This means that when there is a change in the CDM, this change is implemented in such a way that the CDM supports both the ‘old’ data format, according to the CDM before the change, as well as the new data format with the change. When you’re in the development stage of the project, the CDM will change quite frequently, in large projects even on a daily basis. When these changes are backwards compatible, the services which already have been developed and are considered as finished, do not need to change (unless of course the change also involves a functional change of a finished service). Otherwise, when these changes are not backwards compatible, all software components, so all services, which have been finished have to be investigated whether they are hit by the change. Since all services use the same set of central XSD definitions, many will be hit by a change in these definitions.
If you’re lucky you have nice unit tests or other code analysis tools you can use to detect this. You may ask yourself if these test and/or tool will cover a 100% hit range. When services are hit, they have to be modified, tested and released again. To reduce maintenance and rework of all finished services, there will be pressure to maintain backwards compatibility as much as possible.
Maintaining backwards compatibility in practice means
- that all elements that are added to the CDM have to be optional;
- That you can increase the maximum occurrence of an element, but never reduce it;
- That you can make mandatory elements optional, but not vice versa;
- And that structure changes are much more difficult.
For example, when a data element has to be split up into multiple elements. Let’s take a product id element of type string and split it up into a container elements that is able to contain multiple product identifications for the same product. The identification container element will have child elements for product id, product id type and an optional owner id for the ‘owner’ of the identification (e.g. a customer may have his own product identification). One way of applying this change and still maintain backwards compatibility is by using an XML choice construction:
<complexType name="tProduct"> <sequence> <choice minOccurs="0" maxOccurs="1"> <element name="Id" type="string" /> <element name="Identifications"> <complexType> <sequence> <element name="Identification" minOccurs="0" maxOccurs="unbounded"> <complexType> <sequence> <element name="Id" type="string" /> <element name="IdType" type="string" /> <element name="IdOwner" type="string" minOccurs="0"/> </sequence> </complexType> </element> </sequence> </complexType> </element> </choice> <element name="Name" type="string" /> ... </sequence> </complexType>
There are other ways to implement this change and remain backwards compatible, but they will all will into a redundant and verbose data model. As you can imagine, this soon results in a very ugly CDM, which is hard to read and understand.
Hidden functional bugs
There is another danger. When keeping backward compatibility in this way, the services which were finished technically don’t break and still run. But they might functional break! This break is even more dangerous because it may not be visible immediately and it can take quite a long time before this hidden functional bug is discovered. Perhaps the service already runs in a production environment and execute with unnoticed functional bugs!
Take the example above and consider that there has already been a service developed which does something with orders. Besides order handling, it also sends the product id’s in an order to a CRM system, but only for the product id’s in the range 1000-2000. The check in the service on the product id being in the range 1000-2000 will be based upon the original product id field. But what happens if the CDM is changed as described in previous paragraph, so the original product id field is part of a choice and thus becomes optional. This unchanged service now might handle orders that contain products with the newer data definition for a product in where the new “Identification” element is used instead of the old “Id” element. If you’re lucky, the check on the range fails with a run time exception! Lucky, because you’re immediately notified of this functional flaw. It probably will be detected quite early in a test environment when it’s common functionality. But what if it is rare functionality? Then the danger is that it might not be detected and you end up with a run time exception in a production environment. That is not what you want, but at least it is detected!
The real problem is that there is a realistic chance that the check doesn’t throw an exception and doesn’t log an error or warning. It might conclude that the product id is not in the range 1000-2000, because the product id field is not there, while the product identification is in that range! It just uses the new way of data modeling the product identification with the new “Identification” element. This results into a service that has a functional bug while it seems to run correctly!
Backward compatibility in time
Sometimes you have no choice and you have to make changes which are not backward compatible. This can cause another problem: you’re not backwards compatible in time. You might be developing newer versions of services. But what if in production there is a problem with one of these new services using the new CDM and you want to go back to a previous version of that service? You have to go back to the old version of the CDM as well, because the old version is not compatible with the new CDM. But that also means that none of the newer services can run, because they depend on the new CDM. So you have to revert to the old versions for all of the new services using the new CDM!
The base cause of these problems is that all software components (service) are dependent on the central run time CDM!
So this central run CDM introduces dependencies between all (versions of) components. This heavily conflicts with one of the base principles of SOA: loose coupled, independent services.
There is another problem with a central CDM which has more to do with programming concepts, but also impacts the usage of services resulting in a slower development cycle. The interface of a service which is described in its contract (WSDL) should reflect the functionality of a service. However, if you’re using a central CDM, the CDM is used by all the services. This means that the entities in the CDM contain all the data elements which are needed in the contracts of all the services. So basically a CDM entity consists of a ‘merge’ of all these data elements. The result is that the entities will be quite large, detailed and extensive. The services use these CDM entities in their contracts, while functionally only a (small) part of the elements are used in a single service.
This makes the interface of a service very unclear, ambiguous and meaningless.
Another side effect is that it makes no sense to validate (request) messages, because all elements will be optional.
Take for example a simple service that returns the street and city based upon the postal code and house number (this is a very common functionality in The Netherlands). The interface would be nice and clear and almost self-describing when the service contract dictates that the input (request) only is a postal code and the output (response) only contains the street name and the city. But with a central CDM, the input will be an entity of type address, as well as the output. With some bad luck, the address entity also contain all kind of elements for foreign addresses, post office boxes, etc. I’ve seen exactly this example in a real project with an address entity containing more than 30 child elements! While the service only needed four of them: two elements, postal code and house number, as input and also two elements, street and city, as the output. You might consider to this by defining these separate elements as input and output and not to use the entity element. But that’s not the idea of a central CDM! Take notice that this is just a little example. I’ve seen this problem in a project with lawsuit entities. You can imagine how large such an entity can become, with hundreds of elements. Services individually only used some of the elements of the lawsuit entity, but these elements were scattered across the entire entity. So is does not help either to split up the type definition of a lawsuit entity into several sub types. In that project almost all the services needed one or more lawsuit entities resulting in interface contracts (WSDL) which all were very generic and didn’t make sense. You needed the (up to date) documentation of the service in order to know which elements you had to use in the input and which elements were returned as output, because the definitions of the request and response messages were not useful as they contained complete entities.
The solution to both of the problems described above, is not to use a central run time CDM, but only a design time CDM.
This design time CDM has no namespace (or a dummy one). When a service is developed, a hard copy is made of (a part of) the CDM at that moment to a (source) location specific for that service. Then a service specific namespace has to be applied to this local copy of the (service specific) CDM.
And now you can shape this local copy of the CDM to your needs! Tailor it by removing elements that the service contract (WSDL) doesn’t need. You can also apply more restrictions to the remaining elements by making optional elements mandatory, reduce the maximum occurrences of an element and even create data value restrictions for an element (e.g. set a maximum string length). By doing this, you can tailor the interface in such a way that it reflects the functionality of the service!
You can even have two different versions of an entity in this copy of the CDM. For example one to use in the input message and one in the output message.
Let’s demonstrate this with the example of above: An address with only postal code and house number for the input message and an address with street and city for the output message. The design time CDM contains the full address entity, while the local and tailored copy of the service CDM contains two tailored address entities. And this one can be used by the service XSD which contains the message definitions of the request and response payloads:
<schema targetNamespace="DUMMY_NAMESPACE" xmlns="http://www.w3.org/2001/XMLSchema" version="1.0"> <complexType name="TAddress"> <sequence> <element name="Department" type="string" minOccurs="0"/> <element name="Street" type="string" minOccurs="0"/> <element name="Number" type="string" minOccurs="0"/> <element name="PostalCode" type="string" minOccurs="0"/> <element name="City" type="string" minOccurs="0"/> <element name="County" type="string" minOccurs="0"/> <element name="State" type="string" minOccurs="0"/> <element name="Country" type="string" minOccurs="0"/> </sequence> </complexType> </schema>
<schema targetNamespace="http://nl.amis.AddressServiceCDM" xmlns="http://www.w3.org/2001/XMLSchema" version="1.0"> <complexType name="TAddressInput"> <sequence> <element name="Number" type="string" minOccurs="0"/> <element name="PostalCode" type="string" minOccurs="1"/> </sequence> </complexType> <complexType name="TAddressOutput"> <sequence> <element name="Street" type="string" minOccurs="1"/> <element name="City" type="string" minOccurs="1"/> </sequence> </complexType> </schema>
<schema targetNamespace="http://nl.amis.AddressService" xmlns="http://www.w3.org/2001/XMLSchema" xmlns:cdm="http://nl.amis.AddressServiceCDM" version="1.0"> <import namespace="http://nl.amis.AddressServiceCDM" schemaLocation="AddressServiceCDM.xsd"/> <element name="getAddressRequest"> <complexType> <sequence> <element name="Address" type="cdm:TAddressInput" minOccurs="1"/> </sequence> </complexType> </element> <element name="getAddressResponse"> <complexType> <sequence> <element name="Address" type="cdm:TAddressOutput" minOccurs="1"/> </sequence> </complexType> </element> </schema>
When you’re finished tailoring, you can still deploy these service interfaces (WSDL) containing the shaped data definitions (XSDs) to a central run time location. However each service must have its own location within this run time location, to store these tailored data definitions (XSDs). When you do this, you can also store the service interface (abstract WSDL) in there as well. In this way there is only one copy of a service interface, that is used by the implementing service as well as by consuming services.
I’ve worked in a project with SOAP services where the conventions dictated that the filename of a WSDL is the same as the name of the service. The message payloads were not defined in this WSDL, but were included from an external XSD file. This XSD also had the same filename as the service name. This service XSD defined the payload of the messages, but it did not contain CDM entities or CDM type definitions. They were included from another XSD with the fixed name CDM.xsd. This local, service specific, CDM.xsd contained the tailored (stripped and restricted) copy of the central design time CDM, but had the same target namespace as the service.wsdl and the service.xsd:
This approach also gave the opportunity to add operation specific elements to the message definitions in the service.xsd. These operation specific elements were not part of the central CDM and did not belong there due to their nature (operation specific). These operation specific elements ware rarely needed, but when needed, they did not pollute the CDM, because you don’t need to somehow add them to the CDM. Think of switches and options on operations which act on functionality, e.g. a boolean type element “includeProductDescription” in the request message for operation “getOrder”.
Note: The services in the project all did use a little generic XML of which the definition (XSD) was stored in a central run time location. However these data definitions are technical data fields and therefor are not part of the CDM. For example header fields that are used for security, a generic response entity containing messages (error, warning info) and optional paging information elements in case a response contains a collection. You need a central type definition when you are using generic functionality (e.g. from a software library) in all services and consuming software.
With this approach of a design time CDM and tailored interfaces:
- There are no run time dependencies on the CDM and thus no dependencies between (versions of) services
- Contract breach and hidden functional bugs are prevented. (Because of different namespaces services have to copy each data element individually when passing an entity or part of an entity, to its output)
- Service interfaces reflect the service functionality
- Method specific parameters can be added without polluting the CDM
- And – most important – the CDM can change without limitations and as often as you want to!
The result is that the CDM in time will grow to a nice clean and mature model that reflects the business data model of the organization – while not impending and even promoting the agility of service development. And that is exactly what you want with a CDM!
When to use a central run time CDM
A final remark about a central run time CDM. There are situations where this can be a good solution. That is for smaller integration projects and in the case when all the systems and data sources which are to be connected with the integration layer are already in place, so they are not being developed. They probably already run in production for a while.
This means that the data and the data format which has to be passed through the integration layer and is used in the services is already fixed. You could state that the CDM already is there, although it still has to be described, documented in a data model. It’s likely that it’s also a project where there is a ‘one go’ to production, instead of frequent delivery cycles.
When after a while one system is replaced by another system or the integration layer is extended by connecting one or more systems and this results that the CDM has to be changed, you can add versioning to the CDM. Create a copy of the existing CDM and give it a new version (e.g. with a version number in the namespace) and you can make the changed in CDM which are needed. This is also a good opportunity to clean up the CDM by removing unwanted legacy due to keeping backwards compatibility. Use this newer version of the CDM for all new development and maintenance of services.
Again, only use this central run time CDM for smaller projects and when it is a ‘one go’ to production (e.g. replacement of one system). As soon as the project becomes larger and/or integration of systems keeps on going, switch over to the design time CDM approach.
You can easily switch over by starting to develop the new services with the design time CDM approach and keep the ‘old’ services running with the central run time CDM. As soon there is a change in an ‘old’ service, refactor it to the new approach of the design time CDM. In time there will be no more services using the run time CDM, so the run time CDM can be removed.
After reading this blogpost, together with the previous two blogpost which make up the trilogy about my experiences with a Canonical Data Model, you should be able to have good understanding about how to set up a CDM and use it in your projects. Hopefully it helps you in making valuable decisions about creating and using a CDM and your projects will benefit from it.