Data is really important to any organization. Data tells us what the organization is doing, and where it is going. And how it can improve quality and efficiency of processes. Achieve better results. Data can be one of the key products an organization delivers to its customers. I am sure you have read nothing new so far.
In order to make effective use of data, an organization needs to understand its own data. There must be a shared interpretation of the meaning of the data used to interpret events and steer the company. While I am sure that sounds self evident, this is not so easy to achieve. In fact, most organizations I have visited acknowledge the importance of their data – yet do not know what data they posses nor is it straightforward what the meaning is of the data sets they are aware of. It turns out that in many organizations, even if people use the same words they frequently do not actually work from common definitions. Apparently obvious words such as risk or cost or even profit margin on closer inspection are interpreted and calculated in various ways. This may sound silly and of course it should not happen. But it does.
With this in mind, I am quite happy with the release last week of the new Data Catalog service on Oracle Cloud Infrastructure. The Data Catalog is intended as an enterprise wide catalog of shared business terms and definitions, definitions of all business data objects and their attributes, of all data assets – sources of actual data such as databases, content management systems and file storage – and of the relations between all of these.
The vision Oracle stated at OOW2019 for OCI Data Catalog:
A single collaborative solution for data professionals to collect, organize, find, access, enrich and activate technical, business and operational metadata to support self-service data discovery and governance for trusted data assets in Oracle Cloud and beyond.
The Data Catalog should facilitate finding out the meaning of business terms, finding the location of a data set, learning about the technical format and structure of data sets and about constraints, quality, usage, ownership, sensitivity and costs of data. A proper Data Catalog is an absolute must-have for data engineers, data analysts and data scientists. Along with the data steward – these roles are both contributors to the Data Catalog as well as heavy users of the information stored inside the Data Catalog.
The overview movie introducing the Data Catalog service seems to highlight all key requirements, such as the combination of technical, business and operational meta data. Technical meta data describes the structure of tables and CVS files, of NoSQL documents and of other technical implementation constructs- using technically allowed naming (frequently using underscores for example) and references to technical data types (such as VARCHAR2 and BOOLEAN). The business meta data consists of the functional descriptions of concepts, terms and business objects and their properties as they are used in the day to day business of an organization. The technical and business meta data should be linked – but they are each valuable in their own right. The operational meta data describes the usage of data and the freshness and over all quality of specific data sets.
The overview movie for OCI Data Catalog describes traceability – offering auditors the option to learn where data used in a specific report if coming from – which data assets with which qualities and combined in which ways.
So Data Catalog seems to check all the right boxes. And on top of that, in addition to recording meta data and allowing such data to be enriched and easily searched, the Data Catalog service can also harvest technical meta data from various types of data assets on Oracle Cloud Infrastructure – and beyond. This means that for supported data asset types – such as Oracle Database, MySQL, Hive (on OCI), Kafa Topics on Oracle Streaming and Files (on OCI Object Storage). File formats that the harvesters can interpret: cvs, xml, json, Excel, Apache ORC, Apache Parquet. The harvester creates Data Entities for Tables and Views, for Files and for Event Payloads. The attributes for these entities are created from columns and fields.
The result of harvesting technical information is of course fairly technical in nature. It is a bottom-up approach to populating the Data Catalog that is ideally complemented by a top down approach from business terminology and glossary to business object definitions downwards.
First Increment and Next Steps
I am happy that Oracle has introduced Data Catalog on OCI. Meta Data is incredibly important and a good tool for meta-data management has been sorely missed. Having said this, I hope that we will see rapid further evolution of Data Catalog beyond its present state. I am fairly happy with what is there, but I am sad for the many things that are not yet in the product. Some of my initial findings:
- there does not seem to be a concept of a business data object – data entities seem to be primarily produced by harvesting technical data containers; I do not see how to map this technical meta-data object to a corresponding, top down defined business meta-data object
- I have not seen (yet) how links can be created between data entities (similar to Relationships in ERD and foreign keys in Database Design)
- the console does not support creating Data Entities and Attributes; this can be done through the harvester and through the OCI REST API;
- the operational meta-data mentioned in the overview movie (data usage, data freshness) is not currently part of the Data Catalog service
- the traceability discussed in the movie that would help auditors understand how data used in a report originated in data assets and was combined and processed does not seem to be implemented yet
- in general, governance does not really seem part of the service yet; neither rules pertaining to data entities are included, nor is the area of data sensitivity and access rules addressed ; versioning or change management for meta-data definitions is currently not supported.
I hope we will soon see an extension of the currently available functionality, to turn this much needed service into one that provides the broader support we are really looking for. It is all too easy to focus on harvesting technical meta data (it makes for great demos) and not enough on the end to end process required to really do meta-data management and mine our data for all it’s worth
Resources
Data Catalog Documentation : https://docs.cloud.oracle.com/en-us/iaas/data-catalog/using/index.htm
3 minute overview movie – including some of the vision for where the service not yet currently is but hopefully is going: https://docs.cloud.oracle.com/en-us/iaas/data-services-assets/dc-overview.mp4
for an interactive and guided tour – showing the current state of service https://docs.cloud.oracle.com/en-us/iaas/data-services-assets/dc-service-tour.html
Pricing for Data Catalog: ??? I have not been able to find information regarding the cost of using Data Catalog.
Hands-on Lab (Oracle OpenWorld 2019) on Data Catalog: https://static.rainfocus.com/oracle/oow19/sess/1554313399960001FQa1/PF/OOW2019_HOL4992_DataCatalog_Final_15686028810830015IdJ.pdf
Oracle Data Integration Cloud: Data Catalog Service Deep Dive (OOW 2019) : https://static.rainfocus.com/oracle/oow19/sess/1554312265193001yTvE/PF/PRO4988_OCI_Data_Catalog_Final_SRC_1568935778222001omuG.pdf