The importance of data has never been in doubt in the world of Oracle. Through machine learning and predictive analytics as well as real-time streaming data and Big Data, the data spectrum has broadened considerably. With the quickly expanding range of storage and processing facilities, analysis algorithms and visualization means, the opportunity to retrieve value from more types of data has grown considerably. At the same time that the importance of data in general and certainly the relevance of SQL is growing, it seems that the central role of the RDBMS as all-encompassing enterprise data source is somewhat fading.
As the data sets are growing and the rate of transactions increases, we are starting to look for different ways to record these transactions and to query and analyze the data. Hadoop and NoSQL as alternative data stores and Apache Spark as data processing engine are rapidly coming to the front. Oracle embraces these technologies and wraps them together with the relational database under Big Data SQL. Functional programming and the idea that the data processing function should be moved to the data rather than the data to the function – as implemented among others in Apache Spark –plays in important role in making even very large data sets accessible in interactive query mode.
When the rate of data events becomes really high – for example in Internet of Things environments with large numbers of devices constantly reporting data or with web sites receiving high peak loads of visitors – traditional transaction processing in a single OLTP engine is not feasible. New mechanisms have evolved to handle and safely persist such high volumes of messages; Apache Kafka is the most prominent of these and is widely adopted including by Oracle. Note that most NoSQL databases and the TimesTen In Memory database can also process very high volumes of new records.
In addition to receiving and safeguarding these messages, they also need to be very quickly processed; many of them may only be relevant in real time – security breach, fire sensors, failing equipment indicators – and the volume of the data kept in data stores such as the corporate data lake has to be reduced in size more or less immediately. Streaming data processing [aka real-time analytics] is a term used for finding meaning in real time data streams – by aggregating in time windows, matching patterns in subsequently arriving events and detecting events with values above threshold levels. Oracle has announced Data Flow Machine Learning (DFML) Cloud Service for the same purpose: analyze real time streams of data.
This article discusses some of the announcements and trends, product offerings and roadmaps around Big Data, Fast Data, Data Integration and the governance and quality of data. My previous article on machines learning and predictive analytics is closely related.
Some data are more equal – Big Data SQL the great data equalizer
Most enterprises have a lot of variety in the data they deal with. Some data is highly structured and other is very unstructured, some data is bound by strict integrity rules and quality constraints and other is free of any restrictions, some data is “hot” – currently very much in demand – and other data can be stone cold. Some data needs to extremely accurate, down to a prescribed number of fractional digits and other is only approximate. Some is highly confidential and other publicly accessible. Some is around in small quantities and other in huge volumes.
Over the years, companies including Oracle have come to the realization that all this differentiation in data justifies or even mandates a differentiation in how the data is stored and processed. It does not make sense to treat the hottest transactional data in the same way as the archived records from 30 years. Yet many organizations have been doing exactly that: store it all in the enterprise relational database. It works, keeps all data accessible for those rare instances where that really old data is required and most importantly: keeps all data accessible in the same way – through straightforward SQL queries.
Recent technological development –within and outside of Oracle – have made a different approach feasible – with all the benefits of the central enterprise database and increased performance and scalability and potentially lower hardware and license costs. NoSQL data stores and Hadoop (distributed file system) are proven solutions for embracing and storing massive volumes of data in a variety of formats. Apache Spark – as well as several other job execution engines running on top of Hadoop – have made processing and accessing data from Hadoop very simple and straightforward. SQL statements can be executed against Hadoop with results being returned in just a few seconds. Most NoSQL engines also support some flavor of SQL – isn’t that ironic – to cater for the fact that developers, data scientists and tools and programming languages can all speak SQL.
Oracle acknowledges the fact that not all data will always be in the Oracle Database. With partitions and shards, external tables and database links, in-database archiving, compression and other features, Oracle has facilitated storing ever more data within the database universe. With Big Data SQL, it steps outside that comfort zone. Data can be on Hadoop, data can be in NoSQL databases as well as in the Oracle Database. Through Big Data SQL, it is transparent to the developer or data scientist where the data resides and in what format it is. Any Oracle style SQL statement can be executed against the Oracle Database [platform] and will be run against all relevant data sources. If the query involves data on Hadoop and/or in NoSQL stores, then Big Data SQL will execute the query in a federated way in each data node.
Big Data SQL is described as a ‘franchised query engine,’ enables scalable, integrated access in situ to the entire Big Data Management System (BDMS). The Oracle Database is central in this approach: it holds the meta-data about all distributed and hybrid data sources, it coordinates query execution and constructs final result sets, it handles workload management and can do data optimization. Even if none of the data returned to the client application is actually retrieved from tables in the Oracle Database, it will still have played a key role by translating the initial query into the jobs to be executed against all participating data stores (a massively parallel query) and composing the result. The Oracle Database platform will also enforce authorization rules – including RAS and Redaction – and perform auditing.
Big Data SQL and the Big Data connectors – to link the Oracle Database to Hadoop – are products offered alongside Oracle Database. They are also offered as part of the Big Data Appliance, an engineered system for running diverse workloads on Hadoop and NoSQLsystems.
Big Data Movement
Big Data does not come out of nowhere into an enterprise’s data lake. It takes some good thinking to devise the flow of data from its origin to a location, format, quality and granularity level that makes it useful for further processing. The next illustration tries to visualize the various activities that have to be considered to bring data from sources such as the internet of things, social media, web site activity, IT infrastructure and traditional transaction engines into data reservoirs (aka data lakes aka data warehouses) where data science can be performed.
Roughly speaking, there has to be
· capture (including preprocessing) resulting in data stored and available for great things
· process and analysis (to understand the data and turn it into actionable models)
· report and action (provide insight, give recommendations, perform automatic actions)
· governance (to understand the data, guarantee its quality and confidentiality, manage retention and purging)
Depending on the nature of the data, the source and channel, the volume, the urgency and the format, different technologies can be applied.
It is not simple to fully understand the Oracle portfolio around big data integration. `
There is a number of products and cloud services and they seem to have some overlap in functionality, use case and in subscription or license conditions. Some are well established on premises offerings that are now also made available from the cloud, others are built largely from the ground up as native cloud service. Some have been announced – and are not yet generally available.
The next figure maps the phases and main activities around big data capture, process & analysis, report & act and governance to the Oracle product portfolio.
An area that is rapidly increasing in importance is at the bottom in this figure: governance. Oracle has made two acquisitions in recent years of products today called Oracle Enterprise Data Quality (OEDQ) and Enterprise Metadata Management (OEMM).
OEDQ is used to inspect and improve the quality of incoming data before it gets added to the data warehouse or data lake. It is used to filter and enrich data.
More support for data governance is on the near term roadmap – for more tactical data quality management, through reporting, dashboards and trend analysis. In 2017, OEDQ will be available as subscription based cloud offering on the public cloud (on JCS), in addition to the existing on premises model. Next to its standalone role, OEDQ is also part of the newly announced Cloud Data Integration Suite and a core platform component in various Oracle SaaS Cloud Services, such as Sales Cloud, Customer Data Management and Procurement Cloud.
Oracle Enterprise Metadata Management
Oracle Enterprise Metadata Management is another recent acquisition, that is closely related to OEDQ. It is a product that has the essence of all data of an organization: their metadata, describing their structure and meaning, provenance and usage, lifecycles, value ranges and integrity constraints. OEMM provides data lineage, impact analysis, semantic definition and semantic usage analysis for any metadata asset within the catalog. OEMM’s algorithms stitch together metadata from each of the providers providing the complete path of data from source to report or vice versa.
OEMM can harvest and catalog metadata from virtually any metadata provider, including relational, Hadoop, ETL, BI products, data modeling tools. A cross enterprise business glossary is part of OEMM – allowing users to share and collaborate on business terminology and their relationship to the metadata and thereby to the actual business data. Questions addressed by OEMM include : what is the meaning of the data in this table or this file, where did the data come from, how does this attribute over here compare to this field over there and what will happen if we change the definition of this attribute from number to string or from a maximum length of 5 to 10.
The vision is to provide a Control Center for enterprise data that enables you to treat it as a capital asset. This offering will bring together at least the capabilities from OEMM and OEDQ as well as others. First version (CY 2017/2018) will be focused on Data Quality Governance.
GoldenGate –the real time streaming platform – deserves to be mentioned here as it continues to be the cornerstone of most real time replication and event sourcing scenarios involving structured data sources. GoldenGate can extract transactions in near real time from a plethora of data platforms with no impact on running transactions. The extracted change can forwarded and processed into a variety of targets and formats. Next to replication of data sets, GoldenGate is also frequently pivotal in zero downtime database upgrade scenarios as well as in disaster recovery use cases. GoldenGate for Big Data – announced in 2015- adds a real-time transactional data streaming platform into big data systems, such as Hadoop, Kafka, Spark and Flume.
GoldenGate Cloud Service is a new PaaS Solution on OPC providing real-time data replication from on-premise (or third party cloud) to Oracle Cloud with low impact and heterogeneous support. GoldenGate Cloud Service can be use for cloud on-boarding, dev/test, query offloading/live reporting, real-time DW and high availability use cases.
The GoldenGate Cloud Service offers real time data replication – and other GoldenGate use cases – from the cloud, based on a subscription. It covers cloud to cloud (Oracle and 3rd party cloud), and on premises to cloud scenarios and potentially also on premises to on premises. In addition to replication, the cloud service is also put forward to handle migrations between cloud vendors. When used for a zero down time upgrade, GoldenGate is used for only a short period of time; this fits in very nicely with a short term cloud subscription.
The challenges GoldenGate CS helps tackle:
•Taking Too Long to Upload Data to Cloud – GGCS change data capture allows incremental data updates instead of loading large data set at once and reduces the dependency on performance of network connection.
•Cost of Migrating to Cloud is Too High – GGCS’s non intrusive change data capture allows migration without impacting the existing source systems and becomes transparent to the users. Automated and repeatable setup reduces man hours in the migration.
•On-Premises and Cloud are Disconnected – GGCS replication provides real-time updates from on-premise database with sub-second latency.
•No clear way to access services from different environments – GGCS supports hybrid cloud environment across different data servers, versions and platforms.
Other mechanisms for handling real time data are rapidly evolving. Oracle announced the Data Flow Machine Learning Cloud Service, suggested an Oracle PaaS Event Bus based on Apache Kafka, is extending the Stream eXplorer product as Stream Analytics and works on Oracle Functions – serverless processing engines that can respond to real time events. GoldenGate can be both a source and a target for these mechanisms.
Download the AMIS OOW16 Highlights for an overview of announcements at OOW16.