Apache Spark made numerous appearances in many different sessions during Oracle OpenWorld 2016. It is clear that Oracle is very much embracing and leveraging and endorsing Spark at various levels. Apache Spark is “a fast and general engine for large-scale data processing”. Spark has taken over from Hadoop MapReduce as the most prominent distributed job engine that organizes jobs – including sending the function to the distributed data and gathering the results. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Spark runs against data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. Spark has an associated stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming that make expressing specific jobs in specific languages and using domain specific algorithms.
Spark SQL for example allows a query to be expressed in SQL, fed in over a JDBC style connection from a client program and ran against data on a distributed Hadoop cluster. The results are returned to the client program in the same shape as from any JDBC operation against a relational database. Simply put: (big) data stored on a cheap, potentially huge, scale-out distributed file system on commodity hardware can be accessed in pretty much the same way as data stored in an expensive RDBMS. The larger the dataset to process – the more the SparkHadoop solution may shine. Note: Spark 2.0 is [close to] SQL 2003 compliant; it runs all 99 TPC-DS queries unmodified.
Spark ML (Machine Learning) is somewhat similar in that it offers easy programmatic access to a rich set of advanced algorithms and utilities that mine the data on Hadoop in order to for example classify, cluster, filter and otherwise analyze the data with the objective to find patterns and models from which predictions can be made on fresh data.
Oracle is doing various things with Spark:
- Leveraging Spark under covers in Big Data Discovery, Big Data Preparation, Data Flow Machine Learning for “streaming ETL” and more
- allowing customers to have Spark jobs (perhaps to be called microservices) deployed as applications on Application Container Cloud (not formally announced yet shown on some slides)
- Oracle R Advanced Analytics for Hadoop (ORAAH) – providing access to Spark ML through R , providing a high performance platform for running R on Spark clusters
- enhancing Spark processing on SPARC processors using DAX (data accelerators) with Apache Spark SQL – 20x performance improvement over non-DAX accelerated CPUs
A high level goal expressed by Oracle:
Queries can be expressed in various ways, including SQL and R, and can be executed in various – even hybrid – ways against different data sources. To developers and data scientists, it should be transparent where the data lives and how the algorithms are implemented and executed. The functionally expressed queried can be transformed to a job on Spark using Spark SQL or Spark ML, a SQL query or R analysis on Oracle RDBMS etc.
Below you find a number of examples of references to Spark – from a wide range of sessions and presentations.
From Larry’s keynotes:
From the keynote by Thomas Kurian:
From the General Session by Inderjeet Singh:
The session Big Data Predictive Analytics and Machine Learning – Strategy and Roadmap by Charles Berger and Marcos Arancibia:
From session CON 6704 Analytics Pipeline with Apache Spark SQL and Machine Learning by Brad Carlile: