Spark with a K – how Apache Spark is omnipresent at Oracle OpenWorld 2016

Lucas Jellema
0 0
Read Time:2 Minute, 50 Second

Apache Spark made numerous appearances in many different sessions during Oracle OpenWorld 2016. It is clear that Oracle is very much embracing and leveraging and endorsing Spark at various levels. Apache Spark is “a fast and general engine for large-scale data processing”. Spark has taken over from Hadoop MapReduce as the most prominent distributed job engine that organizes jobs – including sending the function to the distributed data and gathering the results. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Spark runs against data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. Spark has an associated stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming that make expressing specific jobs in specific languages and using domain specific algorithms.

Spark SQL for example allows a query to be expressed in SQL, fed in over a JDBC style connection from a client program and ran against data on a distributed Hadoop cluster. The results are returned to the client program in the same shape as from any JDBC operation against a relational database. Simply put: (big) data stored on a cheap, potentially huge, scale-out distributed file system on commodity hardware can be accessed in pretty much the same way as data stored in an expensive RDBMS. The larger the dataset to process – the more the SparkHadoop solution may shine. Note: Spark 2.0 is [close to] SQL 2003 compliant; it runs all 99 TPC-DS queries unmodified.

Spark ML (Machine Learning) is somewhat similar in that it offers easy programmatic access to a rich set of advanced algorithms and utilities that mine the data on Hadoop in order to for example classify, cluster, filter and otherwise analyze the data with the objective to find patterns and models from which predictions can be made on fresh data.

Oracle is doing various things with Spark:

  • Leveraging Spark under covers in Big Data Discovery, Big Data Preparation, Data Flow Machine Learning for “streaming ETL” and more
  • allowing customers to have Spark jobs (perhaps to be called microservices) deployed as applications on Application Container Cloud (not formally announced yet shown on some slides)
  • Oracle R Advanced Analytics for Hadoop (ORAAH) – providing access to Spark ML through R , providing a high performance platform for running R on Spark clusters
  • enhancing Spark processing on SPARC processors using DAX (data accelerators) with Apache Spark SQL – 20x performance improvement over non-DAX accelerated CPUs

 

A high level goal expressed by Oracle:

image

Queries can be expressed in various ways, including SQL and R, and can be executed in various – even hybrid – ways against different data sources. To developers and data scientists, it should be transparent where the data lives and how the algorithms are implemented and executed. The functionally expressed queried can be transformed to a job on Spark using Spark SQL or Spark ML, a SQL query or R analysis on Oracle RDBMS etc.

Below you find a number of examples of references to Spark – from a wide range of sessions and presentations.

From Larry’s keynotes:

image

image

 

From the keynote by Thomas Kurian:

image

image

image

 

From the General Session by Inderjeet Singh:

image

image

image

The session Big Data Predictive Analytics and Machine Learning – Strategy and Roadmap by Charles Berger and Marcos Arancibia:

image

image

image

image

image

 

From session CON 6704 Analytics Pipeline with Apache Spark SQL and Machine Learning by Brad Carlile:

image

image

About Post Author

Lucas Jellema

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director and Oracle Developer Champion. Solution architect and developer on diverse areas including SQL, JavaScript, Kubernetes & Docker, Machine Learning, Java, SOA and microservices, events in various shapes and forms and many other things. Author of the Oracle Press book Oracle SOA Suite 12c Handbook. Frequent presenter on user groups and community events and conferences such as JavaOne, Oracle Code, CodeOne, NLJUG JFall and Oracle OpenWorld.
Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %
Next Post

Oracle NoSQL Database 4.x and the Node.js driver 3.x

There are two ways you can access Oracle NoSQL database from a Node.js application. These are illustrated below. You can use the nosqldb-oraclejs driver and you can use Oracle REST Data Services. In my previous blog post I illustrated how you can access Oracle NoSQL database by using the nosqldb-oraclejs […]
%d bloggers like this: