Data analytics is what turns data into business value. Oracle has a long history in Data Analytics – from Oracle Discoverer and its predecessors such as Data Browser through OBI EE and Endeca to several cloud services. Oracle Analytics Cloud is the strategic offering to adopt going forward – not BI Cloud or ESSBase Cloud.
I have set up a trial account on Oracle Cloud and will now start exploring Oracle (Autonomous) Analytics Cloud. I will provision an instance, load a data set, do some data wrangling and enrichment, explore the data set and finally create some visualizations that will provide some business insights. In this article, I will share my experiences – to help anyone who is also interested get started. And to remember how I did it all, for a next time.
Provision Instance of OAC
Before starting out I had already set up an Oracle Public Cloud Trial Account.
From the dashboard in the trial identity domain – amis2019 – I navigate to Analytics:
In order to provision a fresh instance of Oracle Analytics Cloud (OAC), I click on the obvious button.
After pressing Next, I have a chance to review. Of course I like what I see, and I move forward : by pressing Create I instruct the Oracle Public Cloud to provision my OAC instance.
I can now enter the OAC instance and start performing administration activities (such as granting access to other users) or – more interesting – start adding data sets. The URL to my OAC instance is found from the drop down menu in the OAC Service Console page.
Loading and Preparing Data
When I started to investigate how to load data into the OAC instance, I ran into a bit of a disappointment: OAC cannot handle JSON data (or XML for that matter).
My dataset was prepared in a Jupyter Notebook – a Pandas Dataframe as saved as JSON file. It was not too hard to revisit my Notebook and save the data frame as a CSV file – leaving out all attributes with complex values. However, isn’t OAC supposed to be the enterprise alternative to Jupyter Notebooks? I am unpleasantly surprised by this limitation.
The dataset that I will explore, wrangle and try to extract some business insights from is the session catalog from Oracle OpenWorld and CodeOne 2018. Details on some 2000 sessions – including title and abstract, duration, speakers and their companies, tracks, rooms, levels – offer a nice set of data to make some sense of. The Excel worksheet created from the CSV export from Pandas Data Frame in Jupyter Notebook can be downloaded here.
Login to Oracle Analytics Cloud. Click on Create and select Data Set:
Drag the Excel file to the icon and drop it
Uploading and processing will start automatically.
When the upload is complete and the data has been loaded, I get an overview of my fresh data set:
I can inspect columns – and if so desired perform refinement on them. For example: I can set the correct data type for column length – from text to number:
By pressing the Add button, I save the data set and ensure it will stay around for further explorations.
If I now click on the ‘hamburger menu’
I can go no Projects:
And create a new project to which I can then add my data set.
I am taken to the central data exploration page – one of three tabs in a project: Prepare, Visualize, Narrate. Roughly translated as wrangle the data into a shape fit to visualize, explore the data through visualizations and prepare a (visual) report for your target audience.
Let’s switch (back) to the Preparation tab for one moment. I want to add a few additional columns to the data set.
Prepare Data: Enrich Data Set
The data preparation tab shows a list of recommendations on the far right. These recommendations are actions on the data set that OAC suggests based on profiling the data and smart algorithm baked into OAC. It offers to create new features (aka columns or attribute) from existing columns – in order to make visualization easier later on. For those of you who are trained to normalize data: this is the opposite! It is quite common in data analytics to enrich a data set with data derived from that very same data set. This may feel as not adding any value at all. However, once you start creating visualizations and answering business questions, you will be glad with those smarter, more convenient features.
I think I may want to explore the length of the session title. A maximum of 80 characters is allowed, and I am usually struggling to submit session proposals that fit. I am wondering if others are struggling just as much. And if there are perhaps ways around that limitation: did anyone manage to get in titles of more than 80 characters?
I click on the little add icon in the upper left hand corner, next to Preparation Script.
The Create Column panel is shown. I enter the name of the new column – title_length – and look for an operator to use under String. The operator I need is called char_length. I add it to my column definition. Once I start typing the expression inside the parentheses, a dropdown list appears of functions and columns. I select title.
Next I press Add Step.
and the column is added to my data set.
Slightly more complex: the column Track contains an array (JSON style notation) with strings. OAC just sees one long string – it cannot interpret individual string values. Now perhaps this is something I should have resolved in my Jupyter Notebook, when I had the chance. But here we are.
Can I extract columns primary_track and secondary_track from this array definition?
Hint: here is the documentation on the functions available for use in the expression editor: https://docs.oracle.com/en/cloud/paas/analytics-cloud/acabi/functions.html#GUID-BBA975C7-B2C5-4C94-A007-28775680F6A5
To be honest, it was not easy to get this expression figured out. I prefer a little notebook style Pandas wrangling. The help offered by the expression editor is limited. Note: char(39) resolves to apostrophe ‘ – I could find no other way to insert the apostrophe into the expression.
CASE WHEN POSITION(char(39) IN track)=0 THEN ‘ ‘ ELSE SUBSTRING(track FROM 1+Locate(char(39), track) FOR Locate(char(39) , track,3)-3) END
Press Add Step to commit my new derived column.
In a similar way I can create a secondary_track column:
CASE WHEN POSITION(‘,’ IN track)=0 THEN ‘ ‘ ELSE SUBSTRING(track FROM 3+Locate(‘,’, track, 3) FOR (Locate(char(39), track, Locate(‘,’, track, 3)+3) – Locate(‘,’, track, 3) – 3)) END
(first find the comma that separates the first from the second track, then find the second apostrophe after the comma)
Note: in the remainder of the exploration, it seems that these derived columns cause errors:
Maybe now is a good time to some exploration and visualization.
Click on Save – to save the project:
Explore/Visualize Data Set
Switch to the Visualize tab.
Right click on the Length data element and select Explain length from the dropdown menu.
Explaining a data element results in an interesting data exploration report for a specific data element. Let’s have a look.
The values for (session) length vary from 15 to 240 (that is a session that lasts for 6 hours!). Most sessions clearly last for up to 50 minutes.
Let’s explain the data element day:
and the derived attribute title_length
That is surprising: the title length has to be no more than 80 characters. How can there be 85 unique values? Some speakers apparently are more equal than others. The most common title length is 79 characters.
Looking at Key Drivers for data element oracle_speaker (an element that indicates whether a session has a speaker associated with it who works for Oracle), OAC provides some insights: if a speaker has a Java Champion associated with it, the chances are much smaller than normal that oracle_speaker has the value ‘Y’.
Using the data element explanation feature, we can get a fair understanding of the data, its values and the correlations between features. Note: correlation does not mean causality!
Now let’s quickly do a real visualization.
Drag day to the main panel:
and drop as Columns.
Drag RowCount to the panel and drop as Values:
The first visualization is presented:
Let’s add a little more insight by using the hour component of the (start) time to the Rows:
Switch to different graph style
A simple bar chart:
Let’s visualize the numbers of sessions per conference:
And the number of sessions per session type, per conference:
I have always wondered how many sessions at these conferences have at least one speaker employed by Oracle. This treemap answers that question visually:
Green means Oracle Speaker in a session, blue is no Oracle speaker. The size of the box corresponds to the number of sessions. OpenWorld and CodeOne are shown on the left respectively the right side.
Not an obvious visualization – but quite cool and easy to produce: a Sankey diagram that shows how many of the Oracle OpenWorld or CodeOne sessions have N or Y an Oraxole speaker associated with them.
In visualizations, the true power of OAC shines. The challenge becomes for the data analyst to find the right visualization for the relevant data elements for a specific question. OAC provides some help with that too, by exposing a best visualization option for a data element:
I have selected elements title_length, oracle_speaker and row_count. From the context menu I now select Create Best Visualization. Let’s see.
An horizontal stack bar that shows the number of occurrences for each title length value – in green for sessions with an Oracle speaker and in blue for non Oracle associated sessions. It was my assumption that only the Oracle sessions could have titles of more than 80 characters, but apparently that is not the case. Perhaps sponsors have special privileges as well?!
I am impressed with the speakers who manage with titles of just 10 or 14 characters. How do they do it?
A last visualization: a tag cloud on session type – where the size of the tag represents the number of sessions of that particular type. The color indicates the conference. Creating this tag cloud took about 7 seconds of dragging a few data elements. Really nice.
Oracle Analytics Clouds offers great visualizations.
This tremendous list of visualizations and the ease with which they can be applied to the data set is surely one of the USPs of Oracle Analytics Cloud. Once the data is in good shape, we have a powerful set at our fingertips.
Its data profiling functionality helps to quickly get going. Data preparation is not quite that simple. Creating new, derived columns using expressions is hard going. I hope that will improve. I have not yet even scratched the surface of the visualizations, nor have I touched on machine learning and the narrations functionality, where reports are composed to present the findings.
I have also not yet explored the Data Flow capability in OAC. From the documentation: “Data flows let you use one or multiple data sets, combine the data sets, rename columns, change attributes of the data, and integrate the data sets to produce a curated data set that you can use to create visualizations.” Perhaps there I will have a better experience than in the data preparation tab. In a future article I hope to address some of these topics.
Note: most of what was shown in this article can also be done with Oracle Data Visualization Desktop – a Windows or Mac desktop tool, freely downloadable from Oracle eDelivery: https://www.oracle.com/technetwork/middleware/oracle-data-visualization/downloads/oracle-data-visualization-desktop-2938957.html
The Excel document with the Oracle OpenWorld and CodeOne 2018 session catalog – https://github.com/lucasjellema/Oracle-OpenWorld-CodeOne-2018-SessionCatalog/blob/master/oow2018-sessions-clean.xlsx
YouTube – How To Start Your Oracle Analytics Cloud Trial – https://www.youtube.com/watch?v=h6NsAEwVaDo
Documentation: User Guide for Oracle Data Visualization Desktop https://docs.oracle.com/en/middleware/bi/data-visualization-desktop/index.html
Documentation: Analytics Cloud – https://docs.oracle.com/en/cloud/paas/analytics-cloud/index.html