If data is available from REST APIs, Jupyter Notebooks are a fine vehicle for retrieving that data and storing it in a meaningful, processable format. This article introduces an example of a such a dataset:
Oracle OpenWorld 2018 was a conference that took place in October 2018 in San Francisco. Over 30,000 attendees participated and visited some 2000 sessions. Raw data from the session catalog is available from an API – a REST service that is consumed by the Session Catalog Web UI. This API does not seem intended for data analysts such as we. The data is not in great shape.
The data does contain many details – some purely intended for the Web UI and others relevant for any analysis situation. From this raw source, we will gather the data into our local datalake – a collection of raw JSON files. Next (discussed in subsequent articles), we will wrangle this data – whip it into shape, filter, cleanse, enrich it so it is easily usable for business intelligence and data science tasks such as machine learning.
The Jupyter Notebook that calls out to the REST API to retrieve the data is found in this GitHub repo: https://github.com/lucasjellema/DataAnalytics–IntroductionDataWrangling-JupyterNotebooks/tree/master/CaseOfOracleOpenWorld2018. This repo contains five Jupyter notebooks; the first one for the retrieval of raw data, the subsequent ones for further refinements and finally visualizations.
This notebook retrieves the raw data from the remote API – using subsequent HTTP REST requests. The data is gathered using calls per session type (22 different types), per event (two events – Oracle OpenWorld and CodeOne) and per batch of 50. This means that close to 60 HTTP calls are made. The resulting data is in JSON format. This data is written to local files per event and session type (44 in total) in the folder /datalake.
By perusing the notebook, you will get an idea of how to perform REST calls, how to process the response body and JSON data content, how to turn JSON content in a Pandas Data Frame and how to write the content of such a Data Frame to a JSON file on disk. You will get a glimpse of the raw data. In the next section, we will discuss the Jupyter Notebook that does the main data wrangling, to turn this raw, not very accessible set of data files into a single, refined data warehouse that will be to the delight of data analysts.
Implementation Details
The session catalog for Oracle OpenWorld 2018 is available through a public web site, at: https://events.rainfocus.com/widget/oracle/oow18/catalogoow18? . The session catalog for the co-located CodeOne conference is published at https://events.rainfocus.com/widget/oracle/oow18/catalogcodeone18?. These websites use a common backend API to search for and retrieve details of conference sessions. The session catalog data is available from this REST API whose root endpoint is https://events.rainfocus.com/api/search.
A typical call to this API uses Headers (rfwidgetid: ‘KKA8rC3VuZo5clh8gX5Aq07XFonUTLyU’,rfapiprofileid: ‘uGiII5rYGOjoHXOZx0ch4r7f1KzFC0zd’) and Form values to retrieve specific session information.
A bare bone API call in Python looks like this (courtesy of Postman):
import requests url = "https://events.rainfocus.com/api/search" querystring = {"search.sessiontype":"1522435540042001BxTD"} payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"size\"\r\n\r\n50\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"type\"\r\n\r\nsession\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"from\"\r\n\r\n30\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--" headers = { 'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW", 'rfapiprofileid': "uGiII5rYGOjoHXOZx0ch4r7f1KzFC0zd", 'rfwidgetid': "KKA8rC3VuZo5clh8gX5Aq07XFonUTLyU", 'cache-control': "no-cache", 'Postman-Token': "dadd9f76-6a7f-41ed-8f31-04359976c622" } response = requests.request("POST", url, data=payload, headers=headers, params=querystring) #print first 4000 characters of response print(response.text[:4000])
This request returns all sessions at CodeOne of type Developer Session.
Our challenge is to retrieve the complete data on all sessions for both conferences – CodeOne and Oracle OpenWorld. With the code in the previous cell as starting point – we can create a Python program that pulls session data from the (semi-)public API and stores it in a local file in our data lake (folder /datalake). Note that we have to make an API call for each session type and for both conferences in order to gather all data. Each response is turned into a Pandas Data frame that allows for easy manipulation. This is an example of the top entries of one of these data frames:
Writing the JSON data fetched from the API to a local JSON document on disk is extremely simple. These few lines of code take care of this:
#get session details for session type BOF (Birds of a Feather) at the CodeOne conference key = 'BOF' ss = loadSessionDataForSessionType('codeone',sessionTypes[key]) # write details to a JSON file called oow2018-sessions_codeone_BOF.json in the datalake (folder ./datalake) ss.to_json("{0}oow2018-sessions_codeone_{1}.json".format(dataLake, key), force_ascii=False)
To get the data back into a Python program or another (or the same) Jupyter Notebook, we need do no more than this:
#as a test, try to load data from one of the generated files conference = 'oow' # could also be codeone sessionType = 'HOL' # could also be one of 21 other values such as TUT, DEV, GEN, BOF,... sessionPandas = pd.read_json("{0}oow2018-sessions_{1}_{2}.json".format(dataLake, conference, sessionType)) sessionPandas.head(5)