When Screen Scraping became API calling - Gathering Oracle OpenWorld 2017 Session Catalog with Node SNAGHTML59dcd59

When Screen Scraping became API calling – Gathering Oracle OpenWorld 2017 Session Catalog with Node

A dataset with all sessions of the upcoming Oracle OpenWorld 2017 conference is nice to have – for experiments and demonstrations with many technologies. The session catalog is exposed at a website – https://events.rainfocus.com/catalog/oracle/oow17/catalogoow17 

SNAGHTML59dcd59

With searching, filtering and scrolling, all available sessions can be inspected. If data is available in a browser, it can be retrieved programmatically and persisted locally in for example a JSON document. A typical approach for this is web scraping: having a server side program act like a browser, retrieve the HTML from the web site and query the data from the response. This process is described for example in this article – https://codeburst.io/an-introduction-to-web-scraping-with-node-js-1045b55c63f7 – for Node and the Cheerio library.

However, server side screen scraping of HTML will only be successful when the HTML is static. Dynamic HTML is constructed in the browser by executing JavaScript code that manipulates the browser DOM. If that is the mechanism behind a web site, server side scraping is at the very least considerably more complex (as it requires the server to emulate a modern web browser to a large degree). Selenium has been used in such cases – to provide a server side, programmatically accessible browser engine. Alternatively, screen scraping can also be performed inside the browser itself – as is supported for example by the Getsy library.

As you will find in this article – when server side scraping fails, client side scraping may be a much to complex solution. It is very well possible that the rich client web application is using a REST API that provides the data as a JSON document. An API that our server side program can also easily leverage. That turned out the case for the OOW 2017 website – so instead of complex HTML parsing and server side or even client side scraping, the challenge at hand resolves to nothing more than a little bit of REST calling.

Server Side Scraping

Server side scraping starts with client side inspection of a web site, using the developer tools in your favorite browser.

image

A simple first step with cheerio to get hold of the content of the H1 tag:

image

Now let’s inspect in the web page where we find those session details:

image

We are looking for LI elements with a CSS class of rf-list-item. Extending our little Node program with queries for these elements:

image

The result is disappointing. Apparently the document we have pulled with request-promise does not contain these list items. As I mentioned before, that is not necessarily surprising: these items are added to the DOM at runtime by JavaScript code executed after an Ajax call is used to fetch the session data.

Analyzing the REST API Calls

Using the Developer Tools in the browser, it is not hard to figure out which call was made to fetch these results:

image

The URL is there: https://events.rainfocus.com/api/search. Now the question is: what headers and parameters are sent as part of the request to the API – and what HTTP operation should it be (GET, POST, …)?

The information in the browser tools reveals:

image

A little experimenting with custom calls to the API in Postman made clear that rfWidgetId and rfApiProfileId are required form data.

image

Postman provides an excellent feature to quickly get going with source code in many technologies for making the REST call you have just put together:

image

REST Calling in Node

My first stab:

image

With the sample generated by Postman as a starting point, it is not hard to create the Node application that will iterate through all session types – TUT, BOF, GEN, CON, … -:

image

To limit the size of the individual (requests and) responses, I have decided to search the sessions of each type in 9 blocks – for example CON1, CON2, CON3 etc. The search string is padded with wild cards – so CON1 will return all sessions with an identifier starting with CON1.

To be nice to the OOW 2017 server – and prevent being blocked out by any filters and protections – I will fire requests spaced apart (with a 500 ms delay between each of them).

Because this code is for one time use only, and is not constrained by time limits, I have not put much effort in parallelizing the work, creating the most elegant code in the world etc. It is simply not worth it. This will do the job – once – and that is all I need. (although I want to extend the code to help me download the slide decks for the presentations in an automated fashion; for each conference, it takes me several hours to manually download slide decks to take with me on the plane ride home – only to find out each year that I am too tired to actually browser through those presentations).

The Node code for constructing a local file with all OOW 2017 sessions: