Having access to useful data to create demonstrations and sample applications can be quite a challenge. Demonstrating the power of data visualizations (for example with ADF DVT) or the capabilities of pattern recognition (such as through Oracle Database 12c Match_Recognize) requires a data set that allows for interesting manipulation and presentation. Around the internet, many data sets are available – although few are provided in a form that can easily be processed. For example: many such data sets are published for human consumption – which means they are typically published as HTML data that is to be rendered in a web browser. Fortunately, there are techniques – commonly referred to as screen scraping – that allow us to retrieve data from such human-oriented user interfaces and turn it into data that our programs can handle.
This blog article discusses a data set that has a lot of potential for sample applications and visual demonstrations: the 2014 Eurovision Song Contest. A fairly pointless singing competition that has 26 European and semi-European countries represented by their best national vocal performers – or perhaps the second best – during a four hour plus television spectacle that ends with 36 countries calling in to submit their votes for these 26 performances. All the votes – or rather the points awarded by each voting country – are added up to produce a winner. This year: Conchita Wurst of Austria was crowned as the winner.
What makes this song contest so interesting is that all most of the vote details are public. At www.eurovision.tv/page/results?event=1893 , the EBU (European Broadcasting Union) publishes a lot of information about the voting. The ranks award for all competing countries are provided per voting country and a distinction is made between the professional jury members, the jury aggregate, the popular vote (or tele vote) and the aggregate across jury and the general public. With 36 countries voting for 26 acts and each vote consisting of five individual jury members as well as the jury aggregate and the popular vote, we have an interesting data set of several thousands of data points in several dimensions.
The challenge discussed in this article is how to turn the data published in a human oriented, per country HTML based user interface into a format that we can programmatically process, such as XML or in this case JSON. I will be using JSoup, an open source Java library, for the screen scraping and (the reference implementation of) JSON-P for the creation of the JSON formatted data set. The data set is available for download with this article. Obviously, the next steps are to use this data for visualization, demonstrations of analytical operations and perhaps a little pattern recognition.
Burning questions can be answered such as: what if only the popular vote would count? What would the outcome be if votes from neighboring countries would not count, what if the lowest and highest scoring jury member would be disregarded for each country, what if you would always get points from each country, not just for the top 7 etc.
The Data Set
The data set is available at http://www.eurovision.tv/page/results?event=1893. Here we see for example how the jury and the population of Greece cast their votes for each of the contestants. We can see how the jury combined put Ukraine in 5th place and how the 10th place from the televote meant in the end Ukraine was awarded 5 points for an over all 6th place.
The country whose voting results we want to inspect can be selected from a drop down list. When we select a different country, a GET request is sent to the EBU server and an HTML page is returned.
The source of the HTML page can easily be inspected. It is our guide when we use JSoup to perform the screen scraping. For example; the select list with all the countries is identified as a SELECT element with id equal to select-country-356. We will use this element to fetch the list of all countries to process.
The voting results themselves are contained in a table without id. Every country is represented by a TR element that contains several TD elements for each of the results:
With this information in hand, it is time for JSoup.
Screenscraping with JSoup
JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
JSooup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. It can be downloaded from http://jsoup.org/, a site that also contains documentation and samples.
In this example, I am using JDeveloper as my IDe but of course NetBeans, IDEA and Eclipse could be used just as easily. I have created a Java Desktop Client application and added the JSoup jar file to the project. The POJOs Country and CountryScore hold the data that the screen scraping extracts from the HTML documents:
The class CountryVotesRetriever does the heavy lifting. First it creates a collection of Country POJO instances:
using the HTML SELECT element (the dropdown list with all the countries) as its source.
The second stage in the program is an iteration over all countries. For each country, the HTML document with its voting results is retrieved. This document is then processed, to produce the CountryScores collection for the country. There are some specialty cases we have to cater for: some countries do not organize the tele voting or may run into technical issues during the live event that may mean the tele vote is skipped. The same thing could happen for the jury voting. That means that for each country we can have either jury votes, tele vote or both.
Sometimes – perhaps dues to heavy traffic to the EBU site – the get request fails. That is why I have a built a little retry-mechanism into the main method, to allow the program to retry the retrieval of each HTML document up to five times if it has to.
The processing of each country is shown below. It is not perfectly elegant – but it will do (after all it is code that will be used only once). Note how the selectors are used to get a handle to the crucial elements in the HTML document. For every TR element (the voting results regarding a specific country), a CountryScore element is created and added to the corresponding collection of POJO instances.
Turning the Java objects into a JSON document
Once all data is loaded in the Java Object graph, we can forget about screen scraping, HTML and JSoup. The data is already in our hands. However, it is not yet made persistent. At this point, we can try to insert into a database. Or create an XML document. Instead, I have chosen to create a JSON document to capture the data in a persistent, elegant and accessible way. I am using the reference implementation for JSON-P, launched with Java EE 7 but just as easily usable in a stand alone Java SE application. The JAR file (just a few dozen Kb) can be downloaded from https://jsonp.java.net/download.html.
Clearly I will not explain the details of either JSON or JSON-P. I will just very briefly introduce the code I created to turn the Java Object graph into a JSON document.
The JSON format I am after looks like this:
This is a an object that has a single property: an array called countries that contains JSON objects with three properties each: code, label and an array called countryScores. This array holds JSON objects with properties like code, label, juryRank, teleRank etc.
The JSON-P library provides us with a few simple statements to compose the JSON document in this way. With the JsonArrayBuilder, we create, well, an array of JSON objects. With the JsonObjectBuilder, we compose a JSON object. As you will notice, we work more or less from the inside out: first the most fine grained objects and arrays, then the enveloping, highest level objects.
Once we have created the overarching JsonObject – called model – we can use a writer to turn it into a String or a Document.
The string written to the console can easily be copied and pasted into a JSON file. You can download this file here: esc2014.json. This data can be the starting point of visualizations, analysis and other demonstrations and samples. I hope you will be able to make good use of it.
A little visualization example: in red the countries that did not qualify for the final, in yellow countries that did not participate this year (but have done so in the past); in green the countries performing in the grand finale.
The JDeveloper project for scraping, processing and JSON formatting of the Eurovision Song Contest Voting results: DataGatheringEurovisionSongContest.
The data file: esc2014.zip
The Human Interface for the EBC 2014 Voting Results: http://www.eurovision.tv/page/results?event=1893.
The JSoup project home page: http://jsoup.org/.