Screenscraping from Java using jsoup – effective data gathering from websites

In a recent article I discussed screenscraping in a in hindsight fairly clumsy way (https://technology.amis.nl/blog/12786/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities). While preparing for a series of articles on data visualizations, I had need of statistics regarding the Olympic Games – more specifically: the overall medal count per country during the 2008 Bejing Olympic Games. This information is readily available from dozens of websites. However, I could not find one hat offered the data in easy to process XML or CSV format – all websites had human consumers in mind.

Using screenscraping – we use a programmatic facility to consume the content that is intended to be displayed on screen to human users and subsequently process that content by extracting the required data from it. Some web-pages are easier to scrape than others – this depends on the richness of the HTML (the poorer the better for scraping), the required interactivity (JavaScript, AJAX – the less the better) and the structure used to present the data (tables, frequently despised by web developers, work rather well).

I came across a tool for screenscraping from Java, called jsoup – http://jsoup.org/. It turned out to be so incredibly easy to use – that I thouht I should share it.

Getting going with jsoup is as easy as can be:

1. download jsoup-1.6.1.jar (or whatever the latest version is) from http://jsoup.org/download

2. add this jar as a dependency in your project and/or application CLASSPATH

3. make use of jsoup in the code that does the screenscraping.

A simple example of code that uses jsoup (more examples on: http://jsoup.org/cookbook/):

One of the websites offering the overall medal count is http://www.databaseolympics.com/games/gamesyear.htm?g=26. The page looks as follows:

Image

Well, more importantly, the page looks like this:

Image

This means in terms of screenscraping: I will find the medal count for each country inside a TABLE element with styleclass pt8. Each country has a TR element. Only the first TR element does not represent a country score, as it is the table header. The first TD element in the TR represents the country. The name of the country can be retrieved as the text content from the A element in the TD. The next TD elements contain the numbers of medals in Gold, Silver, Bronze and Total.

The corresponding Java code with jsoup boils down to:

    public static void main(String[] args) throws IOException, SQLException, InterruptedException {
            Document doc = Jsoup.connect(OlympicMedalMirrorProcessor.baseUrl + "?g=26").get();
            String title = doc.title();
            System.out.println(title);
            Element table = doc.select("table.pt8").get(0);
            Elements trs = table.select("tr");
            Iterator trIter = trs.iterator();
            boolean firstRow = true;
            while (trIter.hasNext()) {

                Element tr = (Element)trIter.next();
                if (firstRow) {
                    firstRow = false;
                    continue;
                }
                Elements tds = tr.select("td");
                Iterator tdIter = tds.iterator();
                int tdCount = 1;
                String country = null;
                Integer gold = null;
                Integer silver = null;
                Integer bronze = null;
                Integer total = null;
                // process new line
                while (tdIter.hasNext()) {

                    Element td = (Element)tdIter.next();
                    switch (tdCount++) {
                    case 1:
                        country = td.select("a").text();
                        break;
                    case 2:
                        gold = Integer.parseInt(td.text());
                        break;
                    case 3:
                        silver = Integer.parseInt(td.text());
                        break;
                    case 4:
                        bronze = Integer.parseInt(td.text());
                        break;
                    case 5:
                        total = Integer.parseInt(td.text());
                        break;
                    }

                }
                System.out.println(country + ": gold " + gold + " silver " + silver + " bronze " + bronze + " total " +
                                   total);
            } //table rows
    }

Image

2 Comments

  1. eric January 22, 2012
  2. AuroX September 13, 2011