This year’s Tour de France was quite a spectacle. Great performances, exciting stages, unexpected events: it had it all. Analyzing the race events as they unfolded during the stages of this year’s Tour is something I am keen to attempt. Using Jupyter Notebooks, Python and Pandas and Plotly for visualization, I am sure I can get more detailed stories extracted from raw race data. The starting point for such analysis activities is… the data.
However, I have not been able to find public sources for detail data – such as timeseries data with the GPS location of riders or even groups during the stages of the TdF. So it felt like a dead end before I even had gotten started. Then I remembered Strava. Strava (Swedish for strive) is a platform for tracking performance and deep diving into the collected data.
Strava collects data from athletes regarding their activities – such as running, cycling, walking and hiking. Members can upload data – and tens of millions do so, including some well known cyclists – see this blog article for a list of over 40 Tour de France contenders who publish [some of] their data on Strava: https://blog.strava.com/tour-de-france-riders-to-follow-18148/.
Strava data can be retrieved using an official API (https://developers.strava.com/) for which Python libraries have been developed [https://github.com/hozn/stravalib]. However, only personal data can be retrieved. For use of data from other members, “you will have to make an application and request that athletes sign in with Strava, and grant your application certain permissions using OAuth 2.0.”
Data from public athletes can be looked up on the Strava website ([https://www.strava.com/athletes/search]). An overview is provided of activities for which data was uploaded by an athlete.
For example Steven Kruijswijk (team Jumbo Visma, #3 in final classification):
For each of these activities, details can be inspected on the website; here for example TdF Stage 14 for Steven Kruijswijk:
And some on screen stats analysis:
It is not hard to find out the requests made by the web application to retrieve the data that is presented – using the request analysis features in the browser Developer Tools:
It turns out that the response to these requests are pure JSON documents – that can easily be interpreted.
The URL used by the Strava webapp to retrieve the data uses the activity identifier as its primary key, and addes request parameters to instruct the API backend about the information elements to return. The URL is composed like this:
The value 2548396565 is the activity id for Steven Kruijswijk’s data set for the TdF 2019 Stage 14 performance recording.
I do not yet know yet the meaning or even relevance of the last parameter.
It is important to realize that this URL can only be accessed from an authenticated browser session (authenticated in a browser with a valid Strava account). I have not used this URL to programmatically and repeatedly collect data directly from a computer program but instead only for copy & paste to a JSON text file in the browser. For Stage 14 in the 2019 Tour de France, I have collected data for a number of riders, including the stage winner (Thibaut Pinot), an early front runner (Marco Haller), one of the most active riders in this year’s TdF (Thomas de Gendt) and of course Steven Kruijswijk.
In future articles, I will show you some of the analysis of this detailed data using Python, Pandas, Plotly in a Jupyter Notebook. One early glimpse:
This chart shows the time gap with the stage winner (Thibaut Pinot) for each of three riders at each distance during the stage. The black dotted line is the altitude profile as recorded by Thibaut’s tracking device. The official stage profile as published by the Tour de France organization is shown below – and should correspond with the this dotted line.