Tour de France Data Analysis using Strava data in Jupyter Notebook with Python, Pandas and Plotly - Step 1: single rider loading, exploration, wrangling, visualization image 9

Tour de France Data Analysis using Strava data in Jupyter Notebook with Python, Pandas and Plotly – Step 1: single rider loading, exploration, wrangling, visualization

In this article, I will show how to analyze the performance of Steven Kruijswijk during stage 14 of the 2019 Tour de France in a Jupyter Notebook using Python, Pandas and Plotly. Strava collects data from athletes regarding their activities – such as running, cycling, walking and hiking. Members can upload data – and tens of millions do so, including some well known cyclists such as Steven Kruijswijk. In my previous article I have explained how we can retrieve the Strava data for a specific rider for a stage in the Tour de France. Now it is time to make some sense of that data. In a subsequent article, we will analyze the race, by bringing together the Strava data from several cyclists racing in that 14th stage up the Col du Tourmalet.

The raw JSON files with Strava data as well as the Jupyter Notebook under scrutiny are in this GitHub repository: https://github.com/lucasjellema/data-analytics-strava-tour-de-france.

This is the 14th Stage in the 2019 Tour de France: 117.5 km (shortened to 111 km on the actual race day – read the report here) through the Pyrenees, finishing with a climb from the Hors Categorie on the flanks of the Col du Tourmalet..

image

the altitude overview is challenging:

image

The brief race summary states:

Thibaut Pinot claimed his third stage win in the Tour de France after Porrentruy 2012 and L’Alpe d’Huez 2015 as he stormed to victory at the top of Tourmalet while Julian Alaphilippe, second on the line with a deficit of six seconds, retained the yellow jersey and extended his lead over Steven Kruijswijk and Geraint Thomas. The ranking for stage 14: https://www.letour.fr/en/rankings/stage-14

Steven Kruijswijk started the day as 3rd in the general classification and – spoiler alert! – was still 3rd after this day (and even on the final day of the Tour de France). Let’s take a look at his performance for the day.

Read Rider Data into Pandas

Let’s load the JSON data from file into a Pandas Data Frame – just like we almost always do with data to analyze in a Python based Jupyter Notebook. Then we can inspect, wrangle and explore the data set and start preparing for visualization and further processing. (Check the contents and code snippets of the notebook)

image

From the data file, we obviously get time, altitude, distance, velocity, geo position (lat/long), watts, cadence and temperature.

In order to be able to assign meaning to these values, we need to know the units for each of them. While some are trivial – latlng and watts – others maybe less so. I was struggling a bit at first with velocity – expecting km/h and even prepared for miles/h, the values seemed not to make sense. Of course the unit turned out to be meter/second, not that hard at all. And to be sure for American readers: temperature is in degrees Celsius.

So from the raw data file, we get time (in seconds since the start of the recording; there is no absolute time), altitude (in meters), distance (in meters), velocity (in meters/second), geo position (lat/long), watts (in Watt or J/S), cadence (revolutions/minute), gradient (in %) and temperature (Celsius).

In order to have speed in the more intuitive km/hour and lattitude and longitude in individual, easier processable columns, I have created a simple, reusable function for loading and pre-wrangling the data ever so slightly.

image

Pandas offer some convenient methods to inspect the structure and other meta-data attributes of the data. Let’s explore the data a little.

image

An overview of the values and their distribution on all numerical columns:

image

Visualize distance vs time

Let’s show distance vs time – as an example of what and how we can create visualizations. Here I use the straightforward, built in, matplotlib based plot function. A little later in this notebook, I will use the perhaps new standard in Python Plotting: Plotly and Plotly Express.

Note: the plot function really plots distance versus index. However, in this data frame it happens that time and index coincide.

image

And here is altitude versus time.

image

It is a little bit distorted compare to the stage profile earlier shown in the article. That profile was altitude vs distance. With constant speed – the shapes would have been the same.

Here is a scatterplot that shows altitude vs distance as recorded by Steven Kruijswijk’s GPS device durin stage 14:

image

Visualizing using Plotly and Plotly Express

Plotly is a well known open source library for visualization that can used quite easily in notebooks such as the one I am discussing in this article. In the Spring of 2019, Plotly Express was announced – a terse, consistent, high-level API for rapid data exploration and figure generation (in other words: many of the visualizations required can be created with Plotly Express (while leveraging all of Plotly) in just a few lines of code).

In this notebook, I am using Plotly 4.1 (https://plot.ly/python/getting-started/) – which by default is off line and includes plotly express.

If you do no know Plotly yet, please read this article by Will Koehrsen who declares his love for Plotly in a moving and convincing way.

image

Create an interactive Plotly chart (you can hover, zoom, pan) that shows distance as a function of time. This shows the distance Steven Kruijswijk covered as Stage 14 progressed. Steepness of the curve is indicative of speed – and indirectly of gradient I presume.

image

With ‘plain’ Plotly – without Plotly Express – the previous chart would be created as follows

image

The next plot shows altitude vs distance. Spot the Col du Tourmalet…

image

The next plot shows a scatter plot that at least suggests an unsurprising relationship between gradient (steepness) and speed. Note that there a few outliers – incorrect read outs from whatever device Kruijswijk was using to record his data.

image

Using Plotly under the covers of Plotly Express

Plotly Express is a layer on top of Plotly that allows us to rapidly created charts and visualize data according to very common patterns. If we want more than these patterns – customized axes and legends multiple data sets in one chart, custom hover labels and more – we can by pass Plotly Express and dive in.

Here I will create an interactive plotly chart that shows both speed and altitude as a function of distance – using two y-axes. This brings out the altitude contours for the stage (see that Col du Tourmalet) and shows the speed as Steven rides up and down the mountains.

image

The chart is plotted like this:

image

Some smoothing on the speed data may be useful.

Note that Plotly has built in facilities for zooming and panning. I can take a closer look at the start of the ascend of Col du Tourmalet:

image

In a next article, I will bring in Strava data from other riders – and use Plotly to tell the story of Stage 14 – with the early break away and the nail biting finale.

Resources

Article on mining Strava Data, explicitly discussing segment data for the Col du Tourmalet: http://olivernash.org/2014/05/25/mining-the-strava-data/

Report on the 14th Stage in the Tour de France of 2019: https://www.letour.fr/en/news/2019/stage-14/thibaut-pinot-takes-revenge-for-crosswinds-disaster/1280846 . Another report: http://www.cyclingnews.com/tour-de-france/stage-14/results/ with the story of the stage, including summary of breakaways and the extended neutralised section of the stage. As a result – the actual race was 109 km in length. And: the at the time live blog for The Guardian: https://www.theguardian.com/sport/live/2019/jul/20/tour-de-france-2019-stage-14-takes-race-up-to-finish-on-tourmalet-live.

Online JSON Editor – convenient for quickly checking on JSON data copied/pasted from Strava – https://jsoneditoronline.org/

Introduction to Interactive Time Series Visualizations with Plotly in Python by Will Koehrsen – https://towardsdatascience.com/introduction-to-interactive-time-series-visualizations-with-plotly-in-python-d3219eb7a7af

Installing plotly (4.1): see: https://plot.ly/python/getting-started/ Introducing Plotly Express: https://medium.com/plotly/introducing-plotly-express-808df010143d . Also: this article Python’s One Liner graph creation library with animations Hans Rosling Style (https://mlwhiz.com/blog/2019/05/05/plotly_express/)

Plotly Reference on Axes, Annotations, Shapes etc: https://plot.ly/python/reference/#layout-xaxis

Technical Environment

For this notebook, I made use of Jupyter Notebook 5.7 with the Jupyter Lab extension 1.0.4 installed (https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html) in combination with ploty 4.1

conda install -c conda-forge jupyterlab

Installing plotly (4.1):

conda install -c plotly plotly=4.1.0

conda install -c plotly chart-studio=1.0.0

conda install jupyterlab=1.0 “ipywidgets>=7.5”

(see: https://plot.ly/python/getting-started/)