Data Sets are often too small. We do not have all data that we need in order to interpret, explain, visualize or use for training a meaningful model. However, quite often our data sets are too large. Or, more specifically, they have higher resolution than is necessary or even than is desirable. We may have a timeseries with values for every other second, although meaningful changes do not happen at lower frequencies than 30 seconds or even much longer. We may have measurements for each meter for a metric that does not vary meaningfully at less than 50 meters.
Having data at too high a resolution is not a good thing. For one, a large data set may be quite unwieldy to work with. It is too big for our algorithms and equipment. Furthermore, high resolution data may contain meaningless twitching, local noise that may impact our findings. And of course we may have values along a continuous axis, a floating point range that holds values with several apparently significant digits that are not really significant at all. We cannot meaningfully measure temperature or distance in mili-degrees or micrometers. Ideally, we work with values at meaningful values and only significant digits. When we are really only interested in comparison and similarity detection, we can frequently settle for even less specific values – as the SAX algorithm for example implements.
In many cases, we (should) want to lower the resolution of our data set, quite possibly along two axes: the X-axis (time, distance, simply count of measurement) and the Y-axis (the signal value).
I will assume in this article that we work with data in Pandas (in the context of Jupyter Notebooks running a Python kernel). And I will show some simple examples of reducing the size of the dataset without loss of meaningful information. The sources for this article are on GitHub: https://github.com/lucasjellema/data-analytic-explorations/tree/master/around-the-fort .
For this example, I will work with a data set collected by Strava, representing a walk that lasted for about one hour and close to 5 km. The Strava data set contains over 1300 observations – each recording longitude and latitude, altitude, distance and speed. These measurements are taken about every 2-4 seconds. This results in a high res chart when plotted using plotly express:
fig = px.scatter(s, x=”time”, y=”altitude”, title=’Altitude vs Distance in our Walk Around the Fort’, render_mode=’svg’)
fig.show()
Here I have shown both a scatter and a line chart. Both contain over 1300 values.
For my purposes, I do not need data at this high resolution. In fact, the variation in walking speed is quite high within 30 second periods, but not in a meaningful way. I prefer to have values smoothed over 30 minutes or longer. Note: I am primarily interested in altitude measurements, so let’s focus on that.
I will discuss three methods for horizontal resolution: resample for timeseries, DIY grouping and aggregating for any data series and PAA (piecewise aggregate approximation). Next, we will talk about vertical resolution reduction; we will look at quantiles, equi-height binning and symbolic representation through SAX.
Horizontal Reduction of Resolution
When we are dealing with a time series, it is easy to change the resolution of the data set, simply by resampling on the Data Frame. Let’s say we take the average of the altitude over each 30 second window. That is done as easy as:
a = dataFrame.resample(’30S’).mean()[‘altitude’].to_frame(name=’altitude’)
However, in this case the index of our data set is not actually a timestamp. One of the dimensions is time, another is distance. It seems most appropriate to sample the set of altitude values by distance. Taking the altitude once every 25 meters (an average for all measurements in each 25 meter section) seems quite enough.
This can be done I am sure in several ways. The one I show here takes two steps:
- assign a distance window to each observation (into which 25 meter window does each observation go)
- what is the average altitude value for all observations in each window
The code for this:
distance_window_width = 25
s[‘distance_window’] = s[‘distance’].apply(lambda distance: distance_window_width*(round(distance/distance_window_width)))
And subsequently the aggregation:
d = s[[‘altitude’,’distance_window’]].copy().groupby(s[‘distance_window’]).mean()
In a chart we can see the effect of the reduction of the data resolution – first a line chart (with interpolation) then a bar chart that is a better representation of the data set as it currently stands – small windows for which average values have been calculated.
At this point we have smoothed the curve – averaged out small fluctuations. Instead of taking the average, we could consider other methods of determining the value to represent a window – modus is one option, median another and explicit exclusion of outliers yet another option.
PAA – Piecewise Aggregate Approximation
A popular way of reducing the horizontal resolution of data sets is PAA (piecewise aggregate approximation). In essence, it looks at data per window and calculates the value representing that window, just as we have been doing with our simple averaging.
It is worthwhile to read through some of the PAA resources. Here I will just show how to leverage a Python library that implements PAA or how to create a PAA function (from code from such a library) and invoke it for resolution reduction.
I have created a function paa – the code was copied from https://github.com/seninp/saxpy/blob/master/saxpy/paa.py.
#use PAA for lowering the data set’s resolution
# taken from https://github.com/seninp/saxpy/blob/master/saxpy/paa.py
def paa(series, paa_segments):
“””PAA implementation.”””
series_len = len(series)# check for the trivial case
if (series_len == paa_segments):
return np.copy(series)
else:
res = np.zeros(paa_segments)
# check when we are even
if (series_len % paa_segments == 0):
inc = series_len // paa_segments
for i in range(0, series_len):
idx = i // inc
np.add.at(res, idx, series[i])
# res[idx] = res[idx] + series[i]
return res / inc
# and process when we are odd
else:
for i in range(0, paa_segments * series_len):
idx = i // series_len
pos = i // paa_segments
np.add.at(res, idx, series[pos])
# res[idx] = res[idx] + series[pos]
return res / series_len
With this function in my Notebook, I can create a low res data set with PAA like this (note that I have full control of the number windows or segments the PAA result should have):
# to bring down the number of data points from 1300 to a much lower number, use the PAA algorithm like this:
e = paa(series = s[‘altitude’], paa_segments = 130)
# create Pandas data frame from numpy.ndarray
de = pd.DataFrame(data=e[:], # values
index=e[:], # 1st column as index
columns=[‘altitude’]
)
# add an column x that has its values set to the row index of each row
de[‘x’] = range(1, len(de) + 1)
Vertical Reduction of Resolution
The altitude values calculated for each 25 meter distance window is on a continuous scale. Each value can differ from all other values and is expressed as a floating point number with many decimal digits. Of course these values are only crude estimates of the actual altitude in real life. The GPS facilities in my smartphone do not allow for fine grained altitude determination. So pretending the altitude for each horizontal window is known in great detail is not meaningful.
There are several ways of dealing with this continuous value range. By simply rounding values we can at least get rid of misleading decimal digits. We can further reduce resolution by creating a fixed number of value ranges or bins (or value categories) and assigning each window to a bin or category. This enormously simplifies our data set to a level where calculations seems quite crude – but are perhaps more honest. For making comparisons between signals and finding repeating patterns and other similarities, such a simplification is frequently not only justified but even a cause for faster as well as better results.
A simple approach would be to decide on a small number of altitude levels – say six different levels – and assigning each value to one of these six levels. Pandas have the qcut function that we can leverage (this assigns the quantile to each record, attempting to get equal numbers of records into each quantile resulting in quantiles or bins that do cover different value ranges):
number_of_bins = 6
d[‘altitude_bin’] = pd.qcut(d[‘altitude’], number_of_bins,labels=False)
The corresponding bar chart that shows all bin values looks as follows:
If we want to have the altitude value at the start of the bin to which an observation is assigned, here is what we can do:
number_of_bins = 6
d[‘altitude_bin’] = pd.qcut(d[‘altitude’], number_of_bins,labels=False)
categories, edges = pd.qcut(d[‘altitude’], number_of_bins, retbins=True, labels=False)
df = pd.DataFrame({‘original_altitude’:d[‘altitude’],
‘altitude_bin’: edges[1:][categories]},
columns = [‘original_altitude’, ‘altitude_bin’])
df[‘altitude_bin’].value_counts()
Instead of quantiles with each the same number of values, we can use bins that each cover the same value distance – say each 50 cm altitude. In Pandas this can be done by using the function cut instead of qcut:
number_of_bins = 6
d[‘altitude_bin’] = pd.cut(d[‘altitude’], number_of_bins,labels=False)
Almost the same code, assigning bin index values to each record, based on bins that each cover the same amount of altitude.
In a bar chart, this is what the altitude bin (labeled 0 through 5, these are unit less labels) vs distance looks like:
You can find out the bin ranges quite easily:
number_of_bins = 6
d[‘altitude_bin’] = pd.cut(d[‘altitude’], number_of_bins)
d[‘altitude_bin’].value_counts(sort=False)
Symbolic Representation – SAX
The bin labels in the previous section may appear like measurements with being numeric and all. But in fact they are unit less labels. They could have been labeled A through F. They are ordered but have no size associated with them.
The concept of symbolic representation of time series (and using that compact representation for efficient similarity analysis) has been researched extensively. The most prominent theory in this field to date is called SAX – Symbolic Aggregate approXimation. It also assigns labels to each observed value – going about it in a slightly more subtle way than using equiheight bins or quantiles. Check out one of the many resources on SAX – for example starting from here: https://www.cs.ucr.edu/~eamonn/SAX.htm.
To create a SAX representation of our walk around the fort is not very hard at all.
# how many different categories to use or how many letters in the SAX alphabet
alphabet_size = 7
# normalize the altitude data series
data_znorm = znorm(s[‘altitude’])
# use PAA for horizontal resolution reduction from 1300+ data points to 130 segments
# Note: this is a fairly slow step
data_paa = paa(data_znorm, 130)
# create the SAX representation for the 130 data points
sax_representation_altitude_series = ts_to_string(data_paa, cuts_for_asize(alphabet_size))
sax_representation_altitude_series
What started out as a set of 1300+ floating point values has now been reduced to a string with 130 characters (basically the set fits in 130 * 3 bits). Did we lose information? Well, we gave up on a lot of fake accuracy. And for many purposes, this resolution of our data set is quite enough. Looking for repeating patterns for example. It would seem that “ddddddccccccccaaaabbb” is a nicely repeating pattern. Four times? We did four laps around the fort!
Here is the bar chart that visualizes the SAX pattern. Not unlike the previous bar charts – yet even further condensed.
Resources
The sources for this article are on GitHub: https://github.com/lucasjellema/data-analytic-explorations/tree/master/around-the-fort .