I have time series data in my Pandas Data Frame. And I want to present an aggregation of the data by day of the way in an orderly fashion – sorted by day of the week. Not alphabetically, but sorted the way humans would order the days – starting from Monday and walking our way to Saturday and Sunday.
After a little searching, I learned how to order data in a Data Frame based on even a random, user defined ordering.The trick is ‘categorical data’ – a limited, and usually fixed, number of possible values that may have a strong (meaningful) order. The lexical order of a categorical variable may not be the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
This StackOverflow thread showed me the way.
The starting point is a data frame with time series data – data stamped by date:
The data is not sorted in any way.
I want to aggregate: grouping by day of the week, I want to calculate the mean value for deathCount, and I want to present the results order by day of the week – the categorical ordering, not the lexical ordering.
Using the formal categorical type route, I get the result I desire:
from pandas.api.types import CategoricalDtype cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] cat_type = CategoricalDtype(categories=cats, ordered=True) data['Day of Week'] = data['date'].dt.weekday_name data['Day of Week']=data['Day of Week'].astype(cat_type) df_weekday = data.groupby(d['Day of Week']).mean() df_weekday
I define the CategoricalDtype called cat_type and explicitly set the type of the Day of Week column to this categorical type. This defines the ordering of this column.
By changing the order of the weekday names in the cats list, I can define different ordering. It is mine to govern!
In this case, a simpler – less formal, less clear perhaps – option is available through the reindex operation that I can perform on a Data Frame:
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] # create a new data frame with the death counts grouped by day of the week # reindex is used to order the week days in a logical order (learned from https://stackoverflow.com/questions/47741400/pandas-dataframe-group-and-sort-by-weekday) df_weekday = data.copy().groupby(data['date'].dt.weekday_name).mean().reindex(cats) df_weekday
The effect is the same – by reindexing the data frame using the cats list, I order the data frame’s rows in the order prescribed by the list.
After ensuring the rows in data frame df_weekday are in a meaningful order, I can plot the bar chart with the bars in a meaningful order:
Resources
Pandas Documentation on Categorical Data: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
Pandas Doc on reindex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html