Correlation is a powerful thing. When two metrics rise and fall in a similar way, surely that cannot be just coincidence. It has to be meaningful in some way.
In our minds correlation is easily turned into causality. Our minds are wired to think like that: find the narrative in what we observe. So one the phenomenon represented by one metric is all too readily considered to be an influencing factor on the phenomenon measured by the other metric. Which in turn can lead to hasty and wrong conclusions.
However, correlation is a powerful thing. And coinciding patterns are frequently not just coincidence. There is an explanation. A common one: an overarching phenomenon that influences both aspects represented by the metrics. A common cause. Sometimes that common cause is obvious. And the correlation has not led to new insights. And sometimes it is not. Correlation, while not representing causality, helps us get closer to a root cause and to understanding the behavior we are observing. Which may lead to predicting and influencing.
This notion of correlation may help us find common and perhaps root cause was revitalized by a recent experience I had with log file analysis. Using a tool with two powerful functions. Let me show what got me inspired.
Log File Analysis – Clustering and Common Trend Detection
Here I am looking at a week’s worth of error messages from all collected log files:
Over 30,000 messages. Analyzing these feels like cleaning Augeas’ stables. Not humanly feasible. What are all the different problems that occurred? When did they occur? Why did they occur? Is there a pattern in the incidence of errors that may help understand the cause?
This tool offered a great option: Cluster the currently selected log records:
This means that the log messages are inspected for their pattern: all messages indicating a full disk, a failed HTTP call or a Time Out exception for example will be similar. The name of the disk, the specific endpoint and the JDBC Data Source and SQL statement will be different, but the overall message pattern will be the same for all messages relating a similar issue. By clustering the log entries based on their common pattern, we can very quickly get some hold on which types of errors we are dealing with.
In this case, we went from 30K log entries to 171 clusters or 171 common error patterns. That is a far more manageable number to comprehend and address. Especially when we know that some are rare and others quite frequent.
When you look at the figure, the red rectangle highlights a little graph called trend. It is a presentation of the occurrence of the message pattern over time. It shows when – and how often – the pattern occurred over the last seven days. And it will strike you as it struck me that I see another pattern emerging. The first two message clusters seem to have the same incidence: equally often and with the same distribution over time. The same applies two the 3rd and 4th cluster and the 5th and 6th also seem to be coinciding.
This log analysis tool I am using offers another power feature: it does the analysis of coincidence or correlation for us. When I switch to the Trends tab, I get the next overview:
And the hunt for common causes – or perhaps direct causality after all – is on. From 171 clusters, we are down to 47 trends. Or 47 sets of messages that have high correlation. For example the second pattern shown here with 25 similar trends when expanded represent 420 error messages across 26 message patterns with almost perfect coincidence. I do not know yet whether one pattern represents the root case and the 25 are the effect of that cause, or they all share a common root cause that I have to try to find out from looking at what these 26 message patterns and the phenomena they represent. But I do know I am looking at a data set I can comprehend. Instead of one big haystack, I have multiple much smaller ones – and far less elusive needles.
Note: the tool I am using for my log file analysis is Oracle Management Cloud Log Analytics.
Common Trend Detection
Ever since I discovered the function on my pocket calculator to determine the correlation coefficient of a set of (x,y) coordinates, I have had a weak spot for correlation.
The idea of getting some structure and perhaps a first inkling of that and how things hang together appeals to me.
And the cluster and trend analysis features in Log Analytics have rekindled that interest. I am suddenly keen on looking at other correlations in time series data and in spatial or geo data. Can I find phenomena that play out similarly across a geography? That vary in a similar way across locations.
Here is an inspiring article on comparing maps with data distributions – which turns out to be quite similar to image comparison: http://www.innovativegis.com/basis/mapanalysis/Topic10/Topic10.htm#Compare_maps
See for example: Spatial Correlation in the Python PySAL library (https://pysal.readthedocs.io/en/v1.12.0/users/tutorials/autocorrelation.html#) and the R module r.niche.similarity – https://svn.osgeo.org/grass/grass-addons/grass6/raster/r.niche.similarity/r.niche.similarity.html
On Serial Correlation (aka Autocorrelation) in Time Series Data: https://www.quantstart.com/articles/Serial-Correlation-in-Time-Series-Analysis .
Some questions I would like to look into:
- Twitter-analysis – solely based on times and days of week and month of tweets
- can I determine the time zone for a Twitter-account?
- can I determine the country for a Twitter-account (based on national holidays, coincidence of tweet-activity with nationwide events)
- can I detect robots on Twitter (from a mechanical pattern of posting tweets – both frequency, repeating pattern, all hours of the day) (see this article on spotting Twitter bots)
- can I cluster a group of 150 Twitter-accounts by country?
- can I find Tweet accounts with highly similar tweeting behavior (which could be an indication of a bot or just a copy cat retweeter)
- Can I find data sets with similar time-trends among a much larger collection of datasets? (similar to what the Log (Cluster) Trends feature does)
- for example: temperature readings for various locations; can I find the locations that show similar up and down trends?
- Can I find data sets with similar time-trends and a certain shift in time: they go up and down in a similar way but with a certain lag (one second, two hours, three days)
- Can I find the rooms with east, south and west facing windows from the temperature-readings of these rooms?
I hope to write additional articles on these questions.