The challenge in this article is a simple one. I have a table with records (duh!) and each record describes the occurrence of a certain payload. In this example the payload will be a color. Every record also has a sequence number to indicate well, when it appeared relative to the other observations. A subset of the data would look like this:
The challenges I will discuss in this article are simple:
- which sequences of three color observations occur in the data set
- how often do these sequences of three occur (give me the top 3 sequences)
- given the colors that have been observed, which combinations of three can be created
- which possible combinations (or color sequences of three colors) have not been observed at all
Using Oracle SQL features such as Analytical Functions, Outer Join and Subquery Factoring – these questions become very easy to answer.
First, let’s look at the sequences of three colors that have been observed and recorded. In order to construct the sequences, we need to combine each record with the subsequent two records. And creating a row result using other rows in the result set is almost the definition of Analytical Functions. So that is what we will use. In fact, I will use the LEAD function to take for each record the next two records into account for deriving the sequence. Note that records that do not have at least two subsequent records will not produce useful sequence results so they will be excluded. The query becomes:
The results start with:
The next challenge is to find out how often each color_sequence occurs. We get the query for addressing this challenge easily by adding the count(seq) and the group by color_sequence:
The results start with:
To only get the top 3 we can use – until 12c is upon us – the inline view construction:
All the possible combinations of three colors using the different colors that have been observed is an easy enough challenge using the cross join (to get the cartesian product):
This results in 125 records (apparently we have five different colors in our data); the first few are:
The final challenge is little bit harder. Here are looking for all the color combinations that could have been – as is demonstrated by the previous query producing 125 records – but were never actually witnessed. So any color combination that is possible but that was never recorded should be produced by this query. It is a query to look for the things that do not exist in our data. And that call for the anti-join pattern. It goes like this:
The query reuses query from the previous step that produces all possible combinations. Next, the subquery sequences is the one used at the beginning of this article. Only when we look at the main query see we something new. The possible color combinations are left outer joined with the color sequences that were recorded. The left outer join means that we will at least keep all color combinations and try to join them with the recorded sequences whenever possible. Any combination that did not actually occur – or at least was not recorded – will not be joined with a real sequence record but will instead be outer joined. Well, these combinations are the ones we are looking for: the color combinations that could not be joined for real because they have not been recorded for real. All we need to do to complete the solution is filter away all the color combinations that were joined, leaving the ones that were outer joined.
Some of the 106 results produced:
Resources
Download the SQL scripts that go with this article: color_sequence_investigation.txt.