Oracle 8i (8.1.6 if I remember correctly) started with the introduction of a that wonderful new phenomenon in SQL: Analytical Functions. Functions that basically allow the result for one row in the result set to be calculated using the values in other rows. This allowed to look forward (lag) and back (lead) in result sets as well as calculate aggregates across partitions in every direction. Analytical Functions have helped to produce quite advanced analyses of data using very elegant, compact and surprisingly well performing queries.
With Oracle Database 12c, our SQL just got a little better equipped to perform analysis. The MATCH_RECOGNIZE operator has been introduced – one you may know if you have ever dabbled in CQL, the Continuous Query Language that is at the heart of several complex event processors. This operator goes beyond Analytical Functions in its capabilities to analyze a data set. The comparisons MATCH_RECOGNIZE allows us to make between rows in order to decide whether or not to produce a result are more advanced (and sometimes more fuzzy) than the straightforward comparison operators available with analytical functions. That sounds a little cryptic. Let’s look at some examples.
This painting is called: Who’s afraid of red, yellow and blue. Suppose we have a data set that contains color observations:
Our challenge in terms of SQL could be: find any occurrences of the combination red, yellow and blue in this data set. Now here is a challenge Analytical Functions are very well prepared to deal with. Using two LEAD expressions to look from each row to the next and the second next, each sequence of red, yellow and blue is easily found:
Using the new MATCH_RECOGNIZE we can solve the same challenge in a much more elaborate way. So why show it all? Because it helps me explain what MATCH_RECOGNIZE is and where it goes beyond what analytical functions can do.
The Match_Recognize section ins the query acts on whatever rows are produced by the from and where clauses of the query – in a way that is somewhat similar to the MODEL clause. In this case, those rows are taken and ordered by the seq column.
Two new ‘column expressions’ are introduced (and therefore will be produced in the outcome of the SELECT) through the measures redseq and match_num. The Pattern we instruct the MATCH_RECOGNIZE to look for is defined through a regular expression: RED YELLOW BLUE. This expression is found whenever the conditions defined by RED, YELLOW and BLUE occur in this sequence – with every condition occurring exactly once (we could have defined RED+ YELLOW{0,1} BLUE* to make it look more like a regular expression, but we would have found very different paintings as well).
The conditions labeled RED, YELLOW and BLUE have been defined as: an occurrence (a row in the result set) where payload is ‘red’, ‘yellow’ and ‘blue’ respectively. The measure redseq takes it value from the payload on the row matching the RED condition. MATCH_NUMBER() is the sequence assigned by the MATCH_RECOGNIZE operator to the pattern matches.
Did we ever consecutively hire three employees in the same job?
This question must have kept you awake at night I am sure. In the context of table EMP, do we have three subsequent records (in term of the hiredate) with the same value for JOB? With MATCH_RECOGNIZE, the answer is found like this:
The notion of looking for a regular expression becomes a little more clear in this example. The condition designated with SAME_JOB is that the job value in the record is the same value as the JOB in the first record of the set that is part of the match. Using that expression, we look for a series of three occurrences (looking for SAME_JOB{1} would be silly, as each row satisfies that condition on his own).
We can do this using Analytical Functions, but already we get a slightly uncomfortable feeling about that query: what if the requirement goes from three consecutive hire to four? Or five? Or a dynamic number? In those shifting situations, MATCH_RECOGNIZE may be the better option:
After this example of doing things in a more difficult way using a new operator than we already could using existing functionality, let’s go one step further and do something that we cannot do so easily with more traditional means.
Find the longest sequence of related observations
Finding a sequence of related observations is a fairly common challenge. Some examples are: Voting records, Ball possession in football, Days with or without rain, Passing vehicles (make and model or category), DNA records. And where LAG and LEAD are perfect when the conditions for finding the sequence are strict, MATCH_RECOGNIZE will be the tool of choice when the condition is a little bit fuzzier, such as allowing small gaps in the sequence. The following table serves as our example data source:
We want to find out the longest sequence in the same category (the longest string of A or B occurrences) – for now without gaps.
My solution using MATCH_RECOGNIZE (and granted that I am still quite new to this operator) looks like this:
Using traditional Analytical Functions, I can still come up with a solution, though elegance may not be the word term that springs to mind when you look at the SQL:
And when the challenge changes – it will become clear that MATCH_RECOGNIZE has a clear edge over analytical functions in some circumstances. What if we allowed a single gap in the sequence? So AAAAABAAAABBAABBB would be considered to contain the string AAAAABAAAA as the longest sequence of As (intersected with no more than non-A). Would analytical functions still be up for the job? If it is, it certainly becomes a real challenge. With MATCH_RECOGNIZE, adapting the query to fit that changed challenge is quite simple:
We only need to add DIFFERENT_CATEGORY{{0,1} to the pattern regular expression and we have incorporated the relaxation of the sequence matching rule.
Here is the result of this query:
This will not be my last article discussing MATCH_RECOGNIZE. I hope you have at least gotten a first impression and perhaps a little stirring of the juices out of excitement (about this latest addition to SQL).
To top it off, a list of what we can use in the Pattern expression:
Supported operators for the pattern clause include:
- * for 0 or more iterations
- + for 1 or more iterations
- ? for 0 or 1 iterations
- { n } for exactly n iterations (n > 0)
- { n, } for n or more iterations (n >= 0)
- { n, m } for between n and m (inclusive) iterations (0 <= n <= m, 0 < m)
- { , m } for between 0 and m (inclusive) iterations (m > 0)
- reluctant qualifiers – *?, +?, ??, {n}?, {n,}?, { n, m }?, {,m}?
- | for alternation (OR)
- grouping using () parentheses
- exclusion using {- and -}
- empty pattern using ()
- ^ and $ for start and end of a partition
Resources
Download Oracle Database 12c at eDelivery or from OTN: http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html.
Oracle Database 12c Documentation: http://www.oracle.com/pls/db121/homepage. Specific section on MATCHP_RECOGNIZE is in the Data Warehousing Guide: http://docs.oracle.com/cd/E16655_01/server.121/e17749/pattern.htm#BABJBJAA.
From the first paragraph: “This allowed to look forward (lag) and back (lead) in result sets” – NO, it’s the other way around.