Oracle Database 12c: Pattern Matching through MATCH_RECOGNIZE in SQL

0

Oracle 8i (8.1.6 if I remember correctly) started with the introduction of a that wonderful new phenomenon in SQL: Analytical Functions. Functions that basically allow the result for one row in the result set to be calculated using the values in other rows. This allowed to look forward (lag) and back (lead) in result sets as well as calculate aggregates across partitions in every direction. Analytical Functions have helped to produce quite advanced analyses of data using very elegant, compact and surprisingly well performing queries.

With Oracle Database 12c, our SQL just got a little better equipped to perform analysis. The MATCH_RECOGNIZE operator has been introduced – one you may know if you have ever dabbled in CQL, the Continuous Query Language that is at the heart of several complex event processors. This operator goes beyond Analytical Functions in its capabilities to analyze a data set. The comparisons MATCH_RECOGNIZE allows us to make between rows in order to decide whether or not to produce a result are more advanced (and sometimes more fuzzy) than the straightforward comparison operators available with analytical functions. That sounds a little cryptic. Let’s look at some examples.

image

This painting is called: Who’s afraid of red, yellow and blue. Suppose we have a data set that contains color observations:

image

Our challenge in terms of SQL could be: find any occurrences of the combination red, yellow and blue in this data set. Now here is a challenge Analytical Functions are very well prepared to deal with. Using two LEAD expressions to look from each row to the next and the second next, each sequence of red, yellow and blue is easily found:

image

Using the new MATCH_RECOGNIZE we can solve the same challenge in a much more elaborate way. So why show it all? Because it helps me explain what MATCH_RECOGNIZE is and where it goes beyond what analytical functions can do.

image

The Match_Recognize section ins the query acts on whatever rows are produced by the from and where clauses of the query – in a way that is somewhat similar to the MODEL clause. In this case, those rows are taken and ordered by the seq column.

image

Two new ‘column expressions’ are introduced (and therefore will be produced in the outcome of the SELECT) through the measures redseq and match_num. The Pattern we instruct the MATCH_RECOGNIZE to look for is defined through a regular expression: RED YELLOW BLUE. This expression is found whenever the conditions defined by RED, YELLOW and BLUE occur in this sequence – with every condition occurring exactly once (we could have defined RED+ YELLOW{0,1} BLUE* to make it look more like a regular expression, but we would have found very different paintings as well).

The conditions labeled RED, YELLOW and BLUE have been defined as: an occurrence (a row in the result set) where payload is ‘red’, ‘yellow’ and ‘blue’ respectively. The measure redseq takes it value from the payload on the row matching the RED condition. MATCH_NUMBER() is the sequence assigned by the MATCH_RECOGNIZE operator to the pattern matches.

Did we ever consecutively hire three employees in the same job?

This question must have kept you awake at night I am sure. In the context of table EMP, do we have three subsequent records (in term of the hiredate) with the same value for JOB? With MATCH_RECOGNIZE, the answer is found like this:

image

The notion of looking for a regular expression becomes a little more clear in this example. The condition designated with SAME_JOB is that the job value in the record is the same value as the JOB in the first record of the set that is part of the match. Using that expression, we look for a series of three occurrences (looking for SAME_JOB{1} would be silly, as each row satisfies that condition on his own).

We can do this using Analytical Functions, but already we get a slightly uncomfortable feeling about that query: what if the requirement goes from three consecutive hire to four? Or five? Or a dynamic number? In those shifting situations, MATCH_RECOGNIZE may be the better option:

image

After this example of doing things in a more difficult way using a new operator than we already could using existing functionality, let’s go one step further and do something that we cannot do so easily with more traditional means.

Find the longest sequence of related observations

Finding a sequence of related observations is a fairly common challenge. Some examples are: Voting records, Ball possession in football, Days with or without rain, Passing vehicles (make and model or category), DNA records. And where LAG and LEAD are perfect when the conditions for finding the sequence are strict, MATCH_RECOGNIZE will be the tool of choice when the condition is a little bit fuzzier, such as allowing small gaps in the sequence. The following table serves as our example data source:

image

We want to find out the longest sequence in the same category (the longest string of A or B occurrences) – for now without gaps.

My solution using MATCH_RECOGNIZE (and granted that I am still quite new to this operator) looks like this:

image

Using traditional Analytical Functions, I can still come up with a solution, though elegance may not be the word term that springs to mind when you look at the SQL:

image

And when the challenge changes – it will become clear that MATCH_RECOGNIZE has a clear edge over analytical functions in some circumstances. What if we allowed a single gap in the sequence? So AAAAABAAAABBAABBB would be considered to contain the string AAAAABAAAA as the longest sequence of As (intersected with no more than non-A). Would analytical functions still be up for the job? If it is, it certainly becomes a real challenge. With MATCH_RECOGNIZE, adapting the query to fit that changed challenge is quite simple:

image

We only need to add DIFFERENT_CATEGORY{{0,1} to the pattern regular expression and we have incorporated the relaxation of the sequence matching rule.

Here is the result of this query:

image

This will not be my last article discussing MATCH_RECOGNIZE. I hope you have at least gotten a first impression and perhaps a little stirring of the juices out of excitement (about this latest addition to SQL).

To top it off, a list of what we can use in the Pattern expression:

Supported operators for the pattern clause include:

  • * for 0 or more iterations
  • + for 1 or more iterations
  • ? for 0 or 1 iterations
  • { n } for exactly n iterations (n > 0)
  • { n, } for n or more iterations (n >= 0)
  • { n, m } for between n and m (inclusive) iterations (0 <= n <= m, 0 < m)
  • { , m } for between 0 and m (inclusive) iterations (m > 0)
  • reluctant qualifiers – *?, +?, ??, {n}?, {n,}?, { n, m }?, {,m}?
  • | for alternation (OR)
  • grouping using () parentheses
  • exclusion using {- and -}
  • empty pattern using ()
  • ^ and $ for start and end of a partition

Resources

Download Oracle Database 12c at eDelivery or from OTN: http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html.

Oracle Database 12c Documentation: http://www.oracle.com/pls/db121/homepage. Specific section on MATCHP_RECOGNIZE is in the Data Warehousing Guide: http://docs.oracle.com/cd/E16655_01/server.121/e17749/pattern.htm#BABJBJAA.

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

Leave a Reply