During the OUGF 2014 Harmony Conference in Finland, I attended a seminar by Chris Date on Temporal Data and the Relational Model. He introduced an approach within the relational model to describe temporal data and deal with (temporal) intervals and suggested how such could be included in the SQL language. At the present, we are quite a long way away from having his proposals incorporated in the SQL standard (in fact the standard seems to be moving in a different direction).
Edit (14th June 2014): However, the use of temporal data and valid time intervals is already upon us (and has always been, and the relational model has always been able to handle it even if it does not necessarily have specialist, intrinsic support for it). Note: Chris Date explicitly commented [on the approach he discussed during the seminar on temporal data in the relational model]: “It’s merely a set of carefully thought out shorthands for things you can already do in any relationally complete system. The only really new construct is the INTERVAL type generator, and (as I did say in the seminar, though only very briefly) the relational model has nothing to say regarding what types and type generators need to be supported”
End of Edit
Chris’ excellent seminar has sparked me into trying several things out with SQL as we know it today – in this case in the form of the Oracle 12c Database. My first stab at things is an attempt to perform what Chris calls the PACK operation on intervals. Suppose we have a table that records the chairmanships of a certain organization. The structure of the table is simple: name of the chairperson, date_from and date_to. The interval is defined through these latter two date columns.
The data we currently have describes the following situation:
Apparently, we have had several (brief) periods without chairman (gaps) as well as two time stretches with two contenders (conflicts or collisions). These are subjects for later articles. What we are discussing in this article is the case with John and Helen (marked with the arrows). Apparently, both John and Helen have served as chairman for two terms. If it is meaningful in any way that they served these two terms – or whether this is perhaps not important at all or even an accidentally created set of records – is not clear from the data.
In many situations, we will just want to know who has been chairman during a certain period or when a certain chairperson sat in the chair. In these common situations, it is a little clumsy to have these two intervals for both John and Helen. For most intents and purposes, their two terms should be presented as a single combined term. The PACK operation introduced by Chris Date does exactly that: it takes intervals that meet (one starts immediately after the other one ends) and combines these intervals into a single, continuous interval. After performing the PACK operation on the data visualized in the figure overhead, the data would look like this:
In this article, I will first create the table chairmanships and insert the records corresponding with this visualization. I will then write the SQL Query that will return the packed chairmanships. Finally, I will demonstrate a MERGE statement that can be used to perform the DML operation that will PACK the data in a single statement – updating the table so it will only contain the packed data.
Note: Chris Date also defines the UNPACK operation. This takes the interval records and expands them into records that convey the same information using unity-intervals: intervals with a duration of one unit of date, in this case a single day. The UNPACK operation is by and large a concept that does not actually have to be performed in a real SQL implementation; it is important though to describe what exactly is the meaning of certain operations – they should be thought of a as dealing with unpacked records. With open ended intervals, this prompted Chris Date to remark that you would have to “unpack eternity” (where eternity is described as the maximum date range in SQL, about 10k years).
The DDL statement that creates our table with chairmanships:
create table chairmanships ( chairperson varchar2(50) , date_from date , date_to date );
Here is the data loaded into the table – the same data that is visualized in the previous figures:
Note that these are closed:closed date intervals. Both date_from and date_to are included in the chairmanship period. Jim was chairman on February 1st 1980 and also on May 4th 1992- but not on any date outside that range.
It is not hard to see where there is an opportunity for packing records:
A first query can be constructed that will return an indication for each record whether the date_to of previous chairmanship record (lag) for the same chairperson (partition by chairperson) was exactly one date before the date_from of the current record. When that is the case, the record can be fused with its predecessor.
select chairperson , date_from , date_to , case date_from - lag(date_to) over (partition by chairperson order by date_from asc) when 1 then 'PACK' end action , rank() over (partition by chairperson order by date_from asc) rnk from chairmanships order by date_from
Here is the result from that query:
The next step we can take with that result is to create the concatenation of date intervals. We start with every record that does not have the PACK action set. Then we use a recursive subquery to find for all these records, adjoining records that can be concatenated or merged into it. When such a record is found, the date_to is extended to the end of the recursively added record and a next recursive iteration is performed. The SQL query now looks like this:
with chairpeople as ( select chairperson , date_from , date_to , case date_from - lag(date_to) over (partition by chairperson order by date_from asc) when 1 then 'PACK' end action , rank() over (partition by chairperson order by date_from asc) rnk from chairmanships ) , packing_chairs (chair, date_from, date_to, lvl) as ( select chairperson, date_from, date_to, 1 from chairpeople where action is null union all select p.chair, p.date_from, c.date_to, lvl+1 from chairpeople c join packing_chairs p on (c.chairperson = p.chair and c.rnk = p.lvl+1) where c.action='PACK' ) select chair, date_from, date_to, lvl from packing_chairs
and the resulting records are shown here:
Only one of all chairperson records for the same chair starting with the same DATE_FROM should be retained and it should be the one with the latest DATE_TO (or the highest value for LVL, the recursion level indicator). That simply calls for a GROUP BY on chair and date_from, as the next query does:
with chairpeople as ( select chairperson , date_from , date_to , case date_from - lag(date_to) over (partition by chairperson order by date_from asc) when 1 then 'PACK' end action , rank() over (partition by chairperson order by date_from asc) rnk from chairmanships ) , packing_chairs (chair, date_from, date_to, lvl) as ( select chairperson, date_from, date_to, 1 from chairpeople where action is null union all select p.chair, p.date_from, c.date_to, lvl+1 from chairpeople c join packing_chairs p on (c.chairperson = p.chair and c.rnk = p.lvl+1) where c.action='PACK' ) , packed_chairs as ( select chair, date_from, max(date_to) date_to from packing_chairs group by chair, date_from ) select * from packed_chairs order by date_from
The final result with only the packed records looks like this:
Or, visually:
Suppose the situation becomes slightly more complicated (as depicted in the next figure), with more than two consecutive intervals for a chairperson – would that make any difference? (it would not, the query still packs these records) Or what if the chairmanship records define overlapping chairmanships for the same chairman, would these records be packed as well? (at the moment they would not)
Update the table – Pack the records
In one single sweeping SQL statement, I would like the records to be packed. We have established the fact that it is not meaningful from a business perspective to retain multiple consecutive chairmanship records for the same chairperson. These records have come into existence because of administrative erratics are application glitches. Anyways, we want the table in the data to be just the packed records. So our challenge is to create the DML statement that will make it so.
What needs to happen in the cases where packing is called for, is that we go from multiple consecutive records to a single one. So our statement has to perform an insert or an update and multiple delete operations all in one go. In Oracle SQL at least, that calls for the MERGE operation.
The MERGE in Oracle SQL can perform three types of DML actions from a single statement: INSERT, UPDATE and DELETE. In this case, we will not use INSERT and only perform an update (of the last chairmanship record in a pack) and a delete (of all non-last records in a pack).
In order to determine whether a record should be updated or deleted, the query has been extended with the packing_order, derived as the rank within the partition defined by the chairman and date_from.
The query looks like this:
with chairpeople as ( select chairperson , date_from , date_to , rowid rid , case date_from - lag(date_to) over (partition by chairperson order by date_from asc) when 1 then 'PACK' end action , rank() over (partition by chairperson order by date_from asc) rnk from chairmanships order by date_from ) , packing_chairs (chair, date_from, date_to, lvl) as ( select chairperson, date_from, date_to, 1 from chairpeople where action is null union all select p.chair, p.date_from, c.date_to, lvl+1 from chairpeople c join packing_chairs p on (c.chairperson = p.chair and c.rnk = p.lvl+1) where c.action='PACK' ) select chair, date_from, date_to, lvl , rank() over (partition by chair,date_from order by lvl desc) packing_order from packing_chairs
and the results are:
We have two records with a packing_order > 1 which means they are not the primary [to-be-updated] record in their pack, and therefore will have to go.
The MERGE statement is now written like this:
merge into chairmanships trg using ( with chairpeople as ( select chairperson , date_from , date_to , rowid rid , case date_from - lag(date_to) over (partition by chairperson order by date_from asc) when 1 then 'PACK' end action , rank() over (partition by chairperson order by date_from asc) rnk from chairmanships order by date_from ) , packing_chairs (chair, date_from, date_to, lvl) as ( select chairperson, date_from, date_to, 1 from chairpeople where action is null union all select p.chair, p.date_from, c.date_to, lvl+1 from chairpeople c join packing_chairs p on (c.chairperson = p.chair and c.rnk = p.lvl+1) where c.action='PACK' ) select chair, date_from, date_to, lvl , rank() over (partition by chair,date_from order by lvl desc) packing_order from packing_chairs ) src on (src.date_to = trg.date_to and src.chair = trg.chairperson) when matched then update set trg.date_from = src.date_from delete where packing_order > 1 -- when not matched -- then insert is optional in the merge statement and is not required here, so we leave it out
Finally, after the merge, a select * from chairmanships produces the following result:
Instead of 9 rows, we now have 7 because the two John records have been packed, as have the two Helen records. Note that this could only be done because the interval ‘met’: one started right after the other one ended. Had there been a gap, no packing would have taken place. And with this current query, had there been a partial overlap – then no packing would have been done either – though that might very well be a functional requirement, one that can be implemented fairly easily.
Resources
Oracle SQL Merge statement introduced – blog article on AMIS Technology Blog – https://technology.amis.nl/2006/10/14/have-merge-remove-records-from-target-that-are-not-in-the-source-oracle-10g/
Note from Finland – observation on the OUGF 2014 Harmony conference – https://technology.amis.nl/2014/06/06/part-2-notes-from-finland-impressions-from-the-second-day-of-ougf-2014-harmony-conference/ and https://technology.amis.nl/2014/06/05/notes-from-finland-impressions-from-the-ougf-2014-harmony-conference/
create table chairmanships ( chairperson varchar2(50) , date_from date , date_to date ); insert into chairmanships ( chairperson, date_from, date_to) values ( 'Jim' , to_date('01-02-1980', 'DD-MM-YYYY') , to_date('04-05-1992', 'DD-MM-YYYY') ); insert into chairmanships( chairperson, date_from, date_to) values ( 'John' , to_date('10-05-1992', 'DD-MM-YYYY') , to_date('08-11-1995', 'DD-MM-YYYY') ); insert into chairmanships( chairperson, date_from, date_to) values ( 'John' , to_date('09-11-1995', 'DD-MM-YYYY') , to_date('10-04-1996', 'DD-MM-YYYY') ); insert into chairmanships( chairperson, date_from, date_to) values ( 'Helen' , to_date('11-05-1996', 'DD-MM-YYYY') , to_date('08-11-1998', 'DD-MM-YYYY') ); insert into chairmanships( chairperson, date_from, date_to) values ( 'Helen' , to_date('09-11-1998', 'DD-MM-YYYY') , to_date('10-02-2004', 'DD-MM-YYYY') ); insert into chairmanships( chairperson, date_from, date_to) values ( 'Jane' , to_date('29-03-2004', 'DD-MM-YYYY') , to_date('01-01-2007', 'DD-MM-YYYY') ); insert into chairmanships( chairperson, date_from, date_to) values ( 'Mary' , to_date('30-03-2004', 'DD-MM-YYYY') , to_date('02-11-2005', 'DD-MM-YYYY') ); insert into chairmanships( chairperson, date_from, date_to) values ( 'Bob' , to_date('02-01-2007', 'DD-MM-YYYY') , null ); insert into chairmanships( chairperson, date_from, date_to) values ( 'Hank' , to_date('16-05-2013', 'DD-MM-YYYY') , null );
The challenge of managing data in the context of time is well known. Having looked at this challenge area for a while I devised an approach which uses a standard Oracle environment and a simple interface to remove complexity from the developer and the user. The approach addresses the knotty issues of referential integrity. The demo at http://www.youtube.com/watch?v=V1EcsuJxUno is against a real business requirement. The point of this post is that the temporal challenge can be addressed simply without the need for special frameworks or sql extensions. While the demo of TRE is through a thin interface this purely simplifies the ease of use for development. The underlying db is a standard relational database model.
Hi again.
I’ve just realized that this is a classic case for Pattern Matching (Oracle 12 only):
select * from chairmanships
match_recognize (
partition by chairperson
order by date_from
measures frst.date_from as date_from, decode(classifier(),’NXT’,nxt.date_to,frst.date_to) as date_to
one row per match
after match skip past last row
pattern (frst nxt*)
define nxt as nxt.date_from = prev(nxt.date_to)+1)
order by chairperson,date_from;
Thanks,
Oren.
One more comment…
An alternative to the recursive subquery can be using the analytic function LAST_VALUE with the IGNORE NULLS option. For example:
select chairperson,
date_from,
max(date_to) keep (dense_rank last order by date_to) date_to
from (select chairperson,
LAST_VALUE(NEW_PERIOD_DATE_FROM IGNORE NULLS) OVER(PARTITION BY CHAIRPERSON ORDER BY DATE_FROM) DATE_FROM,
date_to
from (select chairperson,
date_from,
date_to,
case when LNNVL(date_from – lag(date_to) over(partition by chairperson order by date_from) = 1) then date_from end new_period_date_from
from chairmanships)
)
group by chairperson,
date_from
order by date_from;
It’s worth mentioning that in this solution we “tag” each record that represents the first period in the sequence of adjacent periods (or simply the only period, if there is no such sequence), while in the original solution we tag each record that is not the first period. Note the use of LNNVL in my example to get exactly the complement set.
Thanks,
Oren.
Hi Oren,
That is a very elegant solution indeed. I had been wondering yesterday about a solution without the recursive subquery – as I believe that would be more straightforward. I like you approach very much and I prefer it over my own. Thanks for that contribution. I also like the use of LNNVL. I have vague recollections of once hearing about that operator, but I have never actually used it. This seems like a very good application of it.
Let’s see if anyone can come up with other, inventive and perhaps even more concise ways of resolving this challenge.
thanks
Lucas
Hi Lucas.
As always, it’s a pleasure reading your well-written post.
Using “max(date_to)” may give wrong results because date_to may be null and MAX ignore nulls. For example, if Jim has 2 adjacent periods – one from 01-02-1980 until 04-05-1992 and the second from 05-05-1992 until Null – we expect to get one packed period from 01-02-1980 until Null, but we would get one period from 01-02-1980 until 04-05-1992.
So we need to change the “max(date_to)” – for example to:
nullif(max(nvl(date_to,date’9999-12-31′)),date’9999-12-31′)
or to:
max(date_to) keep (dense_rank last order by date_to)
(I’m fond of the FIRST and LAST functions – http://www.db-oriented.com/2013/08/08/tip003/)
Thanks,
Oren.
Hi Oren,
Thank you for your kind words. You are quite right with your observation. Thanks for pointing that out and providing the improved solution. I may have been influenced by Chris Date into ignoring NULLs completely. His literal words were “I despise NULL”.
thanks,
Lucas
Hi Lucas,
there is one old technique, named as “Start_of_group” on our russian oracle forum: (formatted version: https://gist.github.com/xtender/415103f3e724ae3a2847 )
You can also check your query and start_of_group method with more difficult additional data, try to add this:
Formatted: https://gist.github.com/xtender/ae3bc349c76a429cf6ae
PS. I added links to gist, because i’m not sure that syntax highlighter works fine in comments system.
Hi Sayan,
Thanks for your pretty cool contribution. I am wondering about the grp value – what does it signify? The number of periods packed together in a group? I am not I understand why to group on that value. It would make sense if it indicates the sequence number of each pack – but it does not seem like it does? Counting the how-many-ieth flag == 1 a row contains (== sequence number of period) seems meaningful. Well, I have to find time to take a closer look to fully appreciate your query. Thanks a lot for reacting!
kind regards
Lucas
(I wonder if a chairperson has multiple chairmanship periods that are disconnected and even on different ends of someone else’s chairmanship, what would be the result)
Lucas,
It’s exactly sequence number of each pack. Every row with flag=1 is the first row in group, so rolling sum of “flag” is just sequence number of this group. So, with additional data, i provided before, Jim has 2 different groups:
1. 01-02-1980 – 04-05-1992
and than
2. 01-01-2000 – 01-05-2000, 02-05-2000 – NULL
So grp will be 1 for first period, and 2 for second.
Ah, now I see. I knew I should have taken more time to interpret your query. Well, in this round about way, now I understand what is going on. This makes it quite elegant.
Thanks again,
Lucas