SQL Challenge - packing time intervals and merging valid time periods image70

SQL Challenge – packing time intervals and merging valid time periods

During the OUGF 2014 Harmony Conference in Finland, I attended a seminar by Chris Date on Temporal Data and the Relational Model. He introduced an approach within the relational model to describe temporal data and deal with (temporal) intervals and suggested how such could be included in the SQL language. At the present, we are quite a long way away from having his proposals incorporated in the SQL standard (in fact the standard seems to be moving in a different direction).

Edit (14th June 2014): However, the use of temporal data and valid time intervals is already upon us (and has always been, and the relational model has always been able to handle it even if it does not necessarily have specialist, intrinsic support for it). Note: Chris Date explicitly commented [on the approach he discussed during the seminar on temporal data in the relational model]: “It’s merely a set of carefully thought out shorthands for things you can already do in any relationally complete system.  The only really new construct is the INTERVAL type generator, and (as I did say in the seminar, though only very briefly) the relational model has nothing to say regarding what types and type generators need to be supported”

End of Edit

Chris’ excellent seminar has sparked me into trying several things out with SQL as we know it today – in this case in the form of the Oracle 12c Database. My first stab at things is an attempt to perform what Chris calls the PACK operation on intervals. Suppose we have a table that records the chairmanships of a certain organization. The structure of the table is simple: name of the chairperson, date_from and date_to. The interval is defined through these latter two date columns.

The data we currently have describes the following situation:

image

Apparently, we have had several (brief) periods without chairman (gaps) as well as two time stretches with two contenders (conflicts or collisions). These are subjects for later articles. What we are discussing in this article is the case with John and Helen (marked with the arrows). Apparently, both John and Helen have served as chairman for two terms. If it is meaningful in any way that they served these two terms – or whether this is perhaps not important at all or even an accidentally created set of records – is not clear from the data.

In many situations, we will just want to know who has been chairman during a certain period or when a certain chairperson sat in the chair. In these common situations, it is a little clumsy to have these two intervals for both John and Helen. For most intents and purposes, their two terms should be presented as a single combined term. The PACK operation introduced by Chris Date does exactly that: it takes intervals that meet (one starts immediately after the other one ends) and combines these intervals into a single, continuous interval. After performing the PACK operation on the data visualized in the figure overhead, the data would look like this:

image

In this article, I will first create the table chairmanships and insert the records corresponding with this visualization. I will then write the SQL Query that will return the packed chairmanships. Finally, I will demonstrate a MERGE statement that can be used to perform the DML operation that will PACK the data in a single statement – updating the table so it will only contain the packed data.

Note: Chris Date also defines the UNPACK operation. This takes the interval records and expands them into records that convey the same information using unity-intervals: intervals with a duration of one unit of date, in this case a single day. The UNPACK operation is by and large a concept that does not actually have to be performed in a real SQL implementation; it is important though to describe what exactly is the meaning of certain operations – they should be thought of a as dealing with unpacked records. With open ended intervals, this prompted Chris Date to remark that you would have to “unpack eternity” (where eternity is described as the maximum date range in SQL, about 10k years).

The DDL statement that creates our table with chairmanships:

create table chairmanships
( chairperson varchar2(50)
, date_from date
, date_to date
);

Here is the data loaded into the table – the same data that is visualized in the previous figures:

image

Note that these are closed:closed date intervals. Both date_from and date_to are included in the chairmanship period. Jim was chairman on February 1st 1980 and also on May 4th 1992- but not on any date outside that range.

It is not hard to see where there is an opportunity for packing records:

image

A first query can be constructed that will return an indication for each record whether the date_to of previous chairmanship record (lag) for the same chairperson (partition by chairperson) was exactly one date before the date_from of the current record. When that is the case, the record can be fused with its predecessor.

select chairperson
, date_from
, date_to
, case date_from - lag(date_to) over (partition by chairperson order by date_from asc)
when 1 then 'PACK'
end action
, rank() over (partition by chairperson order by date_from asc) rnk
from chairmanships
order
by date_from

Here is the result from that query:

image

The next step we can take with that result is to create the concatenation of date intervals. We start with every record that does not have the PACK action set. Then we use a recursive subquery to find for all these records, adjoining records that can be concatenated or merged into it. When such a record is found, the date_to is extended to the end of the recursively added record and a next recursive iteration is performed. The SQL query now looks like this:

with chairpeople as
( select chairperson
, date_from
, date_to
, case date_from - lag(date_to) over (partition by chairperson order by date_from asc)
when 1 then 'PACK'
end action
, rank() over (partition by chairperson order by date_from asc) rnk
from chairmanships
)
, packing_chairs (chair, date_from, date_to, lvl) as
( select chairperson, date_from, date_to, 1
from chairpeople
where action is null
union all
select p.chair, p.date_from, c.date_to, lvl+1
from chairpeople c
join
packing_chairs p
on (c.chairperson = p.chair and c.rnk = p.lvl+1)
where c.action='PACK'
)
select chair, date_from, date_to, lvl
from packing_chairs

and the resulting records are shown here:

image

Only one of all chairperson records for the same chair starting with the same DATE_FROM should be retained and it should be the one with the latest DATE_TO (or the highest value for LVL, the recursion level indicator). That simply calls for a GROUP BY on chair and date_from, as the next query does:

with chairpeople as
( select chairperson
, date_from
, date_to
, case date_from - lag(date_to) over (partition by chairperson order by date_from asc)
when 1 then 'PACK'
end action
, rank() over (partition by chairperson order by date_from asc) rnk
from chairmanships
)
, packing_chairs (chair, date_from, date_to, lvl) as
( select chairperson, date_from, date_to, 1
from chairpeople
where action is null
union all
select p.chair, p.date_from, c.date_to, lvl+1
from chairpeople c
join
packing_chairs p
on (c.chairperson = p.chair and c.rnk = p.lvl+1)
where c.action='PACK'
)
, packed_chairs as
( select chair, date_from, max(date_to) date_to
from packing_chairs
group
by chair, date_from
)
select *
from packed_chairs
order
by date_from

The final result with only the packed records looks like this:

image

Or, visually:

image

Suppose the situation becomes slightly more complicated (as depicted in the next figure), with more than two consecutive intervals for a chairperson – would that make any difference? (it would not, the query still packs these records) Or what if the chairmanship records define overlapping chairmanships for the same chairman, would these records be packed as well? (at the moment they would not)

image

Update the table – Pack the records

In one single sweeping SQL statement, I would like the records to be packed. We have established the fact that it is not meaningful from a business perspective to retain multiple consecutive chairmanship records for the same chairperson. These records have come into existence because of administrative erratics are application glitches. Anyways, we want the table in the data to be just the packed records. So our challenge is to create the DML statement that will make it so.

What needs to happen in the cases where packing is called for, is that we go from multiple consecutive records to a single one. So our statement has to perform an insert or an update and multiple delete operations all in one go. In Oracle SQL at least, that calls for the MERGE operation.

The MERGE in Oracle SQL can perform three types of DML actions from a single statement: INSERT, UPDATE and DELETE. In this case, we will not use INSERT and only perform an update (of the last chairmanship record in a pack) and a delete (of all non-last records in a pack).

In order to determine whether a record should be updated or deleted, the query has been extended with the packing_order, derived as the rank within the partition defined by the chairman and date_from.

The query looks like this:

with chairpeople as
( select chairperson
, date_from
, date_to
, rowid rid
, case date_from - lag(date_to) over (partition by chairperson order by date_from asc)
when 1 then 'PACK'
end action
, rank() over (partition by chairperson order by date_from asc) rnk
from chairmanships
order
by date_from
)
, packing_chairs (chair, date_from, date_to, lvl) as
( select chairperson, date_from, date_to, 1
from chairpeople
where action is null
union all
select p.chair, p.date_from, c.date_to, lvl+1
from chairpeople c
join
packing_chairs p
on (c.chairperson = p.chair and c.rnk = p.lvl+1)
where c.action='PACK'
)
select chair, date_from, date_to, lvl
, rank() over (partition by chair,date_from order by lvl desc) packing_order
from packing_chairs

and the results are:

image

We have two records with a packing_order > 1 which means they are not the primary [to-be-updated] record in their pack, and therefore will have to go.

The MERGE statement is now written like this:

merge into chairmanships trg
using
( with chairpeople as
( select chairperson
, date_from
, date_to
, rowid rid
, case date_from - lag(date_to) over (partition by chairperson order by date_from asc)
when 1 then 'PACK'
end action
, rank() over (partition by chairperson order by date_from asc) rnk
from chairmanships
order
by date_from
)
, packing_chairs (chair, date_from, date_to, lvl) as
( select chairperson, date_from, date_to, 1
from chairpeople
where action is null
union all
select p.chair, p.date_from, c.date_to, lvl+1
from chairpeople c
join
packing_chairs p
on (c.chairperson = p.chair and c.rnk = p.lvl+1)
where c.action='PACK'
)
select chair, date_from, date_to, lvl
, rank() over (partition by chair,date_from order by lvl desc) packing_order
from packing_chairs
) src
on (src.date_to = trg.date_to and src.chair = trg.chairperson)
when matched
then
update set trg.date_from = src.date_from
delete where packing_order > 1
-- when not matched
-- then insert is optional in the merge statement and is not required here, so we leave it out

Finally, after the merge, a select * from chairmanships produces the following result:

image

Instead of 9 rows, we now have 7 because the two John records have been packed, as have the two Helen records. Note that this could only be done because the interval ‘met’: one started right after the other one ended. Had there been a gap, no packing would have taken place. And with this current query, had there been a partial overlap – then no packing would have been done either – though that might very well be a functional requirement, one that can be implemented fairly easily.

Resources

Oracle SQL Merge statement introduced – blog article on AMIS Technology Blog – https://technology.amis.nl/2006/10/14/have-merge-remove-records-from-target-that-are-not-in-the-source-oracle-10g/

Note from Finland – observation on the OUGF 2014 Harmony conference – https://technology.amis.nl/2014/06/06/part-2-notes-from-finland-impressions-from-the-second-day-of-ougf-2014-harmony-conference/ and https://technology.amis.nl/2014/06/05/notes-from-finland-impressions-from-the-ougf-2014-harmony-conference/

create table chairmanships
( chairperson varchar2(50)
, date_from date
, date_to date
);

insert into chairmanships ( chairperson, date_from, date_to)
values
( 'Jim' , to_date('01-02-1980', 'DD-MM-YYYY') , to_date('04-05-1992', 'DD-MM-YYYY')
);

insert into chairmanships( chairperson, date_from, date_to)
values
( 'John' , to_date('10-05-1992', 'DD-MM-YYYY') , to_date('08-11-1995', 'DD-MM-YYYY')
);

insert into chairmanships( chairperson, date_from, date_to)
values
( 'John' , to_date('09-11-1995', 'DD-MM-YYYY') , to_date('10-04-1996', 'DD-MM-YYYY')
);

insert into chairmanships( chairperson, date_from, date_to)
values
( 'Helen' , to_date('11-05-1996', 'DD-MM-YYYY') , to_date('08-11-1998', 'DD-MM-YYYY')
);

insert into chairmanships( chairperson, date_from, date_to)
values
( 'Helen' , to_date('09-11-1998', 'DD-MM-YYYY') , to_date('10-02-2004', 'DD-MM-YYYY')
);

insert into chairmanships( chairperson, date_from, date_to)
values
( 'Jane' , to_date('29-03-2004', 'DD-MM-YYYY') , to_date('01-01-2007', 'DD-MM-YYYY')
);

insert into chairmanships( chairperson, date_from, date_to)
values
( 'Mary' , to_date('30-03-2004', 'DD-MM-YYYY') , to_date('02-11-2005', 'DD-MM-YYYY')
);

insert into chairmanships( chairperson, date_from, date_to)
values
( 'Bob' , to_date('02-01-2007', 'DD-MM-YYYY') , null
);

insert into chairmanships( chairperson, date_from, date_to)
values
( 'Hank' , to_date('16-05-2013', 'DD-MM-YYYY') , null
);

10 Comments

  1. Rob Squire June 12, 2014
    • Lucas Jellema June 9, 2014
    • Lucas Jellema June 9, 2014
      • Sayan Malakshinov June 11, 2014
        • Lucas Jellema June 11, 2014
          • Sayan Malakshinov June 11, 2014
          • Lucas Jellema June 11, 2014