Advanced SQL to find valid periods – juggling with outer joins, running totals and analytical functions

Lucas Jellema December 23, 2012 Database, Oracle, PL/SQL 3 Comments

After teaching the Advanced SQL Masterclass recently, one of the attendees confronted me with an interesting challenge for me to solve with all the tools I had been discussing all day. This challenge originates in the world of insurance and revolves around policies. Apparently (I am not much of an insurance expert) policies can have periods of inclusions (when they are valid) as well as periods of exclusion (when the policy is definitely not valid, even when there are inclusions that suggest the opposite. The exclusions win, so to say.

Visually, I can describe the situation for one specific policy like this:

Here we see a policy with five inclusions (in green), that partly overlap, as well as four exclusions (in red). The challenge is that we need to find out from a table that contains all periods of inclusion (green) and exclusion (red) what the resulting periods of validity are for the policy. Visually that would be deriving the blue bars in the following figure:

This article describes how this challenge can be approached in SQL.

First I create the table that holds the policy details:

and load the data:

Note that this data corresponds exactly with the first picture in this article.

The query to retrieve the periods of inclusion is built in a few steps. First I will fuse the periods together: overlapping periods of inclusion are merged together as are overlapping periods of exclusion. Next each period of inclusion is joined with every overlapping period of exclusion. A single period of inclusion is added to cater for situations where an inclusion does not overlap at all with an exclusion or where the inclusion ends with a period that does not overlap. Then using the analytical lag function, the end of the joined exclusion for the previous record is added to each row. Finally, from the resulting records, the periods of inclusion are queried.

Step 1: Fusing Periods Together

Instead of dealing with multiple potentially overlapping periods of inclusion or exclusion for one policy, it is convenient to work with non-overlapping stretches of inclusion or exclusion. The query that produces these fused periods:

is written as follows:

The inner query compares the start date for each period with the end date of the previous period for the same policy and of the same type – inclusion or exclusion. If the start date comes after the end date of the previous period, then there is no overlap and a new period begins (the case returns 1). If that start date falls before the previous period’s end date, then there is overlap and the periods should be merged. In this case, the case produced 0 as result.

One query up, a running count is calculated – used to assign a period sequence number to each row. All records that are merged together share the same period sequence number.

Finally the outmost query returns one row (because of the distinct) for each period. This row contains the beginning of the period – the first_value over the partition defined by policy_id, in_or_out indication and the period number. The row also has the end date of the period, using the last_value in a similar vein as the first_value; note the unbounded following clause to ensure that the window contains all records in the partition.

Step 2: Joining each period of inclusion with the overlapping periods of exclusion

The next step is to join the periods of inclusion with the corresponding i.e. overlapping periods of exclusion – as is shown in the next picture:

The query for joining:

In addition, every for inclusion segment we add a single row to cater for:

– inclusions that have no overlapping exclusion at all (for example 31/10 – 1/6)

– inclusion that have a last part without overlapping (following) exclusion (21/7 – 12/8)

The extra row is added with a simple UNION ALL:

The result of step 2 is:

Step 3: Enriching each record with the end date of the previous row’s exclusion

To the result shown in the previous figure, we add the end of the exclusion joined in the previous inclusion record (for the same policy in the same consecutive period). This is simply done with the analytical LAG function:

and the result is:

Step 4: Retrieve the valid inclusion segments with their start and end

Finally we need to filter those records that describe valid inclusion segments. That means:

If the period is positive: an inclusion joined with an exclusion and this exclusion should start after the period start (that is why the first row for policy 2 disqualifies)
If the period is negative, it is the inclusion without joined exclusion – at the end of the period; the previous exclusion should end before the period end (that is why the row marked -1 for policy 1 is excluded)

To find the start and end of each valid inclusion segment:

If the period is positive: the inclusion starts with the period_start or the prev_ex_end and it ends with the exclusion_start (or the period_end ) (for example the first record in the figure)
If the period is negative: the start date is the prev_ex_end (or when that is null the period_start) and its end is period_end (for example the last record in the figure)

The final query to achieve these requirements is:

The result of this query is:

and this fortunately corresponds to the visual solution:

Summary

SQL challenges typically can be dealt with in a number of ways. I have shown you mine. I would be interested in learning about yours. What solutions to this challenge can you come up with? More elegant, better readable, better performing or more fun – let me know.

Resources

Download the sources for this article: policyPeriods.txt.

Tags:analytical function, lag, lead, running total, sql

About The Author

Lucas Jellema

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director and Oracle Developer Champion. Solution architect and developer on diverse areas including SQL, JavaScript, Kubernetes & Docker, Machine Learning, Java, SOA and microservices, events in various shapes and forms and many other things. Author of the Oracle Press book Oracle SOA Suite 12c Handbook. Frequent presenter on user groups and community events and conferences such as JavaOne, Oracle Code, CodeOne, NLJUG JFall and Oracle OpenWorld.

3 Comments

siraprise April 24, 2014

Interesting case in point … I haven’t counted the number of lines of SQL in here, but in a language that implements the features defined in “Temporal Data & The Relational Model”, it gets to be as simple as

USING (STEND) <<
PACK(INSPOL WHERE INOUT=’I’) ON STEND
MINUS (INSPOL WHERE INOUT=’O’)
>>
where INSPOL is the table as follows :

CREATE TABLE INSPOL (
policy_ID NUMBER(10)
STEND DATE_INTERVAL
INOUT VARCHAR (1) ) ;
MetteMusens January 18, 2013

Hi Lucas

This is my version of a solution.

Step 1 – split into all possible intervals
Step 2 – join in the in and outs
Step 3 – get rid of all apart from just a I
Step 4 – make adjaecent intervals into one big

WITH timeline AS
(
— chop in intervals not matter if in or out
SELECT mydate start_date, lead(mydate) OVER (ORDER BY mydate) – (1) end_date
FROM
(
SELECT start_date mydate, start_date, end_date, in_or_out FROM policy_periods
UNION
SELECT end_date+(1), start_date, end_date, in_or_out FROM policy_periods
)
)
, intervals AS(
—- join the data in approp. intervals
SELECT policy_id, start_date, end_date, wm_concat(in_or_out) event_part_ids
FROM
(
SELECT distinct policy_id, t.start_date, t.end_date, in_or_out
FROM timeline t
JOIN policy_periods i ON nvl(i.end_date, DATE ‘9999-12-31’) >= t.start_date AND i.start_date <= nvl(t.end_date, DATE '9999-12-31')
order by start_date, end_date, in_or_out
)
GROUP BY policy_id, start_date, end_date
having wm_concat(in_or_out) = 'I'
)
—————-
— make complete intervals of 2 or more intervals after eacho other
select policy_id
, min(connect_by_root start_date) as start_date
, end_date
from intervals
where connect_by_isleaf = 1
connect by nocycle prior end_date+1=start_date and prior policy_id = policy_id
group by policy_id, end_date
order by policy_id, end_date
;

Best regards
Mette

Erik van Roon January 14, 2013

Lucas,

You said “I would be interested in learning about yours”
Ok, here goes.

I too accepted the challenge, and tried to build a solution without first taking a sneak peek at yours.
I hoped to be able to solve it, and end up with a similar solution.
Like you said: “SQL challenges typically can be dealt with in a number of ways”.
I ended up with a query that (in my very humble and irrelevant opinion) is somewhat simpler, and therefor perhaps easier to understand for a developer who didn’t build it, but is forced to do maintenance on it.
I’d hereby like to offer you my solution.

I’m not trying to say my solution is better then yours.
In fact, I can imagine that I overlooked something.
Although my query produces the same end-result, and although I did my best to implement the requirements, not to produce the desired data, There might still be something wrong with it.
If so, I’d like to hear about it.

I added some extra records (policy_id = 3) to your data for 2 test-cases that I thought should be in there.
a) 3 periods that overlap: record 2 overlaps 1, 3 overlaps 2 but 3 does not overlap 1
b) 2 periods that do not overlap, but do ‘connect’: period 1 ends at date X and period 2 starts at X+1
(in fact your query sees the last case as 2 separate periods)

My added data:
3 overlapping periods

insert into policy_periods values (3, to_date('01-01-2012','DD-MM-YYYY'), to_date('01-06-2012','DD-MM-YYYY'), 'I');
insert into policy_periods values (3, to_date('01-03-2012','DD-MM-YYYY'), to_date('01-09-2012','DD-MM-YYYY'), 'I');
insert into policy_periods values (3, to_date('01-08-2012','DD-MM-YYYY'), to_date('01-11-2012','DD-MM-YYYY'), 'I');

2 connecting periods
insert into policy_periods values (3, to_date(’15-11-2012′,’DD-MM-YYYY’), to_date(’15-12-2012′,’DD-MM-YYYY’), ‘I’);
insert into policy_periods values (3, to_date(’16-12-2012′,’DD-MM-YYYY’), to_date(’31-12-2012′,’DD-MM-YYYY’), ‘I’);
commit;

So, what does my query do?
First, like you, I decided that within an ID and type (in or out) I need to fuse periods that overlap (or connect) together.
So in the WITH clause there is query ‘aggregated_periods’.
Here, in the innermost query, I check for each record if it’s period overlaps (or connects) to the previous period (for this id and type).
If it doesn’t, the start_date is selected. If it does, the start_date is selected as being NULL, because I don’t care about that start_date.
Next, from the result-set I select the last_value of this start_date with the ‘ignore nulls’ option. This way all records that ‘belong together’ because they overlap/connect will get the same start_date: the first (and only not-null) one.
Then from this result-set for every record I select the MAX(end_date) for the combination of id, type and start_date.
Now all records that ‘belong together’ have the same start and end date, so a final DISTINCT completes my quest for aggregated periods.

Since the ‘out’ records are dominant (they overrule everything) I need to somehow select the periods that are NOT blocked by any out-period
The query ‘valid_periods’ determines these periods.
For every ‘out’ period in aggregated_periods it selects a period as:
start_date = the day after the end_date of the previous out-period, NULL means ‘the beginning of time’
end_date = the day before the start_date of this out-period, NULL means ‘the end of time’
One extra period is added for the period starting the day after the last ‘out’ end_date until the end of time.

Finally the main query takes the ‘in’ periods from aggregated_periods.
Valid_periods does not return records if there are no ‘out’ periods, so the ‘in’ periods are outer joined to valid_periods that overlap.
If no records are returned by valid_periods this causes a join to NULL values, meaning: ‘from the beginning until the end of time’
Than for every selected record the start_date becomes the greatest of the in- and valid-start_date, en the end_date becomes the least of the in- and valid-end_date

SELECT policy_id
,      in_or_out
,      start_date
,      end_date
FROM   policy_periods
ORDER BY policy_id
,        in_or_out
,        start_date
,        end_date
;

      POLICY_ID I START_DAT END_DATE
--------------- - --------- ---------
              1 I 10-OCT-10 15-MAY-11
              1 I 20-NOV-11 04-JUN-12
              1 I 05-APR-12 12-AUG-12
              1 I 31-OCT-12 01-MAR-13
              1 I 31-JAN-13 01-JUN-13
              1 O 15-OCT-10 19-OCT-10
              1 O 14-FEB-11 10-SEP-11
              1 O 01-FEB-12 15-APR-12
              1 O 01-JUN-12 21-JUL-12
              2 I 20-NOV-10 24-JUN-11
              2 I 05-APR-11 12-AUG-12
              2 O 14-FEB-10 10-SEP-11
              3 I 01-JAN-12 01-JUN-12
              3 I 01-MAR-12 01-SEP-12
              3 I 01-AUG-12 01-NOV-12
              3 I 15-NOV-12 15-DEC-12
              3 I 16-DEC-12 31-DEC-12

17 rows selected.

WITH aggregated_periods
AS   (SELECT DISTINCT
             policy_id
      ,      in_or_out
      ,      start_date
      ,      MAX(end_date) OVER (PARTITION BY policy_id,in_or_out,start_date)  end_date
      FROM   (SELECT policy_id
              ,      in_or_out
              ,      LAST_VALUE (start_date IGNORE NULLS) OVER (PARTITION BY policy_id,in_or_out ORDER BY start_date_org NULLS LAST)   start_date
              ,      end_date
              FROM   (SELECT policy_id
                      ,      in_or_out
                      ,      start_date          start_date_org
                      ,      CASE
                               WHEN start_date -1 <= LAG (end_date) OVER (PARTITION BY policy_id,in_or_out ORDER BY start_date,end_date)
                               THEN NULL         -- overlap with or connecting to previous period, so not interested in this start_date
                               ELSE start_date
                             END                 start_date
                      ,      end_date
                      FROM   policy_periods
                     )
             )
     )
,    valid_periods
AS   (SELECT policy_id
      ,      LAG(end_date) OVER (PARTITION BY policy_id ORDER BY start_date) + 1    start_date
      ,      start_date -1                                                          end_date
      FROM   aggregated_periods
      WHERE  in_or_out        = 'O'
      UNION
      SELECT policy_id
      ,      MAX(end_date) + 1     start_date
      ,      NULL                  end_date
      FROM   aggregated_periods
      WHERE  in_or_out        = 'O'
      GROUP BY policy_id
     )
SELECT per_in.policy_id
,      GREATEST (per_in.start_date
                ,NVL(per_val.start_date
                    ,per_in.start_date
                    )
                )               start_date
,      LEAST    (per_in.end_date
                ,NVL(per_val.end_date
                    ,per_in.end_date
                    )
                )               end_date
FROM   aggregated_periods           per_in
LEFT OUTER JOIN   valid_periods     per_val
  ON   per_val.policy_id       = per_in.policy_id
  -- And valid_period should overlap in-period
  AND  (   per_val.start_date IS NULL
        OR per_val.start_date <= per_in.end_date
       )
  AND  (   per_val.end_date  IS NULL
        OR per_val.end_date   >= per_in.start_date
       )
WHERE  per_in.in_or_out    = 'I'
ORDER BY per_in.policy_id
,        start_date
;

      POLICY_ID START_DAT END_DATE
--------------- --------- ---------
              1 10-OCT-10 14-OCT-10
              1 20-OCT-10 13-FEB-11
              1 20-NOV-11 31-JAN-12
              1 16-APR-12 31-MAY-12
              1 22-JUL-12 12-AUG-12
              1 31-OCT-12 01-JUN-13
              2 11-SEP-11 12-AUG-12
              3 01-JAN-12 01-NOV-12
              3 15-NOV-12 31-DEC-12

9 rows selected.