After teaching the Advanced SQL Masterclass recently, one of the attendees confronted me with an interesting challenge for me to solve with all the tools I had been discussing all day. This challenge originates in the world of insurance and revolves around policies. Apparently (I am not much of an insurance expert) policies can have periods of inclusions (when they are valid) as well as periods of exclusion (when the policy is definitely not valid, even when there are inclusions that suggest the opposite. The exclusions win, so to say.
Visually, I can describe the situation for one specific policy like this:
Here we see a policy with five inclusions (in green), that partly overlap, as well as four exclusions (in red). The challenge is that we need to find out from a table that contains all periods of inclusion (green) and exclusion (red) what the resulting periods of validity are for the policy. Visually that would be deriving the blue bars in the following figure:
This article describes how this challenge can be approached in SQL.
First I create the table that holds the policy details:
Note that this data corresponds exactly with the first picture in this article.
The query to retrieve the periods of inclusion is built in a few steps. First I will fuse the periods together: overlapping periods of inclusion are merged together as are overlapping periods of exclusion. Next each period of inclusion is joined with every overlapping period of exclusion. A single period of inclusion is added to cater for situations where an inclusion does not overlap at all with an exclusion or where the inclusion ends with a period that does not overlap. Then using the analytical lag function, the end of the joined exclusion for the previous record is added to each row. Finally, from the resulting records, the periods of inclusion are queried.
Step 1: Fusing Periods Together
Instead of dealing with multiple potentially overlapping periods of inclusion or exclusion for one policy, it is convenient to work with non-overlapping stretches of inclusion or exclusion. The query that produces these fused periods:
is written as follows:
The inner query compares the start date for each period with the end date of the previous period for the same policy and of the same type – inclusion or exclusion. If the start date comes after the end date of the previous period, then there is no overlap and a new period begins (the case returns 1). If that start date falls before the previous period’s end date, then there is overlap and the periods should be merged. In this case, the case produced 0 as result.
One query up, a running count is calculated – used to assign a period sequence number to each row. All records that are merged together share the same period sequence number.
Finally the outmost query returns one row (because of the distinct) for each period. This row contains the beginning of the period – the first_value over the partition defined by policy_id, in_or_out indication and the period number. The row also has the end date of the period, using the last_value in a similar vein as the first_value; note the unbounded following clause to ensure that the window contains all records in the partition.
Step 2: Joining each period of inclusion with the overlapping periods of exclusion
The next step is to join the periods of inclusion with the corresponding i.e. overlapping periods of exclusion – as is shown in the next picture:
The query for joining:
In addition, every for inclusion segment we add a single row to cater for:
– inclusions that have no overlapping exclusion at all (for example 31/10 – 1/6)
– inclusion that have a last part without overlapping (following) exclusion (21/7 – 12/8)
The extra row is added with a simple UNION ALL:
The result of step 2 is:
Step 3: Enriching each record with the end date of the previous row’s exclusion
To the result shown in the previous figure, we add the end of the exclusion joined in the previous inclusion record (for the same policy in the same consecutive period). This is simply done with the analytical LAG function:
and the result is:
Step 4: Retrieve the valid inclusion segments with their start and end
Finally we need to filter those records that describe valid inclusion segments. That means:
- If the period is positive: an inclusion joined with an exclusion and this exclusion should start after the period start (that is why the first row for policy 2 disqualifies)
- If the period is negative, it is the inclusion without joined exclusion – at the end of the period; the previous exclusion should end before the period end (that is why the row marked -1 for policy 1 is excluded)
To find the start and end of each valid inclusion segment:
- If the period is positive: the inclusion starts with the period_start or the prev_ex_end and it ends with the exclusion_start (or the period_end ) (for example the first record in the figure)
- If the period is negative: the start date is the prev_ex_end (or when that is null the period_start) and its end is period_end (for example the last record in the figure)
The final query to achieve these requirements is:
The result of this query is:
and this fortunately corresponds to the visual solution:
Summary
SQL challenges typically can be dealt with in a number of ways. I have shown you mine. I would be interested in learning about yours. What solutions to this challenge can you come up with? More elegant, better readable, better performing or more fun – let me know.
Resources
Download the sources for this article: policyPeriods.txt.
Interesting case in point … I haven’t counted the number of lines of SQL in here, but in a language that implements the features defined in “Temporal Data & The Relational Model”, it gets to be as simple as
USING (STEND) <<
PACK(INSPOL WHERE INOUT=’I’) ON STEND
MINUS (INSPOL WHERE INOUT=’O’)
>>
where INSPOL is the table as follows :
CREATE TABLE INSPOL (
policy_ID NUMBER(10)
STEND DATE_INTERVAL
INOUT VARCHAR (1) ) ;
Hi Lucas
This is my version of a solution.
Step 1 – split into all possible intervals
Step 2 – join in the in and outs
Step 3 – get rid of all apart from just a I
Step 4 – make adjaecent intervals into one big
WITH timeline AS
(
— chop in intervals not matter if in or out
SELECT mydate start_date, lead(mydate) OVER (ORDER BY mydate) – (1) end_date
FROM
(
SELECT start_date mydate, start_date, end_date, in_or_out FROM policy_periods
UNION
SELECT end_date+(1), start_date, end_date, in_or_out FROM policy_periods
)
)
, intervals AS(
—- join the data in approp. intervals
SELECT policy_id, start_date, end_date, wm_concat(in_or_out) event_part_ids
FROM
(
SELECT distinct policy_id, t.start_date, t.end_date, in_or_out
FROM timeline t
JOIN policy_periods i ON nvl(i.end_date, DATE ‘9999-12-31’) >= t.start_date AND i.start_date <= nvl(t.end_date, DATE '9999-12-31')
order by start_date, end_date, in_or_out
)
GROUP BY policy_id, start_date, end_date
having wm_concat(in_or_out) = 'I'
)
—————-
— make complete intervals of 2 or more intervals after eacho other
select policy_id
, min(connect_by_root start_date) as start_date
, end_date
from intervals
where connect_by_isleaf = 1
connect by nocycle prior end_date+1=start_date and prior policy_id = policy_id
group by policy_id, end_date
order by policy_id, end_date
;
Best regards
Mette
Lucas,
You said “I would be interested in learning about yours”
Ok, here goes.
I too accepted the challenge, and tried to build a solution without first taking a sneak peek at yours.
I hoped to be able to solve it, and end up with a similar solution.
Like you said: “SQL challenges typically can be dealt with in a number of ways”.
I ended up with a query that (in my very humble and irrelevant opinion) is somewhat simpler, and therefor perhaps easier to understand for a developer who didn’t build it, but is forced to do maintenance on it.
I’d hereby like to offer you my solution.
I’m not trying to say my solution is better then yours.
In fact, I can imagine that I overlooked something.
Although my query produces the same end-result, and although I did my best to implement the requirements, not to produce the desired data, There might still be something wrong with it.
If so, I’d like to hear about it.
I added some extra records (policy_id = 3) to your data for 2 test-cases that I thought should be in there.
a) 3 periods that overlap: record 2 overlaps 1, 3 overlaps 2 but 3 does not overlap 1
b) 2 periods that do not overlap, but do ‘connect’: period 1 ends at date X and period 2 starts at X+1
(in fact your query sees the last case as 2 separate periods)
My added data:
3 overlapping periods
2 connecting periods
insert into policy_periods values (3, to_date(’15-11-2012′,’DD-MM-YYYY’), to_date(’15-12-2012′,’DD-MM-YYYY’), ‘I’);
insert into policy_periods values (3, to_date(’16-12-2012′,’DD-MM-YYYY’), to_date(’31-12-2012′,’DD-MM-YYYY’), ‘I’);
commit;
So, what does my query do?
First, like you, I decided that within an ID and type (in or out) I need to fuse periods that overlap (or connect) together.
So in the WITH clause there is query ‘aggregated_periods’.
Here, in the innermost query, I check for each record if it’s period overlaps (or connects) to the previous period (for this id and type).
If it doesn’t, the start_date is selected. If it does, the start_date is selected as being NULL, because I don’t care about that start_date.
Next, from the result-set I select the last_value of this start_date with the ‘ignore nulls’ option. This way all records that ‘belong together’ because they overlap/connect will get the same start_date: the first (and only not-null) one.
Then from this result-set for every record I select the MAX(end_date) for the combination of id, type and start_date.
Now all records that ‘belong together’ have the same start and end date, so a final DISTINCT completes my quest for aggregated periods.
Since the ‘out’ records are dominant (they overrule everything) I need to somehow select the periods that are NOT blocked by any out-period
The query ‘valid_periods’ determines these periods.
For every ‘out’ period in aggregated_periods it selects a period as:
start_date = the day after the end_date of the previous out-period, NULL means ‘the beginning of time’
end_date = the day before the start_date of this out-period, NULL means ‘the end of time’
One extra period is added for the period starting the day after the last ‘out’ end_date until the end of time.
Finally the main query takes the ‘in’ periods from aggregated_periods.
Valid_periods does not return records if there are no ‘out’ periods, so the ‘in’ periods are outer joined to valid_periods that overlap.
If no records are returned by valid_periods this causes a join to NULL values, meaning: ‘from the beginning until the end of time’
Than for every selected record the start_date becomes the greatest of the in- and valid-start_date, en the end_date becomes the least of the in- and valid-end_date