Advanced SQL to find valid periods – juggling with outer joins, running totals and analytical functions
After teaching the Advanced SQL Masterclass recently, one of the attendees confronted me with an interesting challenge for me to solve with all the tools I had been discussing all day. This challenge originates in the world of insurance and revolves around policies. Apparently (I am not much of an insurance expert) policies can have periods of inclusions (when they are valid) as well as periods of exclusion (when the policy is definitely not valid, even when there are inclusions that suggest the opposite. The exclusions win, so to say.
Visually, I can describe the situation for one specific policy like this:
Here we see a policy with five inclusions (in green), that partly overlap, as well as four exclusions (in red). The challenge is that we need to find out from a table that contains all periods of inclusion (green) and exclusion (red) what the resulting periods of validity are for the policy. Visually that would be deriving the blue bars in the following figure:
This article describes how this challenge can be approached in SQL.
First I create the table that holds the policy details:
Note that this data corresponds exactly with the first picture in this article.
The query to retrieve the periods of inclusion is built in a few steps. First I will fuse the periods together: overlapping periods of inclusion are merged together as are overlapping periods of exclusion. Next each period of inclusion is joined with every overlapping period of exclusion. A single period of inclusion is added to cater for situations where an inclusion does not overlap at all with an exclusion or where the inclusion ends with a period that does not overlap. Then using the analytical lag function, the end of the joined exclusion for the previous record is added to each row. Finally, from the resulting records, the periods of inclusion are queried.
Step 1: Fusing Periods Together
Instead of dealing with multiple potentially overlapping periods of inclusion or exclusion for one policy, it is convenient to work with non-overlapping stretches of inclusion or exclusion. The query that produces these fused periods:
is written as follows:
The inner query compares the start date for each period with the end date of the previous period for the same policy and of the same type – inclusion or exclusion. If the start date comes after the end date of the previous period, then there is no overlap and a new period begins (the case returns 1). If that start date falls before the previous period’s end date, then there is overlap and the periods should be merged. In this case, the case produced 0 as result.
One query up, a running count is calculated – used to assign a period sequence number to each row. All records that are merged together share the same period sequence number.
Finally the outmost query returns one row (because of the distinct) for each period. This row contains the beginning of the period – the first_value over the partition defined by policy_id, in_or_out indication and the period number. The row also has the end date of the period, using the last_value in a similar vein as the first_value; note the unbounded following clause to ensure that the window contains all records in the partition.
Step 2: Joining each period of inclusion with the overlapping periods of exclusion
The next step is to join the periods of inclusion with the corresponding i.e. overlapping periods of exclusion – as is shown in the next picture:
The query for joining:
In addition, every for inclusion segment we add a single row to cater for:
- inclusions that have no overlapping exclusion at all (for example 31/10 – 1/6)
- inclusion that have a last part without overlapping (following) exclusion (21/7 – 12/8)
The extra row is added with a simple UNION ALL:
The result of step 2 is:
Step 3: Enriching each record with the end date of the previous row’s exclusion
To the result shown in the previous figure, we add the end of the exclusion joined in the previous inclusion record (for the same policy in the same consecutive period). This is simply done with the analytical LAG function:
and the result is:
Step 4: Retrieve the valid inclusion segments with their start and end
Finally we need to filter those records that describe valid inclusion segments. That means:
- If the period is positive: an inclusion joined with an exclusion and this exclusion should start after the period start (that is why the first row for policy 2 disqualifies)
- If the period is negative, it is the inclusion without joined exclusion – at the end of the period; the previous exclusion should end before the period end (that is why the row marked -1 for policy 1 is excluded)
To find the start and end of each valid inclusion segment:
- If the period is positive: the inclusion starts with the period_start or the prev_ex_end and it ends with the exclusion_start (or the period_end ) (for example the first record in the figure)
- If the period is negative: the start date is the prev_ex_end (or when that is null the period_start) and its end is period_end (for example the last record in the figure)
The final query to achieve these requirements is:
The result of this query is:
and this fortunately corresponds to the visual solution:
Summary
SQL challenges typically can be dealt with in a number of ways. I have shown you mine. I would be interested in learning about yours. What solutions to this challenge can you come up with? More elegant, better readable, better performing or more fun – let me know.
Resources
Download the sources for this article: policyPeriods.txt.
Hi Lucas
This is my version of a solution.
Step 1 – split into all possible intervals
Step 2 – join in the in and outs
Step 3 – get rid of all apart from just a I
Step 4 – make adjaecent intervals into one big
WITH timeline AS
(
— chop in intervals not matter if in or out
SELECT mydate start_date, lead(mydate) OVER (ORDER BY mydate) – (1) end_date
FROM
(
SELECT start_date mydate, start_date, end_date, in_or_out FROM policy_periods
UNION
SELECT end_date+(1), start_date, end_date, in_or_out FROM policy_periods
)
)
, intervals AS(
—- join the data in approp. intervals
SELECT policy_id, start_date, end_date, wm_concat(in_or_out) event_part_ids
FROM
(
SELECT distinct policy_id, t.start_date, t.end_date, in_or_out
FROM timeline t
JOIN policy_periods i ON nvl(i.end_date, DATE ’9999-12-31′) >= t.start_date AND i.start_date <= nvl(t.end_date, DATE '9999-12-31')
order by start_date, end_date, in_or_out
)
GROUP BY policy_id, start_date, end_date
having wm_concat(in_or_out) = 'I'
)
—————-
– make complete intervals of 2 or more intervals after eacho other
select policy_id
, min(connect_by_root start_date) as start_date
, end_date
from intervals
where connect_by_isleaf = 1
connect by nocycle prior end_date+1=start_date and prior policy_id = policy_id
group by policy_id, end_date
order by policy_id, end_date
;
Best regards
Mette