Death of the End Date – How LAG and LEAD help fight redundancy

Lucas Jellema September 30, 2005 Data Warehousing & BI, Database, Databases, IT, Oracle, PL/SQL 4 Comments

Yesterday just prior to our AMIS Query on ADF Faces I was talking to Toon Koppelaars about quite something else: the end_date column in tables that contain contiguous records with BEGIN_DATE and END_DATE where there is no overlap and there are no gaps. Every BEGIN_DATE is the day after the END_DATE of the previous record and every END_DATE is the day before the BEGIN_DATE of the next record. Except of course for the first and last record. We agreed that in this situation, the END_DATE is really redundant information: it is always available from the next record. Situations where you will find this type of record include product price history, resume entries, inventory records, allocation data etc.

You could enter a debate whether END_DATE is or is not a property of the record. When you do your ER-modelling, would you include the end date as an attribute? And if you do, is that because you are implicitly taking into account the problems you may have if you have to derive the END_DATE from the BEGIN_DATE of a next record? It is a bit like building to fences between two adjacent gardens: once fence can clearly mark the border of both gardens, but somehow – lack of trust in your neighbour’s intentions? – you decide to have two coinciding fences.

Is it a problem if you have this END_DATE column where it is really only redundant or at least denormalized information?Well, you have to go through some additional steps:

When you insert a new record at the end, you have to update the current latest record and set is END_DATE to the BEGIN_DATE minus one of the new record
When you insert a new record somewhere in the middle, you have two update two existing records: change the END_DATE of the one before the new record and update the BEGIN_DATE of the record following the new record
If you change the BEGIN_DATE or END_DATE of an existing record, you have to update the END_DATE or BEGIN_DATE of the preceding or following record

Think of the garden fences: when you start a new garden between two existing gardens, you need move two fences and build two new ones. When you start a new garden at the end of street, you need to build two fences. If you buy piece of the garden next-doors, you have to move two fences.

Of course, for all these updates you will have to work around the mutating table challenge!

Alternatively, you can force the user to make all the correct changes by enforcing business rules that explicitly check that there are no gaps between the records – all end_date and begin_date are adjacent – and that there is no overlap between the records.

It sounds cumbersome. And it is. So why do it?

Well, the reasons most frequently given for having this END_DATE column – or this TO_POSITION because the same reasoning applies to data describing position – are:

It is convenient to quicly find the current record: WHERE END_DATE IS NULL
It is convenient to show in reports for records for which period they apply by showing BEGIN DATE and END DATE explicitly
To quickly find out for how long a certain record was valid: END_DATE – BEGIN_DATE

It is not hard to see that for these three cases, having the END_DATE is convenient. But not at all necessary. Let’s take a look at how Analytical Functions can help us to get these three pieces of functionality just as easily, without using END_DATE.

I have a simple example with product pricing data. Note: these are not actual figures, I have just made something up. I start out traditionally, using a column END_DATE.

create table product_prices
( product    varchar2(50)
, price      number(8,2)
, begin_date date
, end_date   date
)
/
insert into product_prices
values
( 'JDeveloper'
, 2500
, '01-JAN-01'
, '20-SEP-03'
)
/
insert into product_prices
values
( 'JDeveloper'
, 1000
, '21-SEP-03'
, '15-NOV-04'
)
/
insert into product_prices
values
( 'JDeveloper'
, 800
, '16-NOV-04'
, '15-JUL-05'
)
/
insert into product_prices
values
( 'JDeveloper'
, 0
, '16-JUL-05'
, null
)
/

insert into product_prices
values
( 'Eclipse'
, 0
, '30-JUN-01'
, null
)
/

insert into product_prices
values
( 'JBuilder'
, 5000
, '06-MAR-02'
, '30-AUG-03'
)
/
insert into product_prices
values
( 'JBuilder'
, 2500
, '31-AUG-03'
, '30-AUG-04'
)
/
insert into product_prices
values
( 'JBuilder'
, 1500
, '31-AUG-04'
, '30-JAN-05'
)
/
insert into product_prices
values
( 'JBuilder'
, 500
, '31-JAN-05'
, '28-FEB-05'
)
/
insert into product_prices
values
( 'JBuilder'
, 1500
, '01-MAR-05'
, null
)
/

Selecting the current price for all products is now quite simple:

select product
,      price
from   product_prices
where  end_date is null
/
PRODUCT                                                 PRICE
-------------------------------------------------- ----------
JDeveloper                                                  0
Eclipse                                                     0
JBuilder                                                 1500

Finding out for every price record from when until when it was valid is also very simple:

select product
,      price
,      begin_date
,      end_date
,      trunc(nvl(end_date,sysdate) - begin_date) days
from   product_prices
order
by     product
,      begin_date
/
PRODUCT                                                 PRICE BEGIN_DAT END_DATE        DAYS
-------------------------------------------------- ---------- --------- --------- ----------
Eclipse                                                     0 30-JUN-01                 1553
JBuilder                                                 5000 06-MAR-02 30-AUG-03        542
JBuilder                                                 2500 31-AUG-03 30-AUG-04        365
JBuilder                                                 1500 31-AUG-04 30-JAN-05        152
JBuilder                                                  500 31-JAN-05 28-FEB-05         28
JBuilder                                                 1500 01-MAR-05                  213
JDeveloper                                               2500 01-JAN-01 20-SEP-03        992
JDeveloper                                               1000 21-SEP-03 15-NOV-04        421
JDeveloper                                                800 16-NOV-04 15-JUL-05        241
JDeveloper                                                  0 16-JUL-05                   76

So it would seem that going through the trouble of having the additional column and the triggers to deal correctly with inserts and updates really pays off. Or does it? Using Analytical Functions, we can forget about the END_DATE column and retrieve the same information almost as easily. First, to find the current price record for each product – there are several ways to go about it, I only show three of them:

alter table product_prices drop column end_date
/
select distinct
       product
,      first_value( price) over (partition by product
                                 order by begin_date desc
                               ) current_price
from   product_prices
/
PRODUCT                                            CURRENT_PRICE
-------------------------------------------------- -------------
Eclipse                                                        0
JBuilder                                                    1500
JDeveloper                                                     0

Or:

select product
,      price
from   ( select product
         ,      price
         ,      row_number() over ( partition by product
                                    order by begin_date desc
                                  ) rn
         from   product_prices
       )
where  rn = 1
/

or:

select product
,      price
from   ( select product
         ,      price
         ,      lead(begin_date-1) over ( partition by product
                                 order by begin_date
                                ) end_date
         from   product_prices
       )
where  end_date is null
/

Now to find the period for which each price record is applicable:

select product
,      price
,      begin_date
,      lead(begin_date-1) over ( partition by product
                                 order by begin_date
                                ) end_date
from   product_prices
order
by     product
,      begin_date
/
PRODUCT                                                 PRICE BEGIN_DAT END_DATE
-------------------------------------------------- ---------- --------- ---------
Eclipse                                                     0 30-JUN-01
JBuilder                                                 5000 06-MAR-02 30-AUG-03
JBuilder                                                 2500 31-AUG-03 30-AUG-04
JBuilder                                                 1500 31-AUG-04 30-JAN-05
JBuilder                                                  500 31-JAN-05 28-FEB-05
JBuilder                                                 1500 01-MAR-05
JDeveloper                                               2500 01-JAN-01 20-SEP-03
JDeveloper                                               1000 21-SEP-03 15-NOV-04
JDeveloper                                                800 16-NOV-04 15-JUL-05
JDeveloper                                                  0 16-JUL-05

Now it would seem that with only a few simple enhancements to the queries, we can achieve exactly results without the END_DATE as we did with the END_DATE. And we can forget the complexity of the business rules or the triggers to deal with new or updated records as well as the redundant column. One fence should be enough!

About The Author

Lucas Jellema

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director and Oracle Developer Champion. Solution architect and developer on diverse areas including SQL, JavaScript, Kubernetes & Docker, Machine Learning, Java, SOA and microservices, events in various shapes and forms and many other things. Author of the Oracle Press book Oracle SOA Suite 12c Handbook. Frequent presenter on user groups and community events and conferences such as JavaOne, Oracle Code, CodeOne, NLJUG JFall and Oracle OpenWorld.

4 Comments

Garri January 20, 2006

hi
Why do not answer my question?
Sreenivasarao December 2, 2005

I want to know the date format in Oracle Warehouse Builder.
I wanted to filter the data based on a column Date.
The date format of this field is ‘MM/DD/YYYY HH24:MI:SS’.
I want to filetr the data based on a single day.
I am giving the filter condition in Filter transformation is
TO_DATE(INOUTGRP1.DATE,’DD-MON-YYYY’)=TO_DATE(’13-12-2004′,’DD-MON-YYYY’)
But it is not working.If u know the solution of this problem, please let me know.
Toon Koppelaars October 3, 2005

Just a quick response on the overhead of the analytical functions…

Lucas’ first query, third alternative, regarding current product prices executes with plan:
(Note: the SYS… index is the primary key on (product,begin_date)).
——————————————————–
| Id | Operation | Name |
——————————————————–
| 0 | SELECT STATEMENT | |
|* 1 | VIEW | |
| 2 | WINDOW BUFFER | |
| 3 | TABLE ACCESS BY INDEX ROWID| PRODUCT_PRICES |
| 4 | INDEX FULL SCAN | SYS_C0014032 |
——————————————————–
In the redundant enddate scenario a FULL table scan would be done.

The second query regarding fully qualified periods, executes with the same plan:
——————————————————-
| Id | Operation | Name |
——————————————————-
| 0 | SELECT STATEMENT | |
| 1 | WINDOW BUFFER | |
| 2 | TABLE ACCESS BY INDEX ROWID| PRODUCT_PRICES |
| 3 | INDEX FULL SCAN | SYS_C0014032 |
——————————————————-
Again the redundant enddate scenario executes with a FULL table scan.

Granted, the plans with the analytic functions are (still) most likely to be more expensive (both

cpu+I/O) than the redundant datamodel scenario: this mainly depends on the clustering factor for the

primary key index.

The execution plans DO NOT scare me however, AND are MUCH better than the queries we had to write in

the days when we did not have available to us the analytical functions (nested subqueries to find

maximum begindate before sysdate, etc.).

By ‘do not scare me’ I mean that I’m more than happy to trade-off the overhead for these queries

against the code that I’d have to write to implement the ‘redundant enddate’ business rule including

the runtime overhead it generates in transactions (as described by Lucas above). I have yet to

encounter the first production database (that is running for a few years) that has redundancy in

it’s datamodel, AND has no violations with regards to this redundancy…

I very often have discussions with ‘designers’ that create datamodels with redundancy designed into

the model: ‘For performance reasons’… With the current state of the optimizer technology available

to us, I always first disagree with these design decisions. I always seek proof that the queries

that this designed-redundancy is targetting, will come out as the bottle-necks of the system to be

built. To no suprise such proof can rarely be given… And then I veto the redundancy. 🙂
Jeff Moss September 30, 2005

What you say is perfectly correct, however, I would have to be a little concerned with the efficiency of needing to use
the analytic function every time I wanted to work out for my given point in time all those records which were applicable.
I’ve not tested it but I’m guessing the analytic approach would be slower – and in a warehouse situation (which is what
I’m normally in) then the number of rows on the table you are scanning can be significantly large which might make the
performance differential significant enough to warrant not doing it that way.

We also use the END_DATE of NULL for partitioning purposes – to segregate current open rows from historical closed rows so
that’s another use for the physical END_DATE column.

Also, it does make the assumption that the record can’t go away and then come back – which for some of our tables is not
the case. Maybe the government come along and want to make a public footpath between the two gardens – now you have a gap.

Nice article.

Share this:

Like this:

Related Posts

About The Author

Lucas Jellema

4 Comments