Death of the End Date – How LAG and LEAD help fight redundancy

4

Yesterday just prior to our AMIS Query on ADF Faces I was talking to Toon Koppelaars about quite something else: the end_date column in tables that contain contiguous records with BEGIN_DATE and END_DATE where there is no overlap and there are no gaps. Every BEGIN_DATE is the day after the END_DATE of the previous record and every END_DATE is the day before the BEGIN_DATE of the next record. Except of course for the first and last record. We agreed that in this situation, the END_DATE is really redundant information: it is always available from the next record. Situations where you will find this type of record include product price history, resume entries, inventory records, allocation data etc.

You could enter a debate whether END_DATE is or is not a property of the record. When you do your ER-modelling, would you include the end date as an attribute? And if you do, is that because you are implicitly taking into account the problems you may have if you have to derive the END_DATE from the BEGIN_DATE of a next record? It is a bit like building to fences between two adjacent gardens: once fence can clearly mark the border of both gardens, but somehow – lack of trust in your neighbour’s intentions? – you decide to have two coinciding fences.

Is it a problem if you have this END_DATE column where it is really only redundant or at least denormalized information?Well, you have to go through some additional steps:

  • When you insert a new record at the end, you have to update the current latest record and set is END_DATE to the BEGIN_DATE minus one of the new record
  • When you insert a new record somewhere in the middle, you have two update two existing records: change the END_DATE of the one before the new record and update the BEGIN_DATE of the record following the new record
  • If you change the BEGIN_DATE or END_DATE of an existing record, you have to update the END_DATE or BEGIN_DATE of the preceding or following record

Think of the garden fences: when you start a new garden between two existing gardens, you need move two fences and build two new ones. When you start a new garden at the end of street, you need to build two fences. If you buy piece of the garden next-doors, you have to move two fences.

Of course, for all these updates you will have to work around the mutating table challenge!

Alternatively, you can force the user to make all the correct changes by enforcing business rules that explicitly check that there are no gaps between the records – all end_date and begin_date are adjacent – and that there is no overlap between the records.

It sounds cumbersome. And it is. So why do it?

Well, the reasons most frequently given for having this END_DATE column – or this TO_POSITION because the same reasoning applies to data describing position – are:

  • It is convenient to quicly find the current record: WHERE END_DATE IS NULL
  • It is convenient to show in reports for records for which period they apply by showing BEGIN DATE and END DATE explicitly
  • To quickly find out for how long a certain record was valid: END_DATE – BEGIN_DATE

It is not hard to see that for these three cases, having the END_DATE is convenient. But not at all necessary. Let’s take a look at how Analytical Functions can help us to get these three pieces of functionality just as easily, without using END_DATE.

I have a simple example with product pricing data. Note: these are not actual figures, I have just made something up. I start out traditionally, using a column END_DATE.

create table product_prices
( product    varchar2(50)
, price      number(8,2)
, begin_date date
, end_date   date
)
/
insert into product_prices
values
( 'JDeveloper'
, 2500
, '01-JAN-01'
, '20-SEP-03'
)
/
insert into product_prices
values
( 'JDeveloper'
, 1000
, '21-SEP-03'
, '15-NOV-04'
)
/
insert into product_prices
values
( 'JDeveloper'
, 800
, '16-NOV-04'
, '15-JUL-05'
)
/
insert into product_prices
values
( 'JDeveloper'
, 0
, '16-JUL-05'
, null
)
/

insert into product_prices
values
( 'Eclipse'
, 0
, '30-JUN-01'
, null
)
/

insert into product_prices
values
( 'JBuilder'
, 5000
, '06-MAR-02'
, '30-AUG-03'
)
/
insert into product_prices
values
( 'JBuilder'
, 2500
, '31-AUG-03'
, '30-AUG-04'
)
/
insert into product_prices
values
( 'JBuilder'
, 1500
, '31-AUG-04'
, '30-JAN-05'
)
/
insert into product_prices
values
( 'JBuilder'
, 500
, '31-JAN-05'
, '28-FEB-05'
)
/
insert into product_prices
values
( 'JBuilder'
, 1500
, '01-MAR-05'
, null
)
/

Selecting the current price for all products is now quite simple:

select product
,      price
from   product_prices
where  end_date is null
/
PRODUCT                                                 PRICE
-------------------------------------------------- ----------
JDeveloper                                                  0
Eclipse                                                     0
JBuilder                                                 1500

Finding out for every price record from when until when it was valid is also very simple:

select product
,      price
,      begin_date
,      end_date
,      trunc(nvl(end_date,sysdate) - begin_date) days
from   product_prices
order
by     product
,      begin_date
/
PRODUCT                                                 PRICE BEGIN_DAT END_DATE        DAYS
-------------------------------------------------- ---------- --------- --------- ----------
Eclipse                                                     0 30-JUN-01                 1553
JBuilder                                                 5000 06-MAR-02 30-AUG-03        542
JBuilder                                                 2500 31-AUG-03 30-AUG-04        365
JBuilder                                                 1500 31-AUG-04 30-JAN-05        152
JBuilder                                                  500 31-JAN-05 28-FEB-05         28
JBuilder                                                 1500 01-MAR-05                  213
JDeveloper                                               2500 01-JAN-01 20-SEP-03        992
JDeveloper                                               1000 21-SEP-03 15-NOV-04        421
JDeveloper                                                800 16-NOV-04 15-JUL-05        241
JDeveloper                                                  0 16-JUL-05                   76

So it would seem that going through the trouble of having the additional column and the triggers to deal correctly with inserts and updates really pays off. Or does it? Using Analytical Functions, we can forget about the END_DATE column and retrieve the same information almost as easily. First, to find the current price record for each product – there are several ways to go about it, I only show three of them:

alter table product_prices drop column end_date
/
select distinct
       product
,      first_value( price) over (partition by product
                                 order by begin_date desc
                               ) current_price
from   product_prices
/
PRODUCT                                            CURRENT_PRICE
-------------------------------------------------- -------------
Eclipse                                                        0
JBuilder                                                    1500
JDeveloper                                                     0

Or:

select product
,      price
from   ( select product
         ,      price
         ,      row_number() over ( partition by product
                                    order by begin_date desc
                                  ) rn
         from   product_prices
       )
where  rn = 1
/

or:

select product
,      price
from   ( select product
         ,      price
         ,      lead(begin_date-1) over ( partition by product
                                 order by begin_date
                                ) end_date
         from   product_prices
       )
where  end_date is null
/

Now to find the period for which each price record is applicable:

select product
,      price
,      begin_date
,      lead(begin_date-1) over ( partition by product
                                 order by begin_date
                                ) end_date
from   product_prices
order
by     product
,      begin_date
/
PRODUCT                                                 PRICE BEGIN_DAT END_DATE
-------------------------------------------------- ---------- --------- ---------
Eclipse                                                     0 30-JUN-01
JBuilder                                                 5000 06-MAR-02 30-AUG-03
JBuilder                                                 2500 31-AUG-03 30-AUG-04
JBuilder                                                 1500 31-AUG-04 30-JAN-05
JBuilder                                                  500 31-JAN-05 28-FEB-05
JBuilder                                                 1500 01-MAR-05
JDeveloper                                               2500 01-JAN-01 20-SEP-03
JDeveloper                                               1000 21-SEP-03 15-NOV-04
JDeveloper                                                800 16-NOV-04 15-JUL-05
JDeveloper                                                  0 16-JUL-05

Now it would seem that with only a few simple enhancements to the queries, we can achieve exactly results without the END_DATE as we did with the END_DATE. And we can forget the complexity of the business rules or the triggers to deal with new or updated records as well as the redundant column. One fence should be enough!

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

4 Comments

  1. I want to know the date format in Oracle Warehouse Builder.
    I wanted to filter the data based on a column Date.
    The date format of this field is ‘MM/DD/YYYY HH24:MI:SS’.
    I want to filetr the data based on a single day.
    I am giving the filter condition in Filter transformation is
    TO_DATE(INOUTGRP1.DATE,’DD-MON-YYYY’)=TO_DATE(’13-12-2004′,’DD-MON-YYYY’)
    But it is not working.If u know the solution of this problem, please let me know.

  2. Just a quick response on the overhead of the analytical functions…

    Lucas’ first query, third alternative, regarding current product prices executes with plan:
    (Note: the SYS… index is the primary key on (product,begin_date)).
    ——————————————————–
    | Id | Operation | Name |
    ——————————————————–
    | 0 | SELECT STATEMENT | |
    |* 1 | VIEW | |
    | 2 | WINDOW BUFFER | |
    | 3 | TABLE ACCESS BY INDEX ROWID| PRODUCT_PRICES |
    | 4 | INDEX FULL SCAN | SYS_C0014032 |
    ——————————————————–
    In the redundant enddate scenario a FULL table scan would be done.

    The second query regarding fully qualified periods, executes with the same plan:
    ——————————————————-
    | Id | Operation | Name |
    ——————————————————-
    | 0 | SELECT STATEMENT | |
    | 1 | WINDOW BUFFER | |
    | 2 | TABLE ACCESS BY INDEX ROWID| PRODUCT_PRICES |
    | 3 | INDEX FULL SCAN | SYS_C0014032 |
    ——————————————————-
    Again the redundant enddate scenario executes with a FULL table scan.

    Granted, the plans with the analytic functions are (still) most likely to be more expensive (both

    cpu+I/O) than the redundant datamodel scenario: this mainly depends on the clustering factor for the

    primary key index.

    The execution plans DO NOT scare me however, AND are MUCH better than the queries we had to write in

    the days when we did not have available to us the analytical functions (nested subqueries to find

    maximum begindate before sysdate, etc.).

    By ‘do not scare me’ I mean that I’m more than happy to trade-off the overhead for these queries

    against the code that I’d have to write to implement the ‘redundant enddate’ business rule including

    the runtime overhead it generates in transactions (as described by Lucas above). I have yet to

    encounter the first production database (that is running for a few years) that has redundancy in

    it’s datamodel, AND has no violations with regards to this redundancy…

    I very often have discussions with ‘designers’ that create datamodels with redundancy designed into

    the model: ‘For performance reasons’… With the current state of the optimizer technology available

    to us, I always first disagree with these design decisions. I always seek proof that the queries

    that this designed-redundancy is targetting, will come out as the bottle-necks of the system to be

    built. To no suprise such proof can rarely be given… And then I veto the redundancy. :-)

  3. What you say is perfectly correct, however, I would have to be a little concerned with the efficiency of needing to use
    the analytic function every time I wanted to work out for my given point in time all those records which were applicable.
    I’ve not tested it but I’m guessing the analytic approach would be slower – and in a warehouse situation (which is what
    I’m normally in) then the number of rows on the table you are scanning can be significantly large which might make the
    performance differential significant enough to warrant not doing it that way.

    We also use the END_DATE of NULL for partitioning purposes – to segregate current open rows from historical closed rows so
    that’s another use for the physical END_DATE column.

    Also, it does make the assumption that the record can’t go away and then come back – which for some of our tables is not
    the case. Maybe the government come along and want to make a public footpath between the two gardens – now you have a gap.

    Nice article.