Yesterday just prior to our AMIS Query on ADF Faces I was talking to Toon Koppelaars about quite something else: the end_date column in tables that contain contiguous records with BEGIN_DATE and END_DATE where there is no overlap and there are no gaps. Every BEGIN_DATE is the day after the END_DATE of the previous record and every END_DATE is the day before the BEGIN_DATE of the next record. Except of course for the first and last record. We agreed that in this situation, the END_DATE is really redundant information: it is always available from the next record. Situations where you will find this type of record include product price history, resume entries, inventory records, allocation data etc.
You could enter a debate whether END_DATE is or is not a property of the record. When you do your ER-modelling, would you include the end date as an attribute? And if you do, is that because you are implicitly taking into account the problems you may have if you have to derive the END_DATE from the BEGIN_DATE of a next record? It is a bit like building to fences between two adjacent gardens: once fence can clearly mark the border of both gardens, but somehow – lack of trust in your neighbour’s intentions? – you decide to have two coinciding fences.
Is it a problem if you have this END_DATE column where it is really only redundant or at least denormalized information?Well, you have to go through some additional steps:
- When you insert a new record at the end, you have to update the current latest record and set is END_DATE to the BEGIN_DATE minus one of the new record
- When you insert a new record somewhere in the middle, you have two update two existing records: change the END_DATE of the one before the new record and update the BEGIN_DATE of the record following the new record
- If you change the BEGIN_DATE or END_DATE of an existing record, you have to update the END_DATE or BEGIN_DATE of the preceding or following record
Think of the garden fences: when you start a new garden between two existing gardens, you need move two fences and build two new ones. When you start a new garden at the end of street, you need to build two fences. If you buy piece of the garden next-doors, you have to move two fences.
Of course, for all these updates you will have to work around the mutating table challenge!
Alternatively, you can force the user to make all the correct changes by enforcing business rules that explicitly check that there are no gaps between the records – all end_date and begin_date are adjacent – and that there is no overlap between the records.
It sounds cumbersome. And it is. So why do it?
Well, the reasons most frequently given for having this END_DATE column – or this TO_POSITION because the same reasoning applies to data describing position – are:
- It is convenient to quicly find the current record: WHERE END_DATE IS NULL
- It is convenient to show in reports for records for which period they apply by showing BEGIN DATE and END DATE explicitly
- To quickly find out for how long a certain record was valid: END_DATE – BEGIN_DATE
It is not hard to see that for these three cases, having the END_DATE is convenient. But not at all necessary. Let’s take a look at how Analytical Functions can help us to get these three pieces of functionality just as easily, without using END_DATE.
I have a simple example with product pricing data. Note: these are not actual figures, I have just made something up. I start out traditionally, using a column END_DATE.
create table product_prices ( product varchar2(50) , price number(8,2) , begin_date date , end_date date ) / insert into product_prices values ( 'JDeveloper' , 2500 , '01-JAN-01' , '20-SEP-03' ) / insert into product_prices values ( 'JDeveloper' , 1000 , '21-SEP-03' , '15-NOV-04' ) / insert into product_prices values ( 'JDeveloper' , 800 , '16-NOV-04' , '15-JUL-05' ) / insert into product_prices values ( 'JDeveloper' , 0 , '16-JUL-05' , null ) / insert into product_prices values ( 'Eclipse' , 0 , '30-JUN-01' , null ) / insert into product_prices values ( 'JBuilder' , 5000 , '06-MAR-02' , '30-AUG-03' ) / insert into product_prices values ( 'JBuilder' , 2500 , '31-AUG-03' , '30-AUG-04' ) / insert into product_prices values ( 'JBuilder' , 1500 , '31-AUG-04' , '30-JAN-05' ) / insert into product_prices values ( 'JBuilder' , 500 , '31-JAN-05' , '28-FEB-05' ) / insert into product_prices values ( 'JBuilder' , 1500 , '01-MAR-05' , null ) /
Selecting the current price for all products is now quite simple:
select product , price from product_prices where end_date is null / PRODUCT PRICE -------------------------------------------------- ---------- JDeveloper 0 Eclipse 0 JBuilder 1500
Finding out for every price record from when until when it was valid is also very simple:
select product , price , begin_date , end_date , trunc(nvl(end_date,sysdate) - begin_date) days from product_prices order by product , begin_date / PRODUCT PRICE BEGIN_DAT END_DATE DAYS -------------------------------------------------- ---------- --------- --------- ---------- Eclipse 0 30-JUN-01 1553 JBuilder 5000 06-MAR-02 30-AUG-03 542 JBuilder 2500 31-AUG-03 30-AUG-04 365 JBuilder 1500 31-AUG-04 30-JAN-05 152 JBuilder 500 31-JAN-05 28-FEB-05 28 JBuilder 1500 01-MAR-05 213 JDeveloper 2500 01-JAN-01 20-SEP-03 992 JDeveloper 1000 21-SEP-03 15-NOV-04 421 JDeveloper 800 16-NOV-04 15-JUL-05 241 JDeveloper 0 16-JUL-05 76
So it would seem that going through the trouble of having the additional column and the triggers to deal correctly with inserts and updates really pays off. Or does it? Using Analytical Functions, we can forget about the END_DATE column and retrieve the same information almost as easily. First, to find the current price record for each product – there are several ways to go about it, I only show three of them:
alter table product_prices drop column end_date / select distinct product , first_value( price) over (partition by product order by begin_date desc ) current_price from product_prices / PRODUCT CURRENT_PRICE -------------------------------------------------- ------------- Eclipse 0 JBuilder 1500 JDeveloper 0
Or:
select product , price from ( select product , price , row_number() over ( partition by product order by begin_date desc ) rn from product_prices ) where rn = 1 /
or:
select product , price from ( select product , price , lead(begin_date-1) over ( partition by product order by begin_date ) end_date from product_prices ) where end_date is null /
Now to find the period for which each price record is applicable:
select product , price , begin_date , lead(begin_date-1) over ( partition by product order by begin_date ) end_date from product_prices order by product , begin_date / PRODUCT PRICE BEGIN_DAT END_DATE -------------------------------------------------- ---------- --------- --------- Eclipse 0 30-JUN-01 JBuilder 5000 06-MAR-02 30-AUG-03 JBuilder 2500 31-AUG-03 30-AUG-04 JBuilder 1500 31-AUG-04 30-JAN-05 JBuilder 500 31-JAN-05 28-FEB-05 JBuilder 1500 01-MAR-05 JDeveloper 2500 01-JAN-01 20-SEP-03 JDeveloper 1000 21-SEP-03 15-NOV-04 JDeveloper 800 16-NOV-04 15-JUL-05 JDeveloper 0 16-JUL-05
Now it would seem that with only a few simple enhancements to the queries, we can achieve exactly results without the END_DATE as we did with the END_DATE. And we can forget the complexity of the business rules or the triggers to deal with new or updated records as well as the redundant column. One fence should be enough!
hi
Why do not answer my question?
I want to know the date format in Oracle Warehouse Builder.
I wanted to filter the data based on a column Date.
The date format of this field is ‘MM/DD/YYYY HH24:MI:SS’.
I want to filetr the data based on a single day.
I am giving the filter condition in Filter transformation is
TO_DATE(INOUTGRP1.DATE,’DD-MON-YYYY’)=TO_DATE(’13-12-2004′,’DD-MON-YYYY’)
But it is not working.If u know the solution of this problem, please let me know.
Just a quick response on the overhead of the analytical functions…
Lucas’ first query, third alternative, regarding current product prices executes with plan:
(Note: the SYS… index is the primary key on (product,begin_date)).
——————————————————–
| Id | Operation | Name |
——————————————————–
| 0 | SELECT STATEMENT | |
|* 1 | VIEW | |
| 2 | WINDOW BUFFER | |
| 3 | TABLE ACCESS BY INDEX ROWID| PRODUCT_PRICES |
| 4 | INDEX FULL SCAN | SYS_C0014032 |
——————————————————–
In the redundant enddate scenario a FULL table scan would be done.
The second query regarding fully qualified periods, executes with the same plan:
——————————————————-
| Id | Operation | Name |
——————————————————-
| 0 | SELECT STATEMENT | |
| 1 | WINDOW BUFFER | |
| 2 | TABLE ACCESS BY INDEX ROWID| PRODUCT_PRICES |
| 3 | INDEX FULL SCAN | SYS_C0014032 |
——————————————————-
Again the redundant enddate scenario executes with a FULL table scan.
Granted, the plans with the analytic functions are (still) most likely to be more expensive (both
cpu+I/O) than the redundant datamodel scenario: this mainly depends on the clustering factor for the
primary key index.
The execution plans DO NOT scare me however, AND are MUCH better than the queries we had to write in
the days when we did not have available to us the analytical functions (nested subqueries to find
maximum begindate before sysdate, etc.).
By ‘do not scare me’ I mean that I’m more than happy to trade-off the overhead for these queries
against the code that I’d have to write to implement the ‘redundant enddate’ business rule including
the runtime overhead it generates in transactions (as described by Lucas above). I have yet to
encounter the first production database (that is running for a few years) that has redundancy in
it’s datamodel, AND has no violations with regards to this redundancy…
I very often have discussions with ‘designers’ that create datamodels with redundancy designed into
the model: ‘For performance reasons’… With the current state of the optimizer technology available
to us, I always first disagree with these design decisions. I always seek proof that the queries
that this designed-redundancy is targetting, will come out as the bottle-necks of the system to be
built. To no suprise such proof can rarely be given… And then I veto the redundancy. 🙂
What you say is perfectly correct, however, I would have to be a little concerned with the efficiency of needing to use
the analytic function every time I wanted to work out for my given point in time all those records which were applicable.
I’ve not tested it but I’m guessing the analytic approach would be slower – and in a warehouse situation (which is what
I’m normally in) then the number of rows on the table you are scanning can be significantly large which might make the
performance differential significant enough to warrant not doing it that way.
We also use the END_DATE of NULL for partitioning purposes – to segregate current open rows from historical closed rows so
that’s another use for the physical END_DATE column.
Also, it does make the assumption that the record can’t go away and then come back – which for some of our tables is not
the case. Maybe the government come along and want to make a public footpath between the two gardens – now you have a gap.
Nice article.