While working on a Project for one of our Clients, an interesting requirement popped up. What they wanted was the following:
“From this list of dates, we need the ID from the most recent subscription�. Here’s a sample from the list:
ID, CAT, START_DATE, END_DATE
1, 0, 29-DEC-04, 17-JUL-05
2, 0, 29-DEC-04, 17-JUL-05
3, 1, 29-DEC-04, 17-JUL-05
4, 1, 29-DEC-04, 25-OCT-05
5, 2, 29-DEC-04, 17-JUL-05
6, 2, 29-DEC-04,
7, 3, 29-DEC-04,
8, 3, 29-DEC-04, 17-JUL-05
9, 4, 29-DEC-04,
“However�, they said, “These are the additional requirements. When there’s no End Date filled in, it’s still valid so this is the most recent one. When there are more subscriptions with no End Date or the End Dates are equal, look at the Start Dates. Comparing the Start Dates is similar to the comparison of the End Dates. The most recent Start Date is the one we’re interested in. When they are equal, take the one with the highest ID. If a Start Date is not filled in, it’s an invalid one, and we’re not interested in invalid ones, so ignore this one. If the Start Dates are not filled in, take the one with the highest ID, ‘cause then it doesn’t matter which one to take. This must be done for each category (Cat), of course.�
�Of Course…�
The first reaction was tackling this procedurally. Mainly because of the way the requirement was stated, it is in more or less pseudo code: if the end date is greater than the following end date within the same category then…
But, following the mantra “If you can do it in SQL, do it�, this was the result:
select id , cat , start_date , end_date from date_list main_q where not exists (select 1 from date_list nest_q where nest_q.cat = main_q.cat and Decode (nest_q.end_date, main_q.end_date , Decode (nest_q.start_date, main_q.start_date ,Trunc (sysdate,'DDD') + Decode (nest_q.id ,Least (nest_q.id, main_q.id) , 0, 1) , nvl(nest_q.start_date , main_q.start_date-1) ) , nvl(nest_q.end_date, main_q.end_date+1)) > Decode (main_q.end_date, nest_q.end_date , Decode (main_q.start_date, nest_q.start_date , Trunc (sysdate,'DDD') + Decode (main_q.id ,Least (nest_q.id, main_q.id) , 0, 1) , nvl(main_q.start_date ,nest_q.start_date-1) ) , nvl(main_q.end_date, nest_q.end_date+1)) )
Wow… That’s a lot of code to tackle this problem. Can you imagine having to maintain this query? What about changing it to meet an additional requirement? In order to get to the correct results, you need to compare it to values within the same group (the same category) and query the same table again.
When the creator of this query explained it to me, I had a hard time to follow this solution. I was impressed and overwhelmed by the code he created.
Then suddenly, it dawned to me, Analytical Functions… Because of the requirement (“This must be done for each category�) the date_list had to be partitioned by Cat.
The first part of the query was done. It had to be
partition by cat
Another requirement was the priority of the records within each category. First check the End_Date, than the Start_Date and lastly the ID. The order by of the partition is dictated by this requirement:
order by end_date desc nulls first, start_date desc nulls last, id desc
More parts of the query were falling into place. Having the Analytical part of the query in place, I could assemble the following query:
select id , cat , start_date , end_date from (select id , cat , start_date , end_date , row_number() over (partition by cat order by end_date desc nulls first , start_date desc nulls last , id desc ) rn from date_list ) where rn = 1
Being very pleased with myself, and having checked the results of both query, I did some timed tests on both queries. Initially both query showed a similar execution time with a mere 60 records in it.
What about the statistics? There is a dramatic difference.
Original Query:
Elapsed: 00:00:00.00 Statistics --------------------------------------------------- 0 recursive calls 12 db block gets 127 consistent gets 0 physical reads 0 redo size 1717 bytes sent via SQL*Net to client 1976 bytes received via SQL*Net from client 4 SQL*Net roundtrips to/from client 2 sorts (memory) 0 sorts (disk) 30 rows processed
Analytical Function:
Elapsed: 00:00:00.00 Statistics --------------------------------------------------- 0 recursive calls 12 db block gets 5 consistent gets 0 physical reads 0 redo size 1717 bytes sent via SQL*Net to client 802 bytes received via SQL*Net from client 4 SQL*Net roundtrips to/from client 3 sorts (memory) 0 sorts (disk) 30 rows processed
Encouraged by this result, I also did some timed tests with more records. Duplicating the contents of the date_list table into itself, to increase the record count up to 7680 I got this result:
Original Query:
Elapsed: 00:00:03.03 Statistics --------------------------------------------------- 0 recursive calls 4 db block gets 75938 consistent gets 0 physical reads 0 redo size 1838 bytes sent via SQL*Net to client 2066 bytes received via SQL*Net from client 4 SQL*Net roundtrips to/from client 2 sorts (memory) 0 sorts (disk) 30 rows processed
Analytical Function:
Elapsed: 00:00:00.00 Statistics --------------------------------------------------- 0 recursive calls 4 db block gets 25 consistent gets 0 physical reads 0 redo size 1838 bytes sent via SQL*Net to client 802 bytes received via SQL*Net from client 4 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 30 rows processed
Increasing the record count to 1.9 million, the Analytical Query returned with these results:
Original Query:
… I don’t know, it never gave any results…
Analytical Function:
Elapsed: 00:01:34.07 Statistics --------------------------------------------------- 0 recursive calls 97 db block gets 6020 consistent gets 34942 physical reads 0 redo size 1898 bytes sent via SQL*Net to client 794 bytes received via SQL*Net from client 4 SQL*Net roundtrips to/from client 2 sorts (memory) 1 sorts (disk) 30 rows processed
With the initial tests, I omitted the creation of an index on the CAT-column. The differences in statistics are even more dramatic than the results shown. Testing with 1.9 million records, the Analytical Function gave results in approximately the same time as shown, but with more consistent gets. The Original Query, well I’m still waiting…
To quote Tom Kyte: “Analytics Rock and Roll�
Acknowledgement
Thanks to Frans van Deursen from Atos Origin, who came up with the very creative original SQL.
I spend quite some time creating this demonstration-script, so you’d better use it 😉
a true wizard at work 🙂
good job, alex