Analytical Power 20188367001

Analytical Power

While working on a Project for one of our Clients, an interesting requirement popped up. What they wanted was the following:
“From this list of dates, we need the ID from the most recent subscription�. Here’s a sample from the list:

ID, CAT, START_DATE, END_DATE
1, 0, 29-DEC-04, 17-JUL-05
2, 0, 29-DEC-04, 17-JUL-05
3, 1, 29-DEC-04, 17-JUL-05
4, 1, 29-DEC-04, 25-OCT-05
5, 2, 29-DEC-04, 17-JUL-05
6, 2, 29-DEC-04,
7, 3, 29-DEC-04,
8, 3, 29-DEC-04, 17-JUL-05
9, 4, 29-DEC-04,

“However�, they said, “These are the additional requirements. When there’s no End Date filled in, it’s still valid so this is the most recent one. When there are more subscriptions with no End Date or the End Dates are equal, look at the Start Dates. Comparing the Start Dates is similar to the comparison of the End Dates. The most recent Start Date is the one we’re interested in. When they are equal, take the one with the highest ID. If a Start Date is not filled in, it’s an invalid one, and we’re not interested in invalid ones, so ignore this one. If the Start Dates are not filled in, take the one with the highest ID, ‘cause then it doesn’t matter which one to take. This must be done for each category (Cat), of course.�
�Of Course…�

The first reaction was tackling this procedurally. Mainly because of the way the requirement was stated, it is in more or less pseudo code: if the end date is greater than the following end date within the same category then…
But, following the mantra “If you can do it in SQL, do it�, this was the result:

select id
     , cat
     , start_date
     , end_date
  from date_list main_q
 where not exists (select 1
                     from date_list nest_q
                    where nest_q.cat = main_q.cat
                      and Decode (nest_q.end_date, main_q.end_date
                                 , Decode (nest_q.start_date, main_q.start_date
                                          ,Trunc (sysdate,'DDD')
                                           +
                                           Decode (nest_q.id
                                                  ,Least (nest_q.id, main_q.id)
                                                  , 0, 1)
                                          , nvl(nest_q.start_date
                                               , main_q.start_date-1)
							                     )
                                 , nvl(nest_q.end_date, main_q.end_date+1))
                        >
                        Decode (main_q.end_date, nest_q.end_date
                               , Decode (main_q.start_date, nest_q.start_date
                                        , Trunc (sysdate,'DDD')
                                          +
                                          Decode (main_q.id
                                                 ,Least (nest_q.id, main_q.id)
                                                 , 0, 1)
                                          , nvl(main_q.start_date
                                               ,nest_q.start_date-1)
							                     )
                               , nvl(main_q.end_date, nest_q.end_date+1))
                               )

Wow… That’s a lot of code to tackle this problem. Can you imagine having to maintain this query? What about changing it to meet an additional requirement? In order to get to the correct results, you need to compare it to values within the same group (the same category) and query the same table again.
When the creator of this query explained it to me, I had a hard time to follow this solution. I was impressed and overwhelmed by the code he created.

Then suddenly, it dawned to me, Analytical Functions… Because of the requirement (“This must be done for each category�) the date_list had to be partitioned by Cat.
The first part of the query was done. It had to be

partition by cat

Another requirement was the priority of the records within each category. First check the End_Date, than the Start_Date and lastly the ID. The order by of the partition is dictated by this requirement:

order by end_date desc nulls first, start_date desc  nulls last, id desc 

More parts of the query were falling into place. Having the Analytical part of the query in place, I could assemble the following query:

select id
     , cat
     , start_date
     , end_date
  from (select id
             , cat
             , start_date
             , end_date
             , row_number() over (partition by cat
                                      order by end_date desc nulls first
                                             , start_date desc nulls last
                                             , id desc
                                 ) rn
          from date_list
        )
  where rn = 1

Being very pleased with myself, and having checked the results of both query, I did some timed tests on both queries. Initially both query showed a similar execution time with a mere 60 records in it.
What about the statistics? There is a dramatic difference.

Original Query:

Elapsed: 00:00:00.00
Statistics
---------------------------------------------------
          0  recursive calls
         12  db block gets
        127  consistent gets
          0  physical reads
          0  redo size
       1717  bytes sent via SQL*Net to client
       1976  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
         30  rows processed

Analytical Function:

Elapsed: 00:00:00.00
Statistics
---------------------------------------------------
          0  recursive calls
         12  db block gets
          5  consistent gets
          0  physical reads
          0  redo size
       1717  bytes sent via SQL*Net to client
        802  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          3  sorts (memory)
          0  sorts (disk)
         30  rows processed

Encouraged by this result, I also did some timed tests with more records. Duplicating the contents of the date_list table into itself, to increase the record count up to 7680 I got this result:

Original Query:

Elapsed: 00:00:03.03
Statistics
---------------------------------------------------
          0  recursive calls
          4  db block gets
      75938  consistent gets
          0  physical reads
          0  redo size
       1838  bytes sent via SQL*Net to client
       2066  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
         30  rows processed

Analytical Function:

Elapsed: 00:00:00.00
Statistics
---------------------------------------------------
          0  recursive calls
          4  db block gets
         25  consistent gets
          0  physical reads
          0  redo size
       1838  bytes sent via SQL*Net to client
        802  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
         30  rows processed

Increasing the record count to 1.9 million, the Analytical Query returned with these results:

Original Query:
… I don’t know, it never gave any results…

Analytical Function:

Elapsed: 00:01:34.07
Statistics
---------------------------------------------------
          0  recursive calls
         97  db block gets
       6020  consistent gets
      34942  physical reads
          0  redo size
       1898  bytes sent via SQL*Net to client
        794  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          2  sorts (memory)
          1  sorts (disk)
         30  rows processed

With the initial tests, I omitted the creation of an index on the CAT-column. The differences in statistics are even more dramatic than the results shown. Testing with 1.9 million records, the Analytical Function gave results in approximately the same time as shown, but with more consistent gets. The Original Query, well I’m still waiting…

To quote Tom Kyte: “Analytics Rock and Roll�

Acknowledgement
Thanks to Frans van Deursen from Atos Origin, who came up with the very creative original SQL.

I spend quite some time creating this demonstration-script, so you’d better use it 😉

One Response

  1. Jasper April 13, 2005