Reaching Milestones - some Analytical SQL-etics 20188367001

Reaching Milestones – some Analytical SQL-etics

 

Last week, this weblog hit the 9M post reads mark. Since we started technology.amis.nl/blog in July 2004, it took us some 5 years to get to that point. And I started to wonder when we reached earlier milestones, like 100k, 1M and 5M. This took me to some SQL puzzles that before I started in anger might have been difficult to solve, but turned out to be rather straightforward – thanks to Analytical functions.

In this article a brief report on how to reconstruct history using Running Sums and a bit of LAG/LEAD analysis.

Suppose we have a table that records the number of Weblog Postreads for every day since the start of the blog:

create table blog_statistics
( day_stamp   date
, number_of_reads number(4,0)
)
/

And for the sake of this argument, let’s have some sample data in this table – a little under 3 years worth of sample data:

insert into blog_statistics
( day_stamp, number_of_reads)
select sysdate - 100 + day_seq
,      50 * dbms_random.value
from   ( select level  day_seq
         from   dual
         connect
         by     level < 1000
       )

We can find out easily on what day we first reached the 500 mark:

select min(day_stamp)
from   ( select day_stamp
         ,      sum(number_of_reads) over ( order by day_stamp asc) running_reads_total
         from   blog_statistics
       )
where  running_reads_total > 500
/


MIN(DAY_S
---------
06-APR-09

Using the running sum we get with sum() over (order by), we list the total post read count from one day to the next, and the first row that crosses the 500 mark with tis running sum is the milestone date.

Finding several milestone dates in a single query – the 1000, 5000 and 10000 mark – can be done with this query:

select min(day_stamp)
,      milestones.milestone
from   ( select day_stamp
         ,      sum(number_of_reads) over ( order by day_stamp asc) running_reads_total
         from   blog_statistics
       )
,      ( select 1000 mark
         ,      1    milestone
         from   dual
         union all
         select 5000
         ,      2
         from   dual
         union all
         select 10000
         ,      3
         from   dual
       ) milestones
where  running_reads_total > milestones.mark
group
by     milestones.milestone
/

MIN(DAY_S  MILESTONE
--------- ----------
28-APR-09          1
22-SEP-09          2
26-APR-10          3

The next question I would like to have answered is: when did we achieve the first 25% of the grand total of postreads we have received until today:

select min(day_stamp)
from   ( select day_stamp
         ,      sum(number_of_reads) over ( order by day_stamp asc)/ sum(number_of_reads) over () running_percentage
         from   blog_statistics
       )
where  running_percentage > 0.25


MIN(DAY_S
---------
12-JUL-09

Then of course it would be nice to see the date on which we have 50 and 75% as well:

select min(day_stamp)
,      milestones.percentage_mark
from   ( select day_stamp
         ,      sum(number_of_reads) over ( order by day_stamp asc)/ sum(number_of_reads) over () running_percentage
         from   blog_statistics
       )
,      ( select 0.25 * level percentage_mark from dual connect by level <= 4 ) milestones
where  running_percentage >= percentage_mark
group
by     milestones.percentage_mark
order
by     milestones.percentage_mark


MIN(DAY_S PERCENTAGE_MARK
--------- ---------------
12-JUL-09             .25
14-NOV-09              .5
21-MAR-10             .75
29-JUL-10               1

Finally, let’s see how the number of days needed to get from one percentage mark milestone to another evolves over time. Is the period needed shortening or does it increase?

Using the LAG function, it is easy to find the day stamp for the next milestone. Therefore, it is easy to calculate the numbers of days between two consecutive milestones.
 

select day_stamp
,      day_stamp - lag(day_stamp) over (order by day_stamp asc)  days
from   (
select min(day_stamp) day_stamp
,      milestones.percentage_mark
from   ( select day_stamp
         ,      sum(number_of_reads) over ( order by day_stamp asc)/ sum(number_of_reads) over () running_percentage
         from   blog_statistics
       )
,      ( select 0.25 * level percentage_mark from dual connect by level <= 4 ) milestones
where  running_percentage >= percentage_mark
group
by     milestones.percentage_mark
order
by     milestones.percentage_mark
)
/

DAY_STAMP       DAYS
--------- ----------
12-JUL-09
14-NOV-09        125
21-MAR-10        127
29-JUL-10        130

From very limited, randomly generated data, we seem to spot a pattern that makes the periods to get from one milestone to the next is increasing, or so it seems.
 

Trying with 10 milestones:
 

select day_stamp
,      day_stamp - lag(day_stamp) over (order by day_stamp asc)  days
from   (
select min(day_stamp) day_stamp
,      milestones.percentage_mark
from   ( select day_stamp
         ,      sum(number_of_reads) over ( order by day_stamp asc)/ sum(number_of_reads) over () running_percentage
         from   blog_statistics
       )
,      ( select 0.1 * level percentage_mark from dual connect by level <= 10 ) milestones
where  running_percentage >= percentage_mark
group
by     milestones.percentage_mark
order
by     milestones.percentage_mark
)
/

DAY_STAMP       DAYS
--------- ----------
04-MAY-09
20-JUN-09         47
04-AUG-09         45
17-SEP-09         44
14-NOV-09         58
01-JAN-10         48
18-FEB-10         48
13-APR-10         54
04-JUN-10         52
29-JUL-10         55

It makes you wonder how random the results from dbms_random are exactly.

2 Comments

  1. Lucas Jellema June 29, 2009
  2. Rob van Wijk June 28, 2009