Last week, this weblog hit the 9M post reads mark. Since we started technology.amis.nl/blog in July 2004, it took us some 5 years to get to that point. And I started to wonder when we reached earlier milestones, like 100k, 1M and 5M. This took me to some SQL puzzles that before I started in anger might have been difficult to solve, but turned out to be rather straightforward – thanks to Analytical functions.
In this article a brief report on how to reconstruct history using Running Sums and a bit of LAG/LEAD analysis.
Suppose we have a table that records the number of Weblog Postreads for every day since the start of the blog:
create table blog_statistics ( day_stamp date , number_of_reads number(4,0) ) /
And for the sake of this argument, let’s have some sample data in this table – a little under 3 years worth of sample data:
insert into blog_statistics ( day_stamp, number_of_reads) select sysdate - 100 + day_seq , 50 * dbms_random.value from ( select level day_seq from dual connect by level < 1000 )
We can find out easily on what day we first reached the 500 mark:
select min(day_stamp) from ( select day_stamp , sum(number_of_reads) over ( order by day_stamp asc) running_reads_total from blog_statistics ) where running_reads_total > 500 / MIN(DAY_S --------- 06-APR-09
Using the running sum we get with sum() over (order by), we list the total post read count from one day to the next, and the first row that crosses the 500 mark with tis running sum is the milestone date.
Finding several milestone dates in a single query – the 1000, 5000 and 10000 mark – can be done with this query:
select min(day_stamp) , milestones.milestone from ( select day_stamp , sum(number_of_reads) over ( order by day_stamp asc) running_reads_total from blog_statistics ) , ( select 1000 mark , 1 milestone from dual union all select 5000 , 2 from dual union all select 10000 , 3 from dual ) milestones where running_reads_total > milestones.mark group by milestones.milestone / MIN(DAY_S MILESTONE --------- ---------- 28-APR-09 1 22-SEP-09 2 26-APR-10 3
The next question I would like to have answered is: when did we achieve the first 25% of the grand total of postreads we have received until today:
select min(day_stamp) from ( select day_stamp , sum(number_of_reads) over ( order by day_stamp asc)/ sum(number_of_reads) over () running_percentage from blog_statistics ) where running_percentage > 0.25 MIN(DAY_S --------- 12-JUL-09
Then of course it would be nice to see the date on which we have 50 and 75% as well:
select min(day_stamp) , milestones.percentage_mark from ( select day_stamp , sum(number_of_reads) over ( order by day_stamp asc)/ sum(number_of_reads) over () running_percentage from blog_statistics ) , ( select 0.25 * level percentage_mark from dual connect by level <= 4 ) milestones where running_percentage >= percentage_mark group by milestones.percentage_mark order by milestones.percentage_mark MIN(DAY_S PERCENTAGE_MARK --------- --------------- 12-JUL-09 .25 14-NOV-09 .5 21-MAR-10 .75 29-JUL-10 1
Finally, let’s see how the number of days needed to get from one percentage mark milestone to another evolves over time. Is the period needed shortening or does it increase?
Using the LAG function, it is easy to find the day stamp for the next milestone. Therefore, it is easy to calculate the numbers of days between two consecutive milestones.
select day_stamp , day_stamp - lag(day_stamp) over (order by day_stamp asc) days from ( select min(day_stamp) day_stamp , milestones.percentage_mark from ( select day_stamp , sum(number_of_reads) over ( order by day_stamp asc)/ sum(number_of_reads) over () running_percentage from blog_statistics ) , ( select 0.25 * level percentage_mark from dual connect by level <= 4 ) milestones where running_percentage >= percentage_mark group by milestones.percentage_mark order by milestones.percentage_mark ) / DAY_STAMP DAYS --------- ---------- 12-JUL-09 14-NOV-09 125 21-MAR-10 127 29-JUL-10 130
From very limited, randomly generated data, we seem to spot a pattern that makes the periods to get from one milestone to the next is increasing, or so it seems.
Trying with 10 milestones:
select day_stamp , day_stamp - lag(day_stamp) over (order by day_stamp asc) days from ( select min(day_stamp) day_stamp , milestones.percentage_mark from ( select day_stamp , sum(number_of_reads) over ( order by day_stamp asc)/ sum(number_of_reads) over () running_percentage from blog_statistics ) , ( select 0.1 * level percentage_mark from dual connect by level <= 10 ) milestones where running_percentage >= percentage_mark group by milestones.percentage_mark order by milestones.percentage_mark ) / DAY_STAMP DAYS --------- ---------- 04-MAY-09 20-JUN-09 47 04-AUG-09 45 17-SEP-09 44 14-NOV-09 58 01-JAN-10 48 18-FEB-10 48 13-APR-10 54 04-JUN-10 52 29-JUL-10 55
It makes you wonder how random the results from dbms_random are exactly.
Well spotted! I did change the 500 to 1000 in the insert statement when I was preparing the blog article, after I had created and run the SQL. It does not change a bit in the meaningful content of the article however. And thanks for the congrats.
Lucas
Hi Lucas,
You pasted the wrong insert statement: the query results suggest the table contains 499 rows, while the insert statement says 999.
And congratulations with reaching the 9M mark 🙂
Regards,
Rob.