SQL–Only Counting Records Sufficiently Spaced apart using Analytics with Windowing Clause and Anti Join image

SQL–Only Counting Records Sufficiently Spaced apart using Analytics with Windowing Clause and Anti Join

A nice SQL challenge was presented to me by a colleague. The challenges basically consisted of this table. A table contains records that describe login events. Each record has a login timestamp and the identifier of the person logging in. The challenge is to count “unique” login events. These have been defined as unique, non-overlapping two hour periods in which a person has logged in. Such as period starts with a login that is at least two hours after the start of a previous period for that same person.

image

Visually this can be shown like this:

image

All the records marked with a star are counted. All other records are within 2 hours from a starred record and are therefore excluded because they are considered duplicate logins.

EDIT 4th March 2021: Solution

You may have read this article before – when it read a litle differently. I have made substantial changes to the article because it originally contained an incorrect solution. A solution that I did not test properly (!) and that I had not peer reviewed – before publishing it for all to see. One person who saw, read and carefully checked the article was Iudith Mentzel (see the first comment) and I owe them a great debt. They gently pointed out my flawed logic and went so far as to provide a correct solution. And not just one solution – but three.

You are now reading a rewrite of the original article – based on Iudith’s solutions for which I can take no credit (apart from having been the inspiration perhaps). I want to thank Iudith for their wonderful help and we gratefully make use of the first solution provided.

The three great solutions provided by Iudith are available in this Live SQL environment to try out.

image

The solution I like best uses the SQL MATCH_RECOGNIZE clause – not very well known but worthy of your close attention. It is a gem.

A very brief explanation of the logic in this SQL statement:

At the heart of this query is the regular expression pattern (block_start same_block*). This expression can be interpreted as “collect all rows from a starting record with as many rows as possible that satisfy same_block“; a row satisfies same_block if it is within two hours of the login_time of the block_start. The first subsequent row outside the match collection is the block_start for the next match. For each match, we return only one record – with the earliest login time for that block. The overall query returns records per personid and for each personid the first login_time of a block of two hours.

image

select personid
,      to_char(login_time,'HH24:MI') block_start_login_time
from logins
match_recognize (
  partition by personid -- regard every person independently
  order by login_time
  measures 
         match_number()     mn,
         block_start.login_time    as login_time  
  one row per match -- every block is represented by a single record 
  pattern (block_start same_block*) -- collect a log in record and all subsequent rows in same two hour block; once a row is assigned to a block it cannot start or be part of another block  
  define
       same_block as (login_time <= block_start.login_time + interval '2' hour)  -- a login row is in the same block with its subsequent rows within 2 hours
)

 

Using Recursive Subquery

A second solution provided by Iudith uses Recursive Subquery. This mechanism is also not widely used. It is the ANSI compliant and more powerful successor to Oracle proprietary CONNECT BY.

The starting rows are the first logins for every personid and specify both login_time and the end of the two hour block that starts at that login_time. This is the level 1 iteration of the recursive subquery, the root-blockstarters. The next iteration finds all rows that are outside the initial block for a person and determines their rank when ordered by login time. Sloppily put: finds the earliest login that is not in the two hour block started by the first round of root block starters.

Each subsequent iteration does the same thing: it finds the records that start two hour or more after the block starter (rn=1) returned for a personid in the previous iteration.

When all recursive iterations are done, we need to make sure that only the rn =1 records are retained – the blockstarters.

image

select personid
,      to_char(login_time,'HH24:MI')  block_start_login_time
from   ( select *
         from   logins
         model
         partition by (personid)
         dimension by (row_number() over ( partition by personid
                                           order by login_time
                                         ) as rn   
                      )
         measures ( login_time
                  , 'N' blockstarter_yn
                  )
         rules ( blockstarter_yn[any] -- assign a value to the cell count_yn for all login_times 
                 order by rn   -- go through the cells ordered by login time, starting with the earliest
                             = case
                               when cv(rn) = 1 
                               then 'Y'  -- first login by a person
                               when login_time[cv()] > -- when the login time in a cell is more than two hours later later than the last (or max) 
                                                       -- login time in an earlier cell (for a row that was identified as a block starter  
                                    max( case when blockstarter_yn = 'Y'
                                              then login_time
                                         end
                                       ) [rn < cv()] 
                                       + interval '2' hour     -- when the login_time 
                               then 'Y'
                               else 'N'
                               end 
               )
       )
where blockstarter_yn = 'Y'

The result in Live SQL with the iteration that produced the record is shown below:image

Using the Model Clause

The third solution Iudith came up with is in my eyes the most cryptic one. I am sure though that this is a very personal opinion and for some of you this is the obvious choice. The query is shown below – with highlighted the essential cell assignment. Here the value of the cell blockstarter_yn is assigned – Y for a row that is the first login for a person after at least 2 hours after a previous blockstarting login and N for a row that is not a blockstarter.

Because the value of this cell for a row depends on values assigned to its predecessors when ordered by login_time, the definition of the rn dimension and the order by rn is important.

The actual cell value is derived in the case expression, that can be read as follows:

  • if rn is 1, we are dealing with a root blockstarter – the first login record for a personid and therefore the value returned is Y
  • if the login_time is more than two hours later than the last login_time for all previous records, then is row starts a new block and the value Y should be produced
  • else: this row is no a block starter and value N is assigned to the cell

image

The model clause uses partition by (personid) to determine blockstarters per person. The measures (results) produced from the model clause are the login_time and the new cell blockstarter_yn (note that the placeholder value for empty cells is N). The outer query filters the results on blockstarter=’Y’ – for obvious reasons.

select personid
,      to_char(login_time,'HH24:MI')  block_start_login_time
from   ( select *
         from   logins
         model
         partition by (personid)
         dimension by (row_number() over ( partition by personid
                                           order by login_time
                                         ) as rn   
                      )
         measures ( login_time
                  , 'N' blockstarter_yn
                  )
         rules ( blockstarter_yn[any] -- assign a value to the cell count_yn for all login_times 
                 order by rn   -- go through the cells ordered by login time, starting with the earliest
                             = case
                               when cv(rn) = 1 
                               then 'Y'  -- first login by a person
                               when login_time[cv()] > -- when the login time in a cell is more than two hours later later than the last (or max) 
                                                       -- login time in an earlier cell (for a row that was identified as a block starter  
                                    max( case when blockstarter_yn = 'Y'
                                              then login_time
                                         end
                                       ) [rn < cv()] 
                                       + interval '2' hour     -- when the login_time 
                               then 'Y'
                               else 'N'
                               end 
               )
       )
where blockstarter_yn = 'Y'

My original, faulty approach

I followed the following logic while constructing a solution for this challenge:

  • For each login record, find the first login record of the same person in the period of two hours prior to the record; that can be the record itself, but could also be an earlier login record
  • All records that are the first in a two-hour period are “block starters”, records that start a two-hour period in which we can ignore all subsequent logins for that same person (these block starters are definitely to be included in the final result)
  • Find all records (we call them duplicate_logins) that fall into a two hour login block for a person (a two-hour period from the login time of a block starter); these are logins that we should not count
  • Select all records that are not a duplicate_login – these are records that may have a preceding login in the two hours before their login timestamp but that login is within a two hour block and they are not. Note: since the block_starters are also counted as duplicate_login, they must be added separately to the result
    EDIT: This logic is faulty: two (or more) records that are both outside the two hour range from a block starter can be within a two hour range of each other; my solution incorrectly would include both records instead of only the first of these two

Each bullet is implemented in the query with an inline view:

earliest_logins: For each login record, find the first login record of the same person in the period of two hours prior to the record; that can be the record itself, but could also be an earlier login record

image

block_starters: All records that are the first in a two-hour period are “block starters”, records that start a two-hour period in which we can ignore all subsequent logins for that same person (these block starters are definitely to be included in the final result)image

duplicate_logins: Find all records (we call them duplicate_logins) that fall into a two hour login block for a person (a two-hour period from the login time of a block starter); these are logins that we should not count

image

and finally:

Select all records that are not a duplicate_login – these are records that may have a preceding login in the two hours before their login timestamp but that login is within a two hour block and they are not. Note: since the block_starters are also counted as duplicate_login, they must be added separately to the result

EDIT: This logic is faulty: two (or more) records that are both outside the two hour range from a block starter can be within a two hour range of each other; my solution incorrectly would include both records instead of only the first of these two

image

The last step that should be added is of course the count operation (select l.personid, count(login_time) from final group by personid).

Overall query and (faulty) result (Note: Person 1, Login Time 03:00 should not have been included because of the Person 1, Login Time 02:39 result)

image

This result is for this set of records:

image

You can follow along with this challenge in the live database environment offered by LiveSQL at this link.

Resources

The three great solutions provided by Iudith are available in this Live SQL environment to try out.

Anti Join in SQL – my earlier anti search pattern exploration https://technology.amis.nl/it/anti-search-patterns-sql-to-look-for-what-is-not-there-part-one/

 

 

select personid , to_char(login_time,’HH24:MI’) block_start_login_time from logins match_recognize ( partition by personid — regard every person independently order by login_time measures match_number() mn, block_start.login_time as login_time one row per match — every block is represented by a single record pattern (block_start same_block*) — collect a log in record and all subsequent rows in same two hour block; once a row is assigned to a block it cannot start or be part of another block define same_block as (login_time <= block_start.login_time + interval ‘2’ hour) — a login row is in the same block with its subsequent rows within 2 hours )

3 Comments

  1. Iudith Mentzel March 3, 2021
    • Lucas Jellema March 4, 2021
    • Lucas Jellema March 4, 2021

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.