SQL–Only Counting Records Sufficiently Spaced apart using Analytics with Windowing Clause and Anti Join

Lucas Jellema 3

A nice SQL challenge was presented to me by a colleague. The challenges basically consisted of this table. A table contains records that describe login events. Each record has a login timestamp and the identifier of the person logging in. The challenge is to count “unique” login events. These have been defined as unique, non-overlapping two hour periods in which a person has logged in. Such as period starts with a login that is at least two hours after the start of a previous period for that same person.

image

Visually this can be shown like this:

image

All the records marked with a star are counted. All other records are within 2 hours from a starred record and are therefore excluded because they are considered duplicate logins.

EDIT 4th March 2021: Solution

You may have read this article before – when it read a litle differently. I have made substantial changes to the article because it originally contained an incorrect solution. A solution that I did not test properly (!) and that I had not peer reviewed – before publishing it for all to see. One person who saw, read and carefully checked the article was Iudith Mentzel (see the first comment) and I owe them a great debt. They gently pointed out my flawed logic and went so far as to provide a correct solution. And not just one solution – but three.

You are now reading a rewrite of the original article – based on Iudith’s solutions for which I can take no credit (apart from having been the inspiration perhaps). I want to thank Iudith for their wonderful help and we gratefully make use of the first solution provided.

The three great solutions provided by Iudith are available in this Live SQL environment to try out.

image

The solution I like best uses the SQL MATCH_RECOGNIZE clause – not very well known but worthy of your close attention. It is a gem.

A very brief explanation of the logic in this SQL statement:

At the heart of this query is the regular expression pattern (block_start same_block*). This expression can be interpreted as “collect all rows from a starting record with as many rows as possible that satisfy same_block“; a row satisfies same_block if it is within two hours of the login_time of the block_start. The first subsequent row outside the match collection is the block_start for the next match. For each match, we return only one record – with the earliest login time for that block. The overall query returns records per personid and for each personid the first login_time of a block of two hours.

image

select personid
,      to_char(login_time,'HH24:MI') block_start_login_time
from logins
match_recognize (
  partition by personid -- regard every person independently
  order by login_time
  measures 
         match_number()     mn,
         block_start.login_time    as login_time  
  one row per match -- every block is represented by a single record 
  pattern (block_start same_block*) -- collect a log in record and all subsequent rows in same two hour block; once a row is assigned to a block it cannot start or be part of another block  
  define
       same_block as (login_time <= block_start.login_time + interval '2' hour)  -- a login row is in the same block with its subsequent rows within 2 hours
)

 

Using Recursive Subquery

A second solution provided by Iudith uses Recursive Subquery. This mechanism is also not widely used. It is the ANSI compliant and more powerful successor to Oracle proprietary CONNECT BY.

The starting rows are the first logins for every personid and specify both login_time and the end of the two hour block that starts at that login_time. This is the level 1 iteration of the recursive subquery, the root-blockstarters. The next iteration finds all rows that are outside the initial block for a person and determines their rank when ordered by login time. Sloppily put: finds the earliest login that is not in the two hour block started by the first round of root block starters.

Each subsequent iteration does the same thing: it finds the records that start two hour or more after the block starter (rn=1) returned for a personid in the previous iteration.

When all recursive iterations are done, we need to make sure that only the rn =1 records are retained – the blockstarters.

image

select personid
,      to_char(login_time,'HH24:MI')  block_start_login_time
from   ( select *
         from   logins
         model
         partition by (personid)
         dimension by (row_number() over ( partition by personid
                                           order by login_time
                                         ) as rn   
                      )
         measures ( login_time
                  , 'N' blockstarter_yn
                  )
         rules ( blockstarter_yn[any] -- assign a value to the cell count_yn for all login_times 
                 order by rn   -- go through the cells ordered by login time, starting with the earliest
                             = case
                               when cv(rn) = 1 
                               then 'Y'  -- first login by a person
                               when login_time[cv()] > -- when the login time in a cell is more than two hours later later than the last (or max) 
                                                       -- login time in an earlier cell (for a row that was identified as a block starter  
                                    max( case when blockstarter_yn = 'Y'
                                              then login_time
                                         end
                                       ) [rn < cv()] 
                                       + interval '2' hour     -- when the login_time 
                               then 'Y'
                               else 'N'
                               end 
               )
       )
where blockstarter_yn = 'Y'

The result in Live SQL with the iteration that produced the record is shown below:image

Using the Model Clause

The third solution Iudith came up with is in my eyes the most cryptic one. I am sure though that this is a very personal opinion and for some of you this is the obvious choice. The query is shown below – with highlighted the essential cell assignment. Here the value of the cell blockstarter_yn is assigned – Y for a row that is the first login for a person after at least 2 hours after a previous blockstarting login and N for a row that is not a blockstarter.

Because the value of this cell for a row depends on values assigned to its predecessors when ordered by login_time, the definition of the rn dimension and the order by rn is important.

The actual cell value is derived in the case expression, that can be read as follows:

  • if rn is 1, we are dealing with a root blockstarter – the first login record for a personid and therefore the value returned is Y
  • if the login_time is more than two hours later than the last login_time for all previous records, then is row starts a new block and the value Y should be produced
  • else: this row is no a block starter and value N is assigned to the cell

image

The model clause uses partition by (personid) to determine blockstarters per person. The measures (results) produced from the model clause are the login_time and the new cell blockstarter_yn (note that the placeholder value for empty cells is N). The outer query filters the results on blockstarter=’Y’ – for obvious reasons.

select personid
,      to_char(login_time,'HH24:MI')  block_start_login_time
from   ( select *
         from   logins
         model
         partition by (personid)
         dimension by (row_number() over ( partition by personid
                                           order by login_time
                                         ) as rn   
                      )
         measures ( login_time
                  , 'N' blockstarter_yn
                  )
         rules ( blockstarter_yn[any] -- assign a value to the cell count_yn for all login_times 
                 order by rn   -- go through the cells ordered by login time, starting with the earliest
                             = case
                               when cv(rn) = 1 
                               then 'Y'  -- first login by a person
                               when login_time[cv()] > -- when the login time in a cell is more than two hours later later than the last (or max) 
                                                       -- login time in an earlier cell (for a row that was identified as a block starter  
                                    max( case when blockstarter_yn = 'Y'
                                              then login_time
                                         end
                                       ) [rn < cv()] 
                                       + interval '2' hour     -- when the login_time 
                               then 'Y'
                               else 'N'
                               end 
               )
       )
where blockstarter_yn = 'Y'

My original, faulty approach

I followed the following logic while constructing a solution for this challenge:

  • For each login record, find the first login record of the same person in the period of two hours prior to the record; that can be the record itself, but could also be an earlier login record
  • All records that are the first in a two-hour period are “block starters”, records that start a two-hour period in which we can ignore all subsequent logins for that same person (these block starters are definitely to be included in the final result)
  • Find all records (we call them duplicate_logins) that fall into a two hour login block for a person (a two-hour period from the login time of a block starter); these are logins that we should not count
  • Select all records that are not a duplicate_login – these are records that may have a preceding login in the two hours before their login timestamp but that login is within a two hour block and they are not. Note: since the block_starters are also counted as duplicate_login, they must be added separately to the result
    EDIT: This logic is faulty: two (or more) records that are both outside the two hour range from a block starter can be within a two hour range of each other; my solution incorrectly would include both records instead of only the first of these two

Each bullet is implemented in the query with an inline view:

earliest_logins: For each login record, find the first login record of the same person in the period of two hours prior to the record; that can be the record itself, but could also be an earlier login record

image

block_starters: All records that are the first in a two-hour period are “block starters”, records that start a two-hour period in which we can ignore all subsequent logins for that same person (these block starters are definitely to be included in the final result)image

duplicate_logins: Find all records (we call them duplicate_logins) that fall into a two hour login block for a person (a two-hour period from the login time of a block starter); these are logins that we should not count

image

and finally:

Select all records that are not a duplicate_login – these are records that may have a preceding login in the two hours before their login timestamp but that login is within a two hour block and they are not. Note: since the block_starters are also counted as duplicate_login, they must be added separately to the result

EDIT: This logic is faulty: two (or more) records that are both outside the two hour range from a block starter can be within a two hour range of each other; my solution incorrectly would include both records instead of only the first of these two

image

The last step that should be added is of course the count operation (select l.personid, count(login_time) from final group by personid).

Overall query and (faulty) result (Note: Person 1, Login Time 03:00 should not have been included because of the Person 1, Login Time 02:39 result)

image

This result is for this set of records:

image

You can follow along with this challenge in the live database environment offered by LiveSQL at this link.

Resources

The three great solutions provided by Iudith are available in this Live SQL environment to try out.

Anti Join in SQL – my earlier anti search pattern exploration https://technology.amis.nl/it/anti-search-patterns-sql-to-look-for-what-is-not-there-part-one/

 

 

select personid , to_char(login_time,’HH24:MI’) block_start_login_time from logins match_recognize ( partition by personid — regard every person independently order by login_time measures match_number() mn, block_start.login_time as login_time one row per match — every block is represented by a single record pattern (block_start same_block*) — collect a log in record and all subsequent rows in same two hour block; once a row is assigned to a block it cannot start or be part of another block define same_block as (login_time <= block_start.login_time + interval ‘2’ hour) — a login row is in the same block with its subsequent rows within 2 hours )

3 thoughts on “SQL–Only Counting Records Sufficiently Spaced apart using Analytics with Windowing Clause and Anti Join

  1. Hello Lucas,

    As by my understanding of the original challenge description, I think that this solution does have a problem.

    Specifically, for your sample data set, the rows to be counted should be as follows:

    PERSONID LOGIN_TIME COUNT_YN
    ——————————————-
    1 2021-01-01 00:00 Y
    1 2021-01-01 01:00
    1 2021-01-01 01:59
    1 2021-01-01 02:00
    1 2021-01-01 02:39 Y
    1 2021-01-01 03:00
    1 2021-01-01 04:59 Y

    2 2021-01-01 01:01 Y
    2 2021-01-01 01:30
    2 2021-01-01 02:00
    2 2021-01-01 05:00 Y
    2 2021-01-01 06:00

    12 rows selected.

    In other words, for personid = 1 the solution should return 3 rows only, and not 4.

    Depending on whether the 2 hours interval limit is inclusive or not, the solution should include either the row with time 02:00
    ( if the 2 hour limit is exclusive ) or the row with time 02:39 (if the 2 hour limit is inclusive),
    but, anyway, it cannot contain the row with time 03:00, which is less than 2 hours after the previously included row.

    The problem is that, for personid=1, the “block_starters” subquery only returns the first row (time 00:00),
    while in fact there are 3 rows

    It looks to me that the problem cannot be solved without using one form or another of a “procedural” approach.

    Here are 3 solutions, using 3 different approaches ( considering the 2 hours interval as being inclusive ):

    ——————————————
    a. Using a recursive WITH query
    ——————————————

    with recurs (personid, lvl, login_time, last_time, rn) as (
    select personid,
    1 lvl,
    min(login_time),
    min(login_time) + interval ‘2’ hour,
    1
    from logins
    group by personid
    union all
    select r.personid,
    r.lvl + 1,
    l.login_time,
    l.login_time + interval ‘2’ hour,
    row_number() over(partition by r.personid
    order by l.login_time) rn
    from recurs r,
    logins l
    where l.personid = r.personid
    and l.login_time > r.last_time
    and r.rn = 1
    )

    select personid, login_time
    from recurs
    where rn = 1
    order by personid, login_time
    /

    PERSONID LOGIN_TIME
    ——————————–
    1 2021-01-01 00:00
    1 2021-01-01 02:39
    1 2021-01-01 04:59
    2 2021-01-01 01:01
    2 2021-01-01 05:00

    5 rows selected.

    ——————————————
    b. Using MATCH_RECOGNIZE
    ——————————————

    select personid, login_time
    from logins
    match_recognize (
    partition by personid
    order by login_time
    measures
    match_number() mn,
    strt.login_time as login_time
    one row per match
    pattern (strt same*)
    define
    same as (login_time <= strt.login_time + interval '2' hour)
    )
    /

    PERSONID LOGIN_TIME
    ——————————–
    1 2021-01-01 00:00
    1 2021-01-01 02:39
    1 2021-01-01 04:59
    2 2021-01-01 01:01
    2 2021-01-01 05:00

    5 rows selected.

    ——————————————
    c. Using the MODEL clause
    ——————————————

    select personid, login_time
    from (
    select *
    from logins
    model
    partition by (personid)
    dimension by (row_number() over (partition by personid
    order by login_time) as rn)
    measures (
    login_time,
    'N' count_yn
    )
    rules (
    count_yn[any] order by rn =
    case
    when cv(rn) = 1 then 'Y'
    when max(case when count_yn = 'Y'
    then login_time
    end) [rn < cv()] + interval '2' hour < login_time[cv()] then 'Y'
    else 'N'
    end
    )
    )
    where count_yn = 'Y'
    /

    PERSONID LOGIN_TIME
    ——————————–
    1 2021-01-01 00:00
    1 2021-01-01 02:39
    1 2021-01-01 04:59
    2 2021-01-01 01:01
    2 2021-01-01 05:00

    5 rows selected.

    Thanks a lot for your attention & Best Regards,

    Iudith Mentzel

    1. Dear Iudith,

      This is both wonderful and horrible.

      The horrible part is that I suffered from hubris. I was overjoyed that I had found a solution for my colleague’s problem. I did not thoroughly test it. Hack, I did not even seriously check the result for a 9 record test set against my own stated specification. And in that mood of pride and and joy, I published my faulty work in a blog article for the whole world to see. The joke is on me.

      The wonderful part of course is you – and the fact that you did a thorough peer review of my code. And tore it to pieces, quite deservedly so. I want to thank you for your diligent work in finding and sharing the three solutions you arrived at. They are elegant – especially the one with MATCH_RECOGNIZE, I really see the beauty in that one. So that is the solution I have sent this morning to my colleague – along with a humble apology regarding my triumphant mail of yesterday.

      I am still curious to find out which of your solutions will provide the best performance. Will that be MATCH_RECOGNIZE? I hope so; as I said, that one appeals most to me.

      Thank you for pointing the right way. Thank you for being gentle in your rebuke – if it could even be called that. I certainly deserved worse. It is a little amazing that without soliciting a solution (can you help me with my homework…) the world wide community (you in particular) handed me this elegant solution.

      Thank you very much! I will update the article and credit you for the right approach(es).

      kind regards,
      Lucas

    2. HI Iudith,

      I have completely rewritten the article (and the Live SQL script) based on your great solutions. Once again, thank you very much for your insights and the elegance of the SQL statements. Trying to explain them to the reader felt a bit like providing the analysis to some else’s chess match. I can explain the moves – even if I could not come up with them myself.

      kind regards,
      Lucas

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next Post

Java Security: Open Source tools for use in CI/CD pipelines

It is often expected of a DevOps team to also take security into consideration when delivering software. Often however, this does not get the attention it deserves. In this blog post I’ll describe some easy to use, CI/CD pipeline friendly, open source tools you can use to perform several checks […]