SQL Challenge: Find World Cup Football matches with a comeback

4

As I recently said in another blog post, the World Cup Football is such a wonderful source of data that begs to be digested using SQL in many different ways. While working on a tag cloud producer – topic for a subsequent article – I ran into a special challenge. It is easy to describe, yet not so easy to resolve. Or at least, it took me some time to come up with this approach. I would like to invite any reader who has a suggestion as to how this can be resolved.

For my tag cloud regarding the matches in the World Cup tournament, I want to identify those matches that has a comeback in them. That is: matches where the winner has comeback from trailing the opponent . For example: The Netherland were 0-1 behind against Spain (after the penalty kick) and went on to win the game. So this match Spain-The Netherlands has a comeback.

I have made use of a compact notation of the way the scoring process went,  using a string with 0 and 1 characters. For example: ‘011111’ describes the Spain-Netherlands match. The zero indicates a goal for the home team (Spain), each 1 represents 1 goal for the visitors (The Netherlands). ‘1000’ is the scoring process in Brazil vs Croatia – with Croatia coming into the lead before conceding the match.

To find the matches with a comeback in them, I have to find the strings that somewhere midway have more ones than zeros but in the end have more zeros than ones – or vice versa.

The raw data looks like this:

image

with the Spain-Netherlands match highlighted. It reads_ match number 3, group B, number 1 in the group playing at home vs the number two in the group playing away.

My approach hinges on two things:

  • use of recursive subquery factoring to create records for each of the intermediate scores in each of the matches (so Spain vs Netherlands produces six rows for 1-0 or ‘0’, 1-1 or ‘01’, 1-2 or ‘011’, 1-3 or ‘0111’, 1-4 or ‘01111’ and 1-5 or ‘011111’respectively)
  • use of translate and length to count the number of 1 and 0 characters in a string (note: translate(‘01011’, ‘10’,’1’) results in ‘111’ and translate(‘01011’, ‘01’,’0’) results in ‘00’ )

The first step in creating the SQL query that finds all matches with a comeback in them:

using a recursive subquery, this query retrieves all substrings – and therefore all intermediate scores – from the final score

with scoring_process (match_id, intermediate_score, iteration) as
( select id, substr(scoring_process,1,1), 1
  from   match_results
  union all
  select id,substr(scoring_process,1,iteration+1), iteration+1
  from   match_results mr
         join
         scoring_process sp
         on (sp.match_id = mr.id and length(scoring_process)>= iteration + 1)
)
select *
from   scoring_process

image

The next step calculates the running_score: the difference between the number of home and away goals – using the translate operator that replaces every occurrence in the source string of the indicated character with its indicated replacement. When no replacement is specified for a character, it is simply not included in the resulting string. Note: the leading ‘0’ and ‘1’ are added because length(translate(‘000’,’10’, ‘1’)) results in NULL. I could have used NVL instead of adding the extra character.

with scoring_process (match_id, intermediate_score, iteration) as
( select id, substr(scoring_process,1,1), 1
  from   match_results
  union all
  select id,substr(scoring_process,1,iteration+1), iteration+1
  from   match_results mr
         join
         scoring_process sp
         on (sp.match_id = mr.id and length(scoring_process)>= iteration + 1)
)
, match_stages as
( select sp.*
  ,      length(translate('1'||intermediate_score, '10','1')) - length(translate('0'||intermediate_score, '01','0')) running_score
  from   scoring_process sp
)         
select *
from   match_stages

image

To find the matches with a true comeback in them, I need to identify the match_id’s that at one point in time had a negative running_score and at some other moment a positive running_score.

Using simple aggregating (MAX and MIN) along with HAVING I can find the match_id that have running_scores on opposite sides of zero.

with scoring_process (match_id, intermediate_score, iteration) as
( select id, substr(scoring_process,1,1), 1
  from   match_results
  union all
  select id,substr(scoring_process,1,iteration+1), iteration+1
  from   match_results mr
         join
         scoring_process sp
         on (sp.match_id = mr.id and length(scoring_process)>= iteration + 1)
)
, match_stages as
( select sp.*
  ,      length(translate('1'||intermediate_score, '10','1')) - length(translate('0'||intermediate_score, '01','0')) running_score
  from   scoring_process sp
)         
select ms.match_id
from   match_stages ms
having max(running_score)> 0 and min(running_score) < 0
group
by     ms.match_id
&#91;/code&#93;

<a href="https://technology.amis.nl/wp-content/uploads/2014/06/image303.png"><img style="background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;" title="image" src="https://technology.amis.nl/wp-content/uploads/2014/06/image_thumb312.png" alt="image" width="86" height="97" border="0" /></a>

(note: I have only used the first 4 matches in Groups A and B. Surprising that just these 8 matches have three comebacks in them.)

The next level of challenge would be to find ‘dramatic comebacks’ where a team was two goals behind at some stage and yet managed to win the game. Any suggestions for resolving that one?

The script to create the table and the data:



CREATE TABLE "MATCH_RESULTS"
   (    "GROUP1" VARCHAR2(1 BYTE),
    "HOME" NUMBER(1,0),
    "AWAY" NUMBER(1,0),
    "HOME_GOALS" NUMBER(2,0),
    "AWAY_GOALS" NUMBER(2,0),
    "SDM_ID" NUMBER(2,0),
    "LOCAL_START_TIME" DATE,
    "SCORING_PROCESS" VARCHAR2(20 BYTE),
    "WEATHER_CATEGORY" VARCHAR2(20 BYTE),
    "ID" NUMBER(2,0)
   );

Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('A',1,2,3,1,5,to_timestamp('12-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'1000',null,1);
Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('A',3,4,1,0,3,to_timestamp('13-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'0',null,2);
Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('A',1,3,0,0,6,to_timestamp('17-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'0',null,17);
Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('A',4,2,0,4,2,to_timestamp('18-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'1111',null,18);
Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('B',1,2,1,5,12,to_timestamp('13-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'011111',null,3);
Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('B',3,4,3,1,7,to_timestamp('13-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'0010',null,4);
Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('B',1,3,0,2,1,to_timestamp('18-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'11',null,19);
Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('B',4,2,2,3,10,to_timestamp('18-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'10011',null,20);

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director and Oracle Developer Champion. Solution architect and developer on diverse areas including SQL, JavaScript, Kubernetes & Docker, Machine Learning, Java, SOA and microservices, events in various shapes and forms and many other things. Author of the Oracle Press book Oracle SOA Suite 12c Handbook. Frequent presenter on user groups and community events and conferences such as JavaOne, Oracle Code, CodeOne, NLJUG JFall and Oracle OpenWorld.

4 Comments

  1. Vinod Gowda on

    with scoring_process (match_id, intermediate_score, iteration) as
    ( select id, substr(scoring_process,1,1), 1
    from match_results
    union all
    select id,substr(scoring_process,1,iteration+1), iteration+1
    from match_results mr
    join
    scoring_process sp
    on (sp.match_id = mr.id and length(scoring_process)>= iteration + 1)
    ),
    match_stages
    as
    (
    select
    match_id,
    intermediate_score,
    REGEXP_COUNT(intermediate_score,'[1]’) as oppnent_cnt,
    REGEXP_COUNT(intermediate_score,'[0]’) as home_cnt,
    (REGEXP_COUNT(intermediate_score,'[0]’) – REGEXP_COUNT(intermediate_score,'[1]’)) as home_running_score
    from scoring_process)
    select
    MATCH_ID,
    max(HOME_CNT),
    max(oppnent_cnt),
    min(home_running_score)
    from match_stages
    group by MATCH_ID
    having max(HOME_CNT) > max(oppnent_cnt) and min(home_running_score) < -1;

  2. The first thing I do is find out who won. If there are more ‘1’ than ‘0’, the winner is 1. If there are more ‘0’ than ‘1’, I call the winner -1. If they are equal it was a draw and I throw out that match.

    select id, scoring_process,
    sign(length(replace(scoring_process||’01’,’0′,”))
    – length(replace(scoring_process||’01’,’1′,”)))
    winner
    from match_results
    where sign(length(replace(scoring_process||’01’,’0′,”))
    – length(replace(scoring_process||’01’,’1′,”)))
    != 0;

    To get the comebacks, within scoring_process I translate ‘1’ to 1 and ‘0’ to -1 and I take a running sum. I then take the maximum of that running sum, adjusting the sign depending on who the winner was, and that tells me how far behind the winner got before coming back. If the winner was never behind, I throw out that match.

    select id, scoring_process,
    max(-winner*column_value) max_behind
    from (
    select id, scoring_process,
    sign(length(replace(scoring_process||’01’,’0′,”))
    – length(replace(scoring_process||’01’,’1′,”)))
    winner
    from match_results
    where sign(length(replace(scoring_process||’01’,’0′,”))
    – length(replace(scoring_process||’01’,’1′,”)))
    != 0
    ) mr,
    table(cast(multiset(
    select sum(
    decode(to_number(substr(mr.scoring_process,level,1)),0,-1,1)
    ) over (order by level)
    from dual
    connect by level <= 0;

  3. Anton Scheffer on

    To find a comeback you can also use something like
    with scoring_process ( id, scoring_process, running_total, iteration )
    as ( select id
    , scoring_process
    , case when substr( scoring_process, 1, 1 ) = ‘1’ then 1 else -1 end
    , 1
    from match_results
    union all
    select id
    , scoring_process
    , running_total + case when substr( scoring_process, iteration + 1, 1 ) = ‘1’ then 1 else -1 end
    , iteration + 1
    from scoring_process
    where substr( scoring_process, iteration + 1, 1 ) is not null
    )
    select id
    from scoring_process
    group by id
    having max(running_total) > 0
    and min(running_total) < 0

    These query also list matches with a comeback, but which result in a draw.
    If you really only want the games with don't end in a draw you can add
    max( running_total ) keep ( dense_rank last order by iteration ) 0

    For your dramatic comebacks you can add
    and ( ( max( running_total ) keep ( dense_rank last order by iteration ) 1
    )
    or ( max( running_total ) keep ( dense_rank last order by iteration ) > 0
    and max(running_total) < -1
    )
    )