As I recently said in another blog post, the World Cup Football is such a wonderful source of data that begs to be digested using SQL in many different ways. While working on a tag cloud producer – topic for a subsequent article – I ran into a special challenge. It is easy to describe, yet not so easy to resolve. Or at least, it took me some time to come up with this approach. I would like to invite any reader who has a suggestion as to how this can be resolved.
For my tag cloud regarding the matches in the World Cup tournament, I want to identify those matches that has a comeback in them. That is: matches where the winner has comeback from trailing the opponent . For example: The Netherland were 0-1 behind against Spain (after the penalty kick) and went on to win the game. So this match Spain-The Netherlands has a comeback.
I have made use of a compact notation of the way the scoring process went, using a string with 0 and 1 characters. For example: ‘011111’ describes the Spain-Netherlands match. The zero indicates a goal for the home team (Spain), each 1 represents 1 goal for the visitors (The Netherlands). ‘1000’ is the scoring process in Brazil vs Croatia – with Croatia coming into the lead before conceding the match.
To find the matches with a comeback in them, I have to find the strings that somewhere midway have more ones than zeros but in the end have more zeros than ones – or vice versa.
The raw data looks like this:
with the Spain-Netherlands match highlighted. It reads_ match number 3, group B, number 1 in the group playing at home vs the number two in the group playing away.
My approach hinges on two things:
- use of recursive subquery factoring to create records for each of the intermediate scores in each of the matches (so Spain vs Netherlands produces six rows for 1-0 or ‘0’, 1-1 or ‘01’, 1-2 or ‘011’, 1-3 or ‘0111’, 1-4 or ‘01111’ and 1-5 or ‘011111’respectively)
- use of translate and length to count the number of 1 and 0 characters in a string (note: translate(‘01011’, ‘10’,’1’) results in ‘111’ and translate(‘01011’, ‘01’,’0’) results in ‘00’ )
The first step in creating the SQL query that finds all matches with a comeback in them:
using a recursive subquery, this query retrieves all substrings – and therefore all intermediate scores – from the final score
with scoring_process (match_id, intermediate_score, iteration) as ( select id, substr(scoring_process,1,1), 1 from match_results union all select id,substr(scoring_process,1,iteration+1), iteration+1 from match_results mr join scoring_process sp on (sp.match_id = mr.id and length(scoring_process)>= iteration + 1) ) select * from scoring_process
The next step calculates the running_score: the difference between the number of home and away goals – using the translate operator that replaces every occurrence in the source string of the indicated character with its indicated replacement. When no replacement is specified for a character, it is simply not included in the resulting string. Note: the leading ‘0’ and ‘1’ are added because length(translate(‘000’,’10’, ‘1’)) results in NULL. I could have used NVL instead of adding the extra character.
with scoring_process (match_id, intermediate_score, iteration) as ( select id, substr(scoring_process,1,1), 1 from match_results union all select id,substr(scoring_process,1,iteration+1), iteration+1 from match_results mr join scoring_process sp on (sp.match_id = mr.id and length(scoring_process)>= iteration + 1) ) , match_stages as ( select sp.* , length(translate('1'||intermediate_score, '10','1')) - length(translate('0'||intermediate_score, '01','0')) running_score from scoring_process sp ) select * from match_stages
To find the matches with a true comeback in them, I need to identify the match_id’s that at one point in time had a negative running_score and at some other moment a positive running_score.
Using simple aggregating (MAX and MIN) along with HAVING I can find the match_id that have running_scores on opposite sides of zero.
with scoring_process (match_id, intermediate_score, iteration) as ( select id, substr(scoring_process,1,1), 1 from match_results union all select id,substr(scoring_process,1,iteration+1), iteration+1 from match_results mr join scoring_process sp on (sp.match_id = mr.id and length(scoring_process)>= iteration + 1) ) , match_stages as ( select sp.* , length(translate('1'||intermediate_score, '10','1')) - length(translate('0'||intermediate_score, '01','0')) running_score from scoring_process sp ) select ms.match_id from match_stages ms having max(running_score)> 0 and min(running_score) < 0 group by ms.match_id [/code] <a href="https://technology.amis.nl/wp-content/uploads/2014/06/image303.png"><img style="background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;" title="image" src="https://technology.amis.nl/wp-content/uploads/2014/06/image_thumb312.png" alt="image" width="86" height="97" border="0" /></a> (note: I have only used the first 4 matches in Groups A and B. Surprising that just these 8 matches have three comebacks in them.) The next level of challenge would be to find ‘dramatic comebacks’ where a team was two goals behind at some stage and yet managed to win the game. Any suggestions for resolving that one? The script to create the table and the data: [code language='SQL'] CREATE TABLE "MATCH_RESULTS" ( "GROUP1" VARCHAR2(1 BYTE), "HOME" NUMBER(1,0), "AWAY" NUMBER(1,0), "HOME_GOALS" NUMBER(2,0), "AWAY_GOALS" NUMBER(2,0), "SDM_ID" NUMBER(2,0), "LOCAL_START_TIME" DATE, "SCORING_PROCESS" VARCHAR2(20 BYTE), "WEATHER_CATEGORY" VARCHAR2(20 BYTE), "ID" NUMBER(2,0) ); Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('A',1,2,3,1,5,to_timestamp('12-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'1000',null,1); Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('A',3,4,1,0,3,to_timestamp('13-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'0',null,2); Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('A',1,3,0,0,6,to_timestamp('17-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'0',null,17); Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('A',4,2,0,4,2,to_timestamp('18-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'1111',null,18); Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('B',1,2,1,5,12,to_timestamp('13-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'011111',null,3); Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('B',3,4,3,1,7,to_timestamp('13-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'0010',null,4); Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('B',1,3,0,2,1,to_timestamp('18-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'11',null,19); Insert into MATCH_RESULTS (GROUP1,HOME,AWAY,HOME_GOALS,AWAY_GOALS,SDM_ID,LOCAL_START_TIME,SCORING_PROCESS,WEATHER_CATEGORY,ID) values ('B',4,2,2,3,10,to_timestamp('18-JUN-14','DD-MON-RR HH.MI.SSXFF AM'),'10011',null,20);
with scoring_process (match_id, intermediate_score, iteration) as
( select id, substr(scoring_process,1,1), 1
from match_results
union all
select id,substr(scoring_process,1,iteration+1), iteration+1
from match_results mr
join
scoring_process sp
on (sp.match_id = mr.id and length(scoring_process)>= iteration + 1)
),
match_stages
as
(
select
match_id,
intermediate_score,
REGEXP_COUNT(intermediate_score,'[1]’) as oppnent_cnt,
REGEXP_COUNT(intermediate_score,'[0]’) as home_cnt,
(REGEXP_COUNT(intermediate_score,'[0]’) – REGEXP_COUNT(intermediate_score,'[1]’)) as home_running_score
from scoring_process)
select
MATCH_ID,
max(HOME_CNT),
max(oppnent_cnt),
min(home_running_score)
from match_stages
group by MATCH_ID
having max(HOME_CNT) > max(oppnent_cnt) and min(home_running_score) < -1;
My first reply got cut off for some reason, so I wrote my own blog post about “dramatic comebacks”. In fact, my solution shows all the comebacks and says how far behind the winner got, so the definition of “dramatic” could be a bind variable or substitution variable. See >
http://stewashton.wordpress.com/2014/06/27/sql-to-find-world-cup-matches-with-comebacks/
The first thing I do is find out who won. If there are more ‘1’ than ‘0’, the winner is 1. If there are more ‘0’ than ‘1’, I call the winner -1. If they are equal it was a draw and I throw out that match.
select id, scoring_process,
sign(length(replace(scoring_process||’01’,’0′,”))
– length(replace(scoring_process||’01’,’1′,”)))
winner
from match_results
where sign(length(replace(scoring_process||’01’,’0′,”))
– length(replace(scoring_process||’01’,’1′,”)))
!= 0;
To get the comebacks, within scoring_process I translate ‘1’ to 1 and ‘0’ to -1 and I take a running sum. I then take the maximum of that running sum, adjusting the sign depending on who the winner was, and that tells me how far behind the winner got before coming back. If the winner was never behind, I throw out that match.
select id, scoring_process,
max(-winner*column_value) max_behind
from (
select id, scoring_process,
sign(length(replace(scoring_process||’01’,’0′,”))
– length(replace(scoring_process||’01’,’1′,”)))
winner
from match_results
where sign(length(replace(scoring_process||’01’,’0′,”))
– length(replace(scoring_process||’01’,’1′,”)))
!= 0
) mr,
table(cast(multiset(
select sum(
decode(to_number(substr(mr.scoring_process,level,1)),0,-1,1)
) over (order by level)
from dual
connect by level <= 0;
To find a comeback you can also use something like
with scoring_process ( id, scoring_process, running_total, iteration )
as ( select id
, scoring_process
, case when substr( scoring_process, 1, 1 ) = ‘1’ then 1 else -1 end
, 1
from match_results
union all
select id
, scoring_process
, running_total + case when substr( scoring_process, iteration + 1, 1 ) = ‘1’ then 1 else -1 end
, iteration + 1
from scoring_process
where substr( scoring_process, iteration + 1, 1 ) is not null
)
select id
from scoring_process
group by id
having max(running_total) > 0
and min(running_total) < 0
These query also list matches with a comeback, but which result in a draw.
If you really only want the games with don't end in a draw you can add
max( running_total ) keep ( dense_rank last order by iteration ) 0
For your dramatic comebacks you can add
and ( ( max( running_total ) keep ( dense_rank last order by iteration ) 1
)
or ( max( running_total ) keep ( dense_rank last order by iteration ) > 0
and max(running_total) < -1
)
)