Analyzing Match Scores in ‘fuzzy searches’ – explaining the shades of gray in a non-binary world

In a previous article – Not all has to be black and white in SQL Queries – , I discuss how not all searches are about perfect matches. The search for your lifelong partner, or your next job are fine examples of searches where a perfect match is hard to obtain. No single person or no single job will score one 100% on all your criteria. Unfortunately, such binary searches is exactly what SQL is good at. WHERE ALL CRITERIA ARE MET is the filter SQL applies to the records inspected. Of course you can use OR operators to leave some room for non-perfect matches and instead of using the equality (=) you can go for between or LIKE. But a structured approach to ‘fuzzy’ searches where the shades of gray most searches inevitably end up in is not readily available.

The prequel to this articles takes a few first steps in illustrating how scores against different criteria can be collected, how different weight factors for search criteria can be taken into account and how the first two search-match-patterns (number range, discrete values) started to emerge. One major question we ended that article with: how can we explain when the search is performed and the match scores for each record is known, how the score was arrived at. What criteria did the record meet – to some extent- and which ones did it fluke completely.

This article demonstrates a slightly different approach to the fuzzy search challenge and shows a way to not only calculate the scores but also construct a detailed analysis of the composition of the score.

I will use a collection of object types report the individual scores of a record against every criteria. The object type is defined like this:

create type query_match_score_t as object
( score number(20,10)
, weight number(5,2)
, criteria varchar2(2000)
)

and the query_match_score_table_t type is the collection type of instances of this type

create type query_match_score_table_t as
table of query_match_score_t

The score against a criteria is created like this as an object type

select empno
,      query_match_score_t( number_range_score(sal, 1500, 1000, 2000), 2
                          , 'Salary must be as close to 1500 as possible, and at least between 1000 and 2000') score
from   emp

Analyzing Match Scores in 'fuzzy searches' - explaining the shades of gray in a non-binary world fuzzyquery004

Multiple criteria can be combined now in a aggregation, using the COLLECT operator to assemble query_match_score_table_t collections per Employee and a simple SUM to calculate the weighted overall match score for each employee. The query looks like this:

select empno
,      sum(score * weight) weighted_score
,      cast(collect(query_match_score_t( score, weight,criteria)) as query_match_score_table_t) criteria
from   ( select empno
         ,      number_range_score(sal, 1500, 1000, 2000) score
         ,      2 weight
         ,      'Salary must be as close to 1500 as possible, and at least between 1000 and 2000'
         from   emp
         union all
         select empno
         ,      number_range_score(extract (year from hiredate), 1982, 1982, 1983) score
         ,      1 weight
         ,      'Employee is ideally hired in 1982; a hiredate in 1983 is also somewhat valuable'
         from   emp
       )
group
by     empno
order
by     weighted_score desc                                                                                    

Instead of simply adding the scores one by one, we now use separate query blocks for every criteria – that are then unioned together. These are simple to write, can have their own where clause and additional logic – such as joins to other tables – and are easy to add or remove. Note that adding a criteria only requires the addition of a UNION ALL + query. The aggregation – COLLECT and SUM – will simply take additional results into account, they do not have to be changed for that.

The result of running this query looks a little confusing at first

     EMPNO WEIGHTED_SCORE
---------- --------------
CRITERIA(SCORE, WEIGHT, CRITERIA)
--------------------------------------------------------------------------------
      7934            2.2
QUERY_MATCH_SCORE_TABLE_T(QUERY_MATCH_SCORE_T(0, 0, 'base'), QUERY_MATCH_SCORE_T
(1, 1, 'Employee is ideally hired in 1982; a hiredate in 1983 is also somewhat v
aluable'), QUERY_MATCH_SCORE_T(.6, 2, 'Salary must be as close to 1500 as possib
le, and at least between 1000 and 2000'))

      7844              2
QUERY_MATCH_SCORE_TABLE_T(QUERY_MATCH_SCORE_T(0, 0, 'base'), QUERY_MATCH_SCORE_T
(0, 1, 'Employee is ideally hired in 1982; a hiredate in 1983 is also somewhat v
aluable'), QUERY_MATCH_SCORE_T(1, 2, 'Salary must be as close to 1500 as possibl
e, and at least between 1000 and 2000'))

      7499            1.6
QUERY_MATCH_SCORE_TABLE_T(QUERY_MATCH_SCORE_T(0, 0, 'base'), QUERY_MATCH_SCORE_T
(.8, 2, 'Salary must be as close to 1500 as possible, and at least between 1000
and 2000'), QUERY_MATCH_SCORE_T(0, 1, 'Employee is ideally hired in 1982; a hire
date in 1983 is also somewhat valuable'))
 ....

There is no top 5 yet and the layout is not pretty. Note that the latter is not our priority – we are looking to find a way to make the fuzzy search available to applications that want to integrate in their User Interface. However, we can prettify the layout a little bit:

select empno
,      (select ename||' - '||job||' - '||hiredate||' - '||sal from emp where empno = scores.empno) details
,      weighted_score
,      ( select cast(collect( substr(criteria,1,105)||': '||score||' ('||weight||')') as string_table)
         from table(criteria)
         where score > 0) criteria
from   ( select empno
         ,      sum(score * weight) weighted_score
         ,      cast(collect(query_match_score_t( score, weight,criteria)) as query_match_score_table_t) criteria
         from   (
                 select empno
                 ,      0 score
                 ,      0 weight
                 ,      'base' criteria
                 from   emp e
                 union all
                 select empno
                 ,      number_range_score(sal, 1500, 1000, 2000) score
                 ,      2 weight
                 ,      'Salary must be as close to 1500 as possible, and at least between 1000 and 2000'  criteria
                 from   emp
                 union all
                 select empno
                 ,      number_range_score(extract (year from hiredate), 1982, 1982, 1983) score
                 ,      1 weight
                 ,      'Employee is ideally hired in 1982; a hiredate in 1983 is also somewhat valuable'
                 from   emp
                )
         group
         by     empno
         order
         by     weighted_score desc
       ) scores
where  rownum < 6

We have made use of a very simple COLLECTION type called string_table:

create or replace type string_table as table of varchar2(4000)

The result of running this query is a little bit nicer. There are only 5 records, properly ordered. And they neatly explain why they have scored as they have:

     EMPNO
----------
DETAILS
--------------------------------------------------------------------------------
WEIGHTED_SCORE
--------------
CRITERIA
--------------------------------------------------------------------------------
      7934
MILLER - CLERK - 23-JAN-82 - 1300
           2.2
STRING_TABLE('Employee is ideally hired in 1982; a hiredate in 1983 is also some
what valuable: 1 (1)', 'Salary must be as close to 1500 as possible, and at leas
t between 1000 and 2000: .6 (2)')

      7844
TURNER - SALESMAN - 08-SEP-81 - 1500
             2
STRING_TABLE('Salary must be as close to 1500 as possible, and at least between
1000 and 2000: 1 (2)')

      7499
ALLEN - SALESMAN - 20-FEB-81 - 1600
           1.6
STRING_TABLE('Salary must be as close to 1500 as possible, and at least between
1000 and 2000: .8 (2)')
 ....

We can learn from these results that MILLER (a CLERK, hired in 1982 and making 1300 in whatever currency EMP is defined with) scores 1 for the hiredate criteria (with a weight of 1) and earns an additional 0.6 points for the salary criteria (with weight 2). This means an overall score of 1*1+2*0.6 = 2.2. In comparison, number 2 – TURNER – only scores on salary (exactly 1500) but not on Hiredate.

This also tells us that if we change the relative weights of the search criteria, the scores may start to shift quite a bit.

Adding a Criteria

I would like to show how easy it is with this query structure to add an additional fuzzy criteria. Borrowing from the previous article: With a relative weight of 1.5 – more important than the hiredate, less important than the salary criteria – we add a criteria regarding the job of the ideal employees. We would prefer CLERKS but can value ANALYSTS (60%) and SALESMEN (40%) higher than employees in other jobs.

The stand-alone query that gives us scores for this criteria looks like this:

select empno
,      case job when 'CLERK' then 1 when 'ANALYST' then 0.6 when 'SALESMAN' then 0.4 end score
,      1.5 weight
,      'We prefer CLERKs; ANALYST scores 60% and SALESMAN is 40%'
from   emp

When integrated into the entrie query, it becomes:

select empno
,      (select ename||' - '||job||' - '||hiredate||' - '||sal from emp where empno = scores.empno) details
,      weighted_score
,      ( select cast(collect( criteria||': '||score||' ('||weight||')') as string_table)
         from table(criteria)
         where score > 0) criteria
from   ( select empno
         ,      sum(score * weight) weighted_score
         ,      cast(collect(query_match_score_t( score, weight,criteria)) as query_match_score_table_t) criteria
         from   (
                 select empno
                 ,      0 score
                 ,      0 weight
                 ,      'base' criteria
                 from   emp e
                 union all
                 select empno
                 ,      number_range_score(sal, 1500, 1000, 2000) score
                 ,      2 weight
                 ,      'Salary must be as close to 1500 as possible, and at least between 1000 and 2000'  criteria
                 from   emp
                 union all
                 select empno
                 ,      number_range_score(extract (year from hiredate), 1982, 1982, 1983) score
                 ,      1 weight
                 ,      'Employee is ideally hired in 1982; a hiredate in 1983 is also somewhat valuable'
                 from   emp
                 union all
                 select empno
                 ,      case job when 'CLERK' then 1 when 'ANALYST' then 0.6 when 'SALESMAN' then 0.4 end  score
                 ,      1.5 weight
                 ,      'We prefer CLERKs; ANALYST scores 60% and SALESMAN is 40%'
                 from   emp
                )
         group
         by     empno
         order
         by     weighted_score desc
       ) scores
where  rownum < 6

And the results are like this:

     EMPNO DETAILS
--------------------------------------------------------------------------------
WEIGHTED_SCORE
--------------
CRITERIA
--------------------------------------------------------------------------------
      7934 MILLER - CLERK - 23-JAN-82 - 1300
           3.7
STRING_TABLE('We prefer CLERKs; ANALYST scores 60% and SALESMAN is 40%: 1 (1.5)'
, 'Employee is ideally hired in 1982; a hiredate in 1983 is also somewhat valuab
le: 1 (1)', 'Salary must be as close to 1500 as possible, and at least between 1
000 and 2000: .6 (2)')

      7844 TURNER - SALESMAN - 08-SEP-81 - 1500
           2.6
STRING_TABLE('We prefer CLERKs; ANALYST scores 60% and SALESMAN is 40%: .4 (1.5)
', 'Salary must be as close to 1500 as possible, and at least between 1000 and 2000: 1 (2)')

      7499 ALLEN - SALESMAN - 20-FEB-81 - 1600
           2.2
STRING_TABLE('Salary must be as close to 1500 as possible, and at least between 1000 and 2000: .8 (2)', 'We prefer CLERKs; ANALYST scores 60% and SALESMAN is 40%: .4 (1.5)')

Well, can we make this more dynamic? Store the search criteria in a table and iterate over them when we calculate the scores per record? So that we do not have to change the query at all? Isn’t this something Oracle 10g R1 Expression Filter could come in handy for? And we would still like additional match score patterns (sounds like, date range, taxonomy, geographical proximity range).

 

2 Comments

  1. Lucas Jellema September 21, 2008
  2. Edwin Biemond September 21, 2008