Not all has to be black and white in SQL Queries: returning match scores instead of only perfect fits

0

Recently I sat in on a very interesting presentation on an advanced search and matching engine called Elise. The power of this engine lies in the fact that it does not just query records, returning the set of records that satisfy the search criteria – although it has some pretty advanced search filters and operators, but is able to score all records according to the match criteria. Every record gets a score – which can be zero, meaning that not a single search criteria matched even the smallest way. Higher scores mean better matches on more, or on the most important, criteria.

Many search operations do not necessarily need perfect matches. They need possibly multiple results that are all matches to a certain degree. Google for example returns web pages that meet your search criteria to a certain degree. The most relevant matches are at the top of the search results – or at least that is the intention. Searching for interesting job vacancies, holiday destinations, new real estate, electronic devices, dating partners on a e-dating site, books on Oracle technology are other examples of searches that look for best matches given criteria, not necessarily perfects fits for all criteria (usually they do not exist).

In this article, I will demonstrate a first attempt at building SQL queries that determine match scores that allow applications to list records in order of relevancy (based on the score). The user can specify the search criteria, for example a certain range around numerical and date values or a list of desirable values for a particular attribute and the level of desirability for each value, as well as the relative weight for each of the search criteria.....

Let’s say we are looking for employees who earn approximately 1500 and started in 1982 or thereabouts. Well, to be a little bit more specific:

  • the salary should be as close as possible to 1500 and be at least between 1000 and 2000
  • the hiredate should ideally be in 1982; however, 1983 is somewhat acceptable (half the score for 1982)
  • we value a match on salary higher than one on hiredate (twice as important)
  • we want the top 5 scoring Employees

(our underlying database as you might have guessed is good old SCOTT.EMP)

Our first approach to writing queries could be something like:

select * <br />from   emp<br />where  salary = 1500 <br />AND    extract (year from hiredate) = 1982<br />

which is nowhere near what we want (this query may well return zero records).

A ‘normal’ SQL alternative could look like:

select * <br />from   emp<br />where salary between 1000 and 2000<br />AND   extract (year from hiredate)<br />      between 1981 and 1983<br /><br />

which I think is better – but does not make any distinction between fair, good and near perfect matches. So can we make that distinction? How do we rate perfect, good and fair matches?

I first try to find a SQL way of determining a score for an employee record based on the criteria for salary. It could be something like:

select case<br />       when sal &lt; 1500<br />       then case<br />            when sal &gt; 1000<br />            then (sal - 1000)/(1500 - 1000)<br />            else 0<br />            end<br />       when sal &lt; 2000<br />       then (sal - 2000)/(1500 - 2000)<br />       else 0<br />       end  score<br />from   emp <br />

Here the CASE expression determines the score. The awarding of scores can be visualized with a picture like this one:

As long a the salary is between 1000 and 2000 – you will get some points in the score for this criteria. Top score is awarded to a salary of exactly 1500, scores on either side drop from 1 to 0 as the salary moves away from 1500.

For the hiredate, we can some something similar. If the year is 1982, you get top score. If it is 1983, you get half of that. If it is beyond that, you get nought. The SQL could be:

select case extract (year from hiredate)<br />       when 1982 <br />       then 1<br />       when 1983 <br />       then 0.5<br />       end  score <br />from   emp <br />

Note that even though hiredate is a date, through the use of EXTRACT(Year from Hiredate) we have effectively reduced it to a purely numerical value.

We now have expressions that will give us scores on the two criteria. The question now becomes how to combine these two queries in producing an aggregated result – taking the relative weight factors for the criteria in to consideration as well. A simple SELECT statement can do the trick. Here we use LEFT OUTER JOIN to cater for the fact that some employees are not returned by the per-criteria query blocks, yet should be in the overall result:

select ename<br />,      sal<br />,      hiredate<br />,        2 * nvl( emp_sal_around_1500.score, 0)<br />       + nvl(emp_hiredate_1982.score, 0)  score<br />from   emp e<br />       left outer join<br />       ( select case<br />                when sal &lt; 1500<br />                then case<br />                     when sal &gt; 1000<br />                     then (sal - 1000)/(1500 - 1000)<br />                     else 0<br />                     end<br />                when sal &lt; 2000<br />                then (sal - 2000)/(1500 - 2000)<br />                else 0<br />                end  score<br />         from   emp<br />         where  job &lt;&gt; 'PRESIDENT'  -- filter because of in promptu demand that anyone in the role of President <br />                                    -- should not score points on salary (this justifies the left outer join)<br />       ) emp_sal_around_1500<br />       on (e.rowid = emp_sal_around_1500.rowid)<br />       left outer join<br />       ( select case extract (year from hiredate)<br />                when 1982 <br />                then 1<br />                when 1983 <br />                then 0.5<br />                end  score <br />         from   emp<br />         where  hiredate is not null<br />       ) emp_hiredate_1982<br />       on (e.rowid = emp_hiredate_1982.rowid)<br />&nbsp;

This query returns all employees and adds for each one the scores contributed by each criteria.

We’re almost done. We still have the ‘ we only want the top 5′ demand. The simplest solution is to use this query as an inline view that we select from – the "select from (select…) where… " construction. We have to make sure that the employees are ordered by their scores – descending. From this set we then select the first 5 rows:

select *<br />from   ( select ename<br />         ,      sal<br />         ,      hiredate<br />         ,        2 * nvl( emp_sal_around_1500.score, 0)<br />                + nvl(emp_hiredate_1982.score, 0)  score<br />         from   emp e<br />                left outer join<br />                ( select case<br />                         when sal &lt; 1500<br />                         then case<br />                              when sal &gt; 1000<br />                     then (sal - 1000)/(1500 - 1000)<br />                     else 0<br />                     end<br />                when sal &lt; 2000<br />                then (sal - 2000)/(1500 - 2000)<br />                else 0<br />                end  score<br />                  from   emp<br />                ) emp_sal_around_1500<br />                on (e.rowid = emp_sal_around_1500.rowid)<br />                left outer join<br />                ( select case extract (year from hiredate)<br />                         when 1982 <br />                         then 1<br />                         when 1983 <br />                         then 0.5<br />                         end  score <br />                  from
emp<br />                ) emp_
hiredate_1982<br />                on (e.rowid = emp_hiredate_1982.rowid)<br />       order<br />       by     score desc<br />       )<br />where rownum &lt; 6<br />&nbsp;

When we run this query – the results are "revealing":

In a second article that follows on this one, I will look into an obvious question that such results raise: whence the score of 2.2 for MILLER? Why has TURNER a score of 2? Did they score on all criteria or just on one? What will be the effect of changing the relative weights of the criteria?

For now we will focus on a way to make this query a little easier to write and read, by recognizing the fact that we have a ‘calculate score on match criteria’  pattern on our hands. A numerical attribute is compared with a target value (the ideal bull’s eye, top score) as well as a range around that ultimate goal. As we saw for Salary: 1500 was our wet dream; however, the range from 1000 to 2000 was attractive enough to also add to the score. We used the CASE expression to calculate the score. For reasons of performance, that is probably still the optimal solution. However, creating a small reusable component in the form of a PL/SQL function that implements this ‘pattern’ definitely helps our development productivity and the readability of the code.

The PL/SQL function:

create or replace function number_range_score<br />( value in number<br />, target in number<br />, low_end in number default null -- pass in target if there is no scoring range below the target value; null means any value lower than target has a full score<br />, high_end in number default null -- pas in target if there is no scoring range over the target value; null means any value higher than target has a full score<br />) return number<br /> is<br />begin<br />  return<br />  case<br />  when value = target<br />  then 1<br />  when value &lt; target<br />  then case<br />       when low_end is null<br />       then 1<br />       when value &gt; low_end<br />       then (value-low_end)/(target - low_end )<br />       else 0<br />       end<br />  when value &gt; target<br />  then case<br />       when high_end is null<br />       then 1<br />       when value &gt; high_end<br />       then 0<br />       else (value - high_end)/(target - high_end)<br />       end<br />  else 0<br />  end  ;<br />end number_range_score;<br />&nbsp;

And the query now becomes:

select *<br />from   ( select ename<br />         ,      sal<br />         ,      hiredate<br />         ,        2 * number_range_score(sal, 1500, 1000, 2000)<br />                + number_range_score(extract (year from hiredate), 1982, 1982, 1983)  score<br />         from   emp e<br />         order<br />         by     score desc<br />       )<br />where rownum &lt; 6<br /><br />

It is not hard to see which query is easier to write and to read.

Let’s see how easy it is to add a search criteria to the whole fuzzy business: With a relative weight of 1.5 – more important than the hiredate, less important than the salary criteria – we add a criteria regarding the job of the ideal employees. We would prefer CLERKS but can value ANALYSTS (60%) and SALESMEN (40%) higher than employees in other jobs.

A quick SQL expression to calculate the score for this criteria:

case job when 'CLERK' then 1 when 'ANALYST' then 0.6 when 'SALESMAN' then 0.4 end <br />

And adding that to the query is easy enough:

select *<br />from   ( select ename<br />         ,      sal<br />         ,      hiredate<br />         ,        2 * number_range_score(sal, 1500, 1000, 2000)<br />                + number_range_score(extract (year from hiredate), 1982, 1982, 1983)  <br />                + 1.5 * case job <br />                        when 'CLERK' then 1 <br />                        when 'ANALYST' then 0.6 <br />                        when 'SALESMAN' then 0.4 <br />                        else 0 end<br />                score<br />         from   emp e<br />         order<br />         by     score desc<br />       )<br />where rownum &lt; 6<br /><br />

You will be pleased to know that now MILLER’s lead is further confirmed – 3.7 – with TURNER and ALLEN running second and third (2.6 and 2.2 respectively).

Next steps along the path of fuzzy searches include "score explanations", hierarchical data dependencies (when I look for LONDON, UK should score some points but BELGIUM should not; when I look for NETHERLANDS, AMSTERDAM scores full marks), additional patterns and integration of the fuzzy search queries in applications – for example ADF.

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

Comments are closed.