Routeplanners have always been somewhat magical in my eyes. Finding the quickest route from A to B in a graph of thousands of points and connections. I have tried to imagine how to approach such as problem. I have even come up with several solutions that I think might help to implement a performant system for unveiling the fasted itinerary between two points in the graph.
The other day, while preparing for the Quiz on SQL and PL/SQL that Alex and I will do during ODTUG 2006 in two weeks time, I got hooked again on this same challenge. This time, the challenge translated to: write a SQL query that will find the quickest path between two points.
In this article I will describe what solution I came up with this time, using the Oracle Hierarchical Query and some of its more recent operators. I will demonstrate a query that will give us the quickest route between any two points in the graph shown below:
The starting situation is the graph depicted above. This is the graphical representation of the points and connections. The same information is held in two tables: POINTS and CONNECTIONS.
Their create scripts are like this:
create table points ( id number(5) , name varchar2(30) ) / create table connections ( from_id number(5) , to_id number(5) , distance number(5) ) /
They contain the data for the graph shown above, with five points and a total of six connections. Their contents are shown below:
The question I would like to answer with a SQL query is: what is the shortest route – in terms of distance – from A to D. Of course with this simple graph we can easily tell ourselves by merely glancing at the graph. It is from A to C to D, a total distance of 7. But can a query give me the same answer?
I create a helper-view to simplify the query we will write later on. This view joins together points (source), connections and points (destination), so that it will be easier to write the hierarchical query. Note: we could also have used an in-line view with the same query:
create or replace view point_connections as select from_point.name from_point , to_point.name to_point , conn.distance distance from points from_point , connections conn , points to_point where conn.from_id = from_point.id and conn.to_id = to_point.id /
This view contains all connections. Note: I have assumed a directed graph where a connection from A to B does not necessarily mean there is also a connection from B to A. It would be simple to include those reverse connections as well, for example by extending the view definition with a UNION ALL and the same query with the from_id and to_id switched in the WHERE clause.
The query I intend to write will treat the graph as a tree. It will perform a CONNECT BY query, starting at Point A and using the point_connection to connect by prior to_point = from_point:
select level number_of_transfers , sys_connect_by_path(distance,'+') route_cost , CONNECT_BY_ROOT from_point||sys_connect_by_path(to_point,'/') route from point_connections pc connect by nocycle prior to_point = from_point start with from_point = 'A' -- starting point of the trip /
This query will return all routes starting at A , including the ones that do not end at D. Note that NOCYCLE was added to prevent routes from circling endlessly through the graph. Suppose that A connects to C connects to B connects back to A – a very real possibility. In that case, without the NOCYCLE, we would have blown up the query. Since in this query all routes have to start at A, the expression CONNECT_BY_ROOT from_point is a little overdone, as it will always return 'A'. However, to make the query as generic as possible, I have used CONNECT_BY_ROOT from_point to return for each record the value of from_point in the first step of the route.
For each route it will return an indication of the total distance, through the route_cost column that contains an expression seemingly adding together the distances for all connections travelled in the returned route:
NUMBER_OF_TRANSFERS ROUTE_COST ROUTE ------------------- --------------- --------- 1 +3 A/B 2 +3+5 A/B/D 2 +3+2 A/B/E 3 +3+2+1 A/B/E/D 1 +6 A/C 2 +6+1 A/C/D 2 +6+2 A/C/E 3 +6+2+1 A/C/E/D 8 rows selected.
We need to filter the above query result on only the routes that end at D. Subsequently, we need to order them by route_cost, returning the shortest route first. The latter is achieved using a simple PL/SQL function that calculates the distance based on the route_cose expression, simply by performing a little piece of Dynamic SQL. Note: this trick of turning the value from a SYS_CONNNECT_BY_PATH expression into a calculated value is described in Mastering Oracle PL/SQL: Practical Solutions by Connor McDonald, Chaim Katz, Christopher Beck, Joel R. Kallman, David C. Knox.
The PL/SQL function used for calculating the ROUTE_COST is like this:
create or replace function evaluate_string ( p_string in varchar2 ) return number is l_result number; begin execute immediate 'select '||p_string||' from dual' into l_result; return l_result; end evaluate_string; /
The p_string that is passed into this function is assumed to contain a valid SQL expression. This expression is pasted into a simple query against dual. The result is returned. Usage for example:
select evaluate_string('5*4') from dual / EVALUATE_STRING('5*4') ---------------------- 20
Which is quite silly as it is the exact same thing as
select 5*4 from dual /
The value of this little function is apparent when the query returns an expression we cannot predict. For example:
select ename , sal ||case job when 'SALESMAN' then '+' when 'CLERK' then '*' when 'PRESIDENT' then '/' else '-' end ||((sysdate - hiredate)/25) new_sal from emp /
To know the values for new_sal, we need to dynamically calculate those values from the new_sal expressions, like this:
select ename , evaluate_string(new_sal) new_salary from ( select ename , sal ||case job when 'SALESMAN' then '+' when 'CLERK' then '*' when 'PRESIDENT' then '/' else '-' end ||((sysdate - hiredate)/25) new_sal from emp ) /
Note: this function is wide open to SQL Injection. Be sure to not allow other programs or users to make use of the function or add a test on the value of p_string – for example p_string may not contain the word FROM. Without such measures, someone might abuse the function with calls like:
declare l_result number; begin l_result:= evaluate_string('(select sal from emp where ename=''KING'')'); end; /
thereby getting at our most secret data!
We can now rewrite the query to include filter and order by clauses, finally getting what we want: the quickest route from A to D:
select graph.* , evaluate_string(graph.route_cost) distance from ( select level number_of_transfers , sys_connect_by_path(distance,'+') route_cost , CONNECT_BY_ROOT from_point||sys_connect_by_path(to_point,'/') route from point_connections pc connect by nocycle prior to_point = from_point start with from_point = 'A' -- starting point of the trip ) graph where regexp_like(route,'D$') -- D is the final destination order by distance /
So clearly the quickest way to get from A to D is through C with a total distance of 7.
Eating a little more of the pudding
To prove our approach a little further, let's build a road from E to D. Our graph(ic) now looks like this:
The data in the table CONNECTIONS is changed like this:
If we execute our query again – note: the exact same query as we are interested in finding an answer to the exact same question – we get a new answer. And already this answer is not entirely trivial. What would you think is the quickest way from A to D, just by looking at the picture of the graph? The answer:
So going from A to B to E to D results in the quicket route.
I haven't got a clue as to the performance of this query with larger numbers of points and connections. I would expect we should do anything we can to only start routes at meaningful points and try to stop investigating a route as soon as we know it goes nowhere. Well, that is for later.