Find the number of descendant nodes in a SQL query – how many employees work (indirectly) for me? And what do they earn on average?
Every once in a while you run into a SQL challenge that seems extremely simple at first glance and turns out to be not so very simple when you look a bit deeper. Hierarchical queries are not part of my every day diet, but they occur quite regularly. So I feel comfortable with the Oracle CONNECT BY syntax and the 9i and 10g extensions like SYS_CONNECT_BY_PATH, CONNECT_BY_ROOT, CONNECT_BY_IS_LEAF, CONNECT_BY_ISCYCLE and NOCYCLE (see the article New CONNECT BY Features in Oracle Database 10g by Jonathan Gennick for more details).
The challenge I ran into the other day was deceptively simple; it translated to: query all employees from the EMP table and show for each employee the number of employees that report to them, either directly or indirectly. For KING the answer is simple: all employees report to him, either directly or through the management chain. But how do you write the query for this. Somewhat efficiently.
This same question can be asked about Bills of Materials, Directory Structures and their contents, Nodes in an XML Document…. In this article I briefly discuss how we could fairly easily answer these questions. I do not present a thorough study into the performance characteristics though. I would be happy to hear your thoughts on that.
I also have this nagging feeling that I am doing something the hard way where a very simple way exists that I completely overloop. Again, I would like to hear your thoughts on that too – if you phrase them gently.
The mother of all hierarchical queries
The mother sample tables must be the EMP table in th SCOTT schema. If you do not know about it… well, what can I say. Where have you been? So I will assume some knowledge of this table, that contains Employees that have a name, an identification, a job and a salary and of course a boss. Except for Mr. King who – ironically perhaps – is in fact a PRESIDENT who does not report to anyone.
The typical introduction of the basic hierarchical query capabilities of the Oracle SQL engine is the following query:
select lpad(' ',3*level)||ename Employee from emp connect by mgr = prior empno start with mgr is null order siblings by ename
that renders all our employees in a tree-like structure that illustrates the organization structure of our company:
We can see for ourselves that all employees are under de KING node. And that the BLAKE node has 5 direct reports. And that JONES has two direct reports as well as two indirects. But how can we let our query count them.
Counting the number of child nodes
So now what do we do to count the number of child nodes? As you might have guessed: in line view to the rescue. And the Oracle 10g CONNECT_BY_ROOT feature (although using the SYS_CONNECT_BY_PATH, introduced in 9i would probably work too).
Let's start with a simple first step:
select connect_by_root(ename) root from emp connect by mgr = prior empno /
This query does not have a START WITH condition. So we start a tree from each record in the EMP table. The tree for KING will be same one as depicted above, the tree for ADAMS will be a very small one – only containing ADAMS himself. All in all, the in line view results in 39 records. For each record, we ask for its root: where did the tree it is part of start. The result are like this:
This in turn gives us a good indication of the number of subordinates for each node: the number of occurrences of each employee in this resultset indicates the number of tree-nodes that had this employee as their root-node. So by counting the number of occurrences of each employee, we have the number of subordinates for each employee (including the employee himself). To calculate the number of child nodes for each node, we can now use this query:
select ( count(*) -1 ) Child_Node_Count , root from ( select connect_by_root(ename) root from emp connect by mgr = prior empno ) group by root /
The result of this query:
Now we have the the answers we were so desperate for. KING has 13 child nodes and BLAKE the five we had already seen in the earlier query.
Calculating aggregates over all child nodes
Let's make this a little more interesting than just counting the subordinates. Let's try to calculate aggregate values over these subordinates. For example, let's try to find out the average salary for the subordinates of each employee.
Clearly our approach so far will not get us there. Yet the CONNECT_BY_ROOT function is the key to this answer too. The query may look a bit complex at first. We will dicuss it step by step:
select ename , child_node_count_under_root , avg_under_root from ( select root , avg(sal) over ( partition by root order by lvl desc rows between unbounded preceding and 1 preceding ) avg_under_root , count(empno) over ( partition by root order by lvl desc rows between unbounded preceding and 1 preceding ) child_node_count_under_root , ename from ( select connect_by_root(ename) root , sal , empno , ename , level lvl from emp connect by mgr = prior empno ) ) where ename = root /
The inner-most query is the same as we used before. However, this time we have included the LEVEL pseudo-attribute that tells us the level in a particular tree of each node. We know of course that the root node will always be at the lowest level (1) of its own tree.
When calculating the average salary over all child nodes for a certain (root) node, we can make use of an Analytical expression. We calculate the average over a partition that contains all employees with the same root. However, that partition contains the root itself as well. We want to calculate the average over the nodes under the root, not including the root. We can exclude the root node by ordering the partition by level, descending in this case. That ensures that the root node for a certain partition is also the last node in that partition. Using a windowing clause, we can restrict the the number of records that participates in the aggregate calculation. We state here that our window is from the beginning of the partition (unbounded preceding) until the row just before the current row (1 preceding). That means that for the last row in the partition (the root node), the average is calculated over all records in the partition except the root node. Which is what we are looking for.
Finally the outer most query filters out all records except the root nodes – the nodes where the root equals the ename. The result of this query:
Inside Oracle Database 10g New CONNECT BY Features in Oracle Database 10g By Jonathan Gennick on OTN
Trees in SQL: Nested Sets and Materialized Path by Vadim Tropashko
- Hierarchical query with nodes from different tables – DEPT and EMP nodes in one tree
- Selecting a 'pruned tree' with selected nodes and all their ancestors – Hierarchical SQL Queries and Bottom-Up Trees
- Bill of Materials and Hierarchical Queries – Which Component contains a component that…
- Oracle Database 10g Release 2 – SQL and PL/SQL Features
- SQL*Plus or Report style Break Groups in SQL Query