Anti-Search patterns – SQL to look for what is NOT there – Part Two
It is a strange thing really: querying the database to find missing data. Queries usually report on what is there. Not the queries in this article (and its prequel): they report on what is not there. We explore a number of approaches to uncovering missing information – because the fact that something is not found in itself is a fact we may want to know about. And it gives me a nice motive to demonstrate a number of fun SQL features.
We are still very much engaged in the EMP and DEPT tables in the SCOTT schema that back in the early ’80s was pivotal in demonstrating the capabilities of the first Oracle RDBMS releases. It still does an excellent job for most of my demos today. Must be me…
This article delves into urgent matters such as ‘which letters do not occur in the names of the employees’, ‘in which months of the year was no one hired’, ‘who in each department does not have a colleague with less experience (or a lower salary)’, ‘which employees do not have a colleague in the same job (either in their own department or in the entire company)’, and more.
Letters not used in any Employee Name
Our first search for information that is not there is a quest for the letters (from the alphabet) that are not used in any of the Employee Names in table EMP. We have 14 employees, each with a 4 to 8 letter ENAME. We wonder (yes, we do) which of our 26 letters in the western european alphabet is not used in any of those 14 names.
Several approaches to this puzzling challenge are available to us, and I am sure there are many more than I will demonstrate here. All of these make use of the alphabet query that was introduced in the article about static and somewhat dynamic domains: http://technology.amis.nl/blog/9577/retrieving-the-values-for-domains-in-sql-queries-in-the-oracle-rdbms-months-days-numbers-and-custom-value-sets.
The first approach creates one long string out of all the employee enames, then joins that string with every letter in the alphabet. Only rows where the letter does not occur in the string are returned, thereby giving us the list of letters not used in any ename:
with alphabet as ( select chr(64+rownum) letter from dual connect by rownum < 27 ) , employeenames as ( select listagg(ename,',') within group (order by sal) enames from emp ) select letter from alphabet cross join employeenames where instr( enames, letter ) = 0
The second approach, though similar, uses two special options: first a join condition that does not just look for equality in two columns but instead uses instr( ename, letter ) >0 . The second is the outer join with the specific filter for un-joined rows. Every letter that is not joined to any ENAME is returned by this query.
with alphabet as ( select chr(64+rownum) letter from dual connect by rownum < 27 ) select letter from alphabet left outer join emp on instr( ename, letter ) > 0 where emp.rowid is null
The third approach uses the MINUS set operator to substract from the ALPHABET all letters that occur in any of the ENAMES – and note how the CONNECT BY is used to turn a string of characters into a rowset with a single row for each character in the string (it is almost pivoting, right there!).
with alphabet as ( select chr(64+rownum) letter from dual connect by rownum < 27 ) , employeenames as ( select listagg(ename,',') within group (order by sal) enames from emp ) , ename_letters as ( select substr(enames, level, 1) letter from employeenames connect by level <= length(enames) ) select letter from alphabet minus select letter from ename_letters
The jobs that a department does not contain
Another pressing matter is to find out which JOBs do not occur in each department. The EMP table has a JOB column that specifies for each employee which Job he or she is in. A simple query
select distinct job from emp
reveals the various jobs that the organization that is described by table EMP contains. Suppose we want to find out which of these various jobs do not occur in a specific department. We need to craft a query that returns for each department a comma separated values list of the jobs that are not represented in the department.
Again, the outer query comes in handy, to join each department with each job, even if the job is not represented in the department. In order to make this work, we need to use the partition by clause. This clause instructs the SQL engine to perform the outer join for each department again. Every JOB therefore is present in the end result, not just once, but once for every department. Next, we will filter on all joins of JOB for a DEPARTMENT where an outer join was required – or where the DEPARTMENT does not have an EMP row to do a real join.
select distinct deptno , position from emp partition by (deptno) right outer join ( select distinct job position from emp ) functions on (emp.job = functions.position) where emp.rowid is null order by deptno , position
The next step is to turn this query result into the comma separated aggregate that was requested.
select deptno , listagg(position, ',') within group (order by position) unfilled_jobs from ( select deptno , position from emp partition by (deptno) right outer join ( select distinct job position from emp ) functions on (emp.job = functions.position) where emp.rowid is null ) group by deptno order by deptno
The rarest jobs
A simple query to find the most rare job (in terms of the largest number of departments not having that job), which is the same thing as the smallest number of departments having that job:
select job , count(distinct deptno) num_of_depts from emp group by job order by num_of_depts
In which months was no one hired
Every EMLOYEE has a hiredate. From the HIREDATE column, it is easy to find out in which year and month someone was hired. However, in search of more information not there, we would like to find out in which months there was no one hired at all.
The pattern is emerging: outer join and filter on the virtual join cases:
with months as ( select 0 + rownum month_num , to_char( to_date('01-'||rownum||'-2010', 'DD-MM-YYYY'), 'MONTH') month_name from dual connect by level < 13 ) select months.month_name from emp e right outer join months on extract(month from e.hiredate)= months.month_num where e.rowid is null order by months.month_num
We make use of the MONTHS domain, also discussed in the article about static and somewhat dynamic domains: http://technology.amis.nl/blog/9577/retrieving-the-values-for-domains-in-sql-queries-in-the-oracle-rdbms-months-days-numbers-and-custom-value-sets.
A somewhat alternative approach uses simple aggregates and a filter on the employee-hire count:
with months as ( select 0 + rownum month_num , to_char( to_date('01-'||rownum||'-2010', 'DD-MM-YYYY'), 'MONTH') month_name from dual connect by level < 13 ) select months.month_name , count(empno) "number of hires" from emp e right outer join months on extract(month from e.hiredate)= months.month_num having count(empno) = 0 group by months.month_name , months.month_num order by months.month_num
This next elaborate query shows the number of hired employees in each month per department. Again, using the PARTITION BY clause, the outer join has been applied to every department, instead of just once for every row in the MONTHS set. The result of this query would help to get an answer two the question: for each department – give me the months in which no employee was hired.
with months as ( select 0 + rownum month_num , to_char( to_date('01-'||rownum||'-2010', 'DD-MM-YYYY'), 'MONTH') month_name from dual connect by level < 13 ) select months.month_name , deptno department , count(empno) "number of hires" from emp e partition by (deptno) right outer join months on extract(month from e.hiredate)= months.month_num group by months.month_name , months.month_num , deptno order by months.month_num , deptno
Who in each department does not have a colleague with less experience
One way of reading this section title is: finding a record (EMP) for which there is not a next record (EMP) [where hiredate is more recent]. When a query can be expressed in terms of a next record, you should immediately think Analytical Functions, and more specifically: LEAD (and LAG).
Using lead we can look for the next less experienced colleague for each employee:
select ename , lead(ename) over (partition by deptno order by hiredate asc) next_less_experienced_in_dept , deptno , hiredate from emp order by deptno , hiredate
Then we filter on those rows for which next_less_experienced_in_dept is null to find the most junior employees in each department.
select ename , deptno from ( select ename , lead(ename) over (partition by deptno order by hiredate asc) next_less_experienced_in_dept , deptno from emp ) where next_less_experienced_in_dept is null
Now it is easy to check if these least experienced employees are also the least-earners:
select ename , deptno from ( select ename , lead(ename) over (partition by deptno order by sal desc) next_less_paid_in_dept , deptno from emp ) where next_less_paid_in_dept is null
It turns out that MILLER and JAMES make an appearance in both result sets, ADAMS is least experiences in 20 yet SMITH makes less.
An alternative approach to the ‘find least experienced employees’, and even more compact, is this query:
select deptno , max(ename) keep (dense_rank last order by hiredate asc) from emp group by deptno
This query selects for each department the ename for the record that is ranked last when ordered by hiredate. I have never quite gotten used to this syntax, but it can be quite useful.
The equivalent for salary should be obvious:
select deptno , max(ename) keep (dense_rank last order by sal desc) from emp group by deptno
Our final question in this article: which employees do not have a colleague in the same job (either in their own department or in the entire company)
select ename , deptno , job from ( select ename , count(*) over (partition by job, deptno) -1 num_of_colleagues , deptno , job from emp ) where num_of_colleagues = 0
It turns out that it is lonely both at the top and at the bottom (PRESIDENT, MANAGER and CLERK).
- Anti-Search patterns – SQL to look for what is NOT there – Part One
- SQL*Plus or Report style Break Groups in SQL Query
- Oracle RDBMS 11gR2 – LISTAGG – New aggregation operator for creating (comma) delimited strings
- Oracle Database 11gR2 – New analytical function NTH_VALUE
- Is this a bug in Analytical Functions? NULL is supposed to be UNEQUAL to NULL?
- The AMIS Summary of Oracle OpenWorld 2013 is available for download – 60-page white paper
- On the integrity of data in Java applications – presentation from JFall 2013
- Enriching XMLType data using relational data – XQuery and fn:collection in action
- Java 8 – Collection enhancements leveraging Lambda Expressions – or: How Java emulates SQL
- Oracle Database SQL – Recursive Subquery to inspect events in football matches – find the MVP
- Oracle Database 12c: Find most valuable player using MATCH_RECOGNIZE in SQL
- Oracle Database 12c: Pattern Matching through MATCH_RECOGNIZE in SQL
- Oracle Database 12c: joining and outer joining with collections
- Oracle Database 12c: PL/SQL package UTL_CALL_STACK for programmatically inspecting the PL/SQL Call Stack
- Oracle Database 12c: In Line PL/SQL Functions in SQL queries