To understand how Analytical Functions are used in SQL statements, you really have to look at the four different parts of an Analytical ‘clause’:
- the analytical function, for example AVG, LEAD, PERCENTILE_RANK
- the partitioning clause, for example PARTITION BY job or PARTITION BY dept, job
- the order by clause, for example order by job nulls last
- the windowing clause, for example RANGE UNBOUNDED PRECEDING or ROWS UNBOUNDED FOLLOWING
Note that typically only the Analytical Function is required and the other clauses are optional. However, for certain functions, an Order By clause may be required.
This post will focus on the Order By and Windowing Clauses. Part 1 of this ‘series’ Analytical SQL Functions – theory and examples – Part 1 on the Partition Clause discussed the Partition Clause and the over-all structure of Analytical Expressions.
Order By Clause
You may wonder what role an ORDER BY clause can play in an Analytical Function. We will investigate.
There are some Analytical Functions that return information about the relative position of a record (ROW_NUMBER, RANK, DENSE_RANK). Some other functions allow you to compare a record with precedessors or successors when sorted in a specific order (LAG and LEAD). The order by clause is crucial and even mandatory for these functions.
The following query for example returns the
select ename , deptno , sal , hiredate , row_number() over ( partition by deptno order by sal desc, comm desc , hiredate desc ) sal_rank_in_dept , row_number() over (order by sal desc) sal_rank_overall from emp order by deptno , sal desc , comm desc , hiredate desc
Results:
ENAME DEPTNO SAL HIREDATE SAL_RANK_IN_DEPT SAL_RANK_OVERALL ---------- ---------- ---------- --------- ---------------- ---------------- CLARK 10 2450 09-JUN-81 1 6 MILLER 10 1300 23-JAN-82 2 9 KING 20 5000 18-NOV-81 1 1 SCOTT 20 3000 19-APR-87 2 2 FORD 20 3000 03-DEC-81 3 3 JONES 20 2975 02-APR-81 4 4 ADAMS 20 1100 23-MAY-87 5 12 SMITH 20 800 17-DEC-80 6 14 BLAKE 30 2850 01-MAY-81 1 5 ALLEN 30 1600 20-FEB-81 2 7 TURNER 30 1500 08-SEP-81 3 8 MARTIN 30 1250 28-SEP-81 4 11 WARD 30 1250 22-FEB-81 5 10 JAMES 30 950 03-DEC-81 6 13
To make row_number() meaningful, you must specify within what ordering you want to have the row_number(). The same record can have multiple row_numbers – as we see in the results above – in different partitions and for different orderings of that partition. It should be obvious that the order by clause is required to determine the ordering in which the row_number() is derived.
The order by clause as used within Analytical Expressions is no different from the Order By used at the end of Select statements. Order By can make use of multiple columns, make use of functions – just not analytical functions! – , be in ASCENDING or DESCENDING order and either put NULL values first (NULLS FIRST) or last (NULLS LAST).
You cannot order by an analytical function within an Analytical Expression. That means for example that you cannot order the employees within a department partition by their difference with the average salary in their job. The following query is not allowed:
select ename , deptno , job , sal , row_number() over ( partition by deptno order by (sal - avg(sal) over (partition by job)) ) sal_rank_in_dept from emp order by deptno , sal desc
However, the next query solves the issue and gives us the answer we were looking for. Again, a useful trick: if we need analytical results within an Analytical Function, we use an in-line view to produce those results and enrich the records we process in the outer Analytical Function with those results.
select ename , deptno , job , sal , avg_sal_in_job avg_job_sal , sal - avg_sal_in_job sal_min_job_avg , row_number() over ( partition by deptno order by (sal - avg_sal_in_job) ) sal_min_job_avg_rank_in_dept from ( select ename , deptno , job , sal , avg(sal) over (partition by job) avg_sal_in_job from emp ) order by deptno , sal desc
Results:
ENAME DEPTNO JOB SAL AVG_JOB_SAL SAL_MIN_JOB_AVG SAL_MIN_JOB_AVG_RANK_IN_DEPT ---------- ---------- --------- ---------- ----------- --------------- ---------------------------- CLARK 10 MARKETEER 2450 2450 0 1 MILLER 10 CLERK 1300 1037.5 262.5 2 KING 20 PRESIDENT 5000 5000 0 4 SCOTT 20 ANALYST 3000 3000 0 2 FORD 20 ANALYST 3000 3000 0 3 JONES 20 MANAGER 2975 2912.5 62.5 6 ADAMS 20 CLERK 1100 1037.5 62.5 5 SMITH 20 CLERK 800 1037.5 -237.5 1 BLAKE 30 MANAGER 2850 2912.5 -62.5 4 ALLEN 30 SALESMAN 1600 1400 200 6 TURNER 30 SALESMAN 1500 1400 100 5 MARTIN 30 SALESMAN 1250 1400 -150 1 WARD 30 SALESMAN 1250 1400 -150 2 JAMES 30 CLERK 950 1037.5 -87.5 3
This tells us – as if we really wanted to know – for example that MARTIN is the employee in his department with the highest gap between his salary and the average salary in his Job. In that respect is JAMES, who works as CLERK in the same department as MARTIN, is somewhat better off: he comes third in terms of gap with the average salary in his job category. I admit though that this query is quite farfetched – trying to illustrate all concepts in the context of EMP and DEPT is perhaps overambitious.
Rank vs. Row_Number
I always have trouble figuring out why I should RANK (or DENSE_RANK) instead of ROW_NUMBER. Also see SQL for Data Warehousing in the Oracle 10g Data Warehousing Guide. ROW_NUMBER seems to give me the result I am looking for, however RANK is specifically created to answer this type of question.
select ename , hiredate , sal , rank() over ( partition by deptno order by sal desc, comm desc , hiredate desc ) sal_rank_in_dept , rank() over (order by sal desc) sal_rank_overall from emp order by sal desc , comm desc , hiredate desc
ENAME HIREDATE SAL SAL_RANK_IN_DEPT SAL_RANK_OVERALL ---------- --------- ---------- ---------------- ---------------- KING 18-NOV-81 5000 1 1 SCOTT 19-APR-87 3000 2 2 FORD 03-DEC-81 3000 3 2 JONES 02-APR-81 2975 4 4 BLAKE 01-MAY-81 2850 1 5 CLARK 09-JUN-81 2450 1 6 ALLEN 20-FEB-81 1600 2 7 TURNER 08-SEP-81 1500 3 8 MILLER 23-JAN-82 1300 2 9 MARTIN 28-SEP-81 1250 4 10 WARD 22-FEB-81 1250 5 10 ADAMS 23-MAY-87 1100 5 12 JAMES 03-DEC-81 950 6 13 SMITH 17-DEC-80 800 6 14
A difference between RANK() and ROW_NUMBER(): ROW_NUMBER always assigns different values to each row. Therefore, if two rows have a tie – the order by clause makes no distinction – they are assigned different rownumbers in a random way. RANK() – and DENSE_RANK() for that matter – will assign them the same value. If we use a completely decisive order by expression – i.e. no ties – it seems that there is no difference between RANK and ROW_NUMBER.
Running Aggregations and the implicit window
In addition to the pretty obvious role of the ORDER BY clause for Analytical Functions that clearly depend on ordering of the records, such as ROW_NUMBER, RANK, LAG and LEAD, there is a less obvious effect of the order by clause. A term that may help put you in the right frame of mind is ‘running’ as in running total (or running average, running count).
Take for example this query:
select ename , job , sal , avg(sal) over (partition by job) job_avg , avg(sal) over () total_avg from emp where sal < 2500 order by job
ENAME JOB SAL JOB_AVG TOTAL_AVG ---------- --------- ---------- ---------- ---------- SMITH CLERK 800 1037.5 1355.55556 ADAMS CLERK 1100 1037.5 1355.55556 JAMES CLERK 950 1037.5 1355.55556 MILLER CLERK 1300 1037.5 1355.55556 CLARK MARKETEER 2450 2450 1355.55556 ALLEN SALESMAN 1600 1400 1355.55556 TURNER SALESMAN 1500 1400 1355.55556 WARD SALESMAN 1250 1400 1355.55556 MARTIN SALESMAN 1250 1400 1355.55556
We know that this result was arrived upon by calculating for each row the average salary over the partition of rows (out of the total set of rows returned by the query) with the same Job as the current row. For SMITH the JOB_AVG value was calculated as the Average Salary for all Employees returned by the query (i.e. with a salary < 2500) in the same Job as SMITH (i.e. Clerks).
Now if we change the query by adding an order by clause in the Analytical ‘clause’ – like this:
select ename , job , sal , hiredate , avg(sal) over ( partition by job order by hiredate ) job_avg from emp where sal < 2500 order by job , hiredate
we get a result that may surprise you:
ENAME JOB SAL HIREDATE JOB_AVG ---------- --------- ---------- --------- ---------- SMITH CLERK 800 17-DEC-80 800 JAMES CLERK 950 03-DEC-81 875 MILLER CLERK 1300 23-JAN-82 1016.66667 ADAMS CLERK 1100 23-MAY-87 1037.5 CLARK MARKETEER 2450 09-JUN-81 2450 ALLEN SALESMAN 1600 20-FEB-81 1600 WARD SALESMAN 1250 22-FEB-81 1425 TURNER SALESMAN 1500 08-SEP-81 1450 MARTIN SALESMAN 1250 28-SEP-81 1400
You see that the value for JOB_AVG is no longer constant within a partition. Here is the explanation: by specifying the order by clause, we in fact told the SQL engine two things
- order the rows within the JOB-partitions on hiredate (so the first hired employee in a job is listed first in the partition)
- (and this is what you did not see): apply a default windowing clause, specified as ROWS UNBOUNDED PRECEDING
This windowing clause means: for each employee, do not use all records within the partition to calculate the average salary. Instead, use the current record and all preceding records in the partition. The window is a subset of the partition; it can never extend beyond the boundaries of the partition. When no windowing clause is specified – no explicit clause nor an implicit as result of an order by clause – the default setting is to use all records in the partition.
In the previous query, the JOB_AVG value found for SMITH – the first hired CLERK – is arrived at in the following way: take all Employees returned by the query and from that set the partition of all CLERKS. Order them by hiredate. Now starting from the current record – SMITH – take his Salary and the Salary-values of all preceding rows. Since there are no preceding rows in the partition, SMITH’s salary is the only one used for calculating the average and is therefore equal to the average.
JOB_AVG for Turner – the third hired employee in the group of SALESMEN – is calculated using the Salaries of TURNER himself and his two previously hired colleagues – ALLEN and WARD. The last SALESMAN, MARTIN, is outside the window.
It follows from the above that only for the last hired employees in each partition – the ones where the window overlaps with the entire partition – the JOB_AVG is equal to salary of all employees with the same job (as found in the previous query). This applies to ADAMS, CLARK and MARTIN.
The following query, with the explicit windowing clause rows unbounded preceding, is exactly
equivalent to the previous one.
select ename , job , sal , hiredate , avg(sal) over ( partition by job order by hiredate rows unbounded preceding ) job_avg from emp where sal < 2500 order by job
The next image compares the running average (with an order by – and therefore implicit windowing clause) with the average calculated with the same partition clause but without order by:
The reverse of the above query is the following, where the JOB_AVG is calculated over the current employee and all later hired colleagues in the same Job:
select ename , job , sal , hiredate , avg(sal) over ( partition by job order by hiredate range between current row and unbounded following ) job_avg from emp where sal < 2500 order by job
ENAME JOB SAL HIREDATE JOB_AVG ---------- --------- ---------- --------- ---------- SMITH CLERK 800 17-DEC-80 1037.5 JAMES CLERK 950 03-DEC-81 1116.66667 MILLER CLERK 1300 23-JAN-82 1200 ADAMS CLERK 1100 23-MAY-87 1100 CLARK MARKETEER 2450 09-JUN-81 2450 ALLEN SALESMAN 1600 20-FEB-81 1400 WARD SALESMAN 1250 22-FEB-81 1333.33333 TURNER SALESMAN 1500 08-SEP-81 1375 MARTIN SALESMAN 1250 28-SEP-81 1250
Good tutorial with great explanation!
Other examples of analytical functions we can see here http://www.youtube.com/watch?v=9zuCJq_YrhI
Regards
Very Good, explaination becomes meaningfull with SQL-Plus diagrams. Thanks.
Nice tutorial!
Very well explained.
Another article discussing Analytical Functions: Analytic functions by Example, by Shouvik Basu, November 15, 2004
The tutorial was very helpful for me. Thanks.
nice tutorial.very helpful and descriptive.