Analytical SQL Functions – theory and examples – Part 2 on the Order By and Windowing Clauses

7

To understand how Analytical Functions are used in SQL statements, you really have to look at the four different parts of an Analytical ‘clause':

  • the analytical function, for example AVG, LEAD, PERCENTILE_RANK
  • the partitioning clause, for example PARTITION BY job or PARTITION BY dept, job
  • the order by clause, for example order by job nulls last
  • the windowing clause, for example RANGE UNBOUNDED PRECEDING or ROWS UNBOUNDED FOLLOWING

Note that typically only the Analytical Function is required and the other clauses are optional. However, for certain functions, an Order By clause may be required.

This post will focus on the Order By and Windowing Clauses. Part 1 of this ‘series’ Analytical SQL Functions – theory and examples – Part 1 on the Partition Clause discussed the Partition Clause and the over-all structure of Analytical Expressions.

Order By Clause

You may wonder what role an ORDER BY clause can play in an Analytical Function. We will investigate.

There are some Analytical Functions that return information about the relative position of a record (ROW_NUMBER, RANK, DENSE_RANK). Some other functions allow you to compare a record with precedessors or successors when sorted in a specific order (LAG and LEAD). The order by clause is crucial and even mandatory for these functions.

The following query for example returns the

select ename
,      deptno
,      sal
,      hiredate
,      row_number() over ( partition by deptno
                           order by sal desc, comm desc , hiredate desc
                          ) sal_rank_in_dept
,      row_number() over (order by sal desc) sal_rank_overall
from   emp
order
by     deptno
,      sal desc
,      comm desc
,      hiredate desc

Results:

ENAME          DEPTNO        SAL HIREDATE  SAL_RANK_IN_DEPT SAL_RANK_OVERALL
---------- ---------- ---------- --------- ---------------- ----------------
CLARK              10       2450 09-JUN-81                1                6
MILLER             10       1300 23-JAN-82                2                9
KING               20       5000 18-NOV-81                1                1
SCOTT              20       3000 19-APR-87                2                2
FORD               20       3000 03-DEC-81                3                3
JONES              20       2975 02-APR-81                4                4
ADAMS              20       1100 23-MAY-87                5               12
SMITH              20        800 17-DEC-80                6               14
BLAKE              30       2850 01-MAY-81                1                5
ALLEN              30       1600 20-FEB-81                2                7
TURNER             30       1500 08-SEP-81                3                8
MARTIN             30       1250 28-SEP-81                4               11
WARD               30       1250 22-FEB-81                5               10
JAMES              30        950 03-DEC-81                6               13

To make row_number() meaningful, you must specify within what ordering you want to have the row_number(). The same record can have multiple row_numbers – as we see in the results above – in different partitions and for different orderings of that partition. It should be obvious that the order by clause is required to determine the ordering in which the row_number() is derived.

The order by clause as used within Analytical Expressions is no different from the Order By used at the end of Select statements. Order By can make use of multiple columns, make use of functions – just not analytical functions! – , be in ASCENDING or DESCENDING order and either put NULL values first (NULLS FIRST) or last (NULLS LAST).

You cannot order by an analytical function within an Analytical Expression. That means for example that you cannot order the employees within a department partition by their difference with the average salary in their job. The following query is not allowed:

select ename
,      deptno
,      job
,      sal
,      row_number() over ( partition by deptno
                           order by (sal - avg(sal) over (partition by job))
                          ) sal_rank_in_dept
from   emp
order
by     deptno
,      sal desc

However, the next query solves the issue and gives us the answer we were looking for. Again, a useful trick: if we need analytical results within an Analytical Function, we use an in-line view to produce those results and enrich the records we process in the outer Analytical Function with those results.

select ename
,      deptno
,      job
,      sal
,      avg_sal_in_job avg_job_sal
,      sal - avg_sal_in_job sal_min_job_avg
,      row_number() over ( partition by deptno
                           order by (sal - avg_sal_in_job)
                          ) sal_min_job_avg_rank_in_dept
from   ( select ename
         ,      deptno
         ,      job
         ,      sal
         ,      avg(sal) over (partition by job) avg_sal_in_job
         from   emp
       )
order
by     deptno
,      sal desc

Results:

ENAME          DEPTNO JOB              SAL AVG_JOB_SAL SAL_MIN_JOB_AVG SAL_MIN_JOB_AVG_RANK_IN_DEPT
---------- ---------- --------- ---------- ----------- --------------- ----------------------------
CLARK              10 MARKETEER       2450        2450               0                            1
MILLER             10 CLERK           1300      1037.5           262.5                            2
KING               20 PRESIDENT       5000        5000               0                            4
SCOTT              20 ANALYST         3000        3000               0                            2
FORD               20 ANALYST         3000        3000               0                            3
JONES              20 MANAGER         2975      2912.5            62.5                            6
ADAMS              20 CLERK           1100      1037.5            62.5                            5
SMITH              20 CLERK            800      1037.5          -237.5                            1
BLAKE              30 MANAGER         2850      2912.5           -62.5                            4
ALLEN              30 SALESMAN        1600        1400             200                            6
TURNER             30 SALESMAN        1500        1400             100                            5
MARTIN             30 SALESMAN        1250        1400            -150                            1
WARD               30 SALESMAN        1250        1400            -150                            2
JAMES              30 CLERK            950      1037.5           -87.5                            3

This tells us – as if we really wanted to know – for example that MARTIN is the employee in his department with the highest gap between his salary and the average salary in his Job. In that respect is JAMES, who works as CLERK in the same department as MARTIN, is somewhat better off: he comes third in terms of gap with the average salary in his job category. I admit though that this query is quite farfetched – trying to illustrate all concepts in the context of EMP and DEPT is perhaps overambitious.

Rank vs. Row_Number

I always have trouble figuring out why I should RANK (or DENSE_RANK) instead of ROW_NUMBER. Also see SQL for Data Warehousing in the Oracle 10g Data Warehousing Guide. ROW_NUMBER seems to give me the result I am looking for, however RANK is specifically created to answer this type of question.

select ename
,      hiredate
,      sal
,      rank() over ( partition by deptno
                     order by sal desc, comm desc , hiredate desc
                   ) sal_rank_in_dept
,      rank() over (order by sal desc) sal_rank_overall
from   emp
order
by     sal desc
,      comm desc
,      hiredate desc
ENAME      HIREDATE         SAL SAL_RANK_IN_DEPT SAL_RANK_OVERALL
---------- --------- ---------- ---------------- ----------------
KING       18-NOV-81       5000                1                1
SCOTT      19-APR-87       3000                2                2
FORD       03-DEC-81       3000                3                2
JONES      02-APR-81       2975                4                4
BLAKE      01-MAY-81       2850                1                5
CLARK      09-JUN-81       2450                1                6
ALLEN      20-FEB-81       1600                2                7
TURNER     08-SEP-81       1500                3                8
MILLER     23-JAN-82       1300                2                9
MARTIN     28-SEP-81       1250                4               10
WARD       22-FEB-81       1250                5               10
ADAMS      23-MAY-87       1100                5               12
JAMES      03-DEC-81        950                6               13
SMITH      17-DEC-80        800                6               14

A difference between RANK() and ROW_NUMBER(): ROW_NUMBER always assigns different values to each row. Therefore, if two rows have a tie – the order by clause makes no distinction – they are assigned different rownumbers in a random way. RANK() – and DENSE_RANK() for that matter – will assign them the same value. If we use a completely decisive order by expression – i.e. no ties – it seems that there is no difference between RANK and ROW_NUMBER.

Running Aggregations and the implicit window

In addition to the pretty obvious role of the ORDER BY clause for Analytical Functions that clearly depend on ordering of the records, such as ROW_NUMBER, RANK, LAG and LEAD, there is a less obvious effect of the order by clause. A term that may help put you in the right frame of mind is ‘running’ as in running total (or running average, running count).

Take for example this query:

select ename
,      job
,      sal
,      avg(sal) over (partition by job) job_avg
,      avg(sal) over () total_avg
from   emp
where  sal < 2500
order
by     job
ENAME      JOB              SAL    JOB_AVG  TOTAL_AVG
---------- --------- ---------- ---------- ----------
SMITH      CLERK            800     1037.5 1355.55556
ADAMS      CLERK           1100     1037.5 1355.55556
JAMES      CLERK            950     1037.5 1355.55556
MILLER     CLERK           1300     1037.5 1355.55556
CLARK      MARKETEER       2450       2450 1355.55556
ALLEN      SALESMAN        1600       1400 1355.55556
TURNER     SALESMAN        1500       1400 1355.55556
WARD       SALESMAN        1250       1400 1355.55556
MARTIN     SALESMAN        1250       1400 1355.55556

We know that this result was arrived upon by calculating for each row the average salary over the partition of rows (out of the total set of rows returned by the query) with the same Job as the current row. For SMITH the JOB_AVG value was calculated as the Average Salary for all Employees returned by the query (i.e. with a salary < 2500) in the same Job as SMITH (i.e. Clerks).

Now if we change the query by adding an order by clause in the Analytical ‘clause’ – like this:

select ename
,      job
,      sal
,      hiredate
,      avg(sal) over ( partition by job
                       order by     hiredate
                     ) job_avg
from   emp
where  sal &lt; 2500
order
by     job
,      hiredate

we get a result that may surprise you:

ENAME      JOB              SAL HIREDATE     JOB_AVG
---------- --------- ---------- --------- ----------
SMITH      CLERK            800 17-DEC-80        800
JAMES      CLERK            950 03-DEC-81        875
MILLER     CLERK           1300 23-JAN-82 1016.66667
ADAMS      CLERK           1100 23-MAY-87     1037.5
CLARK      MARKETEER       2450 09-JUN-81       2450
ALLEN      SALESMAN        1600 20-FEB-81       1600
WARD       SALESMAN        1250 22-FEB-81       1425
TURNER     SALESMAN        1500 08-SEP-81       1450
MARTIN     SALESMAN        1250 28-SEP-81       1400

You see that the value for JOB_AVG is no longer constant within a partition. Here is the explanation: by specifying the order by clause, we in fact told the SQL engine two things

  1. order the rows within the JOB-partitions on hiredate (so the first hired employee in a job is listed first in the partition)
  2. (and this is what you did not see): apply a default windowing clause, specified as ROWS UNBOUNDED PRECEDING

This windowing clause means: for each employee, do not use all records within the partition to calculate the average salary. Instead, use the current record and all preceding records in the partition. The window is a subset of the partition; it can never extend beyond the boundaries of the partition. When no windowing clause is specified – no explicit clause nor an implicit as result of an order by clause – the default setting is to use all records in the partition.

In the previous query, the JOB_AVG value found for SMITH – the first hired CLERK – is arrived at in the following way: take all Employees returned by the query and from that set the partition of all CLERKS. Order them by hiredate. Now starting from the current record – SMITH – take his Salary and the Salary-values of all preceding rows. Since there are no preceding rows in the partition, SMITH’s salary is the only one used for calculating the average and is therefore equal to the average.

JOB_AVG for Turner – the third hired employee in the group of SALESMEN – is calculated using the Salaries of TURNER himself and his two previously hired colleagues – ALLEN and WARD. The last SALESMAN, MARTIN, is outside the window.

It follows from the above that only for the last hired employees in each partition – the ones where the window overlaps with the entire partition – the JOB_AVG is equal to salary of all employees with the same job (as found in the previous query). This applies to ADAMS, CLARK and MARTIN.

The following query, with the explicit windowing clause rows unbounded preceding, is exactly
equivalent to the previous one.

select ename
,      job
,      sal
,      hiredate
,      avg(sal) over ( partition by job
                       order by     hiredate
                       rows unbounded preceding
                     ) job_avg
from   emp
where  sal &lt; 2500
order
by     job

The next image compares the running average (with an order by – and therefore implicit windowing clause) with the average calculated with the same partition clause but without order by:

The reverse of the above query is the following, where the JOB_AVG is calculated over the current employee and all later hired colleagues in the same Job:

select ename
,      job
,      sal
,      hiredate
,      avg(sal) over ( partition by job
                       order by     hiredate
                       range between current row and unbounded following
                     ) job_avg
from   emp
where  sal &lt; 2500
order
by     job
ENAME      JOB              SAL HIREDATE     JOB_AVG
---------- --------- ---------- --------- ----------
SMITH      CLERK            800 17-DEC-80     1037.5
JAMES      CLERK            950 03-DEC-81 1116.66667
MILLER     CLERK           1300 23-JAN-82       1200
ADAMS      CLERK           1100 23-MAY-87       1100
CLARK      MARKETEER       2450 09-JUN-81       2450
ALLEN      SALESMAN        1600 20-FEB-81       1400
WARD       SALESMAN        1250 22-FEB-81 1333.33333
TURNER     SALESMAN        1500 08-SEP-81       1375
MARTIN     SALESMAN        1250 28-SEP-81       1250
1 2
Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

7 Comments