Windowing Clause
Of course instead of having Oracle apply an implicit, default windowing clause as result of the order by, you can define explicit windows yourself. When no windowing clause is specified, the window is exactly the same as the partition.
Note that for some functions, the order by clause is mandatory, such as for FIRST_VALUE, LAST_VALUE, ROW_NUMBER, LAG and LEAD, RANK and DENSE_RANK.
select ename , job , sal , hiredate , first_value(sal) over ( partition by job order by hiredate range between current row and unbounded following ) job_avg , first_value(sal) over ( partition by job order by hiredate rows between current row and 2 following ) job_avg from emp where sal < 2500 order by job
ENAME JOB SAL HIREDATE JOB_AVG JOB_AVG ---------- --------- ---------- --------- ---------- ---------- SMITH CLERK 800 17-DEC-80 800 800 JAMES CLERK 950 03-DEC-81 950 950 MILLER CLERK 1300 23-JAN-82 1300 1300 ADAMS CLERK 1100 23-MAY-87 1100 1100 CLARK MARKETEER 2450 09-JUN-81 2450 2450 ALLEN SALESMAN 1600 20-FEB-81 1600 1600 WARD SALESMAN 1250 22-FEB-81 1250 1250 TURNER SALESMAN 1500 08-SEP-81 1500 1500 MARTIN SALESMAN 1250 28-SEP-81 1250 1250
Let’s turn to the Windowing clause in somewhat more detail. It has the following syntax:
{ROWS | RANGE} {BETWEEN {UNBOUNDED PRECEDING | CURRENT ROW | value_expr {PRECEDING | FOLLOWING}} AND { UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW | value_expr PRECEDING}}
Well, that says it all, doesn’t it?
In plain language: the window can be expressed as a (value) range or in terms of rows.
range
A range means: an interval expressed as two boundaries with respect to the current row. Or in even plainer language: from the record with a value of X lower (=preceding) than the current row to the record with a value Y lower (=preceding) than the current row. Where it says lower (=preceding) in the sentence, you can replace it with higher(=following). That is, the range can start and end before, start before and end after or start and end after the current record.
Range can only be used with numeric or date values (a range cannot be expressed in terms of varchar2 or blob, which seems logical). A logical offset can be specified with constants such as RANGE 10 PRECEDING, or an expression that evaluates to a constant, or by an interval specification like RANGE INTERVAL N DAY/MONTH/YEAR PRECEDING or an expression that evaluates to an interval. With logical offset, there can only be one expression in the ORDER BY expression list in the function, with type compatible to NUMERIC if offset is numeric, or DATE if an interval is specified.
An example – calculate the number of employees who earn between 200 less and 400 more than each employee:
select ename , sal , count(sal) over ( order by sal asc range between 200 preceding and 400 following ) num_of_emps_in_range from emp order by sal asc
Results:
ENAME SAL NUM_OF_EMPS_IN_RANGE ---------- ---------- -------------------- SMITH 800 3 JAMES 950 6 ADAMS 1100 6 WARD 1250 6 MARTIN 1250 6 MILLER 1300 6 TURNER 1500 3 ALLEN 1600 2 CLARK 2450 2 BLAKE 2850 4 JONES 2975 4 SCOTT 3000 4 FORD 3000 4 KING 5000 1
This tells us that there are 5 employees who earn between 200 less and 400 more than MARTIN. They are: ADAMS, WARD, MILLER, TURNER and ALLEN.
Note that the current record is included in the window in this example, since the start point is before it while the end point is after it. The value of NUM_OF_EMPS_IN_RANGE for MARTIN is 6: himself and these 5 others.
Note too that where the end-point falls outside the partition – even though in this example the partition coincides with entire set of records – it is simply cut-off. Oracle has no problem with you specifying a window with boundaries far outside the scope of the actual range of values.
Also note that the boundaries for a range window can be specified using the keyword UNBOUNDED. Instead of specifying a ridiculously large value to ensure all records from a certain startpoint or until a certain endpoint are included, you can use this keyword.
An example – give me the number of employees earning more than the current employee:
select ename , sal , count(sal) over ( order by sal asc range between 1 following and unbounded following ) num_of_emps_earning_more from emp order by sal asc
Results:
ENAME SAL NUM_OF_EMPS_EARNING_MORE ---------- ---------- ------------------------ SMITH 800 13 JAMES 950 12 ADAMS 1100 11 WARD 1250 9 MARTIN 1250 9 MILLER 1300 8 TURNER 1500 7 ALLEN 1600 6 CLARK 2450 5 BLAKE 2850 4 JONES 2975 3 SCOTT 3000 1 FORD 3000 1 KING 5000 0
For example for ALLEN: there are 6 employees earning more than him. The Salary range window starting at 1601 (ALLEN’s salary+1) and extending indefinitely into the following salary values contains six records: from CLARK to KING.
As I will discuss later, the values specifying the boundaries of a range window do not have to be constants, they can just as well be expression: column values, functions etc. Let’s run a simple query that uses a range window that depends on the job: for CLERKS the salary window will be 250 on both sides of the current record, for managers 750 and for the others 500:
select ename , job , case job when 'CLERK' then 250 when 'MANAGER' then 750 else 500 end window_size , sal , count(sal) over ( order by sal asc range between case job when 'CLERK' then 250 when 'MANAGER' then 750 else 500 end preceding and case job when 'CLERK' then 250 when 'MANAGER' then 750 else 500 end following ) num_of_emps_in_range from emp order by sal asc
Results:
ENAME JOB WINDOW_SIZE SAL NUM_OF_EMPS_IN_RANGE ---------- --------- ----------- ---------- -------------------- SMITH CLERK 250 800 2 JAMES CLERK 250 950 3 ADAMS CLERK 250 1100 5 WARD SALESMAN 500 1250 8 MARTIN SALESMAN 500 1250 8 MILLER CLERK 250 1300 5 TURNER SALESMAN 500 1500 6 ALLEN SALESMAN 500 1600 6 CLARK MARKETEER 500 2450 2 BLAKE MANAGER 750 2850 5 JONES MANAGER 750 2975 5 SCOTT ANALYST 500 3000 4 FORD ANALYST 500 3000 4 KING PRESIDENT 500 5000 1
MARTIN, who is a SALESMAN, has a window size of 500. That means that we count the employees who earn more than 500 less or less than 500 more than MARTIN. That adds up to 8 people: from SMITH to ALLEN.
As said earlier, range windows can be used for Numeric and Date values. For Date Values the range can be expressed with numeric value (interpreted as the number of days before and/or after a date) or as INTERVAL DAY TO SECOND or YEAR TO MONTH; see SQL Reference on INTERVAL datatype.
This query counts the number of employees hired between 90 before and one year and 3 months after each employee.
select ename , hiredate , hiredate - INTERVAL '90' DAY start_window , hiredate + INTERVAL '1-3' YEAR TO MONTH end_window , sal , count(hiredate) over ( order by hiredate asc range BETWEEN INTERVAL '90' DAY PRECEDING AND INTERVAL '1-3' YEAR TO MONTH FOLLOWING ) num_of_emps_in_range from emp order by hiredate asc
The results:
ENAME HIREDATE START_WIN END_WINDO SAL NUM_OF_EMPS_IN_RANGE ---------- --------- --------- --------- ---------- -------------------- SMITH 17-DEC-80 18-SEP-80 17-MAR-82 800 12 ALLEN 20-FEB-81 22-NOV-80 20-MAY-82 1600 12 WARD 22-FEB-81 24-NOV-80 22-MAY-82 1250 12 JONES 02-APR-81 02-JAN-81 02-JUL-82 2975 11 BLAKE 01-MAY-81 31-JAN-81 01-AUG-82 2850 11 CLARK 09-JUN-81 11-MAR-81 09-SEP-82 2450 9 TURNER 08-SEP-81 10-JUN-81 08-DEC-82 1500 6 MARTIN 28-SEP-81 30-JUN-81 28-DEC-82 1250 6 KING 18-NOV-81 20-AUG-81 18-FEB-83 5000 6 JAMES 03-DEC-81 04-SEP-81 03-MAR-83 950 6 FORD 03-DEC-81 04-SEP-81 03-MAR-83 3000 6 MILLER 23-JAN-82 25-OCT-81 23-APR-83 1300 4 SCOTT 19-APR-87 19-JAN-87 19-JUL-88 3000 2 ADAMS 23-MAY-87 22-FEB-87 23-AUG-88 1100 2
Here we see that in the period from 90 days before to a year and three months after KING was hired, 5 other employees were hired (remember that the value 6 displayed for KING includes KING himself). We can easily see that this period around KING’s hiredate starts at 30-JUN-1981 and ends at 18-FEB-1983. In this period, everyone from TURNER to MILLER was hired.
Note by the way how you can use the functions FIRST_VALUE and LAST_VALUE to retrieve values from the first and last record in a window. This is described by Tom Kyte in his Expert One-on-One Oracle, Apress, 2003).
select ename , hiredate , first_value(hiredate) over ( order by hiredate asc range BETWEEN INTERVAL '90' DAY PRECEDING AND INTERVAL '1-3' YEAR TO MONTH FOLLOWING ) first_value_in_window , last_value(hiredate) over ( order by hiredate asc range BETWEEN INTERVAL '90' DAY PRECEDING AND INTERVAL '1-3' YEAR TO MONTH FOLLOWING ) last_value_in_window , count(hiredate) over ( order by hiredate asc range BETWEEN INTERVAL '90' DAY PRECEDING AND INTERVAL '1-3' YEAR TO MONTH FOLLOWING ) num_of_emps_in_range from emp order by hiredate asc
Results:
ENAME HIREDATE FIRST_VAL LAST_VALU NUM_OF_EMPS_IN_RANGE ---------- --------- --------- --------- -------------------- SMITH 17-DEC-80 17-DEC-80 23-JAN-82 12 ALLEN 20-FEB-81 17-DEC-80 23-JAN-82 12 WARD 22-FEB-81 17-DEC-80 23-JAN-82 12 JONES 02-APR-81 20-FEB-81 23-JAN-82 11 BLAKE 01-MAY-81 20-FEB-81 23-JAN-82 11 CLARK 09-JUN-81 02-APR-81 23-JAN-82 9 TURNER 08-SEP-81 08-SEP-81 23-JAN-82 6 MARTIN 28-SEP-81 08-SEP-81 23-JAN-82 6 KING 18-NOV-81 08-SEP-81 23-JAN-82 6 JAMES 03-DEC-81 08-SEP-81 23-JAN-82 6 FORD 03-DEC-81 08-SEP-81 23-JAN-82 6 MILLER 23-JAN-82 18-NOV-81 23-JAN-82 4 SCOTT 19-APR-87 19-APR-87 23-MAY-87 2 ADAMS 23-MAY-87 19-APR-87 23-MAY-87 2
In comparison with the previous query, where we had calculated the boundaries of the window, do we see here the actual values for the first and last record in the window at each record.
The range can be expressed in terms of column values or a function performed on column values:
select ename , job , sal , hiredate , avg(sal) over ( partition by job order by sal range 0.25*sal preceding ) job_avg from emp where sal < 2500 order by job
ENAME JOB SAL HIREDATE JOB_AVG ---------- --------- ---------- --------- ---------- SMITH CLERK 800 17-DEC-80 800 JAMES CLERK 950 03-DEC-81 875 ADAMS CLERK 1100 23-MAY-87 1025 MILLER CLERK 1300 23-JAN-82 1200 CLARK MARKETEER 2450 09-JUN-81 2450 WARD SALESMAN 1250 22-FEB-81 1250 MARTIN SALESMAN 1250 28-SEP-81 1250 TURNER SALESMAN 1500 08-SEP-81 1333.33333 ALLEN SALESMAN 1600 20-FEB-81 1400
or somewhat fancier, including a column – COMM – that is not even explicitly selected:
select ename , job , sal , hiredate , avg(sal) over ( partition by job order by sal range nvl(comm,-100)+ round(0.25*sal) preceding ) job_avg from emp where sal < 2500 order by job
rows
The window can also be expressed in terms of a number of rows instead of a range. Rows offer the advantage of not being limited to NUMBER and DATE values as well as ORDER BY expressions on single value. So with using a ROWS expression for the window, you can order the records by more than one column.
The specification of the windowing clause using ROWS is almost the same as definition using RANGE:
ROWS BETWEEN # PRECEDING|CURRENT ROW|FOLLOWING AND # PRECEDING|CURRENT ROW|FOLLOWING
The window is specified by indicating a startpoint and an endpoint (‘the window-boundaries’) in terms of a number of rows. When the window is specified as a range, there is no telling upfront how many rows the window (if any) will contain. With rows, you exactly what number the window will contain – but you have no clue what the spread in values within that window will be. Note: you do not always know for sure how many rows there will be in the window; if you specify unbounded precding or unbounded following you can usually not predict how many rows the window will contain. But even with a fixed number preceding or following, the actual number or records in the window is not certain. For example with a window set up as ROWS 5 PRECEDING and 3 FOLLOWING, you would expect the window to always contain 9 rows. However, the first 5 records will have a smaller window, because they do not have 5 preceding; the first record has none preceding, the second only one etc. The same of course applies to the last three records in the partition.
Here is an example of using a ROWS based window. This query selects for each employee the average salary for all colleagues in the same job that were hired later than the employee. Note that we could have done this query with a RANGE based window just as easily. I do not know whether – when you can use both RANGE or ROWS – there is a preference for either one, from the viewpoint of performance for example. One would expect that if there is any preference, it would be to use rows whenever you can – but that is just a hunch.
select ename , job , sal , hiredate , avg(sal) over ( partition by job order by hiredate rows between 1 following and unbounded following ) job_avg from emp order by job
The results:
ENAME JOB SAL HIREDATE JOB_AVG ---------- --------- ---------- --------- ---------- FORD ANALYST 3000 03-DEC-81 3000 SCOTT ANALYST 3000 19-APR-87 SMITH CLERK 800 17-DEC-80 1116.66667 JAMES CLERK 950 03-DEC-81 1200 MILLER CLERK 1300 23-JAN-82 1100 ADAMS CLERK 1100 23-MAY-87 JONES MANAGER 2975 02-APR-81 2850 BLAKE MANAGER 2850 01-MAY-81 CLARK MARKETEER 2450 09-JUN-81 KING PRESIDENT 5000 18-NOV-81 ALLEN SALESMAN 1600 20-FEB-81 1333.33333 WARD SALESMAN 1250 22-FEB-81 1375 TURNER SALESMAN 1500 08-SEP-81 1250 MARTIN SALESMAN 1250 28-SEP-81
The value of rows in the windowing clause is also a numeric expression that can be derived from column values or calculations upon them. The next query calculates for each employee the average salary over the current record and the salaries for a number of employees with the same job, earning less than the current employee; the number of employees is calculated as 1% of the salary of the current row – and is therefore dynamic.
select ename , job , sal , hiredate , avg(sal) over ( partition by job order by sal rows round(0.01*sal) preceding ) job_avg from emp where sal < 2500 order by job
Static or Fixed Windows
The Window clause can not be set to a fixed window such as: the first and second row in the partition. For example see the following reporting requirement: Select for the numbers 4 and lower on the salary-ranking per department the difference with the average salary of the numbers 2 and 3. For all employees their name, deptno and salary. Your first idea could be something like the following:
select ename , deptno , sal , sal - avg(sal) over ( partition by deptno order by sal desc range between unbounded -2 preceding and unbounded -3 preceding ) diff_nr2_3_sal from emp order by deptno , sal desc
Unfortunately, that is not allowed. Unbounded is perhaps something like infinity in mathematics; infinity minus x is still infinity.
There is a workaround in this case – and probably for most cases. If you use an in-line view to add the row_number within the partition of interest to each record, we have the information we need to work with a quasi-fixed window.
For our example, the next query will do the trick. In the in-line view we select all columns required in the outer query, enriched with the row_number (ranking) of the employee in terms of salary within the department. In the outer query, for all employees ranked 4 or below, we will calculate the difference between his salary and the average of the salaries of the numbers 2 and 3 in that department. To get at the numbers 2 and 3, we need a fixed window ( rows between unbounded -2 preceding and unbounded-3 preceding). As we discussed, that is not allowed. However, now that we know the position of each employee in the partition, we can easily find the window that contains the number 2 and 3 in the department: the (current-rownumber – 2 preceding) takes us to the nr2 in the partition and (current-rownumber – 3 preceding) takes us to the nr 3. So defining the window as
rows between rn-2 preceding and rn-3 preceding
, we have set up a fixed window that always contains the 2nd and 3rd row in the partition.
Note: we cannot use a negative argument for the rows preceding or following (ORA-01428: argument ‘-1’ is out of range). The employees who come 1st or 2nd in their department would give us negative arguments if we were to use rn-2 and rn-3. Therefore we use the
greatest(rn-2,0)
and
greatest(rn-3,0)
. Since the CASE effectively blocks out the result of all employees ranked 1, 2 or 3, it does not really matter what the avg(sal) returns.
select ename , deptno , sal , rn , case when rn <4 then null else avg(sal) over ( partition by deptno order by sal desc rows between greatest(rn-2,0) preceding and greatest(rn-3,0) preceding ) end avg_sal_nrs2_3 , case when rn <4 then null else sal - avg(sal) over ( partition by deptno order by sal desc rows between greatest(rn-2,0) preceding and greatest(rn-3,0) preceding ) end diff_nr2_3_sal from ( select ename , deptno , sal , row_number() over ( partition by deptno order by sal desc ) rn from emp ) order by deptno , sal desc
The results:
ENAME DEPTNO SAL RN AVG_SAL_NRS2_3 DIFF_NR2_3_SAL ---------- ---------- ---------- ---------- -------------- -------------- CLARK 10 2450 1 MILLER 10 1300 2 KING 20 5000 1 SCOTT 20 3000 2 FORD 20 3000 3 JONES 20 2975 4 3000 -25 ADAMS 20 1100 5 3000 -1900 SMITH 20 800 6 3000 -2200 BLAKE 30 2850 1 ALLEN 30 1600 2 TURNER 30 1500 3 WARD 30 1250 4 1550 -300 MARTIN 30 1250 5 1550 -300 JAMES 30 950 6 1550 -600
They tell us for example that JAMES earns 600 less than the average of the numbers 2 and 3 in his department (deptno 30). The numbers two and three apparently are ALLEN and TURNER who make 1600 and 1500 which averages to 1550 which is indeed 600 more than the 940 JAMES is earning.
Using the row_number as additional information for each record, gathered from an inline-view, we were able to work with static windows!
SECOND_VALUE
SQL provides the FIRST_VALUE (and also LAST_VALUE) function. FIRST_VALUE allows us to get the value of a certain column (or expression) for the first record in a window. Note that by using IGNORE_NULLS , you can ensure that the first NON NULL value is returned, even that does not belong to the first – or even the second or third – record in the window.
An example of using FIRST_VALUE is the following, to find out the name of the highest paid employee in each department:
select distinct deptno , first_value(ename) over (partition by deptno order by sal desc ) from emp order by deptno
Results:
DEPTNO FIRST_VALU ---------- ---------- 10 CLARK 20 KING 30 BLAKE
There is not built-in SECOND_VALUE or ONE_BUT_LAST_VALUE etc., but using the fixed window ‘trick’ we discussed above, it is quite easy to simulate a SECOND_VALUE function. To get the SECOND_VALUE in a window, we will instead get at the FIRST_VALUE in a window that has a boundary that is one lower than the original window. For example, to find the name of the runner-up in terms of salary in each department, we would execute the following query:
select distinct deptno , first_value(ename) over ( partition by deptno order by sal desc rows rn-2 preceding ) runner_up from ( select ename , deptno , sal , row_number() over ( partition by deptno order by sal desc ) rn from emp ) where rn > 1 order by deptno
The outcome:
DEPTNO RUNNER_UP ---------- ---------- 10 MILLER 20 SCOTT 30 ALLEN
To verify the correctness of this result, we execute the in-line view:
SQL> select ename 2 , deptno 3 , sal 4 , row_number() over ( partition by deptno 5 order by sal desc 6 ) rn 7 from emp 8 / ENAME DEPTNO SAL RN ---------- ---------- ---------- ---------- CLARK 10 2450 1 MILLER 10 1300 2 KING 20 5000 1 SCOTT 20 3000 2 FORD 20 3000 3 JONES 20 2975 4 ADAMS 20 1100 5 SMITH 20 800 6 BLAKE 30 2850 1 ALLEN 30 1600 2 TURNER 30 1500 3 WARD 30 1250 4 MARTIN 30 1250 5 JAMES 30 950 6
Lag and Lead
When the fixed window contains a single record, such as in the example of the SECOND_VALUE approach, you can also make use of the LAG and LEAD functions. These functions allow you to retrieve the value from a row either preceding the current row or trailing it by an indicated number of rows.
LEAD provides access to more than one row of a table at the same time without a self join. Given a series of rows returned from a query and a position of the cursor, LEAD provides access to a row at a given physical offset beyond that position. If you do not specify offset, then its default is 1. The optional default value is returned if the offset goes beyond the scope of the table. If you do not specify default, then its default value is null. You cannot use LEAD or any other analytic function for value_expr. That is, you cannot nest analytic functions, but you can use other built-in function expressions for value_expr. For example the following query returns the next hiree in the same job looking from a certain employee:
select ename , job , hiredate , lead(ename) over ( partition by job order by hiredate) as next_hiree_in_job from emp order by job , hiredate ENAME JOB HIREDATE NEXT_HIREE ---------- --------- --------- ---------- FORD ANALYST 03-DEC-81 SCOTT SCOTT ANALYST 19-APR-87 SMITH CLERK 17-DEC-80 JAMES JAMES CLERK 03-DEC-81 MILLER MILLER CLERK 23-JAN-82 ADAMS ADAMS CLERK 23-MAY-87 JONES MANAGER 02-APR-81 BLAKE BLAKE MANAGER 01-MAY-81 CLARK MARKETEER 09-JUN-81 KING PRESIDENT 18-NOV-81 ALLEN SALESMAN 20-FEB-81 WARD WARD SALESMAN 22-FEB-81 TURNER TURNER SALESMAN 08-SEP-81 MARTIN MARTIN SALESMAN 28-SEP-81
That means that our previous query to get at the SECOND_VALUE – the runners-up in terms of salary per department, can also be written using LAG:
select distinct deptno , lag(ename, rn-2) over ( partition by deptno order by sal desc ) lag_runner_up , first_value(ename) over ( partition by deptno order by sal desc rows rn-2 preceding ) first_value_runner_up from ( select ename , deptno , sal , row_number() over ( partition by deptno order by sal desc ) rn from emp ) where rn > 1 order by deptno / DEPTNO LAG_RUNNER FIRST_VALU ---------- ---------- ---------- 10 MILLER MILLER 20 SCOTT SCOTT 30 ALLEN ALLEN
Resources
Links to useful documentation:
Oracle Documentation:
SQL Reference
– Introduction/Overview Analytic Functions
– AVG()
Oracle 10g – SQL for Analysis in Data Warehouses
– Reporting Aggregate Function
Good tutorial with great explanation!
Other examples of analytical functions we can see here http://www.youtube.com/watch?v=9zuCJq_YrhI
Regards
Very Good, explaination becomes meaningfull with SQL-Plus diagrams. Thanks.
Nice tutorial!
Very well explained.
Another article discussing Analytical Functions: Analytic functions by Example, by Shouvik Basu, November 15, 2004
The tutorial was very helpful for me. Thanks.
nice tutorial.very helpful and descriptive.