To understand how Analytical Functions are used in SQL statements, you really have to look at the four different parts of an Analytical ‘clause’:
- the analytical function, for example AVG, LEAD, PERCENTILE_RANK
- the partitioning clause, for example PARTITION BY job or PARTITION BY dept, job
- the order by clause, for example order by job nulls last
- the windowing clause, for example RANGE UNBOUNDED PRECEDING or ROWS UNBOUNDED FOLLOWING
Note that typically only the Analytical Function is required and the other clauses are optional. However, for certain functions, an Order By clause may be required.
This post will focus on the Partition Clause. This next post talks about the ORDER BY and WINDOWING clauses. This post is in a way a successor to this one on Analytical Functions used for complex aggregation issues.
Analytical Functions
(also see documentation)
There are many categories of Analytical Functions. The following list is taken from the 10g Data Warehousing Guide:
- Ranking Functions (RANK and DENSE_RANK, CUME_DIST, PERCENT_RANK, NTILE ,ROW_NUMBER )
- Windowing Aggregate Functions(SUM|AVG|MAX|MIN|COUNT|STDDEV|VARIANCE|FIRST_VALUE|LAST_VALUE)
- Reporting Aggregate Functions
- LAG/LEAD Functions
- FIRST/LAST Functions
- Inverse Percentile Functions(PERCENTILE_CONT, PERCENTILE_DISC)
- Hypothetical Rank and Distribution Functions(RANK | DENSE_RANK | PERCENT_RANK | CUME_DIST)
- Linear Regression Functions (REGR_COUNT, REGR_AVGY and REGR_AVGX, REGR_SLOPE and REGR_INTERCEPT, REGR_R2, REGR_SXX, REGR_SYY, and REGR_SXY )
- Other Statistical Functions
- WIDTH_BUCKET Function
The basic syntax of a query using an Analytical Function is used in the next query:
select distinct d.dname , avg(e.sal) over () average_salary_in_dept from dept d natural join emp e
DNAME AVERAGE_SALARY_IN_DEPT -------------- ---------------------- ACCOUNTING 2073.21429 RESEARCH 2073.21429 SALES 2073.21429
By using
over()
right after the function, we indicate that we want to use the Analytical ‘version’ of the function AVG. Later we will see how we add functionality and complexity to the Analytical expression by adding clauses to the over() expression. Used like this, over() simply indicates: perform the analytical function on the entire set of records returned by the query.
We have already seen that some the Analytical functions are the familiar aggregation operators in a new context, like AVG(), SUM() and COUNT(). Note that in some of them – for example AVG and COUNT – you can use the DISTINCT operator, like this:
select distinct count(distinct mgr) over () number_of_mgrs from emp
This query selects the number of different managers our employees have. Note: without using the distinct, we would get a higher number because several employees share the same manager.
NUMBER_OF_MGRS -------------- 6
Once we have discussed the other clauses we will return to discuss a variety of Analytical Functions at our disposal.
Partition Clause
The PARTITION clause is used for grouping the records. When you use partitions, the analytical function will derive its value for the current record from the records in the same partition as the current record, instead of using the entire set of records returned by the query as is done for Analytical expressions without Partition clause.
We want to see a list of all departments (that have employees) with the average salary in that department. The following two queries give the same result, one, traditional, with and one without using the group by. The second, analytical query, can more easily be extended to return other information as well, such as an aggregation at a different grouping level.
select d.dname , avg(e.sal) from dept d natural join emp e group by d.dname
Using an analytical function:
select distinct d.dname , avg(e.sal) over (partition by d.dname) Average_Salary_in_Dept from dept d natural join emp e
Read the analytical part of this query as follows: ‘for each record returned, select the average of salary values for all rows in the partition (subset, group) with the same department-name as the current record’s department-name‘. This also illustrated in this picture:
We can see that the average salary for ACCOUNTING is arrived at by taking the partition that consists of all Employees in department ACCOUNTING and then calculating the average of all their salaries.
The result in all cases is the following:
DNAME AVG(E.SAL) -------------- ---------- ACCOUNTING 1875 RESEARCH 2645.83333 SALES 1566.66667
By the way, it is interesting to see how the query can contain different – orthogonal – ways of partitioning the rows. See for example:
select ename , job , sal , deptno , avg(sal) over (partition by job) job_avg , avg(sal) over (partition by deptno) deptno_avg , avg(sal) over () total_avg from emp where sal < 2500 order by job
The results:
ENAME JOB SAL DEPTNO JOB_AVG DEPTNO_AVG TOTAL_AVG ---------- --------- ---------- ---------- ---------- ---------- ---------- MILLER CLERK 1300 10 1037.5 1875 1355.55556 SMITH CLERK 800 20 1037.5 950 1355.55556 ADAMS CLERK 1100 20 1037.5 950 1355.55556 JAMES CLERK 950 30 1037.5 1310 1355.55556 CLARK MARKETEER 2450 10 2450 1875 1355.55556 ALLEN SALESMAN 1600 30 1400 1310 1355.55556 TURNER SALESMAN 1500 30 1400 1310 1355.55556 WARD SALESMAN 1250 30 1400 1310 1355.55556 MARTIN SALESMAN 1250 30 1400 1310 1355.55556
Now look at this query, showing the average salary in the department alongside every employee in that department. And let’s throw in the difference between each employee’s salary and the department’s average:
select e.ename , d.dname , e.sal , avg(e.sal) over (partition by d.dname) Average_Salary_in_Dept , e.sal - avg(e.sal) over (partition by d.dname) Diff_With_Dept_Avg from dept d natural join emp e order by d.dname
The results:
ENAME DNAME SAL AVERAGE_SALARY_IN_DEPT DIFF_WITH_DEPT_AVG ---------- -------------- ---------- ---------------------- ------------------ CLARK ACCOUNTING 2450 1875 575 MILLER ACCOUNTING 1300 1875 -575 SMITH RESEARCH 800 2645.83333 -1845.8333 SCOTT RESEARCH 3000 2645.83333 354.166667 ADAMS RESEARCH 1100 2645.83333 -1545.8333 FORD RESEARCH 3000 2645.83333 354.166667 KING RESEARCH 5000 2645.83333 2354.16667 JONES RESEARCH 2975 2645.83333 329.166667 ALLEN SALES 1600 1566.66667 33.3333333 BLAKE SALES 2850 1566.66667 1283.33333 MARTIN SALES 1250 1566.66667 -316.66667 JAMES SALES 950 1566.66667 -616.66667 TURNER SALES 1500 1566.66667 -66.666667 WARD SALES 1250 1566.66667 -316.66667
This picture illustrates the processing of the Analytical expressions in this query:
Perhaps people who are disgruntled because they earn less than the department’s average can be cheered up by comparing their salary in the same query with the average salary in their job-category and within the group of people working for the same manager:
select e.ename , d.dname , e.sal , e.mgr , e.job , avg(e.sal) over (partition by d.dname) Avg_in_Dept , avg(e.sal) over (partition by e.job) Avg_in_Job , avg(e.sal) over (partition by e.mgr) Avg_under_Mgr from dept d natural join emp e order by d.dname
The results are:
ENAME DNAME SAL MGR JOB AVG_IN_DEPT AVG_IN_JOB AVG_UNDER_MGR ---------- -------------- ---------- ---------- --------- ----------- ---------- ------------- MILLER ACCOUNTING 1300 7782 CLERK 1875 1037.5 1300 CLARK ACCOUNTING 2450 7839 MARKETEER 1875 2450 2758.33333 SCOTT RESEARCH 3000 7566 ANALYST 2645.83333 3000 3000 FORD RESEARCH 3000 7566 ANALYST 2645.83333 3000 3000 ADAMS RESEARCH 1100 7788 CLERK 2645.83333 1037.5 1100 KING RESEARCH 5000 PRESIDENT 2645.83333 5000 5000 JONES RESEARCH 2975 7839 MANAGER 2645.83333 2912.5 2758.33333 SMITH RESEARCH 800 7902 CLERK 2645.83333 1037.5 800 JAMES SALES 950 7698 CLERK 1566.66667 1037.5 1310 WARD SALES 1250 7698 SALESMAN 1566.66667 1400 1310 TURNER SALES 1500 7698 SALESMAN 1566.66667 1400 1310 MARTIN SALES 1250 7698 SALESMAN 1566.66667 1400 1310 ALLEN SALES 1600 7698 SALESMAN 1566.66667 1400 1310 BLAKE SALES 2850 7839 MANAGER 1566.66667 2912.5 2758.33333
Miller at least will feel better; he may be substantially lower paid than this colleagues in Department ACCOUNTING but he is much better off than the average CLERK. Since he is the only subordinate for Mgr 7782, there is no useful comparison there.
This picture illustrates this query with Analytical processing along three different partition-dimensions:
Note that when no explicit partition is defined, the entire set of records returned by the query is treated as the default partition.
Suppose we would like to find out the number different managers that are responsible for employees in a certain job – how many managers manage CLERKS? – as well as how many different managers are responsible for employees in a certain department – how many managers are involved in managing employees working in Department 30? – as well as the overall number of managers. We have to deal with three partitions – one which is the entire set of records or the default partition. We also need not just to count the Managers, because that way we would count duplicates. We will count using the distinct operator.
select job , deptno , mgr , count(distinct mgr) over (partition by job) num_of_job_mgrs , count(distinct mgr) over (partition by deptno) num_of_dept_mgrs , count(distinct mgr) over () number_of_mgrs from emp order by job
JOB DEPTNO MGR NUM_OF_JOB_MGRS NUM_OF_DEPT_MGRS NUMBER_OF_MGRS --------- ---------- ---------- --------------- ---------------- -------------- ANALYST 20 7566 1 4 6 ANALYST 20 7566 1 4 6 CLERK 30 7698 4 2 6 CLERK 10 7782 4 2 6 CLERK 20 7788 4 4 6 CLERK 20 7902 4 4 6 MANAGER 20 7839 1 4 6 MANAGER 30 7839 1 2 6 MARKETEER 10 7839 1 2 6 PRESIDENT 20 0 4 6 SALESMAN 30 7698 1 2 6 SALESMAN 30 7698 1 2 6 SALESMAN 30 7698 1 2 6 SALESMAN 30 7698 1 2 6
More advanced partition clauses
The value on which to partition can be calculated using columnvalues, functions or even scalar subqueries:
select ename , to_char(hiredate, 'DAY') day_of_week , sal , hiredate , avg(sal) over ( partition by to_char(hiredate, 'DAY') ) day_of_week_avg from emp where sal < 2500 order by to_char(hiredate, 'DAY')
ENAME DAY_OF_WE SAL HIREDATE DAY_OF_WEEK_AVG ---------- --------- ---------- --------- --------------- ALLEN FRIDAY 1600 20-FEB-81 1600 MARTIN MONDAY 1250 28-SEP-81 1250 ADAMS SATURDAY 1100 23-MAY-87 1200 MILLER SATURDAY 1300 23-JAN-82 1200 WARD SUNDAY 1250 22-FEB-81 1250 JAMES THURSDAY 950 03-DEC-81 950 CLARK TUESDAY 2450 09-JUN-81 1975 TURNER TUESDAY 1500 08-SEP-81 1975 SMITH WEDNESDAY 800 17-DEC-80 800
To illustrate how wildly exotic the partition clause can be defined, see the following utterly pointless example:
select ename , mod(hiredate - (select min(hiredate) from emp),7) pointless , sal , hiredate , avg(sal) over ( partition by mod(hiredate - (select min(hiredate) from emp),7)) pointless_avg from emp where sal < 2500 order by mod(hiredate - (select min(hiredate) from emp),7)
ENAME POINTLESS SAL HIREDATE POINTLESS_AVG ---------- ---------- ---------- --------- ------------- SMITH 0 800 17-DEC-80 800 JAMES 1 950 03-DEC-81 950 ALLEN 2 1600 20-FEB-81 1600 ADAMS 3 1100 23-MAY-87 1200 MILLER 3 1300 23-JAN-82 1200 WARD 4 1250 22-FEB-81 1250 MARTIN 5 1250 28-SEP-81 1250 CLARK 6 2450 09-JUN-81 1975 TURNER 6 1500 08-SEP-81 1975
This example illustrates how you can define very advanced ways of grouping the records returned by the query to perform analysis on. If you feel like grouping employees by the 3rd character in their name, the sum of the last two digits in their empno, the ratio between the number of days since the hiredate and their salary or whatever far fetched way of grouping you can come up with – it can be done! However, the partition clause can not use Analytical Functions directly – you cannot for example partition by rank-on-salary-within-department. To do something like that, you need an in-line view construction.
If you would like to calculate the average salary for the employees grouped by their salary rank in their own department, you would have to do something like this:
select distinct sal_rank , avg(sal) over (partition by sal_rank) avg_sal_for_rank from ( select empno , sal , deptno , rank() over ( partition by deptno order by sal desc ) sal_rank from emp )
SAL_RANK AVG_SAL_FOR_RANK ---------- ---------------- 1 3433.33333 2 2225 3 1500 4 1825 5 1100 6 875
This tells us for example that the best earning employees from each department on average make some 3433 (dollars?). The Runners-up in each department on average make 2225; by executing the inline view
select empno , sal , deptno , rank() over ( partition by deptno order by sal desc ) sal_rank from emp
EMPNO SAL DEPTNO SAL_RANK ---------- ---------- ---------- ---------- 7782 2450 10 1 7934 1300 10 2 7839 5000 20 1 7788 3000 20 2 7902 3000 20 2 7566 2975 20 4 7876 1100 20 5 7369 800 20 6 7698 2850 30 1 7499 1600 30 2 7844 1500 30 3 7521 1250 30 4 7654 1250 30 4 7900 950 30 6
we can easily see that the runners-up are 7934, 7788 and 7902 and 7499 respectively earning 1300, 3000, 3000 and 1600. On average this is 2225.
Finally you should realize that the partition clause can partition the records on more than one value. All examples thusfar grouped on a single value, but the following query with the multi-dimensional partion clause is perfectly valid. In this case we want to know the average salary per department and per hireyear:
select ename , deptno , hiredate , sal , avg(sal) over (partition by deptno, to_char(hiredate,'yyyy')) avg_sal from emp order by deptno , hiredate
The result of this query – not so very interesting due to the small number of records. The most important effect is in department 20.
ENAME DEPTNO HIREDATE SAL AVG_SAL ---------- ---------- --------- ---------- ---------- CLARK 10 09-JUN-81 2450 2450 MILLER 10 23-JAN-82 1300 1300 SMITH 20 17-DEC-80 800 800 JONES 20 02-APR-81 2975 3658.33333 KING 20 18-NOV-81 5000 3658.33333 FORD 20 03-DEC-81 3000 3658.33333 SCOTT 20 19-APR-87 3000 2050 ADAMS 20 23-MAY-87 1100 2050 ALLEN 30 20-FEB-81 1600 1566.66667 WARD 30 22-FEB-81 1250 1566.66667 BLAKE 30 01-MAY-81 2850 1566.66667 TURNER 30 08-SEP-81 1500 1566.66667 MARTIN 30 28-SEP-81 1250 1566.66667 JAMES 30 03-DEC-81 950 1566.66667
To Be Continued…
Part II, to be posted in a few days time, will discuss the Order By and Windowing Clauses. After that I intend to discuss several specific Analytical Functions and a number of advanced cases.
Thanks for an excellent overview with examples
Excellent,very clear to understand………..
Excellent,very clear to understand………..
Excellent, I like the article, it was very clear the explanation.
Thanks
Very Good Explanation. Please keep it up. We read your notes. I would like to know about
the defination of Inline views & why we use it, also why we use SELECT CLAUSE in a SELECT STATEMENT, LIKE
SELECT EMP_ID, DEPT_NAME,
SELECT (SAL…..FROM XYZ)
Thanks,
hi , this is just super. the way the author has put in his ideas are super. keep it up.
A very useful introductory article was just published in Oracle Magazine and on OTN: Find Answers Faster By Jonathan Gennick and Anthony Molinaro Analytic functions provide powerful ways to view your data.
I like your site.