Analytical SQL Functions – theory and examples – Part 1 on the Partition Clause

9

To understand how Analytical Functions are used in SQL statements, you really have to look at the four different parts of an Analytical ‘clause’:

  • the analytical function, for example AVG, LEAD, PERCENTILE_RANK
  • the partitioning clause, for example PARTITION BY job or PARTITION BY dept, job
  • the order by clause, for example order by job nulls last
  • the windowing clause, for example RANGE UNBOUNDED PRECEDING or ROWS UNBOUNDED FOLLOWING

Note that typically only the Analytical Function is required and the other clauses are optional. However, for certain functions, an Order By clause may be required.

This post will focus on the Partition Clause. This next post talks about the ORDER BY and WINDOWING clauses. This post is in a way a successor to this one on Analytical Functions used for complex aggregation issues.

Analytical Functions

(also see documentation)

There are many categories of Analytical Functions. The following list is taken from the 10g Data Warehousing Guide:

The basic syntax of a query using an Analytical Function is used in the next query:

select  distinct
        d.dname
,       avg(e.sal) over () average_salary_in_dept
from    dept d
        natural join
        emp  e
DNAME          AVERAGE_SALARY_IN_DEPT
-------------- ----------------------
ACCOUNTING                 2073.21429
RESEARCH                   2073.21429
SALES                      2073.21429

By using

over()

right after the function, we indicate that we want to use the Analytical ‘version’ of the function AVG. Later we will see how we add functionality and complexity to the Analytical expression by adding clauses to the over() expression. Used like this, over() simply indicates: perform the analytical function on the entire set of records returned by the query.

We have already seen that some the Analytical functions are the familiar aggregation operators in a new context, like AVG(), SUM() and COUNT(). Note that in some of them – for example AVG and COUNT – you can use the DISTINCT operator, like this:

select  distinct
        count(distinct mgr) over () number_of_mgrs
from    emp

This query selects the number of different managers our employees have. Note: without using the distinct, we would get a higher number because several employees share the same manager.

NUMBER_OF_MGRS
--------------
             6

Once we have discussed the other clauses we will return to discuss a variety of Analytical Functions at our disposal.

Partition Clause

The PARTITION clause is used for grouping the records. When you use partitions, the analytical function will derive its value for the current record from the records in the same partition as the current record, instead of using the entire set of records returned by the query as is done for Analytical expressions without Partition clause.

We want to see a list of all departments (that have employees) with the average salary in that department. The following two queries give the same result, one, traditional, with and one without using the group by. The second, analytical query, can more easily be extended to return other information as well, such as an aggregation at a different grouping level.

select  d.dname
,       avg(e.sal)
from    dept d
        natural join
        emp  e
group
by      d.dname

Using an analytical function:

select  distinct
        d.dname
,       avg(e.sal) over (partition by d.dname)  Average_Salary_in_Dept
from    dept d
        natural join
        emp  e

Read the analytical part of this query as follows: ‘for each record returned, select the average of salary values for all rows in the partition (subset, group) with the same department-name as the current record’s department-name‘. This also illustrated in this picture:

We can see that the average salary for ACCOUNTING is arrived at by taking the partition that consists of all Employees in department ACCOUNTING and then calculating the average of all their salaries.

The result in all cases is the following:

DNAME          AVG(E.SAL)
-------------- ----------
ACCOUNTING           1875
RESEARCH       2645.83333
SALES          1566.66667

By the way, it is interesting to see how the query can contain different – orthogonal – ways of partitioning the rows. See for example:

select ename
,      job
,      sal
,      deptno
,      avg(sal) over (partition by job) job_avg
,      avg(sal) over (partition by deptno) deptno_avg
,      avg(sal) over () total_avg
from   emp
where  sal < 2500
order
by     job

The results:

ENAME      JOB              SAL     DEPTNO    JOB_AVG DEPTNO_AVG  TOTAL_AVG
---------- --------- ---------- ---------- ---------- ---------- ----------
MILLER     CLERK           1300         10     1037.5       1875 1355.55556
SMITH      CLERK            800         20     1037.5        950 1355.55556
ADAMS      CLERK           1100         20     1037.5        950 1355.55556
JAMES      CLERK            950         30     1037.5       1310 1355.55556
CLARK      MARKETEER       2450         10       2450       1875 1355.55556
ALLEN      SALESMAN        1600         30       1400       1310 1355.55556
TURNER     SALESMAN        1500         30       1400       1310 1355.55556
WARD       SALESMAN        1250         30       1400       1310 1355.55556
MARTIN     SALESMAN        1250         30       1400       1310 1355.55556

Now look at this query, showing the average salary in the department alongside every employee in that department. And let's throw in the difference between each employee's salary and the department's average:

select  e.ename
,       d.dname
,       e.sal
,       avg(e.sal) over (partition by d.dname)  Average_Salary_in_Dept
,       e.sal - avg(e.sal) over (partition by d.dname)  Diff_With_Dept_Avg
from    dept d
        natural join
        emp  e
order
by      d.dname

The results:

ENAME      DNAME                 SAL AVERAGE_SALARY_IN_DEPT DIFF_WITH_DEPT_AVG
---------- -------------- ---------- ---------------------- ------------------
CLARK      ACCOUNTING           2450                   1875                575
MILLER     ACCOUNTING           1300                   1875               -575
SMITH      RESEARCH              800             2645.83333         -1845.8333
SCOTT      RESEARCH             3000             2645.83333         354.166667
ADAMS      RESEARCH             1100             2645.83333         -1545.8333
FORD       RESEARCH             3000             2645.83333         354.166667
KING       RESEARCH             5000             2645.83333         2354.16667
JONES      RESEARCH             2975             2645.83333         329.166667
ALLEN      SALES                1600             1566.66667         33.3333333
BLAKE      SALES                2850             1566.66667         1283.33333
MARTIN     SALES                1250             1566.66667         -316.66667
JAMES      SALES                 950             1566.66667         -616.66667
TURNER     SALES                1500             1566.66667         -66.666667
WARD       SALES                1250             1566.66667         -316.66667

This picture illustrates the processing of the Analytical expressions in this query:

Perhaps people who are disgruntled because they earn less than the department's average can be cheered up by comparing their salary in the same query with the average salary in their job-category and within the group of people working for the same manager:

select  e.ename
,       d.dname
,       e.sal
,       e.mgr
,       e.job
,       avg(e.sal) over (partition by d.dname)  Avg_in_Dept
,       avg(e.sal) over (partition by e.job)  Avg_in_Job
,       avg(e.sal) over (partition by e.mgr)  Avg_under_Mgr
from    dept d
        natural join
        emp  e
order
by      d.dname

The results are:

ENAME      DNAME                 SAL        MGR JOB       AVG_IN_DEPT AVG_IN_JOB AVG_UNDER_MGR
---------- -------------- ---------- ---------- --------- ----------- ---------- -------------
MILLER     ACCOUNTING           1300       7782 CLERK            1875     1037.5          1300
CLARK      ACCOUNTING           2450       7839 MARKETEER        1875       2450    2758.33333
SCOTT      RESEARCH             3000       7566 ANALYST    2645.83333       3000          3000
FORD       RESEARCH             3000       7566 ANALYST    2645.83333       3000          3000
ADAMS      RESEARCH             1100       7788 CLERK      2645.83333     1037.5          1100
KING       RESEARCH             5000            PRESIDENT  2645.83333       5000          5000
JONES      RESEARCH             2975       7839 MANAGER    2645.83333     2912.5    2758.33333
SMITH      RESEARCH              800       7902 CLERK      2645.83333     1037.5           800
JAMES      SALES                 950       7698 CLERK      1566.66667     1037.5          1310
WARD       SALES                1250       7698 SALESMAN   1566.66667       1400          1310
TURNER     SALES                1500       7698 SALESMAN   1566.66667       1400          1310
MARTIN     SALES                1250       7698 SALESMAN   1566.66667       1400          1310
ALLEN      SALES                1600       7698 SALESMAN   1566.66667       1400          1310
BLAKE      SALES                2850       7839 MANAGER    1566.66667     2912.5    2758.33333

Miller at least will feel better; he may be substantially lower paid than this colleagues in Department ACCOUNTING but he is much better off than the average CLERK. Since he is the only subordinate for Mgr 7782, there is no useful comparison there.

This picture illustrates this query with Analytical processing along three different partition-dimensions:

Note that when no explicit partition is defined, the entire set of records returned by the query is treated as the default partition.

Suppose we would like to find out the number different managers that are responsible for employees in a certain job - how many managers manage CLERKS? - as well as how many different managers are responsible for employees in a certain department - how many managers are involved in managing employees working in Department 30? - as well as the overall number of managers. We have to deal with three partitions - one which is the entire set of records or the default partition. We also need not just to count the Managers, because that way we would count duplicates. We will count using the distinct operator.

select job
,      deptno
,      mgr
,      count(distinct mgr) over (partition by job) num_of_job_mgrs
,      count(distinct mgr) over (partition by deptno) num_of_dept_mgrs
,      count(distinct mgr) over () number_of_mgrs
from   emp
order
by     job
JOB           DEPTNO        MGR NUM_OF_JOB_MGRS NUM_OF_DEPT_MGRS NUMBER_OF_MGRS
--------- ---------- ---------- --------------- ---------------- --------------
ANALYST           20       7566               1                4              6
ANALYST           20       7566               1                4              6
CLERK             30       7698               4                2              6
CLERK             10       7782               4                2              6
CLERK             20       7788               4                4              6
CLERK             20       7902               4                4              6
MANAGER           20       7839               1                4              6
MANAGER           30       7839               1                2              6
MARKETEER         10       7839               1                2              6
PRESIDENT         20                          0                4              6
SALESMAN          30       7698               1                2              6
SALESMAN          30       7698               1                2              6
SALESMAN          30       7698               1                2              6
SALESMAN          30       7698               1                2              6

More advanced partition clauses

The value on which to partition can be calculated using columnvalues, functions or even scalar subqueries:

select ename
,      to_char(hiredate, 'DAY') day_of_week
,      sal
,      hiredate
,      avg(sal) over ( partition by to_char(hiredate, 'DAY')
                     ) day_of_week_avg
from   emp
where  sal < 2500
order
by     to_char(hiredate, 'DAY')
ENAME      DAY_OF_WE        SAL HIREDATE  DAY_OF_WEEK_AVG
---------- --------- ---------- --------- ---------------
ALLEN      FRIDAY          1600 20-FEB-81            1600
MARTIN     MONDAY          1250 28-SEP-81            1250
ADAMS      SATURDAY        1100 23-MAY-87            1200
MILLER     SATURDAY        1300 23-JAN-82            1200
WARD       SUNDAY          1250 22-FEB-81            1250
JAMES      THURSDAY         950 03-DEC-81             950
CLARK      TUESDAY         2450 09-JUN-81            1975
TURNER     TUESDAY         1500 08-SEP-81            1975
SMITH      WEDNESDAY        800 17-DEC-80             800

To illustrate how wildly exotic the partition clause can be defined, see the following utterly pointless example:

select ename
,      mod(hiredate - (select min(hiredate) from emp),7) pointless
,      sal
,      hiredate
,      avg(sal) over ( partition by mod(hiredate - (select min(hiredate) from emp),7)) pointless_avg
from   emp
where  sal < 2500
order
by     mod(hiredate - (select min(hiredate) from emp),7)
ENAME       POINTLESS        SAL HIREDATE  POINTLESS_AVG
---------- ---------- ---------- --------- -------------
SMITH               0        800 17-DEC-80           800
JAMES               1        950 03-DEC-81           950
ALLEN               2       1600 20-FEB-81          1600
ADAMS               3       1100 23-MAY-87          1200
MILLER              3       1300 23-JAN-82          1200
WARD                4       1250 22-FEB-81          1250
MARTIN              5       1250 28-SEP-81          1250
CLARK               6       2450 09-JUN-81          1975
TURNER              6       1500 08-SEP-81          1975

This example illustrates how you can define very advanced ways of grouping the records returned by the query to perform analysis on. If you feel like grouping employees by the 3rd character in their name, the sum of the last two digits in their empno, the ratio between the number of days since the hiredate and their salary or whatever far fetched way of grouping you can come up with - it can be done! However, the partition clause can not use Analytical Functions directly - you cannot for example partition by rank-on-salary-within-department. To do something like that, you need an in-line view construction.

If you would like to calculate the average salary for the employees grouped by their salary rank in their own department, you would have to do something like this:

select distinct
       sal_rank
,      avg(sal) over (partition by sal_rank) avg_sal_for_rank
from   ( select empno
         ,      sal
         ,      deptno
         ,      rank() over ( partition by deptno
                              order by sal desc
                            ) sal_rank
         from emp
       )
  SAL_RANK AVG_SAL_FOR_RANK
---------- ----------------
         1       3433.33333
         2             2225
         3             1500
         4             1825
         5             1100
         6              875

This tells us for example that the best earning employees from each department on average make some 3433 (dollars?). The Runners-up in each department on average make 2225; by executing the inline view

select empno
         ,      sal
         ,      deptno
         ,      rank() over ( partition by deptno
                              order by sal desc
                            ) sal_rank
         from emp
     EMPNO        SAL     DEPTNO   SAL_RANK
---------- ---------- ---------- ----------
      7782       2450         10          1
      7934       1300         10          2
      7839       5000         20          1
      7788       3000         20          2
      7902       3000         20          2
      7566       2975         20          4
      7876       1100         20          5
      7369        800         20          6
      7698       2850         30          1
      7499       1600         30          2
      7844       1500         30          3
      7521       1250         30          4
      7654       1250         30          4
      7900        950         30          6

we can easily see that the runners-up are 7934, 7788 and 7902 and 7499 respectively earning 1300, 3000, 3000 and 1600. On average this is 2225.

Finally you should realize that the partition clause can partition the records on more than one value. All examples thusfar grouped on a single value, but the following query with the multi-dimensional partion clause is perfectly valid. In this case we want to know the average salary per department and per hireyear:

select ename
,      deptno
,      hiredate
,      sal
,      avg(sal) over (partition by deptno, to_char(hiredate,'yyyy')) avg_sal
from   emp
order
by     deptno
,      hiredate

The result of this query - not so very interesting due to the small number of records. The most important effect is in department 20.

ENAME          DEPTNO HIREDATE         SAL    AVG_SAL
---------- ---------- --------- ---------- ----------
CLARK              10 09-JUN-81       2450       2450
MILLER             10 23-JAN-82       1300       1300
SMITH              20 17-DEC-80        800        800
JONES              20 02-APR-81       2975 3658.33333
KING               20 18-NOV-81       5000 3658.33333
FORD               20 03-DEC-81       3000 3658.33333
SCOTT              20 19-APR-87       3000       2050
ADAMS              20 23-MAY-87       1100       2050
ALLEN              30 20-FEB-81       1600 1566.66667
WARD               30 22-FEB-81       1250 1566.66667
BLAKE              30 01-MAY-81       2850 1566.66667
TURNER             30 08-SEP-81       1500 1566.66667
MARTIN             30 28-SEP-81       1250 1566.66667
JAMES              30 03-DEC-81        950 1566.66667

To Be Continued...

Part II, to be posted in a few days time, will discuss the Order By and Windowing Clauses. After that I intend to discuss several specific Analytical Functions and a number of advanced cases.

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

9 Comments

  1. Very Good Explanation. Please keep it up. We read your notes. I would like to know about
    the defination of Inline views & why we use it, also why we use SELECT CLAUSE in a SELECT STATEMENT, LIKE
    SELECT EMP_ID, DEPT_NAME,
    SELECT (SAL…..FROM XYZ)
    Thanks,

  2. hi , this is just super. the way the author has put in his ideas are super. keep it up.

  3. A very useful introductory article was just published in Oracle Magazine and on OTN: Find Answers Faster By Jonathan Gennick and Anthony Molinaro Analytic functions provide powerful ways to view your data.

  4. Pingback: » Analytical SQL Functions - theory and examples - Part 2 on the Order By and Windowing Clauses