Introduction to Analytical Functions – Use of Analytics as alternative for Simple Aggregations

4

I will attempt to introduce analytical functions in a step-by-step manner. A first step into Analytical country could be the following, starting from a very simple aggregation query, exploring several types of aggregations that are increasingly complex and increasingly impossible to implement without analytics (at least in a efficient, performant way):

select  d.dname
,       avg(e.sal)
from    dept d
        natural join
        emp  e
group
by      d.dname

The following two queries give the same result, without using the group by. They can therefore more easily be extended to return other information as well, such as an aggegation at a different grouping level.

Using an analytical function:

select  distinct
        d.dname
,       avg(e.sal) over (partition by d.dname)  -- the analytical avg function
from    dept d
        natural join
        emp  e

Read the analytical part of this query as follows: ‘for each record returned, select the average of salary values for all rows in the partition (subset, group) with the same department-name as the current record’s department-name‘.

Using a scalar subquery:

select  d.dname
,       (select avg(e.sal) from emp e where e.deptno = d.deptno)
from    dept d

The result in all cases is the following:

DNAME          AVG(E.SAL)
-------------- ----------
ACCOUNTING           1875
RESEARCH       2645.83333
SALES          1566.66667

Now it was my impression that the following should also work:

create type numTable as table of number
/

select  d.dname
,       ( select avg(column_value)
          from   table
                 ( cast( d.salaries as numTable)
                 )
        ) Department_Avg_Salary
from    ( select  cast
                  ( multiset
                    ( select e.sal
                      from   emp e
                      where e.deptno = d1.deptno
                     ) as numTable
                   ) salaries
          ,       d1.dname
          from    dept d1
        ) d

I am not sure exactly where to go with the example, but it is a somewhat other approach. However, it does not work: it returns the average salary for the first department in all four rows. I have posted a question on the OTN Forum to see whether anyone knows what I might be doing wrong.

I have been given the correct query on the OTN Forum, by Barbara A Boehmer:

select d.dname
,      ( select avg(column_value)
         from   table (cast (d.salaries as numTable))
         where  d.dname = dname) Department_Avg_Salary
from   ( select cast
                ( multiset
                  ( select e.sal
                    from   emp e
                    where  e.deptno = d1.deptno
                  ) as numTable
                ) salaries
         ,      d1.dname
         from    dept d1
       ) d

I must confess that I do not completely understand why the additional

where  d.dname = dname

is required since I thought that each row’s salaries value only contained salary values for that row anyway. Perhaps I have misjudged the order of evaluation…?

Let’s say we want to see not only the average salary per department but also the over-all average – and the difference between each department’s average and the company’s average.

It seems a question that can not easly be answered using the standard ‘select avg group by’ construction, only through a construction like:

select  dname
,       avg_sal
,       avg_sal - total_avg difference
from    ( select  d.dname
          ,       avg(e.sal) avg_sal
          ,       total_avg
          from    dept d
                  natural join
                  emp  e
          ,      ( select avg(sal) total_avg from emp) avg_sal
          group
          by      d.dname, total_avg
          )

which is just using a scalar subquery.

Another solution based on scalar subqueries:

select  dname
,       avg_sal
,       avg_sal - total_avg difference
from    ( select  d.dname
          ,       (select avg(e.sal) from emp e where e.deptno = d.deptno) avg_sal
          ,       (select avg(e.sal) from emp e) total_avg
          from    dept d
          )

The Analytical Solution looks as follows:

select  dname
,       avg_sal
,       avg_sal - total_avg difference
from    ( select  distinct
                  d.dname
        ,       avg(e.sal) over (partition by d.dname) avg_sal
        ,       avg(e.sal) over () total_avg
        from    dept d
                natural join
                emp  e
        )

or simply:

select  distinct
        d.dname
,       avg(e.sal) over (partition by d.dname) avg_sal
,       avg(e.sal) over (partition by d.dname) - avg(e.sal) over () diff_with_company_avg
from    dept d
        natural join
        emp  e

Read the analytical definition of diff_with_company_avg as follows: ‘for each record returned, select the average of salary values for all rows in the partition (subset, group) with the same department-name as the current record’s department-name and substract from that value the average of all salary values in the complete set of records‘.

The next stage of complexity: show all employees with for each their name, their salary and the average salary in their job.

With a normal group by, this cannot be done. With scalar subqeuries:

select  e.ename
,       e.job
,       e.sal
,       (select avg(e.sal) from emp e2 where e2.job = e.job) avg_sal_in_job
from    emp  e

The Analytical Solution:

select  e.ename
,       e.job
,       e.sal
,       avg(e.sal) over (partition by e.job) avg_sal_in_job
from    emp  e

Read the analytical definition of avg_sal_in_job as follows: ‘for each record returned, select the average of salary values for all rows in the partition (subset, group) with the same job as the current record’s job‘.

ENAME      JOB              SAL AVG_SAL_IN_JOB
---------- --------- ---------- --------------
SMITH      CLERK            800            800
ALLEN      SALESMAN        1600           1600
WARD       SALESMAN        1250           1250
JONES      MANAGER         2975           2975
MARTIN     SALESMAN        1250           1250
BLAKE      MANAGER         2850           2850
CLARK      MARKETEER       2450           2450
SCOTT      ANALYST         3000           3000
KING       PRESIDENT       5000           5000
TURNER     SALESMAN        1500           1500
ADAMS      CLERK           1100           1100
JAMES      CLERK            950            950
FORD       ANALYST         3000           3000
MILLER     CLERK           1300           1300

Adding another request: per employee, show also the department and department’s average salary.

Based on Scalar-Subqueries:

select  e.ename
,       e.job
,       e.sal
,       (select avg(e2.sal) from emp e2 where e2.job = e.job) avg_sal_in_job
,       d.dname
,       (select avg(e2.sal) from emp e2 where e2.deptno = e.deptno) avg_sal_in_dept
from    emp  e
,       dept d
where   e.deptno = d.deptno

Note: in the query above we could not use the NATURAL JOIN construction. When you use a natural join, the columns used in the join – such as DEPTNO in this case – may not be used with a qualifier in the query (otherwise ORA-25155: column used in NATURAL join cannot have qualifier is raised). That would mean that for selecting the average salary in the current deptartment, no proper join condition can be set for joining with e2.deptno. The join-condition e2.deptno = deptno will return all rows!

Based on Analytical Functions:

select  e.ename
,       e.job
,       e.sal
,       avg(e.sal) over (partition by e.job) avg_sal_in_job
,       d.dname
,       avg(e.sal) over (partition by deptno) avg_sal_in_dept
from    emp  e
        natural join
        dept d

The results:

ENAME      JOB              SAL AVG_SAL_IN_JOB DNAME          AVG_SAL_IN_DEPT
---------- --------- ---------- -------------- -------------- ---------------
SCOTT      ANALYST         3000           3000 RESEARCH            2645.83333
FORD       ANALYST         3000           3000 RESEARCH            2645.83333
MILLER     CLERK           1300         1037.5 ACCOUNTING                1875
JAMES      CLERK            950         1037.5 SALES               1566.66667
SMITH      CLERK            800         1037.5 RESEARCH            2645.83333
ADAMS      CLERK           1100         1037.5 RESEARCH            2645.83333
JONES      MANAGER         2975         2912.5 RESEARCH            2645.83333
BLAKE      MANAGER         2850         2912.5 SALES               1566.66667
CLARK      MARKETEER       2450           2450 ACCOUNTING                1875
KING       PRESIDENT       5000           5000 RESEARCH            2645.83333
ALLEN      SALESMAN        1600           1400 SALES               1566.66667
MARTIN     SALESMAN        1250           1400 SALES               1566.66667
TURNER     SALESMAN        1500           1400 SALES               1566.66667
WARD       SALESMAN        1250           1400 SALES               1566.66667

Filtering on aggregate values

Let’s return for a moment to the first query in this post. Assume that we only want to return results for departments that have an average salary over 2000. Using the traditional Aggregation Syntax with Group By clause, we can make use of the HAVING clause:

select  d.dname
,       avg(e.sal)
from    dept d
        natural join
        emp  e
having  avg(e.sal) > 2000
group
by      d.dname

However, Analytical Functions can only be used in the SELECT clause, not in the WHERE clause. In fact, the Analytical calculation is done only just prior to the ORDER BY clause, which is the last thing the SQL Engine does before retuning the resultset for a query. So we have no equivalent for the HAVING clause with the Analtytical AVG() function. The solution then is to use an in-line view:

select dname
,      avg_sal
from   (  select  distinct
                  d.dname
          ,       avg(e.sal) over (partition by d.dname)  avg_sal
          from    dept d
                  natural join
                  emp  e
       )
where  avg_sal > 2000

Links to useful documentation:
Oracle Documentation:
SQL Reference
- Introduction/Overview Analytic Functions
- AVG()
Oracle 10g – SQL for Analysis in Data Warehouses
- Reporting Aggregate Function
Other resources:
Presentation from New York Oracle Usergroup.

Analytic Functions

In a next post I will start to unravel the structure of analytical queries. In this post – after all – you have seen nothing than an advanced sort of way to do aggregations, allowing multiple aggregations at different group levels and with different grouping conditions. Note that all we could do using analytical functions in these examples could also be done using Scalar Subqueries, although I personally prefer the leaner syntax with the Analytical Functions.

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

4 Comments

  1. Pingback: Antivirus free downloads

  2. A very useful introductory article was just published in Oracle Magazine and on OTN: Find Answers Faster By Jonathan Gennick and Anthony Molinaro Analytic functions provide powerful ways to view your data.

  3. Pingback: » Analytical SQL Functions - theory and examples - Part 1 on the Partition Clause