I will attempt to introduce analytical functions in a step-by-step manner. A first step into Analytical country could be the following, starting from a very simple aggregation query, exploring several types of aggregations that are increasingly complex and increasingly impossible to implement without analytics (at least in a efficient, performant way):
select d.dname , avg(e.sal) from dept d natural join emp e group by d.dname
The following two queries give the same result, without using the group by. They can therefore more easily be extended to return other information as well, such as an aggegation at a different grouping level.
Using an analytical function:
select distinct d.dname , avg(e.sal) over (partition by d.dname) -- the analytical avg function from dept d natural join emp e
Read the analytical part of this query as follows: ‘for each record returned, select the average of salary values for all rows in the partition (subset, group) with the same department-name as the current record’s department-name‘.
Using a scalar subquery:
select d.dname , (select avg(e.sal) from emp e where e.deptno = d.deptno) from dept d
The result in all cases is the following:
DNAME AVG(E.SAL) -------------- ---------- ACCOUNTING 1875 RESEARCH 2645.83333 SALES 1566.66667
Now it was my impression that the following should also work:
create type numTable as table of number / select d.dname , ( select avg(column_value) from table ( cast( d.salaries as numTable) ) ) Department_Avg_Salary from ( select cast ( multiset ( select e.sal from emp e where e.deptno = d1.deptno ) as numTable ) salaries , d1.dname from dept d1 ) d
I am not sure exactly where to go with the example, but it is a somewhat other approach. However, it does not work: it returns the average salary for the first department in all four rows. I have posted a question on the OTN Forum to see whether anyone knows what I might be doing wrong.
I have been given the correct query on the OTN Forum, by Barbara A Boehmer:
select d.dname , ( select avg(column_value) from table (cast (d.salaries as numTable)) where d.dname = dname) Department_Avg_Salary from ( select cast ( multiset ( select e.sal from emp e where e.deptno = d1.deptno ) as numTable ) salaries , d1.dname from dept d1 ) d
I must confess that I do not completely understand why the additional
where d.dname = dname
is required since I thought that each row’s salaries value only contained salary values for that row anyway. Perhaps I have misjudged the order of evaluation…?
Let’s say we want to see not only the average salary per department but also the over-all average – and the difference between each department’s average and the company’s average.
It seems a question that can not easly be answered using the standard ‘select avg group by’ construction, only through a construction like:
select dname , avg_sal , avg_sal - total_avg difference from ( select d.dname , avg(e.sal) avg_sal , total_avg from dept d natural join emp e , ( select avg(sal) total_avg from emp) avg_sal group by d.dname, total_avg )
which is just using a scalar subquery.
Another solution based on scalar subqueries:
select dname , avg_sal , avg_sal - total_avg difference from ( select d.dname , (select avg(e.sal) from emp e where e.deptno = d.deptno) avg_sal , (select avg(e.sal) from emp e) total_avg from dept d )
The Analytical Solution looks as follows:
select dname , avg_sal , avg_sal - total_avg difference from ( select distinct d.dname , avg(e.sal) over (partition by d.dname) avg_sal , avg(e.sal) over () total_avg from dept d natural join emp e )
or simply:
select distinct d.dname , avg(e.sal) over (partition by d.dname) avg_sal , avg(e.sal) over (partition by d.dname) - avg(e.sal) over () diff_with_company_avg from dept d natural join emp e
Read the analytical definition of diff_with_company_avg as follows: ‘for each record returned, select the average of salary values for all rows in the partition (subset, group) with the same department-name as the current record’s department-name and substract from that value the average of all salary values in the complete set of records‘.
The next stage of complexity: show all employees with for each their name, their salary and the average salary in their job.
With a normal group by, this cannot be done. With scalar subqeuries:
select e.ename , e.job , e.sal , (select avg(e.sal) from emp e2 where e2.job = e.job) avg_sal_in_job from emp e
The Analytical Solution:
select e.ename , e.job , e.sal , avg(e.sal) over (partition by e.job) avg_sal_in_job from emp e
Read the analytical definition of avg_sal_in_job as follows: ‘for each record returned, select the average of salary values for all rows in the partition (subset, group) with the same job as the current record’s job‘.
ENAME JOB SAL AVG_SAL_IN_JOB ---------- --------- ---------- -------------- SMITH CLERK 800 800 ALLEN SALESMAN 1600 1600 WARD SALESMAN 1250 1250 JONES MANAGER 2975 2975 MARTIN SALESMAN 1250 1250 BLAKE MANAGER 2850 2850 CLARK MARKETEER 2450 2450 SCOTT ANALYST 3000 3000 KING PRESIDENT 5000 5000 TURNER SALESMAN 1500 1500 ADAMS CLERK 1100 1100 JAMES CLERK 950 950 FORD ANALYST 3000 3000 MILLER CLERK 1300 1300
Adding another request: per employee, show also the department and department’s average salary.
Based on Scalar-Subqueries:
select e.ename , e.job , e.sal , (select avg(e2.sal) from emp e2 where e2.job = e.job) avg_sal_in_job , d.dname , (select avg(e2.sal) from emp e2 where e2.deptno = e.deptno) avg_sal_in_dept from emp e , dept d where e.deptno = d.deptno
Note: in the query above we could not use the NATURAL JOIN construction. When you use a natural join, the columns used in the join – such as DEPTNO in this case – may not be used with a qualifier in the query (otherwise ORA-25155: column used in NATURAL join cannot have qualifier is raised). That would mean that for selecting the average salary in the current deptartment, no proper join condition can be set for joining with e2.deptno. The join-condition e2.deptno = deptno will return all rows!
Based on Analytical Functions:
select e.ename , e.job , e.sal , avg(e.sal) over (partition by e.job) avg_sal_in_job , d.dname , avg(e.sal) over (partition by deptno) avg_sal_in_dept from emp e natural join dept d
The results:
ENAME JOB SAL AVG_SAL_IN_JOB DNAME AVG_SAL_IN_DEPT ---------- --------- ---------- -------------- -------------- --------------- SCOTT ANALYST 3000 3000 RESEARCH 2645.83333 FORD ANALYST 3000 3000 RESEARCH 2645.83333 MILLER CLERK 1300 1037.5 ACCOUNTING 1875 JAMES CLERK 950 1037.5 SALES 1566.66667 SMITH CLERK 800 1037.5 RESEARCH 2645.83333 ADAMS CLERK 1100 1037.5 RESEARCH 2645.83333 JONES MANAGER 2975 2912.5 RESEARCH 2645.83333 BLAKE MANAGER 2850 2912.5 SALES 1566.66667 CLARK MARKETEER 2450 2450 ACCOUNTING 1875 KING PRESIDENT 5000 5000 RESEARCH 2645.83333 ALLEN SALESMAN 1600 1400 SALES 1566.66667 MARTIN SALESMAN 1250 1400 SALES 1566.66667 TURNER SALESMAN 1500 1400 SALES 1566.66667 WARD SALESMAN 1250 1400 SALES 1566.66667
Filtering on aggregate values
Let’s return for a moment to the first query in this post. Assume that we only want to return results for departments that have an average salary over 2000. Using the traditional Aggregation Syntax with Group By clause, we can make use of the HAVING clause:
select d.dname , avg(e.sal) from dept d natural join emp e having avg(e.sal) > 2000 group by d.dname
However, Analytical Functions can only be used in the SELECT clause, not in the WHERE clause. In fact, the Analytical calculation is done only just prior to the ORDER BY clause, which is the last thing the SQL Engine does before retuning the resultset for a query. So we have no equivalent for the HAVING clause with the Analtytical AVG() function. The solution then is to use an in-line view:
select dname , avg_sal from ( select distinct d.dname , avg(e.sal) over (partition by d.dname) avg_sal from dept d natural join emp e ) where avg_sal > 2000
Links to useful documentation:
Oracle Documentation:
SQL Reference
– Introduction/Overview Analytic Functions
– AVG()
Oracle 10g – SQL for Analysis in Data Warehouses
– Reporting Aggregate Function
Other resources:
Presentation from New York Oracle Usergroup.
Analytic Functions
In a next post I will start to unravel the structure of analytical queries. In this post – after all – you have seen nothing than an advanced sort of way to do aggregations, allowing multiple aggregations at different group levels and with different grouping conditions. Note that all we could do using analytical functions in these examples could also be done using Scalar Subqueries, although I personally prefer the leaner syntax with the Analytical Functions.
Very nice article. I can’t wait reading the continution of this article on Analytical Functions.
A very useful introductory article was just published in Oracle Magazine and on OTN: Find Answers Faster By Jonathan Gennick and Anthony Molinaro Analytic functions provide powerful ways to view your data.