My challenge of the day: a query that returns aggregate values. And that can be manipulated dynamically to sometimes aggregate grouped by one dimension, sometimes along multiple or another single dimensiopn. The query will be used to build a report page on that displays a table with rows returned from the query. In this page, the user can hide/display columns. Whenever a column is hidden, the query should no longer group by the columnn. However, it should still be the same query.
In this article, we will look at a way to bring this dynamic, conditional aggregation behavior into a SQL query. How the query will perform aggregating on different levels without change to the query itself. We will use wildly interesting concepts such as the database context, the CASE statement and last but not least, the CUBE (aggregate) operator.
In a subsequent article we will also take a look at the web page that we create on top of this query.
Let’s start with the mother of all queries:
select deptno , job , avg(sal) "average salary" , count(empno) "num in job" from scott.emp group by deptno , job
The results from this initial query shows aggregate values – average salary and count of staff in job – per Department and per Job:
DEPTNO JOB average salary num in job ---------- --------- -------------- ---------- 10 PRESIDENT 5000 1 10 CLERK 1300 1 10 MANAGER 2450 1 20 MANAGER 2975 1 20 ANALYST 3000 2 20 CLERK 950 2 30 SALESMAN 1400 4 30 CLERK 950 1 30 MANAGER 2850 1
We want to change the query to only make it group per JOB for other departments than 30. We can change the query using a CASE statement like this:
select deptno , case deptno when 30 then 'ALL' else job end job , avg(sal) "average salary" , count(empno) "num in job" from scott.emp group by deptno , case deptno when 30 then 'ALL' else job end order by deptno , job
The results:
DEPTNO JOB average salary num in job ------ --------- -------------- ---------- 10 CLERK 1300 1 10 MANAGER 2450 1 10 PRESIDENT 5000 1 20 ANALYST 3000 2 20 CLERK 950 2 20 MANAGER 2975 1 30 ALL 1566.66667 6
Here you see that we group per JOB within departments 10 and 20 but not 30. This is our hook into a more general approach to conditional aggregating.
We want to be able to set switches that specify grouping by Department, Job and Hireyear with simple statements like:
exec query_switcher.group_by_job_yn('Y') exec query_switcher.group_by_department_yn('N')
The best way to pass information into SQL queriesis through the use of an (application) context. The optimizer treats references to SYS_CONTEXT as bind parameters – making for a much more efficient execution plan than we get through the use of calls to functions in packages that return the values of global package variables.
To set up a context, we first create the package to manage the context contents:
create or replace package query_switcher is procedure group_by_department_yn(p_yn in varchar2); procedure group_by_job_yn(p_yn in varchar2); procedure group_by_hireyear_yn(p_yn in varchar2); end;
Now we can create the context
create context query_switch_ctx using query_switcher
And then the package body:
create or replace package body query_switcher is procedure group_by_department_yn(p_yn in varchar2) is begin DBMS_SESSION.SET_CONTEXT( 'query_switch_ctx', 'group_by_department', p_yn); end; procedure group_by_job_yn(p_yn in varchar2) is begin DBMS_SESSION.SET_CONTEXT( 'query_switch_ctx', 'group_by_job', p_yn); end; procedure group_by_hireyear_yn(p_yn in varchar2) is begin DBMS_SESSION.SET_CONTEXT( 'query_switch_ctx', 'group_by_hireyear', p_yn); end; end;
With all this in place, we can create the query itself that uses the switches from the context for deciding which behavior to assume:
select case sys_context('query_switch_ctx', 'group_by_department') when 'N' then null else deptno end deptno , case sys_context('query_switch_ctx', 'group_by_job') when 'N' then 'ALL' else job end job , case sys_context('query_switch_ctx', 'group_by_hireyear') when 'N' then to_number(null) else extract(year from hiredate) end hireyear , avg(sal) "average salary" , count(empno) "num in job" from scott.emp group by case sys_context('query_switch_ctx', 'group_by_department') when 'N' then null else deptno end , case sys_context('query_switch_ctx', 'group_by_job') when 'N' then 'ALL' else job end , case sys_context('query_switch_ctx', 'group_by_hireyear') when 'N' then to_number(null) else extract(year from hiredate) end order by deptno , job , hireyear
Before running the query, we can throw the switches to indicate on which columns we want to group.
For example, no column grouping at all – only the overall aggregates:
exec query_switcher.group_by_job_yn('N') exec query_switcher.group_by_hireyear_yn('N') exec query_switcher.group_by_department_yn('N')
The results:
DEPTNO JOB HIREYEAR average salary num in job --------- --------- ---------- -------------- ---------- ALL 2073.21429 14
We would like to group by Hireyear, without changing the query. To achieve that objective, we have to throw the right switch:
exec query_switcher.group_by_hireyear_yn('Y')
And the results:
DEPTNO JOB HIREYEAR average salary num in job ---------- --------- ---------- -------------- ---------- ALL 1980 800 1 ALL 1981 2282.5 10 ALL 1982 1300 1 ALL 1987 2050 2
Try another grouping – throw the (additional) switches:
exec query_switcher.group_by_job_yn('Y')
And the results:
DEPTNO JOB HIREYEAR average salary num in job ---------- --------- ---------- -------------- ---------- ANALYST 1981 3000 1 ANALYST 1987 3000 1 CLERK 1980 800 1 CLERK 1981 950 1 CLERK 1982 1300 1 CLERK 1987 1100 1 MANAGER 1981 2758.33333 3 PRESIDENT 1981 5000 1 SALESMAN 1981 1400 4
Change the grouping – no more grouping on Job and start grouping on Department:
exec query_switcher.group_by_department_yn('Y') exec query_switcher.group_by_job_yn('N')
And the query returns:
DEPTNO JOB HIREYEAR average salary num in job ------- --------- ---------- -------------- ---------- 10 ALL 1981 3725 2 10 ALL 1982 1300 1 20 ALL 1980 800 1 20 ALL 1981 2987.5 2 20 ALL 1987 2050 2 30 ALL 1981 1566.66667 6 exec query_switcher.group_by_department_yn('N') exec query_switcher.group_by_hireyear_yn('N') exec query_switcher.group_by_job_yn('Y') DEPTNO JOB HIREYEAR average salary num in job -------- --------- ---------- -------------- ---------- ANALYST 3000 2 CLERK 1037.5 4 MANAGER 2758.33333 3 PRESIDENT 5000 1 SALESMAN 1400 4
And on and on…
So we achieved what we set out to do: the ability to dynamically change the grouping of the query – no change needed to the query, only manipulation of switches in a context. We could create a View from this Query and use it from Oracle Forms, APEX or Java Web Applications.
Enter: CUBE
I had the feeling that the CUBE aggregate operator could also be used in situations like this. To achieve similar functionality in a perhaps even leaner way. CUBE has been around since 9iR2 – about 2002 – and I have not been able to find many situations where the CUBE could meaningfully be used. But here I have one.
Let’s return to a two dimension aggregation situation, aggregate on DEPTNO and JOB:
select deptno , job , sum(sal) from emp group by cube(deptno, job)
The CUBE causes the aggregation to be performed in multiple ways in this single query. Aggregation is done for:
- all combinations of deptno and job
- all values of deptno (over all job values)
- all values of job (over all deptno values)
- over all (grand total)
The outcome:
DEPTNO JOB SUM(SAL) ------ --------- ---------- 29025 CLERK 4150 ANALYST 6000 MANAGER 8275 SALESMAN 5600 PRESIDENT 5000 10 8750 10 CLERK 1300 10 MANAGER 2450 10 PRESIDENT 5000 20 10875 20 CLERK 1900 20 ANALYST 6000 20 MANAGER 2975 30 9400 30 CLERK 950 30 MANAGER 2850 30 SALESMAN 5600
We the CUBE is used in a query, we can use the GROUPING function. This function can be used as an indicator whether the query groups on (aggregates over) a specific column. The previous query with GROUPING looks like this:
select deptno , job , sum(sal) , grouping(job) , grouping(deptno) from emp group by cube(deptno, job)
and results in:
DEPTNO JOB SUM(SAL) GROUPING(JOB) GROUPING(DEPTNO) --------- --------- ---------- ------------- ---------------- 29025 1 1 CLERK 4150 0 1 ANALYST 6000 0 1 MANAGER 8275 0 1 SALESMAN 5600 0 1 PRESIDENT 5000 0 1 10 8750 1 0 10 CLERK 1300 0 0 10 MANAGER 2450 0 0 10 PRESIDENT 5000 0 0 20 10875 1 0 20 CLERK 1900 0 0 20 ANALYST 6000 0 0 20 MANAGER 2975 0 0 30 9400 1 0 30 CLERK 950 0 0 30 MANAGER 2850 0 0 30 SALESMAN 5600 0 0
We can apply a filter on the grouping results:
select * from ( select deptno , job , sum(sal) , grouping(job) gr_job , grouping(deptno) gr_deptno from emp group by cube(deptno, job) ) where gr_job = case sys_context('query_switch_ctx', 'group_by_job') when 'Y' then 0 else 1 end and gr_deptno = case sys_context('query_switch_ctx', 'group_by_deptno') when 'Y' then 0 else 1 end
With switches:
exec query_switcher.group_by_department_yn('N') exec query_switcher.group_by_job_yn('Y')
And outcome:
DEPTNO JOB SUM(SAL) GR_JOB GR_DEPTNO ------- --------- ---------- ---------- ---------- ANALYST 6000 0 1 CLERK 4150 0 1 MANAGER 8275 0 1 PRESIDENT 5000 0 1 SALESMAN 5600 0 1
So we see that the CUBE operator helps us to make the core query a little simpler – no group by clause riddles with CASE statements. And when we use the CUBE we can simply apply a where clause to determine through plain filtering which aggregation levels should be effective.
Note by the way the slightly counterintuitive values of the GROUPING function compared to the switchers I defined: when we say ‘group by Job’ we want to return all rows where grouping(job) is equal to 0. Probably my misunderstanding…
Hi Lucas,
You can use CUBE as an inefficient row generator (“select rownum from (select 1 from dual group by cube(1,1,1,1)) where rownum <= 13”) or maybe for some obscure BI query or report. I’ll have to ask this to some BI colleagues. However, you said it perfectly: “CUBE will aggregate along all dimensions and at all levels – while we will always need just a subset”. I think CUBE has become almost obsolete for normal development, since the advent of grouping sets. With grouping sets you can specify exactly which grouping you want, without having to calculate everything first and then filter.
Regards,
Rob.
Hi Rob,
Thanks for your response. Yes, you are right on both accounts. I did not use the HAVING because of the way in which I expect the application to make use of this query. Manipulating the where-clause is much easier in the application than fiddling with the HAVING clause, that’s why I approached it like this. And of course the CUBE will aggregate along all dimensions and at all levels – while we will always need just a subset. How costly that proves to be? The percentage unnecessary aggregations can grow quite large, depending on the logic in the application – up to 95% or more. So perhaps this is not a very good example of CUBE usage after all. Do you know one?
regards, Lucas
Lucas,
Nice post.
Two remarks though:
1) the CUBE query will look much easier when using the HAVING clause, since there won’t be a need for the inline view in that case.
2) the CUBE query will be less performant than your first query as it has to calculate all grouping sets first before being able to filter on them. To verify: in the explain plan you’ll see a “GENERATE CUBE” step with a much larger number in the A-rows column than the number of rows in your table.
Regards,
Rob.