IN or EXIST or doesn't it matter 13422386 1019544571447648 7687716130941590224 o1

IN or EXIST or doesn’t it matter

In my previous post about package constants I mentioned the application I am working on right now. They gave me the task to speed up the application since performance was getting worse and worse. I analyzed the statspack results together with a DBA and we found two queries that together took 40% of the logical IO’s of the system. That’s a lot for only two queries, especially when you look how big our application is; we have many queries.

The two queries were both in one procedure. That must have been an off day for the original programmer :-). The queries were small and had an IN statement in the where clause. I simply rewrote the queries to use an EXIST and they became blazingly fast. That was strange. I attended the Tom Kyte seminar in Utrecht in 2005 and he claimed that it didn’t matter anymore. IN or EXIST, the database would see this and the optimizer would have the same execution plan for both, but not in my case. How could that be? I was running the queries in a 9R2 database.


So I wrote a (meaningless) example to show the difference between the IN and EXIST statements. The source of this test script is as follows:

set timing off
set autotrace off
set echo on

create table my_blocked_objects (object_name varchar2(200) not null)
/

create table my_objects (object_name varchar2(200) not null)
/

create index my_blk_idx on my_blocked_objects(object_name)
/

create index my_obj_idx on my_objects(object_name)
/

insert into my_blocked_objects
select owner || '.' || object_name
from   all_objects
/

insert into my_objects
select user || '.' || object_name
from   user_objects
/

commit
/

begin
  dbms_stats.gather_table_stats(user,'MY_BLOCKED_OBJECTS');
  dbms_stats.gather_table_stats(user,'MY_OBJECTS');
end;
/

set timing on
set autotrace on explain

select count(*)
from   my_objects obj
where  obj.object_name in (select blk.object_name
                           from   my_blocked_objects blk)
/

select count(*)
from   my_objects obj
where  exists (select ''
               from   my_blocked_objects blk
               where  blk.object_name = obj.object_name)
/

drop table my_blocked_objects
/

drop table my_objects
/

The autotrace results of this script on my database was:

SQL> select count(*)
  2  from   my_objects obj
  3  where  obj.object_name in (select blk.object_name
  4                             from   my_blocked_objects blk)
  5  /

  COUNT(*)
----------
      3376

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=166 Card=1 Bytes=132
          )

   1    0   SORT (AGGREGATE)
   2    1     HASH JOIN (Cost=166 Card=3378 Bytes=445896)
   3    2       INDEX (FULL SCAN) OF 'MY_OBJ_IDX' (NON-UNIQUE) (Cost=2
          6 Card=3378 Bytes=101340)

   4    2       VIEW OF 'VW_NSO_1' (Cost=156 Card=27148 Bytes=2769096)
   5    4         SORT (UNIQUE) (Cost=156 Card=27148 Bytes=814440)
   6    5           INDEX (FULL SCAN) OF 'MY_BLK_IDX' (NON-UNIQUE) (Co
          st=26 Card=27522 Bytes=825660)




SQL>
SQL> select count(*)
  2  from   my_objects obj
  3  where  exists (select ''
  4                 from   my_blocked_objects blk
  5                 where  blk.object_name = obj.object_name)
  6  /

  COUNT(*)
----------
      3376

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=1 Bytes=30)
   1    0   SORT (AGGREGATE)
   2    1     INDEX (FULL SCAN) OF 'MY_OBJ_IDX' (NON-UNIQUE) (Cost=26
          Card=169 Bytes=5070)

   3    2       INDEX (RANGE SCAN) OF 'MY_BLK_IDX' (NON-UNIQUE) (Cost=
          1 Card=1 Bytes=30)

Again not the same execution plan. The IN version was less efficient compared to the EXIST version of the query. And this was not what Tom Kyte was saying during his seminar. I then tested the same script on another instance; but also a 9R2 database. That gave the following results:

SQL> select count(*)
  2  from   my_objects obj
  3  where  obj.object_name in (select blk.object_name
  4                             from   my_blocked_objects blk)
  5  /

  COUNT(*)
----------
      2220

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=12 Card=1 Bytes=62)
   1    0   SORT (AGGREGATE)
   2    1     HASH JOIN (SEMI) (Cost=12 Card=2223 Bytes=137826)
   3    2       TABLE ACCESS (FULL) OF 'MY_OBJECTS' (Cost=4 Card=2223
          Bytes=71136)

   4    2       INDEX (FAST FULL SCAN) OF 'MY_BLK_IDX' (NON-UNIQUE) (C
          ost=5 Card=24923 Bytes=747690)




SQL>
SQL> select count(*)
  2  from   my_objects obj
  3  where  exists (select ''
  4                 from   my_blocked_objects blk
  5                 where  blk.object_name = obj.object_name)
  6  /

  COUNT(*)
----------
      2220

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=12 Card=1 Bytes=62)
   1    0   SORT (AGGREGATE)
   2    1     HASH JOIN (SEMI) (Cost=12 Card=2223 Bytes=137826)
   3    2       TABLE ACCESS (FULL) OF 'MY_OBJECTS' (Cost=4 Card=2223
          Bytes=71136)

   4    2       INDEX (FAST FULL SCAN) OF 'MY_BLK_IDX' (NON-UNIQUE) (C
          ost=5 Card=24923 Bytes=747690)

Wow the same execution plan, but a slightly higher cost (Cost=12 compared to Cost=3). How was this possible???? TKPROF output showed the same results. It took some time to figure this out; but the difference was introduced by different optimizer parameters in the database. The database with two execution plans had the following parameters:

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------
object_cache_optimal_size            integer     102400
optimizer_dynamic_sampling           integer     0
optimizer_features_enable            string      8.1.7
optimizer_index_caching              integer     0
optimizer_index_cost_adj             integer     10
optimizer_max_permutations           integer     80000
optimizer_mode                       string      CHOOSE

The database with one execution plan had the following parameters:

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------
object_cache_optimal_size            integer     102400
optimizer_dynamic_sampling           integer     1
optimizer_features_enable            string      9.2.0
optimizer_index_caching              integer     0
optimizer_index_cost_adj             integer     100
optimizer_max_permutations           integer     2000
optimizer_mode                       string      CHOOSE

 

For some reason the DBA’s kept the optimizer_features_enable on 8.1.7 after the migration of the database. This prevented the optimizer from using new features like optimizing IN where clauses. The argument was that it could be possible that another value of this parameter could negatively impact performance. That could be true; but no one tested that; they just took tha assumption.
Setting the parameter to 9.2.0 would have automatically optimized the two bad queries that led to this post. They would never have been a problem. I would never have found them as being problem queries. My lesson learnt is that you never should rely on assumptions; you have to test your assumptions.

 

Tom Kyte was right in this case. There is no difference between IN or EXIST. I believe the difference in cost between the two databases (cost=3 vs. cost=12) is hard to compare because of the difference of the optimizer_features_enable parameter.

7 Comments

  1. Andrew Markiewicz September 11, 2006
  2. Patrick Roozen June 2, 2006
  3. Alex Gorbachev May 26, 2006
  4. karl May 25, 2006
  5. karl May 25, 2006
  6. karl May 25, 2006
  7. Marco Gralike May 24, 2006