Easy setup of the query_partition_clause of row_number() for dedup(lication)

0

Finding and getting rid of duplicate records in a table is easy….

The row_number() over ( <query_partition_clause> <order_by_clause> ) analytic function is helpful in determining not only the number of duplicate records, but also in identifying them. Setting up the partition clause though with each and every attribute that is in your partition query, can become a bit cumbersome when dealing with tables with a lot of columns.

So I wanted to have an option to generate not only all necessary column names for this clause, but also to exclude one or more columns from the list of column names, based on my input.
For this purpose, I created 2 overloaded dedup.get_partition_cols functions.
How to use (one of) them, is shown in the following example code.
The package code is provided at the end of this article.

Create test table t with 600 records, a composite primary key, and duplicate values in the last column

SQL~ create table t
  2  as
  3  select name
  4       , seqnr
  5       , name||' record ' note
  6  from (select case when rownum between   1 and 100 then 'harry'
  7                    when rownum between 101 and 200 then 'ann'
  8                    when rownum between 201 and 300 then 'bob'
  9                    when rownum between 301 and 400 then 'ronald'
 10                    when rownum between 401 and 500 then 'patrick'
 11                    when rownum between 501 and 600 then 'marie' end name
 12             , case when rownum between   1 and 100 then rownum-1
 13                    when rownum between 101 and 200 then rownum-101
 14                    when rownum between 201 and 300 then rownum-201
 15                    when rownum between 301 and 400 then rownum-301
 16                    when rownum between 401 and 500 then rownum-401
 17                    when rownum between 501 and 600 then rownum-501 end seqnr
 18        from dual
 19        connect by rownum != 601);

Table created.

SQL~ alter table t add (constraint t_pk primary key (name, seqnr));

Table altered.

SQL~ select count(*) from t;

  COUNT(*)
----------
       600

1 row selected.

Check if t contains duplicate records, and use a partition clause with all columns
At generation, the column list in i, in combination with the “new_value” directive of SQLPLUS, is
transferred to a substitution variable [1], which can then be used in various “duplicate rows” statements.
I purposely don’t use a noprint option for i, because I like the value of i to be shown here.

SQL~ set verify off heading off feedback off pagesize 0 define +
SQL~ col i new_value 1
SQL~ select dedup.get_partition_cols ('t',0) i from dual;
NAME,SEQNR,NOTE
SQL~ prompt ++1 
NAME,SEQNR,NOTE
SQL~ select count(*)
  2  from ( select row_number() over ( partition by ++1 order by rownum ) rn
  3         from t )
  4  where rn != 1 ;
         0

No surprise here, because the table contains a primary key and all columns are setup in the partition clause…
But now with pk columns excluded!

SQL~ select dedup.get_partition_cols ('t',1) i from dual;
NOTE
SQL~ prompt ++1 
NOTE
SQL~ select count(*)
  2  from ( select row_number() over ( partition by ++1 order by rownum ) rn
  3         from t )
  4  where rn != 1 ;
       594

Delete duplicates

SQL~ delete t
  2  where rowid in ( select rid
  3                   from ( select rowid rid
  4                               , row_number() over ( partition by ++1 order by rownum ) rn
  5                          from t )
  6                   where rn != 1 );
SQL~ select count(*) from t;
         6

Package and package body dedup

CREATE OR REPLACE PACKAGE Dedup
  IS

/*
Q: How to use the dedup functions?
A: The first dedup.get_partition_cols function has 2 parameters, one for the table name, and one for 
     -- excluding no columns (value=0), 
     -- excluding primary key columns (value=1)
     -- excluding primary key columns and unique key columns (value=2).
   With the second, overloaded function, you can exclude specific columns (10 max).  

Q: How are the column names extracted? 
A: To list all table columns, 11gR2 listagg syntax is used:
============ 
SQL~ select listagg(column_name,',') within group(order by rownum)
     from user_tab_columns 
     where table_name = 'T';
============
This statement gets executed when you call dedup.get_partition_cols('t',0).
All other calls result in an exclusion of one or more column names from the list of all columns.
*/
-------------------------------------------------------------------------------------------------

FUNCTION Get_Partition_Cols
  ( P_table_name    varchar2
  , P_exclude_col01 varchar2 := NULL
  , P_exclude_col02 varchar2 := NULL
  , P_exclude_col03 varchar2 := NULL
  , P_exclude_col04 varchar2 := NULL
  , P_exclude_col05 varchar2 := NULL
  , P_exclude_col06 varchar2 := NULL
  , P_exclude_col07 varchar2 := NULL
  , P_exclude_col08 varchar2 := NULL
  , P_exclude_col09 varchar2 := NULL
  , P_exclude_col10 varchar2 := NULL )
  RETURN VARCHAR2;

-------------------------------------------------------------------------------------------------

FUNCTION Get_Partition_Cols
   ( P_table_name   varchar2
   , P_exclude_cols pls_integer )
  RETURN VARCHAR2;

-------------------------------------------------------------------------------------------------

END Dedup;
/
CREATE OR REPLACE PACKAGE BODY Dedup
  IS

-------------------------------------------------------------------------------------------------

FUNCTION Get_Partition_Cols
  ( P_table_name    varchar2
  , P_exclude_col01 varchar2 := NULL
  , P_exclude_col02 varchar2 := NULL
  , P_exclude_col03 varchar2 := NULL
  , P_exclude_col04 varchar2 := NULL
  , P_exclude_col05 varchar2 := NULL
  , P_exclude_col06 varchar2 := NULL
  , P_exclude_col07 varchar2 := NULL
  , P_exclude_col08 varchar2 := NULL
  , P_exclude_col09 varchar2 := NULL
  , P_exclude_col10 varchar2 := NULL )
  RETURN VARCHAR2
AS
   v_table_name varchar2(30) := upper(trim(P_table_name ));
   v_ret varchar2(2000);
begin
  if (     P_exclude_col01 is null 
       and P_exclude_col02 is null
       and P_exclude_col03 is null
       and P_exclude_col04 is null
       and P_exclude_col05 is null
       and P_exclude_col06 is null
       and P_exclude_col07 is null
       and P_exclude_col08 is null
       and P_exclude_col09 is null
       and P_exclude_col10 is null ) then
    select listagg(column_name,',') within group(order by rownum) 
      into v_ret
    from user_tab_columns 
    where table_name = v_table_name;
  else
    select listagg(column_name,',') within group(order by rownum) 
      into v_ret
    from user_tab_columns uts
    where uts.table_name = v_table_name
      and column_name not in ( nvl(upper(trim(P_exclude_col01)),' ')
                             , nvl(upper(trim(P_exclude_col02)),' ')
                             , nvl(upper(trim(P_exclude_col03)),' ')
                             , nvl(upper(trim(P_exclude_col04)),' ')
                             , nvl(upper(trim(P_exclude_col05)),' ')
                             , nvl(upper(trim(P_exclude_col06)),' ')
                             , nvl(upper(trim(P_exclude_col07)),' ')
                             , nvl(upper(trim(P_exclude_col08)),' ')
                             , nvl(upper(trim(P_exclude_col09)),' ')
                             , nvl(upper(trim(P_exclude_col10)),' ') );
  end if;
  return v_ret;
exception
when others then
  raise;  
end Get_Partition_Cols;  

-------------------------------------------------------------------------------------------------

FUNCTION Get_Partition_Cols
   ( P_table_name   varchar2
   , P_exclude_cols pls_integer )
  RETURN VARCHAR2
AS
   v_table_name varchar2(30) := upper(trim(P_table_name ));
   v_ret varchar2(2000);
begin
  if P_exclude_cols = 0 then   -- no exclusion
    select listagg(column_name,',') within group(order by rownum) 
      into v_ret
    from user_tab_columns 
    where table_name = v_table_name;
  elsif P_exclude_cols = 1 then  -- exclude pk cols only
    select listagg(column_name,',') within group(order by rownum) 
      into v_ret
    from user_tab_columns uts
    where uts.table_name = v_table_name
      and not exists ( select 1 
                       from user_ind_columns ucs
                          , user_indexes     uis
                       where ucs.index_name = uis.index_name
                         and ucs.table_name = uis.table_name
                         and uis.table_name = v_table_name
                         and uis.uniqueness = 'UNIQUE'
                         and ucs.column_name = uts.column_name
                         and uts.nullable = 'N' );
  elsif P_exclude_cols = 2 then  -- exclude pk and uk cols
    select listagg(column_name,',') within group(order by rownum) 
      into v_ret
    from user_tab_columns uts
    where uts.table_name = v_table_name
      and not exists ( select 1 
                       from user_ind_columns ucs
                          , user_indexes     uis
                       where ucs.index_name = uis.index_name
                         and ucs.table_name = uis.table_name
                         and uis.table_name = v_table_name
                         and uis.uniqueness = 'UNIQUE'
                         and ucs.column_name = uts.column_name );                         
  end if;
  return v_ret;
exception
when others then
  raise;  
end Get_Partition_Cols;  

-------------------------------------------------------------------------------------------------

END Dedup;
/
Share.

About Author

Harry Dragstra is Senior Oracle DBA. He is specialized in Oracle RAC, Oracle Virtualization, and has a a strong interest in APEX development and PL/SQL.

Leave a Reply