Remove duplicate rows with Oracle SQL ROW_NUMBER – intelligent and efficient data cleansing

1

I am quite sure I have addressed this before. But I could not find it readily and I needed it today. So here it goes:

I am loading data from some external source. Well actually, from several external sources. The loading of the data is done in a better safe than sorry way: it is very important that all records come in. At the risk of loading duplicates – we will get them all!

However, once they are all in, there are more in than we want: I have loaded quite a few duplicate records. That will hurt the later on, so the duplicates have to go. With Oracle SQL Analytical Functions, that is quite easy. In two steps:

1) identify the duplicates

2) remove the duplicates

The data I am dealing with is Olympic Medals – over the period 1984-2008. I have found several sources on the internet. I have written a simple Java program that uses JSoup for screenscraping and subsequently invokes a PL/SQL package to create database records.

The key table – the one that is potentially loaded with the duplicate records – is called MEDAL_SCORE. It records for a country (cty_id) how many medals of a certain metal (gold, silver, bronze; column medal) it has won during a specific edition of the Summer Olympics (oly_id).

Image

The query to identify duplicates is fairly easy: when the partition defined by oly_id, cty_id and medal has more than one record – we have a duplicate. We need to say only once how many medals of a medal type a country has one in an Olympiad. Therefore the duplicate counting query becomes:

select count(*)
from   ( select id
         ,      row_number() over ( partition by oly_id, evt_id, cty_id,medal
                                    order by id) rn
         from   medal_score
       )
where  rn>1

When I execute this query, I discover that I am currently at 1098 duplicates – out of a total of 2544 records. Well, as long as the important ones are all there, that is no issue. Especially since removing the duplicates is as simple as:

delete from medal_score
where  rowid in
( select rwid
  from ( select rowid rwid
         ,      row_number() over ( partition by oly_id, evt_id, cty_id,medal
                                    order by id) rn
         from   medal_score
       )
  where  rn>1
)

Image

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

1 Comment

  1. Martin Schapendonk on

    Alternative without analytic functions:
    delete from medal_score
    where rowid in (
    select rowid
    from medal_score t
    where exists (
    select 1
    from medal_score
    where oly_id = t.oly_id
    and cty_id = t.cty_id
    and medal = t.medal
    and rowid > t.rowid
    )
    );
     
     

Leave a Reply